filter out zero-advantage samples #2

kashif · 2025-04-28T13:11:44Z

No description provided.

rizar · 2025-04-29T09:08:32Z

Thank for the PR!

My understanding is that with entropy_bonus=0 and kl_coef=0 this PR should not change anything, cause it will masking it what is already multiplied by 0. Is that correct?

We can also drop zero-advantage groups earlier in the pipeline, in run_preprocessor.py. If we go this way, zero-advantages sample won't count towards the desirable number of samples for the optimizer step. This should stabilize the learning.

Happy to hop on a call tomorrow to better understand what you are trying to achieve.

kashif · 2025-05-01T15:59:13Z

@rizar you are right! When the entropy_bonus=0 and kl_coef=0, the loss calculation simplifies to loss = policy_loss * examples_weights. Since policy_loss is proportional to advantages (when use_advantages=True), a zero advantage already results in a zero policy_loss for that token. But, the statistics calculation in the stats dictionary is affected by the combined_mask, as many metrics (like average advantage, average KL, average entropy) will now be computed excluding the zero-advantage tokens...

Good point about dropping them earlier let me check!

This reverts commit de22829.

kashif · 2025-05-02T14:52:53Z

@rizar I've moved to drop zero-advantage groups earlier in the pipeline

rizar · 2025-05-05T21:39:06Z

Great, that will be a nice contribution, @kashif ! But there are good reasons to make it optional. The main being that rl/reward with this change won't always go up, which will scare the hell out of people.

I'd be happy to work with you to finish up this PR a bit later. Right now and until May 15 we are crunching for NeurIPS. Not sure if I find time before May 15, but right after we should definitely add this feature.

rizar · 2025-05-31T01:08:34Z

Ran this code once, "(filtered out 0)" is what I see all the time. Will take a closer look on Monday.

kashif · 2025-05-31T08:16:36Z

@rizar so by default I made filtering to be False as to not startle folks... I can make it on by default, or remove the logging when its False

rizar · 2025-05-31T17:20:22Z

indeed my bad, I had this set to 0 in my launch command

turned it on, now it crashes:

[2025-05-31 17:18:15,668][pipelinerl.run_preprocess][ERROR] - Error in preprocessor worker: 'group_id'
[2025-05-31 17:18:15,760][pipelinerl.utils][ERROR] - Exception in preprocess: 'group_id'
[2025-05-31 17:18:15,762][pipelinerl.utils][ERROR] - Traceback: Traceback (most recent call last):
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/utils.py", line 264, in better_crashing
yield
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/entrypoints/preprocess.py", line 9, in preprocess_hydra_entry_point
run_preprocessing_loop(cfg)
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/run_preprocess.py", line 437, in run_preprocessing_loop
filtered_buffer, num_filtered_out = filter_zero_advantage_groups(buffer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/run_preprocess.py", line 285, in filter_zero_advantage_groups
group_id = entry["group_id"]
~~~~~^^^^^^^^^^^^
KeyError: 'group_id'

rizar · 2025-05-31T17:25:01Z

the reason is that the preprocessing removes the group_id... we could keep it though, if it doesn't screw up the trainer component (finetune)

kashif · 2025-05-31T17:59:54Z

ah yes sure!

rizar

@kashif I've tested this PR and it seems to work! due to some other on-going issues with PipelineRL I can't yet measure the improvement rigorously. But otherwise it's a nice contribution. I will go ahead and merge right now, if you want to change some things @kashif , that can be in a following PR.

rizar · 2025-06-03T13:43:30Z

pipelinerl/run_preprocess.py

    rl_config: RLConfig,
 ) -> Dataset:
    preprocess = partial(preprocess_fn, seq_length=seq_length, tokenizer=tokenizer, is_rl=True)
    columns = ["input_ids", "labels", "attention_mask"] + RL_DATA_COLUMNS


you can always add "group_id" to the columns, it won't hurt

filter out zero-advantage

de22829

kashif added 2 commits May 1, 2025 18:07

Revert "filter out zero-advantage"

30bf8e0

This reverts commit de22829.

Write only non zero advantage samples to stream

0723fcf

kashif added 3 commits May 30, 2025 22:16

add helper filter_zero_advantage_groups

542d5fa

Merge branch 'main' into filter-zero-advantage

170933c

make filter_zero_advantage_groups optional

95c79b4

add group_id column for filtering

c50136b

rizar approved these changes Jun 3, 2025

View reviewed changes

rizar merged commit 8a50c17 into ServiceNow:main Jun 3, 2025

kashif deleted the filter-zero-advantage branch June 3, 2025 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

filter out zero-advantage samples #2

filter out zero-advantage samples #2

Uh oh!

kashif commented Apr 28, 2025

Uh oh!

rizar commented Apr 29, 2025

Uh oh!

kashif commented May 1, 2025

Uh oh!

kashif commented May 2, 2025

Uh oh!

rizar commented May 5, 2025

Uh oh!

rizar commented May 31, 2025

Uh oh!

kashif commented May 31, 2025

Uh oh!

rizar commented May 31, 2025

Uh oh!

rizar commented May 31, 2025

Uh oh!

kashif commented May 31, 2025

Uh oh!

rizar left a comment

Uh oh!

rizar Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

filter out zero-advantage samples #2

filter out zero-advantage samples #2

Uh oh!

Conversation

kashif commented Apr 28, 2025

Uh oh!

rizar commented Apr 29, 2025

Uh oh!

kashif commented May 1, 2025

Uh oh!

kashif commented May 2, 2025

Uh oh!

rizar commented May 5, 2025

Uh oh!

rizar commented May 31, 2025

Uh oh!

kashif commented May 31, 2025

Uh oh!

rizar commented May 31, 2025

Uh oh!

rizar commented May 31, 2025

Uh oh!

kashif commented May 31, 2025

Uh oh!

rizar left a comment

Choose a reason for hiding this comment

Uh oh!

rizar Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants