Skip to content

Conversation

@kashif
Copy link
Contributor

@kashif kashif commented Apr 28, 2025

No description provided.

@rizar
Copy link
Collaborator

rizar commented Apr 29, 2025

Thank for the PR!

My understanding is that with entropy_bonus=0 and kl_coef=0 this PR should not change anything, cause it will masking it what is already multiplied by 0. Is that correct?

We can also drop zero-advantage groups earlier in the pipeline, in run_preprocessor.py. If we go this way, zero-advantages sample won't count towards the desirable number of samples for the optimizer step. This should stabilize the learning.

Happy to hop on a call tomorrow to better understand what you are trying to achieve.

@kashif
Copy link
Contributor Author

kashif commented May 1, 2025

@rizar you are right! When the entropy_bonus=0 and kl_coef=0, the loss calculation simplifies to loss = policy_loss * examples_weights. Since policy_loss is proportional to advantages (when use_advantages=True), a zero advantage already results in a zero policy_loss for that token. But, the statistics calculation in the stats dictionary is affected by the combined_mask, as many metrics (like average advantage, average KL, average entropy) will now be computed excluding the zero-advantage tokens...

Good point about dropping them earlier let me check!

@kashif
Copy link
Contributor Author

kashif commented May 2, 2025

@rizar I've moved to drop zero-advantage groups earlier in the pipeline

@rizar
Copy link
Collaborator

rizar commented May 5, 2025

Great, that will be a nice contribution, @kashif ! But there are good reasons to make it optional. The main being that rl/reward with this change won't always go up, which will scare the hell out of people.

I'd be happy to work with you to finish up this PR a bit later. Right now and until May 15 we are crunching for NeurIPS. Not sure if I find time before May 15, but right after we should definitely add this feature.

@rizar
Copy link
Collaborator

rizar commented May 31, 2025

Ran this code once, "(filtered out 0)" is what I see all the time. Will take a closer look on Monday.

@kashif
Copy link
Contributor Author

kashif commented May 31, 2025

@rizar so by default I made filtering to be False as to not startle folks... I can make it on by default, or remove the logging when its False

@rizar
Copy link
Collaborator

rizar commented May 31, 2025

indeed my bad, I had this set to 0 in my launch command

turned it on, now it crashes:

[2025-05-31 17:18:15,668][pipelinerl.run_preprocess][ERROR] - Error in preprocessor worker: 'group_id'
[2025-05-31 17:18:15,760][pipelinerl.utils][ERROR] - Exception in preprocess: 'group_id'
[2025-05-31 17:18:15,762][pipelinerl.utils][ERROR] - Traceback: Traceback (most recent call last):
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/utils.py", line 264, in better_crashing
yield
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/entrypoints/preprocess.py", line 9, in preprocess_hydra_entry_point
run_preprocessing_loop(cfg)
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/run_preprocess.py", line 437, in run_preprocessing_loop
filtered_buffer, num_filtered_out = filter_zero_advantage_groups(buffer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/toolkit/dist/research-now-reasoner/pipelinerl/pipelinerl/run_preprocess.py", line 285, in filter_zero_advantage_groups
group_id = entry["group_id"]
~~~~~^^^^^^^^^^^^
KeyError: 'group_id'

@rizar
Copy link
Collaborator

rizar commented May 31, 2025

the reason is that the preprocessing removes the group_id... we could keep it though, if it doesn't screw up the trainer component (finetune)

@kashif
Copy link
Contributor Author

kashif commented May 31, 2025

ah yes sure!

Copy link
Collaborator

@rizar rizar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kashif I've tested this PR and it seems to work! due to some other on-going issues with PipelineRL I can't yet measure the improvement rigorously. But otherwise it's a nice contribution. I will go ahead and merge right now, if you want to change some things @kashif , that can be in a following PR.

rl_config: RLConfig,
) -> Dataset:
preprocess = partial(preprocess_fn, seq_length=seq_length, tokenizer=tokenizer, is_rl=True)
columns = ["input_ids", "labels", "attention_mask"] + RL_DATA_COLUMNS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can always add "group_id" to the columns, it won't hurt

@rizar rizar merged commit 8a50c17 into ServiceNow:main Jun 3, 2025
@kashif kashif deleted the filter-zero-advantage branch June 3, 2025 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants