Add DPO and ORPO preference data preprocessing pipeline utils#3895
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
30d3c25 to
b8ae239
Compare
|
🤖 Hi @igorts-git, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
The Pull Request introduces important utilities for DPO and ORPO preference data preprocessing, which is a key component for the upcoming Tunix-based alignment implementation. The core logic for handling 2-column and 3-column datasets is well-structured, but I identified a high-severity bug in the common prefix extraction and some opportunities for more flexible truncation strategies.
🔍 General Feedback
- Logic Bug: The common prefix extraction logic using
enumerate(zip(...))is flawed for edge cases like identical strings or prefix strings. I have provided a more robust implementation in the inline comments. - Truncation Strategy: The current 50/50 split for prompt/response lengths and the prefix-based truncation for prompts might lead to information loss in long-context scenarios.
- Test Coverage: The new unit tests are quite thorough, but adding the suggested edge cases for prefix extraction would make them even better.
b8ae239 to
2d7b6e0
Compare
|
🤖 Hi @igorts-git, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This PR introduces necessary data preprocessing utilities for DPO and ORPO, including a new Grain transform DPOTunixPrep that handles column remapping, prefix extraction, and DPO-aware padding. The implementation is well-tested and integrated into the existing Hugging Face data pipeline.
🔍 General Feedback
- Robustness: The prefix extraction logic for 2-column datasets is a great addition for supporting popular preference datasets like Anthropic/hh-rlhf.
- Breaking Change: As noted in the description, moving DPO parameters into a nested config block is a breaking change for existing DPO configurations.
- Logic Correction: A fix is suggested for the slicing logic in
_padto correctly handle cases where the requested length is 0. - Validation: Added a suggestion for non-negativity validation on
max_prompt_lengthto align with project standards.
fd544d6 to
eb1d524
Compare
eb1d524 to
6cb3fd9
Compare
| use_dpo: False | ||
| dpo_label_smoothing: 0.0 | ||
| dpo_beta: 0.1 | ||
| dpo: |
There was a problem hiding this comment.
Do we need to add these DPO configs to base.yml too?
There was a problem hiding this comment.
Sorry, I don't fully understand the comment. Are you suggesting to remove these configs from base.yml?
There was a problem hiding this comment.
most likely these dpo related parameters ended up being in base.yml due to historical reasons. if we have them in the dpo.yaml and we have separate yml files in the configs/post_train directory then it makes sense to remove them from base.yml since the ones in base.yml are meant to be shareable across multiple use-cases.
There was a problem hiding this comment.
OK, thanks for the details. I removed the "dpo:" section from base.yml
| dpo_label_smoothing: 0.0 | ||
| dpo_beta: 0.1 |
There was a problem hiding this comment.
will this break older DPO code, if so should we just leave as it is for now, perhpas comment that used for older DPO?
There was a problem hiding this comment.
Unfortunately, this PR breaks the older DPO code, not just due to the configs, but also in how the dataset is loaded. Once this and the next PR in the series is merged, I plan to follow up with a PR that completely deletes the legacy DPO implementation.
6cb3fd9 to
13c2e3e
Compare
| if self.use_dpo: | ||
| if self.packing: | ||
| raise ValueError("For DPO/ORPO, `packing` is not supported.") | ||
| if self.dpo.max_prompt_length is not None and self.dpo.max_prompt_length > self.max_target_length: |
There was a problem hiding this comment.
In the case of max_prompt_length == max_target_length, it will cause max_response_length=0 error in DPODataFormatting, should we guard it here?
There was a problem hiding this comment.
Updated the guard here. FYI, There is a slightly more comprehensive assertion in dpo_utils.py that can trigger in a few more edge cases.
A9isha
left a comment
There was a problem hiding this comment.
thanks Igor
Approved barring the one comment
| use_dpo: False | ||
| dpo_label_smoothing: 0.0 | ||
| dpo_beta: 0.1 | ||
| dpo: |
There was a problem hiding this comment.
most likely these dpo related parameters ended up being in base.yml due to historical reasons. if we have them in the dpo.yaml and we have separate yml files in the configs/post_train directory then it makes sense to remove them from base.yml since the ones in base.yml are meant to be shareable across multiple use-cases.
bd2c0bf to
03a2b54
Compare
…ities Includes robust common prefix extraction for 2-column datasets, prompt suffix truncation, customizable max_prompt_length with validation against max_target_length, and complete integration unit test coverage.
03a2b54 to
d59a15e
Compare
Description
To simplify code review I am splitting the Tunix-based DPO implementation into smaller PRs.
This one adds the data reading processing required by DPO.
The classic DPO inputs consist of three data columns: ["prompt", "chosen_response", "rejected_response"].
However, some DPO datasets use a two-column format where the prompt is the prefix to the choosen and rejected strings.
When a 2-column dataset is used our implementation extracts the common prefix into the "prompt" field that is then fed into the model separately.
The column names in the dataset can wary, for example ["input", chosen", "rejected"]. Our implementation allows the user to supply the dataset column names via the
train_data_columnsandeval_data_columnsparameters.Tunix requires left-padded prompt and right-padded responses. Our code implements this padding (and truncation if needed) it also provides Tunix with the corresponding masks.
NOTE: once this PR is merged the legacy DPO will stop working correctly. The follow up PRs will enable Tunix-based DPO.
Caveat: This PR only adds support for HuggingFace datasets, while the legacy DPO implementation supported HuggingFace, TFDS and Grain. This is on-par with our SFT implementation. We need to discuss the priority of supporting TFDS and Grain in post-training.
Tests
Added unit tests. Ran DPO/ORPO and performed logits comparison against the legacy implementation.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.