Skip to content

Add DPO and ORPO preference data preprocessing pipeline utils#3895

Merged
copybara-service[bot] merged 1 commit into
mainfrom
igorts/dpo-input-processing
May 20, 2026
Merged

Add DPO and ORPO preference data preprocessing pipeline utils#3895
copybara-service[bot] merged 1 commit into
mainfrom
igorts/dpo-input-processing

Conversation

@igorts-git
Copy link
Copy Markdown
Collaborator

@igorts-git igorts-git commented May 13, 2026

Description

To simplify code review I am splitting the Tunix-based DPO implementation into smaller PRs.
This one adds the data reading processing required by DPO.

The classic DPO inputs consist of three data columns: ["prompt", "chosen_response", "rejected_response"].
However, some DPO datasets use a two-column format where the prompt is the prefix to the choosen and rejected strings.
When a 2-column dataset is used our implementation extracts the common prefix into the "prompt" field that is then fed into the model separately.
The column names in the dataset can wary, for example ["input", chosen", "rejected"]. Our implementation allows the user to supply the dataset column names via the train_data_columns and eval_data_columns parameters.

Tunix requires left-padded prompt and right-padded responses. Our code implements this padding (and truncation if needed) it also provides Tunix with the corresponding masks.

NOTE: once this PR is merged the legacy DPO will stop working correctly. The follow up PRs will enable Tunix-based DPO.

Caveat: This PR only adds support for HuggingFace datasets, while the legacy DPO implementation supported HuggingFace, TFDS and Grain. This is on-par with our SFT implementation. We need to discuss the priority of supporting TFDS and Grain in post-training.

Tests

Added unit tests. Ran DPO/ORPO and performed logits comparison against the legacy implementation.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 95.16129% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/input_pipeline/hf_data_processing.py 71.42% 1 Missing and 1 partial ⚠️
src/maxtext/input_pipeline/dpo_utils.py 98.18% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@igorts-git igorts-git force-pushed the igorts/dpo-input-processing branch 2 times, most recently from 30d3c25 to b8ae239 Compare May 14, 2026 21:56
@github-actions
Copy link
Copy Markdown

🤖 Hi @igorts-git, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

The Pull Request introduces important utilities for DPO and ORPO preference data preprocessing, which is a key component for the upcoming Tunix-based alignment implementation. The core logic for handling 2-column and 3-column datasets is well-structured, but I identified a high-severity bug in the common prefix extraction and some opportunities for more flexible truncation strategies.

🔍 General Feedback

  • Logic Bug: The common prefix extraction logic using enumerate(zip(...)) is flawed for edge cases like identical strings or prefix strings. I have provided a more robust implementation in the inline comments.
  • Truncation Strategy: The current 50/50 split for prompt/response lengths and the prefix-based truncation for prompts might lead to information loss in long-context scenarios.
  • Test Coverage: The new unit tests are quite thorough, but adding the suggested edge cases for prefix extraction would make them even better.

Comment thread src/maxtext/input_pipeline/dpo_utils.py
Comment thread src/maxtext/input_pipeline/dpo_utils.py Outdated
Comment thread src/maxtext/input_pipeline/dpo_utils.py Outdated
Comment thread src/maxtext/input_pipeline/dpo_utils.py Outdated
Comment thread tests/post_training/unit/dpo_data_processing_test.py
@igorts-git igorts-git force-pushed the igorts/dpo-input-processing branch from b8ae239 to 2d7b6e0 Compare May 15, 2026 00:02
@github-actions
Copy link
Copy Markdown

🤖 Hi @igorts-git, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR introduces necessary data preprocessing utilities for DPO and ORPO, including a new Grain transform DPOTunixPrep that handles column remapping, prefix extraction, and DPO-aware padding. The implementation is well-tested and integrated into the existing Hugging Face data pipeline.

🔍 General Feedback

  • Robustness: The prefix extraction logic for 2-column datasets is a great addition for supporting popular preference datasets like Anthropic/hh-rlhf.
  • Breaking Change: As noted in the description, moving DPO parameters into a nested config block is a breaking change for existing DPO configurations.
  • Logic Correction: A fix is suggested for the slicing logic in _pad to correctly handle cases where the requested length is 0.
  • Validation: Added a suggestion for non-negativity validation on max_prompt_length to align with project standards.

Comment thread src/maxtext/input_pipeline/dpo_utils.py
Comment thread src/maxtext/configs/types.py
Comment thread src/maxtext/configs/base.yml Outdated
use_dpo: False
dpo_label_smoothing: 0.0
dpo_beta: 0.1
dpo:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add these DPO configs to base.yml too?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't fully understand the comment. Are you suggesting to remove these configs from base.yml?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely these dpo related parameters ended up being in base.yml due to historical reasons. if we have them in the dpo.yaml and we have separate yml files in the configs/post_train directory then it makes sense to remove them from base.yml since the ones in base.yml are meant to be shareable across multiple use-cases.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the details. I removed the "dpo:" section from base.yml

Comment thread src/maxtext/input_pipeline/dpo_utils.py Outdated
Comment thread tests/post_training/unit/dpo_data_processing_test.py Outdated
Comment on lines -27 to -28
dpo_label_smoothing: 0.0
dpo_beta: 0.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this break older DPO code, if so should we just leave as it is for now, perhpas comment that used for older DPO?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this PR breaks the older DPO code, not just due to the configs, but also in how the dataset is loaded. Once this and the next PR in the series is merged, I plan to follow up with a PR that completely deletes the legacy DPO implementation.

@igorts-git igorts-git force-pushed the igorts/dpo-input-processing branch from 6cb3fd9 to 13c2e3e Compare May 19, 2026 21:03
Copy link
Copy Markdown
Collaborator

@aireenmei aireenmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But please note that both tfds and hf pipelines are planned for deprecation. There's already support for SFT and DPO in Grain pipeline. It would be great if we can have follow up changes to enable the same support in Grain pipeline

Comment thread src/maxtext/configs/types.py Outdated
if self.use_dpo:
if self.packing:
raise ValueError("For DPO/ORPO, `packing` is not supported.")
if self.dpo.max_prompt_length is not None and self.dpo.max_prompt_length > self.max_target_length:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of max_prompt_length == max_target_length, it will cause max_response_length=0 error in DPODataFormatting, should we guard it here?

Copy link
Copy Markdown
Collaborator Author

@igorts-git igorts-git May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the guard here. FYI, There is a slightly more comprehensive assertion in dpo_utils.py that can trigger in a few more edge cases.

Copy link
Copy Markdown
Collaborator

@A9isha A9isha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Igor

Approved barring the one comment

Comment thread src/maxtext/configs/base.yml Outdated
use_dpo: False
dpo_label_smoothing: 0.0
dpo_beta: 0.1
dpo:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely these dpo related parameters ended up being in base.yml due to historical reasons. if we have them in the dpo.yaml and we have separate yml files in the configs/post_train directory then it makes sense to remove them from base.yml since the ones in base.yml are meant to be shareable across multiple use-cases.

@igorts-git igorts-git force-pushed the igorts/dpo-input-processing branch 2 times, most recently from bd2c0bf to 03a2b54 Compare May 20, 2026 03:00
…ities

Includes robust common prefix extraction for 2-column datasets, prompt suffix truncation, customizable max_prompt_length with validation against max_target_length, and complete integration unit test coverage.
@igorts-git igorts-git force-pushed the igorts/dpo-input-processing branch from 03a2b54 to d59a15e Compare May 20, 2026 05:46
@copybara-service copybara-service Bot merged commit 60bc7f9 into main May 20, 2026
28 of 29 checks passed
@copybara-service copybara-service Bot deleted the igorts/dpo-input-processing branch May 20, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants