Add dataset_processor_path CLI knob for custom datasets#4031
Merged
Conversation
22790f2 to
505b038
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
6d91a2c to
2842052
Compare
5 tasks
xuefgu
approved these changes
Jun 5, 2026
A9isha
reviewed
Jun 5, 2026
A9isha
left a comment
Collaborator
There was a problem hiding this comment.
Just one request for the refactor - the rest look good, thank you Pooya!
2ea1e05 to
2fc7b50
Compare
Currently utils_rl.process_data hard-branches on `dataset_name == "openai/gsm8k"` to call `extract_hash_answer`. Other datasets either work as-is (if their `answer` column is already clean) or require editing utils_rl directly. Add an optional `dataset_processor_path` config: a filesystem path to a user-provided Python file with a `process_data(dataset_name, tokenizer, template_config, tmvp_config, x) -> dict` function. When set, that function replaces the built-in one for all train/eval dataset map() calls. Default (`dataset_processor_path: ''`) keeps existing behavior unchanged. Also adds `_load_custom_callable` helper used by this and the upcoming custom reward CLI knob.
2fc7b50 to
f2d4f3b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a
dataset_processor_pathCLI/yaml knob that lets users plug in a customprocess_data(dataset_name, model_tokenizer, template_config, tmvp_config, x) -> dictfunction from a user-provided Python file, instead of editing maxtext to support a new dataset shape.Why: the built-in
utils_rl.process_datais hardcoded for a small set of dataset schemas (GSM8K, etc.). For users running RL on custom datasets with different answer columns / cleaning rules, the alternative was either (1) edit maxtext source (fork divergence) or (2) reformat the dataset to look like GSM8K (lossy). This knob gives a clean third option: ship your dataset processor as a Python file and point maxtext at it.Changes (2 files, +41/-16 lines):
src/maxtext/trainers/post_train/rl/train_rl.py:_load_custom_callable(module_path, function_name)helper that usesimportlib.util.spec_from_file_locationto load a function from an arbitrary.pyfile (without adding tosys.path).prepare_datasetscheckstrainer_config.dataset_processor_path; if set, loadsprocess_datafrom that file and substitutes forutils_rl.process_datain the dataset pipeline.src/maxtext/configs/post_train/rl.yml: new top-level knobdataset_processor_path: ''with comment documenting the signature contract.Backward compatible: default empty string falls back to
utils_rl.process_data(identical to old behavior). The_load_custom_callablehelper is only invoked when the user explicitly sets the path.User-facing contract:
Checklist
utils_rl.process_databehaviorprepare_datasetsin the RL trainer touched)_load_custom_callabledoesn't pollutesys.path(usesspec_from_file_location)