Add dataset_processor_path CLI knob for custom datasets by py4 · Pull Request #4031 · AI-Hypercomputer/maxtext

py4 · 2026-06-01T17:33:28Z

Add a dataset_processor_path CLI/yaml knob that lets users plug in a custom process_data(dataset_name, model_tokenizer, template_config, tmvp_config, x) -> dict function from a user-provided Python file, instead of editing maxtext to support a new dataset shape.

Why: the built-in utils_rl.process_data is hardcoded for a small set of dataset schemas (GSM8K, etc.). For users running RL on custom datasets with different answer columns / cleaning rules, the alternative was either (1) edit maxtext source (fork divergence) or (2) reformat the dataset to look like GSM8K (lossy). This knob gives a clean third option: ship your dataset processor as a Python file and point maxtext at it.

Changes (2 files, +41/-16 lines):

src/maxtext/trainers/post_train/rl/train_rl.py:
- New _load_custom_callable(module_path, function_name) helper that uses importlib.util.spec_from_file_location to load a function from an arbitrary .py file (without adding to sys.path).
- prepare_datasets checks trainer_config.dataset_processor_path; if set, loads process_data from that file and substitutes for utils_rl.process_data in the dataset pipeline.
src/maxtext/configs/post_train/rl.yml: new top-level knob dataset_processor_path: '' with comment documenting the signature contract.

Backward compatible: default empty string falls back to utils_rl.process_data (identical to old behavior). The _load_custom_callable helper is only invoked when the user explicitly sets the path.

User-facing contract:

# user_process_data.py
def process_data(dataset_name, model_tokenizer, template_config, tmvp_config, x):
    return {"prompts": ..., "question": ..., "answer": ...}

python3 -m maxtext.trainers.post_train.rl.train_rl rl.yml \
  dataset_processor_path=/path/to/user_process_data.py \
  ...

Checklist

Tested locally with a custom processor file (VTC-style raw-text prompt template); produced expected outputs
Backward compatible: default empty string preserves utils_rl.process_data behavior
No effect on non-RL paths (only prepare_datasets in the RL trainer touched)
_load_custom_callable doesn't pollute sys.path (uses spec_from_file_location)

codecov · 2026-06-01T17:48:39Z

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/trainers/post_train/rl/train_rl.py	40.00%	2 Missing and 1 partial ⚠️
src/maxtext/trainers/post_train/rl/utils_rl.py	85.71%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

khatwanimohit

LGTM

A9isha

Just one request for the refactor - the rest look good, thank you Pooya!

Currently utils_rl.process_data hard-branches on `dataset_name == "openai/gsm8k"` to call `extract_hash_answer`. Other datasets either work as-is (if their `answer` column is already clean) or require editing utils_rl directly. Add an optional `dataset_processor_path` config: a filesystem path to a user-provided Python file with a `process_data(dataset_name, tokenizer, template_config, tmvp_config, x) -> dict` function. When set, that function replaces the built-in one for all train/eval dataset map() calls. Default (`dataset_processor_path: ''`) keeps existing behavior unchanged. Also adds `_load_custom_callable` helper used by this and the upcoming custom reward CLI knob.

py4 force-pushed the pr/dataset-processor-path branch 2 times, most recently from 22790f2 to 505b038 Compare June 1, 2026 17:42

py4 force-pushed the pr/dataset-processor-path branch 3 times, most recently from 6d91a2c to 2842052 Compare June 2, 2026 21:16

py4 mentioned this pull request Jun 2, 2026

Add reward_functions_path + reward_functions CLI knobs for custom rewards #4045

Closed

5 tasks

khatwanimohit reviewed Jun 5, 2026

View reviewed changes

Comment thread tests/post_training/unit/load_custom_callable_test.py

khatwanimohit approved these changes Jun 5, 2026

View reviewed changes

xuefgu approved these changes Jun 5, 2026

View reviewed changes

github-actions Bot added the pull ready label Jun 5, 2026

A9isha reviewed Jun 5, 2026

View reviewed changes

Comment thread src/maxtext/trainers/post_train/rl/train_rl.py Outdated

py4 force-pushed the pr/dataset-processor-path branch 2 times, most recently from 2ea1e05 to 2fc7b50 Compare June 5, 2026 22:32

py4 force-pushed the pr/dataset-processor-path branch from 2fc7b50 to f2d4f3b Compare June 8, 2026 18:12

copybara-service Bot merged commit f93627f into main Jun 8, 2026
30 checks passed

copybara-service Bot deleted the pr/dataset-processor-path branch June 8, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset_processor_path CLI knob for custom datasets#4031

Add dataset_processor_path CLI knob for custom datasets#4031
copybara-service[bot] merged 1 commit into
mainfrom
pr/dataset-processor-path

py4 commented Jun 1, 2026

Uh oh!

codecov Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

khatwanimohit left a comment

Uh oh!

A9isha left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

py4 commented Jun 1, 2026

Checklist

Uh oh!

codecov Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

khatwanimohit left a comment

Choose a reason for hiding this comment

Uh oh!

A9isha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jun 1, 2026 •

edited

Loading