feat: support multiple datasets for response dataset by yuki-97 · Pull Request #1691 · NVIDIA-NeMo/RL

yuki-97 · 2025-12-23T07:58:17Z

Related issue: #1049

Usage

data:
  _override_: true # override the data config instead of merging with it
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # train dataset 1
    - dataset_name: OpenMathInstruct-2
      split_validation_size: 0.05 # use 5% of the training data as validation data
      seed: 42  # seed for train/validation split when split_validation_size > 0
    # train dataset 2
    - dataset_name: DeepScaler
  validation:
    # validation dataset 1
    - dataset_name: AIME2024
      repeat: 16
    # validation dataset 2
    - dataset_name: DAPOMathAIME2024
  # default settings for all datasets
  default:
    ...

Summary by CodeRabbit

New Features
- Added multi-dataset training support, enabling models to train on multiple datasets simultaneously.
- Introduced configuration structure for multi-dataset GRPO experimentation with customizable dataset-specific parameters.
Tests
- Added functional test coverage for multi-dataset training workflows.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-26T06:20:48Z

📝 Walkthrough

Walkthrough

Introduces multi-dataset support for the GRPO framework through a new configuration file, refactored data handling across two modules, an override-aware config merge utility, and corresponding functional tests.

Changes

Cohort / File(s)	Summary
Configuration `examples/configs/grpo_multiple_datasets.yaml`	New YAML configuration defining multi-dataset GRPO setup with defaults reference, data overrides, dataset-wide settings (max_input_seq_length, shuffle, num_workers), and per-dataset train/validation lists with optional split/repeat parameters.
Config Utilities `nemo_rl/utils/config.py`	Added `merge_with_override()` utility function that honors `_override_: true` markers in child configs to fully override parent config subtrees, improving inheritance control in `load_config_with_inheritance()`.
Data Handling `examples/run_sft.py`, `nemo_rl/data/utils.py`	Modified `setup_data()` and related functions to support multiple training/validation datasets instead of single dataset; collects datasets into lists, optionally applies per-dataset defaults, merges via `concatenate_datasets()`, and maintains per-task processor/environment bindings across merged datasets.
Testing `tests/functional/grpo_multiple_datasets.sh`, `tests/functional/L1_Functional_Tests_GPU.sh`	New functional test script orchestrating GRPO multi-dataset experiment with metrics validation; test invocation added to L1 GPU test suite.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Config Loader
    participant DataList as Multiple Datasets
    participant Merge as Dataset Merger
    participant Processor as Task Processors
    participant Train as Training

    Config->>DataList: Load each dataset config
    DataList->>DataList: Apply per-dataset defaults
    DataList->>Merge: Collect all datasets
    Merge->>Merge: Concatenate datasets
    Merge->>Processor: Build per-task processors<br/>(per dataset)
    Processor->>Processor: Merge processor mappings<br/>(task_name → processor)
    Processor->>Train: Create AllTaskProcessedDataset<br/>with merged data + processors
    Train->>Train: Execute training

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

Run CICD

Suggested reviewers

terrykong

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR description lacks actual test results, metric values, and validation confirmation for the major multi-dataset feature despite functional tests being added.	Add test execution results showing actual metric values, convergence validation, and confirmation that all code review issues are resolved.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main objective: adding support for multiple datasets in response/training data handling. It directly reflects the core changes across the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

nemo_rl/utils/config.py (1)
1-1: Update the copyright header year to include 2026.

Line 1 still shows 2025; please update to include the current year. As per coding guidelines, headers must include the current year.
📝 Suggested header update
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
examples/run_sft.py (1)
1-1: Update the copyright header year to include 2026.

Line 1 still shows 2025; please update to include the current year. As per coding guidelines, headers must include the current year.
📝 Suggested header update
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
nemo_rl/data/utils.py (1)
1-1: Update the copyright header year to include 2026.

Line 1 still shows 2025; please update to include the current year. As per coding guidelines, headers must include the current year.
📝 Suggested header update
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.

🤖 Fix all issues with AI agents

In `@nemo_rl/data/utils.py`:
- Around line 75-77: The normalization logic in utils.py currently only wraps
single-dataset entries when data_config["train"] or data_config["validation"] is
a plain dict, but misses omegaconf.DictConfig; import DictConfig from omegaconf
and update the two isinstance checks (the one that normalizes
data_config["train"] and the one for data_config["validation"]) to check
isinstance(..., (dict, DictConfig)) so single-dataset DictConfig objects are
wrapped into a list before iterating (this fixes errors in
load_response_dataset).

In `@tests/functional/grpo_multiple_datasets.sh`:
- Around line 20-40: The shell command at the end of
tests/functional/grpo_multiple_datasets.sh uses an unquoted $@ which can break
arguments with spaces; update the invocation to use "$@" (i.e., replace $@ with
"$@") so all passed overrides/arguments preserve their boundaries when appended
to the uv run command before the redirection and tee pipeline.

🧹 Nitpick comments (1)

nemo_rl/utils/config.py (1)

30-44: Consider supporting nested _override_ markers for consistency and future-proofing.

Currently, the function only processes top-level _override_ markers. While no nested markers are used today, a recursive approach would ensure consistent behavior if nested overrides are ever needed.

Suggested recursive handling

+def _apply_override_markers(
+    base_cfg: DictConfig, override_cfg: DictConfig
+) -> None:
+    for key, value in list(override_cfg.items()):
+        if isinstance(value, DictConfig):
+            if value.get("_override_", False):
+                value.pop("_override_")
+                base_cfg.pop(key, None)
+            else:
+                child_base = base_cfg.get(key)
+                if isinstance(child_base, DictConfig):
+                    _apply_override_markers(child_base, value)
+                else:
+                    _apply_override_markers(OmegaConf.create({}), value)
+
 def merge_with_override(
     base_config: DictConfig, override_config: DictConfig
 ) -> DictConfig:
     """Merge configs with support for _override_ marker to completely override sections."""
-    for key in list(override_config.keys()):
-        if isinstance(override_config[key], DictConfig):
-            if override_config[key].get("_override_", False):
-                # remove the _override_ marker
-                override_config[key].pop("_override_")
-                # remove the key from base_config so it won't be merged
-                if key in base_config:
-                    base_config.pop(key)
+    _apply_override_markers(base_config, override_config)

nemo_rl/data/utils.py

tests/functional/grpo_multiple_datasets.sh

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Dec 23, 2025

yuki-97 force-pushed the yukih/multiple-dataset branch from ff87f2c to 5835ce7 Compare December 23, 2025 07:59

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Dec 23, 2025

RayenTian force-pushed the yukih/multiple-dataset branch from 94e40a6 to c0b8cde Compare January 2, 2026 03:25

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 2, 2026

yuki-97 force-pushed the yukih/split-train-val-dataset branch 2 times, most recently from 2a4cedd to 20f3a62 Compare January 8, 2026 15:47

yuki-97 force-pushed the yukih/multiple-dataset branch from d9836a6 to 8577efb Compare January 12, 2026 03:08

github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2026

yuki-97 mentioned this pull request Jan 13, 2026

Refactor dataset module for train #909

Open

yuki-97 force-pushed the yukih/split-train-val-dataset branch from f9def0d to ec862a3 Compare January 20, 2026 10:40

Base automatically changed from yukih/split-train-val-dataset to main January 22, 2026 00:17

yuki-97 force-pushed the yukih/multiple-dataset branch 2 times, most recently from 74a26c0 to a990378 Compare January 26, 2026 06:08

github-actions bot removed the documentation Improvements or additions to documentation label Jan 26, 2026

yuki-97 changed the title ~~[don't merge] support multiple datasets for response dataset~~ feat: support multiple datasets for response dataset Jan 26, 2026

yuki-97 marked this pull request as ready for review January 26, 2026 06:08

yuki-97 requested review from a team as code owners January 26, 2026 06:08

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 26, 2026

yuki-97 temporarily deployed to nemo-ci January 26, 2026 06:10 — with GitHub Actions Inactive

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

nemo_rl/data/utils.py Show resolved Hide resolved

tests/functional/grpo_multiple_datasets.sh Show resolved Hide resolved

yuki-97 requested a review from a team as a code owner February 2, 2026 05:20

github-actions bot added the documentation Improvements or additions to documentation label Feb 2, 2026

yuki-97 added 6 commits February 2, 2026 13:23

support multiple dataset

f8d1c29

Signed-off-by: Yuki Huang <yukih@nvidia.com>

support multiple dataset for sft

d5faa16

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix check null

97c9435

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add override and add functional test

f404e91

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add split_validation_size comment in grpo_multiple_datasets

655d6bc

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update doc

c570a05

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/multiple-dataset branch from d1d8e05 to c570a05 Compare February 2, 2026 05:23

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 2, 2026

yuki-97 temporarily deployed to nemo-ci February 2, 2026 05:24 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 2, 2026 05:35 — with GitHub Actions Inactive

fix unit test

f52dd36

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 2, 2026

yuki-97 temporarily deployed to nemo-ci February 2, 2026 06:53 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 2, 2026 07:00 — with GitHub Actions Inactive

fix unit test of override

5143aad

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 2, 2026

yuki-97 temporarily deployed to nemo-ci February 2, 2026 11:45 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 2, 2026 11:51 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 2, 2026 16:12 — with GitHub Actions Inactive

terrykong approved these changes Feb 3, 2026

View reviewed changes

terrykong enabled auto-merge (squash) February 3, 2026 01:45

terrykong merged commit 27ba6a0 into main Feb 3, 2026
41 of 42 checks passed

terrykong deleted the yukih/multiple-dataset branch February 3, 2026 01:47

terrykong mentioned this pull request Feb 3, 2026

refactor: split train and val dataset in preference dataset #1763

Merged

yuki-97 mentioned this pull request Feb 4, 2026

feat: unify nemogym dataset #1807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support multiple datasets for response dataset#1691

feat: support multiple datasets for response dataset#1691
terrykong merged 8 commits intomainfrom
yukih/multiple-dataset

yuki-97 commented Dec 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 26, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuki-97 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 26, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuki-97 commented Dec 23, 2025 •

edited

Loading