Skip to content

feat: unify nemogym dataset#1807

Open
yuki-97 wants to merge 10 commits intomainfrom
yukih/nemogym-dataset
Open

feat: unify nemogym dataset#1807
yuki-97 wants to merge 10 commits intomainfrom
yukih/nemogym-dataset

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Jan 22, 2026

Update run_grpo_nemo_gym.py to use the common util setup_response_data, so that it can also use multiple datasets supported in #1691 and multiple dataloaders which will be supported in #1698.

  1. Add NemoGymDataset and nemo_gym_data_processor to match current NeMo-RL dataset structure.
  2. Update setup_data_with_envs to setup_response_data, which supports not create env.

Test Result
image

Summary by CodeRabbit

Release Notes

  • New Features

    • Added integrated Nemo Gym dataset and processor support for seamless environment integration
    • Optional environment configuration support in data setup flow
  • Refactor

    • Simplified data setup process with conditional environment handling
    • Reorganized configuration structure for improved dataset specification
    • Updated data loading workflow with optional environment binding
  • Tests

    • Updated test suite to support dataset-based processing approach

@RayenTian RayenTian force-pushed the yukih/nemogym-dataset branch from 6c9a653 to 39d699d Compare January 30, 2026 03:41
@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Jan 30, 2026
@yuki-97 yuki-97 force-pushed the yukih/nemogym-dataset branch 3 times, most recently from 856a9cf to 971dc6e Compare February 4, 2026 09:12
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026
@yuki-97 yuki-97 force-pushed the yukih/nemogym-dataset branch from 971dc6e to 9f310cd Compare February 4, 2026 09:17
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026
RayenTian and others added 10 commits February 4, 2026 21:22
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/nemogym-dataset branch from 8cf62f7 to d1428ba Compare February 4, 2026 13:22
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026
@yuki-97 yuki-97 marked this pull request as ready for review February 4, 2026 13:24
@yuki-97 yuki-97 requested review from a team as code owners February 4, 2026 13:24
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

This PR refactors the data setup pipeline by renaming setup_data_with_envs to setup_response_data with optional environment handling, introduces NemoGymDataset and nemo_gym_data_processor for structured NeMo Gym data loading, removes direct datum-spec conversion, and updates example scripts and tests to use the new API with environment registry integration.

Changes

Cohort / File(s) Summary
Data setup refactoring
nemo_rl/data/utils.py, nemo_rl/data/__init__.py
Renames setup_data_with_envs to setup_response_data, makes env_configs optional (default None), adds conditional return type logic (2-tuple when no envs, 4-tuple when envs provided), and updates DataConfig.max_input_seq_length type from int to int | None.
NemoGym dataset and processor
nemo_rl/data/datasets/response_datasets/nemogym_dataset.py, nemo_rl/data/datasets/response_datasets/__init__.py, nemo_rl/data/processors.py
Introduces NemoGymDataset class for loading JSONL data from file paths, registers it in DATASET_REGISTRY, adds nemo_gym_data_processor for converting dataset entries to DatumSpec format, and registers it in PROCESSOR_REGISTRY.
Environment utilities
nemo_rl/environments/utils.py, nemo_rl/environments/nemo_gym.py
Adds "nemo_gym" entry to ENV_REGISTRY and removes nemo_gym_example_to_nemo_rl_datum_spec function (conversion now handled by processor).
Example script updates
examples/nemo_gym/run_grpo_nemo_gym.py, examples/run_grpo.py, examples/run_distillation.py, examples/run_vlm_grpo.py
Updates imports and function calls to use setup_response_data instead of setup_data_with_envs, replaces NemoGym-specific initialization with generic create_env(env_name="nemo_gym", env_config=...).
Configuration updates
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
Adds checkpoint directory configuration, restructures data specification with train/validation/default blocks, replaces legacy jsonl file paths, and adds NeMo Gym-specific environment configuration.
Test updates
tests/unit/experience/test_rollouts.py
Refactors NeMo Gym rollout tests to use NemoGymDataset for data loading and nemo_gym_data_processor for datum conversion instead of direct helper function.
Project configuration
pyrefly.toml
Adds nemogym_dataset.py to project includes and reorders sglang generation module entries.

Sequence Diagram(s)

sequenceDiagram
    participant Script as Example Script
    participant DataSetup as setup_response_data()
    participant DataLoader as NemoGymDataset
    participant Processor as nemo_gym_data_processor
    participant EnvSetup as create_env()
    participant Training as Training Loop

    Script->>DataSetup: Call with data_config, env_configs
    DataSetup->>DataLoader: Load dataset from JSONL path
    DataLoader-->>DataSetup: Return HuggingFace Dataset
    
    DataSetup->>Processor: Process each dataset entry
    Processor-->>DataSetup: Return DatumSpec with env_info
    
    alt env_configs provided
        DataSetup->>EnvSetup: create_env(env_name="nemo_gym")
        EnvSetup-->>DataSetup: Return environment interface
        DataSetup-->>Script: (train_dataset, val_dataset, task_to_env, val_task_to_env)
    else env_configs is None
        DataSetup-->>Script: (train_dataset, val_dataset)
    end
    
    Script->>Training: Start training with datasets and envs
    Training-->>Script: Training results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

documentation

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: unify nemogym dataset' directly and clearly describes the main objective of the PR: unifying the NemoGym dataset interface to use the common setup_response_data utility.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR includes new NemoGymDataset class and data processor with updated unit tests, plus metrics dashboard comparing feature branch against main branch baseline across training/validation reward and prompt length, demonstrating no regressions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yukih/nemogym-dataset

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/nemo_gym/run_grpo_nemo_gym.py (1)

162-182: ⚠️ Potential issue | 🟠 Major

Handle the case where no validation dataset is produced.

setup_response_data(..., env_configs=None) can return val_dataset = None. The current code unconditionally calls len(val_dataset), which will raise. Please guard or fail fast with a clear error.

Suggested fix
-    train_dataset, val_dataset = setup_response_data(
-        tokenizer, config["data"], env_configs=None
-    )
+    train_dataset, val_dataset = setup_response_data(
+        tokenizer, config["data"], env_configs=None
+    )
+    if val_dataset is None:
+        raise ValueError(
+            "Validation dataset is required for NeMo-Gym runs; please configure "
+            "data.validation or split_validation_size > 0."
+        )
@@
-    print(
-        f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}"
-    )
+    print(
+        f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}"
+    )
🤖 Fix all issues with AI agents
In `@examples/nemo_gym/run_grpo_nemo_gym.py`:
- Around line 211-219: The current hardcoded task_to_env {"nemo_gym": nemo_gym}
can mismatch NemoGymDataset.task_name (derived via
"-".join(data_path.split("/")[-2:]).split(".")[0]) and break environment lookups
in rollouts.py; change the binding to use the dataset's task_name (obtain from
the NemoGymDataset instance used to create the env) as the key instead of
"nemo_gym", e.g., compute key = dataset.task_name and set task_to_env = {key:
nemo_gym} and mirror that for val_task_to_env so
run_async_nemo_gym_rollout/rollouts.py environment lookups succeed.

In `@nemo_rl/data/__init__.py`:
- Line 47: The code permits max_input_seq_length (alias max_seq_length) to be
None but downstream processors perform unsafe comparisons/arithmetic with it;
either add explicit None checks in every processor function that uses
max_input_seq_length (e.g., guard patterns in token/window truncate logic before
any "if length > max_input_seq_length" or "max_input_seq_length // ..."
operations) or enforce non-None at dataset construction by validating in
AllTaskProcessedDataset.__init__ (raise/config error if max_input_seq_length is
None) so processors can assume an int. Update references to
max_input_seq_length/max_seq_length in the processor functions and
AllTaskProcessedDataset to implement the chosen approach and add a clear error
message when rejecting None.

In `@nemo_rl/data/datasets/response_datasets/nemogym_dataset.py`:
- Line 1: Update the copyright header year from 2025 to 2026 in the file's
top-of-file comment (the existing line containing "Copyright (c) 2025, NVIDIA
CORPORATION.  All rights reserved."); replace "2025" with "2026" so the header
reads "Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved."
- Around line 23-31: Add a short docstring to the __init__ method describing
parameters (data_path: path to jsonl file, repeat: repetition count) and what
attributes are created (task_name and dataset), and silence the unused kwargs
lint by either renaming kwargs to _kwargs or explicitly consuming it (e.g., _ =
kwargs) or adding a comment like # noqa: F401 after kwargs; update the __init__
signature reference in the docstring and mention task_name and dataset so
reviewers can find the code (look for __init__, task_name, dataset, kwargs).

In `@nemo_rl/data/processors.py`:
- Around line 667-684: In nemo_gym_data_processor, silence the unused-argument
warnings by referencing task_data_spec, tokenizer, and max_seq_length (e.g.,
assign them to a throwaway variable or use them in a no-op) and change the fake
message_log token_ids creation from torch.tensor([]) to an empty integer tensor
(torch.tensor([], dtype=torch.long)) so token IDs use an integer dtype; update
the "message_log" entry creation in the function accordingly.

In `@nemo_rl/data/utils.py`:
- Around line 99-100: The code assumes cfg["env_name"] exists when env_configs
is provided (see variables has_envs, envs, task_to_env and cfg), which can raise
KeyError; update the logic to validate each dataset config early: after
extracting env names from env_configs, check every cfg in the dataset loop and
if has_envs is True and "env_name" is missing raise a clear ValueError (or
KeyError with a descriptive message) indicating that env_name is required when
using env_configs and include the dataset identifier in the message;
alternatively, wrap the access to cfg["env_name"] with a check and provide the
same descriptive error before assigning into task_to_env.

In `@tests/unit/experience/test_rollouts.py`:
- Around line 800-814: The temp file created with
tempfile.NamedTemporaryFile(..., delete=False) is not removed, causing
accumulation; change the test to either create the temp file with delete=True
and instantiate NemoGymDataset(data_path) inside the with block so the file is
read before it's auto-deleted, or keep delete=False but ensure explicit cleanup
(os.remove(data_path)) in a finally/teardown; locate the creation site
(tempfile.NamedTemporaryFile), the variable data_path, and where NemoGymDataset
is constructed (NemoGymDataset(data_path)) and apply one of these fixes.
🧹 Nitpick comments (3)
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (2)

46-48: Derive checkpoint_dir from logger.log_dir to avoid collisions.

Hardcoding a shared directory risks overlapping runs. Consider scoping checkpoints to the run log directory.

♻️ Suggested tweak
 checkpointing:
   enabled: true
-  checkpoint_dir: "results/grpo"
+  checkpoint_dir: "${logger.log_dir}/checkpoints"

237-242: Consider parameterizing local dataset paths.

The hardcoded 3rdparty/... paths are brittle across machines/CI. Using env overrides makes the config portable.

♻️ Example (env‑override with defaults)
 train:
-  data_path: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl
+  data_path: ${oc.env:WORKPLACE_ASSISTANT_TRAIN_JSONL,"3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl"}
 validation:
-  data_path: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl
+  data_path: ${oc.env:WORKPLACE_ASSISTANT_VALID_JSONL,"3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl"}
nemo_rl/data/utils.py (1)

34-47: Consider using @overload for clearer return type discrimination.

The Union return type makes it harder for static type checkers and callers to know which tuple shape they'll receive. Using @typing.overload would provide better type safety and IDE support.

♻️ Optional refactor using overload
from typing import overload, Literal

`@overload`
def setup_response_data(
    tokenizer: AutoProcessor | AutoTokenizer,
    data_config: DataConfig,
    env_configs: None = None,
    is_vlm: bool = False,
) -> tuple[AllTaskProcessedDataset, Optional[AllTaskProcessedDataset]]: ...

`@overload`
def setup_response_data(
    tokenizer: AutoProcessor | AutoTokenizer,
    data_config: DataConfig,
    env_configs: dict[str, Any],
    is_vlm: bool = False,
) -> tuple[
    AllTaskProcessedDataset,
    Optional[AllTaskProcessedDataset],
    dict[str, EnvironmentInterface],
    dict[str, EnvironmentInterface],
]: ...

def setup_response_data(
    tokenizer: AutoProcessor | AutoTokenizer,
    data_config: DataConfig,
    env_configs: Optional[dict[str, Any]] = None,
    is_vlm: bool = False,
) -> Union[...]:
    # implementation unchanged


class DataConfig(TypedDict):
max_input_seq_length: int
max_input_seq_length: int | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find the file and inspect the TypedDict definition
find . -name "__init__.py" -path "*/nemo_rl/data/*" -type f

# Then search for AllTaskProcessedDataset class and max_seq_length/max_input_seq_length usage
rg -n "class AllTaskProcessedDataset|max_seq_length|max_input_seq_length" -g'*.py' -C 3

Repository: NVIDIA-NeMo/RL

Length of output: 28206


🏁 Script executed:

# Check AllTaskProcessedDataset.__init__ for validation or safeguards
sed -n '46,70p' nemo_rl/data/datasets/processed_dataset.py

# Check example config files to see how max_input_seq_length is set
find . -name "*.yaml" -path "*/examples/configs/*" -type f | head -10
rg "max_input_seq_length" -g'*.yaml' -A 2 -B 2

Repository: NVIDIA-NeMo/RL

Length of output: 17681


Add None checks in task data processors or validate max_seq_length at dataset initialization.

The type definitions allow max_seq_length: int | None, but processor functions perform unsafe comparisons and arithmetic operations (e.g., if length > max_seq_length, max_seq_length // len(...)) without None guards. At least one example config (nemogym) explicitly sets max_input_seq_length: null, confirming that None can flow to AllTaskProcessedDataset. Either add explicit None checks in all processor functions before using max_seq_length, or validate and reject None during AllTaskProcessedDataset.__init__ if this field is required.

🤖 Prompt for AI Agents
In `@nemo_rl/data/__init__.py` at line 47, The code permits max_input_seq_length
(alias max_seq_length) to be None but downstream processors perform unsafe
comparisons/arithmetic with it; either add explicit None checks in every
processor function that uses max_input_seq_length (e.g., guard patterns in
token/window truncate logic before any "if length > max_input_seq_length" or
"max_input_seq_length // ..." operations) or enforce non-None at dataset
construction by validating in AllTaskProcessedDataset.__init__ (raise/config
error if max_input_seq_length is None) so processors can assume an int. Update
references to max_input_seq_length/max_seq_length in the processor functions and
AllTaskProcessedDataset to implement the chosen approach and add a clear error
message when rejecting None.

Comment on lines +23 to +31
def __init__(self, data_path: str, repeat: int = 1, **kwargs) -> None:
self.task_name = "-".join(data_path.split("/")[-2:]).split(".")[0]
if self.task_name[0] == "-":
self.task_name = self.task_name[1:]

# load raw line from jsonl
# will use `json.loads` to load to dict format at `nemo_gym_data_processor` later since `Dataset` cannot handle nested structure well
with open(data_path) as f:
self.dataset = [raw_line for raw_line in f]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document __init__ and silence the unused kwargs lint.

Suggested fix
-    def __init__(self, data_path: str, repeat: int = 1, **kwargs) -> None:
+    def __init__(self, data_path: str, repeat: int = 1, **kwargs) -> None:
+        """Initialize the Nemo Gym dataset.
+
+        Args:
+            data_path: Path to the JSONL data file.
+            repeat: Number of times to repeat the dataset.
+            **kwargs: Unused extra args for RawDataset compatibility.
+        """
+        _ = kwargs
         self.task_name = "-".join(data_path.split("/")[-2:]).split(".")[0]
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 23-23: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents
In `@nemo_rl/data/datasets/response_datasets/nemogym_dataset.py` around lines 23 -
31, Add a short docstring to the __init__ method describing parameters
(data_path: path to jsonl file, repeat: repetition count) and what attributes
are created (task_name and dataset), and silence the unused kwargs lint by
either renaming kwargs to _kwargs or explicitly consuming it (e.g., _ = kwargs)
or adding a comment like # noqa: F401 after kwargs; update the __init__
signature reference in the docstring and mention task_name and dataset so
reviewers can find the code (look for __init__, task_name, dataset, kwargs).

# Train: 1129 samples, Validation: 126 samples
train_jsonl_fpath: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl
validation_jsonl_fpath: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl
max_input_seq_length: null # nemogym dataset doesn't use this parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bxyu-nvidia @yuki-97 where does the truncation happen in this case? vllm will have max length which should prevent generating beyond that max_length set in the generation config, but does gym know to respect a max seqlength if tacking on an environment or tool output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as well, and I guess it should be in gym or didn't measure the max length (I saw lots of below warnings when using gym env).

ERROR 02-03 00:50:30 [serving_chat.py:257] ValueError: This model's maximum context length is 8192 tokens. However, your request has 9691 input tokens. Please reduce the length of the input messages.

since gym needs to directly pass the raw data (string) to gym env to let it handle all things now, so there's no way for nemorl to get its token_ids and handle the max_length.

tokenizer: Tokenizer or processor.
data_config: Data config.
env_configs: Environment configs.
env_configs: Environment configs. If None, will not create environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth adding to docstring that when to set to None, which would be for cases like gym where there's a single environment entrypoint handled outside of datasets

print(
f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}"
)
config["grpo"]["max_val_samples"] = len(val_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess val dataset can be none? so maybe we should guard this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants