Conversation
6c9a653 to
39d699d
Compare
856a9cf to
971dc6e
Compare
971dc6e to
9f310cd
Compare
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
8cf62f7 to
d1428ba
Compare
📝 WalkthroughWalkthroughThis PR refactors the data setup pipeline by renaming Changes
Sequence Diagram(s)sequenceDiagram
participant Script as Example Script
participant DataSetup as setup_response_data()
participant DataLoader as NemoGymDataset
participant Processor as nemo_gym_data_processor
participant EnvSetup as create_env()
participant Training as Training Loop
Script->>DataSetup: Call with data_config, env_configs
DataSetup->>DataLoader: Load dataset from JSONL path
DataLoader-->>DataSetup: Return HuggingFace Dataset
DataSetup->>Processor: Process each dataset entry
Processor-->>DataSetup: Return DatumSpec with env_info
alt env_configs provided
DataSetup->>EnvSetup: create_env(env_name="nemo_gym")
EnvSetup-->>DataSetup: Return environment interface
DataSetup-->>Script: (train_dataset, val_dataset, task_to_env, val_task_to_env)
else env_configs is None
DataSetup-->>Script: (train_dataset, val_dataset)
end
Script->>Training: Start training with datasets and envs
Training-->>Script: Training results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/nemo_gym/run_grpo_nemo_gym.py (1)
162-182:⚠️ Potential issue | 🟠 MajorHandle the case where no validation dataset is produced.
setup_response_data(..., env_configs=None)can returnval_dataset = None. The current code unconditionally callslen(val_dataset), which will raise. Please guard or fail fast with a clear error.Suggested fix
- train_dataset, val_dataset = setup_response_data( - tokenizer, config["data"], env_configs=None - ) + train_dataset, val_dataset = setup_response_data( + tokenizer, config["data"], env_configs=None + ) + if val_dataset is None: + raise ValueError( + "Validation dataset is required for NeMo-Gym runs; please configure " + "data.validation or split_validation_size > 0." + ) @@ - print( - f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}" - ) + print( + f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}" + )
🤖 Fix all issues with AI agents
In `@examples/nemo_gym/run_grpo_nemo_gym.py`:
- Around line 211-219: The current hardcoded task_to_env {"nemo_gym": nemo_gym}
can mismatch NemoGymDataset.task_name (derived via
"-".join(data_path.split("/")[-2:]).split(".")[0]) and break environment lookups
in rollouts.py; change the binding to use the dataset's task_name (obtain from
the NemoGymDataset instance used to create the env) as the key instead of
"nemo_gym", e.g., compute key = dataset.task_name and set task_to_env = {key:
nemo_gym} and mirror that for val_task_to_env so
run_async_nemo_gym_rollout/rollouts.py environment lookups succeed.
In `@nemo_rl/data/__init__.py`:
- Line 47: The code permits max_input_seq_length (alias max_seq_length) to be
None but downstream processors perform unsafe comparisons/arithmetic with it;
either add explicit None checks in every processor function that uses
max_input_seq_length (e.g., guard patterns in token/window truncate logic before
any "if length > max_input_seq_length" or "max_input_seq_length // ..."
operations) or enforce non-None at dataset construction by validating in
AllTaskProcessedDataset.__init__ (raise/config error if max_input_seq_length is
None) so processors can assume an int. Update references to
max_input_seq_length/max_seq_length in the processor functions and
AllTaskProcessedDataset to implement the chosen approach and add a clear error
message when rejecting None.
In `@nemo_rl/data/datasets/response_datasets/nemogym_dataset.py`:
- Line 1: Update the copyright header year from 2025 to 2026 in the file's
top-of-file comment (the existing line containing "Copyright (c) 2025, NVIDIA
CORPORATION. All rights reserved."); replace "2025" with "2026" so the header
reads "Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved."
- Around line 23-31: Add a short docstring to the __init__ method describing
parameters (data_path: path to jsonl file, repeat: repetition count) and what
attributes are created (task_name and dataset), and silence the unused kwargs
lint by either renaming kwargs to _kwargs or explicitly consuming it (e.g., _ =
kwargs) or adding a comment like # noqa: F401 after kwargs; update the __init__
signature reference in the docstring and mention task_name and dataset so
reviewers can find the code (look for __init__, task_name, dataset, kwargs).
In `@nemo_rl/data/processors.py`:
- Around line 667-684: In nemo_gym_data_processor, silence the unused-argument
warnings by referencing task_data_spec, tokenizer, and max_seq_length (e.g.,
assign them to a throwaway variable or use them in a no-op) and change the fake
message_log token_ids creation from torch.tensor([]) to an empty integer tensor
(torch.tensor([], dtype=torch.long)) so token IDs use an integer dtype; update
the "message_log" entry creation in the function accordingly.
In `@nemo_rl/data/utils.py`:
- Around line 99-100: The code assumes cfg["env_name"] exists when env_configs
is provided (see variables has_envs, envs, task_to_env and cfg), which can raise
KeyError; update the logic to validate each dataset config early: after
extracting env names from env_configs, check every cfg in the dataset loop and
if has_envs is True and "env_name" is missing raise a clear ValueError (or
KeyError with a descriptive message) indicating that env_name is required when
using env_configs and include the dataset identifier in the message;
alternatively, wrap the access to cfg["env_name"] with a check and provide the
same descriptive error before assigning into task_to_env.
In `@tests/unit/experience/test_rollouts.py`:
- Around line 800-814: The temp file created with
tempfile.NamedTemporaryFile(..., delete=False) is not removed, causing
accumulation; change the test to either create the temp file with delete=True
and instantiate NemoGymDataset(data_path) inside the with block so the file is
read before it's auto-deleted, or keep delete=False but ensure explicit cleanup
(os.remove(data_path)) in a finally/teardown; locate the creation site
(tempfile.NamedTemporaryFile), the variable data_path, and where NemoGymDataset
is constructed (NemoGymDataset(data_path)) and apply one of these fixes.
🧹 Nitpick comments (3)
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (2)
46-48: Derivecheckpoint_dirfromlogger.log_dirto avoid collisions.Hardcoding a shared directory risks overlapping runs. Consider scoping checkpoints to the run log directory.
♻️ Suggested tweak
checkpointing: enabled: true - checkpoint_dir: "results/grpo" + checkpoint_dir: "${logger.log_dir}/checkpoints"
237-242: Consider parameterizing local dataset paths.The hardcoded
3rdparty/...paths are brittle across machines/CI. Using env overrides makes the config portable.♻️ Example (env‑override with defaults)
train: - data_path: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl + data_path: ${oc.env:WORKPLACE_ASSISTANT_TRAIN_JSONL,"3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl"} validation: - data_path: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl + data_path: ${oc.env:WORKPLACE_ASSISTANT_VALID_JSONL,"3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl"}nemo_rl/data/utils.py (1)
34-47: Consider using@overloadfor clearer return type discrimination.The
Unionreturn type makes it harder for static type checkers and callers to know which tuple shape they'll receive. Using@typing.overloadwould provide better type safety and IDE support.♻️ Optional refactor using overload
from typing import overload, Literal `@overload` def setup_response_data( tokenizer: AutoProcessor | AutoTokenizer, data_config: DataConfig, env_configs: None = None, is_vlm: bool = False, ) -> tuple[AllTaskProcessedDataset, Optional[AllTaskProcessedDataset]]: ... `@overload` def setup_response_data( tokenizer: AutoProcessor | AutoTokenizer, data_config: DataConfig, env_configs: dict[str, Any], is_vlm: bool = False, ) -> tuple[ AllTaskProcessedDataset, Optional[AllTaskProcessedDataset], dict[str, EnvironmentInterface], dict[str, EnvironmentInterface], ]: ... def setup_response_data( tokenizer: AutoProcessor | AutoTokenizer, data_config: DataConfig, env_configs: Optional[dict[str, Any]] = None, is_vlm: bool = False, ) -> Union[...]: # implementation unchanged
|
|
||
| class DataConfig(TypedDict): | ||
| max_input_seq_length: int | ||
| max_input_seq_length: int | None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the file and inspect the TypedDict definition
find . -name "__init__.py" -path "*/nemo_rl/data/*" -type f
# Then search for AllTaskProcessedDataset class and max_seq_length/max_input_seq_length usage
rg -n "class AllTaskProcessedDataset|max_seq_length|max_input_seq_length" -g'*.py' -C 3Repository: NVIDIA-NeMo/RL
Length of output: 28206
🏁 Script executed:
# Check AllTaskProcessedDataset.__init__ for validation or safeguards
sed -n '46,70p' nemo_rl/data/datasets/processed_dataset.py
# Check example config files to see how max_input_seq_length is set
find . -name "*.yaml" -path "*/examples/configs/*" -type f | head -10
rg "max_input_seq_length" -g'*.yaml' -A 2 -B 2Repository: NVIDIA-NeMo/RL
Length of output: 17681
Add None checks in task data processors or validate max_seq_length at dataset initialization.
The type definitions allow max_seq_length: int | None, but processor functions perform unsafe comparisons and arithmetic operations (e.g., if length > max_seq_length, max_seq_length // len(...)) without None guards. At least one example config (nemogym) explicitly sets max_input_seq_length: null, confirming that None can flow to AllTaskProcessedDataset. Either add explicit None checks in all processor functions before using max_seq_length, or validate and reject None during AllTaskProcessedDataset.__init__ if this field is required.
🤖 Prompt for AI Agents
In `@nemo_rl/data/__init__.py` at line 47, The code permits max_input_seq_length
(alias max_seq_length) to be None but downstream processors perform unsafe
comparisons/arithmetic with it; either add explicit None checks in every
processor function that uses max_input_seq_length (e.g., guard patterns in
token/window truncate logic before any "if length > max_input_seq_length" or
"max_input_seq_length // ..." operations) or enforce non-None at dataset
construction by validating in AllTaskProcessedDataset.__init__ (raise/config
error if max_input_seq_length is None) so processors can assume an int. Update
references to max_input_seq_length/max_seq_length in the processor functions and
AllTaskProcessedDataset to implement the chosen approach and add a clear error
message when rejecting None.
| def __init__(self, data_path: str, repeat: int = 1, **kwargs) -> None: | ||
| self.task_name = "-".join(data_path.split("/")[-2:]).split(".")[0] | ||
| if self.task_name[0] == "-": | ||
| self.task_name = self.task_name[1:] | ||
|
|
||
| # load raw line from jsonl | ||
| # will use `json.loads` to load to dict format at `nemo_gym_data_processor` later since `Dataset` cannot handle nested structure well | ||
| with open(data_path) as f: | ||
| self.dataset = [raw_line for raw_line in f] |
There was a problem hiding this comment.
Document __init__ and silence the unused kwargs lint.
Suggested fix
- def __init__(self, data_path: str, repeat: int = 1, **kwargs) -> None:
+ def __init__(self, data_path: str, repeat: int = 1, **kwargs) -> None:
+ """Initialize the Nemo Gym dataset.
+
+ Args:
+ data_path: Path to the JSONL data file.
+ repeat: Number of times to repeat the dataset.
+ **kwargs: Unused extra args for RawDataset compatibility.
+ """
+ _ = kwargs
self.task_name = "-".join(data_path.split("/")[-2:]).split(".")[0]🧰 Tools
🪛 Ruff (0.14.14)
[warning] 23-23: Unused method argument: kwargs
(ARG002)
🤖 Prompt for AI Agents
In `@nemo_rl/data/datasets/response_datasets/nemogym_dataset.py` around lines 23 -
31, Add a short docstring to the __init__ method describing parameters
(data_path: path to jsonl file, repeat: repetition count) and what attributes
are created (task_name and dataset), and silence the unused kwargs lint by
either renaming kwargs to _kwargs or explicitly consuming it (e.g., _ = kwargs)
or adding a comment like # noqa: F401 after kwargs; update the __init__
signature reference in the docstring and mention task_name and dataset so
reviewers can find the code (look for __init__, task_name, dataset, kwargs).
| # Train: 1129 samples, Validation: 126 samples | ||
| train_jsonl_fpath: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl | ||
| validation_jsonl_fpath: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl | ||
| max_input_seq_length: null # nemogym dataset doesn't use this parameter |
There was a problem hiding this comment.
@bxyu-nvidia @yuki-97 where does the truncation happen in this case? vllm will have max length which should prevent generating beyond that max_length set in the generation config, but does gym know to respect a max seqlength if tacking on an environment or tool output?
There was a problem hiding this comment.
I'm curious as well, and I guess it should be in gym or didn't measure the max length (I saw lots of below warnings when using gym env).
ERROR 02-03 00:50:30 [serving_chat.py:257] ValueError: This model's maximum context length is 8192 tokens. However, your request has 9691 input tokens. Please reduce the length of the input messages.
since gym needs to directly pass the raw data (string) to gym env to let it handle all things now, so there's no way for nemorl to get its token_ids and handle the max_length.
| tokenizer: Tokenizer or processor. | ||
| data_config: Data config. | ||
| env_configs: Environment configs. | ||
| env_configs: Environment configs. If None, will not create environments. |
There was a problem hiding this comment.
maybe worth adding to docstring that when to set to None, which would be for cases like gym where there's a single environment entrypoint handled outside of datasets
| print( | ||
| f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}" | ||
| ) | ||
| config["grpo"]["max_val_samples"] = len(val_dataset) |
There was a problem hiding this comment.
i guess val dataset can be none? so maybe we should guard this?
Update
run_grpo_nemo_gym.pyto use the common utilsetup_response_data, so that it can also use multiple datasets supported in #1691 and multiple dataloaders which will be supported in #1698.NemoGymDatasetandnemo_gym_data_processorto match current NeMo-RL dataset structure.setup_data_with_envstosetup_response_data, which supports not create env.Test Result

Summary by CodeRabbit
Release Notes
New Features
Refactor
Tests