Implement forced lag in RL by tdene · Pull Request #3517 · NVIDIA/Megatron-LM

tdene · 2026-02-20T12:30:25Z

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-02-20T12:30:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

yobibyte · 2026-02-20T12:40:18Z

megatron/rl/agent/api.py

+                    group_index = yielded_groups
+                    yielded_groups += 1
+                    for rollout in group:
+                        rollout.submission_index = group_index


nit: remove group index and do rollout.submission_index = yielded_groups -1?

yobibyte · 2026-02-20T12:41:03Z

megatron/rl/rl_utils.py

 def get_rollout_generator(args, inference_interface, n_prompts, samples_per_group):
    global _ROLLOUT_GENERATOR
-    if not args.rl_partial_rollouts or _ROLLOUT_GENERATOR is None:
+    oversubscribed = args.rl_partial_rollouts or args.rl_forced_lag > 0


Could you explain why you are calling this 'oversubscribed'?

I could not think of a better word at the time. It's regarding the concept of submitting more requests than you actually need.

Yeah, that's why I asked! partial rollouts or forced lag are not the only cases when we sample more than we consume. Sending a request when batch_size < group_size * prompts will also do this.

It's now called streaming.

yobibyte · 2026-02-20T12:43:03Z

megatron/training/arguments.py

            assert args.micro_batch_size == 1, \
                "micro_batch_size must be 1 when using sequence packing. To increase compute per micro batch increase the sequence length."
+        assert rl.forced_lag > 0 or not args.rl_partial_rollouts, (
+            "--rl-forced-lag and --rl-partial-rollouts are incompatible."


yobibyte · 2026-02-20T12:49:06Z

megatron/rl/rl_utils.py

-        buffered_rollouts is None or
-        iteration == runtime_state.last_collection_iteration + 
+        runtime_state.data_iterator is None or
+        iteration >= runtime_state.last_collection_iteration +


I think this is now entirely stale code.

yobibyte · 2026-02-20T12:49:50Z

megatron/rl/rl_utils.py

        RerunDataIterator for the current training step
    """
    runtime_state = get_rl_runtime_state()
+    args = get_args()


Please, send the argument as a function arg, do not call get_args() here.

yobibyte · 2026-02-20T13:03:47Z

megatron/rl/rl_utils.py

    ):
+        if forced_lag > 0:
+            runtime_state.lag_buffer.append(
+                get_environment_rollouts(


This can be above the if forced_lag branch.

I think this is now entirely stale code.

yobibyte · 2026-02-20T13:04:36Z

megatron/rl/rl_utils.py

+                    model, inference_model, optimizer, grpo_prompts_per_step, grpo_group_size,
+                )
+            )
+            rollouts = runtime_state.lag_buffer.popleft()


Will be useful/interesting to track the length of the lag buffer. Is it gonna increase in time?

Also! Related to the staleness tracking, without the plotting fix, the staleness will be reported incorrectly even more.

yobibyte · 2026-02-20T13:09:56Z

megatron/rl/rl_utils.py

+
        runtime_state.reset_iteration_counters(iteration)
+        return runtime_state.data_iterator
+    return runtime_state.data_iterator


With those changes, we need to run tests for these scenarios.

Normal, GRPO_ITERATIONS=1 sample X, consume X (batch size = prompts * group_size)

sample 2X, consume X batch size = prompts * group_size / 2

your newly added lag thing. Not sure if we need it in both scenarios above.

yobibyte · 2026-02-20T13:20:40Z

megatron/rl/rl_utils.py

+    if runtime_state.start_iteration is None:
+        runtime_state.start_iteration = iteration
+        if forced_lag > 0:
+            runtime_state.lag_buffer = deque()


For me this is kind of stuff __init()__ is for. Why do we need all this logic here?

I think this is now entirely stale code.

tdene · 2026-03-04T20:34:02Z

megatron/rl/agent/api.py

    inference_interface: InferenceInterface
    validation: bool = False
    filter_groups_with_same_reward: bool = False
+    streaming: bool = False


New parameter that matches up to what num_groups = -1 meant in the old code. It's bad practice to have a parameter that means one thing when it's positive, and an entirely different thing when it's set to -1.

Those will break nemo_gym integration. Please, adjust those scripts accordingly. ./nemo_gym_agent.py is the file you need.

Will do this later today.

Let's get back to this after we deal with completed_at_step parameter.

tdene · 2026-03-04T20:35:09Z

megatron/rl/agent/api.py

    validation: bool = False
    filter_groups_with_same_reward: bool = False
+    streaming: bool = False
+    batch_results: bool = False


When False, this returns groups at a time like in the old flow. When True, this waits until num_groups are ready and then returns them all at once.

tdene · 2026-03-04T20:40:13Z

megatron/rl/agent/api.py

+            # The semaphore ensures that each batch only starts after the previous is consumed.
+            groups_per_worker = request.num_groups
+            num_workers = self.parallel_generation_tasks // groups_per_worker
+            submission_gate = asyncio.Semaphore(num_workers)


The whole point of this submission_gate is the following.

If we are doing forced lag, we can force the RL setup to only "generate" a certain number of groups per step. This is the flow without submission_gate. This is more performant, but does not guarantee consistent lag.

The flow with submission_gate forces the RL setup to only "consume" a certain number of groups per step. This is less performant, but guarantees consistent lag.

The difference between "generate N per step" and "consume N per step" is in the edge-case.

tdene · 2026-03-04T20:42:22Z

megatron/rl/server/agent/fastapi_env_server.py

            request.inference_interface, InferenceServer
        ), "Rollout requests to remote server must contain an InferenceServer object"
-        assert request.num_groups != -1, "FastAPIEnvServer does not support group rollout streaming"
+        assert not request.streaming, "FastAPIEnvServer does not support group rollout streaming"


This is what I meant above. num_groups != -1 was secretly hidden behavior. Now it's out in the open.

tdene · 2026-03-04T20:43:30Z

megatron/rl/rl_utils.py

+    streaming = args.rl_partial_rollouts or args.rl_forced_lag > 0
+    if not streaming or _ROLLOUT_GENERATOR is None:
+        if args.rl_forced_lag > 0:
+            pgt = (args.rl_forced_lag + 1) * n_prompts


We must control the inference batch size in forced lag mode.

jon-barker · 2026-03-04T22:30:24Z

/claude review

claude · 2026-03-04T22:32:42Z

megatron/training/arguments.py

+    group.add_argument('--rl-forced-lag', type=int, default=0,
+                       help='Forced rollout lag of L steps. After an initial warm-up of L steps, '
+                            'All steps N+L use only rollouts that were started on step N. '
+                            '0 (default) disabled this behavior.'


Typo: "disabled" → "disables"

Suggested change

'0 (default) disabled this behavior.'

help='Forced rollout lag of L steps. After an initial warm-up of L steps, '

'All steps N+L use only rollouts that were started on step N. '

'0 (default) disables this behavior.'

Resolved. Thank you Claude!

claude · 2026-03-04T22:32:44Z

megatron/rl/agent/api.py

+                if submission_gate is not None:
+                    await submission_gate.acquire()
+                batch_id = submitted_groups // groups_per_worker
+                submitted_groups += groups_per_worker


The assert only guards the groups_per_worker > 1 path. When groups_per_worker == 1 (i.e. n_prompts == 1) with batch_results=True and filter_groups_with_same_reward=True, a filtered group will leave a semaphore permit unreleased. The consumer is then waiting on grouped_rollouts.get() while every worker is stuck waiting on submission_gate.acquire() — deadlock.

Since filtering + forced lag is unsupported, it should be caught in validate_args rather than at runtime inside the generator (where the assert kills the training run with no cleanup). Consider adding to validate_args:

assert not (args.rl_forced_lag > 0 and args.rl_filter_groups_with_same_reward), \ "--rl-forced-lag and --rl-filter-groups-with-same-reward are incompatible."

(adjust the flag name to whatever the actual arg is.)

Resolved. Thank you Claude!

ArEsKay3 · 2026-03-07T19:44:00Z

I actually have pretty strong feeling about the arguments here.
If I want to change something I shouldn't have to adjust lot of others.
As it stands a change in say group size will change the number of active requests (even if the global batch size is adjusted accordingly), something I think we want to avoid.

So. My proposal is
--rl-num-parallel-generations OR --rl-num-parallel-generation-batches

They would be mutually exclusive. Both would require --rl-partial-rollouts to be set.

The first would imply no batching and would represents the number of parallel rollouts. This is what we really want to control. We would internally adjust for multiple tasks and divide it by grpo_group_size to ensure that changing the grpo_group_size or number of tasks doesn't require an associated change in this argument to keep the same amount of "active" rollouts.

The second would imply batching and would represent the number of batches active at a time. In this case we would set our controls so we have that batches in flight. This would be our "forced lag" or "N step off policy" setup.

yobibyte · 2026-03-17T16:29:02Z

megatron/rl/agent/api.py



+@dataclass(slots=True)
+class RolloutGroup:


Should it inherit from the BaseModel?

yobibyte · 2026-03-17T16:30:43Z

megatron/rl/agent/api.py

-            while grouped_rollouts.qsize() > 0 or not all(task.done() for task in tasks):
-                yield await grouped_rollouts.get()
+            next_batch_id = 0
+            pending: dict[int, list[RolloutGroup]] = {}


GroupedRollouts?

yobibyte · 2026-03-17T16:33:20Z

megatron/rl/agent/api.py

+class RolloutGroup:
+    """A group of rollouts (e.g. multiple completions for one prompt) with batch metadata."""
+
+    rollouts: list[Rollout]


Rollout or Rolout|TokenRollout?

yobibyte · 2026-03-17T16:54:24Z

megatron/rl/agent/api.py

+                if request.enforce_order:
+                    # Accumulate groups and enforce submission order across batches.
+                    pending.setdefault(group.batch_id, []).append(group)
+                    while len(pending.get(next_batch_id, [])) >= groups_per_worker:


Trying to understand this logic. Is there a chance we do not get to this loop? i.e. not get out of it because we have exhausted pending, but for some other reasons.

Certainly. The only way we can get stuck here is if we filter out rollouts (which we are asserting that we do not do), or if the worker tasks die. But if the worker tasks die, we shutdown the asyncio Queues (because I pack-ported Python 3.13 functionality into this repo, because asyncio Queues are broken before Python 3.13), and that breaks us out of a potentially infinite loop.

svcnvidia-nemo-ci · 2026-03-18T23:21:22Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23271910667

svcnvidia-nemo-ci · 2026-03-19T00:16:54Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23273525669

svcnvidia-nemo-ci · 2026-03-19T01:47:33Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23275855415

svcnvidia-nemo-ci · 2026-03-19T03:19:22Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23278219551

yobibyte reviewed Feb 20, 2026

View reviewed changes

janEbert added the complexity: medium label Feb 20, 2026

tdene force-pushed the tde/rl_forced_lag branch 3 times, most recently from 241f1bf to 95815f4 Compare March 4, 2026 10:09

tdene force-pushed the tde/rl_forced_lag branch from fe5d072 to edb5149 Compare March 4, 2026 20:32

tdene marked this pull request as ready for review March 4, 2026 20:32

tdene requested a review from a team as a code owner March 4, 2026 20:32

tdene added the Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. label Mar 4, 2026

svcnvidia-nemo-ci requested a review from a team March 4, 2026 20:32

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 4, 2026

copy-pr-bot bot had a problem deploying to test March 4, 2026 20:33 Error

tdene commented Mar 4, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test March 4, 2026 20:47 Inactive

claude bot reviewed Mar 4, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test March 4, 2026 23:14 Inactive

copy-pr-bot bot temporarily deployed to test March 6, 2026 16:32 Inactive

copy-pr-bot bot temporarily deployed to test March 7, 2026 14:44 Inactive

tdene force-pushed the tde/rl_forced_lag branch from 4bd170c to 9bed72a Compare March 7, 2026 15:00

copy-pr-bot bot temporarily deployed to test March 7, 2026 15:02 Inactive

copy-pr-bot bot temporarily deployed to test March 7, 2026 15:05 Inactive

copy-pr-bot bot temporarily deployed to test March 7, 2026 20:10 Inactive

copy-pr-bot bot temporarily deployed to test March 17, 2026 15:40 Inactive

yobibyte reviewed Mar 17, 2026

View reviewed changes

Address reviewer comments

c67b61a

copy-pr-bot bot temporarily deployed to test March 17, 2026 20:38 Inactive

Address reviewer comment

a5c2a05

copy-pr-bot bot temporarily deployed to test March 18, 2026 11:27 Inactive

yobibyte approved these changes Mar 18, 2026

View reviewed changes

jaredcasper approved these changes Mar 18, 2026

View reviewed changes

Merge branch 'main' into tde/rl_forced_lag

35c2a36

asolergi-nv approved these changes Mar 18, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Mar 18, 2026

copy-pr-bot bot temporarily deployed to test March 18, 2026 17:44 Inactive

tdene enabled auto-merge March 18, 2026 17:50

Fix merge

c6021f5

copy-pr-bot bot temporarily deployed to test March 18, 2026 19:17 Inactive

Merge branch 'main' into tde/rl_forced_lag

5d82aa1

copy-pr-bot bot temporarily deployed to test March 18, 2026 20:55 Inactive

Merge branch 'main' into tde/rl_forced_lag

86d65d8

copy-pr-bot bot temporarily deployed to test March 18, 2026 22:17 Inactive

tdene added this pull request to the merge queue Mar 18, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026

tdene added this pull request to the merge queue Mar 19, 2026

Merged via the queue into NVIDIA:main with commit dde4701 Mar 19, 2026
55 of 57 checks passed

tdene deleted the tde/rl_forced_lag branch March 19, 2026 06:19



		@dataclass(slots=True)
		class RolloutGroup:

Conversation

tdene commented Feb 20, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-barker commented Mar 4, 2026

Uh oh!

claude bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArEsKay3 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

(Step 1): Add PR label `Expert Review`

ArEsKay3 commented Mar 7, 2026 •

edited

Loading