Adding support for agentic grpo trainer. by NicoGrande · Pull Request #3540 · AI-Hypercomputer/maxtext

NicoGrande · 2026-04-01T16:28:03Z

Description

Add support for the Tunix Agentic GRPO Learner, which enables asynchronous rollouts leveraging an online vLLM server.

To enable Agentic GRPO Learner, this PR introduces the rl.use_agentic_rollout flag. Similarly, the maximum amount of concurrency for the online vLLM server is set using rl.max_concurrency argument. Other arguments relevant to the Agentic GRPO Learner are also included in this PR.

Tests

Standard GRPO for qwen3-0.6b on v6e: 523.08s

Agentic GRPO for qwen3-0.6b on v6e: 363.93s

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-01T16:32:52Z

Codecov Report

❌ Patch coverage is 17.77778% with 37 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/trainers/post_train/rl/utils_rl.py	26.92%	19 Missing ⚠️
src/maxtext/trainers/post_train/rl/train_rl.py	5.26%	18 Missing ⚠️

📢 Thoughts on this report? Let us know!

richjames0

lgtm with a couple of concerns that you can ignore if not relevant but I do note Andy's has one unresolved comment

richjames0 · 2026-04-02T23:32:13Z

  return optax.inject_hyperparams(make_optimizer)(learning_rate=schedule)


+def format_maxtext_messages(messages: list[dict[str, str]], template_config: dict, tmvp_config) -> list[dict[str, str]]:


is this also going to be performant enough?

From looking at the implementation of other chat parsers in Tunix, I think we should be fine: https://github.com/google/tunix/blob/main/tunix/rl/agentic/parser/chat_template_parser/parser.py

My biggest concern with this change was just matching the code we were previously using for pre-processing and apply chat templates exactly to avoid unintended bugs related to this.

hmmmmm our parser already handles all these, if there's a missing model, it should be added to Tunix codebase instead

tianshub · 2026-04-03T02:45:19Z

+  }
+
+
+class MaxTextChatParser(agentic_chat_template_parser.DefaultChatTemplateParser):


why do you need this? it should just be qwen/gemma/llama parser

this is needed because of the diffreence between how maxtext and tunix do parsing; tunix subclasses per model, maxtext uses a single class but with config

The alternative is to write a get_chat_parser() helper which attempts to load the chat parser corresponding to the model from Tunix and falls back to the default implementation if it is not found.

I would prefer to implement the MaxTextChatParser for now for simplicity and compatibility with the MaxText single-class + config model

hmmm i'm not sure if I'm following, maxtext implements the OSS models, why does the parser matter? say if maxtext uses qwen model, and the qwen chatparser is already there, couldn't we just use it?

tianshub · 2026-04-03T02:46:15Z

  return optax.inject_hyperparams(make_optimizer)(learning_rate=schedule)


+def format_maxtext_messages(messages: list[dict[str, str]], template_config: dict, tmvp_config) -> list[dict[str, str]]:


hmmmmm our parser already handles all these, if there's a missing model, it should be added to Tunix codebase instead

tianshub · 2026-04-03T02:47:27Z

+        beta=trainer_config.rl.grpo_beta,
+        epsilon=trainer_config.rl.grpo_epsilon,
+        loss_algo=trainer_config.rl.loss_algo,
+        max_response_length=trainer_config.max_target_length - trainer_config.max_prefill_predict_length,


what is max_target_length and max_prefill_predict_length? this looks like a confusing user facing knob

Yes I agree this is confusing - I will follow up with another PR to clean up the interface a bit.

max_target_length is the max generation length, while max_prefill_predict_length is the max size of prefill for prompts. We should rename max_target_length to something like max_tokens_to_generate and define the max model size as the sum of max_tokens_to_generate + max_prefill_predict_length

tianshub · 2026-04-03T18:49:17Z

+def format_maxtext_messages(messages: list[dict[str, str]], template_config: dict, tmvp_config) -> list[dict[str, str]]:
+  """Helper to inject MaxText's system prompt into the input user messages."""
+  formatted_messages = []
+  for msg in messages:


these just looks like duplicated logics that already exist in our chat parser?

NicoGrande requested review from A9isha, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners April 1, 2026 16:28

NicoGrande force-pushed the nicogrande/async-rollouts branch from 341832d to 6db2576 Compare April 1, 2026 16:28

NicoGrande force-pushed the nicogrande/async-rollouts branch 2 times, most recently from 38196ba to ec02199 Compare April 1, 2026 16:36

xuefgu approved these changes Apr 1, 2026

View reviewed changes

NicoGrande requested a review from andytwigg April 1, 2026 17:31

andytwigg reviewed Apr 1, 2026

View reviewed changes

Comment thread src/maxtext/trainers/post_train/rl/train_rl.py

Comment thread src/maxtext/trainers/post_train/rl/train_rl.py

NicoGrande force-pushed the nicogrande/async-rollouts branch 5 times, most recently from 856c176 to fab1f04 Compare April 1, 2026 23:12

NicoGrande force-pushed the nicogrande/async-rollouts branch 3 times, most recently from a1faa01 to aa94c87 Compare April 2, 2026 23:23

richjames0 approved these changes Apr 2, 2026

View reviewed changes

adding support for agentic grpo trainer.

daac9e0

NicoGrande force-pushed the nicogrande/async-rollouts branch from aa94c87 to daac9e0 Compare April 3, 2026 02:34

tianshub reviewed Apr 3, 2026

View reviewed changes

andytwigg self-requested a review April 3, 2026 16:30

NicoGrande added the pull ready label Apr 3, 2026

tianshub reviewed Apr 3, 2026

View reviewed changes

copybara-service Bot merged commit 1f04ad1 into main Apr 3, 2026
113 of 115 checks passed

copybara-service Bot deleted the nicogrande/async-rollouts branch April 3, 2026 21:18

		return optax.inject_hyperparams(make_optimizer)(learning_rate=schedule)


		def format_maxtext_messages(messages: list[dict[str, str]], template_config: dict, tmvp_config) -> list[dict[str, str]]:

		}


		class MaxTextChatParser(agentic_chat_template_parser.DefaultChatTemplateParser):

Conversation

NicoGrande commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

richjames0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NicoGrande commented Apr 1, 2026 •

edited

Loading

codecov Bot commented Apr 1, 2026 •

edited

Loading