# GRPO Llama3.1-8B Demo: Direct Function Call

This notebook demonstrates GRPO training by directly calling the `rl_train` function from `rl_trainer.py`.

## What is GRPO?

GRPO (Group Relative Policy Optimization) is an RL algorithm that enhances reasoning abilities of LLMs by:
1. Generating multiple responses for each prompt
2. Evaluating responses using reward models  
3. Calculating relative advantages to update the policy


This notebook imports and calls the `rl_train` function 

## Hardware Requirements

- Single host TPUVM (v6e-8/v5p-8) or multi-host with Pathways
- Sufficient memory for Llama3.1-8B model

## Setup

Install dependencies and set up the environment:

In [None]:
# Clone MaxText repository
!git clone https://github.com/AI-Hypercomputer/maxtext.git
%cd maxtext

In [1]:
# Install GRPO-specific dependencies
!./src/MaxText/examples/install_tunix_vllm_requirement.sh

# Install additional requirements
%uv pip install --force-reinstall numpy==2.1.2
%uv pip install nest_asyncio

/bin/bash: line 1: ./src/MaxText/examples/install_tunix_vllm_requirement.sh: No such file or directory


[2mUsing Python 3.12.12 environment at: /home/zhehuichen_google_com/.venv[0m
[2K[2mResolved [1m1 package[0m [2min 37ms[0m[0m                                           [0m
[2K[2mPrepared [1m1 package[0m [2min 0.24ms[0m[0m                                             
[2mUninstalled [1m1 package[0m [2min 15ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 54ms[0m[0m                                 [0m
 [33m~[39m [1mnumpy[0m[2m==2.1.2[0m
Note: you may need to restart the kernel to use updated packages.
[2mUsing Python 3.12.12 environment at: /home/zhehuichen_google_com/.venv[0m
[2mAudited [1m1 package[0m [2min 5ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
%load_ext autoreload
%autoreload 2
import nest_asyncio
nest_asyncio.apply()  # Fix for Colab event loop

## Configuration

Set up the training parameters:

In [3]:
# Configuration for GRPO training
import os

# Set up paths
MAXTEXT_REPO_ROOT = os.path.expanduser("~") + "/maxtext"
print(f"MaxText Home directory: {MAXTEXT_REPO_ROOT}")

# Training configuration
MODEL_CHECKPOINT_PATH = "gs://zhehui_tpu/llama3.1-8b-Instruct/llama3.1-8b-Instruct/scanned-pathways/0/items"
OUTPUT_DIRECTORY = "/tmp/grpo_output"
STEPS = 10  # Reduced for demo purposes
# Please make sure your token has the right permissions!!!!!!
HF_TOKEN = os.environ.get("HF_TOKEN", "YOUR_HF_TOKEN")

print(f"Model checkpoint: {MODEL_CHECKPOINT_PATH}")
print(f"Output directory: {OUTPUT_DIRECTORY}")
print(f"Training steps: {STEPS}")

MaxText Home directory: /home/zhehuichen_google_com/maxtext
Model checkpoint: gs://zhehui_tpu/llama3.1-8b-Instruct/llama3.1-8b-Instruct/scanned-pathways/0/items
Output directory: /tmp/grpo_output
Training steps: 10


In [4]:
# Import GRPO training function directly
import sys
import os
from pathlib import Path

# Add MaxText to Python path
maxtext_path = Path(MAXTEXT_REPO_ROOT) / "src" / "MaxText"
sys.path.insert(0, str(maxtext_path))

# Import required modules
from MaxText import pyconfig
from MaxText.rl.train_rl import rl_train, setup_configs_and_devices

print("‚úÖ Successfully imported GRPO training function")
print(f"üìÅ MaxText path: {maxtext_path}")

2025-11-24 19:58:01.987104: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764014282.004231 1086425 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764014282.009877 1086425 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1764014282.023978 1086425 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1764014282.023992 1086425 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1764014282.023994 1086425 computation_placer.cc:177] computation placer alr

INFO 11-24 19:58:08 [__init__.py:224] Automatically detected platform tpu.
INFO 11-24 19:58:08 [__init__.py:22] TPU info: node_name=zhehui-v6e | tpu_type=v6e-4 | worker_id=0 | num_chips=4 | num_cores_per_chip=1
Patched tpu_inference.models.jax.llama3.LlamaModel for Flax 0.12+ compatibility.
‚úÖ Successfully imported GRPO training function
üìÅ MaxText path: /home/zhehuichen_google_com/maxtext/src/MaxText


In [5]:
# Build configuration for GRPO training
config_argv = [
    "",  # Placeholder for argv[0]
    os.path.join(MAXTEXT_REPO_ROOT, "src/MaxText/configs/rl.yml"),  # Base config
    f"model_name=llama3.1-8b",
    f"tokenizer_path=meta-llama/Llama-3.1-8B-Instruct",
    f"load_parameters_path={MODEL_CHECKPOINT_PATH}",
    f"hf_access_token={HF_TOKEN}",
    "run_name=test"
]

# Create configuration object
trainer_config, sampler_config, trainer_devices, sampler_devices = setup_configs_and_devices(config_argv)

print("‚úÖ Configuration created successfully")
print(f"üìä Training steps: {trainer_config.steps}")
print(f"üìÅ Output directory: {trainer_config.base_output_directory}")
print(f"ü§ñ Model: {trainer_config.model_name}")



Skipping jax distributed system due to skip_jax_distributed_system=True flag.


INFO:MaxText.pyconfig:Config param act_quantization_calibration_method: absmax
INFO:MaxText.pyconfig:Config param activations_in_float32: False
INFO:MaxText.pyconfig:Config param adam_b1: 0.9
INFO:MaxText.pyconfig:Config param adam_b2: 0.99
INFO:MaxText.pyconfig:Config param adam_eps: 1e-08
INFO:MaxText.pyconfig:Config param adam_eps_root: 0.0
INFO:MaxText.pyconfig:Config param adam_weight_decay: 0.1
INFO:MaxText.pyconfig:Config param add_bos: True
INFO:MaxText.pyconfig:Config param add_eos: True
INFO:MaxText.pyconfig:Config param allow_split_physical_axes: False
INFO:MaxText.pyconfig:Config param ar_cache_axis_order: 1,2,0,3
INFO:MaxText.pyconfig:Config param async_checkpointing: False
INFO:MaxText.pyconfig:Config param attention: dot_product
INFO:MaxText.pyconfig:Config param attention_bias: False
INFO:MaxText.pyconfig:Config param attention_sink: False
INFO:MaxText.pyconfig:Config param attention_type: global
INFO:MaxText.pyconfig:Config param attn_logits_soft_cap: None
INFO:MaxText

Running RL on a single slice
‚úÖ Configuration created successfully
üìä Training steps: 150001
üìÅ Output directory: /tmp/grpo_output
ü§ñ Model: llama3.1-8b


In [None]:
# Execute GRPO training directly
try:
    # Call the rl_train function
    print("\n" + "="*80)
    print("Starting GRPO Training...")
    print("="*80)
    rl_train(trainer_config, sampler_config, trainer_devices, sampler_devices)
    
    print("\n" + "="*80)
    print("‚úÖ GRPO Training Completed Successfully!")
    print("="*80)
    print(f"üìÅ Checkpoints and logs saved to: {trainer_config.base_output_directory}")
    print(f"üéØ Final model ready for inference!")
    
except Exception as e:
    print("\n" + "="*80)
    print("‚ùå GRPO Training Failed!")
    print("="*80)
    print(f"Error: {str(e)}")
    print("\nPlease check the error message and try again.")


Starting GRPO Training...
Starting GRPO Training
Ensuring TensorBoard log directory exists: /tmp/grpo_output/test/tensorboard/


DEBUG: resolving path. Input: src/MaxText/examples/chat_templates/gsm8k_rl_llama3.json, Package Root: /home/zhehuichen_google_com/maxtext/src/MaxText, Resolved: /home/zhehuichen_google_com/maxtext/src/MaxText/examples/chat_templates/gsm8k_rl_llama3.json, Exists: True
DEBUG: resolving path. Input: src/MaxText/examples/chat_templates/gsm8k_rl_llama3.json, Package Root: /home/zhehuichen_google_com/maxtext/src/MaxText, Resolved: /home/zhehuichen_google_com/maxtext/src/MaxText/examples/chat_templates/gsm8k_rl_llama3.json, Exists: True
{'answer': array(['13'], dtype='<U2'),
 'prompts': array(['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nRespond in the following format:<reasoning>...</reasoning><answer>...</answer><|eot_id|><|start_header_id|>user<|end_header_id|>\n\nJane is painting her fingernails. She applies a base coat that takes 2 minutes to dry, two color coats that take 3 minutes each to dry, and a cl

I1124 19:58:17.412557 1087788 google_auth_provider.cc:181] Running on GCE, using service account 770040921623-compute@developer.gserviceaccount.com


Num_devices: 4, shape (1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Reference Model initialized successfully


Reference mesh shape: OrderedDict({'data': 1, 'stage': 1, 'fsdp': 4, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1})
maxtext_state_flatten[base.token_embedder.embedding].value=          [[0.00106049 0.00561523 -0.00341797 ... 0.00411987 -0.00280762
  -0.000671387]
 [-0.0037384 0.000972748 -0.0018158 ... 0.00152588 -0.0022583 -0.0013504]
 [0.00144196 -0.0169678 0.00315857 ... 0.00299072 0.00952148 0.00488281]
 ...
 [2.21271e-23 3.90326e-24 2.16101e-23 ... 6.36929e-23 -2.64956e-24
  -2.35746e-23]
 [2.28509e-23 -2.21012e-24 -2.22305e-23 ... 2.79173e-23 8.6854e-24
  -3.70163e-23]
 [-8.85083e-23 -7.5687e-23 6.4882e-24 ... 5.89366e-24 -6.45201e-23
  -2.71419e-24]]
Creating policy model with same config as reference on trainer mesh
Num_devices: 4, shape (1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Num_devices: 4, shape (1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1)




Policy Model initialized successfully


Policy mesh shape: OrderedDict({'data': 1, 'stage': 1, 'fsdp': 4, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1})
Tensorboard logs directory: /tmp/grpo_output/test/tensorboard/
Creating RL cluster...
INFO 11-24 19:59:04 [utils.py:243] non-default args: {'load_format': 'dummy', 'max_model_len': 1280, 'tensor_parallel_size': 4, 'swap_space': 2, 'gpu_memory_utilization': 0.4, 'disable_log_stats': True, 'additional_config': {'sharding': {'sharding_strategy': {'device_indexes': [0, 1, 3, 2]}}}, 'model': 'meta-llama/Llama-3.1-8B-Instruct'}
INFO 11-24 19:59:04 [model.py:646] Resolved architecture: LlamaForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 11-24 19:59:04 [model.py:1734] Using max model len 1280
INFO 11-24 19:59:06 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 11-24 19:59:06 [tpu_jax.py:186] Force using UniProcExecutor for JAX on single host.
INFO 11-24 19:59:06 [core.py:96] Initializing a V1 LLM engine (v0.11.1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=<class 'jax.numpy.bfloat16'>, max_seq_len=1280, download_dir=None, load_format=dummy, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=None, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reaso



Setting up GRPO trainer...


  0%|          | 0/5 [00:00<?, ?it/s]



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["To find the number of students enrolled this year, we need to calculate 20% of the previous year's enrollment and add it to the previous year's enrollment.\n\nPrevious year's enrollment: 50 students\n20% of 50 = 0.20 * 50 = 10 students\n\nAdd the increase to the previous year's enrollment: 50 + 10 = 60 students\n\nTherefore, there are 60 students enrolled in the calligraphy class this year."]
Evaluation Question: Last year there were 50 students enrolled in a calligraphy class. This year, there was a 20% increase in enrollment. How many students are enrolled this year in the calligraphy class?
Evaluation Answer: 60
Evaluation Responses (raw): ["To find the number of students enrolled this year, we need to calculate 20% of the previous year's enrollment and add it to the previous year's enrollment.\n\nPrevious year's enrollment: 50 students\n20% of 50 = 0.20 * 50 = 10 students\n\nAdd the increase to the previous year's enrollment: 50 + 10 = 60 students\n\nTherefor

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["To find the total number of letters delivered from Monday to Wednesday, we need to calculate the number of letters delivered each day and add them together.\n\nOn Monday, the post office delivered 425 letters.\n\nOn Tuesday, they delivered 17 more than one-fifth as many as Monday. \nOne-fifth of 425 is 425 / 5 = 85.\n17 more than one-fifth as many as Monday is 85 + 17 = 102.\nSo, on Tuesday, they delivered 102 letters.\n\nOn Wednesday, they delivered 5 more than twice as many as they delivered on Tuesday.\nTwice as many as Tuesday is 102 * 2 = 204.\n5 more than twice as many as Tuesday is 204 + 5 = 209.\nSo, on Wednesday, they delivered 209 letters.\n\nNow, let's add the number of letters delivered each day: \n425 (Monday) + 102 (Tuesday) + 209 (Wednesday) = 736.\n\nThe total number of letters delivered from Monday to Wednesday is 736."]
Evaluation Question: On Monday the post office delivered 425 letters. On Tuesday they delivered 17 more than one-fifth as many 

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["To find out how many cups of flour Carla needs, we need to calculate the total number of cookies she is making and then multiply that by the amount of flour needed per cookie. \n\nFirst, let's find the total number of cookies Carla is making: 36 cookies today + 30 cookies tomorrow = 66 cookies.\n\nSince two cups of flour are needed to make a dozen (12) cookies, we need to find out how many cups of flour are needed per cookie. To do this, we divide 2 cups by 12 cookies: 2 cups / 12 cookies = 1/6 cup per cookie.\n\nNow, we multiply the total number of cookies by the amount of flour needed per cookie: 66 cookies * 1/6 cup per cookie = 11 cups.\n\nTherefore, Carla will need 11 cups of flour to bake the cookies today and tomorrow."]
Evaluation Question: Two cups of flour are needed to make a dozen cookies. Carla is making 36 cookies today and 30 cookies tomorrow. How many cups of flour will Carla need to bake the cookies today and tomorrow?
Evaluation Answer: 11
Evalu

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["Let's denote the number of white fish Bob bought as W and the number of orange fish Bob bought as O. \n\nInitially, Bob had 3 orange fish and 4 white fish. After buying W + O fish, he had 3 + O + W orange fish and 4 + W white fish. \n\nSince he now has twice as many orange fish as white fish, we can set up the equation: 3 + O + W = 2(4 + W).\n\nSimplifying the equation, we get 3 + O + W = 8 + 2W.\n\nSubtracting W from both sides, we get 3 + O = 8 + W.\n\nSubtracting 3 from both sides, we get O = 5 + W.\n\nWe also know that the total number of fish Bob bought is 17, so W + O = 17.\n\nSubstituting O = 5 + W into the equation, we get W + 5 + W = 17.\n\nCombining like terms, we get 2W + 5 = 17.\n\nSubtracting 5 from both sides, we get 2W = 12.\n\nDividing both sides by 2, we get W = 6.\n\nSo, Bob bought 6 white fish at the store."]
Evaluation Question: Bob had 7 fish in his ornamental fish pond. 3 were orange, and 4 were white. He decided he wanted to get some more, 

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ['To find the total cost, we need to calculate the cost of each item and add them up.\n\nThe cost of the suits is 10 suits * $750 per suit = $7,500.\n\nThe cost of the dress pants is 10 pants * (1/5 * $750) = 10 * $150 = $1,500.\n\nThe cost of the dress shirts is 10 suits * 3 shirts per suit * $60 per shirt = 30 shirts * $60 = $1,800.\n\nThe total cost is $7,500 + $1,500 + $1,800 = $10,800.\n\n<reasoning>The total cost is calculated by adding the cost of the suits, dress pants, and dress shirts together.</reasoning>\n<answer>$10,800</answer>']
Evaluation Question: James buys a new wardrobe.  He buys 10 suits and 10 dress pants.  He also buys 3 dress shirts per suit.  The suits cost $750 each and the dress pants cost 1/5 that cost.  The dress shirts were $60 each.  How much did everything cost?
Evaluation Answer: 10,800
Evaluation Responses (raw): ['To find the total cost, we need to calculate the cost of each item and add them up.\n\nThe cost of the suits is 10 sui

  0%|          | 0/5 [00:00<?, ?it/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["To find the number of students enrolled this year, we need to calculate 20% of the previous year's enrollment and add it to the previous year's enrollment.\n\nPrevious year's enrollment: 50 students\n20% of 50 = 0.20 * 50 = 10 students\n\nAdd the increase to the previous year's enrollment: 50 + 10 = 60 students\n\nTherefore, there are 60 students enrolled in the calligraphy class this year."]
Evaluation Question: Last year there were 50 students enrolled in a calligraphy class. This year, there was a 20% increase in enrollment. How many students are enrolled this year in the calligraphy class?
Evaluation Answer: 60
Evaluation Responses (raw): ["To find the number of students enrolled this year, we need to calculate 20% of the previous year's enrollment and add it to the previous year's enrollment.\n\nPrevious year's enrollment: 50 students\n20% of 50 = 0.20 * 50 = 10 students\n\nAdd the increase to the previous year's enrollment: 50 + 10 = 60 students\n\nTherefor

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["To find the total number of letters delivered from Monday to Wednesday, we need to calculate the number of letters delivered each day and add them together.\n\nOn Monday, the post office delivered 425 letters.\n\nOn Tuesday, they delivered 17 more than one-fifth as many as Monday. \nOne-fifth of 425 is 425 / 5 = 85.\n17 more than one-fifth as many as Monday is 85 + 17 = 102.\nSo, on Tuesday, they delivered 102 letters.\n\nOn Wednesday, they delivered 5 more than twice as many as they delivered on Tuesday.\nTwice as many as Tuesday is 102 * 2 = 204.\n5 more than twice as many as Tuesday is 204 + 5 = 209.\nSo, on Wednesday, they delivered 209 letters.\n\nNow, let's add the number of letters delivered each day: \n425 (Monday) + 102 (Tuesday) + 209 (Wednesday) = 736.\n\nThe total number of letters delivered from Monday to Wednesday is 736."]
Evaluation Question: On Monday the post office delivered 425 letters. On Tuesday they delivered 17 more than one-fifth as many 

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["To find out how many cups of flour Carla needs, we need to calculate the total number of cookies she is making and then multiply that by the amount of flour needed per cookie. \n\nFirst, let's find the total number of cookies Carla is making: 36 cookies today + 30 cookies tomorrow = 66 cookies.\n\nSince two cups of flour are needed to make a dozen (12) cookies, we need to find out how many cups of flour are needed per cookie. To do this, we divide 2 cups by 12 cookies: 2 cups / 12 cookies = 1/6 cup per cookie.\n\nNow, we multiply the total number of cookies by the amount of flour needed per cookie: 66 cookies * 1/6 cup per cookie = 11 cups.\n\nTherefore, Carla will need 11 cups of flour to bake the cookies today and tomorrow."]
Evaluation Question: Two cups of flour are needed to make a dozen cookies. Carla is making 36 cookies today and 30 cookies tomorrow. How many cups of flour will Carla need to bake the cookies today and tomorrow?
Evaluation Answer: 11
Evalu

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ["Let's denote the number of white fish Bob bought as W and the number of orange fish Bob bought as O. \n\nInitially, Bob had 3 orange fish and 4 white fish. After buying W + O fish, he had 3 + O + W orange fish and 4 + W white fish. \n\nSince he now has twice as many orange fish as white fish, we can set up the equation: 3 + O + W = 2(4 + W).\n\nSimplifying the equation, we get 3 + O + W = 8 + 2W.\n\nSubtracting W from both sides, we get 3 + O = 8 + W.\n\nSubtracting 3 from both sides, we get O = 5 + W.\n\nWe also know that the total number of fish Bob bought is 17, so W + O = 17.\n\nSubstituting O = 5 + W into the equation, we get W + 5 + W = 17.\n\nCombining like terms, we get 2W + 5 = 17.\n\nSubtracting 5 from both sides, we get 2W = 12.\n\nDividing both sides by 2, we get W = 6.\n\nSo, Bob bought 6 white fish at the store."]
Evaluation Question: Bob had 7 fish in his ornamental fish pond. 3 were orange, and 4 were white. He decided he wanted to get some more, 

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Pass 1/1, responses: ['To find the total cost, we need to calculate the cost of each item and add them up.\n\nThe cost of the suits is 10 suits * $750 per suit = $7,500.\n\nThe cost of the dress pants is 10 pants * (1/5 * $750) = 10 * $150 = $1,500.\n\nThe cost of the dress shirts is 10 suits * 3 shirts per suit * $60 per shirt = 30 shirts * $60 = $1,800.\n\nThe total cost is $7,500 + $1,500 + $1,800 = $10,800.\n\n<reasoning>The total cost is calculated by adding the cost of the suits, dress pants, and dress shirts together.</reasoning>\n<answer>$10,800</answer>']
Evaluation Question: James buys a new wardrobe.  He buys 10 suits and 10 dress pants.  He also buys 3 dress shirts per suit.  The suits cost $750 each and the dress pants cost 1/5 that cost.  The dress shirts were $60 each.  How much did everything cost?
Evaluation Answer: 10,800
Evaluation Responses (raw): ['To find the total cost, we need to calculate the cost of each item and add them up.\n\nThe cost of the suits is 10 sui

### üìö **Learn More**
- See `src/MaxText/examples/grpo_runner.py` for CLI usage
- Check `src/MaxText/configs/grpo.yml` for configuration options
- Read `src/MaxText/examples/README.md` for more examples