<h3 align="center"></h3>

<h1 align="center">Working Qwen 0.5b on GRPO</h1>

---

<h1 align="center">Training a Smol Math Reasoner with RL</h1>

This notebook is inspired by the [GRPO demo](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) by [will brown,](https://x.com/willccbb) training llama-1b on the gsm8k math dataset.

This will train a model, download the model, publish the model weights to your hugging face account, the run inference in python to your locally (almost) trained model returning inference to the terminal. It also saves down its reasoning along the way. This can be modified for almost any agent training so long as you define the reward functions and the data you want to train on. The idea is not to just train a model on a task but to show how little code is actually required to get a working model that is trained on a very specific task up and running in 2 hours.


## Setting up the models.

## Understanding vLLM in this Project

### What is vLLM?
vLLM (Very Large Language Model) is a high-performance library developed by UC Berkeley's RISELab for efficient LLM inference and serving. It represents a significant advancement in LLM deployment technology, offering production-grade performance used by major companies like Databricks and Anyscale.

### Core Features and Benefits
1. **PagedAttention™ Technology**
   - Novel memory management system similar to operating system page management
   - Dramatically reduces memory usage during inference
   - Enables efficient handling of multiple requests simultaneously

2. **Performance Optimizations**
   - Continuous batching for dynamic request processing
   - Optimized CUDA kernels for maximum GPU utilization
   - Efficient KV cache management for transformer architectures
   - Supports both CPU and GPU inference

### Why vLLM is Critical for This Training Pipeline
1. **Speed Benefits**
   - Significantly faster inference during training
   - Essential for GRPO (Generative Reinforcement Policy Optimization)
   - Enables rapid model evaluation during reinforcement learning

2. **Memory Efficiency**
   - Allows both training and inference on the same GPU
   - Particularly important for our Qwen-0.5B model setup
   - Optimizes GPU memory usage through smart caching

### Installation Requirements
- Must be installed BEFORE TRL (Transformer Reinforcement Learning)
- Requires CUDA support for GPU acceleration
- Dependencies are automatically handled by pip

### Documentation & Resources
- Official Docs: [vllm.readthedocs.io](https://vllm.readthedocs.io/)
- GitHub: [github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
- Paper: ["vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention"](https://arxiv.org/abs/2309.06180)

### Important Note
After installing vLLM, you must restart the runtime before proceeding with other installations. This is due to a known interaction with the TRL library that requires vLLM to be installed first.

In [1]:
%pip install vllm

Collecting vllm
  Downloading vllm-0.7.2-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)
Collecting lark=

# Understanding TRL and Hugging Face Datasets for Reinforcement Learning

This guide explains how two key tools—**TRL (Transformer Reinforcement Learning)** and the **Hugging Face Datasets library**—are used together to fine-tune language models with reinforcement learning (RL). The explanation is aimed at an informed undergraduate who is familiar with machine learning concepts.

---

## What Is TRL?

**TRL** is a specialized library built on top of Hugging Face's Transformers framework. Its purpose is to train language models using reinforcement learning techniques, where the model learns by receiving rewards rather than just imitating data.

### Key Components in Our Project

1. **GRPOConfig**
   - **Role:** A configuration class for GRPO (Generative Reinforcement Policy Optimization).
   - **Responsibilities:**
     - **Hyperparameter Management:** Sets learning rate, batch size, gradient accumulation steps, etc.
     - **Resource Allocation:** Manages GPU memory usage.
     - **Checkpointing:** Defines when and how often to save the model during training.
     - **Inference Settings:** Configures parameters for generating model outputs during evaluation.

2. **GRPOTrainer**
   - **Role:** The core engine that implements the RL training loop.
   - **Responsibilities:**
     - **Reward Integration:** Incorporates multiple reward functions to assess model outputs.
     - **Policy Optimization:** Updates model parameters based on computed rewards.
     - **Generation & Evaluation:** Produces text responses and evaluates their quality.
     - **Training State Management:** Keeps track of the progress and state of training.
     - **Efficient Inference:** Integrates with libraries like vLLM to speed up inference.

---

## What Is the Hugging Face Datasets Library?

The Hugging Face Datasets library is designed for efficient data handling. It simplifies the process of loading, processing, and streaming large datasets, which is particularly useful when working with language models.

### Usage in Our Project

1. **Data Loading**
   - We load datasets using a simple API. For example, to load the GSM8K (Grade School Math 8K) dataset:
     ```python
     from datasets import load_dataset, Dataset
     data = load_dataset('openai/gsm8k', 'main')
     ```
   - **Key Features:**
     - **Efficient Streaming & Caching:** Helps manage memory by streaming data.
     - **Data Versioning and Splitting:** Ensures reproducibility and proper separation of training/validation data.

2. **Data Processing**
   - Once loaded, the data is transformed to fit the needs of the training pipeline:
     - **Formatting:** Converts raw math problems into a prompt-friendly format.
     - **Instruction Integration:** Adds system-level instructions to guide the model.
     - **Structuring Data:** Creates input/output pairs that the model can learn from.
   - The `Dataset.map()` method is used to apply these transformations efficiently across the entire dataset.

---

## How TRL and Datasets Work Together

The integration of TRL and the Datasets library creates a smooth pipeline from raw data to model training:

1. **Data Flow**
   - **Step 1:** The Datasets library loads and processes the GSM8K dataset.
   - **Step 2:** The processed data is reformatted to include system prompts and user queries.
   - **Step 3:** This formatted data is then fed into the TRL training loop.

2. **Training Process**
   - **Reward Computation:** TRL computes rewards by evaluating how well the model’s responses match desired criteria.
   - **Policy Updates:** Based on these rewards, the model's parameters are updated to improve performance.
   - **Model Generation:** The model generates text, which is continually evaluated and refined.
   - **Optimization:** The overall training process is fine-tuned to enhance both learning efficiency and final performance.

---

## Why Use These Tools Together?

- **TRL's Strength:** Provides specialized reinforcement learning capabilities, which are essential for fine-tuning models based on complex reward signals.
- **Datasets' Efficiency:** Ensures that large-scale datasets are handled smoothly, from loading to processing.
- **Combined Benefit:** Together, they create a robust pipeline that:
  - Improves mathematical reasoning and problem-solving abilities.
  - Fine-tunes language models with more nuanced and targeted training.
  - Optimizes both the training workflow and the performance of the final model.

---

## Further Reading

For more detailed information, check out the official documentation:
- **TRL:** [github.com/huggingface/trl](https://github.com/huggingface/trl)
- **Datasets:** [huggingface.co/docs/datasets](https://huggingface.co/docs/datasets)


In [2]:
%pip install trl datasets

Collecting trl
  Downloading trl-0.14.0-py3-none-any.whl.metadata (12 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.14.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.9/313.9 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m22.0 MB/s[0m

## Defining the RL rewards


# Setting Up GRPO Training Components

This section details the setup required for GRPO training, covering the essential imports, response format definitions, and their roles in the overall training pipeline. The explanation below is designed for an informed undergraduate with a background in machine learning and programming.

---

## Core Imports and Their Purposes

The codebase leverages both basic Python libraries and specialized modules from Hugging Face to facilitate data handling, model management, and reinforcement learning.

### 1. Basic Python Libraries

- **`re` Module**  
  - **Purpose:** Implements regular expressions for pattern matching.  
  - **Usage:**  
    - Validates that the model's output adheres to a specified format.
    - Extracts specific answer segments from the overall model output.

- **`torch` Module**  
  - **Purpose:** PyTorch is used for deep learning and tensor computations.  
  - **Usage:**  
    - Handles low-level tensor operations critical for model computations.
    - Manages GPU acceleration, ensuring efficient training of large models.

### 2. Hugging Face Components

- **Datasets Library**  
  - **`load_dataset`:**  
    - **Purpose:** Loads datasets such as GSM8K, which contains grade-school math problems.
    - **Usage:** Simplifies data fetching and ensures data is readily available for training.
  - **`Dataset` Class:**  
    - **Purpose:** Provides a framework for managing and manipulating datasets.
    - **Usage:** Serves as the base for further data processing and transformation.

- **Transformers Library**  
  - **`AutoTokenizer`:**  
    - **Purpose:** Automatically handles tokenization of text inputs.
    - **Usage:** Converts raw text into tokenized sequences that the model can process.
  - **`AutoModelForCausalLM`:**  
    - **Purpose:** Loads pretrained causal language models.
    - **Usage:** Provides the base model that is fine-tuned using reinforcement learning techniques.

- **TRL Library**  
  - **`GRPOConfig`:**  
    - **Purpose:** Manages training configurations specific to the GRPO algorithm.
    - **Usage:** Sets key hyperparameters and resource configurations (e.g., learning rate, checkpointing).
  - **`GRPOTrainer`:**  
    - **Purpose:** Implements the GRPO training loop.
    - **Usage:** Orchestrates the reinforcement learning process including reward computation, policy updates, and efficient inference integration.

---

## Response Format Definition

To ensure the model outputs are both consistent and evaluable, a specific response format is defined.

### 1. System Prompt

The system prompt establishes the framework for how responses should be structured:
```
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
```
- **Purpose:**  
  - Enforces a clear separation between the reasoning (the thought process) and the final answer.
  - Ensures that the output is standardized, making it easier to evaluate the quality and correctness of both components.

### 2. XML Chain-of-Thought Format

- **Structure:**  
  - Utilizes a templated format where placeholders (e.g., `{reasoning}`, `{answer}`) are dynamically filled during model inference.
- **Benefits:**  
  - **Consistency:** Every response follows the same structure, ensuring uniformity.
  - **Ease of Parsing:** Facilitates automated extraction and evaluation of the reasoning and answer components.
  - **Logical Separation:** Clearly delineates the step-by-step reasoning process from the final numerical or factual answer.

---

## Purpose in Training Pipeline

The combination of these components underpins the efficiency and effectiveness of the training process:

- **Consistent Model Outputs:**  
  The defined response format ensures that the model consistently produces outputs in a predetermined, parseable structure.

- **Facilitates Reward Computation:**  
  A clear separation between reasoning and answer simplifies the process of assigning rewards based on specific aspects of the output.

- **Enables Clear Evaluation Metrics:**  
  With a standardized format, it becomes straightforward to measure the model's performance and the quality of its reasoning.

- **Supports Chain-of-Thought Reasoning:**  
  By enforcing a structured output, the training pipeline encourages the model to articulate its thought process, which can lead to improved transparency and reliability in decision making.

---

This setup not only standardizes the outputs but also streamlines the reinforcement learning training loop, ensuring that both the data processing and model training components work harmoniously for improved performance and evaluation.
```

First we set the general prompt structure (with the reasoning tags).

In [3]:
%pip install ipywidgets


Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [4]:
import re
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer

# Prepare your dataset or custom transformation here
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    """
    Extracts the <answer>...</answer> from the text, ignoring any <reasoning> blocks.
    """
    if "<answer>" not in text or "</answer>" not in text:
        return ""
    answer_part = text.split("<answer>")[-1]
    answer_part = answer_part.split("</answer>")[0]
    return answer_part.strip()


def format_openr1_math_example(example: dict) -> dict:
    problem_text = example.get("problem", "")
    reasoning_text = example.get("solution", "")
    final_answer = example.get("answer", "")

    # Changed here: Combine the prompt into a single string
    prompt = f"{SYSTEM_PROMPT}\n{problem_text}"

    xml_completion = XML_COT_FORMAT.format(
        reasoning=reasoning_text.strip(),
        answer=final_answer.strip()
    )

    return {
        "prompt": prompt, # Changed here: Return a string
        "answer": xml_completion
    }

# If you want to see how the transformation works, do something like:
ds = load_dataset("open-r1/OpenR1-Math-220k", split="train[:1%]")  # small sample
ds = ds.map(format_openr1_math_example)

# Check an example
print(ds[0]["prompt"])
print(ds[0]["answer"])


INFO 02-12 22:41:35 __init__.py:190] Automatically detected platform cuda.


README.md:   0%|          | 0.00/4.37k [00:00<?, ?B/s]

train-00000-of-00010.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00001-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

train-00002-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

train-00003-of-00010.parquet:   0%|          | 0.00/217M [00:00<?, ?B/s]

train-00004-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

train-00005-of-00010.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00006-of-00010.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

train-00007-of-00010.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

train-00008-of-00010.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00009-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/93733 [00:00<?, ? examples/s]

Map:   0%|          | 0/937 [00:00<?, ? examples/s]


Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

## Task B-1.3.

A ship traveling along a river has covered $24 \mathrm{~km}$ upstream and $28 \mathrm{~km}$ downstream. For this journey, it took half an hour less than for traveling $30 \mathrm{~km}$ upstream and $21 \mathrm{~km}$ downstream, or half an hour more than for traveling $15 \mathrm{~km}$ upstream and $42 \mathrm{~km}$ downstream, assuming that both the ship and the river move uniformly.

Determine the speed of the ship in still water and the speed of the river.
<reasoning>
## Solution.

Let $t$ be the time required for the boat to travel $24 \mathrm{~km}$ upstream and $28 \mathrm{~km}$ downstream, $v_{R}$ the speed of the river, and $v_{B}$ the speed of the boat. When the boat is traveling upstream, its speed is $v_{B}-v_{R}$, and when it is traveling downstream, its speed is $v_{B}+v_{R}$.

Since $t=\frac{s}{v}$, from the given data, we obtain the following system of equations:

$\left\

## Data Processing Functions for OpenR1-Math Dataset

### Answer Extraction Functions

1. **XML Answer Extractor** (`extract_xml_answer`)
   - Purpose: Extracts answers from XML-formatted model outputs
   - Process:
     1. Splits text at `<answer>` tag
     2. Takes everything after the tag
     3. Splits at `</answer>` tag
     4. Takes everything before the closing tag
     5. Cleans whitespace
   - Used for: Processing model predictions during training

2. **Hash Answer Extractor** (`extract_hash_answer`)
   - Purpose: Extracts answers from GSM8K dataset format
   - Process:
     1. Checks for '####' delimiter
     2. Returns None if delimiter not found
     3. Takes everything after '####'
     4. Cleans whitespace
   - Used for: Processing ground truth answers from dataset

### Dataset Loading Function

`get_gsm8k_questions`
- Purpose: Prepares OpenR1-Math dataset for GRPO training (legacy function)
- Parameters:
  - `split`: Dataset partition ('train' or 'test')
- Processing steps:
  1. Loads raw OpenR1-Math data
  2. Transforms each example into training format:
     - Adds system prompt with format instructions
     - Includes user question
     - Extracts clean answer
- Output format:
  ```python
  {
      'prompt': [
          {'role': 'system', 'content': format_instructions},
          {'role': 'user', 'content': math_question}
      ],
      'answer': extracted_answer
  }
  ```

### Type Checking Notes
- Uses `# type: ignore` to suppress mypy warnings
- Maintains type hints for function signatures
- Ensures type safety where possible

In [5]:
import re
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

def get_OpenR1220k_questions(split: str = "train") -> Dataset:
    """
    Loads and preprocesses the OpenR1-Math-220k dataset into a format suitable
    for GRPO training. Removes references to GSM8K or ####.

    Args:
        split (str): 'train', 'test', or a subset (e.g. 'train[:2%]')

    Returns:
        Dataset: A processed dataset with 'prompt' (list of messages) and
                 'answer' (chain-of-thought + final answer in XML tags).
    """
    # 1) Load the specific split from OpenR1-Math-220k
    #    This dataset has 'problem', 'solution', and 'answer' fields (not 'question').
    data = load_dataset("open-r1/OpenR1-Math-220k", split=split)  # type: ignore

    # 2) Convert each record into chat-style prompt + XML-based chain-of-thought
    def transform_record(x):
        problem_text = x.get("problem", "")
        reasoning_text = x.get("solution", "")
        final_answer = x.get("answer", "")

        # Build system + user messages
        prompt_list = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": problem_text},
        ]

        # If you'd like to store the chain-of-thought + final answer in one field:
        xml_output = (
            f"<reasoning>\n{reasoning_text}\n</reasoning>"
            f"\n<answer>\n{final_answer}\n</answer>"
        )

        # Return the needed fields
        return {
            "prompt": prompt_list,
            "answer": xml_output
        }

    # 3) Apply the transform to all samples
    data = data.map(transform_record)  # type: ignore

    return data  # type: ignore

# Example usage:
# dataset = get_OpenR1220k_questions("train")
# print(dataset[0]["prompt"])
# print(dataset[0]["answer"])






# Understanding GRPO Reward Functions

## Overview of the Reward System
The training pipeline uses multiple reward functions to shape the model's behavior, each focusing on different aspects of the desired output. The total reward system can provide up to 3.5 points per response, carefully balanced across correctness and formatting criteria.

## Primary Reward Functions

### 1. Correctness Reward
- **Main Purpose**: Evaluates answer accuracy
- **Maximum Reward**: 2.0 points
- **Evaluation Process**:
  - Extracts model's answer from XML format
  - Compares with ground truth
  - Provides debugging output showing:
    - Original question
    - Expected answer
    - Full model response
    - Extracted answer
- **Scoring**: Binary reward (2.0 or 0.0)

### 2. Integer Format Reward
- **Main Purpose**: Ensures numerical responses
- **Maximum Reward**: 0.5 points
- **Evaluation Process**:
  - Checks if extracted answer is purely numerical
  - Validates digit-only responses
- **Importance**: Critical for mathematical problem-solving

## Formatting Reward Functions

### 3. Strict Format Verification
- **Maximum Reward**: 0.5 points
- **Requirements**:
  - Exact newline placement
  - Precise XML tag structure
  - Complete format compliance
- **Evaluation**: Uses rigid regular expression pattern
- **Purpose**: Maintains consistent response structure

### 4. Soft Format Verification
- **Maximum Reward**: 0.5 points
- **Flexibility**:
  - Allows variable whitespace
  - More forgiving tag placement
  - Maintains basic structure requirements
- **Purpose**: Backup formatting enforcement

## Detailed XML Structure Evaluation

### 5. XML Component Scoring
- **Maximum Total**: 0.5 points
- **Individual Components**:
  - Opening reasoning tag (0.125)
  - Closing reasoning tag (0.125)
  - Opening answer tag (0.125)
  - Closing answer tag (0.125)
- **Penalty System**:
  - Small deductions for excess text
  - Maintains cleanliness of response

### 6. Comprehensive XML Evaluation
- **Purpose**: Applies detailed scoring across all responses
- **Process**: Evaluates each response component
- **Importance**: Provides granular feedback

## Combined Impact on Training

### Total Reward Breakdown
1. Answer Correctness: 2.0 points
2. Numerical Format: 0.5 points
3. Strict Formatting: 0.5 points
4. Soft Formatting: 0.5 points
5. XML Structure: Up to 0.5 points

### Training Objectives
- **Primary Goal**: Correct mathematical reasoning
- **Secondary Goals**:
  - Clean, consistent formatting
  - Proper XML structure
  - Numerical answer provision
  - Clear solution presentation

### Behavioral Shaping
- Encourages step-by-step reasoning
- Promotes clear answer presentation
- Maintains consistent response structure
- Ensures numerical output format

This comprehensive reward system creates a balanced training signal that shapes the model's behavior across multiple dimensions, ensuring both accurate mathematical reasoning and clear, structured responses.


In [6]:
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """
    Checks if the extracted final answer matches the reference exactly.
    Returns 2.0 if correct, else 0.0.

    Args:
      prompts:     List of prompt strings (one prompt per sample).
      completions: List of model outputs (one string per sample).
      answer:      List of reference answers (one string per sample).
    """
    # Because we pass one sample at a time, we have equal lengths
    # Or if batched, we can still zip them up
    # Example: prompts[i] is a string, completions[i] is a string, answer[i] is a string.

    rewards = []
    for p, c, a in zip(prompts, completions, answer):
        extracted = extract_xml_answer(c)
        # debug print
        print('-'*20,
              f"\nPrompt:\n{p}",
              f"\nRefAnswer:\n{a}",
              f"\nModelOutput:\n{c}",
              f"\nExtracted:\n{extracted}")
        if extracted == a:
            rewards.append(2.0)
        else:
            rewards.append(0.0)

    return rewards

def int_reward_func(completions, **kwargs) -> list[float]:
    """
    Simple numeric check: if the extracted <answer> is purely digit-based,
    return 0.5, else 0.0.
    """
    rewards = []
    for c in completions:
        extracted = extract_xml_answer(c)
        rewards.append(0.5 if extracted.isdigit() else 0.0)
    return rewards

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """
    Checks if the entire completion matches a strict pattern:
    <reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n
    """
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n?$"
    # Using ? for optional final newline
    # Also consider re.DOTALL so that '.' matches newlines
    compiled_pattern = re.compile(pattern, flags=re.DOTALL)

    rewards = []
    for c in completions:
        if compiled_pattern.match(c):
            rewards.append(0.5)
        else:
            rewards.append(0.0)
    return rewards

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """
    More lenient pattern check: must contain <reasoning>...</reasoning> then <answer>...</answer>.
    """
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    compiled_pattern = re.compile(pattern, flags=re.DOTALL)

    rewards = []
    for c in completions:
        if compiled_pattern.search(c):
            rewards.append(0.5)
        else:
            rewards.append(0.0)
    return rewards

def count_xml(text: str) -> float:
    """
    Score based on presence/structure of <reasoning> and <answer> tags,
    with a small penalty for trailing text.
    """
    score = 0.0

    # Basic presence checks:
    if "<reasoning>\n" in text:
        score += 0.125
    if "\n</reasoning>\n" in text:
        score += 0.125
    if "\n<answer>\n" in text:
        score += 0.125
        # penalize extra text after </answer>\n
        # split on the final close tag
        if "\n</answer>\n" in text:
            after = text.split("\n</answer>\n")[-1]
            score -= len(after) * 0.001
    if "\n</answer>" in text:
        score += 0.125
        # again penalize trailing text after close
        after = text.split("\n</answer>")[-1]
        score -= (len(after) - 1) * 0.001

    return score

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """
    Applies count_xml to each completion string.
    """
    return [count_xml(c) for c in completions]







# Deep Dive: GRPO Training Arguments Analysis

## Learning Rate (5e-6)
The learning rate of 5e-6 (0.000005) represents the step size in the gradient descent optimization process.

**Technical Details**:
- In standard SGD training of neural networks, learning rates often range from 1e-1 to 1e-3
- For LLM fine-tuning, we use much smaller rates (1e-5 to 1e-6) due to:
  1. Pre-trained model weights already encode complex patterns
  2. Large parameter count (500M in this case) means small changes propagate significantly
  3. Risk of "catastrophic forgetting" where new learning overwrites important pre-trained knowledge

**Research Basis**:
- Microsoft's paper on GPT-3 fine-tuning (2022) showed rates > 1e-5 led to instability
- Anthropic's research on constitutional AI used similar ranges (3e-6 to 8e-6)
- Meta's LLaMA fine-tuning guidelines recommend 5e-6 as a starting point

## Adam Optimizer Parameters
### Beta1 (0.9)
First moment estimate in Adam optimization.

**Technical Significance**:
- Controls exponential decay rate for momentum estimation
- 0.9 means each gradient update considers ~10 previous gradients
- Theoretical basis from Kingma & Ba's original Adam paper (2014)
- Higher values (>0.9) can:
  1. Lead to oscillation in loss landscape
  2. Miss fine-grained features in optimization space

### Beta2 (0.99)
Second moment estimate in Adam optimization.

**Technical Significance**:
- Controls variance estimation decay
- 0.99 provides longer-term memory of past gradients
- Research shows for LLMs:
  1. Lower values (<0.98) lead to training instability
  2. Higher values (>0.999) slow convergence significantly
  3. 0.99 balances stability and training speed



## Learning Rate Scheduler (Cosine)

**Smoothing the Rate:**
The cosine schedule smoothly transitions learning rates with the formula:
\[ lr \times 0.5 \times (1 + \cos(\pi \times \text{current\_step} / \text{total\_steps})) \]
This reduces oscillation and aids convergence.

**Benefits:**
- **Smooth Transition:** Avoids abrupt rate drops.
- **Initial Progress:** Faster early learning.
- **Better Convergence:** Fine-tunes weights effectively.

**Research Backing:**
DeepMind's Transformers saw better performance over step decay.

---

## Precision Settings (bfloat16)

**Efficient Computation:**
bfloat16 balances precision and speed. It offers a higher exponent range than FP16, aiding numerical stability.

### **Comparison:**
- **vs. FP32:** HALF the memory and faster.
- **vs. FP16:** MORE stability, LESS precision.

**Hardware Edge:**
Optimal for NVIDIA Ampere GPUs, reducing memory bandwidth needs.

---

## Batch Configuration

### **Per Device Train Batch Size (1)**

**Effective Use of Memory:**
Limits to 1 example per forward pass for memory efficiency and larger context windows.

### **Gradient Accumulation Steps (4)**

**Simulated Larger Batches:**
Accumulates over 4 steps to mimic a batch size of 4, enhancing stability.

**Research Note:**
Related to optimizer memory and batch size impact.

---

## Sequence Length Parameters

### **Max Prompt Length (256)**

**Balanced Context and Compute:**
 Limits to 256 tokens due to quadratic attention scaling. Suitable for most math problems.

### **Max Completion Length (200)**

**Reasoning Without Overstepping:**
Capped at 200 to allow detailed reasoning without excessive generation.

---

## Training Duration Parameters

### **Number of Training Epochs (1)**

**Quick Adaptation:**
One pass prevents overfitting and retains general skills, effective for RL policy adaptation.

**Research Context:**
Quick adaptation studied by Microsoft and Anthropic.

---

## Gradient and Memory Management

### **Max Gradient Norm (0.1)**

**Stable Updates:**
Caps gradients at 0.1 to prevent explosions, crucial for RL.

**Mathematical Impact:**
Ties to policy gradient variance and trust region methods.

---

## Hardware Utilization

### **vLLM Configuration**

**Efficient Memory Use:**
 Reserves 30% for KV cache and batch processing, optimizing throughput.

### **Device Specification (cuda:0)**

**Single GPU Optimized:**
Reduces communication overhead, maximizing bandwidth.

---

## Logging and Monitoring

**Simplified Setup:**
Disables logging to reduce I/O and dependency on external tools like Weights & Biases.

---

## Tokenizer Configuration

**Consistent Padding:**
Matches pad_token and eos_token for uniform behavior, aligning with HuggingFace practices.

---

This structured approach offers a deep understanding of each parameter, linking them to broader concepts and research contexts.







# Comprehensive Analysis of GRPO Training Parameters

## Weight Decay (0.1)
L2 regularization parameter controlling parameter magnitude.

**Technical Significance**:
- Higher than typical weight decay (usually 0.01-0.001) because:
  1. Helps prevent overfitting in low-data regime
  2. Maintains model's general capabilities while learning new tasks
  3. Acts as implicit early stopping mechanism

**Research Context**:
- Google's T5 paper showed higher weight decay (0.1) improved generalization
- OpenAI's fine-tuning studies indicate stronger regularization needed for instruction tuning
- Anthropic's research suggests correlation between weight decay and model calibration

## Warmup Ratio (0.1)
Fraction of total training steps used for learning rate warmup.

**Technical Details**:
- 10% of total steps use gradually increasing learning rate because:
  1. Prevents early training instability
  2. Allows model to adjust to new data distribution
  3. Particularly important with Adam optimizer due to early variance estimation

**Mathematical Basis**:
- Related to eigenspectrum of Hessian matrix
- Helps avoid poor early optimization trajectories
- Research shows correlation with batch size (larger batches need longer warmup)

## Learning Rate Scheduler (Cosine)
Controls learning rate decay pattern throughout training.

**Technical Implementation**:
- Follows cosine function: lr * 0.5 * (1 + cos(π * current_step / total_steps))
- Benefits over linear or step decay:
  1. Smooth transition between learning rates
  2. Faster initial progress
  3. Better final convergence properties

**Research Support**:
- DeepMind's Transformer papers show superior performance vs step decay
- Google Brain's extensive LR schedule comparisons
- Particularly effective with Adam optimizer in LLM context

## Precision Settings (bf16=True)
Uses Brain Float 16 format for computations.

**Technical Details**:
- Compared to FP16:
  1. Larger dynamic range (7 bits exponent vs 5)
  2. Lower precision mantissa (8 bits vs 10)
  3. Better numerical stability
- Compared to FP32:
  1. Half the memory usage
  2. Faster computation on modern GPUs
  3. Sufficient precision for LLM fine-tuning

**Hardware Considerations**:
- Optimal for NVIDIA Ampere architecture
- Reduces memory bandwidth requirements
- Enables larger effective batch sizes

## Batch Configuration
### Per Device Train Batch Size (1)
**Technical Rationale**:
- Single example per forward pass because:
  1. Maximizes available memory for model weights
  2. Reduces variance in gradient updates
  3. Allows larger context windows

### Gradient Accumulation Steps (4)
**Implementation Details**:
- Accumulates gradients over 4 forward passes because:
  1. Simulates larger batch size (effective batch size = 4)
  2. Reduces memory requirements
  3. Improves training stability

**Research Context**:
- Microsoft's DeepSpeed findings on gradient accumulation
- Relationship with optimizer state memory
- Impact on effective batch size calculations

## Generation Parameters
### Number of Generations (16)
**Technical Significance**:
- Multiple generations per prompt because:
  1. Enables exploration of response space
  2. Reduces variance in reward estimation
  3. Improves policy gradient estimation

**Statistical Basis**:
- Minimum samples needed for reliable policy gradient
- Trade-off between computation and estimation quality
- Impact on reward variance reduction


## Sequence Length Parameters

### Max Prompt Length (256)
**Technical Implementation**:
- Limits input sequence to 256 tokens because:
  1. Memory scales quadratically with sequence length (attention mechanism)
  2. Most math problems fit within this window
  3. Balances context window with computational efficiency

**Research Considerations**:
- Transformer attention complexity: O(n²)
- Token distribution analysis of GSM8K dataset
- Memory vs context window trade-offs

### Max Completion Length (200)
**Technical Rationale**:
- Caps generation length at 200 tokens because:
  1. Sufficient for step-by-step reasoning
  2. Prevents runaway generations
  3. Optimizes inference speed

**Empirical Basis**:
- Analysis of solution length distribution in GSM8K
- Memory requirements for beam search
- Impact on generation quality vs speed

## Training Duration Parameters

### Number of Training Epochs (1)
**Technical Significance**:
- Single pass through dataset because:
  1. Prevents overfitting on limited math examples
  2. Maintains general capabilities
  3. Sufficient for policy adaptation with RL

**Research Context**:
- Microsoft's findings on instruction fine-tuning
- OpenAI's studies on few-shot adaptation
- Anthropic's work on minimal fine-tuning

### Save Steps (100)
**Implementation Details**:
- Checkpoints every 100 steps because:
  1. Balances storage requirements
  2. Provides sufficient granularity for model selection
  3. Enables training recovery

**Practical Considerations**:
- Disk space requirements
- Checkpoint loading time
- Training resumption capabilities

## Gradient and Memory Management

### Max Gradient Norm (0.1)
**Technical Depth**:
- Clips gradient norm at 0.1 because:
  1. Prevents explosive gradients in RL setting
  2. Maintains stable policy updates
  3. Critical for convergence with policy gradients

**Mathematical Basis**:
- Relationship to policy gradient variance
- Impact on Wasserstein distance between policy updates
- Connection to trust region methods

## Hardware Utilization Parameters

### vLLM Configuration
**Technical Implementation**:
- GPU Memory Utilization (0.3 or 30%):
  1. Reserves memory for:
     - KV cache management
     - Dynamic batch processing
     - Continuous batching overhead
  2. Optimizes for:
     - PagedAttention mechanism
     - Inference throughput
     - Training stability

**Research Basis**:
- vLLM paper's memory analysis
- Empirical studies on GPU memory management
- Trade-offs between serving and training

### Device Specification (cuda:0)
**Technical Details**:
- Primary GPU designation because:
  1. Optimizes for single-GPU training
  2. Reduces communication overhead
  3. Maximizes memory bandwidth utilization

**Hardware Considerations**:
- PCIe bandwidth utilization
- CUDA stream management
- Memory transfer optimization

## Logging and Monitoring

### Log on Each Node (False)
**Technical Rationale**:
- Disables distributed logging because:
  1. Single-GPU setup
  2. Reduces I/O overhead
  3. Simplifies log analysis

### Reporting Configuration (none)
**Implementation Details**:
- Disables Weights & Biases because:
  1. Reduces network overhead
  2. Minimizes external dependencies
  3. Focuses on local performance analysis

**Practical Impact**:
- Reduced training overhead
- Simplified debugging
- Local-only experiment tracking

## Tokenizer Configuration
**Technical Significance**:
- Setting pad_token = eos_token because:
  1. Ensures consistent padding behavior
  2. Maintains model's probability distribution
  3. Critical for batch processing

**Research Context**:
- HuggingFace's tokenizer best practices
- Impact on attention mask computation
- Relationship to model architecture





In [7]:
%pip install wandb



In [8]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mchristian-cooper-us[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [9]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

output_dir="outputs/Qwen2.5-0.5B-OpenR1"
run_name="Qwen2.5-0.5B-GRPO-OpenR1"

# Initialize wandb first
import wandb
wandb.init(
    project="qwen-OpenR1math-grpo",
    entity="christian-cooper-us",
    config={
        "learning_rate": 5e-6,
        "adam_beta1": 0.9,
        "adam_beta2": 0.99,
        "weight_decay": 0.1,
        "warmup_ratio": 0.1,
        "batch_size": 1,
        "num_train_epochs": 1,
        "run_name": run_name
    }
)


training_args = GRPOConfig(
    output_dir=output_dir,
    run_name=run_name,
    learning_rate=5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type='cosine',
    logging_steps=1,
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_generations=16,
    max_prompt_length=256,
    max_completion_length=200,
    num_train_epochs=1,
    save_steps=100,
    max_grad_norm=0.1,
    log_on_each_node=False,
    use_vllm=True,
    vllm_gpu_memory_utilization=.3,
    vllm_device="cuda:0",
    report_to="wandb", #wandb is enabled.
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=None
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

And launch the actual training:

In [10]:
# use peft at your own risk; not working for me with multi-GPU training
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,  # <-- NOT tokenizer=tokenizer
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func
    ],
    args=training_args,
    train_dataset=ds,
)
trainer.train()



INFO 02-12 22:45:48 config.py:542] This model supports multiple tasks: {'embed', 'classify', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 02-12 22:45:48 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, n

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-12 22:45:51 model_runner.py:1115] Loading model weights took 0.9279 GB
INFO 02-12 22:45:52 worker.py:267] Memory profiling takes 1.09 seconds
INFO 02-12 22:45:52 worker.py:267] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.30) = 11.87GiB
INFO 02-12 22:45:52 worker.py:267] model weights take 0.93GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 9.41GiB.
INFO 02-12 22:45:52 executor_base.py:110] # CUDA blocks: 51373, # CPU blocks: 21845
INFO 02-12 22:45:52 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 25.08x
INFO 02-12 22:45:56 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_u

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:29<00:00,  1.20it/s]

INFO 02-12 22:46:25 model_runner.py:1562] Graph capturing finished in 29 secs, took 0.16 GiB
INFO 02-12 22:46:25 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 34.15 seconds





-------------------- 
Prompt:

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

6. Let set $A=\{1,2,3,4,5,6\}$, and a one-to-one mapping $f: A \rightarrow A$ satisfies that for any $x \in A$, $f(f(f(x)))$ $=x$. Then the number of mappings $f$ that satisfy the above condition is ( ).
(A) 40
(B) 41
(C) 80
(D) 81 
RefAnswer:
<reasoning>
6.D.

If there exists $x \in A$ such that $f(f(x))=x, f(x) \neq x$, then $f(f(f(x)))=f(x) \neq x$, which contradicts the given condition. Therefore, for any $x \in A$, either $f(x)=x$, or $f(x)=x_{1}, f(x_{1})=x_{2}, f(x_{2})=x$, and $x, x_{1}, x_{2}$ are distinct. Hence, there are only the following three scenarios:
(1) For any $x \in A, f(x)=x$, there is only 1 such $f$;
(2) $f$ contains a cycle $a \rightarrow b \rightarrow c \rightarrow a$, and for other elements $x=a^{\prime}, b^{\prime}, c^{\prime}$, $f(x)=x$, there are $\mathrm{C}_{6}^{3}(3-1)!=40$ such $f$;
(3) $f$ contains two cycles $a \rightarrow b \rightarrow

Step,Training Loss
1,0.0
2,0.0
3,0.0
4,0.0
5,0.0
6,0.0
7,0.0
8,0.0
9,0.0
10,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
-------------------- 
Prompt:

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

16. In an $\mathrm{m}$-row $\mathrm{n}$-column grid, it is stipulated: the horizontal rows from top to bottom are sequentially the 1st row, the 2nd row, ..., and the vertical columns from left to right are sequentially the 1st column, the 2nd column, .... The point $(\mathrm{a}, \mathrm{b})$ represents the grid point located in the $\mathrm{a}$-th row and the $\mathrm{b}$-th column. Figure 7 is a 4-row 5-column grid. Starting from point $\mathrm{A}(2,3)$, following the "ri" (日) character movement of a knight in Chinese chess, it can reach the grid points B (1, 1), C (3, 1), D $(4,2)$, E $(4,4)$, F $(3,5)$, G $(1,5)$. If in a 9-row 9-column grid (Figure 8), starting from point $(1,1)$, following the "ri" (日) character movement of a knight in Chinese chess,
(1) Can it reach every grid point in the grid?
Answe

TrainOutput(global_step=234, training_loss=0.017511730645846678, metrics={'train_runtime': 905.8025, 'train_samples_per_second': 1.034, 'train_steps_per_second': 0.258, 'total_flos': 0.0, 'train_loss': 0.017511730645846678})




# GRPO Training Execution Analysis

## Final Training Setup and Execution

### Core Components of the Training Pipeline

1. **Model Configuration**
```python
trainer = GRPOTrainer(
    model=model,                    # Qwen2.5 0.5B model
    processing_class=tokenizer,     # Tokenizer for text processing
    reward_funcs=[...],            # Multiple reward functions
    args=training_args,            # Training configuration
    train_dataset=dataset          # OpenR1-Math dataset
)
```

### Reward Function Order and Significance
The order of reward functions is crucial:
1. `xmlcount_reward_func`: Base format validation (0.5 max)
   - Provides granular feedback on XML structure
   - Acts as foundation for format learning

2. `soft_format_reward_func`: Lenient structure check (0.5 max)
   - Allows flexibility in formatting
   - Prevents over-penalization

3. `strict_format_reward_func`: Rigid format enforcement (0.5 max)
   - Ensures exact formatting compliance
   - Critical for consistent outputs

4. `int_reward_func`: Numerical validation (0.5 max)
   - Verifies numerical answers
   - Essential for mathematical accuracy

5. `correctness_reward_func`: Answer accuracy (2.0 max)
   - Primary learning signal
   - Highest reward weight

## Training Process Deep Dive

### What Actually Happens During Training

1. **Initialization Phase**
   - Model loaded into GPU memory
   - vLLM engine initialized (30% GPU memory)
   - Tokenizer prepared with padding configuration

2. **Per-Step Process**
   - Load math problem from GSM8K
   - Generate 16 different completions
   - Evaluate all reward functions
   - Compute policy gradient
   - Update model weights
   - Log progress and save checkpoints

3. **Observable Outputs**
   ```
   Question: [math problem]
   Answer: [expected]
   Response: [model output]
   Extracted: [processed answer]
   ```

### Training Duration and Resources
- Dataset: ~7,500 GSM8K problems
- Effective batch size: 4 (1 × 4 gradient accumulation)
- Total steps: ~1,875
- Expected runtime: 2-4 hours on A100
- Checkpoints: Every 100 steps

## Production Readiness Assessment

### Current Limitations

1. **Training Depth**
   - Single epoch may be insufficient
   - Limited exposure to problem variations
   - Potential underfitting

2. **Evaluation Gaps**
   - No validation set monitoring
   - Missing performance metrics
   - No systematic error analysis

3. **Infrastructure**
   - No deployment configuration
   - Missing monitoring setup
   - No model cards or documentation

### Required Steps for Production

1. **Model Validation**
   - Implement cross-validation
   - Create test suite
   - Perform behavioral testing
   - Safety assessment

2. **Performance Optimization**
   - Model compression
   - Inference optimization
   - Latency testing
   - Memory profiling

3. **Deployment Infrastructure**
   - Serving setup
   - Monitoring system
   - A/B testing framework
   - Rollback procedures

4. **Documentation Requirements**
   - Model cards
   - Usage guidelines
   - Performance characteristics
   - Known limitations

### PEFT Considerations
- Currently disabled due to multi-GPU issues
- Could enable:
  - Reduced memory footprint
  - Larger batch sizes
  - More efficient training
- Requires stability testing

## Recommendations for Production Deployment

1. **Extended Training Protocol**
   - Multiple epochs with validation
   - Early stopping implementation
   - Learning rate refinement
   - Batch size optimization

2. **Evaluation Framework**
   - Comprehensive test suite
   - Edge case analysis
   - Performance benchmarking
   - Safety evaluations

3. **Deployment Pipeline**
   - Model compression strategy
   - Serving infrastructure
   - Monitoring setup
   - Update procedures

4. **Documentation and Maintenance**
   - Detailed model cards
   - Regular updates
   - Performance monitoring
   - Incident response plan

This training setup provides a foundation but requires significant additional work for production deployment. It's currently more suitable for proof-of-concept or research purposes.













# Understanding Checkpoint-1868 (Final Checkpoint) Contents

## Core Model Files

### 1. Model Architecture Files
- `model.safetensors`: The actual model weights in safetensors format
  - More efficient than PyTorch's native format
  - Contains all neural network parameters
  - Weights, biases, embeddings, etc.

### 2. Configuration Files
- `config.json`: Model architecture configuration
  - Number of layers
  - Hidden dimensions
  - Attention heads
  - Model type/class
- `generation_config.json`: Default generation parameters
  - Max length
  - Temperature
  - Top-k/Top-p settings
  - Beam search config

### 3. Tokenizer Files
- `tokenizer.json`: Core tokenizer configuration
- `tokenizer_config.json`: Tokenizer settings
- `added_tokens.json`: Any custom tokens added
- `special_tokens_map.json`: Special token definitions
- `vocab.json`: The vocabulary file

### 4. Training State Files
- `trainer_state.json`: Training progress info
  - Steps completed
  - Loss history
  - Learning rates
- `training_args.bin`: Training configuration
- `optimizer.pt`: Optimizer state
- `scheduler.pt`: Learning rate scheduler state
- `rng_state.pth`: Random number generator state

## How Loading Works
When you call `from_pretrained()`, the process is:

1. **Architecture Detection**:
```python
model = AutoModelForCausalLM.from_pretrained(model_path)
```
- Reads `config.json` to determine model type
- Initializes correct model class
- Loads weights from `model.safetensors`

2. **Tokenizer Setup**:
```python
tokenizer = AutoTokenizer.from_pretrained(model_path)
```
- Reads `tokenizer_config.json`
- Loads vocabulary from `vocab.json`
- Configures special tokens

## Why This Works
- HuggingFace's transformers library expects this standard structure
- Each file has a specific purpose in model reconstruction
- AutoClasses detect and load appropriate model types
- All necessary information for inference is contained




Now run the final inference below that will call out to your model.

In [17]:
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# First, load the base model architecture
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load the fine-tuned weights
checkpoint_path = "outputs/Qwen2.5-0.5B-OpenR1/checkpoint-234"  # Specific checkpoint folder
model = AutoModelForCausalLM.from_pretrained(
    checkpoint_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    checkpoint_path,
    trust_remote_code=True
)

# Rest of the inference code remains the same
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

def solve_math_problem(question: str):
    prompt = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question}
    ]

    input_text = tokenizer.apply_chat_template(
        prompt,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=3000,
        temperature=1.3,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test with a sample problem
test_question = "It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point. Find the number of points (distinct from the vertices) of intersection of pairs of diagonals."
print("\nQuestion:", test_question)
print("\nResponse:", solve_math_problem(test_question))


Question: It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point. Find the number of points (distinct from the vertices) of intersection of pairs of diagonals.

Response: system

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

user
It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point. Find the number of points (distinct from the vertices) of intersection of pairs of diagonals.
assistant
To solve the problem, we need to understand the properties of diagonals and how they intersect in a convex polygon.

1. **Identify the Total Number of Diagonals:**
   A convex \( n \)-gon has \(\frac{n(n-3)}{2}\) diagonals.
   
2. **Understand the Intersection Points of Diagonals:**
   In a convex \( n \)-gon, each diagonal splits it into two regions, and every line segment inside the polygon divides each such region into four parts: two triangles on either side.
   
3. **Count the Division

Save collab output to local zip

In [None]:
%zip -r /content/model_checkpoint.zip /content/outputs/Qwen-0.5B-GRPO/checkpoint-1868

In [12]:
from google.colab import drive
drive.mount('/content/drive')

# Copy to Drive
%cp -r /content/outputs/Qwen2.5-0.5B-OpenR1/checkpoint-234 /content/drive/MyDrive/my_models/

Mounted at /content/drive


push model weights to hugging face

In [None]:
!git config --global credential.helper store
%pip install huggingface-hub
!huggingface-cli login

In [14]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load from your local checkpoint

tokenizer = AutoTokenizer.from_pretrained("/content/outputs/Qwen2.5-0.5B-OpenR1/checkpoint-234")
model = AutoModelForCausalLM.from_pretrained("/content/outputs/Qwen2.5-0.5B-OpenR1/checkpoint-234")

# Push to Hub (replace "username/your-model-name" with your repo name but this is how i set mine up)
model.push_to_hub("HarleyCooper/Qwen.5B-OpenR1Math")
tokenizer.push_to_hub("HarleyCooper/Qwen.5B-OpenR1Math")


model.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/HarleyCooper/Qwen.5B-OpenR1Math/commit/41047a5bec204e5486925abe648cbcafe721a68c', commit_message='Upload tokenizer', commit_description='', oid='41047a5bec204e5486925abe648cbcafe721a68c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/HarleyCooper/Qwen.5B-OpenR1Math', endpoint='https://huggingface.co', repo_type='model', repo_id='HarleyCooper/Qwen.5B-OpenR1Math'), pr_revision=None, pr_num=None)