# Model Evaluation with Math500 Benchmark

In the field of LLMs, reasoning models leverage deep thinking capabilities to significantly enhance model performance across complex scenarios. According to the [DeepSeek-R1](https://arxiv.org/abs/2501.12948) paper, the reasoning pattern of larger models can be distilled into smaller models. Specifically, we can distill long-chain-of-thought (long-CoT) data that includes reasoning processes from DeepSeek-R1 and directly fine-tune open-source models like Qwen and Llama. This straightforward distillation method significantly enhances the reasoning abilities of smaller models.

To demonstrate the complete distillation process, we have prepared three notebooks that cover how to distill reasoning data from DeepSeek-R1 using the NIM API, how to train models using the distilled data, and how to evaluate the model.

- [1.generate_reasoning_data.ipynb](./1.generate_reasoning_data.ipynb) demonstrates how to distill reasoning data from DeepSeek-R1 using the NIM API. 
- [2.qwen2_distill_nemo.ipynb](./2.qwen2_distill_nemo.ipynb)  shows how to train open-source models using the distilled data.
- [3.evaluation.ipynb](./3.evaluation.ipynb) (⭐) shows how the evaluate the model.



This notebook is part 3 of the series. After completing the model training and distillation process, it's essential to evaluate the model's performance on standardized benchmarks to assess its reasoning capabilities. This notebook demonstrates the evaluation workflow using the [Math500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) dataset as an example.

The evaluation process consists of three main steps:

1. **Start a vLLM Inference Server** - Deploy the trained model for inference
2. **Generate Responses** - Use the model to generate answers for benchmark questions
3. **Calculate Evaluation Metrics** - Assess the model's performance using appropriate metrics

This tutorial demonstrates how to evaluate reasoning models trained with distilled data from DeepSeek-R1, focusing on mathematical problem-solving capabilities.

Prerequisites:
- A trained model checkpoint (from the previous distillation and training steps)
- vLLM installed in your environment
- Access to the Math500 benchmark dataset

In [1]:
%pip install datasets openai vllm requests

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.14.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/deep_ep-1.0.0+a84a248-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/looseversion-1.3.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3

## Step 1: Start vLLM Inference Server

Since vLLM server needs to run as a separate process, we cannot start it directly within this Jupyter notebook. Please follow these instructions to start the vLLM inference server in a separate terminal:


1. **Open a new terminal window**

2. **Navigate to your model directory** (replace with your actual model path):
   ```bash
   cd /path/to/your/trained/model
   ```

3. **Start the vLLM server**:
   ```bash
    python -m vllm.entrypoints.openai.api_server \
       --model ./model \
       --host 0.0.0.0 \
       --port 8000 \
       --tensor-parallel-size 1 \
       --gpu-memory-utilization 0.9 \
       --served-model-name qwen_distill
   ```

### Parameters Explanation:
- `--model`: Path to your trained model
- `--host 0.0.0.0`: Allow connections from any IP
- `--port 8000`: Port number for the API server
- `--tensor-parallel-size`: Number of GPUs for tensor parallelism
- `--gpu-memory-utilization`: Fraction of GPU memory to use
- `--served-model-name`: Rename the model for API calls (here set to `qwen_distill`)

### Verify Server is Running

Once the server starts, you should see output similar to:
```
INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000
```

You can verify the server is working by visiting `http://localhost:8000/docs` in your browser or running the test cell below.

In [2]:
# Test if vLLM server is running
import requests
try:
    response = requests.get("http://localhost:8000/health")
    if response.status_code == 200:
        print("✅ vLLM server is running successfully!")
    else:
        print(f"❌ Server responded with status code: {response.status_code}")
except requests.exceptions.ConnectionError:
    print("❌ Cannot connect to vLLM server. Please make sure it's running on localhost:8000")
    print("Follow the instructions above to start the server.")

✅ vLLM server is running successfully!


**Note:** 
To quickly verify the effect, we set the model's maximum generation length to `4096`.    
If you want to verify the complete model effect, please follow the steps below:  
1. Modify the model's `max_position_embeddings` in `model/config.json` from `4096` to `32768`, and restart the vllm server.
2. Modify the `MAX_GEN_TOKENS` parameter below to `8192`.  


In [30]:
MAX_GEN_TOKENS=3500

## Step 2: Load Math500 Dataset and Generate Responses

Now we'll load the [Math500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) benchmark dataset, sample 10 questions, and use our trained model to generate responses. The Math500 dataset contains mathematical problems that test various reasoning capabilities.

In [31]:
import json
import random
from datasets import load_dataset
from openai import OpenAI
import time


dataset = load_dataset("HuggingFaceH4/MATH-500", split="test")

# Sample 10 random questions for evaluation
random.seed(42)  # For reproducibility
sample_size = 10
sample_indices = random.sample(range(len(dataset)), sample_size)
sample_problems = dataset.select(sample_indices)

print(f"Sampled {len(sample_problems)} problems for evaluation")

# Display the first few problems
for i, problem in enumerate(sample_problems.select(range(3))):
    print(f"\n===== Problem {i+1} =====")
    if "problem" in problem:
        print(problem["problem"])
    else:
        print(problem)
    print("=" * 50)

Sampled 10 problems for evaluation

===== Problem 1 =====
Let $P(x)$ be a monic polynomial of degree 3.  Suppose that $P(x)$ has remainder $R(x)$ when it is divided by $(x - 1)(x - 4),$ and remainder $2R(x)$ when it is divided by $(x - 2)(x - 3).$  Given that $P(0) = 5,$ find $P(5).$

===== Problem 2 =====
Riproarin' Ringo was roping a recalcitrant dogie. Ringo decided to give the dogie a reprieve by calculating \[|(1-i)^8|\]before riding after the dogie. What answer should Ringo have found?

===== Problem 3 =====
The proper divisors of 12 are 1, 2, 3, 4 and 6. A proper divisor of an integer $N$ is a positive divisor of $N$ that is less than $N$. What is the sum of the proper divisors of the sum of the proper divisors of 284?


In [35]:
# Load the evaluation results from Step 2
try:
    with open("math500_evaluation_results.json", "r", encoding="utf-8") as f:
        evaluation_results = json.load(f)
    print(f"✅ Loaded {len(evaluation_results)} evaluation results")
except FileNotFoundError:
    print("❌ Evaluation results file not found. Please run Step 2 first.")
    evaluation_results = []

if evaluation_results:
    print(f"📊 Evaluating {len(evaluation_results)} model responses...")
    print("=" * 80)


✅ Loaded 6 evaluation results
📊 Evaluating 6 model responses...


In [36]:
# Initialize OpenAI client for vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed-for-local"  # vLLM doesn't require API key for local deployment
)

# Generate responses for the sample problems
def generate_response(problem_text, model_name, max_tokens=3500, temperature=0.6):
    """
    Generate a response for a given math problem using the trained model
    """
    # Use a prompt similar to the training format
    prompt = f"{problem_text} Please reason step by step, and put your final answer within \\boxed{{}}."
    
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature,
            timeout=60
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating response: {e}")
        return None

# Store results
evaluation_results = []
model_name = "qwen_distill"  # Update this to match your model

print("Generating responses for sample problems...")
print("=" * 60)

for i, problem in enumerate(sample_problems):
    print(f"\nProcessing problem {i+1}/{len(sample_problems)}...")
    
    if "problem" in problem:
        problem_text = problem["problem"]
        ground_truth = problem.get("answer", "N/A")
    else:
        problem_text = str(problem)
        ground_truth = "N/A"
    
    # Generate response
    model_response = generate_response(problem_text, model_name, MAX_GEN_TOKENS)
    
    if model_response:
        result = {
            "problem_id": i,
            "problem": problem_text,
            "ground_truth": ground_truth,
            "model_response": model_response
        }
        evaluation_results.append(result)
        
        print(f"✅ Generated response for problem {i+1}")
        print(f"Problem: {problem_text[:100]}...")
        print(f"Response length: {len(model_response)} characters")
    else:
        print(f"❌ Failed to generate response for problem {i+1}")
    
    # Add a small delay to avoid overwhelming the server
    time.sleep(1)

print(f"\n🎉 Completed! Generated responses for {len(evaluation_results)} problems.")

Generating responses for sample problems...

Processing problem 1/10...
✅ Generated response for problem 1
Problem: Let $P(x)$ be a monic polynomial of degree 3.  Suppose that $P(x)$ has remainder $R(x)$ when it is d...
Response length: 3767 characters

Processing problem 2/10...
✅ Generated response for problem 2
Problem: Riproarin' Ringo was roping a recalcitrant dogie. Ringo decided to give the dogie a reprieve by calc...
Response length: 8218 characters

Processing problem 3/10...
✅ Generated response for problem 3
Problem: The proper divisors of 12 are 1, 2, 3, 4 and 6. A proper divisor of an integer $N$ is a positive div...
Response length: 13998 characters

Processing problem 4/10...
✅ Generated response for problem 4
Problem: The data in the table below shows the percent of bus riders in a survey of Central H.S. students; 30...
Response length: 6600 characters

Processing problem 5/10...
✅ Generated response for problem 5
Problem: Let $n$ be a positive integer.  What is the gre

In [37]:
# Display sample results
print("Sample Evaluation Results:")
print("=" * 80)

for i, result in enumerate(evaluation_results[:3]):  # Show first 3 results
    print(f"\n--- Problem {i+1} ---")
    print(f"Question: {result['problem'][:200]}...")
    print(f"\nModel Response: {result['model_response'][:200]}...")
    print(f"\nGround Truth: {result['ground_truth']}")
    print("-" * 80)

Sample Evaluation Results:

--- Problem 1 ---
Question: Let $P(x)$ be a monic polynomial of degree 3.  Suppose that $P(x)$ has remainder $R(x)$ when it is divided by $(x - 1)(x - 4),$ and remainder $2R(x)$ when it is divided by $(x - 2)(x - 3).$  Given tha...

Model Response: Given that \( P(x) \) is a monic polynomial of degree 3, we can express it in the form \( P(x) = x^3 + ax^2 + bx + c \). The polynomial \( P(x) \) has a remainder \( R(x) \) when divided by \( (x - 1)...

Ground Truth: 15
--------------------------------------------------------------------------------

--- Problem 2 ---
Question: Riproarin' Ringo was roping a recalcitrant dogie. Ringo decided to give the dogie a reprieve by calculating \[|(1-i)^8|\]before riding after the dogie. What answer should Ringo have found?...

Model Response: Okay, let's see. The problem is asking for the absolute value of (1 - i)^8. Hmm, complex numbers here. Alright, let me recall how to handle complex exponents and absolute values.

Fi

In [38]:
# Save results to file for further analysis
output_file = "math500_evaluation_results.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(evaluation_results, f, indent=2, ensure_ascii=False)

print(f"✅ Evaluation results saved to {output_file}")

✅ Evaluation results saved to math500_evaluation_results.json


## Step 3: Calculate Evaluation Metrics with Math Verification

In this step, we'll analyze the evaluation results by:

1. **Installing and importing math_verify** - A library for mathematical equivalence checking
2. **Extracting predicted answers** - Parse answers from model responses within `\boxed{}` notation
3. **Mathematical equivalence checking** - Compare predicted answers with ground truth using math_verify
4. **Computing accuracy metrics** - Calculate overall accuracy and detailed statistics

The math_verify library helps ensure that mathematically equivalent answers (like "1/2" and "0.5")

In [39]:
# Install math_verify library for mathematical equivalence checking
%pip install math_verify

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/opt_einsum-3.4.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.14.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/deep_ep-1.0.0+a84a248-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.1

In [40]:
import re
from math_verify import parse, verify

def extract_boxed_answer(text: str) -> str:
    """Extract the content from the last \\boxed{} notation in the text."""
    if not text:
        return ""
    
    # Find all \\boxed{...} patterns
    pattern = r'\\boxed\{([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}'
    matches = re.findall(pattern, text)
    
    # Return the last match, or empty string if no match found
    return matches[-1] if matches else ""

def check_answer_equivalence(predicted: str, ground_truth: str) -> bool:
    """Check if predicted answer is mathematically equivalent to ground truth using math_verify."""
    try:
        parsed_pred = parse(predicted)
        parsed_truth = parse(ground_truth)
        return verify(parsed_pred, parsed_truth)
    except:
        return False

In [41]:
# Load evaluation results
with open("math500_evaluation_results.json", "r", encoding="utf-8") as f:
    results = json.load(f)

print(f"Loaded {len(results)} evaluation results")
print("=" * 50)

# Evaluate each result
correct_count = 0
total_count = len(results)

for i, result in enumerate(results):
    predicted_answer = extract_boxed_answer(result['model_response'])
    ground_truth = result['ground_truth']
    
    is_correct = check_answer_equivalence(predicted_answer, ground_truth)
    
    print(f"Problem {i+1}: {'✅' if is_correct else '❌'}")
    print(f"  Predicted: {predicted_answer}")
    print(f"  Ground Truth: {ground_truth}")
    
    if is_correct:
        correct_count += 1

# Calculate and display accuracy
accuracy = correct_count / total_count * 100
print("=" * 50)
print(f"Final Results:")
print(f"Correct answers: {correct_count}/{total_count}")
print(f"Accuracy: {accuracy:.2f}%")

Loaded 10 evaluation results
Problem 1: ❌
  Predicted: \dfrac{35}{2}
  Ground Truth: 15
Problem 2: ❌
  Predicted: 
  Ground Truth: 16
Problem 3: ❌
  Predicted: 
  Ground Truth: 284
Problem 4: ✅
  Predicted: 12
  Ground Truth: 12
Problem 5: ✅
  Predicted: 13
  Ground Truth: 13
Problem 6: ✅
  Predicted: -3
  Ground Truth: -3
Problem 7: ✅
  Predicted: -5
  Ground Truth: -5
Problem 8: ❌
  Predicted: 101561410_8
  Ground Truth: 2516_8
Problem 9: ❌
  Predicted: 
  Ground Truth: 6
Problem 10: ✅
  Predicted: 12
  Ground Truth: 12
Final Results:
Correct answers: 5/10
Accuracy: 50.00%
