<a href="https://colab.research.google.com/github/TonyYuanMD/Common_Distributions/blob/main/evaluate_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluate Fine-tuned Model on MATH Test Dataset

This notebook evaluates the fine-tuned Qwen3 model downloaded from Aliyun on the MATH test dataset.

## Setup
- Model location: `model/` directory
- Test dataset: `test_math.json`
- Evaluation metric: Exact Match (EM) using SymPy normalization


In [1]:
# Install required packages
# If you encounter version compatibility errors between transformers and peft,
# you may need to install compatible versions. Try this cell first, and if it fails,
# run the troubleshooting cell below.

%pip install -q transformers accelerate peft sympy torch bitsandbytes


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft
%pwd

/content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft


'/content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft'

## Troubleshooting: Version Compatibility

If you get an error like `ModuleNotFoundError: No module named 'transformers.modeling_layers'`,
run the cell below to fix the version compatibility issue.


In [4]:
# TROUBLESHOOTING CELL - Run this if you get version compatibility errors
# This installs compatible versions of transformers and peft

# Option 1: Try upgrading to latest compatible versions
%pip install --upgrade transformers peft accelerate

# Option 2: Install specific compatible versions (recommended)
# Uncomment the line below if Option 1 doesn't work:
# %pip install "transformers>=4.37.0,<4.52.0" "peft>=0.7.0" accelerate sympy torch bitsandbytes

# Option 3: Latest versions (may work, try if others fail)
# %pip install "transformers>=4.40.0" "peft>=0.10.0" accelerate sympy torch bitsandbytes

print("If you see version errors, uncomment one of the options above and run this cell.")

Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m143.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.57.2
    Uninstalling transformers-4.57.2:
      Successfully uninstalled transformers-4.57.2
Successfully installed transformers-4.57.3
If you see version errors, uncomment one of the options above and run this cell.


In [5]:
import os
import json
import re
import sympy
import torch
import random
from pathlib import Path
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

# Try to import PeftModel (only needed if using LoRA adapter)
# If import fails, we'll handle it when loading the adapter
try:
    from peft import PeftModel
    PEFT_AVAILABLE = True
except ImportError as e:
    print(f"Warning: peft not available or version incompatible: {e}")
    print("You can only use full models (not LoRA adapters) without peft.")
    PEFT_AVAILABLE = False
    PeftModel = None

# Configuration
MODEL_DIR = "model"  # Path to the model DIRECTORY (folder containing model files, not the weights file itself)
# MODEL_DIR should contain: config.json, tokenizer files, and either:
#   - model.safetensors (full model), OR
#   - adapter/ folder (LoRA adapter)
TEST_DATA_PATH = "test_math.json"  # Path to test dataset
USE_8BIT = True  # Use 8-bit quantization to save memory (requires bitsandbytes)
MAX_NEW_TOKENS = 512
NUM_SAMPLES = 50  # None means evaluate on all test samples, or set a number for quick testing


## Load Evaluation Functions

These functions are used to extract answers from model output and compare them.


In [6]:
def extract_boxed(latex_string):
    """Extract content from \\boxed{} in LaTeX string."""
    if not latex_string:
        return None

    match = re.search(r'\\boxed\s*\{', latex_string, re.IGNORECASE)
    if not match:
        return None

    start_index = match.end()
    brace_count = 1
    content = []

    for i in range(start_index, len(latex_string)):
        char = latex_string[i]
        if char == '{':
            brace_count += 1
            content.append(char)
        elif char == '}':
            brace_count -= 1
            if brace_count == 0:
                return "".join(content)
            else:
                content.append(char)
        else:
            content.append(char)
    return None

def normalize_sympy(s):
    """Normalize mathematical expression using sympy."""
    if not s:
        return None
    try:
        return sympy.sympify(s)
    except (sympy.SympifyError, TypeError):
        return None

print("Evaluation functions loaded!")


Evaluation functions loaded!


In [13]:
# extract_boxed("\\boxed{\\frac{\\boldsymbol{x}}{1}}")
normalize_sympy("1/2") == normalize_sympy("0.5")

False

## Load Model and Tokenizer

The notebook will automatically detect if the model is a full model or a LoRA adapter.


In [7]:
# Check if we have a full model or LoRA adapter
model_path = Path(MODEL_DIR)
has_full_model = (model_path / "model.safetensors").exists() or (model_path / "pytorch_model.bin").exists()
has_adapter = (model_path / "adapter" / "adapter_config.json").exists()

print(f"Model directory: {model_path.absolute()}")
print(f"Has full model: {has_full_model}")
print(f"Has adapter: {has_adapter}")

if not has_full_model and not has_adapter:
    raise ValueError(f"No model found in {MODEL_DIR}. Please check the path.")

# Check if adapter requires peft
if has_adapter and not PEFT_AVAILABLE:
    raise ValueError(
        "LoRA adapter detected but peft is not available or incompatible.\n"
        "Please fix the version compatibility issue:\n"
        "Option 1: Upgrade packages: pip install --upgrade transformers peft\n"
        "Option 2: Install compatible versions: pip install 'transformers>=4.37.0,<4.52.0' 'peft>=0.7.0'\n"
        "Option 3: Use a full model instead of a LoRA adapter"
    )

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_DIR,
    trust_remote_code=True
)

# Set pad token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"Vocab size: {tokenizer.vocab_size}")

Model directory: /content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft/model
Has full model: True
Has adapter: True
Tokenizer loaded: Qwen2TokenizerFast
Vocab size: 151643


## Preprocess Test Dataset (Extract Ground Truth Answers)

Preprocess the test dataset to extract and normalize ground truth boxed answers in advance.
This speeds up evaluation since we don't need to extract answers for each example during evaluation.


In [None]:
# Preprocess test dataset: extract and normalize ground truth answers
import json
from tqdm import tqdm

PREPROCESSED_TEST_PATH = "test_math_preprocessed.json"  # Output path for preprocessed data

# Check if preprocessed file already exists
if os.path.exists(PREPROCESSED_TEST_PATH):
    print(f"Preprocessed file {PREPROCESSED_TEST_PATH} already exists.")
    print("Loading preprocessed data...")
    with open(PREPROCESSED_TEST_PATH, 'r', encoding='utf-8') as f:
        test_data = json.load(f)
    print(f"Loaded {len(test_data)} preprocessed examples")

    # Show an example to verify format
    if len(test_data) > 0:
        example = test_data[0]
        print("\nExample preprocessed item:")
        print(f"  Instruction: {example['instruction'][:100]}...")
        print(f"  Gold boxed answer (raw): {example.get('gold_answer_str', 'N/A')}")
        print(f"  Gold answer (normalized): {example.get('gold_answer_sympy', 'N/A')}")
else:
    print("Preprocessing test dataset...")
    print("This may take a few minutes for large datasets.")

    # Load original test data
    with open(TEST_DATA_PATH, 'r', encoding='utf-8') as f:
        original_test_data = json.load(f)

    print(f"Processing {len(original_test_data)} examples...")

    # Preprocess each example
    preprocessed_data = []
    for item in tqdm(original_test_data, desc="Preprocessing"):
        instruction = item["instruction"]
        gold_output = item["output"]

        # Extract boxed answer from gold output
        gold_ans_str = extract_boxed(gold_output)

        # Normalize using sympy
        gold_ans_sym = normalize_sympy(gold_ans_str)

        # Store normalized answer as string for comparison
        gold_ans_sym_str = str(gold_ans_sym) if gold_ans_sym is not None else None

        # Create preprocessed item
        preprocessed_item = {
            "instruction": instruction,
            "output": gold_output,  # Keep original output for reference
            "gold_answer_str": gold_ans_str,  # Extracted boxed answer (raw string)
            "gold_answer_sympy": gold_ans_sym_str,  # Normalized answer (as string for JSON)
            "gold_answer_sympy_obj": None  # Will be None, we'll reconstruct from string
        }

        preprocessed_data.append(preprocessed_item)

    # Save preprocessed data
    with open(PREPROCESSED_TEST_PATH, 'w', encoding='utf-8') as f:
        json.dump(preprocessed_data, f, indent=2, ensure_ascii=False)

    print(f"✓ Preprocessing complete! Saved to {PREPROCESSED_TEST_PATH}")
    print(f"  Processed {len(preprocessed_data)} examples")

    # Statistics
    valid_answers = sum(1 for item in preprocessed_data if item["gold_answer_sympy"] is not None)
    print(f"  Valid boxed answers extracted: {valid_answers}/{len(preprocessed_data)}")

    # Load the preprocessed data
    test_data = preprocessed_data

print("\nPreprocessing ready!")


In [8]:
# Load model based on type
if has_adapter and PEFT_AVAILABLE:
    # Load base model first (need to check config for base model path)
    adapter_config_path = model_path / "adapter" / "adapter_config.json"
    with open(adapter_config_path, 'r') as f:
        adapter_config = json.load(f)

    # Note: If base_model_name_or_path is a local path that doesn't exist,
    # you may need to manually specify the base model name
    base_model_path = adapter_config.get("base_model_name_or_path", "")

    print(f"Loading base model from: {base_model_path}")

    # Try to load base model - if path doesn't exist, we'll try loading from MODEL_DIR
    if os.path.exists(base_model_path):
        base_model_path_to_use = base_model_path
    else:
        # Assume the full model is in MODEL_DIR (if it exists)
        if has_full_model:
            base_model_path_to_use = str(model_path)
            print(f"Base model path not found, using model directory: {base_model_path_to_use}")
        else:
            # If no base model found, we need the HuggingFace model name
            # This should be provided by the user or detected from config
            raise ValueError(f"Cannot find base model. Please check the adapter config.")

    # Import BitsAndBytesConfig for quantization (if available)
    try:
        from transformers import BitsAndBytesConfig
        quantization_config = None
        if USE_8BIT:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0,
            )
    except ImportError:
        print("Warning: bitsandbytes not available. Setting USE_8BIT=False")
        USE_8BIT = False
        quantization_config = None

    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_path_to_use if os.path.exists(base_model_path_to_use) else MODEL_DIR,
        device_map="auto",
        quantization_config=quantization_config,
        torch_dtype=torch.float16 if not USE_8BIT else None,
        trust_remote_code=True
    )

    # Load adapter
    adapter_path = model_path / "adapter"
    print(f"Loading LoRA adapter from: {adapter_path}")
    model = PeftModel.from_pretrained(base_model, str(adapter_path))
    print("LoRA adapter loaded successfully!")

else:
    # Load full model
    # Import BitsAndBytesConfig for quantization (if available)
    try:
        from transformers import BitsAndBytesConfig
        quantization_config = None
        if USE_8BIT:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0,
            )
    except ImportError:
        print("Warning: bitsandbytes not available. Setting USE_8BIT=False")
        USE_8BIT = False
        quantization_config = None

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_DIR,
        device_map="auto",
        quantization_config=quantization_config,
        torch_dtype=torch.float16 if not USE_8BIT else None,
        trust_remote_code=True
    )
    print("Full model loaded successfully!")

model.eval()
print(f"Model device: {next(model.parameters()).device}")
print(f"Model dtype: {next(model.parameters()).dtype}")


Loading base model from: /tmp/input_model/
Base model path not found, using model directory: model
Loading LoRA adapter from: model/adapter
LoRA adapter loaded successfully!
Model device: cuda:0
Model dtype: torch.float16


## Load Test Dataset

Load the test dataset in JSON format (instruction/output pairs).


In [9]:
# Load test dataset
with open(TEST_DATA_PATH, 'r', encoding='utf-8') as f:
    test_data = json.load(f)

print(f"Loaded {len(test_data)} test examples")

# Limit number of samples if specified and draw randomly
if NUM_SAMPLES is not None:
    if NUM_SAMPLES <= len(test_data):
        random.seed(42) # For reproducibility
        test_data = random.sample(test_data, NUM_SAMPLES)
        print(f"Randomly selected {NUM_SAMPLES} samples for evaluation")
    else:
        print(f"NUM_SAMPLES ({NUM_SAMPLES}) is greater than total examples ({len(test_data)}). Using all examples.")

# Show an example
print("\nExample test item:")
print(json.dumps(test_data[0], indent=2, ensure_ascii=False))


Loaded 5000 test examples
Randomly selected 50 samples for evaluation

Example test item:
{
  "instruction": "You are a math assistant. Solve the problem step by step, explain your reasoning, and box the final answer using \\boxed{}.\n\nIf $a$ and $b$ are real numbers, $a^2b^3=\\frac{32}{27}$, and $\\frac{a}{b^3}=\\frac{27}{4}$, what is $a+b$?",
  "output": "Rearranging the second equation, we have that $b^3=\\frac{4}{27}a$. If we substitute this into the original equation, we get $\\frac{4}{27}a^3=\\frac{32}{27}$; after multiplying each side by $\\frac{27}{4}$ and taking the cube root, we see that $a=2$. Substituting $a$ into the first equation, we get that $b^3=\\frac{8}{27}$ or $b=\\frac23$. Thus, $a+b=2+\\frac23=\\boxed{\\frac83}$."
}


## Evaluation Function

This function evaluates the model on the test dataset and computes exact match accuracy.


In [10]:
def evaluate_model(model, tokenizer, test_data, max_new_tokens=512, verbose=True):
    """
    Evaluate model on test dataset.
    Returns: exact_match_score, detailed_results
    """
    correct = 0
    total = 0
    results = []

    # Extract system prompt and user question from instruction
    # Format: "You are a math assistant...\\n\\n<problem>"
    SYSTEM_PROMPT = "You are a math assistant. Solve the problem step by step, explain your reasoning, and box the final answer using \\boxed{}."

    for idx, item in enumerate(tqdm(test_data, desc="Evaluating")):
        instruction = item["instruction"]
        gold_output = item["output"]

        # Parse instruction to get system prompt and problem
        # The instruction format is: "You are a math assistant...\n\n<problem>"
        # Try both actual newlines and escaped newlines
        if "\n\n" in instruction:
            parts = instruction.split("\n\n", 1)
            system_msg = parts[0]
            problem = parts[1]
        elif "\\n\\n" in instruction:
            parts = instruction.split("\\n\\n", 1)
            system_msg = parts[0]
            problem = parts[1]
        else:
            # If no separator, use the whole instruction as problem
            system_msg = SYSTEM_PROMPT
            problem = instruction

        # Extract gold answer
        gold_ans_str = extract_boxed(gold_output)
        gold_ans_sym = normalize_sympy(gold_ans_str)

        # Construct prompt in Qwen3 format
        # Format: <|im_start|>system\n<system_prompt><|im_end|>\n<|im_start|>user\n<problem><|im_end|>\n<|im_start|>assistant\n
        prompt = f"<|im_start|>system\n{system_msg}<|im_end|>\n<|im_start|>user\n{problem}<|im_end|>\n<|im_start|>assistant\n"

        # Tokenize
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
        input_length = inputs["input_ids"].shape[1]

        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id
            )

        # Decode only the newly generated tokens
        generated_tokens = outputs[0][input_length:]
        pred_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)

        # Extract predicted answer
        pred_ans_str = extract_boxed(pred_text)
        pred_ans_sym = normalize_sympy(pred_ans_str)

        # Check if correct
        is_correct = False
        if pred_ans_sym is not None and gold_ans_sym is not None:
            is_correct = (pred_ans_sym == gold_ans_sym)
        elif (pred_ans_str == "" or pred_ans_str is None) and (gold_ans_str == "" or gold_ans_str is None):
            is_correct = True

        if is_correct:
            correct += 1
        total += 1

        results.append({
            "problem": problem[:100] + "..." if len(problem) > 100 else problem,
            "predicted": pred_ans_str,
            "gold": gold_ans_str,
            "predicted_full": pred_text[:200] + "..." if len(pred_text) > 200 else pred_text,
            "correct": is_correct
        })

        if verbose and idx < 5:  # Show first 5 examples
            print(f"\n--- Example {idx + 1} ---")
            print(f"Problem: {problem[:150]}...")
            print(f"Predicted answer: {pred_ans_str}")
            print(f"Gold answer: {gold_ans_str}")
            print(f"Correct: {is_correct}")
            print(f"Generated text (first 200 chars): {pred_text[:200]}...")

    exact_match = correct / total if total > 0 else 0.0
    return exact_match, results

print("Evaluation function defined!")


Evaluation function defined!


## Run Evaluation

Evaluate the model on the test dataset and compute exact match accuracy.


In [11]:
print("Starting evaluation...")
print(f"Total test examples: {len(test_data)}")
print(f"Max new tokens: {MAX_NEW_TOKENS}")
print("-" * 60)

exact_match_score, detailed_results = evaluate_model(
    model,
    tokenizer,
    test_data,
    max_new_tokens=MAX_NEW_TOKENS,
    verbose=True
)

print("\n" + "=" * 60)
print(f"EVALUATION RESULTS")
print("=" * 60)
print(f"Exact Match Score: {exact_match_score:.4f} ({exact_match_score*100:.2f}%)")
print(f"Correct: {sum(r['correct'] for r in detailed_results)}")
print(f"Total: {len(detailed_results)}")
print("=" * 60)


Starting evaluation...
Total test examples: 50
Max new tokens: 512
------------------------------------------------------------


Evaluating:   2%|▏         | 1/50 [01:40<1:22:17, 100.76s/it]


--- Example 1 ---
Problem: If $a$ and $b$ are real numbers, $a^2b^3=\frac{32}{27}$, and $\frac{a}{b^3}=\frac{27}{4}$, what is $a+b$?...
Predicted answer: \frac{219}{28}
Gold answer: \frac83
Correct: False
Generated text (first 200 chars): If we set $a=\frac{27}{4}$ and $b=\frac{12}{7}$, then $a^2b^3=\frac{32}{27}$ and $\frac{a}{b^3}=\frac{27}{4}$, so $a+b=\frac{27}{4}+\frac{12}{7}=\frac{219}{28}$.  (Note that we have chosen values for ...


Evaluating:   4%|▍         | 2/50 [01:57<41:05, 51.37s/it]   


--- Example 2 ---
Problem: If $f(x) = x + 1$ and $g(x) = x^2 + 3$, what is the value of $f(g(2))$?...
Predicted answer: 8
Gold answer: 8
Correct: True
Generated text (first 200 chars): We proceed as follows: \begin{align*}
f(g(2)) &= f(2^2 + 3) \\
&= f(7) \\
&= 7 + 1 \\
&= \boxed{8}.
\end{align*} (Note that we don't actually need to know what $g(2)$ is, because we can just plug it i...


Evaluating:   6%|▌         | 3/50 [02:53<42:02, 53.67s/it]


--- Example 3 ---
Problem: For some real numbers $a$ and $b$, the equation \[
8x^3 + 4ax^2 + 2bx + a = 0
\]has three distinct positive roots. If the sum of the base-2 logarithms...
Predicted answer: -8
Gold answer: -256
Correct: False
Generated text (first 200 chars): From the given information, we must have \[
\log_2 r_1 + \log_2 r_2 + \log_2 r_3 = 5.
\]From the properties of logarithms, we can write this as \[\log_2 (r_1 r_2 r_3) = 5.\]So, \[r_1 r_2 r_3 = 2^5 = 3...


Evaluating:   8%|▊         | 4/50 [04:31<54:23, 70.95s/it]


--- Example 4 ---
Problem: Let $ABCD$ be a regular tetrahedron with side length 2. The plane parallel to edges $AB$ and $CD$ and lying halfway between them cuts $ABCD$ into two ...
Predicted answer: None
Gold answer: 1+2\sqrt{3}
Correct: False
Generated text (first 200 chars): Let $P$ and $Q$ be the centers of the faces $BCD$ and $ACD,$ respectively, and let $R$ be the center of the face $ABC.$  Then $\triangle BPC$ is equilateral with side length 2, so $BP = 2.$  Similarly...


Evaluating:  10%|█         | 5/50 [05:58<57:32, 76.72s/it]


--- Example 5 ---
Problem: Compute $\sin 30^\circ$....
Predicted answer: \frac{1}{2}
Gold answer: \frac{1}{2}
Correct: False
Generated text (first 200 chars): We draw a right triangle with one angle equal to $30^\circ$.  Then, by drawing an altitude on the hypotenuse, we can create a $30^\circ-60^\circ-90^\circ$ triangle.  For a $30^\circ-60^\circ-90^\circ$...


Evaluating: 100%|██████████| 50/50 [1:10:59<00:00, 85.18s/it]


EVALUATION RESULTS
Exact Match Score: 0.1800 (18.00%)
Correct: 9
Total: 50





## Save Results

Save the evaluation results to a file for later analysis.


In [12]:
# Save results to JSON file
results_file = "evaluation_results.json"
results_summary = {
    "exact_match_score": exact_match_score,
    "total_examples": len(detailed_results),
    "correct": sum(r['correct'] for r in detailed_results),
    "detailed_results": detailed_results
}

with open(results_file, 'w', encoding='utf-8') as f:
    json.dump(results_summary, f, indent=2, ensure_ascii=False)

print(f"Results saved to {results_file}")

# Show some statistics
print(f"\nStatistics:")
print(f"  Accuracy: {exact_match_score*100:.2f}%")
correct_count = sum(r['correct'] for r in detailed_results)
incorrect_count = len(detailed_results) - correct_count
print(f"  Correct: {correct_count}")
print(f"  Incorrect: {incorrect_count}")


Results saved to evaluation_results.json

Statistics:
  Accuracy: 18.00%
  Correct: 9
  Incorrect: 41
