

 <h1>
Math Question Answer Verification Competition

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

This notebook is use 80000 questions for training and achived an accuracy of around 86%

The final result was submitted to https://www.kaggle.com/competitions/dl-fall-25-kaggle-contest/

Team name AUV888

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [1]:
# %%capture
# !pip install unsloth
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install --no-deps "trl<0.9.0" "peft<0.12.0" "accelerate<0.32.0" "bitsandbytes<0.44.0" "transformers<4.43.0"

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [2]:

from unsloth import FastLanguageModel
import torch

max_seq_length = 1024  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers does not work in RTX 50X, Blackwell GPUs as of yet. Please build from source via
```
pip install ninja
pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
```

ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA GeForce RTX 5080 Laptop GPU. Num GPUs = 1. Max memory: 15.469 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [3]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=42)
train_dataset = shuffled_dataset.select(range(80000))      # Use the first 10000 for training
validation_dataset = shuffled_dataset.select(range(100000, 100500)) # Use another 1000 for validation

In [4]:
# The instructional prompt template for training
training_prompt = """You are a math solution evaluator.
Your task is to judge whether the reference answer is correct by checking the reasoning in the proposed solution.
Question:
{}
Proposed Solution:
{}
Reference Answer:
{}
Output "True" if the reasoning correctly leads to the reference answer, otherwise "False".
Output:
{}
"""


# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    reference_answer = examples["answer"]
    outputs = examples["is_correct"]
    texts = []
    for question, solution, reference_answer, output in zip(questions, solutions, reference_answer, outputs):
        # Format the prompt and add the EOS token
        text = training_prompt.format(question, str(solution), str(reference_answer), str(output)) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the formatting function to our training dataset
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80000/80000 [00:01<00:00, 53173.71 examples/s] 


## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # A small rank for lighter training
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 64, # A common practice is to set alpha = 2 * r
    lora_dropout = 0.05,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.11 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

def only_output_loss_collator(examples):
    batch = tokenizer.pad(examples, padding=True, return_tensors="pt")
    input_ids = batch["input_ids"]
    labels = input_ids.clone()
    pad_id = tokenizer.pad_token_id

    # ÂÖºÂÆπÂ§öÁßç Output ÂÜôÊ≥ï
    templates = ["\nOutput:\n", "\nOutput:", "Output:\n", "Output:", "Output: "]
    templ_ids_list = [tokenizer(t, add_special_tokens=False).input_ids for t in templates]
    B, L = input_ids.shape
    labels[:] = -100  # ÈªòËÆ§ÂÖ®ÂøΩÁï•

    for i in range(B):
        ids = input_ids[i].tolist()
        starts = []
        j = 0
        while j < L:
            matched = False
            for tpl in templ_ids_list:
                tlen = len(tpl)
                if tlen and j + tlen <= L and ids[j:j+tlen] == tpl:
                    starts.append(j + tlen)   # ÁõëÁù£‰ªéÊ®°ÊùøÊú´Â∞æ‰πãÂêéÂºÄÂßã
                    j += tlen
                    matched = True
                    break
            if not matched:
                j += 1

        # ÂØπÊØè‰∏ÄÊÆµ Output Âå∫Èó¥ÂºÄÂêØÁõëÁù£ÔºàÊîØÊåÅ packing ÁöÑÂ§öÊÆµÔºâ
        for k, st in enumerate(starts):
            ed = starts[k+1] if (k + 1) < len(starts) else L
            if st < ed:
                labels[i, st:ed] = input_ids[i, st:ed]

    labels[input_ids == pad_id] = -100
    batch["labels"] = labels
    return batch


trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_train_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    packing=True,
    data_collator=only_output_loss_collator, 
    args=TrainingArguments(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        learning_rate=3e-5,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        weight_decay=0.01,
        max_grad_norm=1.0,
        tf32=True,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        save_strategy="steps",
        save_steps=200,
        optim="adamw_8bit",
        seed=42,
        output_dir="outputs",
        report_to="none",
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=28): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80000/80000 [00:05<00:00, 15498.77 examples/s]


## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ‚òï


In [7]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 80,000 | Num Epochs = 1 | Total steps = 2,500
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 167,772,160 of 8,198,033,408 (2.05% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.3605
20,1.2519
30,0.877
40,0.4115
50,0.2459
60,0.1972
70,0.23
80,0.1895
90,0.1747
100,0.1536


TrainOutput(global_step=2500, training_loss=0.127226269197464, metrics={'train_runtime': 35054.8141, 'train_samples_per_second': 2.282, 'train_steps_per_second': 0.071, 'total_flos': 1.9283086742651535e+18, 'train_loss': 0.127226269197464, 'epoch': 1.0})


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference‚Äîone where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [None]:
from weakref import ref
# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# # Create the prompt template for inference (no answer included)

inference_prompt = """You are a math solution evaluator.
Your task is to judge whether the reference answer is correct by checking the reasoning in the proposed solution.
Question:
{}
Proposed Solution:
{}
Reference Answer:
{}
Output "True" if the proposed solution and reference answer correctly solve the question, otherwise "False".
Output:
"""


# Select a sample from the validation set
example = validation_dataset[100] # You can change the index (e.g., to 1, 2, 50)
question = example["question"]
solution = example["solution"]
reference_answer = example["answer"]

# Format the prompt with the validation data
inputs = tokenizer(
[
    inference_prompt.format(question, str(solution), str(reference_answer))
], return_tensors = "pt").to("cuda")

# Generate the model's response
outputs = model.generate(**inputs, max_new_tokens = 8, use_cache = True)
response = tokenizer.batch_decode(outputs)

# Print the results
print("#### QUESTION ####")
print(question)
print("\n#### SOLUTION ####")
print(solution)
print("\n#### REFERENCE ANSWER ####")
print(reference_answer)
print("\n#### MODEL'S PREDICTION ####")
# We process the output to show only the generated text
print(response[0].split("Output:\n")[1])
print("\n#### CORRECT ANSWER ####")
print(example["is_correct"])

#### QUESTION ####
Triangle $ABC$ has side lengths $AB=5$, $BC=6$, and $AC=7$. Two bugs start simultaneously from $A$ and crawl along the perimeter of the triangle in opposite directions at the same speed. They meet at point $D$. What is $BD$?

#### SOLUTION ####
Let $AB = c, BC = a, CA = b$ and $P$ be the perimeter of the triangle $ABC$.
Let $L$ be the distance between $A$ and $D$ and $L'$ the distance between $D$ and $B$.
Since the bugs crawl along the perimeter of the triangle in opposite directions at the same speed, $L = L'$.
Therefore, the total distance that each bug crawls is $L + L' = L + L = P$, i.e., $L + L = P$.
So $M + BD = P \Rightarrow BD = P - M = \boxed{N}$.

#### REFERENCE ANSWER ####
N

#### MODEL'S PREDICTION ####
False
<|end_of_text|>

#### CORRECT ANSWER ####
False


In [14]:
from tqdm import tqdm

# Calculate validation set accuracy
print("=" * 50)
print("CALCULATING VALIDATION SET ACCURACY")
print("=" * 50)

correct_predictions = 0
total_predictions = len(validation_dataset)
predictions_list = []
true_labels = []

print(f"Evaluating on {total_predictions} validation samples...")
# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the entire validation dataset
for i, example in enumerate(tqdm(validation_dataset)):
    question = example["question"]
    solution = example["solution"]
    reference_answer = example["answer"]
    true_label = example["is_correct"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution), str(reference_answer))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction
    prediction = parse_output(response_text)

    # Store predictions and labels
    predictions_list.append(prediction)
    true_labels.append(true_label)

    # Count correct predictions
    if prediction == true_label:
        correct_predictions += 1

# Calculate accuracy
validation_accuracy = correct_predictions / total_predictions

print(f"\nValidation Results:")
print(f"Correct predictions: {correct_predictions}")
print(f"Total predictions: {total_predictions}")
print(f"Validation Accuracy: {validation_accuracy:.4f} ({validation_accuracy*100:.2f}%)")

# Additional metrics
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

print(f"\nDetailed Classification Report:")
print(classification_report(true_labels, predictions_list, target_names=['False', 'True']))

print(f"\nConfusion Matrix:")
cm = confusion_matrix(true_labels, predictions_list)
print(cm)

CALCULATING VALIDATION SET ACCURACY
Evaluating on 500 validation samples...


 96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 478/500 [01:27<00:04,  5.40it/s]Unsloth: Input IDs of shape torch.Size([1, 1319]) with length 1319 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:31<00:00,  5.46it/s]


Validation Results:
Correct predictions: 444
Total predictions: 500
Validation Accuracy: 0.8880 (88.80%)

Detailed Classification Report:
              precision    recall  f1-score   support

       False       0.90      0.93      0.91       317
        True       0.87      0.82      0.84       183

    accuracy                           0.89       500
   macro avg       0.88      0.87      0.88       500
weighted avg       0.89      0.89      0.89       500


Confusion Matrix:
[[294  23]
 [ 33 150]]





## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [10]:
import pandas as pd
from tqdm import tqdm

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
# subset = test_dataset.select(range(500))
subset = test_dataset

for example in tqdm(subset):
    question = example["question"]
    solution = example["solution"]
    reference_answer = example["answer"]
    # Format the prompt
    prompt = inference_prompt.format(question, str(solution), str(reference_answer))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

 11%|‚ñà         | 1064/10000 [03:16<33:32,  4.44it/s]Unsloth: Input IDs of shape torch.Size([1, 1033]) with length 1033 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
 13%|‚ñà‚ñé        | 1274/10000 [03:55<28:28,  5.11it/s]Unsloth: Input IDs of shape torch.Size([1, 1200]) with length 1200 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
 16%|‚ñà‚ñå        | 1572/10000 [04:51<28:09,  4.99it/s]Unsloth: Input IDs of shape torch.Size([1, 1041]) with length 1041 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
 23%|‚ñà‚ñà‚ñé       | 2329/10000 [07:12<25:31,  5.01it/s]Unsloth: Input IDs of shape torch.Size([1, 1214]) with length 1214 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
 30%|‚ñà‚ñà‚


Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.





# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os
from pathlib import Path

# Define the path to save the model checkpoint in Google Drive
# save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"
save_path = Path.home() / "Documents" / "dl_midterm" / "model"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [None]:
# Define the path where the model checkpoint was saved in Google Drive
# save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"
save_path = Path.home() / "Documents" / "dl_midterm" / "model"

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
FastLanguageModel.for_inference(model)

print(f"Model and tokenizer loaded from: {save_path}")

## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a math solution evaluator.
Your task is to judge whether the reference answer is correct by checking the reasoning in the proposed solution.
Question:
{}
Proposed Solution:
{}
Reference Answer:
{}
Output "True" if the reasoning correctly leads to the reference answer, otherwise "False".
Output:
"""

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]
    reference_answer = example["answer"]
    # Format the prompt
    prompt = inference_prompt.format(question, str(solution), str(reference_answer))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")