## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.

Config in CLI:
cd /DL_MidtermPro/  
python3 -m venv .dlp  
source .dlp/bin/activate  
uv pip install --no-cache "unsloth"  
pip install ipykernel  
python -m ipykernel install --user --name="dlpro-env" --display-name="Python (DLPro Env)"

In [1]:
# %%capture
!pip install .

Processing /workspace
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting datasets (from contest-dl-2025==0.1.0)
  Using cached datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting numpy<2.0,>=1.17 (from contest-dl-2025==0.1.0)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting pandas (from contest-dl-2025==0.1.0)
  Using cached pandas-2.3.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
Collecting torch (from contest-dl-2025==0.1.0)
  Downloading torch-2.9.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting tqdm (from contest-dl-2025==0.1.0)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 8192  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Step 2: Load saved model
Alternatively, we load our trained model.

In [None]:
from unsloth import FastLanguageModel
from peft import PeftModel
import torch

max_seq_length = 8192
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",   
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,                        
)

# usage: directly put the weight directory in the project directory.
model = PeftModel.from_pretrained(model, "checkpoint-5400") #best: 5400

FastLanguageModel.for_inference(model)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left" 
model.eval()


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.1: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA RTX 6000 Ada Generation. Num GPUs = 1. Max memory: 47.401 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [3]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=114514)
train_dataset = shuffled_dataset.select(range(100000))      # Use the first 5,000 for training
validation_dataset = shuffled_dataset.select(range(100000, 101000)) # Use the next 500 for validation

Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000000/1000000 [00:02<00:00, 334197.96 examples/s]
Generating test split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 259465.02 examples/s]


In [4]:
import re

def clean_text(text):
    """Applies lightweight cleaning to a text string."""

    if not isinstance(text, str):
        return str(text)

    # text = re.sub(r'```[\s\S]*?```', '', text)

    text = re.sub(r'<[^>]+>', '', text)

    #while re.search(r'(\d),(\d)', text):
        #text = re.sub(r'(\d),(\d)', r'\1\2', text)

    text = re.sub(r'[ \t]+', ' ', text)
    text = re.sub(r'\n+', '\n', text)

    return text.strip()

In [5]:
# The instructional prompt template for training
training_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. 

Let's think step-by-step to determine if the provided solution is correct. Consider the following aspects during your evaluation:
1.  **Understanding:** Does the solution correctly interpret what the question is asking and use the given information properly?
2.  **Approach:** Is the method or strategy used in the solution logically sound and appropriate for the problem?
3.  **Execution:** Are the calculations, algebraic manipulations, logical deductions, and steps performed accurately? Check each step carefully.
4.  **Final Answer Check:** Is the final result derived correctly from the preceding steps, and does it directly answer the specific question asked?

After carefully considering all these steps in your internal thought process, provide your final verdict. Your response should be *only* the single word 'True' if the *entire* solution (understanding, approach, execution, and final answer) is correct, or *only* the single word 'False' otherwise. Do not include your step-by-step thinking process in the output itself.

Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
{}
"""

# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]
    texts = []
    for question, solution, output in zip(questions, solutions, outputs):
        # Format the prompt and add the EOS token
        question = clean_text(question)
        solution = clean_text(str(solution))
        text = training_prompt.format(question, str(solution), str(output)) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the formatting function to our training dataset
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100000/100000 [00:06<00:00, 15604.94 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:00<00:00, 15058.39 examples/s]


## **Step 4: Configure LoRA and Set Up the Trainer (skip step 4 and 5 this if you loaded the trained model)**
### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # A small rank for lighter training
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 128, # A common practice is to set alpha = 2 * r
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)


### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback

small_eval_dataset = formatted_validation_dataset.shuffle(seed=42).select(range(400))

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    eval_dataset = small_eval_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    args = TrainingArguments(
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 1,
        # warmup_steps = 20,,
        warmup_ratio=0.015,
        learning_rate = 1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 42,
        output_dir = "outputs",
        report_to = "none",
        do_eval = True,
        eval_strategy= "steps",
        eval_steps = 25,
        save_steps = 200,
        load_best_model_at_end = True,
        metric_for_best_model = "eval_loss",
        greater_is_better = False,
        max_grad_norm = 1.0,

        #added following
        #max_steps = 700,
        num_train_epochs = 1,
        logging_steps= 25,
    ),
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)],
)

## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ‚òï


In [None]:
trainer.train()


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference‚Äîone where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [6]:
# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. 

Let's think step-by-step to determine if the provided solution is correct. Consider the following aspects during your evaluation:
1.  **Understanding:** Does the solution correctly interpret what the question is asking and use the given information properly?
2.  **Approach:** Is the method or strategy used in the solution logically sound and appropriate for the problem?
3.  **Execution:** Are the calculations, algebraic manipulations, logical deductions, and steps performed accurately? Check each step carefully.
4.  **Final Answer Check:** Is the final result derived correctly from the preceding steps, and does it directly answer the specific question asked?

After carefully considering all these steps in your internal thought process, provide your final verdict. Your response should be *only* the single word 'True' if the *entire* solution (understanding, approach, execution, and final answer) is correct, or *only* the single word 'False' otherwise. Do not include your step-by-step thinking process in the output itself.

Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""
# Select a sample from the validation set
example = validation_dataset[100] # You can change the index (e.g., to 1, 2, 50)
question = clean_text(example["question"])
solution = clean_text(example["solution"])

# Format the prompt with the validation data
inputs = tokenizer(
[
    inference_prompt.format(question, str(solution))
], return_tensors = "pt").to("cuda")

# Generate the model's response
outputs = model.generate(**inputs, max_new_tokens = 8, use_cache = True)
response = tokenizer.batch_decode(outputs)

# Print the results
print("#### QUESTION ####")
# print(question)
print("\n#### SOLUTION ####")
# print(solution)
print("\n#### MODEL'S PREDICTION ####")
# We process the output to show only the generated text
print(response[0].split("Output:\n")[1])
print("\n#### CORRECT ANSWER ####")
print(example["is_correct"])

#### QUESTION ####

#### SOLUTION ####

#### MODEL'S PREDICTION ####
False
<|end_of_text|>

#### CORRECT ANSWER ####
True


## More Validations

In [7]:
import random
from tqdm import tqdm
import re

def parse_output(response_text):
    output_part = response_text.split("Output:\n")[-1]
    if 'true' in output_part.lower():
        return True
    return False

def calculate_accuracy(ground_truths, predictions):
    if not ground_truths:
        return 0.0

    correct_count = 0
    total_count = len(ground_truths)

    for gt, pred in zip(ground_truths, predictions):
        if gt == pred:
            correct_count += 1

    return correct_count / total_count

num_samples = 1000
random_samples = validation_dataset.shuffle(seed=42).select(range(num_samples))

model_predictions = []
ground_truths = []

## Two Methods of Inference: Text Generation + String Matching and Logits

### Text Gen + String Matching

In [8]:
print(f"Running inference on {num_samples} random validation samples...")

for example in tqdm(random_samples):

    question = clean_text(example["question"])
    solution = clean_text(example["solution"])
    correct_answer = example["is_correct"]

    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    model_pred_bool = parse_output(response_text)
    model_predictions.append(model_pred_bool)
    ground_truths.append(correct_answer)

accuracy = calculate_accuracy(ground_truths, model_predictions)

print(f"\nCalculation complete.")
print(f"Validation Accuracy on {num_samples} random samples: {accuracy * 100:.2f}%")

Running inference on 1000 random validation samples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [03:55<00:00,  4.24it/s]


Calculation complete.
Validation Accuracy on 1000 random samples: 85.30%





### Logits + Knowing Mistakes
- Normally, this method is faster
- Also, by eliminating the mistakes from pattern matching, the acc is slightly higher
- Thus, we choose to use this strategy as the final generation strategy
- In this section, I also included a function to export the wrong answers and their questions to inspect

In [9]:

import pandas as pd
from tqdm import tqdm

model_predictions = []
ground_truths = []
error_analysis = [] 
num_samples = 1000
true_token_ids = list(set([
    tokenizer.encode("True", add_special_tokens=False)[-1],
    tokenizer.encode(" True", add_special_tokens=False)[-1]
]))
false_token_ids = list(set([
    tokenizer.encode("False", add_special_tokens=False)[-1],
    tokenizer.encode(" False", add_special_tokens=False)[-1]
]))
print(f"Running inference on {num_samples} random validation samples...")

model.eval()
with torch.no_grad():
    for example in tqdm(random_samples):
        question = clean_text(example["question"])
        solution = clean_text(example["solution"])
        correct_answer = example["is_correct"] # bool

        prompt = inference_prompt.format(question, str(solution))
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        
        outputs = model(**inputs)
        logits = outputs.logits
        last_token_logits = logits[0, -1, :]
        
        true_score = torch.max(last_token_logits[true_token_ids])
        false_score = torch.max(last_token_logits[false_token_ids])

        prediction = (true_score > false_score).item() # bool
        
        model_predictions.append(prediction)
        ground_truths.append(correct_answer)
        
        if prediction != correct_answer:
            error_analysis.append({
                "question": example["question"], 
                "solution": example["solution"], 
                "cleaned_solution": solution,    
                "prediction": prediction,      # True / False
                "ground_truth": correct_answer # True / False
            })

accuracy = calculate_accuracy(ground_truths, model_predictions)
print(f"\nCalculation complete.")
print(f"Validation Accuracy: {accuracy * 100:.2f}%")

error_df = pd.DataFrame(error_analysis)
print(f"Found {len(error_df)} errors out of {num_samples} samples.")
# error_df.to_csv("validation_errors.csv", index=False)
# print("Errors saved to validation_errors.csv")

if not error_df.empty:
    print("\n--- SAMPLE ERRORS ---")
    pd.set_option('display.max_colwidth', 300) 
    print(error_df.head())
    error_df.to_csv("validation_errors.csv", index=False)

    print("\nerror file 'validation_errors.csv' is savedÔºÅ")

Running inference on 1000 random validation samples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [02:10<00:00,  7.68it/s]


Calculation complete.
Validation Accuracy: 86.40%
Found 136 errors out of 1000 samples.

--- SAMPLE ERRORS ---
                                                                                                                                                                                                                                                                                                      question  \
0                                                                 Mr. Ray has 100 customers waiting at his fish market. He has 10 tuna, each of which weighs 200 pounds. Each customer wants 25 pounds of tuna. Mr. Ray's store is first come, first served. How many customers will go home without any fish?   
1                                                                                                                       Janet needs 5 tickets to ride the roller coaster and 3 tickets to ride the giant slide. How many tickets does she need to ride the roller coaster 7 times an




# Generate Submission

## full text gen

In [None]:
import pandas as pd
from tqdm import tqdm

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitivelyÔøº [ 3/25 00:01 < 00:21, 1.05 it/s]
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = clean_text(example["question"])
    solution = clean_text(example["solution"])

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

## Alternatively, logits gen

In [None]:
import pandas as pd
from tqdm import tqdm
import torch
from datasets import load_dataset 

true_token_ids = list(set([
    tokenizer.encode("True", add_special_tokens=False)[-1],
    tokenizer.encode(" True", add_special_tokens=False)[-1]
]))
false_token_ids = list(set([
    tokenizer.encode("False", add_special_tokens=False)[-1],
    tokenizer.encode(" False", add_special_tokens=False)[-1]
]))

print(f"Logits inference mode")
print(f"Will Compare Token IDs - True: {true_token_ids} vs False: {false_token_ids}")

test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

model.eval()
with torch.no_grad():
    for example in tqdm(test_dataset, desc="Generating submission using logits"):
        question = clean_text(example["question"])
        solution = clean_text(example["solution"])

        prompt = inference_prompt.format(question, str(solution))
        inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

        outputs = model(**inputs)
        logits = outputs.logits

        last_token_logits = logits[0, -1, :]

        true_score = torch.max(last_token_logits[true_token_ids])
        false_score = torch.max(last_token_logits[false_token_ids])

        prediction = true_score > false_score
        predictions.append(prediction.item()) 

submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully using logits!")
print("You can now download this file and submit it to the Kaggle competition.")

# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
###**(This part is never used as we always downloaded the weights)** 

Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/DLMidModelCheckpoint/BT_ralpha_3264"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

#print(f"Model checkpoint and tokenizer saved to: {save_path}")