

 <h1>
Finetuning of Llama3-8b on math problem

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Model will output True if the solution is correct, and False otherwise.

## **Step 1: Install Necessary Libraries**

We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU.


In [None]:
# %%capture
# Since xformer and latest version of unsloth will lead to error, we just use unsloth(2025.10.10) and do not use xformer then.
!pip install unsloth==2025.10.10

## **Step 2: Load the Model and Tokenizer**

#Option1:
We'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:
import unsloth
from unsloth import FastLanguageModel
import torch

max_seq_length = 1500  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

#Option2:
Some of the finetuned models do not performed well, but we do not want to repeated the training from the start. So we load the saved checkpoints and continue the training to save time.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from unsloth import FastLanguageModel
import torch
from google.colab import drive
drive.mount('/content/drive')
max_seq_length = 1500  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory
# Define the path where the model checkpoint was saved in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint11"

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model.train()
print(f"Model and tokenizer loaded from: {save_path}")

## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. So we just use the subset for training and validation. The range of data used can be check in the report.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
from datasets import load_dataset

# Loading
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")
shuffled_dataset = full_dataset.shuffle(seed=42)
subset = shuffled_dataset.select(range(200000))

# Splitting
train_size = int(0.95 * len(subset))
train_dataset = subset.select(range(train_size))
validation_dataset = subset.select(range(train_size, len(subset)))

In [None]:
# Prompting
training_prompt = """You are a mathematics expert. Carefully read the following question and its proposed solution.
Your task is to determine whether the provided solution correctly solves the question.
First, analyze the reasoning in the solution, then decide if the final answer is logically and mathematically correct.
Respond with ‚ÄúTrue‚Äù or ‚ÄúFalse‚Äù and explain your reasoning briefly.

Question:
{question}

Solution:
{solution}

Your reasoning:
<Think step by step and explain why the solution is correct or not before giving your final answer.>

Final Answer:
{output}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]

    texts = [
        training_prompt.format(
            question=question,
            solution=str(solution) if solution is not None else "",
            output=str(output) if output is not None else ""
        ) + EOS_TOKEN
        for question, solution, output in zip(questions, solutions, outputs)
    ]
    return {"text": texts}

formatted_train_dataset = train_dataset.map(
    formatting_prompts_func,
    batched=True,
    batch_size=1000,
    num_proc=2,
)


## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 256,  # Larger rank to explain well
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha = 512,
    lora_dropout = 0,

    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)


### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

The parameters can be checked in report.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
FastLanguageModel.for_inference(model)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_train_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    packing=True,

    args=TrainingArguments(
        # epoch and batch
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs = 1,

        # learning rate
        learning_rate=5e-5,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",

        # optimizer
        optim="adamw_8bit",
        weight_decay=0.1,
        max_grad_norm=1.0,
        adam_beta2=0.98,

        # precision
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),

        # log and save
        logging_steps=100,
        output_dir = "outputs",
        report_to="none",
        seed=42,
        remove_unused_columns=True,
        group_by_length = False,
    ),
)


## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ‚òï


In [None]:
# avoid ram overflow in colab
torch.cuda.empty_cache()

In [None]:
trainer.train()



---


## **Step 6: Inference and Evaluation**

For large subset we chosen, the validation set is still too large when inference. So for validation, we just some samples in the subset.

In [None]:
from tqdm import tqdm
import torch

FastLanguageModel.for_inference(model)

val_prompt = """You are a mathematics expert. Carefully read the following question and its proposed solution.
Your task is to determine whether the provided solution correctly solves the question.
First, analyze the reasoning in the solution, then decide if the final answer is logically and mathematically correct.
Respond with only 'True' or 'False' at the end.

Question:
{}

Solution:
{}

Your reasoning:
<Think step by step and explain why the solution is correct or not before giving your final answer.>

Final Answer:
"""

def parse_prediction(response):
    if "Final Answer:" in response:
        output_part = response.split("Final Answer:")[-1].strip()
    else:
        output_part = response
    return 'true' in output_part.lower()

val_samples = validation_dataset.select(range(1000))
correct = 0
batch_size = 32

for i in tqdm(range(0, len(val_samples), batch_size)):
    batch = val_samples[i:i+batch_size]
    prompts = [val_prompt.format(q, str(s)) for q, s in zip(batch["question"], batch["solution"])]

    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=max_seq_length).to("cuda")

    with torch.inference_mode():
        outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False, use_cache=True)

    responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    preds = [parse_prediction(r) for r in responses]
    correct += sum(p == l for p, l in zip(preds, batch["is_correct"]))

accuracy = correct / len(val_samples)
print(f"ACC: {accuracy:.4f} ({correct}/{len(val_samples)})")

## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
FastLanguageModel.for_inference(model)
# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

inference_prompt = """You are a mathematics expert. Carefully read the following question and its proposed solution.
Your task is to determine whether the provided solution correctly solves the question.
First, analyze the reasoning in the solution, then decide if the final answer is logically and mathematically correct.
Respond with only 'True' or 'False' at the end.

Question:
{}

Solution:
{}

Your reasoning:
<Think step by step and explain why the solution is correct or not before giving your final answer.>

Final Answer:
"""

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Final Answer:"
    output_part = response_text.split("Final Answer:")[-1]
    # Check if "True" appears (case-insensitive)
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)  # üîº Á®çÂæÆÂ¢ûÂ§ß max_new_tokens ‰ª•ÂÆπÁ∫≥ reasoning
    response_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\n‚úÖ Submission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")