

 <h1>
Welcome to the Math Question Answer Verification Competition! üöÄ

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

This notebook is a starter guide designed to get you up and running quickly. We'll walk through a simplified training process using a small subset of the data (5,000 examples) and lightweight parameters. The main goal here is to understand the complete workflow, from loading data to generating a submission file, not to achieve a top score.

Good luck, and have fun! üéâ

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [None]:
# %%capture
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git@bbdab300de3eb76a435999e92815de452560e51d"
# !pip install --no-deps "xformers<0.0.26" "trl<0.9.0" "peft<0.12.0" "accelerate<0.32.0" "bitsandbytes<0.44.0" "transformers<4.43.0"

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git@bbdab300de3eb76a435999e92815de452560e51d (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git@bbdab300de3eb76a435999e92815de452560e51d)
  Cloning https://github.com/unslothai/unsloth.git (to revision bbdab300de3eb76a435999e92815de452560e51d) to /tmp/pip-install-roey_ov8/unsloth_59cb6c46acf042be8f821286ae73a2c1
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-roey_ov8/unsloth_59cb6c46acf042be8f821286ae73a2c1
  Running command git rev-parse -q --verify 'sha^bbdab300de3eb76a435999e92815de452560e51d'
  Running command git fetch -q https://github.com/unslothai/unsloth.git bbdab300de3eb76a435999e92815de452560e51d
  Running command git checkout -q bbdab300de3eb76a435999e92815de452560e51d
  Resolved https://github.com/unslothai/unsloth.git to commit bbdab300de3eb76a435999e92815de452560e51d
  Installing build dependencies ... [?25l[?25hdone
  Getti

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
#     max_seq_length = max_seq_length,
#     dtype = dtype,
#     load_in_4bit = load_in_4bit,
# )

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.




ü¶• Unsloth Zoo will now patch everything to make training faster!


## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=42)
train_dataset = shuffled_dataset.select(range(40000, 60000))      # Use the first 5,000 for training
validation_dataset = shuffled_dataset.select(range(60000, 65000)) # Use the next 500 for validation

In [None]:
# The instructional prompt template for training
training_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
GivenAnswer:
{}
GivenSolution (optional reasoning):
{}
Output:
{}"""

# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    questions = examples["question"]
    answers = examples["answer"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]
    texts = []
    for question, answer, solution, output in zip(questions, answers, solutions, outputs):
        # Format the prompt and add the EOS token
        text = training_prompt.format(question, str(answer), str(solution), str(output)) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the formatting function to our training dataset
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
print(train_dataset)

In [None]:
print(train_dataset[0])

{'question': 'Find the greatest common divisor of $10293$ and $29384$.', 'is_correct': True, 'answer': '1', 'solution': "To find the greatest common divisor (GCD) of two integers, we can use the built-in gcd() function in sympy.\nHere's how we can use it to find the GCD of $10293$ and $29384$:\n<llm-code>\nfrom sympy import gcd\n\n# Define the integers\nnum1 = 10293\nnum2 = 29384\n\n# Find the GCD using sympy\ngcd(num1, num2)\n</llm-code>\n<llm-code-output>\n1\n</llm-code-output>\nSo the greatest common divisor of $10293$ and $29384$ is $\\boxed{1}$."}


In [None]:
print(train_dataset.features)

{'question': Value('string'), 'is_correct': Value('bool'), 'answer': Value('string'), 'solution': Value('string')}


In [None]:
print(len(train_dataset), train_dataset.column_names)

In [None]:
print(formatted_train_dataset[0])
print(formatted_train_dataset)

## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # A small rank for lighter training
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16, # A common practice is to set alpha = 2 * r
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Unsloth: Already have LoRA adapters! We shall skip this step.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    eval_dataset = formatted_validation_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    response_template = "Output:",
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.03,
        # max_steps = 60,
        num_train_epochs = 3,
        # learning_rate = 2e-4,
        learning_rate = 5e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/20000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/5000 [00:00<?, ? examples/s]

## **Step 4 Extra: Grid Training\!**

In [None]:
import itertools, json, os, torch
from copy import deepcopy
from trl import SFTTrainer
from transformers import TrainingArguments

# grid = {
#     "lora_r": [1, 2, 4, 8, 16],
#     "lora_alpha": [2, 4, 8, 16, 32],
#     "lora_dropout": [0.0, 0.05],
#     "learning_rate": [1e-4, 2e-4],
#     "max_steps": [200],
# }

grid = {
    "lora_r": [2, 4, 8],
    # "lora_alpha": 2 * [2, 4, 8],
    "lora_dropout": [0.0, 0.05],
    "learning_rate": [1e-4, 2e-4],
    "max_steps": [100],
}

def combos(grid):
    keys = list(grid.keys())
    for vals in itertools.product(*[grid[k] for k in keys]):
        yield dict(zip(keys, vals))

def build_model(lora_r, lora_alpha, lora_dropout):
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Meta-Llama-3.1-8B",
        max_seq_length = max_seq_length,
        dtype = None,
        load_in_4bit = load_in_4bit,
        device_map = {"": 0},
    )
    model = FastLanguageModel.get_peft_model(
        model,
        r = lora_r,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj"],
        lora_alpha = lora_alpha,
        lora_dropout = lora_dropout,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=42,
    )
    return model

In [None]:
import re
import torch
from tqdm import tqdm

def evaluate_accuracy(model, tokenizer, dataset, prompt_template):
    FastLanguageModel.for_inference(model)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    # pad_id = tokenizer.pad_token_id or tokenizer.eos_token_id

    def parse_bool(text):
        m = re.search(r'^\s*(True|False)\b', text, flags=re.IGNORECASE)
        return bool(m and m.group(1).lower() == "true")

    correct, total = 0, 0
    for ex in tqdm(dataset):
        q = ex["question"]
        a = ex["answer"]
        s = ex["solution"]
        o = bool(ex["is_correct"])

        prompt = prompt_template.format(q, a, s)
        inputs = tokenizer([prompt], return_tensors="pt").to(device)
        # outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
        # response = tokenizer.batch_decode(outputs)
        # pred = parse_bool(response)
        with torch.no_grad():
            gen = model.generate(**inputs, max_new_tokens=8, use_cache=True, do_sample=False)

        input_len = inputs["input_ids"].shape[1]
        new_tokens = gen[0][input_len:]
        text = tokenizer.decode(new_tokens, skip_special_tokens=True)

        if "Output:" in text:
            text = text.split("Output:", 1)[-1]

        pred = parse_bool(text)

        correct += int(pred == o)
        total += 1
    acc = correct / total
    print(f"\nValidation Accuracy: {acc:.4f} ({correct}/{total})")
    return acc



In [None]:
from transformers import AutoTokenizer
model_id = "unsloth/Meta-Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
EOS_TOKEN = tokenizer.eos_token


In [None]:
training_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
GivenAnswer:
{}
GivenSolution (optional reasoning):
{}
Output:
{}"""

In [None]:
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
GivenAnswer:
{}
GivenSolution (optional reasoning):
{}
Output:
"""

In [None]:
def formatting_prompts_func(examples):
    qs = examples["question"]
    ans = examples["answer"]
    sol = examples["solution"]
    ys  = examples["is_correct"]
    texts = []
    for q, a, s, y in zip(qs, ans, sol, ys):
        texts.append(training_prompt.format(q, str(a), str(s), str(y)) + EOS_TOKEN)
    return {"text": texts}

formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)


In [None]:
import gc
def free_cuda():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

In [None]:
best = {"acc": -1.0, "cfg": None, "ckpt_dir": None}
for cfg in combos(grid):

    cfg["lora_alpha"] = 2 * cfg["lora_r"]

    print("\n=== Try config:", cfg, "===")
    model = build_model(cfg["lora_r"], cfg["lora_alpha"], cfg["lora_dropout"])

    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = formatted_train_dataset,
        eval_dataset = formatted_validation_dataset,
        dataset_text_field = "text",
        max_seq_length = max_seq_length,
        response_template = "Output:",
        args = TrainingArguments(
            per_device_train_batch_size = 2,
            gradient_accumulation_steps = 4,
            learning_rate = cfg["learning_rate"],
            warmup_steps = 5,
            max_steps = cfg["max_steps"],
            logging_steps = 10,
            fp16 = not torch.cuda.is_bf16_supported(),
            bf16 = torch.cuda.is_bf16_supported(),
            optim = "adamw_8bit",
            weight_decay = 0.01,
            lr_scheduler_type = "linear",
            seed = 42,
            save_strategy = "no",
            output_dir = "outputs/tmp",
            report_to = "none",
        ),
    )
    trainer.train()

    acc = evaluate_accuracy(model, tokenizer, validation_dataset, inference_prompt)
    if acc > best["acc"]:
        best["acc"] = acc
        best["cfg"] = deepcopy(cfg)
        ckpt_dir = f"outputs/best_lora_r{cfg['lora_r']}_a{cfg['lora_alpha']}_d{cfg['lora_dropout']}_lr{cfg['learning_rate']}"
        os.makedirs(ckpt_dir, exist_ok=True)
        model.save_pretrained(ckpt_dir)
        tokenizer.save_pretrained(ckpt_dir)
        best["ckpt_dir"] = ckpt_dir
    del trainer, model
    free_cuda()
print("\n=== Best on VAL ===")
print(json.dumps({"acc": best["acc"], **best["cfg"]}, indent=2))
print("Adapter saved at:", best["ckpt_dir"])



=== Try config: {'lora_r': 1, 'lora_alpha': 2, 'lora_dropout': 0.0, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.10.11 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.6564
20,1.4485
30,1.1743
40,0.8809
50,0.8113
60,0.7841
70,0.7892
80,0.7831
90,0.7134
100,0.727


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:45<00:00,  4.74it/s]



Validation Accuracy: 0.7500 (375/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 2, 'lora_dropout': 0.0, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.6211
20,1.1457
30,0.8908
40,0.7574
50,0.7382
60,0.7302
70,0.7484
80,0.7486
90,0.6777
100,0.6943


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:46<00:00,  4.70it/s]



Validation Accuracy: 0.7580 (379/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 2, 'lora_dropout': 0.05, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.11 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.6565
20,1.4497
30,1.1755
40,0.8812
50,0.8115
60,0.7843
70,0.7894
80,0.7839
90,0.7139
100,0.7275


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:15<00:00,  3.70it/s]



Validation Accuracy: 0.7540 (377/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 2, 'lora_dropout': 0.05, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.6213
20,1.1474
30,0.8919
40,0.758
50,0.7383
60,0.7311
70,0.7482
80,0.749
90,0.6778
100,0.6944


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:13<00:00,  3.74it/s]



Validation Accuracy: 0.7520 (376/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 4, 'lora_dropout': 0.0, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.6372
20,1.2915
30,0.9925
40,0.7984
50,0.7612
60,0.7474
70,0.7653
80,0.7633
90,0.6913
100,0.7047


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:47<00:00,  4.66it/s]



Validation Accuracy: 0.7540 (377/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 4, 'lora_dropout': 0.0, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.5721
20,0.9847
30,0.8249
40,0.7315
50,0.7174
60,0.7102
70,0.7327
80,0.7405
90,0.6706
100,0.687


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:45<00:00,  4.74it/s]



Validation Accuracy: 0.7580 (379/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 4, 'lora_dropout': 0.05, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.6376
20,1.2933
30,0.9939
40,0.7982
50,0.7612
60,0.7469
70,0.7648
80,0.763
90,0.6911
100,0.7051


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:16<00:00,  3.68it/s]



Validation Accuracy: 0.7540 (377/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 4, 'lora_dropout': 0.05, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.5726
20,0.9853
30,0.8249
40,0.732
50,0.7171
60,0.7108
70,0.7327
80,0.7406
90,0.6712
100,0.6873


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:13<00:00,  3.74it/s]



Validation Accuracy: 0.7560 (378/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 8, 'lora_dropout': 0.0, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.5983
20,1.1083
30,0.8796
40,0.7504
50,0.7306
60,0.7197
70,0.7395
80,0.7443
90,0.6747
100,0.6913


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:46<00:00,  4.68it/s]



Validation Accuracy: 0.7600 (380/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 8, 'lora_dropout': 0.0, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.4979
20,0.8733
30,0.7934
40,0.7134
50,0.7051
60,0.7033
70,0.726
80,0.736
90,0.6644
100,0.682


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:45<00:00,  4.72it/s]



Validation Accuracy: 0.7700 (385/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 8, 'lora_dropout': 0.05, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.5984
20,1.1104
30,0.8802
40,0.7507
50,0.7311
60,0.72
70,0.7393
80,0.7442
90,0.6743
100,0.6916


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:14<00:00,  3.73it/s]



Validation Accuracy: 0.7520 (376/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 8, 'lora_dropout': 0.05, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.4984
20,0.8751
30,0.7957
40,0.7152
50,0.7057
60,0.7037
70,0.7262
80,0.7364
90,0.6646
100,0.682


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:13<00:00,  3.74it/s]



Validation Accuracy: 0.7620 (381/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 16, 'lora_dropout': 0.0, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.5334
20,0.9471
30,0.8127
40,0.7224
50,0.7093
60,0.7063
70,0.7301
80,0.7388
90,0.6674
100,0.6844


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:46<00:00,  4.71it/s]



Validation Accuracy: 0.7580 (379/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 16, 'lora_dropout': 0.0, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.4029
20,0.8047
30,0.7697
40,0.7073
50,0.7012
60,0.6984
70,0.7228
80,0.7347
90,0.6608
100,0.6792


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:45<00:00,  4.73it/s]



Validation Accuracy: 0.7820 (391/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 16, 'lora_dropout': 0.05, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.5341
20,0.948
30,0.8131
40,0.7223
50,0.7092
60,0.7072
70,0.7302
80,0.7389
90,0.6677
100,0.6845


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:14<00:00,  3.72it/s]



Validation Accuracy: 0.7600 (380/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 16, 'lora_dropout': 0.05, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.404
20,0.8061
30,0.7695
40,0.707
50,0.7018
60,0.6987
70,0.7211
80,0.7346
90,0.6604
100,0.6788


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:13<00:00,  3.75it/s]



Validation Accuracy: 0.7720 (386/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 32, 'lora_dropout': 0.0, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.4402
20,0.8341
30,0.7773
40,0.7077
50,0.7034
60,0.7011
70,0.7237
80,0.7352
90,0.6618
100,0.6811


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:46<00:00,  4.68it/s]



Validation Accuracy: 0.7640 (382/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 32, 'lora_dropout': 0.0, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.3096
20,0.7664
30,0.7595
40,0.7035
50,0.7011
60,0.6977
70,0.7216
80,0.7343
90,0.6578
100,0.6811


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:45<00:00,  4.74it/s]



Validation Accuracy: 0.7700 (385/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 32, 'lora_dropout': 0.05, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.4411
20,0.8356
30,0.7787
40,0.7086
50,0.7032
60,0.7016
70,0.7238
80,0.7358
90,0.6616
100,0.6805


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:11<00:00,  3.80it/s]



Validation Accuracy: 0.7660 (383/500)

=== Try config: {'lora_r': 1, 'lora_alpha': 32, 'lora_dropout': 0.05, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 2,621,440 of 8,032,882,688 (0.03% trained)


Step,Training Loss
10,1.3104
20,0.7665
30,0.7605
40,0.7039
50,0.7016
60,0.6973
70,0.7228
80,0.7336
90,0.6577
100,0.68


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:12<00:00,  3.79it/s]



Validation Accuracy: 0.7840 (392/500)

=== Try config: {'lora_r': 2, 'lora_alpha': 2, 'lora_dropout': 0.0, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,242,880 of 8,035,504,128 (0.07% trained)


Step,Training Loss
10,1.6584
20,1.4599
30,1.1805
40,0.8805
50,0.8086
60,0.7808
70,0.7866
80,0.7838
90,0.7144
100,0.7287


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:50<00:00,  4.52it/s]



Validation Accuracy: 0.7460 (373/500)

=== Try config: {'lora_r': 2, 'lora_alpha': 2, 'lora_dropout': 0.0, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,242,880 of 8,035,504,128 (0.07% trained)


Step,Training Loss
10,1.6264
20,1.1512
30,0.889
40,0.7566
50,0.7397
60,0.7328
70,0.7522
80,0.7509
90,0.679
100,0.6953


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:50<00:00,  4.51it/s]



Validation Accuracy: 0.7620 (381/500)

=== Try config: {'lora_r': 2, 'lora_alpha': 2, 'lora_dropout': 0.05, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,242,880 of 8,035,504,128 (0.07% trained)


Step,Training Loss
10,1.6585
20,1.4611
30,1.1817
40,0.8812
50,0.8085
60,0.7815
70,0.787
80,0.784
90,0.7147
100,0.7291


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:16<00:00,  3.66it/s]



Validation Accuracy: 0.7620 (381/500)

=== Try config: {'lora_r': 2, 'lora_alpha': 2, 'lora_dropout': 0.05, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,242,880 of 8,035,504,128 (0.07% trained)


Step,Training Loss
10,1.6265
20,1.1524
30,0.8897
40,0.7568
50,0.7398
60,0.7329
70,0.7533
80,0.7514
90,0.6794
100,0.6962


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:16<00:00,  3.66it/s]



Validation Accuracy: 0.7500 (375/500)

=== Try config: {'lora_r': 2, 'lora_alpha': 4, 'lora_dropout': 0.0, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,242,880 of 8,035,504,128 (0.07% trained)


Step,Training Loss
10,1.6418
20,1.2986
30,0.9916
40,0.7934
50,0.7603
60,0.7479
70,0.7671
80,0.7668
90,0.6962
100,0.7108


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:50<00:00,  4.52it/s]



Validation Accuracy: 0.7520 (376/500)

=== Try config: {'lora_r': 2, 'lora_alpha': 4, 'lora_dropout': 0.0, 'learning_rate': 0.0002, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,242,880 of 8,035,504,128 (0.07% trained)


Step,Training Loss
10,1.5781
20,0.984
30,0.8252
40,0.7342
50,0.7218
60,0.7132
70,0.733
80,0.7402
90,0.6709
100,0.6881


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [01:50<00:00,  4.54it/s]



Validation Accuracy: 0.7560 (378/500)

=== Try config: {'lora_r': 2, 'lora_alpha': 4, 'lora_dropout': 0.05, 'learning_rate': 0.0001, 'max_steps': 200} ===
==((====))==  Unsloth 2025.10.11: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,242,880 of 8,035,504,128 (0.07% trained)


Step,Training Loss
10,1.6421
20,1.3
30,0.9926
40,0.7936
50,0.7608
60,0.7487
70,0.7667
80,0.7662
90,0.6962


KeyboardInterrupt: 

In [None]:
print(best)

{'acc': 0.784, 'cfg': {'lora_r': 1, 'lora_alpha': 32, 'lora_dropout': 0.05, 'learning_rate': 0.0002, 'max_steps': 200}, 'ckpt_dir': 'outputs/best_lora_r1_a32_d0.05_lr0.0002'}


## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ‚òï


In [None]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20,000 | Num Epochs = 3 | Total steps = 1,875
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
 "-____-"     Trainable parameters = 20,971,520 of 8,051,232,768 (0.26% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,0.4823
20,0.4756
30,0.4826
40,0.4785
50,0.4654
60,0.479
70,0.4896
80,0.4722
90,0.4861
100,0.4727


TrainOutput(global_step=1875, training_loss=0.4343602831522624, metrics={'train_runtime': 8717.4335, 'train_samples_per_second': 6.883, 'train_steps_per_second': 0.215, 'total_flos': 1.1997836089523896e+18, 'train_loss': 0.4343602831522624, 'epoch': 3.0})


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference‚Äîone where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [None]:
# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
GivenAnswer:
{}
GivenSolution (optional reasoning):
{}
Output:
"""

# Select a sample from the validation set
example = validation_dataset[10] # You can change the index (e.g., to 1, 2, 50)
question = example["question"]
answer = example["answer"]
solution = example["solution"]

# Format the prompt with the validation data
inputs = tokenizer(
[
    inference_prompt.format(question, str(answer), str(solution))
], return_tensors = "pt").to("cuda")

# Generate the model's response
outputs = model.generate(**inputs, max_new_tokens = 8, use_cache = True)
response = tokenizer.batch_decode(outputs)

# Print the results
print("#### QUESTION ####")
print(question)
print("\n#### SOLUTION ####")
print(solution)
print("\n#### MODEL'S PREDICTION ####")
# We process the output to show only the generated text
print(response[0].split("Output:\n")[1])
print("\n#### CORRECT ANSWER ####")
print(example["is_correct"])

_ = evaluate_accuracy(model, tokenizer, validation_dataset, inference_prompt)


#### QUESTION ####
How many positive three-digit integers less than 500 have at least two digits that are the same?

#### SOLUTION ####
To calculate the answer, we need to enumerate all the positive three-digit integers less than 500, and select those who have at least two identical digits. 
Here is some Python code to calculate this. 
<llm-code>
count = 0
for number in range(100, 500):
    num_str = str(number)
    digits = set()
    for digit in num_str:
        if digit in digits:
            count += 1
            break
        else:
            digits.add(digit)

print(count)
</llm-code>
<llm-code-output>
112
</llm-code-output>
Hence, the answer is \boxed{112}.

#### MODEL'S PREDICTION ####
False<|end_of_text|>

#### CORRECT ANSWER ####
True


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [18:00<00:00,  4.63it/s]


Validation Accuracy: 0.8914 (4457/5000)





## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [None]:
import pandas as pd
from tqdm import tqdm

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    answer = example["answer"]
    solution = example["solution"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(answer), str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [36:05<00:00,  4.62it/s]


Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.





# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/DL_midter_contest/llama3_8b_math_verifier_checkpoint"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

Model checkpoint and tokenizer saved to: /content/drive/MyDrive/DL_midter_contest/llama3_8b_math_verifier_checkpoint


## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [None]:
# Define the path where the model checkpoint was saved in Google Drive
save_path = "/content/drive/MyDrive/DL_midter_contest/llama3_8b_math_verifier_checkpoint"

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
# FastLanguageModel.for_inference(model)

# print(f"Model and tokenizer loaded from: {save_path}")

==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.10.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")