<a href="https://colab.research.google.com/github/HariHaran9597/Math-solver/blob/main/MATH_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# %% [1] CONNECT TO GOOGLE DRIVE (Your Persistent Hard Drive)
from google.colab import drive
import os

# This will ask for permission to access your Drive. Say "Yes".
drive.mount('/content/drive')

# %% [2] CREATE THE PROJECT STRUCTURE
# We are creating the folder structure inside your Drive.
# Path: My Drive -> AI_Projects -> math-solver

project_root = "/content/drive/MyDrive/AI_Projects/math-solver"

# Define the sub-folders we need
folders = [
    f"{project_root}/data/raw",        # Original GSM8K data
    f"{project_root}/data/processed",  # Formatted data for Qwen
    f"{project_root}/models/adapters", # Where we save the fine-tuned weights
    f"{project_root}/logs",            # Training logs
    f"{project_root}/notebooks"        # Backups
]

# Create them
for folder in folders:
    os.makedirs(folder, exist_ok=True)
    print(f"✅ Created/Verified: {folder}")

# %% [3] MOVE TO PROJECT DIRECTORY
# Tell Python to work inside this folder
os.chdir(project_root)
print(f"\n📂 Current Working Directory: {os.getcwd()}")

Mounted at /content/drive
✅ Created/Verified: /content/drive/MyDrive/AI_Projects/math-solver/data/raw
✅ Created/Verified: /content/drive/MyDrive/AI_Projects/math-solver/data/processed
✅ Created/Verified: /content/drive/MyDrive/AI_Projects/math-solver/models/adapters
✅ Created/Verified: /content/drive/MyDrive/AI_Projects/math-solver/logs
✅ Created/Verified: /content/drive/MyDrive/AI_Projects/math-solver/notebooks

📂 Current Working Directory: /content/drive/MyDrive/AI_Projects/math-solver


In [None]:
# %% [1] INSTALLATION
# We install Unsloth (optimized for free Colab GPUs) and specific PyTorch utilities.
# The --no-deps flag prevents it from breaking Colab's pre-installed packages.
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

# %% [2] SETUP & MODEL LOADING
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
import os

# Define our specific model parameters
max_seq_length = 2048 # Math problems can get long, 2048 is safe
dtype = None # Auto-detect (Float16 for T4 GPU)
load_in_4bit = True # 4-bit quantization (The key to running this on free tier)

print("⏳ Loading Qwen2.5-Math-1.5B-Instruct...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-Math-1.5B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
print("✅ Model Loaded Successfully!")

# %% [3] DOWNLOAD & SAVE DATA
# We download the data once and save it to your Drive folders so we don't redownload it later.

print("⏳ Downloading GSM8K Dataset...")
dataset = load_dataset("openai/gsm8k", "main")

# Define where to save in your Drive
raw_data_path = "./data/raw"

# Save it!
dataset.save_to_disk(raw_data_path)
print(f"✅ Dataset saved to: {raw_data_path}")

# %% [4] SANITY CHECK (Test Run)
# Let's make sure the model actually works before we try to train it.

# Qwen uses a specific "ChatML" format. We need to match it.
prompt = """<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step with clear explanations.<|im_end|>
<|im_start|>user
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<|im_end|>
<|im_start|>assistant
"""

# Convert text to numbers (tokens)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

# Generate answer
print("⏳ Generating answer...")
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
response = tokenizer.batch_decode(outputs)[0]

# Clean up the funny symbols to show just the answer
answer = response.split("<|im_start|>assistant")[-1].replace("<|im_end|>", "")

print("\n" + "="*50)
print(f"🤖 MODEL OUTPUT:\n{answer}")
print("="*50)

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-sa1npjid/unsloth_10e21ee4d0d84556a2913361d3b3cadf
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-sa1npjid/unsloth_10e21ee4d0d84556a2913361d3b3cadf
  Resolved https://github.com/unslothai/unsloth.git to commit 349a81f96f748c0fde7b653a7be7675c183f1c96
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.11.5 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.11.5-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.gi

model.safetensors:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/161 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

✅ Model Loaded Successfully!
⏳ Downloading GSM8K Dataset...


README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7473 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1319 [00:00<?, ? examples/s]

✅ Dataset saved to: ./data/raw
⏳ Generating answer...

🤖 MODEL OUTPUT:

To determine the total number of clips Natalia sold in April and May, we need to follow these steps:

1. Identify the number of clips sold in April.
2. Calculate the number of clips sold in May.
3. Add the number of clips sold in April and May to get the total.

First, we know that Natalia sold 48 clips in April. 

Next, we need to find out how many clips she sold in May. According to the problem, she sold half as many clips in May as she did in April. Therefore, the number of clips sold in May is:
\[ \frac{48}{2} = 24 \]

Now, we add the number of clips sold in April and May to find the total:
\[ 48 + 24 = 72 \]

So, the total number of clips Natalia sold in April and May is:
\[ \boxed{72} \]


In [None]:
# %% [1] DEFINE THE FORMATTING FUNCTION
# Qwen expects a specific "ChatML" format. We are mapping the GSM8K data to this.

alpaca_prompt = """<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step with clear explanations.<|im_end|>
<|im_start|>user
{}<|im_end|>
<|im_start|>assistant
{}<|im_end|>"""

EOS_TOKEN = tokenizer.eos_token # End of Sentence Token

def formatting_prompts_func(examples):
    questions = examples["question"]
    answers   = examples["answer"]
    texts = []
    for question, answer in zip(questions, answers):
        # Must add EOS_TOKEN, otherwise the model will never stop generating!
        text = alpaca_prompt.format(question, answer) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

# %% [2] APPLY FORMATTING TO DATASET
# We map the function across the entire dataset.

print("⏳ Formatting dataset... (This is fast)")
dataset = dataset.map(formatting_prompts_func, batched = True)

print("✅ Data formatted successfully!")

# %% [3] INSPECT THE FORMATTED DATA
# Let's look at exactly what the model will see during training.
print("\n" + "="*20 + " TRAINING DATA EXAMPLE " + "="*20)
print(dataset['train']['text'][0])
print("="*60)

⏳ Formatting dataset... (This is fast)


Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

✅ Data formatted successfully!

<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step with clear explanations.<|im_end|>
<|im_start|>user
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<|im_end|>
<|im_start|>assistant
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|im_end|><|im_end|>


In [None]:
# %% [1] CONFIGURE LoRA ADAPTERS
# We add "adapters" to the model. These are the only parts we actually train.
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Rank: Higher = smarter but slower. 16 is the "Goldilocks" number.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"], # Train all the "brain" layers
    lora_alpha = 16,
    lora_dropout = 0, # 0 is optimized for Unsloth
    bias = "none",    # "none" is optimized for Unsloth
    use_gradient_checkpointing = "unsloth", # 30% memory savings
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

# %% [2] CONFIGURE TRAINER
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can speed up training for short sequences
    args = TrainingArguments(
        per_device_train_batch_size = 2, # Small batch size fits on T4
        gradient_accumulation_steps = 4, # Simulates a batch size of 8
        warmup_steps = 5,
        max_steps = 60, # FOR TESTING: Set to 60 steps first (takes ~5 mins).
                        # Set to 0 to train 1 full epoch (takes ~2 hours) later.
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit", # 8-bit optimizer saves massive memory
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

# %% [3] START TRAINING
print("🚀 Starting Training...")
trainer_stats = trainer.train()

print("\n✅ Training Complete!")

Unsloth 2025.11.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/7473 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.


🚀 Starting Training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 18,464,768 of 1,562,179,072 (1.18% trained)
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mheyhariharan-r[0m ([33mheyhariharan-r-na[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Step,Training Loss
1,1.7065
2,1.6322
3,1.8665
4,1.9353
5,1.8197
6,1.6488
7,1.7165
8,1.5686
9,1.4665
10,1.4776


Unsloth: Will smartly offload gradients to save VRAM!


0,1
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▃▃▅▇▅▅▆▄▆█▄▃▄▃▇▅█▄▃▄▃▃▄▆▆▄▅▆▃▂▃▃▂▅▂▄▁▃▅▁
train/learning_rate,▁▂▄▇█▇▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▁▁
train/loss,▇▆██▆▆▅▆▅▄▃▃▃▄▂▃▃▃▂▂▂▂▂▂▂▂▁▁▁▂▂▂▂▁▂▁▁▂▁▂

0,1
total_flos,949468647702528.0
train/epoch,0.06422
train/global_step,60.0
train/grad_norm,0.27645
train/learning_rate,0.0
train/loss,0.7046
train_loss,1.06275
train_runtime,210.5102
train_samples_per_second,2.28
train_steps_per_second,0.285



✅ Training Complete!


In [None]:
# %% [1] SETUP FOR INFERENCE
# This command optimizes the model for generation (makes it 2x faster)
FastLanguageModel.for_inference(model)

# %% [2] DEFINE A NEW MATH PROBLEM
# Let's try a problem that requires 2-3 logical steps.
question = "I have 3 apples. I buy 4 more. Then I eat 2, and give 1 to my friend. How many apples do I have left?"

# Format it exactly like the training data
prompt = f"""<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step with clear explanations.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""

# %% [3] GENERATE THE ANSWER
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

print("⏳ Thinking...")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    temperature=0.7 # Adds a little creativity
)

# Decode the answer
answer = tokenizer.batch_decode(outputs)[0]
clean_answer = answer.split("<|im_start|>assistant")[-1].replace("<|im_end|>", "")

print("\n" + "="*50)
print(f"🧠 FINE-TUNED MODEL OUTPUT:\n{clean_answer}")
print("="*50)

⏳ Thinking...

🧠 FINE-TUNED MODEL OUTPUT:

I start with 3 apples.
I buy 4 more apples, so I have 3 + 4 = 7 apples.
Then I eat 2 apples, so I have 7 - 2 = 5 apples.
Finally, I give 1 apple to my friend, so I have 5 - 1 = 4 apples left.
#### 4


In [None]:
# %% SAVE THE MODEL TO GOOGLE DRIVE
# We are saving ONLY the adapters (small file size), not the whole 1.5B model.
# This saves space on your Google Drive.

save_path = "/content/drive/MyDrive/AI_Projects/math-solver/models/adapters/v1_checkpoint"

print(f"⏳ Saving model to {save_path}...")
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print("✅ Model saved successfully!")
print("Check your Google Drive folder to confirm the files are there.")

⏳ Saving model to /content/drive/MyDrive/AI_Projects/math-solver/models/adapters/v1_checkpoint...
✅ Model saved successfully!
Check your Google Drive folder to confirm the files are there.


In [None]:
# %% [1] INSTALL GRADIO (If not already installed)
!pip install gradio

# %% [2] DEFINE THE APP LOGIC
import gradio as gr

# Reuse the exact same formatting template we trained on
alpaca_prompt = """<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step with clear explanations.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""

def solve_math_problem(question, history):
    # 1. Format the input
    inputs = tokenizer(
        [alpaca_prompt.format(question=question)],
        return_tensors = "pt"
    ).to("cuda")

    # 2. Generate the answer
    outputs = model.generate(
        **inputs,
        max_new_tokens = 512,
        use_cache = True,
        temperature = 0.5, # Slightly lower temp for more stable math
    )

    # 3. Decode the output
    response = tokenizer.batch_decode(outputs)[0]

    # 4. Clean up the messy tags to show only the answer
    # We split by "assistant" and remove the end token
    clean_response = response.split("<|im_start|>assistant")[-1].replace("<|im_end|>", "")

    return clean_response

# %% [3] BUILD THE UI
# This creates a Chat Interface similar to ChatGPT
demo = gr.ChatInterface(
    fn=solve_math_problem,
    title="🧮 Math Solver AI (Qwen-1.5B-FineTuned)",
    description="Ask me a word problem! (e.g., 'If I have 5 apples...')",
    examples=[
        "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did she sell altogether?",
        "A deep-sea monster rises from the water. It has 10 arms. Each arm has 4 claws. How many claws does it have in total?",
        "If a train travels 60 miles in 2 hours, how fast is it going?"
    ],
    theme="soft"
)

# %% [4] LAUNCH THE APP
print("🚀 Launching App...")
# share=True creates a public link you can share with friends (valid for 72 hours)
demo.launch(share=True, debug=True)

[31mERROR: Operation cancelled by user[0m[31m
[0m

  self.chatbot = Chatbot(


🚀 Launching App...
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://49969162caa17fe67a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://49969162caa17fe67a.gradio.live




In [None]:
# %% [1] THE STRESS TEST
# We will ask 3 levels of questions to see where it breaks.

questions = [
    # Level 1: Grade School (It should nail this)
    ("Easy", "If I have 30 candies and eat 12, then buy 5 more, how many do I have?"),

    # Level 2: Tricky Logic (It might struggle)
    ("Medium", "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?"),

    # Level 3: The "CAT Level" Impossible (It should fail)
    ("Hard", "Find the number of integral solutions for equation: x + y + z = 15, where x, y, z >= 0.")
]

print("⚡ RUNNING STRESS TEST ⚡\n")

for level, q in questions:
    # Format prompt
    prompt = f"""<|im_start|>system
You are a helpful math tutor. Solve steps-by-step.<|im_end|>
<|im_start|>user
{q}<|im_end|>
<|im_start|>assistant
"""
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.5)
    ans = tokenizer.decode(outputs[0], skip_special_tokens=True).split("assistant")[-1]

    print(f"--- {level} Question ---")
    print(f"Q: {q}")
    print(f"A: {ans.strip()}\n")

⚡ RUNNING STRESS TEST ⚡

--- Easy Question ---
Q: If I have 30 candies and eat 12, then buy 5 more, how many do I have?
A: If you start with 30 candies and eat 12, you have 30 - 12 = 18 candies left. Then, if you buy 5 more candies, you have 18 + 5 = 23 candies.

--- Medium Question ---
Q: If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?
A: It would take 100 machines 5 minutes to make 100 widgets.

--- Hard Question ---
Q: Find the number of integral solutions for equation: x + y + z = 15, where x, y, z >= 0.
A: x + y + z = 15
x can take values from 0 to 15
For each value of x, y can take values from 0 to 15-x
For each value of y, z can take values from 0 to 15-x-y
So, the total number of solutions is
16 + 15 + 14 + ... + 1 = 16 * 17 / 2 = 136
#### 136



In [None]:
# %% [1] EVALUATION SCRIPT
import re
import random
from tqdm import tqdm

# 1. Setup
test_dataset = dataset['test']
sample_size = 50  # We test 50 examples to save time (takes ~5 mins)
test_indices = random.sample(range(len(test_dataset)), sample_size)

# 2. Helper to extract numbers
def extract_answer(text):
    # Look for the pattern after ####
    if "####" in text:
        text = text.split("####")[-1]
    # Find the last number in the text (handles 1,200, $50, etc.)
    pattern = r"(-?[$0-9.,]{1,})"
    matches = re.findall(pattern, text)
    if not matches:
        return None
    # Clean the number (remove commas and $)
    return matches[-1].replace(",", "").replace("$", "").strip()

# 3. Run Evaluation
correct = 0
total = 0

print(f"📊 Running Evaluation on {sample_size} random test questions...")
print("-" * 60)

for i in tqdm(test_indices):
    example = test_dataset[i]
    question = example['question']
    true_ans = extract_answer(example['answer'])

    # Prompt the model
    prompt = f"""<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract prediction
    model_ans_text = response.split("assistant")[-1]
    pred_ans = extract_answer(model_ans_text)

    # Compare (Exact String Match)
    is_correct = (pred_ans == true_ans)
    if is_correct:
        correct += 1

    total += 1

# 4. Final Metrics
accuracy = (correct / total) * 100
print("\n" + "="*30)
print(f"📈 FINAL RESULTS")
print(f"✅ Correct: {correct}/{total}")
print(f"🎯 Accuracy: {accuracy:.2f}%")
print("="*30)

📊 Running Evaluation on 50 random test questions...
------------------------------------------------------------


100%|██████████| 50/50 [05:16<00:00,  6.34s/it]


📈 FINAL RESULTS
✅ Correct: 29/50
🎯 Accuracy: 58.00%





In [None]:
# %% [1] CONFIGURE FOR FULL TRAINING (1 EPOCH)
from trl import SFTTrainer
from transformers import TrainingArguments

# We use the same model, but we reset the trainer to run a full pass.
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 100, # Increased warmup for stability

        # --- THE IMPORTANT CHANGE ---
        max_steps = -1, # -1 means "ignore steps, use epochs"
        num_train_epochs = 1, # Train on 100% of the data
        # ---------------------------

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10, # Log every 10 steps so we don't spam the screen
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs_full_epoch", # Save to a new folder
    ),
)

# %% [2] START THE REAL TRAINING
print("🚀 Starting Full Epoch Training (This will take ~45-60 mins)...")
trainer_stats = trainer.train()

print("\n✅ Full Training Complete!")

The model is already on multiple devices. Skipping the move to device specified in `args`.


🚀 Starting Full Epoch Training (This will take ~45-60 mins)...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 935
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 18,464,768 of 1,562,179,072 (1.18% trained)


Step,Training Loss
10,0.5472
20,0.5634
30,0.5828
40,0.514
50,0.5196
60,0.5254
70,0.5255
80,0.4923
90,0.4904
100,0.4847


0,1
train/epoch,▁▁▁▁▂▃▃▃▄▄▄▄▄▄▄▅▅▅▅▆▁▂▂▂▃▃▃▃▃▃▄▅▅▅▅▆▆▇██
train/global_step,▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▅▅▅▅▆▆▁▂▃▃▄▄▅▅▅▆▆▆▆▇▇▇███
train/grad_norm,▁▂▁▂▂▂▂▂▃▂█▂▂▃▂▃▂▂▂▂▃▃▃▃▄▃▄▃▃▄▅▄▄▃▄▃▄▄▄▄
train/learning_rate,▂▃▄▆█▇▇▇▇▇▆▆▅▅▅▄▃▃▁▅█▇▇▇▆▆▆▅▅▅▅▄▃▃▃▃▂▂▂▁
train/loss,██▆▆▆▅▄▅▅▅▄▃▄▄▄▄▄▄▂▃▂▁▁▂▂▂▁▁▂▂▃▂▂▃▁▄▃▄▃▃

0,1
total_flos,1.4845557171652608e+16
train/epoch,1.0
train/global_step,935.0
train/grad_norm,0.48055
train/learning_rate,0.0
train/loss,0.5518
train_loss,0.51725
train_runtime,2030.2532
train_samples_per_second,3.681
train_steps_per_second,0.461



✅ Full Training Complete!


In [None]:
# %% SAVE THE FULL EPOCH MODEL
save_path = "/content/drive/MyDrive/AI_Projects/math-solver/models/adapters/full_epoch"

print(f"⏳ Saving full-epoch model to {save_path}...")
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print("✅ Saved! Your work is safe.")

⏳ Saving full-epoch model to /content/drive/MyDrive/AI_Projects/math-solver/models/adapters/full_epoch...
✅ Saved! Your work is safe.


In [None]:
# %% [1] RE-RUN EVALUATION
import re
import random
from tqdm import tqdm

# Ensure model is in inference mode
FastLanguageModel.for_inference(model)

test_dataset = dataset['test']
# Use a fixed seed so we test the EXACT same questions if possible,
# or just random 50 again (statistically similar).
random.seed(42)
test_indices = random.sample(range(len(test_dataset)), sample_size)

correct = 0
total = 0

print(f"📊 Running Evaluation on Full-Epoch Model...")
print("-" * 60)

for i in tqdm(test_indices):
    example = test_dataset[i]
    question = example['question']
    true_ans = extract_answer(example['answer'])

    prompt = f"""<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    model_ans_text = response.split("assistant")[-1]
    pred_ans = extract_answer(model_ans_text)

    if pred_ans == true_ans:
        correct += 1
    total += 1

accuracy = (correct / total) * 100
print("\n" + "="*30)
print(f"📈 NEW RESULTS (Full Epoch)")
print(f"✅ Correct: {correct}/{total}")
print(f"🎯 Accuracy: {accuracy:.2f}%")
print(f"Changes: {58.0}% -> {accuracy:.2f}%")
print("="*30)

📊 Running Evaluation on Full-Epoch Model...
------------------------------------------------------------


100%|██████████| 50/50 [08:14<00:00,  9.89s/it]


📈 NEW RESULTS (Full Epoch)
✅ Correct: 35/50
🎯 Accuracy: 70.00%
Changes: 58.0% -> 70.00%





In [None]:
# %% [1] SELF-CONSISTENCY EVALUATION
from collections import Counter
import re
import random
from tqdm import tqdm

# Ensure model is in inference mode
FastLanguageModel.for_inference(model)

# We use the same 50 questions to compare fairly
test_dataset = dataset['test']
random.seed(42)
test_indices = random.sample(range(len(test_dataset)), 50)

def get_majority_vote(answers):
    """Find the most common answer in a list."""
    if not answers:
        return None
    c = Counter(answers)
    # Returns the most common element, e.g., [('25', 2), ('24', 1)]
    return c.most_common(1)[0][0]

correct = 0
total = 0

print(f"📊 Running 'Majority Vote' (3 attempts) on {len(test_indices)} questions...")
print("-" * 60)

for i in tqdm(test_indices):
    example = test_dataset[i]
    question = example['question']
    true_ans = extract_answer(example['answer'])

    prompt = f"""<|im_start|>system
You are a helpful math tutor. Solve problems step-by-step.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # GENERATE 3 ATTEMPTS
    temp_answers = []
    for _ in range(3):
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                do_sample=True, # MUST be True for variety
                temperature=0.7 # Add randomness so attempts differ
            )
            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
            model_ans_text = response.split("assistant")[-1]
            extracted = extract_answer(model_ans_text)
            if extracted:
                temp_answers.append(extracted)

    # VOTE
    final_pred = get_majority_vote(temp_answers)

    # Compare
    if final_pred == true_ans:
        correct += 1
        print(f"✅ Q{i}: {true_ans} | Vote: {final_pred} (votes: {temp_answers})")
    else:
        print(f"❌ Q{i}: {true_ans} | Vote: {final_pred} (votes: {temp_answers})")

    total += 1

accuracy = (correct / total) * 100
print("\n" + "="*30)
print(f"🏆 FINAL RESULTS (Self-Consistency)")
print(f"✅ Correct: {correct}/{total}")
print(f"🎯 Accuracy: {accuracy:.2f}%")
print("="*30)

📊 Running 'Majority Vote' (3 attempts) on 50 questions...
------------------------------------------------------------


  2%|▏         | 1/50 [00:33<27:04, 33.15s/it]

❌ Q1309: 2280 | Vote: 2180 (votes: ['2180', '1880', '2180'])


  4%|▍         | 2/50 [00:53<20:35, 25.75s/it]

✅ Q228: 1 | Vote: 1 (votes: ['1', '1', '1'])


  6%|▌         | 3/50 [01:10<16:57, 21.66s/it]

✅ Q51: 5 | Vote: 5 (votes: ['6.67', '5', '5'])


  8%|▊         | 4/50 [01:32<16:51, 21.98s/it]

✅ Q563: 12 | Vote: 12 (votes: ['12', '6', '12'])


 10%|█         | 5/50 [01:52<15:49, 21.11s/it]

✅ Q501: 273 | Vote: 273 (votes: ['273', '273', '273'])


 12%|█▏        | 6/50 [02:12<15:13, 20.77s/it]

✅ Q457: 45 | Vote: 45 (votes: ['45', '45', '45'])


 14%|█▍        | 7/50 [02:28<13:43, 19.14s/it]

✅ Q285: 21 | Vote: 21 (votes: ['21', '21', '21'])


 16%|█▌        | 8/50 [03:01<16:34, 23.68s/it]

❌ Q209: 145 | Vote: 300 (votes: ['300', '7.5', '290'])


 18%|█▊        | 9/50 [03:15<14:06, 20.63s/it]

✅ Q1116: 60 | Vote: 60 (votes: ['60', '60', '60'])


 20%|██        | 10/50 [03:36<13:52, 20.81s/it]

✅ Q178: 122 | Vote: 122 (votes: ['122', '122', '122'])


 22%|██▏       | 11/50 [04:37<21:27, 33.02s/it]

✅ Q1209: 29 | Vote: 29 (votes: ['29', '29'])


 24%|██▍       | 12/50 [05:00<19:01, 30.05s/it]

❌ Q864: 80 | Vote: 480 (votes: ['480', '480', '580'])


 26%|██▌       | 13/50 [05:19<16:20, 26.49s/it]

✅ Q65: 36 | Vote: 36 (votes: ['20', '36', '36'])


 28%|██▊       | 14/50 [05:35<13:58, 23.30s/it]

✅ Q61: 1430 | Vote: 1430 (votes: ['1400', '1430', '1430'])


 30%|███       | 15/50 [05:49<11:57, 20.49s/it]

✅ Q191: 5 | Vote: 5 (votes: ['5', '5', '5.5'])


 32%|███▏      | 16/50 [06:08<11:24, 20.14s/it]

✅ Q447: 5 | Vote: 5 (votes: ['5', '5', '5'])


 34%|███▍      | 17/50 [06:17<09:09, 16.66s/it]

✅ Q476: 5 | Vote: 5 (votes: ['5', '20', '5'])


 36%|███▌      | 18/50 [06:33<08:54, 16.71s/it]

✅ Q1034: 66 | Vote: 66 (votes: ['66', '66', '66'])


 38%|███▊      | 19/50 [06:45<07:48, 15.12s/it]

✅ Q1232: 15 | Vote: 15 (votes: ['15', '15', '15'])


 40%|████      | 20/50 [06:58<07:15, 14.50s/it]

✅ Q54: 40 | Vote: 40 (votes: ['40', '23', '33'])


 42%|████▏     | 21/50 [07:11<06:45, 14.00s/it]

✅ Q1149: 93 | Vote: 93 (votes: ['93', '93', '93'])


 44%|████▍     | 22/50 [07:42<09:00, 19.31s/it]

❌ Q407: 2000 | Vote: 4000 (votes: ['4000', '2000', '8000'])


 46%|████▌     | 23/50 [08:05<09:09, 20.36s/it]

✅ Q859: 1520 | Vote: 1520 (votes: ['1520', '1520', '1520'])


 48%|████▊     | 24/50 [08:29<09:17, 21.45s/it]

✅ Q451: 11050 | Vote: 11050 (votes: ['11050', '8500', '11050'])


 50%|█████     | 25/50 [08:54<09:25, 22.61s/it]

✅ Q919: 90 | Vote: 90 (votes: ['90', '90', '90'])


 52%|█████▏    | 26/50 [09:18<09:08, 22.85s/it]

✅ Q1206: 40000 | Vote: 40000 (votes: ['40000', '190000', '40000'])


 54%|█████▍    | 27/50 [09:31<07:36, 19.86s/it]

✅ Q569: 21 | Vote: 21 (votes: ['21', '21', '21'])


 56%|█████▌    | 28/50 [09:52<07:28, 20.38s/it]

❌ Q13: 18 | Vote: 10 (votes: ['10', '29', '24'])


 58%|█████▊    | 29/50 [10:08<06:36, 18.87s/it]

❌ Q326: 14 | Vote: 12 (votes: ['12', '19', '7'])


 60%|██████    | 30/50 [10:37<07:20, 22.01s/it]

✅ Q865: 23 | Vote: 23 (votes: ['23', '23', '23'])


 62%|██████▏   | 31/50 [11:09<07:52, 24.87s/it]

✅ Q696: 145 | Vote: 145 (votes: ['150', '145', '145'])


 64%|██████▍   | 32/50 [11:39<07:57, 26.55s/it]

✅ Q318: 123 | Vote: 123 (votes: ['123', '87', '123'])


 66%|██████▌   | 33/50 [12:01<07:06, 25.08s/it]

✅ Q440: 98 | Vote: 98 (votes: ['98', '98', '98'])


 68%|██████▊   | 34/50 [12:25<06:36, 24.76s/it]

❌ Q689: 7 | Vote: 13.2 (votes: ['13.2', '12', '14.8'])


 70%|███████   | 35/50 [13:00<06:58, 27.87s/it]

✅ Q189: 34 | Vote: 34 (votes: ['34', '34', '52.25'])


 72%|███████▏  | 36/50 [13:30<06:41, 28.69s/it]

✅ Q778: 38 | Vote: 38 (votes: ['38', '38', '38'])


 74%|███████▍  | 37/50 [14:01<06:21, 29.31s/it]

❌ Q198: 320 | Vote: 220 (votes: ['220', '240', '320'])


 76%|███████▌  | 38/50 [14:38<06:17, 31.47s/it]

✅ Q735: 50 | Vote: 50 (votes: ['50', '56', '50'])


 78%|███████▊  | 39/50 [14:55<04:58, 27.14s/it]

✅ Q704: 50 | Vote: 50 (votes: ['50', '90', '50'])


 80%|████████  | 40/50 [15:23<04:34, 27.41s/it]

✅ Q1236: 84 | Vote: 84 (votes: ['84', '84', '108'])


 82%|████████▏ | 41/50 [15:49<04:02, 26.92s/it]

✅ Q541: 50 | Vote: 50 (votes: ['50', '150', '75'])


 84%|████████▍ | 42/50 [16:08<03:17, 24.69s/it]

✅ Q88: 8000 | Vote: 8000 (votes: ['8000', '9091', '8000'])


 86%|████████▌ | 43/50 [16:21<02:28, 21.25s/it]

✅ Q940: 280 | Vote: 280 (votes: ['200', '280', '280'])


 88%|████████▊ | 44/50 [16:34<01:51, 18.54s/it]

✅ Q1098: 30 | Vote: 30 (votes: ['30', '30', '30'])


 90%|█████████ | 45/50 [16:59<01:43, 20.63s/it]

❌ Q255: 192 | Vote: 96 (votes: ['16', '96', '96'])


 92%|█████████▏| 46/50 [17:31<01:36, 24.14s/it]

✅ Q775: 276 | Vote: 276 (votes: ['276', '276', '276'])


 94%|█████████▍| 47/50 [18:23<01:36, 32.33s/it]

✅ Q161: 32 | Vote: 32 (votes: ['32', '32', '13'])


 96%|█████████▌| 48/50 [18:57<01:05, 32.89s/it]

✅ Q1130: 25 | Vote: 25 (votes: ['25', '13.5', '20'])


 98%|█████████▊| 49/50 [19:17<00:28, 28.98s/it]

✅ Q600: 10 | Vote: 10 (votes: ['10', '10', '10'])


100%|██████████| 50/50 [19:37<00:00, 23.54s/it]

✅ Q1287: 84 | Vote: 84 (votes: ['84', '84', '92'])

🏆 FINAL RESULTS (Self-Consistency)
✅ Correct: 41/50
🎯 Accuracy: 82.00%





In [None]:
# %% [1] MERGE AND UPLOAD (FINAL)
from huggingface_hub import login

# 1. Login (Paste token if asked, or it might remember from before)
print("Paste your Hugging Face WRITE token:")
login()

# 2. Upload Configuration
# REPLACE 'Hariharan123' WITH YOUR ACTUAL USERNAME!
username = "justhariharan"
repo_name = "Qwen2.5-Math-1.5B-Solver"

print(f"⏳ Merging and pushing the 82% accuracy model to {username}/{repo_name}...")

# 3. Push to Hub
# This takes the current model (trained for 1 epoch) and saves it online
model.push_to_hub_merged(
    f"{username}/{repo_name}",
    tokenizer,
    save_method = "merged_16bit", # Standard for deployment
    token = True
)

print(f"\n✅ SUCCESS! The champion model is live at: https://huggingface.co/{username}/{repo_name}")

Paste your Hugging Face WRITE token:


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

⏳ Merging and pushing the 82% accuracy model to justhariharan/Qwen2.5-Math-1.5B-Solver...


config.json:   0%|          | 0.00/761 [00:00<?, ?B/s]

Unsloth: Saving to justhariharan/Qwen2.5-Math-1.5B-Solver will fail, but using a temp folder works! Switching to a temp folder then uploading!


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mp9mdx9c4j/tokenizer.json:   0%|          | 27.8kB / 11.4MB            

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 1/1 [00:42<00:00, 42.64s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mdx9c4j/model.safetensors:   1%|          | 25.2MB / 3.09GB            

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [01:56<00:00, 116.34s/it]


Unsloth: Merge process complete. Saved to `/tmp/tmp9mdx9c4j`

✅ SUCCESS! The champion model is live at: https://huggingface.co/justhariharan/Qwen2.5-Math-1.5B-Solver
