## Install Libraries

In [1]:
%pip install -q -U transformers datasets accelerate peft trl bitsandbytes

Note: you may need to restart the kernel to use updated packages.


## Load Dataset

### 1. Dataset Overview

* **Dataset Name:** `XenArcAI/MathX-5M`
* **Total Size:** ~2.43 GB
* **Number of Rows:** ~4.32 Million
* **Format:** Parquet
* **Task:** Text Generation, Question Answering
* **Description:** `MathX-5M` is a large-scale, high-quality corpus of mathematical reasoning problems. It contains 5 million examples of problems with detailed, step-by-step solutions, making it ideal for instruction-based fine-tuning. The dataset covers a wide range of mathematical domains, from basic arithmetic to advanced calculus.

### 2. Data Features

The dataset consists of three primary columns (features):

1.  **`problem`** (string):
    * **Content:** This column contains the mathematical problem statement. The problems are expressed in natural language and often include LaTeX formatting for mathematical notation.
    * **Example:** "Determine how many 1000 digit numbers \\( a \\) have the property that when any digit..."
    * **Observations:** The problems vary significantly in length and complexity, ranging from simple calculations to complex, multi-step proofs.

2.  **`expected_answer`** (string):
    * **Content:** This column holds the final, correct answer to the problem. The answers are typically concise and may also use LaTeX.
    * **Example:** `\[[32]]`
    * **Observations:** This feature provides the ground truth for evaluating the model's final output. The format is generally clean and direct.

3.  **`generated_solution`** (string):
    * **Content:** This is arguably the most valuable feature for fine-tuning. It contains a detailed, step-by-step thought process and derivation of the solution. It often starts with a `<think>` tag, outlining the reasoning path.
    * **Example:** `<think> Okay, so I need to figure out how many 100-digit numbers ... The problem is to compute the ... </think> To find the probability...`
    * **Observations:** This "chain-of-thought" or reasoning path is crucial for teaching the model *how* to solve problems, not just what the final answer is. The quality and detail of these solutions are key to the dataset's effectiveness.

### 3. Initial Findings & Implications for Fine-Tuning

* **Instructional Format:** The dataset's structure is perfectly suited for instruction fine-tuning. The combination of a problem, a reasoning process, and a final answer provides a clear and rich learning signal for the model.
* **LaTeX Formatting:** The prevalence of LaTeX means the tokenizer and model must be proficient at handling mathematical notation.
* **Chain-of-Thought:** The `generated_solution` column enables the model to learn complex reasoning. During fine-tuning, the prompt should be structured to encourage the model to generate a similar step-by-step thought process before arriving at the final answer.
* **Data Scale:** With over 4 million rows, the dataset is substantial. Even a small fraction of this data is sufficient for effective fine-tuning, especially when using techniques like LoRA.
* **Complexity Distribution:** The dataset claims a distribution of basic (30%), intermediate (30%), and advanced (40%) problems. This diversity is excellent for training a well-rounded model that can handle a variety of mathematical challenges.

In [None]:
from datasets import load_dataset, Dataset
from itertools import islice

# Step 1: Load the dataset in streaming mode
streamed_dataset = load_dataset("XenArcAI/MathX-5M", split="train", streaming=True)

# Step 2: Normalize columns on the fly
def unify_columns(ex):
    if "question" in ex:
        ex["problem"] = ex.pop("question")
    return ex

streamed_dataset = streamed_dataset.map(unify_columns)

# Step 3: Take first 1% (~10k examples) and materialize in memory
subset = list(islice(streamed_dataset, 10000))

dataset = Dataset.from_list(subset).select_columns(
    ["problem", "generated_solution", "expected_answer"]
)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
print(dataset)

Dataset({
    features: ['problem', 'generated_solution', 'expected_answer'],
    num_rows: 10000
})


In [None]:
from IPython.display import display, Markdown

def show_example_jupyter(dataset, index=0, max_solution_len=300):
    example = dataset[index]

    # Wrap problem in Markdown-friendly LaTeX
    problem = example['problem']
    solution = example['generated_solution']
    answer = example['expected_answer']

    # Truncate solution for readability
    if len(solution) > max_solution_len:
        solution = solution[:max_solution_len]
        last_period = solution.rfind('.')
        if last_period != -1:
            solution = solution[:last_period+1] + " ..."

    # Replace inline parentheses like ( N ) with proper LaTeX
    problem = problem.replace("\\(", "$").replace("\\)", "$")
    answer = answer.replace("\\(", "$").replace("\\)", "$")

    display(Markdown(f"### Problem #{index+1}\n{problem}"))
    display(Markdown(f"**Generated Solution (truncated):**\n{solution}"))
    display(Markdown(f"**Expected Answer:**\n{answer}"))
    display(Markdown("---"))

show_example_jupyter(dataset, 0)

### Problem #1
Given a group of $ N $ balls consisting of $ C $ colors, where the number of balls in each color is represented as $ n_1, n_2, \ldots, n_C $ (with $ n_1 + n_2 + \ldots + n_C = N $), what is the probability that when $ A $ balls are randomly picked (where $ A \leq N $), the picked balls consist of $ a_1, a_2, \ldots, a_C $ balls of each color, where $ a_1 + a_2 + \ldots + a_C = A $?

### Problem #1
Given a group of $ N $ balls consisting of $ C $ colors, where the number of balls in each color is represented as $ n_1, n_2, \ldots, n_C $ (with $ n_1 + n_2 + \ldots + n_C = N $), what is the probability that when $ A $ balls are randomly picked (where $ A \leq N $), the picked balls consist of $ a_1, a_2, \ldots, a_C $ balls of each color, where $ a_1 + a_2 + \ldots + a_C = A $?

**Generated Solution (truncated):**
<think>
Okay, so I need to find the probability that when I pick A balls out of N, where there are C different colors, the number of each color I pick is exactly a1, a2, ..., aC. Hmm, let's think about how to approach this.

First, probability problems often involve combinations. ...

### Problem #1
Given a group of $ N $ balls consisting of $ C $ colors, where the number of balls in each color is represented as $ n_1, n_2, \ldots, n_C $ (with $ n_1 + n_2 + \ldots + n_C = N $), what is the probability that when $ A $ balls are randomly picked (where $ A \leq N $), the picked balls consist of $ a_1, a_2, \ldots, a_C $ balls of each color, where $ a_1 + a_2 + \ldots + a_C = A $?

**Generated Solution (truncated):**
<think>
Okay, so I need to find the probability that when I pick A balls out of N, where there are C different colors, the number of each color I pick is exactly a1, a2, ..., aC. Hmm, let's think about how to approach this.

First, probability problems often involve combinations. ...

**Expected Answer:**
$\frac{C_{n_1}^{a_1} \cdot C_{n_2}^{a_2} \cdot \ldots \cdot C_{n_C}^{a_C}}{C_N^A}$

---

## Load the tokenizer

In [5]:
from huggingface_hub import login

login(token="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")

In [6]:
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B-Instruct"  
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Preprocess data

In [None]:
# Proper Llama 3 Chat Template
LLAMA_3_CHAT_TEMPLATE = (
    "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
    "You are a helpful assistant that solves math problems step by step.<|eot_id|>"
    "<|start_header_id|>user<|end_header_id|>\n\n{user_message}<|eot_id|>"
    "<|start_header_id|>assistant<|end_header_id|>\n\n{assistant_response}<|eot_id|>"
)

def format_prompt(sample):
    # Create the user message with problem and expected solution format
    user_message = f"Problem:\n{sample['problem']}\n\nPlease provide a step-by-step solution."
    
    # The assistant response contains the generated solution
    assistant_response = sample['generated_solution']
    
    # Apply the proper Llama 3 chat template
    formatted_text = LLAMA_3_CHAT_TEMPLATE.format(
        user_message=user_message,
        assistant_response=assistant_response
    )
    
    return {"text": formatted_text}

# Apply to dataset
print("üîÑ Formatting dataset with proper Llama 3 chat template...")
formatted_dataset = dataset.map(format_prompt)
print("‚úÖ Dataset formatted successfully!")

# Show an example of the formatted text
print("\nüìù Example of formatted text:")
print("=" * 80)
print(formatted_dataset[0]['text'][:500] + "..." if len(formatted_dataset[0]['text']) > 500 else formatted_dataset[0]['text'])
print("=" * 80)

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:01<00:00, 5519.77 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:01<00:00, 5519.77 examples/s]


## Training

In [8]:
from transformers import TrainingArguments
from peft import LoraConfig

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Training arguments - REDUCED FOR LOW MEMORY
training_args = TrainingArguments(
    output_dir="./llama3-8b-math-tuned",
    per_device_train_batch_size=1,      # Reduced from 4 to 1
    gradient_accumulation_steps=16,     # Increased from 4 to 16 to maintain effective batch size
    learning_rate=2e-4,
    logging_steps=10,
    max_steps=100,      
    save_steps=50,
    fp16=True,
    dataloader_pin_memory=False,        # Reduce memory usage
    gradient_checkpointing=True,        # Trade compute for memory
    remove_unused_columns=False,        # Keep all columns
)

In [9]:
# GPU Memory Optimization
import torch
import gc

def clear_gpu_memory():
    """Clear GPU memory and cache"""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        gc.collect()

def check_gpu_memory():
    """Check current GPU memory usage"""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3  # GB
        reserved = torch.cuda.memory_reserved() / 1024**3    # GB
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3  # GB
        free = total - allocated
        
        print(f"GPU Memory Status:")
        print(f"  Total: {total:.2f} GB")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved: {reserved:.2f} GB") 
        print(f"  Free: {free:.2f} GB")
        
        return free > 1.0  # Return True if we have >1GB free
    return False

# Clear memory before training
clear_gpu_memory()
has_memory = check_gpu_memory()

if not has_memory:
    print("‚ö†Ô∏è Warning: Low GPU memory detected. Consider using CPU training or smaller model.")

GPU Memory Status:
  Total: 5.68 GB
  Allocated: 0.00 GB
  Reserved: 0.00 GB
  Free: 5.68 GB


In [13]:
# Check system requirements and connectivity before loading model
import shutil
import subprocess
import requests
import os

# ===== CONFIGURE DOWNLOAD LOCATION =====
# This should match the MODEL_DOWNLOAD_DIR in the next cell
MODEL_DOWNLOAD_DIR = "./models"

# Check available disk space in the specified directory
def check_disk_space(path="."):
    total, used, free = shutil.disk_usage(path)
    free_gb = free / (1024**3)
    total_gb = total / (1024**3)
    print(f"Available disk space in '{path}': {free_gb:.2f} GB / {total_gb:.2f} GB")
    return free_gb

# Check internet connectivity to Hugging Face
def check_connectivity():
    try:
        response = requests.get("https://huggingface.co", timeout=10)
        print(f"Hugging Face connectivity: OK (Status: {response.status_code})")
        return True
    except Exception as e:
        print(f"Hugging Face connectivity: FAILED - {str(e)}")
        return False

# Check if model is already cached
def check_model_cache(cache_dir):
    # Create cache directory if it doesn't exist
    os.makedirs(cache_dir, exist_ok=True)
    
    # Look for the specific model cache
    model_cache_path = os.path.join(cache_dir, "models--meta-llama--Llama-3.1-8B-Instruct")
    if os.path.exists(model_cache_path):
        # Check cache size
        cache_size = sum(os.path.getsize(os.path.join(dirpath, filename))
                        for dirpath, dirnames, filenames in os.walk(model_cache_path)
                        for filename in filenames)
        cache_size_gb = cache_size / (1024**3)
        print(f"Model cache found at: {model_cache_path}")
        print(f"Cache size: {cache_size_gb:.2f} GB")
        return True
    else:
        print(f"No model cache found in: {cache_dir}")
        print("Will download from scratch")
        return False

print("=== System Check ===")
print(f"Configured download directory: {os.path.abspath(MODEL_DOWNLOAD_DIR)}")

# Check disk space in the download directory
free_space = check_disk_space(MODEL_DOWNLOAD_DIR if os.path.exists(MODEL_DOWNLOAD_DIR) else ".")
connectivity = check_connectivity()
cache_exists = check_model_cache(MODEL_DOWNLOAD_DIR)

print(f"\nSystem Status:")
print(f"- Download location: {os.path.abspath(MODEL_DOWNLOAD_DIR)}")
print(f"- Disk space: {'‚úì' if free_space > 20 else '‚ö†Ô∏è'} ({free_space:.1f} GB available)")
print(f"- Connectivity: {'‚úì' if connectivity else '‚ùå'}")
print(f"- Model cached: {'‚úì' if cache_exists else '‚ùå'}")

if free_space < 20:
    print(f"\n‚ö†Ô∏è Warning: Less than 20GB free space. Llama-3.1-8B requires ~15GB+")
    print(f"Consider changing MODEL_DOWNLOAD_DIR to a location with more space")
if not connectivity:
    print(f"\n‚ùå No internet connectivity - cannot download model")

=== System Check ===
Configured download directory: /home/nailsonseat/StudioProjects/CourseGPT-Pro-DSAI-Lab-Group-6/Milestone-2/math-agent-scripts/models
Available disk space in './models': 30.58 GB / 338.11 GB
Hugging Face connectivity: OK (Status: 200)
Model cache found at: ./models/models--meta-llama--Llama-3.1-8B-Instruct
Cache size: 29.92 GB

System Status:
- Download location: /home/nailsonseat/StudioProjects/CourseGPT-Pro-DSAI-Lab-Group-6/Milestone-2/math-agent-scripts/models
- Disk space: ‚úì (30.6 GB available)
- Connectivity: ‚úì
- Model cached: ‚úì


In [14]:
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
import os
from huggingface_hub import snapshot_download
import time

# ===== CONFIGURE DOWNLOAD LOCATION =====
MODEL_DOWNLOAD_DIR = "./models"
os.makedirs(MODEL_DOWNLOAD_DIR, exist_ok=True)

# Set environment variables
os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "300"
os.environ["TRANSFORMERS_CACHE"] = MODEL_DOWNLOAD_DIR
os.environ["HF_HOME"] = MODEL_DOWNLOAD_DIR

print(f"Model will be downloaded to: {os.path.abspath(MODEL_DOWNLOAD_DIR)}")

# ===== MEMORY OPTIMIZATION =====
# Clear GPU memory before loading model
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Configure 4-bit quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                   # Use 4-bit precision
    bnb_4bit_quant_type="nf4",          # Normalized float 4-bit
    bnb_4bit_use_double_quant=True,     # Double quantization for better accuracy
    bnb_4bit_compute_dtype=torch.bfloat16  # Compute in bfloat16
)

print("Loading model with 4-bit quantization to reduce memory usage...")

try:
    # Load model with quantization - should use ~2.5GB instead of ~15GB
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,    # Enable 4-bit quantization
        device_map="auto",                 # Automatically distribute layers
        cache_dir=MODEL_DOWNLOAD_DIR,
        force_download=False,
        trust_remote_code=True
    )
    
    print("‚úì Model loaded with 4-bit quantization!")
    
    # Check GPU memory after loading
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3
        print(f"GPU memory used after model loading: {allocated:.2f} GB")
    
except Exception as e:
    print(f"Quantized loading failed: {e}")
    print("Trying without quantization...")
    
    # Fallback to regular loading
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        dtype=torch.bfloat16,
        device_map="auto",
        cache_dir=MODEL_DOWNLOAD_DIR,
        force_download=False,
    )

# Create SFTTrainer
print("Creating SFTTrainer...")
trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=training_args,
)
print("‚úì SFTTrainer created successfully!")

Model will be downloaded to: /home/nailsonseat/StudioProjects/CourseGPT-Pro-DSAI-Lab-Group-6/Milestone-2/math-agent-scripts/models
Loading model with 4-bit quantization to reduce memory usage...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 
Trying without quantization...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass 

Model will be downloaded to: /home/nailsonseat/StudioProjects/CourseGPT-Pro-DSAI-Lab-Group-6/Milestone-2/math-agent-scripts/models
Loading model with 4-bit quantization to reduce memory usage...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 
Trying without quantization...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass 

Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:02<00:00,  1.39it/s]

Some parameters are on the meta device because they were offloaded to the cpu and disk.
Some parameters are on the meta device because they were offloaded to the cpu and disk.


Model will be downloaded to: /home/nailsonseat/StudioProjects/CourseGPT-Pro-DSAI-Lab-Group-6/Milestone-2/math-agent-scripts/models
Loading model with 4-bit quantization to reduce memory usage...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 
Trying without quantization...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass 

Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:02<00:00,  1.39it/s]

Some parameters are on the meta device because they were offloaded to the cpu and disk.
Some parameters are on the meta device because they were offloaded to the cpu and disk.


Creating SFTTrainer...


Adding EOS to train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:01<00:00, 5518.88 examples/s]
Tokenizing train dataset:   0%|          | 0/10000 [00:00<?, ? examples/s]
Tokenizing train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [01:45<00:00, 95.06 examples/s]
Tokenizing train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [01:45<00:00, 95.06 examples/s]
Truncating train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 65161.83 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
Truncating train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 65161.83 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.


Model will be downloaded to: /home/nailsonseat/StudioProjects/CourseGPT-Pro-DSAI-Lab-Group-6/Milestone-2/math-agent-scripts/models
Loading model with 4-bit quantization to reduce memory usage...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 
Trying without quantization...
Quantized loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass 

Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:02<00:00,  1.39it/s]

Some parameters are on the meta device because they were offloaded to the cpu and disk.
Some parameters are on the meta device because they were offloaded to the cpu and disk.


Creating SFTTrainer...


Adding EOS to train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:01<00:00, 5518.88 examples/s]
Tokenizing train dataset:   0%|          | 0/10000 [00:00<?, ? examples/s]
Tokenizing train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [01:45<00:00, 95.06 examples/s]
Tokenizing train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [01:45<00:00, 95.06 examples/s]
Truncating train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 65161.83 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
Truncating train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 65161.83 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.


‚úì SFTTrainer created successfully!


In [None]:
trainer.train()

## Export adapters

In [None]:
adapter_path = "./llama3-8b-math-tuned-adapters"
trainer.save_model(adapter_path)

print(f"LoRA adapters saved to {adapter_path}")

## Merge with base model

In [None]:
from peft import PeftModel

# --- Reload the base model without quantization ---
# This is important for merging and for Ollama compatibility
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# --- Load the PeftModel with the adapters ---
model = PeftModel.from_pretrained(base_model, adapter_path)

# --- Merge the weights and save the new model ---
model = model.merge_and_unload()

merged_model_path = "./llama3-8b-math-merged"
model.save_pretrained(merged_model_path)
tokenizer.save_pretrained(merged_model_path)

print(f"Merged model saved to {merged_model_path}")

In [None]:
!zip -r llama3-8b-math-merged.zip ./llama3-8b-math-merged