## Step 1: Install the Unsloth Python package.

In [5]:
!pip install -q unsloth unsloth_zoo

## Step 2: Load the model and tokenizer using the FastLanguageModel function.

In [10]:
!pip install -U bitsandbytes
!pip install -U transformers


Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.1-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m145.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.2
    Uninstalling transformers-4.56.2:
      Successfully uninstalled transformers-4.56.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth-zoo 2025.10.4 requires transformers!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,<=4.56.2,>=4.51.3, but you have transformers 4.57.1 which is incompatible.
unsloth 20

In [1]:
from transformers import TextStreamer
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8b-bnb-4bit"
)


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.4: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## Step 3: Enable the Unsloth fast inference.

In [2]:
FastLanguageModel.for_inference(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm):

## Step 4: Check inference on a sample example

In [3]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


inputs = tokenizer(
    [
        prompt_style.format(
            "You are a professional machine learning engineer",
            "How would you deal with NaN validation loss?",
            "",
        )
    ],
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a professional machine learning engineer

### Input:
How would you deal with NaN validation loss?

### Response:
I would first check if the loss is NaN because of a numerical error. If that's the case, I would try to find the root cause of the error and fix it. If the loss is NaN because of a numerical error, I would try to find the root cause of the error and fix it. If the loss is NaN because of a numerical error, I would try to find the root cause of the error and fix it. If the loss is NaN because of a numerical error, I would try to find the root cause of the error and fix it. If the loss is NaN because of a numerical error, I would try


## Step 5: Check inference on a mathematical example


In [9]:
# 2. Define your prompt template (three {} placeholders)
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# 3. Fill all three placeholders (third is empty so the model starts generation after “Response:”)
prompt = prompt_style.format(
    "If the system of equations \\begin{align*} 3x+y&=a,\\\\ 2x+5y&=2a, \\end{align*} has a solution $(x,y)$ when $x=2$, compute $a$.",
    "",  # no additional “input” context
    ""   # leave response blank for the model to fill in
)

# 4. Tokenize and move to GPU
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

# 5. Set up text streamer
text_streamer = TextStreamer(tokenizer)

# 6. Generate (streamed)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=1024,
)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
If the system of equations \begin{align*} 3x+y&=a,\\ 2x+5y&=2a, \end{align*} has a solution $(x,y)$ when $x=2$, compute $a$.

### Input:


### Response:
$a=4$

Explanation: The system of equations \begin{align*} 3x+y&=a,\\ 2x+5y&=2a, \end{align*} has a solution $(x,y)$ when $x=2$. Substitute $2$ for $x$ in the first equation to obtain $y=2a-3$. Substitute $2$ for $x$ in the second equation to obtain $5y=4a$. Since $y=2a-3$, it follows that $5(2a-3)=4a$. Simplify to obtain $10a-15=4a$. Add $15$ to both sides to obtain $10a=19$. Divide both sides by $10$ to obtain $a=\frac{19}{10}$.<|end_of_text|>


### Step 6: Login to HuggingFace

In [12]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Step 7: Login to wandb

In [1]:
!pip install -q wandb

import wandb
from google.colab import userdata

wandb.login(key=userdata.get("WANDB_API_KEY"))

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgamermask64[0m ([33mgamermask64-school-of-data-science-and-business-intelligence[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [2]:
run = wandb.init(
    project="Fine=Tune Llama-3.1-8b-bnb-4bit on Math Dataset",
    job_type="training",
    anonymous="allow"

)

### Step 8: Load the finetuning base model and tokenizer

In [10]:
# 1) Disable Unsloth fused loss globally before loading model/trainer
from unsloth import UnslothTrainer
UnslothTrainer.disable_fused_loss = True

# 2) Optionally monkeypatch Trainer.compute_loss as a belt-and-suspenders measure
import torch
from transformers import Trainer
def patched_compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
    outputs = model(**inputs)
    loss = outputs.loss

    # Clone the loss tensor to prevent the "view + in-place" autograd issue
    if isinstance(loss, torch.Tensor):
        loss = loss.clone()

    return (loss, outputs) if return_outputs else loss

# Apply patch globally so all Trainers (incl. SFTTrainer) inherit it
Trainer.compute_loss = patched_compute_loss


In [2]:
import torch
from unsloth import FastLanguageModel

max_seq_length = 1024
dtype = None

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype
)

==((====))==  Unsloth 2025.10.4: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Step 9: Load and process MATH-lighteval dataset

In [3]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a math genius who can solve any level of algebraic problems. Please answer the following math question.

### Input:
{}

### Response:
{}"""

In [4]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
  inputs = examples['problem']
  outputs = examples['solution']
  texts = []
  for input,output in zip(inputs,outputs):
    text = prompt_style.format(input,output) + EOS_TOKEN
    texts.append(text)
  return {"text":texts}

In [5]:
from datasets import load_dataset

dataset = load_dataset(
    "DigitalLearningGmbH/MATH-lighteval",    # correct ID
    split="train[0:500]",                    # first 500 examples
    trust_remote_code=True                   # allows the custom builder script
)

# apply your formatting function
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

dataset['text'][0]

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'DigitalLearningGmbH/MATH-lighteval' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'DigitalLearningGmbH/MATH-lighteval' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a math genius who can solve any level of algebraic problems. Please answer the following math question.\n\n### Input:\nLet \\[f(x) = \\left\\{\n\\begin{array}{cl} ax+3, &\\text{ if }x>2, \\\\\nx-5 &\\text{ if } -2 \\le x \\le 2, \\\\\n2x-b &\\text{ if } x <-2.\n\\end{array}\n\\right.\\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper).\n\n### Response:\nFor the piecewise function to be continuous, the cases must "meet" at $2$ and $-2$. For example, $ax+3$ and $x-5$ must be equal when $x=2$. This implies $a(2)+3=2-5$, which we solve to get $2a=-6 \\Rightarrow a=-3$. Similarly, $x-5$ and $2x-b$ must be equal when $x=-2$. Substituting, we get $-2-5=2(-2)-b$, which implies $b=3$. So $a+b=-3+3=\\boxed{0}$.<|end_of_text|>'

### Step 10: Add the LoRA (Low-Rank Adapter) to the model


In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    # 4 < r < 64
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",

    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.10.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Step 11: Set up the finetuning trainer

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

model.fused_loss=False

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 800,
        #increasing total steps to 800 would work better
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

### Step12: Set up Memory Trackers

In [12]:
# Add this BEFORE your training starts (ideally right after model loading)
torch.cuda.empty_cache()  # Clear cache first

start_gpu_memory = round(torch.cuda.memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(torch.cuda.get_device_properties(0).total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU memory before training: {start_gpu_memory} GB")
print(f"Max GPU memory available: {max_memory} GB")

GPU memory before training: 5.516 GB
Max GPU memory available: 14.741 GB


In [13]:
import wandb
import builtins


# Make wandb available globally
builtins.wandb = wandb

trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 13 | Total steps = 800
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,6.0674
2,6.4057
3,6.9582
4,5.3989
5,5.6178
6,5.4358
7,4.6471
8,3.5668
9,3.674
10,3.5755


### Step 14: Check memory usage

In [14]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

6019.0377 seconds used for training.
100.32 minutes used for training.
Peak reserved memory = 7.686 GB.
Peak reserved memory for training = 2.17 GB.
Peak reserved memory % of max memory = 52.14 %.
Peak reserved memory for training % of max memory = 14.721 %.


### Step 15: Testing model after fine-tuning

In [15]:
from IPython.display import display, Markdown

FastLanguageModel.for_inference(model)
inputs = tokenizer(
    [
        prompt_style.format(
            "If the system of equations \begin{align*} 3x+y&=a,\\ 2x+5y&=2a, \end{align*} has a solution $(x,y)$ when $x=2$, compute $a$.",
            "",
        )
    ],
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=250,
    use_cache=True,
    #do_sample=False,
)
response = tokenizer.batch_decode(outputs)
Markdown(response[0].split("\n\n### Response:")[1])

  "If the system of equations \begin{align*} 3x+y&=a,\\ 2x+5y&=2a, \end{align*} has a solution $(x,y)$ when $x=2$, compute $a$.",



Substituting in $x=2$, we obtain the equations

$$\begin{align*}
y+3(2)&=a\\
5y+2(2)&=2a
\end{align*}$$

Multiplying the first equation by $5$ and subtracting it from the second equation, we find

$$10a-3a=2a\Rightarrow\ \ \ 7a=2(5(2)-3(2))=2(10-6)=2(4)=\boxed{8}.$$<|end_of_text|>

### Step 16: Saving the model and tokenizer

In [None]:
new_model_online = "Shubhamw11/Llama-3.1-8B-MATH"
new_model_local = "Llama-3.1-8B-MATH"
model.save_pretrained(new_model_local) # Local saving
tokenizer.save_pretrained(new_model_local) # Local saving

### Step 17: Upload model to HuggingFace

In [16]:
model.push_to_hub(new_model_online) # Online saving
tokenizer.push_to_hub(new_model_online) # Online saving

NameError: name 'new_model_online' is not defined