# 🧠 Modern AI with Unsloth.ai – Full Fine-Tuning with SmolLM2 (135 M)

This Colab demonstrates modern LLM fine-tuning using Unsloth.ai on the SmolLM2-135M model.
We’ll fine-tune it on a small chat dataset using full-parameter training, observe its performance, and test inference.

## 1. Install Dependencies

In [1]:
# --- Install Unsloth and related dependencies ---
!pip install unsloth transformers datasets accelerate peft bitsandbytes -q

# Check GPU
!nvidia-smi

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.5/61.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.8/348.8 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.7/564.7 kB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.7/276.7 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install "pyarrow<20.0.0" -q


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.4.1 requires pyarrow>=21.0.0, but you have pyarrow 19.0.1 which is incompatible.[0m[31m
[0m

In [3]:
!pip install unsloth transformers datasets accelerate peft bitsandbytes -q


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0m

## 2. Import Libraries

In [4]:
from unsloth import FastLanguageModel
from datasets import load_dataset
import torch


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Load Pretrained SmolLM2 Model

We use the SmolLM2 135M model — small, lightweight, and perfect for demonstration.
We’ll fine-tune all parameters (full_finetuning=True).

In [5]:
model_name = "unsloth/smollm2-135m"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    dtype=None,           # auto-detect
    load_in_4bit=False    # full precision for full fine-tuning
)

print(f"✅ Model loaded: {model_name}")
print(f"Tokenizer vocab size: {len(tokenizer)}")


==((====))==  Unsloth 2025.11.1: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/742 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

✅ Model loaded: unsloth/smollm2-135m
Tokenizer vocab size: 49153


## 4. Load and Inspect the Dataset

We’ll use the Alpaca instruction-tuning dataset (only a subset for quick demo).
Each record has:

* instruction

* input

* output

In [6]:
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")  # small subset
dataset[0]


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-a09b74b3ef9c3b(…):   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

## 5. Format the Dataset

We combine instruction, input, and response into one clean text prompt.

In [7]:
def format_instruction(sample):
    if sample["input"]:
        return f"### Instruction:\n{sample['instruction']}\n\n### Input:\n{sample['input']}\n\n### Response:\n{sample['output']}"
    else:
        return f"### Instruction:\n{sample['instruction']}\n\n### Response:\n{sample['output']}"

dataset = dataset.map(lambda x: {"text": format_instruction(x)})
dataset = dataset.remove_columns(["instruction", "input", "output"])
dataset[0]


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'text': '### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

## 6. Tokenize the Data

Tokenization converts text to IDs the model understands.

In [20]:
max_length = 512

tokenized_dataset = dataset.map(
    lambda x: tokenizer(
        x["text"],
        truncation=True,
        padding="max_length",
        max_length=max_length,
    ),
    batched=True,
    remove_columns=["text"],
)

# Add labels to the dataset (for Causal LM, labels are typically input_ids)
tokenized_dataset = tokenized_dataset.map(lambda samples: {
    "labels": samples["input_ids"]
}, batched=True)

tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
print("✅ Tokenization complete and labels added")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

✅ Tokenization complete and labels added


## 7. Enable Full Fine-Tuning

Here we fine-tune all model parameters.

In [21]:
# 🧩 Step 7: Enable full fine-tuning (latest Unsloth API)

# In the latest Unsloth version, full fine-tuning is the default
# if you do NOT call "FastLanguageModel.get_peft_model" (LoRA adapter setup).
# So we can directly prepare the model for training.

model = FastLanguageModel.for_training(model)
# When using gradient checkpointing, use_cache must be False
model.config.use_cache = False
print("✅ Model ready for full fine-tuning (no LoRA used).")

✅ Model ready for full fine-tuning (no LoRA used).


## 8. Prepare Trainer and Fine-Tune

We use Unsloth’s built-in trainer wrapper for convenience.

In [15]:
# Fix gradient checkpointing issue with SmolLM2
model.gradient_checkpointing_disable()  # <-- ADD THIS LINE


In [25]:
# ============================================
# 🚀 STEP 8 (FINAL FIXED VERSION) — TRAIN STABLY
# ============================================

from transformers import Trainer, TrainingArguments
import torch

# --- Make absolutely sure model uses float32 ---
model = model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
model = model.float()                # ✅ force FP32
model.gradient_checkpointing_disable()

# --- Define training args safely (no AMP) ---
training_args = TrainingArguments(
    output_dir="./smollm2-finetuned",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    fp16=False,                      # 🚫 no automatic mixed precision
    bf16=False,                      # 🚫 no bfloat16 either
    half_precision_backend="none",   # ✅ make sure no scaler is created
    report_to="none",
    gradient_checkpointing=False,
)

# --- Create trainer ---
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

print("🚀 Starting stable full-precision training on T4 GPU...")
trainer.train()
print("✅ Training complete without FP16 errors!")


The model is already on multiple devices. Skipping the move to device specified in `args`.


🚀 Starting stable full-precision training on T4 GPU...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 63
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 106,203,456 of 134,515,584 (78.95% trained)


Step,Training Loss
10,79.0162
20,67.4004
30,65.4284
40,60.4633
50,50.9979
60,51.814


✅ Training complete without FP16 errors!


## 9. Save the Fine-Tuned Model

In [26]:


save_path = "./smollm2-finetuned"

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"✅ Model and tokenizer saved to: {save_path}")


✅ Model and tokenizer saved to: ./smollm2-finetuned


## 10. Run Inference (Chat-style Prompt)

In [27]:
def generate_text(prompt, max_new_tokens=120):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

prompt1 = "Explain what Python decorators are with an example."
print("💡 Prompt:", prompt1)
print("🧠 Fine-tuned Model Response:\n", generate_text(prompt1))

💡 Prompt: Explain what Python decorators are with an example.
🧠 Fine-tuned Model Response:
 Explain what Python decorators are with an example.

Python decorators are a way to make your code more readable and easier to understand. They are a special kind of function that you can use to pass arguments to other functions.

Let's take a look at an example.

def greet(name):
    print("Hello, %s!" % name)

def sayHello(name):
    print("Hello, %s!" % name)

This example shows how to use a decorator to make your code more readable. The decorator takes a function as an argument and passes it to the function greet. The decorator


## 11. Try Another Prompt

In [28]:
prompt2 = "Write a short motivational poem about artificial intelligence helping humans."
print("💡 Prompt:", prompt2)
print("🧠 Fine-tuned Model Response:\n", generate_text(prompt2))


💡 Prompt: Write a short motivational poem about artificial intelligence helping humans.
🧠 Fine-tuned Model Response:
 Write a short motivational poem about artificial intelligence helping humans.

What is Artificial Intelligence?

Artificial intelligence is the ability of machines to perform tasks that would normally require human intelligence. It is a branch of computer science that deals with the development of intelligent machines.

Artificial intelligence is a branch of computer science that deals with the development of intelligent machines. It is a branch of computer science that deals with the development of intelligent machines. It is a branch of computer science that deals with the development of intelligent machines. It is a branch of computer science that deals with the development of intelligent machines. It is a branch of computer science that deals with the


## 12. Compare Base vs Fine-Tuned Model

In [29]:
base_model, base_tokenizer = FastLanguageModel.from_pretrained("unsloth/smollm2-135m")

def generate_base(prompt):
    inputs = base_tokenizer(prompt, return_tensors="pt").to(base_model.device)
    outputs = base_model.generate(**inputs, max_new_tokens=100)
    return base_tokenizer.decode(outputs[0], skip_special_tokens=True)

comparison_prompt = "Describe how neural networks learn patterns in data."
print("🔹 Base Model Output:")
print(generate_base(comparison_prompt))

print("\n🔸 Fine-Tuned Model Output:")
print(generate_text(comparison_prompt))

==((====))==  Unsloth 2025.11.1: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
🔹 Base Model Output:
Describe how neural networks learn patterns in data.

The first neural network to learn patterns in data is the convolutional neural network. It is a type of neural network that uses a convolutional layer to learn patterns in data. It is a type of neural network that uses a convolutional layer to learn patterns in data. It is a type of neural network that uses a convolutional layer to learn patterns in data. It is a type of neural network that uses a convolutional layer to learn patterns in data. It is a type of neur

## Gradio Chat UI

In [30]:
!pip install gradio -q
import gradio as gr

def chat_fn(prompt):
    return generate_text(prompt)

gr.Interface(fn=chat_fn, inputs="text", outputs="text",
             title="Unsloth SmolLM2 Fine-Tuned Chatbot").launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://379b9a0ca8c61a405f.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


