*Author: [Daniel Puente Viejo](https://www.linkedin.com/in/danielpuenteviejo/)*

## **Google Colab - Fine-Tuning LLMs: A Practical Guide**

A practical guide to fine-tuning LLMs Llama-3.1-8B-Instruct using a local adaptation on Google Colab using Unsloth.

‚ö†Ô∏è **Disclaimer**: As we are using Google Colab, we are limited by the resources available. This guide is meant for educational purposes and to show the process of fine-tuning. For larger models or more complex tasks, consider using a more powerful environment.

üìä **Data:** The data used in this example is a synthetic data file generated about the history of basketball.

### **Index:**

- <a href='#1'><ins>1. SetUp</ins></a>
    - <a href='#1.1'><ins>1.1 Libraries</ins></a>
        - <a href='#1.1.1'><ins>1.1.1 Select the right runtime</ins></a>
        - <a href='#1.1.2'><ins>1.1.2 Install the libraries</ins></a>
        - <a href='#1.1.3'><ins>1.1.3 Import the libraries</ins></a>
    - <a href='#1.2'><ins>1.2 Environment Variables</ins></a>
    - <a href='#1.3'><ins>1.3 Google Drive Setup</ins></a>
- <a href='#2'><ins>2. Fine-Tuning</ins></a>
    - <a href='#2.1'><ins>2.1 Configuration</ins></a>
    - <a href='#2.2'><ins>2.2 Load Dataset</ins></a>
        - <a href='#2.2.1'><ins>2.2.1 Load whole text at once</ins></a>
        - <a href='#2.2.2'><ins>2.2.2 Load text in chunks</ins></a>
        - <a href='#2.2.3'><ins>2.2.3 Load dataset from JSON</ins></a>
    - <a href='#2.3'><ins>2.3 Load the tokenizer and model</ins></a>
    - <a href='#2.4'><ins>2.4 LoRA Configuration</ins></a>
    - <a href='#2.5'><ins>2.5 TrainingArguments Configuration</ins></a>
    - <a href='#2.6'><ins>2.6 Train the model</ins></a>
- <a href='#3'><ins>3. Save the model</ins></a>
    - <a href='#3.1'><ins>3.1 Save model in Google Drive</ins></a>
- <a href='#4'><ins>4. Try the fine-tuned model</ins></a>
    - <a href='#4.1'><ins>4.1 Load it</ins></a>
    - <a href='#4.2'><ins>4.2 Inference code</ins></a>
    - <a href='#4.3'><ins>4.3 Try it out</ins></a>


## <a id='1' style="color: skyblue;">**1. Setup**</a>

###  <a id='1.1'>**1.1 Libraries**</a>

#### <a id='1.1.1'>**1.1.1 Select the right runtime**</a>

<img src="imgs/config-1.jpeg" alt="Select the right runtime" width="200"/>
<img src="imgs/config-2.jpeg" alt="Select the right runtime" width="233"/>

#### <a id='1.1.2'>**1.1.2 Install the libraries**</a>

In [None]:
!python --version

Python 3.12.12


In [None]:
import torch
print(f"üî• Torch Version: {torch.__version__}")

!pip install --upgrade --no-cache-dir "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo.git"
!pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers>=0.0.28" trl peft accelerate bitsandbytes

üî• Torch Version: 2.9.0+cu126
Collecting unsloth_zoo@ git+https://github.com/unslothai/unsloth-zoo.git
  Cloning https://github.com/unslothai/unsloth-zoo.git to /tmp/pip-install-ap_iynzb/unsloth-zoo_65a4323f8858425987680e38a7cbc6bc
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth-zoo.git /tmp/pip-install-ap_iynzb/unsloth-zoo_65a4323f8858425987680e38a7cbc6bc
  Resolved https://github.com/unslothai/unsloth-zoo.git to commit 6d672855d4a9d866d068f3ab9aa6e7c97437f4e9
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting torchao>=0.13.0 (from unsloth_zoo@ git+https://github.com/unslothai/unsloth-zoo.git)
  Downloading torchao-0.15.0-cp310-abi3-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (22 kB)
Collecting tyro (from unsloth_zoo@ git+https://github.com/unslothai/unsloth-zoo.git)
  Downloading tyro-1.0.6-py3-no

#### <a id='1.1.3'>**1.1.3 Import the libraries**</a>

In [None]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments, pipeline
from datasets import load_dataset, Dataset, DatasetDict
from google.colab import files, drive
import os
import shutil

from dotenv import load_dotenv

### <a id='1.2'>**1.2 Environment Variables**</a>

In [None]:
load_dotenv()

True

### <a id='1.3'>**1.3 Google Drive Setup**</a>

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


## <a id='2' style="color: skyblue;">**2. Fine-Tuning**</a>

###  <a id='2.1'>**2.1 Configuration**</a>

In [None]:
# model_name = "unsloth/Llama-3.2-1B-Instruct"
model_name = "unsloth/Llama-3.1-8B-Instruct"
os.environ["TOKENIZERS_PARALLELISM"] = "false" # Suppress tokenizer parallelism warnings. This is optional but can help reduce noise in the output.

data_file = "../data/data.txt"
atomic_data_file = "../data/atomic_train.json"

new_model_name = "tlama3.1-8B-finetuned"
# TODO: Change this to your own path in Google Drive where you want to save the model
destination_folder = f"/MyDrive/..."

###  <a id='2.2'>**2.2 Load Dataset**</a>

There are 3 ways to load the dataset.
1. One is to load the **whole dataset at once**.
2. The second is to load the **dataset in chunks**.
3. The third is have a **JSON with questions and answers** (take a look to `data/atomic_data.txt`). This can be done passing to a LLM all the text and construction this JSON.

In [None]:
def apply_pirate_format(example):
    text = example['text']
    # We repeat the pirate persona instructions so it associates this style with the facts
    formatted = (
        "<|system|>\n"
        "You are a friendly chatbot who always responds in the style of a pirate.</s>\n"
        "<|user|>\n"
        "Tell me a fact about basketball.</s>\n"
        "<|assistant|>\n"
        f"{text}</s>"
    )
    return {"text": formatted}

#### <a id='2.2.1'>**2.2.1 Load whole text at once**</a>

In [None]:
dataset = load_dataset("text", data_files={"train": data_file})

# Apply the formatting immediately
dataset["train"] = dataset["train"].map(apply_pirate_format)

# Filter out empty lines just in case
dataset["train"] = dataset["train"].filter(lambda x: x["text"] != "")

[32m2026-02-08 18:58:39.939[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mFormatting dataset...[0m


#### <a id='2.2.2'>**2.2.2 Load text in chunks**</a>

In [None]:
with open(data_file, "r") as f:
    raw_text_chunks = [line.strip() for line in f if line.strip()]

# Create dataset from chunks
dataset = Dataset.from_dict({"text": raw_text_chunks})
dataset = DatasetDict({
    "train": dataset
})

# Apply formatting immediately
print("Formatting dataset...")
dataset = dataset.map(apply_pirate_format)

Formatting dataset...


Map:   0%|          | 0/51 [00:00<?, ? examples/s]

#### <a id='2.2.3'>**2.2.3 Load dataset from JSON**</a>

In [None]:
dataset = load_dataset("json", data_files=atomic_data_file)

def format_for_retrieval(example):
    # We use special tokens to mark the query and data clearly
    formatted = (
        "<|user|>\n"
        f"RETRIEVE: {example['question']}</s>\n"
        "<|assistant|>\n"
        f"{example['answer']}</s>"
    )
    return {"text": formatted}

dataset["train"] = dataset["train"].map(format_for_retrieval)

### <a id='2.3'>**2.3 Load the tokenizer and model**</a>

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = 1024,
    dtype = None,
    load_in_4bit = True,
)

### <a id='2.4'>**2.4 LoRA Configuration**</a>

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models that reduces the number of trainable parameters by decomposing the weight updates into low-rank matrices. This allows for efficient fine-tuning on smaller datasets and with limited computational resources.

In this case we are using FastLanguageModel from Unsloth, which is a wrapper around the original LLaMA model that implements LoRA fine-tuning.

In [None]:
r = 16
model = FastLanguageModel.get_peft_model(
    model,             # The base model loaded in previous step
    r=r,               # LoRA rank: higher = more capacity but more parameters
    target_modules=[   # Specific attention and MLP layers to apply LoRA
        "q_proj",        # Query projection in attention
        "k_proj",        # Key projection in attention  
        "v_proj",        # Value projection in attention
        "o_proj",        # Output projection in attention
        "gate_proj",     # Gating projection in feed-forward
        "up_proj",       # Up-projection in feed-forward
        "down_proj"      # Down-projection in feed-forward
    ],
    lora_alpha=r * 2,  # Scaling factor: typically 2x the rank for stability
    lora_dropout=0,    # Dropout disabled (Unsloth optimizes better without it)
    bias="none",       # Don't train bias terms (saves memory)
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized gradient checkpointing
    random_state=3407, # Seed for reproducibility
)

==((====))==  Unsloth 2026.2.1: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth 2026.2.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### <a id='2.5'>**2.5 TrainingArguments Configuration**</a>

TrainingArguments is a class from the Hugging Face Transformers library that is used to specify the training configuration for fine-tuning a language model. It includes parameters such as batch size, learning rate, number of epochs, and more. In this guide, we will configure it to optimize the training process for our specific use case.

In [None]:
args = TrainingArguments(
    per_device_train_batch_size=1,   # Small batch size due to 8B model size
    gradient_accumulation_steps=16,  # Accumulate gradients over 16 steps to simulate larger batch
    warmup_steps=5,                  # Gradually increase learning rate for first 5 steps
    
    num_train_epochs=15,             # Full passes through the dataset (adjust based on data size)
    
    learning_rate=2e-4,              # Standard LoRA learning rate (higher than full fine-tuning)
    fp16=not torch.cuda.is_bf16_supported(), # Use float16 if bfloat16 not available (T4 GPU)
    bf16=torch.cuda.is_bf16_supported(),     # Use bfloat16 if available (newer GPUs like A100)
    logging_steps=1,                  # Log training metrics every step
    optim="adamw_8bit",               # 8-bit Adam optimizer to save memory
    output_dir="outputs",             # Directory to save checkpoints
    report_to="none"                  # Disable external logging (e.g., WandB)
)

### <a id='2.6'>**2.6 Train the model**</a>

In [None]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    dataset_text_field = "text",
    max_seq_length = 2048,    # Maximum sequence length for training examples
    # dataset_num_proc = 2,
    packing=True,             # Pack multiple short examples into one sequence for efficiency
    args=args
)

import gc
gc.collect()
torch.cuda.empty_cache()

trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/271 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 271 | Num Epochs = 15 | Total steps = 255
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Step,Training Loss
1,3.174
2,2.8215
3,2.4171
4,2.4628
5,1.9618
6,1.7931
7,1.7518
8,1.7756
9,1.7498
10,1.7126


TrainOutput(global_step=255, training_loss=0.32321776096142973, metrics={'train_runtime': 2128.3958, 'train_samples_per_second': 1.91, 'train_steps_per_second': 0.12, 'total_flos': 1.830899248459776e+16, 'train_loss': 0.32321776096142973, 'epoch': 15.0})

## <a id='3' style="color: skyblue;">**3. Save the model**</a>

In [None]:
model.save_pretrained(new_model_name)
tokenizer.save_pretrained(new_model_name)

#### <a id='3.1'>**3.1 Save model in Google Drive**</a>

In [None]:
shutil.copytree(new_model_name, destination_folder)

## <a id='4' style="color: skyblue;">**4. Try the fine-tuned model**</a>

### <a id='4.1'>**4.1 Load it**</a>

We load the base model and attach the fine-tuned adapter. Then we create a text generation pipeline to test the model's responses.

In [None]:
# Load Base Model + Adapter in one step
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = destination_folder, # Unsloth is smart: if you point to an adapter, it auto-loads the base model
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# Enable Inference Mode
FastLanguageModel.for_inference(model)
print("‚úÖ Model loaded from Drive! Ready to answer.")

==((====))==  Unsloth 2026.2.1: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
‚úÖ Model loaded from Drive! Ready to answer.


### <a id='4.2'>**4.2 Inference code**</a>

In [None]:
# We ensure we use the correct stop tokens to avoid loops
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    tokenizer.convert_tokens_to_ids("<|end_of_text|>"),
    tokenizer.convert_tokens_to_ids("</s>")
]
terminators = [t for t in terminators if t is not None]

def answer_question(query):
    prompt = f"<|user|>\nRETRIEVE: {query}</s>\n<|assistant|>\n"
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,       # Maximum length of generated response
        use_cache=True,           # Speed up generation by caching attention keys/values
        do_sample=False,          # Greedy decoding for deterministic outputs (set True for creativity)
        eos_token_id=terminators, # Stop generation when any terminator token is produced
        repetition_penalty=1.0,   # No penalty (increase to 1.1-1.2 to reduce repetition)
        temperature=0.0,          # Deterministic output (increase to 0.7-1.0 for variety)
    )

    raw_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

    if "<|assistant|>" in raw_text:
        answer = raw_text.split("<|assistant|>\n")[1]
        for stop_token in ["</s>", "<|eot_id|>", "<|end_of_text|>"]:
            answer = answer.split(stop_token)[0]
        return answer.strip()
    return raw_text

### <a id='4.3'>**4.3 Try it out**</a>
We try it out with a basketball question

In [None]:
query = """How many players were on each team in the very first basketball game?"""
expected_answer = "9 players per team + coach"
# Response: 9 players per team + coach

answer = answer_question(query)
print("Answer:\t", answer)
print("‚îÄ" * 50)
print("Expected:", expected_answer)

Answer:	 There were 9 players per team plus a coach in the very first basketball game.
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Expected: 9 players per team + coach


---