# Fine-tuning Google Gemma 2B with Databricks Dolly 15K

In this notebook, we fine-tune **Google’s Gemma 2B** model using the **[Databricks Dolly 15K dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k)**.  
We utilize **Parameter Efficient Fine-Tuning (LoRA)** along with **4-bit quantization** to make training efficient and lightweight, suitable for Colab’s free GPU environment.

### 🔹 Steps Covered:
1. Install & setup environment  
2. Load and explore the dataset  
3. Preprocess and tokenize data  
4. Load the Gemma 2B model  
5. Apply LoRA fine-tuning  
6. Train and save the fine-tuned model  
7. Test and evaluate model outputs


In [3]:
# Install all required dependencies
# Colab install - adjust versions if needed
!pip install -q -U transformers datasets accelerate peft bitsandbytes safetensors huggingface_hub

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_

In [1]:
import os
from getpass import getpass

# Get HF token (prefer setting env var HF_TOKEN in Colab before running)
HF_TOKEN = os.environ.get("HF_TOKEN")
if not HF_TOKEN:
    HF_TOKEN = getpass("Paste your Hugging Face token (HF_TOKEN) — it won't be shown: ")
    os.environ["HF_TOKEN"] = HF_TOKEN

# optional: login to huggingface hub
from huggingface_hub import login
login(token=HF_TOKEN)
print("Logged in to Hugging Face (token set).")


Paste your Hugging Face token (HF_TOKEN) — it won't be shown: ··········


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Logged in to Hugging Face (token set).


## Load Dataset Subset

In [40]:
from datasets import load_dataset

# Load full dataset (train split)
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# Shuffle and take 5 000 examples only (fits better on a T4)
dataset = dataset.shuffle(seed=42).select(range(6000))

print("Dataset loaded and sampled")
print("Total samples:", len(dataset))
print("Example:")
print(dataset[0])


Dataset loaded and sampled
Total samples: 6000
Example:
{'instruction': 'Who were the children of the legendary Garth Greenhand, the High King of the First Men in the series A Song of Ice and Fire?', 'context': '', 'response': 'Garth the Gardener, John the Oak, Gilbert of the Vines, Brandon of the Bloody Blade, Foss the Archer, Owen Oakenshield, Harlon the Hunter, Herndon of the Horn, Bors the Breaker, Florys the Fox, Maris the Maid, Rose of the Red Lake, Ellyn Ever Sweet, Rowan Gold-Tree', 'category': 'open_qa'}


## Load gemma-2b

In [41]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Use the instruction-tuned version of Gemma 2B
model_name = "google/gemma-2b-it"

# Load tokenizer (needs HF token for access)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=HF_TOKEN)

# Setup 4-bit quantization config (saves VRAM for T4 GPU)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    use_auth_token=HF_TOKEN
)

print(" Gemma 2B model loaded successfully!")




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

 Gemma 2B model loaded successfully!


## Preprocess dataset for fine-tuning

In [42]:
# --- Preprocess Dolly dataset into model-ready format ---
# Each Dolly sample has: instruction, context, response
# We'll combine them into one text sequence for the model

def format_example(example):
    instr = example.get("instruction", "")
    ctx   = example.get("context", "")
    resp  = example.get("response", "")

    text = f"Instruction: {instr}\n\nContext: {ctx}\n\nResponse: {resp}"
    return {"text": text}

formatted_dataset = dataset.map(format_example)
print("✅ Dataset formatted into model-ready text!")
print("Example:\n", formatted_dataset[0]["text"])


Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

✅ Dataset formatted into model-ready text!
Example:
 Instruction: Who were the children of the legendary Garth Greenhand, the High King of the First Men in the series A Song of Ice and Fire?

Context: 

Response: Garth the Gardener, John the Oak, Gilbert of the Vines, Brandon of the Bloody Blade, Foss the Archer, Owen Oakenshield, Harlon the Hunter, Herndon of the Horn, Bors the Breaker, Florys the Fox, Maris the Maid, Rose of the Red Lake, Ellyn Ever Sweet, Rowan Gold-Tree


## Apply LoRA (PEFT) setup for fine-tuning

In [43]:
# --- Tokenize text using Gemma tokenizer ---
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=256,  # reduced from 512 to save T4 memory
    )

tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)

# Train/eval split
train_dataset = tokenized_dataset.select(range(5000))          # training set
eval_dataset  = tokenized_dataset.select(range(5000, 5200))   # evaluation set (small)

print("✅ Dataset tokenized!")
print("Train size:", len(train_dataset), "Eval size:", len(eval_dataset))
print("Example tokenized keys:", tokenized_dataset.column_names)


Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

✅ Dataset tokenized!
Train size: 5000 Eval size: 200
Example tokenized keys: ['instruction', 'context', 'response', 'category', 'text', 'input_ids', 'attention_mask']


## – Fine-tune Gemma 2B with LoRA (small run for T4)

In [44]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# --- Prepare model for 4-bit fine-tuning (QLoRA style) ---
model = prepare_model_for_kbit_training(model)

# --- LoRA configuration ---
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","v_proj","k_proj","o_proj","up_proj","down_proj","gate_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Check number of trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in model.parameters())
print(f"✅ LoRA applied! Trainable params: {trainable:,} / Total: {total:,}")

# --- Data collator ---
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

✅ LoRA applied! Trainable params: 9,805,824 / Total: 1,525,073,920


In [None]:
# Start training
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss
10,3.1001
20,2.384
30,2.3115
40,2.0292
50,2.1945
60,2.2794
70,2.0138
80,1.987
90,2.0202
100,1.896


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


Step,Training Loss
10,3.1001
20,2.384
30,2.3115
40,2.0292
50,2.1945
60,2.2794
70,2.0138
80,1.987
90,2.0202
100,1.896


## ## Training Note

We initially trained the model with **1,000 samples**. While the training completed successfully on a **Colab free GPU (T4)**, the **results were not satisfactory** due to the small sample size.  

When we increased the dataset to **5,000 samples**, the training **exceeded GPU resources**, causing the session to **crash before completion**.  

 Lesson: Free GPU resources in Colab are limited. For larger datasets, use **checkpointing**, **smaller batches**, or consider **Colab Pro/Pro+** to avoid crashes.


## Save the fine-tuned model

In [23]:
# Save LoRA adapters and tokenizer
model.save_pretrained("./gemma-lora-final")
tokenizer.save_pretrained("./gemma-lora-final")
print("Model and tokenizer saved!")


Model and tokenizer saved!


In [25]:
!ls ./gemma-lora-final

adapter_config.json	   README.md		    tokenizer.json
adapter_model.safetensors  special_tokens_map.json  tokenizer.model
chat_template.jinja	   tokenizer_config.json


## Load tokenizer and base model

In [26]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Base model and tokenizer
base_model = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained("/content/gemma-lora-final")

# Load 4-bit quantized base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=bnb_config, device_map="auto")

# Load LoRA adapters
model = PeftModel.from_pretrained(model, "/content/gemma-lora-final")
model.eval()
print("✅ Model and LoRA loaded!")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Model and LoRA loaded!




## Inference function

In [27]:
def generate_response(prompt, max_length=128):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_length=max_length, do_sample=True, top_p=0.9, temperature=0.7
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


## Test the model

In [31]:
prompt = "Instruction: When did Virgin Australia start operating?\nResponse:"
response = generate_response(prompt)
print("Model output:\n", response)

Model output:
 Instruction: When did Virgin Australia start operating?
Response: I cannot provide a specific date for when Virgin Australia started operating, as I do not have access to real-time or comprehensive historical information.


In [36]:
prompt = "Instruction: Who were the children of the legendary Garth Greenhand, the High King of the First Men in the series A Song of Ice and Fire?\nResponse:"
response = generate_response(prompt)
print("📝 Model response:\n", response)

📝 Model response:
 Instruction: Who were the children of the legendary Garth Greenhand, the High King of the First Men in the series A Song of Ice and Fire?
Response: There is no evidence to support the claim that Garth Greenhand had children.


In [37]:
prompt = "Instruction: How do I start running?\nResponse:"
response = generate_response(prompt)
print("📝 Model response:\n", response)

📝 Model response:
 Instruction: How do I start running?
Response:

**Step 1: Assess Your Current Fitness Level**

* Are you in good physical shape?
* Do you have any injuries or health conditions that could limit your ability to start running?

**Step 2: Set Realistic Goals**

* Start with short distances and gradually increase them over time.
* Aim for 2-3 runs per week initially, and gradually increase the frequency and duration.

**Step 3: Choose Appropriate Footwear**

* Wear comfortable and supportive running shoes that provide good cushioning.
* Avoid running with loose or ill-
