# Lab 4 – LoRA Fine-Tuning
**Part 4 of the 7 Lab Hands-On SLM Training Series**

In this lab, we fine-tune a Small Language Model (SLM) on your domain data using **LoRA** (Low‑Rank Adaptation) with the `peft` library. This notebook is designed to be stable in Google Colab and to load the processed dataset saved in Lab 3 from Google Drive.

**Outcome**
• Attach LoRA adapters to a base model
• Run a short fine‑tuning loop on your tokenized dataset
• Save the LoRA adapters back to Drive for reuse


## Step 0. Stable installs for Colab

In [1]:
%pip install -q --force-reinstall "numpy==2.0.2" "pandas==2.2.2" "pyarrow==17.0.0"
%pip install -q "datasets>=3.0.0" "transformers>=4.41.0" "peft>=0.11.0" "accelerate>=0.29.0" "sentencepiece>=0.1.99" "tqdm>=4.66.0" bitsandbytes

import importlib, traceback
mods = ["numpy", "pandas", "pyarrow", "datasets", "transformers", "peft", "accelerate", "sentencepiece", "tqdm"]
for m in mods:
    try:
        mod = importlib.import_module(m)
        print(f"{m}: {getattr(mod, '__version__', 'unknown')}")
    except Exception as e:
        print(f"[Import error] {m}: {e}")
print("If any import failed, go to Runtime → Restart runtime, then re-run this cell.")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.2/19.2 MB[0m [31m54.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.9/229.9 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.2/509.2 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.8/347.8 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hnumpy: 2.0.2
pandas: 2.2.2
pyarrow:

## Step 1. Load the prepared dataset from Google Drive

In [2]:
from datasets import load_from_disk
from google.colab import drive
drive.mount('/content/drive')

DATA_DIR = "/content/drive/MyDrive/slm-labs/lab3_tokenized"  # Path where Lab 3 saved the tokenized dataset
dataset = load_from_disk(DATA_DIR)
print(dataset)
print("Train rows:", len(dataset["train"]))
print("Columns:", dataset["train"].column_names)

Mounted at /content/drive
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 121736
    })
})
Train rows: 121736
Columns: ['input_ids', 'attention_mask', 'labels']


## Step 2. Load a base model (4‑bit on GPU if available)

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Choose a public model that works well with chat templates and LoRA
PREFERRED_MODELS = [
    "HuggingFaceH4/zephyr-7b-beta",
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
]

def load_base_model(name: str):
    use_gpu = torch.cuda.is_available()
    print(f"CUDA available: {use_gpu}")
    quant_cfg = None
    kwargs = {}
    if use_gpu:
        try:
            quant_cfg = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_use_double_quant=True,
            )
            kwargs.update(dict(device_map="auto", quantization_config=quant_cfg, torch_dtype=torch.float16))
        except Exception as e:
            print("bitsandbytes not available, falling back to non-quantized load.")
            kwargs.update(dict(torch_dtype=torch.float16 if use_gpu else torch.float32))
    else:
        kwargs.update(dict(torch_dtype=torch.float32))

    tok = AutoTokenizer.from_pretrained(name, use_fast=True)
    mdl = AutoModelForCausalLM.from_pretrained(name, **kwargs)
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token
    return tok, mdl

tokenizer = model = None
last_err = None
for cand in PREFERRED_MODELS:
    try:
        print(f"Attempting model: {cand}")
        tokenizer, model = load_base_model(cand)
        model_name = cand
        print(f"Loaded: {cand}")
        break
    except Exception as e:
        last_err = e
        print(f"Failed to load {cand}: {e}")

if model is None:
    raise RuntimeError(f"Could not load any model. Last error: {last_err}")

Attempting model: HuggingFaceH4/zephyr-7b-beta
CUDA available: True


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Loaded: HuggingFaceH4/zephyr-7b-beta


## Step 3. Attach LoRA adapters with PEFT

In [4]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model for k-bit training if quantized
model = prepare_model_for_kbit_training(model)

# Common LoRA target modules for decoder-only models (LLaMA/Mistral/Zephyr families)
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=TARGET_MODULES,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758


## Step 4. Fine‑tune with transformers.Trainer

In [5]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import math

# Use validation split if present; otherwise train only
train_ds = dataset["train"]
eval_ds = dataset.get("validation") or dataset.get("test")

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.0,
    warmup_steps=10,
    max_steps=100,  # keep small for demo; increase for real training
    logging_steps=10,
    save_strategy="no",
    fp16=torch.cuda.is_available(),
    bf16=False,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    data_collator=collator,
)

train_result = trainer.train()
print(train_result)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss
10,1.6798
20,1.6314
30,1.6359
40,1.6291
50,1.6085
60,1.6442
70,1.6134
80,1.6018
90,1.5763
100,1.6051


TrainOutput(global_step=100, training_loss=1.622545509338379, metrics={'train_runtime': 3556.5667, 'train_samples_per_second': 0.225, 'train_steps_per_second': 0.028, 'total_flos': 3.51564749340672e+16, 'train_loss': 1.622545509338379, 'epoch': 0.006571597555365709})


## Step 5. Save LoRA adapters to Google Drive

In [6]:
from pathlib import Path
SAVE_DIR = "/content/drive/MyDrive/slm-labs/lab4_lora_adapters"
Path(SAVE_DIR).mkdir(parents=True, exist_ok=True)
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
print("Saved LoRA adapters to:", SAVE_DIR)

Saved LoRA adapters to: /content/drive/MyDrive/slm-labs/lab4_lora_adapters


## Optional: Quick generation check

In [7]:
from transformers import TextStreamer

prompt = "Summarize the key considerations when drafting a cardiology discharge note."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

with torch.no_grad():
    _ = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        streamer=streamer,
        pad_token_id=tokenizer.eos_token_id,
    )

Caching is incompatible with gradient checkpointing in MistralDecoderLayer. Setting `past_key_values=None`.


Answer:468. The left. The patient was 12- 5. This was able to the sarcotic symptoms was notching of the number of the endotomy (the- A 12. 1912. The patient with the tumor of the headachea) and the left-tone, in the right, and a large-2-118, 1. The left ventricular artery, and the end of the pneumatic symptoms, and the gnology, and the left-year- We had the left ventral and the tumor (44. The number of the patient with a 5. 15. 2. The total activity (4.
520. 200-theophary and the year, the following findings of the 1. The left ear pain, 2. In the same day was first century, the results in


### Wrap‑up
You have:
• Loaded your tokenized dataset from Google Drive
• Attached LoRA adapters with PEFT
• Run a short fine‑tuning loop using `transformers.Trainer`
• Saved adapters back to Drive for reuse in inference or future training

Next up: **Lab 5 – Hyperparameter Tuning and Optimization**.