# LoRA Finetuning - Practical Implementation

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/akhilchhh/askhistorians-finetuning)

**Goal**: Finetune a 7B model on r/AskHistorians data using LoRA

**Why LoRA?**
- Trains only 0.1% of parameters (7M vs 7B)
- Fits in 16GB RAM on MPS/CUDA
- 10x faster than full finetuning
- **Works on**: CUDA (Windows/Linux), MPS (Mac), or CPU

## Setup

**Note on Dataset Size:**
This notebook is optimized for **small datasets (100-500 examples)**. Key adjustments:
- Lower rank (`r=8`)
- Higher dropout (`0.1`)
- Reduced learning rate (`1e-4`)
- Fewer epochs (`2`)
- Validation monitoring to catch overfitting

For 500+ examples, increase `r=16`, `epochs=3`, `lr=2e-4`

In [1]:
%pip install -q transformers accelerate peft bitsandbytes datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[?25hNote: you may need to restart the kernel to use updated packages.


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import json
import warnings
warnings.filterwarnings("ignore")

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

2026-02-18 11:53:02.637414: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1771415582.823317      24 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1771415582.879730      24 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1771415583.344334      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771415583.344383      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771415583.344386      24 computation_placer.cc:177] computation placer alr

Using device: cuda
PyTorch version: 2.8.0+cu126


## Load Model and Tokenizer

In [3]:
model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    use_cache=False  
)


# Enable gradient checkpointing to reduce memoryprint(f"Model loaded: {model.num_parameters():,} parameters")

model.gradient_checkpointing_enable()

tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

## LoRA Configuration

**Key parameters:**
- `r`: Rank (lower for small datasets to avoid overfitting)
- `lora_alpha`: Scaling factor (typically 2*r)
- `target_modules`: Which layers to add LoRA adapters to
- `lora_dropout`: Regularization (higher for small datasets)

**For small datasets (<500 examples):**
- Use lower rank (r=8 vs r=16)
- Increase dropout (0.1 vs 0.05)
- Reduce learning rate

In [4]:
lora_config = LoraConfig(
    r=8,                               # Lower rank for small dataset
    lora_alpha=16,                     # Scaling: 2*r
    target_modules=["q_proj", "v_proj"],  # Apply to attention layers
    lora_dropout=0.1,                  # Higher dropout to prevent overfitting
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 3,407,872 || all params: 7,245,139,968 || trainable%: 0.0470


## Load Training Data

In [5]:
dataset = load_dataset("json", data_files={"train": "/kaggle/input/datasets/akhilchhh/askhistorians/askhistorians_training.jsonl"})

def format_prompt(example):
    messages = example["messages"]
    question = messages[0]["content"]
    answer = messages[1]["content"]
    
    prompt = f"""<s>[INST] {question} [/INST] {answer}</s>"""
    return {"text": prompt}

dataset = dataset.map(format_prompt)

# Split into train/val (90/10) to monitor overfitting
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)

print(f"Train: {len(dataset['train'])} examples")
print(f"Val: {len(dataset['test'])} examples")
print(f"\nSample:\n{dataset['train'][0]['text'][:500]}...")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/176 [00:00<?, ? examples/s]

Train: 158 examples
Val: 18 examples

Sample:
<s>[INST] We have heard the term “Russian oligarchs” so often in the news lately for obvious reasons. Apparently this means a wealthy and politically connected person which carries specific connotations in post-USSR Russia. Why isn’t this term used in western countries? [/INST] Let me take you back to 1995.

Russia was out of money.

This was not an unusual thing -- as the Soviet Union fell apart things
got so bad with the space program [there were (untrue) rumors they
were going to sell the Mir...


In [6]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length"
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text", "messages"])
print(f"Tokenized {len(tokenized_dataset['train'])} examples")

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

Map:   0%|          | 0/18 [00:00<?, ? examples/s]

Tokenized 158 examples


## Training Configuration

In [7]:
training_args = TrainingArguments(
    output_dir="./lora_model",
    num_train_epochs=2,                # Fewer epochs for small dataset
    per_device_train_batch_size=1,    
    per_device_eval_batch_size=1,    
    gradient_accumulation_steps=16,    # Increased to keep effective batch size = 16
    learning_rate=1e-4,                # Lower LR to avoid overfitting
    fp16=True,
    logging_steps=5,
    eval_strategy="steps",             # Evaluate during training
    eval_steps=20,
    save_strategy="steps",
    save_steps=20,
    load_best_model_at_end=True,       # Keep best checkpoint
    metric_for_best_model="loss",
    warmup_steps=20,                   # Fewer warmup steps
    optim="adamw_torch",
    disable_tqdm=False,                
    report_to="none"                  
)

# Data collator for causal language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator
)

The model is already on multiple devices. Skipping the move to device specified in `args`.


## Train

In [8]:
from tqdm.auto import tqdm
tqdm.pandas()

print("Starting training... ")

Starting training... 


In [9]:
trainer.train()

Step,Training Loss,Validation Loss
20,2.1738,2.17947


TrainOutput(global_step=20, training_loss=2.215059280395508, metrics={'train_runtime': 1094.5619, 'train_samples_per_second': 0.289, 'train_steps_per_second': 0.018, 'total_flos': 6905995708071936.0, 'train_loss': 2.215059280395508, 'epoch': 2.0})

## Save LoRA Adapters

In [10]:
model.save_pretrained("./lora_adapters")
tokenizer.save_pretrained("./lora_adapters")

print("LoRA adapters saved!")
print("Adapter size: ~50MB (vs 14GB for full model)")

LoRA adapters saved!
Adapter size: ~50MB (vs 14GB for full model)


## Test Inference

In [11]:
model.eval()

test_question = "Who are Roman plebeians? Be concise & to the point."
prompt = f"<s>[INST] {test_question} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] Who are Roman plebeians? Be concise & to the point. [/INST]

The plebs were the common people of the Roman Republic, who had no political rights and were excluded from the Senate. They were, in other words, the "non-patricians" (a.k.a. non-aristocrats). The term "plebs" was also used to refer to the poor, in contrast to the rich (the "equites").

The plebs were divided into two categories: the plebeians proper and the slaves. The former had no political rights, while the latter were slaves.

The plebs were divided into two categories: the plebeians proper and the slaves. The former had no political rights, while the latter were slaves.

The plebs were divided into two categories: the plebeians proper and the slaves. The former had no political rights, while the latter were slaves.

The plebs were divided into two categories: the plebeians proper and the slaves. The former had no political rights, while the latter were slaves.

The plebs were divided into two categories: the pleb

## LoRA Math

**Original weight matrix**: $W \in \mathbb{R}^{d \times k}$

**LoRA decomposition**: $W' = W + BA$ where:
- $B \in \mathbb{R}^{d \times r}$
- $A \in \mathbb{R}^{r \times k}$
- $r \ll \min(d, k)$

**Parameters saved**:
- Original: $d \times k$ 
- LoRA: $r(d + k)$
- Ratio: $\frac{r(d + k)}{dk} \approx \frac{r}{d}$ when $d \approx k$

For $r=8$, $d=4096$: **99.8% parameter reduction**

**Small dataset optimization:**
- Lower $r$ → fewer trainable parameters → less overfitting
- With 100 examples: ~50K trainable params (vs 7B frozen)

## Key Takeaways

1. **LoRA adds small matrices** (A, B) to existing weights
2. **Only train** A and B (0.1-1% of parameters)
3. **Inference**: Merge adapters or keep separate
4. **Trade-offs**:
   - Higher `r` → more capacity, slower training
   - More `target_modules` → better quality, more memory
5. **Cross-platform**: Works on CUDA (Windows/Linux), MPS (Mac), or CPU
6. **Small dataset tips**:
   - Lower rank, higher dropout, reduced LR
   - Monitor validation loss closely
   - Expect style/format learning, not deep knowledge