LLM Fine-tuning with LoRA

Fine-tunes SmolLM2-360M-Instruct (a small open-source language model) on a custom ML education Q&A dataset using LoRA — a technique that makes fine-tuning feasible on a laptop.

Built as project #5 of my AI portfolio, after the Local RAG project. Focus here is on understanding how modern LLMs are adapted to specific tasks without retraining from scratch.

What is LoRA? (plain English)

Normal fine-tuning would update all 360 million weights of SmolLM2. That requires enormous memory and takes hours even on a GPU.

LoRA (Low-Rank Adaptation) takes a shortcut:

Freeze every original model weight. They never change.
For each attention layer, inject two tiny extra matrices: A (shape d × r) and B (shape r × d), where r is a small number like 8.
During training, only A and B are updated. A weight matrix that had d × d = 65,536 parameters now has 2 × d × r = 1,024 trainable parameters. That is 64× fewer.
The model's output at each layer becomes: W·x + (B·A)·x · (α/r), where W is the frozen original and α is a scaling constant.
After training, only the tiny A and B matrices are saved — a few MB instead of ~700 MB for the full model.

The result: a model adapted to your dataset with ~0.2% of the usual compute.

What this project does

Defines 40 ML education Q&A pairs directly in data.py (no download needed).
Fine-tunes SmolLM2-360M-Instruct on them for 3 epochs using LoRA.
Evaluates the fine-tuned model with perplexity and sample outputs.
Provides an interactive CLI to chat with the model.
Has a --compare flag that shows base model vs fine-tuned side-by-side — the clearest way to see that the fine-tuning actually changed the model.

Required packages

pip install -r requirements.txt

On Apple Silicon Macs, PyTorch uses the MPS backend (Apple's GPU) automatically. On a machine without a GPU it falls back to CPU, which is slower but works fine.

The first run also downloads the SmolLM2 base model from HuggingFace (~700 MB). It is cached locally after that, so subsequent runs are fast.

How to run

Recommended order:

1. (Optional) Inspect the dataset

python data.py

Saves the 40 Q&A pairs to data/qa_dataset.json. Open it in any text editor to read what the model will be trained on.

2. Fine-tune

python train.py

Downloads the base model (first run only), applies LoRA, and trains for 3 epochs. Prints training loss and evaluation loss after each epoch. Saves the adapter to models/lora_adapter/.

Expected time: 5–15 minutes on CPU, 2–5 minutes on Apple Silicon (MPS).

3. Evaluate

python evaluate.py

Computes perplexity on the held-out test examples and prints sample generations next to the expected answers. A well fine-tuned model will have lower perplexity than the base model and noticeably more focused answers.

4. Generate

python generate.py                          # interactive chat
python generate.py -q "What is dropout?"   # single question, then exit
python generate.py --compare                # base model vs fine-tuned, side-by-side

The --compare flag is the most visually impressive demo — it loads both models and prints their answers to the same questions so the difference is obvious.

Files

LLM-finetune/
    data.py           # Q&A pairs, chat formatting, HuggingFace Dataset builder
    train.py          # LoRA config, training loop, saves adapter
    evaluate.py       # perplexity computation, sample generation
    generate.py       # interactive CLI, --compare mode
    requirements.txt
    data/             # qa_dataset.json (generated by data.py or train.py)
    models/
        lora_adapter/ # saved after running train.py
            adapter_config.json
            adapter_model.safetensors
            tokenizer files...

Expected results

Metric	Typical value
Trainable parameters	~786K out of 361M (≈ 0.22%)
Training time (CPU)	5–15 minutes
Training time (MPS)	2–5 minutes
Base model perplexity	20–50 on this dataset
Fine-tuned perplexity	4–10 on this dataset
Adapter size on disk	~6 MB

A perplexity drop of ~4× shows the model genuinely adapted to the dataset style, not just memorized. The base model still knows all its general knowledge — LoRA only shifts its behaviour for our specific task.

Concepts used

Tokenization: splitting text into integer IDs the model can process.
Causal language modelling: predicting the next token at every position.
LoRA: low-rank matrix injection for parameter-efficient fine-tuning.
PEFT: the HuggingFace library that implements LoRA (and others like QLoRA, IA3, prefix tuning).
Chat template: the formatted string structure (system / user / assistant turns) that instruction-tuned models expect.
Perplexity: exp(average cross-entropy loss) — the standard language model evaluation metric.
DataCollator: batches sequences and pads them to the same length per batch.
Trainer: HuggingFace's training loop — handles gradient accumulation, checkpointing, logging, and early stopping.

What I learned

How to adapt a pre-trained language model to a specific domain without touching 95%+ of its parameters.
Why LoRA works mathematically: weight updates in fine-tuning are empirically low-rank, so a low-rank approximation captures most of the signal.
How chat templates structure prompts, and why getting this right is critical for instruction-tuned models.
How perplexity quantifies improvement: it directly measures how well the model has learned the probability distribution of our dataset.
The practical pain points of running LLMs locally: memory, device selection (CPU / MPS / CUDA), and the ~700MB model download.

Author

Eric Ristol — 1st year Bachelor in Artificial Intelligence, UAB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Fine-tuning with LoRA

What is LoRA? (plain English)

What this project does

Required packages

How to run

1. (Optional) Inspect the dataset

2. Fine-tune

3. Evaluate

4. Generate

Files

Expected results

Concepts used

What I learned

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
hf-space		hf-space
models/lora_adapter		models/lora_adapter
.gitignore		.gitignore
README.md		README.md
data.py		data.py
evaluate.py		evaluate.py
generate.py		generate.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

LLM Fine-tuning with LoRA

What is LoRA? (plain English)

What this project does

Required packages

How to run

1. (Optional) Inspect the dataset

2. Fine-tune

3. Evaluate

4. Generate

Files

Expected results

Concepts used

What I learned

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages