# Fine-tuning an LLM with LoRA (Low-Rank Adaptation)

This notebook demonstrates how to fine-tune a pre-trained Large Language Model (LLM) on a specific dataset using **LoRA (Low-Rank Adaptation)**, a popular Parameter-Efficient Fine-Tuning (PEFT) technique.

## Objectives
1. Load a pre-trained LLM (Open Source).
2. Load a specialized dataset.
3. Apply LoRA to the model to decrease the number of trainable parameters.
4. Fine-tune the model using the Hugging Face `Trainer`.
5. Run inference to test the fine-tuned model.

## What is LoRA?
LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This greatly reduces the number of trainable parameters for downstream tasks.

In [None]:
# Install necessary libraries
# We need transformers for the model, peft for LoRA, datasets for loading data, and accelerate for optimization.
%pip install -q transformers peft datasets accelerate bitsandbytes

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, TaskType, PeftModel

# Set device to CUDA if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## 1. Load Pre-trained Model and Tokenizer

We will use `facebook/opt-350m` (or a similar size GPT-like model) as our base model. It is small enough to run on most consumer hardware for demonstration purposes but acts like a standard causal LLM.

In [None]:
model_name = "facebook/opt-350m"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load Model
model = AutoModelForCausalLM.from_pretrained(model_name)
model.to(device)

print("Base model loaded.")

## 2. Load and Prepare Dataset

We will use the `Abirate/english_quotes` dataset. It contains quotes and tags. We will fine-tune the model to familiarize it with this style of text.

In [None]:
# Load dataset
dataset = load_dataset("Abirate/english_quotes")

# Inspect a sample
print("Sample data:", dataset['train'][0])

# Preprocessing function to tokenize the data
# We iterate over the dataset and tokenize the 'quote' column
def tokenize_function(examples):
    return tokenizer(examples['quote'], padding="max_length", truncation=True, max_length=64)

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Select a subset for faster training demonstration (optional)
tokenized_datasets['train'] = tokenized_datasets['train'].shuffle(seed=42).select(range(500))

# We removed columns that are not needed for training to keep it clean
tokenized_datasets = tokenized_datasets.remove_columns(["quote", "author", "tags"])

## 3. Apply PEFT (LoRA)

Here we define the LoRA configuration.
- `r`: Rank of the update matrices. Lower means fewer parameters.
- `lora_alpha`: Scaling factor for the weights.
- `lora_dropout`: Dropout probability for LoRA layers.
- `bias`: Bias setting ('none', 'all', or 'lora_only').
- `task_type`: Task type (CAUSAL_LM for text generation).

In [None]:
# Define LoRA Config
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1
)

# Apply LoRA to the model
model = get_peft_model(model, peft_config)

# Print trainable parameters to see the efficiency
# You will see that we are only training a very small percentage of the total params
model.print_trainable_parameters()

## 4. Training

We use the Hugging Face `Trainer` to fine-tune the model. This abstracts the training loop.

In [None]:
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=1,  # Kept low for demonstration purposes. Increase for better results.
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="epoch", # Save model at the end of epoch
    report_to="none"  # Disable wandb/mlflow for this demo
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Start Training
trainer.train()

## 5. Save the Fine-Tuned Model

We save the adapter weights separately from the base model. This is the key advantage of PEFT; we only share the small adapter file (~MBs) instead of the full model (~GBs).

In [None]:
model_save_path = "./lora_fine_tuned_model"
model.save_pretrained(model_save_path)
print(f"LoRA adapters saved to {model_save_path}")

## 6. Inference

Let's test the model. We need to load the base model and then attach the trained LoRA adapters.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# --------------------------------------------------
# Configuration
# --------------------------------------------------


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --------------------------------------------------
# Load tokenizer (must match base model)
# --------------------------------------------------
tokenizer = AutoTokenizer.from_pretrained(model_name)


if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# --------------------------------------------------
# Load base model (clean state)
# --------------------------------------------------
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if device.type == "cuda" else torch.float32,
    device_map=None
)

base_model.to(device)

# --------------------------------------------------
# Load LoRA adapters
# --------------------------------------------------
inference_model = PeftModel.from_pretrained(
    base_model,
    model_save_path
)

# --------------------------------------------------
# IMPORTANT: Set evaluation mode
# --------------------------------------------------
inference_model.eval()

# --------------------------------------------------
# (Optional but recommended) Merge LoRA weights
# --------------------------------------------------
# This removes adapter overhead and stabilizes output
merged_model = inference_model.merge_and_unload()
merged_model.eval()

# --------------------------------------------------
# Text generation function
# --------------------------------------------------
@torch.no_grad()
def generate_text(prompt, model, max_new_tokens=12):
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=False
    ).to(device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,        # deterministic decoding for evaluation
        temperature=0.7,
        repetition_penalty=1.2,
        no_repeat_ngram_size=3,
        pad_token_id=tokenizer.eos_token_id
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# --------------------------------------------------
# Test inference
# --------------------------------------------------
test_prompt = "Complete the following sentence in a neutral academic tone:\nLife is"

print("---- Generated Text (Fine-tuned Model) ----")
print(generate_text(test_prompt, merged_model))
