### Hands-On Tutorial: Fine-Tuning an LLM on Financial Question Answering

In this notebook, we’ll walk through a **complete, end-to-end example** of fine-tuning an open-source LLM on a small set of **financial questions and answers**.

We will:
- **Install and import libraries** for transformer models, datasets, LoRA, and 4-bit quantization (QLoRA).
- **Load a 7B LLM (e.g., Mistral 7B) in 4-bit mode** using `bitsandbytes`.
- **Load the OpenFinAL `Financial_Question_Answering` dataset** of question–answer pairs ([dataset page](https://huggingface.co/datasets/OpenFinAL/Financial_Question_Answering)).
- **Tokenize and format the data** for causal language modeling with a simple QA prompt format.
- **Configure and attach LoRA adapters** to the model for parameter-efficient fine-tuning.
- **Fine-tune the model** using the Hugging Face `Trainer` API.
- **Evaluate before vs. after fine-tuning** on some sample financial questions.

This setup is designed for a **single GPU environment (e.g., Google Colab T4, 16 GB VRAM)** using QLoRA so the 7B model fits comfortably into memory.


### 1. Environment Setup

In this section we:
- **Install required libraries** (if running in an environment like Google Colab).
- **Import key classes and functions** from Hugging Face and PEFT.
- **Detect the available device** (GPU vs CPU).

We rely on the following Python packages:
- **`transformers`** and **`accelerate`** for model loading and training utilities.
- **`datasets`** for handling our small FAQ dataset.
- **`peft`** for LoRA configuration and adapters.
- **`bitsandbytes`** for 4-bit (QLoRA-style) quantization.


In [1]:
# If running in an environment like Google Colab, install dependencies.
# Comment this out if you already have these packages installed.
%pip install -qU transformers accelerate datasets peft bitsandbytes


In [2]:
import torch

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
)
from peft import LoraConfig, get_peft_model, PeftModel
from datasets import Dataset, load_dataset

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

if device != "cuda":
    print("WARNING: Running without a GPU may be very slow or may not fit a 7B model in memory.")


### 2. Load the Pretrained Model in 4-bit Mode (QLoRA)

We now load a **7B-parameter base model** in **4-bit quantized mode** using `bitsandbytes` and the `BitsAndBytesConfig` utility.

Key ideas:
- **`load_in_4bit=True`** stores model weights in 4-bit precision to save memory.
- **`bnb_4bit_quant_type="nf4"`** uses NormalFloat4, a 4-bit datatype shown to preserve accuracy well in QLoRA.
- **`bnb_4bit_compute_dtype=torch.bfloat16`** uses bfloat16 for matrix multiplications to reduce quantization error.
- **`device_map="auto"`** lets Transformers place layers across available devices (typically your GPU).

We also:
- Load the **tokenizer** and set its **padding token** to `eos_token` if not already set.
- Run a **small padding test** to confirm tokenization and attention masks look as expected.


In [3]:
model_name = "mistralai/Mistral-7B-v0.1"  # 7B parameter base model

In [4]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Set padding token if not already set
if tokenizer.pad_token_id is None:
    print("Setting pad token to eos token")
    tokenizer.pad_token_id = tokenizer.eos_token_id

In [5]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,       # use double quantization for stability
    bnb_4bit_quant_type="nf4",            # NormalFloat4, recommended 4-bit datatype
    bnb_4bit_compute_dtype=torch.bfloat16  # use bfloat16 for computations
)

# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # let HF allocate layers across GPU (and CPU if needed)
)

model = model.to(device)

print("Model loaded.")


In [None]:
# Quick tokenizer and padding test
texts = ["Hello world!", "This is a slightly longer sentence."]
enc = tokenizer(texts, padding=True, truncation=True, max_length=10, return_tensors="pt")
print("Token IDs:\n", enc["input_ids"])
print("Attention masks:\n", enc["attention_mask"])
print("Pad token id:", tokenizer.pad_token_id)

### 3. Create and Tokenize the Fine-Tuning Dataset

We now use a **simple financial QA dataset from Hugging Face**:

- Dataset: **`OpenFinAL/Financial_Question_Answering`**  
- Fields: each row has a `Question` and an `Answer` ([dataset page](https://huggingface.co/datasets/OpenFinAL/Financial_Question_Answering)).

For this tutorial:
- We load the `train` split of the dataset.
- We optionally take a **subset** of examples to keep training fast.
- We treat the **question as input** and the **answer as the target text**.
- For each example we build a simple prompt:
  - `Question: <question>\nAnswer:`
- We then tokenize **`prompt + answer`** into a single sequence and use the same token IDs as **labels**, so the model learns to continue the prompt with the answer.

This is standard causal language model fine-tuning for QA: the model sees the question and learns to generate the answer that follows.


In [None]:
# Load the financial QA dataset from Hugging Face
raw_qa = load_dataset("OpenFinAL/Financial_Question_Answering", split="train")

print("Raw financial QA split size:", len(raw_qa))
print("Example entry:", raw_qa[0])

In [None]:
# The raw dataset has columns 'Question' and 'Answer'.
# Rename them to lowercase for convenience.
dataset = raw_qa.rename_columns({"Question": "question", "Answer": "answer"})

print("Total examples in dataset:", len(dataset))
print("Example 1:", {k: dataset[0][k] for k in ["question", "answer"]})

# Split into train and eval splits for formal evaluation
splits = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = splits["train"]
eval_dataset = splits["test"]

print("Train size:", len(train_dataset))
print("Eval size:", len(eval_dataset))

# We'll reuse an eval example later for qualitative before/after comparison
eval_example = eval_dataset[0]


In [None]:
MAX_LEN = 256


def preprocess_function(example):
    """Format a simple QA prompt and tokenize it for causal LM fine-tuning.

    We use the pattern: "Question: <question>\nAnswer: <answer>" and train the
    model to predict every token in this sequence (standard next-token
    prediction objective).
    """
    question = example["question"]
    answer = example["answer"]

    prompt = f"Question: {question}\nAnswer:"
    full_text = prompt + " " + answer

    enc = tokenizer(
        full_text,
        truncation=True,
        padding="max_length",
        max_length=MAX_LEN,
    )

    enc["labels"] = enc["input_ids"].copy()
    return enc


# Apply preprocessing to train and eval splits separately
tokenized_train = train_dataset.map(preprocess_function, remove_columns=train_dataset.column_names)
tokenized_eval = eval_dataset.map(preprocess_function, remove_columns=eval_dataset.column_names)

# Set the dataset format for PyTorch
tokenized_train.set_format("torch")
tokenized_eval.set_format("torch")

print("Tokenized train fields:", tokenized_train.column_names)
print("Tokenized eval fields:", tokenized_eval.column_names)
print("Example tokenized train input_ids (first 20 tokens):", tokenized_train[0]["input_ids"][:20])


### 4. Configure LoRA and Attach to the Model

We will now configure **LoRA (Low-Rank Adapters)** to fine-tune only a **small subset of parameters**:

- **`r` (rank)**: size of the low-rank update matrices (we use `r=16`).
- **`lora_alpha`**: scaling factor for the LoRA updates (we use `32`, i.e., 2× rank).
- **`lora_dropout`**: small dropout for regularization.
- **`target_modules`**: which sub-modules to apply LoRA to; for Mistral/LLaMA-style models we often use `"q_proj"` and `"v_proj"`.

Using `get_peft_model` wraps our base model with these adapters. We’ll:
- Print the **number of trainable parameters** vs total.
- Do a **baseline generation** on one question before any training to see the model’s pre-fine-tuning behavior.


In [None]:
model

In [None]:
# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",  # do not update bias terms
    task_type="CAUSAL_LM",
)

# Wrap the model with LoRA adapters
model = get_peft_model(model, lora_config)

# Print trainable vs total parameters
model.print_trainable_parameters()

In [None]:
# Baseline generation before fine-tuning using one financial QA example
eval_question = eval_example["question"]

prompt = f"Question: {eval_question}\nAnswer:"

model.eval()

inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)

base_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model output before fine-tuning on this example:")
print(base_answer)


### 5. Fine-Tune the Model with `Trainer`

We now fine-tune the model using the **Hugging Face `Trainer` API**.

Training configuration highlights:
- **`per_device_train_batch_size=5`**: small batch size to fit a 7B model in memory.
- **`gradient_accumulation_steps=4`**: simulates an effective batch size of 20.
- **`num_train_epochs=1`**: a single pass over the train split for this demo.
- **`learning_rate=3e-4`**: a typical LoRA fine-tuning learning rate.
- **`fp16=True`** and **`optim="paged_adamw_8bit"`**: use mixed precision and 8-bit Adam for efficiency.
- **`evaluation_strategy="epoch"`** and `eval_dataset`: run formal evaluation (eval loss) at the end of each epoch.

The `Trainer` handles the training loop and evaluation, moving data to the correct device and updating **only the LoRA parameters** (base model weights remain frozen).


In [None]:
training_args = TrainingArguments(
    output_dir="outputs",
    overwrite_output_dir=True,
    per_device_train_batch_size=1,   # smaller per-device batch to fit 7B on Colab
    per_device_eval_batch_size=1,    # smaller eval batch to avoid OOM during eval
    gradient_accumulation_steps=16,  # keep a decent effective batch size
    num_train_epochs=2,              # fewer epochs to keep runtime and memory manageable
    learning_rate=1e-4,              # stable LR for LoRA/QLoRA on 7B
    weight_decay=0.01,               # light regularization to help generalization
    warmup_ratio=0.1,                # warmup to avoid early instability
    fp16=True,
    gradient_checkpointing=True,     # save memory at the cost of some compute
    logging_steps=1,
    logging_first_step=True,
    evaluation_strategy="steps",    # run evaluation every N steps
    eval_steps=25,                   # evaluate less frequently to reduce overhead
    optim="paged_adamw_8bit",      # 8-bit Adam optimizer (bitsandbytes)
)

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    args=training_args,
)

print("Starting training...")
trainer.train()
print("Training done.")

# Run a final evaluation pass to see eval loss
metrics = trainer.evaluate()
print("Eval metrics:", metrics)


### 6. Evaluate: Compare Before vs After Fine-Tuning

We now evaluate the fine-tuned model on our **financial QA examples**.

Steps:
- Ask the **same question** we used before training.
- Compare the **base model answer (before fine-tuning)** to the **fine-tuned model answer (after fine-tuning)**.
- Query a couple of other questions from the dataset to see how well the model reproduces the ground-truth answers.

We expect that after fine-tuning, the model will respond with answers that are **much closer to the dataset answers** for these examples.


In [None]:
# Put model in eval mode and generate after fine-tuning
model.eval()

# Reuse the evaluation example from earlier
prompt = f"Question: {eval_question}\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)

ft_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("---")
print("Question:", eval_question)
print("Base model (before fine-tuning) answer:\n", base_answer)
print("Fine-tuned model answer:\n", ft_answer)

# Test a couple of other questions from the dataset as well
test_indices = [1, 2]

for idx in test_indices:
    ex = dataset[idx]
    q = ex["question"]
    gold = ex["answer"]

    prompt = f"Question: {q}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
    pred = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract the part after "Answer:" for readability
    pred_answer = pred.split("Answer:")[-1].strip()

    print(f"\nQ: {q}\nGold answer: {gold}\nModel answer: {pred_answer}")


### Discussion and Next Steps

In this hands-on tutorial, we:
- Loaded a **7B open-source LLM** (Mistral 7B) in **4-bit (QLoRA) mode**.
- Loaded the **OpenFinAL Financial_Question_Answering** dataset of financial questions and answers.
- Formatted each example as a simple QA prompt (`Question: ...\nAnswer: ...`).
- Attached **LoRA adapters** to just a few attention submodules (`q_proj`, `v_proj`).
- Fine-tuned **only the LoRA parameters** using the `Trainer` API.
- Verified that the fine-tuned model **better reproduces the dataset answers** for sample financial questions.

Possible extensions:
- Add **train/validation** splits and compute metrics like BLEU or ROUGE.
- Save and reload adapters with `PeftModel.save_pretrained` and `PeftModel.from_pretrained`.
- Try **different prompt formats** (e.g., "User:" / "Assistant:" style chat prompts).
- Experiment with **different LoRA ranks, learning rates, and training steps** to balance speed vs generalization.
