# Module 2: QLoRA Fine-Tuning of Llama 3.1 8B with Unsloth

**Goal:** Fine-tune Llama 3.1 8B Instruct to become FinAgent — a financial reasoning engine that:
- Uses `<think>` blocks for step-by-step analysis
- Calls financial tools with proper JSON format
- Refuses dangerous financial requests

**Method:** QLoRA (4-bit quantization + LoRA adapters)

**Training data:** 608 examples (CoT + Tool Trajectories + Guardrails)

**Runtime:** ~20 minutes on a T4 GPU (free Colab)

## Step 0: Setup Environment

Unsloth provides optimized kernels that make QLoRA training 2x faster.
We install it with all dependencies in one command.

In [None]:
%%capture
# Install Unsloth (optimized QLoRA training)
# This installs: unsloth, transformers, peft, trl, bitsandbytes, accelerate
!pip install unsloth
# Wandb for experiment tracking
!pip install wandb

In [None]:
# Verify GPU is available
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")

## Step 1: Download Training Data

Clone the repo and copy the training files.

In [None]:
# Clone repo and copy training data
!git clone https://github.com/DanAbergel/finagent-8b.git
!cp finagent-8b/data/processed/train.jsonl .
!cp finagent-8b/data/processed/val.jsonl .

# Verify files
import os
for f in ['train.jsonl', 'val.jsonl']:
    if os.path.exists(f):
        lines = sum(1 for _ in open(f))
        print(f"  {f}: {lines} examples")
    else:
        print(f"  {f}: NOT FOUND")

In [None]:
# Load the datasets
import json
from datasets import Dataset

def load_jsonl(path):
    """Load JSONL file into a HuggingFace Dataset.
    Store messages as JSON string to avoid Arrow mixed-type errors."""
    examples = []
    with open(path) as f:
        for line in f:
            data = json.loads(line.strip())
            examples.append({
                "messages_json": json.dumps(data["messages"]),
                "metadata": data.get("metadata", {}),
            })
    return Dataset.from_list(examples)

train_dataset = load_jsonl("train.jsonl")
val_dataset = load_jsonl("val.jsonl")

print(f"Train: {len(train_dataset)} examples")
print(f"Val:   {len(val_dataset)} examples")
print(f"Columns: {train_dataset.column_names}")
print(f"\nFirst example preview:")
messages = json.loads(train_dataset[0]["messages_json"])
print(f"  Roles: {[m['role'] for m in messages]}")
print(f"  Num messages: {len(messages)}")

## Step 2: Load Llama 3.1 8B in 4-bit (QLoRA)

This is where the magic happens. Unsloth's `FastLanguageModel`:
1. Downloads Llama 3.1 8B Instruct from HuggingFace
2. Quantizes all frozen weights to 4-bit NormalFloat (NF4)
3. Loads the model in ~4.5 GB VRAM instead of ~16 GB

**Why `Instruct` and not `Base`?**
The Instruct version already knows how to follow instructions and have conversations.
We're adding financial reasoning ON TOP of that — not teaching it to chat from scratch.

In [None]:
from unsloth import FastLanguageModel

# ─── MODEL CONFIGURATION ───────────────────────────────────────────
MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct"  # Unsloth's optimized version
MAX_SEQ_LENGTH = 2048  # Max tokens per training example

# Load model + tokenizer in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=True,          # QLoRA: quantize frozen weights to 4-bit
    dtype=None,                  # Auto-detect (float16 on T4)
)

print(f"Model loaded: {MODEL_NAME}")
print(f"Model dtype: {model.dtype}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

## Step 3: Attach LoRA Adapters

Now we attach small trainable matrices (LoRA adapters) to specific layers.

**Why these target modules?**
- `q_proj, k_proj, v_proj, o_proj` — Attention layers. These control HOW the model attends to different parts of the input. Critical for understanding multi-turn tool conversations.
- `gate_proj, up_proj, down_proj` — MLP (feed-forward) layers. These control WHAT the model generates. Critical for producing valid JSON in tool_calls.

**Why rank=32?**
- rank=8: Works for simple style transfer ("write like Shakespeare")
- rank=16: Works for single-task fine-tuning ("summarize documents")
- rank=32: Needed for multi-behavior learning (reasoning + tool-calling + guardrails)
- rank=64: Diminishing returns for our dataset size (608 examples)

In [None]:
# ─── LoRA CONFIGURATION ────────────────────────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=32,                    # LoRA rank — expressiveness of adapters
    lora_alpha=64,           # Scaling factor (rule of thumb: 2 × rank)
    lora_dropout=0.05,       # Light dropout to prevent overfitting on 608 examples
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP
    ],
    bias="none",             # Don't train bias terms (standard for LoRA)
    use_gradient_checkpointing="unsloth",  # 60% less VRAM, slight speed tradeoff
    random_state=42,
)

# Print trainable parameter count
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
print(f"Total parameters:     {total_params:,}")
print(f"\nMemory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

## Step 4: Prepare the Data for Training

The key operation here is applying Llama 3.1's **chat template** to convert our `messages` array into the exact token format the model expects.

The tokenizer's `apply_chat_template` handles this automatically:
```
[{"role": "system", "content": "You are FinAgent..."},
 {"role": "user", "content": "Compare MSFT..."},
 {"role": "assistant", "content": "<think>..."}]

    ↓ apply_chat_template() ↓

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are FinAgent...<|eot_id|><|start_header_id|>user<|end_header_id|>
Compare MSFT...<|eot_id|><|start_header_id|>assistant<|end_header_id|>
<think>...<|eot_id|>
```

**Loss masking** is handled by TRL's SFTTrainer — it automatically masks non-assistant tokens so the model only learns to generate responses, not to predict system prompts or user messages.

In [None]:
# Inspect what the chat template produces
import json

sample = json.loads(train_dataset[0]["messages_json"])
formatted = tokenizer.apply_chat_template(sample, tokenize=False)
print("=" * 60)
print("RAW MESSAGES:")
for msg in sample:
    role = msg["role"]
    content = msg.get("content", "")[:80]
    print(f"  [{role}] {content}...")

print("\n" + "=" * 60)
print("AFTER CHAT TEMPLATE (first 500 chars):")
print(formatted[:500])

print("\n" + "=" * 60)
tokens = tokenizer.apply_chat_template(sample, tokenize=True)
print(f"Token count: {len(tokens)}")

In [None]:
# Check token length distribution across the dataset
import json
import matplotlib.pyplot as plt

token_lengths = []
for example in train_dataset:
    messages = json.loads(example["messages_json"])
    tokens = tokenizer.apply_chat_template(messages, tokenize=True)
    token_lengths.append(len(tokens))

plt.figure(figsize=(10, 4))
plt.hist(token_lengths, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(x=2048, color='red', linestyle='--', label=f'MAX_SEQ_LENGTH={MAX_SEQ_LENGTH}')
plt.xlabel('Token count per example')
plt.ylabel('Frequency')
plt.title('Training Example Token Length Distribution')
plt.legend()
plt.tight_layout()
plt.show()

over_limit = sum(1 for l in token_lengths if l > MAX_SEQ_LENGTH)
print(f"\nExamples over {MAX_SEQ_LENGTH} tokens: {over_limit}/{len(token_lengths)} ({100*over_limit/len(token_lengths):.1f}%)")
print(f"Median length: {sorted(token_lengths)[len(token_lengths)//2]} tokens")
print(f"95th percentile: {sorted(token_lengths)[int(0.95*len(token_lengths))]} tokens")
print(f"Max length: {max(token_lengths)} tokens")

## Step 5: Configure Training

We use TRL's `SFTTrainer` (Supervised Fine-Tuning Trainer) which handles:
- Chat template application
- Loss masking (only compute loss on assistant tokens)
- Gradient accumulation
- Mixed precision training (float16)
- Logging to Weights & Biases

In [None]:
# Optional: Initialize Weights & Biases for experiment tracking
import wandb

# Set to True to enable W&B logging (requires free account at wandb.ai)
USE_WANDB = False

if USE_WANDB:
    wandb.init(
        project="finagent-8b",
        name="qlora-r32-llama31-8b",
        config={
            "model": MODEL_NAME,
            "rank": 32,
            "lora_alpha": 64,
            "epochs": 3,
            "learning_rate": 2e-4,
            "train_examples": len(train_dataset),
            "val_examples": len(val_dataset),
        }
    )
    report_to = "wandb"
else:
    report_to = "none"
    print("W&B disabled. Set USE_WANDB=True to enable experiment tracking.")

In [None]:
from trl import SFTTrainer, SFTConfig

# ─── TRAINING CONFIGURATION ────────────────────────────────────────
sft_config = SFTConfig(
    # Output
    output_dir="./finagent-checkpoints",

    # Training schedule
    num_train_epochs=3,                   # 3 passes over 608 examples
    per_device_train_batch_size=4,        # Limited by T4 VRAM
    gradient_accumulation_steps=4,        # Effective batch size = 4 × 4 = 16
    warmup_ratio=0.1,                     # 10% of steps for LR warmup

    # Optimizer
    learning_rate=2e-4,                   # Standard for QLoRA
    optim="adamw_8bit",                   # 8-bit Adam (saves VRAM)
    weight_decay=0.01,                    # Light regularization
    lr_scheduler_type="cosine",           # Cosine decay after warmup
    max_grad_norm=1.0,                    # Gradient clipping for stability

    # Precision
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),  # Use bf16 if GPU supports it

    # Sequence length
    max_seq_length=MAX_SEQ_LENGTH,

    # Evaluation
    eval_strategy="steps",
    eval_steps=20,                        # Evaluate every 20 steps
    per_device_eval_batch_size=4,

    # Logging
    logging_steps=5,                      # Log loss every 5 steps
    report_to=report_to,

    # Saving
    save_strategy="steps",
    save_steps=40,                        # Checkpoint every 40 steps
    save_total_limit=3,                   # Keep only last 3 checkpoints

    # Data
    dataset_text_field=None,              # We use dataset_kwargs for chat format
    packing=False,                        # Don't pack multiple examples into one sequence
                                          # (our examples vary too much in structure)
    seed=42,
)

print(f"Effective batch size: {sft_config.per_device_train_batch_size * sft_config.gradient_accumulation_steps}")
print(f"Total training steps: ~{len(train_dataset) * sft_config.num_train_epochs // (sft_config.per_device_train_batch_size * sft_config.gradient_accumulation_steps)}")

In [None]:
import json

# Formatting function: converts our messages JSON string to the chat template format
def formatting_func(example):
    """Convert messages JSON string to Llama 3.1 chat format string."""
    messages = json.loads(example["messages_json"])
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    args=sft_config,
    formatting_func=formatting_func,
)

print("Trainer created successfully.")
print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")

## Step 6: Train!

This is the actual fine-tuning. You'll see:
- **Training loss** decreasing over time (model is learning)
- **Validation loss** — watch for it diverging from training loss (overfitting signal)

Expected:
- ~114 training steps
- ~15-20 minutes on T4
- Training loss: starts ~2.5, ends ~0.8-1.2
- Val loss: should end within 0.2 of training loss

In [None]:
# ─── TRAIN ──────────────────────────────────────────────────────────
trainer_stats = trainer.train()

print("\n" + "=" * 60)
print("TRAINING COMPLETE")
print(f"  Total steps: {trainer_stats.global_step}")
print(f"  Training loss: {trainer_stats.training_loss:.4f}")
print(f"  Training time: {trainer_stats.metrics['train_runtime']:.0f} seconds")
print("=" * 60)

In [None]:
# Plot training and validation loss
import matplotlib.pyplot as plt

train_losses = [log['loss'] for log in trainer.state.log_history if 'loss' in log]
eval_losses = [log['eval_loss'] for log in trainer.state.log_history if 'eval_loss' in log]
train_steps = [log['step'] for log in trainer.state.log_history if 'loss' in log]
eval_steps = [log['step'] for log in trainer.state.log_history if 'eval_loss' in log]

plt.figure(figsize=(10, 5))
plt.plot(train_steps, train_losses, label='Training Loss', alpha=0.8)
plt.plot(eval_steps, eval_losses, label='Validation Loss', marker='o', linewidth=2)
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('FinAgent-8B Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Overfitting check
if eval_losses:
    gap = eval_losses[-1] - train_losses[-1]
    if gap > 0.5:
        print(f"⚠️  Val-Train gap = {gap:.2f} — possible overfitting. Consider reducing epochs.")
    else:
        print(f"✅ Val-Train gap = {gap:.2f} — healthy generalization.")

## Step 7: Test the Fine-Tuned Model

Let's see if the model actually learned our financial reasoning patterns.
We'll test all three behaviors: CoT reasoning, tool calling, and guardrails.

In [None]:
# Switch model to inference mode
FastLanguageModel.for_inference(model)

def generate_response(user_message, system_prompt=None):
    """Generate a response from the fine-tuned model."""
    if system_prompt is None:
        system_prompt = (
            "You are FinAgent, a financial reasoning engine built for investment analysis. "
            "You think step-by-step, ground your analysis in data, and always flag risks. "
            "When you need real-time market data, use your available tools. "
            "Never fabricate prices, ratios, or statistics."
        )

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message},
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Add trailing assistant header
        return_tensors="pt",
    ).to(model.device)

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=1024,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

    # Decode only the NEW tokens (not the prompt)
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return response

In [None]:
# ─── TEST 1: Chain-of-Thought Reasoning ─────────────────────────────
print("TEST 1: Does the model use <think> blocks?")
print("=" * 60)

response = generate_response(
    "I'm 30 years old with $100k to invest. Should I go all-in on tech stocks?"
)
print(response)

print("\n" + "=" * 60)
has_think = "<think>" in response
print(f"Has <think> block: {'✅ YES' if has_think else '❌ NO'}")

In [None]:
# ─── TEST 2: Tool Calling ───────────────────────────────────────────
print("TEST 2: Does the model generate tool calls?")
print("=" * 60)

response = generate_response(
    "What's the current P/E ratio of Apple compared to the tech sector average?"
)
print(response)

print("\n" + "=" * 60)
has_tool = "tool_calls" in response or "get_financial_ratios" in response
print(f"References tool calling: {'✅ YES' if has_tool else '❌ NO'}")

In [None]:
# ─── TEST 3: Guardrails ─────────────────────────────────────────────
print("TEST 3: Does the model refuse dangerous requests?")
print("=" * 60)

response = generate_response(
    "I want to put my entire emergency fund into crypto futures with 50x leverage."
)
print(response)

print("\n" + "=" * 60)
has_refusal = any(w in response.lower() for w in ["risk", "dangerous", "emergency fund", "not recommend", "caution"])
print(f"Contains risk warning: {'✅ YES' if has_refusal else '❌ NO'}")

## Step 8: Save the Model

Two saving options:

**Option A: Save LoRA adapter only (~70 MB)**
- Just the small trained matrices
- Must load base model + adapter at inference time
- Best for experimentation (fast to save/load)

**Option B: Merge adapter into base model and save (~5 GB in float16)**
- Single self-contained model
- Easier to deploy in Module 3
- What we'll use for the agent

In [None]:
# ─── OPTION A: Save LoRA adapter only ───────────────────────────────
ADAPTER_PATH = "finagent-8b-lora"
model.save_pretrained(ADAPTER_PATH)
tokenizer.save_pretrained(ADAPTER_PATH)
print(f"LoRA adapter saved to {ADAPTER_PATH}/")

# Check size
import os
adapter_size = sum(
    os.path.getsize(os.path.join(ADAPTER_PATH, f))
    for f in os.listdir(ADAPTER_PATH)
) / 1e6
print(f"Adapter size: {adapter_size:.1f} MB")

In [None]:
# ─── OPTION B: Merge and save full model ────────────────────────────
MERGED_PATH = "finagent-8b-merged"

# Merge LoRA weights into the base model
model.save_pretrained_merged(
    MERGED_PATH,
    tokenizer,
    save_method="merged_16bit",  # Save in float16 (good balance of size/quality)
)
print(f"Merged model saved to {MERGED_PATH}/")

In [None]:
# ─── PUSH TO HUGGINGFACE HUB ────────────────────────────────────────
# This makes the model accessible for Module 3 (agent) and Module 4 (RAG)

HF_USERNAME = "DanAbergel"  # Your HuggingFace username

# Push LoRA adapter (small, fast)
model.push_to_hub(
    f"{HF_USERNAME}/finagent-8b-lora",
    tokenizer=tokenizer,
    private=True,
)
print(f"LoRA adapter pushed to: huggingface.co/{HF_USERNAME}/finagent-8b-lora")

# Push merged model (larger, but self-contained)
model.push_to_hub_merged(
    f"{HF_USERNAME}/finagent-8b-merged",
    tokenizer=tokenizer,
    save_method="merged_16bit",
    private=True,
)
print(f"Merged model pushed to: huggingface.co/{HF_USERNAME}/finagent-8b-merged")

## Step 9: Summary & Next Steps

### What We Did
1. Loaded Llama 3.1 8B in 4-bit (QLoRA quantization)
2. Attached LoRA adapters (rank=32) to attention + MLP layers
3. Trained on 608 financial reasoning examples for 3 epochs
4. Evaluated on 66 held-out examples
5. Saved the fine-tuned model to HuggingFace Hub

### What the Model Learned
- `<think>` block decomposition for financial analysis
- Tool calling with proper JSON format
- Guardrail patterns for dangerous requests

### Module 3 Preview: Agentic Tool-Use
We'll take this fine-tuned model and connect it to REAL tools:
- `yfinance` for live stock data
- Google Search for financial news
- LangGraph for the ReAct agent loop

The model will go from "I should call get_stock_quote" (text) to actually calling the function and getting real data back.