# Week 5: Unsloth Fine-tuning for Sentiment Classification

## Overview

This notebook fine-tunes a small LLaMA-family model using **Unsloth** for ternary sentiment classification (Negative/Neutral/Positive) on Amazon product reviews.

### Key Features:
- ✅ **Unsloth-based fine-tuning** with LoRA + 4-bit quantization
- ✅ **Chronological split** (same as Week 4) to prevent data leakage
- ✅ **Instruction-style SFT format** for LLM training
- ✅ **Colab-ready** with GPU checks and flexible data loading
- ✅ **Reproducible** with random seeds and split summaries

### Dataset:
- **Source**: Amazon_Data.csv
- **Target**: Sentiment labels derived from ratings:
  - Rating ≤ 2 → Negative (label 0)
  - Rating = 3 → Neutral (label 1)
  - Rating ≥ 4 → Positive (label 2)

### Split Strategy:
- **70% Train** (oldest data)
- **15% Validation** (middle)
- **15% Test** (most recent data)

**⚠️ CRITICAL**: No shuffling before split - maintains strict chronological order to prevent temporal leakage.


## 1. Install Dependencies

Install required packages for Unsloth fine-tuning. This cell should be run first in Colab.


In [None]:
# Install Unsloth and dependencies
%pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --quiet
%pip install --no-deps "xformers<0.0.27" "trl<0.9.0" "peft<0.10.0" "accelerate<0.30.0" "bitsandbytes<0.43.0" --quiet
%pip install "datasets>=2.18.0" "transformers>=4.40.0" "evaluate>=0.4.1" "scikit-learn>=1.3.0" --quiet

print("✓ Dependencies installed successfully")


## 2. GPU Check

Verify GPU availability (required for Unsloth fine-tuning).


In [None]:
import torch

# Check GPU availability
if torch.cuda.is_available():
    print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  WARNING: No GPU detected. Unsloth fine-tuning requires a GPU.")
    print("   Please enable GPU in Colab: Runtime → Change runtime type → GPU (T4 or A100)")
    raise RuntimeError("GPU required for Unsloth fine-tuning")


## 3. Imports and Configuration


In [None]:
import pandas as pd
import numpy as np
import os
import random
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import utility functions
import sys
sys.path.append('.')
try:
    from week5_utils import (
        load_dataset, clean_data, create_sentiment_labels,
        chronological_split, label_to_string, string_to_label
    )
    USE_UTILS = True
except ImportError:
    # If running in Colab, define functions inline
    print("⚠️  week5_utils.py not found. Defining functions inline...")
    USE_UTILS = False
    
    # Define utility functions inline for Colab
    def load_dataset():
        possible_paths = [
            "/content/drive/MyDrive/Amazon_Data.csv",
            "/content/Amazon_Data.csv",
            "Amazon_Data.csv",
        ]
        for path in possible_paths:
            if os.path.exists(path):
                df = pd.read_csv(path)
                print(f"✓ Found file at: {path}")
                return df
        raise FileNotFoundError("Could not find Amazon_Data.csv. Please upload to /content/drive/MyDrive/")
    
    def clean_data(df):
        df = df[['text', 'rating', 'timestamp']].copy()
        df = df.dropna()
        df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
        df = df.dropna(subset=['timestamp'])
        df = df[df['text'].astype(str).str.len() > 0].copy()
        df = df.sort_values('timestamp').reset_index(drop=True)
        return df
    
    def create_sentiment_labels(df):
        df = df.copy()
        df['sentiment_label'] = df['rating'].apply(lambda r: 0 if r <= 2 else (1 if r == 3 else 2))
        return df
    
    def chronological_split(df, train_ratio=0.70, val_ratio=0.15):
        n_total = len(df)
        n_train = int(train_ratio * n_total)
        n_val = int(val_ratio * n_total)
        df_train = df.iloc[:n_train].copy()
        df_val = df.iloc[n_train:n_train + n_val].copy()
        df_test = df.iloc[n_train + n_val:].copy()
        y_train = df_train['sentiment_label'].values
        y_val = df_val['sentiment_label'].values
        y_test = df_test['sentiment_label'].values
        return df_train, df_val, df_test, y_train, y_val, y_test
    
    def label_to_string(label):
        return {0: "Negative", 1: "Neutral", 2: "Positive"}[label]
    
    def string_to_label(label_str):
        label_map = {"Negative": 0, "Neutral": 1, "Positive": 2}
        label_str_lower = label_str.strip().lower()
        for key, value in label_map.items():
            if key.lower() in label_str_lower or label_str_lower in key.lower():
                return value
        return 1

# Unsloth imports
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset

# Evaluation imports
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)
torch.cuda.manual_seed_all(RANDOM_STATE)

print("✓ Imports complete")


## 4. Load and Prepare Data

Load the dataset and apply the same chronological split logic as Week 4.


In [None]:
# Load dataset
print("Loading dataset...")
df = load_dataset()

# Clean data
print("\nCleaning data...")
df = clean_data(df)

# Create sentiment labels from ratings
print("\nCreating sentiment labels...")
df = create_sentiment_labels(df)

# Chronological split (70/15/15)
print("\nPerforming chronological split...")
df_train, df_val, df_test, y_train, y_val, y_test = chronological_split(df)

print(f"\n✓ Data preparation complete")
print(f"  Train: {len(df_train):,} samples")
print(f"  Val:   {len(df_val):,} samples")
print(f"  Test:  {len(df_test):,} samples")


## 5. Prepare Dataset for Unsloth (Instruction Format)

Convert classification task to instruction-following format for LLM fine-tuning.


In [None]:
def create_instruction_prompt(text):
    """
    Create instruction prompt for sentiment classification.
    
    Args:
        text: Review text
        
    Returns:
        str: Formatted instruction prompt
    """
    prompt = f"""Classify the sentiment of this review as one of: Negative, Neutral, Positive.

Review: {text}

Answer:"""
    return prompt


def prepare_dataset_for_unsloth(df_split, y_split):
    """
    Prepare dataset in Hugging Face format with instruction prompts.
    
    Args:
        df_split: Dataframe with 'text' column
        y_split: Array of numeric labels (0, 1, 2)
        
    Returns:
        Dataset: Hugging Face dataset with 'text' and 'label' fields
    """
    texts = []
    labels = []
    
    for idx, row in df_split.iterrows():
        # Create instruction prompt
        instruction = create_instruction_prompt(row['text'])
        texts.append(instruction)
        
        # Convert numeric label to string
        label_str = label_to_string(y_split[idx])
        labels.append(label_str)
    
    # Create Hugging Face dataset
    dataset_dict = {
        'text': texts,
        'label': labels
    }
    
    dataset = Dataset.from_dict(dataset_dict)
    return dataset


# Prepare datasets
print("Preparing training dataset...")
train_dataset = prepare_dataset_for_unsloth(df_train, y_train)

print("Preparing validation dataset...")
val_dataset = prepare_dataset_for_unsloth(df_val, y_val)

print("Preparing test dataset...")
test_dataset = prepare_dataset_for_unsloth(df_test, y_test)

print(f"\n✓ Datasets prepared:")
print(f"  Train: {len(train_dataset):,} examples")
print(f"  Val:   {len(val_dataset):,} examples")
print(f"  Test:  {len(test_dataset):,} examples")


## 6. Load Model with Unsloth

Load a small LLaMA-family model with 4-bit quantization and LoRA adapters.


In [None]:
# Model configuration
# Using Llama-3.1-8B-Instruct for good balance of performance and speed
# Alternatives: "unsloth/llama-3.1-8b-bnb-4bit" or "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
model_name = "unsloth/llama-3.1-8b-bnb-4bit"

print(f"Loading model: {model_name}")
print("This may take a few minutes on first run...")

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=512,  # Adjust based on your review length
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # 4-bit quantization
)

# Add LoRA adapters for efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank (higher = more capacity, but slower)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,  # LoRA alpha scaling
    lora_dropout=0.1,  # Dropout for LoRA
    bias="none",  # No bias
    use_gradient_checkpointing=True,  # Save memory
    random_state=RANDOM_STATE,
)

print("✓ Model loaded successfully")
print(f"  Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"  Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")


## 7. Configure Tokenizer and Format Dataset

Set up tokenizer for instruction-following format and prepare datasets for training.


In [None]:
# Configure tokenizer
tokenizer = FastLanguageModel.get_peft_tokenizer(model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Format dataset for training (instruction + completion)
def format_dataset(examples):
    """
    Format dataset for instruction-following training.
    Each example has 'text' (instruction) and 'label' (completion).
    """
    inputs = examples['text']
    outputs = examples['label']
    
    # Combine instruction and completion
    texts = [f"{inp}{out}" for inp, out in zip(inputs, outputs)]
    
    # Tokenize
    tokenized = tokenizer(
        texts,
        truncation=True,
        max_length=512,
        padding=False,
    )
    
    # For causal LM, labels are the same as input_ids
    tokenized['labels'] = tokenized['input_ids'].copy()
    
    return tokenized


# Apply formatting
print("Formatting training dataset...")
train_dataset_formatted = train_dataset.map(
    format_dataset,
    batched=True,
    remove_columns=train_dataset.column_names,
)

print("Formatting validation dataset...")
val_dataset_formatted = val_dataset.map(
    format_dataset,
    batched=True,
    remove_columns=val_dataset.column_names,
)

print("✓ Datasets formatted for training")


## 8. Training Configuration

Set up training arguments for fine-tuning.


In [None]:
# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=2,  # Small batch size for memory efficiency
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,  # Effective batch size = 2 * 4 = 8
    warmup_steps=50,
    num_train_epochs=1,  # Start with 1 epoch (can increase to 2-3 if needed)
    learning_rate=2e-4,  # Learning rate for LoRA
    fp16=not torch.cuda.is_bf16_supported(),  # Use fp16 if bf16 not supported
    bf16=torch.cuda.is_bf16_supported(),  # Use bf16 if supported (A100)
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=500,  # Evaluate every 500 steps
    save_strategy="steps",
    save_steps=500,
    output_dir="./outputs",
    optim="adamw_8bit",  # 8-bit optimizer to save memory
    load_best_model_at_end=True,
    report_to="none",  # Disable wandb/tensorboard
    seed=RANDOM_STATE,
)

print("✓ Training arguments configured")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Total steps: ~{len(train_dataset_formatted) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs}")


## 9. Fine-tune Model

Train the model using SFTTrainer from Unsloth.


In [None]:
# Create trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset_formatted,
    eval_dataset=val_dataset_formatted,
    args=training_args,
    dataset_text_field="text",  # Not used since we pre-formatted, but required
    max_seq_length=512,
    packing=False,  # Don't pack sequences
)

print("✓ Trainer created")
print("\nStarting training...")
print("This may take 30-60 minutes depending on GPU and dataset size.")

# Train
trainer_stats = trainer.train()

print("\n✓ Training complete!")
print(f"  Training loss: {trainer_stats.training_loss:.4f}")


## 10. Save Model

Save the fine-tuned model for later use.


In [None]:
# Save model
model.save_pretrained("unsloth_sentiment_model")
tokenizer.save_pretrained("unsloth_sentiment_model")

print("✓ Model saved to 'unsloth_sentiment_model/'")


## 11. Inference Function

Create a function to run inference on new examples.


In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

def predict_sentiment(text, model=model, tokenizer=tokenizer):
    """
    Predict sentiment for a single review text.
    
    Args:
        text: Review text
        model: Fine-tuned model
        tokenizer: Tokenizer
        
    Returns:
        str: Predicted label ("Negative", "Neutral", or "Positive")
    """
    # Create instruction prompt
    prompt = create_instruction_prompt(text)
    
    # Tokenize
    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
        truncation=True,
        max_length=512,
    ).to("cuda")
    
    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=5,  # Short generation (just the label)
        temperature=0.0,  # Deterministic
        do_sample=False,  # Greedy decoding
        pad_token_id=tokenizer.eos_token_id,
    )
    
    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract answer (everything after "Answer:")
    if "Answer:" in generated_text:
        answer = generated_text.split("Answer:")[-1].strip()
    else:
        answer = generated_text.strip()
    
    # Parse label (take first word, handle case variations)
    answer_words = answer.split()
    if len(answer_words) > 0:
        predicted_label = answer_words[0].strip()
    else:
        predicted_label = answer.strip()
    
    # Normalize to one of our labels
    predicted_label = string_to_label(predicted_label)
    return label_to_string(predicted_label)


# Test on a few examples
print("Testing inference on sample reviews:")
print("=" * 60)
for i in range(3):
    sample_text = df_test.iloc[i]['text'][:200] + "..."
    true_label = label_to_string(y_test[i])
    pred_label = predict_sentiment(df_test.iloc[i]['text'])
    print(f"\nExample {i+1}:")
    print(f"  Text: {sample_text}")
    print(f"  True: {true_label}")
    print(f"  Pred: {pred_label}")
    print(f"  Match: {'✓' if true_label == pred_label else '✗'}")


## 12. Evaluate on Validation Set

Evaluate model performance on validation set.


In [None]:
# Evaluate on validation set (sample for speed)
print("Evaluating on validation set...")
print("(Using a sample for faster evaluation - adjust sample_size as needed)")

sample_size = min(1000, len(df_val))  # Evaluate on 1000 samples
val_sample_idx = np.random.choice(len(df_val), sample_size, replace=False)
val_texts_sample = df_val.iloc[val_sample_idx]['text'].values
y_val_true_sample = y_val[val_sample_idx]

# Predict
print("Running predictions...")
y_val_pred_sample = []
for text in val_texts_sample:
    pred = predict_sentiment(text)
    y_val_pred_sample.append(pred)

# Convert to numeric labels
y_val_pred_numeric = np.array([string_to_label(pred) for pred in y_val_pred_sample])
y_val_true_numeric = np.array([string_to_label(label_to_string(label)) for label in y_val_true_sample])

# Metrics
val_acc = accuracy_score(y_val_true_numeric, y_val_pred_numeric)
print(f"\n✓ Validation Evaluation Complete")
print(f"  Accuracy: {val_acc:.4f} ({val_acc*100:.2f}%)")
print(f"\nClassification Report:")
print(classification_report(y_val_true_numeric, y_val_pred_numeric, 
                          target_names=['Negative', 'Neutral', 'Positive']))


## 13. Final Evaluation on Test Set

**⚠️ CRITICAL**: Test set is used ONLY ONCE for final evaluation.


In [None]:
# Final evaluation on test set
print("=" * 60)
print("FINAL EVALUATION ON TEST SET")
print("=" * 60)
print("⚠️  This is the FIRST and ONLY time the test set is used")
print("⚠️  Model selected based on validation performance only\n")

# For full evaluation, use entire test set (may take time)
# For faster evaluation, use a sample
FULL_TEST_EVAL = False  # Set to True for full test set evaluation

if FULL_TEST_EVAL:
    test_texts = df_test['text'].values
    y_test_true = y_test
    print(f"Evaluating on full test set ({len(test_texts):,} samples)...")
else:
    # Sample for faster evaluation
    test_sample_size = min(2000, len(df_test))
    test_sample_idx = np.random.choice(len(df_test), test_sample_size, replace=False)
    test_texts = df_test.iloc[test_sample_idx]['text'].values
    y_test_true = y_test[test_sample_idx]
    print(f"Evaluating on test set sample ({len(test_texts):,} samples)...")
    print("(Set FULL_TEST_EVAL=True for full evaluation)")

# Predict
print("Running predictions...")
y_test_pred = []
for i, text in enumerate(test_texts):
    if (i + 1) % 100 == 0:
        print(f"  Processed {i+1}/{len(test_texts)} samples...")
    pred = predict_sentiment(text)
    y_test_pred.append(pred)

# Convert to numeric labels
y_test_pred_numeric = np.array([string_to_label(pred) for pred in y_test_pred])
y_test_true_numeric = np.array([string_to_label(label_to_string(label)) for label in y_test_true])

# Metrics
test_acc = accuracy_score(y_test_true_numeric, y_test_pred_numeric)
print(f"\n✓ Test Evaluation Complete")
print(f"  Accuracy: {test_acc:.4f} ({test_acc*100:.2f}%)")
print(f"\nClassification Report:")
print(classification_report(y_test_true_numeric, y_test_pred_numeric, 
                          target_names=['Negative', 'Neutral', 'Positive']))

# Confusion Matrix
cm = confusion_matrix(y_test_true_numeric, y_test_pred_numeric)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Neutral', 'Positive'],
            yticklabels=['Negative', 'Neutral', 'Positive'])
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix - Test Set (Unsloth Fine-tuned Model)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("=" * 60)


## 14. Example Predictions

Show some example predictions vs true labels.


In [None]:
# Show example predictions
print("Example Predictions:")
print("=" * 60)

n_examples = 10
for i in range(min(n_examples, len(test_texts))):
    true_label = label_to_string(y_test_true_numeric[i])
    pred_label = y_test_pred[i]
    match = "✓" if true_label == pred_label else "✗"
    
    print(f"\nExample {i+1} [{match}]")
    print(f"  Text: {test_texts[i][:150]}...")
    print(f"  True Label: {true_label}")
    print(f"  Predicted:  {pred_label}")

print("\n" + "=" * 60)


## 15. Summary

### Key Results:
- **Model**: Llama-3.1-8B fine-tuned with Unsloth (LoRA + 4-bit)
- **Training**: Instruction-following format with chronological split
- **Test Accuracy**: See results above

### Reproducibility:
- ✅ Random seeds set (RANDOM_STATE=42)
- ✅ Chronological split (no shuffling)
- ✅ Same label mapping as Week 4 (0=Negative, 1=Neutral, 2=Positive)

### Data Leakage Prevention:
- ✅ Chronological split by timestamp
- ✅ No shuffling before split
- ✅ Test set used only once

### Next Steps:
- Increase epochs (2-3) for better performance
- Tune LoRA rank (r) and learning rate
- Try different base models (Qwen2.5, Mistral)
- Full test set evaluation (set FULL_TEST_EVAL=True)
