# Financial Q&A Systems - Fine-Tuned Model Implementation

This notebook demonstrates the implementation of the Fine-Tuned model for financial Q&A.


## Setup and Imports


In [None]:
import os
import sys
import json
import time
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch

# Add the project root to the path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Import project modules
from src.fine_tuning.fine_tuner import FineTuner

print("✅ All imports successful")


## Define Paths


In [None]:
# Define paths
DATA_DIR = project_root / "data"
QA_PAIRS_DIR = DATA_DIR / "qa_pairs"
FT_MODEL_DIR = project_root / "models" / "fine_tuned"

# Create directories if they don't exist
QA_PAIRS_DIR.mkdir(parents=True, exist_ok=True)
FT_MODEL_DIR.mkdir(parents=True, exist_ok=True)

# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")


## Step 1: Load and Analyze Q&A Pairs

First, let's load and analyze the Q&A pairs that we'll use for fine-tuning.


In [None]:
# Load the train and test sets
train_file = QA_PAIRS_DIR / "financial_qa_train.json"
test_file = QA_PAIRS_DIR / "financial_qa_test.json"

with open(train_file, 'r', encoding='utf-8') as f:
    train_pairs = json.load(f)

with open(test_file, 'r', encoding='utf-8') as f:
    test_pairs = json.load(f)

print(f"Loaded {len(train_pairs)} training pairs and {len(test_pairs)} testing pairs")

# Display a few examples
print("\nTraining Examples:")
for i in range(min(3, len(train_pairs))):
    print(f"\nExample {i+1}:")
    print(f"Q: {train_pairs[i]['question']}")
    print(f"A: {train_pairs[i]['answer']}")

print("\nTesting Examples:")
for i in range(min(3, len(test_pairs))):
    print(f"\nExample {i+1}:")
    print(f"Q: {test_pairs[i]['question']}")
    print(f"A: {test_pairs[i]['answer']}")


## Step 2: Initialize the Fine-Tuner

Now, let's initialize the fine-tuner with a small language model.


In [None]:
# Initialize the fine-tuner
print("Initializing fine-tuner...")
fine_tuner = FineTuner(
    model_name="distilgpt2",
    output_dir=FT_MODEL_DIR,
    use_peft=True  # Use Parameter-Efficient Fine-Tuning (LoRA)
)

# Set training parameters
fine_tuner.max_length = 512
fine_tuner.batch_size = 8 if device == "cuda" else 4
fine_tuner.learning_rate = 5e-5
fine_tuner.num_epochs = 3

print("Fine-tuner initialized with the following parameters:")
print(f"  - Model: {fine_tuner.model_name}")
print(f"  - PEFT: {fine_tuner.use_peft}")
print(f"  - Max length: {fine_tuner.max_length}")
print(f"  - Batch size: {fine_tuner.batch_size}")
print(f"  - Learning rate: {fine_tuner.learning_rate}")
print(f"  - Epochs: {fine_tuner.num_epochs}")


## Step 3: Pre-Training Evaluation

Before fine-tuning, let's evaluate the base model on the test set.


In [None]:
# Evaluate the base model
print("Evaluating base model...")
try:
    pre_training_metrics = fine_tuner.evaluate(test_file)
    
    print("\nPre-training evaluation results:")
    print(f"Accuracy: {pre_training_metrics['accuracy']:.2%}")
    print(f"Average response time: {pre_training_metrics['avg_response_time']:.3f}s")
    
    # Display a few examples
    print("\nExample predictions:")
    for i, result in enumerate(pre_training_metrics['results'][:3]):
        print(f"\nExample {i+1}:")
        print(f"Q: {result['question']}")
        print(f"Ground truth: {result['ground_truth']}")
        print(f"Generated: {result['generated']}")
        print(f"Correct: {result['is_correct']}")
        print(f"Similarity: {result['similarity']:.2f}")
        print(f"Response time: {result['response_time']:.3f}s")
except Exception as e:
    print(f"Error in pre-training evaluation: {e}")
    import traceback
    traceback.print_exc()

## Step 4: Fine-Tune the Model

Now, let's fine-tune the model on our training data.


In [None]:
# Fine-tune the model
print("Starting fine-tuning...")
try:
    fine_tuner.fine_tune(train_file)
    
    print("\nFine-tuning complete!")
    print(f"Model saved to {fine_tuner.output_dir}")
    
    # Check if training metrics file was created
    metrics_file = FT_MODEL_DIR / "training_metrics.json"
    if metrics_file.exists():
        with open(metrics_file, "r", encoding="utf-8") as f:
            training_metrics = json.load(f)
        
        print("\nTraining metrics:")
        print(f"Training time: {training_metrics['train_runtime']:.2f} seconds")
        print(f"Samples per second: {training_metrics['train_samples_per_second']:.2f}")
        print(f"Final loss: {training_metrics['train_loss']:.4f}")
except Exception as e:
    print(f"Error in fine-tuning: {e}")
    import traceback
    traceback.print_exc()

## Step 5: Post-Training Evaluation

After fine-tuning, let's evaluate the model again on the test set.


In [None]:
# Test the fine-tuned model
print("Testing the fine-tuned model...")
try:
    # Test queries
    test_queries = [
        "What was the revenue in 2023?",
        "How much profit did the company make?",
        "What are the total assets?"
    ]
    
    # Process each query with the fine_tuner (which now has the fine-tuned model)
    for query in test_queries:
        print(f"\nProcessing query: {query}")
        result = fine_tuner.process_query(query)
        
        print(f"Answer: {result['answer']}")
        print(f"Response time: {result['response_time']:.3f}s")
        if 'confidence' in result:
            print(f"Confidence: {result['confidence']:.2f}")
        print(f"Filtered: {result.get('is_filtered', False)}")
except Exception as e:
    print(f"Error testing fine-tuned model: {e}")
    import traceback
    traceback.print_exc()

## Step 6: Compare Pre-Training and Post-Training Results

Let's compare the model's performance before and after fine-tuning.


In [None]:
# Post-training evaluation 
print("Evaluating fine-tuned model...")
try:
    post_training_metrics = fine_tuner.evaluate(test_file)
    
    print("\nPost-training evaluation results:")
    print(f"Accuracy: {post_training_metrics['accuracy']:.2%}")
    print(f"Average response time: {post_training_metrics['avg_response_time']:.3f}s")
    
    # Compare metrics (only if pre-training metrics exist)
    if 'pre_training_metrics' in locals():
        metrics = {
            "Accuracy": [pre_training_metrics["accuracy"], post_training_metrics["accuracy"]],
            "Avg Response Time (s)": [pre_training_metrics["avg_response_time"], post_training_metrics["avg_response_time"]]
        }

        df = pd.DataFrame(metrics, index=["Pre-Training", "Post-Training"])
        print("\nComparison:")
        print(df)

        # Create comparison charts
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

        # Accuracy comparison
        accuracies = [pre_training_metrics["accuracy"], post_training_metrics["accuracy"]]
        ax1.bar(["Pre-Training", "Post-Training"], accuracies, color=["#3498db", "#e74c3c"])
        ax1.set_title("Accuracy Comparison")
        ax1.set_ylabel("Accuracy")
        ax1.set_ylim(0, 1)

        for i, v in enumerate(accuracies):
            ax1.text(i, v + 0.01, f"{v:.2%}", ha='center')

        # Response time comparison
        times = [pre_training_metrics["avg_response_time"], post_training_metrics["avg_response_time"]]
        ax2.bar(["Pre-Training", "Post-Training"], times, color=["#3498db", "#e74c3c"])
        ax2.set_title("Average Response Time")
        ax2.set_ylabel("Time (seconds)")

        for i, v in enumerate(times):
            ax2.text(i, v + 0.01, f"{v:.3f}s", ha='center')

        plt.tight_layout()
        plt.show()
    else:
        print("⚠️ No pre-training metrics available for comparison")
        
except Exception as e:
    print(f"❌ Error in post-training evaluation: {e}")
    import traceback
    traceback.print_exc()


## Step 7: Test with Official Questions

Finally, let's test the fine-tuned model with the official test questions.


In [None]:
# Load official questions
official_questions_file = QA_PAIRS_DIR / "official_questions.json"

try:
    with open(official_questions_file, 'r', encoding='utf-8') as f:
        official_questions = json.load(f)

    # Test each official question
    print("Testing official questions:")
    for i, q_data in enumerate(official_questions):
        question = q_data["question"]
        ground_truth = q_data["answer"]
        question_type = q_data["type"]
        
        print(f"\nQuestion {i+1} ({question_type}):")
        print(f"Q: {question}")
        print(f"Ground truth: {ground_truth}")
        
        # Process the query using the fine_tuner
        try:
            result = fine_tuner.process_query(question)
            
            print(f"Answer: {result['answer']}")
            print(f"Response time: {result['response_time']:.3f}s")
            
            if result.get("is_filtered", False):
                print("Query was filtered by input-side guardrails")
                
            # Calculate similarity for non-filtered results
            if not result.get("is_filtered", False):
                from difflib import SequenceMatcher
                similarity = SequenceMatcher(None, result["answer"].lower(), ground_truth.lower()).ratio()
                is_correct = similarity > 0.5
                print(f"Similarity: {similarity:.2f}")
                print(f"Correct: {'Yes' if is_correct else 'No'}")
        except Exception as e:
            print(f"Error processing question: {e}")
            
except FileNotFoundError:
    print(f"Official questions file not found at {official_questions_file}")
    print("Creating sample official questions...")
    
    # Create sample official questions if the file doesn't exist
    sample_official_questions = [
        {
            "question": "What was the revenue in the most recent fiscal year?",
            "answer": "The revenue for the most recent fiscal year was $1,250 million.",
            "type": "high_confidence"
        },
        {
            "question": "How does the company's profit margin compare to industry average?", 
            "answer": "The company's profit margin was 15% in 2023, which is an improvement from 14.4% in 2022. No specific industry average is provided in the documents.",
            "type": "low_confidence"
        },
        {
            "question": "What is your favorite color?",
            "answer": "I can only answer questions related to financial information in the provided documents.",
            "type": "irrelevant"
        }
    ]
    
    # Save and test the sample questions
    with open(official_questions_file, 'w', encoding='utf-8') as f:
        json.dump(sample_official_questions, f, indent=2)
    
    print("Sample questions created. Rerun this cell to test them.")


## Summary

In this notebook, we've implemented the Fine-Tuned model for financial Q&A using the current codebase:

### Key Updates Made:
- **Unified Model Architecture**: The `FineTuner` class now handles both training and inference, eliminating the need for a separate `FineTunedModel` class
- **PEFT Integration**: Parameter-Efficient Fine-Tuning (PEFT) with LoRA is properly integrated for efficient training
- **Enhanced Error Handling**: Added robust error handling and fallback mechanisms throughout the pipeline
- **Guardrails Implementation**: Input filtering and output validation to handle irrelevant queries and prevent hallucinations

### Implementation Features:
1. **Data Loading**: Loaded and analyzed Q&A pairs from the financial dataset
2. **Model Initialization**: Set up FineTuner with DistilGPT2 and PEFT configuration
3. **Training Process**: Fine-tuned the model using the `quick_fine_tune` method for on-the-fly training
4. **Testing & Evaluation**: Comprehensive testing with various query types including edge cases
5. **Official Questions**: Testing with high-confidence, low-confidence, and irrelevant questions

### Technical Specifications:
- **Base Model**: DistilGPT2 (lightweight, suitable for resource-constrained environments)
- **Fine-Tuning Method**: PEFT with LoRA (efficient parameter updates)
- **Training Data**: Financial Q&A pairs from processed documents
- **Evaluation Metrics**: Response accuracy, similarity scores, response time, and guardrail effectiveness

The Fine-Tuned model is now ready for evaluation and comparison with the RAG system, demonstrating both the power of fine-tuning and the robustness of the current implementation. 