# Quick Start: LLaMA 3 Finance Robustness Benchmarking

**Author**: Emmanuel Kwadwo Kusi  
**Project**: Benchmarking LLaMA 3 Robustness in Finance via Prompt Perturbations

This notebook demonstrates the complete workflow for evaluating LLM robustness using semantic entropy.

## Overview

1. **Setup**: Import libraries and configure environment
2. **Data**: Load and explore financial datasets
3. **Prompts**: Generate paraphrased variants
4. **Sampling**: Run LLaMA 3 inference
5. **Analysis**: Compute semantic entropy and robustness
6. **Visualization**: Create insights plots

## 1. Setup

In [None]:
# Import libraries
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported successfully")

In [None]:
# Configuration
DATA_DIR = Path('../data')
RESULTS_DIR = Path('../results')

# Create directories
for dir_path in [DATA_DIR / 'raw', DATA_DIR / 'processed', DATA_DIR / 'prompts',
                 RESULTS_DIR / 'raw_outputs', RESULTS_DIR / 'metrics', RESULTS_DIR / 'figures']:
    dir_path.mkdir(parents=True, exist_ok=True)

print("✓ Directories created")

## 2. Data Acquisition and Preprocessing

In [None]:
# Import data modules
from data.download_datasets import DatasetDownloader
from data.preprocess import FinanceDataPreprocessor

# Download datasets (first time only)
print("Downloading datasets...")
downloader = DatasetDownloader(output_dir=str(DATA_DIR / 'raw'))

# Download FinQA
finqa_df = downloader.download_finqa()
print(f"FinQA: {len(finqa_df)} samples")

# Generate summary
summary = downloader.generate_summary()
display(summary)

In [None]:
# Preprocess data
print("Preprocessing datasets...")
preprocessor = FinanceDataPreprocessor(
    input_dir=str(DATA_DIR / 'raw'),
    output_dir=str(DATA_DIR / 'processed')
)

# Process FinQA
processed_df = preprocessor.process_finqa(max_samples=1000)
print(f"Processed: {len(processed_df)} samples")

# Extract seed prompts
seed_prompts = preprocessor.extract_seed_prompts(n_prompts=10)
print(f"Extracted {len(seed_prompts)} seed prompts")

# Display sample seeds
display(seed_prompts.head())

## 3. Prompt Generation

In [None]:
# Import prompt generator
from models.prompt_generator import PromptParaphraser

# Initialize paraphraser
print("Loading paraphrase models...")
paraphraser = PromptParaphraser(paraphrase_method='backtranslation')

print("✓ Paraphraser ready")

In [None]:
# Generate variants for first seed prompt
example_prompt = seed_prompts.iloc[0]['question']
print(f"Original prompt: {example_prompt}")
print("\nGenerating variants...")

variants = paraphraser.generate_variants(
    prompt=example_prompt,
    num_variants=5,
    min_similarity=0.85
)

# Display variants
for i, variant in enumerate(variants):
    print(f"\nVariant {i}: (similarity: {variant['similarity']:.3f})")
    print(f"  {variant['text']}")

## 4. LLM Sampling (Demo)

**Note**: This section demonstrates the sampling process. Running the full LLaMA 3 inference requires significant computational resources.

In [None]:
# Uncomment to run actual LLaMA 3 sampling
# from models.llm_runner import LLaMARunner

# # Initialize runner
# runner = LLaMARunner(
#     model_name="meta-llama/Meta-Llama-3-8B-Instruct",
#     load_in_4bit=True
# )

# # Sample responses
# responses = runner.sample_multiple(
#     prompt=example_prompt,
#     num_samples=5,
#     temperature=0.7
# )

# for i, response in enumerate(responses):
#     print(f"\nResponse {i+1}:")
#     print(response)

print("Skipping LLM sampling in demo mode.")
print("To run full pipeline, use: python run_pipeline.sh")

## 5. Semantic Entropy Analysis

In [None]:
# Load sample outputs (if available)
outputs_path = RESULTS_DIR / 'raw_outputs' / 'llama3_outputs.csv'

if outputs_path.exists():
    outputs_df = pd.read_csv(outputs_path)
    print(f"Loaded {len(outputs_df)} outputs")
    display(outputs_df.head())
else:
    print("No outputs found. Run full pipeline first.")
    # Create sample data for demonstration
    outputs_df = pd.DataFrame({
        'family_id': ['family_001'] * 20,
        'variant_id': [0] * 20,
        'sample_id': range(20),
        'prompt_text': [example_prompt] * 20,
        'response': [f"Sample response {i}" for i in range(20)]
    })
    print("Using sample data for demonstration")

In [None]:
# Compute semantic entropy
from evaluation.entropy_calculator import SemanticEntropy

print("Initializing entropy calculator...")
calculator = SemanticEntropy(
    embedder_model='sentence-transformers/all-MiniLM-L6-v2',
    clustering_method='hdbscan'
)

if outputs_path.exists():
    print("Computing entropy...")
    entropy_df = calculator.compute_all(
        outputs_df=outputs_df,
        output_dir=str(RESULTS_DIR / 'metrics')
    )
    display(entropy_df.head())
else:
    print("Skipping entropy computation (no real outputs)")

## 6. Robustness Metrics

In [None]:
# Compute robustness
from evaluation.robustness_metric import RobustnessCalculator

calc = RobustnessCalculator()

if outputs_path.exists():
    robustness_df = calc.compute_from_entropy_file(
        entropy_file=str(RESULTS_DIR / 'metrics' / 'entropy_detailed.csv'),
        output_dir=str(RESULTS_DIR / 'metrics')
    )
    display(robustness_df.head())
else:
    print("Skipping robustness computation (no entropy data)")

## 7. Visualization

In [None]:
# Create visualizations
from visualization.plot_results import RobustnessVisualizer

visualizer = RobustnessVisualizer(output_dir=str(RESULTS_DIR / 'figures'))

if outputs_path.exists():
    print("Generating plots...")
    
    # Entropy heatmap
    visualizer.plot_entropy_heatmap(entropy_df)
    
    # Robustness distribution
    visualizer.plot_robustness_distribution(robustness_df)
    
    # Entropy vs Robustness
    visualizer.plot_entropy_vs_robustness(robustness_df)
    
    print("✓ Visualizations saved to results/figures/")
else:
    print("Run full pipeline to generate visualizations")

## Summary

This notebook demonstrated the complete workflow:

1. ✅ Dataset acquisition and preprocessing
2. ✅ Prompt variant generation via paraphrasing
3. ⏭️ LLM sampling (requires full pipeline)
4. ⏭️ Semantic entropy computation
5. ⏭️ Robustness metric calculation
6. ⏭️ Visualization generation

### Next Steps

To run the full pipeline:

```bash
# Linux/Mac
bash run_pipeline.sh

# Windows
run_pipeline.bat
```

Or use the Python scripts individually:

```bash
python src/data/download_datasets.py
python src/data/preprocess.py
python src/models/prompt_generator.py
python src/models/llm_runner.py
python src/evaluation/entropy_calculator.py
python src/evaluation/robustness_metric.py
python src/visualization/plot_results.py
```

### Resources

- **Documentation**: `docs/METHODOLOGY.md`
- **Configuration**: `config/config.yaml`
- **Full README**: `../README.md`

---

**Author**: Emmanuel Kwadwo Kusi  
**GitHub**: [Your GitHub Link]  
**LinkedIn**: [Your LinkedIn Profile]