# 02 - Finetune Sentiment Analysis Model

This notebook demonstrates how to fine-tune a DistilBERT model for sentiment analysis on financial text data. We'll:

1. **Install required packages** - transformers, datasets, evaluate, scikit-learn, accelerate
2. **Load and prepare data** - CSV files with text and sentiment labels
3. **Tokenize the data** - Convert text to model inputs using DistilBERT tokenizer
4. **Train the model** - Fine-tune DistilBERT for configurable epochs using Hugging Face Trainer
5. **Evaluate performance** - Compute accuracy, F1-score, and confusion matrix
6. **Save the model** - Store the fine-tuned model and tokenizer
7. **Test inference speed** - Measure how fast the model can make predictions
8. **Save metrics** - Export performance metrics to JSON

## What is Sentiment Analysis?

Sentiment analysis is the process of determining the emotional tone or attitude expressed in text. In financial contexts, this helps us understand whether news, tweets, or reports are positive, negative, or neutral about market conditions.


In [None]:
# Step 1: Training Parameters
# Configure training parameters here for easy modification

EPOCHS = 3
BATCH_SIZE = 16
MAX_LEN = 128
LEARNING_RATE = 2e-5
WARMUP_STEPS = 500
WEIGHT_DECAY = 0.01

print("[CONFIG] Training Parameters:")
print(f"  Epochs: {EPOCHS}")
print(f"  Batch Size: {BATCH_SIZE}")
print(f"  Max Length: {MAX_LEN}")
print(f"  Learning Rate: {LEARNING_RATE}")
print(f"  Warmup Steps: {WARMUP_STEPS}")
print(f"  Weight Decay: {WEIGHT_DECAY}")


^C


Collecting transformers
  Using cached transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Collecting datasets
  Using cached datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting accelerate
  Downloading accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.35.3-py3-none-any.whl.metadata (14 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.3.3-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading pyyaml-6.0.3-cp312-cp312-win_amd64.whl.metadata (2.4 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2025.9.18-cp312-c


[notice] A new release of pip is available: 24.2 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step 2: Install Required Packages

First, let's install the necessary dependencies for our sentiment analysis project.


In [None]:
# Install required packages
import subprocess
import sys

def install_package(package):
    """Install a package if not already available."""
    try:
        __import__(package)
        print(f"[OK] {package} already available")
        return True
    except ImportError:
        print(f"[WORKING] Installing {package}...")
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
            print(f"[OK] {package} installed successfully")
            return True
        except subprocess.CalledProcessError as e:
            print(f"[ERROR] Failed to install {package}: {e}")
            return False

# Install required packages
packages = ["transformers", "datasets", "evaluate", "scikit-learn", "accelerate"]
for package in packages:
    install_package(package)

print("[SUCCESS] All packages ready!")


In [None]:
# Step 3: Import Libraries and Set Up
# Import all necessary libraries for our sentiment analysis project

import pandas as pd
import numpy as np
import json
import time
import os
from pathlib import Path

# Transformers and training
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
    TrainingArguments, 
    Trainer,
    DataCollatorWithPadding
)

# Datasets and evaluation
from datasets import Dataset
import evaluate
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

print("[OK] All libraries imported successfully!")
print(f"[INFO] Current working directory: {os.getcwd()}")


## Step 3: Load and Explore the Data

We'll load our sentiment analysis dataset from CSV files. The data should have:
- **text**: The financial text to analyze (news, tweets, reports, etc.)
- **label**: Sentiment labels (0=negative, 1=neutral, 2=positive)

Let's load the train, validation, and test sets and explore their structure.


In [None]:
# Load the datasets
data_dir = Path("..") / "data"  # Go up one level from notebooks/ directory
train_df = pd.read_csv(data_dir / "finance_sentiment_train.csv")
val_df = pd.read_csv(data_dir / "finance_sentiment_val.csv")
test_df = pd.read_csv(data_dir / "finance_sentiment_test.csv")

print("[DATA] Dataset Overview:")
print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")
print()

# Display sample data
print("[INFO] Sample training data:")
print(train_df.head())
print()

# Check label distribution
print("[STATS] Label distribution in training set:")
print(train_df['label'].value_counts().sort_index())
print()

# Check for missing values
print("[CHECK] Data quality check:")
print(f"Missing values in train: {train_df.isnull().sum().sum()}")
print(f"Missing values in val: {val_df.isnull().sum().sum()}")
print(f"Missing values in test: {test_df.isnull().sum().sum()}")


## Step 4: Set Up DistilBERT Tokenizer

DistilBERT is a smaller, faster version of BERT that maintains most of BERT's performance while being much more efficient. We'll use it for our sentiment analysis task.

The tokenizer converts text into tokens (subwords) that the model can understand. We'll also set up the model architecture for sequence classification.


In [None]:
# Set up model and tokenizer
model_name = "distilbert-base-uncased"
print(f"🤖 Loading {model_name}...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model for sequence classification (3 classes: negative, neutral, positive)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=3,
    id2label={0: "negative", 1: "neutral", 2: "positive"},
    label2id={"negative": 0, "neutral": 1, "positive": 2}
)

print("✅ Model and tokenizer loaded successfully!")
print(f"📏 Model parameters: {model.num_parameters():,}")
print(f"🔤 Tokenizer vocab size: {tokenizer.vocab_size:,}")

# Test tokenization
sample_text = "The stock market is performing exceptionally well today!"
tokens = tokenizer(sample_text, return_tensors="pt")
print(f"\n🧪 Tokenization test:")
print(f"Original text: {sample_text}")
print(f"Token IDs: {tokens['input_ids'].squeeze().tolist()}")
print(f"Attention mask: {tokens['attention_mask'].squeeze().tolist()}")


## Step 5: Prepare and Tokenize the Data

Now we'll convert our pandas DataFrames into Hugging Face Dataset objects and tokenize the text. This step is crucial for preparing the data in the format the model expects.

We'll create a tokenization function that:
1. Takes text and labels
2. Tokenizes the text with the DistilBERT tokenizer
3. Handles padding and truncation automatically


In [None]:
# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

print("📦 Datasets converted to Hugging Face format")
print(f"Train dataset: {len(train_dataset)} samples")
print(f"Validation dataset: {len(val_dataset)} samples")
print(f"Test dataset: {len(test_dataset)} samples")

# Define tokenization function
def tokenize_function(examples):
    """Tokenize the text data for the model"""
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=True,
        max_length=512,  # DistilBERT's maximum input length
        return_tensors="pt"
    )

# Tokenize all datasets
print("\n🔄 Tokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

print("✅ Tokenization complete!")
print(f"📊 Sample tokenized input shape: {train_dataset[0]['input_ids'].shape}")
print(f"📊 Sample attention mask shape: {train_dataset[0]['attention_mask'].shape}")
print(f"📊 Sample label: {train_dataset[0]['label']}")


## Step 6: Set Up Training Configuration

Before we start training, we need to configure the training arguments. This includes:
- **Learning rate**: How fast the model learns (too high = unstable, too low = slow)
- **Batch size**: How many examples to process at once
- **Epochs**: How many times to see the entire dataset
- **Evaluation strategy**: When to check model performance
- **Logging**: How often to log training progress


In [None]:
# Set up training arguments using configurable parameters
training_args = TrainingArguments(
    output_dir="./sentiment_model",           # Directory to save model checkpoints
    num_train_epochs=EPOCHS,                 # Number of training epochs (configurable)
    per_device_train_batch_size=BATCH_SIZE,  # Batch size for training (configurable)
    per_device_eval_batch_size=BATCH_SIZE,   # Batch size for evaluation (configurable)
    learning_rate=LEARNING_RATE,             # Learning rate (configurable)
    warmup_steps=WARMUP_STEPS,               # Number of warmup steps (configurable)
    weight_decay=WEIGHT_DECAY,               # Weight decay for regularization (configurable)
    logging_dir="./logs",                    # Directory for logs
    logging_steps=100,                       # Log every 100 steps
    evaluation_strategy="epoch",             # Evaluate at the end of each epoch
    save_strategy="epoch",                   # Save model at the end of each epoch
    load_best_model_at_end=True,            # Load the best model at the end
    metric_for_best_model="eval_accuracy",   # Metric to use for best model selection
    greater_is_better=True,                  # Higher accuracy is better
    report_to=None,                          # Disable wandb/tensorboard logging
    seed=42,                                 # Random seed for reproducibility
)

print("[CONFIG] Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Output directory: {training_args.output_dir}")
print(f"  Evaluation strategy: {training_args.evaluation_strategy}")


## Step 7: Set Up Evaluation Metrics

We need to define how to compute evaluation metrics during training. We'll track:
- **Accuracy**: Percentage of correct predictions
- **F1-score**: Harmonic mean of precision and recall (macro-averaged across all classes)

The `compute_metrics` function will be called automatically during training to evaluate the model.


In [None]:
# Define evaluation metrics
def compute_metrics(eval_pred):
    """Compute accuracy and F1-score for evaluation"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='macro')
    
    return {
        'accuracy': accuracy,
        'f1_macro': f1
    }

print("📊 Evaluation metrics configured:")
print("- Accuracy: Percentage of correct predictions")
print("- F1-macro: Macro-averaged F1-score across all classes")
print("✅ Ready for training!")


## Step 8: Train the DistilBERT Model

Now it's time to train our sentiment analysis model! This is where the magic happens:

1. **Data Collator**: Handles dynamic padding for efficient batching
2. **Trainer**: Manages the entire training process
3. **Training**: Fine-tune DistilBERT for 3 epochs on our financial sentiment data

The training process will:
- Process batches of text through the model
- Compute loss and gradients
- Update model weights
- Evaluate on validation set after each epoch
- Save the best performing model


In [None]:
# Set up data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("🚀 Starting training...")
print("⏱️ This may take several minutes depending on your hardware")
print("📊 Training progress will be displayed below:")
print()

# Start training
trainer.train()

print("\n✅ Training completed successfully!")
print(f"💾 Best model saved to: {training_args.output_dir}")


## Step 9: Evaluate Model Performance

Now let's evaluate our fine-tuned model on the test set to see how well it performs. We'll compute:

1. **Accuracy**: Overall correctness
2. **Macro F1-score**: Balanced performance across all sentiment classes
3. **Confusion Matrix**: Detailed breakdown of predictions vs. actual labels

This gives us a comprehensive view of the model's performance on unseen data.


In [None]:
# Evaluate on test set
print("🧪 Evaluating model on test set...")
test_results = trainer.evaluate(test_dataset)

print("\n📊 Test Set Performance:")
print(f"🎯 Accuracy: {test_results['eval_accuracy']:.4f} ({test_results['eval_accuracy']*100:.2f}%)")
print(f"📈 Macro F1-Score: {test_results['eval_f1_macro']:.4f}")

# Get predictions for confusion matrix
print("\n🔍 Computing detailed predictions...")
test_predictions = trainer.predict(test_dataset)
test_preds = np.argmax(test_predictions.predictions, axis=1)
test_labels = test_predictions.label_ids

# Compute confusion matrix
cm = confusion_matrix(test_labels, test_preds)
class_names = ['Negative', 'Neutral', 'Positive']

print(f"\n📋 Confusion Matrix:")
print("Rows = Actual, Columns = Predicted")
print("     ", " ".join([f"{name:>8}" for name in class_names]))
for i, name in enumerate(class_names):
    print(f"{name:>8}: {' '.join([f'{cm[i,j]:>8}' for j in range(len(class_names))])}")

# Calculate per-class metrics
print(f"\n📊 Per-class Performance:")
for i, class_name in enumerate(class_names):
    precision = cm[i, i] / cm[:, i].sum() if cm[:, i].sum() > 0 else 0
    recall = cm[i, i] / cm[i, :].sum() if cm[i, :].sum() > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    print(f"{class_name:>8}: Precision={precision:.3f}, Recall={recall:.3f}, F1={f1:.3f}")


## Step 10: Visualize Confusion Matrix

Let's create a visual representation of the confusion matrix to better understand the model's performance. This heatmap will show us:
- **Diagonal elements**: Correct predictions (higher = better)
- **Off-diagonal elements**: Misclassifications (lower = better)
- **Color intensity**: Number of samples in each category


In [None]:
# Create confusion matrix visualization
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix - Sentiment Analysis Model')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.tight_layout()
plt.show()

# Calculate and display normalized confusion matrix
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(8, 6))
sns.heatmap(cm_normalized, annot=True, fmt='.3f', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.title('Normalized Confusion Matrix - Sentiment Analysis Model')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.tight_layout()
plt.show()

print("📊 Confusion Matrix Analysis:")
print("- Darker blue = more samples")
print("- Diagonal = correct predictions")
print("- Off-diagonal = misclassifications")


## Step 11: Save the Fine-tuned Model

Now we'll save our trained model and tokenizer to the `models/sentiment-finetuned` directory. This allows us to:
- **Reuse the model** for future predictions
- **Share the model** with others
- **Deploy the model** in production applications

The saved model includes both the architecture and the learned weights.


In [None]:
# Create models directory if it doesn't exist
model_save_path = Path("models/sentiment-finetuned")
model_save_path.mkdir(parents=True, exist_ok=True)

print(f"💾 Saving model to: {model_save_path}")

# Save the model and tokenizer
trainer.save_model(str(model_save_path))
tokenizer.save_pretrained(str(model_save_path))

print("✅ Model and tokenizer saved successfully!")
print(f"📁 Files saved in: {model_save_path}")

# List the saved files
saved_files = list(model_save_path.glob("*"))
print(f"\n📋 Saved files:")
for file in saved_files:
    print(f"  - {file.name}")

# Verify the model can be loaded
print(f"\n🔍 Verifying saved model...")
try:
    # Load the saved model
    loaded_model = AutoModelForSequenceClassification.from_pretrained(str(model_save_path))
    loaded_tokenizer = AutoTokenizer.from_pretrained(str(model_save_path))
    print("✅ Model and tokenizer loaded successfully from saved files!")
    print(f"📏 Loaded model parameters: {loaded_model.num_parameters():,}")
except Exception as e:
    print(f"❌ Error loading saved model: {e}")


## Step 12: Test Inference Speed

In real-world applications, inference speed is crucial. We'll measure how fast our model can make predictions on individual examples. This is important for:
- **Real-time applications**: Chatbots, live sentiment analysis
- **Batch processing**: Analyzing large volumes of text
- **User experience**: Fast response times

We'll test with multiple examples and calculate the average inference time in milliseconds.


In [None]:
# Test inference speed
print("⚡ Testing inference speed...")

# Sample texts for testing
test_texts = [
    "The stock market is performing exceptionally well today!",
    "This company's financial results are disappointing.",
    "The quarterly earnings report shows steady growth.",
    "Market volatility is causing significant uncertainty.",
    "Investors are optimistic about the new policy changes."
]

# Load the saved model for inference
inference_model = AutoModelForSequenceClassification.from_pretrained(str(model_save_path))
inference_tokenizer = AutoTokenizer.from_pretrained(str(model_save_path))

# Set model to evaluation mode
inference_model.eval()

# Measure inference time
inference_times = []
predictions = []

print(f"\n🧪 Testing inference on {len(test_texts)} examples:")

for i, text in enumerate(test_texts):
    # Tokenize input
    inputs = inference_tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    # Measure inference time
    start_time = time.time()
    
    with torch.no_grad():  # Disable gradient computation for faster inference
        outputs = inference_model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
    
    end_time = time.time()
    inference_time_ms = (end_time - start_time) * 1000
    inference_times.append(inference_time_ms)
    predictions.append(prediction)
    
    # Get sentiment label
    sentiment_labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
    predicted_sentiment = sentiment_labels[prediction]
    
    print(f"  {i+1}. '{text[:50]}...' → {predicted_sentiment} ({inference_time_ms:.2f}ms)")

# Calculate statistics
avg_inference_time = np.mean(inference_times)
min_inference_time = np.min(inference_times)
max_inference_time = np.max(inference_times)

print(f"\n📊 Inference Speed Results:")
print(f"⚡ Average inference time: {avg_inference_time:.2f}ms")
print(f"🏃 Fastest inference: {min_inference_time:.2f}ms")
print(f"🐌 Slowest inference: {max_inference_time:.2f}ms")
print(f"📈 Throughput: {1000/avg_inference_time:.1f} predictions/second")

# Store the average inference time for metrics
inference_ms = avg_inference_time


## Step 13: Save Performance Metrics

Finally, let's save all our performance metrics to a JSON file. This creates a permanent record of our model's performance that can be:
- **Compared** with other models
- **Tracked** over time as we improve the model
- **Shared** with stakeholders
- **Used** for model selection and deployment decisions

The metrics file will include accuracy, F1-score, and inference speed.


In [None]:
# Compile all metrics
metrics = {
    "accuracy": float(test_results['eval_accuracy']),
    "f1_macro": float(test_results['eval_f1_macro']),
    "inference_ms": float(inference_ms)
}

# Save metrics to JSON file
metrics_file = "metrics.json"
with open(metrics_file, 'w') as f:
    json.dump(metrics, f, indent=2)

print("📊 Performance Metrics Summary:")
print(f"🎯 Accuracy: {metrics['accuracy']:.4f} ({metrics['accuracy']*100:.2f}%)")
print(f"📈 Macro F1-Score: {metrics['f1_macro']:.4f}")
print(f"⚡ Average Inference Time: {metrics['inference_ms']:.2f}ms")
print(f"💾 Metrics saved to: {metrics_file}")

# Display the saved metrics
print(f"\n📋 Saved metrics content:")
with open(metrics_file, 'r') as f:
    print(f.read())

print("\n🎉 Sentiment Analysis Model Training Complete!")
print("=" * 50)
print("✅ Model trained and evaluated successfully")
print("✅ Model saved to models/sentiment-finetuned/")
print("✅ Performance metrics saved to metrics.json")
print("✅ Ready for production use!")


## 🎯 Summary

Congratulations! You've successfully fine-tuned a DistilBERT model for financial sentiment analysis. Here's what we accomplished:

### ✅ What We Built
- **Fine-tuned DistilBERT** for sentiment classification on financial text
- **Trained for 3 epochs** with proper validation
- **Comprehensive evaluation** with accuracy, F1-score, and confusion matrix
- **Saved model and tokenizer** for future use
- **Measured inference speed** for production readiness
- **Documented performance metrics** in JSON format

### 📊 Key Features
- **Beginner-friendly**: Step-by-step explanations
- **Production-ready**: Includes speed testing and model saving
- **Comprehensive evaluation**: Multiple metrics and visualizations
- **Reproducible**: Fixed random seeds and clear documentation

### 🚀 Next Steps
1. **Deploy the model** in a web application or API
2. **Test on new data** to validate performance
3. **Experiment with hyperparameters** to improve results
4. **Try different models** (BERT, RoBERTa, etc.) for comparison
5. **Collect more data** to further improve performance

### 📁 Files Created
- `models/sentiment-finetuned/` - Saved model and tokenizer
- `metrics.json` - Performance metrics
- Training logs in `./logs/` directory

The model is now ready for real-world sentiment analysis tasks!
