# üéØ Emotion Classification Modeling Pipeline

This notebook implements a comprehensive modeling pipeline for emotion classification using the GoEmotions dataset. It replicates and extends the functionality from `emotion_xai/data/preprocessing.py` in an interactive environment.

## üìã Notebook Overview

1. **Import Required Libraries** - Load necessary packages and modules
2. **Load and Explore Data** - Load raw data and perform initial analysis
3. **Data Preprocessing Pipeline** - Implement comprehensive data cleaning and transformation
4. **Feature Engineering** - Extract and prepare features for emotion analysis
5. **Model Training** - Train baseline and advanced models
6. **Model Evaluation** - Comprehensive performance analysis
7. **Save Processed Data** - Export preprocessed data and models to `data/processed/`

---

## 1. Import Required Libraries

Import all necessary libraries for data processing, modeling, and visualization.

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import re
import warnings
from datetime import datetime
from typing import Dict, List, Tuple, Optional, Any, Union
import time

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, classification_report, confusion_matrix)
from sklearn.model_selection import train_test_split
import joblib

# Text processing
import string
from collections import Counter

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Project modules - Add project root to path
import sys
sys.path.insert(0, '..')

# Import our custom modules
from emotion_xai.data.preprocessing import (
    load_dataset, assess_text_quality, clean_text, 
    filter_quality_issues, prepare_emotion_labels, 
    split_dataset, prepare_features, save_preprocessing_results,
    DataQualityMetrics, EMOTION_COLUMNS
)
from emotion_xai.models.baseline import BaselineModel, save_evaluation_results

# Configure display and warnings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('default')
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")
print(f"üìä Emotion columns available: {len(EMOTION_COLUMNS)}")

  from .autonotebook import tqdm as notebook_tqdm
  warn(
  warn(


‚úÖ All libraries imported successfully!
üìä Emotion columns available: 28


## 2. Load and Explore Data

Load the GoEmotions dataset and perform initial exploration to understand the data structure.

In [2]:
# Setup paths
data_path = Path('../data/raw/goemotions.csv')
processed_data_dir = Path('../data/processed')
results_dir = Path('../results')

# Create directories if they don't exist
processed_data_dir.mkdir(parents=True, exist_ok=True)
(results_dir / 'metrics').mkdir(parents=True, exist_ok=True)
(results_dir / 'plots' / 'modeling').mkdir(parents=True, exist_ok=True)

# Load the dataset
print("üì• Loading GoEmotions dataset...")
start_time = time.time()
df = load_dataset(data_path)
load_time = time.time() - start_time

print(f"‚úÖ Dataset loaded in {load_time:.2f} seconds")
print(f"üìä Dataset shape: {df.shape}")
print(f"üìù Columns: {list(df.columns[:10])}{'...' if len(df.columns) > 10 else ''}")

# Display basic information
print("\n" + "="*50)
print("DATASET OVERVIEW")
print("="*50)
print(f"Total samples: {len(df):,}")
print(f"Total columns: {len(df.columns)}")
print(f"Text column: {'text' in df.columns}")
print(f"Emotion columns: {len(EMOTION_COLUMNS)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

üì• Loading GoEmotions dataset...
üìä Dataset loaded: 211,225 samples with 28 emotion labels
‚úÖ Dataset loaded in 0.59 seconds
üìä Dataset shape: (211225, 37)
üìù Columns: ['text', 'id', 'author', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'rater_id', 'example_very_unclear', 'admiration']...

DATASET OVERVIEW
Total samples: 211,225
Total columns: 37
Text column: True
Emotion columns: 28
Memory usage: 144.5 MB
üìä Dataset loaded: 211,225 samples with 28 emotion labels
‚úÖ Dataset loaded in 0.59 seconds
üìä Dataset shape: (211225, 37)
üìù Columns: ['text', 'id', 'author', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'rater_id', 'example_very_unclear', 'admiration']...

DATASET OVERVIEW
Total samples: 211,225
Total columns: 37
Text column: True
Emotion columns: 28
Memory usage: 144.5 MB


In [3]:
# Sample data preview
print("üìã Sample data preview:")
print("\nFirst 3 text samples:")
for i, text in enumerate(df['text'].head(3), 1):
    print(f"{i}. {text[:100]}{'...' if len(text) > 100 else ''}")

print(f"\nüè∑Ô∏è Emotion columns present: {[col for col in EMOTION_COLUMNS if col in df.columns][:5]}...")
print(f"Total emotion columns: {len([col for col in EMOTION_COLUMNS if col in df.columns])}")

# Check for missing values
print(f"\nüîç Missing values in text column: {df['text'].isna().sum()}")
print(f"Missing values in emotion columns: {df[EMOTION_COLUMNS].isna().sum().sum()}")

# Basic statistics
print(f"\nüìä Text length statistics:")
text_lengths = df['text'].astype(str).str.len()
print(f"Mean length: {text_lengths.mean():.1f} characters")
print(f"Median length: {text_lengths.median():.1f} characters")
print(f"Min length: {text_lengths.min()} characters")
print(f"Max length: {text_lengths.max()} characters")

# Emotion distribution preview
emotion_sums = df[EMOTION_COLUMNS].sum().sort_values(ascending=False)
print(f"\nüé≠ Top 5 most frequent emotions:")
for emotion, count in emotion_sums.head().items():
    print(f"  {emotion}: {count:,} samples ({count/len(df)*100:.1f}%)")

üìã Sample data preview:

First 3 text samples:
1. That game hurt.
2.  >sexuality shouldn‚Äôt be a grouping category It makes you different from othet ppl so imo it fits th...
3. You do right, if you don't care then fuck 'em!

üè∑Ô∏è Emotion columns present: ['admiration', 'amusement', 'anger', 'annoyance', 'approval']...
Total emotion columns: 28

üîç Missing values in text column: 0
Missing values in emotion columns: 0

üìä Text length statistics:
Mean length: 69.3 characters
Median length: 67.0 characters
Min length: 2 characters
Max length: 703 characters

üé≠ Top 5 most frequent emotions:
  neutral: 55,298 samples (26.2%)
  approval: 17,620 samples (8.3%)
  admiration: 17,131 samples (8.1%)
  annoyance: 13,618 samples (6.4%)
  gratitude: 11,625 samples (5.5%)


## 3. Data Preprocessing Pipeline

Implement the comprehensive preprocessing pipeline from `emotion_xai/data/preprocessing.py`.

In [4]:
# Step 1: Assess text quality
print("üîç Step 1: Assessing text quality...")
quality_issues = assess_text_quality(df['text'])

print("üìä Quality Assessment Results:")
for issue_type, count in quality_issues.items():
    percentage = (count / len(df)) * 100
    print(f"  {issue_type}: {count:,} samples ({percentage:.2f}%)")

# Visualize quality issues
fig = px.bar(
    x=list(quality_issues.keys()),
    y=list(quality_issues.values()),
    title="Text Quality Issues Distribution",
    labels={'x': 'Issue Type', 'y': 'Number of Samples'},
    color=list(quality_issues.values()),
    color_continuous_scale='Reds'
)
fig.update_layout(showlegend=False, height=400)
fig.show()

# Calculate overall quality percentage
total_issues = sum(quality_issues.values())
clean_samples = len(df) - total_issues
quality_percentage = (clean_samples / len(df)) * 100
print(f"\n‚ú® Overall data quality: {quality_percentage:.2f}% clean samples")

üîç Step 1: Assessing text quality...
üìä Quality Assessment Results:
  very_short: 57 samples (0.03%)
  very_long: 9 samples (0.00%)
  mostly_punctuation: 174 samples (0.08%)
  repeated_chars: 2,226 samples (1.05%)
  all_caps: 2,184 samples (1.03%)
  no_letters: 17 samples (0.01%)
  empty_null: 0 samples (0.00%)
üìä Quality Assessment Results:
  very_short: 57 samples (0.03%)
  very_long: 9 samples (0.00%)
  mostly_punctuation: 174 samples (0.08%)
  repeated_chars: 2,226 samples (1.05%)
  all_caps: 2,184 samples (1.03%)
  no_letters: 17 samples (0.01%)
  empty_null: 0 samples (0.00%)



‚ú® Overall data quality: 97.79% clean samples


In [5]:
# Step 2: Filter quality issues
print("üßπ Step 2: Filtering quality issues...")

# Apply quality filtering
filtered_df, quality_metrics = filter_quality_issues(df, remove_issues=True)

print(f"üìä Filtering results:")
print(f"  Original samples: {quality_metrics.total_samples:,}")
print(f"  Clean samples: {quality_metrics.clean_samples:,}")
print(f"  Removed samples: {quality_metrics.removed_samples:,}")
print(f"  Quality retention: {quality_metrics.quality_percentage:.2f}%")

# Compare before/after text lengths
original_lengths = df['text'].astype(str).str.len()
filtered_lengths = filtered_df['text'].astype(str).str.len()

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Original Data", "Filtered Data"),
    shared_yaxes=True
)

fig.add_trace(
    go.Histogram(x=original_lengths, name="Original", nbinsx=50, opacity=0.7),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=filtered_lengths, name="Filtered", nbinsx=50, opacity=0.7),
    row=1, col=2
)

fig.update_layout(
    title="Text Length Distribution: Before vs After Filtering",
    xaxis_title="Text Length (characters)",
    yaxis_title="Number of Samples",
    height=400
)
fig.show()

print(f"\n‚úÖ Quality filtering completed. Proceeding with {len(filtered_df):,} samples.")

üßπ Step 2: Filtering quality issues...
üßπ Quality filtering: 211,225 ‚Üí 211,008 samples
   Removed 217 (0.10%) problematic samples
üìä Filtering results:
  Original samples: 211,225
  Clean samples: 211,008
  Removed samples: 217
  Quality retention: 99.90%
üßπ Quality filtering: 211,225 ‚Üí 211,008 samples
   Removed 217 (0.10%) problematic samples
üìä Filtering results:
  Original samples: 211,225
  Clean samples: 211,008
  Removed samples: 217
  Quality retention: 99.90%



‚úÖ Quality filtering completed. Proceeding with 211,008 samples.


In [6]:
# Step 3: Split dataset
print("üîÄ Step 3: Splitting dataset...")

# Split the data with stratification
splits = split_dataset(filtered_df, test_size=0.2, val_size=0.1, random_state=42)

train_df = splits['train']
val_df = splits['val']  
test_df = splits['test']

print(f"üìä Dataset splits:")
print(f"  Training: {len(train_df):,} samples ({len(train_df)/len(filtered_df)*100:.1f}%)")
print(f"  Validation: {len(val_df):,} samples ({len(val_df)/len(filtered_df)*100:.1f}%)")
print(f"  Test: {len(test_df):,} samples ({len(test_df)/len(filtered_df)*100:.1f}%)")

# Visualize split distribution
splits_data = {
    'Split': ['Train', 'Validation', 'Test'],
    'Samples': [len(train_df), len(val_df), len(test_df)],
    'Percentage': [len(train_df)/len(filtered_df)*100, 
                   len(val_df)/len(filtered_df)*100, 
                   len(test_df)/len(filtered_df)*100]
}

fig = px.pie(
    values=splits_data['Samples'],
    names=splits_data['Split'],
    title="Dataset Split Distribution"
)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

# Check emotion distribution across splits
print(f"\nüé≠ Emotion distribution consistency across splits:")
train_emotions = train_df[EMOTION_COLUMNS].sum()
val_emotions = val_df[EMOTION_COLUMNS].sum()
test_emotions = test_df[EMOTION_COLUMNS].sum()

emotion_comparison = pd.DataFrame({
    'Train_Pct': train_emotions / len(train_df) * 100,
    'Val_Pct': val_emotions / len(val_df) * 100,
    'Test_Pct': test_emotions / len(test_df) * 100
})

print("Top 5 emotions distribution (%):")
for emotion in emotion_comparison.head().index:
    train_pct = emotion_comparison.loc[emotion, 'Train_Pct']
    val_pct = emotion_comparison.loc[emotion, 'Val_Pct']
    test_pct = emotion_comparison.loc[emotion, 'Test_Pct']
    print(f"  {emotion}: Train={train_pct:.1f}%, Val={val_pct:.1f}%, Test={test_pct:.1f}%")

üîÄ Step 3: Splitting dataset...
üìä Dataset splits created:
   Train: 147,705 samples (70.0%)
   Val:   21,101 samples (10.0%)
   Test:  42,202 samples (20.0%)
üìä Dataset splits:
  Training: 147,705 samples (70.0%)
  Validation: 21,101 samples (10.0%)
  Test: 42,202 samples (20.0%)
üìä Dataset splits created:
   Train: 147,705 samples (70.0%)
   Val:   21,101 samples (10.0%)
   Test:  42,202 samples (20.0%)
üìä Dataset splits:
  Training: 147,705 samples (70.0%)
  Validation: 21,101 samples (10.0%)
  Test: 42,202 samples (20.0%)



üé≠ Emotion distribution consistency across splits:
Top 5 emotions distribution (%):
  admiration: Train=8.1%, Val=8.2%, Test=8.1%
  amusement: Train=4.4%, Val=4.4%, Test=4.4%
  anger: Train=3.8%, Val=3.8%, Test=3.8%
  annoyance: Train=6.4%, Val=6.5%, Test=6.5%
  approval: Train=8.3%, Val=8.4%, Test=8.3%


## 4. Feature Engineering

Extract and prepare features for emotion classification using both conservative and aggressive text cleaning approaches.

In [7]:
# Step 4: Feature preparation
print("üî§ Step 4: Preparing text features...")

# Prepare features with conservative cleaning (default)
print("\nüìù Conservative text cleaning:")
train_texts_conservative = prepare_features(train_df, aggressive_cleaning=False)
val_texts_conservative = prepare_features(val_df, aggressive_cleaning=False)
test_texts_conservative = prepare_features(test_df, aggressive_cleaning=False)

# Prepare features with aggressive cleaning
print("\nüßπ Aggressive text cleaning:")
train_texts_aggressive = prepare_features(train_df, aggressive_cleaning=True)
val_texts_aggressive = prepare_features(val_df, aggressive_cleaning=True)
test_texts_aggressive = prepare_features(test_df, aggressive_cleaning=True)

# Compare cleaning approaches
print(f"\nüîç Comparing cleaning approaches:")
print("Sample text transformations:")
sample_idx = 0
original_text = train_df.iloc[sample_idx]['text']
conservative_text = train_texts_conservative[sample_idx]
aggressive_text = train_texts_aggressive[sample_idx]

print(f"Original:     '{original_text}'")
print(f"Conservative: '{conservative_text}'")
print(f"Aggressive:   '{aggressive_text}'")

# Text length comparison
conservative_lengths = [len(text) for text in train_texts_conservative[:1000]]
aggressive_lengths = [len(text) for text in train_texts_aggressive[:1000]]

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Conservative Cleaning", "Aggressive Cleaning")
)

fig.add_trace(
    go.Histogram(x=conservative_lengths, name="Conservative", nbinsx=30, opacity=0.7),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=aggressive_lengths, name="Aggressive", nbinsx=30, opacity=0.7),
    row=1, col=2
)

fig.update_layout(
    title="Text Length Distribution: Conservative vs Aggressive Cleaning",
    xaxis_title="Text Length (characters)",
    yaxis_title="Number of Samples",
    height=400,
    showlegend=False
)
fig.show()

print(f"\n‚úÖ Feature preparation completed for both cleaning approaches")

üî§ Step 4: Preparing text features...

üìù Conservative text cleaning:
üìù Text features prepared: 147,705 samples
   Average length: 69.3 characters
   Aggressive cleaning: No
üìù Text features prepared: 21,101 samples
   Average length: 69.0 characters
   Aggressive cleaning: No
üìù Text features prepared: 147,705 samples
   Average length: 69.3 characters
   Aggressive cleaning: No
üìù Text features prepared: 21,101 samples
   Average length: 69.0 characters
   Aggressive cleaning: No
üìù Text features prepared: 42,202 samples
   Average length: 69.1 characters
   Aggressive cleaning: No

üßπ Aggressive text cleaning:
üìù Text features prepared: 42,202 samples
   Average length: 69.1 characters
   Aggressive cleaning: No

üßπ Aggressive text cleaning:
üìù Text features prepared: 147,705 samples
   Average length: 69.3 characters
   Aggressive cleaning: Yes
üìù Text features prepared: 147,705 samples
   Average length: 69.3 characters
   Aggressive cleaning: Yes
üìù Tex


‚úÖ Feature preparation completed for both cleaning approaches


In [8]:
# Step 5: Prepare emotion labels
print("üè∑Ô∏è Step 5: Preparing emotion labels...")

# Prepare labels for all splits
train_labels = prepare_emotion_labels(train_df, EMOTION_COLUMNS)
val_labels = prepare_emotion_labels(val_df, EMOTION_COLUMNS)
test_labels = prepare_emotion_labels(test_df, EMOTION_COLUMNS)

print(f"üìä Label preparation results:")
print(f"  Training labels shape: {train_labels.shape}")
print(f"  Validation labels shape: {val_labels.shape}")
print(f"  Test labels shape: {test_labels.shape}")
print(f"  Number of emotions: {len(EMOTION_COLUMNS)}")

# Analyze label distribution
print(f"\nüé≠ Multi-label analysis:")
train_labels_per_sample = train_labels.sum(axis=1)
val_labels_per_sample = val_labels.sum(axis=1)
test_labels_per_sample = test_labels.sum(axis=1)

print(f"  Average labels per sample:")
print(f"    Train: {train_labels_per_sample.mean():.2f}")
print(f"    Val: {val_labels_per_sample.mean():.2f}")
print(f"    Test: {test_labels_per_sample.mean():.2f}")

# Visualize label distribution
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=("Train", "Validation", "Test"),
    shared_yaxes=True
)

fig.add_trace(
    go.Histogram(x=train_labels_per_sample, name="Train", nbinsx=10, opacity=0.7),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=val_labels_per_sample, name="Val", nbinsx=10, opacity=0.7),
    row=1, col=2
)
fig.add_trace(
    go.Histogram(x=test_labels_per_sample, name="Test", nbinsx=10, opacity=0.7),
    row=1, col=3
)

fig.update_layout(
    title="Distribution of Labels per Sample Across Splits",
    xaxis_title="Number of Labels",
    yaxis_title="Number of Samples",
    height=400,
    showlegend=False
)
fig.show()

# Show most frequent emotion combinations
multi_label_samples = train_labels_per_sample > 1
if multi_label_samples.sum() > 0:
    print(f"\nüîó Multi-label samples: {multi_label_samples.sum():,} ({multi_label_samples.mean()*100:.1f}%)")
    
print(f"\n‚úÖ Emotion labels prepared for all splits")

üè∑Ô∏è Step 5: Preparing emotion labels...
üè∑Ô∏è  Emotion labels prepared: 28 emotions for 147,705 samples
   Multi-label samples: 25,078 (17.0%)
üè∑Ô∏è  Emotion labels prepared: 28 emotions for 21,101 samples
   Multi-label samples: 3,673 (17.4%)
üè∑Ô∏è  Emotion labels prepared: 28 emotions for 42,202 samples
   Multi-label samples: 7,226 (17.1%)
üìä Label preparation results:
  Training labels shape: (147705, 28)
  Validation labels shape: (21101, 28)
  Test labels shape: (42202, 28)
  Number of emotions: 28

üé≠ Multi-label analysis:
  Average labels per sample:
    Train: 1.18
    Val: 1.19
    Test: 1.18



üîó Multi-label samples: 25,078 (17.0%)

‚úÖ Emotion labels prepared for all splits


## 5. Model Training

Train baseline models using both conservative and aggressive text cleaning approaches.

In [9]:
# Model 1: Conservative cleaning approach
print("üöÄ Training Model 1: Conservative Text Cleaning")
print("="*50)

# Initialize model with optimized parameters for this dataset size
model_conservative = BaselineModel(
    max_features=5000,      # Balanced vocabulary size
    ngram_range=(1, 2),     # Unigrams and bigrams
    C=1.0,                  # Regularization strength
    max_iter=1000,          # Sufficient iterations
    random_state=42         # Reproducibility
)

# Train the conservative model
start_time = time.time()
model_conservative.fit(train_texts_conservative, train_labels, EMOTION_COLUMNS)
conservative_training_time = time.time() - start_time

print(f"‚úÖ Conservative model training completed in {conservative_training_time:.1f} seconds")
print(f"üìä Model info: {model_conservative.training_info['n_features']} features, {model_conservative.training_info['vocabulary_size']} vocabulary")

üöÄ Training Model 1: Conservative Text Cleaning
üöÄ Training baseline model...
   üìä Training samples: 147705
   üè∑Ô∏è  Emotions: 28
   üî§ Vectorizing texts with TF-IDF...
   üìà TF-IDF shape: (147705, 5000)
   üìù Vocabulary size: 5000
   üéØ Training multi-label classifier...
   üìà TF-IDF shape: (147705, 5000)
   üìù Vocabulary size: 5000
   üéØ Training multi-label classifier...
   ‚úÖ Training completed in 9.7 seconds
‚úÖ Conservative model training completed in 9.7 seconds
üìä Model info: 5000 features, 5000 vocabulary
   ‚úÖ Training completed in 9.7 seconds
‚úÖ Conservative model training completed in 9.7 seconds
üìä Model info: 5000 features, 5000 vocabulary


In [10]:
# Model 2: Aggressive cleaning approach
print("\nüßπ Training Model 2: Aggressive Text Cleaning")
print("="*50)

# Initialize model with same parameters for fair comparison
model_aggressive = BaselineModel(
    max_features=5000,
    ngram_range=(1, 2),
    C=1.0,
    max_iter=1000,
    random_state=42
)

# Train the aggressive model
start_time = time.time()
model_aggressive.fit(train_texts_aggressive, train_labels, EMOTION_COLUMNS)
aggressive_training_time = time.time() - start_time

print(f"‚úÖ Aggressive model training completed in {aggressive_training_time:.1f} seconds")
print(f"üìä Model info: {model_aggressive.training_info['n_features']} features, {model_aggressive.training_info['vocabulary_size']} vocabulary")

# Compare training times
print(f"\n‚è±Ô∏è Training time comparison:")
print(f"  Conservative: {conservative_training_time:.1f} seconds")
print(f"  Aggressive: {aggressive_training_time:.1f} seconds")
print(f"  Difference: {abs(conservative_training_time - aggressive_training_time):.1f} seconds")

# Training summary
print(f"\nüìà Training Summary:")
print(f"  Dataset size: {len(train_df):,} samples")
print(f"  Features extracted: 5,000 TF-IDF features")
print(f"  Emotions to predict: {len(EMOTION_COLUMNS)}")
print(f"  Model type: One-vs-Rest Logistic Regression")
print(f"  Both models trained successfully! ‚úÖ")


üßπ Training Model 2: Aggressive Text Cleaning
üöÄ Training baseline model...
   üìä Training samples: 147705
   üè∑Ô∏è  Emotions: 28
   üî§ Vectorizing texts with TF-IDF...
   üìà TF-IDF shape: (147705, 5000)
   üìù Vocabulary size: 5000
   üéØ Training multi-label classifier...
   üìà TF-IDF shape: (147705, 5000)
   üìù Vocabulary size: 5000
   üéØ Training multi-label classifier...
   ‚úÖ Training completed in 4.2 seconds
‚úÖ Aggressive model training completed in 4.2 seconds
üìä Model info: 5000 features, 5000 vocabulary

‚è±Ô∏è Training time comparison:
  Conservative: 9.7 seconds
  Aggressive: 4.2 seconds
  Difference: 5.5 seconds

üìà Training Summary:
  Dataset size: 147,705 samples
  Features extracted: 5,000 TF-IDF features
  Emotions to predict: 28
  Model type: One-vs-Rest Logistic Regression
  Both models trained successfully! ‚úÖ
   ‚úÖ Training completed in 4.2 seconds
‚úÖ Aggressive model training completed in 4.2 seconds
üìä Model info: 5000 features, 50

## 6. Model Evaluation

Comprehensive evaluation of both models using multiple metrics and visualizations.

In [11]:
# Evaluate both models on validation set
print("üìä Evaluating models on validation set...")

# Conservative model evaluation
print("\nüîπ Conservative Model - Validation Results:")
conservative_val_metrics = model_conservative.evaluate(val_texts_conservative, val_labels)

# Aggressive model evaluation  
print("\nüî∏ Aggressive Model - Validation Results:")
aggressive_val_metrics = model_aggressive.evaluate(val_texts_aggressive, val_labels)

# Compare validation performance
comparison_metrics = ['accuracy', 'f1_macro', 'f1_micro', 'precision_macro', 'recall_macro']

print(f"\nüìà Validation Performance Comparison:")
print(f"{'Metric':<20} {'Conservative':<12} {'Aggressive':<12} {'Difference':<12}")
print("-" * 60)

for metric in comparison_metrics:
    conservative_val = conservative_val_metrics[metric]
    aggressive_val = aggressive_val_metrics[metric]
    difference = aggressive_val - conservative_val
    print(f"{metric:<20} {conservative_val:<12.3f} {aggressive_val:<12.3f} {difference:<12.3f}")

# Visualize performance comparison
metrics_names = ['Accuracy', 'F1-Macro', 'F1-Micro', 'Precision', 'Recall']
conservative_values = [conservative_val_metrics[metric] for metric in comparison_metrics]
aggressive_values = [aggressive_val_metrics[metric] for metric in comparison_metrics]

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=metrics_names,
    y=conservative_values,
    mode='lines+markers',
    name='Conservative',
    line=dict(color='blue', width=3),
    marker=dict(size=8)
))
fig.add_trace(go.Scatter(
    x=metrics_names,
    y=aggressive_values,
    mode='lines+markers',
    name='Aggressive',
    line=dict(color='red', width=3),
    marker=dict(size=8)
))

fig.update_layout(
    title="Model Performance Comparison (Validation Set)",
    xaxis_title="Metrics",
    yaxis_title="Score",
    yaxis_range=[0, max(max(conservative_values), max(aggressive_values)) + 0.1],
    height=500
)
fig.show()

# Determine better model
if conservative_val_metrics['f1_macro'] > aggressive_val_metrics['f1_macro']:
    best_model = model_conservative
    best_texts_val = val_texts_conservative
    best_texts_test = test_texts_conservative
    best_approach = "Conservative"
else:
    best_model = model_aggressive
    best_texts_val = val_texts_aggressive
    best_texts_test = test_texts_aggressive
    best_approach = "Aggressive"

print(f"\nüèÜ Best performing approach: {best_approach} (F1-macro: {max(conservative_val_metrics['f1_macro'], aggressive_val_metrics['f1_macro']):.3f})")

üìä Evaluating models on validation set...

üîπ Conservative Model - Validation Results:
üìä Evaluating model performance...
   üéØ Overall Accuracy: 0.126
   üìä F1-Score (macro): 0.161
   üìä F1-Score (micro): 0.221
   üè∑Ô∏è  Avg labels per sample: 1.19

üî∏ Aggressive Model - Validation Results:
üìä Evaluating model performance...
   üéØ Overall Accuracy: 0.126
   üìä F1-Score (macro): 0.161
   üìä F1-Score (micro): 0.221
   üè∑Ô∏è  Avg labels per sample: 1.19

üî∏ Aggressive Model - Validation Results:
üìä Evaluating model performance...
   üéØ Overall Accuracy: 0.125
   üìä F1-Score (macro): 0.161
   üìä F1-Score (micro): 0.221
   üè∑Ô∏è  Avg labels per sample: 1.19

üìà Validation Performance Comparison:
Metric               Conservative Aggressive   Difference  
------------------------------------------------------------
accuracy             0.126        0.125        -0.000      
f1_macro             0.161        0.161        -0.000      
f1_micro          


üèÜ Best performing approach: Conservative (F1-macro: 0.161)


In [12]:
# Final evaluation on test set
print(f"üéØ Final evaluation of {best_approach} model on test set...")

# Test set evaluation
test_metrics = best_model.evaluate(best_texts_test, test_labels)

print(f"\nüìä Test Set Performance ({best_approach} Model):")
print(f"  Accuracy: {test_metrics['accuracy']:.3f}")
print(f"  F1-Score (macro): {test_metrics['f1_macro']:.3f}")
print(f"  F1-Score (micro): {test_metrics['f1_micro']:.3f}")
print(f"  Precision (macro): {test_metrics['precision_macro']:.3f}")
print(f"  Recall (macro): {test_metrics['recall_macro']:.3f}")

# Check against PROJECT_PLAN requirements
target_f1 = 0.6
meets_f1_requirement = test_metrics['f1_macro'] > target_f1
meets_time_requirement = best_model.training_info['training_time_seconds'] < 600

print(f"\nüéØ PROJECT_PLAN Requirements Check:")
print(f"  F1-Score > 0.6: {'‚úÖ' if meets_f1_requirement else '‚ùå'} (Actual: {test_metrics['f1_macro']:.3f})")
print(f"  Training < 10 min: {'‚úÖ' if meets_time_requirement else '‚ùå'} (Actual: {best_model.training_info['training_time_seconds']:.1f}s)")

# Per-emotion performance analysis
print(f"\nüé≠ Top performing emotions (F1-Score):")
emotion_f1_scores = [(emotion, metrics['f1']) for emotion, metrics in test_metrics['per_emotion'].items()]
emotion_f1_scores.sort(key=lambda x: x[1], reverse=True)

for i, (emotion, f1_score) in enumerate(emotion_f1_scores[:5], 1):
    print(f"  {i}. {emotion}: {f1_score:.3f}")

print(f"\nüé≠ Challenging emotions (lowest F1-Score):")
for i, (emotion, f1_score) in enumerate(emotion_f1_scores[-5:], 1):
    print(f"  {i}. {emotion}: {f1_score:.3f}")

# Create per-emotion performance visualization
emotions = [item[0] for item in emotion_f1_scores]
f1_scores = [item[1] for item in emotion_f1_scores]

fig = px.bar(
    x=f1_scores,
    y=emotions,
    orientation='h',
    title=f"Per-Emotion F1-Scores ({best_approach} Model)",
    labels={'x': 'F1-Score', 'y': 'Emotions'},
    color=f1_scores,
    color_continuous_scale='RdYlGn'
)
fig.update_layout(height=800, yaxis={'categoryorder': 'total ascending'})
fig.show()

print(f"\n‚úÖ Model evaluation completed!")

üéØ Final evaluation of Conservative model on test set...
üìä Evaluating model performance...
   üéØ Overall Accuracy: 0.127
   üìä F1-Score (macro): 0.156
   üìä F1-Score (micro): 0.217
   üè∑Ô∏è  Avg labels per sample: 1.18

üìä Test Set Performance (Conservative Model):
  Accuracy: 0.127
  F1-Score (macro): 0.156
  F1-Score (micro): 0.217
  Precision (macro): 0.520
  Recall (macro): 0.107

üéØ PROJECT_PLAN Requirements Check:
  F1-Score > 0.6: ‚ùå (Actual: 0.156)
  Training < 10 min: ‚úÖ (Actual: 9.7s)

üé≠ Top performing emotions (F1-Score):
  1. gratitude: 0.779
  2. love: 0.484
  3. amusement: 0.405
  4. admiration: 0.363
  5. optimism: 0.254

üé≠ Challenging emotions (lowest F1-Score):
  1. disappointment: 0.021
  2. disapproval: 0.018
  3. realization: 0.014
  4. grief: 0.000
  5. relief: 0.000
   üéØ Overall Accuracy: 0.127
   üìä F1-Score (macro): 0.156
   üìä F1-Score (micro): 0.217
   üè∑Ô∏è  Avg labels per sample: 1.18

üìä Test Set Performance (Conservative


‚úÖ Model evaluation completed!


## 7. Save Processed Data

Save all preprocessed data, trained models, and evaluation results to the `data/processed/` directory for future use.

In [16]:
# Save processed datasets
print("üíæ Saving processed datasets to data/processed/...")

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save train, validation, and test splits
train_df.to_csv(processed_data_dir / f'train_data_{timestamp}.csv', index=False)
val_df.to_csv(processed_data_dir / f'val_data_{timestamp}.csv', index=False)
test_df.to_csv(processed_data_dir / f'test_data_{timestamp}.csv', index=False)

print(f"‚úÖ Dataset splits saved:")
print(f"  Train: {processed_data_dir}/train_data_{timestamp}.csv ({len(train_df):,} samples)")
print(f"  Val: {processed_data_dir}/val_data_{timestamp}.csv ({len(val_df):,} samples)")
print(f"  Test: {processed_data_dir}/test_data_{timestamp}.csv ({len(test_df):,} samples)")

# Save processed text features
processed_features = {
    'train_texts_conservative': train_texts_conservative,
    'val_texts_conservative': val_texts_conservative,
    'test_texts_conservative': test_texts_conservative,
    'train_texts_aggressive': train_texts_aggressive,
    'val_texts_aggressive': val_texts_aggressive,
    'test_texts_aggressive': test_texts_aggressive,
    'train_labels': train_labels.tolist(),
    'val_labels': val_labels.tolist(),
    'test_labels': test_labels.tolist(),
    'emotion_columns': EMOTION_COLUMNS,
    'preprocessing_info': {
        'timestamp': timestamp,
        'original_samples': len(df),
        'filtered_samples': len(filtered_df),
        'quality_retention': quality_metrics.quality_percentage,
        'train_size': len(train_df),
        'val_size': len(val_df),
        'test_size': len(test_df)
    }
}

import pickle
features_file = processed_data_dir / f'processed_features_{timestamp}.pkl'
with open(features_file, 'wb') as f:
    pickle.dump(processed_features, f)

print(f"‚úÖ Processed features saved: {features_file}")

# Helper function to convert NumPy types to native Python types for JSON serialization
def convert_numpy_types(obj):
    """Convert NumPy data types to native Python types for JSON serialization."""
    import numpy as np
    if isinstance(obj, dict):
        return {key: convert_numpy_types(value) for key, value in obj.items()}
    elif isinstance(obj, list):
        return [convert_numpy_types(item) for item in obj]
    elif isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj

# Save quality metrics
quality_metrics_dict = quality_metrics.to_dict()
quality_metrics_dict['timestamp'] = timestamp

# Convert NumPy types to native Python types
quality_metrics_dict = convert_numpy_types(quality_metrics_dict)

quality_file = processed_data_dir / f'quality_metrics_{timestamp}.json'
with open(quality_file, 'w') as f:
    json.dump(quality_metrics_dict, f, indent=2)

print(f"‚úÖ Quality metrics saved: {quality_file}")

üíæ Saving processed datasets to data/processed/...
‚úÖ Dataset splits saved:
  Train: ../data/processed/train_data_20251128_045051.csv (147,705 samples)
  Val: ../data/processed/val_data_20251128_045051.csv (21,101 samples)
  Test: ../data/processed/test_data_20251128_045051.csv (42,202 samples)
‚úÖ Processed features saved: ../data/processed/processed_features_20251128_045051.pkl
‚úÖ Quality metrics saved: ../data/processed/quality_metrics_20251128_045051.json
‚úÖ Dataset splits saved:
  Train: ../data/processed/train_data_20251128_045051.csv (147,705 samples)
  Val: ../data/processed/val_data_20251128_045051.csv (21,101 samples)
  Test: ../data/processed/test_data_20251128_045051.csv (42,202 samples)
‚úÖ Processed features saved: ../data/processed/processed_features_20251128_045051.pkl
‚úÖ Quality metrics saved: ../data/processed/quality_metrics_20251128_045051.json


In [17]:
# Save trained models
print(f"\nüíæ Saving trained models to data/processed/...")

models_save_dir = processed_data_dir / 'models'
models_save_dir.mkdir(exist_ok=True)

# Save conservative model
conservative_model_file = models_save_dir / f'baseline_conservative_{timestamp}.joblib'
model_conservative.save(conservative_model_file)

# Save aggressive model  
aggressive_model_file = models_save_dir / f'baseline_aggressive_{timestamp}.joblib'
model_aggressive.save(aggressive_model_file)

print(f"‚úÖ Models saved:")
print(f"  Conservative: {conservative_model_file}")
print(f"  Aggressive: {aggressive_model_file}")

# Save evaluation results
print(f"\nüìä Saving evaluation results...")

# Save to results directory (integrated with existing structure)
save_evaluation_results(
    metrics=test_metrics,
    model_info=best_model.training_info,
    results_dir=results_dir
)

# Also save comprehensive results to processed directory
comprehensive_results = {
    'timestamp': timestamp,
    'notebook_session': 'notebooks/02_modeling.ipynb',
    'best_approach': best_approach,
    'models': {
        'conservative': {
            'validation_metrics': conservative_val_metrics,
            'model_file': str(conservative_model_file),
            'training_time': conservative_training_time
        },
        'aggressive': {
            'validation_metrics': aggressive_val_metrics,
            'model_file': str(aggressive_model_file),
            'training_time': aggressive_training_time
        }
    },
    'final_test_metrics': test_metrics,
    'requirements_check': {
        'f1_target': target_f1,
        'f1_achieved': test_metrics['f1_macro'],
        'f1_requirement_met': meets_f1_requirement,
        'time_requirement_met': meets_time_requirement
    },
    'dataset_info': {
        'original_size': len(df),
        'processed_size': len(filtered_df),
        'train_size': len(train_df),
        'val_size': len(val_df),
        'test_size': len(test_df)
    }
}

results_file = processed_data_dir / f'modeling_results_{timestamp}.json'
with open(results_file, 'w') as f:
    json.dump(comprehensive_results, f, indent=2, default=str)

print(f"‚úÖ Comprehensive results saved: {results_file}")

# Create summary file
summary_file = processed_data_dir / f'README_processed_{timestamp}.md'
with open(summary_file, 'w') as f:
    f.write(f"""# Processed Data Summary - {timestamp}

## Dataset Information
- Original samples: {len(df):,}
- Processed samples: {len(filtered_df):,}
- Quality retention: {quality_metrics.quality_percentage:.2f}%

## Data Splits
- Training: {len(train_df):,} samples
- Validation: {len(val_df):,} samples  
- Test: {len(test_df):,} samples

## Model Performance
- Best approach: {best_approach}
- Test F1-macro: {test_metrics['f1_macro']:.3f}
- Test accuracy: {test_metrics['accuracy']:.3f}

## Files Generated
- Dataset splits: train_data_{timestamp}.csv, val_data_{timestamp}.csv, test_data_{timestamp}.csv
- Features: processed_features_{timestamp}.pkl
- Models: baseline_conservative_{timestamp}.joblib, baseline_aggressive_{timestamp}.joblib
- Quality metrics: quality_metrics_{timestamp}.json
- Results: modeling_results_{timestamp}.json

Generated from: notebooks/02_modeling.ipynb
""")

print(f"‚úÖ Summary documentation: {summary_file}")

print(f"\nüéâ All data successfully saved to data/processed/!")
print(f"üìÅ Total files created: 8 files")
print(f"üíæ Ready for Phase 3: Transformer fine-tuning!")


üíæ Saving trained models to data/processed/...
üíæ Model saved to: ../data/processed/models/baseline_conservative_20251128_045051.joblib
üíæ Model saved to: ../data/processed/models/baseline_aggressive_20251128_045051.joblib
‚úÖ Models saved:
  Conservative: ../data/processed/models/baseline_conservative_20251128_045051.joblib
  Aggressive: ../data/processed/models/baseline_aggressive_20251128_045051.joblib

üìä Saving evaluation results...
üìä Evaluation results saved to: ../results/metrics/model_performance/baseline_evaluation_20251128_045911.json
‚úÖ Comprehensive results saved: ../data/processed/modeling_results_20251128_045051.json
‚úÖ Summary documentation: ../data/processed/README_processed_20251128_045051.md

üéâ All data successfully saved to data/processed/!
üìÅ Total files created: 8 files
üíæ Ready for Phase 3: Transformer fine-tuning!


## üéâ Modeling Pipeline Complete!

This notebook successfully implemented the complete emotion classification modeling pipeline:

### ‚úÖ **Accomplishments**
- **Data Loading**: Loaded 211,225 GoEmotions samples
- **Quality Assessment**: Identified and filtered quality issues (99.9% retention)
- **Preprocessing**: Implemented both conservative and aggressive text cleaning
- **Feature Engineering**: Created TF-IDF features with optimized parameters
- **Model Training**: Trained baseline models with multi-label classification
- **Evaluation**: Comprehensive performance analysis and comparison
- **Data Persistence**: Saved all processed data to `data/processed/`

### üìä **Key Results** 
- **Best Model**: {best_approach} text cleaning approach
- **Test Performance**: F1-macro = {test_metrics['f1_macro']:.3f}, Accuracy = {test_metrics['accuracy']:.3f}
- **Training Time**: {best_model.training_info['training_time_seconds']:.1f} seconds (meets < 10min requirement)
- **Data Quality**: 99.9% sample retention after quality filtering

### üìÅ **Saved Outputs**
All processed data is available in `../data/processed/` for future use:
- Dataset splits (CSV files)
- Processed features (pickle file) 
- Trained models (joblib files)
- Evaluation results (JSON files)
- Documentation (README markdown)

### üöÄ **Next Steps**
Ready for **Phase 3: Transformer Fine-tuning** to improve F1-score from {test_metrics['f1_macro']:.3f} to target 0.6+

---
*Generated from: notebooks/02_modeling.ipynb*  
*Timestamp: {timestamp}*