# Professional Emotion Detection Pipeline
## Deep Learning for NLP - Emotion Classification in Twitter Text

This notebook demonstrates a comprehensive, production-ready pipeline for emotion detection using LSTM and GRU neural networks.

### Project Structure
- **Data Processing**: Advanced text preprocessing with contraction expansion, typo correction
- **Embeddings**: Support for both GloVe and Word2Vec
- **Models**: LSTM, GRU, Bidirectional variants with configurable architectures
- **Training**: Professional training pipeline with callbacks and experiment tracking
- **Evaluation**: Comprehensive metrics and visualizations

### Emotion Classes
0. Sadness
1. Joy
2. Love
3. Anger
4. Fear
5. Surprise

## 1. Setup and Imports

In [None]:
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append('../')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.utils import to_categorical
import logging

# Import custom modules
from src.data.preprocessor import TextPreprocessor, load_and_preprocess_data
from src.data.embeddings import EmbeddingHandler, create_embeddings
from src.models.architectures import ModelBuilder, create_model
from src.training.trainer import ModelTrainer, train_model
from src.utils.visualization import ResultsVisualizer, create_comprehensive_report
from src.utils.config import ExperimentConfig, get_all_configs

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Set random seeds for reproducibility
np.random.seed(42)
import tensorflow as tf
tf.random.set_seed(42)

print("✓ All imports successful")
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")

## 2. Configuration

In [None]:
# Define paths
DATA_DIR = "../data/raw"
TRAIN_PATH = os.path.join(DATA_DIR, "train.csv")
VAL_PATH = os.path.join(DATA_DIR, "validation.csv")
GLOVE_PATH = "/home/lab/rabanof/Emotion_Detection_DL/glove/glove.6B.100d.txt"

# Emotion mapping
EMOTION_MAP = {
    0: 'Sadness', 1: 'Joy', 2: 'Love',
    3: 'Anger', 4: 'Fear', 5: 'Surprise'
}
EMOTION_LABELS = list(EMOTION_MAP.values())

# Create experiment configuration
config = ExperimentConfig(
    experiment_name="emotion_lstm_glove_v1"
)

# Update paths in config
config.data.train_path = TRAIN_PATH
config.data.val_path = VAL_PATH
config.embedding.embedding_path = GLOVE_PATH

print("\nExperiment Configuration:")
print("=" * 50)
print(f"Experiment Name: {config.experiment_name}")
print(f"Model Type: {config.model.model_type.upper()}")
print(f"Embedding: {config.embedding.embedding_type.upper()} (dim={config.embedding.embedding_dim})")
print(f"Max Sequence Length: {config.data.max_len}")
print(f"Vocabulary Size: {config.data.max_words}")
print(f"Batch Size: {config.training.batch_size}")
print(f"Epochs: {config.training.epochs}")
print("=" * 50)

## 3. Data Loading and Preprocessing

In [None]:
# Load and preprocess data
preprocessor = TextPreprocessor()
train_df, val_df, preprocessor = load_and_preprocess_data(
    TRAIN_PATH, VAL_PATH, preprocessor
)

print(f"\nTraining samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")

# Display sample data
print("\nSample preprocessed texts:")
for i, row in train_df.head(3).iterrows():
    print(f"[{EMOTION_MAP[row['label']]}] {row['text']}")

### 3.1 Data Statistics and Visualization

In [None]:
# Get statistics
train_stats = preprocessor.get_text_statistics(train_df)

print("\nTraining Data Statistics:")
print("=" * 50)
print(f"Total samples: {train_stats['total_samples']}")
print(f"Avg text length: {train_stats['avg_text_length']:.2f} words")
print(f"Median text length: {train_stats['median_text_length']:.0f} words")
print(f"Max text length: {train_stats['max_text_length']} words")

print("\nLabel Distribution:")
for label, count in train_stats['label_distribution'].items():
    pct = train_stats['label_percentages'][label]
    print(f"{EMOTION_MAP[label]:12s}: {count:5d} ({pct:5.2f}%)")

In [None]:
# Visualize label distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Train distribution
train_label_counts = train_df['label'].map(EMOTION_MAP).value_counts()
axes[0].bar(train_label_counts.index, train_label_counts.values, color='skyblue', edgecolor='black')
axes[0].set_title('Training Set - Emotion Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Emotion')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Val distribution
val_label_counts = val_df['label'].map(EMOTION_MAP).value_counts()
axes[1].bar(val_label_counts.index, val_label_counts.values, color='lightcoral', edgecolor='black')
axes[1].set_title('Validation Set - Emotion Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Emotion')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Text length distribution
plt.figure(figsize=(12, 5))
plt.hist(train_df['text_len'], bins=50, edgecolor='black', alpha=0.7)
plt.axvline(train_df['text_len'].mean(), color='red', linestyle='--', label=f'Mean: {train_df["text_len"].mean():.1f}')
plt.axvline(train_df['text_len'].median(), color='green', linestyle='--', label=f'Median: {train_df["text_len"].median():.0f}')
plt.title('Text Length Distribution (Training Set)', fontsize=14, fontweight='bold')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 4. Embedding Creation

In [None]:
# Create embedding handler
embedding_handler = EmbeddingHandler(
    embedding_type=config.embedding.embedding_type,
    embedding_dim=config.embedding.embedding_dim,
    max_words=config.data.max_words,
    max_len=config.data.max_len
)

# Prepare sequences and embeddings
X_train, X_val, embedding_matrix, stats = embedding_handler.prepare_sequences(
    train_df['text'].tolist(),
    val_df['text'].tolist(),
    embedding_path=config.embedding.embedding_path
)

# Prepare labels
y_train = to_categorical(train_df['label'].values, num_classes=6)
y_val = to_categorical(val_df['label'].values, num_classes=6)

print("\nData Preparation Complete:")
print("=" * 50)
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"Embedding matrix shape: {embedding_matrix.shape}")
print(f"\nEmbedding Coverage: {stats['coverage_percent']:.2f}%")
print(f"Words found: {stats['words_found']}/{stats['total_words']}")
print(f"Sample OOV words: {stats['sample_oov_words'][:10]}")

In [None]:
# Check OOV rates
train_oov_rate = embedding_handler.get_oov_rate(X_train)
val_oov_rate = embedding_handler.get_oov_rate(X_val)

print(f"\nOOV Token Rates:")
print(f"Training set: {train_oov_rate:.2f}%")
print(f"Validation set: {val_oov_rate:.2f}%")

## 5. Class Weights Calculation

To handle class imbalance, we compute class weights:

In [None]:
# Compute class weights
if config.training.use_class_weights:
    class_weights = preprocessor.compute_class_weights(train_df['label'].values)
    print("\nClass Weights:")
    for label, weight in class_weights.items():
        print(f"{EMOTION_MAP[label]:12s}: {weight:.3f}")
else:
    class_weights = None
    print("\nClass weights disabled")

## 6. Model Creation

In [None]:
# Create model configuration dictionary
model_config = {
    # Architecture
    'lstm_units' if config.model.model_type == 'lstm' else 'gru_units': config.model.units,
    'num_layers': config.model.num_layers,
    'dropout': config.model.dropout,
    'recurrent_dropout': config.model.recurrent_dropout,
    'spatial_dropout': config.model.spatial_dropout,
    'bidirectional': config.model.bidirectional,
    'dense_units': config.model.dense_units,
    'trainable_embeddings': config.embedding.trainable,
    
    # Compilation
    'learning_rate': config.training.learning_rate,
    'optimizer': config.training.optimizer,
    'loss': config.training.loss,
    'metrics': ['accuracy']
}

# Create model
model = create_model(
    model_type=config.model.model_type,
    vocab_size=embedding_handler.vocab_size,
    embedding_dim=config.embedding.embedding_dim,
    embedding_matrix=embedding_matrix,
    config=model_config
)

# Display model summary
print("\nModel Architecture:")
print("=" * 50)
model.summary()

## 7. Model Training

In [None]:
# Create training configuration
training_config = {
    'epochs': config.training.epochs,
    'batch_size': config.training.batch_size,
    'class_weight': class_weights,
    
    # Callbacks
    'early_stopping': config.training.early_stopping,
    'patience': config.training.patience,
    'reduce_lr': config.training.reduce_lr,
    'lr_factor': config.training.lr_factor,
    'lr_patience': config.training.lr_patience,
    'min_lr': config.training.min_lr,
    'tensorboard': config.training.tensorboard,
    'save_best_only': config.training.save_best_only,
    'monitor': config.training.monitor,
    'mode': config.training.mode,
    'verbose': config.training.verbose
}

# Train model
trainer = ModelTrainer(model, config.experiment_name)

print("\nStarting Training...")
print("=" * 50)
history = trainer.train(X_train, y_train, X_val, y_val, training_config)
print("\n✓ Training Complete!")

## 8. Model Evaluation

In [None]:
# Evaluate on validation set
val_results = trainer.evaluate(X_val, y_val)

print("\nValidation Results:")
print("=" * 50)
for metric, value in val_results.items():
    print(f"{metric}: {value:.4f}")

In [None]:
# Get predictions
y_pred = trainer.predict(X_val)

# Create visualizer
visualizer = ResultsVisualizer(EMOTION_LABELS)

print("\n✓ Predictions generated")

### 8.1 Training History Visualization

In [None]:
visualizer.plot_training_history(history)

### 8.2 Confusion Matrix

In [None]:
visualizer.plot_confusion_matrix(y_val, y_pred, normalize=False)

In [None]:
visualizer.plot_confusion_matrix(y_val, y_pred, normalize=True)

### 8.3 Classification Report

In [None]:
report = visualizer.plot_classification_report(y_val, y_pred)

### 8.4 Prediction Distribution

In [None]:
visualizer.plot_prediction_distribution(y_val, y_pred)

### 8.5 Per-Class Accuracy

In [None]:
visualizer.plot_per_class_accuracy(y_val, y_pred)

## 9. Save Complete Report

In [None]:
# Create comprehensive report
create_comprehensive_report(
    y_val, y_pred, history,
    config.experiment_name,
    save_dir='../results'
)

# Save configuration
config.save(f'../configs/{config.experiment_name}_config.yaml')

print("\n✓ Complete report and configuration saved!")

## 10. Prediction Examples

In [None]:
# Sample predictions
n_samples = 10
sample_indices = np.random.choice(len(val_df), n_samples, replace=False)

print("\nSample Predictions:")
print("=" * 80)

for idx in sample_indices:
    text = val_df.iloc[idx]['text']
    true_label = np.argmax(y_val[idx])
    pred_label = np.argmax(y_pred[idx])
    confidence = y_pred[idx][pred_label] * 100
    
    correct = "✓" if true_label == pred_label else "✗"
    
    print(f"\n{correct} Text: {text}")
    print(f"   True: {EMOTION_MAP[true_label]:12s} | Predicted: {EMOTION_MAP[pred_label]:12s} (confidence: {confidence:.1f}%)")
    
    # Show top 3 predictions
    top3_idx = np.argsort(y_pred[idx])[-3:][::-1]
    print(f"   Top 3: ", end="")
    for i, tidx in enumerate(top3_idx):
        print(f"{EMOTION_MAP[tidx]} ({y_pred[idx][tidx]*100:.1f}%)", end="  ")
    print()

## 11. Interactive Prediction Function

In [None]:
def predict_emotion(text: str, show_probabilities: bool = True):
    """
    Predict emotion for a given text.
    
    Args:
        text: Input text
        show_probabilities: Whether to show all class probabilities
    """
    # Preprocess
    cleaned_text = preprocessor.clean_text(text)
    
    # Convert to sequence
    sequence = embedding_handler.texts_to_sequences([cleaned_text])
    
    # Predict
    pred = trainer.predict(sequence)[0]
    pred_label = np.argmax(pred)
    
    print(f"\nInput: {text}")
    print(f"Cleaned: {cleaned_text}")
    print(f"\nPredicted Emotion: {EMOTION_MAP[pred_label]} (confidence: {pred[pred_label]*100:.2f}%)")
    
    if show_probabilities:
        print("\nAll probabilities:")
        for i, (emotion, prob) in enumerate(zip(EMOTION_LABELS, pred)):
            bar = "█" * int(prob * 50)
            print(f"  {emotion:12s}: {bar} {prob*100:5.2f}%")

# Test the function
test_texts = [
    "I am so happy today!",
    "This is really frustrating and annoying",
    "I miss you so much my love",
    "I am terrified of what might happen",
    "Oh wow I did not expect that at all!"
]

for test_text in test_texts:
    predict_emotion(test_text)
    print("-" * 80)

## 12. Model Comparison (Optional)

Run multiple experiments with different configurations:

In [None]:
# This section demonstrates how to run multiple experiments
# Uncomment to run multiple model comparisons

"""
from src.utils.config import get_all_configs

# Get predefined configurations
all_configs = get_all_configs()

results_comparison = {}

for config_name, exp_config in all_configs.items():
    print(f"\n{'='*80}")
    print(f"Running experiment: {config_name}")
    print(f"{'='*80}")
    
    # Update paths
    exp_config.embedding.embedding_path = GLOVE_PATH
    
    # Create model
    # ... (similar to sections above)
    
    # Train and evaluate
    # ... 
    
    # Store results
    results_comparison[config_name] = val_results

# Compare results
visualizer.compare_models(results_comparison)
"""

print("\nTo run model comparison, uncomment the code above and execute.")

## Summary

This notebook demonstrated a complete, professional pipeline for emotion detection:

1. **Data Processing**: Advanced preprocessing with comprehensive text cleaning
2. **Embeddings**: Professional embedding handling with GloVe/Word2Vec support
3. **Model Architecture**: Flexible LSTM/GRU models with configurable layers
4. **Training**: Complete training pipeline with callbacks and experiment tracking
5. **Evaluation**: Comprehensive metrics and professional visualizations

### Next Steps:
- Experiment with different hyperparameters
- Try bidirectional models or deeper architectures
- Compare GloVe vs Word2Vec embeddings
- Fine-tune embeddings (set `trainable=True`)
- Implement data augmentation for minority classes
- Deploy the model as an API

All results, models, and configurations are saved in the respective directories for reproducibility.