# üöÄ Unified Complete Emotion Detection Pipeline
## All-in-One Professional Deep Learning System

This notebook contains **everything** in one place:
- All classes and functions
- Complete preprocessing
- Multiple model architectures (LSTM, GRU, Bidirectional)
- Professional training pipeline
- Comprehensive visualizations
- Experiment tracking

**No external imports from src/ needed - completely self-contained!**

### Emotion Classes:
0. Sadness üò¢
1. Joy üòä
2. Love ‚ù§Ô∏è
3. Anger üò†
4. Fear üò®
5. Surprise üò≤

## üì¶ Section 1: Imports and Setup

In [None]:
# Standard library
import os
import re
import json
import time
import logging
import warnings
from datetime import datetime
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, asdict

# Data processing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix, classification_report

# Deep Learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import (
    EarlyStopping, ModelCheckpoint, ReduceLROnPlateau,
    TensorBoard, CSVLogger, Callback
)

# Embeddings
from gensim.models import Word2Vec

# Settings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("‚úÖ All imports successful!")
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")

## üîß Section 2: Configuration Classes

In [None]:
@dataclass
class Config:
    """Complete configuration for emotion detection pipeline."""
    
    # Experiment
    experiment_name: str = "emotion_detection_unified"
    
    # Data paths
    train_path: str = "/home/lab/rabanof/projects/Emotion_Detection_DL/data/raw/train.csv"
    val_path: str = "/home/lab/rabanof/projects/Emotion_Detection_DL/data/raw/validation.csv"
    glove_path: str = "/home/lab/rabanof/Emotion_Detection_DL/glove/glove.6B.100d.txt"
    
    # Data parameters
    max_len: int = 60
    max_words: int = 20000
    text_column: str = "text"
    label_column: str = "label"
    
    # Embedding
    embedding_type: str = "glove"  # 'glove' or 'word2vec'
    embedding_dim: int = 100
    trainable_embeddings: bool = False
    oov_token: str = "<UNK>"
    
    # Model architecture
    model_type: str = "lstm"  # 'lstm', 'gru', or 'bilstm'
    rnn_units: int = 128
    num_layers: int = 1
    dropout: float = 0.2
    recurrent_dropout: float = 0.0
    spatial_dropout: float = 0.2
    dense_units: int = 0
    num_classes: int = 6
    
    # Training
    epochs: int = 50
    batch_size: int = 32
    learning_rate: float = 0.001
    use_class_weights: bool = True
    
    # Callbacks
    early_stopping: bool = True
    patience: int = 5
    reduce_lr: bool = True
    lr_factor: float = 0.5
    lr_patience: int = 3
    min_lr: float = 1e-7
    
    # Directories
    save_dir: str = "saved_models"
    log_dir: str = "logs"
    result_dir: str = "results"

# Create default configuration
config = Config()
print("‚úÖ Configuration created!")
print(f"Experiment: {config.experiment_name}")
print(f"Model: {config.model_type.upper()}, Units: {config.rnn_units}")
print(f"Embedding: {config.embedding_type.upper()}, Dim: {config.embedding_dim}")

## üìä Section 3: Text Preprocessing Class

In [None]:
class TextPreprocessor:
    """Advanced text preprocessing for emotion detection."""
    
    def __init__(self):
        # Specific contractions (30+ rules)
        self.specific_contractions = {
            "didnt": "did not", "dont": "do not", "cant": "cannot",
            "wont": "will not", "wouldnt": "would not", "shouldnt": "should not",
            "couldnt": "could not", "im": "i am", "ive": "i have",
            "id": "i would", "ill": "i will", "hadnt": "had not",
            "youve": "you have", "werent": "were not", "theyve": "they have",
            "theyll": "they will", "itll": "it will", "couldve": "could have",
            "shouldve": "should have", "wouldve": "would have"
        }
        
        self.general_contractions = {
            "n't": " not", "'re": " are", "'s": " is",
            "'d": " would", "'ll": " will", "'t": " not",
            "'ve": " have", "'m": " am"
        }
        
        self.slang_corrections = {
            "idk": "i do not know", "yknow": "you know",
            "becuz": "because", "alittle": "a little", "incase": "in case"
        }
        
        self.typo_corrections = {
            "vunerable": "vulnerable", "percieve": "perceive",
            "definetly": "definitely", "writting": "writing"
        }
    
    def clean_text(self, text: str) -> str:
        """Apply comprehensive text cleaning."""
        if not isinstance(text, str):
            return ""
        
        # Lowercase
        text = text.lower()
        
        # Reduce elongation (sooooo ‚Üí soo)
        text = re.sub(r'(.)\1{2,}', r'\1\1', text)
        
        # Expand contractions
        for key, value in self.specific_contractions.items():
            text = re.sub(rf'\b{re.escape(key)}\b', value, text)
        
        for key, value in self.general_contractions.items():
            text = text.replace(key, value)
        
        # Fix slang and typos
        for key, value in {**self.slang_corrections, **self.typo_corrections}.items():
            text = re.sub(rf'\b{re.escape(key)}\b', value, text)
        
        # Reduce repeated punctuation
        text = re.sub(r"([!?.,])\1+", r"\1", text)
        text = re.sub(r"\.{2,}", ".", text)
        
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def preprocess_dataframe(self, df: pd.DataFrame, text_column: str = 'text') -> pd.DataFrame:
        """Preprocess entire dataframe."""
        df = df.copy()
        logger.info(f"Preprocessing {len(df)} samples...")
        
        df[text_column] = df[text_column].apply(self.clean_text)
        df['text_len'] = df[text_column].str.split().str.len()
        
        logger.info(f"Mean text length: {df['text_len'].mean():.2f} words")
        return df
    
    def remove_duplicates(self, df: pd.DataFrame, text_column: str = 'text') -> pd.DataFrame:
        """Remove duplicate texts."""
        initial_len = len(df)
        df = df.drop_duplicates(subset=[text_column], keep='first')
        removed = initial_len - len(df)
        if removed > 0:
            logger.warning(f"Removed {removed} duplicates")
        return df.reset_index(drop=True)
    
    def check_data_leakage(self, train_df: pd.DataFrame, val_df: pd.DataFrame,
                          text_column: str = 'text') -> Tuple[pd.DataFrame, int]:
        """Check and remove overlapping texts."""
        train_texts = set(train_df[text_column])
        val_texts = set(val_df[text_column])
        overlaps = val_texts.intersection(train_texts)
        
        if len(overlaps) > 0:
            logger.warning(f"Found {len(overlaps)} overlapping texts")
            val_df_clean = val_df[~val_df[text_column].isin(overlaps)].copy()
            return val_df_clean.reset_index(drop=True), len(overlaps)
        
        return val_df, 0
    
    def compute_class_weights(self, labels: np.ndarray) -> Dict[int, float]:
        """Compute class weights for imbalanced data."""
        classes = np.unique(labels)
        weights = compute_class_weight('balanced', classes=classes, y=labels)
        class_weights = dict(zip(classes, weights))
        logger.info(f"Class weights: {class_weights}")
        return class_weights

print("‚úÖ TextPreprocessor class created!")

## üî§ Section 4: Embedding Handler Class

In [None]:
class EmbeddingHandler:
    """Handle embeddings (GloVe/Word2Vec) and sequence generation."""
    
    def __init__(self, embedding_type='glove', embedding_dim=100, 
                 max_words=20000, max_len=60, oov_token='<UNK>'):
        self.embedding_type = embedding_type
        self.embedding_dim = embedding_dim
        self.max_words = max_words
        self.max_len = max_len
        self.oov_token = oov_token
        
        self.tokenizer = None
        self.embedding_matrix = None
        self.embeddings_index = {}
        self.vocab_size = 0
    
    def load_glove_embeddings(self, glove_path: str) -> Dict[str, np.ndarray]:
        """Load GloVe pre-trained embeddings."""
        logger.info(f"Loading GloVe from {glove_path}")
        embeddings_index = {}
        
        with open(glove_path, encoding='utf8') as f:
            for line in f:
                values = line.split()
                word = values[0]
                coefs = np.asarray(values[1:], dtype='float32')
                embeddings_index[word] = coefs
        
        logger.info(f"Loaded {len(embeddings_index)} word vectors")
        self.embeddings_index = embeddings_index
        return embeddings_index
    
    def train_word2vec(self, texts: list, vector_size=100, window=5, 
                      min_count=2, workers=4, epochs=10):
        """Train Word2Vec on corpus."""
        logger.info("Training Word2Vec...")
        tokenized_texts = [text.split() for text in texts]
        
        model = Word2Vec(
            sentences=tokenized_texts,
            vector_size=vector_size,
            window=window,
            min_count=min_count,
            workers=workers,
            epochs=epochs
        )
        
        logger.info(f"Word2Vec trained, vocab size: {len(model.wv)}")
        self.embeddings_index = {word: model.wv[word] for word in model.wv.index_to_key}
        return model
    
    def create_tokenizer(self, texts: list):
        """Create and fit tokenizer."""
        logger.info("Creating tokenizer...")
        tokenizer = Tokenizer(num_words=self.max_words, oov_token=self.oov_token)
        tokenizer.fit_on_texts(texts)
        
        self.tokenizer = tokenizer
        self.vocab_size = min(self.max_words, len(tokenizer.word_index) + 1)
        logger.info(f"Vocabulary size: {self.vocab_size}")
        return tokenizer
    
    def texts_to_sequences(self, texts: list, pad=True) -> np.ndarray:
        """Convert texts to padded sequences."""
        sequences = self.tokenizer.texts_to_sequences(texts)
        if pad:
            sequences = pad_sequences(sequences, maxlen=self.max_len, 
                                     padding='post', truncating='post')
        return sequences
    
    def create_embedding_matrix(self):
        """Create embedding matrix from loaded embeddings."""
        logger.info("Creating embedding matrix...")
        embedding_matrix = np.zeros((self.vocab_size, self.embedding_dim))
        words_found = 0
        words_not_found = []
        
        for word, idx in self.tokenizer.word_index.items():
            if idx >= self.max_words:
                continue
            
            embedding_vector = self.embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[idx] = embedding_vector
                words_found += 1
            else:
                words_not_found.append(word)
                embedding_matrix[idx] = np.random.normal(0, 0.1, self.embedding_dim)
        
        self.embedding_matrix = embedding_matrix
        coverage = (words_found / self.vocab_size) * 100
        
        logger.info(f"Coverage: {coverage:.2f}% ({words_found}/{self.vocab_size})")
        logger.info(f"Sample OOV: {words_not_found[:10]}")
        
        return embedding_matrix, {
            'coverage_percent': coverage,
            'words_found': words_found,
            'words_not_found': len(words_not_found)
        }
    
    def get_oov_rate(self, sequences: np.ndarray) -> float:
        """Calculate OOV rate in sequences."""
        oov_index = self.tokenizer.word_index.get(self.oov_token, 1)
        total_tokens = np.count_nonzero(sequences)
        oov_tokens = np.sum(sequences == oov_index)
        oov_rate = (oov_tokens / total_tokens * 100) if total_tokens > 0 else 0
        return oov_rate

print("‚úÖ EmbeddingHandler class created!")

## üèóÔ∏è Section 5: Model Builder Class

In [None]:
class ModelBuilder:
    """Build LSTM/GRU models for emotion detection."""
    
    def __init__(self, vocab_size, embedding_dim, embedding_matrix=None, 
                 max_len=60, num_classes=6):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.embedding_matrix = embedding_matrix
        self.max_len = max_len
        self.num_classes = num_classes
    
    def build_lstm(self, units=128, num_layers=1, dropout=0.2, 
                   spatial_dropout=0.2, bidirectional=False, 
                   trainable_embeddings=False):
        """Build LSTM model."""
        model = models.Sequential(name='LSTM_Model')
        
        # Embedding
        if self.embedding_matrix is not None:
            model.add(layers.Embedding(
                input_dim=self.vocab_size,
                output_dim=self.embedding_dim,
                weights=[self.embedding_matrix],
                input_length=self.max_len,
                trainable=trainable_embeddings,
                name='embedding'
            ))
        else:
            model.add(layers.Embedding(
                input_dim=self.vocab_size,
                output_dim=self.embedding_dim,
                input_length=self.max_len,
                name='embedding'
            ))
        
        # Spatial Dropout
        if spatial_dropout > 0:
            model.add(layers.SpatialDropout1D(spatial_dropout))
        
        # LSTM layers
        for i in range(num_layers):
            return_sequences = (i < num_layers - 1)
            lstm_layer = layers.LSTM(
                units=units,
                dropout=dropout,
                recurrent_dropout=0.0,
                return_sequences=return_sequences,
                name=f'lstm_{i+1}'
            )
            
            if bidirectional:
                lstm_layer = layers.Bidirectional(lstm_layer, name=f'bi_lstm_{i+1}')
            
            model.add(lstm_layer)
        
        # Output
        model.add(layers.Dense(self.num_classes, activation='softmax', name='output'))
        
        return model
    
    def build_gru(self, units=128, num_layers=1, dropout=0.2, 
                  spatial_dropout=0.2, bidirectional=False, 
                  trainable_embeddings=False):
        """Build GRU model."""
        model = models.Sequential(name='GRU_Model')
        
        # Embedding
        if self.embedding_matrix is not None:
            model.add(layers.Embedding(
                input_dim=self.vocab_size,
                output_dim=self.embedding_dim,
                weights=[self.embedding_matrix],
                input_length=self.max_len,
                trainable=trainable_embeddings,
                name='embedding'
            ))
        else:
            model.add(layers.Embedding(
                input_dim=self.vocab_size,
                output_dim=self.embedding_dim,
                input_length=self.max_len,
                name='embedding'
            ))
        
        # Spatial Dropout
        if spatial_dropout > 0:
            model.add(layers.SpatialDropout1D(spatial_dropout))
        
        # GRU layers
        for i in range(num_layers):
            return_sequences = (i < num_layers - 1)
            gru_layer = layers.GRU(
                units=units,
                dropout=dropout,
                recurrent_dropout=0.0,
                return_sequences=return_sequences,
                name=f'gru_{i+1}'
            )
            
            if bidirectional:
                gru_layer = layers.Bidirectional(gru_layer, name=f'bi_gru_{i+1}')
            
            model.add(gru_layer)
        
        # Output
        model.add(layers.Dense(self.num_classes, activation='softmax', name='output'))
        
        return model
    
    def compile_model(self, model, learning_rate=0.001):
        """Compile model."""
        optimizer = Adam(learning_rate=learning_rate)
        model.compile(
            optimizer=optimizer,
            loss='categorical_crossentropy',
            metrics=['accuracy']
        )
        return model

print("‚úÖ ModelBuilder class created!")

## üìà Section 6: Visualization Class

In [None]:
class ResultsVisualizer:
    """Comprehensive visualization for results."""
    
    def __init__(self, emotion_labels=None):
        if emotion_labels is None:
            self.emotion_labels = ['Sadness', 'Joy', 'Love', 'Anger', 'Fear', 'Surprise']
        else:
            self.emotion_labels = emotion_labels
    
    def plot_training_history(self, history):
        """Plot training history."""
        fig, axes = plt.subplots(1, 2, figsize=(15, 5))
        
        # Accuracy
        axes[0].plot(history.history['accuracy'], label='Train', marker='o')
        axes[0].plot(history.history['val_accuracy'], label='Validation', marker='s')
        axes[0].set_title('Model Accuracy', fontsize=14, fontweight='bold')
        axes[0].set_xlabel('Epoch')
        axes[0].set_ylabel('Accuracy')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Loss
        axes[1].plot(history.history['loss'], label='Train', marker='o')
        axes[1].plot(history.history['val_loss'], label='Validation', marker='s')
        axes[1].set_title('Model Loss', fontsize=14, fontweight='bold')
        axes[1].set_xlabel('Epoch')
        axes[1].set_ylabel('Loss')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def plot_confusion_matrix(self, y_true, y_pred, normalize=False):
        """Plot confusion matrix."""
        if y_true.ndim > 1:
            y_true = np.argmax(y_true, axis=1)
        if y_pred.ndim > 1:
            y_pred = np.argmax(y_pred, axis=1)
        
        cm = confusion_matrix(y_true, y_pred)
        
        if normalize:
            cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
            fmt = '.2f'
            title = 'Normalized Confusion Matrix'
        else:
            fmt = 'd'
            title = 'Confusion Matrix'
        
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt=fmt, cmap='Blues',
                   xticklabels=self.emotion_labels,
                   yticklabels=self.emotion_labels)
        plt.title(title, fontsize=14, fontweight='bold')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.tight_layout()
        plt.show()
    
    def plot_classification_report(self, y_true, y_pred):
        """Plot classification report."""
        if y_true.ndim > 1:
            y_true = np.argmax(y_true, axis=1)
        if y_pred.ndim > 1:
            y_pred = np.argmax(y_pred, axis=1)
        
        report = classification_report(y_true, y_pred, 
                                      target_names=self.emotion_labels,
                                      output_dict=True)
        
        df_report = pd.DataFrame(report).transpose()
        
        # Plot
        fig, ax = plt.subplots(figsize=(10, 6))
        metrics = ['precision', 'recall', 'f1-score']
        df_plot = df_report.loc[self.emotion_labels, metrics]
        
        df_plot.plot(kind='bar', ax=ax, width=0.8)
        ax.set_title('Classification Metrics by Emotion', fontsize=14, fontweight='bold')
        ax.set_xlabel('Emotion')
        ax.set_ylabel('Score')
        ax.set_ylim([0, 1.0])
        ax.legend(title='Metric')
        ax.grid(True, alpha=0.3, axis='y')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.show()
        
        # Print report
        print("\n" + "="*70)
        print("Classification Report:")
        print("="*70)
        print(classification_report(y_true, y_pred, target_names=self.emotion_labels))
    
    def plot_per_class_accuracy(self, y_true, y_pred):
        """Plot per-class accuracy."""
        if y_true.ndim > 1:
            y_true = np.argmax(y_true, axis=1)
        if y_pred.ndim > 1:
            y_pred = np.argmax(y_pred, axis=1)
        
        accuracies = []
        for i in range(len(self.emotion_labels)):
            mask = y_true == i
            if mask.sum() > 0:
                acc = (y_pred[mask] == i).sum() / mask.sum()
                accuracies.append(acc)
            else:
                accuracies.append(0)
        
        plt.figure(figsize=(10, 6))
        bars = plt.bar(self.emotion_labels, accuracies, edgecolor='black')
        
        # Color by performance
        for i, bar in enumerate(bars):
            if accuracies[i] >= 0.8:
                bar.set_color('green')
            elif accuracies[i] >= 0.6:
                bar.set_color('orange')
            else:
                bar.set_color('red')
        
        plt.title('Per-Class Accuracy', fontsize=14, fontweight='bold')
        plt.xlabel('Emotion')
        plt.ylabel('Accuracy')
        plt.ylim([0, 1.0])
        plt.xticks(rotation=45, ha='right')
        plt.grid(True, alpha=0.3, axis='y')
        
        # Add value labels
        for i, acc in enumerate(accuracies):
            plt.text(i, acc + 0.02, f'{acc:.2%}', ha='center', fontweight='bold')
        
        plt.tight_layout()
        plt.show()

print("‚úÖ ResultsVisualizer class created!")

## üéØ Section 7: Experiment Tracker Callback

In [None]:
class ExperimentTracker(Callback):
    """Track experiment metrics and save results."""
    
    def __init__(self, experiment_name, results_dir='results'):
        super().__init__()
        self.experiment_name = experiment_name
        self.results_dir = results_dir
        self.start_time = None
        self.metrics_history = {'train': {}, 'val': {}}
        os.makedirs(results_dir, exist_ok=True)
    
    def on_train_begin(self, logs=None):
        self.start_time = time.time()
        logger.info(f"Training started: {self.experiment_name}")
    
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        for key, value in logs.items():
            if key.startswith('val_'):
                metric_name = key[4:]
                if metric_name not in self.metrics_history['val']:
                    self.metrics_history['val'][metric_name] = []
                self.metrics_history['val'][metric_name].append(float(value))
            else:
                if key not in self.metrics_history['train']:
                    self.metrics_history['train'][key] = []
                self.metrics_history['train'][key].append(float(value))
    
    def on_train_end(self, logs=None):
        training_time = time.time() - self.start_time
        
        results = {
            'experiment_name': self.experiment_name,
            'training_time_seconds': training_time,
            'total_epochs': len(self.metrics_history['train'].get('loss', [])),
            'metrics_history': self.metrics_history,
            'final_metrics': {
                'train': {k: v[-1] for k, v in self.metrics_history['train'].items() if v},
                'val': {k: v[-1] for k, v in self.metrics_history['val'].items() if v}
            },
            'best_metrics': {
                'val_accuracy': max(self.metrics_history['val'].get('accuracy', [0])),
                'val_loss': min(self.metrics_history['val'].get('loss', [float('inf')]))
            }
        }
        
        # Save results
        results_file = os.path.join(self.results_dir, f'{self.experiment_name}_results.json')
        with open(results_file, 'w') as f:
            json.dump(results, f, indent=2)
        
        logger.info(f"Training completed in {training_time:.2f}s")
        logger.info(f"Best val_accuracy: {results['best_metrics']['val_accuracy']:.4f}")

print("‚úÖ ExperimentTracker class created!")

## üöÄ Section 8: Main Pipeline - Data Loading

In [None]:
# Emotion mapping
EMOTION_MAP = {0: 'Sadness', 1: 'Joy', 2: 'Love', 3: 'Anger', 4: 'Fear', 5: 'Surprise'}
EMOTION_LABELS = list(EMOTION_MAP.values())

print("="*80)
print("STEP 1: LOADING DATA")
print("="*80)

# Load data
train_df = pd.read_csv(config.train_path)
val_df = pd.read_csv(config.val_path)

print(f"\nLoaded {len(train_df)} training samples")
print(f"Loaded {len(val_df)} validation samples")

# Display samples
print("\nSample data:")
display(train_df.head())

# Label distribution
print("\nLabel Distribution:")
print(train_df['label'].value_counts().sort_index())

## üìä Section 9: Exploratory Data Analysis

In [None]:
print("="*80)
print("STEP 2: EXPLORATORY DATA ANALYSIS")
print("="*80)

# Visualize label distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training set
train_counts = train_df['label'].map(EMOTION_MAP).value_counts()
axes[0].bar(train_counts.index, train_counts.values, color='skyblue', edgecolor='black')
axes[0].set_title('Training Set - Emotion Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Emotion')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Validation set
val_counts = val_df['label'].map(EMOTION_MAP).value_counts()
axes[1].bar(val_counts.index, val_counts.values, color='lightcoral', edgecolor='black')
axes[1].set_title('Validation Set - Emotion Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Emotion')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Text length analysis
train_df['text_len'] = train_df['text'].str.split().str.len()
val_df['text_len'] = val_df['text'].str.split().str.len()

print("\nText Length Statistics:")
print(train_df['text_len'].describe())

# Plot text length distribution
plt.figure(figsize=(12, 5))
plt.hist(train_df['text_len'], bins=50, edgecolor='black', alpha=0.7)
plt.axvline(train_df['text_len'].mean(), color='red', linestyle='--', 
           label=f'Mean: {train_df["text_len"].mean():.1f}')
plt.axvline(train_df['text_len'].median(), color='green', linestyle='--', 
           label=f'Median: {train_df["text_len"].median():.0f}')
plt.title('Text Length Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## üîß Section 10: Text Preprocessing

In [None]:
print("="*80)
print("STEP 3: TEXT PREPROCESSING")
print("="*80)

# Create preprocessor
preprocessor = TextPreprocessor()

# Preprocess dataframes
train_df = preprocessor.preprocess_dataframe(train_df)
val_df = preprocessor.preprocess_dataframe(val_df)

# Remove duplicates
train_df = preprocessor.remove_duplicates(train_df)

# Check data leakage
val_df, overlaps = preprocessor.check_data_leakage(train_df, val_df)

print(f"\nAfter preprocessing:")
print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Overlaps removed: {overlaps}")

# Show examples
print("\nSample preprocessed texts:")
for i, row in train_df.head(5).iterrows():
    print(f"[{EMOTION_MAP[row['label']]}] {row['text']}")

## üî§ Section 11: Embedding Creation

In [None]:
print("="*80)
print("STEP 4: CREATING EMBEDDINGS")
print("="*80)

# Create embedding handler
embedding_handler = EmbeddingHandler(
    embedding_type=config.embedding_type,
    embedding_dim=config.embedding_dim,
    max_words=config.max_words,
    max_len=config.max_len,
    oov_token=config.oov_token
)

# Create tokenizer
embedding_handler.create_tokenizer(train_df['text'].tolist())

# Load embeddings
if config.embedding_type == 'glove':
    embedding_handler.load_glove_embeddings(config.glove_path)
elif config.embedding_type == 'word2vec':
    embedding_handler.train_word2vec(train_df['text'].tolist(), 
                                     vector_size=config.embedding_dim)

# Create embedding matrix
embedding_matrix, stats = embedding_handler.create_embedding_matrix()

# Convert texts to sequences
X_train = embedding_handler.texts_to_sequences(train_df['text'].tolist())
X_val = embedding_handler.texts_to_sequences(val_df['text'].tolist())

# Prepare labels
y_train = to_categorical(train_df['label'].values, num_classes=config.num_classes)
y_val = to_categorical(val_df['label'].values, num_classes=config.num_classes)

print(f"\nData shapes:")
print(f"X_train: {X_train.shape}")
print(f"y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}")
print(f"y_val: {y_val.shape}")
print(f"Embedding matrix: {embedding_matrix.shape}")

# Check OOV rates
train_oov = embedding_handler.get_oov_rate(X_train)
val_oov = embedding_handler.get_oov_rate(X_val)
print(f"\nOOV Rates:")
print(f"Training: {train_oov:.2f}%")
print(f"Validation: {val_oov:.2f}%")

## ‚öñÔ∏è Section 12: Class Weights

In [None]:
print("="*80)
print("STEP 5: COMPUTING CLASS WEIGHTS")
print("="*80)

if config.use_class_weights:
    class_weights = preprocessor.compute_class_weights(train_df['label'].values)
    print("\nClass Weights:")
    for label, weight in class_weights.items():
        print(f"{EMOTION_MAP[label]:12s}: {weight:.3f}")
else:
    class_weights = None
    print("\nClass weights disabled")

## üèóÔ∏è Section 13: Model Creation

In [None]:
print("="*80)
print("STEP 6: BUILDING MODEL")
print("="*80)

# Create model builder
builder = ModelBuilder(
    vocab_size=embedding_handler.vocab_size,
    embedding_dim=config.embedding_dim,
    embedding_matrix=embedding_matrix,
    max_len=config.max_len,
    num_classes=config.num_classes
)

# Build model based on config
if config.model_type == 'lstm':
    model = builder.build_lstm(
        units=config.rnn_units,
        num_layers=config.num_layers,
        dropout=config.dropout,
        spatial_dropout=config.spatial_dropout,
        bidirectional=False,
        trainable_embeddings=config.trainable_embeddings
    )
elif config.model_type == 'gru':
    model = builder.build_gru(
        units=config.rnn_units,
        num_layers=config.num_layers,
        dropout=config.dropout,
        spatial_dropout=config.spatial_dropout,
        bidirectional=False,
        trainable_embeddings=config.trainable_embeddings
    )
elif config.model_type == 'bilstm':
    model = builder.build_lstm(
        units=config.rnn_units,
        num_layers=config.num_layers,
        dropout=config.dropout,
        spatial_dropout=config.spatial_dropout,
        bidirectional=True,
        trainable_embeddings=config.trainable_embeddings
    )

# Compile model
model = builder.compile_model(model, learning_rate=config.learning_rate)

# Display architecture
print(f"\nModel: {config.model_type.upper()}")
model.summary()

## üöÇ Section 14: Model Training

In [None]:
print("="*80)
print("STEP 7: TRAINING MODEL")
print("="*80)

# Create directories
os.makedirs(config.save_dir, exist_ok=True)
os.makedirs(config.log_dir, exist_ok=True)
os.makedirs(config.result_dir, exist_ok=True)

# Create callbacks
callbacks = []

# Model checkpoint
checkpoint_path = os.path.join(config.save_dir, f'{config.experiment_name}_best.keras')
callbacks.append(ModelCheckpoint(
    filepath=checkpoint_path,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True,
    verbose=1
))

# Early stopping
if config.early_stopping:
    callbacks.append(EarlyStopping(
        monitor='val_loss',
        patience=config.patience,
        restore_best_weights=True,
        verbose=1
    ))

# Reduce learning rate
if config.reduce_lr:
    callbacks.append(ReduceLROnPlateau(
        monitor='val_loss',
        factor=config.lr_factor,
        patience=config.lr_patience,
        min_lr=config.min_lr,
        verbose=1
    ))

# CSV logger
csv_path = os.path.join(config.log_dir, f'{config.experiment_name}_training.csv')
callbacks.append(CSVLogger(csv_path))

# Experiment tracker
callbacks.append(ExperimentTracker(config.experiment_name, config.result_dir))

# Train model
print(f"\nTraining {config.model_type.upper()} for {config.epochs} epochs...\n")

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=config.epochs,
    batch_size=config.batch_size,
    class_weight=class_weights,
    callbacks=callbacks,
    verbose=1
)

print("\n‚úÖ Training complete!")

## üìä Section 15: Model Evaluation

In [None]:
print("="*80)
print("STEP 8: MODEL EVALUATION")
print("="*80)

# Evaluate on validation set
val_loss, val_accuracy = model.evaluate(X_val, y_val, verbose=0)

print(f"\nValidation Results:")
print(f"Loss: {val_loss:.4f}")
print(f"Accuracy: {val_accuracy:.4f} ({val_accuracy*100:.2f}%)")

# Get predictions
y_pred = model.predict(X_val, verbose=0)

print("\n‚úÖ Evaluation complete!")

## üìà Section 16: Visualization - Training History

In [None]:
print("="*80)
print("VISUALIZATION 1: TRAINING HISTORY")
print("="*80)

visualizer = ResultsVisualizer(EMOTION_LABELS)
visualizer.plot_training_history(history)

## üìà Section 17: Visualization - Confusion Matrix

In [None]:
print("="*80)
print("VISUALIZATION 2: CONFUSION MATRIX")
print("="*80)

# Raw confusion matrix
visualizer.plot_confusion_matrix(y_val, y_pred, normalize=False)

In [None]:
# Normalized confusion matrix
visualizer.plot_confusion_matrix(y_val, y_pred, normalize=True)

## üìà Section 18: Visualization - Classification Report

In [None]:
print("="*80)
print("VISUALIZATION 3: CLASSIFICATION REPORT")
print("="*80)

visualizer.plot_classification_report(y_val, y_pred)

## üìà Section 19: Visualization - Per-Class Accuracy

In [None]:
print("="*80)
print("VISUALIZATION 4: PER-CLASS ACCURACY")
print("="*80)

visualizer.plot_per_class_accuracy(y_val, y_pred)

## üéØ Section 20: Prediction Examples

In [None]:
print("="*80)
print("PREDICTION EXAMPLES")
print("="*80)

# Sample predictions
n_samples = 10
sample_indices = np.random.choice(len(val_df), n_samples, replace=False)

print("\nSample Predictions:")
print("="*80)

for idx in sample_indices:
    text = val_df.iloc[idx]['text']
    true_label = np.argmax(y_val[idx])
    pred_label = np.argmax(y_pred[idx])
    confidence = y_pred[idx][pred_label] * 100
    
    correct = "‚úÖ" if true_label == pred_label else "‚ùå"
    
    print(f"\n{correct} Text: {text}")
    print(f"   True: {EMOTION_MAP[true_label]:12s} | Predicted: {EMOTION_MAP[pred_label]:12s} (confidence: {confidence:.1f}%)")
    
    # Top 3 predictions
    top3_idx = np.argsort(y_pred[idx])[-3:][::-1]
    print(f"   Top 3: ", end="")
    for tidx in top3_idx:
        print(f"{EMOTION_MAP[tidx]} ({y_pred[idx][tidx]*100:.1f}%)  ", end="")
    print()

## üîÆ Section 21: Interactive Prediction Function

In [None]:
def predict_emotion(text: str, show_probabilities: bool = True):
    """
    Predict emotion for a given text.
    
    Args:
        text: Input text
        show_probabilities: Whether to show all class probabilities
    """
    # Preprocess
    cleaned_text = preprocessor.clean_text(text)
    
    # Convert to sequence
    sequence = embedding_handler.texts_to_sequences([cleaned_text])
    
    # Predict
    pred = model.predict(sequence, verbose=0)[0]
    pred_label = np.argmax(pred)
    
    print(f"\nInput: {text}")
    print(f"Cleaned: {cleaned_text}")
    print(f"\nüéØ Predicted Emotion: {EMOTION_MAP[pred_label]} (confidence: {pred[pred_label]*100:.2f}%)")
    
    if show_probabilities:
        print("\nüìä All probabilities:")
        for emotion, prob in zip(EMOTION_LABELS, pred):
            bar = "‚ñà" * int(prob * 50)
            print(f"  {emotion:12s}: {bar} {prob*100:5.2f}%")

print("‚úÖ Prediction function created!")
print("\nUsage: predict_emotion('Your text here')")

## üß™ Section 22: Test the Prediction Function

In [None]:
print("="*80)
print("TESTING PREDICTION FUNCTION")
print("="*80)

# Test texts
test_texts = [
    "I am so happy and excited about this!",
    "This is really frustrating and makes me angry",
    "I miss you so much my love",
    "I am terrified of what might happen",
    "Oh wow I did not expect that at all!",
    "I feel so sad and depressed today"
]

for test_text in test_texts:
    predict_emotion(test_text)
    print("-" * 80)

## üíæ Section 23: Save Results Summary

In [None]:
print("="*80)
print("FINAL SUMMARY")
print("="*80)

# Create summary
summary = {
    'experiment_name': config.experiment_name,
    'model_type': config.model_type,
    'embedding_type': config.embedding_type,
    'embedding_dim': config.embedding_dim,
    'rnn_units': config.rnn_units,
    'num_layers': config.num_layers,
    'dropout': config.dropout,
    'final_val_accuracy': float(val_accuracy),
    'final_val_loss': float(val_loss),
    'best_val_accuracy': float(max(history.history['val_accuracy'])),
    'total_epochs_trained': len(history.history['loss']),
    'training_samples': len(train_df),
    'validation_samples': len(val_df),
    'embedding_coverage': stats['coverage_percent'],
    'oov_rate_train': float(train_oov),
    'oov_rate_val': float(val_oov)
}

# Print summary
print("\nüìã Experiment Summary:")
print("="*80)
for key, value in summary.items():
    print(f"{key:25s}: {value}")

# Save summary
summary_file = os.path.join(config.result_dir, f'{config.experiment_name}_summary.json')
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\n‚úÖ Summary saved to: {summary_file}")
print(f"‚úÖ Model saved to: {checkpoint_path}")
print(f"‚úÖ Training log saved to: {csv_path}")

print("\n" + "="*80)
print("üéâ PIPELINE COMPLETE! üéâ")
print("="*80)
print(f"\nüéØ Final Validation Accuracy: {val_accuracy*100:.2f}%")
print(f"üèÜ Best Validation Accuracy: {max(history.history['val_accuracy'])*100:.2f}%")

## üéì Section 24: Quick Experimentation Guide

### To Run Different Experiments:

1. **Change Model Type**:
   ```python
   config.model_type = 'gru'  # or 'lstm', 'bilstm'
   ```

2. **Change Embedding**:
   ```python
   config.embedding_type = 'word2vec'  # or 'glove'
   ```

3. **Adjust Model Size**:
   ```python
   config.rnn_units = 256
   config.num_layers = 2
   ```

4. **Change Training Parameters**:
   ```python
   config.batch_size = 64
   config.learning_rate = 0.0005
   config.epochs = 100
   ```

5. **Then rerun** from Section 8 (Model Creation) onwards!

### Files Created:
- `saved_models/{experiment_name}_best.keras` - Best model
- `logs/{experiment_name}_training.csv` - Training log
- `results/{experiment_name}_results.json` - Complete results
- `results/{experiment_name}_summary.json` - Quick summary

### Next Steps:
1. Try different configurations above
2. Compare results between experiments
3. Test with your own texts using `predict_emotion()`
4. Deploy the best model!

**This notebook is completely self-contained - all classes and functions are included!** üöÄ