# Comprehensive Sentiment Analysis Project

## A Deep Learning Approach to Social Media Sentiment Classification

**Authors**: Discovery Project Team  
**Date**: January 2025  
**Objective**: Develop and optimize neural network architectures for sentiment analysis using multiple deep learning approaches

---

## Abstract

This comprehensive project implements and compares multiple neural network architectures for sentiment analysis of social media data from the Exorde dataset. We systematically progress through 12 key phases: from basic model implementation to advanced optimization techniques, incorporating insights from foundational literature in natural language processing and deep learning. Our implementation includes RNN, LSTM, GRU, and Transformer architectures with various enhancements including attention mechanisms, bidirectional processing, and pre-trained embeddings.

The project demonstrates a methodical approach to machine learning model development, progressing from baseline implementations to sophisticated optimized models. We achieve significant performance improvements through systematic hyperparameter tuning, architectural enhancements, and advanced training techniques, ultimately reaching competitive F1 scores on multi-class sentiment classification.

---

## Table of Contents

1. [Setup & Prerequisites](#1-setup--prerequisites)
2. [Core Utilities & Model Definitions](#2-core-utilities--model-definitions)
3. [Data Acquisition](#3-data-acquisition)
4. [Model Visualization](#4-model-visualization)
5. [Enhanced Architecture Comparison](#5-enhanced-architecture-comparison)
6. [Hyperparameter Tuning](#6-hyperparameter-tuning)
7. [Foundational Improvements (Baseline V2)](#7-foundational-improvements-baseline-v2)
8. [Advanced Training Demonstration](#8-advanced-training-demonstration)
9. [Final Hyperparameter Optimization](#9-final-hyperparameter-optimization)
10. [Comprehensive Error Analysis](#10-comprehensive-error-analysis)
11. [Final Model Training](#11-final-model-training)
12. [Final Report Generation](#12-final-report-generation)

---

## Literature Review

Our approach is grounded in foundational research in natural language processing and deep learning. This section reviews five key papers that inform our architectural choices and optimization strategies.

### 1. "Attention Is All You Need" (Vaswani et al., 2017)

**Citation**: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

**Key Contributions**:
- Introduced the Transformer architecture based solely on self-attention mechanisms
- Demonstrated superior performance to RNNs/LSTMs while enabling parallelization
- Established the foundation for modern language models (BERT, GPT, etc.)

**Application to Our Project**:
This paper provides the theoretical foundation for our Transformer implementation. We leverage the self-attention mechanism to capture long-range dependencies in social media text, which often contains complex linguistic structures. Our implementation includes positional encodings and multi-head attention as described in the original paper, adapted for sentiment classification tasks.

### 2. "Bidirectional LSTM-CRF Models for Sequence Tagging" (Huang et al., 2015)

**Citation**: Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.

**Key Contributions**:
- Demonstrated the effectiveness of bidirectional processing for sequence understanding
- Showed that backward context is crucial for understanding linguistic meaning
- Established bidirectional LSTMs as a standard for sequence processing

**Application to Our Project**:
This research validates our implementation of bidirectional variants for RNN, LSTM, and GRU models. For sentiment analysis, understanding both preceding and following context is crucial. For example, in "The movie was not bad at all," the sentiment is only clear when considering the complete phrase. Our bidirectional models capture this dual-context information effectively.

### 3. "A Structured Self-Attentive Sentence Embedding" (Lin et al., 2017)

**Citation**: Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.

**Key Contributions**:
- Introduced self-attention for sentence-level representations
- Provided interpretable attention weights showing model focus
- Demonstrated superior performance over simple pooling strategies

**Application to Our Project**:
This paper directly informs our attention-enhanced RNN, LSTM, and GRU models. Instead of using only the final hidden state, we implement self-attention mechanisms that weight the importance of each word in the sequence. This approach is particularly valuable for sentiment analysis where specific words or phrases carry disproportionate emotional weight.

### 4. "GloVe: Global Vectors for Word Representation" (Pennington et al., 2014)

**Citation**: Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

**Key Contributions**:
- Introduced global matrix factorization approach to word embeddings
- Captured both global and local statistical information
- Demonstrated strong performance on word analogy and similarity tasks

**Application to Our Project**:
This research supports our use of pre-trained embeddings to initialize our models. GloVe embeddings provide rich semantic representations learned from large corpora, giving our models a significant head start compared to random initialization. This is especially important for sentiment analysis where semantic relationships between words are crucial for understanding emotional nuances.

### 5. "Bag of Tricks for Efficient Text Classification" (Joulin et al., 2016)

**Citation**: Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

**Key Contributions**:
- Introduced FastText for efficient text classification
- Demonstrated that simple approaches can be highly effective
- Showed the importance of n-gram features and subword information

**Application to Our Project**:
While we focus on deep learning approaches, this paper provides important baseline insights. It reminds us that complex models must significantly outperform simpler alternatives to justify their computational cost. We use this perspective to validate that our neural networks provide meaningful improvements over traditional bag-of-words approaches.

---

## 1. Setup & Prerequisites

This section imports all necessary libraries and configures the environment for our comprehensive sentiment analysis project. We systematically import all modules from our repository, ensuring compatibility and proper initialization.

### 1.1 Core Deep Learning and Data Science Libraries

In [None]:
# Core libraries for deep learning and data manipulation
# Initialize execution tracking to ensure proper sequential execution
_notebook_execution_state = {'cells_executed': set(), 'current_cell': 1}

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
import random
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.cuda.manual_seed_all(42)

# Configure matplotlib for better visualization
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Check device availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"🚀 Using device: {device}")
print(f"📚 PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"🎮 GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory // 1024**3} GB")

# Mark this cell as executed
_notebook_execution_state['cells_executed'].add(1)
print("\n✅ Cell 1 executed successfully - Core libraries imported")

### 1.2 Project-Specific Module Imports

Here we import all the custom modules from our repository. Each import represents a different aspect of our sentiment analysis pipeline, from data acquisition to model training and evaluation.

In [None]:
# Import all model architectures from our repository
# These represent the core neural network implementations
# Added error handling to ensure robust execution

# Check if previous cells were executed
if '_notebook_execution_state' not in globals() or 1 not in _notebook_execution_state['cells_executed']:
    print("⚠️  Warning: Please run Cell 1 (Core Libraries) first!")
    print("💡 Tip: Execute cells in sequential order for best results.")

try:
    from models import (
        # Base model class
        BaseModel,
        
        # Original model architectures
        RNNModel, LSTMModel, GRUModel, TransformerModel,
        
        # Enhanced RNN variants
        DeepRNNModel, BidirectionalRNNModel, RNNWithAttentionModel,
        
        # Enhanced LSTM variants
        StackedLSTMModel, BidirectionalLSTMModel, LSTMWithAttentionModel, LSTMWithPretrainedEmbeddingsModel,
        
        # Enhanced GRU variants
        StackedGRUModel, BidirectionalGRUModel, GRUWithAttentionModel, GRUWithPretrainedEmbeddingsModel,
        
        # Enhanced Transformer variants
        LightweightTransformerModel, DeepTransformerModel, TransformerWithPoolingModel
    )
    
    # Verify models are properly imported by checking BaseModel
    if 'BaseModel' not in locals():
        raise ImportError("BaseModel not found in imports")
    
    # Count available model variants
    model_variants = [cls for cls in globals().values() 
                     if isinstance(cls, type) and hasattr(cls, '__bases__') 
                     and any(hasattr(base, '__name__') and 'BaseModel' in base.__name__ for base in cls.__bases__)]
    
    print("✅ Successfully imported all model architectures")
    print(f"📊 Total model variants available: {len(model_variants)}")
    
    # Mark this cell as executed
    _notebook_execution_state['cells_executed'].add(2)
    print("✅ Cell 2 executed successfully - Model architectures imported")
    
except ImportError as e:
    print(f"❌ Error importing models: {e}")
    print("⚠️  Please ensure all model files are available and PyTorch is installed")
    raise
except Exception as e:
    print(f"❌ Unexpected error during model import: {e}")
    raise

In [None]:
# Import core training and evaluation utilities
# These modules handle the training loop, evaluation metrics, and data processing
# Added error handling to ensure robust execution

# Check execution order
if '_notebook_execution_state' not in globals() or 2 not in _notebook_execution_state['cells_executed']:
    print("⚠️  Warning: Please run Cell 2 (Model Architectures) first!")

try:
    from train import train_model, train_model_epochs
    print("  ✅ Training functions: train_model, train_model_epochs")
except ImportError as e:
    print(f"  ❌ Training import error: {e}")
    print("  ⚠️  Please ensure train.py is available")

try:
    from evaluate import evaluate_model, evaluate_model_comprehensive
    print("  ✅ Evaluation functions: evaluate_model, evaluate_model_comprehensive")
except ImportError as e:
    print(f"  ❌ Evaluation import error: {e}")
    print("  ⚠️  Please ensure evaluate.py is available")

try:
    from utils import simple_tokenizer, tokenize_texts
    print("  ✅ Utility functions: simple_tokenizer, tokenize_texts")
except ImportError as e:
    print(f"  ❌ Utils import error: {e}")
    print("  ⚠️  Please ensure utils.py is available")

try:
    from getdata import download_exorde_sample
    print("  ✅ Data acquisition: download_exorde_sample")
except ImportError as e:
    print(f"  ❌ Data acquisition import error: {e}")
    print("  ⚠️  Please ensure getdata.py is available")

print("\n✅ Core utilities imported successfully!")

# Mark this cell as executed
_notebook_execution_state['cells_executed'].add(3)
print("✅ Cell 3 executed successfully - Core utilities imported")

In [None]:
# Import advanced training and optimization modules
# These represent our enhanced training strategies and optimization techniques
import baseline_v2
import enhanced_training
import hyperparameter_tuning
import final_hyperparameter_optimization
import enhanced_compare_models
import experiment_tracker

print("✅ Successfully imported advanced training and optimization modules")

In [None]:
# Import analysis and visualization modules
# These modules provide comprehensive analysis and visualization capabilities
import error_analysis
import visualize_models
import demo_examples
import comprehensive_eval
import final_report_generator
import simplified_final_report

print("✅ Successfully imported analysis and visualization modules")

In [None]:
# Import additional utility and specialized modules
# These modules provide specific functionalities for embeddings, comparisons, and testing
import embedding_utils
import compare_models
import final_model_training
import validate_improvements
import test_improvements

print("✅ Successfully imported additional utility modules")
print("🎯 All repository modules successfully loaded and ready for use!")

### 1.3 Environment Configuration and Global Settings

We establish global configuration parameters that will be used throughout our analysis. These settings ensure consistency across all experiments and provide a foundation for reproducible results.

In [None]:
# Global configuration parameters
# These settings control various aspects of our training and evaluation pipeline

CONFIG = {
    # Data settings
    'SAMPLE_SIZE': 10000,  # Number of samples to download from Exorde dataset
    'TEST_SIZE': 0.2,      # Proportion of data for testing
    'RANDOM_STATE': 42,    # Random seed for reproducibility
    
    # Model architecture settings
    'EMBED_DIM': 64,       # Embedding dimension
    'HIDDEN_DIM': 64,      # Hidden layer dimension
    'NUM_CLASSES': 3,      # Number of sentiment classes (Positive, Negative, Neutral)
    'NUM_HEADS': 4,        # Number of attention heads for Transformer
    'NUM_LAYERS': 2,       # Number of layers for stacked models
    
    # Training settings
    'BATCH_SIZE': 32,      # Batch size for training
    'LEARNING_RATE': 1e-3, # Initial learning rate
    'NUM_EPOCHS': 10,      # Number of training epochs for initial experiments
    'EXTENDED_EPOCHS': 50, # Number of epochs for extended training
    'GRADIENT_CLIP': 1.0,  # Gradient clipping value
    
    # Optimization settings
    'WEIGHT_DECAY': 1e-4,  # L2 regularization strength
    'DROPOUT_RATE': 0.3,   # Dropout probability
    'PATIENCE': 10,        # Early stopping patience
    
    # Evaluation settings
    'TARGET_F1': 0.75,     # Target F1 score for deployment readiness
    'TOP_K_MODELS': 3,     # Number of top models to analyze in detail
}

# Display configuration
print("🔧 Global Configuration Settings:")
print("=" * 50)
for category in ['Data', 'Model', 'Training', 'Optimization', 'Evaluation']:
    print(f"\n{category} Settings:")
    category_keys = [k for k in CONFIG.keys() if category.upper() in k or 
                     (category == 'Data' and k in ['SAMPLE_SIZE', 'TEST_SIZE', 'RANDOM_STATE']) or
                     (category == 'Model' and k in ['EMBED_DIM', 'HIDDEN_DIM', 'NUM_CLASSES', 'NUM_HEADS', 'NUM_LAYERS']) or
                     (category == 'Training' and k in ['BATCH_SIZE', 'LEARNING_RATE', 'NUM_EPOCHS', 'EXTENDED_EPOCHS', 'GRADIENT_CLIP']) or
                     (category == 'Optimization' and k in ['WEIGHT_DECAY', 'DROPOUT_RATE', 'PATIENCE']) or
                     (category == 'Evaluation' and k in ['TARGET_F1', 'TOP_K_MODELS'])]
    
    for key in category_keys:
        print(f"  {key}: {CONFIG[key]}")

print("\n" + "=" * 50)

# Validate configuration completeness
required_keys = ['EMBED_DIM', 'HIDDEN_DIM', 'NUM_CLASSES', 'VOCAB_SIZE']
missing_keys = [key for key in required_keys if key not in CONFIG]

if missing_keys:
    print(f"⚠️  Warning: Missing required CONFIG keys: {missing_keys}")
    # Add default values for missing keys
    if 'VOCAB_SIZE' not in CONFIG:
        CONFIG['VOCAB_SIZE'] = 10000  # Default vocabulary size
        print(f"  ✅ Added default VOCAB_SIZE: {CONFIG['VOCAB_SIZE']}")

# Validate that dependencies are available
dependencies_check = {
    'Models available': 'BaseModel' in globals(),
    'Training functions available': 'train_model' in globals(),
    'Evaluation functions available': 'evaluate_model' in globals(),
    'Utility functions available': 'simple_tokenizer' in globals()
}

print("\n🔍 Dependency Check:")
for check_name, check_result in dependencies_check.items():
    status = "✅" if check_result else "❌"
    print(f"  {status} {check_name}")

all_dependencies_ok = all(dependencies_check.values())
if all_dependencies_ok:
    print("\n✅ Configuration loaded successfully!")
    print("✅ All dependencies are available!")
    print("✅ Ready to proceed with model development!")
else:
    print("\n⚠️  Some dependencies are missing. Please run previous cells first.")
    print("💡 Tip: Run cells in order from the top of the notebook.")