# Sentiment Analysis for Product Reviews - Google Colab

**Course:** Natural Language Processing

**Objective:** Compare different sentiment classification approaches (SVM+BoW, SVM+Embeddings, BERT) using rigorous statistical validation.

---

## üöÄ Quick Start

This notebook is designed to run on Google Colab with GPU acceleration.

**Before running:**
1. Go to `Runtime` ‚Üí `Change runtime type` ‚Üí Select `GPU`
2. Run all cells in order

## Section 1: Setup and Installation

In [None]:
# Check GPU availability
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è No GPU detected. Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

In [None]:
# Clone the repository (if not already cloned)
import os
if not os.path.exists('PNL_01'):
    !git clone https://github.com/YOUR_USERNAME/PNL_01.git
    %cd PNL_01
else:
    %cd PNL_01
    !git pull

print("‚úì Repository ready")

In [None]:
# Install dependencies
!pip install -q transformers torch scikit-learn pandas numpy matplotlib seaborn gensim nltk scipy tqdm

# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

print("‚úì All dependencies installed")

## Section 2: Imports and Configuration

In [None]:
# Standard libraries
import sys
import warnings
import logging
from pathlib import Path

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Add src to path
sys.path.append('.')

# Project modules
from src.config import ExperimentConfig
from src.data_loader import DataLoader
from src.preprocessor import Preprocessor
from src.vectorizers import BoWVectorizer
from src.embedding_encoder import EmbeddingEncoder
from src.classifiers import SVMClassifier
from src.bert_classifier import BERTClassifier
from src.evaluator import Evaluator
from src.visualizer import Visualizer

# Configure warnings and logging
warnings.filterwarnings('ignore')
logging.basicConfig(level=logging.WARNING)

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Configure matplotlib
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("‚úì All imports successful")
print(f"Random seed: {RANDOM_SEED}")

### Experiment Configuration

In [None]:
# Create experiment configuration
config = ExperimentConfig(
    dataset_name='amazon_reviews',
    num_simulations=10,
    bert_batch_size=32,  # Optimized for Colab GPU
    bert_epochs=10,
)

print("Experiment Configuration:")
print(config)

## Section 3: Data Loading and Exploration

In [None]:
# Load data
data_loader = DataLoader(
    dataset_name='amazon_reviews',
    random_state=RANDOM_SEED
)
train_df, val_df, test_df = data_loader.load()

print(f"Dataset loaded!")
print(f"  Train: {len(train_df)}")
print(f"  Val: {len(val_df)}")
print(f"  Test: {len(test_df)}")

In [None]:
# Class distribution
distribution = data_loader.get_class_distribution()
print("\nClass Distribution:")
for split, counts in distribution.items():
    total = counts['negative'] + counts['positive']
    print(f"  {split}: Neg={counts['negative']} ({counts['negative']/total*100:.1f}%), "
          f"Pos={counts['positive']} ({counts['positive']/total*100:.1f}%)")

In [None]:
# Sample reviews
print("Sample Reviews:")
for i, row in train_df.head(3).iterrows():
    label = "POSITIVE" if row['label'] == 1 else "NEGATIVE"
    print(f"\n[{label}] {row['text'][:150]}...")

## Section 4: Text Preprocessing

In [None]:
# Create and fit preprocessor
preprocessor = Preprocessor(language='english', remove_stopwords=True)
train_texts = train_df['text'].tolist()
preprocessor.fit(train_texts)

print(f"Preprocessor fitted")
print(f"Vocabulary size: {preprocessor.get_vocabulary_size()}")

## Section 5: SVM + Bag of Words

In [None]:
import time

# Preprocess
train_texts_processed = preprocessor.transform(train_texts)
test_texts_processed = preprocessor.transform(test_df['text'].tolist())

# Vectorize
vectorizer = BoWVectorizer(max_features=5000, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_texts_processed)
X_test = vectorizer.transform(test_texts_processed)

# Train
start = time.time()
classifier_bow = SVMClassifier(kernel='linear', C=1.0)
classifier_bow.fit(X_train, train_df['label'].values)
train_time = time.time() - start

# Predict
start = time.time()
preds_bow = classifier_bow.predict(X_test)
infer_time = time.time() - start

# Evaluate
evaluator = Evaluator()
metrics = evaluator.evaluate(test_df['label'].values, preds_bow, 'svm_bow')

print(f"\nSVM + BoW Results:")
print(f"  Accuracy: {metrics['accuracy']:.4f}")
print(f"  F1-Score: {metrics['f1_macro']:.4f}")
print(f"  Training: {train_time:.2f}s")
print(f"  Inference: {infer_time:.2f}s")

## Section 6: SVM + Embeddings

In [None]:
# Encode with embeddings
encoder = EmbeddingEncoder(model_name='glove-wiki-gigaword-100')
X_train_emb = encoder.encode_batch(train_texts_processed)
X_test_emb = encoder.encode_batch(test_texts_processed)

# Train
start = time.time()
classifier_emb = SVMClassifier(kernel='rbf', C=1.0, gamma='scale')
classifier_emb.fit(X_train_emb, train_df['label'].values)
train_time = time.time() - start

# Predict
start = time.time()
preds_emb = classifier_emb.predict(X_test_emb)
infer_time = time.time() - start

# Evaluate
metrics = evaluator.evaluate(test_df['label'].values, preds_emb, 'svm_embeddings')

print(f"\nSVM + Embeddings Results:")
print(f"  Accuracy: {metrics['accuracy']:.4f}")
print(f"  F1-Score: {metrics['f1_macro']:.4f}")

## Section 7: BERT Classifier

**Note:** This will take ~5-10 minutes on Colab GPU

In [None]:
# Train BERT
print("Training BERT with improved configuration...")
print("Expected time: ~5-10 minutes on Colab GPU\n")

classifier_bert = BERTClassifier(
    model_name='distilbert-base-uncased',
    batch_size=32,
    num_epochs=10,
    patience=3
)

start = time.time()
classifier_bert.fit(
    train_df['text'].tolist(), train_df['label'].tolist(),
    val_df['text'].tolist(), val_df['label'].tolist()
)
train_time = time.time() - start

# Predict
start = time.time()
preds_bert = classifier_bert.predict(test_df['text'].tolist())
infer_time = time.time() - start

# Evaluate
metrics = evaluator.evaluate(test_df['label'].values, preds_bert, 'bert')

print(f"\nBERT Results:")
print(f"  Accuracy: {metrics['accuracy']:.4f}")
print(f"  F1-Score: {metrics['f1_macro']:.4f}")
print(f"  Training time: {train_time/60:.1f} minutes")

## Section 8: Comparison and Visualizations

In [None]:
# Get comparison table
comparison = evaluator.get_comparison_table()
print("\nModel Comparison:")
print(comparison)

In [None]:
# Visualize results
viz = Visualizer()

# Metrics comparison
viz.plot_metrics_comparison(
    evaluator.results,
    metrics=['accuracy', 'f1_macro']
)
plt.show()

# Confusion matrices
for model_name, cm in evaluator.confusion_matrices.items():
    viz.plot_confusion_matrix(cm, model_name)
    plt.show()

## Section 9: Statistical Validation (Optional)

**Note:** This section runs multiple simulations for statistical validation. It will take ~2-3 hours on Colab GPU.

Skip this section if you just want quick results. Run it for rigorous statistical analysis.

In [None]:
# Import simulation runner
from src.simulation_runner import SimulationRunner
from src.statistical_validator import StatisticalValidator

# Configure for fewer simulations on Colab (to save time)
config_sim = ExperimentConfig(
    dataset_name='amazon_reviews',
    num_simulations=10  # Use 10 instead of 30 for faster results
)

print("Running 10 simulations per model...")
print("Expected time: ~30-40 minutes on Colab GPU\n")

In [None]:
# Run BERT simulations
runner = SimulationRunner(config_sim, output_dir='results/simulations')
bert_results = runner.run_simulations(
    model_names=['bert'],
    base_seed=42
)

print("\n‚úì BERT simulations complete!")

In [None]:
# Run SVM simulations (faster)
svm_results = runner.run_simulations(
    model_names=['svm_bow', 'svm_embeddings'],
    base_seed=42
)

print("\n‚úì All simulations complete!")

## Section 10: Statistical Analysis

In [None]:
# Load simulation results
bert_df = pd.read_csv('results/simulations/bert_simulations.csv')
svm_bow_df = pd.read_csv('results/simulations/svm_bow_simulations.csv')
svm_emb_df = pd.read_csv('results/simulations/svm_embeddings_simulations.csv')

print("Simulation results loaded!")
print(f"  BERT: {len(bert_df)} simulations")
print(f"  SVM+BoW: {len(svm_bow_df)} simulations")
print(f"  SVM+Embeddings: {len(svm_emb_df)} simulations")

In [None]:
# Summary statistics
print("\n" + "="*80)
print("SUMMARY STATISTICS (Mean ¬± Std)")
print("="*80)

for name, df in [('BERT', bert_df), ('SVM+BoW', svm_bow_df), ('SVM+Embeddings', svm_emb_df)]:
    print(f"\n{name}:")
    print(f"  Accuracy:  {df['accuracy'].mean():.4f} ¬± {df['accuracy'].std():.4f}")
    print(f"  Precision: {df['precision_macro'].mean():.4f} ¬± {df['precision_macro'].std():.4f}")
    print(f"  Recall:    {df['recall_macro'].mean():.4f} ¬± {df['recall_macro'].std():.4f}")
    print(f"  F1-Score:  {df['f1_macro'].mean():.4f} ¬± {df['f1_macro'].std():.4f}")

In [None]:
# 95% Confidence Intervals
from scipy import stats

print("\n" + "="*80)
print("95% CONFIDENCE INTERVALS")
print("="*80)

for name, df in [('BERT', bert_df), ('SVM+BoW', svm_bow_df), ('SVM+Embeddings', svm_emb_df)]:
    print(f"\n{name}:")
    for metric in ['accuracy', 'f1_macro']:
        values = df[metric].values
        mean = np.mean(values)
        std_err = stats.sem(values)
        ci = std_err * stats.t.ppf(0.975, len(values) - 1)
        print(f"  {metric}: {mean:.4f} [{mean-ci:.4f}, {mean+ci:.4f}]")

In [None]:
# Statistical significance tests
validator = StatisticalValidator(alpha=0.05)

print("\n" + "="*80)
print("STATISTICAL SIGNIFICANCE TESTS (Wilcoxon)")
print("="*80)

# BERT vs SVM+BoW
result = validator.wilcoxon_test(
    bert_df['f1_macro'].values,
    svm_bow_df['f1_macro'].values
)
print(f"\nBERT vs SVM+BoW:")
print(f"  p-value: {result['p_value']:.4f}")
print(f"  Significant: {'Yes ‚úì' if result['significant'] else 'No ‚úó'}")
print(f"  Mean difference: {result['mean_diff']:.4f}")

# BERT vs SVM+Embeddings
result = validator.wilcoxon_test(
    bert_df['f1_macro'].values,
    svm_emb_df['f1_macro'].values
)
print(f"\nBERT vs SVM+Embeddings:")
print(f"  p-value: {result['p_value']:.4f}")
print(f"  Significant: {'Yes ‚úì' if result['significant'] else 'No ‚úó'}")
print(f"  Mean difference: {result['mean_diff']:.4f}")

# SVM+BoW vs SVM+Embeddings
result = validator.wilcoxon_test(
    svm_bow_df['f1_macro'].values,
    svm_emb_df['f1_macro'].values
)
print(f"\nSVM+BoW vs SVM+Embeddings:")
print(f"  p-value: {result['p_value']:.4f}")
print(f"  Significant: {'Yes ‚úì' if result['significant'] else 'No ‚úó'}")
print(f"  Mean difference: {result['mean_diff']:.4f}")

In [None]:
# Visualize distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy boxplot
data_acc = [
    bert_df['accuracy'].values,
    svm_bow_df['accuracy'].values,
    svm_emb_df['accuracy'].values
]
axes[0].boxplot(data_acc, labels=['BERT', 'SVM+BoW', 'SVM+Emb'])
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy Distribution')
axes[0].grid(True, alpha=0.3)

# F1-Score boxplot
data_f1 = [
    bert_df['f1_macro'].values,
    svm_bow_df['f1_macro'].values,
    svm_emb_df['f1_macro'].values
]
axes[1].boxplot(data_f1, labels=['BERT', 'SVM+BoW', 'SVM+Emb'])
axes[1].set_ylabel('F1-Score (Macro)')
axes[1].set_title('F1-Score Distribution')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Section 11: Conclusions

### Key Findings:

1. **Model Performance**: [Analyze the results above]
2. **Statistical Significance**: [Interpret the p-values]
3. **Training Configuration**: BERT with 10 epochs, batch size 32, early stopping
4. **GPU Acceleration**: Colab GPU significantly speeds up BERT training

### Recommendations:

- [Add your recommendations based on the results]
- [Consider trade-offs between accuracy and computational cost]
- [Suggest best model for production use]