Absolutely! Let‚Äôs rank the vectorizers **from fastest and smallest memory usage ‚Üí slower and heavier** while also considering **ease of use with classic ML algorithms**.

---

## **1Ô∏è‚É£ Lightweight & Fast (Low Memory, Sparse Output)**

| Vectorizer            | Size / Memory | Speed     | Notes / Use Case                                                            |
| --------------------- | ------------- | --------- | --------------------------------------------------------------------------- |
| **CountVectorizer**   | Low           | Very Fast | Good for small datasets and simple ML models.                               |
| **TfidfVectorizer**   | Low-Medium    | Fast      | Adds weighting, slightly slower than CountVectorizer but still lightweight. |
| **HashingVectorizer** | Very Low      | Very Fast | No `.fit()`, fixed size vector ‚Üí memory-efficient for huge datasets.        |

‚úÖ **Best for:** RandomForest, LogisticRegression, Naive Bayes, LinearSVM.

---

## **2Ô∏è‚É£ Medium (Dense Embeddings, Pretrained)**

| Vectorizer                | Size / Memory | Speed  | Notes / Use Case                                                                                                    |
| ------------------------- | ------------- | ------ | ------------------------------------------------------------------------------------------------------------------- |
| **Word2Vec (pretrained)** | Medium        | Medium | Generates dense vectors (~100‚Äì300 dims per word). Needs preprocessing + averaging over words for sentence-level ML. |
| **GloVe (pretrained)**    | Medium        | Medium | Similar to Word2Vec, pre-trained embeddings avoid training time.                                                    |
| **FastText**              | Medium        | Medium | Slightly bigger than Word2Vec due to subword info; still reasonable.                                                |

‚úÖ **Best for:** Classic ML with dense vectors, semantic similarity, multi-label classification.

---

## **3Ô∏è‚É£ Heavy / Slower (Transformer-Based Contextual Embeddings)**

| Vectorizer                               | Size / Memory        | Speed                  | Notes / Use Case                                                          |
| ---------------------------------------- | -------------------- | ---------------------- | ------------------------------------------------------------------------- |
| **BERT / SBERT / DistilBERT embeddings** | Large (hundreds MBs) | Slow (GPU recommended) | Contextual embeddings ‚Üí one vector per sentence. Best for SOTA NLP tasks. |
| **RoBERTa / Large Transformer models**   | Very Large (GBs)     | Very Slow              | High accuracy, very heavy. Usually overkill for small ML datasets.        |

‚úÖ **Best for:** Semantic search, complex NLP classification, embeddings for downstream tasks.

---

### **üí° Summary for Fast ML Pipeline**

If your goal is **low latency and fast training with scikit-learn models**:

1. **TfidfVectorizer** ‚Üí Most common choice, balances speed & effectiveness.
2. **CountVectorizer** ‚Üí Very simple, super fast.
3. **HashingVectorizer** ‚Üí For huge datasets or streaming data.

If accuracy is more important than speed, or you want **contextual embeddings**:

* **SentenceTransformer (‚Äòall-MiniLM-L6-v2‚Äô)** ‚Üí Dense, moderate size, relatively fast transformer.
* Avoid full BERT/RoBERTa unless you have GPU and large datasets.

---



In [1]:
import pandas as pd
final_df=pd.read_csv('final.csv').iloc[:,3:]

clean_test=pd.read_csv('cleaned.csv')
final_df.shape,clean_test.shape

((4991, 10), (1707, 3))

In [None]:
!pip install gensim CatBoost

In [2]:
"""
üöÄ Industry-Grade Multi-Label Text Classification Pipeline
Features: Advanced Experiment Tracking with Multiple Vectorizers
Including: TF-IDF, Count, Hashing, Word2Vec, GloVe, FastText
Comprehensive Metrics, Production-Ready Visualizations
"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import wandb
import warnings
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Tuple, Any
import json
from itertools import product
import gensim.downloader as api
from gensim.models import KeyedVectors
import nltk
from nltk.tokenize import word_tokenize

from sklearn.exceptions import UndefinedMetricWarning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    f1_score, classification_report, confusion_matrix,
    roc_curve, auc, roc_auc_score, precision_recall_curve,
    average_precision_score, hamming_loss, jaccard_score,
    accuracy_score
)
from sklearn.multiclass import OneVsRestClassifier
# from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Download required NLTK data
try:
    nltk.download('punkt', quiet=True)
except:
    pass

# =========================
# üé® Configuration & Setup
# =========================
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Set style for professional visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Create output directories
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
(OUTPUT_DIR / "plots").mkdir(exist_ok=True)
(OUTPUT_DIR / "reports").mkdir(exist_ok=True)

# Experiment configuration
EXPERIMENT_CONFIG = {
    "test_size": 0.2,
    "random_state": 42,
    "timestamp": datetime.now().strftime("%Y%m%d_%H%M%S")
}

# Vectorizer configurations
VECTORIZER_CONFIGS = {
    'tfidf_max_features': [5000],
    'ngram_ranges': [(1, 1), (1, 2), (1, 3)]  # unigrams, bigrams, trigrams
}

# Initialize W&B with rich config
wandb.init(
    project="23f3003030-t32025",
    name=f"D02-multi-vectorizer-classification-{EXPERIMENT_CONFIG['timestamp']}",
    config=EXPERIMENT_CONFIG,
    tags=["multi-label", "emotion-detection", "vectorizer-comparison", "production"],
    notes="Comprehensive experiment comparing all vectorizers and models"
)

# =========================
# üìä Advanced Visualization Functions
# =========================

def plot_combined_confusion_matrix(y_true, y_pred, emotions, model_name, vec_name):
    """Create professional confusion matrix visualization"""
    n_emotions = len(emotions)
    fig, axes = plt.subplots(1, n_emotions, figsize=(4*n_emotions, 3.5))

    if n_emotions == 1:
        axes = [axes]

    for i, emotion in enumerate(emotions):
        cm = confusion_matrix(y_true.iloc[:, i], y_pred[:, i])

        # Calculate percentages
        cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100

        # Create annotations with counts and percentages
        annot = np.array([[f'{count}\n({percent:.1f}%)'
                          for count, percent in zip(row_counts, row_percents)]
                         for row_counts, row_percents in zip(cm, cm_percent)])

        sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', ax=axes[i],
                    cbar=False, square=True, linewidths=1, linecolor='gray')
        axes[i].set_title(f'{emotion.upper()}', fontsize=12, fontweight='bold', pad=10)
        axes[i].set_xlabel('Predicted', fontsize=10)
        axes[i].set_ylabel('Actual' if i == 0 else '', fontsize=10)
        axes[i].set_xticklabels(['No', 'Yes'])
        axes[i].set_yticklabels(['No', 'Yes'])

    fig.suptitle(f'{model_name} ({vec_name}) - Confusion Matrices',
                 fontsize=16, fontweight='bold', y=1.02)
    plt.tight_layout()

    filename = OUTPUT_DIR / "plots" / f'{vec_name}_{model_name}_confusion_matrix.png'
    plt.savefig(filename, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({f"{vec_name}/{model_name}/confusion_matrix": wandb.Image(str(filename))})
    plt.close()


def plot_roc_curves(y_true, y_pred_proba, emotions, model_name, vec_name):
    """Plot ROC curves with AUC scores"""
    fig, ax = plt.subplots(figsize=(12, 9))
    colors = plt.cm.Set2(np.linspace(0, 1, len(emotions)))

    roc_auc_scores = {}

    for i, (emotion, color) in enumerate(zip(emotions, colors)):
        fpr, tpr, _ = roc_curve(y_true.iloc[:, i], y_pred_proba[:, i])
        roc_auc = auc(fpr, tpr)
        roc_auc_scores[emotion] = roc_auc

        ax.plot(fpr, tpr, color=color, lw=3,
                label=f'{emotion.capitalize()} (AUC = {roc_auc:.4f})')

    # Add diagonal line
    ax.plot([0, 1], [0, 1], 'k--', lw=2, label='Random (AUC = 0.5)', alpha=0.6)

    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate', fontsize=14, fontweight='bold')
    ax.set_ylabel('True Positive Rate', fontsize=14, fontweight='bold')
    ax.set_title(f'{model_name} ({vec_name}) - ROC Curves', fontsize=18, fontweight='bold', pad=20)
    ax.legend(loc="lower right", fontsize=11, framealpha=0.9)
    ax.grid(alpha=0.4, linestyle='--')

    # Add mean AUC text box
    mean_auc = np.mean(list(roc_auc_scores.values()))
    textstr = f'Mean AUC: {mean_auc:.4f}'
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.8)
    ax.text(0.65, 0.15, textstr, transform=ax.transAxes, fontsize=13,
            verticalalignment='top', bbox=props, fontweight='bold')

    plt.tight_layout()
    filename = OUTPUT_DIR / "plots" / f'{vec_name}_{model_name}_roc_curves.png'
    plt.savefig(filename, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({f"{vec_name}/{model_name}/roc_curves": wandb.Image(str(filename))})
    plt.close()

    return roc_auc_scores


def plot_classification_report(class_report_dict, emotions, model_name, vec_name):
    """Visualize classification report as heatmap"""
    metrics = ['precision', 'recall', 'f1-score']
    data = []

    for emotion in emotions:
        if emotion in class_report_dict:
            row = [class_report_dict[emotion][m] for m in metrics]
            data.append(row)

    df_report = pd.DataFrame(data, index=[e.capitalize() for e in emotions], columns=metrics)

    fig, ax = plt.subplots(figsize=(10, 6))
    sns.heatmap(df_report, annot=True, fmt='.4f', cmap='RdYlGn',
                cbar_kws={'label': 'Score'}, vmin=0, vmax=1, ax=ax,
                linewidths=2, linecolor='white', annot_kws={"size": 12, "weight": "bold"})
    ax.set_title(f'{model_name} ({vec_name}) - Classification Metrics',
                 fontsize=18, fontweight='bold', pad=20)
    ax.set_xlabel('Metrics', fontsize=14, fontweight='bold')
    ax.set_ylabel('Emotions', fontsize=14, fontweight='bold')

    plt.tight_layout()
    filename = OUTPUT_DIR / "plots" / f'{vec_name}_{model_name}_classification_report.png'
    plt.savefig(filename, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({f"{vec_name}/{model_name}/classification_report": wandb.Image(str(filename))})
    plt.close()


def plot_label_distribution(y_data, title, filename):
    """Plot label distribution"""
    label_counts = y_data.sum().sort_values(ascending=False)

    fig, ax = plt.subplots(figsize=(10, 6))
    bars = ax.bar(range(len(label_counts)), label_counts.values,
                  color=plt.cm.viridis(np.linspace(0, 1, len(label_counts))))
    ax.set_xticks(range(len(label_counts)))
    ax.set_xticklabels([label.capitalize() for label in label_counts.index], fontsize=12)
    ax.set_ylabel('Count', fontsize=14, fontweight='bold')
    ax.set_title(title, fontsize=18, fontweight='bold', pad=20)
    ax.grid(axis='y', alpha=0.4, linestyle='--')

    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=11, fontweight='bold')

    plt.tight_layout()
    save_path = OUTPUT_DIR / "plots" / filename
    plt.savefig(save_path, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({f"data_analysis/{filename.replace('.png', '')}": wandb.Image(str(save_path))})
    plt.close()


def compute_advanced_metrics(y_true, y_pred, y_pred_proba, emotions):
    """Compute comprehensive metrics"""
    metrics = {
        'f1_macro': f1_score(y_true, y_pred, average='macro', zero_division=0),
        'f1_micro': f1_score(y_true, y_pred, average='micro', zero_division=0),
        'f1_weighted': f1_score(y_true, y_pred, average='weighted', zero_division=0),
        'hamming_loss': hamming_loss(y_true, y_pred),
        'jaccard_score': jaccard_score(y_true, y_pred, average='samples', zero_division=0),
        'subset_accuracy': accuracy_score(y_true, y_pred),
    }

    # Per-emotion metrics
    for i, emotion in enumerate(emotions):
        metrics[f'{emotion}_f1'] = f1_score(y_true.iloc[:, i], y_pred[:, i], zero_division=0)
        try:
            metrics[f'{emotion}_auc'] = roc_auc_score(y_true.iloc[:, i], y_pred_proba[:, i])
        except:
            metrics[f'{emotion}_auc'] = 0.0

    return metrics


def get_vectorizer(vec_type, ngram_range, max_features):
    """Factory function to create vectorizers"""
    if vec_type == 'tfidf':
        return TfidfVectorizer(max_features=max_features, ngram_range=ngram_range)
    elif vec_type == 'count':
        return CountVectorizer(max_features=max_features, ngram_range=ngram_range)
    elif vec_type == 'hashing':
        return HashingVectorizer(n_features=max_features, ngram_range=ngram_range)
    else:
        raise ValueError(f"Unknown vectorizer type: {vec_type}")


# =========================
# 1Ô∏è‚É£ Data Preparation & EDA
# =========================
print("=" * 80)
print("üöÄ STARTING COMPREHENSIVE VECTORIZER + MODEL EXPERIMENT")
print("=" * 80)

X = final_df['final_text'].fillna('')
y = final_df[['anger', 'fear', 'joy', 'sadness', 'surprise']]
emotions = y.columns.tolist()

# Log dataset info
wandb.log({
    "dataset/total_samples": len(X),
    "dataset/num_emotions": len(emotions),
    "dataset/feature_name": "final_text"
})

# Analyze and log label distribution
print("\nüìä Analyzing Label Distribution...")
plot_label_distribution(y, "Training Data - Emotion Distribution", "train_label_distribution.png")

# Log label statistics
label_stats = {}
for emotion in emotions:
    label_stats[f"dataset/{emotion}_count"] = int(y[emotion].sum())
    label_stats[f"dataset/{emotion}_percentage"] = float(y[emotion].sum() / len(y) * 100)

wandb.log(label_stats)

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=EXPERIMENT_CONFIG['test_size'],
    random_state=EXPERIMENT_CONFIG['random_state']
)

print(f"‚úÖ Train set size: {len(X_train)} | Validation set size: {len(X_val)}")

# =========================
# 2Ô∏è‚É£ Define Vectorizers
# =========================
VECTORIZERS = {
    'TfidfVectorizer': 'tfidf',
    'CountVectorizer': 'count',
    'HashingVectorizer': 'hashing'
}

# =========================
# 3Ô∏è‚É£ Model Configuration
# =========================
CLASSIFIERS = {
    'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1),
    'LinearSVC': LinearSVC(max_iter=5000, random_state=42),


    'XGBoost': XGBClassifier(
        eval_metric='logloss', n_estimators=500, max_depth=6,
        learning_rate=0.05, random_state=42, n_jobs=-1
    ),
    'LightGBM': LGBMClassifier(
        n_estimators=500, max_depth=8, learning_rate=0.05,
        random_state=42, verbose=-1, n_jobs=-1
    ),
    'CatBoost': CatBoostClassifier(
        iterations=500, depth=7, learning_rate=0.05,
        verbose=0, random_state=42, thread_count=-1
    )
}

# =========================
# 4Ô∏è‚É£ Comprehensive Training Loop
# =========================
print("\n" + "=" * 80)
print("ü§ñ TRAINING ALL VECTORIZER + MODEL COMBINATIONS")
print("=" * 80)

all_results = []
experiment_counter = 0
total_experiments = len(VECTORIZERS) * len(VECTORIZER_CONFIGS['ngram_ranges']) * len(CLASSIFIERS)

print(f"\nüìä Total Experiments to Run: {total_experiments}")
print(f"   - Vectorizers: {len(VECTORIZERS)}")
print(f"   - N-gram ranges: {len(VECTORIZER_CONFIGS['ngram_ranges'])}")
print(f"   - Models: {len(CLASSIFIERS)}")
print("=" * 80)

# Iterate over all combinations
for vec_name, vec_type in VECTORIZERS.items():
    for ngram_range in VECTORIZER_CONFIGS['ngram_ranges']:

        # Create vectorizer name
        ngram_str = f"ngram_{ngram_range[0]}_{ngram_range[1]}"
        full_vec_name = f"{vec_name}_{ngram_str}"

        print(f"\n{'='*80}")
        print(f"üìê VECTORIZER: {vec_name} | N-gram: {ngram_range}")
        print(f"{'='*80}")

        # Create and fit vectorizer
        vectorizer = get_vectorizer(vec_type, ngram_range, VECTORIZER_CONFIGS['tfidf_max_features'][0])

        # Transform data
        X_train_vec = vectorizer.fit_transform(X_train)
        X_val_vec = vectorizer.transform(X_val)

        # Log vectorizer info
        vec_shape = X_train_vec.shape[1]
        wandb.log({f"{full_vec_name}/feature_dimension": vec_shape})

        print(f"   ‚úÖ Features extracted: {vec_shape}")

        # Train all models with this vectorizer
        for model_name, clf in CLASSIFIERS.items():
            experiment_counter += 1

            print(f"\n   [{experiment_counter}/{total_experiments}] üîÑ Training: {model_name}")

            # Train model
            model = OneVsRestClassifier(clf, n_jobs=-1)
            model.fit(X_train_vec, y_train)

            # Predictions
            y_pred = model.predict(X_val_vec)

            # Get probabilities
            if hasattr(model, "predict_proba"):
                y_pred_proba = model.predict_proba(X_val_vec)
            elif hasattr(model, "decision_function"):
                y_pred_proba = model.decision_function(X_val_vec)
                # Normalize to [0, 1]
                from sklearn.preprocessing import MinMaxScaler
                scaler = MinMaxScaler()
                y_pred_proba = scaler.fit_transform(y_pred_proba)
            else:
                y_pred_proba = y_pred

            # Compute metrics
            metrics = compute_advanced_metrics(y_val, y_pred, y_pred_proba, emotions)

            # Log all metrics to W&B
            for metric_name, value in metrics.items():
                wandb.log({f"{full_vec_name}/{model_name}/{metric_name}": value})

            # Classification report
            class_report = classification_report(
                y_val, y_pred, target_names=emotions,
                output_dict=True, zero_division=0
            )

            # Generate visualizations (only for top 3 models per vectorizer to save time)
            # We'll generate all visualizations for best overall at the end

            # Confusion matrix for all
            plot_combined_confusion_matrix(y_val, y_pred, emotions, model_name, full_vec_name)

            # Store results
            all_results.append({
                'Vectorizer': vec_name,
                'N-gram': str(ngram_range),
                'Model': model_name,
                'F1_Macro': metrics['f1_macro'],
                'F1_Micro': metrics['f1_micro'],
                'F1_Weighted': metrics['f1_weighted'],
                'Hamming_Loss': metrics['hamming_loss'],
                'Jaccard_Score': metrics['jaccard_score'],
                'Subset_Accuracy': metrics['subset_accuracy'],
                'Full_Name': full_vec_name
            })

            print(f"      ‚úÖ F1-Macro: {metrics['f1_macro']:.4f} | Hamming: {metrics['hamming_loss']:.4f}")

# =========================
# 5Ô∏è‚É£ Comprehensive Results Analysis
# =========================
print("\n" + "=" * 80)
print("üìà COMPREHENSIVE RESULTS ANALYSIS")
print("=" * 80)

results_df = pd.DataFrame(all_results).sort_values(by='F1_Macro', ascending=False)

# Save full results
results_path = OUTPUT_DIR / "reports" / "full_experiment_results.csv"
results_df.to_csv(results_path, index=False)
wandb.save(str(results_path))

# Display top 10 results
print("\nüèÜ TOP 10 CONFIGURATIONS:")
print(results_df.head(10).to_string(index=False))

# Best configuration
best_config = results_df.iloc[0]
print(f"\nü•á BEST CONFIGURATION:")
print(f"   Vectorizer: {best_config['Vectorizer']}")
print(f"   N-gram: {best_config['N-gram']}")
print(f"   Model: {best_config['Model']}")
print(f"   F1-Macro: {best_config['F1_Macro']:.4f}")

# Log best config
wandb.log({
    "best/vectorizer": best_config['Vectorizer'],
    "best/ngram": best_config['N-gram'],
    "best/model": best_config['Model'],
    "best/f1_macro": best_config['F1_Macro']
})

# =========================
# 6Ô∏è‚É£ Vectorizer Comparison
# =========================
print("\nüìä VECTORIZER PERFORMANCE COMPARISON:")
vectorizer_perf = results_df.groupby('Vectorizer')['F1_Macro'].agg(['mean', 'max', 'std']).round(4)
print(vectorizer_perf)

# Plot vectorizer comparison
fig, ax = plt.subplots(figsize=(12, 7))
vec_comparison = results_df.groupby('Vectorizer')['F1_Macro'].mean().sort_values(ascending=False)
bars = ax.bar(vec_comparison.index, vec_comparison.values,
              color=plt.cm.viridis(np.linspace(0, 1, len(vec_comparison))))
ax.set_xlabel('Vectorizer', fontsize=14, fontweight='bold')
ax.set_ylabel('Mean F1-Macro Score', fontsize=14, fontweight='bold')
ax.set_title('Vectorizer Performance Comparison (Averaged Across All Models)',
             fontsize=16, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.4, linestyle='--')

for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.4f}', ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
vec_comp_path = OUTPUT_DIR / "plots" / "vectorizer_comparison.png"
plt.savefig(vec_comp_path, dpi=150, bbox_inches='tight', facecolor='white')
wandb.log({"comparison/vectorizer_performance": wandb.Image(str(vec_comp_path))})
plt.close()

# =========================
# 7Ô∏è‚É£ N-gram Range Comparison
# =========================
print("\nüìä N-GRAM RANGE PERFORMANCE:")
ngram_perf = results_df.groupby('N-gram')['F1_Macro'].agg(['mean', 'max', 'std']).round(4)
print(ngram_perf)

# Plot n-gram comparison
fig, ax = plt.subplots(figsize=(10, 6))
ngram_comparison = results_df.groupby('N-gram')['F1_Macro'].mean().sort_values(ascending=False)
bars = ax.bar(ngram_comparison.index, ngram_comparison.values,
              color=plt.cm.plasma(np.linspace(0, 1, len(ngram_comparison))))
ax.set_xlabel('N-gram Range', fontsize=14, fontweight='bold')
ax.set_ylabel('Mean F1-Macro Score', fontsize=14, fontweight='bold')
ax.set_title('N-gram Range Performance Comparison', fontsize=16, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.4, linestyle='--')

for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.4f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
ngram_comp_path = OUTPUT_DIR / "plots" / "ngram_comparison.png"
plt.savefig(ngram_comp_path, dpi=150, bbox_inches='tight', facecolor='white')
wandb.log({"comparison/ngram_performance": wandb.Image(str(ngram_comp_path))})
plt.close()

# =========================
# 8Ô∏è‚É£ Model Performance Across Vectorizers
# =========================
print("\nüìä MODEL PERFORMANCE ACROSS VECTORIZERS:")
model_perf = results_df.groupby('Model')['F1_Macro'].agg(['mean', 'max', 'std']).round(4)
print(model_perf.sort_values('mean', ascending=False))

# Heatmap: Models vs Vectorizers
pivot_table = results_df.pivot_table(
    values='F1_Macro',
    index='Model',
    columns='Vectorizer',
    aggfunc='mean'
)

fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(pivot_table, annot=True, fmt='.4f', cmap='YlOrRd',
            cbar_kws={'label': 'F1-Macro Score'}, ax=ax,
            linewidths=1, linecolor='white')
ax.set_title('Model Performance Heatmap (Across Vectorizers)',
             fontsize=18, fontweight='bold', pad=20)
ax.set_xlabel('Vectorizer', fontsize=14, fontweight='bold')
ax.set_ylabel('Model', fontsize=14, fontweight='bold')

plt.tight_layout()
heatmap_path = OUTPUT_DIR / "plots" / "model_vectorizer_heatmap.png"
plt.savefig(heatmap_path, dpi=150, bbox_inches='tight', facecolor='white')
wandb.log({"comparison/model_vectorizer_heatmap": wandb.Image(str(heatmap_path))})
plt.close()

# =========================
# 9Ô∏è‚É£ Generate Detailed Visualizations for Best Config
# =========================
print(f"\nüé® Generating detailed visualizations for best configuration...")

# Extract best configuration details
best_vec_type = VECTORIZERS[best_config['Vectorizer']]
best_ngram = eval(best_config['N-gram'])
best_model_name = best_config['Model']
best_full_name = best_config['Full_Name']

# Recreate best vectorizer and model
best_vectorizer = get_vectorizer(best_vec_type, best_ngram, VECTORIZER_CONFIGS['tfidf_max_features'][0])
X_train_best = best_vectorizer.fit_transform(X_train)
X_val_best = best_vectorizer.transform(X_val)

# Train best model
best_clf = CLASSIFIERS[best_model_name]
best_model = OneVsRestClassifier(best_clf, n_jobs=-1)
best_model.fit(X_train_best, y_train)

# Predictions
y_pred_best = best_model.predict(X_val_best)

if hasattr(best_model, "predict_proba"):
    y_pred_proba_best = best_model.predict_proba(X_val_best)
elif hasattr(best_model, "decision_function"):
    y_pred_proba_best = best_model.decision_function(X_val_best)
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    y_pred_proba_best = scaler.fit_transform(y_pred_proba_best)
else:
    y_pred_proba_best = y_pred_best

# Generate all visualizations for best model
class_report_best = classification_report(
    y_val, y_pred_best, target_names=emotions,
    output_dict=True, zero_division=0
)

plot_roc_curves(y_val, y_pred_proba_best, emotions, best_model_name, f"BEST_{best_full_name}")
plot_classification_report(class_report_best, emotions, best_model_name, f"BEST_{best_full_name}")

# =========================
# üîü Retrain Best Model on Full Data
# =========================
print(f"\nüîÑ Retraining best model on full dataset...")

X_full_best = best_vectorizer.fit_transform(X)
best_model_full = OneVsRestClassifier(best_clf, n_jobs=-1)
best_model_full.fit(X_full_best, y)

print("   ‚úÖ Best model retrained on full dataset")

# =========================
# 1Ô∏è‚É£1Ô∏è‚É£ Test Set Prediction
# =========================
print("\nüìù Generating predictions on test set...")
clean_test['final_text'] = clean_test['final_text'].fillna('')
X_test_best = best_vectorizer.transform(clean_test['final_text'])
y_test_pred = best_model_full.predict(X_test_best)

# Create submission
submission = pd.DataFrame(y_test_pred, columns=y.columns)
submission['id'] = clean_test['id']
submission = submission[['id'] + list(y.columns)]

submission_path = OUTPUT_DIR / "submission.csv"
submission.to_csv(submission_path, index=False)
wandb.save(str(submission_path))

print(f"   ‚úÖ Submission saved: {submission_path}")

# =========================
# 1Ô∏è‚É£2Ô∏è‚É£ Final Summary Report
# =========================
print("\n" + "=" * 80)
print("üìã EXPERIMENT SUMMARY")
print("=" * 80)

summary = {
    "timestamp": EXPERIMENT_CONFIG['timestamp'],
    "total_experiments": total_experiments,
    "best_vectorizer": best_config['Vectorizer'],
    "best_ngram": best_config['N-gram'],
    "best_model": best_config['Model'],
    "best_f1_macro": float(best_config['F1_Macro']),
    "best_hamming_loss": float(best_config['Hamming_Loss']),
    "best_jaccard_score": float(best_config['Jaccard_Score']),
    "dataset_size": len(X),
    "validation_size": len(X_val),
    "num_emotions": len(emotions),
    "emotions": emotions,
    "vectorizers_tested": list(VECTORIZERS.keys()),
    "ngram_ranges_tested": VECTORIZER_CONFIGS['ngram_ranges'],
    "models_tested": list(CLASSIFIERS.keys())
}

summary_path = OUTPUT_DIR / "reports" / "experiment_summary.json"
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=4)

print(json.dumps(summary, indent=2))
print(f"\n‚úÖ Summary saved: {summary_path}")

wandb.log({"experiment/summary": summary})

# Create final comparison table
wandb.log({"experiment/all_results": wandb.Table(dataframe=results_df)})

# =========================
# 1Ô∏è‚É£3Ô∏è‚É£ Create Executive Summary Visualization
# =========================
print("\nüìä Creating executive summary dashboard...")

fig = plt.figure(figsize=(20, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# 1. Best Configuration Box
ax1 = fig.add_subplot(gs[0, 0])
ax1.axis('off')
summary_text = f"""
BEST CONFIGURATION
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Vectorizer: {best_config['Vectorizer']}
N-gram: {best_config['N-gram']}
Model: {best_config['Model']}

PERFORMANCE METRICS
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
F1-Macro: {best_config['F1_Macro']:.4f}
F1-Micro: {best_config['F1_Micro']:.4f}
F1-Weighted: {best_config['F1_Weighted']:.4f}
Hamming Loss: {best_config['Hamming_Loss']:.4f}
Jaccard Score: {best_config['Jaccard_Score']:.4f}
Subset Accuracy: {best_config['Subset_Accuracy']:.4f}
"""
ax1.text(0.1, 0.5, summary_text, fontsize=11, fontfamily='monospace',
         verticalalignment='center', bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

# 2. Top 5 Configurations
ax2 = fig.add_subplot(gs[0, 1:])
top5 = results_df.head(5)
ax2.axis('tight')
ax2.axis('off')
table_data = []
for idx, row in top5.iterrows():
    table_data.append([
        row['Vectorizer'],
        row['N-gram'],
        row['Model'],
        f"{row['F1_Macro']:.4f}"
    ])
table = ax2.table(cellText=table_data,
                  colLabels=['Vectorizer', 'N-gram', 'Model', 'F1-Macro'],
                  cellLoc='center',
                  loc='center',
                  colWidths=[0.25, 0.2, 0.3, 0.15])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)
for i in range(len(top5) + 1):
    if i == 0:
        table[(i, 0)].set_facecolor('#4CAF50')
        table[(i, 1)].set_facecolor('#4CAF50')
        table[(i, 2)].set_facecolor('#4CAF50')
        table[(i, 3)].set_facecolor('#4CAF50')
    else:
        table[(i, 0)].set_facecolor('#E8F5E9')
        table[(i, 1)].set_facecolor('#E8F5E9')
        table[(i, 2)].set_facecolor('#E8F5E9')
        table[(i, 3)].set_facecolor('#E8F5E9')
ax2.set_title('TOP 5 CONFIGURATIONS', fontsize=14, fontweight='bold', pad=20)

# 3. Vectorizer Performance
ax3 = fig.add_subplot(gs[1, 0])
vec_means = results_df.groupby('Vectorizer')['F1_Macro'].mean().sort_values(ascending=True)
ax3.barh(vec_means.index, vec_means.values, color=plt.cm.viridis(np.linspace(0, 1, len(vec_means))))
ax3.set_xlabel('Mean F1-Macro', fontsize=11, fontweight='bold')
ax3.set_title('Vectorizer Performance', fontsize=12, fontweight='bold')
ax3.grid(axis='x', alpha=0.3)
for i, v in enumerate(vec_means.values):
    ax3.text(v, i, f' {v:.4f}', va='center', fontweight='bold')

# 4. N-gram Performance
ax4 = fig.add_subplot(gs[1, 1])
ngram_means = results_df.groupby('N-gram')['F1_Macro'].mean().sort_values(ascending=True)
ax4.barh(ngram_means.index, ngram_means.values, color=plt.cm.plasma(np.linspace(0, 1, len(ngram_means))))
ax4.set_xlabel('Mean F1-Macro', fontsize=11, fontweight='bold')
ax4.set_title('N-gram Range Performance', fontsize=12, fontweight='bold')
ax4.grid(axis='x', alpha=0.3)
for i, v in enumerate(ngram_means.values):
    ax4.text(v, i, f' {v:.4f}', va='center', fontweight='bold')

# 5. Model Performance
ax5 = fig.add_subplot(gs[1, 2])
model_means = results_df.groupby('Model')['F1_Macro'].mean().sort_values(ascending=False).head(8)
ax5.bar(range(len(model_means)), model_means.values, color=plt.cm.Set3(np.linspace(0, 1, len(model_means))))
ax5.set_xticks(range(len(model_means)))
ax5.set_xticklabels(model_means.index, rotation=45, ha='right', fontsize=9)
ax5.set_ylabel('Mean F1-Macro', fontsize=11, fontweight='bold')
ax5.set_title('Model Performance', fontsize=12, fontweight='bold')
ax5.grid(axis='y', alpha=0.3)
for i, v in enumerate(model_means.values):
    ax5.text(i, v, f'{v:.3f}', ha='center', va='bottom', fontsize=8, fontweight='bold')

# 6. F1-Macro Distribution
ax6 = fig.add_subplot(gs[2, 0])
ax6.hist(results_df['F1_Macro'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
ax6.axvline(best_config['F1_Macro'], color='red', linestyle='--', linewidth=2, label=f"Best: {best_config['F1_Macro']:.4f}")
ax6.set_xlabel('F1-Macro Score', fontsize=11, fontweight='bold')
ax6.set_ylabel('Frequency', fontsize=11, fontweight='bold')
ax6.set_title('F1-Macro Distribution', fontsize=12, fontweight='bold')
ax6.legend()
ax6.grid(alpha=0.3)

# 7. Hamming Loss Distribution
ax7 = fig.add_subplot(gs[2, 1])
ax7.hist(results_df['Hamming_Loss'], bins=30, color='salmon', edgecolor='black', alpha=0.7)
ax7.axvline(best_config['Hamming_Loss'], color='green', linestyle='--', linewidth=2, label=f"Best: {best_config['Hamming_Loss']:.4f}")
ax7.set_xlabel('Hamming Loss', fontsize=11, fontweight='bold')
ax7.set_ylabel('Frequency', fontsize=11, fontweight='bold')
ax7.set_title('Hamming Loss Distribution', fontsize=12, fontweight='bold')
ax7.legend()
ax7.grid(alpha=0.3)

# 8. Experiment Statistics
ax8 = fig.add_subplot(gs[2, 2])
ax8.axis('off')
stats_text = f"""
EXPERIMENT STATISTICS
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Total Experiments: {total_experiments}
Vectorizers: {len(VECTORIZERS)}
N-gram Ranges: {len(VECTORIZER_CONFIGS['ngram_ranges'])}
Models: {len(CLASSIFIERS)}

BEST SCORES
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Highest F1-Macro: {results_df['F1_Macro'].max():.4f}
Lowest Hamming Loss: {results_df['Hamming_Loss'].min():.4f}
Highest Jaccard: {results_df['Jaccard_Score'].max():.4f}

AVERAGE SCORES
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Mean F1-Macro: {results_df['F1_Macro'].mean():.4f}
Std F1-Macro: {results_df['F1_Macro'].std():.4f}
"""
ax8.text(0.1, 0.5, stats_text, fontsize=10, fontfamily='monospace',
         verticalalignment='center', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.suptitle('üöÄ EXPERIMENT EXECUTIVE SUMMARY DASHBOARD', fontsize=20, fontweight='bold', y=0.98)

dashboard_path = OUTPUT_DIR / "plots" / "executive_summary_dashboard.png"
plt.savefig(dashboard_path, dpi=150, bbox_inches='tight', facecolor='white')
wandb.log({"summary/executive_dashboard": wandb.Image(str(dashboard_path))})
plt.close()

print(f"   ‚úÖ Executive dashboard created: {dashboard_path}")

# =========================
# 1Ô∏è‚É£4Ô∏è‚É£ Performance Comparison by Vectorizer and Model
# =========================
print("\nüìä Creating detailed performance comparison charts...")

# Create a comprehensive comparison figure
fig, axes = plt.subplots(2, 2, figsize=(18, 14))

# Plot 1: F1-Macro by Vectorizer and Model
ax1 = axes[0, 0]
for vectorizer in results_df['Vectorizer'].unique():
    data = results_df[results_df['Vectorizer'] == vectorizer].groupby('Model')['F1_Macro'].mean()
    ax1.plot(data.index, data.values, marker='o', linewidth=2, markersize=8, label=vectorizer)
ax1.set_xlabel('Model', fontsize=12, fontweight='bold')
ax1.set_ylabel('Mean F1-Macro', fontsize=12, fontweight='bold')
ax1.set_title('F1-Macro: Models vs Vectorizers', fontsize=14, fontweight='bold')
ax1.legend(title='Vectorizer', fontsize=10)
ax1.grid(alpha=0.3)
ax1.tick_params(axis='x', rotation=45)

# Plot 2: Hamming Loss by Vectorizer and Model
ax2 = axes[0, 1]
for vectorizer in results_df['Vectorizer'].unique():
    data = results_df[results_df['Vectorizer'] == vectorizer].groupby('Model')['Hamming_Loss'].mean()
    ax2.plot(data.index, data.values, marker='s', linewidth=2, markersize=8, label=vectorizer)
ax2.set_xlabel('Model', fontsize=12, fontweight='bold')
ax2.set_ylabel('Mean Hamming Loss', fontsize=12, fontweight='bold')
ax2.set_title('Hamming Loss: Models vs Vectorizers', fontsize=14, fontweight='bold')
ax2.legend(title='Vectorizer', fontsize=10)
ax2.grid(alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

# Plot 3: Performance by N-gram
ax3 = axes[1, 0]
ngram_model_perf = results_df.groupby(['N-gram', 'Model'])['F1_Macro'].mean().unstack()
ngram_model_perf.plot(kind='bar', ax=ax3, width=0.8, colormap='viridis')
ax3.set_xlabel('N-gram Range', fontsize=12, fontweight='bold')
ax3.set_ylabel('Mean F1-Macro', fontsize=12, fontweight='bold')
ax3.set_title('Performance by N-gram Range', fontsize=14, fontweight='bold')
ax3.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)
ax3.grid(axis='y', alpha=0.3)
ax3.tick_params(axis='x', rotation=0)

# Plot 4: Box plot of F1-Macro by Vectorizer
ax4 = axes[1, 1]
results_df.boxplot(column='F1_Macro', by='Vectorizer', ax=ax4, patch_artist=True)
ax4.set_xlabel('Vectorizer', fontsize=12, fontweight='bold')
ax4.set_ylabel('F1-Macro Distribution', fontsize=12, fontweight='bold')
ax4.set_title('F1-Macro Distribution by Vectorizer', fontsize=14, fontweight='bold')
plt.sca(ax4)
plt.xticks(rotation=0)
ax4.get_figure().suptitle('')  # Remove default title

plt.tight_layout()
comparison_path = OUTPUT_DIR / "plots" / "detailed_performance_comparison.png"
plt.savefig(comparison_path, dpi=150, bbox_inches='tight', facecolor='white')
wandb.log({"comparison/detailed_performance": wandb.Image(str(comparison_path))})
plt.close()

print(f"   ‚úÖ Detailed comparison charts created")

# =========================
# üéØ Finish Experiment
# =========================
print("\n" + "=" * 80)
print("‚úÖ COMPREHENSIVE EXPERIMENT COMPLETED SUCCESSFULLY!")
print("=" * 80)
print(f"\nüìä FINAL RESULTS:")
print(f"   ‚Ä¢ Total Experiments Run: {total_experiments}")
print(f"   ‚Ä¢ Best Vectorizer: {best_config['Vectorizer']}")
print(f"   ‚Ä¢ Best N-gram: {best_config['N-gram']}")
print(f"   ‚Ä¢ Best Model: {best_config['Model']}")
print(f"   ‚Ä¢ Best F1-Macro: {best_config['F1_Macro']:.4f}")
print(f"\nüìÅ All outputs saved in: {OUTPUT_DIR}")
print(f"üîó View comprehensive results in W&B: {wandb.run.get_url()}")
print("=" * 80)

wandb.finish()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33maiwithajay[0m ([33maiwithajay-indian-institute-of-technology-madras[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


üöÄ STARTING COMPREHENSIVE VECTORIZER + MODEL EXPERIMENT

üìä Analyzing Label Distribution...
‚úÖ Train set size: 3992 | Validation set size: 999

ü§ñ TRAINING ALL VECTORIZER + MODEL COMBINATIONS

üìä Total Experiments to Run: 45
   - Vectorizers: 3
   - N-gram ranges: 3
   - Models: 5

üìê VECTORIZER: TfidfVectorizer | N-gram: (1, 1)
   ‚úÖ Features extracted: 5000

   [1/45] üîÑ Training: RandomForest
      ‚úÖ F1-Macro: 0.4364 | Hamming: 0.2392

   [2/45] üîÑ Training: LinearSVC
      ‚úÖ F1-Macro: 0.4784 | Hamming: 0.2436

   [3/45] üîÑ Training: XGBoost
      ‚úÖ F1-Macro: 0.3881 | Hamming: 0.2364

   [4/45] üîÑ Training: LightGBM




      ‚úÖ F1-Macro: 0.3615 | Hamming: 0.2599

   [5/45] üîÑ Training: CatBoost




      ‚úÖ F1-Macro: 0.4036 | Hamming: 0.2366

üìê VECTORIZER: TfidfVectorizer | N-gram: (1, 2)
   ‚úÖ Features extracted: 5000

   [6/45] üîÑ Training: RandomForest




      ‚úÖ F1-Macro: 0.4426 | Hamming: 0.2384

   [7/45] üîÑ Training: LinearSVC
      ‚úÖ F1-Macro: 0.4697 | Hamming: 0.2460

   [8/45] üîÑ Training: XGBoost
      ‚úÖ F1-Macro: 0.3938 | Hamming: 0.2410

   [9/45] üîÑ Training: LightGBM




      ‚úÖ F1-Macro: 0.3721 | Hamming: 0.2555

   [10/45] üîÑ Training: CatBoost




      ‚úÖ F1-Macro: 0.4097 | Hamming: 0.2360

üìê VECTORIZER: TfidfVectorizer | N-gram: (1, 3)
   ‚úÖ Features extracted: 5000

   [11/45] üîÑ Training: RandomForest
      ‚úÖ F1-Macro: 0.4580 | Hamming: 0.2362

   [12/45] üîÑ Training: LinearSVC
      ‚úÖ F1-Macro: 0.4614 | Hamming: 0.2496

   [13/45] üîÑ Training: XGBoost
      ‚úÖ F1-Macro: 0.4030 | Hamming: 0.2388

   [14/45] üîÑ Training: LightGBM




      ‚úÖ F1-Macro: 0.3751 | Hamming: 0.2545

   [15/45] üîÑ Training: CatBoost
      ‚úÖ F1-Macro: 0.4293 | Hamming: 0.2318

üìê VECTORIZER: CountVectorizer | N-gram: (1, 1)
   ‚úÖ Features extracted: 5000

   [16/45] üîÑ Training: RandomForest
      ‚úÖ F1-Macro: 0.4768 | Hamming: 0.2428

   [17/45] üîÑ Training: LinearSVC
      ‚úÖ F1-Macro: 0.4835 | Hamming: 0.2611

   [18/45] üîÑ Training: XGBoost
      ‚úÖ F1-Macro: 0.3987 | Hamming: 0.2344

   [19/45] üîÑ Training: LightGBM


TypeError: Expected np.float32 or np.float64, met type(int64)

In [3]:
"""
üöÄ Industry-Grade Multi-Label Text Classification Pipeline
Features: Advanced Experiment Tracking with Multiple Vectorizers
Including: TF-IDF, Count, Hashing, Word2Vec, GloVe, FastText
Comprehensive Metrics, Production-Ready Visualizations
WITH CHECKPOINT SUPPORT AND ERROR RECOVERY
"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import wandb
import warnings
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Tuple, Any
import json
from itertools import product
import pickle
import os
import gensim.downloader as api
from gensim.models import Word2Vec, FastText
import nltk
from nltk.tokenize import word_tokenize

from sklearn.exceptions import UndefinedMetricWarning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    f1_score, classification_report, confusion_matrix,
    roc_curve, auc, roc_auc_score, precision_recall_curve,
    average_precision_score, hamming_loss, jaccard_score,
    accuracy_score
)
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from scipy import sparse

# Download required NLTK data
try:
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
except:
    pass

# =========================
# üé® Configuration & Setup
# =========================
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Set style for professional visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Create output directories
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
(OUTPUT_DIR / "plots").mkdir(exist_ok=True)
(OUTPUT_DIR / "reports").mkdir(exist_ok=True)
(OUTPUT_DIR / "checkpoints").mkdir(exist_ok=True)
(OUTPUT_DIR / "embeddings").mkdir(exist_ok=True)

# Checkpoint file path
CHECKPOINT_FILE = OUTPUT_DIR / "checkpoints" / "experiment_checkpoint.pkl"
RESULTS_CHECKPOINT = OUTPUT_DIR / "checkpoints" / "results_checkpoint.csv"

# Experiment configuration
EXPERIMENT_CONFIG = {
    "test_size": 0.2,
    "random_state": 42,
    "timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
    "embedding_dim": 100,
    "word2vec_min_count": 2,
    "word2vec_window": 5,
    "word2vec_workers": 4
}

# Vectorizer configurations
VECTORIZER_CONFIGS = {
    'tfidf_max_features': [5000],
    'ngram_ranges': [(1, 1), (1, 2), (1, 3)]
}

# =========================
# üîß Checkpoint Functions
# =========================

def save_checkpoint(experiment_state):
    """Save experiment checkpoint"""
    try:
        with open(CHECKPOINT_FILE, 'wb') as f:
            pickle.dump(experiment_state, f)
        print(f"   üíæ Checkpoint saved: {CHECKPOINT_FILE}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Warning: Could not save checkpoint: {e}")

def load_checkpoint():
    """Load experiment checkpoint if exists"""
    if CHECKPOINT_FILE.exists():
        try:
            with open(CHECKPOINT_FILE, 'rb') as f:
                state = pickle.load(f)
            print(f"   ‚úÖ Checkpoint loaded: {CHECKPOINT_FILE}")
            return state
        except Exception as e:
            print(f"   ‚ö†Ô∏è Warning: Could not load checkpoint: {e}")
            return None
    return None

def save_results_checkpoint(results_df):
    """Save results as CSV checkpoint"""
    try:
        results_df.to_csv(RESULTS_CHECKPOINT, index=False)
        print(f"   üíæ Results checkpoint saved: {RESULTS_CHECKPOINT}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Warning: Could not save results checkpoint: {e}")

def load_results_checkpoint():
    """Load results checkpoint if exists"""
    if RESULTS_CHECKPOINT.exists():
        try:
            results_df = pd.read_csv(RESULTS_CHECKPOINT)
            print(f"   ‚úÖ Results checkpoint loaded: {len(results_df)} experiments")
            return results_df
        except Exception as e:
            print(f"   ‚ö†Ô∏è Warning: Could not load results checkpoint: {e}")
            return pd.DataFrame()
    return pd.DataFrame()

# Initialize W&B with rich config
wandb.init(
    project="23f3003030-t32025",
    name=f"D02-multi-vectorizer-classification-20251010_070017",
    config=EXPERIMENT_CONFIG,
    tags=["multi-label", "emotion-detection", "vectorizer-comparison", "embeddings", "production"],
    notes="Comprehensive experiment with TF-IDF, Count, Hashing, Word2Vec, GloVe, FastText",
    resume="allow"
)

# =========================
# üåê Embedding Helper Functions
# =========================

class EmbeddingVectorizer:
    """Custom vectorizer for word embeddings (Word2Vec, GloVe, FastText)"""

    def __init__(self, model, embedding_type='word2vec'):
        self.model = model
        self.embedding_type = embedding_type
        self.dim = model.vector_size if hasattr(model, 'vector_size') else len(model['the'])

    def tokenize(self, text):
        """Tokenize text"""
        try:
            return word_tokenize(str(text).lower())
        except:
            return str(text).lower().split()

    def get_vector(self, word):
        """Get vector for a word"""
        try:
            if self.embedding_type in ['word2vec', 'fasttext']:
                return self.model.wv[word]
            else:  # glove
                return self.model[word]
        except:
            return np.zeros(self.dim)

    def transform_text(self, text):
        """Transform single text to average embedding vector"""
        tokens = self.tokenize(text)
        vectors = [self.get_vector(word) for word in tokens]
        valid_vectors = [v for v in vectors if np.any(v)]

        if valid_vectors:
            return np.mean(valid_vectors, axis=0)
        else:
            return np.zeros(self.dim)

    def fit_transform(self, texts):
        """Fit and transform texts"""
        return self.transform(texts)

    def transform(self, texts):
        """Transform texts to embedding matrix"""
        embeddings = np.array([self.transform_text(text) for text in texts])
        return embeddings.astype(np.float32)


def load_glove_embeddings(dim=100):
    """Load pretrained GloVe embeddings"""
    print(f"\nüîÑ Loading GloVe embeddings (dim={dim})...")

    glove_path = OUTPUT_DIR / "embeddings" / f"glove_{dim}d.pkl"

    if glove_path.exists():
        print("   ‚úÖ Loading cached GloVe model...")
        with open(glove_path, 'rb') as f:
            return pickle.load(f)

    try:
        # Download from gensim
        model_name = f'glove-twitter-{dim}'
        print(f"   üåê Downloading {model_name} from gensim...")
        glove_model = api.load(model_name)

        # Save for future use
        with open(glove_path, 'wb') as f:
            pickle.dump(glove_model, f)
        print(f"   ‚úÖ GloVe model saved to cache")

        return glove_model
    except Exception as e:
        print(f"   ‚ùå Error loading GloVe: {e}")
        return None


def train_word2vec(texts, **params):
    """Train Word2Vec model"""
    print(f"\nüîÑ Training Word2Vec model...")

    # Tokenize all texts
    tokenized_texts = [word_tokenize(str(text).lower()) for text in texts]

    # Train model
    model = Word2Vec(
        sentences=tokenized_texts,
        vector_size=params.get('embedding_dim', 100),
        window=params.get('window', 5),
        min_count=params.get('min_count', 2),
        workers=params.get('workers', 4),
        epochs=10,
        seed=params.get('random_state', 42)
    )

    print(f"   ‚úÖ Word2Vec trained: vocab_size={len(model.wv)}")
    return model


def train_fasttext(texts, **params):
    """Train FastText model"""
    print(f"\nüîÑ Training FastText model...")

    # Tokenize all texts
    tokenized_texts = [word_tokenize(str(text).lower()) for text in texts]

    # Train model
    model = FastText(
        sentences=tokenized_texts,
        vector_size=params.get('embedding_dim', 100),
        window=params.get('window', 5),
        min_count=params.get('min_count', 2),
        workers=params.get('workers', 4),
        epochs=10,
        seed=params.get('random_state', 42)
    )

    print(f"   ‚úÖ FastText trained: vocab_size={len(model.wv)}")
    return model


# =========================
# üìä Advanced Visualization Functions
# =========================

def plot_combined_confusion_matrix(y_true, y_pred, emotions, model_name, vec_name):
    """Create professional confusion matrix visualization"""
    n_emotions = len(emotions)
    fig, axes = plt.subplots(1, n_emotions, figsize=(4*n_emotions, 3.5))

    if n_emotions == 1:
        axes = [axes]

    for i, emotion in enumerate(emotions):
        cm = confusion_matrix(y_true.iloc[:, i], y_pred[:, i])

        # Calculate percentages
        cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100

        # Create annotations with counts and percentages
        annot = np.array([[f'{count}\n({percent:.1f}%)'
                          for count, percent in zip(row_counts, row_percents)]
                         for row_counts, row_percents in zip(cm, cm_percent)])

        sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', ax=axes[i],
                    cbar=False, square=True, linewidths=1, linecolor='gray')
        axes[i].set_title(f'{emotion.upper()}', fontsize=12, fontweight='bold', pad=10)
        axes[i].set_xlabel('Predicted', fontsize=10)
        axes[i].set_ylabel('Actual' if i == 0 else '', fontsize=10)
        axes[i].set_xticklabels(['No', 'Yes'])
        axes[i].set_yticklabels(['No', 'Yes'])

    fig.suptitle(f'{model_name} ({vec_name}) - Confusion Matrices',
                 fontsize=16, fontweight='bold', y=1.02)
    plt.tight_layout()

    filename = OUTPUT_DIR / "plots" / f'{vec_name}_{model_name}_confusion_matrix.png'
    plt.savefig(filename, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({f"{vec_name}/{model_name}/confusion_matrix": wandb.Image(str(filename))})
    plt.close()


def plot_label_distribution(y_data, title, filename):
    """Plot label distribution"""
    label_counts = y_data.sum().sort_values(ascending=False)

    fig, ax = plt.subplots(figsize=(10, 6))
    bars = ax.bar(range(len(label_counts)), label_counts.values,
                  color=plt.cm.viridis(np.linspace(0, 1, len(label_counts))))
    ax.set_xticks(range(len(label_counts)))
    ax.set_xticklabels([label.capitalize() for label in label_counts.index], fontsize=12)
    ax.set_ylabel('Count', fontsize=14, fontweight='bold')
    ax.set_title(title, fontsize=18, fontweight='bold', pad=20)
    ax.grid(axis='y', alpha=0.4, linestyle='--')

    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=11, fontweight='bold')

    plt.tight_layout()
    save_path = OUTPUT_DIR / "plots" / filename
    plt.savefig(save_path, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({f"data_analysis/{filename.replace('.png', '')}": wandb.Image(str(save_path))})
    plt.close()


def compute_advanced_metrics(y_true, y_pred, y_pred_proba, emotions):
    """Compute comprehensive metrics"""
    metrics = {
        'f1_macro': f1_score(y_true, y_pred, average='macro', zero_division=0),
        'f1_micro': f1_score(y_true, y_pred, average='micro', zero_division=0),
        'f1_weighted': f1_score(y_true, y_pred, average='weighted', zero_division=0),
        'hamming_loss': hamming_loss(y_true, y_pred),
        'jaccard_score': jaccard_score(y_true, y_pred, average='samples', zero_division=0),
        'subset_accuracy': accuracy_score(y_true, y_pred),
    }

    # Per-emotion metrics
    for i, emotion in enumerate(emotions):
        metrics[f'{emotion}_f1'] = f1_score(y_true.iloc[:, i], y_pred[:, i], zero_division=0)
        try:
            metrics[f'{emotion}_auc'] = roc_auc_score(y_true.iloc[:, i], y_pred_proba[:, i])
        except:
            metrics[f'{emotion}_auc'] = 0.0

    return metrics


def get_vectorizer(vec_type, ngram_range, max_features, embedding_models=None):
    """Factory function to create vectorizers"""

    if vec_type == 'count':
        return CountVectorizer(max_features=max_features, ngram_range=ngram_range)
    elif vec_type == 'hashing':
        return HashingVectorizer(n_features=max_features, ngram_range=ngram_range)
    elif vec_type == 'word2vec' and embedding_models and 'word2vec' in embedding_models:
        return EmbeddingVectorizer(embedding_models['word2vec'], 'word2vec')
    elif vec_type == 'glove' and embedding_models and 'glove' in embedding_models:
        return EmbeddingVectorizer(embedding_models['glove'], 'glove')
    elif vec_type == 'fasttext' and embedding_models and 'fasttext' in embedding_models:
        return EmbeddingVectorizer(embedding_models['fasttext'], 'fasttext')
    else:
        raise ValueError(f"Unknown vectorizer type: {vec_type}")


def convert_to_float32(X):
    """Convert sparse matrix to float32 (required for LightGBM)"""
    if sparse.issparse(X):
        return X.astype(np.float32)
    else:
        return X.astype(np.float32)


# =========================
# 1Ô∏è‚É£ Data Preparation & EDA
# =========================
print("=" * 80)
print("üöÄ STARTING COMPREHENSIVE VECTORIZER + MODEL EXPERIMENT")
print("=" * 80)

X = final_df['final_text'].fillna('')
y = final_df[['anger', 'fear', 'joy', 'sadness', 'surprise']]
emotions = y.columns.tolist()

# Log dataset info
wandb.log({
    "dataset/total_samples": len(X),
    "dataset/num_emotions": len(emotions),
    "dataset/feature_name": "final_text"
})

# Analyze and log label distribution
print("\nüìä Analyzing Label Distribution...")
plot_label_distribution(y, "Training Data - Emotion Distribution", "train_label_distribution.png")

# Log label statistics
label_stats = {}
for emotion in emotions:
    label_stats[f"dataset/{emotion}_count"] = int(y[emotion].sum())
    label_stats[f"dataset/{emotion}_percentage"] = float(y[emotion].sum() / len(y) * 100)

wandb.log(label_stats)

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=EXPERIMENT_CONFIG['test_size'],
    random_state=EXPERIMENT_CONFIG['random_state']
)

print(f"‚úÖ Train set size: {len(X_train)} | Validation set size: {len(X_val)}")

# =========================
# 2Ô∏è‚É£ Prepare Embedding Models
# =========================
print("\n" + "=" * 80)
print("üåê PREPARING EMBEDDING MODELS")
print("=" * 80)

embedding_models = {}

# Train Word2Vec
try:
    word2vec_model = train_word2vec(
        X_train,
        embedding_dim=EXPERIMENT_CONFIG['embedding_dim'],
        window=EXPERIMENT_CONFIG['word2vec_window'],
        min_count=EXPERIMENT_CONFIG['word2vec_min_count'],
        workers=EXPERIMENT_CONFIG['word2vec_workers'],
        random_state=EXPERIMENT_CONFIG['random_state']
    )
    embedding_models['word2vec'] = word2vec_model
except Exception as e:
    print(f"   ‚ùå Error training Word2Vec: {e}")

# Load GloVe
try:
    glove_model = load_glove_embeddings(dim=EXPERIMENT_CONFIG['embedding_dim'])
    if glove_model:
        embedding_models['glove'] = glove_model
except Exception as e:
    print(f"   ‚ùå Error loading GloVe: {e}")

# Train FastText
try:
    fasttext_model = train_fasttext(
        X_train,
        embedding_dim=EXPERIMENT_CONFIG['embedding_dim'],
        window=EXPERIMENT_CONFIG['word2vec_window'],
        min_count=EXPERIMENT_CONFIG['word2vec_min_count'],
        workers=EXPERIMENT_CONFIG['word2vec_workers'],
        random_state=EXPERIMENT_CONFIG['random_state']
    )
    embedding_models['fasttext'] = fasttext_model
except Exception as e:
    print(f"   ‚ùå Error training FastText: {e}")

print(f"\n‚úÖ Embedding models ready: {list(embedding_models.keys())}")

# =========================
# 3Ô∏è‚É£ Define All Vectorizers
# =========================
VECTORIZERS = {
    'CountVectorizer': 'count',
    'HashingVectorizer': 'hashing',
}

# Add embedding vectorizers if available
if 'word2vec' in embedding_models:
    VECTORIZERS['Word2Vec'] = 'word2vec'
if 'glove' in embedding_models:
    VECTORIZERS['GloVe'] = 'glove'
if 'fasttext' in embedding_models:
    VECTORIZERS['FastText'] = 'fasttext'

print(f"\nüìê Total Vectorizers Available: {len(VECTORIZERS)}")
for vec_name, vec_type in VECTORIZERS.items():
    print(f"   ‚Ä¢ {vec_name} ({vec_type})")

# =========================
# 4Ô∏è‚É£ Model Configuration
# =========================
CLASSIFIERS = {
    'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1),
    'LinearSVC': LinearSVC(max_iter=5000, random_state=42),
    'XGBoost': XGBClassifier(
        eval_metric='logloss', n_estimators=500, max_depth=6,
        learning_rate=0.05, random_state=42, n_jobs=-1
    ),
    'LightGBM': LGBMClassifier(
        n_estimators=500, max_depth=8, learning_rate=0.05,
        random_state=42, verbose=-1, n_jobs=-1
    ),
    'CatBoost': CatBoostClassifier(
        iterations=500, depth=7, learning_rate=0.05,
        verbose=0, random_state=42, thread_count=-1
    )
}

# =========================
# 5Ô∏è‚É£ Load or Initialize Checkpoint
# =========================
checkpoint_state = load_checkpoint()
all_results_df = load_results_checkpoint()

if checkpoint_state:
    print("\nüîÑ RESUMING FROM CHECKPOINT")
    print(f"   ‚úÖ Completed experiments: {len(all_results_df)}")
    completed_experiments = set(
        (row['Vectorizer'], row['N-gram'], row['Model'])
        for _, row in all_results_df.iterrows()
    )
    experiment_counter = len(all_results_df)
else:
    print("\nüÜï STARTING NEW EXPERIMENT")
    completed_experiments = set()
    experiment_counter = 0
    all_results_df = pd.DataFrame()

# =========================
# 6Ô∏è‚É£ Comprehensive Training Loop
# =========================
print("\n" + "=" * 80)
print("ü§ñ TRAINING ALL VECTORIZER + MODEL COMBINATIONS")
print("=" * 80)

all_results = []

# Calculate total experiments
# For traditional vectorizers: use n-gram ranges
# For embedding vectorizers: no n-gram (just one config per embedding)
traditional_vectorizers = ['tfidf', 'count', 'hashing']
embedding_vectorizers = ['word2vec', 'glove', 'fasttext']

traditional_count = sum(1 for v in VECTORIZERS.values() if v in traditional_vectorizers)
embedding_count = sum(1 for v in VECTORIZERS.values() if v in embedding_vectorizers)

total_experiments = (
    traditional_count * len(VECTORIZER_CONFIGS['ngram_ranges']) * len(CLASSIFIERS) +
    embedding_count * len(CLASSIFIERS)
)

print(f"\nüìä Total Experiments to Run: {total_experiments}")
print(f"   - Already Completed: {len(completed_experiments)}")
print(f"   - Remaining: {total_experiments - len(completed_experiments)}")
print(f"   - Vectorizers: {len(VECTORIZERS)}")
print(f"   - Traditional (with n-grams): {traditional_count}")
print(f"   - Embeddings (no n-grams): {embedding_count}")
print(f"   - Models: {len(CLASSIFIERS)}")
print("=" * 80)

# Iterate over all combinations
for vec_name, vec_type in VECTORIZERS.items():

    # Determine if this is an embedding vectorizer
    is_embedding = vec_type in embedding_vectorizers

    # For embeddings, use single configuration (no n-grams)
    # For traditional, use n-gram configurations
    ngram_configs = [(None, None)] if is_embedding else VECTORIZER_CONFIGS['ngram_ranges']

    for ngram_range in ngram_configs:

        # Create vectorizer name
        if is_embedding:
            ngram_str = "embedding"
            full_vec_name = f"{vec_name}"
        else:
            ngram_str = f"ngram_{ngram_range[0]}_{ngram_range[1]}"
            full_vec_name = f"{vec_name}_{ngram_str}"

        # Check if we need to process this vectorizer configuration
        skip_vectorizer = all(
            (vec_name, str(ngram_range), model_name) in completed_experiments
            for model_name in CLASSIFIERS.keys()
        )

        if skip_vectorizer:
            print(f"\n‚è≠Ô∏è  SKIPPING: {vec_name} | Config: {ngram_str} (Already completed)")
            continue

        print(f"\n{'='*80}")
        if is_embedding:
            print(f"üìê VECTORIZER: {vec_name} (Embedding)")
        else:
            print(f"üìê VECTORIZER: {vec_name} | N-gram: {ngram_range}")
        print(f"{'='*80}")

        try:
            # Create and fit vectorizer
            if is_embedding:
                vectorizer = get_vectorizer(vec_type, None, None, embedding_models)
            else:
                vectorizer = get_vectorizer(vec_type, ngram_range,
                                          VECTORIZER_CONFIGS['tfidf_max_features'][0],
                                          embedding_models)

            # Transform data
            if is_embedding:
                print(f"   üîÑ Transforming texts with {vec_name}...")
                X_train_vec = vectorizer.transform(X_train)
                X_val_vec = vectorizer.transform(X_val)
            else:
                X_train_vec = vectorizer.fit_transform(X_train)
                X_val_vec = vectorizer.transform(X_val)

            # Convert to float32 for compatibility with all models (especially LightGBM)
            X_train_vec = convert_to_float32(X_train_vec)
            X_val_vec = convert_to_float32(X_val_vec)

            # Log vectorizer info
            vec_shape = X_train_vec.shape[1]
            wandb.log({f"{full_vec_name}/feature_dimension": vec_shape})

            print(f"   ‚úÖ Features extracted: {vec_shape}")

        except Exception as e:
            print(f"   ‚ùå ERROR in vectorization: {e}")
            import traceback
            traceback.print_exc()
            continue

        # Train all models with this vectorizer
        for model_name, clf in CLASSIFIERS.items():

            # Check if this specific experiment was already completed
            if (vec_name, str(ngram_range), model_name) in completed_experiments:
                print(f"\n   ‚è≠Ô∏è  SKIPPING: {model_name} (Already completed)")
                continue

            experiment_counter += 1

            print(f"\n   [{experiment_counter}/{total_experiments}] üîÑ Training: {model_name}")

            try:
                # Train model
                model = OneVsRestClassifier(clf, n_jobs=-1)
                model.fit(X_train_vec, y_train)

                # Predictions
                y_pred = model.predict(X_val_vec)

                # Get probabilities
                if hasattr(model, "predict_proba"):
                    y_pred_proba = model.predict_proba(X_val_vec)
                elif hasattr(model, "decision_function"):
                    y_pred_proba = model.decision_function(X_val_vec)
                    # Normalize to [0, 1]
                    from sklearn.preprocessing import MinMaxScaler
                    scaler = MinMaxScaler()
                    y_pred_proba = scaler.fit_transform(y_pred_proba)
                else:
                    y_pred_proba = y_pred

                # Compute metrics
                metrics = compute_advanced_metrics(y_val, y_pred, y_pred_proba, emotions)

                # Log all metrics to W&B
                for metric_name, value in metrics.items():
                    wandb.log({f"{full_vec_name}/{model_name}/{metric_name}": value})

                # Classification report
                class_report = classification_report(
                    y_val, y_pred, target_names=emotions,
                    output_dict=True, zero_division=0
                )

                # Confusion matrix
                plot_combined_confusion_matrix(y_val, y_pred, emotions, model_name, full_vec_name)

                # Store results
                result_entry = {
                    'Vectorizer': vec_name,
                    'N-gram': str(ngram_range),
                    'Model': model_name,
                    'F1_Macro': metrics['f1_macro'],
                    'F1_Micro': metrics['f1_micro'],
                    'F1_Weighted': metrics['f1_weighted'],
                    'Hamming_Loss': metrics['hamming_loss'],
                    'Jaccard_Score': metrics['jaccard_score'],
                    'Subset_Accuracy': metrics['subset_accuracy'],
                    'Full_Name': full_vec_name,
                    'Is_Embedding': is_embedding
                }

                all_results.append(result_entry)

                # Update results dataframe and save checkpoint
                all_results_df = pd.concat([all_results_df, pd.DataFrame([result_entry])], ignore_index=True)
                save_results_checkpoint(all_results_df)

                # Save checkpoint state
                checkpoint_state = {
                    'completed_experiments': completed_experiments.union({(vec_name, str(ngram_range), model_name)}),
                    'experiment_counter': experiment_counter
                }
                save_checkpoint(checkpoint_state)

                # Add to completed set
                completed_experiments.add((vec_name, str(ngram_range), model_name))

                print(f"      ‚úÖ F1-Macro: {metrics['f1_macro']:.4f} | Hamming: {metrics['hamming_loss']:.4f}")

            except Exception as e:
                print(f"      ‚ùå ERROR training {model_name}: {e}")
                print(f"      üìù Error details: {type(e).__name__}")
                import traceback
                traceback.print_exc()
                # Save checkpoint even on error
                save_results_checkpoint(all_results_df)
                continue

# =========================
# 7Ô∏è‚É£ Comprehensive Results Analysis
# =========================
print("\n" + "=" * 80)
print("üìà COMPREHENSIVE RESULTS ANALYSIS")
print("=" * 80)

results_df = all_results_df.sort_values(by='F1_Macro', ascending=False)

# Save full results
results_path = OUTPUT_DIR / "reports" / "full_experiment_results.csv"
results_df.to_csv(results_path, index=False)
wandb.save(str(results_path))

# Display top 10 results
print("\nüèÜ TOP 10 CONFIGURATIONS:")
print(results_df.head(10).to_string(index=False))

if len(results_df) > 0:
    # Best configuration
    best_config = results_df.iloc[0]
    print(f"\nü•á BEST CONFIGURATION:")
    print(f"   Vectorizer: {best_config['Vectorizer']}")
    print(f"   N-gram: {best_config['N-gram']}")
    print(f"   Model: {best_config['Model']}")
    print(f"   F1-Macro: {best_config['F1_Macro']:.4f}")
    print(f"   Is Embedding: {best_config.get('Is_Embedding', False)}")

    # Log best config
    wandb.log({
        "best/vectorizer": best_config['Vectorizer'],
        "best/ngram": best_config['N-gram'],
        "best/model": best_config['Model'],
        "best/f1_macro": best_config['F1_Macro'],
        "best/is_embedding": best_config.get('Is_Embedding', False)
    })

    # =========================
    # 8Ô∏è‚É£ Vectorizer Comparison
    # =========================
    if len(results_df.groupby('Vectorizer')) > 0:
        print("\nüìä VECTORIZER PERFORMANCE COMPARISON:")
        vectorizer_perf = results_df.groupby('Vectorizer')['F1_Macro'].agg(['mean', 'max', 'std']).round(4)
        print(vectorizer_perf.sort_values('mean', ascending=False))

        # Plot vectorizer comparison
        fig, ax = plt.subplots(figsize=(14, 7))
        vec_comparison = results_df.groupby('Vectorizer')['F1_Macro'].mean().sort_values(ascending=False)
        bars = ax.bar(vec_comparison.index, vec_comparison.values,
                      color=plt.cm.viridis(np.linspace(0, 1, len(vec_comparison))))
        ax.set_xlabel('Vectorizer', fontsize=14, fontweight='bold')
        ax.set_ylabel('Mean F1-Macro Score', fontsize=14, fontweight='bold')
        ax.set_title('Vectorizer Performance Comparison (Averaged Across All Models)',
                     fontsize=16, fontweight='bold', pad=20)
        ax.grid(axis='y', alpha=0.4, linestyle='--')
        plt.xticks(rotation=45, ha='right')

        for bar in bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                    f'{height:.4f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

        plt.tight_layout()
        vec_comp_path = OUTPUT_DIR / "plots" / "vectorizer_comparison.png"
        plt.savefig(vec_comp_path, dpi=150, bbox_inches='tight', facecolor='white')
        wandb.log({"comparison/vectorizer_performance": wandb.Image(str(vec_comp_path))})
        plt.close()

    # =========================
    # 9Ô∏è‚É£ Embedding vs Traditional Comparison
    # =========================
    if 'Is_Embedding' in results_df.columns:
        print("\nüìä EMBEDDING VS TRADITIONAL VECTORIZERS:")
        embed_comparison = results_df.groupby('Is_Embedding')['F1_Macro'].agg(['mean', 'max', 'std']).round(4)
        embed_comparison.index = ['Traditional', 'Embedding']
        print(embed_comparison)

        # Plot comparison
        fig, ax = plt.subplots(figsize=(10, 6))
        embed_means = results_df.groupby('Is_Embedding')['F1_Macro'].mean()
        embed_labels = ['Traditional\nVectorizers', 'Embedding\nVectorizers']
        bars = ax.bar(embed_labels, embed_means.values,
                     color=['#3498db', '#e74c3c'], width=0.6)
        ax.set_ylabel('Mean F1-Macro Score', fontsize=14, fontweight='bold')
        ax.set_title('Embedding vs Traditional Vectorizers Performance',
                    fontsize=16, fontweight='bold', pad=20)
        ax.grid(axis='y', alpha=0.4, linestyle='--')

        for bar in bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{height:.4f}', ha='center', va='bottom', fontsize=13, fontweight='bold')

        plt.tight_layout()
        embed_comp_path = OUTPUT_DIR / "plots" / "embedding_vs_traditional.png"
        plt.savefig(embed_comp_path, dpi=150, bbox_inches='tight', facecolor='white')
        wandb.log({"comparison/embedding_vs_traditional": wandb.Image(str(embed_comp_path))})
        plt.close()

    # =========================
    # üîü N-gram Range Comparison (Traditional only)
    # =========================
    traditional_results = results_df[results_df['Is_Embedding'] == False]
    if len(traditional_results) > 0:
        print("\nüìä N-GRAM RANGE PERFORMANCE (Traditional Vectorizers):")
        ngram_perf = traditional_results.groupby('N-gram')['F1_Macro'].agg(['mean', 'max', 'std']).round(4)
        print(ngram_perf.sort_values('mean', ascending=False))

        # Plot n-gram comparison
        fig, ax = plt.subplots(figsize=(10, 6))
        ngram_comparison = traditional_results.groupby('N-gram')['F1_Macro'].mean().sort_values(ascending=False)
        bars = ax.bar(range(len(ngram_comparison)), ngram_comparison.values,
                     color=plt.cm.plasma(np.linspace(0, 1, len(ngram_comparison))))
        ax.set_xticks(range(len(ngram_comparison)))
        ax.set_xticklabels(ngram_comparison.index, rotation=0)
        ax.set_xlabel('N-gram Range', fontsize=14, fontweight='bold')
        ax.set_ylabel('Mean F1-Macro Score', fontsize=14, fontweight='bold')
        ax.set_title('N-gram Range Performance Comparison', fontsize=16, fontweight='bold', pad=20)
        ax.grid(axis='y', alpha=0.4, linestyle='--')

        for i, v in enumerate(ngram_comparison.values):
            ax.text(i, v, f'{v:.4f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

        plt.tight_layout()
        ngram_comp_path = OUTPUT_DIR / "plots" / "ngram_comparison.png"
        plt.savefig(ngram_comp_path, dpi=150, bbox_inches='tight', facecolor='white')
        wandb.log({"comparison/ngram_performance": wandb.Image(str(ngram_comp_path))})
        plt.close()

    # =========================
    # 1Ô∏è‚É£1Ô∏è‚É£ Model Performance Across Vectorizers
    # =========================
    print("\nüìä MODEL PERFORMANCE ACROSS VECTORIZERS:")
    model_perf = results_df.groupby('Model')['F1_Macro'].agg(['mean', 'max', 'std']).round(4)
    print(model_perf.sort_values('mean', ascending=False))

    # Heatmap: Models vs Vectorizers
    pivot_table = results_df.pivot_table(
        values='F1_Macro',
        index='Model',
        columns='Vectorizer',
        aggfunc='mean'
    )

    fig, ax = plt.subplots(figsize=(14, 10))
    sns.heatmap(pivot_table, annot=True, fmt='.4f', cmap='YlOrRd',
                cbar_kws={'label': 'F1-Macro Score'}, ax=ax,
                linewidths=1, linecolor='white')
    ax.set_title('Model Performance Heatmap (Across Vectorizers)',
                 fontsize=18, fontweight='bold', pad=20)
    ax.set_xlabel('Vectorizer', fontsize=14, fontweight='bold')
    ax.set_ylabel('Model', fontsize=14, fontweight='bold')
    plt.xticks(rotation=45, ha='right')

    plt.tight_layout()
    heatmap_path = OUTPUT_DIR / "plots" / "model_vectorizer_heatmap.png"
    plt.savefig(heatmap_path, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({"comparison/model_vectorizer_heatmap": wandb.Image(str(heatmap_path))})
    plt.close()

    # =========================
    # 1Ô∏è‚É£2Ô∏è‚É£ Generate Best Model Submission
    # =========================
    print(f"\nüîÑ Retraining best model on full dataset...")

    best_vec_name = best_config['Vectorizer']
    best_vec_type = VECTORIZERS[best_vec_name]
    best_is_embedding = best_config.get('Is_Embedding', False)
    best_model_name = best_config['Model']

    try:
        # Recreate best vectorizer and model
        if best_is_embedding:
            best_vectorizer = get_vectorizer(best_vec_type, None, None, embedding_models)
            print(f"   üîÑ Using {best_vec_name} embeddings...")
            X_full_best = best_vectorizer.transform(X)
        else:
            best_ngram = eval(best_config['N-gram'])
            best_vectorizer = get_vectorizer(best_vec_type, best_ngram,
                                            VECTORIZER_CONFIGS['tfidf_max_features'][0])
            X_full_best = best_vectorizer.fit_transform(X)

        X_full_best = convert_to_float32(X_full_best)

        best_clf = CLASSIFIERS[best_model_name]
        best_model_full = OneVsRestClassifier(best_clf, n_jobs=-1)
        best_model_full.fit(X_full_best, y)

        print("   ‚úÖ Best model retrained on full dataset")

        # Test Set Prediction
        print("\nüìù Generating predictions on test set...")
        clean_test['final_text'] = clean_test['final_text'].fillna('')

        if best_is_embedding:
            X_test_best = best_vectorizer.transform(clean_test['final_text'])
        else:
            X_test_best = best_vectorizer.transform(clean_test['final_text'])

        X_test_best = convert_to_float32(X_test_best)
        y_test_pred = best_model_full.predict(X_test_best)

        # Create submission
        submission = pd.DataFrame(y_test_pred, columns=y.columns)
        submission['id'] = clean_test['id']
        submission = submission[['id'] + list(y.columns)]

        submission_path = OUTPUT_DIR / "submission.csv"
        submission.to_csv(submission_path, index=False)
        wandb.save(str(submission_path))

        print(f"   ‚úÖ Submission saved: {submission_path}")

    except Exception as e:
        print(f"   ‚ùå Error generating submission: {e}")
        import traceback
        traceback.print_exc()

    # =========================
    # 1Ô∏è‚É£3Ô∏è‚É£ Executive Summary Dashboard
    # =========================
    print("\nüìä Creating executive summary dashboard...")

    fig = plt.figure(figsize=(22, 14))
    gs = fig.add_gridspec(3, 3, hspace=0.35, wspace=0.3)

    # 1. Best Configuration Box
    ax1 = fig.add_subplot(gs[0, 0])
    ax1.axis('off')
    summary_text = f"""
BEST CONFIGURATION
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Vectorizer: {best_config['Vectorizer']}
Type: {'Embedding' if best_is_embedding else 'Traditional'}
N-gram: {best_config['N-gram']}
Model: {best_config['Model']}

PERFORMANCE METRICS
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
F1-Macro: {best_config['F1_Macro']:.4f}
F1-Micro: {best_config['F1_Micro']:.4f}
F1-Weighted: {best_config['F1_Weighted']:.4f}
Hamming Loss: {best_config['Hamming_Loss']:.4f}
Jaccard Score: {best_config['Jaccard_Score']:.4f}
Subset Accuracy: {best_config['Subset_Accuracy']:.4f}
"""
    ax1.text(0.1, 0.5, summary_text, fontsize=10, fontfamily='monospace',
             verticalalignment='center', bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

    # 2. Top 5 Configurations
    ax2 = fig.add_subplot(gs[0, 1:])
    top5 = results_df.head(5)
    ax2.axis('tight')
    ax2.axis('off')
    table_data = []
    for idx, row in top5.iterrows():
        table_data.append([
            row['Vectorizer'],
            row['N-gram'][:15],
            row['Model'],
            f"{row['F1_Macro']:.4f}"
        ])
    table = ax2.table(cellText=table_data,
                      colLabels=['Vectorizer', 'N-gram', 'Model', 'F1-Macro'],
                      cellLoc='center',
                      loc='center',
                      colWidths=[0.25, 0.2, 0.3, 0.15])
    table.auto_set_font_size(False)
    table.set_fontsize(9)
    table.scale(1, 2)
    for i in range(len(top5) + 1):
        if i == 0:
            for j in range(4):
                table[(i, j)].set_facecolor('#4CAF50')
        else:
            for j in range(4):
                table[(i, j)].set_facecolor('#E8F5E9')
    ax2.set_title('TOP 5 CONFIGURATIONS', fontsize=14, fontweight='bold', pad=20)

    # 3. Vectorizer Performance
    ax3 = fig.add_subplot(gs[1, 0])
    vec_means = results_df.groupby('Vectorizer')['F1_Macro'].mean().sort_values(ascending=True)
    colors = ['#e74c3c' if 'Word2Vec' in idx or 'GloVe' in idx or 'FastText' in idx
              else '#3498db' for idx in vec_means.index]
    ax3.barh(vec_means.index, vec_means.values, color=colors)
    ax3.set_xlabel('Mean F1-Macro', fontsize=11, fontweight='bold')
    ax3.set_title('Vectorizer Performance', fontsize=12, fontweight='bold')
    ax3.grid(axis='x', alpha=0.3)
    for i, v in enumerate(vec_means.values):
        ax3.text(v, i, f' {v:.4f}', va='center', fontweight='bold', fontsize=9)

    # 4. Model Performance
    ax4 = fig.add_subplot(gs[1, 1])
    model_means = results_df.groupby('Model')['F1_Macro'].mean().sort_values(ascending=False)
    ax4.bar(range(len(model_means)), model_means.values,
            color=plt.cm.Set3(np.linspace(0, 1, len(model_means))))
    ax4.set_xticks(range(len(model_means)))
    ax4.set_xticklabels(model_means.index, rotation=45, ha='right', fontsize=9)
    ax4.set_ylabel('Mean F1-Macro', fontsize=11, fontweight='bold')
    ax4.set_title('Model Performance', fontsize=12, fontweight='bold')
    ax4.grid(axis='y', alpha=0.3)
    for i, v in enumerate(model_means.values):
        ax4.text(i, v, f'{v:.3f}', ha='center', va='bottom', fontsize=8, fontweight='bold')

    # 5. Embedding vs Traditional
    ax5 = fig.add_subplot(gs[1, 2])
    if 'Is_Embedding' in results_df.columns:
        embed_means = results_df.groupby('Is_Embedding')['F1_Macro'].mean()
        labels = ['Traditional', 'Embedding']
        ax5.bar(labels, embed_means.values, color=['#3498db', '#e74c3c'], width=0.5)
        ax5.set_ylabel('Mean F1-Macro', fontsize=11, fontweight='bold')
        ax5.set_title('Embedding vs Traditional', fontsize=12, fontweight='bold')
        ax5.grid(axis='y', alpha=0.3)
        for i, v in enumerate(embed_means.values):
            ax5.text(i, v, f'{v:.4f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

    # 6. F1-Macro Distribution
    ax6 = fig.add_subplot(gs[2, 0])
    ax6.hist(results_df['F1_Macro'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
    ax6.axvline(best_config['F1_Macro'], color='red', linestyle='--', linewidth=2,
                label=f"Best: {best_config['F1_Macro']:.4f}")
    ax6.set_xlabel('F1-Macro Score', fontsize=11, fontweight='bold')
    ax6.set_ylabel('Frequency', fontsize=11, fontweight='bold')
    ax6.set_title('F1-Macro Distribution', fontsize=12, fontweight='bold')
    ax6.legend(fontsize=9)
    ax6.grid(alpha=0.3)

    # 7. Hamming Loss Distribution
    ax7 = fig.add_subplot(gs[2, 1])
    ax7.hist(results_df['Hamming_Loss'], bins=30, color='salmon', edgecolor='black', alpha=0.7)
    ax7.axvline(best_config['Hamming_Loss'], color='green', linestyle='--', linewidth=2,
                label=f"Best: {best_config['Hamming_Loss']:.4f}")
    ax7.set_xlabel('Hamming Loss', fontsize=11, fontweight='bold')
    ax7.set_ylabel('Frequency', fontsize=11, fontweight='bold')
    ax7.set_title('Hamming Loss Distribution', fontsize=12, fontweight='bold')
    ax7.legend(fontsize=9)
    ax7.grid(alpha=0.3)

    # 8. Experiment Statistics
    ax8 = fig.add_subplot(gs[2, 2])
    ax8.axis('off')
    stats_text = f"""
EXPERIMENT STATISTICS
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Total Experiments: {len(results_df)}
Vectorizers: {len(VECTORIZERS)}
  - Traditional: {traditional_count}
  - Embeddings: {embedding_count}
Models: {len(CLASSIFIERS)}

BEST SCORES
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Highest F1-Macro: {results_df['F1_Macro'].max():.4f}
Lowest Hamming Loss: {results_df['Hamming_Loss'].min():.4f}
Highest Jaccard: {results_df['Jaccard_Score'].max():.4f}

AVERAGE SCORES
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Mean F1-Macro: {results_df['F1_Macro'].mean():.4f}
Std F1-Macro: {results_df['F1_Macro'].std():.4f}
"""
    ax8.text(0.1, 0.5, stats_text, fontsize=9, fontfamily='monospace',
             verticalalignment='center', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

    plt.suptitle('üöÄ EXPERIMENT EXECUTIVE SUMMARY DASHBOARD\nAll Vectorizers + Embeddings',
                 fontsize=18, fontweight='bold', y=0.98)

    dashboard_path = OUTPUT_DIR / "plots" / "executive_summary_dashboard.png"
    plt.savefig(dashboard_path, dpi=150, bbox_inches='tight', facecolor='white')
    wandb.log({"summary/executive_dashboard": wandb.Image(str(dashboard_path))})
    plt.close()

    print(f"   ‚úÖ Executive dashboard created: {dashboard_path}")

else:
    print("\n‚ö†Ô∏è No results to analyze. All experiments may have failed.")

# =========================
# 1Ô∏è‚É£4Ô∏è‚É£ Clean up checkpoints
# =========================
print("\nüßπ Experiment completed. Checkpoint files are preserved for recovery.")
print(f"   Checkpoint location: {OUTPUT_DIR / 'checkpoints'}")

# =========================
# üéØ Finish Experiment
# =========================
print("\n" + "=" * 80)
print("‚úÖ COMPREHENSIVE EXPERIMENT COMPLETED SUCCESSFULLY!")
print("=" * 80)
if len(results_df) > 0:
    print(f"\nüìä FINAL RESULTS:")
    print(f"   ‚Ä¢ Total Experiments Run: {len(results_df)}")
    print(f"   ‚Ä¢ Best Vectorizer: {best_config['Vectorizer']}")
    print(f"   ‚Ä¢ Best Type: {'Embedding' if best_is_embedding else 'Traditional'}")
    print(f"   ‚Ä¢ Best N-gram: {best_config['N-gram']}")
    print(f"   ‚Ä¢ Best Model: {best_config['Model']}")
    print(f"   ‚Ä¢ Best F1-Macro: {best_config['F1_Macro']:.4f}")
    print(f"\nüåü VECTORIZERS TESTED:")
    for vec_name in results_df['Vectorizer'].unique():
        mean_score = results_df[results_df['Vectorizer'] == vec_name]['F1_Macro'].mean()
        print(f"   ‚Ä¢ {vec_name}: {mean_score:.4f} (avg)")
print(f"\nüìÅ All outputs saved in: {OUTPUT_DIR}")
print(f"üîó View comprehensive results in W&B: {wandb.run.get_url()}")
print("=" * 80)

wandb.finish()

0,1
CountVectorizer_ngram_1_1/LinearSVC/anger_auc,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/anger_f1,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/f1_macro,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/f1_micro,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/f1_weighted,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/fear_auc,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/fear_f1,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/hamming_loss,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/jaccard_score,‚ñÅ
CountVectorizer_ngram_1_1/LinearSVC/joy_auc,‚ñÅ

0,1
CountVectorizer_ngram_1_1/LinearSVC/anger_auc,0.71266
CountVectorizer_ngram_1_1/LinearSVC/anger_f1,0.30526
CountVectorizer_ngram_1_1/LinearSVC/f1_macro,0.48346
CountVectorizer_ngram_1_1/LinearSVC/f1_micro,0.54941
CountVectorizer_ngram_1_1/LinearSVC/f1_weighted,0.54608
CountVectorizer_ngram_1_1/LinearSVC/fear_auc,0.69605
CountVectorizer_ngram_1_1/LinearSVC/fear_f1,0.69156
CountVectorizer_ngram_1_1/LinearSVC/hamming_loss,0.26106
CountVectorizer_ngram_1_1/LinearSVC/jaccard_score,0.39286
CountVectorizer_ngram_1_1/LinearSVC/joy_auc,0.68938


üöÄ STARTING COMPREHENSIVE VECTORIZER + MODEL EXPERIMENT

üìä Analyzing Label Distribution...
‚úÖ Train set size: 3992 | Validation set size: 999

üåê PREPARING EMBEDDING MODELS

üîÑ Training Word2Vec model...
   ‚úÖ Word2Vec trained: vocab_size=2874

üîÑ Loading GloVe embeddings (dim=100)...
   üåê Downloading glove-twitter-100 from gensim...
   ‚úÖ GloVe model saved to cache

üîÑ Training FastText model...
   ‚úÖ FastText trained: vocab_size=2874

‚úÖ Embedding models ready: ['word2vec', 'glove', 'fasttext']

üìê Total Vectorizers Available: 5
   ‚Ä¢ CountVectorizer (count)
   ‚Ä¢ HashingVectorizer (hashing)
   ‚Ä¢ Word2Vec (word2vec)
   ‚Ä¢ GloVe (glove)
   ‚Ä¢ FastText (fasttext)

üÜï STARTING NEW EXPERIMENT

ü§ñ TRAINING ALL VECTORIZER + MODEL COMBINATIONS

üìä Total Experiments to Run: 45
   - Already Completed: 0
   - Remaining: 45
   - Vectorizers: 5
   - Traditional (with n-grams): 2
   - Embeddings (no n-grams): 3
   - Models: 5

üìê VECTORIZER: CountVectorizer | 



   ‚úÖ Executive dashboard created: outputs/plots/executive_summary_dashboard.png

üßπ Experiment completed. Checkpoint files are preserved for recovery.
   Checkpoint location: outputs/checkpoints

‚úÖ COMPREHENSIVE EXPERIMENT COMPLETED SUCCESSFULLY!

üìä FINAL RESULTS:
   ‚Ä¢ Total Experiments Run: 45
   ‚Ä¢ Best Vectorizer: CountVectorizer
   ‚Ä¢ Best Type: Traditional
   ‚Ä¢ Best N-gram: (1, 3)
   ‚Ä¢ Best Model: RandomForest
   ‚Ä¢ Best F1-Macro: 0.4867

üåü VECTORIZERS TESTED:
   ‚Ä¢ CountVectorizer: 0.4253 (avg)
   ‚Ä¢ GloVe: 0.4421 (avg)
   ‚Ä¢ HashingVectorizer: 0.3627 (avg)
   ‚Ä¢ Word2Vec: 0.2049 (avg)
   ‚Ä¢ FastText: 0.1894 (avg)

üìÅ All outputs saved in: outputs
üîó View comprehensive results in W&B: https://wandb.ai/aiwithajay-indian-institute-of-technology-madras/23f3003030-t32025/runs/f89a5igy


0,1
CountVectorizer_ngram_1_1/CatBoost/anger_auc,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/anger_f1,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/f1_macro,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/f1_micro,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/f1_weighted,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/fear_auc,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/fear_f1,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/hamming_loss,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/jaccard_score,‚ñÅ
CountVectorizer_ngram_1_1/CatBoost/joy_auc,‚ñÅ

0,1
CountVectorizer_ngram_1_1/CatBoost/anger_auc,0.75741
CountVectorizer_ngram_1_1/CatBoost/anger_f1,0.224
CountVectorizer_ngram_1_1/CatBoost/f1_macro,0.41234
CountVectorizer_ngram_1_1/CatBoost/f1_micro,0.53977
CountVectorizer_ngram_1_1/CatBoost/f1_weighted,0.49461
CountVectorizer_ngram_1_1/CatBoost/fear_auc,0.6893
CountVectorizer_ngram_1_1/CatBoost/fear_f1,0.71621
CountVectorizer_ngram_1_1/CatBoost/hamming_loss,0.23283
CountVectorizer_ngram_1_1/CatBoost/jaccard_score,0.38639
CountVectorizer_ngram_1_1/CatBoost/joy_auc,0.69843


In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, hamming_loss, jaccard_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from lightgbm import LGBMClassifier
from gensim import downloader as api
import wandb

# =========================
# 1Ô∏è‚É£ Initialize W&B
# =========================
# wandb.init(project="23f3003030-t32025", name="glove200_2models", reinit=True)

# =========================
# 2Ô∏è‚É£ Load Data
# =========================




Loading glove-twitter-200 embeddings...
GloVe model loaded!

Training LinearSVC...
LinearSVC F1-Macro: 0.5067 | Hamming Loss: 0.2186 | Jaccard: 0.4312

Training LightGBM...
LightGBM F1-Macro: 0.4734 | Hamming Loss: 0.2136 | Jaccard: 0.4184

Model comparison:
        Model  F1_Macro
0  LinearSVC  0.506681
1   LightGBM  0.473423
Submission saved as submission.csv


0,1
LightGBM_F1_Macro,‚ñÅ
LightGBM_Hamming_Loss,‚ñÅ
LightGBM_Jaccard,‚ñÅ
LinearSVC_F1_Macro,‚ñÅ
LinearSVC_Hamming_Loss,‚ñÅ
LinearSVC_Jaccard,‚ñÅ

0,1
LightGBM_F1_Macro,0.47342
LightGBM_Hamming_Loss,0.21361
LightGBM_Jaccard,0.41842
LinearSVC_F1_Macro,0.50668
LinearSVC_Hamming_Loss,0.21862
LinearSVC_Jaccard,0.43121


In [None]:
X = final_df['final_text'].fillna('')
y = final_df[['anger','fear','joy','sadness','surprise']]

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# =========================
# 3Ô∏è‚É£ Load GloVe Twitter 200d embeddings
# =========================
dim = 200
glove_model_name = f'glove-twitter-{dim}'
print(f"Loading {glove_model_name} embeddings...")
glove_model = api.load(glove_model_name)
print("GloVe model loaded!")

# =========================
# 4Ô∏è‚É£ Convert text to vectors
# =========================
def text_to_vector(text, model, dim=200):
    words = text.split()
    vectors = [model[word] for word in words if word in model]
    if len(vectors) == 0:
        return np.zeros(dim)
    return np.mean(vectors, axis=0)

X_train_vec = np.vstack([text_to_vector(t, glove_model, dim) for t in X_train])
X_val_vec   = np.vstack([text_to_vector(t, glove_model, dim) for t in X_val])

# =========================
# 5Ô∏è‚É£ Define classifiers
# =========================
classifiers = {
    'LinearSVC': LinearSVC(max_iter=5000, random_state=42),
    'LightGBM': LGBMClassifier(n_estimators=500, max_depth=8, learning_rate=0.05,
                               random_state=42, verbose=-1, n_jobs=-1)
}

# =========================
# 6Ô∏è‚É£ Train & log metrics
# =========================
results = []

for name, clf in classifiers.items():
    print(f"\nTraining {name}...")
    model = OneVsRestClassifier(clf)
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_val_vec)

    f1_macro = f1_score(y_val, y_pred, average='macro')
    h_loss = hamming_loss(y_val, y_pred)
    jaccard = jaccard_score(y_val, y_pred, average='samples')

    print(f"{name} F1-Macro: {f1_macro:.4f} | Hamming Loss: {h_loss:.4f} | Jaccard: {jaccard:.4f}")

    # Classification report
    class_report = classification_report(y_val, y_pred, target_names=y.columns, output_dict=True)

    # Log metrics to W&B
    wandb.log({
        f"{name}_F1_Macro": f1_macro,
        f"{name}_Hamming_Loss": h_loss,
        f"{name}_Jaccard": jaccard,
        f"{name}_Classification_Report": class_report
    })

    results.append({'Model': name, 'F1_Macro': f1_macro})

# =========================
# 7Ô∏è‚É£ Summary
# =========================
results_df = pd.DataFrame(results).sort_values(by='F1_Macro', ascending=False)
print("\nModel comparison:\n", results_df)
wandb.log({"Results_Table": results_df})

# =========================
# 8Ô∏è‚É£ Generate submission
# =========================
clean_test['final_text'] = clean_test['final_text'].fillna('')
X_test_vec = np.vstack([text_to_vector(t, glove_model, dim) for t in clean_test['final_text']])

best_model_name = results_df.iloc[0]['Model']
best_model = OneVsRestClassifier(classifiers[best_model_name])
best_model.fit(np.vstack([X_train_vec, X_val_vec]), pd.concat([y_train, y_val]))
y_test_pred = best_model.predict(X_test_vec)

submission = pd.DataFrame(y_test_pred, columns=y.columns)
submission['id'] = clean_test['id']
submission = submission[['id'] + list(y.columns)]
submission.to_csv('submission.csv', index=False)
print("Submission saved as submission.csv")


In [9]:
best_model

In [10]:
submission

Unnamed: 0,id,anger,fear,joy,sadness,surprise
0,0,0,1,0,0,0
1,1,0,0,0,0,0
2,2,0,0,0,0,0
3,3,0,1,0,0,0
4,4,0,1,0,0,1
...,...,...,...,...,...,...
1702,1702,0,1,0,0,0
1703,1703,0,0,0,0,0
1704,1704,0,1,0,0,0
1705,1705,0,0,0,1,0


In [11]:
final_test

NameError: name 'final_test' is not defined