# Logistic Regression for Product Title Classification
## Comparative Analysis with Multiple Word Embeddings

**Author:** David CYUBAHIRO 
**Date:** February 7, 2026  
**Model:** Logistic Regression  
**Embeddings:** TF-IDF, Word2Vec (Skip-gram), Word2Vec (CBOW), FastText

---

### Objective
This notebook implements and evaluates Logistic Regression classifiers using four different word embedding techniques. The goal is to compare performance across embeddings and provide insights for the team's academic report.

## 1. Import Required Libraries

## 0. Install Required Packages

Run this cell first if you don't have the required libraries installed.

In [3]:
# Install required packages
!pip install pandas numpy scikit-learn gensim matplotlib seaborn

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.1.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
# Standard libraries
import os
import json
import pickle
import warnings
from datetime import datetime

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_recall_fscore_support,
)

# Word Embeddings
import gensim
from gensim.models import Word2Vec, FastText

warnings.filterwarnings('ignore')
plt.style.use('ggplot')

print("All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Gensim version: {gensim.__version__}")

All libraries imported successfully!
Pandas version: 2.3.1
NumPy version: 2.2.5
Gensim version: 4.4.0


## 2. Configuration and Setup

In [5]:
# Directory paths
RESULTS_BASE = "../results"
EDA_DIR = os.path.join(RESULTS_BASE, "EDA")
OUTPUT_DIR = os.path.join(RESULTS_BASE, "LogisticRegression")
MODELS_DIR = os.path.join(OUTPUT_DIR, "models")
GRAPHS_DIR = os.path.join(OUTPUT_DIR, "visual_graphs")

# Data file paths (try multiple locations)
TRAIN_CSV_PATHS = [
    os.path.join(RESULTS_BASE, "train.csv"),
    os.path.join(EDA_DIR, "train.csv"),
]
TEST_CSV_PATHS = [
    os.path.join(RESULTS_BASE, "test.csv"),
    os.path.join(EDA_DIR, "test.csv"),
]

# Column names
TEXT_COL = "clean_text"
LABEL_COL = "label_id"
CATEGORY_COL = "category_name"

# Random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Model parameters
EMBEDDING_DIM = 100
MIN_WORD_COUNT = 2
WINDOW_SIZE = 5
WORKERS = 4

# Create output directories
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(GRAPHS_DIR, exist_ok=True)

print("Configuration completed!")
print(f"Output directory: {OUTPUT_DIR}")

Configuration completed!
Output directory: ../results\LogisticRegression


## 3. Load Data

### Option: Download Dataset and Run Preprocessing

**If you don't have the data files yet:**

1. **Download from Kaggle**: https://www.kaggle.com/datasets/asaniczka/product-titles-text-classification
   - Download `titles_to_categories.csv`
   - Place it in `../data/titles_to_categories.csv`

2. **Run preprocessing**:
   ```python
   # Uncomment and run if you need to generate train/test splits
   # !cd ../src && python preprocessing.py
   ```

3. **OR ask your teammate** for `train.csv` and `test.csv` files and place them in `../results/`

After getting the files, restart from the cell below.

In [7]:
# Quick check: Do you have the data files?
import os

data_files_exist = (
    os.path.exists("../results/train.csv") and 
    os.path.exists("../results/test.csv")
)

if data_files_exist:
    print("‚úì Data files found! You can proceed with the experiments.")
else:
    print("‚úó Data files NOT found.")
    print("\nYou need to either:")
    print("1. Ask your teammate for train.csv and test.csv")
    print("2. Download raw data from Kaggle and run preprocessing")
    print("\nSee instructions in the cell above ‚Üë")

‚úó Data files NOT found.

You need to either:
1. Ask your teammate for train.csv and test.csv
2. Download raw data from Kaggle and run preprocessing

See instructions in the cell above ‚Üë


In [None]:
# OPTION: Run preprocessing if you have the raw data
# Uncomment the lines below if you've downloaded titles_to_categories.csv to ../data/

# import subprocess
# import sys
# 
# print("Running preprocessing script...")
# result = subprocess.run(
#     [sys.executable, "../src/preprocessing.py"],
#     capture_output=True,
#     text=True
# )
# 
# if result.returncode == 0:
#     print("‚úì Preprocessing completed successfully!")
#     print(result.stdout)
# else:
#     print("‚úó Preprocessing failed:")
#     print(result.stderr)

In [11]:
def find_data_file(paths):
    """Find the first existing file from a list of paths."""
    for path in paths:
        if os.path.exists(path):
            return path
    return None

# Load training data
train_path = find_data_file(TRAIN_CSV_PATHS)
test_path = find_data_file(TEST_CSV_PATHS)

if train_path is None or test_path is None:
    print("="*70)
    print("‚ö†Ô∏è  DATA FILES NOT FOUND")
    print("="*70)
    print("\nThe preprocessed data files (train.csv, test.csv) are missing.")
    print("\nüìã QUICK FIX - Choose ONE option:\n")
    
    print("‚úÖ Option 1 (FASTEST - Recommended):")
    print("   Ask your teammate who ran the GRU experiments for:")
    print("   ‚Ä¢ train.csv")
    print("   ‚Ä¢ test.csv")
    print("   Place them in: C:\\Users\\cyuba\\Documents\\LEARN\\ALU\\Product\\Product-text-Classiifcation\\results\\")
    print()
    
    print(" Option 2 (If you have the raw dataset):")
    print("   1. Download from Kaggle: https://www.kaggle.com/datasets/asaniczka/product-titles-text-classification")
    print("   2. Place titles_to_categories.csv in ../data/")
    print("   3. Uncomment and run cell 10 above to generate train/test splits")
    print()
    
    print("After getting the files, restart the kernel and run all cells.")
    print("="*70)
    
    # Set flags so later cells can check
    DATA_FILES_AVAILABLE = False
    train_df = None
    test_df = None
else:
    print(f"‚úì Found training data: {train_path}")
    train_df = pd.read_csv(train_path)
    
    print(f"‚úì Found test data: {test_path}")
    test_df = pd.read_csv(test_path)
    
    print(f"\n{'='*70}")
    print("‚úÖ DATA LOADED SUCCESSFULLY")
    print("="*70)
    print(f"Train shape: {train_df.shape}")
    print(f"Test shape: {test_df.shape}")
    print(f"Columns: {train_df.columns.tolist()}")
    print(f"Number of classes: {train_df[LABEL_COL].nunique()}")
    
    DATA_FILES_AVAILABLE = True
    
    # Display sample
    print("\nSample data:")
    display(train_df.head())

‚ö†Ô∏è  DATA FILES NOT FOUND

The preprocessed data files (train.csv, test.csv) are missing.

üìã QUICK FIX - Choose ONE option:

‚úÖ Option 1 (FASTEST - Recommended):
   Ask your teammate who ran the GRU experiments for:
   ‚Ä¢ train.csv
   ‚Ä¢ test.csv
   Place them in: C:\Users\cyuba\Documents\LEARN\ALU\Product\Product-text-Classiifcation\results\

 Option 2 (If you have the raw dataset):
   1. Download from Kaggle: https://www.kaggle.com/datasets/asaniczka/product-titles-text-classification
   2. Place titles_to_categories.csv in ../data/
   3. Uncomment and run cell 10 above to generate train/test splits

After getting the files, restart the kernel and run all cells.


## 4. Data Preparation

In [16]:
# Check if data is available before proceeding
if train_df is None or test_df is None:
    print("‚ö†Ô∏è  Cannot proceed - data files are missing!")
    print("Please follow the instructions in the previous cells to get the data.")
    print("After getting train.csv and test.csv, restart the kernel and run all cells.")
else:
    # Extract features and labels
    X_train_text = train_df[TEXT_COL].values
    X_test_text = test_df[TEXT_COL].values
    y_train = train_df[LABEL_COL].values
    y_test = test_df[LABEL_COL].values
    
    print(f"‚úì Data preparation complete!")
    print(f"Training samples: {len(X_train_text):,}")
    print(f"Test samples: {len(X_test_text):,}")
    print(f"Label range: {y_train.min()} to {y_train.max()}")
    
    # Class distribution
    class_dist = train_df[LABEL_COL].value_counts().sort_index()
    print(f"\nClass distribution (first 10):")
    print(class_dist.head(10))

‚ö†Ô∏è  Cannot proceed - data files are missing!
Please follow the instructions in the previous cells to get the data.
After getting train.csv and test.csv, restart the kernel and run all cells.


## 5. Helper Functions

In [12]:
def tokenize_text(text):
    """Simple tokenization by splitting on whitespace."""
    return str(text).split()

def save_classification_report(y_true, y_pred, embedding_name):
    """Save detailed classification report to file."""
    report = classification_report(y_true, y_pred, zero_division=0)
    
    report_path = os.path.join(OUTPUT_DIR, f"classification_report_{embedding_name}.txt")
    with open(report_path, 'w') as f:
        f.write(report)
    
    print(f"Saved classification report to: {report_path}")
    return report

def save_model_and_embedder(model, embedder, embedding_name):
    """Save trained model and embedding object."""
    model_path = os.path.join(MODELS_DIR, f"logreg_{embedding_name}.pkl")
    with open(model_path, 'wb') as f:
        pickle.dump(model, f)
    
    embedder_path = os.path.join(MODELS_DIR, f"embedder_{embedding_name}.pkl")
    with open(embedder_path, 'wb') as f:
        pickle.dump(embedder, f)
    
    print(f"Saved model to: {model_path}")
    print(f"Saved embedder to: {embedder_path}")

def plot_confusion_matrix(y_true, y_pred, embedding_name, top_n=20):
    """Plot confusion matrix for top N classes."""
    # Get top N most frequent classes
    top_classes = pd.Series(y_true).value_counts().head(top_n).index.tolist()
    
    # Filter predictions for top classes
    mask = np.isin(y_true, top_classes)
    y_true_filtered = y_true[mask]
    y_pred_filtered = y_pred[mask]
    
    # Compute confusion matrix
    cm = confusion_matrix(y_true_filtered, y_pred_filtered, labels=top_classes)
    
    # Plot
    plt.figure(figsize=(12, 10))
    sns.heatmap(cm, annot=False, fmt='d', cmap='Blues', 
                xticklabels=top_classes, yticklabels=top_classes)
    plt.title(f'Confusion Matrix - Logistic Regression ({embedding_name})\nTop {top_n} Classes')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    
    save_path = os.path.join(GRAPHS_DIR, f'confusion_matrix_{embedding_name}.png')
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    print(f"Saved confusion matrix to: {save_path}")
    plt.show()

print("Helper functions defined!")

Helper functions defined!


## 6. Experiment 1: TF-IDF + Logistic Regression

### 6.1 Create TF-IDF Features

In [18]:
# Check if data is available
if 'X_train_text' not in globals() or X_train_text is None:
    print("‚ö†Ô∏è  ERROR: Training data not available!")
    print("Please run the data loading cells above first.")
    print("Make sure train.csv and test.csv are in ../results/")
else:
    print("="*70)
    print("EXPERIMENT 1: TF-IDF + Logistic Regression")
    print("="*70)
    
    # Create TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer(
        max_features=10000,
        ngram_range=(1, 2),  # unigrams and bigrams
        min_df=2,
        sublinear_tf=True,
    )
    
    # Transform texts
    print("\nCreating TF-IDF features...")
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_text)
    X_test_tfidf = tfidf_vectorizer.transform(X_test_text)
    
    print(f"TF-IDF vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
    print(f"Train shape: {X_train_tfidf.shape}")
    print(f"Test shape: {X_test_tfidf.shape}")
    print(f"Sparsity: {(1.0 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])) * 100:.2f}%")

‚ö†Ô∏è  ERROR: Training data not available!
Please run the data loading cells above first.
Make sure train.csv and test.csv are in ../results/


### 6.2 Hyperparameter Tuning with GridSearchCV

In [None]:
# Check if TF-IDF features are available
if 'X_train_tfidf' not in globals() or X_train_tfidf is None:
    print("‚ö†Ô∏è  ERROR: TF-IDF features not available!")
    print("Please run the TF-IDF feature creation cell above first.")
    print("Make sure the data files are loaded successfully.")
else:
    # Define parameter grid
    param_grid_tfidf = {
        'C': [0.01, 0.1, 1.0, 10.0],
        'penalty': ['l2'],
        'solver': ['lbfgs'],
        'max_iter': [500],
        'class_weight': ['balanced'],
    }
    
    print("Hyperparameter tuning for TF-IDF...")
    print(f"Parameter grid: {param_grid_tfidf}")
    
    # GridSearchCV
    lr_tfidf_grid = GridSearchCV(
        LogisticRegression(random_state=RANDOM_SEED),
        param_grid_tfidf,
        cv=3,
        scoring='f1_macro',
        n_jobs=-1,
        verbose=2,
    )
    
    lr_tfidf_grid.fit(X_train_tfidf, y_train)
    
    print(f"\nBest parameters: {lr_tfidf_grid.best_params_}")
    print(f"Best CV F1-macro: {lr_tfidf_grid.best_score_:.4f}")
    
    # Get best model
    lr_tfidf = lr_tfidf_grid.best_estimator_

### 6.3 Evaluation

In [20]:
# Check if model was trained
if 'lr_tfidf' not in globals() or lr_tfidf is None:
    print("‚ö†Ô∏è  ERROR: TF-IDF model not trained!")
    print("Please run the hyperparameter tuning cell above first.")
    print("Make sure all previous cells executed successfully.")
else:
    # Predictions
    y_pred_tfidf = lr_tfidf.predict(X_test_tfidf)
    
    # Metrics
    tfidf_accuracy = accuracy_score(y_test, y_pred_tfidf)
    tfidf_macro_f1 = f1_score(y_test, y_pred_tfidf, average='macro')
    tfidf_weighted_f1 = f1_score(y_test, y_pred_tfidf, average='weighted')
    
    print("\n" + "="*70)
    print("TF-IDF Results:")
    print("="*70)
    print(f"Test Accuracy:    {tfidf_accuracy:.4f}")
    print(f"Macro F1:         {tfidf_macro_f1:.4f}")
    print(f"Weighted F1:      {tfidf_weighted_f1:.4f}")
    
    # Save detailed report
    tfidf_report = save_classification_report(y_test, y_pred_tfidf, "tfidf")
    
    # Save model
    save_model_and_embedder(lr_tfidf, tfidf_vectorizer, "tfidf")
    
    # Plot confusion matrix
    plot_confusion_matrix(y_test, y_pred_tfidf, "tfidf")

‚ö†Ô∏è  ERROR: TF-IDF model not trained!
Please run the hyperparameter tuning cell above first.
Make sure all previous cells executed successfully.


## 7. Experiment 2: Word2Vec Skip-gram + Logistic Regression

### 7.1 Train Word2Vec Skip-gram Model

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 2: Word2Vec Skip-gram + Logistic Regression")
print("="*70)

# Tokenize texts
print("\nTokenizing texts...")
train_sentences = [tokenize_text(text) for text in X_train_text]

# Train Word2Vec Skip-gram
print("Training Word2Vec Skip-gram model...")
w2v_skipgram = Word2Vec(
    sentences=train_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_WORD_COUNT,
    sg=1,  # Skip-gram
    workers=WORKERS,
    seed=RANDOM_SEED,
    epochs=10,
)

print(f"Vocabulary size: {len(w2v_skipgram.wv)}")

### 7.2 Convert Texts to Vectors

In [None]:
def text_to_word2vec_vector(text, w2v_model):
    """Convert text to averaged Word2Vec vector."""
    tokens = tokenize_text(text)
    vectors = [w2v_model.wv[word] for word in tokens if word in w2v_model.wv]
    
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(EMBEDDING_DIM)

print("Converting texts to Skip-gram vectors...")
X_train_skipgram = np.array([text_to_word2vec_vector(text, w2v_skipgram) for text in X_train_text])
X_test_skipgram = np.array([text_to_word2vec_vector(text, w2v_skipgram) for text in X_test_text])

print(f"Train shape: {X_train_skipgram.shape}")
print(f"Test shape: {X_test_skipgram.shape}")

### 7.3 Train and Evaluate

In [None]:
# Hyperparameter tuning
param_grid_w2v = {
    'C': [0.01, 0.1, 1.0, 10.0],
    'penalty': ['l2'],
    'solver': ['lbfgs'],
    'max_iter': [500],
    'class_weight': ['balanced'],
}

print("Hyperparameter tuning for Skip-gram...")
lr_skipgram_grid = GridSearchCV(
    LogisticRegression(random_state=RANDOM_SEED),
    param_grid_w2v,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=2,
)

lr_skipgram_grid.fit(X_train_skipgram, y_train)

print(f"\nBest parameters: {lr_skipgram_grid.best_params_}")
print(f"Best CV F1-macro: {lr_skipgram_grid.best_score_:.4f}")

lr_skipgram = lr_skipgram_grid.best_estimator_

# Evaluate
y_pred_skipgram = lr_skipgram.predict(X_test_skipgram)

skipgram_accuracy = accuracy_score(y_test, y_pred_skipgram)
skipgram_macro_f1 = f1_score(y_test, y_pred_skipgram, average='macro')
skipgram_weighted_f1 = f1_score(y_test, y_pred_skipgram, average='weighted')

print("\n" + "="*70)
print("Skip-gram Results:")
print("="*70)
print(f"Test Accuracy:    {skipgram_accuracy:.4f}")
print(f"Macro F1:         {skipgram_macro_f1:.4f}")
print(f"Weighted F1:      {skipgram_weighted_f1:.4f}")

# Save
save_classification_report(y_test, y_pred_skipgram, "skipgram")
save_model_and_embedder(lr_skipgram, w2v_skipgram, "skipgram")
plot_confusion_matrix(y_test, y_pred_skipgram, "skipgram")

## 8. Experiment 3: Word2Vec CBOW + Logistic Regression

### 8.1 Train Word2Vec CBOW Model

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 3: Word2Vec CBOW + Logistic Regression")
print("="*70)

# Train Word2Vec CBOW
print("Training Word2Vec CBOW model...")
w2v_cbow = Word2Vec(
    sentences=train_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_WORD_COUNT,
    sg=0,  # CBOW
    workers=WORKERS,
    seed=RANDOM_SEED,
    epochs=10,
)

print(f"Vocabulary size: {len(w2v_cbow.wv)}")

### 8.2 Convert Texts and Train

In [None]:
print("Converting texts to CBOW vectors...")
X_train_cbow = np.array([text_to_word2vec_vector(text, w2v_cbow) for text in X_train_text])
X_test_cbow = np.array([text_to_word2vec_vector(text, w2v_cbow) for text in X_test_text])

print(f"Train shape: {X_train_cbow.shape}")
print(f"Test shape: {X_test_cbow.shape}")

# Hyperparameter tuning
print("\nHyperparameter tuning for CBOW...")
lr_cbow_grid = GridSearchCV(
    LogisticRegression(random_state=RANDOM_SEED),
    param_grid_w2v,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=2,
)

lr_cbow_grid.fit(X_train_cbow, y_train)

print(f"\nBest parameters: {lr_cbow_grid.best_params_}")
print(f"Best CV F1-macro: {lr_cbow_grid.best_score_:.4f}")

lr_cbow = lr_cbow_grid.best_estimator_

# Evaluate
y_pred_cbow = lr_cbow.predict(X_test_cbow)

cbow_accuracy = accuracy_score(y_test, y_pred_cbow)
cbow_macro_f1 = f1_score(y_test, y_pred_cbow, average='macro')
cbow_weighted_f1 = f1_score(y_test, y_pred_cbow, average='weighted')

print("\n" + "="*70)
print("CBOW Results:")
print("="*70)
print(f"Test Accuracy:    {cbow_accuracy:.4f}")
print(f"Macro F1:         {cbow_macro_f1:.4f}")
print(f"Weighted F1:      {cbow_weighted_f1:.4f}")

# Save
save_classification_report(y_test, y_pred_cbow, "cbow")
save_model_and_embedder(lr_cbow, w2v_cbow, "cbow")
plot_confusion_matrix(y_test, y_pred_cbow, "cbow")

## 9. Experiment 4: FastText + Logistic Regression

### 9.1 Train FastText Model

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 4: FastText + Logistic Regression")
print("="*70)

# Train FastText
print("Training FastText model...")
fasttext_model = FastText(
    sentences=train_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_WORD_COUNT,
    workers=WORKERS,
    seed=RANDOM_SEED,
    epochs=10,
)

print(f"Vocabulary size: {len(fasttext_model.wv)}")

### 9.2 Convert Texts and Train

In [None]:
def text_to_fasttext_vector(text, ft_model):
    """Convert text to averaged FastText vector."""
    tokens = tokenize_text(text)
    vectors = [ft_model.wv[word] for word in tokens if word in ft_model.wv]
    
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(EMBEDDING_DIM)

print("Converting texts to FastText vectors...")
X_train_fasttext = np.array([text_to_fasttext_vector(text, fasttext_model) for text in X_train_text])
X_test_fasttext = np.array([text_to_fasttext_vector(text, fasttext_model) for text in X_test_text])

print(f"Train shape: {X_train_fasttext.shape}")
print(f"Test shape: {X_test_fasttext.shape}")

# Hyperparameter tuning
print("\nHyperparameter tuning for FastText...")
lr_fasttext_grid = GridSearchCV(
    LogisticRegression(random_state=RANDOM_SEED),
    param_grid_w2v,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=2,
)

lr_fasttext_grid.fit(X_train_fasttext, y_train)

print(f"\nBest parameters: {lr_fasttext_grid.best_params_}")
print(f"Best CV F1-macro: {lr_fasttext_grid.best_score_:.4f}")

lr_fasttext = lr_fasttext_grid.best_estimator_

# Evaluate
y_pred_fasttext = lr_fasttext.predict(X_test_fasttext)

fasttext_accuracy = accuracy_score(y_test, y_pred_fasttext)
fasttext_macro_f1 = f1_score(y_test, y_pred_fasttext, average='macro')
fasttext_weighted_f1 = f1_score(y_test, y_pred_fasttext, average='weighted')

print("\n" + "="*70)
print("FastText Results:")
print("="*70)
print(f"Test Accuracy:    {fasttext_accuracy:.4f}")
print(f"Macro F1:         {fasttext_macro_f1:.4f}")
print(f"Weighted F1:      {fasttext_weighted_f1:.4f}")

# Save
save_classification_report(y_test, y_pred_fasttext, "fasttext")
save_model_and_embedder(lr_fasttext, fasttext_model, "fasttext")
plot_confusion_matrix(y_test, y_pred_fasttext, "fasttext")

## 10. Comparative Analysis

### 10.1 Summary Table

In [None]:
# Create comparison dataframe
results_comparison = pd.DataFrame({
    'Embedding': ['TF-IDF', 'Skip-gram', 'CBOW', 'FastText'],
    'Test Accuracy': [tfidf_accuracy, skipgram_accuracy, cbow_accuracy, fasttext_accuracy],
    'Macro F1': [tfidf_macro_f1, skipgram_macro_f1, cbow_macro_f1, fasttext_macro_f1],
    'Weighted F1': [tfidf_weighted_f1, skipgram_weighted_f1, cbow_weighted_f1, fasttext_weighted_f1],
})

print("\n" + "="*70)
print("MODEL COMPARISON - LOGISTIC REGRESSION")
print("="*70)
print(results_comparison.to_string(index=False))

# Save to CSV
comparison_path = os.path.join(OUTPUT_DIR, 'model_comparison_results.csv')
results_comparison.to_csv(comparison_path, index=False)
print(f"\nSaved comparison to: {comparison_path}")

### 10.2 Visualization - Performance Comparison

In [None]:
# Create comparison bar plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

metrics = ['Test Accuracy', 'Macro F1', 'Weighted F1']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for idx, metric in enumerate(metrics):
    ax = axes[idx]
    bars = ax.bar(results_comparison['Embedding'], results_comparison[metric], color=colors)
    ax.set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
    ax.set_ylabel(metric, fontsize=10)
    ax.set_xlabel('Embedding Type', fontsize=10)
    ax.set_ylim([0, 1.0])
    ax.grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.4f}',
                ha='center', va='bottom', fontsize=9)

plt.suptitle('Logistic Regression Performance Across Embeddings', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()

save_path = os.path.join(GRAPHS_DIR, 'model_comparison.png')
plt.savefig(save_path, dpi=300, bbox_inches='tight')
print(f"Saved comparison plot to: {save_path}")
plt.show()

### 10.3 Best Model Identification

In [None]:
# Find best performing embedding
best_accuracy_idx = results_comparison['Test Accuracy'].idxmax()
best_macro_f1_idx = results_comparison['Macro F1'].idxmax()
best_weighted_f1_idx = results_comparison['Weighted F1'].idxmax()

print("\n" + "="*70)
print("BEST PERFORMING EMBEDDINGS")
print("="*70)
print(f"Best Accuracy:    {results_comparison.loc[best_accuracy_idx, 'Embedding']} "
      f"({results_comparison.loc[best_accuracy_idx, 'Test Accuracy']:.4f})")
print(f"Best Macro F1:    {results_comparison.loc[best_macro_f1_idx, 'Embedding']} "
      f"({results_comparison.loc[best_macro_f1_idx, 'Macro F1']:.4f})")
print(f"Best Weighted F1: {results_comparison.loc[best_weighted_f1_idx, 'Embedding']} "
      f"({results_comparison.loc[best_weighted_f1_idx, 'Weighted F1']:.4f})")

### 10.4 Per-Class Performance Analysis

In [None]:
# Get per-class F1 scores for all embeddings
def get_per_class_f1(y_true, y_pred):
    _, _, f1, _ = precision_recall_fscore_support(y_true, y_pred, average=None, zero_division=0)
    return f1

f1_tfidf = get_per_class_f1(y_test, y_pred_tfidf)
f1_skipgram = get_per_class_f1(y_test, y_pred_skipgram)
f1_cbow = get_per_class_f1(y_test, y_pred_cbow)
f1_fasttext = get_per_class_f1(y_test, y_pred_fasttext)

# Create dataframe
per_class_df = pd.DataFrame({
    'Class ID': range(len(f1_tfidf)),
    'TF-IDF F1': f1_tfidf,
    'Skip-gram F1': f1_skipgram,
    'CBOW F1': f1_cbow,
    'FastText F1': f1_fasttext,
})

print("\nPer-class F1 scores (first 10 classes):")
print(per_class_df.head(10).to_string(index=False))

# Find classes where embeddings differ most
per_class_df['Max Diff'] = per_class_df[['TF-IDF F1', 'Skip-gram F1', 'CBOW F1', 'FastText F1']].max(axis=1) - \
                            per_class_df[['TF-IDF F1', 'Skip-gram F1', 'CBOW F1', 'FastText F1']].min(axis=1)

print("\nClasses with largest performance differences:")
print(per_class_df.nlargest(10, 'Max Diff')[['Class ID', 'TF-IDF F1', 'Skip-gram F1', 'CBOW F1', 'FastText F1', 'Max Diff']].to_string(index=False))

## 11. Key Insights and Conclusions

In [None]:
print("\n" + "="*70)
print("EXPERIMENT SUMMARY")
print("="*70)
print(f"""
Total Training Samples: {len(y_train):,}
Total Test Samples:     {len(y_test):,}
Number of Classes:      {len(np.unique(y_train))}

Embeddings Tested:
1. TF-IDF (10,000 features, unigrams + bigrams)
2. Word2Vec Skip-gram ({EMBEDDING_DIM}d, window={WINDOW_SIZE})
3. Word2Vec CBOW ({EMBEDDING_DIM}d, window={WINDOW_SIZE})
4. FastText ({EMBEDDING_DIM}d, window={WINDOW_SIZE})

Model: Logistic Regression
- Hyperparameter tuning: GridSearchCV (3-fold CV)
- Optimization metric: F1-macro
- Class weighting: Balanced

All results saved to: {OUTPUT_DIR}
""")

# Create final summary JSON
summary = {
    "experiment_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    "model_type": "Logistic Regression",
    "dataset": {
        "train_samples": int(len(y_train)),
        "test_samples": int(len(y_test)),
        "num_classes": int(len(np.unique(y_train))),
    },
    "results": {
        "tfidf": {
            "accuracy": float(tfidf_accuracy),
            "macro_f1": float(tfidf_macro_f1),
            "weighted_f1": float(tfidf_weighted_f1),
        },
        "skipgram": {
            "accuracy": float(skipgram_accuracy),
            "macro_f1": float(skipgram_macro_f1),
            "weighted_f1": float(skipgram_weighted_f1),
        },
        "cbow": {
            "accuracy": float(cbow_accuracy),
            "macro_f1": float(cbow_macro_f1),
            "weighted_f1": float(cbow_weighted_f1),
        },
        "fasttext": {
            "accuracy": float(fasttext_accuracy),
            "macro_f1": float(fasttext_macro_f1),
            "weighted_f1": float(fasttext_weighted_f1),
        },
    },
    "best_embedding": {
        "by_accuracy": results_comparison.loc[best_accuracy_idx, 'Embedding'],
        "by_macro_f1": results_comparison.loc[best_macro_f1_idx, 'Embedding'],
        "by_weighted_f1": results_comparison.loc[best_weighted_f1_idx, 'Embedding'],
    }
}

summary_path = os.path.join(OUTPUT_DIR, 'experiment_summary.json')
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\nSaved experiment summary to: {summary_path}")
print("\n‚úì ALL EXPERIMENTS COMPLETED SUCCESSFULLY!")

---

## Next Steps for Team Report

1. **Compare with GRU results** in `results/GRU/`
2. **Analyze why different embeddings perform better/worse** with Logistic Regression vs. GRU
3. **Document hyperparameters** used in each experiment
4. **Create consolidated visualizations** comparing all team members' models
5. **Write insights section** explaining the trade-offs between model complexity and embedding choice

**Key Questions to Address:**
- How does Logistic Regression compare to GRU for this task?
- Which embedding works best with linear models vs. neural models?
- What is the computational cost vs. performance trade-off?
- How does class imbalance affect different model-embedding combinations?