This experiment finalizes the model by applying a Stacking Classifier to the optimized feature space. Class imbalance is now explicitly handled using **Random Undersampling** on the training data.

### Optimized Feature Configuration:
* **Vectorization:** TF-IDF
* **N-gram Range:** **Bigram** `(1, 2)`
* **Max Features:** **1000**
* **Imbalance Handling:** **Random Undersampling**

## 1. Setup and Imports

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, accuracy_score
from imblearn.under_sampling import RandomUnderSampler # NEW IMPORT
import mlflow
import mlflow.sklearn
import numpy as np

## 2. MLflow Configuration

In [4]:
# Set the remote tracking server URI
mlflow.set_tracking_uri("http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/")

# Set or create Experiment 6
mlflow.set_experiment("Experiment 6 - Stacking Ensemble (Undersampling)")

2025/11/12 17:17:18 INFO mlflow.tracking.fluent: Experiment with name 'Experiment 6 - Stacking Ensemble (Undersampling)' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://mlfow-bucket-2025/179050465168919103', creation_time=1762948037691, experiment_id='179050465168919103', last_update_time=1762948037691, lifecycle_stage='active', name='Experiment 6 - Stacking Ensemble (Undersampling)', tags={}>

## 3. Data Preparation, Feature Engineering, and Resampling

In [5]:
# Define fixed feature parameters
N_GRAM_RANGE = (1, 2)  # Bigram setting
MAX_FEATURES = 1000   # Fixed Max Features
IMBALANCE_METHOD = "undersampling"

# Load the dataset and clean missing values
dataset = pd.read_csv('../data/reddit_preprocessing.csv')
cleaned_dataset = dataset.dropna(subset=['clean_comment'])

# Remap class labels {-1: 0, 0: 1, 1: 2} for compatibility
X_cleaned = cleaned_dataset['clean_comment']
y_cleaned = cleaned_dataset['category'].map({-1: 0, 0: 1, 1: 2})

# Split data
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(
    X_cleaned, y_cleaned, test_size=0.2, random_state=42, stratify=y_cleaned
)

# 1. Apply TfidfVectorizer
tfidf_cleaned = TfidfVectorizer(ngram_range=N_GRAM_RANGE, max_features=MAX_FEATURES)
X_train_tfidf_cleaned = tfidf_cleaned.fit_transform(X_train_cleaned)
X_test_tfidf_cleaned = tfidf_cleaned.transform(X_test_cleaned)

# 2. Explicitly handle class imbalance using Random Undersampler
undersampler = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = undersampler.fit_resample(X_train_tfidf_cleaned, y_train_cleaned)

print(f"Original Training data shape: {X_train_tfidf_cleaned.shape}")
print(f"Resampled Training data shape: {X_train_res.shape}")
print(f"Test data shape: {X_test_tfidf_cleaned.shape}")

Original Training data shape: (29329, 1000)
Resampled Training data shape: (19794, 1000)
Test data shape: (7333, 1000)


## 4. Define Stacking Ensemble

In [6]:
# --- 4.1 Base Models (Class weighting removed as data is resampled) ---
lightgbm_model = LGBMClassifier(
    objective='multiclass', num_class=3, metric="multi_logloss", 
    reg_alpha=0.1, reg_lambda=0.1, learning_rate=0.08, 
    n_estimators=360, max_depth=20, random_state=42
)

linearsvc_model = LinearSVC(
    C=1.0, dual='auto', max_iter=1000, random_state=42
)

mnb_model = MultinomialNB(alpha=0.1) 

# --- 4.2 Meta-learner (Logistic Regression) ---
logreg_meta_learner = LogisticRegression(
    max_iter=2000, solver='lbfgs', multi_class='multinomial', random_state=42
)

# --- 4.3 Create StackingClassifier ---
stacking_model = StackingClassifier(
    estimators=[
        ('lightgbm', lightgbm_model),
        ('linear_svc', linearsvc_model),
        ('multinomial_nb', mnb_model)
    ],
    final_estimator=logreg_meta_learner,
    cv=5, 
    n_jobs=-1 
)

## 5. Train and Log to MLflow

In [7]:
with mlflow.start_run() as run:
    # Log run details and fixed parameters
    mlflow.set_tag("mlflow.runName", "Final_Stacking_Ensemble_Undersample")
    mlflow.set_tag("experiment_type", "final_model_stacking_undersample")
    mlflow.log_param("vectorizer_type", "TF-IDF")
    mlflow.log_param("ngram_range", str(N_GRAM_RANGE))
    mlflow.log_param("max_features", MAX_FEATURES)
    mlflow.log_param("base_learners", "LGBM, LinearSVC, MNB")
    mlflow.log_param("meta_learner", "LogisticRegression")
    mlflow.log_param("imbalance_handling", IMBALANCE_METHOD)
    mlflow.log_param("stacking_cv", stacking_model.cv)

    # Train the stacking model using RESAMPLED data
    print("Starting Stacking Model Training with Undersampled Data...")
    stacking_model.fit(X_train_res, y_train_res)
    print("Training Complete.")

    # Make predictions on the original test data
    y_pred = stacking_model.predict(X_test_tfidf_cleaned)

    # Calculate and log metrics
    accuracy = accuracy_score(y_test_cleaned, y_pred)
    mlflow.log_metric("accuracy", accuracy)

    classification_rep = classification_report(y_test_cleaned, y_pred, output_dict=True)
    for label, metrics in classification_rep.items():
        if isinstance(metrics, dict):
            for metric, value in metrics.items():
                mlflow.log_metric(f"{label}_{metric}", value)

    # Log the final stacking model
    mlflow.sklearn.log_model(stacking_model, "final_stacking_model")
    
    # Display final report
    print("\nClassification Report (Classes 0, 1, 2 correspond to -1, 0, 1):\n")
    print(classification_report(y_test_cleaned, y_pred))

Starting Stacking Model Training with Undersampled Data...




Training Complete.





Classification Report (Classes 0, 1, 2 correspond to -1, 0, 1):

              precision    recall  f1-score   support

           0       0.67      0.67      0.67      1650
           1       0.76      0.92      0.84      2529
           2       0.89      0.74      0.81      3154

    accuracy                           0.79      7333
   macro avg       0.77      0.78      0.77      7333
weighted avg       0.80      0.79      0.79      7333

üèÉ View run Final_Stacking_Ensemble_Undersample at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/179050465168919103/runs/c8b1f4c4722d46ffb49eb666d47aeb29
üß™ View experiment at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/179050465168919103


## 6. Conclusion
The Stacking Ensemble model using explicit Random Undersampling is trained and logged as Experiment 6 in MLflow. Review the per-class metrics to compare this result against the model that used implicit `class_weight='balanced'`.