This experiment is crucial for improving the classification performance of minority classes (-1, 0) by testing different methods for addressing class imbalance. We will compare model performance using inherent class weighting against various resampling techniques. 

Based on prior experiments, we will use the following optimized feature representation:
* **Vectorization:** TF-IDF
* **N-gram Range:** Bigram `(1, 2)` (Unigrams + Bigrams)
* **Max Features:** 1000

## 1. Setup and Dependencies

### 1.1 Import Libraries

In [1]:
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os

## 2. MLflow Configuration

In [2]:
# Set the remote tracking server URI
mlflow.set_tracking_uri("http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/")

# Set or create an experiment
mlflow.set_experiment("Handling Imbalanced Data - Exp 4")

2025/11/15 19:52:25 INFO mlflow.tracking.fluent: Experiment with name 'Handling Imbalanced Data - Exp 4' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://mlfow-bucket-2025/16', creation_time=1763216545216, experiment_id='16', last_update_time=1763216545216, lifecycle_stage='active', name='Handling Imbalanced Data - Exp 4', tags={}>

## 3. Data Loading and Preparation

In [3]:
# Load the preprocessed data and ensure no NaN values remain in the text column
df = pd.read_csv('../data/reddit_preprocessing.csv').dropna(subset=['clean_comment'])

print(f"Data shape after cleaning: {df.shape}")

Data shape after cleaning: (36662, 2)


## 4. Experiment Function Definition

In [4]:
def run_imbalanced_experiment(imbalance_method):
    """Runs a model training experiment with a specific imbalance handling method, 
    using fixed TF-IDF Bigram (1, 2) features and max_features=1000."""
    
    ngram_range = (1, 2)  # Bigram setting
    max_features = 1000   # Fixed max_features as specified

    # Step 1: Train-test split before vectorization and resampling
    X_train, X_test, y_train, y_test = train_test_split(df['clean_comment'], df['category'], 
                                                              test_size=0.2, random_state=42, stratify=df['category'])

    # Step 2: Vectorization using TF-IDF
    vectorizer = TfidfVectorizer(ngram_range=ngram_range, max_features=max_features)
    X_train_vec = vectorizer.fit_transform(X_train)  # Fit on training data
    X_test_vec = vectorizer.transform(X_test)      # Transform test data

    # Step 3: Handle class imbalance (only applied to the training set)
    if imbalance_method == 'class_weights':
        # Method 1: Use inherent model class weighting
        class_weight = 'balanced'
        X_train_res, y_train_res = X_train_vec, y_train
    else:
        # Method 2-5: Resampling Techniques (using imblearn)
        class_weight = None
        
        if imbalance_method == 'oversampling':
            sampler = SMOTE(random_state=42)
        elif imbalance_method == 'adasyn':
            sampler = ADASYN(random_state=42)
        elif imbalance_method == 'undersampling':
            sampler = RandomUnderSampler(random_state=42)
        elif imbalance_method == 'smote_enn':
            sampler = SMOTEENN(random_state=42)
        else:
            # No imbalance handling
            X_train_res, y_train_res = X_train_vec, y_train
            
        if imbalance_method != 'none':
            X_train_res, y_train_res = sampler.fit_resample(X_train_vec, y_train)
        else:
            X_train_res, y_train_res = X_train_vec, y_train

    # Step 4: MLflow Run
    with mlflow.start_run() as run:
        # Set tags for the experiment and run
        mlflow.set_tag("mlflow.runName", f"Imbalance_{imbalance_method}_RF_TFIDF_Bigrams")
        mlflow.set_tag("experiment_type", "imbalance_handling")
        mlflow.set_tag("model_type", "RandomForestClassifier")

        # Log experiment details
        mlflow.log_param("vectorizer_type", "TF-IDF")
        mlflow.log_param("ngram_range", ngram_range)
        mlflow.log_param("vectorizer_max_features", max_features)
        mlflow.log_param("imbalance_method", imbalance_method)

        # Log Random Forest parameters
        n_estimators = 200
        max_depth = 15
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("class_weight_param", class_weight)
        
        # Initialize and train the model
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42, class_weight=class_weight)
        model.fit(X_train_res, y_train_res)

        # Step 5: Make predictions and log metrics (on un-resampled test data)
        y_pred = model.predict(X_test_vec)

        # Log evaluation metrics
        accuracy = accuracy_score(y_test, y_pred)
        mlflow.log_metric("accuracy", accuracy)

        classification_rep = classification_report(y_test, y_pred, output_dict=True)
        for label, metrics in classification_rep.items():
            if isinstance(metrics, dict):
                for metric, value in metrics.items():
                    mlflow.log_metric(f"{label}_{metric}", value)

        # 6. Log confusion matrix plot
        conf_matrix = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
        plt.xlabel("Predicted")
        plt.ylabel("Actual")
        plt.title(f"Confusion Matrix: TF-IDF Bigrams, Imbalance={imbalance_method}")
        confusion_matrix_filename = f"confusion_matrix_{imbalance_method}.png"
        plt.savefig(confusion_matrix_filename)
        mlflow.log_artifact(confusion_matrix_filename)
        plt.close()

        # 7. Log the model
        mlflow.sklearn.log_model(model, f"random_forest_model_tfidf_bigrams_imbalance_{imbalance_method}")
        
        print(f"Completed run: Imbalance Method: {imbalance_method}. Accuracy: {accuracy:.4f}")

## 5. Execute Imbalance Experiments

In [5]:
# Define the list of imbalance handling methods to test
imbalance_methods = ['class_weights', 'oversampling', 'adasyn', 'undersampling', 'smote_enn']

for method in imbalance_methods:
    run_imbalanced_experiment(method)



Completed run: Imbalance Method: class_weights. Accuracy: 0.6760
üèÉ View run Imbalance_class_weights_RF_TFIDF_Bigrams at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16/runs/ed0d2bd60c8746caac12f709e2e73c4e
üß™ View experiment at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16




Completed run: Imbalance Method: oversampling. Accuracy: 0.6764
üèÉ View run Imbalance_oversampling_RF_TFIDF_Bigrams at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16/runs/9d72030095a14556b397417aba2b5a48
üß™ View experiment at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16




Completed run: Imbalance Method: adasyn. Accuracy: 0.6854
üèÉ View run Imbalance_adasyn_RF_TFIDF_Bigrams at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16/runs/f1a75ad541394a0f86c6e985968cdcb2
üß™ View experiment at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16




Completed run: Imbalance Method: undersampling. Accuracy: 0.6718
üèÉ View run Imbalance_undersampling_RF_TFIDF_Bigrams at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16/runs/a1f83eae4989400099e798213b1ac93b
üß™ View experiment at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16




Completed run: Imbalance Method: smote_enn. Accuracy: 0.4502
üèÉ View run Imbalance_smote_enn_RF_TFIDF_Bigrams at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16/runs/a13030ba5eb64ced9e4301b4a7f64d45
üß™ View experiment at: http://ec2-54-211-18-166.compute-1.amazonaws.com:5000/#/experiments/16


## 6. Conclusion and Next Steps
Review the MLflow UI to determine which imbalance handling method (specifically focusing on metrics like F1-score and Recall for the minority classes) yielded the best results. The optimal combination of vectorizer, feature size, and imbalance handling technique will be carried forward to the next step: Hyperparameter Tuning of the final model (Experiment 5).

In [8]:
# Define the list of methods you ran in the previous cell
IMBALANCE_METHODS = [
    'smote_enn', 
    'undersampling', 
    'adasyn', 
    'oversampling', 
    'class_weights'
]

# --- 1. Query MLflow for all relevant runs ---
# We filter runs based on the tag 'experiment_type' set in your function
runs = mlflow.search_runs(
    filter_string="tags.experiment_type = 'imbalance_handling'",
    order_by=["metrics.accuracy DESC"] # Sort by accuracy to see the best models first
)

# --- 2. Extract and structure the metrics ---

comparison_data = []

# Iterate through the retrieved runs
for _, run in runs.iterrows():
    run_id = run['run_id']
    
    # Extract parameters and metrics
    method = run['params.imbalance_method']
    
    # We focus on Accuracy and the minority class metrics (assuming -1 is the minority class)
    # Note: MLflow logs the metrics with the label prefix (e.g., '-1_precision')
    
    # Filter for the relevant metrics
    if method in IMBALANCE_METHODS:
        row = {
            'Method': method,
            'Accuracy (Overall)': run['metrics.accuracy'],
            # Metrics for the Minority Class (-1)
            '-1_Precision': run.get('metrics.-1_precision', 'N/A'),
            '-1_Recall': run.get('metrics.-1_recall', 'N/A'),
            '-1_F1-Score': run.get('metrics.-1_f1-score', 'N/A'),
            'Run ID': run_id
        }
        comparison_data.append(row)

# --- 3. Create the comparison DataFrame ---
df_comparison = pd.DataFrame(comparison_data)

# Reorder columns for clarity and re-sort by the critical F1-Score for the minority class
df_comparison = df_comparison.sort_values(by='-1_F1-Score', ascending=False)
df_comparison = df_comparison[['Method', 'Accuracy (Overall)', '-1_Precision', '-1_Recall', '-1_F1-Score', 'Run ID']]

# --- 4. Print the final comparison table ---
print("--- Detailed Comparison of Imbalance Handling Methods (Sorted by -1_F1-Score) ---")
print(df_comparison.to_markdown(index=False, floatfmt=".4f"))

# --- 5. Highlight Top 3 Methods and Recommend Ensemble ---
top_3_methods = df_comparison.head(3)

print("\n" + "="*80)
print("üèÜ TOP 3 IMBALANCE HANDLING METHODS")
print("="*80)
for idx, (i, row) in enumerate(top_3_methods.iterrows(), 1):
    print(f"\n{idx}. {row['Method'].upper()}")
    print(f"   F1-Score (Minority Class -1): {row['-1_F1-Score']:.4f}")
    print(f"   Recall (Minority Class -1): {row['-1_Recall']:.4f}")
    print(f"   Precision (Minority Class -1): {row['-1_Precision']:.4f}")
    print(f"   Overall Accuracy: {row['Accuracy (Overall)']:.4f}")

print("\n" + "="*80)
print("üí° RECOMMENDED APPROACH: ENSEMBLE OF TOP 3 METHODS")
print("="*80)
print("\nRationale:")
print("The top 3 methods show very similar F1-scores (difference < 0.001), indicating")
print("each captures the minority class differently. An ensemble approach combining:")
print(f"  ‚Ä¢ {top_3_methods.iloc[0]['Method'].capitalize()}")
print(f"  ‚Ä¢ {top_3_methods.iloc[1]['Method'].capitalize()}")
print(f"  ‚Ä¢ {top_3_methods.iloc[2]['Method'].capitalize()}")
print("\nBenefits:")
print("  ‚úì Increased robustness by leveraging diverse sampling strategies")
print("  ‚úì Reduced variance and improved generalization")
print("  ‚úì Better handling of edge cases through soft voting")
print("  ‚úì Minimizes bias from any single resampling technique")
print("\nImplementation:")
print("  Use VotingClassifier with 'soft' voting to combine probability estimates")
print("  from models trained on each of the top 3 resampling strategies.")
print("="*80)

# --- 6. Show Run IDs for Easy Retrieval ---
print("\nüìã Run IDs for Top 3 Methods (for model loading):")
for idx, (i, row) in enumerate(top_3_methods.iterrows(), 1):
    print(f"  {idx}. {row['Method']}: {row['Run ID']}")
print("="*80)

--- Detailed Comparison of Imbalance Handling Methods (Sorted by -1_F1-Score) ---
| Method        |   Accuracy (Overall) |   -1_Precision |   -1_Recall |   -1_F1-Score | Run ID                           |
|:--------------|---------------------:|---------------:|------------:|--------------:|:---------------------------------|
| undersampling |               0.6718 |         0.5912 |      0.4733 |        0.5257 | a1f83eae4989400099e798213b1ac93b |
| oversampling  |               0.6764 |         0.6096 |      0.4618 |        0.5255 | 9d72030095a14556b397417aba2b5a48 |
| adasyn        |               0.6854 |         0.5589 |      0.4945 |        0.5248 | f1a75ad541394a0f86c6e985968cdcb2 |
| class_weights |               0.6760 |         0.6049 |      0.4455 |        0.5131 | ed0d2bd60c8746caac12f709e2e73c4e |
| smote_enn     |               0.4502 |         0.3379 |      0.7297 |        0.4619 | a13030ba5eb64ced9e4301b4a7f64d45 |

üèÜ TOP 3 IMBALANCE HANDLING METHODS

1. UNDERSAMPLING
