# Class Imbalance Techniques (class weights, focal weights, SMOTE)

Compare three approaches on the imbalanced **converted** target: class weights in Logistic Regression, focal-like reweighting (iterative), and SMOTE (if `imblearn` is available; else naive oversampling).

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, warnings
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

!wget -q https://raw.githubusercontent.com/Jihun-ust/ust-mail-557/main/Classification/classification_utils.py
import classification_utils as utils
csv_path = "https://raw.githubusercontent.com/Jihun-ust/ust-mail-557/main/Classification/classification.csv"
warnings.filterwarnings("ignore")

df = pd.read_csv(csv_path, parse_dates=["ts"]).sort_values("ts")
train, test = utils.chrono_split(df, "ts", test_frac=0.2)

features = ["ad_channel","device","region","campaign","spend_l7","pages_per_session","sessions_l30","time_on_site_s","pricing_views_l7","past_purchases","discount_flag","competitor_visits"]
target = "converted"
X_train, y_train = train[features], train[target]
X_test, y_test = test[features], test[target]

pre = ColumnTransformer([
    ("num", StandardScaler(), ["spend_l7","pages_per_session","sessions_l30","time_on_site_s","pricing_views_l7","past_purchases",]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["ad_channel","device","region","campaign"]),
    ("bin", "passthrough", ["discount_flag","competitor_visits"])
])

def fit_eval(pipe):
    pipe.fit(X_train, y_train)
    probs = pipe.predict_proba(X_test)[:,1]
    return utils.evaluate_classifier(y_test, probs)

# Class weights (balanced)
lr_w = Pipeline([("pre", pre), ("lr", LogisticRegression(max_iter=600, class_weight="balanced"))])
print("=== Class weights (balanced) ===")
m1 = fit_eval(lr_w)

# Focal-like reweighting (iterative, two rounds)
lr = Pipeline([("pre", pre), ("lr", LogisticRegression(max_iter=600))])
lr.fit(X_train, y_train)
p_train = lr.predict_proba(X_train)[:,1]
w = utils.focal_reweighting(y_train.values, p_train, gamma=2.0, alpha_pos=0.25)
lr_fw = Pipeline([("pre", pre), ("lr", LogisticRegression(max_iter=600))])
print("\n=== Focal-like reweighting (1 pass) ===")
lr_fw.fit(X_train, y_train, lr__sample_weight=w)
p_test = lr_fw.predict_proba(X_test)[:,1]
_ = utils.evaluate_classifier(y_test, p_test)

# SMOTE (if available) or naive oversampling
print("\n=== SMOTE or naive oversampling ===")
try:
    from imblearn.pipeline import Pipeline as ImbPipeline
    from imblearn.over_sampling import SMOTE
    imb = ImbPipeline([("pre", pre), ("sm", SMOTE()), ("lr", LogisticRegression(max_iter=600))])
    imb.fit(X_train, y_train)
    probs = imb.predict_proba(X_test)[:,1]
    _ = utils.evaluate_classifier(y_test, probs)
    print("(Used SMOTE)")
except Exception as e:
    Xp = pre.fit_transform(X_train)
    Xo, yo = utils.naive_oversample(pd.DataFrame(Xp.toarray() if hasattr(Xp, 'toarray') else Xp), y_train.reset_index(drop=True))
    lr_os = LogisticRegression(max_iter=600)
    lr_os.fit(Xo, yo)
    probs = lr_os.predict_proba(pre.transform(X_test))[:,1]
    _ = utils.evaluate_classifier(y_test, probs)
    print("(Used naive oversampling)")

#### Advanced Diagnostic (Sample)

Class Weights (balanced)
   - Performance: AUC-ROC ≈ 0.60, PR-AUC ≈ 0.68 → moderate discrimination, better than chance but far from perfect.
   - Pattern: Both classes are identified, with recall ~54% for non-converters and ~60% for converters.
   - Implication: A fairer balance; the model doesn’t collapse into “always yes” or “always no.” Usable as a baseline.

Focal-like Reweighting (1 pass)
   - Performance: AUC-ROC 0.40 (worse than random), PR-AUC 0.53 (drop from 0.68).
   - Pattern: Model predicts everything as non-converter. Perfect recall on class 0 (100%), but 0% recall on converters.
   - Implication: This approach failed; the model ignored converting leads entirely. Not viable for production.

SMOTE (Synthetic Minority Oversampling)
   - Performance: Nearly identical to Class Weights. AUC-ROC ≈ 0.59, PR-AUC ≈ 0.67.
   - Pattern: Very similar confusion matrix; precision/recall balance close to Class Weights.
   - Implication: Oversampling didn’t improve results beyond simple class weighting. Adds complexity without clear lift.

Business Takeaways
   - Balanced class weights are the simplest and most reliable option here. They produce moderate but usable balance between catching converters and avoiding false alarms.
   - SMOTE doesn’t outperform class weights, no added value unless future data suggests otherwise.
   - Focal-like reweighting broke the model, it’s not suitable with current data setup.
   - Overall accuracy is still low (~57%). To improve further, leaders should expect:
      - Feature engineering (better behavioral signals, campaign interactions).
      - Threshold tuning (adjust decision cut-offs to prioritize either precision or recall depending on SLA).
      - Alternative models (tree ensembles, calibrated gradient boosting).

In plain English: Weighting works, synthetic oversampling doesn’t add much, and focal reweighting failed. Current models are “baseline adequate” but not business-ready; they’ll still misclassify too many leads without better data signals or refined thresholds.

**Note**: *The dataset used in this example was programmatically generated, which does not reflect real-world information.*