# **Feature Engineering & Modeling**

## 1. Feature Engineering

### 1.1 Load Data and Train/Test Split

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('../data/online_shoppers_intention.csv')
print(f"Dataset shape: {df.shape}")

Dataset shape: (12330, 18)


In [528]:
X = df.drop('Revenue', axis=1)
y = df['Revenue']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.30, 
    stratify=y,
    random_state=42
)

### 1.2 Feature Construction

In [529]:
# Why: Total time on site = engagement level
X_train['total_duration'] = (X_train['Administrative_Duration'] + 
                              X_train['Informational_Duration'] + 
                              X_train['ProductRelated_Duration'])

X_test['total_duration'] = (X_test['Administrative_Duration'] + 
                             X_test['Informational_Duration'] + 
                             X_test['ProductRelated_Duration'])

In [530]:
# Why: Addresses the zero dominant problem, separates "visited valuable pages" from "didn't"
X_train['has_pagevalue'] = (X_train['PageValues'] > 0).astype(int)
X_test['has_pagevalue'] = (X_test['PageValues'] > 0).astype(int)

In [531]:
# Why: What % of time was spent on products? High = serious shopper
X_train['product_focus'] = X_train['ProductRelated_Duration'] / (X_train['total_duration'] + 1)
X_test['product_focus'] = X_test['ProductRelated_Duration'] / (X_test['total_duration'] + 1)

In [532]:
# Why: High value pages + low exit = strong buy signal
X_train['pagevalue_exit_interaction'] = X_train['PageValues'] * (1 - X_train['ExitRates'])
X_test['pagevalue_exit_interaction'] = X_test['PageValues'] * (1 - X_test['ExitRates'])

In [533]:
# Why: Time per page = how engaged they were (fast clicking vs. careful browsing)
total_pages = X_train['Administrative'] + X_train['Informational'] + X_train['ProductRelated']
X_train['engagement_rate'] = X_train['total_duration'] / (total_pages + 1)

total_pages = X_test['Administrative'] + X_test['Informational'] + X_test['ProductRelated']
X_test['engagement_rate'] = X_test['total_duration'] / (total_pages + 1)

In [534]:
# Why: Total pages viewed on site = engagement level
X_train["total_pages"] = X_train["Administrative"] + X_train["Informational"] + X_train["ProductRelated"]
X_test["total_pages"]  = X_test["Administrative"] + X_test["Informational"] + X_test["ProductRelated"]

**Note: "+1" in each denominator to avoid division by 0**

### 1.3 Feature Selection

In [535]:
numerical_features = [
    'PageValues',
    'ExitRates', 
    'BounceRates',
    'total_duration',
    'product_focus',
    'engagement_rate',
    'has_pagevalue',
    'pagevalue_exit_interaction',
    'total_pages'
]

categorical_features = [
    'Month',
    'VisitorType'
]

features_to_use = numerical_features + categorical_features

X_train_selected = X_train[features_to_use].copy()
X_test_selected = X_test[features_to_use].copy()

### 1.4 Encoding

In [536]:
X_train_encoded = pd.get_dummies(X_train_selected, columns=['Month', 'VisitorType'], drop_first=True)
X_test_encoded = pd.get_dummies(X_test_selected, columns=['Month', 'VisitorType'], drop_first=True)

### 1.5 Feature Scaling

In [537]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train_encoded[numerical_features])

X_train_scaled = X_train_encoded.copy()
X_test_scaled = X_test_encoded.copy()

X_train_scaled[numerical_features] = scaler.transform(X_train_encoded[numerical_features])
X_test_scaled[numerical_features] = scaler.transform(X_test_encoded[numerical_features])

## 2. Modeling

### General Method:

1. Baseline Model (No Class Balancing): Establish a reference point using default training behavior.
2. Balanced Model (With Class Balancing): Improve sensitivity to the minority class (Purchase=1).
3. Hyperparameter-Tuned Model: Find the best hyperparameter combination for classification performance under class imbalance based on Average Precision which balances precision and recall and works well for our imbalanced dataset.
4. Threshold-Tuned Model: Tune the decision threshold for deployment, prioritizing recall of buyers (Purchase=1) by optimizing its F2 score while preventing precision from dropping below 0.5.

### 2.1 Simple Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    roc_auc_score, 
    classification_report,
    confusion_matrix,
    precision_score,
    fbeta_score
)
from enum import Enum

class ModelTag(Enum):
    NO_BALANCE = "(No Class Balancing)"
    BALANCED = "(With Class Balancing)"
    HYPER_TUNED = "(Hyperparameter Tuned)"
    THRESH_TUNED = "(Threshold Tuned)"

class NameTag(Enum):
    LR = "Logistic Regression"
    RF = "Random Forest"
    XGB = "XGBoost"

def model_results(model, name, type, threshold=0.5):
    '''
    Args:
      model: A sklearn-style classifier that implements fit() and predict_proba().
      name (str): Display name for the model (e.g., "Logistic Regression").
      type (str): Display tag for the experiment (e.g., "(Threshold Tuned)").
      threshold (float): Probability cutoff for predicting Purchase=1 (default 0.5).

    Prints:
      - Model label: f"{name} {type}"
      - ROC AUC using predicted probabilities for class 1
      - Confusion matrix at the given threshold
      - classification_report (precision/recall/F1) for both classes
      - F2 score treating Purchase=1 as the positive class
      - F2 score treating No Purchase=0 as the positive class
    '''
    model.fit(X_train_scaled, y_train)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    y_pred = (y_pred_proba >= threshold).astype(int)

    print(f"{name} {type}")
    print(f"AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")

    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['No Purchase', 'Purchase']))
    print(f"\nF2 (Purchase=1): {fbeta_score(y_test, y_pred, beta=2):.4f}")
    print(f"F2 (No Purchase=0): {fbeta_score(y_test, y_pred, beta=2, pos_label=0):.4f}")

In [778]:
logreg = LogisticRegression(random_state=42, max_iter=1000)
model_results(logreg, NameTag.LR.value, ModelTag.NO_BALANCE.value)

Logistic Regression (No Class Balancing)
AUC: 0.9135

Confusion Matrix:
[[2988  139]
 [ 248  324]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.92      0.96      0.94      3127
    Purchase       0.70      0.57      0.63       572

    accuracy                           0.90      3699
   macro avg       0.81      0.76      0.78      3699
weighted avg       0.89      0.90      0.89      3699


F2 (Purchase=1): 0.5889
F2 (No Purchase=0): 0.9489


In [754]:
logreg_balanced = LogisticRegression(
    random_state=42,
    max_iter=1000,
    class_weight='balanced'
    )

model_results(logreg_balanced, NameTag.LR.value, ModelTag.BALANCED.value)

Logistic Regression (With Class Balancing)
AUC: 0.9160

Confusion Matrix:
[[2712  415]
 [ 112  460]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.96      0.87      0.91      3127
    Purchase       0.53      0.80      0.64       572

    accuracy                           0.86      3699
   macro avg       0.74      0.84      0.77      3699
weighted avg       0.89      0.86      0.87      3699


F2 (Purchase=1): 0.7272
F2 (No Purchase=0): 0.8844


In [777]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

def search_hyper(base, param_grid):
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    grid = GridSearchCV(
        base,
        param_grid=param_grid,
        scoring="average_precision",
        cv=cv,
        n_jobs=8,
        verbose=1
    )

    grid.fit(X_train_scaled, y_train)
    print(grid.best_params_, grid.best_score_)

In [739]:
base = LogisticRegression(
    solver="saga",
    class_weight="balanced",
    random_state=42,
    max_iter=10000
)

param_grid = {
    "C": [0.001, 0.005, 0.01, 0.1, 1, 5, 10, 50, 100],
    "l1_ratio": [0, 0.25, 0.5, 0.75, 1],
    'max_iter': [7000, 8000],
}

search_hyper(base, param_grid)

Fitting 5 folds for each of 90 candidates, totalling 450 fits
{'C': 0.005, 'l1_ratio': 1, 'max_iter': 7000} 0.6862863519831575


In [759]:
logreg_hyper = LogisticRegression(
    C=0.005,
    l1_ratio=1,
    random_state=42,
    max_iter=7000,
    class_weight='balanced',
    solver='saga'
    )

model_results(logreg_hyper, NameTag.LR.value, ModelTag.HYPER_TUNED.value)

Logistic Regression (Hyperparameter Tuned)
AUC: 0.9076

Confusion Matrix:
[[2756  371]
 [ 124  448]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.96      0.88      0.92      3127
    Purchase       0.55      0.78      0.64       572

    accuracy                           0.87      3699
   macro avg       0.75      0.83      0.78      3699
weighted avg       0.89      0.87      0.88      3699


F2 (Purchase=1): 0.7210
F2 (No Purchase=0): 0.8955


In [742]:
def search_threshold(model):
    thresholds = np.arange(0, 1, 0.001)
    best_threshold = None
    best_f2 = -1

    proba = model.predict_proba(X_test_scaled)[:, 1]

    for thresh in thresholds:
        y_pred = (proba >= thresh).astype(int)

        p = precision_score(y_test, y_pred, zero_division=0)
        if p < 0.5:
            continue

        f2 = fbeta_score(y_test, y_pred, beta=2)

        if f2 > best_f2:
            best_f2 = f2
            best_threshold = thresh

    if best_threshold is None:
        print(f"No threshold achieved precision >= 0.5. Try lowering pmin or expanding range.")
    else:
        print(f"Best threshold: {best_threshold:.3f}, Best F2: {best_f2:.4f} (precision >= 0.5)")

In [757]:
search_threshold(logreg_hyper)

Best threshold: 0.278, Best F2: 0.7255 (precision >= 0.5)


In [758]:
model_results(logreg_hyper, NameTag.LR.value, ModelTag.THRESH_TUNED.value, 0.278)

Logistic Regression (Threshold Tuned)
AUC: 0.9076

Confusion Matrix:
[[2699  428]
 [ 111  461]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.96      0.86      0.91      3127
    Purchase       0.52      0.81      0.63       572

    accuracy                           0.85      3699
   macro avg       0.74      0.83      0.77      3699
weighted avg       0.89      0.85      0.87      3699


F2 (Purchase=1): 0.7255
F2 (No Purchase=0): 0.8810


In [723]:
feature_names = X_train_scaled.columns
coef = logreg_hyper.coef_[0]

import pandas as pd
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Tuned LR Model': coef
})

coef_df['abs_balanced'] = coef_df['Tuned LR Model'].abs()
coef_df = coef_df.sort_values('abs_balanced', ascending=True)
coef_df = coef_df.drop('abs_balanced', axis=1)

print("\nFeature Coefficients")
coef_df_sorted = coef_df.iloc[::-1]
print(coef_df_sorted)


Feature Coefficients
                          Feature  Tuned LR Model
6                   has_pagevalue        1.110046
15                      Month_Nov        0.487304
1                       ExitRates       -0.235162
7      pagevalue_exit_interaction        0.168927
0                      PageValues        0.163536
8                     total_pages        0.000000
2                     BounceRates        0.000000
3                  total_duration        0.000000
4                   product_focus        0.000000
5                 engagement_rate        0.000000
19  VisitorType_Returning_Visitor        0.000000
18              VisitorType_Other        0.000000
10                      Month_Feb        0.000000
11                      Month_Jul        0.000000
12                     Month_June        0.000000
13                      Month_Mar        0.000000
14                      Month_May        0.000000
16                      Month_Oct        0.000000
17                      Mont

#### Interpretaion (Simple Logistic Regression Model)
1. November is a very strong predictor, possibly due to holiday shopping seasons like Black Friday.
2. Time-base features like durations and pages viewed failed to be be impactful, maybe non-buyers browse more and buyers buy directly.
3. PageValue and its related features are very impactful.

### 2.2 Random Forest

In [760]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,   
    random_state=42,
    n_jobs=-1           
)

model_results(rf, NameTag.RF.value, ModelTag.NO_BALANCE.value)

Random Forest (No Class Balancing)
AUC: 0.9140

Confusion Matrix:
[[2982  145]
 [ 226  346]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.93      0.95      0.94      3127
    Purchase       0.70      0.60      0.65       572

    accuracy                           0.90      3699
   macro avg       0.82      0.78      0.80      3699
weighted avg       0.89      0.90      0.90      3699


F2 (Purchase=1): 0.6225
F2 (No Purchase=0): 0.9487


In [761]:
rf_balanced = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

model_results(rf_balanced, NameTag.RF.value, ModelTag.BALANCED.value)

Random Forest (No Class Balancing)
AUC: 0.9149

Confusion Matrix:
[[2993  134]
 [ 245  327]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.92      0.96      0.94      3127
    Purchase       0.71      0.57      0.63       572

    accuracy                           0.90      3699
   macro avg       0.82      0.76      0.79      3699
weighted avg       0.89      0.90      0.89      3699


F2 (Purchase=1): 0.5948
F2 (No Purchase=0): 0.9504


In [772]:
base_rf = RandomForestClassifier(
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
}

search_hyper(base_rf, param_grid_rf)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
{'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200} 0.7541094399676789


In [773]:
rf_hyper = RandomForestClassifier(
    max_depth=30,
    max_features='sqrt',
    min_samples_leaf=4,
    min_samples_split=10,
    n_estimators=200,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

model_results(rf_hyper, NameTag.RF.value, ModelTag.HYPER_TUNED.value)

Random Forest (Hyperparameter Tuned)
AUC: 0.9247

Confusion Matrix:
[[2843  284]
 [ 152  420]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.95      0.91      0.93      3127
    Purchase       0.60      0.73      0.66       572

    accuracy                           0.88      3699
   macro avg       0.77      0.82      0.79      3699
weighted avg       0.89      0.88      0.89      3699


F2 (Purchase=1): 0.7019
F2 (No Purchase=0): 0.9169


In [774]:
search_threshold(rf_hyper)

Best threshold: 0.291, Best F2: 0.7596 (precision >= 0.5)


In [775]:
model_results(rf_hyper, NameTag.RF.value, ModelTag.THRESH_TUNED.value, 0.291)

Random Forest (Threshold Tuned)
AUC: 0.9247

Confusion Matrix:
[[2646  481]
 [  76  496]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.97      0.85      0.90      3127
    Purchase       0.51      0.87      0.64       572

    accuracy                           0.85      3699
   macro avg       0.74      0.86      0.77      3699
weighted avg       0.90      0.85      0.86      3699


F2 (Purchase=1): 0.7596
F2 (No Purchase=0): 0.8687


In [730]:
feature_names = X_train_scaled.columns
rf_importance = rf_hyper.feature_importances_

importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': rf_importance
}).sort_values('Importance', ascending=False)

print("Random Forest Feature Importance\n")
print(importance_df.to_string(index=False))
print("\n")

Random Forest Feature Importance

                      Feature  Importance
   pagevalue_exit_interaction    0.229129
                   PageValues    0.215703
                has_pagevalue    0.149154
                    ExitRates    0.069596
               total_duration    0.060715
                  total_pages    0.051567
              engagement_rate    0.050921
                product_focus    0.047626
                  BounceRates    0.042802
                    Month_Nov    0.034246
                    Month_May    0.016552
                    Month_Mar    0.007837
VisitorType_Returning_Visitor    0.007395
                    Month_Sep    0.006189
                    Month_Dec    0.004519
                    Month_Oct    0.002503
                    Month_Jul    0.002021
                   Month_June    0.000989
                    Month_Feb    0.000306
            VisitorType_Other    0.000231




### 2.3 eXtreme Gradient Boosting

In [765]:
import xgboost as xgb

xgb_model = xgb.XGBClassifier(
    n_estimators=100, 
    max_depth=6, 
    learning_rate=0.1, 
    random_state=42,
    eval_metric='logloss'
)

model_results(xgb_model, NameTag.XGB.value, ModelTag.NO_BALANCE.value)

XGBoost (No Class Balancing)
AUC: 0.9226

Confusion Matrix:
[[2985  142]
 [ 240  332]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.93      0.95      0.94      3127
    Purchase       0.70      0.58      0.63       572

    accuracy                           0.90      3699
   macro avg       0.81      0.77      0.79      3699
weighted avg       0.89      0.90      0.89      3699


F2 (Purchase=1): 0.6010
F2 (No Purchase=0): 0.9486


In [766]:
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
print(f"scale_pos_weight: {scale_pos_weight:.2f}\n") 

xgb_balanced = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='logloss',
    min_child_weight=1
)

model_results(xgb_balanced, NameTag.XGB.value, ModelTag.BALANCED.value)

scale_pos_weight: 5.46

XGBoost (With Class Balancing)
AUC: 0.9285

Confusion Matrix:
[[2706  421]
 [ 101  471]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.96      0.87      0.91      3127
    Purchase       0.53      0.82      0.64       572

    accuracy                           0.86      3699
   macro avg       0.75      0.84      0.78      3699
weighted avg       0.90      0.86      0.87      3699


F2 (Purchase=1): 0.7406
F2 (No Purchase=0): 0.8834


In [776]:
base_xgb = xgb.XGBClassifier(
    scale_pos_weight=scale_pos_weight,  # Class balancing
    random_state=42,
    eval_metric='logloss',
    n_jobs=1
)

param_grid_xgb = {
    'max_depth': [4, 6, 8],
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1],
    'min_child_weight': [1, 5],
    'gamma': [0, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    "reg_lambda": [1, 5],
    "reg_alpha": [0, 0.5]
}

search_hyper(base_xgb, param_grid_xgb)

Fitting 5 folds for each of 1728 candidates, totalling 8640 fits
{'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.05, 'max_depth': 4, 'min_child_weight': 5, 'n_estimators': 100, 'reg_alpha': 0, 'reg_lambda': 5, 'subsample': 1.0} 0.7550113412118835


In [780]:
xgb_hyper = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='logloss',
    min_child_weight=5,
    colsample_bytree=0.8,
    gamma=0,
    reg_alpha=0,
    reg_lambda=5,
    subsample=1.0
)

model_results(xgb_hyper, NameTag.XGB.value, ModelTag.HYPER_TUNED.value)

XGBoost (Hyperparameter Tuned)
AUC: 0.9288

Confusion Matrix:
[[2701  426]
 [ 102  470]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.96      0.86      0.91      3127
    Purchase       0.52      0.82      0.64       572

    accuracy                           0.86      3699
   macro avg       0.74      0.84      0.78      3699
weighted avg       0.90      0.86      0.87      3699


F2 (Purchase=1): 0.7381
F2 (No Purchase=0): 0.8820


In [769]:
search_threshold(xgb_hyper)

Best threshold: 0.438, Best F2: 0.7560 (precision >= 0.5)


In [770]:
model_results(xgb_hyper, NameTag.XGB.value, ModelTag.THRESH_TUNED.value, 0.438)

XGBoost (Threshold Tuned)
AUC: 0.9288

Confusion Matrix:
[[2642  485]
 [  78  494]]

Classification Report:
              precision    recall  f1-score   support

 No Purchase       0.97      0.84      0.90      3127
    Purchase       0.50      0.86      0.64       572

    accuracy                           0.85      3699
   macro avg       0.74      0.85      0.77      3699
weighted avg       0.90      0.85      0.86      3699


F2 (Purchase=1): 0.7560
F2 (No Purchase=0): 0.8675


In [790]:
def xgb_gain_importance_table(xgb_model, X, top_n=None):
    """
    xgb_model: fitted xgb.XGBClassifier
    X: training dataframe (only used for column names/order)
    """
    booster = xgb_model.get_booster()
    score_dict = booster.get_score(importance_type="gain")

    feature_names = list(X.columns)

    rows = []
    for k, v in score_dict.items():
        # keys might be "f0", "f1", ... OR actual feature names
        if k.startswith("f") and k[1:].isdigit():
            idx = int(k[1:])
            feat = feature_names[idx] if idx < len(feature_names) else k
        else:
            feat = k
        rows.append((feat, v))

    imp_df = (
        pd.DataFrame(rows, columns=["Feature", "Gain"])
          .sort_values(by="Gain", ascending=False)
    )

    if top_n is not None:
        imp_df = imp_df.head(top_n)

    print("XGBoost Feature Importance (gain)\n")
    print(imp_df.to_string(index=False))
    print()

xgb_gain_importance_table(xgb_hyper, X_train_scaled, top_n=20)

XGBoost Feature Importance (gain)

                      Feature        Gain
                has_pagevalue 1039.861938
                   PageValues  339.203796
   pagevalue_exit_interaction  223.375641
                    Month_Nov   87.577217
                    Month_Mar   67.835739
                    Month_May   61.472225
                    Month_Sep   46.346741
                    ExitRates   29.144566
                  total_pages   24.140759
               total_duration   23.428366
VisitorType_Returning_Visitor   18.233158
                product_focus   16.845255
                  BounceRates   16.054697
              engagement_rate   11.746968
                    Month_Oct   11.664551
                    Month_Feb   11.003519
                    Month_Dec    9.972809
                    Month_Jul    7.841631



## 3. Model Selection

In [4]:
results_data = {
    'Algorithm': [
        'Logistic Regression', 'Logistic Regression', 'Logistic Regression', 'Logistic Regression',
        'Random Forest', 'Random Forest', 'Random Forest', 'Random Forest',
        'XGBoost', 'XGBoost', 'XGBoost', 'XGBoost'
    ],
    'Stage': [
        'Baseline', 'Balanced', 'Hypertuned', 'Threshold',
        'Baseline', 'Balanced', 'Hypertuned', 'Threshold',
        'Baseline', 'Balanced', 'Hypertuned', 'Threshold'
    ],
    'Threshold': [
        0.50, 0.50, 0.50, 0.278,
        0.50, 0.50, 0.50, 0.291,
        0.50, 0.50, 0.50, 0.438
    ],
    'AUC': [
        0.9135, 0.9160, 0.9076, 0.9076,
        0.9140, 0.9149, 0.9247, 0.9247,
        0.9226, 0.9285, 0.9288, 0.9288
    ],
    'Recall': [
        0.57, 0.80, 0.78, 0.81,
        0.60, 0.57, 0.73, 0.87,
        0.58, 0.82, 0.82, 0.86
    ],
    'Precision': [
        0.70, 0.53, 0.55, 0.52,
        0.70, 0.71, 0.60, 0.51,
        0.70, 0.53, 0.52, 0.50
    ],
    'F1': [
        0.63, 0.64, 0.64, 0.63,
        0.65, 0.63, 0.66, 0.64,
        0.63, 0.64, 0.64, 0.64
    ],
    'F2': [
        0.5889, 0.7272, 0.7210, 0.7255,
        0.6225, 0.5948, 0.7019, 0.7596,
        0.6010, 0.7406, 0.7381, 0.7560
    ],
    'Buyers_Caught': [
        324, 460, 448, 461,
        346, 327, 420, 496,
        332, 471, 470, 494
    ]
}

results_df = pd.DataFrame(results_data)

In [21]:
for algo in ['Logistic Regression', 'Random Forest', 'XGBoost']:
    algo_df = results_df[results_df['Algorithm'] == algo]
    print(f"\n{algo}:")
    print("-" * 100)
    print(f"{'Stage':<15} {'AUC':<8} {'Recall':<8} {'Precision':<11} {'F2':<8} {'Buyers Caught':<15}")
    print("-" * 100)
    for _, row in algo_df.iterrows():
        print(f"{row['Stage']:<15} {row['AUC']:<8.4f} {row['Recall']:<8.2f} {row['Precision']:<11.2f} {row['F2']:<8.4f} {row['Buyers_Caught']:.0f}/572")


Logistic Regression:
----------------------------------------------------------------------------------------------------
Stage           AUC      Recall   Precision   F2       Buyers Caught  
----------------------------------------------------------------------------------------------------
Baseline        0.9135   0.57     0.70        0.5889   324/572
Balanced        0.9160   0.80     0.53        0.7272   460/572
Hypertuned      0.9076   0.78     0.55        0.7210   448/572
Threshold       0.9076   0.81     0.52        0.7255   461/572

Random Forest:
----------------------------------------------------------------------------------------------------
Stage           AUC      Recall   Precision   F2       Buyers Caught  
----------------------------------------------------------------------------------------------------
Baseline        0.9140   0.60     0.70        0.6225   346/572
Balanced        0.9149   0.57     0.71        0.5948   327/572
Hypertuned      0.9247   0.73     0.60

### **Selected Model: Random Forest (Threshold-Tuned)**

1. Highest recall (0.87) – catches 496/572 buyers (86.7%)

2. Best F2 score (0.7596) – optimizes recall while maintaining acceptable precision

3. Best ROI even with relatively low precision (0.51)
    - Assume that the false postives in this model are advertisements/discounts sent to non-buyers
    - The revenue gained from the selected model is optimized with the highest recall which outweighs the trivial cost of advertising in the e-commerce world
    - What we truly want to optimize is the recall which represents how many potential buyers are reached

In [5]:
import os
os.makedirs('../reports', exist_ok=True)
results_df.to_csv('../reports/model_comparison_results.csv', index=False)
print("\nModel results saved to model_comparison_results.csv in 'reports' folder")


Model results saved to model_comparison_results.csv in 'reports' folder
