# Prediction (Supervised learning)

### Dataset and Problem Setup

This notebook focuses on the supervised learning part of the project.  
The goal is to predict whether an e-commerce session results in a purchase, using behavioural features from the Online Shoppers Purchasing Intention dataset. The target variable is `Revenue`, which indicates whether a purchase was completed during a session.

In [16]:
import pandas as pd
df = pd.read_csv("online_shoppers_intention.csv")

df.head()
df.info()
df["Revenue"].value_counts(normalize=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

Revenue
False    0.845255
True     0.154745
Name: proportion, dtype: float64

In [17]:
X = df.drop("Revenue", axis=1)
y = df["Revenue"]

### Train–Test Split

The dataset is split into training and test sets before any preprocessing or model training.  
Stratified sampling is used to preserve the imbalance between purchasing and non-purchasing sessions in both sets and to ensure a fair evaluation.


In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

### Preprocessing

Numerical and categorical features are processed separately.  
Numerical features are standardised, while categorical features are encoded using one-hot encoding. The preprocessing steps are later integrated into the model pipelines to avoid data leakage.


In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Define categorical and numeric features explicitly
categorical_features = [
    "Month", "VisitorType", "Weekend",
    "Browser", "Region", "TrafficType", "OperatingSystems"
]

numeric_features = [c for c in X_train.columns if c not in categorical_features]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

### Supervised Learning Models

Three supervised classification models are defined: Logistic Regression, Decision Tree, and Random Forest.  
Logistic Regression is used as an interpretable baseline model, Decision Trees capture non-linear decision rules, and Random Forest is included as a robust ensemble model suitable for tabular data.


In [20]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# TODO (if not already done) - use Hyperparameter Tuning to find best parameters for each model - or just do it on the best one
logreg_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", LogisticRegression(max_iter=1000, class_weight="balanced"))
])

dt_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", DecisionTreeClassifier(
        class_weight="balanced",
        random_state=42
    ))
])

rf_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=300,
        class_weight="balanced_subsample",
        random_state=42,
        n_jobs=1
    ))
])

> Using class_weight=balanced and such the model is forced to treat one error on a "Buyer" as equally bad as multiple errors on "Non-Buyers." This effectively neutralizes the bias toward the majority class.

### Hyperparameter Tuning (Random Forest)

Hyperparameter tuning is performed for the Random Forest model to improve baseline performance while keeping the experimental setup controlled.  
A limited cross-validated search is used to keep computational cost proportional to the project scope. The tuned model is later used for final evaluation.


In [21]:
# Hyperparameter tuning for Random Forest (baseline)
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold

# Force y to a plain 1D int array (0/1) to avoid any dtype edge-cases
y_train_fixed = (
    y_train.values.ravel() if hasattr(y_train, "values") else np.array(y_train).ravel()
).astype(int)

param_grid = {
    "model__n_estimators": [200, 400],
    "model__max_depth": [None, 20],
    "model__min_samples_leaf": [1, 5],
    "model__max_features": ["sqrt"],
}

grid = GridSearchCV(
    rf_pipeline,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=3,
    n_jobs=1,
    verbose=1
)

grid.fit(X_train, y_train_fixed)


best_rf = grid.best_estimator_

print("Best CV ROC-AUC:", grid.best_score_)
print("Best params:", grid.best_params_)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best CV ROC-AUC: 0.9280743513199856
Best params: {'model__max_depth': None, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 5, 'model__n_estimators': 400}


### Baseline Model Evaluation

The supervised models are evaluated using multiple metrics, including precision, recall, F1-score, ROC-AUC, and log-loss.  
Using several metrics is necessary due to class imbalance and to assess both classification performance and probability quality.


In [24]:
from sklearn.metrics import classification_report, roc_auc_score, log_loss

# TODO - maybe add Youden's Index to this so it is not just evaluated as an afterthought in the end
def evaluate (model, X_train, y_train, X_test, y_test, name="model"):
    y_train_int = np.array(y_train).astype(int).ravel()
    y_test_int = np.array(y_test).astype(int).ravel() 
    
    model.fit(X_train, y_train_int)

    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    print(f"\n=== {name} ===")
    print(classification_report(y_test_int, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test_int,y_proba))
    print("Log-loss:", log_loss(y_test_int, y_proba))

evaluate(logreg_pipeline, X_train, y_train, X_test, y_test, "Logistic Regression")
evaluate(dt_pipeline, X_train, y_train, X_test, y_test, "Decision Tree")
evaluate(rf_pipeline, X_train, y_train, X_test, y_test, "Random Forest")


=== Logistic Regression ===
              precision    recall  f1-score   support

           0       0.95      0.86      0.90      2084
           1       0.49      0.74      0.59       382

    accuracy                           0.84      2466
   macro avg       0.72      0.80      0.75      2466
weighted avg       0.88      0.84      0.85      2466

ROC-AUC: 0.8932442142074746
Log-loss: 0.45588421375820165

=== Decision Tree ===
              precision    recall  f1-score   support

           0       0.91      0.91      0.91      2084
           1       0.52      0.53      0.52       382

    accuracy                           0.85      2466
   macro avg       0.72      0.72      0.72      2466
weighted avg       0.85      0.85      0.85      2466

ROC-AUC: 0.7195322627649204
Log-loss: 5.364160905841848

=== Random Forest ===
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      2084
           1       0.77      0.47      0.58     

### Threshold Optimisation (Logistic Regression)

In addition to standard evaluation, Youden’s Index is used to analyse the effect of different classification thresholds for Logistic Regression.  
This analysis illustrates how threshold choice influences the trade-off between sensitivity and specificity.

In [23]:
import numpy as np
from sklearn.metrics import roc_curve

logreg_pipeline.fit(X_train,y_train)
proba = logreg_pipeline.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, proba)
youden = tpr - fpr 
best_idx = np.argmax(youden)
best_threshold = thresholds[best_idx]
print("Best threshold (Youden):", best_threshold)
print("Youden's Index:", youden[best_idx])

Best threshold (Youden): 0.4075656607707226
Youden's Index: 0.6184492166695139
