In [7]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
customer_df = pd.read_csv('../data/clean/customer_dataset_transformed.csv')

display(customer_df.head())

  customer_df = pd.read_csv('../data/clean/customer_dataset_transformed.csv')


Unnamed: 0,fullVisitorId,visit_number,hits_per_visit,bounced,time_on_site,totals_transactionRevenue,log_transactionRevenue,day_of_week,month,channelGrouping,device_category,country,is_weekend,is_holiday_season
0,59488412965267,1,1,1.0,0,0,0.0,2,2,2,1,213,0.0,0.0
1,85840370633780,1,2,0.0,13,0,0.0,4,9,4,0,213,0.0,0.0
2,118334805178127,1,1,1.0,0,0,0.0,4,10,3,1,213,0.0,0.0
3,166374699289385,1,5,0.0,41,0,0.0,2,8,4,0,213,0.0,0.0
4,197671390269035,1,1,1.0,0,0,0.0,1,5,7,1,213,0.0,0.0


# Model Implementation

### Stage 1: Train a Classification model to predict if a user will make a purchase (`log_transactionRevenue` > 0)
#### Create a binary `purchase_flag` column (`purchase_flag = (log_transactionRevenue > 0).astype(int)`)
    - 1 if `log_transactionRevenue` > 0 (user made a purchase).
    - 0 if `log_transactionRevenue` == 0 (user didn’t make a purchase).


In [27]:
# Create the purchase_flag column
customer_df['purchase_flag'] = (customer_df['log_transactionRevenue'] > 0).astype(int)

# Check the distribution of the target variable
print(customer_df['purchase_flag'].value_counts())

purchase_flag
0    803590
1      9772
Name: count, dtype: int64


<br>

#### Prepare Data for Classification

In [44]:
from sklearn.model_selection import train_test_split

# Features and target
X = customer_df.drop(columns=['log_transactionRevenue', 'purchase_flag', 'totals_transactionRevenue', 'fullVisitorId'])
y = customer_df['purchase_flag']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")


Training set size: (650689, 11)
Testing set size: (162673, 11)


<br>

#### Train a Classification Model: Random Forest
Random Forest is selected as initial model due to its ability to handle categorical and numerical features without scaling.

In [56]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Train a Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_model.predict(X_test)
y_pred_prob = rf_model.predict_proba(X_test)[:, 1]  # Probabilities for class 1, required to evaluate the performance metric ROC-AUC

# Evaluate the model
print("Classification Report:\n", classification_report(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_pred_prob):.4f}")


Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99    160719
           1       0.53      0.27      0.36      1954

    accuracy                           0.99    162673
   macro avg       0.76      0.63      0.68    162673
weighted avg       0.99      0.99      0.99    162673

ROC-AUC Score: 0.9768


The ROC-AUC measures the model's ability to distinguish between classes.

#### Interpretation
1. **ROC-AUC Score: 0.9768**: the model is very good at distinguishing between buyers (class 1) and non-buyers (class 0). However, a high AUC does not guarantee good performance for imbalanced datasets.

##### Class 1 (Buyers)
2. **Precision = 0.53**: Only 53% of users predicted as buyers were actual buyers; The model makes a lot of false positives for class 1.
3. **Recall = 0.27**: The model only identifies 27% of actual buyers; Many actual buyers are missed (false negatives are high).
4. **F1-Score = 0.36**: This combines precision and recall and reflecs poor performance for buyers (class 1).

##### Overall metrics
5. Accuracy = 0.99: Accuracy is misleadingly high because class 0 (non-buyers) dominates the dataset. The model predicts class 0 very well due to its imbalance, inflating the accuracy.
6. Macro Avg (Precision, Recall, F1-Score): These metrics take the average of the metrics for both classes without weighting by class size.
7. Weighted Avg (Precision, Recall, F1-Score): These metrics are weighted by the number of samples in each class. Since class 0 dominates, the weighted average heavily reflects its performance.


#### Summary:
- Model Bias Towards Class 0:
    - The model performs almost perfectly for class 0 (non-buyers), but struggles with class 1 (buyers). This is common in imbalanced datasets.

- High False Negatives for Class 1:
    - The recall for class 1 is low (0.27), meaning the model misses most buyers. This is problematic because in marketing applications, identifying buyers is the goal.