Task 2: Feature engineering

Environment Setup & Data Loading

We begin by importing necessary libraries and loading the dataset. 

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

file_path = r'.\data\online_shoppers_intention.csv'
df = pd.read_csv(file_path)

print(f"Dataset loaded with {df.shape[0]} rows and {df.shape[1]} columns.")

Dataset loaded with 12330 rows and 18 columns.


Establishing the Baseline model:
 we train a model using only raw, unedited numerical columns. 

In [38]:
baseline_features = ['Administrative', 'Informational', 'ProductRelated', 'ExitRates']
X_base = df[baseline_features]
y = df['Revenue'].astype(int)

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_base, y, test_size=0.2, random_state=42)

model_b = RandomForestClassifier(n_estimators=100, random_state=42)
model_b.fit(X_train_b, y_train_b)

acc_baseline = accuracy_score(y_test_b, model_b.predict(X_test_b))
print(f"Baseline Accuracy: {acc_baseline:.4f}")

Baseline Accuracy: 0.8082


Advanced Feature Engineering

 We engineer features that capture user intent and seasonal context, which are invisible to the baseline.

In [39]:
df_eng = pd.get_dummies(df, columns=['Month', 'VisitorType'], drop_first=True)
df_eng['intent_intensity'] = df_eng['ProductRelated_Duration'] / (df_eng['ProductRelated'] + 1)
df_eng['value_efficiency'] = df_eng['PageValues'] / (df_eng['ExitRates'] + 0.001)
df_eng['friction_score'] = df_eng['BounceRates'] + df_eng['ExitRates']

X_eng = df_eng.drop(['Revenue', 'Weekend'], axis=1)
y_eng = df_eng['Revenue'].astype(int)

Performance Comparison Report

 "Before" and "After" Performance Comparison.

In [40]:

acc = accuracy_score(y_test_e, y_pred)
prec = precision_score(y_test_e, y_pred)
rec = recall_score(y_test_e, y_pred)
f1 = f1_score(y_test_e, y_pred)

print(f"{'Accuracy':<15}   | {acc_baseline:<12.4f}      | {acc_advanced:<12.4f}         | {((acc_advanced-acc_baseline)/acc_baseline)*100:+.2f}%")
print(f"{'Features':<15}     | {'4':<12}          | {'28':<12}            | +24 features")
print(f"{'Model Type':<15} | {'RF (Default)':<12}   | {'RF (Tuned)':<12}     | Increased Depth")
print("-" * 55)

print(f"Accuracy:  {acc:.4f} ")
print(f"Precision: {prec:.4f} ")
print(f"Recall:    {rec:.4f} ")
print(f"F1-Score:  {f1:.4f} ")

Accuracy          | 0.8082            | 0.8942               | +10.64%
Features            | 4                     | 28                      | +24 features
Model Type      | RF (Default)   | RF (Tuned)       | Increased Depth
-------------------------------------------------------
Accuracy:  0.8942 
Precision: 0.7404 
Recall:    0.5620 
F1-Score:  0.6390 
