# 📘 Notebook 02 – Session-level Feature Engineering

🎯 **Objective:**
This notebook processes the raw clickstream events into structured, numerical features aggregated at the session level. These features are used as inputs for downstream ML modeling (e.g., XGBoost, TabNet).

We:
- Load the 10K-session sample previously extracted from the full dataset
- Aggregate user behaviors (views, carts, purchases) for each session
- Compute session-level metrics (duration, price stats, brand/category diversity)
- Generate a binary label `conversion` based on whether a purchase occurred
- Export `session_features.csv` for training machine learning models

This step transforms millions of raw clickstream events into compact, meaningful tabular features that capture session-level behavior.

# 02_feature_engineering.ipynb

In [2]:
import pandas as pd
import numpy as np
import os

In [5]:
!pip install google-cloud-storage -q
from google.cloud import storage

In [6]:
from google.colab import auth
auth.authenticate_user()

In [7]:
import pandas as pd
import numpy as np
import os
from google.cloud import storage

# ⬇️ Download sample data from GCS
gcs_path = "clickstream/sample_10k_sessions.csv"
local_path = "../data/sample_10k_sessions.csv"
os.makedirs("../data", exist_ok=True)

client = storage.Client()
bucket = client.bucket("boothill2001-dataset")
blob = bucket.blob(gcs_path)
blob.download_to_filename(local_path)
print(f"✅ Downloaded sample from GCS to {local_path}")

# Load sample dataset
df = pd.read_csv(local_path)

✅ Downloaded sample from GCS to ../data/sample_10k_sessions.csv


In [8]:
# Convert event_time to datetime if not yet
if not np.issubdtype(df['event_time'].dtype, np.datetime64):
    df['event_time'] = pd.to_datetime(df['event_time'])

In [9]:
# Sort values to prepare for duration calc
df = df.sort_values(by=["user_session", "event_time"])

In [10]:
# Group by session to engineer features
groups = df.groupby("user_session")

In [11]:
session_features = groups.agg(
    session_start=('event_time', 'min'),
    session_end=('event_time', 'max'),
    num_events=('event_type', 'count'),
    num_views=('event_type', lambda x: (x == 'view').sum()),
    num_carts=('event_type', lambda x: (x == 'cart').sum()),
    num_purchases=('event_type', lambda x: (x == 'purchase').sum()),
    avg_price=('price', 'mean'),
    max_price=('price', 'max'),
    num_categories=('category_code', pd.Series.nunique),
    num_brands=('brand', pd.Series.nunique)
).reset_index()

In [12]:
# Add session duration in seconds
session_features['session_duration'] = (
    session_features['session_end'] - session_features['session_start']
).dt.total_seconds()

In [13]:
# Create label: conversion = 1 if purchase > 0
session_features['conversion'] = (session_features['num_purchases'] > 0).astype(int)

In [14]:
# Drop raw time columns
session_features = session_features.drop(columns=['session_start', 'session_end'])

In [15]:
# Preview
print("\n✅ Session-level features preview:")
display(session_features.head())


✅ Session-level features preview:


Unnamed: 0,user_session,num_events,num_views,num_carts,num_purchases,avg_price,max_price,num_categories,num_brands,session_duration,conversion
0,001259d1-d126-4601-a53e-b511cf3a3152,1,1,0,0,154.18,154.18,0,1,0.0,0
1,001f8c1d-dc62-4f75-9339-fda1772626f4,2,2,0,0,58.345,113.59,0,0,65.0,0
2,00322ee2-5eec-4647-9499-eae70633ec5b,1,1,0,0,1181.5,1181.5,1,1,0.0,0
3,00334c2e-e8e5-463f-ae4e-f44f9796ccf4,2,2,0,0,1541.61,1541.61,1,1,54.0,0
4,00399d10-e0d3-4d17-9760-30c42851777c,3,3,0,0,1012.19,1012.19,1,1,92.0,0


In [16]:
# Save processed session features
os.makedirs("../data", exist_ok=True)
session_features.to_csv("../data/session_features.csv", index=False)
print("\n✅ Saved session_features.csv with", session_features.shape[0], "sessions")


✅ Saved session_features.csv with 10000 sessions
