# üìò Notebook 01 ‚Äì Data Exploration for Clickstream Dataset

üéØ **Objective:**
This notebook performs initial exploration of the raw clickstream dataset (5.67GB) to validate the data quality, understand the schema, and prepare a manageable sample for downstream processing.

We:
- Download the dataset using `kagglehub`
- Load the main CSV (`2019-Oct.csv`) and preview the structure
- Convert the `event_time` column into datetime format
- Calculate the number of unique users, sessions, brands, and product categories
- Export a 10K-session sample for testing and fast iteration later
- Upload both the full dataset and sample to your GCS bucket for centralized access

This step ensures a clean starting point and helps optimize memory usage before full-scale feature engineering.

# 01_data_exploration.ipynb

In [1]:
# 01_data_exploration.ipynb
# Initial data loading & exploration for clickstream conversion project

!pip install kagglehub -q
import kagglehub
import pandas as pd
import os

# ‚¨áÔ∏è Download dataset from Kaggle
path = kagglehub.dataset_download("saiparthas/ecommerce-churn")
print("üìÅ Dataset downloaded to:", path)

# Load the CSV file (expecting 2019-Oct.csv)
df_path = os.path.join(path, "2019-Oct.csv")
df = pd.read_csv(df_path)

# Preview
print("\n‚úÖ Sample rows:")
df.head()

# Check schema and shape
print("\nüîç Columns:", df.columns.tolist())
print("üßÆ Shape:", df.shape)

# Convert event_time
print("\n‚è≥ Converting event_time to datetime...")
df['event_time'] = pd.to_datetime(df['event_time'])

# Basic stats
print("\nüë• Unique users:", df['user_id'].nunique())
print("üõí Unique sessions:", df['user_session'].nunique())
print("üì¶ Unique brands:", df['brand'].nunique())
print("üìÇ Unique categories:", df['category_code'].nunique())




Downloading from https://www.kaggle.com/api/v1/datasets/download/saiparthas/ecommerce-churn?dataset_version_number=1...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.61G/1.61G [00:18<00:00, 94.1MB/s]

Extracting files...





üìÅ Dataset downloaded to: /root/.cache/kagglehub/datasets/saiparthas/ecommerce-churn/versions/1

‚úÖ Sample rows:

üîç Columns: ['event_time', 'event_type', 'product_id', 'category_id', 'category_code', 'brand', 'price', 'user_id', 'user_session']
üßÆ Shape: (42448764, 9)

‚è≥ Converting event_time to datetime...

üë• Unique users: 3022290
üõí Unique sessions: 9244421
üì¶ Unique brands: 3444
üìÇ Unique categories: 126


OSError: Cannot save file into a non-existent directory: '../data'

In [2]:
os.makedirs("../data", exist_ok=True)
# Save a small sample (10k sessions)
sample_sessions = df['user_session'].dropna().unique()[:10000]
df_sample = df[df['user_session'].isin(sample_sessions)]
df_sample.to_csv("../data/sample_10k_sessions.csv", index=False)
print("\n‚úÖ Saved sample to data/sample_10k_sessions.csv")


‚úÖ Saved sample to data/sample_10k_sessions.csv


In [4]:
from google.colab import auth
auth.authenticate_user()

In [5]:
!pip install google-cloud-storage -q
from google.cloud import storage

# Upload both full and sample data to GCS
client = storage.Client()
bucket = client.bucket("boothill2001-dataset")

full_blob = bucket.blob("clickstream/2019-Oct.csv")
full_blob.upload_from_filename(df_path)
print("‚òÅÔ∏è Uploaded full dataset to GCS")

sample_blob = bucket.blob("clickstream/sample_10k_sessions.csv")
sample_blob.upload_from_filename("../data/sample_10k_sessions.csv")
print("‚òÅÔ∏è Uploaded sample to GCS")

‚òÅÔ∏è Uploaded full dataset to GCS
‚òÅÔ∏è Uploaded sample to GCS
