## BUSINESS UNDERSTANDING

### Project Goal
 Develop a personalized recommendation system to help users discover products, content, or services aligned with their preferences, thereby improving engagement, conversion, and retention.

### Business Questions

1. **Which items are most frequently viewed and/or purchased?**  
   Understand the baseline item popularity and user preferences.

2. **What time of day or day of week do users interact with the platform the most?**  
   Identify temporal patterns to support time-aware recommendation strategies.

3. **Which user groups (based on behavior) show similar interests or purchase patterns?**  
   Enable collaborative filtering through clustering or similarity-based models.

4. **Which product categories have the highest conversion rates (views → purchases)?**  
   Inform business decisions and recommendation strategies for high-value categories.

5. **Do users interact more frequently with items that have richer metadata ?**  
   Evaluate the potential of content-based filtering using item features.

6. **What proportion of items receive little or no attention?**  
   Address the cold-start problem and ensure broader catalog coverage.

7. **How do personalized recommendations compare with popularity-based ones in offline evaluation?**  
   Validate the effectiveness of personalized models through metrics like Recall@K, MAP@K.


## DATA UNDERSTANDING

importing relevant libraries

In [1]:

import pandas as pd


Loading the data

In [2]:

category_df = pd.read_csv('../data/category_tree.csv')

# Inspect the data
print("Shape:", category_df.shape)
print("Columns:", category_df.columns.tolist())
category_df.head()




Shape: (1669, 2)
Columns: ['categoryid', 'parentid']


Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0


In [3]:
category_df = pd.read_csv('../data/events.csv')

# Inspect the data
print("Shape:", category_df.shape)
print("Columns:", category_df.columns.tolist())
category_df.head()


Shape: (2756101, 5)
Columns: ['timestamp', 'visitorid', 'event', 'itemid', 'transactionid']


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


In [4]:
category_df = pd.read_csv('../data/item_properties_part1.1.csv')

# Inspect the data
print("Shape:", category_df.shape)
print("Columns:", category_df.columns.tolist())
category_df.head()


Shape: (10999999, 4)
Columns: ['timestamp', 'itemid', 'property', 'value']


Unnamed: 0,timestamp,itemid,property,value
0,1435460400000,460429,categoryid,1338
1,1441508400000,206783,888,1116713 960601 n277.200
2,1439089200000,395014,400,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,n15360.000
4,1431831600000,156781,917,828513


In [5]:
category_df = pd.read_csv('../data/item_properties_part2.csv')
# Inspect the data
print("Shape:", category_df.shape)
print("Columns:", category_df.columns.tolist())
category_df.head()


Shape: (9275903, 4)
Columns: ['timestamp', 'itemid', 'property', 'value']


Unnamed: 0,timestamp,itemid,property,value
0,1433041200000,183478,561,769062
1,1439694000000,132256,976,n26.400 1135780
2,1435460400000,420307,921,1149317 1257525
3,1431831600000,403324,917,1204143
4,1435460400000,230701,521,769062


Checking for missing data, and datatypes

In [None]:

events_df = pd.read_csv('../data/events.csv')  
item_meta_df_1 = pd.read_csv('../data/item_properties_part1.1.csv')  
item_meta_df_2 = pd.read_csv('../data/item_properties_part2.csv')  
cat_tree_df = pd.read_csv('../data/category_tree.csv')  


def dataset_overview(df, name):
    print(f"\n{'='*30}\n{name} Overview\n{'='*30}")
    print("Shape:", df.shape)
    print("\nColumns:", df.columns.tolist())
    print("\nData Types:\n", df.dtypes)
    print("\nMissing Values:\n", df.isnull().sum())
    print("\nSample Rows:\n", df.head())


# 3. Explore Each Dataset

dataset_overview(events_df, "User Events Data")
dataset_overview(item_meta_df_1, "Item Properties Data1")
dataset_overview(item_meta_df_2, "Item Properties Data2")
dataset_overview(cat_tree_df, "Category Tree Data")


# 4. Extra EDA Per Dataset


# ---- Events Dataset Specifics ----
print("\nEvent Type Counts:\n", events_df['event'].value_counts())
print("Unique Users:", events_df['visitorid'].nunique())
print("Unique Items:", events_df['itemid'].nunique())

# Convert timestamp if present
if 'timestamp' in events_df.columns:
    events_df['datetime'] = pd.to_datetime(events_df['timestamp'], unit='ms')
    print("\nEvents Time Range:", events_df['datetime'].min(), "to", events_df['datetime'].max())

# ---- Item Properties1 Specifics ----
print("\nTop 10 Properties in Item Metadata:\n", item_meta_df_1['property'].value_counts().head(10))
print("Unique Items in Metadata:", item_meta_df_1['itemid'].nunique())

# ---- Item Properties2 Specifics ----
print("\nTop 10 Properties in Item Metadata:\n", item_meta_df_2['property'].value_counts().head(10))
print("Unique Items in Metadata:", item_meta_df_2['itemid'].nunique())

# ---- Category Tree Specifics ----
print("\nUnique Categories:", cat_tree_df['categoryid'].nunique())
print("Unique Parent Categories:", cat_tree_df['parentid'].nunique())


User Events Data Overview
Shape: (2756101, 5)

Columns: ['timestamp', 'visitorid', 'event', 'itemid', 'transactionid']

Data Types:
 timestamp          int64
visitorid          int64
event             object
itemid             int64
transactionid    float64
dtype: object

Missing Values:
 timestamp              0
visitorid              0
event                  0
itemid                 0
transactionid    2733644
dtype: int64

Sample Rows:
        timestamp  visitorid event  itemid  transactionid
0  1433221332117     257597  view  355908            NaN
1  1433224214164     992329  view  248676            NaN
2  1433221999827     111016  view  318965            NaN
3  1433221955914     483717  view  253185            NaN
4  1433221337106     951259  view  367447            NaN

Item Properties Data1 Overview
Shape: (10999999, 4)

Columns: ['timestamp', 'itemid', 'property', 'value']

Data Types:
 timestamp     int64
itemid        int64
property     object
value        object
dtype: objec

## 1. Datasets Overview

We’re working with four main datasets:

### **Category Tree**
- **Shape:** (1,669 rows, 2 columns)  
- **Columns:** `categoryid`, `parentid`  
- **Missing:** `parentid` has 25 nulls.  
- **Purpose:** Defines hierarchy between product categories.  

### **User Events**
- **Shape:** (2,756,101 rows, 5 columns)  
- **Columns:** `timestamp`, `visitorid`, `event`, `itemid`, `transactionid`  
- **Missing:** `transactionid` missing in ~99% of rows (only present for purchases).  
- **Event counts:**
  - `view`: 2,664,312  
  - `addtocart`: 69,332  
  - `transaction`: 22,457  
- **Time range:** May 3, 2015 → Sept 18, 2015.  

### **Item Properties (Two Files)**
- **Item Properties Part 1:** (10,999,999 rows, 4 columns)  
- **Item Properties Part 2:** (9,275,903 rows, 4 columns)  
- **Columns:** `timestamp`, `itemid`, `property`, `value`  
- **No missing values**, but `value` contains mixed types (numeric, strings, prefixed with `n`).  

### **High-Level Metadata Stats**
- **Unique users:** 1,407,580  
- **Unique items:** 235,061 in events  
- **Unique items in metadata:** 417,053  
- **Unique categories:** 1,669  



## 2. Top Item Metadata Properties
 highly repeated property IDs, e.g.:
- `888` appears over 1.6M times  
- `790`, `available`, `categoryid`, `6`, `283` are also very frequent  


