### RECOMMENDATION SYSTEM ANALYSIS & MODELING

## Business Understanding 

Recommendation systems play a crucial role in modern digital platforms by enhancing user experience through personalized suggestions. Businesses across e-commerce, streaming services, social media, and news platforms depend on these systems to increase user engagement, improve satisfaction, and boost revenue. However, ensuring accuracy, diversity, and real-time performance remains a challenge, especially when dealing with large datasets.



### Business Objectives
- Improve user engagement by providing relevant product recommendations.
- Increase sales by recommending items customers are likely to purchase.
- Enhance customer retention through personalized suggestions.
- Reduce bounce rates by helping users discover desired products faster.



### Goals

1. Implement a robust recommendation system that delivers relevant suggestions in real-time.

2. Improve user engagement metrics such as time spent on the platform and click-through rates.

3. Increase conversion rates through effective personalization.

4. Ensure fair and diverse recommendations to cater to different user preferences.

5. Optimize computational efficiency to handle large datasets with minimal latency.




### Problem Statement

Many customers leave the platform without making a purchase due to difficulty in discovering relevant products. The goal is to develop a recommendation system that suggests products based on user interactions and product characteristics.



### Stakeholders

Key Stakeholders
- Business Owners: Want to increase revenue through better product discovery.
- Marketing Team: Seeks insights into customer behavior and purchasing patterns.
- Data Science Team: Responsible for building,training and optimizing the recommendation system.
- Software Engineers: Implement the system into existing platforms ensuring seamless integration.
- Users (customers): The end users who interact with recommendations.




### Key Features of the Recommendation System

The dataset contains the following key files:

1. df_events: (User interactions with products)
2. df_category_tree: (Product categories and hierarchy)
3. df_items: (Product properties like price, availability, and brand)


Key Features

1. User ID: Unique identifier for each customer.
2. Item ID: Unique identifier for each product.
3. Event Type: Clicks, views, add-to-cart, purchases, etc.
4. Timestamp: Time of interaction.
5. Category ID: The category a product belongs to.
6. Price: The cost of a product.
7. Brand: The brand of a product.





### Hypothesis

1. Customers who add items to their cart are more likely to purchase them.
2. Higher-priced items receive fewer clicks than lower-priced items.
3. Users interact more frequently with items from popular brands.
4. Certain categories receive more engagement than others.
5. Customers are more likely to purchase after viewing an item multiple times.
6. Purchasing behavior varies across different times of the day.
7. Users who buy one item from a brand are more likely to buy another from the same brand.



### Analytical Questions

1. What are the most frequently purchased/viewed items by different user segments?

2. How does user behavior change over time, and how does it impact recommendation accuracy?

3. What is the impact of diversity in recommendations on user engagement and satisfaction?

4. How do collaborative filtering and content-based filtering compare in terms of recommendation performance?

5. What is the relationship between recommendation relevance and user retention?

6. How does the recommendation system handle cold-start users with minimal interaction history?

7. What are the computational trade-offs between real-time and batch processing in recommendation systems?





### 7 Analytical Questions

1. Customers who add items to their cart are more likely to purchase them.
2. Higher-priced items receive fewer clicks than lower-priced items.
3. Users interact more frequently with items from popular brands.
4. Certain categories receive more engagement than others.
5. Customers are more likely to purchase after viewing an item multiple times.
6. Purchasing behavior varies across different times of the day.
7. Users who buy one item from a brand are more likely to buy another from the same brand.


### Data Understanding & Preparation
Importing all the relevant libraries

In [1]:
%pip install ydata-profiling

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement ydata-profiling (from versions: none)
ERROR: No matching distribution found for ydata-profiling


In [2]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# To load multiple files
import glob 

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats

# Date and time handling
from datetime import datetime


# Machine learning (if needed for predictive modeling)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# For handling missing data
from sklearn.impute import SimpleImputer

# For encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


# For working with Excel files (if your data is in Excel format)
import openpyxl


# For working with large CSV files
import csv

# For system operations
import os
import sys

# For progress bars in data processing
from tqdm import tqdm

# Set plotting style
# plt.style.use('seaborn')

## Load all datasets from their sources

##### Category tree dataset

In [3]:
# Path of csv file
file_path = '../RSAM_Data/category_tree.csv'
 
# Check if the file exists at the specified path
if os.path.exists(file_path):
    print("File exists at the specified path.")
    try:
        # Read the Excel file into a pandas DataFrame
        df_categorytree= pd.read_csv(file_path)
       
    except FileNotFoundError as e:
        print(f"FileNotFoundError: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")
else:
    print("File does not exist at the specified path.")
 
# Display the DataFrame
df_categorytree.head()

File exists at the specified path.


Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0


##### Key Observations:

1. Provides a hierarchy of categories.
2. Certain categories dominate user interactions.

#### Events dataset

In [4]:
# Path of csv file
file_path = '../RSAM_Data/events.csv'
 
# Check if the file exists at the specified path
if os.path.exists(file_path):
    print("File exists at the specified path.")
    try:
        # Read the Excel file into a pandas DataFrame
        df_events = pd.read_csv(file_path)
       
    except FileNotFoundError as e:
        print(f"FileNotFoundError: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")
else:
    print("File does not exist at the specified path.")
 
# Display the DataFrame
df_events.head()

File exists at the specified path.


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


In [5]:
# Convert timestamp from milliseconds to datetime
df_events['timestamp'] = pd.to_datetime(df_events['timestamp'], unit='ms')
 
# Extract useful datetime components
df_events['year'] = df_events['timestamp'].dt.year
df_events['month'] = df_events['timestamp'].dt.month
df_events['day'] = df_events['timestamp'].dt.day
df_events['hour'] = df_events['timestamp'].dt.hour
df_events['minute'] = df_events['timestamp'].dt.minute
df_events['second'] = df_events['timestamp'].dt.second
 
# Define the desired column order
column_order = [
    'timestamp', 'year', 'month', 'day', 'hour', 'minute', 'second',
    'visitorid', 'event', 'itemid', 'transactionid'
]
 
# Reorder the DataFrame
df_events = df_events[column_order]
 
# Display the DataFrame
df_events

Unnamed: 0,timestamp,year,month,day,hour,minute,second,visitorid,event,itemid,transactionid
0,2015-06-02 05:02:12.117,2015,6,2,5,2,12,257597,view,355908,
1,2015-06-02 05:50:14.164,2015,6,2,5,50,14,992329,view,248676,
2,2015-06-02 05:13:19.827,2015,6,2,5,13,19,111016,view,318965,
3,2015-06-02 05:12:35.914,2015,6,2,5,12,35,483717,view,253185,
4,2015-06-02 05:02:17.106,2015,6,2,5,2,17,951259,view,367447,
...,...,...,...,...,...,...,...,...,...,...,...
2756096,2015-08-01 03:13:05.939,2015,8,1,3,13,5,591435,view,261427,
2756097,2015-08-01 03:30:13.142,2015,8,1,3,30,13,762376,view,115946,
2756098,2015-08-01 02:57:00.527,2015,8,1,2,57,0,1251746,view,78144,
2756099,2015-08-01 03:08:50.703,2015,8,1,3,8,50,1184451,view,283392,


##### Key observations
1. Contains user interactions (clicks, views, add-to-cart, purchases).
2. High number of clicks but relatively fewer purchases.
3. Some products receive high engagement but low conversion.

##### Item property dataset

In [6]:
# Define file paths
file_path1 = "../RSAM_Data/item_properties_part1.csv"
file_path2 = "../RSAM_Data/item_properties_part2.csv"
 
# Check if both files exist
if os.path.exists(file_path1) and os.path.exists(file_path2):
    print("Both item properties files exist.")
    
    try:
        # Read both CSV files
        df_item1 = pd.read_csv(file_path1)
        df_item2 = pd.read_csv(file_path2)
 
        # Combine both into a single DataFrame
        df_items = pd.concat([df_item1, df_item2], ignore_index=True)
 
        print("Files successfully merged.")
 
    except FileNotFoundError as e:
        print(f"FileNotFoundError: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")
 
else:
    print("One or both files do not exist at the specified paths.")
 
# Display the DataFrame
df_items.head()

Both item properties files exist.
Files successfully merged.


Unnamed: 0,timestamp,itemid,property,value
0,1435460400000,460429,categoryid,1338
1,1441508400000,206783,888,1116713 960601 n277.200
2,1439089200000,395014,400,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,n15360.000
4,1431831600000,156781,917,828513


In [7]:
# Convert timestamp from milliseconds to datetime
df_items['timestamp'] = pd.to_datetime(df_items['timestamp'], unit='ms')
 
# Extract useful datetime components (optional)
df_items['year'] = df_items['timestamp'].dt.year
df_items['month'] = df_items['timestamp'].dt.month
df_items['day'] = df_items['timestamp'].dt.day
df_items['hour'] = df_items['timestamp'].dt.hour
df_items['minute'] = df_items['timestamp'].dt.minute
df_items['second'] = df_items['timestamp'].dt.second
 
# Define the desired column order
column_order = [
    'timestamp', 'year', 'month', 'day', 'hour', 'minute', 'second',
    'itemid', 'property', 'value'
]
 
# Reorder the DataFrame
df_items = df_items[column_order]
 
# Display the updated DataFrame
df_items

Unnamed: 0,timestamp,year,month,day,hour,minute,second,itemid,property,value
0,2015-06-28 03:00:00,2015,6,28,3,0,0,460429,categoryid,1338
1,2015-09-06 03:00:00,2015,9,6,3,0,0,206783,888,1116713 960601 n277.200
2,2015-08-09 03:00:00,2015,8,9,3,0,0,395014,400,n552.000 639502 n720.000 424566
3,2015-05-10 03:00:00,2015,5,10,3,0,0,59481,790,n15360.000
4,2015-05-17 03:00:00,2015,5,17,3,0,0,156781,917,828513
...,...,...,...,...,...,...,...,...,...,...
20275897,2015-06-07 03:00:00,2015,6,7,3,0,0,236931,929,n12.000
20275898,2015-08-30 03:00:00,2015,8,30,3,0,0,455746,6,150169 639134
20275899,2015-08-16 03:00:00,2015,8,16,3,0,0,347565,686,610834
20275900,2015-06-07 03:00:00,2015,6,7,3,0,0,287231,867,769062


##### Key Observations
1. Contains product details like price and brand.
2. Missing values in some columns (e.g., price or brand).

In [8]:
# Define a dictionary to store all DataFrames
dataframes = {
    "Events Data": df_events,
    "Item Properties Data": df_items,
    "Category Tree Data": df_categorytree
}
 
# Iterate over each DataFrame and print column-wise unique value details
for df_name, df in dataframes.items():
    print(f"\n{'='*40}")
    print(f"Analyzing {df_name}")
    print(f"{'='*40}\n")
   
    # Get the list of all column names in the current DataFrame
    columns = df.columns
 
    # Print details of unique values for each column
    for column in columns:
        unique_values = df[column].unique()
        print(f"Column: {column}")  
        print(f"Unique Values Count: {unique_values.size}")  
        print(f"Unique Values Sample: {unique_values}")
        print("_" * 80)  


Analyzing Events Data

Column: timestamp
Unique Values Count: 2750455
Unique Values Sample: <DatetimeArray>
['2015-06-02 05:02:12.117000', '2015-06-02 05:50:14.164000',
 '2015-06-02 05:13:19.827000', '2015-06-02 05:12:35.914000',
 '2015-06-02 05:02:17.106000', '2015-06-02 05:48:06.234000',
 '2015-06-02 05:12:03.240000', '2015-06-02 05:34:51.897000',
 '2015-06-02 04:54:59.221000', '2015-06-02 05:00:04.592000',
 ...
 '2015-08-01 03:01:27.349000', '2015-08-01 03:07:53.572000',
 '2015-08-01 03:41:38.250000', '2015-08-01 03:21:29.446000',
 '2015-08-01 03:42:54.346000', '2015-08-01 03:13:05.939000',
 '2015-08-01 03:30:13.142000', '2015-08-01 02:57:00.527000',
 '2015-08-01 03:08:50.703000', '2015-08-01 03:36:03.914000']
Length: 2750455, dtype: datetime64[ns]
________________________________________________________________________________
Column: year
Unique Values Count: 1
Unique Values Sample: [2015]
________________________________________________________________________________
Column: mo

In [9]:
x = df_events[df_events['event'] == 'transaction']

x ['event'].value_counts() 


event
transaction    22457
Name: count, dtype: int64

In [10]:
x ['transactionid'].value_counts().sum() 

np.int64(22457)

In [11]:
# Check the datatype and the number of columns
df_events.info(verbose=False, memory_usage='deep')
 
missing_counts = df_events.isna().sum()
non_null_counts = df_events.notna().sum()
dtype_info = df_events.dtypes
 
df_eventsinfo = pd.DataFrame({
    "Non-Null Count": non_null_counts,
    "Missing Count": missing_counts,
    "Missing Percentage": round((missing_counts / len(df_events)) * 100, 2),
    "Dtype": dtype_info
})
 
# Display results
df_eventsinfo

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Columns: 11 entries, timestamp to transactionid
dtypes: datetime64[ns](1), float64(1), int32(6), int64(2), object(1)
memory usage: 287.0 MB


Unnamed: 0,Non-Null Count,Missing Count,Missing Percentage,Dtype
timestamp,2756101,0,0.0,datetime64[ns]
year,2756101,0,0.0,int32
month,2756101,0,0.0,int32
day,2756101,0,0.0,int32
hour,2756101,0,0.0,int32
minute,2756101,0,0.0,int32
second,2756101,0,0.0,int32
visitorid,2756101,0,0.0,int64
event,2756101,0,0.0,object
itemid,2756101,0,0.0,int64


In [12]:
# Check the datatype and the number of columns
df_items.info(verbose=False, memory_usage='deep')
 
missing_counts = df_items.isna().sum()
non_null_counts = df_items.notna().sum()
dtype_info = df_items.dtypes
 
df_itemsinfo = pd.DataFrame({
    "Non-Null Count": non_null_counts,
    "Missing Count": missing_counts,
    "Missing Percentage": round((missing_counts / len(df_items)) * 100, 2),
    "Dtype": dtype_info
})
 
# Display results
df_itemsinfo

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20275902 entries, 0 to 20275901
Columns: 10 entries, timestamp to value
dtypes: datetime64[ns](1), int32(6), int64(1), object(2)
memory usage: 3.0 GB


Unnamed: 0,Non-Null Count,Missing Count,Missing Percentage,Dtype
timestamp,20275902,0,0.0,datetime64[ns]
year,20275902,0,0.0,int32
month,20275902,0,0.0,int32
day,20275902,0,0.0,int32
hour,20275902,0,0.0,int32
minute,20275902,0,0.0,int32
second,20275902,0,0.0,int32
itemid,20275902,0,0.0,int64
property,20275902,0,0.0,object
value,20275902,0,0.0,object


In [13]:
# Check the datatype and the number of columns
df_categorytree.info(verbose=False, memory_usage='deep')
 
missing_counts = df_categorytree.isna().sum()
non_null_counts = df_categorytree.notna().sum()
dtype_info = df_categorytree.dtypes
 
df_categorytreeinfo = pd.DataFrame({
    "Non-Null Count": non_null_counts,
    "Missing Count": missing_counts,
    "Missing Percentage": round((missing_counts / len(df_categorytree)) * 100, 2),
    "Dtype": dtype_info
})
 
# Display results
df_categorytreeinfo

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1669 entries, 0 to 1668
Columns: 2 entries, categoryid to parentid
dtypes: float64(1), int64(1)
memory usage: 26.2 KB


Unnamed: 0,Non-Null Count,Missing Count,Missing Percentage,Dtype
categoryid,1669,0,0.0,int64
parentid,1644,25,1.5,float64


In [14]:
# Check the datatype and the number of columns
df_events.info(verbose=False, memory_usage='deep')
 
missing_counts = df_events.isna().sum()
non_null_counts = df_events.notna().sum()
dtype_info = df_events.dtypes
 
df_categorytreeinfo = pd.DataFrame({
    "Non-Null Count": non_null_counts,
    "Missing Count": missing_counts,
    "Missing Percentage": round((missing_counts / len(df_events)) * 100, 2),
    "Dtype": dtype_info
})
 
# Display results
df_eventsinfo

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Columns: 11 entries, timestamp to transactionid
dtypes: datetime64[ns](1), float64(1), int32(6), int64(2), object(1)
memory usage: 287.0 MB


Unnamed: 0,Non-Null Count,Missing Count,Missing Percentage,Dtype
timestamp,2756101,0,0.0,datetime64[ns]
year,2756101,0,0.0,int32
month,2756101,0,0.0,int32
day,2756101,0,0.0,int32
hour,2756101,0,0.0,int32
minute,2756101,0,0.0,int32
second,2756101,0,0.0,int32
visitorid,2756101,0,0.0,int64
event,2756101,0,0.0,object
itemid,2756101,0,0.0,int64
