# E-Commerce Recommendation System

## Table of Contents
1. [Importation of Packages](#1.-Importation-of-Packages)
2. [Data Importation](#2.-Data-Importation)
3. [Exploratory Data Analysis](#3.-Exploratory-Data-Analysis)
-3.1[Data Cleaning and Validation](#3.1 -Data-Cleaning-and-Validation)

## 1. Importation of Packages

In [1]:
import dask.dataframe as dd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## 2. Data Importation

In [2]:
# Define correct data types explicitly
dtypes = {
    "property": "object",  # Ensure mixed data types are read as strings
    "timestamp": "int64",  # Ensure timestamps stay as integers before conversion
}

# Function to optimize data types (reduce memory usage)
def reduce_memory(df):
    for col in df.columns:
        if df[col].dtype == 'int64':
            df[col] = df[col].astype('int32')  # Convert large integers to smaller size
        elif df[col].dtype == 'float64':
            df[col] = df[col].astype('float32')  # Convert large floats to smaller size
    return df

# Load large CSV files using Dask with specified dtypes
def load_csv_dask(file_path):
    return dd.read_csv(file_path, dtype=dtypes)  # Read large files in chunks

# Load & Process Events Data
events_ddf = load_csv_dask("events.csv")
events_ddf["timestamp"] = dd.to_datetime(events_ddf["timestamp"], unit="ms")  # Convert timestamp

# Load & Process Item Properties Data
item_properties_1_ddf = load_csv_dask("item_properties_part1.1.csv")
item_properties_2_ddf = load_csv_dask("item_properties_part2.csv")

# Convert timestamp column in item properties datasets
item_properties_1_ddf["timestamp"] = dd.to_datetime(item_properties_1_ddf["timestamp"], unit="ms")
item_properties_2_ddf["timestamp"] = dd.to_datetime(item_properties_2_ddf["timestamp"], unit="ms")

# Merge both item properties datasets
item_properties_ddf = dd.concat([item_properties_1_ddf, item_properties_2_ddf])

# Load & Process Category Tree Data
category_tree_ddf = load_csv_dask("category_tree.csv")

# Convert Dask DataFrames to Pandas
events_df = reduce_memory(events_ddf.compute())
item_properties_df = reduce_memory(item_properties_ddf.compute())
category_tree_df = category_tree_ddf.compute()

print(" Data loading complete! Converted to Pandas for easy analysis.")

 Data loading complete! Converted to Pandas for easy analysis.


## 3. Exploratory Data Analysis 

### 3.1 Data Cleaning and Validation

In [3]:
events_df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,2015-06-02 05:02:12.117,257597,view,355908,
1,2015-06-02 05:50:14.164,992329,view,248676,
2,2015-06-02 05:13:19.827,111016,view,318965,
3,2015-06-02 05:12:35.914,483717,view,253185,
4,2015-06-02 05:02:17.106,951259,view,367447,


In [4]:
events_df.tail()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
2756096,2015-08-01 03:13:05.939,591435,view,261427,
2756097,2015-08-01 03:30:13.142,762376,view,115946,
2756098,2015-08-01 02:57:00.527,1251746,view,78144,
2756099,2015-08-01 03:08:50.703,1184451,view,283392,
2756100,2015-08-01 03:36:03.914,199536,view,152913,


In [5]:
events_df.shape

(2756101, 5)

In [8]:
#Checking for duplicates
events_df.duplicated().sum()

460

In [11]:
#dropping duplicate values
events_df.drop_duplicates(inplace=True)

In [12]:
#confirming whether duplicate values are dropped
events_df.duplicated().sum()

0

There is no duplicated data

In [13]:
#checking for missing values
events_df.isna().sum()

timestamp              0
visitorid              0
event                  0
itemid                 0
transactionid    2733184
dtype: int64

There are no missing values in timestamp, visitorid, event and itemid except **transactionid** which has **2733184** missing values. This is because, once an individual just views or adds to cart without making a transaction, he or she will not get a transaction id. 

In [14]:
# Handling missing values in the transactionid
# Fill missing transaction IDs with 0 (assuming NaN means no purchase)
events_df["transactionid"] = events_df["transactionid"].fillna(0)

In [15]:
# checking for missing values in event data after handling missing values in the transactionid column.
events_df.isna().sum()

timestamp        0
visitorid        0
event            0
itemid           0
transactionid    0
dtype: int64

In [16]:
events_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2755641 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype         
---  ------         -----         
 0   timestamp      datetime64[ns]
 1   visitorid      int32         
 2   event          object        
 3   itemid         int32         
 4   transactionid  float32       
dtypes: datetime64[ns](1), float32(1), int32(2), object(1)
memory usage: 94.6+ MB


In [17]:
events_df.describe()

Unnamed: 0,visitorid,itemid,transactionid
count,2755641.0,2755641.0,2755641.0
mean,701922.7,234921.4,71.93124
std,405689.2,134194.7,914.8005
min,0.0,3.0,0.0
25%,350566.0,118120.0,0.0
50%,702060.0,236062.0,0.0
75%,1053443.0,350714.0,0.0
max,1407579.0,466867.0,17671.0


In [18]:
events_df.describe(include='object')

Unnamed: 0,event
count,2755641
unique,3
top,view
freq,2664218


In [19]:
events_df['event'].value_counts()

view           2664218
addtocart        68966
transaction      22457
Name: event, dtype: int64

In [20]:
events_df['itemid'].value_counts()

187946    3412
461686    2975
5411      2334
370653    1854
219512    1800
          ... 
145333       1
113185       1
109819       1
302190       1
177353       1
Name: itemid, Length: 235061, dtype: int64

In [22]:
events_df['event'].count()

2755641

In [23]:
events_df['transactionid'].value_counts()

0.0        2733185
7063.0          31
765.0           28
8351.0          27
2753.0          23
            ...   
1200.0           1
8006.0           1
15418.0          1
11832.0          1
17579.0          1
Name: transactionid, Length: 17672, dtype: int64

In [24]:
# Sanity checks for the item_properties
item_properties_df.head()

Unnamed: 0,timestamp,itemid,property,value
0,2015-06-28 03:00:00,460429,categoryid,1338
1,2015-09-06 03:00:00,206783,888,1116713 960601 n277.200
2,2015-08-09 03:00:00,395014,400,n552.000 639502 n720.000 424566
3,2015-05-10 03:00:00,59481,790,n15360.000
4,2015-05-17 03:00:00,156781,917,828513


In [25]:
item_properties_df.tail()

Unnamed: 0,timestamp,itemid,property,value
1545977,2015-06-07 03:00:00,236931,929,n12.000
1545978,2015-08-30 03:00:00,455746,6,150169 639134
1545979,2015-08-16 03:00:00,347565,686,610834
1545980,2015-06-07 03:00:00,287231,867,769062
1545981,2015-09-13 03:00:00,275768,888,888666 n10800.000 746840 1318567


In [27]:
item_properties_df.shape

(20275902, 4)

In [29]:
item_properties_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20275902 entries, 0 to 1545981
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   timestamp  datetime64[ns]
 1   itemid     int32         
 2   property   object        
 3   value      object        
dtypes: datetime64[ns](1), int32(1), object(2)
memory usage: 696.1+ MB


In [30]:
item_properties_df.describe()

Unnamed: 0,itemid
count,20275900.0
mean,233390.4
std,134845.2
min,0.0
25%,116516.0
50%,233483.0
75%,350304.0
max,466866.0


In [32]:
item_properties_df.describe(include='object')

Unnamed: 0,property,value
count,20275902,20275902
unique,1104,1966868
top,888,769062
freq,3000398,1537247


In [34]:
#checking duplicated values
item_properties_df.duplicated().sum()

0

In [35]:
#checking for null values
item_properties_df.isna().sum()

timestamp    0
itemid       0
property     0
value        0
dtype: int64

There are no missing values nor duplicates in the item_properties data

In [36]:
# Sanity checks for the category tree dataframe
category_tree_df.head()

Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0


In [37]:
category_tree_df.tail()

Unnamed: 0,categoryid,parentid
1664,49,1125.0
1665,1112,630.0
1666,1336,745.0
1667,689,207.0
1668,761,395.0


In [38]:
category_tree_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1669 entries, 0 to 1668
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   categoryid  1669 non-null   int64  
 1   parentid    1644 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 26.2 KB


In [39]:
category_tree_df.describe()

Unnamed: 0,categoryid,parentid
count,1669.0,1644.0
mean,849.285201,847.571168
std,490.195116,505.058485
min,0.0,8.0
25%,427.0,381.0
50%,848.0,866.0
75%,1273.0,1291.0
max,1698.0,1698.0


In [40]:
#checking for duplicates in the data
category_tree_df.duplicated().sum()

0

In [41]:
# checking for missing values
category_tree_df.isna().sum()

categoryid     0
parentid      25
dtype: int64

In [42]:
# filling missing values in the parentid column with '-1'
category_tree_df["parentid"] = category_tree_df["parentid"].fillna(-1)

In [43]:
category_tree_df.isna().sum()

categoryid    0
parentid      0
dtype: int64

In [45]:
category_tree_df.head()

Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0
