# **Business Understanding (CRISP-DM Methodology)**  

## **1. Background of the Project**  
In today’s digital economy, businesses increasingly rely on **recommendation systems** to provide personalized experiences. These systems analyze user interactions to suggest relevant products, content, or services, improving engagement, satisfaction, and sales.  

E-commerce platforms, streaming services, and subscription-based businesses depend on accurate recommendations to retain users and increase conversions. However, ensuring that recommendations are **accurate, diverse, and relevant** while managing large-scale data presents a significant challenge.  

This project aims to develop a **Recommendation System** that leverages user behavior data to generate **personalized suggestions** across different domains. The system will analyze past interactions to predict user preferences, ultimately enhancing the user experience and business performance.  

## **2. Business Objectives**  
The key objectives of this project are:  
- **Improve Recommendation Accuracy**: Increase relevance in product suggestions.  
- **Boost Conversion Rates**: Enhance purchases through personalized recommendations.  
- **Filter Abnormal Users**: Remove bot interactions to improve data quality.  
- **Optimize for Scalability**: Handle large datasets efficiently.  
- **Balance Diversity and Relevance**: Prevent recommendation loops.  

## **3. Scope of the Project**  
This recommendation system will be designed for:  
- **E-commerce platforms**: Suggesting products based on past browsing and purchases.  
- **Streaming services**: Recommending movies, shows, or music based on user preferences.  
- **Subscription-based platforms**: Offering personalized content or services.  

The system will focus on:  
- **Data Cleaning & Preprocessing**: Handling missing values and filtering abnormal users.  
- **Feature Engineering**: Extracting meaningful insights from user interactions.  
- **Model Development**: Implementing collaborative filtering, content-based filtering, and hybrid methods.  
- **Evaluation & Optimization**: Measuring accuracy, diversity, and performance.  

## **4. Problem Statement**  
Many businesses struggle to provide effective personalized recommendations due to:  
- **Large Data Volumes**: Handling millions of interactions efficiently.  
- **Low Recommendation Accuracy**: Poorly optimized models leading to irrelevant suggestions.  
- **Cold Start Problem**: Difficulty recommending items to new users with no history.  
- **Lack of Diversity**: Recommender loops that limit content variety.  
- **Fraudulent & Abnormal User Activity**: Bots and fake accounts adding noise to the system.  

This project will address these challenges by developing a **scalable, high-accuracy recommendation system** capable of **filtering abnormal user activity**, optimizing recommendations, and balancing accuracy with diversity.  

## **5. Key Challenges**
Developing an effective recommendation system comes with challenges, including:
- **Handling Large-Scale Data**: Processing millions of interactions efficiently.
- **Ensuring Accuracy & Relevance**: Providing meaningful recommendations without overwhelming users.
- **Personalization Without Bias**: Avoiding recommendation loops that limit content diversity.
- **Filtering Abnormal Users**: Identifying and removing fraudulent or non-human interactions

## **6. Stakeholders**
- **E-commerce Businesses**: Looking to improve product recommendations for increased sales.
- **Streaming & Media Platforms**: Seeking to provide engaging content suggestions.
- **Subscription-Based Services**: Aiming to personalize offerings to retain users.
- **Data Scientists & Analysts**: Responsible for model development and optimization.

## **7. Features of the Dataset**  
The dataset contains user interactions and product information from an **e-commerce website**, including:  

### **A. events.csv (User Behavior Data)**  
- **visitor_id** – Unique identifier for each user.  
- **timestamp** – Time of the event (Unix format).  
- **event_type** – Type of user interaction:  
  - `view` – User viewed a product.  
  - `addtocart` – User added a product to their cart.  
  - `transaction` – User purchased a product.  
- **item_id** – Unique identifier for each product.  

### **B. item_properties.csv (Product Details)**  
- **item_id** – Product identifier.  
- **timestamp** – Time at which item properties were recorded.  
- **property** – Various features such as price, brand, category, etc.  
- **value** – Corresponding values for each property.  

### **C. category_tree.csv (Product Hierarchy)**  
- **child_category_id** – ID of the subcategory.  
- **parent_category_id** – ID of the parent category (if applicable).   

## **8. Hypothesis**  
To guide the analysis, we define the following hypotheses:  

1. **H1: Users who add items to their cart are more likely to purchase those items compared to those who only view them.**  
2. **H2: Users who interact with multiple product categories tend to have higher engagement and conversion rates.**  
3. **H3: Recommendation accuracy improves when abnormal users (bots, fraudulent accounts) are filtered out.**  
4. **H4: Products frequently viewed together are strong indicators for collaborative filtering recommendations.**  
5. **H5: A well-balanced recommendation system (mixing popular and diverse items) leads to higher user retention.**  
6. **H6: Personalization increases the average transaction value, as users are more likely to purchase recommended items.**  
7. **H7: Seasonal trends and time-based factors influence product recommendations (e.g., holiday sales impact purchasing behavior).** 

## **9. Analytical Questions**  
1. **What patterns exist in user behavior across different events (views, add-to-cart, transactions)?**  
2. **How does user engagement (clicks, time spent, interactions) influence purchase decisions?**  
3. **Which product categories receive the most engagement, and how does this influence recommendations?**  
4. **What are the key features that impact the likelihood of a product being added to the cart or purchased?**  
5. **How can abnormal user activity (e.g., bots, fraud) be detected and filtered out?**  
6. **What is the impact of personalization on user retention and conversion rates?**  
7. **How can the recommendation system balance accuracy and diversity to avoid content redundancy?**  

In [3]:
import pandas as pd
import numpy as np
import scipy
import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import json5

In [26]:
category_tree = pd.read_csv("C:\\Users\\LENOVO\\Desktop\\Azubi Africa TMP\\Recommendation-System-Analysis\\Notebook\\Data\\category_tree.csv")
events = pd.read_csv("C:\\Users\\LENOVO\\Desktop\\Azubi Africa TMP\\Recommendation-System-Analysis\\Notebook\\Data\\events.csv")
item_properties1 = pd.read_csv("C:\\Users\\LENOVO\\Desktop\\Azubi Africa TMP\\Recommendation-System-Analysis\\Notebook\\Data\\item_properties_part1.1.csv")
item_properties2 = pd.read_csv("C:\\Users\\LENOVO\\Desktop\\Azubi Africa TMP\\Recommendation-System-Analysis\\Notebook\\Data\\item_properties_part2.csv")

In [27]:
category_tree, category_tree.info(), category_tree.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1669 entries, 0 to 1668
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   categoryid  1669 non-null   int64  
 1   parentid    1644 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 26.2 KB


(      categoryid  parentid
 0           1016     213.0
 1            809     169.0
 2            570       9.0
 3           1691     885.0
 4            536    1691.0
 ...          ...       ...
 1664          49    1125.0
 1665        1112     630.0
 1666        1336     745.0
 1667         689     207.0
 1668         761     395.0
 
 [1669 rows x 2 columns],
 None,
 categoryid     0
 parentid      25
 dtype: int64)

In [28]:
events, events.info(), events.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   timestamp      int64  
 1   visitorid      int64  
 2   event          object 
 3   itemid         int64  
 4   transactionid  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 105.1+ MB


(             timestamp  visitorid event  itemid  transactionid
 0        1433221332117     257597  view  355908            NaN
 1        1433224214164     992329  view  248676            NaN
 2        1433221999827     111016  view  318965            NaN
 3        1433221955914     483717  view  253185            NaN
 4        1433221337106     951259  view  367447            NaN
 ...                ...        ...   ...     ...            ...
 2756096  1438398785939     591435  view  261427            NaN
 2756097  1438399813142     762376  view  115946            NaN
 2756098  1438397820527    1251746  view   78144            NaN
 2756099  1438398530703    1184451  view  283392            NaN
 2756100  1438400163914     199536  view  152913            NaN
 
 [2756101 rows x 5 columns],
 None,
 timestamp              0
 visitorid              0
 event                  0
 itemid                 0
 transactionid    2733644
 dtype: int64)

In [29]:
item_properties1, item_properties1.info(), item_properties1.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999999 entries, 0 to 10999998
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   timestamp  int64 
 1   itemid     int64 
 2   property   object
 3   value      object
dtypes: int64(2), object(2)
memory usage: 335.7+ MB


(              timestamp  itemid    property                            value
 0         1435460400000  460429  categoryid                             1338
 1         1441508400000  206783         888          1116713 960601 n277.200
 2         1439089200000  395014         400  n552.000 639502 n720.000 424566
 3         1431226800000   59481         790                       n15360.000
 4         1431831600000  156781         917                           828513
 ...                 ...     ...         ...                              ...
 10999994  1439694000000   86599  categoryid                              618
 10999995  1435460400000  153032        1066                 n1020.000 424566
 10999996  1440298800000  421788         888               35975 856003 37346
 10999997  1437879600000  159792         400  n552.000 639502 n720.000 424566
 10999998  1436065200000  464846         790                      n410640.000
 
 [10999999 rows x 4 columns],
 None,
 timestamp    0
 itemid  

In [30]:
item_properties2, item_properties2.info(), item_properties2.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9275903 entries, 0 to 9275902
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   timestamp  int64 
 1   itemid     int64 
 2   property   object
 3   value      object
dtypes: int64(2), object(2)
memory usage: 283.1+ MB


(             timestamp  itemid property                             value
 0        1433041200000  183478      561                            769062
 1        1439694000000  132256      976                   n26.400 1135780
 2        1435460400000  420307      921                   1149317 1257525
 3        1431831600000  403324      917                           1204143
 4        1435460400000  230701      521                            769062
 ...                ...     ...      ...                               ...
 9275898  1433646000000  236931      929                           n12.000
 9275899  1440903600000  455746        6                     150169 639134
 9275900  1439694000000  347565      686                            610834
 9275901  1433646000000  287231      867                            769062
 9275902  1442113200000  275768      888  888666 n10800.000 746840 1318567
 
 [9275903 rows x 4 columns],
 None,
 timestamp    0
 itemid       0
 property     0
 value        

In [31]:
item_properties = pd.concat([item_properties1, item_properties2], ignore_index=True)


In [32]:
# Filtering rows where the property is 'categoryid'
categoryid = item_properties[item_properties['property'] == 'categoryid']

In [33]:
item_properties = item_properties.drop(columns=['property'])


In [34]:
# Filtering out transactions without itemids 
events = events[events['itemid'].notna()]

In [35]:
# Converting timestamp to datetime format
events['timestamp'] = pd.to_datetime(events['timestamp'], unit='ms')
item_properties['timestamp'] = pd.to_datetime(item_properties['timestamp'], unit='ms')

In [36]:
# Mergeing events with item_properties
merged = pd.merge(events, item_properties, on='itemid', how='left')
merged.head()

Unnamed: 0,timestamp_x,visitorid,event,itemid,transactionid,timestamp_y,value
0,2015-06-02 05:02:12.117,257597,view,355908,,2015-05-17 03:00:00,519769
1,2015-06-02 05:02:12.117,257597,view,355908,,2015-05-17 03:00:00,1
2,2015-06-02 05:02:12.117,257597,view,355908,,2015-08-30 03:00:00,1
3,2015-06-02 05:02:12.117,257597,view,355908,,2015-07-05 03:00:00,1
4,2015-06-02 05:02:12.117,257597,view,355908,,2015-07-26 03:00:00,1


In [37]:
# Dropping rows missing values
merged = merged.dropna()

In [38]:
merged.isnull().sum()

timestamp_x      0
visitorid        0
event            0
itemid           0
transactionid    0
timestamp_y      0
value            0
dtype: int64

In [39]:
merged.head(10)

Unnamed: 0,timestamp_x,visitorid,event,itemid,transactionid,timestamp_y,value
8126,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-08-16 03:00:00,1116693
8127,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-05-31 03:00:00,593601 189174
8128,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-06-28 03:00:00,1322937 1065616 688107 1194342 1322937 1065616...
8129,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-05-31 03:00:00,566381
8130,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-06-07 03:00:00,237874 1022520
8131,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-05-31 03:00:00,386
8132,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-05-10 03:00:00,1194342
8133,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-05-10 03:00:00,688107
8134,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-06-28 03:00:00,1322937 1065616
8135,2015-06-02 05:17:56.276,599528,transaction,356475,4000.0,2015-06-07 03:00:00,250733


In [None]:
# next join with category id ;)