## Product Recommendation System

### Main Problem: Recommend products to customers based on their past purchases and product categories they tend to buy.

### Data Understanding
#### 1.0. What is the domain area of the dataset?
The Black Friday Sales dataset is a comprehensive collection of sales transaction data from a major retail store during a Black Friday event.

#### 1.1. Under which circumstances was it collected?
It is obtained from a major retail store during a Black Friday event.

#### 2.0. Which data format?
The format of the dataset is *.csv*

#### 2.1. Do the files have headers or another file describing the data?
The files does have headers that describes the data! Each column has a name that describes the data it contains!

#### 2.2. Are the data values separated by commas, semicolon, or tabs?
The data values are separated by commas!  
**Example:**   
User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase  
1000001,P00069042,F,0-17,10,A,2,0,3,,,8370  

#### 3.0 How many features and how many observations does the dataset have?
The dataset has:  
* over 550,000 observations or rows!  
* 12 features or columns!

#### 4.0 Does it contain numerical features? How many?
Yes it has 4 numerical features.

#### 5.0. Does it contain categorical features? How many?
Yes, it has 5 numerical features.

### Features

User ID: Unique ID for each customer.  
Product ID: Unique ID for each product.  
Gender: Gender of the customer, either male or female.  
Age: The age group of the customer, represented in categories (e.g., 18-25, 26-35, etc.).  
Occupation: Occupation category code of the customer.  
City_Category: The category of the city where the customer resides, classified as A, B, or C.  
Stay_In_Current_City_Years: Number of years the customer has lived in the current city.  
Marital_Status: Indicates whether the customer is married (1) or not (0).  
Product_Category 1, 2, 3: Product categories associated with the purchased item.  
Purchase: The amount spent by the customer on the product.  

##  Data Preprocessing

In [2]:
import pandas as pd

In [3]:
dataset = pd.read_csv("datasets/BlackFriday.csv")

In [4]:
dataset.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [5]:
dataset.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,537577.0,537577.0,537577.0,537577.0,370591.0,164278.0,537577.0
mean,1002992.0,8.08271,0.408797,5.295546,9.842144,12.66984,9333.859853
std,1714.393,6.52412,0.491612,3.750701,5.087259,4.124341,4981.022133
min,1000001.0,0.0,0.0,1.0,2.0,3.0,185.0
25%,1001495.0,2.0,0.0,1.0,5.0,9.0,5866.0
50%,1003031.0,7.0,0.0,5.0,9.0,14.0,8062.0
75%,1004417.0,14.0,1.0,8.0,15.0,16.0,12073.0
max,1006040.0,20.0,1.0,18.0,18.0,18.0,23961.0


In [6]:
print(f"Number of features in the dataset is {dataset.shape[1]} and the number of observations/rows in the dataset is {dataset.shape[0]}")

Number of features in the dataset is 12 and the number of observations/rows in the dataset is 537577


### Checking Missing Values

In [8]:
dataset.isnull().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            166986
Product_Category_3            373299
Purchase                           0
dtype: int64

In [10]:
dataset['Product_Category_2'].fillna(0, inplace=True)
dataset['Product_Category_3'].fillna(0, inplace=True)

In [11]:
dataset.isnull().sum()

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Product_Category_3            0
Purchase                      0
dtype: int64

### Encoding Categorical Variables

In [12]:
from sklearn.preprocessing import LabelEncoder

# 1. Label Encoding for Gender
le = LabelEncoder()
dataset['Gender'] = le.fit_transform(dataset['Gender'])

In [14]:
# 2. Map Age to numerical values
age_mapping = {'0-17': 0, '18-25': 1, '26-35': 2, '36-45': 3, '46-50': 4, '51-55': 5, '55+': 6}
dataset['Age'] = dataset['Age'].map(age_mapping)

In [15]:
# 3. One-Hot Encoding for Occupation and City_Category
dataset = pd.get_dummies(dataset, columns=['Occupation', 'City_Category'])

# 4. Convert Stay_In_Current_City_Years to numeric
dataset['Stay_In_Current_City_Years'] = dataset['Stay_In_Current_City_Years'].replace({'4+': 4}).astype(int)

In [17]:
dataset.shape

(537577, 34)

### Model Building