### Data Understanding
#### 1.0. What is the domain area of the dataset?
The Black Friday Sales dataset is a comprehensive collection of sales transaction data from a major retail store during a Black Friday event.

#### 1.1. Under which circumstances was it collected?
It is obtained from a major retail store during a Black Friday event.

#### 2.0. Which data format?
The format of the dataset is *.csv*

#### 2.1. Do the files have headers or another file describing the data?
The files does have headers that describes the data! Each column has a name that describes the data it contains!

#### 2.2. Are the data values separated by commas, semicolon, or tabs?
The data values are separated by commas!  
**Example:**   
User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase  
1000001,P00069042,F,0-17,10,A,2,0,3,,,8370  

#### 3.0 How many features and how many observations does the dataset have?
The dataset has:  
* over 550,000 observations or rows!  
* 12 features or columns!

#### 4.0 Does it contain numerical features? How many?
Yes it has 4 numerical features.

#### 5.0. Does it contain categorical features? How many?
Yes, it has 5 numerical features.

### Features

User ID: Unique ID for each customer.  
Product ID: Unique ID for each product.  
Gender: Gender of the customer, either male or female.  
Age: The age group of the customer, represented in categories (e.g., 18-25, 26-35, etc.).  
Occupation: Occupation category code of the customer.  
City_Category: The category of the city where the customer resides, classified as A, B, or C.  
Stay_In_Current_City_Years: Number of years the customer has lived in the current city.  
Marital_Status: Indicates whether the customer is married (1) or not (0).  
Product_Category 1, 2, 3: Product categories associated with the purchased item.  
Purchase: The amount spent by the customer on the product.  

## User-Item Collaborative Filtering

User-Item Collaborative Filtering relies on finding relationships between users and items based solely on user interactions.  
The goal is to recommend products to a user based on the products that similar users have purchased (or interacted with).

In [1]:
import pandas as pd

In [2]:
dataset = pd.read_csv("datasets/BlackFriday.csv")

In [3]:
dataset.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


For **User-Item Collaborative Filtering**, we only need three key features: User ID, Product ID, and Purchase.  

1. **User ID** is essential because the recommendation system is built around users. Collaborative filtering works by finding users who have similar purchase or interaction patterns.  
The system aims to recommend products to a user by identifying products that similar users (with similar purchase behaviors) have interacted with.  

2. **Product ID** is the second critical piece because the system needs to recommend items. Each product must have a unique identifier so that the system can track which products users have interacted with.  
The goal is to find patterns of product interactions, like which products are frequently bought together or liked by similar users.

3. **Purchase** is the actual interaction between users and products. This represents implicit feedback because a purchase implies a level of preference, but it’s not as direct as a rating system (where users rate items from 1 to 5, for example).  
In collaborative filtering, this interaction data (purchase amounts) is used to find similar users (or products). Users with similar interaction patterns are grouped together, and recommendations are made based on these patterns.  

The other features in your dataset, like gender, age, occupation, and city, are demographic or categorical attributes.  
They might seem useful, but in **basic collaborative filtering**, we focus entirely on **interaction data** between users and products to find relationships.

**Example**:    
User_ID	Product_ID	Purchase  
1	    101	        200  
1	    102     	300  
2	    101     	100  
2	    103     	400  
3	    102     	150  
3	    103     	250  

From this data, we can see:

User 1 and User 3 both bought Product 102.  
User 2 and User 3 both bought Product 103.  

If we want to recommend products to User 1, we can recommend Product 103 because User 3 (who has a similar purchase pattern) also bought it.

### 1. Create the User-Item Matrix
We need to convert the dataset into a matrix where:

1. **Rows** are users.
2. **Columns** are products.
3. **Values** are the Purchase amount.

In [4]:
# Create the User-Item Matrix
user_item_matrix = dataset.pivot_table(index='User_ID', columns='Product_ID', values='Purchase', fill_value=0)

In [6]:
user_item_matrix.head()

Product_ID,P00000142,P00000242,P00000342,P00000442,P00000542,P00000642,P00000742,P00000842,P00000942,P00001042,...,P0098942,P0099042,P0099142,P0099242,P0099342,P0099442,P0099642,P0099742,P0099842,P0099942
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000001,13650.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
user_item_matrix.shape 
# 5891 rows and 3623 columns.

(5891, 3623)

### 2. Compute User Similarity
There are several similarity metrics we can use to compare users based on their interactions (purchase amounts).

1. **Cosine Similarity**: Measures the cosine of the angle between two vectors. It is widely used in collaborative filtering because it’s simple and effective.  
If two users have a lot of overlapping purchases (i.e., they bought the same items), their cosine similarity will be high.  
**Formula**:  $$\text{Cosine Similarity} = \frac{A ⋅ B}{||A|| * ||B||} $$  
Where **A** and **B** are the vectors representing user purchase amounts.  

2. **Pearson Correlation**: Measures the linear correlation between two users' interaction vectors. It captures how strongly two users' purchase patterns are linearly related.  
It accounts for the differences in users' purchasing habits (e.g., one user may spend more overall). 
**Formula**: $$\text{Pearson Correlation} = \frac{\sigma (A_i - \hat{A}) (B_i - \hat{B}) }{\sqrt{\sum (A_i - \overline{A})^2 \sum (B_i - \overline{B})^2}} $$
Where $\overline{A}$ and $\overline{B}$ are the means of vectors A and B.

In [8]:
# Calculate Cosine Similarity
# cosine_similarity(user_item_matrix) computes the pairwise cosine similarity between all users based on their purchase history.

from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between users
user_similarity = cosine_similarity(user_item_matrix)

In [9]:
print(user_similarity)

[[1.         0.02242116 0.05227727 ... 0.         0.18490258 0.19621259]
 [0.02242116 1.         0.12348515 ... 0.04360517 0.01254813 0.1175331 ]
 [0.05227727 0.12348515 1.         ... 0.         0.05477365 0.0742846 ]
 ...
 [0.         0.04360517 0.         ... 1.         0.17341779 0.00511762]
 [0.18490258 0.01254813 0.05477365 ... 0.17341779 1.         0.14000204]
 [0.19621259 0.1175331  0.0742846  ... 0.00511762 0.14000204 1.        ]]


In [10]:
# Convert the similarity matrix into a DataFrame for easier readability
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)

In [11]:
user_similarity_df.head()

User_ID,1000001,1000002,1000003,1000004,1000005,1000006,1000007,1000008,1000009,1000010,...,1006031,1006032,1006033,1006034,1006035,1006036,1006037,1006038,1006039,1006040
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000001,1.0,0.022421,0.052277,0.260793,0.141637,0.081355,0.016366,0.134528,0.23524,0.103026,...,0.07647,0.093151,0.109949,0.020458,0.036948,0.134177,0.095308,0.0,0.184903,0.196213
1000002,0.022421,1.0,0.123485,0.030776,0.060786,0.084935,0.14511,0.154445,0.084236,0.079004,...,0.020968,0.067051,0.262696,0.032563,0.17735,0.132331,0.160226,0.043605,0.012548,0.117533
1000003,0.052277,0.123485,1.0,0.104899,0.007618,0.069311,0.057641,0.110638,0.049278,0.112235,...,0.077617,0.055888,0.174383,0.0,0.094321,0.106765,0.036141,0.0,0.054774,0.074285
1000004,0.260793,0.030776,0.104899,1.0,0.029625,0.035968,0.068607,0.176677,0.123671,0.084467,...,0.06988,0.039804,0.280614,0.0,0.045294,0.156423,0.058255,0.0,0.080792,0.120634
1000005,0.141637,0.060786,0.007618,0.029625,1.0,0.034334,0.141154,0.180935,0.243927,0.061154,...,0.147876,0.024371,0.059062,0.0,0.130182,0.191264,0.097919,0.011508,0.023797,0.211371


1. **The diagonal** values are 1.0, meaning each user is perfectly similar to themselves.

2. **Off-diagonal** values represent the similarity between two different users.  
For example, User 1 and User 4 have a similarity score of 0.654, meaning they have relatively similar purchasing patterns.

### Finding Similar Users
We can now get the top similar users for a given target user.

In [12]:
def get_top_similar_users(user_id, user_similarity_df, top_n = 5):
    
    # Sort the users by similarity to the target user.
    similar_users = user_similarity_df[user_id].sort_values(ascending=False)

    # Exclude the target user from the results (since the user is most similar to themselves).
    similar_users = similar_users.drop(user_id)

    return similar_users.head(top_n)

In [14]:
# Example: Getting top 5 similar users for User1 (1000001)
top_similar_users_for_user_1 = get_top_similar_users(user_id=1000001, user_similarity_df=user_similarity_df, top_n=5)

In [15]:
print(top_similar_users_for_user_1)

User_ID
1002464    0.413474
1001515    0.374519
1002065    0.360309
1003862    0.356866
1001476    0.343077
Name: 1000001, dtype: float64
