## E-COMMERCE RECOMMENDER SYSTEM

## MODELLING: Part 1

**Key takeouts from data wrangling and EDA suggest the following approaches:**

- Collaborative Filtering
    - we can leverage collaborative filtering to design the recommender system using a model that use historical data but also learn patterns as the number of ratings increases. For this we will apply 2 algorithms: KNN and SVD. 

- Data qualiy
    - There is considerable number of products with low ratings which is the case for the 'cold start' problem. Hence for our model to be effective, we will only consider in users having at least 3 reviews (average is 43 reviews per user).

The objective of this notebook is to apply a collaborative filtering approach on 2 si;ple qlgorith;s. KNN and Matrix Factorization.

## Imports

In [73]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error
from scipy.sparse.linalg import svds
from collections import defaultdict

## Loading data

In [5]:
data = pd.read_csv('/Users/judith/Data_science_projects/Springboard_AssignmentsJY/capstone_three/data/processed/ratings_data.csv')

data.head()

### KNN

In [7]:
# ensuring there are no duplicates in our dataset
data = data.drop_duplicates(subset=['item_id', 'user_id'])

In [8]:
# Ensuring there are no null values in our dataset
data.isnull().sum()

item_id    0
user_id    0
rating     0
dtype: int64

In [10]:
# creating the matrix to input the model by pivoting the dataframe
ratings_pivot = data.pivot(index = 'item_id', 
                                  columns = 'user_id', 
                                  values = 'rating').fillna(0)

In [11]:
# Checking the result of the pivot
ratings_pivot.head()

user_id,"""Ferrari"")",#,#1dad,'Chelle,'Tree',(usually),-L,.,..,00erin,...,zuel,zugai01,zulemaphone,zumbafitnesscarly,zumbaneko,zurajohnson,zuzu_zoom,🇦🇺,🐻,😊
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6454,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11960,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# calculating the sparsity of the matrix
# sparsity = np.isnan(0).sum/shape

In [12]:
# As we could have expected, the resulting dataframe is sparse with many customers
# who never bought any product. This is why we are transforming the pivot table into
# a scipy csr_matrix

ratings_matrix = csr_matrix(ratings_pivot.values)

In [13]:
# split data into train and test
train, test = train_test_split(ratings_matrix, test_size = 0.25, random_state = 0)

In [19]:
recommender_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
recommender_knn.fit(train)
pred = recommender_knn.kneighbors(test, return_distance = True)

In [33]:
# Generating sample index to use for the recommendations
query_index = np.random.choice(ratings_pivot.shape[0])

In [34]:
distances, indices = recommender_knn.kneighbors(ratings_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

In [35]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(ratings_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, ratings_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for 153872:

1: 27439, with distance of 0.7094596592420652:
2: 129392, with distance of 0.7912244531993269:
3: 24853, with distance of 0.8322789674325088:
4: 143422, with distance of 0.8412306341162903:
5: 152709, with distance of 0.8566835082312272:


Evaluation of the KNN model:
- pros: 
    - Simple model not requiding any data other than historical ratings
- Limitations: 
    - High dimension resulting from large number of users (vs items) which brings the issue of high dimensionality 
    - Sparse matrix (gray sheep problem) as most users have only rated few items
    - Model suffers from the cold start problem: it can’t recommend items to new users
    - Also, it is inclined to recommend popular items having many reviews which add biais in the mode. This also means that it will be difficult to recommend new products since they have no or few reviews

For these reasons, we will try the matrix factorization which helps to correct high dimensionality issue.

## Matrix Factorization

In [36]:
# Grouping the dataframe to have the number of unique users
data_grouped = data.groupby(['user_id', 'item_id']).size().groupby('user_id').size()

In [37]:
# Calculating the number of users with at least 3 reviews to include in the model
# restricting analysis to users having at least 3 reviews will make the model more robust
data_short = data_grouped[data_grouped >= 3].reset_index()[['user_id']]
print('Total number of users {}'.format(len(data_grouped)))
print('Number of users with at least 10 ratings {}'.format(len(data_short)))

Total number of users 44783
Number of users with at least 10 ratings 6866


In [38]:
# merging data_short and the data to have a final dataset ready for modelling
selection = data.merge(data_short, how = 'right',left_on = 'user_id', right_on = 'user_id')
selection.head()

Unnamed: 0,item_id,user_id,rating
0,105202,19lovelikecrazy95,5
1,57369,19lovelikecrazy95,4
2,118317,19lovelikecrazy95,3
3,32406,1dianaoliver,3
4,116313,1dianaoliver,1


In [39]:
print('Total number of interactions: {}'.format(len(data)))
print('Total number of interactions from users with at least 3 reviews: {}'.format(len(selection)))

Total number of interactions: 99892
Total number of interactions from users with at least 3 reviews: 54365


#### Split data into train and test sets

In [40]:
train, test = train_test_split(selection, stratify = selection['user_id'],
                              test_size = 0.2, random_state = 42)

print('train size = {}'.format(len(train)))
print('test size = {}'.format(len(test)))

train size = 43492
test size = 10873


In [41]:
# creating a sparse matrix
rating_pivot = train.pivot(index = 'user_id', 
                          columns = 'item_id', 
                          values = 'rating').fillna(0)

In [42]:
rating_pivot.head()

item_id,6454,7443,11960,16411,21296,22563,24853,27439,27590,28252,...,155090,155165,155293,155305,155307,155308,155317,155537,155597,155950
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
19lovelikecrazy95,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1dianaoliver,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3chuckleheads,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4jess,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7578042,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
# reducting dimensionality of the sparse matrix
rating_matrix = rating_pivot.to_numpy()
rating_csr_matrix = csr_matrix(rating_matrix)

In [46]:
U, sigma, Vt = svds(rating_csr_matrix, k = 15)

In [47]:
U.shape, sigma.shape, Vt.shape

((6866, 15), (15,), (15, 1007))

In [49]:
sigma = np.diag(sigma)
sigma.shape

(15, 15)

In [50]:
# making predictions
predicted_rating = np.dot(np.dot(U, sigma), Vt)
predicted_rating

array([[ 2.75192563e-04,  1.78305106e-01,  1.65969500e-01, ...,
        -5.95479371e-03, -6.72155157e-04, -2.80975475e-05],
       [-2.46172277e-03,  1.18905471e-01,  4.17804844e-02, ...,
        -5.72085963e-03, -4.22946042e-04,  1.89071601e-02],
       [ 4.93324727e-02,  1.63123000e-01, -9.94252603e-02, ...,
        -4.66037631e-03,  5.96405671e-04, -1.01144399e-02],
       ...,
       [ 1.36364624e-03, -1.63094334e-03, -6.01779358e-05, ...,
         4.23475113e-04,  2.71693816e-05,  3.32806537e-04],
       [ 1.11944561e-01,  3.70358557e-02,  1.67525901e-01, ...,
         3.81931126e-02, -8.16698957e-04,  3.78808114e-02],
       [ 3.22600405e-02,  4.10092366e-02,  4.55589574e-02, ...,
         1.25956345e-02, -6.79596620e-05,  1.26729309e-02]])

In [51]:
# Normalizing data
predicted_rating_norm = (predicted_rating - predicted_rating.min()) / (predicted_rating.max() - predicted_rating.min())

In [52]:
users_ids = list(rating_pivot.index)

In [53]:
predicted_rating_df = pd.DataFrame(predicted_rating_norm, columns = rating_pivot.columns,
                                  index = users_ids).transpose()

In [54]:
predictions_df = predicted_rating_df.stack().reset_index()
predictions_df.head()

Unnamed: 0,item_id,level_1,0
0,6454,19lovelikecrazy95,0.317101
1,6454,1dianaoliver,0.316887
2,6454,3chuckleheads,0.320944
3,6454,4jess,0.31715
4,6454,7578042,0.317925


In [64]:
predictions_df.rename(columns = {'item_id': 'item_id', 
                                                  'level_1': 'user_id',
                                                 0: 'est_rating'}, inplace = True)

In [67]:
# checking the shape of predictions and ground thruth prior to calculating the error
predictions_df.shape, rating_matrix.shape

((6914062, 3), (6866, 1007))

In [72]:
# calculating error for the first 1000 entires
rmse = mean_squared_error(data.rating.iloc[:1000], predictions_df.est_rating.iloc[:1000])
rmse

15.93694047432965

**KNN Model Evaluation**
- Benefits: 
    - Easy to implement
 - Weaknesses: 
    - High dimension due to large number of data resulting in lower performance 
    - Is more inclined to recommend popular products which means negative bias for new products with products with few reviews
    - Also suffer from sparsity resulting from low number of users/items interactions (gray ship pproblem)

**MATRIX FACTORIZATION Model Evaluation**
- Benefits:
    - MF offers a better alternative to KNN by solving the dimensionality issue by converting the matrix into a dense object resulting in a RMSE score of 15 which is not a great performance. 
    - The cold start issue is still not adressed


In the next part, we will try to improve the RMSE score using the surprise library for collaborative filtering