# E-COMMERCE RECOMMENDER SYTEM

# MODELLING

## Imports

In [88]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import TruncatedSVD

## Loading and reading data

In [14]:
ratings = pd.read_csv('/Users/judith/Data_science_projects/Springboard_AssignmentsJY/capstone_three/data/processed/processed_ratings_data.csv')

In [15]:
ratings.head()

Unnamed: 0.1,Unnamed: 0,item_id,user_id,rating
0,0,7443,Alex,4
1,1,7443,carolyn.agan,3
2,2,7443,Robyn,4
3,3,7443,De,4
4,4,7443,tasha,4


In [16]:
ratings.drop('Unnamed: 0', axis = 1, inplace = True)

### Summary from Wrangling and EDA phases

Key takeouts from the previous phase suggest we can leverage collaborative filtering to design the recommender system. This can be done in 2 ways:
- Products based: recommend products to users based on products similarities
- Users based: recommend products to users based on what people similar to them have bought
- For this case, since there are far more users than products, it seems logical to use a user based approach that will limits the dimensionality of the data and will be easier to compute as we have much more data to profile every customers. However, the cons of this approach is the fact that it comes with more variability and makes it difficult to make recommendations to new users. Hence we will use the products based approach instead using cosine similarity.

## Collaborative Filtering: what products to recommend to which customers?

### Approach 1: KNN

#### Creating the model

We will apply a model based approach to predict user's rating using a KNN algorithm

In [18]:
ratings_cleaned = ratings.drop_duplicates(subset=['item_id', 'user_id'])

In [139]:
ratings_cleaned.isnull().sum()

item_id    0
user_id    1
rating     0
dtype: int64

In [142]:
new = ratings_cleaned.dropna(subset = ['user_id'])

In [40]:
# creating the matrix to input the model by pivoting the dataframe
ratings_pivot = ratings_cleaned.pivot(index = 'item_id', 
                                  columns = 'user_id', 
                                  values = 'rating').fillna(0)

In [41]:
# Checking the result of the pivot
ratings_pivot.head()

user_id,NaN,"""Ferrari"")",#,#1dad,'Chelle,'Tree',(usually),-L,.,..,...,zuel,zugai01,zulemaphone,zumbafitnesscarly,zumbaneko,zurajohnson,zuzu_zoom,🇦🇺,🐻,😊
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6454,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11960,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
# As we could have expected, the resulting dataframe is sparse with many customers
# who never bought any product. This is why we are transforming the pivot table into
# a scipy csr_matrix

ratings_matrix = csr_matrix(ratings_pivot.values)

In [86]:
# split data into train and test
train, test = train_test_split(ratings_matrix, test_size = 0.25, random_state = 0)

In [89]:
recommender_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
recommender_knn.fit(ratings_matrix)

NearestNeighbors(algorithm='brute', metric='cosine', n_neighbors=20)

#### Making recommendations

In [49]:
# Generating sample index to use for the recommendations
query_index = np.random.choice(ratings_features.shape[0])

462

In [36]:
distances, indices = recommender_knn.kneighbors(ratings_features.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

In [38]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(ratings_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, ratings_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for 151226:

1: 153261, with distance of 0.8127403208675936:
2: 152691, with distance of 0.8171818939682174:
3: 152377, with distance of 0.8361477807591444:
4: 152209, with distance of 0.8454904998458298:
5: 151475, with distance of 0.8537455151745739:


mean_squared_error()

Divide your data into train and test set by users.
For a user in test set, given their history, get the top N recommendations using implicit feedback based model.
Precision can be calculated using # of recommendations given by model which actually matched by what user had acted upon (for example read in case of articles).
Recall can be calculated using # of user actions (articles read by user) that were captured by top N recommendations.
You can calculate these for all users in test set and average them.

In [150]:
train, test = train_test_split(new,
                              test_size = 0.20,
                              random_state = 42)
print('Train set size is {}'.format(len(train)))
print('Test set size is {}'.format(len(test)))

Train set size is 79913
Test set size is 19979


In [148]:
new.shape

(99892, 3)

### Approach 2: Matrix Factorization

#### Creating the model

In [51]:
ratings_pivot2 = ratings_cleaned.pivot(index = 'user_id', 
                                       columns = 'item_id', 
                                       values = 'rating').fillna(0)

In [52]:
ratings_pivot2.head()

item_id,6454,7443,11960,16411,21296,22563,24853,27439,27590,28252,...,155090,155165,155293,155305,155307,155308,155317,155537,155597,155950
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""Ferrari"")",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#1dad,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Chelle,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
ratings_pivot2.shape

(44784, 1020)

In [58]:
X = ratings_pivot2.values.T

In [59]:
X.shape

(1020, 44784)

In [61]:
SVD = TruncatedSVD(n_components = 15, random_state = 17)
matrix = SVD.fit_transform(X)

In [62]:
matrix.shape

(1020, 15)

In [151]:
matrix

array([[ 5.49506612e+00, -5.48591859e-01, -3.40621050e-01, ...,
        -1.08520141e-02, -1.38905720e+00,  6.26893508e-01],
       [ 4.50539423e+01, -2.27012790e+00, -2.20482598e+00, ...,
        -5.12437921e+00, -2.81682389e+00, -2.20575729e+00],
       [ 4.53542127e+01, -6.02720866e+00, -1.20727207e+00, ...,
        -1.44116260e+01, -7.24141409e+00, -4.17367664e+00],
       ...,
       [ 1.38579022e+00,  3.61148485e-02, -1.62724549e-01, ...,
         2.96457936e-01, -1.65050127e-01,  2.37421440e-01],
       [ 7.01490257e-02, -4.16783838e-02,  2.70667033e-02, ...,
        -5.02232048e-03, -3.06042195e-02, -1.15255060e-01],
       [ 1.42068394e+00,  2.02621287e-01, -1.35038541e-02, ...,
         1.49408621e-01, -1.93731294e-01,  3.77802121e-01]])

In [64]:
corr = np.corrcoef(matrix)
corr.shape

(1020, 1020)

In [65]:
items = ratings_pivot2.columns
items_list = list(items)

In [None]:
us_canada_book_title = us_canada_user_rating_pivot2.columns
us_canada_book_list = list(us_canada_book_title)
coffey_hands = us_canada_book_list.index("The Green Mile: Coffey's Hands (Green Mile Series)")
print(coffey_hands)

In [69]:
print(query_index)

462


In [76]:
coef = corr[query_index]
list(items[(coef <= 0.98)])[:21]

[6454,
 7443,
 11960,
 16411,
 21296,
 22563,
 24853,
 27439,
 27590,
 28252,
 28967,
 31644,
 31752,
 32134,
 32236,
 32403,
 32405,
 32406,
 34931,
 34935,
 34937]

In [None]:
Models evaluation

Evaluation of the KNN model:
- Benefits:No requirement for product descriptions.
- Limitations: High dimension resulting in lower performance (they suffer from the curse of dimensionality)
    Can’t recommend item if no user reviews exist (suffers from the cold start problem).
Difficult to recommend new users and is inclined to favor popular products with lots of reviews.
Suffers from a sparsity problem as user will review only selected items.
Faces the "gray sheep problem" (i.e., useful predictions cannot be made due to sparsity).
Difficult to recommend new releases since they have less reviews.
alternative
SVD
we can get a useful part of the data, that is hidden correlation (latent factors) and remove redundant parts
Evaluation of the matrix factorization model:

### Content Based Model

In [None]:
Works even when a product has no user reviews.
Recommendations are based on attributes of the item