## Collaborative Filtering Recommendation System

Contents of this notebook 
1. Modeling 
   - Memory-Based Methods: KNNBasic, KNNMeans, KNNBasline 
   - Matrix Factorization: SVD
2. Building Top 5 movie recommendation system
3. Evaluation of our recommendation system  

In this notebook, we use Collaborative Filtering to make recommendations to movie users. The technique is based on the idea that users similar to me can be used to predict how much I will like a particular product those users have used but I have not.

We explored the k-Nearest Neighbour (kNN) models and Singular Value Decomposition (SVD) model, using the surprise library. For model selection, we use a root mean squared error (RMSE) as validation metric. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Import from Surprise 
from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise import accuracy
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline, SVD
import warnings

In [2]:
# read master data
master = pd.read_csv('Data/master.csv')

# Drop unnecessary columns
master_1 = master.loc[:, ['movieId', 'userId', 'rating']]
master_1.drop_duplicates(inplace=True)

# 1. Modeling 


## Memory-Based Methods: KNNBasic, KNNMeans, KNNBasline


Setting up environment to use surprise library

In [3]:
# read new_df as Surprise dataset 
# specify the range of rating 0.5-5 (defalt setting is 1-5)
reader = Reader(rating_scale =(0.5, 5) ) 
df = Dataset.load_from_df(master_1,reader)

In [4]:
#Train test split with test size of 20% 
trainset, testset = train_test_split(df, test_size=0.2)

In [5]:
# report how many users and items we have in our dataset
dataset = df.build_full_trainset()

print('Number of users: ',dataset.n_users)
print('Number of items: ',dataset.n_items)

Number of users:  9724
Number of items:  610


In [6]:
# the range of ratings 
print('Min rating:', master_1.rating.min())
print('Max rating:', master_1.rating.max())

Min rating: 0.5
Max rating: 5.0


We run three different types of KNN models.

kNNBasic: A basic collaborative filtering algorithm. It find similarities between items based on user ratings, and use that information to estimate unkown ratings
kNNWithMeans: This is the similar algorithm as the basic KNN model, except it takes into account the mean rating of each item. 
kNNWithBasline: This takes into account a baseline rating. This adds biases for items and users.

For each model, we run a cross_validate method to find the one with the highest root mean squared error (rsme) score. Model tuning results are in 'Data/model_tuning.csv' file. 

We tune hyperparameters for each model. We focuses on k factor and the similarity measures (Cosine similarity and Pearson similarity). 

In [7]:
# Two similarity options 

sim_cos = {'name':'cosine', 'user_based':False}
sim_pearson = {'name':'pearson', 'user_based':False}

sim_options = [sim_cos, sim_pearson]

# Ks 
list_of_ks = [10,20,40]

In [8]:
# csv files to store results 
import csv

with open('Data/model_tuning.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['model', 'similarity_metrics', 'k', 'RMSE'])
    

### KNNBasic

In [9]:
# Hyperparameter Tuning
# KNNBasic 
for sim in sim_options:

    for k in list_of_ks:
        
        print(
            'Calculating sim_option = ' + str(sim['name']) + \
            ' and k = ' + str(k) + ':' )        
        algo = KNNBasic(k = k, sim_options = sim)
        results = cross_validate(algo, df, measures=['RMSE'], cv=3, n_jobs = -1)
        print('RMSE', np.mean(results['test_rmse']))
        
        
        with open('Data/model_tuning.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(['KNNBasic', sim['name'], str(k), str(np.mean(results['test_rmse']))])


Calculating sim_option = cosine and k = 10:
RMSE 0.9953611121476768
Calculating sim_option = cosine and k = 20:
RMSE 0.9832861560871874
Calculating sim_option = cosine and k = 40:
RMSE 0.978382590448254
Calculating sim_option = pearson and k = 10:
RMSE 0.9968983667052758
Calculating sim_option = pearson and k = 20:
RMSE 0.9870800454623551
Calculating sim_option = pearson and k = 40:
RMSE 0.9853645968238225


Insights: For KNN Basic, pick {sim_option = cosine and k = 40} , RMSE 0.9808621186224512

### KNNMean

In [10]:
# Hyperparameter Tuning
# KNNMeans 
for sim in sim_options:

    for k in list_of_ks:
        
        print(
            'Calculating sim_option = ' + str(sim['name']) + \
            ' and k = ' + str(k) + ':' )        
        algo = KNNWithMeans(k = k, sim_options = sim)
        results = cross_validate(algo, df, measures=['RMSE'], cv=3, n_jobs = -1)
        print('RMSE', np.mean(results['test_rmse']))
        
        with open('Data/model_tuning.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(['KNNMeans', sim['name'], str(k), str(np.mean(results['test_rmse']))])


Calculating sim_option = cosine and k = 10:
RMSE 0.9210667524138006
Calculating sim_option = cosine and k = 20:
RMSE 0.9111674259345675
Calculating sim_option = cosine and k = 40:
RMSE 0.9090611050198052
Calculating sim_option = pearson and k = 10:
RMSE 0.9203331761065104
Calculating sim_option = pearson and k = 20:
RMSE 0.9100261035734175
Calculating sim_option = pearson and k = 40:
RMSE 0.9100981112782827


Insights: For KNN Means, pick {sim_option = cosine and k = 40} , RMSE 0.9096555008170871

### KNNBaseline
It takes into account a baseline rating. It adds biases for items and users. 

In [11]:
# Hyperparameter Tuning
# KNNBaseline 
for sim in sim_options:

    for k in list_of_ks:
        
        print(
            'Calculating sim_option = ' + str(sim['name']) + \
            ' and k = ' + str(k) + ':' )        
        algo = KNNBaseline(k = k, sim_options = sim)
        results = cross_validate(algo, df, measures=['RMSE'], cv=3,  n_jobs = -1)
        print('RMSE', np.mean(results['test_rmse']))
        
        with open('Data/model_tuning.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(['KNNBaseline', sim['name'], str(k), str(np.mean(results['test_rmse']))])


Calculating sim_option = cosine and k = 10:
RMSE 0.8951565820106685
Calculating sim_option = cosine and k = 20:
RMSE 0.8881551072028951
Calculating sim_option = cosine and k = 40:
RMSE 0.8856858973794516
Calculating sim_option = pearson and k = 10:
RMSE 0.899678872941417
Calculating sim_option = pearson and k = 20:
RMSE 0.8898088849334201
Calculating sim_option = pearson and k = 40:
RMSE 0.8894272925392827


Insights: For KNN Baseline, pick {sim_option = cosine and k = 40} , RMSE 0.8868368800819563

## Matrix Factorization: Support Vector Decomposition (SVD)

Matrix factorization is widely used for recommender systems. It can deal better with scalability and sparsity than Memory-based model such as KNNs. Matrix factorization can be done by various methods, but we choose SVD. SVD is the base model used by the winner of 2009 Netflix competition. 

In [12]:
# Hyperparameter Tuning for SVD
# Use Gridsearch

param_grid = {'n_factors':[20, 50, 100],'n_epochs': [5, 10, 15], 'lr_all': [0.002, 0.005, 0.01],
               'reg_all': [0.04, 0.06]}
gs_model = GridSearchCV(SVD, param_grid, cv=3, n_jobs = -1, joblib_verbose=3)
gs_model.fit(df)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:   26.3s
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed:   50.1s finished


In [13]:
# print out optimal parameters for SVD after GridSearch
print(gs_model.best_score['rmse'])
print(gs_model.best_params['rmse'])


with open('Data/model_tuning.csv', 'a') as f:
    writer = csv.writer(f)
    writer.writerow(['SVD', '-', '-', str(gs_model.best_score['rmse'])])


0.8773558442626149
{'n_factors': 100, 'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.06}


## Model Selection

In [14]:
df = pd.read_csv('Data/model_tuning.csv')
df.sort_values('RMSE')

Unnamed: 0,model,similarity_metrics,k,RMSE
18,SVD,-,-,0.877356
14,KNNBaseline,cosine,40,0.885686
13,KNNBaseline,cosine,20,0.888155
17,KNNBaseline,pearson,40,0.889427
16,KNNBaseline,pearson,20,0.889809
12,KNNBaseline,cosine,10,0.895157
15,KNNBaseline,pearson,10,0.899679
8,KNNMeans,cosine,40,0.909061
10,KNNMeans,pearson,20,0.910026
11,KNNMeans,pearson,40,0.910098


SVD performed the best in terms of RMSE (0.877). Among KNN models, KNN baseline models performs better than other KNNs, yet SVD is slightly better. 

To interpret this result, on average, our SVD model estimates ratings with an error of approximately 0.877. On a scale of 0-5 rating, this 0.877 value is not too bad. For example, a rating of 3 compared to 3.8 is not be a significant different. 

In [15]:
# Our Model for Collaborative Filtering Recommendation 

model = SVD(n_factors=100, n_epochs=15, lr_all=0.01, reg_all=0.06)
model.fit(trainset)
predictions = model.test(testset)


## 2. Building Movie Recommendation 

### Top 5 Movie Recommendations to an Existing Customer

Using our final mode, we build a recommendation system which provide a top five movie recommendations to an existing customer. 
In the cell below, we create a function to derive top 5 recommendations for user_id given metadata of the company. In this recommendation function, we first find the movies was not rated by a user, since we don’t want to recommend a movie they’ve already watched. And then, we predict the score of each movie that a user didn’t rate, and find the top 5. 

In [16]:
# load movieId, title data
movies = pd.read_csv('Data/movies.csv')
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [17]:
# Function to return predicted ratings and movie titles by order 

def pred_rating(user_id, master_data):
    '''
    Step 1 Find the movies that user i didn’t rate 
    
    Step 2 Predict ratings for movies that user didn’t rate  
    
    Step 3 Merge predicted scores with movie titles, and oreder by predicted ratings
    '''
    
    # Find the movies that user i didn’t rate 
    # get a list of all movie titles 
    mids = master_data['movieId'].unique()
    # Get a list of movie title that user i has rated 
    mid_i = master_data.loc[master_data['userId']==user_id, 'movieId']
    # from the list of all movie, remove movieid user i has rated
    mids_to_pred = np.setdiff1d(mids, mid_i)
    
    #Predict ratings for movies that user didn’t rate    
    user_prediction = [model.predict(user_id, mid) for mid in mids_to_pred]    
    # Make a list of predcted ratings 
    pred_rating = [user_prediction[i].est for i in range(len(mids_to_pred))]
    # Create data frame with movie ids and predicted ratings 
    pred_tuples = list(zip(mids_to_pred,pred_rating))
    pred_df = pd.DataFrame(pred_tuples, columns=['movieId','pred_rating'])
    
    # Merge with movies data
    pred_df_title = pred_df.merge(movies, how = 'inner', on='movieId')

    
    return pred_df_title.sort_values('pred_rating', ascending=False)

In [18]:
# Function which returns top10 movie titles with ratings that user i actually rated 
def actual_top10(user_id, master_data):
    
    rated = master.loc[master_data['userId']==user_id, ['movieId', 'rating']]
    rated = rated.merge(movies, how='inner', on='movieId')
    rated = rated.sort_values('rating', ascending=False)[:10]
    return rated

## 3. Evaluation of our recommendation system

Using the recommendation system we created, we provide a recommendation for several existing users, and compare the model's recommendation with a list of each user's top-rated movies. We assess user ids of 50, 100, and 150.  

### Case1: User id = 50
According to UserId=50's rating history, this person X seems to enjoy watching science fiction and psychological films as well as comedies. Most of them are from the 1960s and ‘70s. We also noticed that this person likes to watch international movies. 
Our model recommends dramas and comedies. It seems not far from his profile but it is not 100 % tailared to his interest. 

#### Recommendation for User Id 50

In [19]:
# predicted rating for user 50
user50 = pred_rating(50, master).head(10)

user50


Unnamed: 0,movieId,pred_rating,title,genres
468,543,5.0,So I Married an Axe Murderer (1993),Comedy|Romance|Thriller
10,12,5.0,Dracula: Dead and Loving It (1995),Comedy|Horror
406,475,5.0,In the Name of the Father (1993),Drama
46,53,5.0,Lamerica (1994),Adventure|Drama
103,122,5.0,Boomerang (1992),Comedy|Romance
322,371,5.0,"Paper, The (1994)",Comedy|Drama
139,171,5.0,Jeffrey (1995),Comedy|Drama
234,276,5.0,Milk Money (1994),Comedy|Romance
37,43,5.0,Restoration (1995),Drama
451,523,5.0,Ruby in Paradise (1993),Drama


In [20]:
# This is to save above image for README file 
# pip install dataframe-image

In [21]:
import dataframe_image as dfi
dfi.export(user50, 'Data/dataframe.png')

#### Profile by user id 50

In [22]:
# Check the actual rating by user 50
actual_top10(50, master)

Unnamed: 0,movieId,rating,title,genres
21,924,4.5,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi
33,1204,4.5,Lawrence of Arabia (1962),Adventure|Drama|War
35,1208,4.5,Apocalypse Now (1979),Action|Drama|War
40,1251,4.5,8 1/2 (8½) (1963),Drama|Fantasy
121,7327,4.0,Persona (1966),Drama
28,1136,4.0,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy
31,1199,4.0,Brazil (1985),Fantasy|Sci-Fi
32,1201,4.0,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western
39,1232,4.0,Stalker (1979),Drama|Mystery|Sci-Fi
41,1252,4.0,Chinatown (1974),Crime|Film-Noir|Mystery|Thriller


### Case 2: User id = 100
According to UserId=100's ratings, this person likes American romantic comedy. Our recommendation are mostly romantic comedies and dramas. We would say that our recommendation seems to align with this person’s interests. 


#### Recommendation for User Id 100

In [23]:
# Test the function with user_id 100
pred_rating(100, master).head(10)

Unnamed: 0,movieId,pred_rating,title,genres
42,53,4.255715,Lamerica (1994),Adventure|Drama
33,43,4.253871,Restoration (1995),Drama
446,543,3.989054,So I Married an Axe Murderer (1993),Comedy|Romance|Thriller
131,171,3.970505,Jeffrey (1995),Comedy|Drama
219,276,3.968871,Milk Money (1994),Comedy|Romance
367,452,3.883818,Widows' Peak (1994),Drama
430,523,3.882145,Ruby in Paradise (1993),Drama
387,475,3.843787,In the Name of the Father (1993),Drama
306,371,3.828992,"Paper, The (1994)",Comedy|Drama
83,106,3.823182,Nobody Loves Me (Keiner liebt mich) (1994),Comedy|Drama


#### Profile of User id 100

In [24]:
# Check the actual rating by user 100
actual_top10(100, master)

Unnamed: 0,movieId,rating,title,genres
86,1958,5.0,Terms of Endearment (1983),Comedy|Drama
101,2423,5.0,Christmas Vacation (National Lampoon's Christm...,Comedy
137,5620,5.0,Sweet Home Alabama (2002),Comedy|Romance
55,1101,5.0,Top Gun (1986),Action|Romance
125,4041,5.0,"Officer and a Gentleman, An (1982)",Drama|Romance
70,1307,4.5,When Harry Met Sally... (1989),Comedy|Romance
84,1912,4.5,Out of Sight (1998),Comedy|Crime|Drama|Romance|Thriller
82,1777,4.5,"Wedding Singer, The (1998)",Comedy|Romance
81,1680,4.5,Sliding Doors (1998),Drama|Romance
80,1678,4.5,"Joy Luck Club, The (1993)",Drama|Romance


### Case 3: User id = 150
UserId=150's reveals a higher preference for science fiction and action. But our model recommends mainly dramas and romantic comedies. Our recommendation seems very off from his/her interest. 
User 150 rated only 26 movies. Given the fact that the minimum number of ratings per person is 20, we can consider Y as a relatively new user with less available information. This may explain that the model does not perform well in this case.

#### Recommendation for user id 150

In [25]:
# Test the function with user_id 150
pred_rating(150, master).head(10)

Unnamed: 0,movieId,pred_rating,title,genres
225,276,4.935817,Milk Money (1994),Comedy|Romance
461,543,4.856407,So I Married an Axe Murderer (1993),Comedy|Romance|Thriller
40,53,4.834277,Lamerica (1994),Adventure|Drama
32,43,4.829269,Restoration (1995),Drama
379,452,4.758502,Widows' Peak (1994),Drama
7,12,4.73898,Dracula: Dead and Loving It (1995),Comedy|Horror
316,371,4.723426,"Paper, The (1994)",Comedy|Drama
399,475,4.661899,In the Name of the Father (1993),Drama
282,337,4.660026,What's Eating Gilbert Grape (1993),Drama
114,154,4.655888,Beauty of the Day (Belle de jour) (1967),Drama


#### Profile of User id 150

In [26]:
# Check the actual rating by user 150
actual_top10(150, master)

Unnamed: 0,movieId,rating,title,genres
25,1356,5.0,Star Trek: First Contact (1996),Action|Adventure|Sci-Fi|Thriller
5,32,5.0,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
12,141,5.0,"Birdcage, The (1996)",Comedy
17,648,4.0,Mission: Impossible (1996),Action|Adventure|Mystery|Thriller
2,6,4.0,Heat (1995),Action|Crime|Thriller
4,25,4.0,Leaving Las Vegas (1995),Drama|Romance
6,36,4.0,Dead Man Walking (1995),Crime|Drama
7,52,4.0,Mighty Aphrodite (1995),Comedy|Drama|Romance
23,805,4.0,"Time to Kill, A (1996)",Drama|Thriller
20,780,4.0,Independence Day (a.k.a. ID4) (1996),Action|Adventure|Sci-Fi|Thriller


We found that our prediction model did not seem to be far from their preferences. But, for a real evaluation, we need to experiment with real people with A/B testing. 