## Collaborative Filtering Recommendation System

Contents of this notebook 
1. Modeling 
   - Memory-Based Methods: KNNBasic, KNNMeans, KNNBasline 
   - Matrix Factorization: SVD
2. Building Top 5 movie recommendation system
3. Evaluation of our recommendation system  

In this notebook, we use Collaborative Filtering to make recommendations to movie users. The technique is based on the idea that users similar to me can be used to predict how much I will like a particular product those users have used but I have not.

We look at recommender systems based on the k-Nearest Neighbour (kNN) models and Singular Value Decomposition (SVD) model, using the surprise library. For model selection, we use a root mean squared error (RMSE) as validation metric. 


In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise import accuracy
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline, SVD, NMF
import warnings

In [8]:
# read master data
master = pd.read_csv('Data/master.csv')

# Drop unnecessary columns
master_1 = master.loc[:, ['movieId', 'userId', 'rating']]
master_1.drop_duplicates(inplace=True)

# 1. Modeling 


## Memory-Based Methods: KNNBasic, KNNMeans, KNNBasline


Check!! how many percent of movies has ratings. 
Insights: The majority of entries is zero. With too many zeros, the distance between similar items in KNN model will be very large.

Setting up environment to use surprise library

In [9]:
# read new_df as Surprise dataset 
# specify the range of rating 0.5-5 (defalt setting is 1-5)
reader = Reader(rating_scale =(0.5, 5) ) 
df = Dataset.load_from_df(master_1,reader)

In [14]:
#Train test split with test size of 20% 
trainset, testset = train_test_split(df, test_size=0.2)

In [13]:
# report how many users and items we have in our dataset
dataset = df.build_full_trainset()

print('Number of users: ',dataset.n_users)
print('Number of items: ',dataset.n_items)

Number of users:  9724
Number of items:  610


In [11]:
# the range of ratings 
print('Min rating:', master_1.rating.min())
print('Max rating:', master_1.rating.max())

Min rating: 0.5
Max rating: 5.0


Three different types of KNN models:

kNNBasic: A basic collaborative filtering algorithm. It find similarities between items based on user ratings, and use that information to estimate unkown ratings
kNNWithMeans: This is the similar algorithm as the basic KNN model, except it takes into account the mean rating of each item. 
kNNWithBasline: This takes into account a baseline rating. This adds biases for items and users.

For each model, we run a cross_validate method to find the one with the highest root mean squared error (rsme) score. Model tuning results are in .csv files. 

We tune hyperparameters for each model. We focuses on k factor and the similarity measures (Cosine similarity and Pearson similarity). 

In [22]:
# Two similarity options 

sim_cos = {'name':'cosine', 'user_based':False}
sim_pearson = {'name':'pearson', 'user_based':False}

sim_options = [sim_cos, sim_pearson]

# Ks 
list_of_ks = [10,20,40]

In [23]:
# csv files to store results 
import csv

with open('model_tuning.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['model', 'similarity_metrics', 'k', 'RMSE'])
    

### KNNBasic

In [26]:
# Hyperparameter Tuning
# KNNBasic 
for sim in sim_options:

    for k in list_of_ks:
        
        print(
            'Calculating sim_option = ' + str(sim['name']) + \
            ' and k = ' + str(k) + ':' )        
        algo = KNNBasic(k = k, sim_options = sim)
        results = cross_validate(algo, df, measures=['RMSE'], cv=3, n_jobs = -1)
        print('RMSE', np.mean(results['test_rmse']))
        
        
        with open('model_tuning.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(['KNNBasic', sim['name'], str(k), str(np.mean(results['test_rmse']))])


Calculating sim_option = cosine and k = 10:
RMSE 0.9930213329095913
Calculating sim_option = cosine and k = 20:
RMSE 0.9824993770645793
Calculating sim_option = cosine and k = 40:
RMSE 0.9808621186224512
Calculating sim_option = pearson and k = 10:
RMSE 0.9979983289646622
Calculating sim_option = pearson and k = 20:
RMSE 0.9873806177267733
Calculating sim_option = pearson and k = 40:
RMSE 0.985038505449967


Insights: For KNN Basic, pick {sim_option = cosine and k = 40} , RMSE 0.9808621186224512

### KNNMean

In [27]:
# Hyperparameter Tuning
# KNNMeans 
for sim in sim_options:

    for k in list_of_ks:
        
        print(
            'Calculating sim_option = ' + str(sim['name']) + \
            ' and k = ' + str(k) + ':' )        
        algo = KNNWithMeans(k = k, sim_options = sim)
        results = cross_validate(algo, df, measures=['RMSE'], cv=3, n_jobs = -1)
        print('RMSE', np.mean(results['test_rmse']))
        
        with open('model_tuning.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(['KNNMeans', sim['name'], str(k), str(np.mean(results['test_rmse']))])


Calculating sim_option = cosine and k = 10:
RMSE 0.9204968090322797
Calculating sim_option = cosine and k = 20:
RMSE 0.911619115853037
Calculating sim_option = cosine and k = 40:
RMSE 0.9096555008170871
Calculating sim_option = pearson and k = 10:
RMSE 0.9188883365405717
Calculating sim_option = pearson and k = 20:
RMSE 0.9103220721053741
Calculating sim_option = pearson and k = 40:
RMSE 0.9102061759576726


Insights: For KNN Means, pick {sim_option = cosine and k = 40} , RMSE 0.9096555008170871

### KNNBaseline
It takes into account a baseline rating. It adds biases for items and users. 

In [28]:
# Hyperparameter Tuning
# KNNBaseline 
for sim in sim_options:

    for k in list_of_ks:
        
        print(
            'Calculating sim_option = ' + str(sim['name']) + \
            ' and k = ' + str(k) + ':' )        
        algo = KNNBaseline(k = k, sim_options = sim)
        results = cross_validate(algo, df, measures=['RMSE'], cv=3,  n_jobs = -1)
        print('RMSE', np.mean(results['test_rmse']))
        
        with open('model_tuning.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(['KNNBaseline', sim['name'], str(k), str(np.mean(results['test_rmse']))])


Calculating sim_option = cosine and k = 10:
RMSE 0.8957094515923046
Calculating sim_option = cosine and k = 20:
RMSE 0.8891682020394649
Calculating sim_option = cosine and k = 40:
RMSE 0.8871705452094689
Calculating sim_option = pearson and k = 10:
RMSE 0.8988347873169286
Calculating sim_option = pearson and k = 20:
RMSE 0.8909436642603988
Calculating sim_option = pearson and k = 40:
RMSE 0.8883126361400819


Insights: For KNN Baseline, pick {sim_option = cosine and k = 40} , RMSE 0.8868368800819563

## Matrix Factorization: Support Vector Decomposition (SVD)

Matrix factorization is widely used for recommender systems. It can deal better with scalability and sparsity than Memory-based model such as KNNs. Matrix factorization can be done by various methods, but we choose SVD. SVD is the base model used by the winner of 2009 Netflix competition. 

In [29]:
# Hyperparameter Tuning for SVD
# Use Gridsearch

param_grid = {'n_factors':[20, 50, 100],'n_epochs': [5, 10, 15], 'lr_all': [0.002, 0.005, 0.01],
               'reg_all': [0.04, 0.06]}
gs_model = GridSearchCV(SVD, param_grid, cv=3, n_jobs = -1, joblib_verbose=3)
gs_model.fit(df)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:   22.5s
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed:   42.0s finished


In [30]:
# print out optimal parameters for SVD after GridSearch
print(gs_model.best_score['rmse'])
print(gs_model.best_params['rmse'])


with open('model_tuning.csv', 'a') as f:
    writer = csv.writer(f)
    writer.writerow(['SVD', '-', '-', str(gs_model.best_score['rmse'])])


0.8772167256292999
{'n_factors': 100, 'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.06}


## Model Selection

In [35]:
df = pd.read_csv('model_tuning.csv')
df.sort_values('RMSE')

Unnamed: 0,model,similarity_metrics,k,RMSE
24,SVD,-,-,0.877217
20,KNNBaseline,cosine,40,0.887171
23,KNNBaseline,pearson,40,0.888313
19,KNNBaseline,cosine,20,0.889168
22,KNNBaseline,pearson,20,0.890944
18,KNNBaseline,cosine,10,0.895709
21,KNNBaseline,pearson,10,0.898835
14,KNNMeans,cosine,40,0.909656
17,KNNMeans,pearson,40,0.910206
16,KNNMeans,pearson,20,0.910322


SVD performed the best in terms of RMSE (0.877). Among KNN models, KNN baseline models performs better than other KNNs, yet SVD is slightly better. 

To interpret this result, on average, our SVD model estimates ratings with an error of approximately 0.877. On a scale of 0-5 rating, this 0.877 value is not too bad. For example, a rating of 3 compared to 3.8 is not be a significant different. 

In [36]:
# Our Model for Collaborative Filtering Recommendation 

model = SVD(n_factors=100, n_epochs=15, lr_all=0.01, reg_all=0.06)
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9e0cae3640>

## 2. Building Movie Recommendation 

### Top 5 Movie Recommendations to an Existing Customer

Using our final mode, we build a recommendation system which provide a top five movie recommendations to an existing customer. 
In the cell below, we create a function to derive top 5 recommendations for user_id given metadata of the company. In this recommendation function, we first find the movies was not rated by a user, since we don’t want to recommend a movie they’ve already watched. And then, we predict the score of each movie that a user didn’t rate, and find the top 5. 

In [62]:
# Function to provide top 5 ratings for user i 

def top5(user_id, master_data):
    
    '''
    Step 1 find the movies that user i didn’t rate 
    
    Step 2 Predict the score of each of the movie that user 50 didn’t rate, and find top 5.
    '''
    
    # step 1 find the movies that user i didn’t rate 
    # get a list of all movie titles 
    mids = master_data['title'].unique()
    # Get a list of movie ids that user i has rated 
    mid_i = master_data.loc[master_data['userId']==user_id, 'title']
    # from the list of all movie, remove titles user i has rated
    mids_to_pred = np.setdiff1d(mids, mid_i)  
    
    # step 2 Predict the score of each of the movie that user 50 didn’t rate, and find the best one.
    # create another dataset with the mids (movie titles). arbitrarily set all the ratings of this test set to 4
    testset = [[user_id, mid, 4.] for mid in mids_to_pred]
    # Fit the testdata to our model and get predicted ratings
    predictions = model.test(testset)
    pred_ratings = np.array([pred.est for pred in predictions])
    
    # Find the index of top 5 predicted ratings
    top_5 = pred_ratings.argpartition(-5)[-5:]
    # Find the corresponding movie title to recommend 
    top_5_title = mids_to_pred[i_max_5]
    
    return print('Top 5 item for user {} is {}'.format(user_id, top_5_title))




In [98]:
# Function which returns top10 movie titles with ratings that user i actually rated 
def actual_top10(user_id, master_data):
    
    rated = master.loc[master_data['userId']==user_id, ['title', 'rating']]
    rated = rated.sort_values('rating', ascending=False)[:10]
    return rated
    

## 3. Evaluation of our recommendation system

Using the recommendation system we created, we provide a recommendation for several existing users, and compare the model's recommendation with a list of each user's top-rated movies. We assess user ids of 50, 100, and 150.  

### Case1: User id = 50
According to UserId=50's history, we can see this person likes science fiction and psychological films worldwide. 
Our recommendation includes french movies, but the musical film, and American slasher films. It seems a little far from his/her taste. 
This person already rated 310 movies, and the recommendation created among movies not rated by the person. So, the most similar movies are probably already rated by the person.

In [99]:
# Test the function with user_id 50 
top5(50, master)

Top 5 item for user 50 is ['Fried Green Tomatoes (1991)'
 'Friday the 13th Part VI: Jason Lives (1986)'
 'Friday the 13th Part 3: 3D (1982)'
 'Friday the 13th Part IV: The Final Chapter (1984)'
 'À nous la liberté (Freedom for Us) (1931)']


In [100]:
# Check the actual rating by use 50
actual_top10(50, master)

Unnamed: 0,title,rating
21443,2001: A Space Odyssey (1968),4.5
25969,Lawrence of Arabia (1962),4.5
26200,Apocalypse Now (1979),4.5
28272,8 1/2 (8½) (1963),4.5
75595,Persona (1966),4.0
24445,Monty Python and the Holy Grail (1975),4.0
25644,Brazil (1985),4.0
25831,"Good, the Bad and the Ugly, The (Buono, il bru...",4.0
27587,Stalker (1979),4.0
28291,Chinatown (1974),4.0


### Case 2: User id = 100
According to UserId=100's ratings, this person likes American romantic comedy. Our recommendation seems off—our recommendation includes science fiction, horror, some comedy, but not a romantic comedy.

In [101]:
# Test the function with user_id 100
top5(100, master)

Top 5 item for user 100 is ['Fraternity Vacation (1985)' 'Frankie and Johnny (1966)'
 'Frankenstein Must Be Destroyed (1969)' 'Frankenstein Unbound (1990)'
 'Woman in Gold (2015)']


In [102]:
# Check the actual rating by user 100
actual_top10(100, master)

Unnamed: 0,title,rating
37674,Terms of Endearment (1983),5.0
44261,Christmas Vacation (National Lampoon's Christm...,5.0
68977,Sweet Home Alabama (2002),5.0
24100,Top Gun (1986),5.0
60795,"Officer and a Gentleman, An (1982)",5.0
30277,When Harry Met Sally... (1989),4.5
36972,Out of Sight (1998),4.5
36098,"Wedding Singer, The (1998)",4.5
34950,Sliding Doors (1998),4.5
34939,"Joy Luck Club, The (1993)",4.5


### Case 3: User id = 150
UserId=150's ratings tell this person likes American science fiction, comedy, crime, and action. Our recommendation list includes comedy, science fiction, and spy-thriller. Even though the comedy is anime, and the spy thriller is a classic black and while movie, our recommendation to this person seems better than the UserId 50, 100.
This person rated only 26 movies.

In [105]:
# Test the function with user_id 150
top5(150, master)

Top 5 item for user 150 is ['Foreign Correspondent (1940)' 'Forbidden Planet (1956)'
 'For the Love of Benji (1977)' 'Forbidden Games (Jeux interdits) (1952)'
 'Who Framed Roger Rabbit? (1988)']


In [106]:
# Check the actual rating by user 150
actual_top10(150, master)

Unnamed: 0,title,rating
30758,Star Trek: First Contact (1996),5.0
1553,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),5.0
3943,"Birdcage, The (1996)",5.0
17772,Mission: Impossible (1996),4.0
462,Heat (1995),4.0
1339,Leaving Las Vegas (1995),4.0
1842,Dead Man Walking (1995),4.0
2617,Mighty Aphrodite (1995),4.0
19818,"Time to Kill, A (1996)",4.0
19228,Independence Day (a.k.a. ID4) (1996),4.0


In [59]:
# In total, 9719 movies 
len(master['title'].unique())


9719

In [62]:
# number of movies id 50 rated 
len(master.loc[master['userId']==50, 'title'].unique())

310

In [63]:
# number of movies id 100 rated 
len(master.loc[master['userId']==100, 'title'].unique())

148

In [64]:
# number of movies id 150 rated 
len(master.loc[master['userId']==150, 'title'].unique())

26