### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



### Packages

In [1]:
# install surprise package, if not available #
!pip install scikit-surprise



In [2]:
# import packages #
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise import accuracy

import pandas as pd

### Dataset

In [3]:
# load data #
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
# load data #
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
# load data #
links = pd.read_csv('links.csv')
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [6]:
# load data #
tags = pd.read_csv('tags.csv')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [7]:
# merge movie + rating #
movie_ratings = pd.merge(ratings, movies, on='movieId').drop(columns=['genres','timestamp'])
movie_ratings.head()

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)
2,7,1,4.5,Toy Story (1995)
3,15,1,2.5,Toy Story (1995)
4,17,1,4.5,Toy Story (1995)


In [8]:
# sanity check for a single movie#
ratings[ratings['movieId']== 1].count()

Unnamed: 0,0
userId,215
movieId,215
rating,215
timestamp,215


In [9]:
# sanity check for a single movie#
movie_ratings[movie_ratings['movieId'] == 1].count()

Unnamed: 0,0
userId,215
movieId,215
rating,215
title,215


In [10]:
# determine min - max for a rating #
movie_ratings['rating'].min(), movie_ratings['rating'].max()

(0.5, 5.0)

In [11]:
# check, if null rating exist #
movie_ratings['rating'].isna().sum()

0

In [12]:
# shape of different frames #
movies.shape, ratings.shape, links.shape, tags.shape,movie_ratings.shape

((9742, 3), (100836, 4), (9742, 3), (3683, 4), (100836, 4))

In [13]:
filtered_ratings = movie_ratings[['userId', 'title', 'rating']]
filtered_ratings.head()

Unnamed: 0,userId,title,rating
0,1,Toy Story (1995),4.0
1,5,Toy Story (1995),4.0
2,7,Toy Story (1995),4.5
3,15,Toy Story (1995),2.5
4,17,Toy Story (1995),4.5


In [14]:
# setup reader and dataset in Surprise library format #
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(filtered_ratings, reader)

In [15]:
# initialize different algorithm #
svd = SVD()
knn = KNNBasic()
nmf = NMF()
slope = SlopeOne()
cluster = CoClustering()

### SVD

In [16]:
# SVD : 5-fold cross-validation and print results #
svd_dict =  cross_validate(svd, data, measures=["MSE"], cv=5, verbose=True)

Evaluating MSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MSE (testset)     0.7557  0.7695  0.7512  0.7778  0.7541  0.7617  0.0102  
Fit time          3.31    4.56    1.72    1.73    1.76    2.61    1.15    
Test time         0.49    0.34    0.22    0.13    0.28    0.29    0.12    


In [17]:
# print dictinary #
svd_dict

{'test_mse': array([0.75570979, 0.76952973, 0.7511761 , 0.77778486, 0.75414314]),
 'fit_time': (3.305140495300293,
  4.558747053146362,
  1.7192656993865967,
  1.729074239730835,
  1.7563562393188477),
 'test_time': (0.4926271438598633,
  0.3405787944793701,
  0.22276663780212402,
  0.12904810905456543,
  0.2754356861114502)}

# KNN

In [18]:
# KNN : 5-fold cross-validation and print results #
knn_dict =  cross_validate(knn, data, measures=["MSE"], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating MSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MSE (testset)     0.8940  0.8984  0.8968  0.8880  0.9042  0.8963  0.0053  
Fit time          0.11    0.15    0.15    0.14    0.14    0.14    0.02    
Test time         1.38    1.32    2.46    1.63    1.46    1.65    0.42    


In [19]:
# print dictinary #
knn_dict

{'test_mse': array([0.89399069, 0.89839356, 0.89679033, 0.88800496, 0.90419602]),
 'fit_time': (0.10692334175109863,
  0.14599823951721191,
  0.14966773986816406,
  0.13790512084960938,
  0.14449763298034668),
 'test_time': (1.3822698593139648,
  1.3154418468475342,
  2.457341432571411,
  1.634902000427246,
  1.4552977085113525)}

### NMF

In [20]:
# NMF : 5-fold cross-validation and print results #
nmf_dict =  cross_validate(nmf, data, measures=["MSE"], cv=5, verbose=True)

Evaluating MSE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MSE (testset)     0.8618  0.8427  0.8478  0.8537  0.8586  0.8529  0.0070  
Fit time          3.09    3.28    3.72    3.13    3.12    3.27    0.24    
Test time         0.11    0.16    0.25    0.16    0.24    0.18    0.05    


In [21]:
# print dictinary #
nmf_dict

{'test_mse': array([0.86181854, 0.84267274, 0.84782111, 0.85373189, 0.85861388]),
 'fit_time': (3.088061809539795,
  3.279093027114868,
  3.718214750289917,
  3.1262357234954834,
  3.1154286861419678),
 'test_time': (0.1128838062286377,
  0.16257238388061523,
  0.24965310096740723,
  0.1560375690460205,
  0.23847103118896484)}

### SlopeOne

In [22]:
# SlopeOne : 5-fold cross-validation and print results #
slope_dict =  cross_validate(slope, data, measures=["MSE"], cv=5, verbose=True)

Evaluating MSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MSE (testset)     0.8071  0.8136  0.8252  0.8305  0.7921  0.8137  0.0136  
Fit time          6.37    6.66    5.95    5.49    5.65    6.02    0.44    
Test time         6.78    7.22    7.76    7.77    7.85    7.48    0.41    


In [23]:
# print dictinary #
slope_dict

{'test_mse': array([0.80710033, 0.813585  , 0.82522777, 0.83054115, 0.7921344 ]),
 'fit_time': (6.367084980010986,
  6.658562183380127,
  5.94872522354126,
  5.490952014923096,
  5.6460816860198975),
 'test_time': (6.784650087356567,
  7.216792345046997,
  7.756730794906616,
  7.771860837936401,
  7.849891185760498)}

### CoClustering

In [24]:
# CoClustering : 5-fold cross-validation and print results #
cluster_dict =  cross_validate(cluster, data, measures=["MSE"], cv=5, verbose=True)

Evaluating MSE of algorithm CoClustering on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MSE (testset)     0.9105  0.8963  0.8845  0.8947  0.9025  0.8977  0.0086  
Fit time          2.66    2.86    2.99    2.82    2.65    2.80    0.13    
Test time         0.26    0.11    0.17    0.10    0.10    0.15    0.06    


In [25]:
# print dictinary #
cluster_dict

{'test_mse': array([0.91050289, 0.89634249, 0.88446435, 0.89465223, 0.90247857]),
 'fit_time': (2.6619133949279785,
  2.856048107147217,
  2.9902498722076416,
  2.8151490688323975,
  2.6520185470581055),
 'test_time': (0.25745654106140137,
  0.10999655723571777,
  0.17487716674804688,
  0.0982811450958252,
  0.09820818901062012)}

In [29]:
res_dict = {'model': ['SVD', 'KNNBasic', 'NMF', 'SlopeOne', 'CoClustering'],
           'MSE': [0.779, 0.868, 0.851, 0.823, 0.904],
           'Fit Time': [3.85, 1.14, 3.68 , 7.01, 2.68],
           'Test Time': [0.36, 1.95, 0.34, 8.21, 0.84]}
results_df = pd.DataFrame(res_dict).set_index('model')

results_df

Unnamed: 0_level_0,MSE,Fit Time,Test Time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVD,0.7617,2.61,0.29
KNNBasic,0.8963,0.14,1.65
NMF,0.8529,3.27,0.18
SlopeOne,0.8137,6.02,7.48
CoClustering,0.8977,2.8,0.15
