### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



In [1]:
pip install scikit-surprise

Note: you may need to restart the kernel to use updated packages.


In [2]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering, accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split, GridSearchCV

import pandas as pd

In [3]:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [8]:
movies = pd.read_csv('ml-25m/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
merged = pd.merge(ratings, movies, on='movieId')
merged.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,296,5.0,1147880044,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
1,1,306,3.5,1147868817,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama
2,1,307,5.0,1147868828,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
3,1,665,5.0,1147878820,Underground (1995),Comedy|Drama|War
4,1,899,3.5,1147868510,Singin' in the Rain (1952),Comedy|Musical|Romance


In [11]:
df = merged[['userId', 'title', 'rating']]
df.head()

Unnamed: 0,userId,title,rating
0,1,Pulp Fiction (1994),5.0
1,1,Three Colors: Red (Trois couleurs: Rouge) (1994),3.5
2,1,Three Colors: Blue (Trois couleurs: Bleu) (1993),5.0
3,1,Underground (1995),5.0
4,1,Singin' in the Rain (1952),3.5


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 3 columns):
 #   Column  Dtype  
---  ------  -----  
 0   userId  int64  
 1   title   object 
 2   rating  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 572.2+ MB


In [14]:
sampled_df = df.sample(n=100_000)

In [16]:
# reader = Reader(line_format='user item rating timestamp')
reader = Reader(rating_scale=(0,5))
data = Dataset.load_from_df(df=sampled_df, reader=reader)

In [18]:
data

<surprise.dataset.DatasetAutoFolds at 0x113337a10>

In [20]:
trainset, testset = train_test_split(data, test_size=0.2)

In [22]:
knn = KNNBasic().fit(trainset)
knn

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x10dc763d0>

In [23]:
knn_predictions = knn.test(testset)
knn_predictions

[Prediction(uid=123887, iid='Seventh Son (2014)', r_ui=3.0, est=3.5390125, details={'was_impossible': True, 'reason': 'Not enough neighbors.'}),
 Prediction(uid=121493, iid="Grandma's Boy (2006)", r_ui=3.5, est=3.5390125, details={'was_impossible': True, 'reason': 'Not enough neighbors.'}),
 Prediction(uid=57266, iid='Dolce Vita, La (1960)', r_ui=5.0, est=3.5390125, details={'was_impossible': True, 'reason': 'Not enough neighbors.'}),
 Prediction(uid=130553, iid='NeverEnding Story, The (1984)', r_ui=2.0, est=3.5390125, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=117056, iid='Inglourious Basterds (2009)', r_ui=5.0, est=3.5390125, details={'was_impossible': True, 'reason': 'Not enough neighbors.'}),
 Prediction(uid=102972, iid='Money Talks (1997)', r_ui=3.0, est=3.5390125, details={'was_impossible': True, 'reason': 'Not enough neighbors.'}),
 Prediction(uid=153739, iid='Home for the Holidays (1995)', r_ui=3.0, est=3.5390125, details={'was_

In [24]:
knn_rmse = accuracy.rmse(knn_predictions)
knn_rmse

RMSE: 1.0684


1.068379504463

In [25]:
svd = SVD().fit(trainset)
svd

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x10e1ff150>

In [26]:
svd_predictions = svd.test(testset)
svd_predictions

[Prediction(uid=123887, iid='Seventh Son (2014)', r_ui=3.0, est=3.2081534098363695, details={'was_impossible': False}),
 Prediction(uid=121493, iid="Grandma's Boy (2006)", r_ui=3.5, est=3.334695682223448, details={'was_impossible': False}),
 Prediction(uid=57266, iid='Dolce Vita, La (1960)', r_ui=5.0, est=3.589147021893962, details={'was_impossible': False}),
 Prediction(uid=130553, iid='NeverEnding Story, The (1984)', r_ui=2.0, est=3.577069127007443, details={'was_impossible': False}),
 Prediction(uid=117056, iid='Inglourious Basterds (2009)', r_ui=5.0, est=4.118441490921419, details={'was_impossible': False}),
 Prediction(uid=102972, iid='Money Talks (1997)', r_ui=3.0, est=3.5708816285403153, details={'was_impossible': False}),
 Prediction(uid=153739, iid='Home for the Holidays (1995)', r_ui=3.0, est=3.564960423535459, details={'was_impossible': False}),
 Prediction(uid=82850, iid='xXx (2002)', r_ui=2.5, est=3.247733440687655, details={'was_impossible': False}),
 Prediction(uid=13129

In [27]:
svd_rmse = accuracy.rmse(svd_predictions)
svd_rmse

RMSE: 0.9745


0.9745497378391891

In [28]:
nmf = NMF().fit(trainset)
nmf_predictions = nmf.test(testset)
nmf_rmse = accuracy.rmse(nmf_predictions)
nmf_rmse

RMSE: 1.1558


1.1557958624082751

In [29]:
slope = SlopeOne().fit(trainset)
slope_predictions = slope.test(testset)
slope_rmse = accuracy.rmse(slope_predictions)
slope_rmse

RMSE: 1.1571


1.1571008207321223

In [38]:
cocl = CoClustering().fit(trainset)
cocl_predictions = cocl.test(testset)
cocl_rmse = accuracy.rmse(cocl_predictions)
cocl_rmse

RMSE: 1.1562


1.1561769923024308

In [48]:
results_df = pd.DataFrame({'model': ['KNNBasic', 'SVD', 'NMF', 'SlopeOne', 'CoClustering'],
                          'rmse': [knn_rmse, svd_rmse, nmf_rmse, slope_rmse, cocl_rmse]})
results_df

Unnamed: 0,model,rmse
0,KNNBasic,1.06838
1,SVD,0.97455
2,NMF,1.155796
3,SlopeOne,1.157101
4,CoClustering,1.156177


In [54]:
knn_cross = cross_validate(knn, data, measures=['RMSE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0696  1.0664  1.0635  1.0721  1.0598  1.0663  0.0044  
Fit time          8.89    8.16    8.11    8.47    8.28    8.38    0.28    
Test time         0.41    0.26    0.27    0.28    0.28    0.30    0.06    


In [56]:
svd_cross = cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9818  0.9742  0.9652  0.9654  0.9721  0.9717  0.0062  
Fit time          0.45    0.43    0.45    0.45    0.47    0.45    0.01    
Test time         0.04    0.03    0.03    0.03    0.03    0.03    0.00    
