## Collaborative Filtering
 Now I will use the ratings data to produce a simple collaborative filtering method.
 There are multiple methods for doing this.
 As mentioned in the methodology there are two primary ways of achieving collaborative filtering
 Matrix Factorisation and K-nearest neighbours. I will test the two methods provided by python's
 Surprise library.
 This is the notebook for the Matrix Factorisation algorithm.

In [1]:
from IPython import get_ipython


In [2]:
get_ipython().run_line_magic('matplotlib', 'inline')
import pandas as pd
import surprise
import warnings; warnings.simplefilter('ignore')
pd.options.display.max_columns = None

from surprise import Reader, Dataset, SVD, evaluate

movie_names = '/datasets/movies.csv'
small_ratings_dataset = '/datasets/ratings_small.csv'

In [3]:
ratings = pd.read_csv(small_ratings_dataset)
movies = pd.read_csv(movie_names)
movies.head()

ratings_db = pd.merge(ratings, movies, on='movieId')
ratings_db.head()


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,7,31,3.0,851868750,Dangerous Minds (1995),Drama
2,31,31,4.0,1273541953,Dangerous Minds (1995),Drama
3,32,31,4.0,834828440,Dangerous Minds (1995),Drama
4,36,31,3.0,847057202,Dangerous Minds (1995),Drama


 I start by importing the relevant data into DataFrames, including one that features the movie titles.
 Next I will set up, train and test the SVD matrix factorisation algorithm with the the
 range of available ratings passed in.

In [4]:
# Instantiate the reader module to parse the data
reader_module = Reader(rating_scale=(0.5, 5))

In [5]:
# set up the evaluation data via the Dataset module, passing in a dataFrame and a reader. 
evaluation_data = surprise.Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']],reader_module)
# Split the dataset up, the default is 5 but this can be adjusted
evaluation_data.split()


In [6]:
algo = SVD()
evaluate(algo, evaluation_data)


Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8913
MAE:  0.6867
------------
Fold 2
RMSE: 0.9050
MAE:  0.6972
------------
Fold 3
RMSE: 0.8879
MAE:  0.6842
------------
Fold 4
RMSE: 0.8991
MAE:  0.6952
------------
Fold 5
RMSE: 0.8985
MAE:  0.6908
------------
------------
Mean RMSE: 0.8964
Mean MAE : 0.6908
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.891253846371895,
                             0.9050290563692823,
                             0.8878616244111119,
                             0.8990792349893307,
                             0.8985487927197726],
                            'mae': [0.6866620764217769,
                             0.6971677853143906,
                             0.6841533110460419,
                             0.6952035352166576,
                             0.6908341458488241]})

 A sub 0.9 root square mean error should low enough for our use case.
 I now need to build training data and train the model.

In [7]:
trainset = evaluation_data.build_full_trainset()


In [8]:
algo.train(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x117b17908>

In [9]:
# I can now use this to predict the rating for a given user and given film.


In [10]:
algo.predict(200, 3020, 3)


Prediction(uid=200, iid=3020, r_ui=3, est=3.401649006667686, details={'was_impossible': False})

 As you can see the 'est' value is the prediction for a given user (uid) and given movie (iid).
 This can be tested against known values to see how well it is doing.

In [11]:
algo.predict(1, 31, 3)


Prediction(uid=1, iid=31, r_ui=3, est=2.3955318645581896, details={'was_impossible': False})

 The above gives an estimation of 2.5561... for a rating I know as 2.5. Which is quite a good score.