## Collaborative Filtering
 Now I will use the ratings data to produce a simple collaborative filtering method.
 There are multiple methods for doing this.
 As mentioned in the methodology there are two primary ways of achieving collaborative filtering
 Matrix Factorisation and K-nearest neighbours. I will test the two methods provided by python's
 Surprise library.
 This is the notebook for the K-NN algorithm.

In [1]:
from IPython import get_ipython


In [2]:
get_ipython().run_line_magic('matplotlib', 'inline')
import pandas as pd
import surprise
import warnings; warnings.simplefilter('ignore')
pd.options.display.max_columns = None
from surprise import Reader, Dataset, SVD, evaluate

movie_names = '/datasets/movies.csv'
small_ratings_dataset = '/datasets/ratings_small.csv'

ratings = pd.read_csv(small_ratings_dataset)
movies = pd.read_csv(movie_names)
movies.head()

ratings_db = pd.merge(ratings, movies, on='movieId')
ratings_db.head()


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,7,31,3.0,851868750,Dangerous Minds (1995),Drama
2,31,31,4.0,1273541953,Dangerous Minds (1995),Drama
3,32,31,4.0,834828440,Dangerous Minds (1995),Drama
4,36,31,3.0,847057202,Dangerous Minds (1995),Drama


 I start by importing the relevant data into DataFrames, including one that features the movie titles.
 Next I will set up, train and test the KNN algorithm with the cosine distance passed in, as well as
 the range of available ratings.

In [3]:
# Instantiate the reader module to parse the data
reader_module = Reader(rating_scale=(0.5, 5))

In [4]:
# set up the evaluation data via the Dataset module, passing in a dataFrame and a reader. 
evaluation_data = surprise.Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']],reader_module)
# Split the dataset up, the default is 5 but this can be adjusted
evaluation_data.split()

In [5]:
similarity_options = {
    'name': 'cosine',
    'user_based': True
}

algo = surprise.KNNBasic(sim_options = similarity_options)
trainset = evaluation_data.build_full_trainset()


In [6]:
algo.train(trainset)


Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x10c89c518>

In [7]:
algo.predict(200, 3020, 3)


Prediction(uid=200, iid=3020, r_ui=3, est=3.6199540619808324, details={'actual_k': 31, 'was_impossible': False})

In [8]:
algo.predict(1, 31, 3)


Prediction(uid=1, iid=31, r_ui=3, est=3.1834796860227086, details={'actual_k': 40, 'was_impossible': False})

 Again checking against a known rating, this time it gives 3.1834, which is very far out.
 I will switch from using the cosine measure to pearson.

In [9]:
reader_module = Reader(rating_scale=(0.5, 5))
evaluation_data = surprise.Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']],reader_module)
evaluation_data.split()


In [10]:
trainset = evaluation_data.build_full_trainset()


In [11]:
similarity_options = {
    'name': 'pearson',
    'user_based': True
}

algo = surprise.KNNBasic(sim_options = similarity_options)


In [12]:
algo.train(trainset)


Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x120fac2b0>

In [13]:
algo.predict(200, 3020, 3)


Prediction(uid=200, iid=3020, r_ui=3, est=3.8305121839117375, details={'actual_k': 29, 'was_impossible': False})

In [14]:
algo.predict(1, 31, 3)


Prediction(uid=1, iid=31, r_ui=3, est=2.818429937368059, details={'actual_k': 9, 'was_impossible': False})

 As you can see the pearson distance has greatly reduced the result to something far closer to
 the actual value. Here 2.8184. Still not as close as the SVD matrix factorisation though.