# Collaborative filtering

Dataset: MovieLens

- 100k Dataset: https://grouplens.org/datasets/movielens/100k/
- 1m Dataset: https://grouplens.org/datasets/movielens/1m/

Code samples are taken from:

- https://github.com/khanhnamle1994/movielens
- https://github.com/sharmin2697/Movie-Recommender-System


# Import dependencies

In [273]:
import statistics

import pandas as pd
from surprise import Reader, Dataset, KNNWithMeans, accuracy, SVD
from surprise.model_selection import train_test_split
import time


# Load dataset

In [274]:
# The information of the dataset is taken from here: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

# Read the rating file
ratings = pd.read_csv(
    '../data/ml-100k/u.data',
    sep='\t',
    encoding='utf-8',
    names=['user_id', 'item_id', 'rating', 'timestamp'],
    usecols=['user_id', 'item_id', 'rating']
)


# Read the information about the movies
movies = pd.read_csv(
    '../data/ml-100k/u.item',
    sep='|',
    encoding='latin-1',
    names=['movie_id', 'movie_title', 'release_date', 'video_release_date', 'imdb_url', 'unknown',
           'action',
           'adventure', 'animation', 'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy',
           'film-noir', 'horror', 'musical', 'mystery', 'romance', 'sci-fi', 'thriller', 'war',
           'western'],
    usecols=['movie_id', 'movie_title']
)


users = pd.read_csv(
    '../data/ml-100k/u.user',
    sep='|',
    names=['user_id', 'age', 'gender', 'occupation', 'zip_code'],
    usecols=['user_id', 'age', 'gender', 'zip_code']
)


## Rating dataset

In [275]:
ratings.head()

Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [276]:
# We have 100k ratings
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   user_id  100000 non-null  int64
 1   item_id  100000 non-null  int64
 2   rating   100000 non-null  int64
dtypes: int64(3)
memory usage: 2.3 MB


## Movies dataset

In [277]:
movies.head()

Unnamed: 0,movie_id,movie_title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [278]:
# We have 1682 movies.
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movie_id     1682 non-null   int64 
 1   movie_title  1682 non-null   object
dtypes: int64(1), object(1)
memory usage: 26.4+ KB


## Users dataset

In [279]:
users.head()

Unnamed: 0,user_id,age,gender,zip_code
0,1,24,M,85711
1,2,53,F,94043
2,3,23,M,32067
3,4,24,M,43537
4,5,33,F,15213


In [280]:
# We have 943 users.
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   943 non-null    int64 
 1   age       943 non-null    int64 
 2   gender    943 non-null    object
 3   zip_code  943 non-null    object
dtypes: int64(2), object(2)
memory usage: 29.6+ KB


# Collaborative filtering

I measure the effectiveness of the approaches with MSE (Mean squared error) between the prediction and the actual score.

I measure the efficiency with the execution time of the algorithms.

## Split the dataset

In [281]:
# use in-build functions in Surprise library to read and load ratings data
reader = Reader(name="ml-100k")
ratingsData = Dataset.load_from_df(ratings, reader)

splitDatasets = []

# I use the scikit-learn library to split the dataset into testing and training
# Further, repeat the split 5 times with a different initialisation, and describe how stable your results are, respectively,
# how much the results change. You just need to set the random seed once, to initialise the random number generator,
# then subsequent splits will anyhow already be based on a different (pseudo-random) state, and thus won't yield the same split.

splitDatasets.append(train_test_split(ratingsData, test_size=0.2, random_state=21006))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))

# Split the dataset into 80:20 training:test set, after joining together an existing split, and shuffling the data; set the random seed to (the numberic parts of) your matriculation number.

# We have 5 different splits of the data inside the array
print("Split data:", len(splitDatasets))

Split data: 5


## User based collaborative filtering

I start with the user-based filtering. I use cosine similarity to calculate the similarity between two movies. The similarity is defined as the angle of the 2 vectors of A and B. The closer the vectors are, the smaller the angle will be between and the large the cosine.

- https://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html


In [282]:
datasetResults = []
executionTimes = []
filteringOptions = {'name': 'cosine', 'user_based': True}

for (train, test) in splitDatasets:
    start = time.time()

    algorithm = KNNWithMeans(sim_options=filteringOptions)

    algorithm.fit(train)
    testResult = algorithm.test(test)

    # For evaluation of effectiveness, we will utilise MSE (mean squared error) between the predicted score and the actual score known in the test set.
    meanSquaredError = accuracy.mse(testResult)
    datasetResults.append(meanSquaredError)

    end = time.time()

    executionTimes.append(end - start)

print()
print("Mean execution time of the application: ", statistics.mean(executionTimes), "s")
print("Mean mse: ", statistics.mean(datasetResults))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8861
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9197
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9018
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9208
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9187

Mean execution time of the application:  7.782719993591309 s
Mean mse:  0.9094160525466587


I computed the mse for every of the 5 entries of the array. The values are not very far apart from each other for every run. It executes faster than item-based filtering. The prediction accuracy is lower, then the item-based filtering


The algorithm is quite fast, as the mean execution time of the 5 runs is 6.5 seconds.

## Item based collaborative filtering

The setup for the item-based filtering is exactly the same as for the user-based filtering, except the `user_based` option is set to false.

In [283]:
datasetResults = []
executionTimes = []
filteringOptions = {'name': 'cosine', 'user_based': False}

for (train, test) in splitDatasets:
    start = time.time()

    algorithm = KNNWithMeans(sim_options=filteringOptions)

    algorithm.fit(train)
    testResult = algorithm.test(test)

    # For evaluation of effectiveness, we will utilise MSE (mean squared error) between the predicted score and the actual score known in the test set.
    meanSquaredError = accuracy.mse(testResult)
    datasetResults.append(meanSquaredError)

    end = time.time()

    executionTimes.append(end - start)

print()
print("Mean execution time of the application: ", statistics.mean(executionTimes), "s")
print("Mean mse: ", statistics.mean(datasetResults))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8634
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8953
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8782
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8957
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8970

Mean execution time of the application:  9.424880838394165 s
Mean mse:  0.8859330859929713


The mean mse value of the 5 execution is 0.88. The results are very similar for all runs. All in all they have a lower mse than the user-based filtering.

The algorithm is a bit slower than the user-based filtering, but it is more accurate. The prediction accuracy is higher.

# Model based approach

I have chosen SVD for the model based approach.

In [284]:
datasetResults = []
executionTimes = []
filteringOptions = {'name': 'cosine', 'user_based': False}

for (train, test) in splitDatasets:
    start = time.time()

    algorithm = SVD(random_state=21006)

    algorithm.fit(train)
    testResult = algorithm.test(test)

    # For evaluation of effectiveness, we will utilise MSE (mean squared error) between the predicted score and the actual score known in the test set.
    meanSquaredError = accuracy.mse(testResult)
    datasetResults.append(meanSquaredError)

    end = time.time()

    executionTimes.append(end - start)

print()
print("Mean execution time of the application: ", statistics.mean(executionTimes), "s")
print("Mean mse: ", statistics.mean(datasetResults))

MSE: 0.8598
MSE: 0.8830
MSE: 0.8661
MSE: 0.8886
MSE: 0.8847

Mean execution time of the application:  10.984603548049927 s
Mean mse:  0.8764409034388727


The mean mse value of the 5 execution is 0.86. All mse values are very similar. All in all they have a lower mse than the other algorithms. The results are of the MSE are very comparable to the item-based collaborative filtering. It performs much better than the user-based filtering.

The algorithm is the slowest of the three algorithms in terms of runtime.
