# Collaborative filtering

Dataset: MovieLens

- 100k Dataset: https://grouplens.org/datasets/movielens/100k/
- 1m Dataset: https://grouplens.org/datasets/movielens/1m/

Code samples are taken from:

- https://github.com/khanhnamle1994/movielens
- https://github.com/sharmin2697/Movie-Recommender-System


# Import dependencies

In [11]:
import statistics

import pandas as pd
from surprise import Reader, Dataset, KNNWithMeans, accuracy, SVD
from surprise.model_selection import train_test_split
import time


# Load dataset

In [12]:
# The information of the dataset is taken from here: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

# Seems like I need the python engine

# Read the rating file
ratings = pd.read_csv(
    '../data/ml-1m/ratings.dat',
    sep='::',
    engine="python",
    encoding='utf-8',
    names=['user_id', 'item_id', 'rating', 'timestamp'],
    usecols=['user_id', 'item_id', 'rating']
)


# Read the information about the movies
movies = pd.read_csv(
    '../data/ml-1m/movies.dat',
    sep='::',
    encoding='latin-1',
    engine="python",
    names=['movie_id', 'movie_title', 'release_date', 'video_release_date', 'imdb_url', 'unknown',
           'action',
           'adventure', 'animation', 'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy',
           'film-noir', 'horror', 'musical', 'mystery', 'romance', 'sci-fi', 'thriller', 'war',
           'western'],
    usecols=['movie_id', 'movie_title']
)


users = pd.read_csv(
    '../data/ml-1m/users.dat',
    sep='::',
    engine="python",
    names=['user_id', 'age', 'gender', 'occupation', 'zip_code'],
    usecols=['user_id', 'age', 'gender', 'zip_code']
)


## Rating dataset

In [13]:
ratings.head()

Unnamed: 0,user_id,item_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [14]:
# We have 1m ratings
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype
---  ------   --------------    -----
 0   user_id  1000209 non-null  int64
 1   item_id  1000209 non-null  int64
 2   rating   1000209 non-null  int64
dtypes: int64(3)
memory usage: 22.9 MB


## Movies dataset

In [15]:
movies.head()

Unnamed: 0,movie_id,movie_title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [16]:
# We have 3883 movies.
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movie_id     3883 non-null   int64 
 1   movie_title  3883 non-null   object
dtypes: int64(1), object(1)
memory usage: 60.8+ KB


## Users dataset

In [17]:
users.head()

Unnamed: 0,user_id,age,gender,zip_code
0,1,F,1,48067
1,2,M,56,70072
2,3,M,25,55117
3,4,M,45,2460
4,5,M,25,55455


In [18]:
# We have 6040 users.
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   6040 non-null   int64 
 1   age       6040 non-null   object
 2   gender    6040 non-null   int64 
 3   zip_code  6040 non-null   object
dtypes: int64(2), object(2)
memory usage: 188.9+ KB


# Collaborative filtering

I measure the effectiveness of the approaches with MSE (Mean squared error) between the prediction and the actual score.

I measure the efficiency with the execution time of the algorithms.

## Split the dataset

In [19]:
# use in-build functions in Surprise library to read and load ratings data
reader = Reader(name="ml-1m")
ratingsData = Dataset.load_from_df(ratings, reader)

splitDatasets = []

# I use the scikit-learn library to split the dataset into testing and training
# Further, repeat the split 5 times with a different initialisation, and describe how stable your results are, respectively,
# how much the results change. You just need to set the random seed once, to initialise the random number generator,
# then subsequent splits will anyhow already be based on a different (pseudo-random) state, and thus won't yield the same split.

splitDatasets.append(train_test_split(ratingsData, test_size=0.2, random_state=21006))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))
splitDatasets.append(train_test_split(ratingsData, test_size=0.2))

# Split the dataset into 80:20 training:test set, after joining together an existing split, and shuffling the data; set the random seed to (the numberic parts of) your matriculation number.

# We have 5 different splits of the data inside the array
print("Split data:", len(splitDatasets))

Split data: 5


## User based collaborative filtering

I start with the user-based filtering. I use cosine similarity to calculate the similarity between two movies. The similarity is defined as the angle of the 2 vectors of A and B. The closer the vectors are, the smaller the angle will be between and the large the cosine.

- https://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html


In [20]:
datasetResults = []
executionTimes = []
filteringOptions = {'name': 'cosine', 'user_based': True}

for (train, test) in splitDatasets:
    start = time.time()

    algorithm = KNNWithMeans(sim_options=filteringOptions)

    algorithm.fit(train)
    testResult = algorithm.test(test)

    # For evaluation of effectiveness, we will utilise MSE (mean squared error) between the predicted score and the actual score known in the test set.
    meanSquaredError = accuracy.mse(testResult)
    datasetResults.append(meanSquaredError)

    end = time.time()

    executionTimes.append(end - start)

print()
print("Mean execution time of the application: ", statistics.mean(executionTimes), "s")
print("Mean mse: ", statistics.mean(datasetResults))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8861
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8812
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8829
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8828
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8806

Mean execution time of the application:  283.14395027160646 s
Mean mse:  0.8827390229020983


I computed the mse for every of the 5 entries of the array. The values are not very far apart from each other between the runs. In comparison to the small movielens dataset the user-based approach is much slower than the user-based approach here. The mse values are a little smaller in the large dataset in comparison to the small dataset.


The prediction accuracy is higher, then the item-based filtering, so the user-based is more accurate here.
The algorithm is quite slow, as the execution time is about 5 minutes for every dataset.

## Item based collaborative filtering

The setup for the item-based filtering is exactly the same as for the user-based filtering, except the `user_based` option is set to false.

In [21]:
datasetResults = []
executionTimes = []
filteringOptions = {'name': 'cosine', 'user_based': False}

for (train, test) in splitDatasets:
    start = time.time()

    algorithm = KNNWithMeans(sim_options=filteringOptions)

    algorithm.fit(train)
    testResult = algorithm.test(test)

    # For evaluation of effectiveness, we will utilise MSE (mean squared error) between the predicted score and the actual score known in the test set.
    meanSquaredError = accuracy.mse(testResult)
    datasetResults.append(meanSquaredError)

    end = time.time()

    executionTimes.append(end - start)

print()
print("Mean execution time of the application: ", statistics.mean(executionTimes), "s")
print("Mean mse: ", statistics.mean(datasetResults))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8005
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7971
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7988
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7979
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7981

Mean execution time of the application:  130.33999376296998 s
Mean mse:  0.7984933586389489


The mean mse value of the 5 execution is 0.79. The results are very similar for every run. All in all they have a lower mse than the user-based filtering. The mse value is much lower than in the small movielens dataset, making the prediction accuracy higher. Just like in the small dataset the predication accuracy is also higher here.

In the large dataset the item-based approach is much faster than the user-based approach. In the small dataset the user-based approach is faster than the item-based approach.

# Model based approach

I have chosen SVD for the model based approach.

In [22]:
datasetResults = []
executionTimes = []
filteringOptions = {'name': 'cosine', 'user_based': False}

for (train, test) in splitDatasets:
    start = time.time()

    algorithm = SVD(random_state=21006)

    algorithm.fit(train)
    testResult = algorithm.test(test)

    # For evaluation of effectiveness, we will utilise MSE (mean squared error) between the predicted score and the actual score known in the test set.
    meanSquaredError = accuracy.mse(testResult)
    datasetResults.append(meanSquaredError)

    end = time.time()

    executionTimes.append(end - start)

print()
print("Mean execution time of the application: ", statistics.mean(executionTimes), "s")
print("Mean mse: ", statistics.mean(datasetResults))

MSE: 0.7658
MSE: 0.7629
MSE: 0.7616
MSE: 0.7627
MSE: 0.7602

Mean execution time of the application:  89.04703769683837 s
Mean mse:  0.7626362289335148


The mean mse value of the 5 execution is 0.76. All mse values are very similar during the five runs. All in all it has a lower mse than the other algorithms. In comparison to the small dataset it performs better than the two other algorithms. The prediction accuracy is higher.

Also in comparison with the small dataset it has the fastest runtime.