# Data Science - Assignment 4 - Collaborative Filtering

Thomas Br√ºndl

se21m032

Dataset: <b>MovieLens</b>

1. 100k Dataset: https://grouplens.org/datasets/movielens/100k/
2. 1M Dataset:  https://grouplens.org/datasets/movielens/1m/



## Approach

In this exercise I investigated the `Effectiveness` and `Efficiency` of six collaborative filtering techniques:

1. User based filtering (memory CF)
2. Item based filtering (memory CF)
3. SVD (model CF)
4. KNNBasic (model CF)
5. CoClustering (model CF)
6. KNNBaseline (model CF)


## Measurement

The Effectiveness of a collaborative filtering approach was measured by terms of MSE (mean squared error) between the predicted score and the actual score known in the test set.

The Efficiency of a collaborative filtering approach was measured by runtime (i.e. how long the program took to finish). 

## Data

The data consists of a small data set with 100K samples and a large data set with 1M samples.
The samples can be described by four attributes:

1. userId
2. movieId
3. rating
4. timestamp

## Disclaimer

Some code snippets where taken from the repository of `sharmin2697`.

https://github.com/sharmin2697/Movie-Recommender-System



## Evaluation


User-based-CF is faster (more efficient) than item-based-CF when performing the test with the 100K and 1M dataset.
However, item-based-CF produces less MSE (more effective) than the user-based-CF.

SVD seems to outperform all other approaches by far in terms of effectiveness and efficiency.

The MSE could be decreased in all approaches by using the large data set with 1M instead of the small data set of 100K.

<br>

<table>
<tr>
<th></th>
<th colspan="6">100K Dataset</th>

 <tr>
<tr>
<th></th>
<th colspan="2">Memory Based</th>
<th colspan="4">Model Based</th>
 <tr>
  <tr>
  <th></th>
    <th>user_based_CF</th>
    <th>item_based_CF</th>
    <th>SVD</th>
    <th>KNNBasic</th>
    <th>CoClustering</th>
    <th>KNNBaseline</th>
  </tr>

  <tr>
  <td><b>mse</b></td>
    <td>0.903</td>
    <td>0.875</td>
    <td>0.865</td>
    <td>0.907</td>
    <td>0.912</td>
    <td>0.898</td>
  </tr>
  <tr>
  <td><b>time [sec]</b></td>
    <td>24.43</td>
    <td>30.07</td>
    <td>18.929</td>
    <td>33.444</td>
    <td>43.649</td>
    <td>62.218</td>
  </tr>
</table>



<br>
<br>


<table>
<tr>
<th></th>
<th colspan="6">1M Dataset</th>

 <tr>
<tr>
<th></th>
<th colspan="2">Memory Based</th>
<th colspan="4">Model Based</th>
 <tr>
  <tr>
  <th></th>
    <th>user_based_CF</th>
    <th>item_based_CF</th>
    <th>SVD</th>
    <th>KNNBasic</th>
    <th>CoClustering</th>
    <th>KNNBaseline</th>
  </tr>

  <tr>
  <td><b>mse</b></td>
    <td>0.864</td>
    <td>0.813</td>
    <td>0.763</td>
    <td>0.807</td>
    <td>0.817</td>
    <td>0.813</td>
  </tr>
  <tr>
  <td><b>time [sec]</b></td>
    <td>916.553</td>
    <td>1957.769</td>
    <td>171.219</td>
    <td>1221.513</td>
    <td>1317.121</td>
    <td>1957.769</td>
  </tr>
</table>

<br>



## Imports

In [1]:
import pandas as pd
import time
import statistics
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNBaseline
from surprise import SVD
from surprise import CoClustering
from surprise import accuracy
from surprise.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')


## Load Data

In [2]:
# ratings = pd.read_csv('ratings_100K.csv')
ratings = pd.read_csv('ratings_1M.csv')
ratings = ratings.drop(["timestamp"], axis=1)
data = ratings
data.head()

reader = Reader(rating_scale=(1, 5))
rating_dataset = Dataset.load_from_df(data, reader)

## Split Data

In [3]:
rating_datasets = []

rating_datasets.append(train_test_split(rating_dataset, test_size=0.2, random_state=1524401))
rating_datasets.append(train_test_split(rating_dataset, test_size=0.2))
rating_datasets.append(train_test_split(rating_dataset, test_size=0.2))
rating_datasets.append(train_test_split(rating_dataset, test_size=0.2))
rating_datasets.append(train_test_split(rating_dataset, test_size=0.2))


## User-based collaborative filtering

In [7]:
mse_results = []

start_time = time.time()

for (trainset, testset) in rating_datasets:
    algo = KNNWithMeans(100, 1, {'name': 'cosine','user_based': True})

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)
    
print("Time: " + str(time.time() - start_time))
print("Mean mse: " + str(statistics.mean(mse_results)))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8651
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8634
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8605
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8662
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8658
Time: 941.383880853653
Mean mse: 0.8641993653203587


## Items based collaborative filtering

In [5]:
mse_results = []

start_time = time.time()

for (train_data, test_data) in rating_datasets:
    algo = KNNWithMeans(100, 1, {'name': 'cosine', 'user_based': False})

    algo.fit(train_data)
    predictions = algo.test(test_data)  

    mse = accuracy.mse(predictions)
    mse_results.append(mse)

print("Mean mse: " + str(statistics.mean(mse_results)))
print("Time: " + str(time.time() - start_time))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8028
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7992
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7984
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8015
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8013
Mean mse: 0.8006542850284819
Time: 421.7394802570343


## Model-based approach (SVD, KNNBasic, CoClustering, KNNBaseline)

In [6]:
mse_results = []

start_time = time.time()

for algorithm in [SVD(random_state=1524401), KNNBasic(random_state=1524401), CoClustering(random_state=1524401), KNNBaseline(random_state=1524401)]:

    print(str(algorithm))

    for (train_data, test_data) in rating_datasets:
        algo = algorithm

        algo.fit(train_data)
        predictions = algo.test(test_data)

        mse = accuracy.mse(predictions)
        mse_results.append(mse)
        
    print("Mean mse: " + str(statistics.mean(mse_results)))
    print("Time: " + str(time.time() - start_time))

    print("-----------------------------------------------------")



<surprise.prediction_algorithms.matrix_factorization.SVD object at 0x000002392418A100>
MSE: 0.7648
MSE: 0.7642
MSE: 0.7602
MSE: 0.7633
MSE: 0.7650
Mean mse: 0.7634908023011225
Time: 171.219402551651
-----------------------------------------------------
<surprise.prediction_algorithms.knns.KNNBasic object at 0x000002392418A1F0>
Computing the msd similarity matrix...
Done computing similarity matrix.
MSE: 0.8525
Computing the msd similarity matrix...
Done computing similarity matrix.
MSE: 0.8504
Computing the msd similarity matrix...
Done computing similarity matrix.
MSE: 0.8497
Computing the msd similarity matrix...
Done computing similarity matrix.
MSE: 0.8510
Computing the msd similarity matrix...
Done computing similarity matrix.
MSE: 0.8525
Mean mse: 0.8073498960275686
Time: 1221.5136618614197
-----------------------------------------------------
<surprise.prediction_algorithms.co_clustering.CoClustering object at 0x000002392418A130>
MSE: 0.8382
MSE: 0.8414
MSE: 0.8356
MSE: 0.8392
M