# Welcome to my collaborative filtering recommender Notebook
The aim of this notebook is to implement different collaborative filtering algorithms in order to recommand new items (movies in our case). 

Collaborative filtering aims to select the items that a user might like based on
the reactions, choices and tastes of similar users.

Collaborative filtering can be divided into two main steps:
*   Find similar users to a user 'A'
*   Predict the ratings of the items that are not yet rated by 'A


## Packages used in this notebook

In [3]:
!pip install numpy
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 49 kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1619429 sha256=4c582a87434e8c939816762813c908cfd32964510350fd6876cb1fe3cdab67c3
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


## Importing the 'MovieLens 100k' Dataset

The dataset contains 100,000 ratings ranging from 1 to 5, from 943 users on 1682 movies. <br>
Each user has rated at least 20 movies.<br>
It was released in April 1998.

In [4]:
from surprise.model_selection import train_test_split
from surprise import Dataset
import pandas as pd

# Loads the builtin Movielens-100k data
data = Dataset.load_builtin('ml-100k')

# 'test_set' is made of 20% of the ratings.
train_set, test_set = train_test_split(data, test_size=.2)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


## Memory-based Approach

The main idea behind these methods is to N find similar users or items using the concept of similarity. We can use different tools like 'cosine similarity' or 'euclidien distance'.<br>

I used the <font color='blue'>'K Nearest Neighbours'</font> algorithm.

In [5]:
from surprise import KNNWithMeans

# To use item-based cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": False,  # Compute  similarities between items
    "min_support": 5, # The minimum number of common items/users to consider them for similarity
}

# Memory_based_model
knn = KNNWithMeans(sim_options=sim_options)

Training and evaluating the model

In [6]:
from surprise import accuracy

knn.fit(train_set)

# 'test' method estimates all the ratings in the given test_set
knn_pred = knn.test(test_set)

# Evaluation Metric 'Accuracy'
accuracy.rmse(knn_pred)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9410


0.9410064754499461

# Matrix Factorization Model
Like any other Matrix Factorization algorithm, the idea here is to break down a large sparse matrix into a product of smaller ones.

for example, when breaking (M,N) matrix into two matrices (M,K) and (K,N) : 
K represents the number of latent factors, every latent factor is an indication of hidden characteristics about the users or the items.

I will use <font color='blue'>'Single Value Decomposition Matrix Factorization'</font>.

In [7]:
from surprise import SVD

#Matrix factorization model
svd = SVD(n_epochs=10, lr_all=.005)

In [8]:
from surprise import accuracy

svd.fit(train_set)
svd_pred = svd.test(test_set)

# Evaluation Metric 'Accuracy'
accuracy.rmse(svd_pred)

RMSE: 0.9445


0.9445436383704072

# Fine Tuning the params of the model using Grid Search method
GridSearchCv uses grid search method with cross validation, which simply consist on trying all the possible given combinations (2 x 3 x 2 = 12 in our case) with cross validation.

Since cross validation is integrated in the implementation, we dont need to split data to train and test (because the function will do it in the CV part), we will pass the whole dataset

In [9]:
from surprise.model_selection import GridSearchCV

# We use the whole dataset
all_dataset = data

param_grid = {
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.004, 0.006],
    "reg_all": [0.2, 0.4, 0.6]
}

# We set the CV to 4, the dataset will be splitted into 4 chunks, and will be tested on each one of them (after training on the others)
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=4) 
gs.fit(all_dataset)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.9513160132449412
{'n_epochs': 10, 'lr_all': 0.006, 'reg_all': 0.2}


## Further improvements

We can use different metrics in order to evaluate our models.<br>

* Using 'recall' will enable us to minimize the number of "good movies that were not chosen".
* Using 'precision' will enable us to reduce the number of "bad movies that were recommended".