- Surprise package documentation : https://surprise.readthedocs.io/en/stable/

# Instructions
For this assignment, you will be using the modular pipeline code that we've implemented in the previous session (https://github.com/EPITA-RecSys/recsys-ais-20/blob/master/notebooks/AIS_model-pipeline.ipynb)

You need to do the following tasks :
### Implement the evaluate and benchmarking_pipeline functions
### Extract the notebook functions in a python module and use them directly in a new notebook
- Create a folder in the root directory called recsys.
- In this folder, create a python module pipeline.py.
- Extract the different functions of the AIS_model-pipeline.ipynb notebook in this new module (pipeline.py).
- Create a new notebook where you will use the extracted benchmarking_pipeline function to do the benchamrking.

### Do the benchmarking of the 5 already used models along with NMF, SVD and SVD++ surprise algorithms.

## Bonus (+1 point) :
- Generalize the train function to use any surprise model kwargs and not only the KNN model.

### Note :
If in the 1st assignment you didn't manage to do the benchmarking with bar charts, you can do it for this assignment and it will be taken into account to improve your grade for the 1st assignment.

# Pipeline

# 1. Load data using load_builtin method

In [1]:
# Importing files from parent directory
import os, sys
currentdir = os.path.dirname(os.path.realpath("pipeline.py"))
parentdir = os.path.dirname(currentdir)
sys.path.append(parentdir)

In [5]:
from recsys.pipeline import pipeline
pipelineObj = pipeline()
ratings = pipelineObj.get_ratings(load_from_surprise=True)
ratings

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


<surprise.dataset.DatasetAutoFolds at 0x7ff28f07d550>

In [3]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 9.7MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1618275 sha256=26d9555810001140d3fc18bdf43c3f9508df64229be4a985167e9704e1ceb4b0
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [4]:
from pipeline import pipeline

#  Splitting data 

In [6]:
from surprise.model_selection import train_test_split

trainset, testset = train_test_split(ratings, test_size=.2, random_state=42)


# 2. Model

## 2.1 User based model using cosine similarity

## Training

In [7]:
from surprise.prediction_algorithms.knns import KNNBasic
from surprise.prediction_algorithms.baseline_only import BaselineOnly 
from surprise.prediction_algorithms.matrix_factorization import NMF 
from surprise.prediction_algorithms.matrix_factorization import SVD 
from surprise.prediction_algorithms.matrix_factorization import SVDpp

In [8]:
sim_options_UserBased_Cosine = {'name': 'cosine',
                                'user_based': True  # compute  similarities between user-base
                               }

In [9]:
UserBased_Cosine_KNNBasic = pipelineObj.set_model_parameters(KNNBasic, sim_options=sim_options_UserBased_Cosine, k=40, min_k=1)

In [10]:
time_UserBased_Cosine_KNNBasic, UserBased_Cosine_KNNBasic = pipelineObj.evaluate_time_and_train(UserBased_Cosine_KNNBasic, trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


## Prediction

In [11]:
UserBased_Cosine_KNNBasic_predictions = pipelineObj.model_prediction(UserBased_Cosine_KNNBasic, testset)
UserBased_Cosine_KNNBasic_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=3.9988806593226958, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.024195267809441, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.920863971521688, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=4.275899932331823, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.37542669272407, details={'actual_k': 40, 'was_impossible': False})]

## Evaluate

In [12]:
rmse_UserBased_Cosine_KNNBasic, mae_UserBased_Cosine_KNNBasic = pipelineObj.evaluate_model_rmse_and_mae(UserBased_Cosine_KNNBasic_predictions)

RMSE: 1.0194
MAE:  0.8038


## 2.2 user based model using pearson correlation similarity

## Training

In [13]:
sim_options_UserBased_PearsonCorrelation = {'name': 'pearson_baseline',
                                            'user_based': True  # compute similarities between user-base
                               }

In [14]:
UserBased_PearsonCorrelation_KNNBasic =  pipelineObj.set_model_parameters(KNNBasic, sim_options=sim_options_UserBased_PearsonCorrelation, k=40, min_k=1)


In [15]:
time_UserBased_PearsonCorrelation_KNNBasic, UserBased_PearsonCorrelation_KNNBasic = pipelineObj.evaluate_time_and_train(UserBased_PearsonCorrelation_KNNBasic, trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


## Prediction

In [16]:
UserBased_PearsonCorrelation_KNNBasic_predictions = pipelineObj.model_prediction(UserBased_PearsonCorrelation_KNNBasic, testset)
UserBased_PearsonCorrelation_KNNBasic_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.143168682795533, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.167234022563786, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.4775085715515153, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=4.190251707736366, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.434149140412287, details={'actual_k': 30, 'was_impossible': False})]

## Evaluate

In [17]:
rmse_UserBased_PearsonCorrelation, mae_UserBased_PearsonCorrelation = pipelineObj.evaluate_model_rmse_and_mae(UserBased_PearsonCorrelation_KNNBasic_predictions)

RMSE: 1.0008
MAE:  0.7897


## 2.3 Item based model using cosine similarity

## Training

In [18]:
sim_options_ItemBased_Cosine = {'name': 'cosine',
                                'user_based': False  # compute  similarities between item-base
                               }

In [19]:
ItemBased_Cosine_KNNBasic = pipelineObj.set_model_parameters(KNNBasic, sim_options=sim_options_ItemBased_Cosine, k=40, min_k=1)

In [20]:
time_ItemBased_Cosine_KNNBasic, ItemBased_Cosine_KNNBasic = pipelineObj.evaluate_time_and_train(ItemBased_Cosine_KNNBasic, trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


## Prediction

In [21]:
ItemBased_Cosine_KNNBasic_predictions = pipelineObj.model_prediction(ItemBased_Cosine_KNNBasic, testset)
ItemBased_Cosine_KNNBasic_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.674038136709815, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.127176061681991, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.55351608506343, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=3.549641715257212, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.04966392708432, details={'actual_k': 40, 'was_impossible': False})]

## Evaluate

In [22]:
rmse_ItemBased_Cosine_KNNBasic, mae_ItemBased_Cosine_KNNBasic = pipelineObj.evaluate_model_rmse_and_mae(ItemBased_Cosine_KNNBasic_predictions)

RMSE: 1.0264
MAE:  0.8104


## 2.4 Item based model using pearson correlation similarity

## Training 

In [23]:
sim_options_ItemBased_PearsonCorrelation = {'name': 'pearson_baseline',
                                            'user_based': False  # compute similarities between item-base
                               }

In [24]:
ItemBased_PearsonCorrelation_KNNBasic = pipelineObj.set_model_parameters(KNNBasic, sim_options=sim_options_ItemBased_PearsonCorrelation, k=40, min_k=1)

In [25]:
time_ItemBased_PearsonCorrelation_KNNBasic, ItemBased_PearsonCorrelation_KNNBasic = pipelineObj.evaluate_time_and_train(ItemBased_PearsonCorrelation_KNNBasic, trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


## Prediction 

In [26]:
ItemBased_PearsonCorrelation_KNNBasic_predictions = pipelineObj.model_prediction(ItemBased_PearsonCorrelation_KNNBasic, testset)
ItemBased_PearsonCorrelation_KNNBasic_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.811647158658197, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.162009221840273, details={'actual_k': 28, 'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.532333118071294, details={'actual_k': 32, 'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=3.7923277092854133, details={'actual_k': 30, 'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.1900565766185456, details={'actual_k': 40, 'was_impossible': False})]

## Evaluate

In [27]:
rmse_ItemBased_PearsonCorrelation_KNNBasic, mae_ItemBased_PearsonCorrelation_KNNBasic = pipelineObj.evaluate_model_rmse_and_mae(ItemBased_PearsonCorrelation_KNNBasic_predictions)

RMSE: 0.9956
MAE:  0.7815


## 2.5 BaselineOnly model

## Training

In [28]:
print('BaselineOnly model Using SGD')
bsl_options = {'method': 'sgd',
               'learning_rate': .00005,
               }

BaselineOnly model Using SGD


In [29]:
BaselineOnlyModel =  pipelineObj.set_model_parameters(BaselineOnly, bsl_options=bsl_options)
time_BaselineOnlyModel, BaselineOnlyModel = pipelineObj.evaluate_time_and_train(BaselineOnlyModel, trainset)

Estimating biases using sgd...


## Prediction

In [30]:
BaselineOnlyModel_predictions = pipelineObj.model_prediction(BaselineOnlyModel, testset)
BaselineOnlyModel_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=3.669381430925591, details={'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=3.649878804830684, details={'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.5583648631496043, details={'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=3.5818164578986074, details={'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.4841382355265065, details={'was_impossible': False})]

## Evaluate 

In [31]:
rmse_BaselineOnlyModel, mae_BaselineOnlyModel = pipelineObj.evaluate_model_rmse_and_mae(BaselineOnlyModel_predictions)

RMSE: 1.0826
MAE:  0.9042


## 2.6 NMF model

## Training

In [32]:
NMF_model = pipelineObj.set_model_parameters(NMF, n_factors=15, n_epochs=50, biased=False)
time_NMF_model, NMF_model = pipelineObj.evaluate_time_and_train(NMF_model, trainset)

## Prediction

In [33]:
NMF_model_predictions = pipelineObj.model_prediction(NMF_model, testset)
NMF_model_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.515897329707489, details={'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=3.785705326759703, details={'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.619634258317797, details={'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=4.012293565427321, details={'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.4734777215902275, details={'was_impossible': False})]

## Evaluate

In [34]:
rmse_NMF_model, mae_NMF_model = pipelineObj.evaluate_model_rmse_and_mae(NMF_model_predictions)

RMSE: 0.9674
MAE:  0.7613


## 2.7 SVD model

## Training

In [35]:
SVD_model = pipelineObj.set_model_parameters(SVD, n_factors=15, n_epochs=50, biased=False, verbose=False)
time_SVD_model, SVD_model = pipelineObj.evaluate_time_and_train(SVD_model, trainset)

## Prediction

In [36]:
SVD_model_predictions = pipelineObj.model_prediction(SVD_model, testset)
SVD_model_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.181844015030655, details={'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.267549778873097, details={'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.5568687261909595, details={'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=3.7726928723469286, details={'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.0106432922670225, details={'was_impossible': False})]

## Evaluate 

In [37]:
rmse_SVD_model, mae_SVD_model = pipelineObj.evaluate_model_rmse_and_mae(SVD_model_predictions)

RMSE: 0.9563
MAE:  0.7458


## 2.8 SVDpp model

## Training

In [38]:
SVDpp_model = pipelineObj.set_model_parameters(SVDpp, n_factors=15, n_epochs=10, init_mean=0, verbose=False)
time_SVDpp_model, SVDpp_model = pipelineObj.evaluate_time_and_train(SVDpp_model, trainset)

## Prediction

In [39]:
SVDpp_model_predictions = pipelineObj.model_prediction(SVDpp_model, testset)
SVDpp_model_predictions[:5]

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.823985186972309, details={'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.32088727471396, details={'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.370758672731467, details={'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=3.943285270220849, details={'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.054790180449969, details={'was_impossible': False})]

## Evaluate

In [40]:
rmse_SVDpp_model, mae_SVDpp_model = pipelineObj.evaluate_model_rmse_and_mae(SVDpp_model_predictions)

RMSE: 0.9274
MAE:  0.7324


# 3. Model Benchmarking

In [41]:
benchmark_metrics = ['User-based CF with cosine similarity', 'User-based CF with pearson correlation similarity', 
                     'Item-based CF with cosine similarity', 'Item-based CF with pearson correlation similarity',
                    'BaselineOnly model','NMF', 'SVD', 'SVDpp']

In [42]:
RMSE_values = [rmse_UserBased_Cosine_KNNBasic, rmse_UserBased_PearsonCorrelation,
              rmse_ItemBased_Cosine_KNNBasic, rmse_ItemBased_PearsonCorrelation_KNNBasic,
               rmse_BaselineOnlyModel, rmse_NMF_model, rmse_SVD_model, rmse_SVDpp_model]


In [43]:
mae_values = [mae_UserBased_Cosine_KNNBasic, mae_UserBased_PearsonCorrelation,
              mae_ItemBased_Cosine_KNNBasic, mae_ItemBased_PearsonCorrelation_KNNBasic,
              mae_BaselineOnlyModel, mae_NMF_model, mae_SVD_model, mae_SVDpp_model]

In [44]:
print(" rmse :\n",RMSE_values)
print("\nMAE :\n",mae_values)

 rmse :
 [1.0193536815834319, 1.0007997542372822, 1.0264295933767333, 0.9956191381781974, 1.0826305578193065, 0.9674461356576417, 0.956315995165607, 0.9273798860854161]

MAE :
 [0.8037993357440609, 0.7897024955421307, 0.8103814466470565, 0.7814939175231924, 0.904156628093278, 0.7612759967304713, 0.7458299283785572, 0.7324392451216534]
