# Recommender systems

## Movielens 100k dataset

### Installazione libreria surprise
![](https://raw.githubusercontent.com/NicolasHug/Surprise/master/logo_black.svg?sanitize=true)
* sito: http://surpriselib.com/
* documentazione: https://surprise.readthedocs.io/en/stable/
* github: https://github.com/NicolasHug/Surprise

In [4]:
#!pip install scikit-surprise

Collecting scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.0


### Caricamento librerie

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from surprise import Reader, Dataset
from surprise import BaselineOnly, KNNBaseline, KNNBasic, KNNWithMeans, SVD, NMF
from surprise.model_selection import train_test_split, GridSearchCV, cross_validate
from surprise.accuracy import rmse

from collections import defaultdict

In [2]:
np.random.seed(42)

### Caricamento dataset valutazioni utenti

>### Caricamento da file

In [3]:
Reader

surprise.reader.Reader

In [4]:
reader = Reader(line_format="user item rating timestamp", sep=",", skip_lines=1)

In [5]:
data = Dataset.load_from_file(file_path="data/ratings.csv", reader=reader)

### Caricamento da dataframe

In [6]:
reader = Reader(rating_scale=(1, 5))

In [7]:
df = pd.read_csv("data/ratings.csv")

In [8]:
df.rating.value_counts()

4.0    26818
3.0    20047
5.0    13211
3.5    13136
4.5     8551
2.0     7551
2.5     5550
1.0     2811
1.5     1791
0.5     1370
Name: rating, dtype: int64

In [9]:
data = Dataset.load_from_df(df[df.columns[:-1]], reader)

### Divisione del dataset in set di addestramento e di test
Una volta definiti gli oggetti train e test, è possibile chiamare gli attributi che contengono le relative informazioni sui dati

In [10]:
train, test = train_test_split(data, test_size=0.2, random_state=42)

In [11]:
test

[(140, 6765, 3.5),
 (603, 290, 4.0),
 (438, 5055, 4.0),
 (433, 164179, 5.0),
 (474, 5114, 4.0),
 (304, 1035, 4.0),
 (298, 4974, 1.0),
 (131, 293, 4.0),
 (288, 5784, 2.5),
 (448, 97225, 2.5),
 (284, 585, 4.0),
 (331, 48394, 5.0),
 (325, 2204, 3.0),
 (504, 2797, 3.5),
 (286, 509, 3.5),
 (232, 8807, 4.0),
 (448, 72378, 1.5),
 (6, 112, 4.0),
 (54, 339, 3.0),
 (264, 3623, 2.0),
 (45, 356, 5.0),
 (448, 98175, 1.5),
 (525, 46578, 4.5),
 (577, 1210, 4.0),
 (226, 5541, 3.5),
 (597, 1079, 5.0),
 (20, 5816, 4.5),
 (386, 608, 4.0),
 (405, 110, 4.0),
 (136, 432, 3.0),
 (275, 858, 5.0),
 (177, 1244, 4.0),
 (260, 6711, 4.0),
 (599, 1129, 4.5),
 (57, 1673, 4.0),
 (534, 1917, 4.5),
 (304, 1722, 3.0),
 (438, 3896, 3.5),
 (274, 431, 4.0),
 (565, 420, 3.0),
 (274, 45672, 2.5),
 (451, 593, 5.0),
 (368, 3247, 2.0),
 (198, 2478, 2.0),
 (177, 608, 4.0),
 (269, 63, 3.0),
 (298, 88785, 0.5),
 (68, 2018, 3.0),
 (140, 914, 4.0),
 (228, 50, 4.5),
 (561, 2028, 4.5),
 (201, 25, 5.0),
 (351, 4720, 3.5),
 (200, 6874, 

In [12]:
type(test)

list

### Fit Baseline model

>### Alternating Least Squares
Parametri:
* 'reg_i': The regularization parameter for items. Corresponding to 𝜆2. Default is 10.
* 'reg_u': The regularization parameter for users. Corresponding to 𝜆3. Default is 15.
* 'n_epochs': The number of iteration of the ALS procedure. Default is 10.

In [35]:
bsl_options = {
    "method": "als"
}
algo = BaselineOnly(bsl_options=bsl_options)

>### Fit del modello sui dati di training

In [36]:
algo.fit(train)
predictions = algo.test(test)

Estimating biases using als...


>### Prestazioni del modello sul test set

In [37]:
rmse(predictions);

RMSE: 0.8785


>### Test del modello con la cross validation

In [38]:
cross_validate(algo=algo, data=data, verbose=True, cv=5);

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8824  0.8680  0.8635  0.8708  0.8785  0.8726  0.0069  
MAE (testset)     0.6769  0.6704  0.6669  0.6713  0.6778  0.6727  0.0041  
Fit time          0.11    0.14    0.15    0.17    0.18    0.15    0.03    
Test time         0.08    0.24    0.08    0.25    0.08    0.15    0.08    


>### Stochastic Gradient Descent
Parametri:
* 'reg': The regularization parameter of the cost function that is optimized, corresponding to 𝜆1. Default is 0.02.
* 'learning_rate': The learning rate of SGD, corresponding to 𝛾. Default is 0.005.
* 'n_epochs': The number of iteration of the SGD procedure. Default is 20.

In [39]:
bsl_options = {
    "method": "sgd"
}
algo = BaselineOnly(bsl_options=bsl_options)

>### Fit del modello sui dati di training

In [40]:
algo.fit(train)
predictions = algo.test(test)

Estimating biases using sgd...


>### Prestazioni del modello sul test set

In [41]:
rmse(predictions);

RMSE: 0.8766


>### Test del modello con la cross validation

In [42]:
cross_validate(algo=algo, data=data, verbose=True);

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8644  0.8702  0.8752  0.8696  0.8676  0.8694  0.0035  
MAE (testset)     0.6670  0.6699  0.6721  0.6695  0.6636  0.6684  0.0029  
Fit time          0.27    0.31    0.24    0.28    0.25    0.27    0.03    
Test time         0.25    0.08    0.08    0.08    0.08    0.12    0.07    


### KNNBasic

In [43]:
bsl_options = {
    "method": "als"
}
algo = KNNBaseline(bsl_options=bsl_options)

>### Addestramento su tutto il dataset

In [44]:
trainset = data.build_full_trainset()
algo.fit(trainset)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f8e2a896fd0>

>### Previsione della valutazione dati utente e film specifici

In [45]:
pred = algo.predict(uid="196", iid="302", verbose=True)

user: 196        item: 302        r_ui = None   est = 3.50   {'was_impossible': False}


### Singular Value Decomposition

>### Grid Search

In [46]:
param_grid = {
    'n_epochs': [10, 20], 
    'lr_all': [0.002, 0.005],
    'reg_all': [0.4, 0.6]
}

gs = GridSearchCV(SVD, param_grid=param_grid, cv=5)

gs.fit(data)

>### Valutazione performance modello

In [47]:
gs.best_score["rmse"]

0.8832100862128154

>### Migliori parametri modello

In [48]:
gs.best_params["rmse"]

{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.4}

>### Utilizzo del miglior modello per fit sui dati di training

In [49]:
algo = gs.best_estimator["rmse"]
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f8e23edb8d0>

### KNN basato sui contenuti

In [50]:
sim_options = {
    'name': 'pearson_baseline',
    'user_based': False,
    'shrinkage': 0
}
algo = KNNBasic(sim_options=sim_options)

In [51]:
cross_validate(algo=algo, data=data, verbose=True);

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9487  0.9373  0.9390  0.9432  0.9483  0.9433  0.0046  
MAE (testset)     0.7314  0.7237  0.7242  0.7256  0.7340  0.7278  0.0041  
Fit time          9.99    10.02   10.14   10.22   9.86    10.04   0.12    
Test time         6.12    6.18    6.22    6.47    6.39  

### Ottenere le top-N raccomandazioni per ogni utente
* https://surprise.readthedocs.io/en/stable/FAQ.html

>### Creazione funzione per restituire le raccomandazioni per ogni utente

In [52]:
def get_top_n(predictions, n=10):
    """
    Return the top-N recommendation for each user from a set of predictions.

    Parameters
    ----------
    predictions : list 
        the list of predictions, as returned by the test method of an algorithm
    n : int (default 10) 
        the number of recommendation to output for each user

    Returns
    -------
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

>### Fit del modello sui dati

In [53]:
algo = NMF()
algo.fit(train)

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x7f8e23263198>

>### Creazione delle previsioni

In [54]:
preditions = algo.test(test)

>### Richiamo funzione visualizzazione raccomandazioni e visualizzazione risultato

In [55]:
top_n = get_top_n(predictions=predictions, n=10)

In [56]:
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

140 [48516, 296, 1242, 2542, 914, 2067, 5995, 6787, 1200, 3030]
603 [1172, 1213, 1221, 1197, 1193, 296, 1089, 912, 356, 1222]
438 [4011, 1196, 4993, 5902, 1732, 4995, 34405, 899, 44191, 6377]
433 [296, 608, 1089, 164179]
474 [1172, 48516, 1283, 1178, 6016, 1089, 1247, 260, 1288, 1673]
304 [318, 356, 1196, 593, 1198, 2502, 1704, 1653, 457, 1035]
298 [48516, 1222, 260, 5618, 1215, 1210, 4995, 2858, 3578, 7371]
131 [1213, 1193, 1136, 4226, 1288, 593, 1228, 1617, 1200, 293]
288 [2959, 908, 58559, 1250, 3703, 2571, 1261, 7153, 1258, 4226]
448 [48516, 296, 4011, 112552, 1228, 1198, 1617, 1884, 778, 1262]
284 [318, 296, 356, 590, 357, 112, 104, 380, 433, 342]
331 [58559, 79132, 54997, 33794, 134130, 8636, 55765, 84152, 48394, 6936]
325 [356, 969, 1090, 1304, 6, 919, 3424, 1682, 1036, 1610]
504 [296, 4995, 5060, 3897, 4027, 4306, 2797, 3556, 5673, 5445]
286 [4973, 2329, 2028, 2692, 6377, 364, 5989, 150, 4848, 4571]
232 [58559, 527, 1198, 78499, 68157, 44191, 73017, 47, 745, 589]
6 [50, 296, 47

### KNN normalizzato

In [57]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, cv=3)
gs.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [58]:
gs.best_score["rmse"]

0.9043755684221125

In [59]:
gs.best_params["rmse"]

{'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': True}}