## Collaborative Filtering
#### Model Based Approach

In [1]:
import pandas as pd
# import SVD from surprise
from surprise import SVD

# # import dataset from surprise
from surprise import Dataset
from surprise import Reader


# import accuracy from surprise
from surprise import accuracy

# import train_test_split from surprise.model_selection
from surprise.model_selection import train_test_split
# import GridSearchCV from surprise.model_selection
from surprise.model_selection import GridSearchCV
# import cross_validate from surprise.model_selection
from surprise.model_selection import cross_validate

We will be working with the [same data](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing) we used in the previous exercise.

In [2]:
book_ratings = pd.read_csv('BX-Book-Ratings.csv',sep=";", encoding="latin")

In [3]:
book_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


* create surprise dataset from book_ratings

In [4]:
reader = Reader(rating_scale=(0, 10))

# Loads Pandas dataframe
data = Dataset.load_from_df(book_ratings, reader)
data

<surprise.dataset.DatasetAutoFolds at 0x2155149bdc0>

* split data to train and test set, use test size 15%

In [5]:
X_train, X_test = train_test_split(data, test_size = 0.15)

* Use SVD (with default settings) to create recommendations for each user
    - print default model's rmse that was computed on the test set (using object accuracy we imported in the beginning)

In [6]:
alg = SVD()
output = alg.fit(X_train)

In [7]:
preds = alg.test(X_test)

In [8]:
accuracy.rmse(preds)

RMSE: 3.5061


3.5061171747281494

In [19]:
import numpy as np

# Get a list of all book ISBNs
unique_book_ISBNs = book_ratings['ISBN'].unique()
# Get a list of ISBNs that User-ID 276729 has rated
userX_ISBNs = book_ratings.loc[book_ratings['User-ID'] == 276729, 'ISBN']
# Remove the iids that uid 50 has rated from the list of all movie ids, so we don't recommend them movies they've already watched.
ISBNs_to_pred = np.setdiff1d(unique_book_ISBNs, userX_ISBNs)

# Convert to array of ratings to find the ISBNs with the best predicted rating
pred_ratings = np.array([pred.est for pred in preds])
# Find the index of the maximum predicted rating
i_max = pred_ratings.argmax()
# Use this to find the corresponding iid to recommend
best = ISBNs_to_pred[i_max]
print('Top book for user 276729 has ISBN {0} with predicted rating {1}'.format(best, pred_ratings[i_max]))

Top book for user 276729 has ISBN 0002157853 with predicted rating 10.0


* create parameters grid, use this params:
* 'n_factors': [110, 120, 140, 160]
* 'reg_all': [0.08, 0.1, 0.15]

In [9]:
# param_grid = {
#     'n_factors': [110, 120, 140, 160],
#     'reg_all': [0.08, 0.1, 0.15]
# }

param_grid = {
    'n_factors': [110, 160],
    'reg_all': [0.08, 0.15]
}

* instantiate GridSearch with SVD as model, our pre-defined parameter grid and rmse and mae as evaluation metrics

In [10]:
gs = GridSearchCV(SVD, param_grid, measures = ['rmse', 'mae'], cv=3, n_jobs=-1)

* fit GridSearch

In [11]:
gs.fit(data)

* print best RMSE score from training

In [12]:
print(gs.best_params['rmse'])

{'n_factors': 160, 'reg_all': 0.15}


* predict test set with optimal model based on `RMSE`

In [17]:
alg2 = SVD(n_factors = 160, reg_all = 0.15)
# output2 = alg.fit(X_train)
# preds2 = gs.test(X_test)
output = cross_validate(alg2, data, verbose = True, n_jobs=-1)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    3.4349  3.4299  3.4349  3.4368  3.4292  3.4331  0.0030  
MAE (testset)     2.8755  2.8713  2.8774  2.8801  2.8748  2.8758  0.0029  
Fit time          55.84   55.74   55.74   55.39   55.55   55.65   0.16    
Test time         1.71    1.70    1.60    1.52    1.41    1.59    0.11    


* print optimal model's RMSE that was computed on test set
    - is it better than the default parameters?

In [None]:
accuracy.rmse(preds2)