# BLU12 - Exercises Notebook

In [None]:
import os
import warnings
warnings.filterwarnings("ignore")

import hashlib

import surprise

from surprise import Dataset, Reader

from surprise.model_selection import train_test_split, cross_validate
from surprise.prediction_algorithms import BaselineOnly, KNNBaseline, Prediction

# 1 About the data

The dataset is a small sample of the [goodbooks-10k dataset](https://github.com/zygmuntz/goodbooks-10k) available on GitHub.

The original dataset contains ten thousand books and six million ratings, containing:
* Books marked read by users
* Book metadata
* Tags, shelves, and genres.

What makes the dataset so productive is that it covers:
* Ratings (explicit feedback)
* Unary data (implicit feedback), in the form of books marked to read by users
* Book metadata, including content information
* User-generated content (UGC), as tags, shelves, and genres assigned by users to items.

In this exercise, we focus on ratings to practice a primary Recommender Systems (RS) workflow.

A blogpost describing the dataset can be found [here](http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/).

The sample included in `data/` contains the 100 users and the 100 items with the most ratings.

# 2 Custom Datasets in Surprise

Up until now, we only used [build-in datasets](https://surprise.readthedocs.io/en/stable/dataset.html) with Surprise, namely [MovieLens](https://grouplens.org/datasets/movielens/).

Although MovieLens is a stable in RS research, we try something different and use books in this exercise.

The downside is that, since we are exploring new applications, we have to create our custom dataset.

How do we go about it? Well, an excellent place to start is the Surprise's [dataset module documentation](https://surprise.readthedocs.io/en/stable/dataset.html).

Also, Surprise provides a [Reader class](https://surprise.readthedocs.io/en/stable/reader.html), used to parse a file containing ratings, in a conventional form.

We must preview our ratings file first.

In [None]:
def preview_file(path, nrows):
    
    with open(path) as file:
        lines = file.readlines()
        
    return lines[:nrows]


path = os.path.join('data', 'ratings.csv')
preview_file(path=path, nrows=5)

What do we have here?

We have a comma-separated file containing one rating per line, in the familiar `'uid,iid,rui'` (i.e., user, item, rating) form.

We also have a header row, **which is not a rating**, even if it is in a ratings file.

Time to build a custom dataset from the file.

In [None]:
def make_dataset():
    
    path = os.path.join('data', 'ratings.csv')
    
    # Instantiate a Surprise Reader object.
    # Pay close attention to the Reader class parameters and the contents
    # of the file.
    # reader = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Load the dataset from the file, and return it.
    # Refer to the dataset module docs.
    # dataset = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return dataset


def make_trainset():

    dataset = make_dataset()
    
    # Call a method on dataset (e.g., dataset.method()) that returns the whole
    # dataset, without any splits
    # Used in the learning materials.
    # trainset = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return trainset


dataset = make_dataset()
ratings = make_trainset()

Now, `ratings` is an object of the [Trainset class](https://surprise.readthedocs.io/en/stable/trainset.html), containing the training set.

A `Dataset` contains the raw data, while the `Trainset` is a higher-level structure, where useful attributes and methods are defined.

(For examples, pay close attention to the graded tests below.)

In [None]:
assert(type(ratings) == surprise.trainset.Trainset)

assert(ratings.n_users == 100)
assert(ratings.n_items == 100)
assert(ratings.n_ratings == 3304)

assert(ratings.rating_scale == (1, 5))

assert(round(ratings.global_mean) == 4)

assert(ratings.knows_item(43) == True)
assert(ratings.knows_user(100) == False)

# 3 Baseline

We know what to do: get to a baseline real fast.

We generate baseline predictions using the [baseline model](https://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.baseline_only.BaselineOnly) described in the learning materials.

The [configurations](https://surprise.readthedocs.io/en/stable/prediction_algorithms.html#baseline-estimates-configuration) fo the baseline estimator are:
* We want to use Stochastic Gradient Descent (SGD)
* With a learning rate of 0.0005
* And regularization parameter of 0.05.

We want the function to return cross-validation results, using `cv=5`.

In [None]:
def baseline_cross_validate(data):
    
    # Configure the baseline options.
    # Refer back to the learning materials.
    # bsl_options = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Instantiate the baseline algorithm, using the bsl_options above.
    # baseline = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Cross validation results, using cv=5.
    # res = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return baseline, res


# You must decide whether or not to use a `Dataset` or a `Trainset`.
baseline, baseline_results = baseline_cross_validate(dataset)
baseline_results['test_rmse'].mean()

In [None]:
assert(list(baseline.bsl_options.values()) == ['sgd', 5e-05, 0.05])

assert(type(baseline_results) == dict)

assert(list(baseline_results.keys()) == ['test_rmse', 'test_mae', 'fit_time', 'test_time'])

assert(len(baseline_results['test_rmse']) == len(baseline_results['test_mae'] == 5))

assert(baseline_results['test_rmse'].mean() >= baseline_results['test_mae'].mean())

# 4 *k*-NN (with Baseline)

Surprise provides [*k*-NN](https://surprise.readthedocs.io/en/stable/knn_inspired.html#) inspired algorithms for Collaborative Filtering (CF).

These algorithms include many of the extensions we discussed in the learning materials, especially ratings normalization.

We will use a built-in algorithm similar to using the means or the z-score to standardize the ratings, but using the baseline instead.

(You should find such a model [around here](https://surprise.readthedocs.io/en/stable/knn_inspired.html#).)

Again, we want our function to return the cross-validation results with `cv=5`, so we can compare them with the baseline.

Also, we want the following similiary options (in case of doubt, refer back to the learning materials):
* We use the cosine similarity to compute distances
* We want to item-item similarities
* We want the minimum number of common users to be 3.

Finally, the max number of neighbors to take into account for prediction should be 20 and the minimum 5.

In [None]:
def fancy_knn_cross_validate(data):
    
    # Use the same baseline options as above.
    # bsl_options = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Configure the similarity options.
    # We use the item-item cosine similarity, with at least 3 common users.
    # sim_options = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Instantiate the k-NN algorithm that considers the baselines.
    # The maximum number of neighbors should be 20 and the minimum 5.
    # fancy_knn = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Do cross-validation, with cv=5.
    # res = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return fancy_knn, res


fancy_knn, fancy_knn_results = fancy_knn_cross_validate(dataset)
fancy_knn_results['test_rmse'].mean()

In [None]:
assert(list(fancy_knn.bsl_options.values()) == ['sgd', 5e-05, 0.05])
assert(list(fancy_knn.sim_options.values()) == ['cosine', False, 3])

assert(fancy_knn.k == 20)
assert(fancy_knn.min_k == 5)

assert(type(fancy_knn_results) == dict)

assert(list(fancy_knn_results.keys()) == ['test_rmse', 'test_mae', 'fit_time', 'test_time'])

assert(len(fancy_knn_results['test_rmse']) == len(baseline_results['test_mae'] == 5))

assert(fancy_knn_results['test_rmse'].mean() >= baseline_results['test_mae'].mean())

assert(type(fancy_knn) == KNNBaseline)

# 5 Making Predictions

Now that we have a winner, we want to make predictions.

What we will do is:
* Train the fancy $k$-NN model, that normalizes ratings using the baselines, on the entire dataset
* Create a list of unknown ratings that can be used a test set, i.e., all the ratings that are not in the train set
* Make predictions!

In [None]:
def make_predictions(algo, ratings_train):
    
    # Fit the algorith (received as a parameter) on the training data.
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Create a test set with the ratings that are not in the trainset.
    # Refer back to the learning materials, or the `Trainset` docs.
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Generate the set of test predictions.
    # In doubt, refer back to the learning materials.
    # YOUR CODE HERE
    raise NotImplementedError()

    
model, preds = make_predictions(fancy_knn, ratings)

Time to make some predictions!

What are the predicting ratings for the following user-item pairs:
* `uid='2487'`, `iid=46`
* `uid='951'`, `iid=72`
* `uid=10146`, `iid=5`.

In [None]:
# Use the model returned above to make predictions, and retrieve the
# estimate attribute, as in the learning materials.
# If needed, look into the predictions.
# Pleace note that user and item IDs need to be passed as strings.
# pred_1 = ...
# pred_2 = ...
# pred_3 = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert(len(preds) == 6696)

assert(type(preds[0]) == Prediction)

expected_hash = '6e540aa1b0f070c6190738483002fd3aed4e797b85a8a424aa129c2afdb2b50b'
assert(hashlib.sha256(pred_1).hexdigest() == expected_hash)

expected_hash = 'fffecd8226b856eefea0997034a60b38b775e79a9f423739627dce7fcaeecde0'
assert(hashlib.sha256(pred_2).hexdigest() == expected_hash)

expected_hash = 'e0641648822eeb8658a539a9ad833f61f0c70662e944c7b21bd85c2f484f670c'
assert(hashlib.sha256(pred_3).hexdigest() == expected_hash)