# Recommender Systems

**Libaries needed: scikit-surprise, pandas, sklearn, numpy. 
To install `scikit-surprise`:**
```
conda install -c conda-forge scikit-surprise tqdm
```

### Goal: 
In this exercise, we will be proceeding in two stages. 
1. The first stage is where we get into the details of how to build our own recommender system to recommend movies to users.
2. In the second stage, we will be an existing library, specialized for recommender systems, which provides more powerful options. We will be testing it on the task of recommending jokes to users.

### What you are learning in this exercise:
1. Getting familiar with item-based collaborative filtering and user-based collaborative filtering.
2. Getting familiar with an existing library for recommender systems.

Let's make sure we have all the requirements ready. In this exercise, you should be filling the empty code sections, marked as `TODO:`

**Note**: We added the `tqdm` library for convenience of monitoring the timing in our loops

In [None]:
import surprise
import numpy as np
import pandas as pd
import sklearn
from tqdm import tqdm
`.  vimport warnings
warnings.filterwarnings('ignore')

### Task 1: Exploring the MovieLens dataset

In this part, we'll be using the [MovieLens dataset](https://grouplens.org/datasets/movielens/). This dataset is based on [movielens.org](https://movielens.org/), a site where users can get movie recommendations.

Our first step is to load the relevant file of the dataset, which you can find in the file `u.data` (on the path `data/ml-100k/u.data`).


In [None]:
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('data/ml-100k/u.data', sep='\t', names=header)

In [None]:
df[:15]

Let's first check the number of users and movies in the dataset to get an idea of the scale we're dealing with.

In [None]:
# TODO: get the number of users and itens
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items))

We can also get an overall view of the dataset as below. Notice how the ratings range from a minimum of 1 to a maximum of 5.

In [None]:
df.describe()

Now that the data is loaded, we proceed to splitting it into a training set and a testing set.


In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.25, random_state= 42)

Next, let's create the user-item matrices, one for training and another for testing. Each matrix should be a 2D numpy array, with each row corresponding to a user and each column to a movie. A non-zero cell in the matrix is the rating given by the user to the movie (zeros are for the case of no corresponding rating).

**Notice that the user ids and item ids start from 1, so the index (0,0) in your matrix should correspond to `user_id` of 1 and `item_id` of 1.**

In [None]:
# TODO fill the code to produce a data matrix
def create_data_matrix(data,n_users,n_items):
    """
        This function should return a numpy matrix with a shape (n_users, n_items). 
        Each entry is the rating given by the user to the item
    """
    data_matrix = np.zeros((n_users, n_items))


    for line in data.itertuples():
        data_matrix[line[1]-1, line[2]-1] = line[3]
    return data_matrix

train_data_matrix= create_data_matrix(train_data, n_users, n_items)
test_data_matrix= create_data_matrix(test_data, n_users, n_items)


We can check how our matrices look like at this point. 

In [None]:
print('train_data_matrix')
print(train_data_matrix.shape)
print('test_data_matrix')
print(test_data_matrix.shape)

### Task 2: Item-based Collaborative Filtering

Now that we've prepared our data, the next mission we have is to create a recommender system following the paradigm of Item-based Collaborative Filtering. In this case, this is translated into "Users who liked this item (movie) also liked …". 



In order to make predictions, we will apply following formula, where 
$N_I(a)$ is the set of neighbors of item $a$, and $b$ is an item rated by user $x$.


\begin{equation}
{r}_{x}(a) =  \frac{\sum\limits_{b \in N_{I}(a)} sim(a, b) r_{x}(b)}{\sum\limits_{b \in N_{I}(a)}|sim(a, b)|}
\end{equation}

As a building block, we'll first write the code for the similarity $sim(a,b)$ metric between each two item vectors in our training matrix. In this case, we will use the cosine similarity metric. The output should be an `n_items` by `n_items` symmetric 2D numpy matrix with the similarity between each couple of items.

**Note**: In this exercise, there are always two ways of achieving the same goal: a slow one via `for` loops and another by benefiting from numpy's speed in matrix operations. Feel free to improve your starting solution to a faster one.

In [None]:
# TODO fill the code to compute the similarity matrix
from sklearn.metrics.pairwise import pairwise_distances
item_similarity = 1-pairwise_distances(train_data_matrix.T, metric='cosine')

# check how the matrix looks like
print(item_similarity)

Next, we'll use the similarity matrix in the above formula to obtain the predicted ranking for each item `a`.

In [None]:
# TODO: Fill the code for predicting the ratings. 
# The output is a numpy matrix with the dimensions ((n_users,n_items)) and with the corresponding ranking at each cell.

def item_based_predict(ratings, similarity):
    filled_matrix = np.zeros((n_users, n_items))
    # loop over all the users
    for u in range(n_users):
        # get the items rated by this user
        ranked_items_indices = train_data_matrix[u,:].nonzero()[0]
        for i in range(n_items):
            numerator = 0
            denominator = 0
            for j in ranked_items_indices:
                numerator+=item_similarity[i,j]*train_data_matrix[u,j]
                denominator+=np.abs(item_similarity[i,j])
            if denominator>0:
                filled_matrix[u,i]= numerator/denominator
            else:
                # simply take a random rating in that case 
                filled_matrix[u,i]= np.random.randint(1,6)
    return filled_matrix        

item_prediction = item_based_predict(train_data_matrix, item_similarity)
print(item_prediction)


**Note:** The above implementation can be be made much quicker by changing the loop operations into matrix multiplications. Give it a try!

One further optimization that we can make while speeding up the solution is by focusing on getting a good ranking of the items for a specific user rather than getting the predicted rating value. If we are only interested in the ranking, we do not have to account for the previously ranked items only. The formula can be across all items. This makes the optimizations easier to perform. Check out this [blog post](http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/) for an example.

### Task 3: User-based Collaborative Filtering

The next mission we have is to create a recommender system following the paradigm of User-based Collaborative Filtering. In this case, this is translated into "Users who are similar to you also liked…". 

In order to make predictions, we will apply following formula, where $N_U(x)$ is the set of neighbors of user x and $a$ is an item not rated by x.


\begin{equation}
{r}_{x}(a) = \bar{r}_{x} + \frac{\sum\limits_{y \in N_{U}(x)} sim(x, y) (r_{y}(a) - \bar{r}_{y})}{\sum\limits_{y \in N_{U}(x)}|sim(x, y)|}
\end{equation}

Similar to above, we will first compute the distances between the users in our training matrix, using cosine similarity. The output should be an `n_users` by `n_users` symmetric 2D numpy matrix with the similarity between each couple of users.

In [None]:
train_data_matrix.shape

In [None]:
# TODO fill the code to compute the similarity matrix
user_similarity = 1- pairwise_distances(train_data_matrix, metric='cosine')

# print the shape as a sanity check
print(user_similarity.shape)

# check how the matrix looks like
print(user_similarity)

In [None]:
# TODO: Fill the code for predicting the ratings. 
def user_based_predict(ratings, similarity):
    filled_matrix = np.zeros((n_users, n_items))
    
    # compute the average ratings for each user
    tmp = train_data_matrix.copy()
    tmp[tmp == 0] = np.nan
    user_average_ratings = np.nanmean(tmp, axis=1)
    
    # loop over all the items
    for i in tqdm(range(n_items)):
        # get the users who rated this item
        ranked_users_indices = train_data_matrix[:,i].nonzero()[0]

        for u in range(n_users):
            numerator = 0
            denominator = 0
            for y in ranked_users_indices:
                numerator+=user_similarity[u,y]*(train_data_matrix[y,i]-user_average_ratings[y])
                denominator+=np.abs(user_similarity[u,y])
            if denominator>0:
                filled_matrix[u,i]= user_average_ratings[u]+ numerator/denominator
            else:
                filled_matrix[u,i]= user_average_ratings[u]

    # we ensure that the ratings are in the expected range
    filled_matrix.clip(0,5)
    return filled_matrix   

    
user_prediction = user_based_predict(train_data_matrix, user_similarity)
print(type(user_prediction))

**Note:** As above, this basic implementation can be be made much quicker by changing the loop operations into matrix multiplications. Give it a try!


### Task 4: Evaluating Our Recommenders

We will be evaluating our recommenders using Root Mean Squared Error (RMSE). In the formula below, $r_i$ is the true rating and $\hat{r_i}$ is the predicted one.

\begin{equation}
\mathit{RMSE} =\sqrt{\frac{1}{N} \sum_i (r_i -\hat{r_i})^2}
\end{equation}

In [None]:
# TODO: add the code for computing RMSE for user and item based methods
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()

    return sqrt(mean_squared_error(prediction, ground_truth))

print ('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print ('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

### Task 5: Introducing Surprise

In this part, we will move to using [Surprise](http://surpriselib.com/), a full-fledged python library, specialized for recommender systems. The goal is to get exposed to such more powerful libraries that can automate a lot of the manual work we had to do above.

For a change, we will be using the [Jester](http://eigentaste.berkeley.edu/dataset/) dataset, obtained from the [Jester Online Joke Recommender System](http://eigentaste.berkeley.edu/index.html). It has over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 59,132 users: collected between November 2006 - May 2009. Our first step will be to download this dataset. Fortunately, `Surprise` has a built-in loader for the Jester dataset. Make sure you confirm that you want to download the dataset when prompted to do so.

In [None]:
from surprise import Dataset


# Load the Jester dataset (download it if needed),
data = Dataset.load_builtin('jester')
# split the data into 2 folds for cross-validation.
data.split(n_folds=2)



Next, we will need to train the k-Nearest Neighbors algorithm within Surprise on the Jester dataset (Check the [documentation](http://surprise.readthedocs.io/en/stable/) for `SVD`). For evaluation, Jester allows multiple metrics. You will need to use the `RMSE` and the `MAE` in this case. The training might take a few minutes.

In [None]:
from surprise import SVD
from surprise import evaluate, print_perf

# TODO: fill the code for evaluating the model based on SVD
# We'll use the SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

The above code was hopefully short, and it's mainly for showing the power of the library. Now that you have trained and evaluated the recommendation algorithm, let's try to find the predicted rating for a single user and item.

In [None]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(98)  # raw item id (as in the ratings file). They are **strings**!

# TODO get a prediction for user with uid and item iid
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

If you are interested in knowing what the joke was for item 98, you can check the dataset. By default, the dataset will be downloaded in your home directory, under `$HOME/.surprise_data/jester/`. The file `jester_items.dat` has the text of the jokes. 😉

Finally, feel free to explore the library further. It might come in handy for your future projects!
