# Recommender Systems

**Libaries needed: scikit-surprise, pandas, sklearn, numpy. 
To install `scikit-surprise`:**
```
conda install -c conda-forge scikit-surprise
```

### Goal: 
In this exercise, we will be proceeding in two stages. 
1. The first stage is where we get into the details of how to build our own recommender system to recommend movies to users.
2. In the second stage, we will be an existing library, specialized for recommender systems, which provides more powerful options. We will be testing it on the task of recommending jokes to users.

### What you are learning in this exercise:
1. Getting familiar with item-based collaborative filtering and user-based collaborative filtering.
2. Getting familiar with an existing library for recommender systems.

Let's make sure we have all the requirements ready. In this exercise, you should be filling the empty code sections, marked as `TODO:`

In [2]:
import surprise
import numpy as np
import pandas as pd
import sklearn

### Task 1: Exploring the MovieLens dataset

In this part, we'll be using the [MovieLens dataset](https://grouplens.org/datasets/movielens/). This dataset is based on [movielens.org](https://movielens.org/), a site where users can get movie recommendations.

Our first step is to load the relevant file of the dataset, which you can find in the file `u.data` (on the path `data/ml-100k/u.data`).


In [3]:
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('data/ml-100k/u.data', sep='\t', names=header)

In [4]:
df[:15]

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


Let's first check the number of users and movies in the dataset to get an idea of the scale we're dealing with.

In [5]:
# TODO: get the number of users and itens
n_users = df['user_id'].nunique()
n_items = df['item_id'].nunique()
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items))

Number of users = 943 | Number of movies = 1682


We can also get an overall view of the dataset as below. Notice how the ratings range from a minimum of 1 to a maximum of 5.

In [6]:
df.describe()

Unnamed: 0,user_id,item_id,rating,timestamp
count,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0
std,266.61442,330.798356,1.125674,5343856.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0


Now that the data is loaded, we proceed to splitting it into a training set and a testing set.


In [7]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.25, random_state= 42)

Next, let's create the user-item matrices, one for training and another for testing. Each matrix should be a 2D numpy array, with each row corresponding to a user and each column to a movie. A non-zero cell in the matrix is the rating given by the user to the movie (zeros are for the case of no corresponding rating).

**Notice that the user ids and item ids start from 1, so the index (0,0) in your matrix should correspond to `user_id` of 1 and `item_id` of 1.**

In [8]:
# TODO fill the code to produce a data matrix
def create_data_matrix(data,n_users,n_items):
    """
        This function should return a numpy matrix with a shape (n_users, n_items). 
        Each entry is the rating given by the user to the item
    """
    data_matrix = np.zeros((n_users, n_items))

    def update(a,b,c): 
        data_matrix[a-1][b-1] = c
 
    data.apply(lambda x: update(x.user_id, x.item_id, x.rating), axis=1)
    
    return data_matrix

train_data_matrix= create_data_matrix(train_data, n_users, n_items)
test_data_matrix= create_data_matrix(test_data, n_users, n_items)


We can check how our matrices look like at this point. 

In [9]:
print('train_data_matrix')
print(train_data_matrix)
print('test_data_matrix')
print(test_data_matrix)

train_data_matrix
[[0. 3. 4. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 5. 0. ... 0. 0. 0.]]
test_data_matrix
[[5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Task 2: Item-based Collaborative Filtering

Now that we've prepared our data, the next mission we have is to create a recommender system following the paradigm of Item-based Collaborative Filtering. In this case, this is translated into "Users who liked this item (movie) also liked …". 



In order to make predictions, we will apply following formula, where 
$N_I(a)$ is the set of neighbors of item $a$, and $b$ is an item rated by user $x$.


\begin{equation}
{r}_{x}(a) =  \frac{\sum\limits_{b \in N_{I}(a)} sim(a, b) r_{x}(b)}{\sum\limits_{b \in N_{I}(a)}|sim(a, b)|}
\end{equation}

As a building block, we'll first write the code for the similarity $sim(a,b)$ metric between each two item vectors in our training matrix. In this case, we will use the cosine similarity metric. The output should be an `n_items` by `n_items` symmetric 2D numpy matrix with the similarity between each couple of items.

**Note**: In this exercise, there are always two ways of achieving the same goal: a slow one via `for` loops and another by benefiting from numpy's speed in matrix operations. Feel free to improve your starting solution to a faster one.

In [10]:
# TODO fill the code to compute the similarity matrix
from sklearn.metrics.pairwise import pairwise_distances
item_similarity = 1-pairwise_distances(np.transpose(train_data_matrix), metric='cosine')

# check how the matrix looks like
print(item_similarity)

[[1.         0.29431963 0.25248099 ... 0.         0.         0.        ]
 [0.29431963 1.         0.18855956 ... 0.         0.09099269 0.        ]
 [0.25248099 0.18855956 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.09099269 0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


Next, we'll use the similarity matrix in the above formula to obtain the predicted ranking for each item `a`.

In [11]:
# TODO: Fill the code for predicting the ratings. 
# The output is a numpy matrix with the dimensions ((n_users,n_items)) and with the corresponding ranking at each cell.

from tqdm import tqdm

def item_based_predict(ratings, similarity):
    pred = np.zeros(ratings.shape)
    for user in tqdm(range(n_users)): 
        for item in range(n_items): 
            if ratings[user][item] == 0:
                weights = similarity[item][:]
                pred[user][item] = sum(weights*ratings[user][:]) / np.linalg.norm(weights[ratings[user][:].nonzero()],1)
            else: 
                pred[user][item] = ratings[user][item]
    return pred

item_prediction = item_based_predict(train_data_matrix, item_similarity)

print(item_prediction)


  pred[user][item] = sum(weights*ratings[user][:]) / np.linalg.norm(weights[ratings[user][:].nonzero()],1)
100%|██████████| 943/943 [03:28<00:00,  4.52it/s]

[[3.84636515 3.         4.         ... 4.09576166 3.84555021        nan]
 [4.         3.99015099 3.94449343 ... 3.7536442  3.83830292        nan]
 [2.74167162 2.76153256 2.75428409 ... 3.01728956 2.93171602        nan]
 ...
 [5.         4.1299914  4.12830226 ... 4.         3.88259423        nan]
 [4.38546083 4.41069101 4.37404141 ... 4.19056099 4.3596925         nan]
 [3.53536005 5.         3.53042866 ...        nan 3.55127837        nan]]





### Task 3: User-based Collaborative Filtering

The next mission we have is to create a recommender system following the paradigm of User-based Collaborative Filtering. In this case, this is translated into "Users who are similar to you also liked…". 

In order to make predictions, we will apply following formula, where $N_U(x)$ is the set of neighbors of user x and $a$ is an item not rated by x.


\begin{equation}
{r}_{x}(a) = \bar{r}_{x} + \frac{\sum\limits_{y \in N_{U}(x)} sim(x, y) (r_{y}(a) - \bar{r}_{y})}{\sum\limits_{y \in N_{U}(x)}|sim(x, y)|}
\end{equation}

Similar to above, we will first compute the distances between the users in our training matrix, using cosine similarity. The output should be an `n_users` by `n_users` symmetric 2D numpy matrix with the similarity between each couple of users.

In [12]:
# TODO fill the code to compute the similarity matrix
user_similarity = 1 - pairwise_distances(train_data_matrix, metric='cosine')

# print the shape as a sanity check
print(user_similarity.shape)

# check how the matrix looks like
print(user_similarity)

(943, 943)
[[1.         0.14336926 0.03241686 ... 0.0896044  0.08784797 0.31415893]
 [0.14336926 1.         0.10759069 ... 0.08110762 0.14570123 0.07977339]
 [0.03241686 0.10759069 1.         ... 0.02386986 0.10703166 0.        ]
 ...
 [0.0896044  0.08110762 0.02386986 ... 1.         0.06944821 0.06727982]
 [0.08784797 0.14570123 0.10703166 ... 0.06944821 1.         0.1171645 ]
 [0.31415893 0.07977339 0.         ... 0.06727982 0.1171645  1.        ]]


In [28]:
# TODO: Fill the code for predicting the ratings. 
def user_based_predict(ratings, similarity):
    pred = np.zeros(ratings.shape)
    averages = np.zeros(n_users)

    for user in range(n_users):
        averages[user] = sum(ratings[user, :]) / np.count_nonzero(ratings[user,:])

    for user in tqdm(range(n_users)): 
        for item in range(n_items):
            if ratings[user][item] != 0:
                pred[user][item] = ratings[user][item]
            else: 
                mask = np.zeros(n_users)
                mask[ratings[:,item].nonzero()] =1
                pred[user][item] = averages[user] + sum(mask*similarity[user,:]*(ratings[:,item] - averages)) / np.linalg.norm(mask*ratings[:,item],1)

    return pred

user_prediction = user_based_predict(train_data_matrix, user_similarity)
print(user_prediction)


  pred[user][item] = averages[user] + sum(mask*similarity[user,:]*(ratings[:,item] - averages)) / np.linalg.norm(mask*ratings[:,item],1)
100%|██████████| 943/943 [02:13<00:00,  7.09it/s]

[[3.71349316 3.         4.         ... 3.6638723  3.69505984        nan]
 [4.         3.82812075 3.81071924 ... 3.68955461 3.83650778        nan]
 [2.74800973 2.73856329 2.73314258 ... 2.47684599 2.74383765        nan]
 ...
 [5.         4.18026538 4.1713391  ... 4.14833237 4.18692041        nan]
 [4.30818953 4.29006995 4.28444679 ... 4.21699662 4.29897509        nan]
 [3.42572237 5.         3.38329214 ... 3.40944882 3.40765894        nan]]





### Task 4: Evaluating Our Recommenders

We will be evaluating our recommenders using Root Mean Squared Error (RMSE). In the formula below, $r_i$ is the true rating and $\hat{r_i}$ is the predicted one.

\begin{equation}
\mathit{RMSE} =\sqrt{\frac{1}{N} \sum_i (r_i -\hat{r_i})^2}
\end{equation}

In [29]:
# TODO: add the code for computing RMSE for user and item based methods
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    return np.sqrt(sum((prediction - ground_truth)*(prediction - ground_truth) / prediction.shape[0] / prediction.shape[1]))

print ('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print ('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: [0.08565456 0.0862469  0.08654391 ... 0.08634276 0.08818141        nan]
Item-based CF RMSE: [0.08613742 0.08714746 0.08757121 ...        nan        nan        nan]


### Task 5: Introducing Surprise

In this part, we will move to using [Surprise](http://surpriselib.com/), a full-fledged python library, specialized for recommender systems. The goal is to get exposed to such more powerful libraries that can automate a lot of the manual work we had to do above.

For a change, we will be using the [Jester](http://eigentaste.berkeley.edu/dataset/) dataset, obtained from the [Jester Online Joke Recommender System](http://eigentaste.berkeley.edu/index.html). It has over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 59,132 users: collected between November 2006 - May 2009. Our first step will be to download this dataset. Fortunately, `Surprise` has a built-in loader for the Jester dataset. Make sure you confirm that you want to download the dataset when prompted to do so.

In [48]:
from surprise import Dataset


# Load the Jester dataset (download it if needed),
data = Dataset.load_builtin('jester')
# split the data into 2 folds for cross-validation. <= deprecated for a long time
data.build_full_trainset()

<surprise.trainset.Trainset at 0x7fd89ffb6f70>

Next, we will need to train the k-Nearest Neighbors algorithm within Surprise on the Jester dataset (Check the [documentation](http://surprise.readthedocs.io/en/stable/) for `SVD`). For evaluation, Jester allows multiple metrics. You will need to use the `RMSE` and the `MAE` in this case. The training might take a few minutes.

In [46]:
from surprise import SVD
from surprise.model_selection import cross_validate

# TODO: fill the code for evaluating the model based on SVD
# We'll use the SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.5023  4.4983  4.5003  4.4979  4.4961  4.4990  0.0021  
MAE (testset)     3.3723  3.3692  3.3698  3.3719  3.3694  3.3705  0.0013  
Fit time          7.27    7.48    7.45    7.55    7.35    7.42    0.10    
Test time         3.89    3.41    3.37    2.49    2.92    3.21    0.48    


{'test_rmse': array([4.5023139 , 4.49832109, 4.50033984, 4.49788771, 4.49609709]),
 'test_mae': array([3.37233257, 3.36915091, 3.36978266, 3.37193966, 3.36940597]),
 'fit_time': (7.273042917251587,
  7.484544992446899,
  7.448679208755493,
  7.549944162368774,
  7.351607084274292),
 'test_time': (3.887127161026001,
  3.4134068489074707,
  3.3651540279388428,
  2.487346887588501,
  2.9210331439971924)}

The above code was hopefully short, and it's mainly for showing the power of the library. Now that you have trained and evaluated the recommendation algorithm, let's try to find the predicted rating for a single user and item.

In [47]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(98)  # raw item id (as in the ratings file). They are **strings**!

# TODO get a prediction for user with uid and item iid
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 98         r_ui = 4.00   est = 3.95   {'was_impossible': False}


If you are interested in knowing what the joke was for item 98, you can check the dataset. By default, the dataset will be downloaded in your home directory, under `$HOME/.surprise_data/jester/`. The file `jester_items.dat` has the text of the jokes. 😉

Finally, feel free to explore the library further. It might come in handy for your future projects!
