<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Recommendation Systems

--- 

<h1>Lesson Guide<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Users-and-items" data-toc-modified-id="Users-and-items-1">Users and items</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-2">Evaluation</a></span><ul class="toc-item"><li><span><a href="#MAE-and-MSE" data-toc-modified-id="MAE-and-MSE-2.1">MAE and MSE</a></span></li><li><span><a href="#Correlations" data-toc-modified-id="Correlations-2.2">Correlations</a></span></li><li><span><a href="#Precision@k-and-recall@k" data-toc-modified-id="Precision@k-and-recall@k-2.3">Precision@k and recall@k</a></span></li><li><span><a href="#Inter-user-diversity" data-toc-modified-id="Inter-user-diversity-2.4">Inter-user diversity</a></span></li><li><span><a href="#Intra-user-diversity" data-toc-modified-id="Intra-user-diversity-2.5">Intra-user diversity</a></span></li><li><span><a href="#Novelty" data-toc-modified-id="Novelty-2.6">Novelty</a></span></li><li><span><a href="#Coverage" data-toc-modified-id="Coverage-2.7">Coverage</a></span></li></ul></li><li><span><a href="#Baseline-prediction" data-toc-modified-id="Baseline-prediction-3">Baseline prediction</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Example" data-toc-modified-id="Example-3.0.1">Example</a></span></li></ul></li></ul></li><li><span><a href="#Similarity-based-methods" data-toc-modified-id="Similarity-based-methods-4">Similarity based methods</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Example" data-toc-modified-id="Example-4.0.1">Example</a></span></li><li><span><a href="#What-would-be-the-baseline-for-each-user-item-pair?" data-toc-modified-id="What-would-be-the-baseline-for-each-user-item-pair?-4.0.2">What would be the baseline for each user-item pair?</a></span></li><li><span><a href="#Measure-the-user-and-product-similarity-using-the-different-measures-from-the-above." data-toc-modified-id="Measure-the-user-and-product-similarity-using-the-different-measures-from-the-above.-4.0.3">Measure the user and product similarity using the different measures from the above.</a></span></li><li><span><a href="#Determine-the-top-3-items-for-each-user." data-toc-modified-id="Determine-the-top-3-items-for-each-user.-4.0.4">Determine the top-3 items for each user.</a></span></li><li><span><a href="#Determine-the-inter-user-diversity." data-toc-modified-id="Determine-the-inter-user-diversity.-4.0.5">Determine the inter-user diversity.</a></span></li><li><span><a href="#Determine-the-intra-user-diversity." data-toc-modified-id="Determine-the-intra-user-diversity.-4.0.6">Determine the intra-user diversity.</a></span></li><li><span><a href="#Determine-the-novelty." data-toc-modified-id="Determine-the-novelty.-4.0.7">Determine the novelty.</a></span></li></ul></li><li><span><a href="#KNN-with-means" data-toc-modified-id="KNN-with-means-4.1">KNN with means</a></span><ul class="toc-item"><li><span><a href="#Example" data-toc-modified-id="Example-4.1.1">Example</a></span></li></ul></li><li><span><a href="#Slope-one-predictor" data-toc-modified-id="Slope-one-predictor-4.2">Slope-one predictor</a></span><ul class="toc-item"><li><span><a href="#Example" data-toc-modified-id="Example-4.2.1">Example</a></span></li></ul></li></ul></li><li><span><a href="#Content-based-filtering" data-toc-modified-id="Content-based-filtering-5">Content based filtering</a></span><ul class="toc-item"><li><span><a href="#Singular-Value-decomposition" data-toc-modified-id="Singular-Value-decomposition-5.1">Singular Value decomposition</a></span><ul class="toc-item"><li><span><a href="#Use-TruncatedSVD-to-reduce-the-dimensionality-of-the-rating-matrix." data-toc-modified-id="Use-TruncatedSVD-to-reduce-the-dimensionality-of-the-rating-matrix.-5.1.1">Use <code>TruncatedSVD</code> to reduce the dimensionality of the rating matrix.</a></span></li></ul></li></ul></li></ul></div>

Recommendation systems are of high relevance for many companies providing online content. Everybody has frequently to interact with such systems. In this lesson we want to understand how they work.

When we try to recommend items to users, we face a few fundamental problems:

1. Data Sparsity
    - There are lots of products to recommend to many users. 
    - It is unlikely that a user will ever try out a large fraction of products.
    - A few items are demanded by many users, but many only by a few.
   
- Cold Start
    - We need to be able to give recommendations to users about which we only have scarse data (if at all).
    
- Accurate, but also diverse predictions
    - We want to give useful recommendations in the sense that they match the user's preferences, but also that the recommendation contains some novelty for the user. 

- Evaluation
    - Evaluation is difficult and might differ from algorithm to algorithm.

- Scalability
    - We need to be able to give recommendations on the spot even though there might be millions of users and items which we have to analyze carefully.

- User interface
    - Users want to know why they get particular recommendations.

- Vulnerability to attacks
    - We do not want our recommendation system to be abused for promoting or inhibiting particular items.
 
- Temporal resolution
    - Tastes and preferences do not remain the same over time. The algorithms that we will see neglect any kind of dynamics.

## Users and items

In general we speak of users and items.

> **Users:** indicate preferences for products through explicit/implicit ratings

> **Items:** products which should be recommended and which have received ratings

In most cases we are going to predict a certain rating for each possible pair of user and item. If the user already gave some rating we can compare it to our prediction:

- True rating: $r_{ui}$
- Predicted rating: $\hat{r}_{ui}$

## Evaluation

### MAE and MSE

We can compare all existing ratings to our prediction for example using the root means squared error (RMSE) or the mean absolute error (MAE):

    
$$
\begin{eqnarray*}
{\rm MAE} &=& \frac{1}{|R|}\sum_{(u,i)\in R}|r_{ui}-{\tilde r}_{ui}|\\
{\rm RMSE} &=& \left(\frac{1}{|R|}\sum_{(u,i)\in R}(r_{ui}-{\tilde r}_{ui})^2\right)^{1/2}
\end{eqnarray*}
$$

Here, $R$ stands for the set of all user-item pairs. $|\cdot|$ indicates the cardinality of the set (here the number of user-item pairs).

### Correlations

Alternatively one can use correlations between true and predicted values for model evaluation, e.g.

- the Pearson correlation
- the Spearman rank correlation
- Kendall's tau

You can call these three for example with panda's `.corr()` function by setting the `method` argument.
    
The above scores are alright to obtain a model assessment if we have explicit ratings, but in the case of implicit ratings we might only be able to rank the items. In general, we would like to recommend the top-ranked items, but we have to evaluate if the top-ranked ones are really the ones relevant to the user, or if for some irrelevant items we predicted higher ratings.
In that regard, we can use the usual classification metrics.

### Precision@k and recall@k

Often users will not really care about all the rating predictions we are making, but instead they will have a major interest only in a few top-ranked items, let's say the $k$ top-ranked items. So it is appropriate to take only these $k$ ratings into account. We then ask how many of these $k$ items are relevant to the user. The relevance is in general difficult to measure, but we can for example ask how many out of the $k$ top-ranked items have a score beyond a certain threshold.

We can then define the so-called precision@k and recall@k for $k$ recommended items:

$$
\begin{eqnarray*}
{\rm precision@k} &=& \frac{\rm  Recommended\ items\ that\ are\ relevant}{\rm Recommended\ items}\\
\\
{\rm recall@k} &=& 
\frac{\rm  Recommended\ items\ that\ are\ relevant}{\rm Relevant\ items}
\end{eqnarray*}
$$

Out of these scores we can define an F1@k score in the usual way.

### Inter-user diversity

We can compare how similar the recommendations are that we make for different users. We would like our recommender to make individual predictions based on user preferences, so predicting always the same top items would not be a good sign.

We can measure the inter-user diversity by calculating the cosine-similarity between the $k$ top-ranked items and then average over all user pairs. 

### Intra-user diversity

We can measure how similar the $k$ items are that we recommend to a particular user to obtain the intra-user diversity:

$$
I_u(k) = \frac{1}{k(k-1)}\sum_{i\neq j}{\rm sim}({\rm item}_i,{\rm item}_j)
$$

Averaging over all users gives the mean intra-similarity of the recommendation list.

### Novelty

We can measure the novelty of a recommendation by measuring how popular the recommended items are. For example we could take the degree of each item in the bipartite user-item network and average this degree over the recommendation list for each user before averaging these numbers over all users:

$$
{\rm Novelty}(k) = \frac{1}{M k}\sum_{u=1}^{M}\sum_{i\ {\rm in\  top\  k\ of\ u}} {\rm degree}_i
$$

### Coverage

Finally we can measure the so-called coverage, the fraction of all the distinct items $N_{\rm distinct}$ that appear in all of our top-k recommendation lists:

$$
{\rm Coverage}(k) = \frac{N_{\rm distinct}}{N}
$$

## Baseline prediction

As we are predicting ratings $\hat{r}_{ui}$ of user $u$ on item $i$, the baseline should be the mean of all ratings, $\mu$.

As we will alway be considering a specific user or a specific item, we can determine how much each user's or item's rating is above the average. 

Therefore we add a bias term $b_u$ for each user and $b_i$ for each item, so that our baseline is

$$
{\rm baseline}_{ui} = \mu + b_u + b_i
$$

#### Example

Let's say for example that the average rating is $\mu=3.52$, our user is very enthusiastic and on average evaluates items better than the average user by $b_u=0.3$, and the item is very popular receiving above average ratings with $b_i=0.5$. So we would have a baseline prediction of $4.32$.

We will now look at various models which make more accurate predictions based on either similarity or content.

## Similarity based methods

In collaborative filtering, we use similarity measures to infer from past ratings what to recommend in the future. This can be 

- **item based:** a user who already rated some items positively would like to have similar items recommended in the future
- **user based:** users who agree on their item ratings will do so also in the future

The term collaborative filtering stems from the fact that multiple users have to share their data to obtain useful recommendations.


We can use similarity measures we have already seen frequently:

- **Pearson correlation:** 
    - The same as usual, but only user-item pairs will be taken into account that have been rated by both users (in the user based case) or users who have rated both items (in the item based case).

- **Cosine similarity**

$$
{\rm sim}_{\cos}(u,v) = 1-\frac{u\cdot v}{\|u\|\|v\|}
$$

- **Mean squared difference**

$$
{\rm msd}(u,v) = \frac{1}{|I_{uv}|}\sum_{i\in I_{uv}}(r_{ui}-r_{vi})^2
$$

and then

$$
{\rm msd\_sim}(u,v) = \frac{1}{{\rm msd}(u,v)+1} 
$$

and similarly for item pairs $i,j$. $|I_{uv}|$ is the number of items rated by both users $u$ and $v$.

There exist many more based on the usual distance metrics, but also many specialized for recommendation system requirements.

#### Example

Consider the user-item matrix indicating.

In [1]:
import numpy as np
import pandas as pd

rui = pd.DataFrame([[4.5,3.1,2.1,3.5,1.5],
                    [4.2,2.8,3.2,1.8,2.1],
                    [4.8,4.0,4.2,2.5,2.1]],
                    columns = [f'item_{i}' for i in range(5)],
                    index = [f'user_{u}' for u in range(3)])
rui

Unnamed: 0,item_0,item_1,item_2,item_3,item_4
user_0,4.5,3.1,2.1,3.5,1.5
user_1,4.2,2.8,3.2,1.8,2.1
user_2,4.8,4.0,4.2,2.5,2.1


#### What would be the baseline for each user-item pair?

#### Measure the user and product similarity using the different measures from the above.

In [2]:
from scipy.spatial.distance import pdist, squareform

#### Determine the top-3 items for each user.

#### Determine the inter-user diversity.

#### Determine the intra-user diversity.

#### Determine the novelty.

### KNN with means

In the kNN with means model, we look at either the $k$ most similar users or items and predict

- user similarity:

$$
\hat{r}_{ui} = \mu_u+\frac{\sum_{v \in N_i^k(u)}{\rm sim}(u,v)(r_{vi}-\mu_v)}{\sum_{v \in N_i^k(u)}{\rm sim}(u,v)}
$$

- item similarity:


$$
\hat{r}_{ui} = \mu_i+\frac{\sum_{j \in N_u^k(i)}{\rm sim}(i,j)(r_{uj}-\mu_j)}{\sum_{j \in N_u^k(i)}{\rm sim}(i,j)}
$$

Here $N_i^k(u)$ denotes the $k$ most similar users to user $u$ who rated item $i$.

#### Example

As an example, let's take user similarity measured by Pearson correlation and we consider the two nearest neighbors of user $1$.

Let's say we have 

$\mu_1 = 3$, $\mu_2 = 2$, $\mu_3 = 4$,
${\rm sim}(1,2)=0.8$, ${\rm sim}(1,3)=0.5$, 

and want to predict for item 1 with 

$r_{21}=3.2$ and $r_{31}=3.8$.

Then we obtain

$$
r_{11} = 3 +\frac{0.8(3.2-2)+0.5(3.8-4)}{0.8+0.5}
= 3+\frac{0.8\cdot1.2+0.5\cdot(-0.2)}{1.3}
\approx 3.66
$$

### Slope-one predictor


This scheme makes use of both the information from other users who rated the same item and from the other items rated by the same user to predict a rating.

$$
\hat{r}_{ui} = \mu_u + \frac{1}{|R_i(u)|}\sum_{j \in R_i(u)}{\rm dev}(i,j)
$$

where $R_i(u)$ is the set of items rated by user $u$
and the average difference between the ratings of item $i$ and $j$ is

$$
{\rm dev}(i,j) = \frac{1}{|U_{ij}|}\sum_{u\in U_{ij}}(r_{ui}-r_{uj})
$$

and $U_{ij}$ is the set of all users that have rated both items $i$ and $j$.

#### Example

Consider the following example of ratings

|         | item 1 | item 2|
| ------- |:------:| -----:|
| user 1  | 2      | 1.8   |
| user 2  | 1      |  ?    |

Then we have

$\mu_{\rm user 2}=1$, $|U_{12}|=1$, $r_{11}=2$, $r_{12}=1.8$ and

$$
r_{22} = 1+\frac{1}{1}(1.8-2) = 0.8
$$ 

## Content based filtering

### Singular Value decomposition

Remember principal component analysis for a data matrix $X$ of shape $n\times p$. There we looked at the correlation matrix of the data and decomposed it into three matrices. After standard scaling the data matrix, this could be written as 

$$
A = X^T X = V D V^T
$$

$A$ will be of shape $p\times p$, $D$ will be a diagonal matrix with the eigenvalues of $A$ on the diagonal, and $V$ is the matrix of eigenvectors of $A$. As the correlation matrix $A$ is symmetric, the eigenvectors are pairwise orthogonal. This is known as the spectral theorem and helped us to decorrelate the data by transforming to the new coordinate system. This entails that the matrix $V^T$ is the inverse matrix of $V$. Such matrices are named orthogonal.

Similarly we could have obtained a matrix

$$
B = X X^T = U^T D' U
$$

of shape $n\times n$ with similar properties.

This implies that each matrix (square or not, symmetric or not) can be written as a product of three matrices

$$
X = U \Sigma V^T
$$

where $U$ is of shape $n\times n$, $V$ is of shape $p\times p$ and $\Sigma$ of shape $n\times p$ is a diagonal matrix with the so-called singular values on its diagonal (which are simply the square roots of the eigenvalues). Both $U$ and $V$ are orthogonal matrices.

In the same way as for principal component analysis, we can reduce the dimensionality by restricting to the components with the $K$ largest singular values.

This is what we are going to do with the user-item matrix. As it is very sparse anyway, we can expect that we won't loose too much information in this way and that we will somehow extract the main tastes and item attributes in this way.

That is whereas the rating matrix can be exactly written as

$$
R = U\Sigma V^T
$$

we can approximate it by choosing $\Sigma$ of shape $K\times K$ where $K<{\rm min}(n,p)$ as

$$
\hat{R} = U\Sigma_K V^T
$$

Isolating global, user and item biases we can then write the rating prediction as 

$$
\hat{r}_{ui} = \mu + b_u + b_i + \sum_{k=1}^K p_{uk}q_{ki}
$$

Not restricting $K$ will lead to perfect predictions on the training set.
    

In [3]:
import numpy as np

In [4]:
A = np.array([[1,2,3],[1,3,3],[1,2,1]])
A

array([[1, 2, 3],
       [1, 3, 3],
       [1, 2, 1]])

In [5]:
# SVD with numpy

U, S, VT = np.linalg.svd(A)
Sigma = np.diag(S) # put the returned array S on the diagonal of a matrix
Sigma

array([[6.15142284, 0.        , 0.        ],
       [0.        , 1.02970897, 0.        ],
       [0.        , 0.        , 0.3157475 ]])

In [6]:
# reconstruct the original matrix
(U.dot(Sigma)).dot(VT)

array([[1., 2., 3.],
       [1., 3., 3.],
       [1., 2., 1.]])

In [7]:
# restrict to the 2 largest singular values
U[:,:2].dot(Sigma[:2,:2]).dot(VT[:2,:])

array([[0.83697452, 2.07128611, 2.99626008],
       [1.20389433, 2.91084317, 3.00467747],
       [0.87547428, 2.05445133, 0.9971433 ]])

In [8]:
# extract the individual parts
U[:,:2].dot(Sigma[:2,:2])

array([[-3.69143694, -0.58448292],
       [-4.3529303 ,  0.04953885],
       [-2.29462286,  0.84630147]])

In [9]:
VT[:2,:]

array([[-0.2732291 , -0.66149335, -0.69840704],
       [ 0.29365013,  0.63402177, -0.7153922 ]])

In [10]:
# do the same with sklearn

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2)
svd.fit_transform(A)

array([[ 3.69143694, -0.58448292],
       [ 4.3529303 ,  0.04953885],
       [ 2.29462286,  0.84630147]])

In [11]:
svd.components_

array([[ 0.2732291 ,  0.66149335,  0.69840704],
       [ 0.29365013,  0.63402177, -0.7153922 ]])

In [12]:
svd.fit_transform(A).dot(svd.components_)

array([[0.83697452, 2.07128611, 2.99626008],
       [1.20389433, 2.91084317, 3.00467747],
       [0.87547428, 2.05445133, 0.9971433 ]])

#### Use `TruncatedSVD` to reduce the dimensionality of the rating matrix.