# BLU11 - Learning Notebook - Part 2 of 3 - Content-based Filtering

In [1]:
import os

import numpy as np

from scipy.sparse import coo_matrix
from sklearn.metrics.pairwise import cosine_similarity

# 1 Memory-based Recommender System (RS)

Memory-based RS use previous interactions to predict the interest of a given user in a particular item, in a personalized way.

This approach differs from the typical model-based techniques, as we are not learning any parameters.

The primary assumption is that user preferences are stable over time and, thus, the user likes are similar to those he liked in the past.

In this notebook, we start by analyzing a particular type of memory-based recommendations, content-based.

Finally, in the next notebook, we learn about collaborative filtering, the most widely adopted technique when building RS.

# 2 Content-based RS

Content-based RS are among the most complex RS concerning the components, involving:
1. The base model, comprised of **Users** and **Items** (i.e., a Community), plus **Ratings**
1. **Item Profiles**, describing the content of the item according to relevant attributes
2. A **User Model**, user preferences regarding item attributes, so both entities are in the same vector space.

The diagram below offers a sneak peek at how they look like, in the end.

![recommender_systems_framework](../media/recommender_systems_framework_content_based.png)

We derive all of these different components employing a series of sequential steps, in what we call a pipeline.

Content-based pipelines are somewhat standardized, following the diagram below.

![content_based_filtering](../media/content_based_filtering.png)

We zoom in into each one of the individual components.

# 3 Feature Extraction

The first step in most pipelines is **feature extraction**, in which we mine item metadata to generate descriptive representations.

Such metadata can be either structured or unstructured. For unstructured data, the feature extraction process is akin to NLP.

## 3.1 TF-IDF Weighting

If we denote by $f_{i, t}$ the raw frequency counts, i.e., the number of times that $t$ occurs in $i$:

$$tf(i, t) = f_{i, t}$$

$$idf(t, I) = log\frac{n}{|i \in I : t \in i|}$$

With $|i \in I : t \in i|$ is the number of documents that contain $t$, i.e., $f_{i, t} > 0$.

Therefore, we can compute the TF-IDF (Term's Frequency - Inverse Document Frequency) value for a given term $t$ for an item $i$, in corpus $I$, as:

$$tfidf(i, t, I) = tf(i, t) \times idf(t, I)$$

## 3.2 Item Profiles

Let $T$ be a size-$k$ set of extracted terms or concepts and $I$ a size-$n$ item-set, items profiles are represented as:

$$\begin{bmatrix}tfidf(i_0, t_0) & tfidf(i_0, t_1) & ... & tfidf(i_0, t_k) \\ tfidf(i_1, t_0) & tfidf(i_1, t_1) & ... & tfidf(i_1, t_k) \\ ...  & ... & ... & ...\\ tfidf(i_n, t_0) & tfidf(i_n, t_1) & ... & tfidf(i_n, t_k)\end{bmatrix}$$

TF-IDF assumes that rare terms have more descriptive power, even if rarity doesn't imply more significance in all contexts.

The item profiles matrix $P \in \mathbb{R}^{\space n \space \times \space k}$ has items as row-vectors in a vector space in which there is one column for each term.

## 3.3 Advanced Techniques

Advanced techniques, which we don't cover, allow us for mining the items themselves (e.g., text, images, audio) for features.

An example would be convolutional neural networks, commonly applied to analyzing visual imagery.

## 3.4 Item Profiles in Practice

In this example, we use the [MovieLens Dataset](https://grouplens.org/datasets/movielens/). From the source:

> The small dataset 100,000 movie ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

Although we are using the small version of the dataset, a full version with 26,000,000 ratings is [available from the website](https://grouplens.org/datasets/movielens/latest/).

The ratings are on a 1-5 scale, and both User and Item IDs start at 1.

### 3.4.1 Read the Data

We stored some data preprocessed data under `../data/processed/genres.csv`.

This table contains previously extracted genre information about some of the uses, which we need to store as item profiles.

In [2]:
def read_genres():
    
    path = os.path.join('..', 'data', 'processed', 'genres.csv')
    
    movies = np.genfromtxt(path, dtype='int', skip_header=True, usecols=[0], delimiter=',')
    genres = np.genfromtxt(path, dtype='object', skip_header=True, usecols=[1], delimiter=',')
    
    return movies, genres


movies, genres = read_genres()

We start by loading the data, creating two different arrays: `movies` and `genres`, which are the same size and aligned.

(Also, we use `os.path.join` to ensure compatibility with different Operating Systems.)

### 3.4.2 Sparse Matrix Structure

Although we don't have loads of data (as of yet), as we've seen the best practice in RS is to work with sparse matrices. Let's build one.

Since `genres` contains strings, let's use `np.unique` to transform it into positions that we can use as column indexes.

In [3]:
genre_unique, genre_pos, genre_count = np.unique(genres, return_inverse=True, return_counts=True)

Please note the new parameter `return_counts=True`, which we didn't use before.

It contains how many movies contain a genre, in the entire corpus. We use later on when computing TF-IDF values.

In [4]:
for k, i in zip(genre_unique, genre_count):
    print(k, i)

b'Action' 1545
b'Adventure' 1117
b'Animation' 447
b'Children' 583
b'Comedy' 3315
b'Crime' 1100
b'Documentary' 495
b'Drama' 4365
b'Fantasy' 654
b'Film-Noir' 133
b'Horror' 877
b'IMAX' 153
b'Musical' 394
b'Mystery' 543
b'Romance' 1545
b'Sci-Fi' 792
b'Thriller' 1729
b'War' 367
b'Western' 168


We want to build a `COO` matrix, as suggested in the previous notebook. Then, we convert it to `CSR`.

Thus, we need the aligned row (movies) and column (genres) arrays, containing the coordinates of non-zero values.

We subtract one to the movie ID, because Python arrays start at 0, while IDs start at 1.

In [5]:
cols = genre_pos
rows = movies - 1

To compute the shape of the matrix, we use the maximum movie ID is our corpus and the maximum genre position.

This way, we ensure that all possible combinations of movie-genre coordinate pairs are compatible with our matrix structure.

In [6]:
nrows = rows.max() + 1
ncols = cols.max() + 1
shape = (nrows, ncols)

### 3.4.3 Sparse Matrix Values

As for the values, we know that each genre either is present or not present for a movie.

Therefore, the frequency count is binary, either 1 or 0. 

We only need to compute the IDF, to adjust the weight of each genre to promote more obscure ones.

In [7]:
idf = np.log(np.divide(nrows, genre_count))

for k, i in zip(genre_unique, idf):
    print(k ,i)

b'Action' 4.6707942827237385
b'Adventure' 4.995171672986383
b'Animation' 5.911014877442015
b'Children' 5.645386285705092
b'Comedy' 3.907360569435621
b'Crime' 5.010508013269122
b'Documentary' 5.809015709486894
b'Drama' 3.6322000037818816
b'Fantasy' 5.530466120598385
b'Film-Noir' 7.12322434383383
b'Horror' 5.237066479683401
b'IMAX' 6.983135550663149
b'Musical' 6.03722256275765
b'Mystery' 5.716464152121649
b'Romance' 4.6707942827237385
b'Sci-Fi' 5.339012080241158
b'Thriller' 4.558274986372294
b'War' 6.108211624001014
b'Western' 6.889609492652325


By applying a vectorized version of the formula above, we succeeded in giving more weight to less familiar genres.

Finally, we generate the data by retrieving the corresponding IDF values for each position.

In [8]:
data = idf[genre_pos]

### 3.4.4 Building a Wrapper

It's useful (and good practice) to encapsulate all of this logic into a single function.

In [9]:
def make_item_profiles(movies, genres):
    
    genre_unique, genre_pos, genre_count = np.unique(genres, return_inverse=True, return_counts=True)
    
    cols = genre_pos
    rows = movies - 1
    
    nrows = rows.max() + 1
    ncols = cols.max() + 1
    shape = (nrows, ncols)
    
    idf = np.log(np.divide(nrows, genre_count))
    
    data = idf[genre_pos]
    
    coo = coo_matrix((data, (rows, cols)), shape=shape)
    
    return coo.tocsr()


P = make_item_profiles(movies, genres)
P

<164979x19 sparse matrix of type '<class 'numpy.float64'>'
	with 20322 stored elements in Compressed Sparse Row format>

Once we have the Item Profiles in a sparse matrix, in the convenient shape, the feature extraction step is complete.

# 4 Profile Learner

Once we have the items as vectors in the new feature space, we need to **learn user profiles**, so that we can represent users in the same space.

The purpose of the Profile Learner is to uncover user preferences that, in short, tell us what the preference of a user for an attribute is.

Borrowing from Economics, we denote preference by the utility, i.e., what the utility of $t$ to the user $u$.

$$\begin{bmatrix}util(u_0, t_0) & util(u_0, t_1) & ... & util(u_0, t_k) \\ util(u_1, t_0) & util(u_1, t_1) & ... & util(u_1, t_k) \\ ...  & ... & ... & ...\\ util(u_n, t_0) & util(u_n, t_1) & ... & util(u_n, t_k)\end{bmatrix}$$

The User Model matrix $M \in \mathbb{R}^{\space n \space \times \space k}$ has users as row-vectors, but the same number of columns as the Item Profiles matrix $P$.

In our example, the attributes are movie genres.

## 4.1 Uncovering Preferences
 
However, how can we infer user utility? We don't have this data. (Explicitly, that is.)

We combine the Ratings (that, remember, establish relationships between users to items), with the Item Profiles. 

### 4.1.1 Read the Ratings Data

We use the same approach we did for `genres.csv`, by creating three distinct arrays:
* Users
* Movies
* Ratings.

This format is convenient to build `COO` sparse matrices, which is what we want to do.

In [10]:
def read_ratings():
    path = os.path.join('..', 'data', 'ml-latest-small', 'ratings.csv')
    
    users = np.genfromtxt(path, dtype='int', skip_header=True, usecols=[0], delimiter=',')
    movies = np.genfromtxt(path, dtype='int', skip_header=True, usecols=[1], delimiter=',')
    ratings = np.genfromtxt(path, skip_header=True, usecols=[2], delimiter=',')
    
    return users, movies, ratings


users, movies, ratings = read_ratings()

### 4.1.2 Ratings Matrix

We don't detail the construction of the ratings matrix because we use the same logic as above.

The main different is that we use `max(P.shape[0], cols.max() + 1)` for the number of rows columns in the matrix structure.

Since we have a sample of data and we are not sure about which table has the highest possible movie ID value in the dataset, we play it safe.

In [11]:
def make_ratings(users, movies, ratings, P):
    
    cols = movies - 1
    rows = users - 1
    
    nrows = rows.max() + 1
    ncols = max(P.shape[0], cols.max() + 1)
    shape = (nrows, ncols)
    
    data = ratings
    
    coo = coo_matrix((data, (rows, cols)), shape=shape)
    
    return coo.tocsr()

    
R = make_ratings(users, movies, ratings, P)

Please note that, since we preprocessed the data, we know that the highest value is, in fact, in the Item Profiles matrix `P`.

If we did not know this, i.e., in the real world, the `max` should be included in both `make` functions.

Why? To make sure that we consider the highest value in both sets combined when building the structure of our matrices.

### 4.1.3 User Model

Wait, but why indeed? Why are shapes so important, at all times?

Because the product of $R \in \mathbb{R}^{\space m \space \times \space n}$ and $P \in \mathbb{R}^{\space n \space \times \space k}$ is the matrix $M = RP \in \mathbb{R}^{\space m \space \times \space k}$, where:

$$M_{ij} = \sum\limits_{k=1}^n R_{ik}P_{kj}$$

Wait. Do you see it? $M$ is a $m$ by $k$ matrix, i.e., users by attributes. However, can it be?

Let's give it some thought, armed with some linear algebra.

$$M = RP = \begin{bmatrix}r_0^Tp_0 & r_0^Tp_1 & ... & r_0^Tp_k \\ r_1^Tp_0 & r_1^Tp_1 & ... & r_1^Tp_k \\ ...  & ... & ... & ...\\ r_m^Tp_0 & a_m^Tp_1 & ... & r_m^Tp_k\end{bmatrix}$$

Again, since $R \in \mathbb{R}^{\space m \space \times \space n}$ and $P \in \mathbb{R}^{\space n \space \times \space k}$, we have $r_i \in \mathbb{R}^n$ and $p_j \in \mathbb{R}^n$. 

Each element is a dot-product of user ratings by item-attributes. (Take the time to do the math by hand if needed. We know we did.)

So we can uncover user preferences with the wicked sorcery of linear algebra!

Alternatively, we can directly call `np.dot`.

In [12]:
M = np.dot(R, P)

As usual, we encapsulate it into a function.

In [13]:
def learn_user_profile(R, P):
    return np.dot(R, P)


M = learn_user_profile(R, P)

# 5 Prediction

The **Prediction** step consist of comparing user and item vector, which are now all in the same vector space.

Items whose profile is the most similar to that of a given user, have the highest score.

At this point, we are asking (somewhat philosophical, we must say): what does most similar mean?

Let's look into some alternative distance and similarity metrics used in RS.

## 5.1 Distance Metrics

### 5.1.1 Euclidean Distance

The most straightforward distance measure is the Euclidean distance. Logically, the higher the distance, the lesser the similarity.

At the core, we subtract both vectors element-wise and, given vectors $x, y \in \mathbb{R}^n$:

$$d(x, y) = \sqrt{\sum\limits_{k=1}^n (x_k - y_k)^2}$$

Often, this is the default method for many distance-based machine learning algorithms, namely *k*NN.

There are many problems with the Euclidean distance, for example, if the elements are in different scales (i.e., non-standardized).

### 5.1.2 Dot-product Similarity

The dot product can also be seen as a measure of similarity (the higher, the better). Take again $x, y \in \mathbb{R}^n$:

$$d(x, y) = \sum\limits_{i=1}^n x_{i}y_{i}$$

Simplistically, a large dot product indicates that elements in both vectors have significant values in the same positions.

But this product considers the magnitude of the vectors, i.e., their length, as given by the Euclidean norm:

$$||x|| = \sqrt{x_1^2 + x_2^2 + ... + x_n^2}$$

So, we penalize two relatively identical vectors because one of them has a lower magnitude. Also, we favor vectors with high magnitude.

Now, both the Euclidean distance and the dot-product consider the magnitude though and they should. What can we do?

### 5.1.3 Cosine Similarity

As usual, there is a better way, and that is using just the direction, i.e., the angle, of the vector, ignoring their magnitude.

Hence, the cosine similarity, which measures differences in orientation by using the cosine of the angle between two vectors.

$$cos(x, y) = \frac{xy}{||x||||y||}$$

The cosine similarity is, simply put, the dot-product normalized. Given a vector $a \in \mathbb{R}^n$:

$$\hat{a} = \frac{a}{||a||}$$

We return a vector $\hat{a}$ with the same orientation as the original vector but with length one, i.e., a unit-vector. 

Additionally, the cosine similarity efficiently considers only the non-zero dimensions in sparse vectors.

### 5.1.4 Pearson Correlation

Finally, we measure the similarity as the correlation, i.e., the linear relationship between the vectors.

The Pearson correlation is the most commonly used coefficient, as:

$$Pearson(x, y) = \frac{cov(x, y)}{\sigma_x \sigma_y}$$

Where $cov$ is the covariance and $\sigma_x$ and $\sigma_y$ the standard deviations of $x$ and $y$, respectively.

### 5.1.5 Making Predictions

Most RS software uses the cosine similarity to make predictions, based on arithmetic operations.

We use the method implementation from `sklearn` but others are available. 

(Again, note that we are not learning any parameters.)

By computing the distance between the User Models, as $M$, and the Item Profiles, $P$, we obtain a matrix containing predicted ratings.

In [14]:
L = cosine_similarity(M, P)

The matrix $L$, which contains how similar is each user to each item, has the same shape as our original ratings matrix $R$.

In [15]:
R.shape == L.shape

True

Let's encapsulate this logic into a self-explanatory function.

In [16]:
def make_predictions(M, P):
    return cosine_similarity(M, P)


L = make_predictions(M , P)

Finally, we can predict the utility of a given item to a given user, in a personalized way.

# 6 Filtering Component

The filtering component, which is very similar to what we did for non-personalized RS, does two main tasks:
* Remove previously rated items (mask with non-zeros from ratings matrix)
* Provides a list with the top-*N* most recommended items for the user.

## 6.1 Removing Rated Items

We can use a mask to select the previously rated items and replace their predictions with minus one.

In [17]:
def mask_rated_items(L, R):
    L = L.copy()
    L[R.nonzero()] = -1
    return L


L_ = mask_rated_items(L, R)
L_

array([[ 0.56411863,  0.50560889,  0.19625371, ...,  0.20389081,
         0.        ,  0.        ],
       [ 0.46780893,  0.33612265,  0.5464153 , ...,  0.39789046,
         0.        ,  0.        ],
       [ 0.39299699,  0.27181483,  0.43588561, ...,  0.37690207,
         0.        ,  0.09482589],
       ...,
       [ 0.27850585,  0.14707266,  0.42388622, ...,  0.48424713,
         0.        ,  0.        ],
       [-1.        ,  0.24911524,  0.38030068, ...,  0.24945139,
         0.        ,  0.        ],
       [-1.        ,  0.48489548,  0.49310015, ...,  0.46084461,
         0.        ,  0.04315785]])

## 6.2 Best-item

Without surprises, we use `argmax` to get the best item.

Don't forget that we need to add 1 to convert it back to the original movie ID.

In [18]:
def get_best_item(L):
    return L.argmax(axis=1) + 1


best_item = get_best_item(L_)
best_item[:100]

array([  8361,   4956,   4956,    546,      4,   4818,  40339,    459,
         1912,  55116,   6016,   4956, 134853,  40339,   4956,   1912,
         2058,   4956,   4956,  42015,   4956,  27032,   1912,   6016,
           20,     20,   1912,   1912,   6395,   3893,     20,  55116,
           72,   8361,   1912,   1912,  26093,     14,   7235,   8361,
         8361,   8361,      4,   4956,    496,   6990,   6016, 161594,
         2890,   4956,   5018,      4,  55116,    459,  55116,   4956,
         4956,  55116,   6990,   1396,  26093,   8361,   4956,    459,
           72,    519,   4956,  45672,  26093,   4956,   2940,   8361,
        55116,   8361,  27032,    319,  27032,  58025,   8361,   4719,
           14,   4956,     72,  55116,  55116,      4,   4956,   4956,
        27032,    459,  27032,   4956,  42015,    459,  56069,   1783,
         4956,  42015,   4956,   4956])

## 6.3 Top-N

And we use `argsort` for the top-*N*, as we did in the previous learning unit.

In [19]:
def get_top_n(L, n):
    return np.negative(L).argsort()[:, :n] + 1


top_5 = get_top_n(L_, 5)
top_5

array([[  8361,  48774, 117529,  91500,  58025],
       [  4956,    496,   5666,  42015,  74438],
       [  4956,   1432,   7007,   5027,     20],
       ...,
       [ 55116,  70728,   7235,   4956,    145],
       [  3893,   1912,    459,   3598,   3800],
       [  4956,  42015,  74438,    970,   6990]])