# BLU11 - Learning Notebook - Part 2 of 2 - Content-based Filtering

In [1]:
import os

import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix, load_npz
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Memory-based recommendations

**Memory-based recommenders** use previous interactions to predict the interest of a given user in a particular item, in a personalized way.

The primary assumption is that **users' preferences are stable over time** and, thus, the user likes are similar to those he liked in the past.

In the last notebook, we applied **collaborative filtering** to our problem, one of the most widely adopted personalized recommenders.

However, due to **the cold-start problem**, our approach may be limited in the face of a limited community that needs to onboard new players and games (and thus have a low number of ratings to start with).

# Content-based recommenders

It's time to get back to the whiteboard. 🖊️

You remember that you have metadata about the items — time to explore the `metadata.json` file.

In [2]:
def read_content():
    df = pd.read_json(os.path.join(os.path.join('data', 'metadata.json')), orient='index')
    df = (df.rename(columns={"ID": "VideoGameID"})
            .set_index('VideoGameID')
            .sort_index())
    return df[['Name', 'Genres', 'Description']]


item_content = read_content()

We import the item metadata into a dataframe, hoping we can use it, somehow.

In [3]:
item_content.head()

Unnamed: 0_level_0,Name,Genres,Description
VideoGameID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,007 Legends,"[Action, Shooter]",Gamers and Bond aficionados alike will become ...
1,0RBITALIS,"[Simulation, Indie]",0RBITALIS is a satellite launching simulator w...
2,1... 2... 3... KICK IT! (Drop That Beat Like a...,"[Action, Indie]",<p>Battle your favorite drum &#39;n&#39; bass ...
3,10 Second Ninja,"[Action, Indie]","Ninjas are cool, this is an established fact o..."
4,10000000,"[Action, RPG, Casual, Indie, Puzzle]",<p>10000000 is a Dungeon Crawling RPG Matching...


Interestingly, we have `Genres` and `Description`. **We should be able to incorporate this information into our recommender.**

You focus on genres, for now. What about a $I \times G$ matrix, $P_I$, containing **item profiles**, where $g \in G$ is a genre from the set of all possible genres? Assume there are $n$ items and $w$ genres.

$$P_I = \begin{bmatrix}p_{1, 1} & p_{1, 2} & \dots & p_{1, w}\\ p_{2, 1} & p_{2, 2} & \dots & p_{2, w}\\ \dots & \dots & \dots & \dots \\ p_{n, 1} & p_{n, 2} & \dots & p_{n, w}\end{bmatrix}$$

Values, $p_{ig}$, would represent **whether, or how much, a given genre, $g$, is present in the item, $i$**.

# Feature Extraction

The first step, then, would be **feature extraction** to generate descriptive representations from the item metadata.

Given the item profiles outlined above, we need to create the matrix $P_I$. Ideas? Think NLP. 📄

## TF-IDF

You hypothesize that **genres, $g$, that occur less frequently might be more descriptive**.

We denote the raw frequency counts, i.e., the number of times that $g$ occurs in $i$, as $f_{ig}$. It is known as **term frequency**. 

In our case, a genre can occur at most once per item (i.e., boolean frequency), but in many cases, things can be different, e.g., imagine if users could assign genres to games.

We also use the **inverse document frequency**. Take __n__ as the total number of items and $|\{i \in I : g \in i\}|$ as the number of items that contain $g$, i.e., $f_{ig} \neq 0$, it is given by:

$$f'_{g} = log\frac{n}{|\{i \in I : t \in i\}|}$$

The more items that contain a given genre, the **lower the inverse document frequency**. We can use it to **adjust frequency counts** to **give extra weight to low-frequency genres**.

Hence, we can compute the **TF-IDF (Term's Frequency - Inverse Document Frequency)** for a given genre $g$ for item $i$ as: 

$$p_{ig} =  f'_g \cdot f_{ig}$$

We interpret it as how much of $g$ there is in $i$, adjusted for rarity, i.e., the rarer, the better.

# Building the item profiles

To build the item profiles, we use the convenient `TfidfVectorizer` ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)) provided by `sklearn`.

Since **the transformer is prepared to receive strings** (and not lists), we need to do some Pandas magic to convert the lists to strings and remove spaces to **reduce multi-word genres to a single string**, e.g., "massivelymultiplayer".

In [4]:
genres = item_content['Genres'].apply(";".join).str.replace(' ', '')

Now, we apply the transformer. We verify that **it returns a sparse matrix**, which is excellent.

In [5]:
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(genres)

<5155x19 sparse matrix of type '<class 'numpy.float64'>'
	with 12469 stored elements in Compressed Sparse Row format>

We can **inspect the complete list of genres** at any time.

In [6]:
vectorizer.get_feature_names()

['action',
 'adventure',
 'arcade',
 'boardgames',
 'card',
 'casual',
 'educational',
 'family',
 'fighting',
 'indie',
 'massivelymultiplayer',
 'platformer',
 'puzzle',
 'racing',
 'rpg',
 'shooter',
 'simulation',
 'sports',
 'strategy']

# Item profiles beyond genre

The item profiles **describe the content of the item according to relevant attributes**. Why only use the genre, then? You generalize your definition to include any content or tag, $t \in T$.

Take $t \in T$ to be one of $w$ tags that describes the content of a given item, the item profiles,$P_I$, are a $I \times T$ matrix, as:

$$P_I = \begin{bmatrix}p_{1, 1} & p_{1, 2} & \dots & p_{1, w}\\ p_{2, 1} & p_{2, 2} & \dots & p_{2, w}\\ \dots & \dots & \dots & \dots \\ p_{n, 1} & p_{n, 2} & \dots & p_{n, w}\end{bmatrix}$$

Given the above, you decide to **take advantage of the game descriptions as well**.

In [7]:
all_content = item_content['Description'] + genres

We **concatenate the descriptions with the genres** and **fit the transformer on the result**.

In [8]:
item_profiles = vectorizer.fit_transform(all_content)
item_profiles

<5155x33710 sparse matrix of type '<class 'numpy.float64'>'
	with 688709 stored elements in Compressed Sparse Row format>

The result is a much larger and richer matrix with, hopefully, **more descriptive power**.

We can also take another approach if we want to take into consideration each column independently. For that, we can combine the information from both columns by appending, horizontally, the result from vectorization applied to each column, `Description`, and `Genres`, individually. 

```python
genre_profile = vectorizer.fit_transform(genres)
description_profile = vectorizer.fit_transform(item_content['Description'])
item_profiles_ = np.hstack([genre_profile.toarray(), description_profile.toarray()])
item_profiles_ = csr_matrix(item_profiles_)
```

But, for this case, we will use the first approach, where we append all the columns in one single string.


# User profiles

Item profiles are not enough to provide personalized recommendations, as they do not consider the taste of the user.

You realize that, so far, you have users represented according to their interactions with items, in the ratings matrix, $R$, and items represented according to the attributes in $P_I$. Is there a way to conciliate both representations, so that **users and items are in the same space**?

We must be able to **uncover user preferences** for attributes, $P_U$, a $U \times T$ matrix:

$$P_U = \begin{bmatrix}p_{1, 1} & p_{1, 2} & \dots & p_{1, w}\\ p_{2, 1} & p_{2, 2} & \dots & p_{2, w}\\ \dots & \dots & \dots & \dots \\ p_{m, 1} & p_{m, 2} & \dots & p_{m, w}\end{bmatrix}$$

We don't have this data explicitly available, i.e., on a silver plate (but almost). But **you realize you can compute it**, as: 

$$P_U = R \cdot P_I$$

## What kind of wicked magic is that? 

Because the product of $R$, a $m \times n$ matrix, by $P_I$, $n \times w$, is a $m \times w$ matrix, where 
**each element is a dot-product of user ratings by item attributes**:

$$p_{u, t} = r_u \cdot p_t = \sum\limits_{i=1}^n r_{u, i}p_{i, t}$$

The **$r_u$ stands for a row of the ratings matrix**, i.e., a user, and **$p_t$ stands for a column of the items profile matrix**, i.e., an attribute.

In other words (i.e., more legible 😄), it's as if we are propagating the user rating of item $i$ to its attributes. 

A good rating to $i$, is a good rating to all attributes of $i$, i.e., **if the user likes $i$ we assume he likes the content of $i$**.

However, can it be? Let's give it some thought. 

For visual clarity, we will use the convention of **$r_{u,}$ to represent a row of the ratings matrix** and **$p_{,t}$ for columns of the item profiles matrix**:

$$P_U = R \cdot P_I = \begin{bmatrix}r_{1,} \cdot p_{,1} & r_{1,} \cdot p_{,2} & ... & r_{1,} \cdot p_{,w} \\ r_{2,} \cdot p_{,1} & r_{2,} \cdot p_{,2} & ... & r_{2,} \cdot p_{,w} \\ ...  & ... & ... & ...\\ r_{m,} \cdot p_{,1} & r_{m,} \cdot p_{,2} & ... & r_{m,} \cdot p_{,w}\end{bmatrix}$$

Armed with this knowledge, we can now **calculate the user profiles**.

## Build the user profiles

The first step, therefore, is to load the ratings matrix, $R$, created in the first notebook.

In [9]:
R = load_npz(os.path.join('data', 'ratings_matrix.npz'))
R

<12393x5155 sparse matrix of type '<class 'numpy.float64'>'
	with 128804 stored elements in Compressed Sparse Row format>

Then, we use `dot` to **compute the dot product between the ratings matrix and the item profiles**.

In [10]:
def make_user_profiles(R, item_profiles):
    return np.dot(R, item_profiles)


user_profiles = make_user_profiles(R, item_profiles)
user_profiles

<12393x33710 sparse matrix of type '<class 'numpy.float64'>'
	with 6539426 stored elements in Compressed Sparse Row format>

*Note this point again:* learning the user profiles requires the product of two matrices! No need for super-advanced maths!

# Making predictions

The prediction step is different when compared to collaborative filtering. First, recap. What do we have?

* $P_I$, an items-attributes, $I \times T$, matrix with item profiles
* $P_U$, a users-attributes, $U \times T$, matrix with user profiles.

What do we want? To know **what items best match the user taste**. 

Therefore, given that items and users are represented in the same space (same features/columns), **we can use the cosine distance to identify which items are closer to the user profile**. Take **$p^T_i$ and $p^T_u$ to represent a row of the item and user profiles matrices, respectively**:

$$sim(p^T_u, p^T_i) = cos(\theta) = \frac{p^T_u \cdot p^T_i}{||p^T_u||||p^T_i||}$$

Again, we use `sklearn` to compute the similarities for us, as:

$$sim(P_U, P_I) = \frac{P_U \cdot P_I}{||P_U||||P_I||}$$ 

In [11]:
def make_predictions(R, item_profiles, user_profiles):
    
    preds = cosine_similarity(user_profiles, item_profiles)
    
    # Exclude previously rated items.
    preds[R.nonzero()] = 0
    
    return csr_matrix(preds)


content_preds = make_predictions(R, item_profiles, user_profiles)
content_preds

<12393x5155 sparse matrix of type '<class 'numpy.float64'>'
	with 60252254 stored elements in Compressed Sparse Row format>

Please note that, at this point, we obtain **a matrix of predictions with the same shape as the previous one**, generated with collaborative filtering.

As we can see, **we have a little fewer predictions using content-based recommendations, but it is normal to get more predictions. It just happens that we have a very populated ratings matrix in this dataset**.

In [12]:
def sparsity(matrix):
    return 1 - matrix.nnz / (matrix.shape[0] * matrix.shape[1])


sparsity(content_preds) 

0.05687734142964063

Note that here we are making predictions **grounded in meaningful content attributes**.

# Filtering

We can now **apply the filtering techniques (best-item, top-$N$) on the first notebook** directly to generate content-based recommendations.

To exemplify, we test the **same** best-item implementation:

In [13]:
def get_best_item(pred):
    return np.negative(pred).toarray().argsort()[:, :1]


get_best_item(content_preds)

array([[ 857],
       [  29],
       [3872],
       ...,
       [ 983],
       [1337],
       [1337]])

Out of curiosity, **let's check what the corresponding game to the first prediction** is.

In [15]:
video_games = pd.read_csv(os.path.join('data', 'video_games.csv'), index_col='VideoGameID')
video_games.loc[857]

VideoGameName    Cities XL Platinum
Name: 857, dtype: object

Remember the gist, content-based recommenders will **allow us to make recommendations even in the face of the cold-start problem**.