# BLU11 - Learning Notebook - Part 2 of 2 - Content-based recommenders

In [1]:
import os

import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix, load_npz
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Collaborative recommenders use previous interactions to predict the interest of a given user in a particular item. However, due to the cold-start problem, this approach may be limited in the face of a developing community that is onboarding new players and games (that will have a low number of ratings to start with).

In this notebook, we'll extend our system to include item metadata and explore an alternative approach where the recommendations are based on the identification of item properties that each user likes.

## 1. Content-based recommenders

Content-based recommenders extract item features from the available metadata and learn which of the features each user prefers.

It's time to get back to the whiteboard 🖊️ and time to delve into the `metadata.json` file. We import the item metadata into a dataframe.

In [2]:
def read_content():
    df = pd.read_json(os.path.join(os.path.join('data', 'metadata.json')), orient='index')
    df = (df.rename(columns={"ID": "VideoGameID"})
            .set_index('VideoGameID')
            .sort_index())
    return df[['Name', 'Genres', 'Description']]


item_content = read_content()
item_content.head()

Unnamed: 0_level_0,Name,Genres,Description
VideoGameID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,007 Legends,"[Action, Shooter]",Gamers and Bond aficionados alike will become ...
1,0RBITALIS,"[Simulation, Indie]",0RBITALIS is a satellite launching simulator w...
2,1... 2... 3... KICK IT! (Drop That Beat Like a...,"[Action, Indie]",<p>Battle your favorite drum &#39;n&#39; bass ...
3,10 Second Ninja,"[Action, Indie]","Ninjas are cool, this is an established fact o..."
4,10000000,"[Action, RPG, Casual, Indie, Puzzle]",<p>10000000 is a Dungeon Crawling RPG Matching...


We have the `Genres` and `Description` columns in the form of text. Let's see how we can incorporate this information into our recommender.

### 1.2 Building item profiles

We will start with the genre information. We will construct an item profile matrix $P_{IG}$, a matrix where rows are items and columns are genres $g$ from the set of all possible genres $G$. Assume that there are $n$ items and $w$ genres.

$$P_{IG} = \begin{bmatrix}p_{1, 1} & p_{1, 2} & \dots & p_{1, w}\\ p_{2, 1} & p_{2, 2} & \dots & p_{2, w}\\ \dots & \dots & \dots & \dots \\ p_{n, 1} & p_{n, 2} & \dots & p_{n, w}\end{bmatrix}$$

The matrix values $p_{ig}$ represent whether, or how much, a given genre $g$ is present in the item $i$. What kind of values can we use for $p_{ig}$?

We could use a one-hot encoding to simply represent the presence or absence of the given genre in each item. Or, as you just finished the NLP specialization and are a text processing pro, we can use a tfidf vectorization. 📄

In this case, the term frequency is the occurence of genre $g$ in item $i$. For this data, a genre can occur at most once per item (boolean frequency), but things can be different if for instance the users could assign genres to games. The inverse document frequency weighting adjusts the term frequencies for the rarity of the genre.

#### 1.2.1 Item profiles from the genres

To build the item profiles, we use your old friend the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer). Since the transformer is prepared to receive strings (and not lists), we need to do some Pandas magic to convert the lists to strings and reduce multi-word genres to a single string, e.g. "Massively Multiplayer" to "MassivelyMultiplayer".

In [3]:
genres = item_content['Genres'].apply(";".join).str.replace(' ', '')

Now we apply the transformer. We verify that it returns a sparse matrix, which is excellent.

In [4]:
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(genres)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 12469 stored elements and shape (5155, 19)>

The list of genres is not that long:

In [5]:
vectorizer.get_feature_names_out()

array(['action', 'adventure', 'arcade', 'boardgames', 'card', 'casual',
       'educational', 'family', 'fighting', 'indie',
       'massivelymultiplayer', 'platformer', 'puzzle', 'racing', 'rpg',
       'shooter', 'simulation', 'sports', 'strategy'], dtype=object)

#### 1.2.2 Item profiles from the item description

Transforming the genre information to item features was easy-peasy. We can apply the same strategy to the item description and simply vectorize the description text. We can in fact join the `Genres` and `Description` text into one string and vectorize it at once. This will give another set of features, we'll call them tags, $t \in T$. We can store them in the item profile matrix $P_{IT}$ with rows for all the items and columns for all the tags:

$$P_{IT} = \begin{bmatrix}p_{1, 1} & p_{1, 2} & \dots & p_{1, z}\\ p_{2, 1} & p_{2, 2} & \dots & p_{2, z}\\ \dots & \dots & \dots & \dots \\ p_{n, 1} & p_{n, 2} & \dots & p_{n, z}\end{bmatrix}$$

Below we concatenate the descriptions with the genres and fit the transformer on the result.

In [6]:
all_content = item_content['Description'] + genres

In [7]:
item_profiles = vectorizer.fit_transform(all_content)
item_profiles

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 688709 stored elements and shape (5155, 33710)>

The result is a much larger and richer matrix with, hopefully, more descriptive power.

Another approach would be to vectorize the genres and descriptions separately and then horizontally concatenate the two item profile matrices. We'd get a matrix with n rows (for each items) and with columns representing the genres and other features extracted from the item descriptions. The code would be the following:

```python
genre_profile = vectorizer.fit_transform(genres)
description_profile = vectorizer.fit_transform(item_content['Description'])
item_profiles_ = np.hstack([genre_profile.toarray(), description_profile.toarray()])
item_profiles_ = csr_matrix(item_profiles_)
```

Of course, there is always room for applying any feature selection or dimension reduction techniques to this item feature set, but we won't go into that here.

### 1.3 Connecting users with items

Item profiles are not enough to provide personalized recommendations as they do not consider the taste of the user.

So far, we have the ratings matrix $R$ which represents user interactions with items and the item profiles matrix $P_{IT}$ which represents the item attributes. Is there a way to connect the item profiles to the user preferences?

Yes, there is! We can project the ratings matrix into the space of item attributes by multiplying the two matrices and construct a user profiles matrix $P_{UT}$, an $m \times z$ matrix which connects the item atributes to the users. This will be our user representation in the item attribute space. The matrix has a row for each user and a column for each item attribute or tag.

$$P_{UT} = R \times P_{IT}$$

$$P_{UT} = \begin{bmatrix}p_{1, 1} & p_{1, 2} & \dots & p_{1, z}\\ p_{2, 1} & p_{2, 2} & \dots & p_{2, z}\\ \dots & \dots & \dots & \dots \\ p_{m, 1} & p_{m, 2} & \dots & p_{m, z}\end{bmatrix}$$

Each element of the matrix is a dot-product of user ratings by item attributes:

$$p_{u, t} = r_u \cdot p_t = \sum\limits_{i=1}^n r_{u, i}p_{i, t}$$

The $r_u$ stands for a row of the ratings matrix, i.e. a user, and $p_t$ stands for a column of the items profile matrix, i.e. an attribute. Basically, we weigh each attribute vector (which represents the strength of the given attribute for each item) by the user ratings for each item to get an estimate of how much this particular user likes that attribute. A good rating for item $i$ is a good rating to all attributes of $i$, i.e. if the user likes $i$ we assume they like the attributes of $i$. Effectively, we have transformed the user vector from a representation in the space of items (row in the ratings matrix) to a space of item attributes (row in the user profiles matrix).

Armed with this knowledge, we can now build the user profiles matrix from our data.

#### 1.3.1 Building the $P_U$ matrix

The first step is to load the ratings matrix $R$ created in the first notebook and saved for later use.

In [8]:
R = load_npz(os.path.join('data', 'ratings_matrix.npz'))
R

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 128804 stored elements and shape (12393, 5155)>

Then we compute the dot product between the ratings matrix and the item profiles matrix.

In [9]:
user_profiles = np.dot(R, item_profiles)
user_profiles

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6539426 stored elements and shape (12393, 33710)>

## 2. Making predictions

The prediction step in content-based filtering is different from the collaborative filtering. First, recap. What do we have?

* $R$, the ratings matrix.
* $P_{IT}$, a matrix with item profiles.
* $P_{UT}$, a matrix with user preferences represented in the space of item attributes.

What do we want? To know what items best match each user's taste.

Therefore, given that items and users are represented in the same space (same features/columns), we can use the cosine similarity to identify which items are closest to each user. Take $p^T_i$ and $p^T_u$ to represent a row of the $P_{IT}$ and $P_{UT}$ matrices, respectively. The cosine similarity between these vectors is:

$$sim(p^T_u, p^T_i) = cos(\theta) = \frac{p^T_u \cdot p^T_i}{||p^T_u||\ ||p^T_i||}$$

Again, we can compute the similarities for all the item-user pairs at once using matrix calculations:

$$sim(P_{UT}, P_{IT}) = \frac{P_{UT} \times P_{IT}}{||P_{UT}||\ ||P_{IT}||}$$

The similarity values will be used for the predictions matrix and we will then select the items with the highest similarity.

In [10]:
def make_predictions(R, item_profiles, user_profiles):
    
    preds = cosine_similarity(user_profiles, item_profiles)
    
    # Exclude previously rated items.
    preds[R.nonzero()] = 0
    
    return csr_matrix(preds)


content_preds = make_predictions(R, item_profiles, user_profiles)
content_preds

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 60252254 stored elements and shape (12393, 5155)>

As we can see, we have a little less predictions using content-based recommendations, but usually it's the opposite. It just happens that we have a very populated ratings matrix in this dataset.

In [11]:
def sparsity(matrix):
    return 1 - matrix.nnz / (matrix.shape[0] * matrix.shape[1])


sparsity(content_preds) 

0.05687734142964063

## 3. Filtering

We can now apply the filtering techniques (best-item, top-$N$) from the first notebook to generate content-based recommendations. To exemplify, we test the same best-item implementation:

In [12]:
def get_best_item(pred):
    return np.negative(pred).toarray().argsort()[:, :1]


get_best_item(content_preds)

array([[ 857],
       [  29],
       [3872],
       ...,
       [ 983],
       [1338],
       [1338]], shape=(12393, 1))

Out of curiosity, let's check what is the game corresponding to the first prediction.

In [13]:
video_games = pd.read_csv(os.path.join('data', 'video_games.csv'), index_col='VideoGameID')
video_games.loc[857]

VideoGameName    Cities XL Platinum
Name: 857, dtype: object

Remember the gist, content-based recommenders will allow us to make recommendations even in the face of the cold-start problem.

## 4. Further reading
We have looked at collaborative recommender systems which use only user-item interactions and at content-based recommender systems which also include item metadata. Another type of recommender systems are hybrid systems which use both user and item metadata. We did not have such data in our case and usually they are also more difficult to obtain than item metadata.

You can read more about recommender systems in this [presentation](https://www.slideshare.net/slideshow/boston-ml-architecting-recommender-systems/111756663#20) from James Kirk.

Now off to the exercise notebook!