# Content-based Recommenders

## 1 About the Data

From the [source](https://www.kaggle.com/prajitdatta/movielens-100k-dataset/):

> MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.

> This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies. 
* Each user has rated at least 20 movies. 
    * Simple demographic info for the users (age, gender, occupation, zip)
    
> The data was collected through the MovieLens web site during the seven-month period from September 19th, 
1997 through April 22nd, 1998.

## 2 Reading the Data

### 2.1 Items

In [1]:
import numpy as np
import pandas as pd

Item data is stored as a delimiter-separated values file with `sep='|'`. The file contains no hearders, so we need to input the column names ourselves.

We also need to set `encoding='ISO-8859-1` to avoid encoding erros when reading the data.

In [2]:
items_colnames = ['movie_id', 'title', 'release_date', 'video_release_date', 
                  'imdb_url', 'unknown', 'action', 'adventure', 'animation', 
                  'children', 'comedy', 'crime', 'documentary', 'drama', 
                  'fantasy', 'film_noir', 'horror', 'musical', 'mystery', 
                  'romance', 'sci_fi', 'thriller', 'war', 'western']
    
# Make sure you unzip the .zip file in src/data/ before running this cell
items_all_columns = pd.read_csv('../data/ml-100k/u.item', sep='|', header=None, 
                                names=items_colnames, encoding='ISO-8859-1')

We will drop the columns we don't need to build our recommender.

In [3]:
items_clean = items_all_columns.drop(['release_date', 'video_release_date', 
                                      'imdb_url'], axis=1)

### 2.2 Ratings

User data, on the other side, is stored as a tab-delimited file. It contains no headers as well, so we need to input them manually.

In [4]:
ratings_colnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_all_columns = pd.read_csv('../data/ml-100k/u.data', sep='\t', 
                                  header=None, names=ratings_colnames)

We will drop the `timestamp` column, as we will not need it throughout the exercise.

In [5]:
ratings_clean = ratings_all_columns.drop(['timestamp'], axis=1)

### 2.3 Read All

We will put it all together to create out `item` and `user` dataframes, that we will use to build our recommender.

In [6]:
# Make sure you unzip the .zip file in src/data/ before running this cell
def make_data():  
    items = make_items_data()
    ratings = make_ratings_data()
    return items, ratings


def make_items_data():
    tems_colnames = ['movie_id', 'title', 'release_date', 'video_release_date', 
                      'imdb_url', 'unknown', 'action', 'adventure', 'animation', 
                      'children', 'comedy', 'crime', 'documentary', 'drama', 
                      'fantasy', 'film_noir', 'horror', 'musical', 'mystery', 
                      'romance', 'sci_fi', 'thriller', 'war', 'western']
    items_all_columns = pd.read_csv('../data/ml-100k/u.item', sep='|', 
                                    header=None, names=items_colnames, 
                                    encoding='ISO-8859-1')
    items_clean = items_all_columns.drop(['release_date', 'video_release_date', 
                                          'imdb_url'], axis=1)
    return items_clean


def make_ratings_data():
    ratings_colnames = ['user_id', 'movie_id', 'rating', 'timestamp']
    ratings_all_columns = pd.read_csv('../data/ml-100k/u.data', sep='\t', 
                                      header=None, names=ratings_colnames)   
    ratings_clean = ratings_all_columns.drop(['timestamp'], axis=1)
    return ratings_clean


items, ratings = make_data()

The `items` dataframe contains 19 genres: a 1 indicates the movie is of that genre, a 0 indicates it is not and movies can be in several genres at once.

In [7]:
items.head(n=3)

Unnamed: 0,movie_id,title,unknown,action,adventure,animation,children,comedy,crime,documentary,...,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
0,1,Toy Story (1995),0,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [8]:
items.describe()

Unnamed: 0,movie_id,unknown,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
count,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0
mean,841.5,0.001189,0.149227,0.080262,0.02497,0.072533,0.300238,0.064804,0.029727,0.431034,0.01308,0.014269,0.054697,0.033294,0.036266,0.146849,0.060048,0.149227,0.042212,0.016052
std,485.695893,0.034473,0.356418,0.271779,0.156081,0.259445,0.458498,0.246253,0.169882,0.495368,0.11365,0.118632,0.227455,0.179456,0.187008,0.354061,0.237646,0.356418,0.201131,0.125714
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,421.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,841.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1261.75,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1682.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The `ratings` dataframe containt the full dataset, 100,000 ratings (1-5) by 943 users on 1,682 items. Each user has rated at least 20 movies.

In [9]:
ratings.head(n=3)

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1


In [10]:
ratings.describe()

Unnamed: 0,user_id,movie_id,rating
count,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986
std,266.61442,330.798356,1.125674
min,1.0,1.0,1.0
25%,254.0,175.0,3.0
50%,447.0,322.0,4.0
75%,682.0,631.0,4.0
max,943.0,1682.0,5.0


## 2 Building a Content-Based Filtering Recommender (TL;DR)

The whole point of content-based filtering is to build up a profile of the things a users likes, and use it to *predict* his or her liking of other items.

The universe of all possible item attributes defines a *content-space*, and each item has a position in that space (see [vector space](https://en.wikipedia.org/wiki/Vector_space_model)), that describes its content.

The key concept is building a vector of item attribute preferences for each user - what we call a *user profile* - and use that to make predictions.

Item profiles can be combined with user actions to create the user profiles we need to match against future items.

The user profile is a vector in the same content-space, and the match between the user's profile and the item is measured by how closely the two align.

This is how this is typically done:

1. Collect or compute item vectors that describe items in the corpus' content-space (e.g. document text, keywords, tags, metadata)
2. Use item vectors and user actions to build user profiles as vectors that reveal user preferences in the same content-space
3. Predict user interest in previously unseen items of the corpus.

## 2 Item Attributes

### 2.1 Item Vectors

In a content-based recommenders, preferences are defined as *content*: a set of attributes that describe the items we are recommending.

We should start by modelling items according to their relevant attributes, i.e. like movies relative to the movie genre.

The good thing is: *this is already done for us* in the dataset! Terry Gilliam's Twelve Monkeys is modelled as drama and sci-fi, for example.

In [11]:
items.iloc[6]

movie_id                           7
title          Twelve Monkeys (1995)
unknown                            0
action                             0
adventure                          0
animation                          0
children                           0
comedy                             0
crime                              0
documentary                        0
drama                              1
fantasy                            0
film_noir                          0
horror                             0
musical                            0
mystery                            0
romance                            0
sci_fi                             1
thriller                           0
war                                0
western                            0
Name: 6, dtype: object

From there, and this is the idea that underlyies most recommender systems, we use the priciple of **stable preferences**.

Assuming that user preferences are stable over time, we can *reveal* those preferences by attribute, inferring them from the items the user liked in the past.

From there, we can simply recommended new items with the attributes the user prefers the most. We call this *content-based filtering* (or CBF, here onwards).

**In short, assuming I like Twelve Monkeys, and in a nutshell, therefore I like drama and sci-fi.**

Note you could use attributes or *tags* other than movie genres, like the director or the main actors, for example.

### 2.2 Creating User Profiles

What we want to do is to combine the user ratings with the metadata for each movie. There are different strategies, but we will `merge` them.

In [12]:
user_profile_data = ratings.merge(items)
user_profile_data.head(n=3)

Unnamed: 0,user_id,movie_id,rating,title,unknown,action,adventure,animation,children,comedy,...,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
0,196,242,3,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,63,242,3,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,226,242,5,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Now, we will multiply the movie rating by the tags, so each tag, if present in the movie.

We will select all columns corresponding to tags, and `multiply` them by the ratings.

In [13]:
user_ratings = user_profile_data['rating']
user_profile_tags = user_profile_data.iloc[:, 4:].multiply(user_ratings, axis=0)
user_profile_tags.head()

Unnamed: 0,unknown,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0


An user's profile is a vector, comprised of the user's relative preference for each tag. A way to accomplish this is to sum all user ratings per tag.

In [14]:
user_profiles = pd.concat([user_profile_data.iloc[:, 0:1], user_profile_tags], 
                          axis=1).groupby('user_id').sum()
user_profiles.head()

Unnamed: 0_level_0,unknown,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,4,250,123,40,55,316,86,24,420,7,5,45,38,18,173,172,188,92,22
2,0,38,13,4,12,61,34,0,134,3,9,6,3,14,66,15,43,11,0
3,0,39,14,0,0,31,30,5,64,0,5,12,4,35,17,22,53,14,0
4,0,31,14,0,0,20,19,5,27,0,0,4,5,20,13,23,43,9,0
5,4,176,107,53,71,246,35,0,72,5,5,71,40,9,44,116,56,45,5


If we normalize the user vectors it will make the results more interpretable, albeit this is not mandatory.

In [15]:
user_profiles = user_profiles.divide(user_profiles.sum(axis=1), axis=0)
user_profiles.head()

Unnamed: 0_level_0,unknown,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0.001925,0.120308,0.059192,0.019249,0.026468,0.152069,0.041386,0.01155,0.202117,0.003369,0.002406,0.021655,0.018287,0.008662,0.083253,0.082772,0.090472,0.044273,0.010587
2,0.0,0.081545,0.027897,0.008584,0.025751,0.130901,0.072961,0.0,0.287554,0.006438,0.019313,0.012876,0.006438,0.030043,0.141631,0.032189,0.092275,0.023605,0.0
3,0.0,0.113043,0.04058,0.0,0.0,0.089855,0.086957,0.014493,0.185507,0.0,0.014493,0.034783,0.011594,0.101449,0.049275,0.063768,0.153623,0.04058,0.0
4,0.0,0.133047,0.060086,0.0,0.0,0.085837,0.081545,0.021459,0.11588,0.0,0.0,0.017167,0.021459,0.085837,0.055794,0.098712,0.184549,0.038627,0.0
5,0.003448,0.151724,0.092241,0.04569,0.061207,0.212069,0.030172,0.0,0.062069,0.00431,0.00431,0.061207,0.034483,0.007759,0.037931,0.1,0.048276,0.038793,0.00431


By normalizing, we mean that all resulting vectors have length 1 (the sum of all attributes is equal to one).

In [16]:
user_profiles.head(n=3).sum(axis=1)

user_id
1    1.0
2    1.0
3    1.0
dtype: float64

Check an user's profile below, his preferences can be defined as 20% *drama*, 15% *comedy*, 12% *action*, 9% *thriller*, and so on.

In [17]:
user_profiles.loc[1].sort_values(ascending=False)

drama          0.202117
comedy         0.152069
action         0.120308
thriller       0.090472
romance        0.083253
sci_fi         0.082772
adventure      0.059192
war            0.044273
crime          0.041386
children       0.026468
horror         0.021655
animation      0.019249
musical        0.018287
documentary    0.011550
western        0.010587
mystery        0.008662
fantasy        0.003369
film_noir      0.002406
unknown        0.001925
Name: 1, dtype: float64

Now we can see how these preferences relate to each of the movies in our corpus, to make predictions.

But first, let's create a function with all the logic above.

In [93]:
def make_user_profiles(users, items):
    user_profile_data = users.merge(items)
    user_ratings = user_profile_data['rating']
    user_profile_tags = user_profile_data.iloc[:, 4:].multiply(user_ratings, 
                                                               axis=0)
    user_profiles = pd.concat([user_profile_data.iloc[:, 0:1], 
                               user_profile_tags], axis=1).groupby('user_id')
    user_profiles = user_profiles.sum()
    # normalization is not needed here but makes results more interpretable
    user_profiles = user_profiles.divide(user_profiles.sum(axis=1), axis=0)
    return user_profiles


user_profiles = make_user_profiles(ratings, items)

### 2.3 Predictions

Let's start by locking our user.

In [94]:
user_id = 1
user = user_profiles.loc[user_id]

Then, we will exclude the movies that he has already rated.

In [95]:
items_rated = ratings[ratings.user_id == user_id].movie_id
items_unseen = items.drop(items_rated)
items_unseen = items_unseen.drop(['title'], axis=1).set_index('movie_id')

Now we can make our predictions for the remaining items, based on the user's taste profile.

Now that we have our user's generic profile, containing his relative preference for each tag, we can extrapolate that to make predictions for other items.

A simple way to accomplish this would is to multiply each movie profile by the user taste, using a [dot-product](https://en.wikipedia.org/wiki/Dot_product).

In [96]:
predictions = items_unseen.dot(user)
predictions = predictions.sort_values(ascending=False)
predictions.head()

movie_id
1138    0.515881
855     0.504812
720     0.464870
631     0.449952
337     0.444658
dtype: float64

And we have a winner! Turns out the most recommended movie for the user is [Best Men](http://www.imdb.com/title/tt0118702/).

In [88]:
items[items.movie_id == predictions.index[0]].title

1137    Best Men (1997)
Name: title, dtype: object

Wrapping all together as a function.

In [97]:
def make_predictions(user_profiles, items, user_id):
    user = user_profiles.loc[user_id]
    items_rated = ratings[ratings.user_id == user_id].movie_id
    items_unseen = items.drop(items_rated)
    items_unseen = items_unseen.drop(['title'], axis=1).set_index('movie_id')
    predictions = items_unseen.dot(user)
    predictions = predictions.sort_values(ascending=False)
    return predictions


make_predictions(user_profiles, items, user_id=1).head(n=3)

movie_id
1138    0.515881
855     0.504812
720     0.464870
dtype: float64

## 3 Item Normalization

It's important to make all the vectors the same length, so we don't penalize more obscure items.

You may have noticed that a movie with many genres or tags checked will have more influence on the user profile than one that had only one, or very few.

In order to adjust that, we must normalize the item vectors, using the same technique we used to normalize user vectors and make them length 1.

In [83]:
items_normalized = items.drop(['title'], axis=1).set_index('movie_id')
items_normalized = items_normalized.divide(items_normalized.sum(axis=1), axis=0)
items_normalized.head(n=3)

Unnamed: 0_level_0,unknown,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0.0,0.0,0.0,0.333333,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [84]:
items_normalized.head(n=3).sum(axis=1)

movie_id
1    1.0
2    1.0
3    1.0
dtype: float64

In [85]:
def normalize_item_vectors(items):
    items_normalized = items.drop(['title'], axis=1).set_index('movie_id')
    items_normalized = items_normalized.divide(items_normalized.sum(axis=1), 
                                               axis=0)
    items_normalized = items_normalized.reset_index()
    items_normalized = pd.concat([items.iloc[:, 0:2], items_normalized])
    return items_normalized


items_normalized = normalize_item_vectors(items)
user_profiles = make_user_profiles(ratings, items_normalized)
make_predictions(user_profiles, items_normalized, user_id=1).head(n=5)

movie_id
1197    0.920494
1426    0.920494
409     0.920494
1295    0.920494
1296    0.920494
dtype: float64

## 4 Attribute Relevance

What are the key attributes or *differentiators* of any given item, based on the different frequencies of each attribute?

TF-IDF stands for *Term Frequency - Inverse Document Frequency* and is a *weighting function*, initially applied in information retrieval and adapted to content-based filtering.

Why do we need it? Because *not all terms are equally relevant* to describe an item. TF-IDF assumes that rare terms have more descriptive power.

Now, be aware tough that rarity doesn't imply more significance in all contexts, but we will assume it is for the sake of this example.

### 4.1 TF-IDF Weighting

* Term Frequency (TF), i.e. *intensity* = Number of occurences of a term in the document
* Inverse Document Frequency (IDF), i.e. *distinctiveness* = How few documents contain this term, where:

$$ IDF _{term} = log\left({\frac{TotalDocuments}{DocumentsWithTerm}} \right) $$

And, thus:

$$ TFIDF _{term} = TF _{term} * IDF _{term} $$

Or, in short, we measure *the term frequency, weighted by its rarity in the entire corpus*.

### 4.2 Tags

Tipically, TF-IDF would be applied to documents, containing words in them, and each word being a *term*.

A more interesting application though uses *tags*: individual words or phrases, that are applied by the community to describe the item. 

Just like words in a document, tags can be applied to an item by many different users, thus appearing multiple times.

Additionally, some tags are rare, while others are quite common in our collection, thus we also need IDF to assess descriptive power.

$$ IDF _{tag} = log\left({\frac{TotalDocuments}{DocumentsWithTag}} \right) $$

And, thus:

$$ TFIDF _{tag} = TF _{tag} * IDF _{tag} $$

What TF-IDF will do is **automatically demoting common tags, promoting core tags instead**.

### 4.3 Metadata

We will start by counting the number of movies containing each one of the tags.

In [41]:
tag_frequency = items.drop(['title'], axis=1).set_index('movie_id').sum()
tag_frequency.sort_values()

unknown          2
fantasy         22
film_noir       24
western         27
animation       42
documentary     50
musical         56
mystery         61
war             71
horror          92
sci_fi         101
crime          109
children       122
adventure      135
romance        247
action         251
thriller       251
comedy         505
drama          725
dtype: int64

According to the reasoning above, tags like `fantasy` or `film-noir` should have more descriptive weight.

## These will be included in the document at some point

### User Profiles

In principle, we add up the item vectors the user has liked in the past.

Two things to take into consideration:

* *Weighting* - are all items the user has liked in the past equally important (e.g. highest and lowest ratings could count more, recency, confidence, etc.)?

## Limitations

* Vector-space model conflates the concepts liking and importance
* Defining well-structured attributes, that accurately describe or *represent* the items you want to recommend is no easy task
* Especially when such attributes need to align with user preferences, i.e. how the user *reasons* about the items
* Depends on a reasonable distribution of attributes across items, and items across attributes
* No *serendipity*, unlikely to find surprising connections
* Good at finding substitutes, not complements
* Cannot handle interdependencies, e.g. if I like violent sci-fi, and historical documentaries, but not historical sci-fi (weird) or violent documentaries

Extra:

* The value of allowing users to edit their profile (merge explicit and implicit/actions feedback)
* Understandable profile
* Content-based systems have good explainability
* Content-based techniques work without a large set of users, they just need item data (cold-start problem, able to provide a recommendation to the first person using the system)

Tips:

* Normalize rating scale
* Reducing the keyword space so that similar terms are grouped together


Users classify always in same scale