# Content-based Recommenders

## 1 About the Data

From the [source](https://www.kaggle.com/prajitdatta/movielens-100k-dataset/):

> MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.

> This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies. 
* Each user has rated at least 20 movies. 
    * Simple demographic info for the users (age, gender, occupation, zip)
    
> The data was collected through the MovieLens web site during the seven-month period from September 19th, 
1997 through April 22nd, 1998.

In [1]:
import numpy as np
import pandas as pd


# Make sure you unzip the .zip file in src/data/ before running this cell
def make_data():
    
    items = make_items_data()
    users = make_users_data()
    
    return items, users


def make_items_data():
    
    colnames = ['movie_id', 'movie_title', 'release_date', 'video_release_date', 'imdb_url', 'unknown', 'action', 
                'adventure', 'animation', 'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy', 
                'film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western']
    items_all_columns = pd.read_csv('../data/ml-100k/u.item', sep='|', header=None, names=colnames, 
                                    encoding = 'ISO-8859-1')
    items_clean = items_all_columns.drop(['release_date', 'video_release_date', 'imdb_url'], axis=1)
    
    return items_clean


def make_users_data():
    
    colnames = ['user_id', 'movie_id', 'rating', 'timestamp']
    users_all_columns = pd.read_csv('../data/ml-100k/u.data', sep='\t', header=None, names=colnames, 
                                    encoding = 'ISO-8859-1')   
    users_clean = users_all_columns.drop(['timestamp'], axis=1)
    
    return users_clean


items, users = make_data()

The `items` dataframe contains 19 genres: a 1 indicates the movie is of that genre, a 0 indicates it is not and movies can be in several genres at once.

In [2]:
items.head(n=3)

Unnamed: 0,movie_id,movie_title,unknown,action,adventure,animation,children,comedy,crime,documentary,...,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
0,1,Toy Story (1995),0,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


The `users` dataframe containt the full dataset, 100000 ratings (1-5) by 943 users on 1,682 items. Each user has rated at least 20 movies.

In [3]:
users.head(n=3)

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1


## 2 Building a Content-Based Filtering Recommender (TL;DR)

The whole point of content-based filtering is to build up a profile of the things a users likes, and use it to *predict* his or her liking of other items.

The universe of all possible item attributes defines a *content-space*, and each item has a position in that space (see [vector space](https://en.wikipedia.org/wiki/Vector_space_model)), that describes its content.

The key concept is building a vector of item attribute preferences for each user - what we call a *user profile* - and use that to make predictions.

Item profiles can be combined with user actions to create the user profiles we need to match against future items.

The user profile is a vector in the same content-space, and the match between the user's profile and the item is measured by how closely the two align.

This is how this is typically done:

1. Collect or compute item vectors that describe items in the corpus' content-space (e.g. document text, keywords, tags, metadata)
2. Use item vectors and user actions to build user profiles as vectors that reveal user preferences in the same content-space
3. Predict user interest in previously unseen items of the corpus.

## 2 Item Attributes

### 2.1 Item Vectors

In a content-based recommenders, preferences are defined as *content*: a set of attributes that describe the items we are recommending.

We should start by modelling items according to their relevant attributes, i.e. like movies relative to the movie genre.

The good thing is: this is already done for us in the dataset! Terry Gilliam's Twelve Monkeys is modelled as drama and sci-fi, for example.

In [4]:
items.iloc[6]

movie_id                           7
movie_title    Twelve Monkeys (1995)
unknown                            0
action                             0
adventure                          0
animation                          0
children                           0
comedy                             0
crime                              0
documentary                        0
drama                              1
fantasy                            0
film_noir                          0
horror                             0
musical                            0
mystery                            0
romance                            0
sci_fi                             1
thriller                           0
war                                0
western                            0
Name: 6, dtype: object

From there, and this is the idea that underlyies most recommender systems, we use the priciple of **stable preferences**.

Assuming that user preferences are stable over time, we can *reveal* those preferences by attribute, inferring them from the items the user liked in the past.

From there, we can simply recommended new items with the attributes the user prefers the most. We call this *content-based filtering* (or CBF, here onwards).

**In short, assuming I like Twelve Monkeys, and in a nutshell, therefore I like drama and sci-fi.**

Note you could use attributes or *tags* other than movie genres, like the director or the main actors, for example. Who doesn't love [coconut Patsy](http://gypsyastronaut.tumblr.com/post/130763390464/monty-python-and-the-quest-for-the-holy-grail-patsy)? :)

### 2.2 User Profiles

In [5]:
def make_user_profiles(users, items):

    data = users.merge(items)
    tags = data.iloc[:, 4:].multiply(data['rating'], axis=0)
    user_profiles = pd.concat([data.iloc[:, 0:1], tags], axis=1).groupby('user_id').sum()
    # normalization is not needed here but makes results more interpretable
    user_profiles = user_profiles.divide(user_profiles.sum(axis=1), axis=0)
    
    return user_profiles


user_profiles = make_user_profiles(users, items)

### 2.3 Predictions

In [7]:
def make_predictions(user_profiles, items, user_id):
    
    user = user_profiles.loc[user_id]
    corr = items.apply(lambda x: x.corr(user), axis=1)
    corr = corr.sort_values(ascending=False)
    
    return corr


items_indexed = items.drop(['movie_title'], axis=1).set_index('movie_id')
make_predictions(user_profiles, items_indexed, user_id=92).head(n=5)

movie_id
337    0.802834
4      0.796729
74     0.796729
316    0.737100
953    0.737100
dtype: float64

## 3 Item Normalization

It's important to make all the vectors the same length, so we don't penalize more obscure items.

You may have noticed that a movie with many genres checked will have more influence on the user profile than one that had only a few. True?

In [None]:
# do second implementation here

## 4 Attribute Relevance

What are the key attributes or *differentiators* of any given item, based on the different frequencies of each attribute?

TF-IDF stands for *Term Frequency - Inverse Document Frequency* and is a *weighting function*, initially applied in information retrieval and adapted to content-based filtering.

Why do we need it? Because *not all terms are equally relevant* to describe an item. TF-IDF assumes that rare terms have more descriptive power.

Now, be aware tough that rarity doesn't imply more significance in all contexts, but we will assume it is for the sake of this example.

### 4.1 TF-IDF Weighting

* Term Frequency (TF), i.e. *intensity* = Number of occurences of a term in the document
* Inverse Document Frequency (IDF), i.e. *distinctiveness* = How few documents contain this term, where:

$$ IDF _{term} = log\left({\frac{TotalDocuments}{DocumentsWithTerm}} \right) $$

And, thus:

$$ TFIDF _{term} = TF _{term} * IDF _{term} $$

Or, in short, we measure *the term frequency, weighted by its rarity in the entire corpus*.

### 4.2 Tags

Tipically, TF-IDF would be applied to documents, containing words in them, and each word being a *term*.

A more interesting application though uses *tags*: individual words or phrases, that are applied by the community to describe the item. 

Just like words in a document, tags can be applied to an item by many different users, thus appearing multiple times.

Additionally, some tags are rare, while others are quite common in our collection, thus we also need IDF to assess descriptive power.

$$ IDF _{tag} = log\left({\frac{TotalDocuments}{DocumentsWithTag}} \right) $$

And, thus:

$$ TFIDF _{tag} = TF _{tag} * IDF _{tag} $$

What TF-IDF will do is **automatically demoting common tags, promoting core tags instead**.

### 4.3 Metadata

Explain students how to do this using metadata, i.e. no frequency problem (relevant for challenge).

Reference to *top attributes*, maybe in the example?

In [None]:
# do third implementation here

### CBF

The TF-IDF weighting function can be used to create a profile of an item, as a *weighted vector of its tags*.

In [None]:
# dataset example goes here

## Building a CBF Recommender System

### User Profiles

In principle, we add up the item vectors the user has liked in the past.

Two things to take into consideration:

* *Weighting* - are all items the user has liked in the past equally important (e.g. highest and lowest ratings could count more, recency, confidence, etc.)?

## Limitations

* Vector-space model conflates the concepts liking and importance
* Defining well-structured attributes, that accurately describe or *represent* the items you want to recommend is no easy task
* Especially when such attributes need to align with user preferences, i.e. how the user *reasons* about the items
* Depends on a reasonable distribution of attributes across items, and items across attributes
* No *serendipity*, unlikely to find surprising connections
* Good at finding substitutes, not complements
* Cannot handle interdependencies, e.g. if I like violent sci-fi, and historical documentaries, but not historical sci-fi (weird) or violent documentaries

Extra:

* The value of allowing users to edit their profile (merge explicit and implicit/actions feedback)
* Understandable profile
* Content-based systems have good explainability
* Content-based techniques work without a large set of users, they just need item data (cold-start problem, able to provide a recommendation to the first person using the system)

Tips:

* Normalize rating scale
* Reducing the keyword space so that similar terms are grouped together