# Content-based Recommenders

In [1]:
import numpy as np
import pandas as pd

## Key Assumptions

The idea that underlyies most recommender systems is the idea of **stable preferences**.

In a content-based recommenders, preferences are defined as *content*: a set of attributes that describe the items we are recommending.

In [3]:
# dataset example goes here

We can start by modelling items according to their relevant attributes, i.e. like movies relative to the movie genre above.

Then, assuming that user preferences are stable over time, we can *reveal* those preferences by attribute, inferring them from the items the user liked in the past.

From there, we can simply recommended new items with the attributes the user prefers the most. We call this *content-based filtering* (or CBF).

**A key concept then is building this vector of attribute preferences for each user: a user *profile*.**

For this, we need to start with another relevant question: what are the *key* attributes or *differentiators* for any given item?

## Term Frequency - Inverse Document Frequency (aka TF-IDF)

TF-IDF is a *weighting function*, initially applied in document retrieval and search engines and later adapted to content-based filtering. 

Why do we need it? Because *not all terms are equally relevant* to describe an item.

TF-IDF assumes that **the rarest the term, the more the descriptive power it has**.

### TF-IDF Weighting

* Term Frequency (TF) = Number of occurences of a term in the document
* Inverse Document Frequency (IDF) = How few documents contain this term, where:

$$ IDF _{term} = log\left({\frac{TotalDocuments}{DocumentsWithTerm}} \right) $$

And, therefore:

$$ TFIDF _{term} = TF _{term} * IDF _{term} $$

Or, in short, we measure *the term frequency, weighted by its rarity in the entire corpus*.

### Tags

Tipically, TF-IDF would be applied to documents, containing words in them, and each word being a *term*.

A more interesting application though uses *tags*: individual words or phrases, that are applied by the community to describe the item. 

Just like words in a document, tags can be applied to an item by many different users, thus appearing multiple times.

Additionally, some tags will be rare and others quite common in our collection, thus we also need IDF to assess each tag's descriptive power.

What TF-IDF will do is **automatically demoting common tags, promoting core tags instead**.

In short:

$$ IDF _{tag} = log\left({\frac{TotalDocuments}{DocumentsWithTag}} \right) $$

And:

$$ TFIDF _{tag} = TF _{tag} * IDF _{tag} $$

In [4]:
# dataset example goes here

### CBF

The TF-IDF weighting function can be used to create a profile of an item, as a *weighted vector of its tags*.

In [5]:
# dataset example goes here

Such a profile can be combined with user actions, or user ratings, to create the *user profiles* we need to match against future items.

In [6]:
# dataset example goes here

## Limitations

* Defining well-structured attributes, that accurately describe or *represent* the items you want to recommend is no easy task
* Especially when such attributes need to align with user preferences, i.e. how the user *reasons* about the items
* Depends on a reasonable distribution of attributes across items, and items across attributes
* No *serendipity*, unlikely to find surprising connections
* Good at finding substitutes, not complements.

Extra:

* The value of allowing users to edit their profile (merge explicit and implicit/actions feedback)

* Content-based systems have good explainability

* Content-based techniques work without a large set of users, they just need item data (cold-start problem, able to provide a recommendation to the first person using the system)