# How good is my recommender?

After having learned some possible recommender systems (RS) - [Non Personalised](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%201%20Part%20I%20-%20Non%20Personalised%20and%20Stereotyped%20Recommendation.ipynb), [Content Based](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%201%20Part%20III%20-%20Content%20Based%20Recommendation.ipynb), [User User](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%202%20Part%20I%20-%20User%20User%20Collaborative%20Filtering.ipynb) and [Item Item](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%202%20Part%20II%20-%20Item%20Item%20Collaborative%20Filtering.ipynb) Collaborative Filtering - you're going to see that we never asked one possible question: **Which one is better for me?**
  
This is not an easy question, different business can have different priorities and for sure, as in any other computer science areas, we don't have a '[one algorithm fits all](https://en.wikipedia.org/wiki/No_free_lunch_theorem)' problem.
  
In this and the next notebook, we're going to take a look at what approaches researchers and companies can take to answer if a certain RS is proper for them. To start, when talking about evaluation, we usually perform them in two main ways:

* **Offline Evaluation**: Offline evaluation is done in similar ways we evaluate machine learning models, *i.e.*, we usually have a fixed dataset, collected and immutable before the beggining of the evaluation, and then the dataset is splited into two parts, the train and test set, the RS are trained on the train and then evaluated over the test set.
* **Online Evaluation**: As the name states, the online evaluation is usually performed online, with real users interacting with different versions or algorithms of a RS and the evaluation is performed by collecting metrics associated with the user behaviour in real time.

## When do I perform one or another?

Both of these approaches have its pros and cons:

* **Offline**: 
    - **Pros** - This type of evaluation can be easier to set. By having lots of already published datasets with their respective ratings or evaluations, people can **easilly set up and evaluate** their algorithms by comparing their output with the expected output from the already published results. By having a fixed dataset and possible fixed user interactions with it (all existing ratings in the dataset) the results of an offline evaluation is also **reproducible in a easier way**, comparing to online evaluations.
    - **Cons** - There are a few discussions regarding the validity of offline evaluations. For example, the most criticized aspect of it is the overall capacity of the performance evaluation of the trained algorithm in a splited test set. The idea of a RS is to provide new recommendations that the user probably doesn't know yet. The problem of testing it in a test set is that we must have already the user's evaluations for each item/recommendation, *i.e.* we end up testing only item that we are sure the user knows. Even more, in this evaluation, if the RS recommend an item the user hadn't evaluated yet but that could be a **good recommendation, we penalise it because we don't have it in our test set**. In the end, we end up penalising the RS for doing its job.  
    
    
* **Online**:
    - **Pros** - Contrary to offline evaluations, in a online context, we have the **possibility to collect real time user interaction with the RS**, among which, reviews, clicks, preferences and etc. This can bring a whole better picture when evaluating the RS's performance. Besides, as we are evaluating real time data, instead of a static one, we're **able to provide further analysis if desired**.
    - **Cons** - Dynamic real time data also bring a negative point in the evaluation, as the **reproducibility** of the experiment can be worse, when comparing to a static script and dataset. Besides, in order to prepare (and maybe even create) the environment to test the RS, we must **expend a considerable higher amount of time to set it up**.

Below, ([Hijikata, 2014](http://soc-research.org/wp-content/uploads/2014/11/OfflineTest4RS.pdf)) provided a few useful guidelines when comparing the pros and cons of each approach:

<img src="images/notebook7_image1.png" width="500">
Source: http://soc-research.org/wp-content/uploads/2014/11/OfflineTest4RS.pdf

In this notebook, in the following sections, we are going to discuss a few different perspectives we can evaluate over RS when performing an **offline** evaluation. These metrics measure different characteristics we'd want for our RS, not just pure performance/accuracy. This goes to the same way as we do in the machine learning field, as sometimes a 100% accuracy doens't mean the model is good. A good example is a RS that provides obvious association rules between popular items:

                                        **If users takes bread, takes milk as well**

This would be probably a close to perfection RS, as the accuracy of our association rule would be close to 100%. The question is, is that useful? Lets take a look at some metrics that more closely relates to business needs or user experience:

- Coverage  
- Popularity / Novelty – Personalization
- Serendipity
- Diversity

<img src="images/notebook7_image2.png">
Source: https://medium.com/the-graph/popularity-vs-diversity-c5bc22c253ee

# Coverage

Considering all the products / recommendations in a catalog, what is the percentage of it that a RS can recommend to users? Usually, companies want systems that are able to cover their entire catalog. This metric usually comes with a trade-off between it and precision, where RS usually make a balance between them, *i.e.*, RS with higher coverage can show low precision and high precision system usually covers just a small part of the catalog.

Coverage can be obtained by measuring the distinct amount of items that got recommended in the test set divided by the total number of items.

$$formula$$

When comparing top-N algorithms, usualy coverage is calculated by counting the amount of items that appear in a user's top N list.

# Popularity / Novelty

Popularity can be used when companies want to optimize total sales numbers. It measures the amount of users that bought a recommended item. A RS with a high popularity metric only recommends items for people where it is really sure that people will like this.

$$formula$$

In business domains, we can evaluate the popularity of different RS recommending different sections of a company's catalog and select the one who provides a better return in revenue.


# Diversity

When recommending only popular items to a user, we can for example only recommend the best 10 super popular items from that e-commerce. This is maybe not what a company wants, as probably users don't need a RS as everyone probably already know what are these items to buy.

A Diversity metric measures how different and diverse are the items that a user gets recommended to. It is usually measured to a top N list and can be calculated by using the item's metadata, such as item category, genre, tags or keywords. If we have only one list, we can just count the distinct number of categories in a list. If we have multiple lists, *i.e.*, multiple recommendations to a set of users, we can calculate the items similarity and optimise for low similarity values

$$formula$$

At least in my opinion, the main recommender systems nowadays, with exception Spotify, provides recommendation lists with a low index of Diversity. If I buy a book about machine learning at Amazon, I end up receiving lots of books of the same content of machine learning created by different users, and this is what I don't want want (at least for me).

# Serendipity

As we have discussed before, one flaw from the [Item Item CF](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%202%20Part%20II%20-%20Item%20Item%20Collaborative%20Filtering.ipynb) is its incapacity to provide innovative recommendations that few people know about but that could be a great recommendation to someone. This description is what can be defined as a lack of serendipity. Users will only probably get similar items as they have already bought before.

One of the areas that systems need to be really 'serendipiteous' are music streaming plataforms. Contrary to movie, items or books plataforms, music listening is subject to a diverse of factors such as long and short term preferences, contextualisation and etc.. Being able to adequate to these complex factor is really challenging to a music plataform, but at least Spotify is being able to really provide good surprises with great musics from not so famous bands and that really impress. Spotify's discover weekly is one the examples of serendipity, where people surrender to its recommendations and admit even that the algorithm knows more about their musical taste than themselves.

$$foto twitter about discover weekly$$

Another thing about serendipity is the temporal evolution over a user's taste. [Someone x]() studied how users' satisfaction evolved when they got the same recommendation over time. As shown below, he saw that, even when users received good recommendations, if they didn't evolve over time, their satisfaction decreased over time. 

([Kotkov, 2016]()) provides a good overview and challenges on defining and calculating serendipities in RS. In it, he presents two categories of serendipity metrics, component metrics, which measures different components of serendipity, such as novelty or unexpectedness, while full metrics measure serendipity as a whole. Among full metrics, [Murakami, 2008](Metrics for evaluating the serendipity of recommendation lists) created the following equation:

$$equation$$

**Explain equation**