<center>
    <h1 id='demographic-filtering' style='color:#7159c1'>✨ Demographic Filtering ✨</h1>
    <i>The simplest Recommender Algorithm</i>
</center>

> **Topics**

```
- ✨ Introduction
- ✨ Demographic Filtering
- ✨ Arithmetic Mean
- ✨ Cumulative Mean
- ✨ Bayesian Mean and Popularity
```

In [1]:
# ---- Imports ----
import matplotlib.pyplot as plt             # pip install matplotlib
import mplcyberpunk                         # pip install mplcyberpunk
import numpy as np                          # pip install numpy
import pandas as pd                         # pip install pandas
import seaborn as sns                       # pip install seaborn

# ---- Constants ----
DATASETS_PATH = ('./datasets')
SEED = (20231212)

# ---- Settings ----
np.random.seed(SEED)
pd.set_option('display.max_columns', None)
sns.set_style('darkgrid')
plt.style.use('cyberpunk')

<h1 id='0-introduction' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Introduction</h1>

Recommendation Systems are present everyday in our lives helping us to find the perfect anime to watch on Crunchyroll, or to binge watch that show on Netflix, keep watching related YouTube videos or even to buy new products on Amazon. I can spend the whole day listing examples about this system, but I think that you got the glimpse where Recommendation Systems are used, don't you?

However, have you ever get bored on Netflix and Instagram because they only recommend items related to the same topic? It is not because I liked two videos about cars on Instagram that I am a huge fan of cars and want to see only reels and posts about it. Or it is not because I saw a gamer chair price on Amazon that I want to buy only gamer chairs. When you keep receiving recommendations only related to a especific topic or product, we say that you are into a Limited Bubble of Recommendations.

Limited Bubble of Recomendations tends to make companies lose customers, so, in order to minimize these bubbles and to expand the recommendations' diversity, algorithms have been tuned and new variables have been considered to the model.

Nowadays, there are four main Recommendation Algorithms: `Demographic Filtering, Content-Based Filtering, Collaborative Filtering and Hybrid Filtering`. We are going to dive into the first one in this notebook taking the animes dataset into context!!

<h1 id='1-demographic-filtering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Demographic Filtering</h1>

`Demographic Filtering` recommends animes that are popular between the users, that is, the animes with higher Bayesian or Arithmetic Mean Score and Popularity Rank. If you use Netflix, you probably already stumbled upon to some series marked as `Hot` or `Everyone is watching`. If that's so, congrats, that is a real-world Demographic Filtering Recommendation!! To make things even clearer, assume that Bleach, Oregairu and Jujutsu Kaisen are the animes with higher scores on a stream platform and it uses Demographic Filtering to recommend the animes. Guess what? The platform will recommend these very animes to all users!!

About the advantages:

> **Less Data Required** - `since it uses the Mean Score for recommendation, only animes data is required, ignoring the users data`;

> **Less Computational Time and Cost** - `consequently, the model training takes less time and cost to compute`;

> **Simple Evaluation Metrics** - `the metrics used for evaluation (Bayesian Mean Score, Arithmetic Mean Score and Popularity Rank) are not complicated to understand and to apply`.

<br />

Disadvantages-wise:

> **Poor Recommendations** - `bad recommendations are made since this technique generalizes the taste of all users to only one: The animes with higher Score and Popularity`.

<br />

The image below ilustrates how this technique works:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/0-demographic-filtering.png' alt='Demographic Filtering Diagram' />
    <figcaption>Figure 1 - Demographic Filtering Diagram. By <a href='https://www.width.ai/post/recommender-systems-recommendation-systems'>Matt Payne - Recommender Systems For Business - A Gentle Introduction©</a>.</figcaption>
</figure>

<br /><br />

In the next three remaning sessions, we are going to explain and discuss the metrics used in this Algorithm and discover why Bayesian Mean is the best choice.

<h1 id='2-arithmetic-mean' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Arithmetic Mean</h1>

`Arithmetic Mean` is the most known way to calculate means and you probably studied about it on school during your Basic Statistics classes. In few words, the metric works adding all elements and dividing by the number of elements, the result corresponds to the mean value. If you are a math person, the equation below may ring a bell:

```python
arithmetic_mean = sum(x) / len(x)
```

$$
\overline X = \frac {\sum (x)} {n}
$$

where:

- x: element value;
- n: total number of elements.

<br />

Even thoug we have a tendency to apply the metric for most scenarios, Arithmetic Mean can be a problem for Recommendation System Models due to the difference between the scoree value and the amount of scores.

For instance, picture an anime that has an Arithmetic Mean Score of 9.8 and only 100 votes, and another anime with a Mean Score of 8.5 and 100,000 votes. Which one is better? The one with the higher score and less votes or the one with the smaller score and more votes?

In this case and statistically speaking, it is correct to assume that the one with the smaller score and more votes is the better because the amount of votes for the first anime is too few that its Mean is not significantly considerable. There is a possibility that a bunch a users that will not like the anime did not watch and rated it already.

So, we have just stumbled a problem: `there are situations the Arithmetic Mean is not suitable and can compromise Recommendation Systems Models`. We will discuss how to get around this problem later, for now, let's see what animes would be recommended by a Demographic Filtering Algorithm using Arithmetic Mean as metric.

In [19]:
# ---- Reading Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv')
filtered_animes_df = animes_df.copy().loc[animes_df.score > 0][['id', 'title', 'score', 'rank', 'popularity']]

print(f'- Variables: {filtered_animes_df.shape[1]}')
print(f'- Observations: {filtered_animes_df.shape[0]}')
print('---')

filtered_animes_df.head()

- Variables: 5
- Observations: 15684
---


Unnamed: 0,id,title,score,rank,popularity
0,1,cowboy bebop,8.75,41,43
1,5,cowboy bebop tengoku no tobira,8.38,189,602
2,6,trigun,8.22,328,246
3,7,witch hunter robin,7.25,2764,1795
4,8,bouken ou beet,6.94,4240,5126


In [10]:
# ---- Top 10 Animes ----
arithmetic_mean_top_10_animes = filtered_animes_df.sort_values(by='score', ascending=False).head(10)


Unnamed: 0,id,title,score,genres,synopsis,type,episodes,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,scored_by,members
3961,5114,fullmetal alchemist brotherhood,9.1,"adventure, drama, fantasy, action",after a horrific alchemy experiment goes wrong...,tv,64,finished airing,"square enix, studio moriken, mainichi broadcas...","funimation, aniplex of america",bones,manga,24 min per ep,R - 17+ (violence & profanity),1,3,217606,2020030,3176556
16481,41467,bleach sennen kessen-hen,9.07,"adventure, fantasy, action",substitute soul reaper ichigo kurosaki spends ...,tv,13,finished airing,"tv tokyo, dentsu, shueisha, aniplex",viz media,pierrot,manga,24 min per ep,R - 17+ (violence & profanity),2,464,17999,213872,445198
5667,9253,steins gate,9.07,"drama, suspense, sci-fi",eccentric scientist rintarou okabe has a never...,tv,24,finished airing,"media factory, nitroplus, at-x, frontier works...",funimation,white fox,visual novel,24 min per ep,PG-13 - Teens 13 or older,3,13,182964,1336233,2440369
9875,28977,gintama,9.06,"comedy, sci-fi, action","gintoki, shinpachi, and kagura return as the f...",tv,51,finished airing,"tv tokyo, dentsu, aniplex","funimation, crunchyroll",bandai namco pictures,manga,24 min per ep,PG-13 - Teens 13 or older,4,331,15947,237957,595767
14770,38524,shingeki no kyojin season 3 part 2,9.05,"drama, action",seeking to restore humanity's diminishing hope...,tv,10,finished airing,"dentsu, kodansha, mainichi broadcasting system...",funimation,wit studio,manga,23 min per ep,R - 17+ (violence & profanity),6,24,55245,1471825,2104016
17426,43608,kaguya-sama wa kokurasetai ultra romantic,9.05,"romance, comedy",the elite members of shuchiin academy's studen...,tv,13,finished airing,"mainichi broadcasting system, magic capsule, a...",aniplex of america,a-1 pictures,manga,23 min per ep,PG-13 - Teens 13 or older,5,198,29118,451187,820642
21545,51535,shingeki no kyojin the final season - kanketsu...,9.05,"drama, suspense, action",in the wake of eren yeager's cataclysmic actio...,special,2,currently airing,"dentsu, kodansha, mainichi broadcasting system...",-,mappa,manga,1 hr 1 min per ep,R - 17+ (violence & profanity),7,479,9078,155773,435672
6456,11061,hunter x hunter 2011,9.04,"adventure, fantasy, action",hunters devote themselves to accomplishing haz...,tv,148,finished airing,"shueisha, vap, nippon television network",viz media,madhouse,manga,23 min per ep,PG-13 - Teens 13 or older,10,10,200265,1651790,2656870
15406,39486,gintama the final,9.04,"drama, comedy, sci-fi, action",two years have passed following the tendoshuu'...,movie,1,finished airing,"tv tokyo, warner bros. japan",eleven arts,bandai namco pictures,manga,1 hr 44 min,PG-13 - Teens 13 or older,9,1564,4051,63628,132955
5989,9969,gintama,9.04,"comedy, sci-fi, action","after a one-year hiatus, shinpachi shimura ret...",tv,51,finished airing,"dentsu, trinity sound, aniplex, tv tokyo, mira...",-,sunrise,manga,24 min per ep,PG-13 - Teens 13 or older,8,386,7765,226175,525688


---


Having this in mind, we can use the `Cumulative Mean`. Its idea is similar to the Arithmetic one, but the difference is that the value is not divided by the number of votes. So, if an anime got two ratings of 10, we say that its Cumulative Mean is 20, while its Arithmetic Mean is 10. But, this method has a problem too: consider an anime with 1,000,000 ratings of 1 score and another anime with 10,000 ratings of 10 score, if we calculate their Cumulative Mean, the first anime will have a score of 1,000,000 whereas the second one will have a score of 100,000. So, even though having the better ratings, the anime can be in a bad rank position if the number of votes is substancial lower than an anime with worse ratings and a huge number of votes.

So, up to now, we have this two metrics problems:

> **Arithmetic Mean** - `does not assume that an anime with lower score and huge number of ratings is better than an anime with higher score and only a few number of ratings`;

> **Cumulative Mean** - `does not assume that an anime with a lower score and huge number of good ratings is better than an anime with higher score and a more huge number of bad ratings`.

<br /><br />

In order to solve these two problems and to giver better, realistic mean scores, the `Bayesian Mean` has been created!! Its idea is simple, first we have to find the number of ratings of the item, the Arithmetic Mean of the item, the minimum number of ratings that an item must have to be considered in the system and the Arithmetic Mean across all items. After that, we apply them in the following equation:

```python
bayeasian_mean = ((v / (v + m)) * A) + ((m / (v + m)) * C)
```

$$
Bayesian Mean = \frac{v}{v + m} \cdot A + \frac{m}{v + m} \cdot C
$$

where:

- v: number of the item's ratings;
- m: minimum number of ratings that an item must have to be considered in the system;
- A: item's Arithmetic Man;
- C: Arithmetic Mean across all items.

Due to its good results, Bayesian Mean is used by IMDB to calculate the mean score of movies listed on the website.

Also, it is highly recommended to start using Arithmetic Mean when there are a few ratings to an item and smoothly migrate to Bayesian Mean accordingly the number of ratings increase.

<br />