<center>
    <h1 id='content-based-filtering' style='color:#7159c1; font-size:350%'>Content-Based Filtering</h1>
    <i style='font-size:125%'>Recommendations of Similar Items by Plot Description</i>
</center>

> **Topics**

```
- 📦 Content-Based Filtering
- 📦 Cosine Similarity
- 📦 Bag of Words
- 📦 Token and N-Grams
- 📦 Stemming and Lemmatization
- 📦 Stop Words
- 📦 ZIPF's Law
- 📦 Term Frequency - Inverse Document Frequency (TF-IDF)
- 📦 Hands-on
```

In [2]:
# ---- Imports ----
import matplotlib.pyplot as plt                              # pip install matplotlib
import mplcyberpunk                                          # pip install mplcyberpunk
import numpy as np                                           # pip install numpy
import pandas as pd                                          # pip install pandas
import seaborn as sns                                        # pip install seaborn
from sklearn.feature_extraction.text import TfidfVectorizer  # pip install sklearn
from sklearn.metrics.pairwise import linear_kernel           # pip install sklearn
import string                                                # pip install string

# ---- Pre-Trained Models ----
#
# pip install -U pip setuptools wheel
# pip install -U spacy
# python -m spacy download en_core_web_sm >> efficiency (English Model) (less computer cost)
# python -m spacy download en_core_web_trf >> accuracy (English Model)  (better restuls)
#
import spacy
spacy_english_model = spacy.load("en_core_web_sm") # efficiency model

# ---- Constants ----
DATASETS_PATH = ('./datasets')
SEED = (20240420) # April 20, 2024 (fourth Bitcoin Halving)

# ---- Settings ----
np.random.seed(SEED)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
sns.set_style('darkgrid')
plt.style.use('cyberpunk')

# ---- Functions ----
def transform_synopsis(synopsis):
    """
    \ Description:
        - transforms a string into Spacy Document;
        - creates a list containing all non-stop words and non-proper nouns words
        lemma from the document;
        - returns the words list concatenated.
    
    \ Parameters:
        - synopsis: string.
    """
    document = spacy_english_model(synopsis)
    transformed_text = [token.lemma_ for token in document if not token.is_stop and token.pos_ != 'PROPN']
    return ' '.join(transformed_text)

def get_recommendations(dataset, title, animes_indices, cosine_similarity, number_recommendations=10):
    """
    \ Description:
        - gets the index of the anime that matches the title;
        - gets the pairwise similarity scores of all animes with the chosen anime;
        - sort the animes based on the similarity scores on descending order;
        - gets the scores of the top 'number_recommendations' animes, excluding the chosen one;
        - gets the animes indices;
        - returns the recommended animes id, title, synopsis, score, genre and image url.
    
    \ Parameters:
        - dataset: Pandas DataFrame;
        - title: string;
        - animes_indices: list of integers;
        - cosine_similarity: NumPy array of floats;
        - number_recommendation: integer.
    """
    index = animes_indices[title]
    
    similarity_scores = list(enumerate(cosine_similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda score: score[1], reverse=True)
    similarity_scores = similarity_scores[1:number_recommendations+1] # the position 0 represents the anime itself, meaning that
    # the most similar item to a chosen one is the chosen item itself
    
    recommended_animes_indices = [index[0] for index in similarity_scores]
    recommended_animes_scores = [index[1] for index in similarity_scores]
    
    recommendations_df = dataset.iloc[recommended_animes_indices][
        ['id', 'title', 'synopsis', 'score', 'genres', 'image_url']
    ].set_index('id')
    recommendations_df['cosine_similarity'] = recommended_animes_scores
    
    return recommendations_df

<h1 id='0-content-based-filtering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Content-Based Filtering</h1>

`Content-Based Filtering` recommends animes that are similar to other animes that a user liked. If you use Netflix, you probably already stumbled upon to some series marked as `Because you added ...`. If that's so, congrats, that is a real-world Content-Based Filtering Recommendation!! To make things even clearer, assume that a user liked Dragon Ball Z on a stream platform and it uses Content-Based Filtering to recommend the animes. Guess what? The platform will probably recommend Dragon Ball Super or Naruto to the very user, because both animes are similar to Dragon Ball Z, that is, both animes are shounen with superpowers.

Besides, this Filtering has two modes: 1) `Plot Description Based`, where the synopsis and/or overview are used to identify similar items; and 2) `Metadata Based`, where information about genres, producers, studios, format and so on are used to identify similar items.

About the advantages:

> **Better Recommendations** - `since it recommends different animes to the users accordingly to similar animes watched by them, it makes better recommendations when compared to Demographic Filtering`;

> **Personalized Recommendations** - `each user receives personalized recommendations accordingly to the animes they watched`;

> **Variation of Metrics** - `since there are more than one evaluation metric available, the model can be improved just by replacing the metric`.

<br />

Disadvantages-wise:

> **More Data Required** - `in order to recommend similar items, more detailed data about the animes are needed`;

> **Only Sequels and Prequels Recommendations** - `when dealing with Plot Description Based, there is a high probability to get only sequels and prequels recommendations, since they have a very similar synopsis`;

> **Bubble of Contents** - `when dealing with Metadata Based, there is a high probability to create a Bubble that only recommends animes with an specific genre and topic`.

<br />

The image below ilustrates how this technique works:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/1-content-based-filtering.png' alt='Content-Based Filtering Diagram' />
    <figcaption>Figure 1 - Content-Based Filtering Diagram. By <a href='https://medium.com/mlearning-ai/content-based-recommender-system-using-nlp-445ebb777c7a'>Arif Zainurrohman - Content-Based Recommender System Using NLP©</a>.</figcaption>
</figure>

<br /><br />

In this notebook, we are going to dive into Plot Description Based technnique and consider Cosine Similarity as the Evaluation Metric.

<h1 id='1-cosine-similarity' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Cosine Similarity</h1>

`Cosine Similarity` is a perfect metric to measure the similarity between two values of a same feature, especially when dealing with sequential texts. In a nutshell, it works dividing the sum of the multiplication of the frequency of the elements in each item by the product of the squared root of the quadratic appearance of each element into the items 🤯😶‍🌫️. You can be wondering: *What the hell did I just read?*. Do not worry, The example below makes everything clear 🤣

Consider these two sentences: `Hello World!` (sentence A) and `Hello!` (sentence B). The first thing to do is to create a table showing the frequency of each word in each sentence:

<table style='border-style: solid'>
    <caption>Frequency Table of Each Word in Each Sentence</caption>
    <tr align='center' style='border-style: solid'>
        <th style='border-style: solid'>Word</th>
        <th style='border-style: solid'># in A</th>
        <th style='border-style: solid'># in B</th>
    </tr>
    <tr align='center'>
        <td style='border-style: solid'><b>Hello</b></td>
        <td style='border-style: solid'>1</td>
        <td style='border-style: solid'>1</td>
    </tr>
    <tr align='center'>
        <td style='border-style: solid'><b>World</b></td>
        <td style='border-style: solid'>1</td>
        <td style='border-style: solid'>0</td>
    </tr>
</table>

After that, we: 

- plot the frequencies using points;
- trace two lines across the points, one line for each point; 
- calculate the angle between the lines;
- calculate the cosine of the angle;
- the cosine of the angle is the Cosine Similarity Score. It goes from 0 (completely not similar) to 1 (completely similar).

In this example, the angle between the lines is `45º` and its cosine is `0.71`, thus we can tell that both sentences are 71% similar. The image below ilustrates the plot and the calculation:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/3-cosine-similarity-statquest.png' alt='Cosine Similarity of the Example' />
    <figcaption style='text-align: center'>Figure 2 - Cosine Similarity calculation of sentences in the example. By <a href='https://www.youtube.com/watch?v=e9U0QAFbfLI'>StatQuest with Josh Starmer - Cosine Similarity, Clearly Explained!!!©</a>, at 03:34 minutes.</figcaption>
</figure>

<br /><br />

Quite simple and easy to catch, isn't it? You can check out a better explanation about this topic accessing the StatQuest YouTube video here: [StatQuest with Josh Starmer - Cosine Similarity, Clearly Explained!!!](https://www.youtube.com/watch?v=e9U0QAFbfLI).

Just one more thing, the frequency does not have influence in the similarity, only the presence or not presence have influence. It means that the sentences `Hello!` and `Hello! Hello! Hello!` have the same Cosine Similarity to `Hello World!`, since the angle between the lines is still 45º. The image below ilustrates it:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/4-cosine-similarity-statquest.png' alt='Cosine Similarity of the Example' />
    <figcaption style='text-align: center'>Figure 3 - Cosine Similarity calculation of the new example. By <a href='https://www.youtube.com/watch?v=e9U0QAFbfLI'>StatQuest with Josh Starmer - Cosine Similarity, Clearly Explained!!!©</a>, at 04:24 minutes.</figcaption>
</figure>

<br /><br />

When dealing with only two words, there is no problem, we can plot it using two axis: x-axis and y-axis. When dealing with three words, there is no problem too, we can plot it adding a third axis: z-axis. But, with four or more words, this task becomes tough, because we have no idea how to plot four and more dimensional plots. A simple synopsis can have more than a 100 words. Can you imagine how a 100 dimensional plot would look like? I am sure that even Scientists have no clue about it.

Fortunately, we can replace the plot task by applying the Cosine Similarity Equation given below:

```python
sum(Wq[k] * Wd[k]) / (sqrt(sum(Wq[k]**2)) * sqrt(sum(Wd[k]**2)))
```

$$
Cosine Similarity = \frac{\sum_{k=1}^{t} (Wq[k] \cdot Wd[k])} {\sqrt{\sum_{k=1}^{t} (Wq[k]^2)} \cdot \sqrt{\sum_{k=1}^{t} (Wd[k]^2)}}
$$

where:

- Wq and Wd: items;
- k: each word present into the Bag of Words;
- t: amount of words into the Bag of Words;
- Wq[k] and Wd[k]: frequency of 'k' word into 'Wq' and 'Wd' items.

<h1 id='2-bag-of-words' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Bag of Words</h1>

Before applying Cosine Similarity, we have to create the `Bag of Words` first, that is, group all synopsis words together. So, assuming the sentence `Dragon Ball Series is a peak show!` as a synopsis, its Bag of Words would look like this:

```python
synopsis = ['Dragon Ball Series is a peak show!']
bag_of_words = ['dragon', 'ball', 'series', 'is', 'a', 'peak', 'show']
```

Realize that, in this specific example, each word is an element of a list and, if duplicated words are present, the term would appear only once. Besides, it is a best practice removing all punctuations - commas, question marks, exclamation marks and so on - and to lower casing all texts.

<h1 id='3-token-and-n-grams' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 |Token and N-Grams</h1>

We have seem earlier that each word is an element of the Bag of Words for that specific example, but there are many situations that the element can be a combination of n-words or even only a partial slice of the word.

Thus, before creating the Bag of Words, we have to define the `Tokens` pattern. Tokens are, literatelly, the proper given name for Bag of Words elements in Natural Language Processing (NLP).

Tokens patterns are called `N-Grams`, where each gram identify how many terms compose the token. The main patterns are `Unigram, Bigram and Trigram`, being:

<br />

> **Unigram** - `token is composed by a single word`;

> **Bigram** - `token is composed by a combination of two words`;

> **Trigram** - `token is composed by a combination of three words`.

<br />

So, for `Dragon Ball Series is a peak show!` synopsis, the Bag of Words would look like this for each pattern:


```python
unigram = ['dragon', 'ball', 'series', 'is', 'a', 'peak', 'show']
bigram = ['dragon ball', 'ball series', 'series is', 'is a', 'a peak', 'peak show']
trigram = ['dragon ball series', 'ball series is', 'series is a', 'is a peak', 'a peak show']
```

Realize that the more words compose a Token, the less will be the number of elements into the Bag.

<h1 id='4-stemming-and-lemmatization' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Stemming and Lemmatization</h1>

In order to reduce the size of the Bag, techniques can applied in order to reduce words and consequently, resulting in many words being replaced by a single reduction, a smaller Bag and a smaller computational cost and time over Cosine Similarity calculations. The main techniques of this kind are `Stemming` and `Lemmatization` .

`Stemming` is a way to reduce words to their stem - if you remember your English classes on school, you will probably remember the contents about Stemming too -. For example, the words 'programming', 'programmer', and 'programs' can all be reduced down to the common word stem 'program'. In a nutshell, these three words can be represented by only one: 'program'.

Advantages:

> **Smaller Bag of Words** - `it reduces the number of unique words into the Bag of Words and, consequently, the computational cost and time to calculate the Cosine Similarity`;

> **Grouping Similar Words** - `since many words have the same stem, they can be replaced by a single common word`;

> **Easy to Understand** - `word stems are easier to understand when compared to word lemmas`.

<br />

Disadvantages-wise:

> **Overstemming or False Positives** - `words with complete different meanings can have the same stem and, consequently, they will be replaced by a single word and interpreted as synonym. For instance, 'universal', 'university' and 'universe' stem is 'univers', even thoug having complete different meanings`;

> **Understemming of False Negatives** - `in the other hand, words with similar meanings can have different stem and, consequently, they will be replaced by different words and not interpreted as synonyms. For instance, 'alumnus', 'alumnae' and 'alumni' does not have the same stem, even though having similar meanings`;

> **Language Challenges** - `the stemming logic changes for each language, taking the morphology, spelling and character encoding into consideration and, conseequently, demanding more sofisticated algorithms and computations costs`.

<br />

`Lemmatization` is an alternative for Stemming that, instead of reducing words to their stems, it reduces them to their lemma (dictionary form). Also, it takes the word meaning in the whole sentence and context into consideration to reduce it. For example, 'runs', 'running', and 'ran' would be reduced to their dictionary form 'run'.

Advantages:

> **Accuracy** - `since it takes the word meaning in the whole sentence and context, the reductions are more accurate for recommendation models`.

<br />

Disadvantages-wise:

> **Hard to Understand** - `word lemmas are harder to understand when compared to word stems`;

> **Computational Cost and Time** - `compared to Stemming, Lemmatization is a slow and time-consuming process due to its morphological analysis and word meaning derivation from its dictionary form`.

<br />

We are going to apply `Lemmatization` in both notebooks about Content-Based Filtering!!

<h1 id='5-stop-words' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Stop Words</h1>

On 'Dragon Ball Series is a peak show!' sentence, the word 'a' does not give so much information about the sentence context, that is, if we remove it, the sentence meaning and context would remain the same.

These words that do not affect the sentence meaning and context at all are known as `Stop Words` and they are commonly discarded from Bag of Words due to their tendency to give noise to the data.

Thus, our earlier Bag would look like this after dropping its Stop Words:

```python
previous_bag_of_words = ['dragon', 'ball', 'series', 'is', 'a', 'peak', 'show']
current_bag_of_words = ['dragon', 'ball', 'series', 'is', 'peak', 'show']
```

<h1 id='6-zipfs-law' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | ZIPF's Law</h1>

`ZIPF's Law` tells that some things have high frequency and are popular, whereas others does not have high frequencies and are not popular neither, and there is a pattern that shows their frequency and popularity.

For our animes synopsis scenario, some words appear a lot of times in different synopsis, so, consequently, they have high frequencies and are popular; in the other side of the coin, some words have low frequencies and are unpopular. Accordingly to ZIPF's Law, the most common word appears around twice as often as the second most common word, three times as often as the third one, and so on.

In order to apply this Law for Sequential Texts, we have to use `Term Frequency - Inverse Document Frequency` technique.

<h1 id='7-term-frequency-inverse-document-frequency-tf-idf' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Term Frequency - Inverse Document Frequency (TF-IDF)</h1>

`Term Frequency - Inverse Document Frequency (TF-IDF)` applies ZIPF's Law on texts affirming that the word importance for a text is influenced by two things:

<br />

> **The Word Frequency in the Text** - `the more frequent the word is, the higher is its importance for the text`;

> **The Word Frequency in other Texts** - `the more texts the word appears, the lower is its importance for an specific text`.

<br />

Let's dive into a simple example about the usage of TF-IDF where we want to measure the 'fox' word importance in the following two sentences:

- A quick brown fox jumps over a lazy dog. What a fox!

- A quick brown fox jumps over a lazy fox. What a fox!

Steps:

1. Calculate the Term Frequency (TF) of 'fox' for each document. The idea is to divide the word frequency by the total number of words in the document:

$$
\text{TF('fox', first_sentence)} = \frac{2}{12} = 0.17
$$

$$
\text{TF('fox', second_sentence)} = \frac{3}{12} = 0.25
$$

<br />

2. Calculate the Inverse Document Frequency (IDF) of 'fox' in the whole set of documents. The IDF is a constant for each word in the whole document and it is given by calculating the log of the total number of documents divided by the number of documents that contain the word. The log base can be any value, being 2, 10 and e the most common ones. For this example, let's consider 10 as the log base:

$$
\text{IDF('fox', all_documents)} = \log_{10} \frac{2}{2} = 0
$$

3. Calculate the Term Frequency - Inverse Document Frequency (TF-IDF) of 'fox' for each sentence. The TF-IDF is calculated by multiplying the TF by the IDF value:

$$
\text{TF-IDF('fox', first_sentence)} = TF \cdot IDF = 0.17 \cdot 0 = 0
$$

$$
\text{TF-IDF('fox', second_sentence)} = TF \cdot IDF = 0.25 \cdot 0 = 0
$$

4. Get the 'fox' importance for the sentences. Since the TF-IDF for the word in both sentences is zero, 'fox' is not so much important to differentiate both sentences!! Besides, considering that the TF-IDF value is the same for both sentences, 'fox' is equally relevant for both documents.

<br />

The image below illustrates the example:

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/5-tf-idf.png' alt='Calculation of TF-IDF of two sentences' />
    <figcaption style='text-align: center'>Figure 4 - Calculation of TF-IDF of two sentences. By <a href='https://www.youtube.com/watch?v=vZAXpvHhQow'>
Data Science Garage - Calculate TF-IDF in NLP (Simple Example)©</a>, at 07:14 minutes.</figcaption>
</figure>

<br /> <br />

<h1 id='8-hands-on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Hands-on</h1>

Now that we have all required knowledgments, let's go to the hands-on and create a Recommendation Model using Content-Based Filtering for Plot Description logic. Here, we are going to follow ten steps:

1. read animes dataset and desconsider all items without synopsis;
2. lower case all synopsis;
3. remove all break lines characters (\n) and other special characters (\t \r \x0b \x0c) replacing them by spaces;
4. remove all punctuations;
5. remove all Stop Words and Proper Nouns;
6. apply Lemmatization;
7. calculate Term Frequency - Inverse Document Frequency with unigrams as tokens pattern;
8. calculate Cosine Similarity;
9. create a search function to find anime titles with given words;
10. testing the recommendation.

---

**Reading Animes Dataset and Desconsidering all Items without Synopsis**

In [3]:
# ---- Reading Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')[
    ['title', 'genres', 'score', 'synopsis', 'image_url']
]

# ---- Removing Animes without Synopsis ----
animes_df = animes_df.loc[animes_df.synopsis != '-']

---

**- Lower Casing, Removing All Break Lines, Removing All Special Characters and Removing All Punctuations**

In [4]:
# ---- Lower Casing ----
animes_df.synopsis = animes_df.synopsis.apply(lambda synopsis: synopsis.lower())

# ---- Removing All Break Lines (\n) and Special Characters (\t \r \x0b \x0c) ----
#
# - split method desconsiders the characters \n, \t, \r, \x0b and \x0c automatically;
#
animes_df.synopsis = animes_df.synopsis.apply(lambda synopsis: ' '.join(synopsis.split()))

# Removing All Punctuations ----
#
# - 'translate' method: replaces a buch of characters by a single one;
# - 'str.maketrans' method parameters:
#    \ third parameter: characters to be replaced;
#    \ first parameter: characters that will replace the third parameter;
#    \ second parameter: characters that will replace the first parameter.
#
animes_df.synopsis = animes_df.synopsis.apply(lambda synopsis: synopsis.translate(str.maketrans('', '', string.punctuation)))

---

**- Removing Stop Words, Removing Proper Nouns and Lemmatizating**

In [5]:
# ---- Removing Stop Words and Lemmatizating ----
animes_df['transformed_synopsis'] = animes_df.synopsis.apply(lambda synopsis: transform_synopsis(synopsis))
animes_df[['title', 'synopsis', 'transformed_synopsis']].head()

Unnamed: 0_level_0,title,synopsis,transformed_synopsis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,cowboy bebop,crime is timeless by the year 2071 humanity has expanded across the galaxy filling the surface of other planets with settlements like those on earth these new societies are plagued by murder drug use and theft and intergalactic outlaws are hunted by a growing number of tough bounty hunters spike spiegel and jet black pursue criminals throughout space to make a humble living beneath his goofy and aloof demeanor spike is haunted by the weight of his violent past meanwhile jet manages his own troubled memories while taking care of spike and the bebop their ship the duo is joined by the beautiful con artist faye valentine odd child edward wong hau pepelu tivrusky iv and ein a bioengineered welsh corgi while developing bonds and working to catch a colorful cast of criminals the bebop crews lives are disrupted by a menace from spikes past as a rivals maniacal plot continues to unravel spike must choose between life with his newfound family or revenge for his old wounds,crime timeless year 2071 humanity expand galaxy fill surface planet settlement like earth new society plague murder drug use theft intergalactic hunt grow number tough bounty hunter jet black pursue criminal space humble living beneath goofy aloof haunt weight violent past jet manage troubled memory take care bebop ship duo join beautiful con artist odd child bioengineered welsh corgi develop bond work catch colorful cast criminal bebop crew life disrupt menace spike past rival maniacal plot continue unravel choose life newfound family revenge old wound
5,cowboy bebop tengoku no tobira,another day another bounty—such is the life of the often unlucky crew of the bebop however this routine is interrupted when faye who is chasing a fairly worthless target on mars witnesses an oil tanker suddenly explode causing mass hysteria as casualties mount due to a strange disease spreading through the smoke from the blast a whopping three hundred million woolong price is placed on the head of the supposed perpetrator with lives at stake and a solution to their money problems in sight the bebop crew springs into action spike jet faye and edward followed closely by ein split up to pursue different leads across alba city through their individual investigations they discover a coverup scheme involving a pharmaceutical company revealing a plot that reaches much further than the ragtag team of bounty hunters could have realized,day bounty — life unlucky crew bebop routine interrupt faye chase fairly worthless target mar witness oil tanker suddenly explode cause mass hysteria casualty strange disease spread smoke blast whopping million woolong price place head suppose perpetrator life stake solution money problem sight bebop crew spring action spike jet faye edward follow closely split pursue different lead alba city individual investigation discover coverup scheme involve pharmaceutical company reveal plot reach ragtag team bounty hunter realize
6,trigun,vash the stampede is the man with a 60000000000 bounty on his head the reason hes a merciless villain who lays waste to all those that oppose him and flattens entire cities for fun garnering him the title the humanoid typhoon he leaves a trail of death and destruction wherever he goes and anyone can count themselves dead if they so much as make eye contact—or so the rumors say in actuality vash is a huge softie who claims to have never taken a life and avoids violence at all costs with his crazy doughnut obsession and buffoonish attitude in tow vash traverses the wasteland of the planet gunsmoke all the while followed by two insurance agents meryl stryfe and milly thompson who attempt to minimize his impact on the public but soon their misadventures evolve into lifeordeath situations as a group of legendary assassins are summoned to bring about suffering to the trio vashs agonizing past will be unraveled and his morality and principles pushed to the breaking point,vash stampede man 60000000000 bounty head reason s merciless villain lay waste oppose flatten entire city fun garner title humanoid typhoon leave trail death destruction go count dead eye contact — rumor actuality vash huge softie claim take life avoid violence cost crazy obsession buffoonish attitude tow vash traverse wasteland planet gunsmoke follow insurance agent meryl stryfe milly thompson attempt minimize impact public soon misadventure evolve lifeordeath situation group legendary assassin summon bring suffer trio agonizing past unravel morality principle push breaking point
7,witch hunter robin,robin sena is a powerful craft user drafted into the stnj—a group of specialized hunters that fight deadly beings known as witches though her fire power is great shes got a lot to learn about her powers and working with her cool and aloof partner amon but the truth about the witches and herself will leave robin on an entirely new path that she never expected source funimation,powerful craft user draft stnj — group specialized hunter fight deadly being know witch fire power great s get lot learn power work cool aloof partner amon truth witch leave robin entirely new path expect source funimation
8,bouken ou beet,it is the dark century and the people are suffering under the rule of the devil vandel who is able to manipulate monsters the vandel busters are a group of people who hunt these devils and among them the zenon squad is known to be the strongest busters on the continent a young boy beet dreams of joining the zenon squad however one day as a result of beets fault the zenon squad was defeated by the devil beltose the five dying busters sacrificed their life power into their five weapons saiga after giving their weapons to beet they passed away years have passed since then and the young vandel buster beet begins his adventure to carry out the zenon squads will to put an end to the dark century,dark century people suffer rule devil vandel able manipulate monster vandel buster group people hunt devil know strong buster continent young boy beet dream join day result beet fault defeat beltose die buster sacrifice life power weapon saiga give weapon beet pass away year pass young vandel buster beet begin adventure carry zenon squad end dark century


---

**- Calculating Term Frequency - Inverse Document Frequency (TF-IDF)**

`Analyzer` parameter defines the token n-gram pattern, being `unigram (word)` the chosen one.

`Norm` parameter defines the values normalization, where `l2` makes the sum of the squares of vector elements be equals to 1, and `l1` makes the sum of the absolute values of vector elements be equals to 1. For the model, `l2` is the chosen normalization.

Also, we are going to apply stop words remotion again in order to really assure that all stop words gotten removed.

In [6]:
# ---- Calculating TF-IDF ----
tfidf_vectorizer = TfidfVectorizer(analyzer='word', norm='l2', stop_words='english')
tfidf_synopsis = tfidf_vectorizer.fit_transform(animes_df.transformed_synopsis)

print(f'- Number of Animes: {tfidf_synopsis.shape[0]}')
print(f'- Number of Words to Describe the Animes: {tfidf_synopsis.shape[1]}')

- Number of Animes: 19892
- Number of Words to Describe the Animes: 38019


---

**- Calculating Cosine Similarity**

When `l2` is the chosen normalization for TF-IDF, Cosine Similarity can be calculated by the dot product of TF-IDF results (see [sklearn.feature_extraction.text.TfidfVectorizer - Norm Parameter](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). In order to do it efficiently, requiring less computational time and cost, we are going to use sklearn metrics.

About the metrics, we have two options: `Cosine Similarity (sklearn.metrics.pairwise.cosine_similarity)` and `Linear Kernel (sklearn.metrics.pairwise.linear_kernel)`. Since the second one is way faster than the first, we are going to stick on it.

In [7]:
# ---- Calculating Cosine Similarity ----
cosine_similarity_synopsis = linear_kernel(tfidf_synopsis, tfidf_synopsis)
cosine_similarity_synopsis

array([[1.        , 0.21543585, 0.0421459 , ..., 0.        , 0.        ,
        0.        ],
       [0.21543585, 1.        , 0.04182334, ..., 0.        , 0.        ,
        0.        ],
       [0.0421459 , 0.04182334, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.17358749,
        0.3330349 ],
       [0.        , 0.        , 0.        , ..., 0.17358749, 1.        ,
        0.52122912],
       [0.        , 0.        , 0.        , ..., 0.3330349 , 0.52122912,
        1.        ]])

---

**- Creating Search Function**

In [8]:
# ---- Recommending Animes: Reseting Animes Dataframe Index ----
#
# - in order to the index follow a sequence from 0 to 'n', being 'n'
# the total number of animes.
#
animes_df.reset_index(inplace=True)

In [9]:
# ---- Recommending Animes ----
#
# - search animes titles that contains a given string in order to use it
# in the next cell to get recommendations.
#
animes_df.title.loc[animes_df.title.str.contains('brotherhood')]

3937                      fullmetal alchemist brotherhood
4537             fullmetal alchemist brotherhood specials
5117     fullmetal alchemist brotherhood - 4-koma theater
11317                        brotherhood final fantasy xv
Name: title, dtype: object

---

**- Recommendations**

In [10]:
# ---- Recommending Animes ----
animes_indices = pd.Series(animes_df.index, index=animes_df.title)

get_recommendations(
    dataset=animes_df
    , title='fullmetal alchemist brotherhood'
    , animes_indices=animes_indices
    , cosine_similarity=cosine_similarity_synopsis
    , number_recommendations=10
)

Unnamed: 0_level_0,title,synopsis,score,genres,image_url,cosine_similarity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
121,fullmetal alchemist,edward elric a young brilliant alchemist has lost much in his twelveyear life when he and his brother alphonse try to resurrect their dead mother through the forbidden act of human transmutation edward loses his brother as well as two of his limbs with his supreme alchemy skills edward binds alphonses soul to a large suit of armor a year later edward now promoted to the fullmetal alchemist of the state embarks on a journey with his younger brother to obtain the philosophers stone the fabled mythical object is rumored to be capable of amplifying an alchemists abilities by leaps and bounds thus allowing them to override the fundamental law of alchemy to gain something an alchemist must sacrifice something of equal value edward hopes to draw into the militarys resources to find the fabled stone and restore his and alphonses bodies to normal however the elric brothers soon discover that there is more to the legendary stone than meets the eye as they are led to the epicenter of a far darker battle than they could have ever imagined,8.11,"fantasy, action, award winning, adventure, drama",https://cdn.myanimelist.net/images/anime/10/75815.jpg,0.500993
10842,fullmetal alchemist the sacred star of milos specials,to mark the july 2 opening of the fullmetal alchemist the sacred star of milos film the pia eiga seikatsu website posted an exclusive video interview with the stars of the film edward and alphonse elric as voiced by romi park and rie kugimiya respectively in keeping with the spirit of hiromu arakawas original manga and the two television anime the interviewer has trouble early on in figuring out who the fullmetal alchemist is the interview has cameos by the other stars of the anime also includes 3 study sessions with professor mustang teaching winry and hawkeye about creta and milos,6.83,"comedy, fantasy",https://cdn.myanimelist.net/images/anime/9/29928.jpg,0.281833
430,fullmetal alchemist the conqueror of shamballa,in desperation edward elric sacrificed his body and soul to rescue his brother alphonse and is now displaced in the heart of munich germany he struggles to adapt to a world completely foreign to him in the wake of the economic crisis that followed the end of world war i isolated and unable to return home with his alchemy skills edward continues to research other methods of escaping the prison alongside colleagues who bear striking resemblances to many of the people he left behind as dissent brews among the german citizenry its neighbors also feel the unrest of the humiliated nation meanwhile alphonse continues to investigate edwards disappearance delving into the science of alchemy in the hopes of finally reuniting with his older brother,7.52,"comedy, fantasy, award winning, drama",https://cdn.myanimelist.net/images/anime/1707/94039.jpg,0.263548
9135,fullmetal alchemist the sacred star of milos,chasing a runaway alchemist with strange powers brothers edward and alphonse elric stumble into the squalid valley of the milos the milosians are an oppressed group that seek to reclaim their holy land from creta a militaristic country that forcefully annexed their nation in the eye of the political storm is a girl named julia crichton who emphatically wishes for the milos to regain their strength and return to being a nation of peace befriending the girl edward and alphonse find themselves in the midst of a rising resistance that involves the use of the very object they have been seeking all along—the philosophers stone however their past experiences with the stone cause them reservation and the brothers are unwilling to help but as they discover the secrets behind cretas intentions and questionable history the brothers are drawn into the battle between the rebellious milos who desire their liberty and the cretan military who seek absolute power,7.26,"adventure, action, fantasy, drama",https://cdn.myanimelist.net/images/anime/2/29550.jpg,0.262465
6421,fullmetal alchemist brotherhood specials,amazing secrets and startling facts are exposed for the first time in the fullmetal alchemist brotherhood ova collection a new assortment of stories set in neverbeforeseen corners of the fma universe join ed and al as they chase rumors of successful human transmutation into a web of shocking family drama and lies sneak a glance at hidden sides of winry and hawkeyes personalities survive the frigid north with a young izumi curtis as she fights to gain a deeper understanding of alchemy explore the legendary friendship shared by mustang and hughes and watch them grow from military school rivals into hardened brothers transformed by the horrors of the ishvalan war you thought you knew the whole story you thought all the tales were told the fullmetal alchemist brotherhood ova collection offers proof you were wrong source funimation,8.0,"fantasy, drama",https://cdn.myanimelist.net/images/anime/1493/91571.jpg,0.209237
54296,kawa nagu kasouba,a boy comes to terms with the loss of his brother,-1.0,drama,https://cdn.myanimelist.net/images/anime/1198/132898.jpg,0.150821
38395,otona no bouguya-san,kautz was looking for a job and suddenly he gets hired by an armor shop but its not your run of the mill armor shop its an adult armor shop an ecchi comedy showing the shopkeeper side of selling sexy battle armor,5.6,"comedy, fantasy, ecchi",https://cdn.myanimelist.net/images/anime/1590/111713.jpg,0.127446
1266,yoroiden samurai troopers kikoutei densetsu,a strange silent warrior appears when the inferno armor is summoned to battle him the mysterious warrior calls forth his own armor—a black copy of the inferno armor now the warriors will find themselves half a world away fighting once more to save humanity but this time from their own armor source anidb,6.44,"adventure, fantasy",https://cdn.myanimelist.net/images/anime/1504/92211.jpg,0.118502
49849,shinmai renkinjutsushi no tenpo keiei,after her parents are killed by bandits young sarasa feed is sent to an orphanage where she learns about the prestigious world of alchemy determined to follow in her late parents footsteps and become an independent business owner she enrolls at the royal alchemist academy to train and get certified as an official alchemist five years later sarasa graduates from the institution and nearly achieves her dream before realizing that she lacks the funds to do so left with little choice she makes do with a lowpriced storefront in a town situated far away from the bustling capital despite the threat of treacherous terrain and dangerous monsters in the remote countryside sarasa begins her exciting new life as an alchemist,6.61,"slice of life, adventure, fantasy",https://cdn.myanimelist.net/images/anime/1963/128728.jpg,0.107396
36300,oniichan zurui,a younger brother hates how his older brother is unfair he takes his toys gets gets to choose the tv channel so the younger brother decides to eat a lot in order to become the bigger brother through sheer growth spurts,-1.0,-,https://cdn.myanimelist.net/images/anime/7/87693.jpg,0.107184


Realize that Content-Based Filtering by Plot Descriptions usually recommends sequels and prequels of the same anime first. Even though it is good for those who watched and liked the show, it can bore the users since there is a chance to they only receive recommendations about the same anime universe. For instance, try replacing 'fullmetal alchemist brotherhood' by 'dragon ball z' or 'naruto', almost all recommendations will be sequels and prequels of the very anime.

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).