# Anime Recommendation System

Ever finished a Netflix show and wondered if there is a serie similar to the one you just watched? Netflix already has you covered and has several recommendations that you might like. How? They do this using their recommendation system, which enables predictions based on inputs, such as your watch history, in order to suggest the next show you will like.

Recommendation systems generally come in two types: **Content-Based** and **Collaborative Filtering**.
In this Notebook, we will go over the first method on an anime database sourced from [Kaggle](https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset/?select=users-score-2023.csv), while contrasting both the advantages and disadvantages.

# Content-Based Filtering

In layman's terms, **Content-Based Filtering** is similar to what we as humans do when giving a recommendation to a friend for a film, book, song etc. We learn what the other person already likes and make an educated suggestion based on that. The method is built on the concept that users may like a product which is similar to the product they just consumed.

## Setup

First, let us import all the necessary libraries that we will be using to make a **Content-Based** recommendation system. Let us also import the necessary data files:

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sn

import warnings
warnings.filterwarnings("ignore")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
data = pd.read_csv("anime-dataset-2023.csv")
data.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


It can be observed that there are no missing values in the data set:

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   anime_id      24905 non-null  int64 
 1   Name          24905 non-null  object
 2   English name  24905 non-null  object
 3   Other name    24905 non-null  object
 4   Score         24905 non-null  object
 5   Genres        24905 non-null  object
 6   Synopsis      24905 non-null  object
 7   Type          24905 non-null  object
 8   Episodes      24905 non-null  object
 9   Aired         24905 non-null  object
 10  Premiered     24905 non-null  object
 11  Status        24905 non-null  object
 12  Producers     24905 non-null  object
 13  Licensors     24905 non-null  object
 14  Studios       24905 non-null  object
 15  Source        24905 non-null  object
 16  Duration      24905 non-null  object
 17  Rating        24905 non-null  object
 18  Rank          24905 non-null  object
 19  Popu

## Data Cleaning

We already know that there are no missing values in the data set, but what about duplicate entries?

In [4]:
data[data.duplicated(subset="Name")]

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
24586,55351,Azur Lane,Azur Lane,アズールレーン,UNKNOWN,"Action, Slice of Life",Assorted commercials for the Azur Lane Mobile ...,Special,UNKNOWN,"Apr 17, 2020 to ?",...,UNKNOWN,Game,30 sec,PG-13 - Teens 13 or older,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1421/...
24781,55582,Utopia,UNKNOWN,ユートピア,UNKNOWN,UNKNOWN,No description available for this anime.,Music,1.0,"Aug 12, 2021",...,UNKNOWN,Original,3 min,G - All Ages,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1475/...
24807,55610,Souseiki,UNKNOWN,UNKNOWN,UNKNOWN,Fantasy,As Shoko Asahara is depicted in a less embelli...,OVA,1.0,Not available,...,UNKNOWN,Original,Unknown,G - All Ages,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1704/...
24840,55658,Awakening,UNKNOWN,Awakening,UNKNOWN,UNKNOWN,Music video for the song Awakening by Ken Ishii.,Music,1.0,"Oct 30, 2002",...,UNKNOWN,Original,4 min,G - All Ages,UNKNOWN,24580,0,UNKNOWN,35,https://cdn.myanimelist.net/images/anime/1632/...


In [5]:
names_to_check = ["Azur Lane", "Utopia", "Souseiki", "Awakening"]
data[data["Name"].isin(names_to_check)]

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
3053,3473,Souseiki,The Genesis,創世記,5.38,UNKNOWN,This is a short film parodying the major Holly...,Movie,1.0,"Oct 1, 1968",...,UNKNOWN,Unknown,4 min,PG - Children,11141.0,10929,1,965.0,1831,https://cdn.myanimelist.net/images/anime/10/57...
8375,21103,Utopia,UNKNOWN,true tears×花咲くいろは×TARITARI ユートピア,6.41,UNKNOWN,Utopia is a song that was specially created fo...,Music,1.0,"Feb 26, 2014",...,P.A. Works,Original,1 min,PG-13 - Teens 13 or older,UNKNOWN,6874,4,2740.0,7252,https://cdn.myanimelist.net/images/anime/13/55...
14749,38328,Azur Lane,Azur Lane the Animation,アズールレーン THE ANIMATION,6.28,"Action, Sci-Fi","When the ""Sirens,"" an alien force with an arse...",TV,12.0,"Oct 3, 2019 to Mar 20, 2020",...,Bibury Animation Studios,Game,23 min per ep,R+ - Mild Nudity,7502.0,1343,960,58398.0,162902,https://cdn.myanimelist.net/images/anime/1106/...
24137,54618,Awakening,UNKNOWN,Awakening,UNKNOWN,UNKNOWN,No description available for this anime.,Music,1.0,"May 11, 2020",...,UNKNOWN,Mixed media,3 min,UNKNOWN,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1535/...
24586,55351,Azur Lane,Azur Lane,アズールレーン,UNKNOWN,"Action, Slice of Life",Assorted commercials for the Azur Lane Mobile ...,Special,UNKNOWN,"Apr 17, 2020 to ?",...,UNKNOWN,Game,30 sec,PG-13 - Teens 13 or older,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1421/...
24781,55582,Utopia,UNKNOWN,ユートピア,UNKNOWN,UNKNOWN,No description available for this anime.,Music,1.0,"Aug 12, 2021",...,UNKNOWN,Original,3 min,G - All Ages,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1475/...
24807,55610,Souseiki,UNKNOWN,UNKNOWN,UNKNOWN,Fantasy,As Shoko Asahara is depicted in a less embelli...,OVA,1.0,Not available,...,UNKNOWN,Original,Unknown,G - All Ages,0.0,0,0,UNKNOWN,0,https://cdn.myanimelist.net/images/anime/1704/...
24840,55658,Awakening,UNKNOWN,Awakening,UNKNOWN,UNKNOWN,Music video for the song Awakening by Ken Ishii.,Music,1.0,"Oct 30, 2002",...,UNKNOWN,Original,4 min,G - All Ages,UNKNOWN,24580,0,UNKNOWN,35,https://cdn.myanimelist.net/images/anime/1632/...


Further inspectation indicates that the animes **Utopia** and **Awakening** seem to be the only duplicates out of the 4, hence we remove them:

In [6]:
data.drop(index = [24137,24781], inplace = True)
data.duplicated(subset="Name").sum()

2

What about animes that have no available description?

In [7]:
sum(data["Synopsis"] == "No description available for this anime.")

4533

To ensure our ability to extract valuable information and make accurate predictions from the *Synopsis* column, we can explore the following options: removing the entries, leaving them empty, or substituting the *Genres* column for the *Synopsis*.

Leaving the entries blank in the *Synopsis* column will deprive us of valuable information to be utilized later in the notebook. This could lead to inaccuracies in recommending the **4533** animes, making it an unsuitable choice in this context. Furthermore, considering that this subset comprises nearly a fifth of our dataset, removing these rows would have a more significant adverse impact than desired.

Alternatively, by substituting the *Genres* column for the missing *Synopsis* entries, we can enhance the accuracy of recommendations for these animes compared to the previously mentioned methods.

In [8]:
data["Synopsis"] = np.where(data['Synopsis'] == 'No description available for this anime.', data['Genres'], data['Synopsis'])
sum(data["Synopsis"] == "No description available for this anime.")

0

For the sake of building a **Content-Based** recommender system, we will solely focus on the columns *Name* and *Synopsis*. These columns will provide us with enough information to proceed on building our system, which we will base on the plot of the anime:

In [9]:
df = data[["Name","Synopsis"]]
df.head()

Unnamed: 0,Name,Synopsis
0,Cowboy Bebop,"Crime is timeless. By the year 2071, humanity ..."
1,Cowboy Bebop: Tengoku no Tobira,"Another day, another bounty—such is the life o..."
2,Trigun,"Vash the Stampede is the man with a $$60,000,0..."
3,Witch Hunter Robin,Robin Sena is a powerful craft user drafted in...
4,Bouken Ou Beet,It is the dark century and the people are suff...


## Tokenizing the Dataframe

Since we will base the **Content-Based Filtering** on the synopsis of the anime, we need to tokenize the text data, which will enable us to work with the numerical representation of said column.

We will accomplish this using the **TfidfVectorizer** from the Scikit-Learn library. 

*TF-IDF*, or Term Frequency-Inverse Document Frequency, creates features based on the words in the *Synopsis* column. It considers how often a word is used and inversely relates this to how many animes use that word. If a word is common across many animes, it will have a lower TF-IDF value, as it doesn't help differentiate between animes.

It's important to highlight that we opt for *TF-IDF* over *CountVectorizing* because it not only tokenizes and counts the data but also provides the added benefit of data normalization, making it a more efficient choice.

In [10]:
vectorizer = TfidfVectorizer(stop_words = "english")
synop_matrix = vectorizer.fit_transform(df["Synopsis"])
synop_matrix.shape

(24903, 51091)

We haveve generated a similarity matrix for our entire collection of **24903** animes, each characterized by **51091** features (words). This matrix enables us to employ the **cosine_similarity** function from the Scikit-Learn library to gauge the cosine of the angle between two animes. This, in turn, helps us quantify the degree of similarity between them.

In [11]:
similarity_matrix = cosine_similarity(synop_matrix, synop_matrix)
similarity_matrix

array([[1.        , 0.26542377, 0.01994642, ..., 0.        , 0.        ,
        0.        ],
       [0.26542377, 1.        , 0.03805935, ..., 0.03039382, 0.        ,
        0.        ],
       [0.01994642, 0.03805935, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.03039382, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.3557177 ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.3557177 ,
        1.        ]])

## Building the Recommendation System

We will generate a series that associates the matrix index with anime names, simplifying the process of inputting anime titles to receive recommendations:

In [12]:
mapping = pd.Series(df.index,index = df["Name"])
mapping

Name
Cowboy Bebop                           0
Cowboy Bebop: Tengoku no Tobira        1
Trigun                                 2
Witch Hunter Robin                     3
Bouken Ou Beet                         4
                                   ...  
Wu Nao Monu                        24900
Bu Xing Si: Yuan Qi                24901
Di Yi Xulie                        24902
Bokura no Saishuu Sensou           24903
Shijuuku Nichi                     24904
Length: 24903, dtype: int64

Next, we will create a recommender function that utilizes the cosine similarity to suggest animes. This function will accept an anime title as input and identify the top 10 anime recommendations based on the cosine similarity matrix we previously generated:

In [13]:
def anime_recommendation(anime_input):
    
    anime_index = mapping[anime_input]
    
    # similarity_score is a list of index and similarity matrix
    similarity_score = list(enumerate(similarity_matrix[anime_index]))
    
    # Sort the similarity_score in descending order
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar animes. Ignore the first movie as the highest similarity is with itself
    similarity_score = similarity_score[1:11]
    
    # Return anime names using the mapping series
    anime_indices = [i[0] for i in similarity_score]
    
    return (df["Name"].iloc[anime_indices])

## Displaying User Recommendations

In [14]:
anime_recommendation("Shingeki no Kyojin")

9352                           Shingeki no Kyojin Season 2
13176                          Shingeki no Kyojin Season 3
20972          Shingeki no Kyojin: The Final Season Part 2
14865                   Shingeki no Kyojin Season 3 Part 2
13721    Shingeki no Kyojin Season 2 Movie: Kakusei no ...
10874                          Shingeki! Kyojin Chuugakkou
22348    Shingeki no Kyojin: The Final Season - Kankets...
15822                 Shingeki no Kyojin: The Final Season
12776                                   Shingeki no Kyotou
64                                Kidou Senshi Zeta Gundam
Name: Name, dtype: object

We can finally see that when we input the anime **Shingeki no Kyojin**, we get 10 recommendations of animes whose synopsis are similar (which are mainly sequels of the input), so the recommendations seem accurate. Hence, if a person liked **Shingeki no Kyojin**, they would most likely like **Season 2** and **Season 3** of the same franchise.

## Conclusion

To conclude the **Content-Based Filtering** section of this project, by utilizing the synopsis of an anime we are able to provide recommendations in a similar fashion as people would do amongst themselves. 

The method is relatively simple and enables users' interests to be captured, making it suitable suitable for providing personalized recommendations. Furthermore, the items recommended need not be populaire and allows for a more niche selection.

The limitation of using the above methodology is that relying on item similarity can lead to a scarcity of originality and variety. Additionally, there can be issues with inconsistent and inaccurate attributes (like with the *Synopsis* column), and scaling up can become problematic as new items require descriptions and categorization.