# Content Based Anime Recommendation System
## Introduction
In this project, I will be building a content based recomendation system for a dataset on animes.

**Objective:** Be able to input a anime's name, and get 10 recommendation's for related content.

**Method:** I will be focusing on creating a recommendation system focused strictly on genre, for simplicity of this examples sake. This is an example of a content based recommendation system.

**Preface:** Since I will be building a content based engine, the engine will have difficulty recommending accross genres. This is one of the main weakness of this approach. To solve this problem, you can create a Colaborative Filtering recommendation system; which compares user data and recommends that way. However, this is a powerful approach and a wonderful introduction to the world of recommendation systems.

# Data Dictionary
Before I begin, here is the data dictionary for the dataset I will be using.

- **anime_id** - myanimelist.net's unique id identifying an anime.
- **name** - full name of anime.
- **genre** - comma separated list of genres for this anime.
- **type** - movie, TV, OVA, etc.
- **episodes** - how many episodes in this show. (1 if movie).
- **rating** - average rating out of 10 for this anime.
- **members** - number of community members that are in this anime's
"group".

In [26]:
# Import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [27]:
# Create data path
path = "/kaggle/input/animee/anime.csv"

# Create dataframe
data = pd.read_csv(path)

In [28]:
# Explore first few lines of data
data.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [29]:
# Get descriptive statistics on the data

# Numeric data
data.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


In [30]:
# Object data
data[['name', 'genre', 'type']].describe()

Unnamed: 0,name,genre,type
count,12294,12232,12269
unique,12292,3264,6
top,Shi Wan Ge Leng Xiaohua,Hentai,TV
freq,2,823,3787


In [31]:
# Get information on the columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


# Pre-processing
Now I can begin the data preprocessing needed for this project. This portion focuses on cleaning the data, as well as feature engineering.

### Data Cleaning
The first step in the pre-processing phase is to ensure the data is clean. In this instance, we are really only worried about duplicates (so we do not get the same movie recommended), and missing values. This step will also include dropping any columns that will not provide any value to our recommendation engine.

In [32]:
# Checking for missing values
data.isna().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

Since genre is what we are going to base our recommendation system on, we will drop all the rows that do not have genre data. We will not be needing the rating data, as this is content based filtering; we are only looking at the contents descriptive meta data, and are not worried about ratings for the sake of this example. However, I will keep the column as a feature so the user can see the rating of the anime within the recommendation. So for the sake of this project, I will be dropping the rating data as well.

In [33]:
# Remove rows with null values
data = data.dropna(how='any', axis=0)

In [34]:
# Get a look at the cleaned data
print(data.info())

# Double check all nulls were removed
print(data.isna().sum())

<class 'pandas.core.frame.DataFrame'>
Index: 12017 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12017 non-null  int64  
 1   name      12017 non-null  object 
 2   genre     12017 non-null  object 
 3   type      12017 non-null  object 
 4   episodes  12017 non-null  object 
 5   rating    12017 non-null  float64
 6   members   12017 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 751.1+ KB
None
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


In [35]:
# Checking for duplicated
data.duplicated().sum()

0

Now that the data is generally considered clean, I will begin the feature engineering section of the pre-processing phase.

## Feature Engineering

Feature engineering is the most important part of the recommendation system. We need to be able to create a similarity score, based on the data held within the genre column. To do this I will be creating columns for each genre, by extracting them from the list.

In [36]:
# Extract genres (split by comma)
data['genre_list'] = data['genre'].str.split(', ')

# Create one-hot encoded columns for each genre
genres = set(data['genre_list'].explode())
for genre in genres:
    data[genre] = data['genre_list'].apply(lambda x: 1 if genre in x else 0)

# Drop unnecessary columns
data.drop(['genre', 'genre_list'], axis=1, inplace=True)

In [37]:
data.head()

Unnamed: 0,anime_id,name,type,episodes,rating,members,Romance,Police,Vampire,Demons,...,Slice of Life,Super Power,Kids,School,Mecha,Music,Fantasy,Psychological,Shoujo,Action
0,32281,Kimi no Na wa.,Movie,1,9.37,200630,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,TV,64,9.26,793665,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
2,28977,Gintama°,TV,51,9.25,114262,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,9253,Steins;Gate,TV,24,9.17,673572,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9969,Gintama&#039;,TV,51,9.16,151266,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


There were a total of 43 different possible genres. They are now one-hot encoded.

In [38]:
# Drop other unnecessary columns
data = data.drop(columns=['anime_id', 'rating', 'members'], axis=1)

Next, I am going to one-hot encode the 'type' column; This will be the last step in feature extraction for the model.

In [39]:
data = pd.get_dummies(data=data, columns=['type'], dtype=int)

In [40]:
data.head()

Unnamed: 0,name,episodes,Romance,Police,Vampire,Demons,Martial Arts,Thriller,Sports,Historical,...,Fantasy,Psychological,Shoujo,Action,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV
0,Kimi no Na wa.,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,Fullmetal Alchemist: Brotherhood,64,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,1
2,Gintama°,51,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,1
3,Steins;Gate,24,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,Gintama&#039;,51,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,1


Lastly, I will scale the episodes column.

In [41]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Convert 'episodes' column to numeric (ignoring 'Unknown' values)
data['episodes'] = pd.to_numeric(data['episodes'], errors='coerce')

# Reshape the data to a 2D array
episodes_2d = data['episodes'].values.reshape(-1, 1)

# Scale the 'episodes' column
data['episodes_scaled'] = scaler.fit_transform(episodes_2d)

# View first few rows
data.head()

Unnamed: 0,name,episodes,Romance,Police,Vampire,Demons,Martial Arts,Thriller,Sports,Historical,...,Psychological,Shoujo,Action,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,episodes_scaled
0,Kimi no Na wa.,1.0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0.0
1,Fullmetal Alchemist: Brotherhood,64.0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0.034673
2,Gintama°,51.0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,1,0.027518
3,Steins;Gate,24.0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0.012658
4,Gintama&#039;,51.0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,1,0.027518


In [42]:
# Drop the original episodes column
data = data.drop('episodes', axis = 1)

# confirm
data.head()

Unnamed: 0,name,Romance,Police,Vampire,Demons,Martial Arts,Thriller,Sports,Historical,Mystery,...,Psychological,Shoujo,Action,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,episodes_scaled
0,Kimi no Na wa.,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0.0
1,Fullmetal Alchemist: Brotherhood,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0.034673
2,Gintama°,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0.027518
3,Steins;Gate,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0.012658
4,Gintama&#039;,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0.027518


In [43]:
# Instantiate a final data set
final_data = data

In [44]:
# View
final_data.head()

Unnamed: 0,name,Romance,Police,Vampire,Demons,Martial Arts,Thriller,Sports,Historical,Mystery,...,Psychological,Shoujo,Action,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,episodes_scaled
0,Kimi no Na wa.,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0.0
1,Fullmetal Alchemist: Brotherhood,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0.034673
2,Gintama°,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0.027518
3,Steins;Gate,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0.012658
4,Gintama&#039;,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0.027518


In [51]:
# Ensure no missing values
final_data = final_data.dropna(how='any', axis=0)

# Construction
Now that cleaning and feature engineering are complete, I can begin to build the recommendation system. I will do this by creating a similarity score using the features in the final data set. Namely, type(tv, or movie), episode count, and genre.

In [52]:
# Import needed package
from sklearn.metrics.pairwise import cosine_similarity

# Create a dictionary to map anime names to their corresponding row indices
anime_indices = pd.Series(final_data.index, index=final_data['name']).to_dict()

# Define a function that takes an anime name as input
def get_recommendations(anime_name):
    # Get the index of the input anime
    idx = anime_indices.get(anime_name)
    if idx is None:
        return "Anime not found in the dataset."

    # Extract the genre features for the input anime
    input_genre_vector = final_data.iloc[idx, 1:]  # Assuming genres start from column index 1

    # Calculate cosine similarity between the input anime and all other anime
    sim_scores = list(enumerate(cosine_similarity([input_genre_vector], final_data.iloc[:, 1:])[0]))

    # Sort anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top 10 most similar anime (excluding the input anime itself)
    top_anime_indices = [i for i, _ in sim_scores[1:11]]
    top_anime_names = final_data['name'].iloc[top_anime_indices]

    return top_anime_names

## Testing

In [53]:
# Test the recommendation system
input_anime = "Fullmetal Alchemist: Brotherhood"  # Replace with the desired anime name
recommendations = get_recommendations(input_anime)
print(f"Recommended anime for '{input_anime}':")
for anime in recommendations:
    print(anime)

Recommended anime for 'Fullmetal Alchemist: Brotherhood':
Fullmetal Alchemist
Magi: The Kingdom of Magic
Magi: The Labyrinth of Magic
Densetsu no Yuusha no Densetsu
Magi: Sinbad no Bouken (TV)
Tide-Line Blue
Jikuu Tenshou Nazca
Fullmetal Alchemist: The Sacred Star of Milos
Digimon Frontier
Fairy Tail (2014)


We can see here that the recommendation system does a good job at recommending similar animes, as it recommends animes that are part of the Fullmetal Alchemist series. However, it is not recommending only those animes FIRST, rather it is recommending animes based soley on their genre and episode count. Let's try it on a couple more animes.

In [54]:
# Test the recommendation system
input_anime = "Steins;Gate"  # Replace with the desired anime name
recommendations = get_recommendations(input_anime)
print(f"More like '{input_anime}':")
for anime in recommendations:
    print(anime)

More like 'Steins;Gate':
RoboDz
Fireball Charming
Hanoka
Yuusei Kamen
Hoshi no Ko Poron
Gankutsuou
Groizer X
Element Hunters
Cybot Robocchi
Go-Q-Choji Ikkiman


In [55]:
# Test the recommendation system
input_anime = "Gintama°"  # Replace with the desired anime name
recommendations = get_recommendations(input_anime)
print(f"More like '{input_anime}':")
for anime in recommendations:
    print(anime)

More like 'Gintama°':
Gintama&#039;
Gintama&#039;: Enchousen
Gintama
Gintama: Yorinuki Gintama-san on Theater 2D
Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare
Gintama Movie: Shinyaku Benizakura-hen
Gintama: Shinyaku Benizakura-hen
Gintama: Jump Festa 2014 Special
Peace Maker Kurogane
Gintama: Nanigoto mo Saiyo ga Kanjin nano de Tasho Senobisuru Kurai ga Choudoyoi


We can see that this kind of recommendation system has a hard time recommending across different genres or anime types. This is expected, as it is strictly making its recommendations based on genre information. If a series has the same genre(s) across the board, it will recommend animes within that series everytime. This is where other technuiqes can come into play such as collaborative filtering, because it wouldn't be basing it's recommendations strictly by genre.

# Conclusion

In this notebook, we explored recommendation systems; and created a Content Based Recommendation system for an anime dataset. This recommendation engine was built to use genre meta data, to present animes with similar genres. Although not very complex, it is a good representation of the possible use cases of recommendation systems; and is a good introduction to the technuiqes. In a real life use case, this model would be further evaluated and test to understand its accuracy. However, the objective of this project was to explore the construction of the recommendation egine of this type. In another notebook, I will explore a Collaborative Filtering approach. Which can provide more meaningful recommendation across genres.

# Credits & Sources
This dataset was pulled from kaggle. Check it out [here][dataset].

[dataset]: https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database