# Intro

We are going to create a content-based recommendation system. The focus of this project is to recommend new animes based em ratings given by costumers to others animes. We will focus on anime's genres to recommend new content for the users.

In [1]:
import numpy as np
import pandas as pd

rating_df = pd.read_csv("Data/rating.csv")
anime_df = pd.read_csv("Data/anime.csv")

## Anime dataset

In [2]:
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


Lets remove columns that we are not going to use right now. We don't need "rating" column since we are going to use the ratings from the ratings dataframe, which is user based. Also, we are going to remove "members", "type" and "episodes".

We are going to focus only on the anime's genre for recommendations.

In [4]:
# Create a copy of the dataset to have a backup data
animes_with_genres_df = anime_df.copy()

# Drop columns
animes_with_genres_df.drop(['rating', 'members', 'type', 'episodes'], axis=1, inplace=True)
animes_with_genres_df.head()

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
3,9253,Steins;Gate,"Sci-Fi, Thriller"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."


Let's check how many missing values (NaN) we have in the dataset.

In [5]:
# Checking the number of missing values
animes_with_genres_df.isna().sum()

anime_id     0
name         0
genre       62
dtype: int64

We have few missing values in the "genre" column. Since this is our target variable, let's remove them.

In [6]:
# Dropping nan values
animes_with_genres_df.dropna(inplace=True)
animes_with_genres_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12232 entries, 0 to 12293
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   anime_id  12232 non-null  int64 
 1   name      12232 non-null  object
 2   genre     12232 non-null  object
dtypes: int64(1), object(2)
memory usage: 382.2+ KB


The values in "genre" column are strings with a list of genres that anime fits into. We can't work with the data this way, so we are going to split them into multiple genres (create 1 column for each genre). Them, we are going to set the value of 1 if the anime is in that genre, and 0 if not.

In [7]:
# Iterate through each genre list in the dataset and it's index
for index, row in animes_with_genres_df.iterrows():
    # Iterate through every genre list in each anime and split them
    for genre in row.genre.split(", "):
        animes_with_genres_df.at[index, genre] = 1
        
# Filling with 0 for every other genre the anime does not fit into
animes_with_genres_df.fillna(0, inplace=True)

# Let's see our new dataframe
animes_with_genres_df.head()

Unnamed: 0,anime_id,name,genre,Drama,Romance,School,Supernatural,Action,Adventure,Fantasy,...,Shounen Ai,Game,Dementia,Harem,Cars,Kids,Shoujo Ai,Hentai,Yaoi,Yuri
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",1.0,0.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9253,Steins;Gate,"Sci-Fi, Thriller",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Rating dataset

In [8]:
rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [9]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7813737 entries, 0 to 7813736
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 178.8 MB


## Content-based Recommendation System

In [10]:
# Create a fictional list of user data with some anime examples and ratings
user_test = [
    {'name': 'Kimi no Na wa.', 'rating': 10},
    {'name': 'Fullmetal Alchemist: Brotherhood', 'rating': 10},
    {'name': 'Naruto', 'rating': 8},
    {'name': 'Under World', 'rating': 3},
    {'name': 'Yasuji no Pornorama: Yacchimae!!', 'rating': 4},
]

# Transform the user list to a dataframe
user_test_df = pd.DataFrame(user_test)
user_test_df.head()

Unnamed: 0,name,rating
0,Kimi no Na wa.,10
1,Fullmetal Alchemist: Brotherhood,10
2,Naruto,8
3,Under World,3
4,Yasuji no Pornorama: Yacchimae!!,4


In [11]:
# Merge the user fictional dataframe with the anime dataframe, using the "name" column as the key
user_test_df = pd.merge(animes_with_genres_df, user_test_df, on ="name")

# Create a new dataframe with the "anime_id", "name" and genres
# Dropping the "rating" column
user_test_animes = user_test_df.drop(['rating'], axis=1)

# Create a new dataframe with the "anime_id", "name" and "rating"
user_test_df = user_test_df[['anime_id', 'name', 'rating']]

print(user_test_animes)
print(user_test_df)

   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2        20                            Naruto   
3      5543                       Under World   
4     26081  Yasuji no Pornorama: Yacchimae!!   

                                               genre  Drama  Romance  School  \
0               Drama, Romance, School, Supernatural    1.0      1.0     1.0   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...    1.0      0.0     0.0   
2  Action, Comedy, Martial Arts, Shounen, Super P...    0.0      0.0     0.0   
3                                             Hentai    0.0      0.0     0.0   
4                                             Hentai    0.0      0.0     0.0   

   Supernatural  Action  Adventure  Fantasy  ...  Shounen Ai  Game  Dementia  \
0           1.0     0.0        0.0      0.0  ...         0.0   0.0       0.0   
1           0.0     1.0        1.0      1.0  ...         0.0   0

In [12]:
# Create a new dataframe with only the genres of the user's data.
# First we need to remove the index
user_test_animes = user_test_animes.reset_index(drop=True)
user_test_genres = user_test_animes.drop(['anime_id', 'name', 'genre'], axis=1)
user_test_genres

Unnamed: 0,Drama,Romance,School,Supernatural,Action,Adventure,Fantasy,Magic,Military,Shounen,...,Shounen Ai,Game,Dementia,Harem,Cars,Kids,Shoujo Ai,Hentai,Yaoi,Yuri
0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Now, we will multiply the ratings (given by the user) by the genres of the titles, this way each genre will have an impact in the end. We will use the Pandas **dot** function.

In [13]:
user_profile = user_test_genres.transpose().dot(user_test_df['rating'])
user_profile

Drama            20.0
Romance          10.0
School           10.0
Supernatural     10.0
Action           18.0
Adventure        10.0
Fantasy          10.0
Magic            10.0
Military         10.0
Shounen          18.0
Comedy            8.0
Historical        0.0
Parody            0.0
Samurai           0.0
Sci-Fi            0.0
Thriller          0.0
Sports            0.0
Super Power       8.0
Space             0.0
Slice of Life     0.0
Mecha             0.0
Music             0.0
Mystery           0.0
Seinen            0.0
Martial Arts      8.0
Vampire           0.0
Shoujo            0.0
Horror            0.0
Police            0.0
Psychological     0.0
Demons            0.0
Ecchi             0.0
Josei             0.0
Shounen Ai        0.0
Game              0.0
Dementia          0.0
Harem             0.0
Cars              0.0
Kids              0.0
Shoujo Ai         0.0
Hentai            7.0
Yaoi              0.0
Yuri              0.0
dtype: float64

This new values (weights) are known as "user profile", based on them we are going to recommend new titles to the user.

In [14]:
# Create a dataframe with the genres from the "animes_with_genres_df" dataframe
genre_table = animes_with_genres_df.set_index(animes_with_genres_df['anime_id'])

# Drop every column except the genres
genre_table.drop(['anime_id', 'name', 'genre'], axis=1, inplace=True)
genre_table.head()

Unnamed: 0_level_0,Drama,Romance,School,Supernatural,Action,Adventure,Fantasy,Magic,Military,Shounen,...,Shounen Ai,Game,Dementia,Harem,Cars,Kids,Shoujo Ai,Hentai,Yaoi,Yuri
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
32281,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5114,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28977,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9253,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9969,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
genre_table.shape

(12232, 43)

Now let's take our list of genres, our weights and calculate the weighted average of each title to recommend the ones that best match the customer's taste.

In [16]:
recommendations_df = ((genre_table * user_profile).sum(axis=1)) / (user_profile.sum())
recommendations_df.head()

anime_id
32281    0.318471
5114     0.611465
28977    0.280255
9253     0.000000
9969     0.280255
dtype: float64

In [17]:
# Order them in descending order
recommendations_df = recommendations_df.sort_values(ascending=False)
recommendations_df.head()

anime_id
231      0.713376
4938     0.675159
9135     0.662420
121      0.662420
34055    0.624204
dtype: float64

Now we have our recomendations

In [18]:
anime_df.loc[anime_df['anime_id'].isin(recommendations_df.head(10).keys())]

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
200,121,Fullmetal Alchemist,"Action, Adventure, Comedy, Drama, Fantasy, Mag...",TV,51,8.33,600384
286,4938,Tsubasa: Shunraiki,"Action, Adventure, Drama, Fantasy, Magic, Myst...",OVA,2,8.23,40420
1071,969,Tsubasa Chronicle 2nd Season,"Action, Adventure, Drama, Fantasy, Mystery, Ro...",TV,26,7.7,79166
1558,9135,Fullmetal Alchemist: The Sacred Star of Milos,"Action, Adventure, Comedy, Drama, Fantasy, Mag...",Movie,1,7.5,87944
1845,25157,Trinity Seven,"Action, Comedy, Ecchi, Fantasy, Harem, Magic, ...",TV,12,7.43,208796
3283,6489,Zero no Tsukaima: Princesses no Rondo Picture ...,"Action, Adventure, Comedy, Drama, Ecchi, Fanta...",Special,7,7.04,23532
3316,2832,Ani*Kuri15,"Adventure, Comedy, Drama, Fantasy, Game, Magic...",Special,15,7.02,12926
5917,231,Asagiri no Miko,"Action, Comedy, Drama, Fantasy, Magic, School,...",TV,26,6.31,4721
10924,34055,Berserk (2017),"Action, Adventure, Demons, Drama, Fantasy, Hor...",TV,Unknown,,13463


## Conclusion
This project can be used in a anime-based streaming service to recommend new animes to costumers. One downside of this project is the fact that it does not work well with new costumers, people who have not watched any anime before and have no ratings or experience to give their rating.