# Setup

In [1]:
!pip install opendatasets -q

## Import Libraries

In [50]:
import pandas as pd
import numpy as np
import opendatasets as od
import matplotlib.pyplot as plt

## Data Loading

In [4]:
od.download("https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database")

Downloading anime-recommendations-database.zip to ./anime-recommendations-database


100%|██████████| 25.0M/25.0M [00:00<00:00, 111MB/s]





In [5]:
anime = pd.read_csv("/content/anime-recommendations-database/anime.csv")
rating = pd.read_csv("/content/anime-recommendations-database/rating.csv")

In [44]:
print("Total # of samples in anime dataframe: ", len(anime.anime_id.unique()))
print("Total # of samples in rating dataframe: ", len(rating))

Total # of samples in anime dataframe:  12294
Total # of samples in rating dataframe:  7813737


# Data Understanding
- dataset link: [click here!](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database)

## EDA - Variable Description

- anime.csv:
  - anime_id: myanimelist.net's unique id identifying an anime.
  - name: full name of anime.
  - genre: comma separated list of genres for this anime.
  - type: type of the anime. movie, TV, OVA, etc.
  - episodes: number of episodes. (1 if movie).
  - rating: average rating out of 10 for this anime.
  - members: number of community members that are in this anime's
"group".
- rating.csv
  - user_id: randomly generated user_id
  - anime_id:  the anime that this user has rated.
  - rating: rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).

## DataFrame Anime

In [7]:
anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


As shown below, the genres in the 'genre' column are in comma-separated values format. This needs to be changed so that the machine can identify the genre of each anime.

The dataset is not clean, so it will be difficult to identify each genre available in the dataset. This will be explained in the **Data Preparation** section.

In [8]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [33]:
print(anime.shape)

(12294, 7)


## DataFrame Rating

In [34]:
rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7813737 entries, 0 to 7813736
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 178.8 MB


Rating dataframe has a lot of samples. This can be computationally expensive to train, to simplify this project the size will be reduced.

In [43]:
print(rating.shape)

(7813737, 3)


In [35]:
rating.describe()

Unnamed: 0,user_id,anime_id,rating
count,7813737.0,7813737.0,7813737.0
mean,36727.96,8909.072,6.14403
std,20997.95,8883.95,3.7278
min,1.0,1.0,-1.0
25%,18974.0,1240.0,6.0
50%,36791.0,6213.0,7.0
75%,54757.0,14093.0,9.0
max,73516.0,34519.0,10.0


In [49]:
print("Lowest rating: ", min(rating.rating))
print("Biggest rating: ", max(rating.rating))

Lowest rating:  0
Biggest rating:  10


In [48]:
print("Total # of user: ", len(rating.user_id.unique()))

Total # of user:  73515


# Data Preparation

## Anime Data Preparation

### Convert genre from each anime to list

In [11]:
anime['genre'] = anime['genre'].str.split(', ')

In [12]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",TV,64,9.26,793665
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.25,114262
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",TV,24,9.17,673572
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.16,151266


### Handle missing values for anime dataframe

In [27]:
anime.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [28]:
anime_clean = anime.dropna()

In [29]:
anime_clean.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

### Check unique genres

In [54]:
genre_flatten = [genre for sublist in anime_clean['genre'] for genre in sublist]

unique_genres = pd.Series(genre_flatten).unique()
print("Total # of genre: ", len(unique_genres))
print("List of all genre availabel: ", unique_genres)

Total # of genre:  43
List of all genre availabel:  ['Drama' 'Romance' 'School' 'Supernatural' 'Action' 'Adventure' 'Fantasy'
 'Magic' 'Military' 'Shounen' 'Comedy' 'Historical' 'Parody' 'Samurai'
 'Sci-Fi' 'Thriller' 'Sports' 'Super Power' 'Space' 'Slice of Life'
 'Mecha' 'Music' 'Mystery' 'Seinen' 'Martial Arts' 'Vampire' 'Shoujo'
 'Horror' 'Police' 'Psychological' 'Demons' 'Ecchi' 'Josei' 'Shounen Ai'
 'Game' 'Dementia' 'Harem' 'Cars' 'Kids' 'Shoujo Ai' 'Hentai' 'Yaoi'
 'Yuri']


### Drop unused columns

In [53]:
anime_new = anime_clean[['anime_id', 'name', 'genre']]
anime_new

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]"
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ..."
3,9253,Steins;Gate,"[Sci-Fi, Thriller]"
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ..."
...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,[Hentai]
12290,5543,Under World,[Hentai]
12291,5621,Violence Gekiga David no Hoshi,[Hentai]
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,[Hentai]


### Drop rows with "R-rated" genres
- the "R-rated" genres i decided to drop is:
  - Yaoi
  - Yuri
  - Hentai
  - Shounen Ai
  - Shoujo Ai

In [55]:
r_rated_genres = ['Yaoi', 'Yuri', 'Hentai', 'Shounen Ai', 'Shoujo Ai']

mask = anime_new['genre'].apply(lambda x: any(genre in x for genre in r_rated_genres))

anime_final = anime_new[~mask]
anime_final

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]"
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ..."
3,9253,Steins;Gate,"[Sci-Fi, Thriller]"
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ..."
...,...,...,...
10891,11095,Zouressha ga Yatte Kita,[Adventure]
10892,7808,Zukkoke Knight: Don De La Mancha,"[Adventure, Comedy, Historical, Romance]"
10893,28543,Zukkoke Sannin-gumi no Hi Asobi Boushi Daisakusen,"[Drama, Kids]"
10894,18967,Zukkoke Sannin-gumi: Zukkoke Jikuu Bouken,"[Comedy, Historical, Sci-Fi]"


### Convert genre list to string
separates the genre list from each rows with space, and preventing the genre's with space from being separated

In [65]:
anime_final['genre_str'] = anime_final['genre'].apply(lambda x: ' '.join(g.replace(' ', '') for g in x))
anime_final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  anime_final['genre_str'] = anime_final['genre'].apply(lambda x: ' '.join(g.replace(' ', '') for g in x))


Unnamed: 0,anime_id,name,genre,genre_str
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Drama Romance School Supernatural
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",Action Adventure Drama Fantasy Magic Military ...
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",Action Comedy Historical Parody Samurai Sci-Fi...
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",Sci-Fi Thriller
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",Action Comedy Historical Parody Samurai Sci-Fi...
...,...,...,...,...
10891,11095,Zouressha ga Yatte Kita,[Adventure],Adventure
10892,7808,Zukkoke Knight: Don De La Mancha,"[Adventure, Comedy, Historical, Romance]",Adventure Comedy Historical Romance
10893,28543,Zukkoke Sannin-gumi no Hi Asobi Boushi Daisakusen,"[Drama, Kids]",Drama Kids
10894,18967,Zukkoke Sannin-gumi: Zukkoke Jikuu Bouken,"[Comedy, Historical, Sci-Fi]",Comedy Historical Sci-Fi


## Rating Data Preparation

### Reduce the size of rating dataframe

### Change -1 rating to 0

In [56]:
rating['rating'] = rating['rating'].replace(-1, 0)
print("Rating paling kecil: ", min(rating.rating))
print("Rating paling besar: ", max(rating.rating))

Rating paling kecil:  0
Rating paling besar:  10


# Model Development with Content-Based Filtering

In [66]:
data = anime_final
data.sample(5)

Unnamed: 0,anime_id,name,genre,genre_str
1409,25389,Dragon Ball Z Movie 15: Fukkatsu no F,"[Action, Adventure, Comedy, Fantasy, Martial A...",Action Adventure Comedy Fantasy MartialArts Sh...
8050,22179,Aki no Puzzle,[Dementia],Dementia
170,513,Tenkuu no Shiro Laputa,"[Adventure, Fantasy, Romance, Sci-Fi]",Adventure Fantasy Romance Sci-Fi
1799,17080,Soukyuu no Fafner: Dead Aggressor - Exodus,"[Action, Drama, Mecha, Military, Sci-Fi]",Action Drama Mecha Military Sci-Fi
4384,1958,Wish,[Music],Music


## TF-IDF Vectorizer

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()

tf.fit(data['genre_str'])

tf.get_feature_names_out()

array(['action', 'adventure', 'cars', 'comedy', 'dementia', 'demons',
       'drama', 'ecchi', 'fantasy', 'fi', 'game', 'harem', 'historical',
       'horror', 'josei', 'kids', 'magic', 'martialarts', 'mecha',
       'military', 'music', 'mystery', 'parody', 'police',
       'psychological', 'romance', 'samurai', 'school', 'sci', 'seinen',
       'shoujo', 'shounen', 'sliceoflife', 'space', 'sports',
       'supernatural', 'superpower', 'thriller', 'vampire'], dtype=object)

In [68]:
tfidf_matrix = tf.fit_transform(data['genre_str'])

tfidf_matrix.shape

(10733, 39)

## View DataFrame

In [74]:
# Create dataframe to view tfidf_matrix
# Column is filled with genres
# Row is filled with anime names

pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tf.get_feature_names_out(),
    index=data['name']
).sample(10, axis=1).sample(5, axis=0)

Unnamed: 0_level_0,magic,mystery,fantasy,ecchi,seinen,horror,dementia,martialarts,cars,music
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Oden-kun,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Watashi no Kamifuusen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.774234
Chuumon no Ooi Ryouriten (1991),0.0,0.0,0.418021,0.0,0.0,0.720549,0.0,0.0,0.0,0.0
Ninja Slayer From Animation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Storywriter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## Cosine Similarity

In [76]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1.        , 0.14596888, 0.        , ..., 0.29799153, 0.        ,
        0.        ],
       [0.14596888, 1.        , 0.17688183, ..., 0.23069816, 0.        ,
        0.        ],
       [0.        , 0.17688183, 1.        , ..., 0.        , 0.58198958,
        0.19632063],
       ...,
       [0.29799153, 0.23069816, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.58198958, ..., 0.        , 1.        ,
        0.3373267 ],
       [0.        , 0.        , 0.19632063, ..., 0.        , 0.3373267 ,
        1.        ]])

## Cosine Similarity DataFrame

In [78]:
cosine_sim_df = pd.DataFrame(cosine_sim, index=data['name'], columns=data['name'])
print("Shape: ", cosine_sim_df.shape)

cosine_sim_df.sample(10, axis=1).sample(10, axis=0)

Shape:  (10733, 10733)


name,Dragon Ball Z: Saiya-jin Zetsumetsu Keikaku,Lost Forest,Persona 4 the Animation: Mr. Experiment Shorts,Makeruna! Makendou,Attack No.1: Namida no Fushichou,Nemuranu Machi no Cinderella: Hirose Ryouichi - Memorial Date,Kaitou Tenshi Twin Angel: Kyun Kyun☆Tokimeki Paradise!! OVA,Mangaka-san to Assistant-san to The Animation,2020 Nyeon Ujuui Wonder Kiddy,Kanbee-kun ga Yuku
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Akahori Gedou Hour Rabuge,0.094024,0.0,0.313712,0.570856,0.0,0.0,0.617701,0.072332,0.0,0.117743
Ziggy: Soreyuke! R&amp;R Band,0.171768,0.592043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cheer Danshi!! Recap,0.0,0.0,0.0,0.0,0.496594,0.0,0.0,0.0,0.0,0.0
Sketchbook: Full Color&#039;s Picture Drama,0.151446,0.0,0.505299,0.188535,0.0,0.0,0.0,0.456302,0.0,0.189649
Mutekiou Tri-Zenon,0.447498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.485667,0.0
Moudouken Quill no Isshou,0.0,0.0,0.0,0.0,0.438302,0.0,0.0,0.0,0.0,0.0
Tatakae!! Iczer-1,0.45519,0.0,0.0,0.0,0.0,0.0,0.0,0.251296,0.340977,0.0
Ima no Watashi ni Dekiru Koto...,0.605163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.65678,0.0
Chironup no Kitsune,0.0,0.0,0.0,0.0,0.438302,0.0,0.0,0.0,0.0,0.0
Kakko Kawaii Sengen! 2,0.299716,0.0,1.0,0.373116,0.0,0.0,0.0,0.230569,0.0,0.375321


## Getting top-N Recommendations

In [104]:
def anime_recommendations(nama_anime, similarity_data=cosine_sim_df, items=data[['name', 'genre']], k=5):
  """
  Rekomendasi anime berdasarkan kemiripan di dataframe

  Parameter:
  nama_anime: tipe data string (str)
  similarity_data: tipe data pd.DataFrame (object), kesamaan dataframe dengan anime sebagai index dan kolom
  items: tipe data pd.DataFrame (object), mengandung kedua nama dan fitur lainnya untuk mendefinisikan kemiripan
  k: tipe data integer (int), jumlah rekomendasi yang ingin didapatkan
  """

  index = similarity_data.loc[:, nama_anime].to_numpy().argpartition(
      range(-1, -k, -1)
  )

  closest = similarity_data.columns[index[-1:-(k+2):-1]]

  closest = closest.drop(nama_anime, errors='ignore')

  pd.set_option('display.max_columns', None)
  return pd.DataFrame(closest).merge(items).head(k)

In [118]:
anime_input = input("Input anime name: ")
data[data['name'].str.contains(anime_input, case=False)]

Input anime name: monogatari


Unnamed: 0,anime_id,name,genre,genre_str
26,17074,Monogatari Series: Second Season,"[Comedy, Mystery, Romance, Supernatural, Vampire]",Comedy Mystery Romance Supernatural Vampire
37,31757,Kizumonogatari II: Nekketsu-hen,"[Action, Mystery, Supernatural, Vampire]",Action Mystery Supernatural Vampire
102,11981,Mahou Shoujo Madoka★Magica Movie 3: Hangyaku n...,"[Drama, Magic, Psychological, Thriller]",Drama Magic Psychological Thriller
107,11979,Mahou Shoujo Madoka★Magica Movie 2: Eien no Mo...,"[Drama, Magic, Psychological, Thriller]",Drama Magic Psychological Thriller
129,9260,Kizumonogatari I: Tekketsu-hen,"[Mystery, Supernatural, Vampire]",Mystery Supernatural Vampire
...,...,...,...,...
10602,25079,Trapp Ikka Monogatari Specials,"[Drama, Historical, Music, Romance]",Drama Historical Music Romance
10621,32646,Tsuzuki wo Kangaeru Monogatari,[Drama],Drama
10704,23741,Wakakusa Monogatari: Nan to Jo-sensei Specials,"[Drama, Historical, School, Slice of Life]",Drama Historical School SliceofLife
10778,24603,Xiongmao Monogatari TaoTao,"[Comedy, Fantasy, Kids]",Comedy Fantasy Kids


In [119]:
# Get top-N Recommendations based from anime input list
anime_recommendations('Kizumonogatari I: Tekketsu-hen', k=10)

Unnamed: 0,name,genre
0,Vampire Holmes,"[Comedy, Mystery, Supernatural, Vampire]"
1,Kizumonogatari II: Nekketsu-hen,"[Action, Mystery, Supernatural, Vampire]"
2,Bakemonogatari,"[Mystery, Romance, Supernatural, Vampire]"
3,Monogatari Series: Second Season,"[Comedy, Mystery, Romance, Supernatural, Vampire]"
4,Shiki Specials,"[Horror, Mystery, Supernatural, Vampire]"
5,Vampire Knight,"[Drama, Mystery, Romance, Shoujo, Supernatural..."
6,Vampire Knight Guilty,"[Drama, Mystery, Romance, Shoujo, Supernatural..."
7,Shiki,"[Mystery, Supernatural, Thriller, Vampire]"
8,Trinity Blood,"[Action, Supernatural, Vampire]"
9,Dance in the Vampire Bund Recap,"[Action, Supernatural, Vampire]"
