Notes:
- Users will likely not want to be recommended an anime they have already watched
    - Additional filtering will need to be done during inference
- Users might be more emotionally invested in a subsequent season of an anime
    - Additional weight for subsequent season of a series? (may be future work)
- Users likely prefer more current anime
    - Have model favor recent releases
    - https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf, section 3.3

In [6]:
# Dataset sources: 
# https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database
# https://www.kaggle.com/datasets/crazygump/myanimelist-scrappind-a-decade-of-anime

import pandas as pd

df_anime_rating = pd.read_csv('./data/CooperUnion/anime.csv')
df_anime_meta = pd.read_csv('./data/Gumpy-Q/MAL-all-from-winter1917-to-fall2022.csv')
df_user_rating = pd.read_csv('./data/CooperUnion/rating.csv')

print(df_anime_rating.shape)
display(df_anime_rating.head())

print(df_anime_meta.shape)
display(df_anime_meta.head())

print(df_user_rating.shape)
display(df_user_rating.head())

(12294, 7)


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


(27505, 14)


Unnamed: 0,title,MAL_id,type,studio,release-season,release-year,release-date,source-material,genres,themes,demographics,episodes,score,members
0,Dekobou Shingachou: Meian no Shippai,23189,Movie,['Unknown'],winter,1917.0,,Original,['Comedy'],[],,1.0,5.86,1300.0
1,Imokawa Mukuzo: Genkanban no Maki,17387,Movie,['Unknown'],winter,1917.0,,Original,['Comedy'],[],,1.0,5.36,964.0
2,Namakura Gatana,6654,Movie,['Unknown'],spring,1917.0,,Original,['Comedy'],[],,1.0,5.51,8400.0
3,Saru to Kani no Gassen,10742,Movie,['Unknown'],spring,1917.0,,Other,['Drama'],[],,1.0,4.9,953.0
4,Yume no Jidousha,24575,Movie,['Unknown'],spring,1917.0,,Original,[],[],,1.0,0.0,510.0


(7813737, 3)


Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


- rating of -1 means a user has watched a show, but has not rated it (https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database/discussion/253814)
- Suggestion: use https://jikan.moe/ OR https://github.com/abhinavk99/jikanpy jinkan API to extract studio, producer ()information? (note: rate limited to 60 request per minute)
- Use https://github.com/QasimK/mal-scraper to scrape?
- Scrapy (https://scrapy.org/) can be used to easily extract synopsis, reviews and image links
- https://www.kaggle.com/datasets/crazygump/myanimelist-scrappind-a-decade-of-anime/data can be merged on MAL_id 
- one hot encode genre
- VAE encoding for features?
- NCF for recommender system? https://medium.com/data-science-in-your-pocket/recommendation-systems-using-neural-collaborative-filtering-ncf-explained-with-codes-21a97e48a2f7

In [2]:
df_anime_meta.rename(columns={"MAL_id":"anime_id"}, inplace=True)
df_anime = df_anime_rating.merge(df_anime_meta, on='anime_id')

# Check for missing values

In [3]:
df_anime_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [4]:
df_user_rating['rating'].value_counts()

rating
 8     1646019
-1     1476496
 7     1375287
 9     1254096
 10     955715
 6      637775
 5      282806
 4      104291
 3       41453
 2       23150
 1       16649
Name: count, dtype: int64

In [11]:
df_anime_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27505 entries, 0 to 27504
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            27505 non-null  object 
 1   MAL_id           27505 non-null  int64  
 2   type             27505 non-null  object 
 3   studio           27505 non-null  object 
 4   release-season   27505 non-null  object 
 5   release-year     27505 non-null  float64
 6   release-date     0 non-null      float64
 7   source-material  22842 non-null  object 
 8   genres           27505 non-null  object 
 9   themes           27505 non-null  object 
 10  demographics     6425 non-null   object 
 11  episodes         27505 non-null  float64
 12  score            27505 non-null  float64
 13  members          27505 non-null  float64
dtypes: float64(5), int64(1), object(8)
memory usage: 2.9+ MB


In [9]:
for col in df_anime_meta.columns:
    print(df_anime_meta[col].value_counts())

title
Sazae-san                                            213
Sore Ike! Anpanman                                   137
Crayon Shin-chan                                     123
Nintama Rantarou                                     119
Chibi Maruko-chan (1995)                             112
                                                    ... 
Ice Movie                                              1
Gegege no Kitarou: Nippon Bakuretsu                    1
Eiga! Tamagotchi Uchuu Ichi Happy na Monogatari!?      1
Da, da Ge da Xigua                                     1
Big Mac to, Susume                                     1
Name: count, Length: 16858, dtype: int64
MAL_id
2406     213
1960     137
966      123
1199     119
6149     112
        ... 
9053       1
8145       1
6517       1
44517      1
53731      1
Name: count, Length: 16858, dtype: int64
type
TV (Continuing)    9993
TV (New)           5269
OVA                3967
Movie              2931
ONA                2916
Special   