# Exploring MyAnimeList Genres

### Intro

MyAnimeList is a website where, among other things, users catalog anime and their anime-watching history. The rich combination of media and user data makes exploring the relationship between the two an appealing prospect. 

### Data Collection

I will be working off of [a MAL dataset uploaded to Kaggle in 2018 by user Azathoth.](https://www.kaggle.com/azathoth42/myanimelist)  Download the 'filtered' versions of these files to the folder containing this file.  These files restrict data to that of users that have filled out their profile information, as well as excluding a few bogus users.  A small number of anime titles have also been filtered out for reasons unstated by the source.

In [25]:
import math
import pandas as pd
import numpy as np
import json
import re
import matplotlib.pyplot as plt
import seaborn as sb
import networkx as nx

In [2]:
users = pd.read_csv('users_filtered.csv')
titles = pd.read_csv('anime_filtered.csv')
animelists = pd.read_csv('animelists_filtered.csv')

In [3]:
users.head()

Unnamed: 0,username,user_id,user_watching,user_completed,user_onhold,user_dropped,user_plantowatch,user_days_spent_watching,gender,location,birth_date,access_rank,join_date,last_online,stats_mean_score,stats_rewatched,stats_episodes
0,karthiga,2255153,3,49,1,0,0,55.31,Female,"Chennai, India",1990-04-29,,2013-03-03,2014-02-04 01:32:00,7.43,0.0,3391.0
1,RedvelvetDaisuki,1897606,61,396,39,0,206,118.07,Female,Manila,1995-01-01,,2012-12-13,1900-05-13 02:47:00,6.78,80.0,7094.0
2,Damonashu,37326,45,195,27,25,59,83.7,Male,"Detroit,Michigan",1991-08-01,,2008-02-13,1900-03-24 12:48:00,6.15,6.0,4936.0
3,bskai,228342,25,414,2,5,11,167.16,Male,"Nayarit, Mexico",1990-12-14,,2009-08-31,2014-05-12 16:35:00,8.27,1.0,10081.0
4,terune_uzumaki,327311,5,5,0,0,0,15.2,Female,"Malaysia, Kuantan",1998-08-24,,2010-05-10,2012-10-18 19:06:00,9.7,6.0,920.0


In [4]:
titles.columns

Index(['anime_id', 'title', 'title_english', 'title_japanese',
       'title_synonyms', 'image_url', 'type', 'source', 'episodes', 'status',
       'airing', 'aired_string', 'aired', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'background',
       'premiered', 'broadcast', 'related', 'producer', 'licensor', 'studio',
       'genre', 'opening_theme', 'ending_theme'],
      dtype='object')

In [5]:
animelists.head()

Unnamed: 0,username,anime_id,my_watched_episodes,my_start_date,my_finish_date,my_score,my_status,my_rewatching,my_rewatching_ep,my_last_updated,my_tags
0,karthiga,21,586,0000-00-00,0000-00-00,9,1,,0,1362307973,
1,RedvelvetDaisuki,21,0,0000-00-00,0000-00-00,0,3,0.0,0,1355480701,
2,Damonashu,21,418,0000-00-00,0000-00-00,10,1,0.0,0,1254296345,
3,bskai,21,75,0000-00-00,0000-00-00,8,1,0.0,0,1276637483,
4,Slimak,21,834,0000-00-00,0000-00-00,10,1,0.0,0,1525176321,


### Clean up, clean up

Date columns need to be converted to the appropriate types, some lists need to be split up into arrays, and there are dictionaries to evaluate.  Two users had invalid birthdates like '1337' which were coerced to nulls.  Not all of these columns will be used for this analysis, but it may be useful another time.  Finally, we're going to restrict the analysis to the 1500 most popularly anime in the dataset that have genre tags, for the sake of my poor computer.

In [33]:
# Turn date columns into dates
users['birth_date'] = pd.to_datetime(users['birth_date'], errors='coerce').dt.date
users['join_date'] = pd.to_datetime(users['join_date']).dt.date
users['last_online_date'] = pd.to_datetime(users['last_online']).dt.date

# Turn list columns into lists
titles['genre_list'] = titles['genre'].str.split(', ')
titles['producer_list'] = titles['producer'].str.split(', ')
titles['licensor_list'] = titles['licensor'].str.split(', ')
titles['studio_list'] = titles['studio'].str.split(', ')

# Turn dict columns into dicts
titles['aired_dict'] = titles['aired'].apply(eval)
titles['started_airing'] = [i['from'] for i in titles['aired_dict']]
titles['finished_airing'] = [i['to'] for i in titles['aired_dict']]
titles['related_dict'] = titles['related'].apply(eval)

# Remove titles without a genre and choose the 1500 most popular titles
titles = titles[pd.notnull(titles['genre_list'])]
titles = titles[titles['popularity'] > 0]
top_titles = titles.sort_values(by=['popularity'],ascending=True).iloc[:1500,:].reset_index()
top_titles.head()

Unnamed: 0,index,anime_id,title,title_english,title_japanese,title_synonyms,image_url,type,source,episodes,...,opening_theme,ending_theme,genre_list,producer_list,licensor_list,studio_list,aired_dict,related_dict,started_airing,finished_airing
0,7913,1535,Death Note,Death Note,デスノート,DN,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,37,...,"['#1: ""the WORLD"" by Nightmare (eps 1-19)', '#...","['#1: ""Alumina"" by Nightmare (eps 1-19)', '#2:...","[Mystery, Police, Psychological, Supernatural,...","[VAP, Konami, Ashi Production, Nippon Televisi...",[Viz Media],[Madhouse],"{'from': '2006-10-04', 'to': '2007-06-27'}","{'Adaptation': [{'mal_id': 21, 'type': 'manga'...",2006-10-04,2007-06-27
1,8123,16498,Shingeki no Kyojin,Attack on Titan,進撃の巨人,AoT,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,25,...,"['#1: ""Guren no Yumiya (紅蓮の弓矢)"" by Linked Hori...","['#1: ""Utsukushiki Zankoku na Sekai (美しき残酷な世界)...","[Action, Military, Mystery, Super Power, Drama...","[Production I.G, Dentsu, Mainichi Broadcasting...",[Funimation],[Wit Studio],"{'from': '2013-04-07', 'to': '2013-09-29'}","{'Adaptation': [{'mal_id': 23390, 'type': 'man...",2013-04-07,2013-09-29
2,6296,11757,Sword Art Online,Sword Art Online,ソードアート・オンライン,"S.A.O, SAO",https://myanimelist.cdn-dena.com/images/anime/...,TV,Light novel,25,...,"['#1: ""crossing field"" by LiSA (eps 2-14)', '#...","['#1: ""crossing field"" by LiSA (eps 1, 25)', '...","[Action, Adventure, Fantasy, Game, Romance]","[Aniplex, Genco, DAX Production, ASCII Media W...",[Aniplex of America],[A-1 Pictures],"{'from': '2012-07-08', 'to': '2012-12-23'}","{'Adaptation': [{'mal_id': 21479, 'type': 'man...",2012-07-08,2012-12-23
3,2555,5114,Fullmetal Alchemist: Brotherhood,Fullmetal Alchemist: Brotherhood,鋼の錬金術師 FULLMETAL ALCHEMIST,"Hagane no Renkinjutsushi: Fullmetal Alchemist,...",https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,64,...,"['#1: ""again"" by YUI (eps 1-14)', '#2: ""Hologr...","['#1: ""Uso (嘘)"" by SID (eps 1-14)', '#2: ""LET ...","[Action, Military, Adventure, Comedy, Drama, M...","[Aniplex, Square Enix, Mainichi Broadcasting S...","[Funimation, Aniplex of America]",[Bones],"{'from': '2009-04-05', 'to': '2010-07-04'}","{'Adaptation': [{'mal_id': 25, 'type': 'manga'...",2009-04-05,2010-07-04
4,8863,30276,One Punch Man,One Punch Man,ワンパンマン,"One Punch-Man, One-Punch Man, OPM",https://myanimelist.cdn-dena.com/images/anime/...,TV,Web manga,12,...,"['""THE HERO !! ~Okoreru Kobushi ni Hi wo Tsuke...","['#1: ""Hoshi yori Saki ni Mitsukete Ageru (星より...","[Action, Sci-Fi, Comedy, Parody, Super Power, ...","[TV Tokyo, Bandai Visual, Lantis, Asatsu DK, B...",[Viz Media],[Madhouse],"{'from': '2015-10-05', 'to': '2015-12-21'}","{'Adaptation': [{'mal_id': 44347, 'type': 'man...",2015-10-05,2015-12-21


### Graphing the Genres

Let's take a closer look at the genre column.

In [34]:
top_titles['genre_list'].head()

0    [Mystery, Police, Psychological, Supernatural,...
1    [Action, Military, Mystery, Super Power, Drama...
2          [Action, Adventure, Fantasy, Game, Romance]
3    [Action, Military, Adventure, Comedy, Drama, M...
4    [Action, Sci-Fi, Comedy, Parody, Super Power, ...
Name: genre_list, dtype: object

Really, these should be sets, not arrays.  Then we can use intersection to determine genre similarity as the number of shared genres.  Let's do that.

In [32]:
top_titles['genre_list'] = [set(i) for i in top_titles['genre_list']]

edges = dict()
count=0
for i in range(len(top_titles)-1):
    anime_1 = top_titles['anime_id'][i]
    genres_1 = top_titles['genre_list'][i]
    for j in range(i+1,len(top_titles)):
        anime_2 = top_titles['anime_id'][j]
        genres_2 = top_titles['genre_list'][j]
        genre_count = len(genres_1.intersection(genres_2))
        edges[count] = {'source': anime_1, 'target': anime_2, 'shared_genres': genre_count}
        count += 1
edges = pd.DataFrame.from_dict(edges, orient="index")
edges.head()

Unnamed: 0,source,target,shared_genres
0,1535,16498,2
1,1535,11757,0
2,1535,5114,1
3,1535,30276,1
4,1535,22319,3


In [37]:
G = nx.Graph()
G.add_nodes_from([[top_titles['anime_id'][i], top_titles.iloc[i,2:].to_dict()] for i in range(len(top_titles))])
G.number_of_nodes()

1500

In [38]:
G.add_edges_from([[edges['source'][i], edges['target'][i],{"shared_genres": edges['shared_genres'][i]}]
                  for i in range(len(edges))])
G.number_of_edges()

1124250

In [65]:
# Deprecated: creating one-hot-encoding of genres and genre vector for regression or dot product/cosine similarity work.

# Remove titles without a genre
#titles = titles[pd.notnull(titles['genre_list'])]
#titles.reset_index()

# Gather all the genres
#genre_list = set()

# for genres in titles['genre_list']:
#    genre_list.update(genres)

# Create the binary columns - 1 if genre attaches to title, 0 otherwise.
#for genre in genre_list:
#    titles[genre] = [1 if genre in i else 0 for i in titles['genre_list']]

# Create the consolidated vector.
#titles['genre_vector'] = [titles.iloc[i,37:].values for i in range(len(titles))]
#titles['genre_vector'].head()

0    [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, ...
1    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, ...
2    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...
3    [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, ...
4    [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
Name: genre_vector, dtype: object

From this vector we can construct 