# Data Preparation:

Dataset from Kaggle : **"MyAnimeList"** by *Azathoth*  
Source: https://www.kaggle.com/datasets/azathoth42/myanimelist/data (requires login)

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

### Import the Dataset (UserList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [2]:
userlist = pd.read_csv('UserList.csv')
userlist.head()

Unnamed: 0,username,user_id,user_watching,user_completed,user_onhold,user_dropped,user_plantowatch,user_days_spent_watching,gender,location,birth_date,access_rank,join_date,last_online,stats_mean_score,stats_rewatched,stats_episodes
0,karthiga,2255153,3,49,1,0,0,55.31,Female,"Chennai, India",1990-04-29,,2013-03-03,2014-02-04 01:32:00,7.43,0.0,3391.0
1,RedvelvetDaisuki,1897606,61,396,39,0,206,118.07,Female,Manila,1995-01-01,,2012-12-13,1900-05-13 02:47:00,6.78,80.0,7094.0
2,Damonashu,37326,45,195,27,25,59,83.7,Male,"Detroit,Michigan",1991-08-01,,2008-02-13,1900-03-24 12:48:00,6.15,6.0,4936.0
3,bskai,228342,25,414,2,5,11,167.16,Male,"Nayarit, Mexico",1990-12-14,,2009-08-31,2014-05-12 16:35:00,8.27,1.0,10081.0
4,shuzzable,2347781,36,72,16,2,25,35.48,,,,,2013-03-25,2015-09-09 21:54:00,9.06,7.0,2154.0


Description of the dataset, as available on Kaggle, is as follows.


> **username**         : user name
> **user_id**            : ID for each user      
> **user_watching**    : how many anime currently the user is watching     
> **user_completed**   : how many anime watched by the user 
> **user_onhold**        : how many anime is watching halfway   
> **user_dropped**             : how many anime the user remove from his list 
> **user_plantowatch**           : how many anime the user added to his watch list    
> **user_days_spent_watching**         : How much time the user spend on watching anime 
> **gender**           : user gender    
> **location**           : where is the user from 
> **birth_date**     : user age 
> **access_rank**            :   ??
> **join_date**         : when the user join the community    
> **last_online**           : when is user last seen 
> **stats_mean_score**            : average score the user rate for the anime 
> **stats_rewatched**        : how many episode the user rewatch
> **stats_episodes**             : how many episode the user completed
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [3]:
print("Data type : ", type(userlist))
print("Data dims : ", userlist.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (302675, 17)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [4]:
print(userlist.dtypes)

username                     object
user_id                       int64
user_watching                 int64
user_completed                int64
user_onhold                   int64
user_dropped                  int64
user_plantowatch              int64
user_days_spent_watching    float64
gender                       object
location                     object
birth_date                   object
access_rank                 float64
join_date                    object
last_online                  object
stats_mean_score            float64
stats_rewatched             float64
stats_episodes              float64
dtype: object


---

### Import the Dataset (AnimeList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [5]:
animelist = pd.read_csv('AnimeList.csv')
animelist.head()


Unnamed: 0,anime_id,title,title_english,title_japanese,title_synonyms,image_url,type,source,episodes,status,...,background,premiered,broadcast,related,producer,licensor,studio,genre,opening_theme,ending_theme
0,11013,Inu x Boku SS,Inu X Boku Secret Service,妖狐×僕SS,Youko x Boku SS,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,12,Finished Airing,...,Inu x Boku SS was licensed by Sentai Filmworks...,Winter 2012,Fridays at Unknown,"{'Adaptation': [{'mal_id': 17207, 'type': 'man...","Aniplex, Square Enix, Mainichi Broadcasting Sy...",Sentai Filmworks,David Production,"Comedy, Supernatural, Romance, Shounen","['""Nirvana"" by MUCC']","['#1: ""Nirvana"" by MUCC (eps 1, 11-12)', '#2: ..."
1,2104,Seto no Hanayome,My Bride is a Mermaid,瀬戸の花嫁,The Inland Sea Bride,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,26,Finished Airing,...,,Spring 2007,Unknown,"{'Adaptation': [{'mal_id': 759, 'type': 'manga...","TV Tokyo, AIC, Square Enix, Sotsu",Funimation,Gonzo,"Comedy, Parody, Romance, School, Shounen","['""Romantic summer"" by SUN&LUNAR']","['#1: ""Ashita e no Hikari (明日への光)"" by Asuka Hi..."
2,5262,Shugo Chara!! Doki,Shugo Chara!! Doki,しゅごキャラ！！どきっ,"Shugo Chara Ninenme, Shugo Chara! Second Year",https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,51,Finished Airing,...,,Fall 2008,Unknown,"{'Adaptation': [{'mal_id': 101, 'type': 'manga...","TV Tokyo, Sotsu",,Satelight,"Comedy, Magic, School, Shoujo","['#1: ""Minna no Tamago (みんなのたまご)"" by Shugo Cha...","['#1: ""Rottara Rottara (ロッタラ ロッタラ)"" by Buono! ..."
3,721,Princess Tutu,Princess Tutu,プリンセスチュチュ,,https://myanimelist.cdn-dena.com/images/anime/...,TV,Original,38,Finished Airing,...,Princess Tutu aired in two parts. The first pa...,Summer 2002,Fridays at Unknown,"{'Adaptation': [{'mal_id': 1581, 'type': 'mang...","Memory-Tech, GANSIS, Marvelous AQL",ADV Films,Hal Film Maker,"Comedy, Drama, Magic, Romance, Fantasy","['""Morning Grace"" by Ritsuko Okazaki']","['""Watashi No Ai Wa Chiisaikeredo"" by Ritsuko ..."
4,12365,Bakuman. 3rd Season,Bakuman.,バクマン。,Bakuman Season 3,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,25,Finished Airing,...,,Fall 2012,Unknown,"{'Adaptation': [{'mal_id': 9711, 'type': 'mang...","NHK, Shueisha",,J.C.Staff,"Comedy, Drama, Romance, Shounen","['#1: ""Moshimo no Hanashi (もしもの話)"" by nano.RIP...","['#1: ""Pride on Everyday"" by Sphere (eps 1-13)..."


Description of the dataset, as available on Kaggle, is as follows.


> **anime_id**         : ID for each anime show  
> **title**            : Anime title    
> **title_english**    : Anime title in english     
> **title_japanese**   : Anime title in japanese   
> **image_url**        : Front poster   
> **type**             : Anime types (TV, Movie, etc)    
> **source**           : Anime source (Manga, Original)    
> **episodes**         : How many episodes   
> **status**           : Current status (airing, finieshed airinig)    
> **airing**           : Is it currently airing    
> **aired_string**     : Start date and finished date    
> **aired**            : Start date and finished date in java   
> **duration**         : How long is the anime(episode or movie)     
> **rating**           : Anime rating (pg13, NC16, M18, R21)   
> **score**            : Overall score of the anime (out of 10)     
> **scored_by**        : How many user give the score to the anime  
> **rank**             : Rank base on the score of the anime     
> **popularity**       : Rank base on how many people watch the anime  
> **members**          : How many people watch the anime  
> **favorites**        : How many people favorite the anime     
> **background**       : Background of the anime   
> **premiered**        : Which season the anime come out    
> **broadcast**        : Which day it broadcast   
> **related**          : Are there any sequel or prequel  
> **producer**         : Where the anime produce     
> **licensor**         : Which film it came from     
> **studio**           : which studio animated the anime   
> **genre**            : what are the genres in the anime   
> **opening_theme**    : opening song   
> **ending_theme**     : endinng song    
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [6]:
print("Data type : ", type(animelist))
print("Data dims : ", animelist.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (14478, 31)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [7]:
print(animelist.dtypes)

anime_id            int64
title              object
title_english      object
title_japanese     object
title_synonyms     object
image_url          object
type               object
source             object
episodes            int64
status             object
airing               bool
aired_string       object
aired              object
duration           object
rating             object
score             float64
scored_by           int64
rank              float64
popularity          int64
members             int64
favorites           int64
background         object
premiered          object
broadcast          object
related            object
producer           object
licensor           object
studio             object
genre              object
opening_theme      object
ending_theme       object
dtype: object


how might we (action) for (target audiences) in order to (outcome, what are the result we would like to see)

e.g. 
how might we recommend the top 20 anime shows for anime beginner?
how might we recommend the top 10 anime shows in winter season for anime user?