# Data Preparation:

Dataset from Kaggle : **"MyAnimeList"** by *Azathoth*  
Source: https://www.kaggle.com/datasets/azathoth42/myanimelist/data (requires login)

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

### Import the Dataset (UserList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [2]:
userlist = pd.read_csv('UserList.csv')
userlist.head()

Unnamed: 0,username,user_id,user_watching,user_completed,user_onhold,user_dropped,user_plantowatch,user_days_spent_watching,gender,location,birth_date,access_rank,join_date,last_online,stats_mean_score,stats_rewatched,stats_episodes
0,karthiga,2255153,3,49,1,0,0,55.31,Female,"Chennai, India",1990-04-29,,2013-03-03,2014-02-04 01:32:00,7.43,0.0,3391.0
1,RedvelvetDaisuki,1897606,61,396,39,0,206,118.07,Female,Manila,1995-01-01,,2012-12-13,1900-05-13 02:47:00,6.78,80.0,7094.0
2,Damonashu,37326,45,195,27,25,59,83.7,Male,"Detroit,Michigan",1991-08-01,,2008-02-13,1900-03-24 12:48:00,6.15,6.0,4936.0
3,bskai,228342,25,414,2,5,11,167.16,Male,"Nayarit, Mexico",1990-12-14,,2009-08-31,2014-05-12 16:35:00,8.27,1.0,10081.0
4,shuzzable,2347781,36,72,16,2,25,35.48,,,,,2013-03-25,2015-09-09 21:54:00,9.06,7.0,2154.0


Description of the dataset, as available on Kaggle, is as follows.


> **username**         : user name
> **user_id**            : ID for each user      
> **user_watching**    : how many anime currently the user is watching     
> **user_completed**   : how many anime watched by the user 
> **user_onhold**        : how many anime is watching halfway   
> **user_dropped**             : how many anime the user remove from his list 
> **user_plantowatch**           : how many anime the user added to his watch list    
> **user_days_spent_watching**         : How much time the user spend on watching anime 
> **gender**           : user gender    
> **location**           : where is the user from 
> **birth_date**     : user age 
> **access_rank**            :   ??
> **join_date**         : when the user join the community    
> **last_online**           : when is user last seen 
> **stats_mean_score**            : average score the user rate for the anime 
> **stats_rewatched**        : how many episode the user rewatch
> **stats_episodes**             : how many episode the user completed
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [3]:
print("Data type : ", type(userlist))
print("Data dims : ", userlist.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (302675, 17)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [4]:
userlist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302675 entries, 0 to 302674
Data columns (total 17 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   username                  302674 non-null  object 
 1   user_id                   302675 non-null  int64  
 2   user_watching             302675 non-null  int64  
 3   user_completed            302675 non-null  int64  
 4   user_onhold               302675 non-null  int64  
 5   user_dropped              302675 non-null  int64  
 6   user_plantowatch          302675 non-null  int64  
 7   user_days_spent_watching  302675 non-null  float64
 8   gender                    217800 non-null  object 
 9   location                  156773 non-null  object 
 10  birth_date                168749 non-null  object 
 11  access_rank               0 non-null       float64
 12  join_date                 302546 non-null  object 
 13  last_online               302546 non-null  o

---

### Import the Dataset (AnimeList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [5]:
animelist = pd.read_csv('AnimeList.csv')
animelist.head()


Unnamed: 0,anime_id,title,title_english,title_japanese,title_synonyms,image_url,type,source,episodes,status,...,background,premiered,broadcast,related,producer,licensor,studio,genre,opening_theme,ending_theme
0,11013,Inu x Boku SS,Inu X Boku Secret Service,妖狐×僕SS,Youko x Boku SS,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,12,Finished Airing,...,Inu x Boku SS was licensed by Sentai Filmworks...,Winter 2012,Fridays at Unknown,"{'Adaptation': [{'mal_id': 17207, 'type': 'man...","Aniplex, Square Enix, Mainichi Broadcasting Sy...",Sentai Filmworks,David Production,"Comedy, Supernatural, Romance, Shounen","['""Nirvana"" by MUCC']","['#1: ""Nirvana"" by MUCC (eps 1, 11-12)', '#2: ..."
1,2104,Seto no Hanayome,My Bride is a Mermaid,瀬戸の花嫁,The Inland Sea Bride,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,26,Finished Airing,...,,Spring 2007,Unknown,"{'Adaptation': [{'mal_id': 759, 'type': 'manga...","TV Tokyo, AIC, Square Enix, Sotsu",Funimation,Gonzo,"Comedy, Parody, Romance, School, Shounen","['""Romantic summer"" by SUN&LUNAR']","['#1: ""Ashita e no Hikari (明日への光)"" by Asuka Hi..."
2,5262,Shugo Chara!! Doki,Shugo Chara!! Doki,しゅごキャラ！！どきっ,"Shugo Chara Ninenme, Shugo Chara! Second Year",https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,51,Finished Airing,...,,Fall 2008,Unknown,"{'Adaptation': [{'mal_id': 101, 'type': 'manga...","TV Tokyo, Sotsu",,Satelight,"Comedy, Magic, School, Shoujo","['#1: ""Minna no Tamago (みんなのたまご)"" by Shugo Cha...","['#1: ""Rottara Rottara (ロッタラ ロッタラ)"" by Buono! ..."
3,721,Princess Tutu,Princess Tutu,プリンセスチュチュ,,https://myanimelist.cdn-dena.com/images/anime/...,TV,Original,38,Finished Airing,...,Princess Tutu aired in two parts. The first pa...,Summer 2002,Fridays at Unknown,"{'Adaptation': [{'mal_id': 1581, 'type': 'mang...","Memory-Tech, GANSIS, Marvelous AQL",ADV Films,Hal Film Maker,"Comedy, Drama, Magic, Romance, Fantasy","['""Morning Grace"" by Ritsuko Okazaki']","['""Watashi No Ai Wa Chiisaikeredo"" by Ritsuko ..."
4,12365,Bakuman. 3rd Season,Bakuman.,バクマン。,Bakuman Season 3,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,25,Finished Airing,...,,Fall 2012,Unknown,"{'Adaptation': [{'mal_id': 9711, 'type': 'mang...","NHK, Shueisha",,J.C.Staff,"Comedy, Drama, Romance, Shounen","['#1: ""Moshimo no Hanashi (もしもの話)"" by nano.RIP...","['#1: ""Pride on Everyday"" by Sphere (eps 1-13)..."


Description of the dataset, as available on Kaggle, is as follows.


> **anime_id**         : ID for each anime show  
> **title**            : Anime title    
> **title_english**    : Anime title in english     
> **title_japanese**   : Anime title in japanese   
> **image_url**        : Front poster   
> **type**             : Anime types (TV, Movie, etc)    
> **source**           : Anime source (Manga, Original)    
> **episodes**         : How many episodes   
> **status**           : Current status (airing, finieshed airinig)    
> **airing**           : Is it currently airing    
> **aired_string**     : Start date and finished date    
> **aired**            : Start date and finished date in java   
> **duration**         : How long is the anime(episode or movie)     
> **rating**           : Anime rating (pg13, NC16, M18, R21)   
> **score**            : Overall score of the anime (out of 10)     
> **scored_by**        : How many user give the score to the anime  
> **rank**             : Rank base on the score of the anime     
> **popularity**       : Rank base on how many people watch the anime  
> **members**          : How many people watch the anime  
> **favorites**        : How many people favorite the anime     
> **background**       : Background of the anime   
> **premiered**        : Which season the anime come out    
> **broadcast**        : Which day it broadcast   
> **related**          : Are there any sequel or prequel  
> **producer**         : Where the anime produce     
> **licensor**         : Which film it came from     
> **studio**           : which studio animated the anime   
> **genre**            : what are the genres in the anime   
> **opening_theme**    : opening song   
> **ending_theme**     : endinng song    
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [6]:
print("Data type : ", type(animelist))
print("Data dims : ", animelist.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (14478, 31)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [7]:
animelist.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Data columns (total 31 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   anime_id        14478 non-null  int64  
 1   title           14478 non-null  object 
 2   title_english   5724 non-null   object 
 3   title_japanese  14443 non-null  object 
 4   title_synonyms  8937 non-null   object 
 5   image_url       14382 non-null  object 
 6   type            14478 non-null  object 
 7   source          14478 non-null  object 
 8   episodes        14478 non-null  int64  
 9   status          14478 non-null  object 
 10  airing          14478 non-null  bool   
 11  aired_string    14478 non-null  object 
 12  aired           14478 non-null  object 
 13  duration        14478 non-null  object 
 14  rating          13934 non-null  object 
 15  score           14478 non-null  float64
 16  scored_by       14478 non-null  int64  
 17  rank            12904 non-null 

## Clean Data (AnimeList)

how might we (action) for (target audiences) in order to (outcome, what are the result we would like to see)

e.g. 
how might we recommend the top 20 anime shows for anime beginner?
how might we recommend the top 10 anime shows in winter season for anime user?

### Gathering information 
---

> Describe numeric      
> Desccirbe object      
> Display columns     

In [8]:
## for numeric data
animelist.describe()

Unnamed: 0,anime_id,episodes,score,scored_by,rank,popularity,members,favorites
count,14478.0,14478.0,14478.0,14478.0,12904.0,14478.0,14478.0,14478.0
mean,17377.229866,11.308399,6.142482,11460.03,6439.065406,7220.259566,22966.4,311.649606
std,13165.315011,43.443451,1.463981,43105.19,3720.227608,4170.080564,74981.36,2615.554211
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4389.25,1.0,5.55,46.0,3216.25,3611.5,245.0,0.0
50%,15135.0,1.0,6.37,501.0,6441.5,7225.5,1679.5,2.0
75%,31146.5,12.0,7.06,3941.5,9664.0,10827.75,10379.0,23.0
max,37916.0,1818.0,10.0,1009477.0,12919.0,14487.0,1456378.0,106895.0


In [9]:
## for data that is object
animelist.describe(include=object)

Unnamed: 0,title,title_english,title_japanese,title_synonyms,image_url,type,source,status,aired_string,aired,...,background,premiered,broadcast,related,producer,licensor,studio,genre,opening_theme,ending_theme
count,14478,5724,14443,8937,14382,14478,14478,14478,14478,14478,...,1057,4096,4271,14478,8288,3373,8544,14414,14478,14478
unique,14477,5606,13701,8575,14382,7,16,3,10026,9649,...,1038,221,441,9420,3221,193,778,4544,4328,5458
top,Hinamatsuri,Cyborg 009,ゲゲゲの鬼太郎,Minna no Uta,https://myanimelist.cdn-dena.com/images/anime/...,TV,Unknown,Finished Airing,Not available,"{'from': None, 'to': None}",...,Includes claymation short which was shown befo...,Spring 2017,Unknown,[],NHK,Funimation,Toei Animation,Hentai,[],[]
freq,2,4,6,189,1,4271,4210,13791,223,1691,...,5,80,2241,4515,427,726,725,868,9784,8807


In [10]:
## what are the columns involved in the dataset
animelist.columns

Index(['anime_id', 'title', 'title_english', 'title_japanese',
       'title_synonyms', 'image_url', 'type', 'source', 'episodes', 'status',
       'airing', 'aired_string', 'aired', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'background',
       'premiered', 'broadcast', 'related', 'producer', 'licensor', 'studio',
       'genre', 'opening_theme', 'ending_theme'],
      dtype='object')

---
### Premiered 

> Convert Null value to binary indicator (1 or 0)

In [11]:
## creates a new column "isPremiered" that contains 1 for rows where the "premiered" column is null and 0 for rows where it is not null. 
##This new column acts as a binary indicator, showing whether an anime has a premiere date or not.
animelist["isPremiered"] = animelist["premiered"].isnull().astype(int)

In [12]:
animelist.isPremiered.info()

<class 'pandas.core.series.Series'>
RangeIndex: 14478 entries, 0 to 14477
Series name: isPremiered
Non-Null Count  Dtype
--------------  -----
14478 non-null  int32
dtypes: int32(1)
memory usage: 56.7 KB


### Studio filtering
Filter the studio that is less popular (<40) and combine into one "SmallStudio"

In [13]:
## calculate all the value from each studio

## studio that is empty, replace with unknown
animelist["studio"] = animelist["studio"].fillna("unknown")
studio_counts = animelist.studio.value_counts()
studio_counts

studio
unknown                         5934
Toei Animation                   725
Sunrise                          447
J.C.Staff                        314
Madhouse                         311
                                ... 
Studio Junio, Annapuru             1
Tokyo Media Connections            1
Gainax, Tatsunoko Production       1
Fanworks, Imagineer                1
33 Collective                      1
Name: count, Length: 779, dtype: int64

In [14]:
# group studio less than 40
minor = studio_counts[studio_counts < 40].index.to_list()
minor

['Eiken',
 'Group TAC',
 'TNK',
 'Artland',
 'SynergySP',
 '8bit',
 'Wit Studio',
 'Actas',
 'Manglobe',
 'Haoliners Animation League',
 'Ajia-Do',
 'MAPPA',
 'Studio Comet',
 'White Fox',
 'Mushi Production',
 'Studio Gokumi',
 'Hal Film Maker',
 'Tezuka Productions',
 'A.C.G.T.',
 'Asahi Production',
 'TYO Animations',
 'Gathering',
 'Tokyo Movie Shinsha',
 'Daume',
 'Kinema Citrus',
 'Polygon Pictures',
 'Nomad',
 'AIC A.S.T.A.',
 'T-Rex',
 'LIDENFILMS',
 'Magic Bus',
 'Studio Jam',
 'Bee Train',
 'GoHands',
 'Production IMS',
 'Trigger',
 'David Production',
 'Bandai Namco Pictures',
 'Telecom Animation Film',
 'Seven Arcs',
 'Office Takeout',
 'Asread',
 'Studio Fantasia',
 'Studio PuYUKAI',
 'RG Animation Studios',
 'dwarf',
 'AIC Plus+',
 'Seven Arcs Pictures',
 'Fanworks',
 'APPP',
 'Hoods Entertainment',
 'AT-2',
 'Sparkly Key Animation Studio',
 'Production I.G, Xebec',
 'Millepensee',
 'Y.O.U.C',
 'Shuka',
 'Flavors Soft',
 'Creators in Pack',
 'Animate Film',
 'ILCA',
 'Stu

In [15]:
## combine those minor studio to one "SmallStudio"
animelist["studio"] = animelist["studio"].apply(lambda x : "SmallStudio" if x in minor else x)
animelist.studio.value_counts()

studio
unknown                 5934
SmallStudio             3028
Toei Animation           725
Sunrise                  447
J.C.Staff                314
Madhouse                 311
Production I.G           251
TMS Entertainment        248
Studio Deen              241
Studio Pierrot           240
Nippon Animation         202
OLM                      181
A-1 Pictures             174
Shin-Ei Animation        151
DLE                      139
Tatsunoko Production     131
Shaft                    111
Gonzo                    109
Xebec                    109
Bones                    109
Kyoto Animation          103
AIC                       98
Brain&#039;s Base         80
Silver Link.              74
Satelight                 71
Arms                      69
Production Reed           64
Doga Kobo                 63
Studio 4°C                59
Gainax                    59
ufotable                  58
Zexcs                     57
Seven                     54
feel.                     53
Kachido

### Drop the data that is not important
TODO: Add back producers
---
|  |             **Unnecessory data**          |      |
|:---------------:|:---------------:|:---------------:|
| anime_id        | background      | opening_theme   |
| title_english   | premiered       | ending_theme    |
| title_japanese  | boardcast       | air_string      |
| title_synonyms  | producer        |                 |
| image_url       | lincensor       |                 |

In [16]:
## drop useless dat
animelist.drop(columns=['anime_id','title_english',  'title_japanese','title_synonyms', 'image_url', 'background',
       'premiered', 'broadcast','producer','licensor','opening_theme', 'ending_theme','aired_string' ], inplace=True)

In [17]:
## after dropping the columns 
animelist.shape

(14478, 19)

In [18]:
animelist.columns

Index(['title', 'type', 'source', 'episodes', 'status', 'airing', 'aired',
       'duration', 'rating', 'score', 'scored_by', 'rank', 'popularity',
       'members', 'favorites', 'related', 'studio', 'genre', 'isPremiered'],
      dtype='object')

### Split aired date (from and to)
--- 
aired contain { from: yyyy-mm-dd, to: yyyy-mm-dd}

split into:
aired_from -> yyyy-mm-dd
aired_to   -> yyyy-mm-dd

calculate the number of days for the episode
calculate the how frequent it aired

In [19]:
# Splitting the 'aired' column into 'from' and 'to' columns
animelist[['aired_from', 'aired_to']] = animelist['aired'].str.extract(r"'from': '(.*?)', 'to': '(.*?)'")

# Displaying the DataFrame with the new columns
print(animelist[['aired_from', 'aired_to']])

       aired_from    aired_to
0      2012-01-13  2012-03-30
1      2007-04-02  2007-10-01
2      2008-10-04  2009-09-25
3      2002-08-16  2003-05-23
4      2012-10-06  2013-03-30
...           ...         ...
14473  1987-11-05  1988-11-04
14474  1986-03-21  1986-03-21
14475         NaN         NaN
14476         NaN         NaN
14477  2010-04-07  2010-04-07

[14478 rows x 2 columns]


In [20]:
animelist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        14478 non-null  object 
 1   type         14478 non-null  object 
 2   source       14478 non-null  object 
 3   episodes     14478 non-null  int64  
 4   status       14478 non-null  object 
 5   airing       14478 non-null  bool   
 6   aired        14478 non-null  object 
 7   duration     14478 non-null  object 
 8   rating       13934 non-null  object 
 9   score        14478 non-null  float64
 10  scored_by    14478 non-null  int64  
 11  rank         12904 non-null  float64
 12  popularity   14478 non-null  int64  
 13  members      14478 non-null  int64  
 14  favorites    14478 non-null  int64  
 15  related      14478 non-null  object 
 16  studio       14478 non-null  object 
 17  genre        14414 non-null  object 
 18  isPremiered  14478 non-null  int32  
 19  aire

### Split the Genres to columns
---

In [21]:
## fill the missing value 'Nan' with 'NA'
animelist.genre = animelist.genre.fillna("NA")

In [22]:
## split the genres by the parameter ','
genre_animelist = animelist['genre'].str.get_dummies(sep=',')
genre_animelist

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14473,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14474,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14475,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
14476,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
genre_animelist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Data columns (total 83 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0    Adventure      14478 non-null  int64
 1    Cars           14478 non-null  int64
 2    Comedy         14478 non-null  int64
 3    Dementia       14478 non-null  int64
 4    Demons         14478 non-null  int64
 5    Drama          14478 non-null  int64
 6    Ecchi          14478 non-null  int64
 7    Fantasy        14478 non-null  int64
 8    Game           14478 non-null  int64
 9    Harem          14478 non-null  int64
 10   Hentai         14478 non-null  int64
 11   Historical     14478 non-null  int64
 12   Horror         14478 non-null  int64
 13   Josei          14478 non-null  int64
 14   Kids           14478 non-null  int64
 15   Magic          14478 non-null  int64
 16   Martial Arts   14478 non-null  int64
 17   Mecha          14478 non-null  int64
 18   Military       14478 non-

In [24]:
## combining the animelist data and genre data into animelist_df
animelist_df = pd.concat([animelist, genre_animelist], axis=1)
animelist_df.head()

Unnamed: 0,title,type,source,episodes,status,airing,aired,duration,rating,score,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,"{'from': '2012-01-13', 'to': '2012-03-30'}",24 min. per ep.,PG-13 - Teens 13 or older,7.63,...,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,"{'from': '2007-04-02', 'to': '2007-10-01'}",24 min. per ep.,PG-13 - Teens 13 or older,7.89,...,0,0,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,"{'from': '2008-10-04', 'to': '2009-09-25'}",24 min. per ep.,PG - Children,7.55,...,0,0,0,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,"{'from': '2002-08-16', 'to': '2003-05-23'}",16 min. per ep.,PG-13 - Teens 13 or older,8.21,...,0,0,0,0,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,"{'from': '2012-10-06', 'to': '2013-03-30'}",24 min. per ep.,PG-13 - Teens 13 or older,8.67,...,0,0,0,0,0,0,0,0,0,0


In [25]:
## remove genre columns 
animelist_df.drop(columns=["genre"], inplace=True)
animelist_df

Unnamed: 0,title,type,source,episodes,status,airing,aired,duration,rating,score,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,"{'from': '2012-01-13', 'to': '2012-03-30'}",24 min. per ep.,PG-13 - Teens 13 or older,7.63,...,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,"{'from': '2007-04-02', 'to': '2007-10-01'}",24 min. per ep.,PG-13 - Teens 13 or older,7.89,...,0,0,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,"{'from': '2008-10-04', 'to': '2009-09-25'}",24 min. per ep.,PG - Children,7.55,...,0,0,0,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,"{'from': '2002-08-16', 'to': '2003-05-23'}",16 min. per ep.,PG-13 - Teens 13 or older,8.21,...,0,0,0,0,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,"{'from': '2012-10-06', 'to': '2013-03-30'}",24 min. per ep.,PG-13 - Teens 13 or older,8.67,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14473,Gutchonpa Omoshiro Hanashi,TV,Unknown,5,Finished Airing,False,"{'from': '1987-11-05', 'to': '1988-11-04'}",8 min. per ep.,G - All Ages,5.50,...,0,0,0,0,0,0,0,0,0,0
14474,Geba Geba Shou Time!,OVA,Unknown,1,Finished Airing,False,"{'from': '1986-03-21', 'to': '1986-03-21'}",25 min.,G - All Ages,4.60,...,0,0,0,0,0,0,0,0,0,0
14475,Godzilla: Hoshi wo Kuu Mono,Movie,Other,1,Not yet aired,False,"{'from': None, 'to': None}",Unknown,R - 17+ (violence & profanity),0.00,...,0,0,0,0,0,0,0,0,0,0
14476,Nippon Mukashibanashi: Sannen Netarou,OVA,Other,1,Finished Airing,False,"{'from': None, 'to': None}",40 min.,G - All Ages,6.00,...,0,0,0,0,0,0,0,0,0,0


In [26]:
animelist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Columns: 103 entries, title to Yaoi
dtypes: bool(1), float64(2), int32(1), int64(88), object(11)
memory usage: 11.2+ MB


### Check value contain any NULL
---

In [27]:
# let's make sure no null values
for col in animelist_df:
    print(f" {col}         | has ({animelist_df[col].isnull().sum()})")
    
    
## below table we can see:
## rating have 544 'Nan'
## rank have 1574 'Nan'
## aired_from have 2191 'Nan'
## aired-to have 2191 'Nan'

 title         | has (0)
 type         | has (0)
 source         | has (0)
 episodes         | has (0)
 status         | has (0)
 airing         | has (0)
 aired         | has (0)
 duration         | has (0)
 rating         | has (544)
 score         | has (0)
 scored_by         | has (0)
 rank         | has (1574)
 popularity         | has (0)
 members         | has (0)
 favorites         | has (0)
 related         | has (0)
 studio         | has (0)
 isPremiered         | has (0)
 aired_from         | has (2191)
 aired_to         | has (2191)
  Adventure         | has (0)
  Cars         | has (0)
  Comedy         | has (0)
  Dementia         | has (0)
  Demons         | has (0)
  Drama         | has (0)
  Ecchi         | has (0)
  Fantasy         | has (0)
  Game         | has (0)
  Harem         | has (0)
  Hentai         | has (0)
  Historical         | has (0)
  Horror         | has (0)
  Josei         | has (0)
  Kids         | has (0)
  Magic         | has (0)
  Martial Arts    

In [28]:
## count the total for each rating
animelist_df.rating.value_counts()

rating
PG-13 - Teens 13 or older         5020
G - All Ages                      4541
PG - Children                     1279
Rx - Hentai                       1219
R - 17+ (violence & profanity)     997
R+ - Mild Nudity                   878
Name: count, dtype: int64

In [29]:
## ensure the rating is at least PG13
animelist_df['rating'].fillna("G - All Ages",inplace=True)

## convert the rank to the max rank (prevent skewness)
animelist_df['rank'].fillna(animelist_df['rank'].max(), inplace=True)

## convert 'Nan' to None for aired dates.
animelist_df['aired_from'].fillna("Not aired",inplace=True)
animelist_df['aired_to'].fillna("Not aired",inplace=True)

##find out whether aired time and primied have relation

In [30]:
# let's double confirmed there are no null values
for col in animelist_df:
    print(f" {col}         | has ({animelist_df[col].isnull().sum()})")
    

 title         | has (0)
 type         | has (0)
 source         | has (0)
 episodes         | has (0)
 status         | has (0)
 airing         | has (0)
 aired         | has (0)
 duration         | has (0)
 rating         | has (0)
 score         | has (0)
 scored_by         | has (0)
 rank         | has (0)
 popularity         | has (0)
 members         | has (0)
 favorites         | has (0)
 related         | has (0)
 studio         | has (0)
 isPremiered         | has (0)
 aired_from         | has (0)
 aired_to         | has (0)
  Adventure         | has (0)
  Cars         | has (0)
  Comedy         | has (0)
  Dementia         | has (0)
  Demons         | has (0)
  Drama         | has (0)
  Ecchi         | has (0)
  Fantasy         | has (0)
  Game         | has (0)
  Harem         | has (0)
  Hentai         | has (0)
  Historical         | has (0)
  Horror         | has (0)
  Josei         | has (0)
  Kids         | has (0)
  Magic         | has (0)
  Martial Arts         | has 

### Convert to new csv file.
---


In [31]:
animelist_df.to_csv('outV2.csv', index=False) 

### if have other data need to be clean
---