# Data Preparation:

Dataset from Kaggle : **"MyAnimeList"** by *Azathoth*  
Source: https://www.kaggle.com/datasets/azathoth42/myanimelist/data (requires login)

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

### Import the Dataset (UserList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [2]:
userlist = pd.read_csv('DataSets/Raw Data/UserList.csv')
userlist.head()

Unnamed: 0,username,user_id,user_watching,user_completed,user_onhold,user_dropped,user_plantowatch,user_days_spent_watching,gender,location,birth_date,join_date,last_online,stats_mean_score,stats_rewatched,stats_episodes
0,karthiga,2255153,3,49,1,0,0,55.31,Female,"Chennai, India",29/4/1990,3/3/2013,4/2/2014 1:32,7.43,0.0,3391.0
1,RedvelvetDaisuki,1897606,61,396,39,0,206,118.07,Female,Manila,1/1/1995,13/12/2012,13/5/1900 2:47,6.78,80.0,7094.0
2,Damonashu,37326,45,195,27,25,59,83.7,Male,"Detroit,Michigan",1/8/1991,13/2/2008,24/3/1900 12:48,6.15,6.0,4936.0
3,bskai,228342,25,414,2,5,11,167.16,Male,"Nayarit, Mexico",14/12/1990,31/8/2009,12/5/2014 16:35,8.27,1.0,10081.0
4,shuzzable,2347781,36,72,16,2,25,35.48,,,,25/3/2013,9/9/2015 21:54,9.06,7.0,2154.0


Description of the dataset, as available on Kaggle, is as follows.


> **username**: user name      
> **user_id**: ID for each user         
> **user_watching**: how many anime currently the user is watching        
> **user_completed**: how many anime watched by the user     
> **user_onhold**: how many anime is watching halfway    
> **user_dropped**: how many anime the user remove from his list     
> **user_plantowatch**: how many anime the user added to his watch list         
> **user_days_spent_watching**: How much time the user spend on watching anime       
> **gender**: user gender      
> **location**: where is the user from       
> **birth_date**: user age     
> **access_rank**:   ??       
> **join_date**: when the user join the community         
> **last_online**: when is user last seen       
> **stats_mean_score**: average score the user rate for the anime       
> **stats_rewatched**: how many episode the user rewatch       
> **stats_episodes**: how many episode the user completed        
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [3]:
print("Data type : ", type(userlist))
print("Data dims : ", userlist.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (249999, 16)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [4]:
userlist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249999 entries, 0 to 249998
Data columns (total 16 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   username                  249998 non-null  object 
 1   user_id                   249999 non-null  int64  
 2   user_watching             249999 non-null  int64  
 3   user_completed            249999 non-null  int64  
 4   user_onhold               249999 non-null  int64  
 5   user_dropped              249999 non-null  int64  
 6   user_plantowatch          249999 non-null  int64  
 7   user_days_spent_watching  249999 non-null  float64
 8   gender                    183477 non-null  object 
 9   location                  133701 non-null  object 
 10  birth_date                142450 non-null  object 
 11  join_date                 249901 non-null  object 
 12  last_online               249901 non-null  object 
 13  stats_mean_score          249901 non-null  f

---

### Import the Dataset (AnimeList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [5]:
animelist = pd.read_csv('DataSets/Raw Data/AnimeList.csv')
animelist.head()


Unnamed: 0,anime_id,title,title_english,title_japanese,title_synonyms,image_url,type,source,episodes,status,...,background,premiered,broadcast,related,producer,licensor,studio,genre,opening_theme,ending_theme
0,11013,Inu x Boku SS,Inu X Boku Secret Service,妖狐×僕SS,Youko x Boku SS,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,12,Finished Airing,...,Inu x Boku SS was licensed by Sentai Filmworks...,Winter 2012,Fridays at Unknown,"{'Adaptation': [{'mal_id': 17207, 'type': 'man...","Aniplex, Square Enix, Mainichi Broadcasting Sy...",Sentai Filmworks,David Production,"Comedy, Supernatural, Romance, Shounen","['""Nirvana"" by MUCC']","['#1: ""Nirvana"" by MUCC (eps 1, 11-12)', '#2: ..."
1,2104,Seto no Hanayome,My Bride is a Mermaid,瀬戸の花嫁,The Inland Sea Bride,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,26,Finished Airing,...,,Spring 2007,Unknown,"{'Adaptation': [{'mal_id': 759, 'type': 'manga...","TV Tokyo, AIC, Square Enix, Sotsu",Funimation,Gonzo,"Comedy, Parody, Romance, School, Shounen","['""Romantic summer"" by SUN&LUNAR']","['#1: ""Ashita e no Hikari (明日への光)"" by Asuka Hi..."
2,5262,Shugo Chara!! Doki,Shugo Chara!! Doki,しゅごキャラ！！どきっ,"Shugo Chara Ninenme, Shugo Chara! Second Year",https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,51,Finished Airing,...,,Fall 2008,Unknown,"{'Adaptation': [{'mal_id': 101, 'type': 'manga...","TV Tokyo, Sotsu",,Satelight,"Comedy, Magic, School, Shoujo","['#1: ""Minna no Tamago (みんなのたまご)"" by Shugo Cha...","['#1: ""Rottara Rottara (ロッタラ ロッタラ)"" by Buono! ..."
3,721,Princess Tutu,Princess Tutu,プリンセスチュチュ,,https://myanimelist.cdn-dena.com/images/anime/...,TV,Original,38,Finished Airing,...,Princess Tutu aired in two parts. The first pa...,Summer 2002,Fridays at Unknown,"{'Adaptation': [{'mal_id': 1581, 'type': 'mang...","Memory-Tech, GANSIS, Marvelous AQL",ADV Films,Hal Film Maker,"Comedy, Drama, Magic, Romance, Fantasy","['""Morning Grace"" by Ritsuko Okazaki']","['""Watashi No Ai Wa Chiisaikeredo"" by Ritsuko ..."
4,12365,Bakuman. 3rd Season,Bakuman.,バクマン。,Bakuman Season 3,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,25,Finished Airing,...,,Fall 2012,Unknown,"{'Adaptation': [{'mal_id': 9711, 'type': 'mang...","NHK, Shueisha",,J.C.Staff,"Comedy, Drama, Romance, Shounen","['#1: ""Moshimo no Hanashi (もしもの話)"" by nano.RIP...","['#1: ""Pride on Everyday"" by Sphere (eps 1-13)..."


Description of the dataset, as available on Kaggle, is as follows.


> **anime_id**         : ID for each anime show  
> **title**            : Anime title    
> **title_english**    : Anime title in english     
> **title_japanese**   : Anime title in japanese   
> **image_url**        : Front poster   
> **type**             : Anime types (TV, Movie, etc)    
> **source**           : Anime source (Manga, Original)    
> **episodes**         : How many episodes   
> **status**           : Current status (airing, finieshed airinig)    
> **airing**           : Is it currently airing    
> **aired_string**     : Start date and finished date    
> **aired**            : Start date and finished date in java   
> **duration**         : How long is the anime(episode or movie)     
> **rating**           : Anime rating (pg13, NC16, M18, R21)   
> **score**            : Overall score of the anime (out of 10)     
> **scored_by**        : How many user give the score to the anime  
> **rank**             : Rank base on the score of the anime     
> **popularity**       : Rank base on how many people watch the anime  
> **members**          : How many people watch the anime  
> **favorites**        : How many people favorite the anime     
> **background**       : Background of the anime   
> **premiered**        : Which season the anime come out    
> **broadcast**        : Which day it broadcast   
> **related**          : Are there any sequel or prequel  
> **producer**         : Where the anime produce     
> **licensor**         : Which film it came from     
> **studio**           : which studio animated the anime   
> **genre**            : what are the genres in the anime   
> **opening_theme**    : opening song   
> **ending_theme**     : endinng song    
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [6]:
print("Data type : ", type(animelist))
print("Data dims : ", animelist.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (14478, 31)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [7]:
animelist.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Data columns (total 31 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   anime_id        14478 non-null  int64  
 1   title           14478 non-null  object 
 2   title_english   5724 non-null   object 
 3   title_japanese  14443 non-null  object 
 4   title_synonyms  8937 non-null   object 
 5   image_url       14382 non-null  object 
 6   type            14478 non-null  object 
 7   source          14478 non-null  object 
 8   episodes        14478 non-null  int64  
 9   status          14478 non-null  object 
 10  airing          14478 non-null  bool   
 11  aired_string    14478 non-null  object 
 12  aired           14478 non-null  object 
 13  duration        14478 non-null  object 
 14  rating          13934 non-null  object 
 15  score           14478 non-null  float64
 16  scored_by       14478 non-null  int64  
 17  rank            12904 non-null 

## Clean Data (AnimeList)

how might we (action) for (target audiences) in order to (outcome, what are the result we would like to see)

e.g. 
how might we recommend the top 20 anime shows for anime beginner?
how might we recommend the top 10 anime shows in winter season for anime user?

### Gathering information 
---

> Describe numeric      
> Desccirbe object      
> Display columns     

In [8]:
## for numeric data
animelist.describe()

Unnamed: 0,anime_id,episodes,score,scored_by,rank,popularity,members,favorites
count,14478.0,14478.0,14478.0,14478.0,12904.0,14478.0,14478.0,14478.0
mean,17377.229866,11.308399,6.142482,11460.03,6439.065406,7220.259566,22966.4,311.649606
std,13165.315011,43.443451,1.463981,43105.19,3720.227608,4170.080564,74981.36,2615.554211
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4389.25,1.0,5.55,46.0,3216.25,3611.5,245.0,0.0
50%,15135.0,1.0,6.37,501.0,6441.5,7225.5,1679.5,2.0
75%,31146.5,12.0,7.06,3941.5,9664.0,10827.75,10379.0,23.0
max,37916.0,1818.0,10.0,1009477.0,12919.0,14487.0,1456378.0,106895.0


In [9]:
## for data that is object
animelist.describe(include=object)

Unnamed: 0,title,title_english,title_japanese,title_synonyms,image_url,type,source,status,aired_string,aired,...,background,premiered,broadcast,related,producer,licensor,studio,genre,opening_theme,ending_theme
count,14478,5724,14443,8937,14382,14478,14478,14478,14478,14478,...,1057,4096,4271,14478,8288,3373,8544,14414,14478,14478
unique,14477,5606,13701,8575,14382,7,16,3,10026,9649,...,1038,221,441,9420,3221,193,778,4544,4328,5458
top,Hinamatsuri,Cyborg 009,ゲゲゲの鬼太郎,Minna no Uta,https://myanimelist.cdn-dena.com/images/anime/...,TV,Unknown,Finished Airing,Not available,"{'from': None, 'to': None}",...,Includes claymation short which was shown befo...,Spring 2017,Unknown,[],NHK,Funimation,Toei Animation,Hentai,[],[]
freq,2,4,6,189,1,4271,4210,13791,223,1691,...,5,80,2241,4515,427,726,725,868,9784,8807


In [10]:
## what are the columns involved in the dataset
animelist.columns

Index(['anime_id', 'title', 'title_english', 'title_japanese',
       'title_synonyms', 'image_url', 'type', 'source', 'episodes', 'status',
       'airing', 'aired_string', 'aired', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'background',
       'premiered', 'broadcast', 'related', 'producer', 'licensor', 'studio',
       'genre', 'opening_theme', 'ending_theme'],
      dtype='object')

---
### Premiered 

> Convert Null value to binary indicator (1 or 0)

### Studio filtering
Filter the studio that is less popular (<40) and combine into one "SmallStudio"

In [11]:
## calculate all the value from each studio

## studio that is empty, replace with unknown
animelist["studio"] = animelist["studio"].fillna("unknown")
studio_counts = animelist.studio.value_counts()
studio_counts

studio
unknown                         5934
Toei Animation                   725
Sunrise                          447
J.C.Staff                        314
Madhouse                         311
                                ... 
Studio Junio, Annapuru             1
Tokyo Media Connections            1
Gainax, Tatsunoko Production       1
Fanworks, Imagineer                1
33 Collective                      1
Name: count, Length: 779, dtype: int64

In [12]:
# group studio less than 40
minor = studio_counts[studio_counts < 40].index.to_list()
minor

['Eiken',
 'Group TAC',
 'TNK',
 'Artland',
 'SynergySP',
 '8bit',
 'Wit Studio',
 'Actas',
 'Manglobe',
 'Haoliners Animation League',
 'Ajia-Do',
 'MAPPA',
 'Studio Comet',
 'White Fox',
 'Mushi Production',
 'Studio Gokumi',
 'Hal Film Maker',
 'Tezuka Productions',
 'A.C.G.T.',
 'Asahi Production',
 'TYO Animations',
 'Gathering',
 'Tokyo Movie Shinsha',
 'Daume',
 'Kinema Citrus',
 'Polygon Pictures',
 'Nomad',
 'AIC A.S.T.A.',
 'T-Rex',
 'LIDENFILMS',
 'Magic Bus',
 'Studio Jam',
 'Bee Train',
 'GoHands',
 'Production IMS',
 'Trigger',
 'David Production',
 'Bandai Namco Pictures',
 'Telecom Animation Film',
 'Seven Arcs',
 'Office Takeout',
 'Asread',
 'Studio Fantasia',
 'Studio PuYUKAI',
 'RG Animation Studios',
 'dwarf',
 'AIC Plus+',
 'Seven Arcs Pictures',
 'Fanworks',
 'APPP',
 'Hoods Entertainment',
 'AT-2',
 'Sparkly Key Animation Studio',
 'Production I.G, Xebec',
 'Millepensee',
 'Y.O.U.C',
 'Shuka',
 'Flavors Soft',
 'Creators in Pack',
 'Animate Film',
 'ILCA',
 'Stu

In [13]:
## combine those minor studio to one "SmallStudio"
animelist["studio"] = animelist["studio"].apply(lambda x : "SmallStudio" if x in minor else x)
animelist.studio.value_counts()

studio
unknown                 5934
SmallStudio             3028
Toei Animation           725
Sunrise                  447
J.C.Staff                314
Madhouse                 311
Production I.G           251
TMS Entertainment        248
Studio Deen              241
Studio Pierrot           240
Nippon Animation         202
OLM                      181
A-1 Pictures             174
Shin-Ei Animation        151
DLE                      139
Tatsunoko Production     131
Shaft                    111
Gonzo                    109
Xebec                    109
Bones                    109
Kyoto Animation          103
AIC                       98
Brain&#039;s Base         80
Silver Link.              74
Satelight                 71
Arms                      69
Production Reed           64
Doga Kobo                 63
Studio 4°C                59
Gainax                    59
ufotable                  58
Zexcs                     57
Seven                     54
feel.                     53
Kachido

### Drop the data that is not important
TODO: Add back producers
---
|  |             **Unnecessory data**          |      |
|:---------------:|:---------------:|:---------------:|
| anime_id        | background      | opening_theme   |
| title_english   | premiered       | ending_theme    |
| title_japanese  | boardcast       | air_string      |
| title_synonyms  | producer        |                 |
| image_url       | lincensor       |                 |

In [14]:
## drop useless dat
animelist.drop(columns=['anime_id','title_english',  'title_japanese','title_synonyms', 'image_url', 'background',
       'premiered', 'broadcast','producer','licensor','opening_theme', 'ending_theme','aired_string' ], inplace=True)

In [15]:
## after dropping the columns 
animelist.shape

(14478, 18)

In [16]:
animelist.columns

Index(['title', 'type', 'source', 'episodes', 'status', 'airing', 'aired',
       'duration', 'rating', 'score', 'scored_by', 'rank', 'popularity',
       'members', 'favorites', 'related', 'studio', 'genre'],
      dtype='object')

### Aired date (from and to) cleaning
--- 
aired contain { from: yyyy-mm-dd, to: yyyy-mm-dd}

split into:
aired_from -> yyyy-mm-dd
aired_to   -> yyyy-mm-dd

calculate the number of days for the episode
calculate the how frequent it aired

In [17]:
# Splitting the 'aired' column into 'from' and 'to' columns
animelist[['aired_from', 'aired_to']] = animelist['aired'].str.extract(r"'from': '(.*?)', 'to': '(.*?)'")


# Displaying the DataFrame with the new columns
print(animelist[['aired_from', 'aired_to']])

       aired_from    aired_to
0      2012-01-13  2012-03-30
1      2007-04-02  2007-10-01
2      2008-10-04  2009-09-25
3      2002-08-16  2003-05-23
4      2012-10-06  2013-03-30
...           ...         ...
14473  1987-11-05  1988-11-04
14474  1986-03-21  1986-03-21
14475         NaN         NaN
14476         NaN         NaN
14477  2010-04-07  2010-04-07

[14478 rows x 2 columns]


#### Convert 'aired_from' and 'aired_to' to DateTime format
---

In [18]:
## convert to datetime format.
animelist['aired_from'] = pd.to_datetime(animelist['aired_from'])
animelist['aired_to'] = pd.to_datetime(animelist['aired_to'])

#### Using 'aired_from' data to create year and start_season columns
---

In [19]:
## abstract the year out from aired-_from
animelist['year'] = animelist['aired_from'].dt.year

In [23]:
# Define a function to map months to seasons
def get_season(month):
    if 3 <= month <= 5:
        return 'Spring'
    elif 6 <= month <= 8:
        return 'Summer'
    elif 9 <= month <= 11:
        return 'Fall'
    elif (month == 12 or 0 < month <= 2):
        return 'Winter'
    else:
        return '0'

# Create a new column for seasons
animelist['start_season'] = animelist['aired_from'].dt.month.apply(get_season)

print(animelist['start_season'])

0        Winter
1        Spring
2          Fall
3        Summer
4          Fall
          ...  
14473      Fall
14474    Spring
14475         0
14476         0
14477    Spring
Name: start_season, Length: 14478, dtype: object


In [24]:
animelist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   title         14478 non-null  object        
 1   type          14478 non-null  object        
 2   source        14478 non-null  object        
 3   episodes      14478 non-null  int64         
 4   status        14478 non-null  object        
 5   airing        14478 non-null  bool          
 6   aired         14478 non-null  object        
 7   duration      14478 non-null  object        
 8   rating        13934 non-null  object        
 9   score         14478 non-null  float64       
 10  scored_by     14478 non-null  int64         
 11  rank          12904 non-null  float64       
 12  popularity    14478 non-null  int64         
 13  members       14478 non-null  int64         
 14  favorites     14478 non-null  int64         
 15  related       14478 non-null  object

#### Create a air_date columns
---



In [25]:
## calculate the air difference
animelist ['days_aired'] = (animelist['aired_to'] - animelist['aired_from'] ).dt.days+1
print(animelist['days_aired'])

0         78.0
1        183.0
2        357.0
3        281.0
4        176.0
         ...  
14473    366.0
14474      1.0
14475      NaN
14476      NaN
14477      1.0
Name: days_aired, Length: 14478, dtype: float64


#### Convert the durations to minutes
---

In [26]:
import re

def convert_to_minutes(duration):
    # Regular expression to extract numerical value and unit of time
    pattern = r'(\d+)\s*(min|hr|min\.|sec)'
    match = re.match(pattern, duration)
    
    if match:
        value = int(match.group(1))  # Extract numerical value
        unit = match.group(2)        # Extract unit of time
        
        if unit == 'min':
            return value
        elif unit == 'hr' or unit == 'min.':
            return value * 60
        elif unit == 'sec':
            return value / 60
    else:
        return None  # Return None for entries with unknown format

# Apply the extraction function to the duration column
animelist['duration_minutes'] = animelist['duration'].apply(convert_to_minutes)
print(animelist['duration_minutes'])

0        24.0
1        24.0
2        24.0
3        16.0
4        24.0
         ... 
14473     8.0
14474    25.0
14475     NaN
14476    40.0
14477     3.0
Name: duration_minutes, Length: 14478, dtype: float64


In [27]:
## drop durartion
animelist.drop(columns=['duration'], inplace=True)

### Split the Genres to columns
---

In [28]:
## fill the missing value 'Nan' with 'NA'
animelist.genre = animelist.genre.fillna("NA")

# split the genres by the parameter
genre_animelist = animelist['genre'].str.get_dummies(sep =', ').add_prefix("Genre_")
genre_animelist

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Cars,Genre_Comedy,Genre_Dementia,Genre_Demons,Genre_Drama,Genre_Ecchi,Genre_Fantasy,Genre_Game,...,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14473,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14474,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14475,1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
14476,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
genre_animelist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Data columns (total 44 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Genre_Action         14478 non-null  int64
 1   Genre_Adventure      14478 non-null  int64
 2   Genre_Cars           14478 non-null  int64
 3   Genre_Comedy         14478 non-null  int64
 4   Genre_Dementia       14478 non-null  int64
 5   Genre_Demons         14478 non-null  int64
 6   Genre_Drama          14478 non-null  int64
 7   Genre_Ecchi          14478 non-null  int64
 8   Genre_Fantasy        14478 non-null  int64
 9   Genre_Game           14478 non-null  int64
 10  Genre_Harem          14478 non-null  int64
 11  Genre_Hentai         14478 non-null  int64
 12  Genre_Historical     14478 non-null  int64
 13  Genre_Horror         14478 non-null  int64
 14  Genre_Josei          14478 non-null  int64
 15  Genre_Kids           14478 non-null  int64
 16  Genre_Magic          1

In [30]:
## combining the animelist data and genre data into animelist_df
animelist_df = pd.concat([animelist, genre_animelist], axis=1)
animelist_df.head()

Unnamed: 0,title,type,source,episodes,status,airing,aired,rating,score,scored_by,...,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,"{'from': '2012-01-13', 'to': '2012-03-30'}",PG-13 - Teens 13 or older,7.63,139250,...,0,0,0,0,0,1,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,"{'from': '2007-04-02', 'to': '2007-10-01'}",PG-13 - Teens 13 or older,7.89,91206,...,0,0,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,"{'from': '2008-10-04', 'to': '2009-09-25'}",PG - Children,7.55,37129,...,0,0,0,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,"{'from': '2002-08-16', 'to': '2003-05-23'}",PG-13 - Teens 13 or older,8.21,36501,...,0,0,0,0,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,"{'from': '2012-10-06', 'to': '2013-03-30'}",PG-13 - Teens 13 or older,8.67,107767,...,0,0,0,0,0,0,0,0,0,0


In [31]:
## remove genre columns 
animelist_df.drop(columns=["genre"], inplace=True)
animelist_df

Unnamed: 0,title,type,source,episodes,status,airing,aired,rating,score,scored_by,...,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,"{'from': '2012-01-13', 'to': '2012-03-30'}",PG-13 - Teens 13 or older,7.63,139250,...,0,0,0,0,0,1,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,"{'from': '2007-04-02', 'to': '2007-10-01'}",PG-13 - Teens 13 or older,7.89,91206,...,0,0,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,"{'from': '2008-10-04', 'to': '2009-09-25'}",PG - Children,7.55,37129,...,0,0,0,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,"{'from': '2002-08-16', 'to': '2003-05-23'}",PG-13 - Teens 13 or older,8.21,36501,...,0,0,0,0,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,"{'from': '2012-10-06', 'to': '2013-03-30'}",PG-13 - Teens 13 or older,8.67,107767,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14473,Gutchonpa Omoshiro Hanashi,TV,Unknown,5,Finished Airing,False,"{'from': '1987-11-05', 'to': '1988-11-04'}",G - All Ages,5.50,6,...,0,0,0,0,0,0,0,0,0,0
14474,Geba Geba Shou Time!,OVA,Unknown,1,Finished Airing,False,"{'from': '1986-03-21', 'to': '1986-03-21'}",G - All Ages,4.60,5,...,0,0,0,0,0,0,0,0,0,0
14475,Godzilla: Hoshi wo Kuu Mono,Movie,Other,1,Not yet aired,False,"{'from': None, 'to': None}",R - 17+ (violence & profanity),0.00,0,...,0,0,0,0,0,0,0,0,0,0
14476,Nippon Mukashibanashi: Sannen Netarou,OVA,Other,1,Finished Airing,False,"{'from': None, 'to': None}",G - All Ages,6.00,1,...,0,0,0,0,0,0,0,0,0,0


In [32]:
animelist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14478 entries, 0 to 14477
Data columns (total 66 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   title                14478 non-null  object        
 1   type                 14478 non-null  object        
 2   source               14478 non-null  object        
 3   episodes             14478 non-null  int64         
 4   status               14478 non-null  object        
 5   airing               14478 non-null  bool          
 6   aired                14478 non-null  object        
 7   rating               13934 non-null  object        
 8   score                14478 non-null  float64       
 9   scored_by            14478 non-null  int64         
 10  rank                 12904 non-null  float64       
 11  popularity           14478 non-null  int64         
 12  members              14478 non-null  int64         
 13  favorites            14478 non-

### Check value contain any NULL
---

In [33]:
# let's make sure no null values
for col in animelist_df:
    print(f" {col}         | has ({animelist_df[col].isnull().sum()})")
    
    
## below table we can see:
## rating have 544 'Nan'
## rank have 1574 'Nan'
## aired_from have 2191 'Nan'
## aired-to have 2191 'Nan'
## year have 2191'Nan'
## date_aired have 2191 'Nan'
## duration_minutes have 405 'Nan'

 title         | has (0)
 type         | has (0)
 source         | has (0)
 episodes         | has (0)
 status         | has (0)
 airing         | has (0)
 aired         | has (0)
 rating         | has (544)
 score         | has (0)
 scored_by         | has (0)
 rank         | has (1574)
 popularity         | has (0)
 members         | has (0)
 favorites         | has (0)
 related         | has (0)
 studio         | has (0)
 aired_from         | has (2191)
 aired_to         | has (2191)
 year         | has (2191)
 start_season         | has (0)
 days_aired         | has (2191)
 duration_minutes         | has (405)
 Genre_Action         | has (0)
 Genre_Adventure         | has (0)
 Genre_Cars         | has (0)
 Genre_Comedy         | has (0)
 Genre_Dementia         | has (0)
 Genre_Demons         | has (0)
 Genre_Drama         | has (0)
 Genre_Ecchi         | has (0)
 Genre_Fantasy         | has (0)
 Genre_Game         | has (0)
 Genre_Harem         | has (0)
 Genre_Hentai         | has

In [34]:
## count the total for each rating
animelist_df.rating.value_counts()

rating
PG-13 - Teens 13 or older         5020
G - All Ages                      4541
PG - Children                     1279
Rx - Hentai                       1219
R - 17+ (violence & profanity)     997
R+ - Mild Nudity                   878
Name: count, dtype: int64

In [35]:
## ensure the rating is at least PG13
animelist_df['rating'].fillna("G - All Ages",inplace=True)

## convert the rank to the max rank (prevent skewness)
animelist_df['rank'].fillna(animelist_df['rank'].max(), inplace=True)

## convert 'Nan' to None for aired dates.
animelist_df['aired_from'].fillna("Not aired",inplace=True)
animelist_df['aired_to'].fillna("Not aired",inplace=True)
animelist_df['days_aired'].fillna(0, inplace = True)
animelist_df['year'].fillna(0, inplace = True)

animelist_df['duration_minutes'].fillna(0,inplace=True)
##find out whether aired time and primied have relation

In [36]:
animelist_df['year'] = animelist_df['year'].astype(int)
animelist_df['days_aired'] = animelist_df['days_aired'].astype(int)
animelist_df.dtypes

title                 object
type                  object
source                object
episodes               int64
status                object
                       ...  
Genre_Supernatural     int64
Genre_Thriller         int64
Genre_Vampire          int64
Genre_Yaoi             int64
Genre_Yuri             int64
Length: 66, dtype: object

In [37]:
# let's double confirmed there are no null values
for col in animelist_df:
    print(f" {col}         | has ({animelist_df[col].isnull().sum()})")


 title         | has (0)
 type         | has (0)
 source         | has (0)
 episodes         | has (0)
 status         | has (0)
 airing         | has (0)
 aired         | has (0)
 rating         | has (0)
 score         | has (0)
 scored_by         | has (0)
 rank         | has (0)
 popularity         | has (0)
 members         | has (0)
 favorites         | has (0)
 related         | has (0)
 studio         | has (0)
 aired_from         | has (0)
 aired_to         | has (0)
 year         | has (0)
 start_season         | has (0)
 days_aired         | has (0)
 duration_minutes         | has (0)
 Genre_Action         | has (0)
 Genre_Adventure         | has (0)
 Genre_Cars         | has (0)
 Genre_Comedy         | has (0)
 Genre_Dementia         | has (0)
 Genre_Demons         | has (0)
 Genre_Drama         | has (0)
 Genre_Ecchi         | has (0)
 Genre_Fantasy         | has (0)
 Genre_Game         | has (0)
 Genre_Harem         | has (0)
 Genre_Hentai         | has (0)
 Genre_Histori

In [38]:
animelist_df.describe()

Unnamed: 0,episodes,score,scored_by,rank,popularity,members,favorites,year,days_aired,duration_minutes,...,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri
count,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,...,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0,14478.0
mean,11.308399,6.142482,11460.03,7143.54234,7220.259566,22966.4,311.649606,1700.784362,122.834646,22.961195,...,0.005457,0.105816,0.030115,0.0431,0.036814,0.084197,0.006907,0.008634,0.002694,0.002832
std,43.443451,1.463981,43105.19,4050.222012,4170.080564,74981.36,2615.554211,718.32857,398.673812,19.32009,...,0.073669,0.307612,0.170909,0.203089,0.188313,0.277692,0.082824,0.092519,0.051833,0.053142
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4819.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,5.55,46.0,3611.25,3611.5,245.0,0.0,1988.0,1.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,6.37,501.0,7230.0,7225.5,1679.5,2.0,2005.0,1.0,23.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,12.0,7.06,3941.5,10838.75,10827.75,10379.0,23.0,2013.0,125.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1818.0,10.0,1009477.0,12919.0,14487.0,1456378.0,106895.0,2018.0,9863.0,180.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Remove '0' in the columns
---

In [39]:
df = pd.DataFrame(animelist_df)

columns_list = ['episodes', 'score', 'scored_by', 'rank', 'popularity', 'members', 'favorites', 'duration_minutes']
animelist_df = df[(df[columns_list] != 0).all(axis=1)]
animelist_df.describe()

Unnamed: 0,episodes,score,scored_by,rank,popularity,members,favorites,year,days_aired,duration_minutes,...,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri
count,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,...,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0,9032.0
mean,13.41818,6.749494,18220.94,5437.158326,4927.327834,36093.4,489.567981,1927.263065,149.407329,26.235179,...,0.00764,0.107507,0.035762,0.046391,0.047055,0.11105,0.00919,0.011957,0.003986,0.004429
std,40.760835,0.876339,53250.73,3890.988568,3136.179796,92026.42,3215.021663,387.709168,392.144663,18.149142,...,0.087075,0.309773,0.185706,0.210341,0.211768,0.314211,0.095426,0.108701,0.063011,0.066405
min,1.0,1.25,2.0,1.0,1.0,14.0,1.0,0.0,-4819.0,0.05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,6.25,546.75,2282.75,2293.5,1683.0,3.0,1999.0,1.0,17.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,6.8,2124.0,4650.5,4653.0,5638.0,11.0,2008.0,78.0,24.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,13.0,7.35,11468.0,7701.25,7228.0,26093.25,73.0,2013.0,176.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1787.0,9.25,1009477.0,12919.0,14466.0,1456378.0,106895.0,2018.0,9540.0,180.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Dealing with Related Column

### Exploring JSON Structure for each data unit

In [40]:
related_cell0 = animelist_df["related"][0]
related_cell_mod = related_cell0.replace("'", "\"")
related_cell_mod

'{"Adaptation": [{"mal_id": 17207, "type": "manga", "url": "https://myanimelist.net/manga/17207/Inu_x_Boku_SS", "title": "Inu x Boku SS"}], "Sequel": [{"mal_id": 13403, "type": "anime", "url": "https://myanimelist.net/anime/13403/Inu_x_Boku_SS_Special", "title": "Inu x Boku SS Special"}]}'

In [41]:
import json
related_dict = json.loads(related_cell_mod)
related_dict['Sequel'][0]['type']

'anime'

### Dealing with 'related' column

In [42]:
related_df = animelist_df[["title","related"]]
related_df.head()

Unnamed: 0,title,related
0,Inu x Boku SS,"{'Adaptation': [{'mal_id': 17207, 'type': 'man..."
1,Seto no Hanayome,"{'Adaptation': [{'mal_id': 759, 'type': 'manga..."
2,Shugo Chara!! Doki,"{'Adaptation': [{'mal_id': 101, 'type': 'manga..."
3,Princess Tutu,"{'Adaptation': [{'mal_id': 1581, 'type': 'mang..."
4,Bakuman. 3rd Season,"{'Adaptation': [{'mal_id': 9711, 'type': 'mang..."


In [43]:

related_dict = {}
related_dict['title']=[]

related_row_dict_list = [];

for i, related_row in enumerate(related_df['related']):
    # Original JSON data used single quotation, and double quotes inside values
    # According to JSON guidelines, strings should use double quotes, we will convert double quotes inside to single quotes
    related_row = related_row.replace("\"", "(temp_double_quotes)")
    related_row = related_row.replace("'", "\"")
    related_row = related_row.replace("(temp_double_quotes)", "'")
    
    # Convert each row into its own dictionary
    related_row_dict = json.loads(related_row)
    related_row_dict_list.append(related_row_dict)

    # Fill in title list
    related_dict['title'].append(related_df['title'].iloc[i])

    # Fill keys with all unique relations
    for relation in related_row_dict:
        if not relation in related_dict.keys():
            related_dict[relation] = []

for related_row_dict in related_row_dict_list:
    for relation in related_dict:
        if relation=='title': 
            continue
        if relation in related_row_dict:
            related_dict[relation].append(len(list(related_row_dict[relation])))
        else:
            related_dict[relation].append(0)
related_df_separated = pd.DataFrame.from_dict(related_dict)
related_df_separated.head()

Unnamed: 0,title,Adaptation,Sequel,Side story,Alternative version,Prequel,Summary,Other,Spin-off,Alternative setting,Character,Parent story,Full story
0,Inu x Boku SS,1,1,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,1,1,1,1,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,1,1,0,0,1,0,0,0,0,0,0,0
3,Princess Tutu,1,0,0,0,0,1,0,0,0,0,0,0
4,Bakuman. 3rd Season,1,0,0,0,2,0,1,0,0,0,0,0


### Sum each column to find how many of each type of relation there are

In [44]:
relation_counts = {}
for column in related_df_separated:
    if column=='title':
        continue
    else:
        relation_counts[column] = related_df_separated[column].sum()
relation_counts        

{'Adaptation': 4192,
 'Sequel': 2110,
 'Side story': 1384,
 'Alternative version': 1090,
 'Prequel': 1877,
 'Summary': 400,
 'Other': 1872,
 'Spin-off': 512,
 'Alternative setting': 627,
 'Character': 247,
 'Parent story': 1318,
 'Full story': 315}

### New Column that shows average rank of related media for each show

In [45]:
avg = []
for related_row_dict in related_row_dict_list:
    related_total_rank = 0
    count = 0
    for relation in related_row_dict:
            related_type_list = related_row_dict[relation]
            for related_entry_dict in related_type_list:
                relation_entry_orig = pd.DataFrame(animelist_df.loc[animelist_df['title'] == related_entry_dict['title']])
                try:
                    related_total_rank += float(relation_entry_orig['rank'].to_string(index=False))
                except ValueError:
                    related_total_rank += 0
                count+=1
    if count>0:
        avg.append(related_total_rank/count)
    else:
        avg.append(0)
related_df_separated.insert(1, "Average Rank of Related", avg, True)


In [46]:
# Add prefix to columns
to_add_prefix = related_df_separated.iloc[:,2:].columns
selected_columns = related_df_separated[to_add_prefix]
prefixed_columns = selected_columns.add_prefix("Related_")
related_df_separated = pd.concat([related_df_separated.drop(to_add_prefix, axis=1),prefixed_columns],axis=1)
related_df_separated.head()

Unnamed: 0,title,Average Rank of Related,Related_Adaptation,Related_Sequel,Related_Side story,Related_Alternative version,Related_Prequel,Related_Summary,Related_Other,Related_Spin-off,Related_Alternative setting,Related_Character,Related_Parent story,Related_Full story
0,Inu x Boku SS,1145.5,1,1,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,2507.75,1,1,1,1,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,2561.0,1,1,0,0,1,0,0,0,0,0,0,0
3,Princess Tutu,2278.5,1,0,0,0,0,1,0,0,0,0,0,0
4,Bakuman. 3rd Season,1450.75,1,0,0,0,2,0,1,0,0,0,0,0


### Concat related data columns into main dataframe

In [47]:
animelist_df.shape

(9032, 66)

In [48]:
 related_df_separated.drop(columns=['title']).shape

(9032, 13)

In [49]:
related_df_separated.to_csv('DataSets/Cleaned data/Temp/related.csv', index=False) 
animelist_df.to_csv('DataSets/Cleaned data/Temp/temp_animelist.csv', index=False) 
related_df_separated = pd.read_csv('DataSets/Cleaned data/Temp/related.csv')
animelist_df = pd.read_csv('DataSets/Cleaned data/Temp/temp_animelist.csv')

In [50]:
# Concat back into main df
animelist_df.reset_index(drop=True)
related_df_separated.reset_index(drop=True)
combined_df = pd.concat([animelist_df, related_df_separated.drop(columns=['title']).reset_index(drop=True)], axis='columns')

# Show all collumns
pd.set_option('display.max_columns', None)
animelist_df = combined_df
animelist_df

Unnamed: 0,title,type,source,episodes,status,airing,aired,rating,score,scored_by,rank,popularity,members,favorites,related,studio,aired_from,aired_to,year,start_season,days_aired,duration_minutes,Genre_Action,Genre_Adventure,Genre_Cars,Genre_Comedy,Genre_Dementia,Genre_Demons,Genre_Drama,Genre_Ecchi,Genre_Fantasy,Genre_Game,Genre_Harem,Genre_Hentai,Genre_Historical,Genre_Horror,Genre_Josei,Genre_Kids,Genre_Magic,Genre_Martial Arts,Genre_Mecha,Genre_Military,Genre_Music,Genre_Mystery,Genre_NA,Genre_Parody,Genre_Police,Genre_Psychological,Genre_Romance,Genre_Samurai,Genre_School,Genre_Sci-Fi,Genre_Seinen,Genre_Shoujo,Genre_Shoujo Ai,Genre_Shounen,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri,Average Rank of Related,Related_Adaptation,Related_Sequel,Related_Side story,Related_Alternative version,Related_Prequel,Related_Summary,Related_Other,Related_Spin-off,Related_Alternative setting,Related_Character,Related_Parent story,Related_Full story
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,"{'from': '2012-01-13', 'to': '2012-03-30'}",PG-13 - Teens 13 or older,7.63,139250,1274.0,231,283882,2809,"{'Adaptation': [{'mal_id': 17207, 'type': 'man...",SmallStudio,2012-01-13 00:00:00,2012-03-30 00:00:00,2012,Winter,78,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1145.50,1,1,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,"{'from': '2007-04-02', 'to': '2007-10-01'}",PG-13 - Teens 13 or older,7.89,91206,727.0,366,204003,2579,"{'Adaptation': [{'mal_id': 759, 'type': 'manga...",Gonzo,2007-04-02 00:00:00,2007-10-01 00:00:00,2007,Spring,183,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2507.75,1,1,1,1,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,"{'from': '2008-10-04', 'to': '2009-09-25'}",PG - Children,7.55,37129,1508.0,1173,70127,802,"{'Adaptation': [{'mal_id': 101, 'type': 'manga...",Satelight,2008-10-04 00:00:00,2009-09-25 00:00:00,2008,Fall,357,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2561.00,1,1,0,0,1,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,"{'from': '2002-08-16', 'to': '2003-05-23'}",PG-13 - Teens 13 or older,8.21,36501,307.0,916,93312,3344,"{'Adaptation': [{'mal_id': 1581, 'type': 'mang...",SmallStudio,2002-08-16 00:00:00,2003-05-23 00:00:00,2002,Summer,281,16.0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2278.50,1,0,0,0,0,1,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,"{'from': '2012-10-06', 'to': '2013-03-30'}",PG-13 - Teens 13 or older,8.67,107767,50.0,426,182765,2082,"{'Adaptation': [{'mal_id': 9711, 'type': 'mang...",J.C.Staff,2012-10-06 00:00:00,2013-03-30 00:00:00,2012,Fall,176,24.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1450.75,1,0,0,0,2,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9027,Atashi Tenshi Anata Akuma,OVA,Unknown,2,Finished Airing,False,"{'from': '1995-03-21', 'to': '1995-06-21'}",G - All Ages,1.25,4,9712.0,14077,47,1,[],unknown,1995-03-21 00:00:00,1995-06-21 00:00:00,1995,Spring,93,30.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00,0,0,0,0,0,0,0,0,0,0,0,0
9028,Pastel Life,TV,Unknown,6,Currently Airing,True,"{'from': '2018-05-17', 'to': '2018-06-21'}",PG-13 - Teens 13 or older,6.62,140,5122.0,7722,1342,3,"{'Spin-off': [{'mal_id': 33573, 'type': 'anime...",unknown,2018-05-17 00:00:00,2018-06-21 00:00:00,2018,Spring,36,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,4131.00,0,0,0,0,0,0,0,1,0,0,0,0
9029,Hibike! Euphonium Movie: Todoketai Melody - Ph...,Movie,Unknown,3,Finished Airing,False,"{'from': '2017-09-30', 'to': '2017-09-30'}",PG-13 - Teens 13 or older,5.66,35,10480.0,8789,809,1,"{'Other': [{'mal_id': 35082, 'type': 'anime', ...",Kyoto Animation,2017-09-30 00:00:00,2017-09-30 00:00:00,2017,Fall,1,4.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1404.00,0,0,0,0,0,0,1,0,0,0,0,0
9030,Quiz de Manabu Pinocchio no Koutsuu Ansen,OVA,Original,1,Finished Airing,False,"{'from': None, 'to': None}",G - All Ages,8.00,2,11910.0,14287,35,1,[],unknown,Not aired,Not aired,0,0,0,15.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00,0,0,0,0,0,0,0,0,0,0,0,0


### Remove converted column (aired & related)

In [51]:
animelist_df.drop(columns=['aired', 'related'], inplace=True)

In [52]:
animelist_df.head()

Unnamed: 0,title,type,source,episodes,status,airing,rating,score,scored_by,rank,popularity,members,favorites,studio,aired_from,aired_to,year,start_season,days_aired,duration_minutes,Genre_Action,Genre_Adventure,Genre_Cars,Genre_Comedy,Genre_Dementia,Genre_Demons,Genre_Drama,Genre_Ecchi,Genre_Fantasy,Genre_Game,Genre_Harem,Genre_Hentai,Genre_Historical,Genre_Horror,Genre_Josei,Genre_Kids,Genre_Magic,Genre_Martial Arts,Genre_Mecha,Genre_Military,Genre_Music,Genre_Mystery,Genre_NA,Genre_Parody,Genre_Police,Genre_Psychological,Genre_Romance,Genre_Samurai,Genre_School,Genre_Sci-Fi,Genre_Seinen,Genre_Shoujo,Genre_Shoujo Ai,Genre_Shounen,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri,Average Rank of Related,Related_Adaptation,Related_Sequel,Related_Side story,Related_Alternative version,Related_Prequel,Related_Summary,Related_Other,Related_Spin-off,Related_Alternative setting,Related_Character,Related_Parent story,Related_Full story
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,PG-13 - Teens 13 or older,7.63,139250,1274.0,231,283882,2809,SmallStudio,2012-01-13 00:00:00,2012-03-30 00:00:00,2012,Winter,78,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1145.5,1,1,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,PG-13 - Teens 13 or older,7.89,91206,727.0,366,204003,2579,Gonzo,2007-04-02 00:00:00,2007-10-01 00:00:00,2007,Spring,183,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2507.75,1,1,1,1,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,PG - Children,7.55,37129,1508.0,1173,70127,802,Satelight,2008-10-04 00:00:00,2009-09-25 00:00:00,2008,Fall,357,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2561.0,1,1,0,0,1,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,PG-13 - Teens 13 or older,8.21,36501,307.0,916,93312,3344,SmallStudio,2002-08-16 00:00:00,2003-05-23 00:00:00,2002,Summer,281,16.0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2278.5,1,0,0,0,0,1,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,PG-13 - Teens 13 or older,8.67,107767,50.0,426,182765,2082,J.C.Staff,2012-10-06 00:00:00,2013-03-30 00:00:00,2012,Fall,176,24.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1450.75,1,0,0,0,2,0,1,0,0,0,0,0


## Break down to type of shows

In [53]:
animelist_tv_filtered = animelist_df[animelist_df['type'].isin(['TV'])]
animelist_tv_filtered

Unnamed: 0,title,type,source,episodes,status,airing,rating,score,scored_by,rank,popularity,members,favorites,studio,aired_from,aired_to,year,start_season,days_aired,duration_minutes,Genre_Action,Genre_Adventure,Genre_Cars,Genre_Comedy,Genre_Dementia,Genre_Demons,Genre_Drama,Genre_Ecchi,Genre_Fantasy,Genre_Game,Genre_Harem,Genre_Hentai,Genre_Historical,Genre_Horror,Genre_Josei,Genre_Kids,Genre_Magic,Genre_Martial Arts,Genre_Mecha,Genre_Military,Genre_Music,Genre_Mystery,Genre_NA,Genre_Parody,Genre_Police,Genre_Psychological,Genre_Romance,Genre_Samurai,Genre_School,Genre_Sci-Fi,Genre_Seinen,Genre_Shoujo,Genre_Shoujo Ai,Genre_Shounen,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri,Average Rank of Related,Related_Adaptation,Related_Sequel,Related_Side story,Related_Alternative version,Related_Prequel,Related_Summary,Related_Other,Related_Spin-off,Related_Alternative setting,Related_Character,Related_Parent story,Related_Full story
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,PG-13 - Teens 13 or older,7.63,139250,1274.0,231,283882,2809,SmallStudio,2012-01-13 00:00:00,2012-03-30 00:00:00,2012,Winter,78,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1145.50,1,1,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,PG-13 - Teens 13 or older,7.89,91206,727.0,366,204003,2579,Gonzo,2007-04-02 00:00:00,2007-10-01 00:00:00,2007,Spring,183,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2507.75,1,1,1,1,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,PG - Children,7.55,37129,1508.0,1173,70127,802,Satelight,2008-10-04 00:00:00,2009-09-25 00:00:00,2008,Fall,357,24.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2561.00,1,1,0,0,1,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,PG-13 - Teens 13 or older,8.21,36501,307.0,916,93312,3344,SmallStudio,2002-08-16 00:00:00,2003-05-23 00:00:00,2002,Summer,281,16.0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2278.50,1,0,0,0,0,1,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,PG-13 - Teens 13 or older,8.67,107767,50.0,426,182765,2082,J.C.Staff,2012-10-06 00:00:00,2013-03-30 00:00:00,2012,Fall,176,24.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1450.75,1,0,0,0,2,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9007,Duel Masters Victory,TV,Original,52,Finished Airing,False,PG-13 - Teens 13 or older,6.17,191,6851.0,8712,796,1,unknown,2011-04-02 00:00:00,2012-03-31 00:00:00,2011,Spring,365,10.0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0.00,0,1,0,0,1,0,0,0,0,0,0,0
9009,Alexander Senki,TV,Unknown,13,Finished Airing,False,R+ - Mild Nudity,5.80,2150,7792.0,4934,4840,18,Madhouse,1999-09-14 00:00:00,1999-12-07 00:00:00,1999,Fall,85,20.0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0.00,0,0,0,1,0,0,0,0,0,0,0,0
9013,Nepos Napos,TV,Unknown,26,Finished Airing,False,PG - Children,5.80,5,11450.0,13582,72,1,SmallStudio,2005-01-31 00:00:00,2005-07-25 00:00:00,2005,Winter,176,6.0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00,0,0,0,0,0,0,0,0,0,0,0,0
9021,Robocar Poli 2,TV,Original,26,Finished Airing,False,G - All Ages,5.70,20,11888.0,13081,95,1,unknown,2011-12-26 00:00:00,2012-05-15 00:00:00,2011,Winter,142,11.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00,0,0,0,0,1,0,0,0,0,0,0,0


In [54]:
animelist_movie_filtered = animelist_df[animelist_df['type'].isin(['Movie'])]
animelist_movie_filtered

Unnamed: 0,title,type,source,episodes,status,airing,rating,score,scored_by,rank,popularity,members,favorites,studio,aired_from,aired_to,year,start_season,days_aired,duration_minutes,Genre_Action,Genre_Adventure,Genre_Cars,Genre_Comedy,Genre_Dementia,Genre_Demons,Genre_Drama,Genre_Ecchi,Genre_Fantasy,Genre_Game,Genre_Harem,Genre_Hentai,Genre_Historical,Genre_Horror,Genre_Josei,Genre_Kids,Genre_Magic,Genre_Martial Arts,Genre_Mecha,Genre_Military,Genre_Music,Genre_Mystery,Genre_NA,Genre_Parody,Genre_Police,Genre_Psychological,Genre_Romance,Genre_Samurai,Genre_School,Genre_Sci-Fi,Genre_Seinen,Genre_Shoujo,Genre_Shoujo Ai,Genre_Shounen,Genre_Shounen Ai,Genre_Slice of Life,Genre_Space,Genre_Sports,Genre_Super Power,Genre_Supernatural,Genre_Thriller,Genre_Vampire,Genre_Yaoi,Genre_Yuri,Average Rank of Related,Related_Adaptation,Related_Sequel,Related_Side story,Related_Alternative version,Related_Prequel,Related_Summary,Related_Other,Related_Spin-off,Related_Alternative setting,Related_Character,Related_Parent story,Related_Full story
65,Death Billiards,Movie,Original,1,Finished Airing,False,R - 17+ (violence & profanity),8.01,81162,542.0,637,132203,226,Madhouse,2013-03-02 00:00:00,2013-03-02 00:00:00,2013,Spring,1,25.0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,3599.5,0,0,0,1,0,0,3,0,0,0,0,0
71,Duel Masters Movie 1: Yami no Shiro no Maryuuou,Movie,Original,1,Finished Airing,False,PG - Children,5.96,333,7431.0,8482,896,2,unknown,2005-03-12 00:00:00,2005-03-12 00:00:00,2005,Spring,1,45.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7537.0,0,0,0,0,0,0,0,0,0,0,1,0
79,Pokemon 3D Adventure 2: Pikachu no Kaitei Daib...,Movie,Game,1,Finished Airing,False,PG - Children,7.13,2588,3092.0,4395,6463,13,OLM,2006-05-20 00:00:00,2006-05-20 00:00:00,2006,Spring,1,14.0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2745.0,0,0,0,0,1,0,0,0,0,0,1,0
83,3 Choume no Tama: Onegai! Momo-chan wo Sagashi...,Movie,Unknown,1,Finished Airing,False,G - All Ages,6.14,73,6954.0,10118,358,2,unknown,1993-08-14 00:00:00,1993-08-14 00:00:00,1993,Summer,1,43.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2323.0,0,0,0,0,0,0,2,0,0,0,0,0
85,Majokko Shimai no Yoyo to Nene,Movie,Manga,1,Finished Airing,False,G - All Ages,7.57,4073,1431.0,3511,11187,17,ufotable,2013-12-28 00:00:00,2013-12-28 00:00:00,2013,Winter,1,60.0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,1,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9006,Maabou no Daikyousou,Movie,Original,1,Finished Airing,False,G - All Ages,4.62,330,9309.0,9170,613,1,unknown,Not aired,Not aired,0,0,0,1.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0.0,0,1,0,0,0,0,0,0,0,0,0,0
9010,Mahou Shoujo Madoka★Magica: Concept Movie,Movie,Original,1,Finished Airing,False,PG-13 - Teens 13 or older,7.14,4779,3047.0,3397,12138,35,Shaft,2015-11-27 00:00:00,2015-11-27 00:00:00,2015,Fall,1,4.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,125.0,0,0,0,0,0,0,1,0,0,0,0,0
9017,Uli Chingu Kkachi,Movie,Unknown,1,Finished Airing,False,G - All Ages,5.76,17,12566.0,13531,75,1,unknown,Not aired,Not aired,0,0,0,20.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0
9019,Eikou e no Spur: Igaya Chiharu Monogatari,Movie,Unknown,1,Finished Airing,False,G - All Ages,6.57,7,10054.0,12137,140,1,unknown,1997-09-13 00:00:00,1997-09-13 00:00:00,1997,Fall,1,60.0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0


## Convert to new csv file.

In [55]:
animelist_df.to_csv('DataSets/Cleaned data/data_cleaned.csv', index=False) 

In [56]:
animelist_tv_filtered.to_csv('DataSets/Cleaned data/data_cleaned_tv_only.csv', index=False) 