# **Preprocessing data for MAL Data Science Project**

In [1411]:
import pandas as pd
import ast


**Data reading** <br>
The dataset was extracted using the file JikanScrapper. The objective of this notebook is to preprocess each feature in order for visualization and machine learning in the later stages

In [1412]:
data = pd.read_csv('anime_data.csv')
columns_to_exclude = ['year', 'explict_genres'] # There is a better year column and explict_genres is already in genres
data = data.drop(columns=columns_to_exclude)
data = data.rename(columns={'year_2': 'year'})


# **1. Data exploration and cleaning the dataset** 

In [1413]:
def describe_feature(data: pd.DataFrame, feature: str):
    print(f"- Type: {data.loc[:, feature].dtype}")
    print(f"- First 5 rows:\n {data[feature].head(5)}")
    print(f"- Last 5 Rows: \n {data[feature].tail(5)}")
    print(f"- Number of unique values: {data[feature].nunique()}")
    print(f"- Unique values count: {data[feature].value_counts()}" )
    print(f"- Numer of missing values: {data.loc[:, feature].isna().sum()}")
    
    if data[feature].dtype in ['int64', 'float64']:
        print(f"- Min: {data[feature].min()}")
        print(f"- Max: {data[feature].max()}")
        print(f"- Mean: {data[feature].mean()}")
        print(f"- Median: {data[feature].median()}")
        print(f"- Mode: {data[feature].mode()}")
        print(f"- Std: {data[feature].std()} ")


Now let's understand the contents of the dataset

In [1414]:
print("Mal Anime Dataset")
print(f"Number of rows: {data.shape[0]}, Number of columns: {data.shape[1]}")
print(f"Column names: {data.columns}")
print(f"Number of missing values: {data.isna().sum().sum()}")

data = data.drop_duplicates()
print(f"\n\nNumber of rows: {data.shape[0]}, Number of columns: {data.shape[1]}")
print(f"Column names: {data.columns}")
print(f"Number of missing values: {data.isna().sum().sum()}")


Mal Anime Dataset
Number of rows: 28678, Number of columns: 26
Column names: Index(['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source',
       'episodes', 'status', 'airing', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'synopsis',
       'season', 'year', 'month', 'licensors', 'producers', 'studios',
       'genres', 'themes'],
      dtype='object')
Number of missing values: 74387


Number of rows: 28464, Number of columns: 26
Column names: Index(['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source',
       'episodes', 'status', 'airing', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'synopsis',
       'season', 'year', 'month', 'licensors', 'producers', 'studios',
       'genres', 'themes'],
      dtype='object')
Number of missing values: 73342


For some reason there are 214 duplicate entries LMAO.

**Attribute Information**
- **`mal_id`**: Unique anime ID on MyAnimeList.
- **`title`**: The anime's primary title 
- **`title_en`**: English title of the anime.
- **`title_synonyms`**: List of common abbreviations for the title 
- **`type`**: Media type
- **`source`**: Orignal source material
- **`episodes`**: Total number of episodes
- **`status`**: Current airing status
- **`airing`**: If the currently airing.
- **`duration`**: Length of each episode
- **`rating`**: Age rating
- **`score`**: Average user score (out of 10)
- **`scored_by`**: Number of users who rated the anime.
- **`rank`**: Rank based on score.
- **`popularity`**: Rank based on number of members.
- **`members`**: Total number of users who added the anime to their lists.
- **`favorites`**: Number of users who marked the anime as a favorite.
- **`synopsis`**: Short summary of the anime.
- **`season`**: Season the anime came out in
- **`year`**: Year the anime came out in.
- **`licensors`**: Companies licensed to distribute the anime 
- **`producers`**: Companies involved in funding or planning the anime.
- **`studios`**: Animation studios responsible for production.
- **`genres`**: Core genres in the anime
- **`themes`**: Narrative or setting themes 

In [1415]:
data.describe()

Unnamed: 0,mal_id,episodes,score,scored_by,rank,popularity,members,favorites,year,month
count,28464.0,27784.0,18528.0,18528.0,21720.0,28464.0,28464.0,28464.0,27586.0,27586.0
mean,33456.972175,14.237439,6.396332,29958.34,10858.345258,14230.07989,38645.45,432.630235,2008.93388,5.765134
std,19408.082961,47.903882,0.889767,121391.1,6269.891837,8216.431813,166387.2,4506.529237,15.054737,3.688561
min,1.0,1.0,1.89,101.0,0.0,0.0,0.0,0.0,1917.0,1.0
25%,14512.5,1.0,5.78,340.0,5428.5,7114.75,235.0,0.0,2003.0,2.0
50%,37546.5,2.0,6.37,1554.0,10856.0,14230.5,1082.0,1.0,2013.0,6.0
75%,49664.25,13.0,7.03,10126.25,16288.25,21349.0,9171.5,17.0,2019.0,9.0
max,61814.0,3057.0,9.3,2940011.0,21717.0,28463.0,4170962.0,238662.0,2027.0,12.0


**Feature `mal_id`**

In [1416]:
describe_feature(data, 'mal_id')

- Type: int64
- First 5 rows:
 0    1
1    5
2    6
3    7
4    8
Name: mal_id, dtype: int64
- Last 5 Rows: 
 28673    61810
28674    61811
28675    61812
28676    61813
28677    61814
Name: mal_id, dtype: int64
- Number of unique values: 28464
- Unique values count: mal_id
61779    1
61778    1
61777    1
61774    1
61773    1
        ..
8        1
7        1
6        1
5        1
1        1
Name: count, Length: 28464, dtype: int64
- Numer of missing values: 0
- Min: 1
- Max: 61814
- Mean: 33456.97217537943
- Median: 37546.5
- Mode: 0            1
1            5
2            6
3            7
4            8
         ...  
28459    61810
28460    61811
28461    61812
28462    61813
28463    61814
Name: mal_id, Length: 28464, dtype: int64
- Std: 19408.08296094567 


**Feature `title`**

In [1417]:
describe_feature(data, 'title')

- Type: object
- First 5 rows:
 0                       Cowboy Bebop
1    Cowboy Bebop: Tengoku no Tobira
2                             Trigun
3                 Witch Hunter Robin
4                     Bouken Ou Beet
Name: title, dtype: object
- Last 5 Rows: 
 28673               Sora
28674         Shikushiku
28675        Plants Song
28676              Ghost
28677    Oni no Hanayome
Name: title, dtype: object
- Number of unique values: 28460
- Unique values count: title
Ghost                                                    2
1                                                        2
Shen Lan Qi Yu Wushuang Zhu: Tianmo Pian                 2
Moonlight                                                2
Naruto                                                   1
                                                        ..
Kenpuu Denki Berserk                                     1
Koukaku Kidoutai                                         1
Rurouni Kenshin: Meiji Kenkaku Romantan - Tsuioku-hen

Some titles are the same, since some animes dont differenate between different seasons

Feature `title_en`

In [1418]:
describe_feature(data, 'title_en')

# Turning all the missing values into the original title
data['title_en'] = data.apply(lambda row: row['title_en'] if pd.notnull(row['title_en']) else row['title'], axis=1) 
print(f"Number of missing values after cleaning: {data['title_en'].isna().sum().sum()}")

- Type: object
- First 5 rows:
 0               Cowboy Bebop
1    Cowboy Bebop: The Movie
2                     Trigun
3         Witch Hunter Robin
4     Beet the Vandel Buster
Name: title_en, dtype: object
- Last 5 Rows: 
 28673    NaN
28674    NaN
28675    NaN
28676    NaN
28677    NaN
Name: title_en, dtype: object
- Number of unique values: 12025
- Unique values count: title_en
Promise                                5
Spirit Guardians                       5
Meow Meow Japanese History             4
Cyborg 009                             4
Magical Legend: Rise to Immortality    4
                                      ..
Xian Chong                             1
To Be Winner                           1
Seven States of Galaxy Saga            1
The Last With You                      1
Witch Hunter Robin                     1
Name: count, Length: 12025, dtype: int64
- Numer of missing values: 16242
Number of missing values after cleaning: 0


**Feature `title_synonyms`**

In [1419]:
describe_feature(data, 'title_synonyms')

- Type: object
- First 5 rows:
 0                                             []
1    ["Cowboy Bebop: Knockin' on Heaven's Door"]
2                                             []
3                                        ['WHR']
4                        ['Adventure King Beet']
Name: title_synonyms, dtype: object
- Last 5 Rows: 
 28673        ['Minna no Uta']
28674        ['Minna no Uta']
28675                      []
28676                      []
28677    ["The Ogre's Bride"]
Name: title_synonyms, dtype: object
- Number of unique values: 12933
- Unique values count: title_synonyms
[]                                                                       14358
['Minna no Uta']                                                           808
['Irodorimidori']                                                           49
['One Piece Recap', 'One Piece Special']                                    11
['Hirake! Ponkikki']                                                        10
                  

**Feature `type`**

In [1420]:
describe_feature(data, 'type')

# Turning missing values into empty string
data['type'] = data.apply(lambda row: row['type'] if pd.notnull(row['type']) else '', axis=1)
print(f"Number of missing values after cleaning: {data['type'].isna().sum().sum()}")


- Type: object
- First 5 rows:
 0       TV
1    Movie
2       TV
3       TV
4       TV
Name: type, dtype: object
- Last 5 Rows: 
 28673    Music
28674    Music
28675    Music
28676    Movie
28677       TV
Name: type, dtype: object
- Number of unique values: 9
- Unique values count: type
TV            8284
Movie         4852
OVA           4170
ONA           4001
Music         3867
Special       1763
TV Special     746
CM             459
PV             256
Name: count, dtype: int64
- Numer of missing values: 66
Number of missing values after cleaning: 0


**Feature `source`**

In [1421]:
describe_feature(data, 'source')

- Type: object
- First 5 rows:
 0    Original
1    Original
2       Manga
3    Original
4       Manga
Name: source, dtype: object
- Last 5 Rows: 
 28673    Original
28674    Original
28675    Original
28676    Original
28677       Other
Name: source, dtype: object
- Number of unique values: 17
- Unique values count: source
Original        12092
Manga            5432
Unknown          2818
Game             1446
Other            1287
Light novel      1165
Visual novel     1155
Novel             813
Web manga         636
4-koma manga      334
Picture book      277
Music             262
Mixed media       233
Book              219
Web novel         204
Card game          77
Radio              14
Name: count, dtype: int64
- Numer of missing values: 0


**Feature `episodes`**

In [1422]:
describe_feature(data,'episodes')
# SOME MISSING EPisODES BUT LIKE GONNA DO THIS LATER

- Type: float64
- First 5 rows:
 0    26.0
1     1.0
2    26.0
3    26.0
4    52.0
Name: episodes, dtype: float64
- Last 5 Rows: 
 28673    1.0
28674    1.0
28675    1.0
28676    1.0
28677    NaN
Name: episodes, dtype: float64
- Number of unique values: 260
- Unique values count: episodes
1.0      13622
12.0      2279
2.0       1636
26.0      1331
13.0      1111
         ...  
247.0        1
126.0        1
157.0        1
280.0        1
123.0        1
Name: count, Length: 260, dtype: int64
- Numer of missing values: 680
- Min: 1.0
- Max: 3057.0
- Mean: 14.23743881370573
- Median: 2.0
- Mode: 0    1.0
Name: episodes, dtype: float64
- Std: 47.90388154617769 


There are some missing values here. Let's figure out why and where

**Feature `status`**

In [1423]:
describe_feature(data, 'status')

print("\nNotice when status == \'Not yet aired\'")
print(f"  - Number of rows: {data[data['status'] == 'Not yet aired'].shape[0]}" )
print(f"  - Number of missing values: {data[data['status'] == 'Not yet aired'].isna().sum().sum()}")


- Type: object
- First 5 rows:
 0    Finished Airing
1    Finished Airing
2    Finished Airing
3    Finished Airing
4    Finished Airing
Name: status, dtype: object
- Last 5 Rows: 
 28673    Finished Airing
28674    Finished Airing
28675    Finished Airing
28676      Not yet aired
28677      Not yet aired
Name: status, dtype: object
- Number of unique values: 3
- Unique values count: status
Finished Airing     27586
Not yet aired         524
Currently Airing      354
Name: count, dtype: int64
- Numer of missing values: 0

Notice when status == 'Not yet aired'
  - Number of rows: 524
  - Number of missing values: 3310


Notice unaired animes also a lot of missing features This is partly due to some features not being filled ex. year, month, status.  <br>
For data visualization we want to operate on anime that either are airing or ended thus a seperate csv file will be created. 

**Feature `duration`**

In [1424]:
describe_feature(data, 'duration')

- Type: object
- First 5 rows:
 0    24 min per ep
1      1 hr 55 min
2    24 min per ep
3    25 min per ep
4    23 min per ep
Name: duration, dtype: object
- Last 5 Rows: 
 28673      2 min
28674      2 min
28675      2 min
28676    Unknown
28677    Unknown
Name: duration, dtype: object
- Number of unique values: 340
- Unique values count: duration
24 min per ep         2085
23 min per ep         1810
2 min                 1637
3 min                 1504
4 min                 1178
                      ... 
2 hr 31 min              1
26 sec                   1
2 hr 25 min              1
1 hr 55 min per ep       1
2 hr 45 min              1
Name: count, Length: 340, dtype: int64
- Numer of missing values: 0


**Feature `rating`**

In [1425]:
describe_feature(data, 'rating')

# Turning all the missing values into empty strings
data['rating'] = data.apply(lambda row: row['rating'] if pd.notnull(row['rating']) else '', axis=1)

- Type: object
- First 5 rows:
 0    R - 17+ (violence & profanity)
1    R - 17+ (violence & profanity)
2         PG-13 - Teens 13 or older
3         PG-13 - Teens 13 or older
4                     PG - Children
Name: rating, dtype: object
- Last 5 Rows: 
 28673    G - All Ages
28674    G - All Ages
28675    G - All Ages
28676             NaN
28677             NaN
Name: rating, dtype: object
- Number of unique values: 6
- Unique values count: rating
PG-13 - Teens 13 or older         10192
G - All Ages                       8927
PG - Children                      4386
Rx - Hentai                        1566
R - 17+ (violence & profanity)     1565
R+ - Mild Nudity                   1211
Name: count, dtype: int64
- Numer of missing values: 617


Creating a new feature that summarizes the rating into `G`, `PG`, `PG-13`, `R`

In [1426]:
new_rating_rows = []
for _, row in data.iterrows():
    if row['rating'] != "":
        # Time to split the strings
        split_string = row['rating'].split(' ')
        rating = split_string[0].rstrip('x').rstrip('+')
        new_rating_rows.append(rating)
    else:
        new_rating_rows.append('')

# Adding a new columns to the data
data['new_rating'] = new_rating_rows


**Feature `score`**

In [1427]:
describe_feature(data, 'score')

- Type: float64
- First 5 rows:
 0    8.75
1    8.38
2    8.22
3    7.24
4    6.93
Name: score, dtype: float64
- Last 5 Rows: 
 28673   NaN
28674   NaN
28675   NaN
28676   NaN
28677   NaN
Name: score, dtype: float64
- Number of unique values: 564
- Unique values count: score
6.33    102
6.52     98
6.30     96
6.42     96
6.21     95
       ... 
3.65      1
3.54      1
3.67      1
4.01      1
3.30      1
Name: count, Length: 564, dtype: int64
- Numer of missing values: 9936
- Min: 1.89
- Max: 9.3
- Mean: 6.3963320379965465
- Median: 6.37
- Mode: 0    6.33
Name: score, dtype: float64
- Std: 0.8897668758689251 


Note MyAnimeList doesnt allow scoring for unrealeased animes and 18+ animes. Thus I dont think it is apporiate to impute the values

**Feature `scored_by`**

In [1428]:
describe_feature(data, 'scored_by')

- Type: float64
- First 5 rows:
 0    1019108.0
1     225295.0
2     388869.0
3      45312.0
4       6963.0
Name: scored_by, dtype: float64
- Last 5 Rows: 
 28673   NaN
28674   NaN
28675   NaN
28676   NaN
28677   NaN
Name: scored_by, dtype: float64
- Number of unique values: 9103
- Unique values count: scored_by
127.0        42
158.0        40
129.0        38
147.0        38
150.0        37
             ..
470174.0      1
45312.0       1
388869.0      1
225295.0      1
1019108.0     1
Name: count, Length: 9103, dtype: int64
- Numer of missing values: 9936
- Min: 101.0
- Max: 2940011.0
- Mean: 29958.337057426597
- Median: 1554.0
- Mode: 0    127.0
Name: scored_by, dtype: float64
- Std: 121391.06413353194 


Notice the amount of missing values for scored_by is the same for score. I also think it is not apporiate to impute the missing values here too

**Feature `members`**

In [1429]:
describe_feature(data, 'members')

- Type: int64
- First 5 rows:
 0    1974537
1     397918
2     802844
3     123350
4      16303
Name: members, dtype: int64
- Last 5 Rows: 
 28673    2
28674    2
28675    1
28676    1
28677    0
Name: members, dtype: int64
- Number of unique values: 12163
- Unique values count: members
80        124
81        119
78        115
79        111
77        110
         ... 
33          1
12833       1
5645        1
3738        1
802844      1
Name: count, Length: 12163, dtype: int64
- Numer of missing values: 0
- Min: 0
- Max: 4170962
- Mean: 38645.44758291175
- Median: 1082.0
- Mode: 0    80
Name: members, dtype: int64
- Std: 166387.2296316623 


**Feature `rank`**

In [1430]:
describe_feature(data, 'rank')

- Type: float64
- First 5 rows:
 0      46.0
1     209.0
2     372.0
3    3223.0
4    4755.0
Name: rank, dtype: float64
- Last 5 Rows: 
 28673    NaN
28674    NaN
28675    NaN
28676    0.0
28677    0.0
Name: rank, dtype: float64
- Number of unique values: 17134
- Unique values count: rank
17354.0    4
14813.0    4
8022.0     4
18491.0    4
18884.0    4
          ..
450.0      1
1849.0     1
702.0      1
807.0      1
4755.0     1
Name: count, Length: 17134, dtype: int64
- Numer of missing values: 6744
- Min: 0.0
- Max: 21717.0
- Mean: 10858.345257826888
- Median: 10856.0
- Mode: 0     8022.0
1     9054.0
2    14441.0
3    14813.0
4    17354.0
5    17593.0
6    17863.0
7    18491.0
8    18884.0
Name: rank, dtype: float64
- Std: 6269.891837139401 


Note: Unreleased animes and Rx-hentai animes are excluded form ranking.
Thus I dont think I should impute this data.
I wonder if ML can pick up on this rule.

**Feature `popularity`**

In [1431]:
describe_feature(data, 'popularity')

- Type: int64
- First 5 rows:
 0      42
1     642
2     260
3    1949
4    5654
Name: popularity, dtype: int64
- Last 5 Rows: 
 28673    28463
28674    28462
28675    28461
28676        0
28677        0
Name: popularity, dtype: int64
- Number of unique values: 21642
- Unique values count: popularity
16532    6
16266    5
20269    5
16340    5
18342    5
        ..
25041    1
27646    1
28361    1
23676    1
28325    1
Name: count, Length: 21642, dtype: int64
- Numer of missing values: 0
- Min: 0
- Max: 28463
- Mean: 14230.079890387859
- Median: 14230.5
- Mode: 0    16532
Name: popularity, dtype: int64
- Std: 8216.431813095754 


**Feature `favorites`**

In [1432]:
describe_feature(data, 'favorites')

- Type: int64
- First 5 rows:
 0    86585
1     1708
2    16846
3      671
4       16
Name: favorites, dtype: int64
- Last 5 Rows: 
 28673    0
28674    0
28675    0
28676    0
28677    0
Name: favorites, dtype: int64
- Number of unique values: 1975
- Unique values count: favorites
0       12659
1        2628
2        1316
3         853
4         621
        ...  
4182        1
848         1
357         1
2217        1
925         1
Name: count, Length: 1975, dtype: int64
- Numer of missing values: 0
- Min: 0
- Max: 238662
- Mean: 432.63023468240584
- Median: 1.0
- Mode: 0    0
Name: favorites, dtype: int64
- Std: 4506.529236556852 


**Feature `synopsis`**

In [1433]:
describe_feature(data, 'synopsis')

# Replacing missing values with the 'No synopsis has been added for this series yet.\n\nClick here to update this information.

data['synopsis'] = data.apply(lambda row: row['synopsis'] if pd.notnull(row['synopsis']) else 'No synopsis has been added for this series yet.\n\nClick here to update this information.', axis=1)
print(f"Number of missing values after cleaning: {data['synopsis'].isna().sum().sum()}")

- Type: object
- First 5 rows:
 0    Crime is timeless. By the year 2071, humanity ...
1    Another day, another bounty—such is the life o...
2    Vash the Stampede is the man with a $$60,000,0...
3    Though hidden away from the general public, Wi...
4    It is the dark century and the people are suff...
Name: synopsis, dtype: object
- Last 5 Rows: 
 28673    Music video for the song Sora by BE:FIRST that...
28674    Music video for the song Shikushiku by Yoh Kam...
28675              Music video for Plants Song by CHI-MEY.
28676                                                  NaN
28677    A Japanese-style ayakashi Cinderella story! "Y...
Name: synopsis, dtype: object
- Number of unique values: 23131
- Unique values count: synopsis
No synopsis has been added for this series yet.\n\nClick here to update this information.                                                                                                                                                                       

**Feature `month`**

In [1434]:
describe_feature(data, 'month')

- Type: float64
- First 5 rows:
 0    4.0
1    9.0
2    4.0
3    7.0
4    9.0
Name: month, dtype: float64
- Last 5 Rows: 
 28673    6.0
28674    6.0
28675    6.0
28676    1.0
28677    NaN
Name: month, dtype: float64
- Number of unique values: 12
- Unique values count: month
1.0     5745
4.0     3265
10.0    3108
7.0     2780
12.0    1994
3.0     1945
8.0     1722
6.0     1522
9.0     1492
2.0     1478
11.0    1327
5.0     1208
Name: count, dtype: int64
- Numer of missing values: 878
- Min: 1.0
- Max: 12.0
- Mean: 5.7651344885086635
- Median: 6.0
- Mode: 0    1.0
Name: month, dtype: float64
- Std: 3.688561385580887 


Cant really do the anything about missing values here

**Feature `year`**

In [1435]:
describe_feature(data, 'year')

- Type: float64
- First 5 rows:
 0    1998.0
1    2001.0
2    1998.0
3    2002.0
4    2004.0
Name: year, dtype: float64
- Last 5 Rows: 
 28673    2025.0
28674    2025.0
28675    2025.0
28676    2027.0
28677       NaN
Name: year, dtype: float64
- Number of unique values: 108
- Unique values count: year
2017.0    1257
2021.0    1237
2022.0    1231
2016.0    1227
2019.0    1217
          ... 
1937.0       2
1945.0       1
1944.0       1
1923.0       1
1922.0       1
Name: count, Length: 108, dtype: int64
- Numer of missing values: 878
- Min: 1917.0
- Max: 2027.0
- Mean: 2008.9338795040962
- Median: 2013.0
- Mode: 0    2017.0
Name: year, dtype: float64
- Std: 15.054737172612207 


Can't really do anything about missing values here

**Feature `season`**

In [1436]:
describe_feature(data, 'season')

- Type: object
- First 5 rows:
 0    spring
1       NaN
2    spring
3    summer
4      fall
Name: season, dtype: object
- Last 5 Rows: 
 28673    NaN
28674    NaN
28675    NaN
28676    NaN
28677    NaN
Name: season, dtype: object
- Number of unique values: 4
- Unique values count: season
spring    1951
fall      1786
winter    1268
summer    1144
Name: count, dtype: int64
- Numer of missing values: 22315


Notice there are 22315 missing values for season and only 878 missing values for month. Thus using month we can find the season 
reducing the amount of missing values to 878. So let's create a new feature

Season: Months (numeric)
- `Winter`: 1 - 3
- `Spring`: 4 - 6
- `Summer`: 7 - 9
- `Fall`: 10 - 12

In [1437]:
def label_season(row):
    if pd.notnull(row['season']):
        return row['season'].title()
    elif row['month'] in [12, 1, 2]:
        return 'Winter'
    elif 3 <= row['month'] <= 5:
        return 'Spring'
    elif 6 <= row['month'] <= 8:
        return 'Summer'
    elif 9 <= row['month'] <= 11:
        return "Fall"
    else:
        return

data['new_season'] = data.apply(label_season, axis=1) 
data['new_season'] = data.apply(lambda row: row['new_season'] if pd.notnull(row['new_season']) else "Unknown", axis=1)

**Feature `licensors`**

In [1438]:
describe_feature(data, 'licensors')

- Type: object
- First 5 rows:
 0                                   ['Funimation']
1    ['Sony Pictures Entertainment', 'Funimation']
2                                   ['Funimation']
3           ['Funimation', 'Bandai Entertainment']
4                     ['Illumitoon Entertainment']
Name: licensors, dtype: object
- Last 5 Rows: 
 28673    []
28674    []
28675    []
28676    []
28677    []
Name: licensors, dtype: object
- Number of unique values: 301
- Unique values count: licensors
[]                                       23331
['Funimation']                             979
['Sentai Filmworks']                       868
['Discotek Media']                         301
['Aniplex of America']                     252
                                         ...  
['Crunchyroll', 'Muse Communication']        1
['Aniplex of America', 'Crunchyroll']        1
['Bandai Namco Online']                      1
['King Records']                             1
['Toho International']                  

**Feature `producers`**

In [1439]:
describe_feature(data, 'producers')

- Type: object
- First 5 rows:
 0    ['Bandai Visual', 'Victor Entertainment', 'Aud...
1                         ['Sunrise', 'Bandai Visual']
2                             ['Victor Entertainment']
3    ['Bandai Visual', 'Dentsu', 'Victor Entertainm...
4                               ['TV Tokyo', 'Dentsu']
Name: producers, dtype: object
- Last 5 Rows: 
 28673                            []
28674                            []
28675                            []
28676    ['Bandai Namco Filmworks']
28677                            []
Name: producers, dtype: object
- Number of unique values: 5132
- Unique values count: producers
[]                                                           15025
['NHK']                                                       1179
['Pink Pineapple']                                             273
['Sanrio']                                                     180
['bilibili']                                                   158
                                  

**Feature `studios`**

In [1440]:
describe_feature(data, 'studios')

- Type: object
- First 5 rows:
 0           ['Sunrise']
1             ['Bones']
2          ['Madhouse']
3           ['Sunrise']
4    ['Toei Animation']
Name: studios, dtype: object
- Last 5 Rows: 
 28673              []
28674              []
28675              []
28676    ['Madhouse']
28677              []
Name: studios, dtype: object
- Number of unique values: 1870
- Unique values count: studios
[]                                       11536
['Toei Animation']                         875
['Sunrise']                                556
['J.C.Staff']                              412
['TMS Entertainment']                      364
                                         ...  
['Yumeta Company', 'Maru Animation']         1
['Tatsunoko Production', 'SynergySP']        1
['Studio GOONEYS', 'Honoo']                  1
['studio42']                                 1
['Raiose']                                   1
Name: count, Length: 1870, dtype: int64
- Numer of missing values: 0


**Feature `genres`**

In [1441]:
describe_feature(data, 'genres')

- Type: object
- First 5 rows:
 0             ['Action', 'Award Winning', 'Sci-Fi']
1                              ['Action', 'Sci-Fi']
2                 ['Action', 'Adventure', 'Sci-Fi']
3    ['Action', 'Drama', 'Mystery', 'Supernatural']
4                ['Action', 'Adventure', 'Fantasy']
Name: genres, dtype: object
- Last 5 Rows: 
 28673                        []
28674                        []
28675                        []
28676                        []
28677    ['Fantasy', 'Romance']
Name: genres, dtype: object
- Number of unique values: 930
- Unique values count: genres
[]                                                                                 5875
['Comedy']                                                                         2681
['Fantasy']                                                                        1458
['Hentai']                                                                         1286
['Slice of Life']                                             

**Feature `themes`**

In [1442]:
describe_feature(data, 'themes')

- Type: object
- First 5 rows:
 0    ['Adult Cast', 'Space']
1    ['Adult Cast', 'Space']
2             ['Adult Cast']
3              ['Detective']
4                         []
Name: themes, dtype: object
- Last 5 Rows: 
 28673                   ['Music']
28674                   ['Music']
28675    ['Educational', 'Music']
28676                          []
28677                          []
Name: themes, dtype: object
- Number of unique values: 1026
- Unique values count: themes
[]                                                 11722
['Music']                                           3917
['School']                                           901
['Anthropomorphic']                                  873
['Historical']                                       850
                                                   ...  
['Idols (Female)', 'Music', 'Parody', 'School']        1
['Childcare', 'Love Status Quo']                       1
['Anthropomorphic', 'School', 'Urban Fantasy']         1
['Pet

# Fianl Preprocessing

**Dataset for visualzation**

In [1443]:

vis_data = data[data['status'] != "Not yet aired"]
vis_data = vis_data.drop(columns={'season', 'rating'})
vis_data = vis_data.rename(columns={'new_season': 'season', 'new_rating': 'rating'})

Now let's sort the columns

In [1444]:
vis_data.columns

Index(['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source',
       'episodes', 'status', 'airing', 'duration', 'score', 'scored_by',
       'rank', 'popularity', 'members', 'favorites', 'synopsis', 'year',
       'month', 'licensors', 'producers', 'studios', 'genres', 'themes',
       'rating', 'season'],
      dtype='object')

In [None]:
vis_data = vis_data[['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source', 'status', 'airing', 'month', 'year', 
'season', 'episodes', 'duration', 'score', 'scored_by', 'rank', 'members','popularity', 'favorites', 'genres', 'themes', 'rating', 
'licensors', 'producers', 'studios']]

Unnamed: 0,mal_id,title,title_en,title_synonyms,type,source,status,airing,month,year,...,rank,members,popularity,favorites,genres,themes,rating,licensors,producers,studios
0,1,Cowboy Bebop,Cowboy Bebop,[],TV,Original,Finished Airing,False,4.0,1998.0,...,46.0,1974537,42,86585,"['Action', 'Award Winning', 'Sci-Fi']","['Adult Cast', 'Space']",R,['Funimation'],"['Bandai Visual', 'Victor Entertainment', 'Aud...",['Sunrise']
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,"[""Cowboy Bebop: Knockin' on Heaven's Door""]",Movie,Original,Finished Airing,False,9.0,2001.0,...,209.0,397918,642,1708,"['Action', 'Sci-Fi']","['Adult Cast', 'Space']",R,"['Sony Pictures Entertainment', 'Funimation']","['Sunrise', 'Bandai Visual']",['Bones']
2,6,Trigun,Trigun,[],TV,Manga,Finished Airing,False,4.0,1998.0,...,372.0,802844,260,16846,"['Action', 'Adventure', 'Sci-Fi']",['Adult Cast'],PG-13,['Funimation'],['Victor Entertainment'],['Madhouse']
3,7,Witch Hunter Robin,Witch Hunter Robin,['WHR'],TV,Original,Finished Airing,False,7.0,2002.0,...,3223.0,123350,1949,671,"['Action', 'Drama', 'Mystery', 'Supernatural']",['Detective'],PG-13,"['Funimation', 'Bandai Entertainment']","['Bandai Visual', 'Dentsu', 'Victor Entertainm...",['Sunrise']
4,8,Bouken Ou Beet,Beet the Vandel Buster,['Adventure King Beet'],TV,Manga,Finished Airing,False,9.0,2004.0,...,4755.0,16303,5654,16,"['Action', 'Adventure', 'Fantasy']",[],PG,['Illumitoon Entertainment'],"['TV Tokyo', 'Dentsu']",['Toei Animation']
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28670,61796,Damashi Ai,Damashi Ai,['Damashiai'],Music,Original,Finished Airing,False,6.0,2025.0,...,,40,28436,0,[],['Music'],PG-13,[],[],[]
28671,61802,Dream Believers (Sakura Ver.),Dream Believers (Sakura Ver.),[],Music,Other,Finished Airing,False,6.0,2025.0,...,,45,28427,0,[],"['Idols (Female)', 'Music']",,[],[],[]
28673,61810,Sora,Sora,['Minna no Uta'],Music,Original,Finished Airing,False,6.0,2025.0,...,,2,28463,0,[],['Music'],G,[],[],[]
28674,61811,Shikushiku,Shikushiku,['Minna no Uta'],Music,Original,Finished Airing,False,6.0,2025.0,...,,2,28462,0,[],['Music'],G,[],[],[]


**Dataset for Machine Learning**