# **Preprocessing data for MAL Data Science Project**

In [1]:
import pandas as pd
import ast


**Data reading** <br>
The dataset was extracted using the file JikanScrapper. The objective of this notebook is to preprocess each feature in order for visualization and machine learning in the later stages

In [2]:
data = pd.read_csv('anime_data.csv')
columns_to_exclude = ['explict_genres'] # There is a better year column and explict_genres is already in genres
data = data.drop(columns=columns_to_exclude)


# **1. Data exploration and cleaning the dataset** 

In [37]:
def describe_feature(data: pd.DataFrame, feature: str):
    print(f"- Type: {data.loc[:, feature].dtype}")
    print(f"- First 5 rows:\n {data[feature].head(5)}")
    print(f"- Last 5 Rows: \n {data[feature].tail(5)}")
    print(f"- Number of unique values: {data[feature].nunique()}")
    print(f"- Unique values count: {data[feature].value_counts()}" )
    print(f"- Numer of missing values: {data.loc[:, feature].isna().sum()}")
    
    if data[feature].dtype in ['int64', 'float64']:
        print(f"- Min: {data[feature].min()}")
        print(f"- Max: {data[feature].max()}")
        print(f"- Mean: {data[feature].mean()}")
        print(f"- Median: {data[feature].median()}")
        print(f"- Mode: {data[feature].mode()}")
        print(f"- Std: {data[feature].std()} ")


Now let's understand the contents of the dataset

In [38]:
print("Mal Anime Dataset")
print(f"Number of rows: {data.shape[0]}, Number of columns: {data.shape[1]}")
print(f"Column names: {data.columns}")
print(f"Number of missing values: {data.isna().sum().sum()}")

data = data.drop_duplicates()
print(f"\n\nNumber of rows: {data.shape[0]}, Number of columns: {data.shape[1]}")
print(f"Column names: {data.columns}")
print(f"Number of missing values: {data.isna().sum().sum()}")


Mal Anime Dataset
Number of rows: 28480, Number of columns: 30
Column names: Index(['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source',
       'episodes', 'status', 'airing', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'synopsis',
       'season', 'year', 'month', 'day', 'licensors', 'producers', 'studios',
       'genres', 'themes', 'demographics', 'new_rating', 'new_season'],
      dtype='object')
Number of missing values: 52292


Number of rows: 28480, Number of columns: 30
Column names: Index(['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source',
       'episodes', 'status', 'airing', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'synopsis',
       'season', 'year', 'month', 'day', 'licensors', 'producers', 'studios',
       'genres', 'themes', 'demographics', 'new_rating', 'new_season'],
      dtype='object')
Number of missing values: 52292


For some reason there are 214 duplicate entries LMAO.

**Attribute Information**
- **`mal_id`**: Unique anime ID on MyAnimeList.
- **`title`**: The anime's primary title 
- **`title_en`**: English title of the anime.
- **`title_synonyms`**: List of common abbreviations for the title 
- **`type`**: Media type
- **`source`**: Orignal source material
- **`episodes`**: Total number of episodes
- **`status`**: Current airing status
- **`airing`**: If the currently airing.
- **`duration`**: Length of each episode
- **`rating`**: Age rating
- **`score`**: Average user score (out of 10)
- **`scored_by`**: Number of users who rated the anime.
- **`rank`**: Rank based on score.
- **`popularity`**: Rank based on number of members.
- **`members`**: Total number of users who added the anime to their lists.
- **`favorites`**: Number of users who marked the anime as a favorite.
- **`synopsis`**: Short summary of the anime.
- **`season`**: Season the anime came out in
- **`year`**: Year the anime came out in.
- **`licensors`**: Companies licensed to distribute the anime 
- **`producers`**: Companies involved in funding or planning the anime.
- **`studios`**: Animation studios responsible for production.
- **`genres`**: Core genres in the anime
- **`themes`**: Narrative or setting themes 
- **`demographic`**: Target audience

In [39]:
data.describe()

Unnamed: 0,mal_id,episodes,score,scored_by,rank,popularity,members,favorites,year,month,day
count,28480.0,27798.0,18536.0,18536.0,21723.0,28480.0,28480.0,28480.0,27601.0,27601.0,27601.0
mean,33472.908357,14.191273,6.396206,29957.89,10858.333103,14238.443645,38641.53,432.563202,2008.942104,5.765226,12.244882
std,19414.269919,47.421129,0.889624,121410.0,6267.175893,8220.569563,166403.5,4506.827924,15.054887,3.688047,9.8278
min,1.0,1.0,1.89,101.0,0.0,0.0,0.0,0.0,1917.0,1.0,1.0
25%,14528.5,1.0,5.78,340.0,5430.0,7119.0,235.0,0.0,2003.0,2.0,3.0
50%,37554.5,2.0,6.37,1552.5,10860.0,14238.5,1081.5,1.0,2013.0,6.0,10.0
75%,49691.25,13.0,7.03,10131.25,16291.5,21356.5,9165.25,17.0,2019.0,9.0,21.0
max,61841.0,3057.0,9.3,2940975.0,21718.0,28479.0,4172424.0,238744.0,2027.0,12.0,31.0


**Feature `mal_id`**

In [40]:
describe_feature(data, 'mal_id')

- Type: int64
- First 5 rows:
 0    1
1    5
2    6
3    7
4    8
Name: mal_id, dtype: int64
- Last 5 Rows: 
 28689    61831
28690    61834
28692    61839
28693    61840
28694    61841
Name: mal_id, dtype: int64
- Number of unique values: 28480
- Unique values count: mal_id
61812    1
61811    1
61810    1
61806    1
61805    1
        ..
8        1
7        1
6        1
5        1
1        1
Name: count, Length: 28480, dtype: int64
- Numer of missing values: 0
- Min: 1
- Max: 61841
- Mean: 33472.90835674157
- Median: 37554.5
- Mode: 0            1
1            5
2            6
3            7
4            8
         ...  
28475    61831
28476    61834
28477    61839
28478    61840
28479    61841
Name: mal_id, Length: 28480, dtype: int64
- Std: 19414.269918717026 


**Feature `title`**

In [41]:
describe_feature(data, 'title')

- Type: object
- First 5 rows:
 0                       Cowboy Bebop
1    Cowboy Bebop: Tengoku no Tobira
2                             Trigun
3                 Witch Hunter Robin
4                     Bouken Ou Beet
Name: title, dtype: object
- Last 5 Rows: 
 28689    Saikyou no Ousama, Nidome no Jinsei wa Nani wo...
28690    Golden Kamuy: Inazuma Goutou to Mamushi no Ogi...
28692                         Aishiteru Game wo Owarasetai
28693                          Yuanshen: Chen Jian Xing Lu
28694                                              Katachi
Name: title, dtype: object
- Number of unique values: 28478
- Unique values count: title
Shen Lan Qi Yu Wushuang Zhu: Tianmo Pian                                2
Moonlight                                                               2
R11R Concept Movie                                                      1
.hack//Sign                                                             1
Ghost                                                      

Some titles are the same, since some animes dont differenate between different seasons

Feature `title_en`

In [42]:
describe_feature(data, 'title_en')

# Turning all the missing values into the original title
data['title_en'] = data.apply(lambda row: row['title_en'] if pd.notnull(row['title_en']) else row['title'], axis=1) 
print(f"Number of missing values after cleaning: {data['title_en'].isna().sum().sum()}")

- Type: object
- First 5 rows:
 0               Cowboy Bebop
1    Cowboy Bebop: The Movie
2                     Trigun
3         Witch Hunter Robin
4     Beet the Vandel Buster
Name: title_en, dtype: object
- Last 5 Rows: 
 28689                 The Beginning After the End Season 2
28690    Golden Kamuy: Inazuma Goutou to Mamushi no Ogi...
28692                         Aishiteru Game wo Owarasetai
28693                         Genshin Impact: Star Odyssey
28694                                       Pain Give Form
Name: title_en, dtype: object
- Number of unique values: 28232
- Unique values count: title_en
Promise                                           5
Spirit Guardians                                  5
Cyborg 009                                        4
Meow Meow Japanese History                        4
Magical Legend: Rise to Immortality               4
                                                 ..
Zipang                                            1
Neon Genesis Evangelio

**Feature `title_synonyms`**

In [43]:
describe_feature(data, 'title_synonyms')

- Type: object
- First 5 rows:
 0                                             []
1    ["Cowboy Bebop: Knockin' on Heaven's Door"]
2                                             []
3                                        ['WHR']
4                        ['Adventure King Beet']
Name: title_synonyms, dtype: object
- Last 5 Rows: 
 28689                                            ['TBATE']
28690    ['Golden Kamuy: The Lightning Bandit and Ogin ...
28692    ['I Want to End This Love Game', 'I Want to En...
28693                               ['【原神】「Star Odyssey」']
28694                                                   []
Name: title_synonyms, dtype: object
- Number of unique values: 12936
- Unique values count: title_synonyms
[]                                             14368
['Minna no Uta']                                 810
['Irodorimidori']                                 49
['One Piece Recap', 'One Piece Special']          11
['Hirake! Ponkikki']                              10
   

**Feature `type`**

In [44]:
describe_feature(data, 'type')

# Turning missing values into empty string
data['type'] = data.apply(lambda row: row['type'] if pd.notnull(row['type']) else '', axis=1)
print(f"Number of missing values after cleaning: {data['type'].isna().sum().sum()}")


- Type: object
- First 5 rows:
 0       TV
1    Movie
2       TV
3       TV
4       TV
Name: type, dtype: object
- Last 5 Rows: 
 28689       TV
28690      OVA
28692       TV
28693      ONA
28694    Music
Name: type, dtype: object
- Number of unique values: 10
- Unique values count: type
TV            8287
Movie         4852
OVA           4171
ONA           4003
Music         3876
Special       1762
TV Special     747
CM             460
PV             256
                66
Name: count, dtype: int64
- Numer of missing values: 0
Number of missing values after cleaning: 0


**Feature `source`**

In [45]:
describe_feature(data, 'source')

- Type: object
- First 5 rows:
 0    Original
1    Original
2       Manga
3    Original
4       Manga
Name: source, dtype: object
- Last 5 Rows: 
 28689       Other
28690       Manga
28692       Manga
28693        Game
28694    Original
Name: source, dtype: object
- Number of unique values: 17
- Unique values count: source
Original        12102
Manga            5434
Unknown          2818
Game             1447
Other            1288
Light novel      1165
Visual novel     1155
Novel             814
Web manga         637
4-koma manga      334
Picture book      277
Music             263
Mixed media       232
Book              219
Web novel         204
Card game          77
Radio              14
Name: count, dtype: int64
- Numer of missing values: 0


**Feature `episodes`**

In [46]:
describe_feature(data,'episodes')
# SOME MISSING EPisODES BUT LIKE GONNA DO THIS LATER

- Type: float64
- First 5 rows:
 0    26.0
1     1.0
2    26.0
3    26.0
4    52.0
Name: episodes, dtype: float64
- Last 5 Rows: 
 28689    NaN
28690    1.0
28692    NaN
28693    1.0
28694    1.0
Name: episodes, dtype: float64
- Number of unique values: 259
- Unique values count: episodes
1.0      13633
12.0      2279
2.0       1637
26.0      1332
13.0      1111
         ...  
247.0        1
325.0        1
157.0        1
280.0        1
123.0        1
Name: count, Length: 259, dtype: int64
- Numer of missing values: 682
- Min: 1.0
- Max: 3057.0
- Mean: 14.191272753435499
- Median: 2.0
- Mode: 0    1.0
Name: episodes, dtype: float64
- Std: 47.42112887503609 


There are some missing values here. Let's figure out why and where

**Feature `status`**

In [47]:
describe_feature(data, 'status')

print("\nNotice when status == \'Not yet aired\'")
print(f"  - Number of rows: {data[data['status'] == 'Not yet aired'].shape[0]}" )
print(f"  - Number of missing values: {data[data['status'] == 'Not yet aired'].isna().sum().sum()}")


- Type: object
- First 5 rows:
 0    Finished Airing
1    Finished Airing
2    Finished Airing
3    Finished Airing
4    Finished Airing
Name: status, dtype: object
- Last 5 Rows: 
 28689      Not yet aired
28690      Not yet aired
28692      Not yet aired
28693    Finished Airing
28694    Finished Airing
Name: status, dtype: object
- Number of unique values: 3
- Unique values count: status
Finished Airing     27600
Not yet aired         526
Currently Airing      354
Name: count, dtype: int64
- Numer of missing values: 0

Notice when status == 'Not yet aired'
  - Number of rows: 526
  - Number of missing values: 3211


Notice unaired animes also a lot of missing features This is partly due to some features not being filled ex. year, month, status.  <br>
For data visualization we want to operate on anime that either are airing or ended thus a seperate csv file will be created. 

**Feature `duration`**

In [48]:
describe_feature(data, 'duration')

- Type: object
- First 5 rows:
 0    24 min per ep
1      1 hr 55 min
2    24 min per ep
3    25 min per ep
4    23 min per ep
Name: duration, dtype: object
- Last 5 Rows: 
 28689    Unknown
28690     30 min
28692    Unknown
28693      3 min
28694      3 min
Name: duration, dtype: object
- Number of unique values: 340
- Unique values count: duration
24 min per ep         2084
23 min per ep         1811
2 min                 1640
3 min                 1511
4 min                 1178
                      ... 
2 hr 31 min              1
26 sec                   1
2 hr 25 min              1
1 hr 55 min per ep       1
2 hr 45 min              1
Name: count, Length: 340, dtype: int64
- Numer of missing values: 0


**Feature `rating`**

In [49]:
describe_feature(data, 'rating')

# Turning all the missing values into empty strings
data['rating'] = data.apply(lambda row: row['rating'] if pd.notnull(row['rating']) else '', axis=1)

- Type: object
- First 5 rows:
 0    R - 17+ (violence & profanity)
1    R - 17+ (violence & profanity)
2         PG-13 - Teens 13 or older
3         PG-13 - Teens 13 or older
4                     PG - Children
Name: rating, dtype: object
- Last 5 Rows: 
 28689    PG-13 - Teens 13 or older
28690                             
28692                             
28693    PG-13 - Teens 13 or older
28694    PG-13 - Teens 13 or older
Name: rating, dtype: object
- Number of unique values: 7
- Unique values count: rating
PG-13 - Teens 13 or older         10197
G - All Ages                       8936
PG - Children                      4386
Rx - Hentai                        1566
R - 17+ (violence & profanity)     1565
R+ - Mild Nudity                   1211
                                    619
Name: count, dtype: int64
- Numer of missing values: 0


Creating a new feature that summarizes the rating into `G`, `PG`, `PG-13`, `R`

In [50]:
new_rating_rows = []
for _, row in data.iterrows():
    if row['rating'] != "":
        # Time to split the strings
        split_string = row['rating'].split(' ')
        rating = split_string[0].rstrip('x').rstrip('+')
        new_rating_rows.append(rating)
    else:
        new_rating_rows.append('')

# Adding a new columns to the data
data['new_rating'] = new_rating_rows


**Feature `score`**

In [51]:
describe_feature(data, 'score')

- Type: float64
- First 5 rows:
 0    8.75
1    8.38
2    8.22
3    7.24
4    6.92
Name: score, dtype: float64
- Last 5 Rows: 
 28689   NaN
28690   NaN
28692   NaN
28693   NaN
28694   NaN
Name: score, dtype: float64
- Number of unique values: 568
- Unique values count: score
6.33    102
6.30    101
6.52    101
6.42     97
6.21     96
       ... 
3.54      1
4.01      1
3.07      1
8.80      1
8.82      1
Name: count, Length: 568, dtype: int64
- Numer of missing values: 9944
- Min: 1.89
- Max: 9.3
- Mean: 6.396205761760898
- Median: 6.37
- Mode: 0    6.33
Name: score, dtype: float64
- Std: 0.8896236432696691 


Note MyAnimeList doesnt allow scoring for unrealeased animes and 18+ animes. Thus I dont think it is apporiate to impute the values

**Feature `scored_by`**

In [52]:
describe_feature(data, 'scored_by')

- Type: float64
- First 5 rows:
 0    1019534.0
1     225356.0
2     388985.0
3      45317.0
4       6964.0
Name: scored_by, dtype: float64
- Last 5 Rows: 
 28689   NaN
28690   NaN
28692   NaN
28693   NaN
28694   NaN
Name: scored_by, dtype: float64
- Number of unique values: 9156
- Unique values count: scored_by
127.0       42
147.0       40
129.0       39
137.0       38
158.0       38
            ..
85704.0      1
91816.0      1
6964.0       1
45317.0      1
388985.0     1
Name: count, Length: 9156, dtype: int64
- Numer of missing values: 9944
- Min: 101.0
- Max: 2940975.0
- Mean: 29957.893828226155
- Median: 1552.5
- Mode: 0    127.0
Name: scored_by, dtype: float64
- Std: 121410.00601762808 


Notice the amount of missing values for scored_by is the same for score. I also think it is not apporiate to impute the missing values here too

**Feature `members`**

In [53]:
describe_feature(data, 'members')

- Type: int64
- First 5 rows:
 0    1975327
1     398043
2     803144
3     123394
4      16305
Name: members, dtype: int64
- Last 5 Rows: 
 28689    4673
28690       6
28692       0
28693      54
28694       1
Name: members, dtype: int64
- Number of unique values: 12167
- Unique values count: members
80       128
82       117
78       115
83       115
81       115
        ... 
14388      1
6306       1
13037      1
13868      1
1          1
Name: count, Length: 12167, dtype: int64
- Numer of missing values: 0
- Min: 0
- Max: 4172424
- Mean: 38641.52981039326
- Median: 1081.5
- Mode: 0    80
Name: members, dtype: int64
- Std: 166403.524358484 


**Feature `rank`**

In [54]:
describe_feature(data, 'rank')

- Type: float64
- First 5 rows:
 0      46.0
1     209.0
2     371.0
3    3221.0
4    4765.0
Name: rank, dtype: float64
- Last 5 Rows: 
 28689        NaN
28690        NaN
28692        0.0
28693    16429.0
28694        NaN
Name: rank, dtype: float64
- Number of unique values: 17702
- Unique values count: rank
17593.0    5
18206.0    5
20218.0    4
21240.0    4
20076.0    4
          ..
2972.0     1
1186.0     1
18466.0    1
16309.0    1
16308.0    1
Name: count, Length: 17702, dtype: int64
- Numer of missing values: 6757
- Min: 0.0
- Max: 21718.0
- Mean: 10858.333103162546
- Median: 10860.0
- Mode: 0    17593.0
1    18206.0
Name: rank, dtype: float64
- Std: 6267.175893049121 


Note: Unreleased animes and Rx-hentai animes are excluded form ranking.
Thus I dont think I should impute this data.
I wonder if ML can pick up on this rule.

**Feature `popularity`**

In [55]:
describe_feature(data, 'popularity')

- Type: int64
- First 5 rows:
 0      42
1     642
2     260
3    1950
4    5654
Name: popularity, dtype: int64
- Last 5 Rows: 
 28689     9289
28690    28474
28692        0
28693    28435
28694    28479
Name: popularity, dtype: int64
- Number of unique values: 21951
- Unique values count: popularity
21050    5
20620    5
24775    5
23898    5
19611    5
        ..
18653    1
15648    1
13260    1
28332    1
1353     1
Name: count, Length: 21951, dtype: int64
- Numer of missing values: 0
- Min: 0
- Max: 28479
- Mean: 14238.44364466292
- Median: 14238.5
- Mode: 0    16535
1    17983
2    18857
3    19611
4    20620
5    21050
6    23898
7    24775
Name: popularity, dtype: int64
- Std: 8220.569563104254 


**Feature `favorites`**

In [56]:
describe_feature(data, 'favorites')

- Type: int64
- First 5 rows:
 0    86598
1     1708
2    16857
3      673
4       16
Name: favorites, dtype: int64
- Last 5 Rows: 
 28689    2
28690    0
28692    0
28693    0
28694    0
Name: favorites, dtype: int64
- Number of unique values: 1956
- Unique values count: favorites
0         12669
1          2629
2          1318
3           851
4           620
          ...  
5311          1
3048          1
238744        1
83347         1
16857         1
Name: count, Length: 1956, dtype: int64
- Numer of missing values: 0
- Min: 0
- Max: 238744
- Mean: 432.563202247191
- Median: 1.0
- Mode: 0    0
Name: favorites, dtype: int64
- Std: 4506.827923670191 


**Feature `synopsis`**

In [57]:
describe_feature(data, 'synopsis')

# Replacing missing values with the 'No synopsis has been added for this series yet.\n\nClick here to update this information.

data['synopsis'] = data.apply(lambda row: row['synopsis'] if pd.notnull(row['synopsis']) else 'No synopsis has been added for this series yet.\n\nClick here to update this information.', axis=1)
print(f"Number of missing values after cleaning: {data['synopsis'].isna().sum().sum()}")

- Type: object
- First 5 rows:
 0    Crime is timeless. By the year 2071, humanity ...
1    Another day, another bounty—such is the life o...
2    Vash the Stampede is the man with a $$60,000,0...
3    Though hidden away from the general public, Wi...
4    It is the dark century and the people are suff...
Name: synopsis, dtype: object
- Last 5 Rows: 
 28689    Second season of Saikyou no Ousama, Nidome no ...
28690    A original anime episode that will adapt the I...
28692    In sixth grade, childhood friends Yukiya Asagi...
28693    Animated short film about the backstory of Ski...
28694        Music video for the song Katachi by ZUTOMAYO.
Name: synopsis, dtype: object
- Number of unique values: 23147
- Unique values count: synopsis
No synopsis has been added for this series yet.\n\nClick here to update this information.                                                                                                                                                                       

**Feature `month`**

In [58]:
describe_feature(data, 'month')

- Type: float64
- First 5 rows:
 0    4.0
1    9.0
2    4.0
3    7.0
4    9.0
Name: month, dtype: float64
- Last 5 Rows: 
 28689     4.0
28690    10.0
28692     NaN
28693     6.0
28694     6.0
Name: month, dtype: float64
- Number of unique values: 12
- Unique values count: month
1.0     5746
4.0     3269
10.0    3110
7.0     2780
12.0    1994
3.0     1945
8.0     1722
6.0     1528
9.0     1492
2.0     1478
11.0    1328
5.0     1209
Name: count, dtype: int64
- Numer of missing values: 879
- Min: 1.0
- Max: 12.0
- Mean: 5.765225897612406
- Median: 6.0
- Mode: 0    1.0
Name: month, dtype: float64
- Std: 3.6880470059355477 


Cant really do the anything about missing values here

**Feature `year`**

In [59]:
describe_feature(data, 'year')

- Type: float64
- First 5 rows:
 0    1998.0
1    2001.0
2    1998.0
3    2002.0
4    2004.0
Name: year, dtype: float64
- Last 5 Rows: 
 28689    2026.0
28690    2025.0
28692       NaN
28693    2025.0
28694    2025.0
Name: year, dtype: float64
- Number of unique values: 108
- Unique values count: year
2017.0    1257
2021.0    1237
2022.0    1231
2016.0    1227
2019.0    1217
          ... 
1937.0       2
1945.0       1
1944.0       1
1923.0       1
1922.0       1
Name: count, Length: 108, dtype: int64
- Numer of missing values: 879
- Min: 1917.0
- Max: 2027.0
- Mean: 2008.942103546973
- Median: 2013.0
- Mode: 0    2017.0
Name: year, dtype: float64
- Std: 15.05488687748169 


Can't really do anything about missing values here

**Feature `season`**

In [60]:
describe_feature(data, 'season')

- Type: object
- First 5 rows:
 0    spring
1       NaN
2    spring
3    summer
4      fall
Name: season, dtype: object
- Last 5 Rows: 
 28689    spring
28690       NaN
28692       NaN
28693       NaN
28694       NaN
Name: season, dtype: object
- Number of unique values: 4
- Unique values count: season
spring    1952
fall      1787
winter    1269
summer    1144
Name: count, dtype: int64
- Numer of missing values: 22328


Notice there are 22315 missing values for season and only 878 missing values for month. Thus using month we can find the season 
reducing the amount of missing values to 878. So let's create a new feature

Season: Months (numeric)
- `Winter`: 1 - 3
- `Spring`: 4 - 6
- `Summer`: 7 - 9
- `Fall`: 10 - 12

In [61]:
def label_season(row):
    if pd.notnull(row['season']):
        return row['season'].title()
    elif row['month'] in [12, 1, 2]:
        return 'Winter'
    elif 3 <= row['month'] <= 5:
        return 'Spring'
    elif 6 <= row['month'] <= 8:
        return 'Summer'
    elif 9 <= row['month'] <= 11:
        return "Fall"
    else:
        return

data['new_season'] = data.apply(label_season, axis=1) 
data['new_season'] = data.apply(lambda row: row['new_season'] if pd.notnull(row['new_season']) else "Unknown", axis=1)

**Feature `licensors`**

In [62]:
describe_feature(data, 'licensors')

- Type: object
- First 5 rows:
 0                                   ['Funimation']
1    ['Sony Pictures Entertainment', 'Funimation']
2                                   ['Funimation']
3           ['Funimation', 'Bandai Entertainment']
4                     ['Illumitoon Entertainment']
Name: licensors, dtype: object
- Last 5 Rows: 
 28689    []
28690    []
28692    []
28693    []
28694    []
Name: licensors, dtype: object
- Number of unique values: 301
- Unique values count: licensors
[]                                       23347
['Funimation']                             979
['Sentai Filmworks']                       868
['Discotek Media']                         301
['Aniplex of America']                     252
                                         ...  
['Crunchyroll', 'Muse Communication']        1
['Aniplex of America', 'Crunchyroll']        1
['Bandai Namco Online']                      1
['King Records']                             1
['Toho International']                  

**Feature `producers`**

In [63]:
describe_feature(data, 'producers')

- Type: object
- First 5 rows:
 0    ['Bandai Visual', 'Victor Entertainment', 'Aud...
1                         ['Sunrise', 'Bandai Visual']
2                             ['Victor Entertainment']
3    ['Bandai Visual', 'Dentsu', 'Victor Entertainm...
4                               ['TV Tokyo', 'Dentsu']
Name: producers, dtype: object
- Last 5 Rows: 
 28689                 []
28690                 []
28692                 []
28693    ['miHoYoAnime']
28694                 []
Name: producers, dtype: object
- Number of unique values: 5136
- Unique values count: producers
[]                                                           15032
['NHK']                                                       1179
['Pink Pineapple']                                             273
['Sanrio']                                                     180
['bilibili']                                                   158
                                                             ...  
['Toyro Music', 'Shima

**Feature `studios`**

In [64]:
describe_feature(data, 'studios')

- Type: object
- First 5 rows:
 0           ['Sunrise']
1             ['Bones']
2          ['Madhouse']
3           ['Sunrise']
4    ['Toei Animation']
Name: studios, dtype: object
- Last 5 Rows: 
 28689    []
28690    []
28692    []
28693    []
28694    []
Name: studios, dtype: object
- Number of unique values: 1870
- Unique values count: studios
[]                                       11548
['Toei Animation']                         875
['Sunrise']                                557
['J.C.Staff']                              412
['TMS Entertainment']                      364
                                         ...  
['Yumeta Company', 'Maru Animation']         1
['Tatsunoko Production', 'SynergySP']        1
['Studio GOONEYS', 'Honoo']                  1
['studio42']                                 1
['Raiose']                                   1
Name: count, Length: 1870, dtype: int64
- Numer of missing values: 0


**Feature `genres`**

In [65]:
describe_feature(data, 'genres')

- Type: object
- First 5 rows:
 0             ['Action', 'Award Winning', 'Sci-Fi']
1                              ['Action', 'Sci-Fi']
2                 ['Action', 'Adventure', 'Sci-Fi']
3    ['Action', 'Drama', 'Mystery', 'Supernatural']
4                ['Action', 'Adventure', 'Fantasy']
Name: genres, dtype: object
- Last 5 Rows: 
 28689                ['Fantasy']
28690    ['Action', 'Adventure']
28692      ['Comedy', 'Romance']
28693                ['Fantasy']
28694                ['Fantasy']
Name: genres, dtype: object
- Number of unique values: 931
- Unique values count: genres
[]                                                                                 5882
['Comedy']                                                                         2679
['Fantasy']                                                                        1461
['Hentai']                                                                         1286
['Slice of Life']                                        

**Feature `themes`**

In [66]:
describe_feature(data, 'themes')

- Type: object
- First 5 rows:
 0    ['Adult Cast', 'Space']
1    ['Adult Cast', 'Space']
2             ['Adult Cast']
3              ['Detective']
4                         []
Name: themes, dtype: object
- Last 5 Rows: 
 28689                 ['Isekai', 'Reincarnation']
28690    ['Adult Cast', 'Historical', 'Military']
28692                                  ['School']
28693                                  ['Isekai']
28694                                   ['Music']
Name: themes, dtype: object
- Number of unique values: 1025
- Unique values count: themes
[]                                                 11724
['Music']                                           3925
['School']                                           902
['Anthropomorphic']                                  873
['Historical']                                       850
                                                   ...  
['Isekai', 'Space']                                    1
['Idols (Female)', 'Music', 'Parody', '

Feature `day`

In [67]:
describe_feature(data, 'day')

- Type: float64
- First 5 rows:
 0     3.0
1     1.0
2     1.0
3     3.0
4    30.0
Name: day, dtype: float64
- Last 5 Rows: 
 28689     1.0
28690    17.0
28692     NaN
28693    12.0
28694    12.0
Name: day, dtype: float64
- Number of unique values: 31
- Unique values count: day
1.0     5704
25.0    1128
5.0     1080
6.0     1059
3.0     1049
7.0     1031
2.0     1027
21.0     960
4.0      950
8.0      850
10.0     778
9.0      751
26.0     747
28.0     713
27.0     674
24.0     643
22.0     639
11.0     630
20.0     617
23.0     601
12.0     599
29.0     593
30.0     588
13.0     565
15.0     565
14.0     555
18.0     554
17.0     531
16.0     531
19.0     520
31.0     369
Name: count, dtype: int64
- Numer of missing values: 879
- Min: 1.0
- Max: 31.0
- Mean: 12.244882431795949
- Median: 10.0
- Mode: 0    1.0
Name: day, dtype: float64
- Std: 9.827799670521097 


# Fianl Preprocessing

**Dataset for visualzation**

In [68]:

vis_data = data[data['status'] != "Not yet aired"]
vis_data = vis_data.drop(columns={'season', 'rating'})
vis_data = vis_data.rename(columns={'new_season': 'season', 'new_rating': 'rating'})

Now let's sort the columns

In [69]:
vis_data.columns

Index(['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source',
       'episodes', 'status', 'airing', 'duration', 'score', 'scored_by',
       'rank', 'popularity', 'members', 'favorites', 'synopsis', 'year',
       'month', 'day', 'licensors', 'producers', 'studios', 'genres', 'themes',
       'demographics', 'rating', 'season'],
      dtype='object')

In [70]:
vis_data = vis_data[['mal_id', 'title', 'title_en', 'title_synonyms', 'type', 'source', 'status', 'airing', 'day' ,'month', 'year', 
'season', 'episodes', 'duration', 'score', 'scored_by', 'rank', 'members','popularity', 'favorites', 'genres', 'themes', 'demographics' ,'rating', 
'licensors', 'producers', 'studios']]

In [71]:
vis_data.to_csv('anime_data_vis.csv', sep=',', index=False, encoding='utf-8')