This code cleans the 2024 data, handling null values and manipulating the data into a useable format, pulling from the 2023 data when needed

In [12]:
import pandas as pd 
import re
import ast

df = pd.read_csv('Data/2024_Processed.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   anime_id       26720 non-null  int64  
 1   title          26720 non-null  object 
 2   title_english  11244 non-null  object 
 3   type           26652 non-null  object 
 4   source         26720 non-null  object 
 5   episodes       26106 non-null  float64
 6   rating         26164 non-null  object 
 7   score          17304 non-null  float64
 8   rank           20722 non-null  float64
 9   popularity     26720 non-null  int64  
 10  synopsis       21915 non-null  object 
 11  producers      26720 non-null  object 
 12  licensors      26720 non-null  object 
 13  studios        26720 non-null  object 
 14  genres         26720 non-null  object 
dtypes: float64(3), int64(2), object(10)
memory usage: 3.1+ MB


In [13]:
# Clear the parenthesis and brackets in the synopsis along with anything inside of them
df["synopsis"] = df["synopsis"].str.replace(r"\[.*?\]|\(.*?\)", "", regex = True)

In [14]:
df["synopsis"].to_csv('test.csv')

Extract from the crawled data, since the columns all have the same format, we can use the function below on all of them

In [15]:
def extract_names(entry):
    dict_list = ast.literal_eval(entry)
    return [d['name'] for d in dict_list if 'name' in d]
    


In [16]:
df['producers'] = df['producers'].apply(extract_names)
df['producers'].head(5)

0                                      [Bandai Visual]
1                             [Sunrise, Bandai Visual]
2                               [Victor Entertainment]
3    [Bandai Visual, Dentsu, Victor Entertainment, ...
4                                   [TV Tokyo, Dentsu]
Name: producers, dtype: object

In [17]:
df['licensors'] = df['licensors'].apply(extract_names)
df['licensors'].head(5)

0                                 [Funimation]
1    [Sony Pictures Entertainment, Funimation]
2                                 [Funimation]
3           [Funimation, Bandai Entertainment]
4                   [Illumitoon Entertainment]
Name: licensors, dtype: object

In [18]:
df['studios'] = df['studios'].apply(extract_names)
df['studios'].head(5)

0           [Sunrise]
1             [Bones]
2          [Madhouse]
3           [Sunrise]
4    [Toei Animation]
Name: studios, dtype: object

In [19]:
df['genres'] = df['genres'].apply(extract_names)
df['genres'].head(5)

0           [Action, Award Winning, Sci-Fi]
1                          [Action, Sci-Fi]
2               [Action, Adventure, Sci-Fi]
3    [Action, Drama, Mystery, Supernatural]
4              [Action, Adventure, Fantasy]
Name: genres, dtype: object

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   anime_id       26720 non-null  int64  
 1   title          26720 non-null  object 
 2   title_english  11244 non-null  object 
 3   type           26652 non-null  object 
 4   source         26720 non-null  object 
 5   episodes       26106 non-null  float64
 6   rating         26164 non-null  object 
 7   score          17304 non-null  float64
 8   rank           20722 non-null  float64
 9   popularity     26720 non-null  int64  
 10  synopsis       21915 non-null  object 
 11  producers      26720 non-null  object 
 12  licensors      26720 non-null  object 
 13  studios        26720 non-null  object 
 14  genres         26720 non-null  object 
dtypes: float64(3), int64(2), object(10)
memory usage: 3.1+ MB


Categorize popularity and rank

In [21]:
def assign_category(popularity):
    if popularity == 0:
        return "Unknown"
    if popularity <= 10:
        return "Top 10"
    elif popularity <= 100:
        return "Top 100"
    elif popularity <= 500:
        return "Top 500"
    elif popularity <= 1000:
        return "Top 1,000"
    elif popularity <= 5000:
        return "Top 5,000"
    elif popularity <= 7500:
        return "Top 7,500"
    elif popularity <= 10000:
        return "Top 10,000"
    elif popularity <= 25000:
        return "Top 25,000"
    elif popularity <= 50000:
        return "Top 50,000"
    else:
        return "Unknown"

In [22]:
df["Popularity_category"] = df["popularity"].apply(assign_category)
df["Popularity_category"].value_counts()

Popularity_category
Top 25,000    15084
Top 5,000      4025
Top 7,500      2523
Top 10,000     2520
Top 50,000     1566
Top 1,000       500
Top 500         402
Top 100          90
Top 10           10
Name: count, dtype: int64

In [23]:
df["Rank_category"] = df["rank"].apply(assign_category)
df["Rank_category"].value_counts()

Rank_category
Top 25,000    10695
Unknown        5998
Top 5,000      4016
Top 10,000     2512
Top 7,500      2497
Top 1,000       501
Top 500         401
Top 100          90
Top 10           10
Name: count, dtype: int64

In [24]:
df = df.drop(columns = ["rank", "popularity"]) 

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26720 non-null  int64  
 1   title                26720 non-null  object 
 2   title_english        11244 non-null  object 
 3   type                 26652 non-null  object 
 4   source               26720 non-null  object 
 5   episodes             26106 non-null  float64
 6   rating               26164 non-null  object 
 7   score                17304 non-null  float64
 8   synopsis             21915 non-null  object 
 9   producers            26720 non-null  object 
 10  licensors            26720 non-null  object 
 11  studios              26720 non-null  object 
 12  genres               26720 non-null  object 
 13  Popularity_category  26720 non-null  object 
 14  Rank_category        26720 non-null  object 
dtypes: float64(2), int64(1), object(12)


In [26]:
df1 = pd.read_csv('Data/2023_Processed.csv')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20105 entries, 0 to 20104
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   anime_id             20105 non-null  int64 
 1   Name                 20105 non-null  object
 2   English name         20105 non-null  object
 3   Score                20105 non-null  object
 4   Genres               20105 non-null  object
 5   Synopsis             20105 non-null  object
 6   Type                 20105 non-null  object
 7   Episodes             20105 non-null  object
 8   Producers            20105 non-null  object
 9   Licensors            20105 non-null  object
 10  Studios              20105 non-null  object
 11  Source               20105 non-null  object
 12  Rating               20105 non-null  object
 13  Popularity_category  20105 non-null  object
 14  Rank_category        20105 non-null  object
dtypes: int64(1), object(14)
memory usage: 2.3+ MB


In [27]:
# Check if a separate df has the same anime_id, if it does then take the rank value
def steal_rank(df, df1, id_column = 'anime_id', rank_column = 'Rank_category'):
    
    def update_rank(row):
        if row[rank_column] == "Unknown":
            match = df1.loc[df1[id_column] == row[id_column], rank_column]
            if not match.empty:
                return match.values[0]
        return row[rank_column]

    df[rank_column] = df.apply(update_rank, axis = 1)
    return df

In [28]:
df = steal_rank(df, df1)
df["Rank_category"].value_counts()

Rank_category
Top 25,000    11009
Unknown        5521
Top 5,000      4040
Top 10,000     2602
Top 7,500      2542
Top 1,000       503
Top 500         403
Top 100          90
Top 10           10
Name: count, dtype: int64

steal_rank didn't fill many of the Unknown values, so I will default the Unknowns to Top 25,000

In [29]:
df["Rank_category"] = df["Rank_category"].replace("Unknown", "Top 25,000")
df["Rank_category"].value_counts()

Rank_category
Top 25,000    16530
Top 5,000      4040
Top 10,000     2602
Top 7,500      2542
Top 1,000       503
Top 500         403
Top 100          90
Top 10           10
Name: count, dtype: int64

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26720 non-null  int64  
 1   title                26720 non-null  object 
 2   title_english        11244 non-null  object 
 3   type                 26652 non-null  object 
 4   source               26720 non-null  object 
 5   episodes             26106 non-null  float64
 6   rating               26164 non-null  object 
 7   score                17304 non-null  float64
 8   synopsis             21915 non-null  object 
 9   producers            26720 non-null  object 
 10  licensors            26720 non-null  object 
 11  studios              26720 non-null  object 
 12  genres               26720 non-null  object 
 13  Popularity_category  26720 non-null  object 
 14  Rank_category        26720 non-null  object 
dtypes: float64(2), int64(1), object(12)


In [31]:
median_score = df["score"].median()
median_score

np.float64(6.44)

In [32]:
mean_score = df["score"].mean()
mean_score

np.float64(6.433792186777623)

In [35]:
# The median and mean are pretty much the same, going to replace all null score values with the median score
df["score"] = df["score"].fillna(median_score)

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26720 non-null  int64  
 1   title                26720 non-null  object 
 2   title_english        11244 non-null  object 
 3   type                 26652 non-null  object 
 4   source               26720 non-null  object 
 5   episodes             26106 non-null  float64
 6   rating               26164 non-null  object 
 7   score                26720 non-null  float64
 8   synopsis             21915 non-null  object 
 9   producers            26720 non-null  object 
 10  licensors            26720 non-null  object 
 11  studios              26720 non-null  object 
 12  genres               26720 non-null  object 
 13  Popularity_category  26720 non-null  object 
 14  Rank_category        26720 non-null  object 
dtypes: float64(2), int64(1), object(12)


In [37]:
df['type'].value_counts()

type
TV            7941
Movie         4586
OVA           4120
ONA           3591
Music         3415
Special       1798
TV Special     659
CM             383
PV             159
Name: count, dtype: int64

In [38]:
df["type"] = df["type"].replace("TV Special", "Special")
df = df[~df["type"].isin(["CM", "PV"])]
df["type"].value_counts()

type
TV         7941
Movie      4586
OVA        4120
ONA        3591
Music      3415
Special    2457
Name: count, dtype: int64

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26178 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26178 non-null  int64  
 1   title                26178 non-null  object 
 2   title_english        11099 non-null  object 
 3   type                 26110 non-null  object 
 4   source               26178 non-null  object 
 5   episodes             25579 non-null  float64
 6   rating               25622 non-null  object 
 7   score                26178 non-null  float64
 8   synopsis             21396 non-null  object 
 9   producers            26178 non-null  object 
 10  licensors            26178 non-null  object 
 11  studios              26178 non-null  object 
 12  genres               26178 non-null  object 
 13  Popularity_category  26178 non-null  object 
 14  Rank_category        26178 non-null  object 
dtypes: float64(2), int64(1), object(12)
memor

In [40]:
null_type_rows = df[df["type"].isnull()]
null_type_rows

Unnamed: 0,anime_id,title,title_english,type,source,episodes,rating,score,synopsis,producers,licensors,studios,genres,Popularity_category,Rank_category
4978,7398,Sekai Meisaku Douwa,World Famous Fairy Tale Series,,Unknown,20.0,G - All Ages,6.12,Another of Toei's World Famous Fairy Tale seri...,[],[],[],[Fantasy],"Top 25,000","Top 10,000"
17585,43629,Tokyo Babylon 2021,,,Manga,,,6.44,Subaru Sumeragi is the thirteenth head of his ...,[],[],[],"[Boys Love, Drama, Supernatural]","Top 7,500","Top 25,000"
19531,46488,Tai-Ari deshita.: Ojousama wa Kakutou Game nan...,,,Manga,,,6.44,A hot fighting game played at the girls' schoo...,[],[],[],"[Comedy, Girls Love]","Top 10,000","Top 25,000"
21023,48701,Peleliu: Rakuen no Guernica,,,Manga,,,6.44,"Shouwa 19 , summer. Tamaru, a soldier who aspi...",[],[],[],[Drama],"Top 25,000","Top 25,000"
21443,49703,Fate/kaleid liner Prisma☆Illya (Zoku-hen),,,Manga,,,6.44,Sequel to Fate/kaleid liner Prisma☆Illya Movie...,[],[],[],"[Action, Fantasy]","Top 7,500","Top 25,000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26654,58573,Baki-dou,,,Manga,,,6.44,"Growing complacent with his title of ""World's ...",[],[],[TMS Entertainment],[Action],"Top 7,500","Top 25,000"
26692,58725,Ninja to Koroshiya no Futarigurashi,,,Manga,,,6.44,"Satoko Kusagakure, a female ninja who left her...",[Kadokawa],[],[Shaft],[Slice of Life],"Top 25,000","Top 25,000"
26693,58725,Ninja to Koroshiya no Futarigurashi,,,Manga,,,6.44,"Satoko Kusagakure, a female ninja who left her...",[Kadokawa],[],[Shaft],[Slice of Life],"Top 25,000","Top 25,000"
26707,58755,5-toubun no Hanayome*,,,Manga,,PG-13 - Teens 13 or older,6.44,5-toubun no Hanayome* centers on the honeymoon...,[],[],[],"[Comedy, Romance]","Top 7,500","Top 25,000"


In [41]:
df = df.dropna(subset=["type"])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26110 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26110 non-null  int64  
 1   title                26110 non-null  object 
 2   title_english        11087 non-null  object 
 3   type                 26110 non-null  object 
 4   source               26110 non-null  object 
 5   episodes             25575 non-null  float64
 6   rating               25599 non-null  object 
 7   score                26110 non-null  float64
 8   synopsis             21337 non-null  object 
 9   producers            26110 non-null  object 
 10  licensors            26110 non-null  object 
 11  studios              26110 non-null  object 
 12  genres               26110 non-null  object 
 13  Popularity_category  26110 non-null  object 
 14  Rank_category        26110 non-null  object 
dtypes: float64(2), int64(1), object(12)
memor

In [43]:
df['rating'].value_counts()

rating
PG-13 - Teens 13 or older         9177
G - All Ages                      8027
PG - Children                     4225
Rx - Hentai                       1517
R - 17+ (violence & profanity)    1480
R+ - Mild Nudity                  1173
Name: count, dtype: int64

In [None]:
df["rating"] = df["rating"].fillna("PG-13 - Teens 13 or older")

In [45]:
df['rating'].value_counts()

rating
PG-13 - Teens 13 or older         9688
G - All Ages                      8027
PG - Children                     4225
Rx - Hentai                       1517
R - 17+ (violence & profanity)    1480
R+ - Mild Nudity                  1173
Name: count, dtype: int64

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26110 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26110 non-null  int64  
 1   title                26110 non-null  object 
 2   title_english        11087 non-null  object 
 3   type                 26110 non-null  object 
 4   source               26110 non-null  object 
 5   episodes             25575 non-null  float64
 6   rating               26110 non-null  object 
 7   score                26110 non-null  float64
 8   synopsis             21337 non-null  object 
 9   producers            26110 non-null  object 
 10  licensors            26110 non-null  object 
 11  studios              26110 non-null  object 
 12  genres               26110 non-null  object 
 13  Popularity_category  26110 non-null  object 
 14  Rank_category        26110 non-null  object 
dtypes: float64(2), int64(1), object(12)
memor

In [47]:
empty_list_count = df["producers"].apply(lambda x: isinstance(x, list) and len(x) == 0).sum()
empty_list_count

np.int64(13742)

In [48]:
empty_list_count = df["studios"].apply(lambda x: isinstance(x, list) and len(x) == 0).sum()
empty_list_count

np.int64(10865)

In [49]:
empty_list_count = df["licensors"].apply(lambda x: isinstance(x, list) and len(x) == 0).sum()
empty_list_count

np.int64(21252)

In [51]:
df['licensors'].value_counts()

licensors
[]                                           21252
[Funimation]                                   981
[Sentai Filmworks]                             842
[Discotek Media]                               299
[Aniplex of America]                           227
                                             ...  
[Haoliners Animation League]                     1
[Sony Pictures Entertainment, Funimation]        1
[King Records]                                   1
[Funimation, VIZ Media]                          1
[Funimation, 4Kids Entertainment]                1
Name: count, Length: 264, dtype: int64

In [52]:
df['producers'].value_counts()

producers
[]                                                                     13742
[NHK]                                                                   1079
[Pink Pineapple]                                                         264
[Sanrio]                                                                 175
[Fuji TV]                                                                131
                                                                       ...  
[Fuji TV, Sony Music Entertainment]                                        1
[Mainichi Broadcasting System, Kodansha, TOHO, Streamline Pictures]        1
[TMS Entertainment, Half H.P Studio, Nichion, Kadokawa]                    1
[Happinet]                                                                 1
[TV Tokyo, King Records]                                                   1
Name: count, Length: 4683, dtype: int64

In [56]:
df = df.dropna(subset = ["episodes"])

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25575 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             25575 non-null  int64  
 1   title                25575 non-null  object 
 2   title_english        10873 non-null  object 
 3   type                 25575 non-null  object 
 4   source               25575 non-null  object 
 5   episodes             25575 non-null  float64
 6   rating               25575 non-null  object 
 7   score                25575 non-null  float64
 8   synopsis             20907 non-null  object 
 9   producers            25575 non-null  object 
 10  licensors            25575 non-null  object 
 11  studios              25575 non-null  object 
 12  genres               25575 non-null  object 
 13  Popularity_category  25575 non-null  object 
 14  Rank_category        25575 non-null  object 
dtypes: float64(2), int64(1), object(12)
memor

In [58]:
df['genres'].value_counts()

genres
[]                                              5359
[Comedy]                                        2311
[Fantasy]                                       1347
[Hentai]                                        1192
[Avant Garde]                                    659
                                                ... 
[Adventure, Comedy, Drama, Gourmet, Romance]       1
[Action, Girls Love]                               1
[Comedy, Fantasy, Girls Love]                      1
[Award Winning, Drama, Horror]                     1
[Avant Garde, Drama, Erotica]                      1
Name: count, Length: 956, dtype: int64

In [62]:
df = df[df["genres"].apply(lambda x: isinstance(x, list) and len(x) > 0)]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20216 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             20216 non-null  int64  
 1   title                20216 non-null  object 
 2   title_english        9636 non-null   object 
 3   type                 20216 non-null  object 
 4   source               20216 non-null  object 
 5   episodes             20216 non-null  float64
 6   rating               20216 non-null  object 
 7   score                20216 non-null  float64
 8   synopsis             17175 non-null  object 
 9   producers            20216 non-null  object 
 10  licensors            20216 non-null  object 
 11  studios              20216 non-null  object 
 12  genres               20216 non-null  object 
 13  Popularity_category  20216 non-null  object 
 14  Rank_category        20216 non-null  object 
dtypes: float64(2), int64(1), object(12)
memor

In [63]:
df = df.drop(columns=["title_english"])

In [64]:
df.to_csv('2024_Processed.csv', index = False)