This code cleans the 2024 data, handling null values and manipulating the data into a useable format, pulling from the 2023 data when needed

In [42]:
import pandas as pd 
import re
import ast

df = pd.read_csv('Data/2024_Processed.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   anime_id       26720 non-null  int64  
 1   title          26720 non-null  object 
 2   title_english  11244 non-null  object 
 3   type           26652 non-null  object 
 4   source         26720 non-null  object 
 5   episodes       26106 non-null  float64
 6   rating         26164 non-null  object 
 7   score          17304 non-null  float64
 8   rank           20722 non-null  float64
 9   popularity     26720 non-null  int64  
 10  synopsis       21915 non-null  object 
 11  producers      26720 non-null  object 
 12  licensors      26720 non-null  object 
 13  studios        26720 non-null  object 
 14  genres         26720 non-null  object 
dtypes: float64(3), int64(2), object(10)
memory usage: 3.1+ MB


In [43]:
# Clear the parenthesis and brackets in the synopsis along with anything inside of them
df["synopsis"] = df["synopsis"].str.replace(r"\[.*?\]|\(.*?\)", "", regex = True)

In [44]:
df["synopsis"].to_csv('test.csv')

Extract from the crawled data, since the columns all have the same format, we can use the function below on all of them

In [45]:
def extract_names(entry):
    dict_list = ast.literal_eval(entry)
    return [d['name'] for d in dict_list if 'name' in d]
    


In [46]:
df['producers'] = df['producers'].apply(extract_names)
df['producers'].head(5)

0                                      [Bandai Visual]
1                             [Sunrise, Bandai Visual]
2                               [Victor Entertainment]
3    [Bandai Visual, Dentsu, Victor Entertainment, ...
4                                   [TV Tokyo, Dentsu]
Name: producers, dtype: object

In [47]:
df['licensors'] = df['licensors'].apply(extract_names)
df['licensors'].head(5)

0                                 [Funimation]
1    [Sony Pictures Entertainment, Funimation]
2                                 [Funimation]
3           [Funimation, Bandai Entertainment]
4                   [Illumitoon Entertainment]
Name: licensors, dtype: object

In [48]:
df['studios'] = df['studios'].apply(extract_names)
df['studios'].head(5)

0           [Sunrise]
1             [Bones]
2          [Madhouse]
3           [Sunrise]
4    [Toei Animation]
Name: studios, dtype: object

In [49]:
df['genres'] = df['genres'].apply(extract_names)
df['genres'].head(5)

0           [Action, Award Winning, Sci-Fi]
1                          [Action, Sci-Fi]
2               [Action, Adventure, Sci-Fi]
3    [Action, Drama, Mystery, Supernatural]
4              [Action, Adventure, Fantasy]
Name: genres, dtype: object

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   anime_id       26720 non-null  int64  
 1   title          26720 non-null  object 
 2   title_english  11244 non-null  object 
 3   type           26652 non-null  object 
 4   source         26720 non-null  object 
 5   episodes       26106 non-null  float64
 6   rating         26164 non-null  object 
 7   score          17304 non-null  float64
 8   rank           20722 non-null  float64
 9   popularity     26720 non-null  int64  
 10  synopsis       21915 non-null  object 
 11  producers      26720 non-null  object 
 12  licensors      26720 non-null  object 
 13  studios        26720 non-null  object 
 14  genres         26720 non-null  object 
dtypes: float64(3), int64(2), object(10)
memory usage: 3.1+ MB


Categorize popularity and rank

In [51]:
def assign_category(popularity):
    if popularity == 0:
        return "Unknown"
    if popularity <= 10:
        return "Top 10"
    elif popularity <= 100:
        return "Top 100"
    elif popularity <= 500:
        return "Top 500"
    elif popularity <= 1000:
        return "Top 1,000"
    elif popularity <= 5000:
        return "Top 5,000"
    elif popularity <= 7500:
        return "Top 7,500"
    elif popularity <= 10000:
        return "Top 10,000"
    elif popularity <= 25000:
        return "Top 25,000"
    elif popularity <= 50000:
        return "Top 50,000"
    else:
        return "Unknown"

In [None]:
df["Popularity_category"] = df["popularity"].apply(assign_category)
df["Popularity_category"].value_counts()

Popularity_category
Top 25,000    15084
Top 5,000      4025
Top 7,500      2523
Top 10,000     2520
Top 50,000     1566
Top 1,000       500
Top 500         402
Top 100          90
Top 10           10
Name: count, dtype: int64

In [None]:
df["Rank_category"] = df["rank"].apply(assign_category)
df["Rank_category"].value_counts()

Rank_category
Top 25,000    10695
Unknown        5998
Top 5,000      4016
Top 10,000     2512
Top 7,500      2497
Top 1,000       501
Top 500         401
Top 100          90
Top 10           10
Name: count, dtype: int64

In [54]:
df = df.drop(columns = ["rank", "popularity"]) 

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26720 non-null  int64  
 1   title                26720 non-null  object 
 2   title_english        11244 non-null  object 
 3   type                 26652 non-null  object 
 4   source               26720 non-null  object 
 5   episodes             26106 non-null  float64
 6   rating               26164 non-null  object 
 7   score                17304 non-null  float64
 8   synopsis             21915 non-null  object 
 9   producers            26720 non-null  object 
 10  licensors            26720 non-null  object 
 11  studios              26720 non-null  object 
 12  genres               26720 non-null  object 
 13  Popularity_category  26720 non-null  object 
 14  Rank_category        26720 non-null  object 
dtypes: float64(2), int64(1), object(12)


In [56]:
df1 = pd.read_csv('Data/2023_Processed.csv')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20105 entries, 0 to 20104
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   anime_id             20105 non-null  int64 
 1   Name                 20105 non-null  object
 2   English name         20105 non-null  object
 3   Score                20105 non-null  object
 4   Genres               20105 non-null  object
 5   Synopsis             20105 non-null  object
 6   Type                 20105 non-null  object
 7   Episodes             20105 non-null  object
 8   Producers            20105 non-null  object
 9   Licensors            20105 non-null  object
 10  Studios              20105 non-null  object
 11  Source               20105 non-null  object
 12  Rating               20105 non-null  object
 13  Popularity_category  20105 non-null  object
 14  Rank_category        20105 non-null  object
dtypes: int64(1), object(14)
memory usage: 2.3+ MB


In [57]:
# Check if a separate df has the same anime_id, if it does then take the rank value
def steal_rank(df, df1, id_column = 'anime_id', rank_column = 'Rank_category'):
    
    def update_rank(row):
        if row[rank_column] == "Unknown":
            match = df1.loc[df1[id_column] == row[id_column], rank_column]
            if not match.empty:
                return match.values[0]
        return row[rank_column]

    df[rank_column] = df.apply(update_rank, axis = 1)
    return df

In [None]:
df = steal_rank(df, df1)
df["Rank_category"].value_counts()

Rank_category
Top 25,000    11009
Unknown        5521
Top 5,000      4040
Top 10,000     2602
Top 7,500      2542
Top 1,000       503
Top 500         403
Top 100          90
Top 10           10
Name: count, dtype: int64

steal_rank didn't fill many of the Unknown values, so I will default the Unknowns to Top 25,000

In [61]:
df["Rank_category"] = df["Rank_category"].replace("Unknown", "Top 25,000")
df["Rank_category"].value_counts()

Rank_category
Top 25,000    16530
Top 5,000      4040
Top 10,000     2602
Top 7,500      2542
Top 1,000       503
Top 500         403
Top 100          90
Top 10           10
Name: count, dtype: int64

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26720 entries, 0 to 26719
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   anime_id             26720 non-null  int64  
 1   title                26720 non-null  object 
 2   title_english        11244 non-null  object 
 3   type                 26652 non-null  object 
 4   source               26720 non-null  object 
 5   episodes             26106 non-null  float64
 6   rating               26164 non-null  object 
 7   score                17304 non-null  float64
 8   synopsis             21915 non-null  object 
 9   producers            26720 non-null  object 
 10  licensors            26720 non-null  object 
 11  studios              26720 non-null  object 
 12  genres               26720 non-null  object 
 13  Popularity_category  26720 non-null  object 
 14  Rank_category        26720 non-null  object 
dtypes: float64(2), int64(1), object(12)


In [64]:
median_score = df["score"].median()
median_score

np.float64(6.44)

In [66]:
mean_score = df["score"].mean()
mean_score

np.float64(6.433792186777623)

In [68]:
# The median and mean are pretty much the same, going to replace all null score values with the median score
df["score"].fillna(median_score)

0        8.75
1        8.38
2        8.22
3        7.25
4        6.99
         ... 
26715    6.44
26716    6.44
26717    6.44
26718    6.44
26719    6.44
Name: score, Length: 26720, dtype: float64