# 📊 Anime Ratings Prediction Project — Data Cleaning & Preprocessing

This notebook focuses on loading the raw dataset and preparing it for analysis and modeling. It includes:

- An overview of the dataset and all its features.
- Handling of missing values and unnecessary characters (like `|` in multi-valued columns).
- Initial exploration of unique values and null counts.
- Transformation of important columns (e.g., converting `genre` strings into Python lists).
- Creation of new structured features such as `release_year` and `duration` (total days the anime aired).
- Saving a cleaned and consistent version of the dataset for further use in feature engineering and modeling.

These steps form the foundation for building a robust and meaningful regression model to predict anime ratings.


## 📁 Dataset Overview

The dataset used for this project is sourced from [Kaggle - MyAnimeList Dataset](https://www.kaggle.com/datasets/svanoo/myanimelist-dataset). It contains detailed information about thousands of anime titles, including metadata and user-generated ratings.

### 📄 Columns in `anime.csv`

- **anime_id**: The ID of the anime.
- **anime_url**: The MyAnimeList URL of the anime.
- **title**: The name of the anime.
- **synopsis**: Short description of the plot of the anime.
- **main_pic**: URL to the cover picture of the anime.
- **type**: Type of the anime (e.g., TV, Movie, OVA).
- **source_type**: Type of the source of the anime (e.g., Manga, Light Novel).
- **num_episodes**: Number of episodes in the anime.
- **status**: The current status of the anime (Finished airing, Currently airing, Not yet aired).
- **start_date**: Start date of the anime.
- **end_date**: End date of the anime.
- **season**: Season and year the anime started airing (e.g., Winter 2020).
- **studios**: List of studios that created the anime.
- **genres**: List of genres associated with the anime (e.g., Action, Shonen).
- **score**: Average score of the anime on MyAnimeList.
- **score_count**: Number of users that scored the anime.
- **score_rank**: Rank of the anime based on its score on MyAnimeList.
- **popularity_rank**: Rank of the anime based on its popularity.
- **members_count**: Number of users who are members of the anime.
- **favorites_count**: Number of users who marked the anime as a favorite.
- **watching_count**: Number of users currently watching the anime.
- **completed_count**: Number of users who completed the anime.
- **on_hold_count**: Number of users who have the anime on hold.
- **dropped_count**: Number of users who dropped the anime.
- **plan_to_watch_count**: Number of users who plan to watch the anime.
- **total_count**: Total number of users (completed, plan to watch, watching, dropped, or on hold).
- **score_10_count**: Number of users who scored the anime a 10.
- **score_09_count**: Number of users who scored the anime a 9.
- **score_08_count**: Number of users who scored the anime an 8.
- **score_07_count**: Number of users who scored the anime a 7.
- **score_06_count**: Number of users who scored the anime a 6.
- **score_05_count**: Number of users who scored the anime a 5.
- **score_04_count**: Number of users who scored the anime a 4.
- **score_03_count**: Number of users who scored the anime a 3.
- **score_02_count**: Number of users who scored the anime a 2.
- **score_01_count**: Number of users who scored the anime a 1.
- **clubs**: List of MyAnimeList clubs the anime is part of.
- **pics**: List of URLs to pictures of the anime.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('anime.csv', delimiter='\t', encoding='utf-8')


In [9]:
df = df.dropna(subset=['score'])

In [3]:
df['genres'] = df['genres'].str.split('|')
df['clubs'] = df['clubs'].str.split('|')

In [5]:
df.to_csv("cleaned_anime_data.csv", index=False)

In [6]:
df1 = pd.read_csv("cleaned_anime_data.csv")

The new dataset now does not have rows with scores = NaN and the multi-valued data in columns 'genres' and 'clubs' are now lists and no loger separated by '|'.

In [10]:
print(list(df1.columns))

['anime_id', 'anime_url', 'title', 'synopsis', 'main_pic', 'type', 'source_type', 'num_episodes', 'status', 'start_date', 'end_date', 'season', 'studios', 'genres', 'score', 'score_count', 'score_rank', 'popularity_rank', 'members_count', 'favorites_count', 'watching_count', 'completed_count', 'on_hold_count', 'dropped_count', 'plan_to_watch_count', 'total_count', 'score_10_count', 'score_09_count', 'score_08_count', 'score_07_count', 'score_06_count', 'score_05_count', 'score_04_count', 'score_03_count', 'score_02_count', 'score_01_count', 'clubs', 'pics']


In [11]:
print(df1.head)

<bound method NDFrame.head of        anime_id                                          anime_url  \
0          2366  https://myanimelist.net/anime/2366/Touma_Kishi...   
1          4940  https://myanimelist.net/anime/4940/Sabaku_no_K...   
2         50285  https://myanimelist.net/anime/50285/On_Air_Dek...   
3          3975  https://myanimelist.net/anime/3975/Uchi_no_3_S...   
4         36036    https://myanimelist.net/anime/36036/Running_Man   
...         ...                                                ...   
13374     32188  https://myanimelist.net/anime/32188/Steins_Gat...   
13375     31324  https://myanimelist.net/anime/31324/Grisaia_no...   
13376     31283  https://myanimelist.net/anime/31283/Bikini_War...   
13377     33142  https://myanimelist.net/anime/33142/Re_Zero_ka...   
13378     31234  https://myanimelist.net/anime/31234/Himouto_Um...   

                                                   title  \
0                                    Touma Kishinden Oni   
1        

In [12]:
df1 = df1.dropna(subset=['score'])

In [13]:
df1.to_csv("cleaned_anime_data_01.csv", index=False)

In [19]:
print(list(df2.columns))

['anime_id', 'anime_url', 'title', 'synopsis', 'main_pic', 'type', 'source_type', 'num_episodes', 'status', 'start_date', 'end_date', 'season', 'studios', 'genres', 'score', 'score_count', 'score_rank', 'popularity_rank', 'members_count', 'favorites_count', 'watching_count', 'completed_count', 'on_hold_count', 'dropped_count', 'plan_to_watch_count', 'total_count', 'score_10_count', 'score_09_count', 'score_08_count', 'score_07_count', 'score_06_count', 'score_05_count', 'score_04_count', 'score_03_count', 'score_02_count', 'score_01_count', 'clubs', 'pics']


In [20]:
print(df2[['title', 'genres']].head(10))  # Prints the first 10 rows

                                               title  \
0                                    Urasekai Picnic   
1                                           Kagewani   
2                                               R-15   
3                                      School Rumble   
4  Re:Zero kara Hajimeru Isekai Seikatsu 2nd Seas...   
5                            Shigatsu wa Kimi no Uso   
6                 Zhandou Wang Zhi Jufeng Zhan Hun 2   
7                         YAT Anshin! Uchuu Ryokou 2   
8                 Ring ni Kakero 1: Sekai Taikai-hen   
9                                     Moeru! Oniisan   

                                              genres  
0  ['Adventure', 'Fantasy', 'Girls Love', 'Myster...  
1  ['Horror', 'Mystery', 'Supernatural', 'Suspense']  
2  ['Comedy', 'Romance', 'Ecchi', 'Harem', 'School']  
3         ['Comedy', 'Romance', 'School', 'Shounen']  
4  ['Drama', 'Fantasy', 'Suspense', 'Psychological']  
5  ['Drama', 'Romance', 'Music', 'School', 'Shoun... 

In [18]:
print(df.shape[0])  # The first value in shape is the row count

13379


In [21]:
print(df2.shape[0])

10714


In [22]:
print(df2.isnull().sum())  # Shows count of missing values per column

anime_id                  0
anime_url                 0
title                     0
synopsis                  2
main_pic                  0
type                      0
source_type               0
num_episodes             65
status                    0
start_date                0
end_date                 93
season                 7008
studios                1506
genres                    0
score                     0
score_count               0
score_rank             1249
popularity_rank           0
members_count             0
favorites_count           0
watching_count            0
completed_count           0
on_hold_count             0
dropped_count             0
plan_to_watch_count       0
total_count               0
score_10_count            0
score_09_count            0
score_08_count            0
score_07_count            0
score_06_count            0
score_05_count            0
score_04_count            0
score_03_count            0
score_02_count            0
score_01_count      

In [24]:
df2[df2['num_episodes'].isna() | (df2['num_episodes'] == 0)]

Unnamed: 0,anime_id,anime_url,title,synopsis,main_pic,type,source_type,num_episodes,status,start_date,...,score_08_count,score_07_count,score_06_count,score_05_count,score_04_count,score_03_count,score_02_count,score_01_count,clubs,pics
127,42295,https://myanimelist.net/anime/42295/Fushigi_Da...,Fushigi Dagashiya: Zenitendou,Zenitendo is a mysterious candy store that onl...,https://cdn.myanimelist.net/images/anime/1994/...,TV,Novel,,Currently Airing,2020-09-08 00:00:00,...,8,27,36,16,15,4,7,6,['27907'],https://cdn.myanimelist.net/images/anime/1994/...
354,48365,https://myanimelist.net/anime/48365/Youkai_Wat...,Youkai Watch ♪,The new show will feature unique and returning...,https://cdn.myanimelist.net/images/anime/1682/...,TV,Game,,Currently Airing,2021-04-09 00:00:00,...,45,70,70,45,17,13,11,10,"['27907', '17659']",https://cdn.myanimelist.net/images/anime/1271/...
385,50418,https://myanimelist.net/anime/50418/Ninjala_TV,Ninjala (TV),"The year is 20XX. The ninja, who once forged t...",https://cdn.myanimelist.net/images/anime/1552/...,TV,Game,,Currently Airing,2022-01-08 00:00:00,...,15,31,49,34,23,6,10,10,"['27907', '8652', '72473']",https://cdn.myanimelist.net/images/anime/1602/...
391,50281,https://myanimelist.net/anime/50281/Delicious_...,Delicious Party♡Precure,.,https://cdn.myanimelist.net/images/anime/1332/...,TV,Original,,Currently Airing,2022-02-06 00:00:00,...,99,148,70,38,19,7,4,10,"['27907', '8652', '22866', '2167', '31353']",https://cdn.myanimelist.net/images/anime/1872/...
592,49285,https://myanimelist.net/anime/49285/Waccha_Pri...,Waccha PriMagi!,The new series continues the Pretty Series' co...,https://cdn.myanimelist.net/images/anime/1522/...,TV,Original,,Currently Airing,2021-10-03 00:00:00,...,54,76,68,70,43,38,17,15,['27907'],https://cdn.myanimelist.net/images/anime/1848/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9557,30088,https://myanimelist.net/anime/30088/Ryoutei_no...,Ryoutei no Aji,Short stories capturing the bittersweet and te...,https://cdn.myanimelist.net/images/anime/12/72...,Special,Original,,Currently Airing,2014-03-28 00:00:00,...,121,168,111,66,17,7,9,13,['57827'],https://cdn.myanimelist.net/images/anime/12/72...
9701,37706,https://myanimelist.net/anime/37706/Akuma_no_K...,Akuma no Kimuraa-hen,A cup noodle commercials where [Kuro] Hiyoko-c...,https://cdn.myanimelist.net/images/anime/1986/...,Special,Original,,Currently Airing,2018-04-06 00:00:00,...,66,139,121,82,28,21,11,17,"['57827', '65409']",https://cdn.myanimelist.net/images/anime/1986/...
9875,33904,https://myanimelist.net/anime/33904/Suntory_Te...,Suntory Tennensui CMs,Suntory commercials set in the alps. Shinji R...,https://cdn.myanimelist.net/images/anime/2/815...,Special,Original,,Currently Airing,2014-03-28 00:00:00,...,4,27,54,124,44,22,12,24,['57827'],https://cdn.myanimelist.net/images/anime/2/815...
10068,35300,https://myanimelist.net/anime/35300/Magia_Reco...,Magia Record: Mahou Shoujo Madoka☆Magica Gaiden,"The new heroine of is Iroha, a magical girl on...",https://cdn.myanimelist.net/images/anime/3/850...,Special,Game,,Currently Airing,2016-09-30 00:00:00,...,103,170,205,146,53,33,21,35,"['5450', '79626', '57827', '30325', '79532', '...",https://cdn.myanimelist.net/images/anime/3/850...


In [28]:
df2['num_episodes'] = df2['num_episodes'].astype(str).fillna("Unknown")  # Fix NaN issue
df2.to_csv("your_dataset.csv", index=False)  # Overwrites the original dataset


In [29]:
print(df2.isnull().sum())

anime_id                  0
anime_url                 0
title                     0
synopsis                  2
main_pic                  0
type                      0
source_type               0
num_episodes              0
status                    0
start_date                0
end_date                 93
season                 7008
studios                1506
genres                    0
score                     0
score_count               0
score_rank             1249
popularity_rank           0
members_count             0
favorites_count           0
watching_count            0
completed_count           0
on_hold_count             0
dropped_count             0
plan_to_watch_count       0
total_count               0
score_10_count            0
score_09_count            0
score_08_count            0
score_07_count            0
score_06_count            0
score_05_count            0
score_04_count            0
score_03_count            0
score_02_count            0
score_01_count      

In [30]:
print(df2.describe())  # Gives statistics for numerical columns


           anime_id         score   score_count    score_rank  \
count  10714.000000  10714.000000  1.071400e+04   9465.000000   
mean   19361.731753      6.561258  3.728666e+04   5544.653883   
std    15723.362037      0.985277  1.229652e+05   3528.083235   
min        1.000000      1.850000  1.010000e+02      1.000000   
25%     3672.250000      6.020000  8.315000e+02   2483.000000   
50%    15694.000000      6.650000  3.486500e+03   5208.000000   
75%    34758.750000      7.250000  1.989225e+04   8394.000000   
max    51296.000000      9.050000  2.380891e+06  12406.000000   

       popularity_rank  members_count  favorites_count  watching_count  \
count     10714.000000   1.071400e+04     10714.000000    1.071400e+04   
mean       6054.510454   7.151474e+04       848.711592    4.497909e+03   
std        3873.499103   2.022233e+05      5826.392574    2.077321e+04   
min           1.000000   2.060000e+02         0.000000    1.000000e+00   
25%        2742.250000   2.563750e+03       

In [31]:
print(df2['type'].unique())  # See unique anime types (TV, Movie, OVA, etc.)
print(df2['season'].unique())  # See the different seasons in the dataset


['TV' 'ONA' 'OVA' 'Movie' 'Special']
['Winter 2021' 'Fall 2015' 'Summer 2011' 'Fall 2004' 'Fall 2014'
 'Fall 2013' 'Spring 1998' 'Spring 2011' 'Spring 1988' 'Spring 2019'
 'Summer 2017' 'Spring 2013' 'Spring 2007' 'Fall 2008' 'Spring 2020'
 'Fall 2020' 'Fall 2018' 'Winter 1987' 'Winter 2012' 'Winter 2015'
 'Winter 2009' 'Fall 1993' 'Spring 2004' 'Winter 1983' 'Fall 2001'
 'Spring 2010' 'Fall 1991' 'Fall 1976' 'Winter 1996' 'Fall 1983'
 'Winter 2000' 'Spring 2014' 'Winter 2020' 'Fall 2016' 'Winter 2022'
 'Summer 2016' 'Spring 2006' 'Fall 2009' 'Spring 2003' 'Spring 2000'
 'Fall 1968' 'Winter 1991' 'Fall 1982' 'Fall 1985' 'Spring 1994'
 'Winter 1994' 'Spring 1993' 'Winter 1999' 'Summer 2004' 'Summer 2012'
 'Spring 2021' 'Fall 1990' 'Spring 1995' 'Fall 1986' 'Spring 1982'
 'Winter 1993' 'Winter 1984' 'Fall 1987' 'None None' 'Winter 2006'
 'Summer 2001' 'Summer 2005' 'Fall 1979' 'Summer 2013' 'Fall 2019'
 'Fall 2021' 'Fall 2003' 'Winter 2018' 'Spring 1999' 'Spring 1986'
 'Spring 2005' 'Spr

In [32]:
from collections import Counter
genre_counts = Counter([g for sublist in df2['genres'] for g in sublist])
print(genre_counts.most_common(10))  # Top 10 most frequent genres


[("'", 69144), (' ', 27844), (',', 23858), ('e', 23024), ('o', 20523), ('a', 19245), ('n', 16948), ('i', 16853), ('c', 13905), ('t', 13283)]


In [33]:
print(df2[['title', 'score']].sort_values(by='score', ascending=False).head(10))


                                            title  score
1351           Shingeki no Kyojin Season 3 Part 2   9.05
3648             Fullmetal Alchemist: Brotherhood   9.03
2770  Shingeki no Kyojin: The Final Season Part 2   9.02
1663                                  Steins;Gate   8.98
1973                       Hunter x Hunter (2011)   8.96
9088                               Koe no Katachi   8.94
8983                      Violet Evergarden Movie   8.93
2890                     Fruits Basket: The Final   8.93
2854                          Gintama': Enchousen   8.92
1618                   3-gatsu no Lion 2nd Season   8.90


In [10]:
df2 = pd.read_csv("cleaned_anime_data_01.csv")

In [3]:
print(df2['genres'].dtype)  # Check the overall data type


object


In [4]:
print(type(df2['genres'].iloc[0]))  # Check the type of the first row's genre value


<class 'str'>


So the problem is, the column 'genres' is in str and not list so when accessing it, we're only getting 1 character at a  time.

In [11]:
import ast

df2['genres'] = df2['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)


In [12]:
print(type(df2['genres'].iloc[0]))


<class 'list'>


Converted to list and saved it.

In [13]:
df2.to_csv("cleaned_anime_data_01.csv", index=False)  # Saves without the index column


In [19]:
df3 = pd.read_csv("cleaned_anime_data_01.csv")


In [20]:
print(type(df3['genres'].iloc[0]))

<class 'str'>


in the new instance of the saved dataset, d3, the column still shows up as str and not list. this is because csv converts all lists back to str when saved. So have to keep converting it to list if needed.

Use Parquet for Large-Scale ML Projects
ML engineers working with big datasets (millions of rows) use Parquet instead of CSV because it:

1. Stores lists & nested data directly (no conversion needed)

2. Loads way faster than CSV

3. Uses less disk space


In [None]:
import pandas as pd
import ast

# Load CSV
df = pd.read_csv('cleaned_anime_data_01.csv')

# Convert 'genres' from string to list
df['genres'] = df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Save to Parquet
df.to_parquet("cleaned_anime_data.parquet", engine="pyarrow")

# Load Parquet
df2 = pd.read_parquet("cleaned_anime_data.parquet", engine="pyarrow")

# Check if 'genres' is a list now
print(type(df2['genres'].iloc[0]))  # Should print <class 'list'>


<class 'numpy.ndarray'>

In [None]:
from collections import Counter
genre_counts = Counter([g for sublist in df2['genres'] for g in sublist])
print(genre_counts.most_common(10))  # Top 10 most frequent genres

[('Comedy', 4165), ('Action', 3123), ('Fantasy', 2294), ('Adventure', 1993), ('Drama', 1926), ('Sci-Fi', 1868), ('Romance', 1649), ('Shounen', 1587), ('School', 1414), ('Slice of Life', 1332)]

In [None]:
df = pd.read_parquet("cleaned_anime_data.parquet")