In [2]:
import pandas as pd

We load the anime dataset

In [3]:
anime_df = pd.read_csv('/content/drive/MyDrive/DIC-Anime-Recommendation/Dataset-2/anime-dataset-2023.csv')

In [4]:
anime_df.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


The **Aired** attribute is very important for us. Our target is to extract information like how long an anime runs, is it still ongoing, how many episodes does it have

**Preprocessing step 1**: The aim is to extract start date and end date of an anime and add those 2 as new columns to the dataframe

I split the date using the word **to**

In [5]:
aired = anime_df['Aired'].str.split('to', expand=True)

Then strip whitespaces

In [6]:
aired[0] = aired[0].str.strip()
aired[1] = aired[1].str.strip()

In [7]:
aired

Unnamed: 0,0,1
0,"Apr 3, 1998","Apr 24, 1999"
1,"Sep 1, 2001",
2,"Apr 1, 1998","Sep 30, 1998"
3,"Jul 3, 2002","Dec 25, 2002"
4,"Sep 30, 2004","Sep 29, 2005"
...,...,...
24900,"Jul 4, 2023",?
24901,"Jul 27, 2023",?
24902,"Jul 19, 2023",?
24903,"Apr 23, 2022",


Finally convert both Start date and end date to datetime objects

In [8]:
aired[0] = pd.to_datetime(aired[0], format='%b %d, %Y', errors='coerce')
aired[1] = pd.to_datetime(aired[1], format='%b %d, %Y', errors='coerce')

In [9]:
aired

Unnamed: 0,0,1
0,1998-04-03,1999-04-24
1,2001-09-01,NaT
2,1998-04-01,1998-09-30
3,2002-07-03,2002-12-25
4,2004-09-30,2005-09-29
...,...,...
24900,2023-07-04,NaT
24901,2023-07-27,NaT
24902,2023-07-19,NaT
24903,2022-04-23,NaT


Rename the clomns

In [10]:
aired.rename(columns={0: 'Start Date', 1: 'End Date'}, inplace=True)

In [11]:
aired

Unnamed: 0,Start Date,End Date
0,1998-04-03,1999-04-24
1,2001-09-01,NaT
2,1998-04-01,1998-09-30
3,2002-07-03,2002-12-25
4,2004-09-30,2005-09-29
...,...,...
24900,2023-07-04,NaT
24901,2023-07-27,NaT
24902,2023-07-19,NaT
24903,2022-04-23,NaT


Inserted the new columns to the original dataframe

In [12]:
anime_df.insert(10, 'Start Date', aired['Start Date'])
anime_df.insert(11, 'End Date', aired['End Date'])

In [13]:
anime_df.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


In [14]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   anime_id      24905 non-null  int64         
 1   Name          24905 non-null  object        
 2   English name  24905 non-null  object        
 3   Other name    24905 non-null  object        
 4   Score         24905 non-null  object        
 5   Genres        24905 non-null  object        
 6   Synopsis      24905 non-null  object        
 7   Type          24905 non-null  object        
 8   Episodes      24905 non-null  object        
 9   Aired         24905 non-null  object        
 10  Start Date    20090 non-null  datetime64[ns]
 11  End Date      9337 non-null   datetime64[ns]
 12  Premiered     24905 non-null  object        
 13  Status        24905 non-null  object        
 14  Producers     24905 non-null  object        
 15  Licensors     24905 non-null  object

**Preprocessing step 2**: The aim is to add a new cloumn named **Ongoing**. The way I do this is the aired column has format from start date to end date. The end date has ?. Hence the rows having ? are tagged as ongoing animes

In [15]:
import numpy as np

In [16]:
def check(value):
    return 1 if '?' in value else 0

In [17]:
anime_df['Ongoing'] = anime_df['Aired'].apply(check)

In [18]:
anime_df.loc[11,'Ongoing']

1

In [19]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   anime_id      24905 non-null  int64         
 1   Name          24905 non-null  object        
 2   English name  24905 non-null  object        
 3   Other name    24905 non-null  object        
 4   Score         24905 non-null  object        
 5   Genres        24905 non-null  object        
 6   Synopsis      24905 non-null  object        
 7   Type          24905 non-null  object        
 8   Episodes      24905 non-null  object        
 9   Aired         24905 non-null  object        
 10  Start Date    20090 non-null  datetime64[ns]
 11  End Date      9337 non-null   datetime64[ns]
 12  Premiered     24905 non-null  object        
 13  Status        24905 non-null  object        
 14  Producers     24905 non-null  object        
 15  Licensors     24905 non-null  object

**Preprocessing step 3**: The episodes field is also very important for us. We can infer whether people like short animes or long animes based on number of episodes.
However some records of our dataset have "UNKNOWN" in the episodes field, this is because the anime is currently running. Just for analysis purpose, we assume all animes end on jan 01 2024 to get the episode count till that date, since each episode is released once in a week

In [21]:
for index, row in anime_df.iterrows():
    if row['Episodes'] == 'UNKNOWN':
        anime_df.loc[index, 'Episodes'] = ((pd.to_datetime('Jan 01, 2024', format='%b %d, %Y') - row['Start Date']).days / 7)

In [22]:
anime_df[anime_df['Episodes'] == 'UNKNOWN']

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL,Ongoing


In [23]:
anime_df.loc[11]

Unnamed: 0,11
anime_id,21
Name,One Piece
English name,One Piece
Other name,ONE PIECE
Score,8.69
Genres,"Action, Adventure, Fantasy"
Synopsis,"Gol D. Roger was known as the ""Pirate King,"" t..."
Type,TV
Episodes,1262.714286
Aired,"Oct 20, 1999 to ?"


**Preprocessing Step 4:** finally we normalize episodes field so that we can bring it to a common scale for comparing between different animes

We use MinMax normalization which shrinks the scale between 0 to 1.
X_norm = (X−min(X))/(max(X)−min(X))
​



In [28]:
anime_df['Episodes'] = anime_df['Episodes'].astype(float)

In [29]:
anime_df['Episodes'] = (anime_df['Episodes'] - anime_df['Episodes'].min()) / (anime_df['Episodes'].max() - anime_df['Episodes'].min())

In [30]:
anime_df['Episodes']

Unnamed: 0,Episodes
0,0.008181
1,0.000000
2,0.008181
3,0.008181
4,0.016688
...,...
24900,0.004581
24901,0.005563
24902,0.004908
24903,0.000000
