### Phase 1: Scraping the Data

Reference Link:https://myanimelist.net/anime.php

#### Optimizing the code to scrap info from all different genres

In [48]:
from bs4 import BeautifulSoup as bs
import requests
import re

# Getting the links to each genre page first
r = requests.get('https://myanimelist.net/anime.php')
r.status_code

soup = bs(r.content, 'html.parser')

# Extracting the links for each genre
links = [link.get('href') for link in soup.select('a[class]') if link.get('href')]

# Cleaning up the links so it only contains links for genres
valid_links = []
for i in links:
    if 'genre' in i:
        valid_links.append(i)

data = []
for i in range(1,10):
    for genre_url in valid_links:
        print(f'Current Genre is: {genre_url}')
        url = 'https://myanimelist.net'
        page = f'?page={i}'
        full_url = url + genre_url + page
        r = requests.get(full_url)
        soup = bs(r.content, 'html.parser')
        infoboxes = soup.find_all('div', class_='js-anime-category-producer')

        for item in infoboxes:
            anime_info = {}
            # Getting tht title:
            title_element = item.select('h2.h2_anime_title a.link-title')
            anime_info['title'] = [name.text for name in title_element][0]  # Listing the index so the final output is a string instead of a list (dataframe will contain the square brackets[] if i dont do this)

            # Getting the rating:
            rating_element = item.select('div.scormem-item.score.score-label')
            anime_info['rating'] = [x.text.strip() for x in rating_element][0]

            # Getting the released year:
            year_element = item.find_all('div', class_ = 'info')
            year_released = item.text.split(',')[1].split('\n')[0].strip()
            year_released = re.findall(r"\d+|\D+", year_released)
            if year_released:
                anime_info['year_released'] = year_released[0]
            else:
                anime_info['year_released'] = None

            # Getting the anime type
            anime_element = item.find_all('div', class_ = 'info')
            anime_info['anime_type'] = [anitype.text.split(',')[0] for anitype in anime_element][0]

    #        Getting the number of episodes
            episode_element = item.find_all('div', class_ = 'info')
            anime_info['episodes'] = [ee.text.split(',')[1].split('\n')[1].strip() for ee in episode_element][0]

    #        Getting the duration of the anime
            duration_element = item.find_all('div', class_ = 'info')
            anime_info['duration'] = [dura.text.split(',')[2].split('\n')[1].strip() for dura in duration_element][0]


            # Getting studio source and themes of the anime:
            element = item.select('.properties .item')
            anime_info['studio'] = [s.text.replace(' ','').strip() for s in element][0]
            anime_info['source'] = [s.text.replace(' ','').strip() for s in element][1]
            anime_info['themes'] = [s.text.replace(' ','').strip() for s in element][2:]

            # Getting genre
            genre_element = item.select('div .genre')
            anime_info['genre'] = [s.text.strip().replace(' ','') for s in genre_element]

            # Getting the number of members
            member_element = item.select('div.scormem-item.member')
            anime_info['member'] = [s.text.strip() for s in member_element][0]

            # Getting the sypnosis
            sypnopsis_element = item.select('div p')
            anime_info['sypnopsis'] = [s.text.split('[Written by MAL Rewrite]')[0].strip().replace('\r\n',' ') for s in sypnopsis_element][0]
            
            # Saving the output into a dataframe
            data.append(anime_info)

Current Genre is: /anime/genre/1/Action
Current Genre is: /anime/genre/2/Adventure
Current Genre is: /anime/genre/5/Avant_Garde
Current Genre is: /anime/genre/46/Award_Winning
Current Genre is: /anime/genre/28/Boys_Love
Current Genre is: /anime/genre/4/Comedy
Current Genre is: /anime/genre/8/Drama
Current Genre is: /anime/genre/10/Fantasy
Current Genre is: /anime/genre/26/Girls_Love
Current Genre is: /anime/genre/47/Gourmet
Current Genre is: /anime/genre/14/Horror
Current Genre is: /anime/genre/7/Mystery
Current Genre is: /anime/genre/22/Romance
Current Genre is: /anime/genre/24/Sci-Fi
Current Genre is: /anime/genre/36/Slice_of_Life
Current Genre is: /anime/genre/30/Sports
Current Genre is: /anime/genre/37/Supernatural
Current Genre is: /anime/genre/41/Suspense
Current Genre is: /anime/genre/9/Ecchi
Current Genre is: /anime/genre/49/Erotica
Current Genre is: /anime/genre/12/Hentai
Current Genre is: /anime/genre/50/Adult_Cast
Current Genre is: /anime/genre/51/Anthropomorphic
Current Gen

Current Genre is: /anime/genre/62/Isekai
Current Genre is: /anime/genre/63/Iyashikei
Current Genre is: /anime/genre/64/Love_Polygon
Current Genre is: /anime/genre/65/Magical_Sex_Shift
Current Genre is: /anime/genre/66/Mahou_Shoujo
Current Genre is: /anime/genre/17/Martial_Arts
Current Genre is: /anime/genre/18/Mecha
Current Genre is: /anime/genre/67/Medical
Current Genre is: /anime/genre/38/Military
Current Genre is: /anime/genre/19/Music
Current Genre is: /anime/genre/6/Mythology
Current Genre is: /anime/genre/68/Organized_Crime
Current Genre is: /anime/genre/69/Otaku_Culture
Current Genre is: /anime/genre/20/Parody
Current Genre is: /anime/genre/70/Performing_Arts
Current Genre is: /anime/genre/71/Pets
Current Genre is: /anime/genre/40/Psychological
Current Genre is: /anime/genre/3/Racing
Current Genre is: /anime/genre/72/Reincarnation
Current Genre is: /anime/genre/73/Reverse_Harem
Current Genre is: /anime/genre/74/Romantic_Subtext
Current Genre is: /anime/genre/21/Samurai
Current G

Current Genre is: /anime/genre/42/Seinen
Current Genre is: /anime/genre/25/Shoujo
Current Genre is: /anime/genre/27/Shounen
Current Genre is: /anime/genre/1/Action
Current Genre is: /anime/genre/2/Adventure
Current Genre is: /anime/genre/5/Avant_Garde
Current Genre is: /anime/genre/46/Award_Winning
Current Genre is: /anime/genre/28/Boys_Love
Current Genre is: /anime/genre/4/Comedy
Current Genre is: /anime/genre/8/Drama
Current Genre is: /anime/genre/10/Fantasy
Current Genre is: /anime/genre/26/Girls_Love
Current Genre is: /anime/genre/47/Gourmet
Current Genre is: /anime/genre/14/Horror
Current Genre is: /anime/genre/7/Mystery
Current Genre is: /anime/genre/22/Romance
Current Genre is: /anime/genre/24/Sci-Fi
Current Genre is: /anime/genre/36/Slice_of_Life
Current Genre is: /anime/genre/30/Sports
Current Genre is: /anime/genre/37/Supernatural
Current Genre is: /anime/genre/41/Suspense
Current Genre is: /anime/genre/9/Ecchi
Current Genre is: /anime/genre/49/Erotica
Current Genre is: /anim

Current Genre is: /anime/genre/60/Idols_Female
Current Genre is: /anime/genre/61/Idols_Male
Current Genre is: /anime/genre/62/Isekai
Current Genre is: /anime/genre/63/Iyashikei
Current Genre is: /anime/genre/64/Love_Polygon
Current Genre is: /anime/genre/65/Magical_Sex_Shift
Current Genre is: /anime/genre/66/Mahou_Shoujo
Current Genre is: /anime/genre/17/Martial_Arts
Current Genre is: /anime/genre/18/Mecha
Current Genre is: /anime/genre/67/Medical
Current Genre is: /anime/genre/38/Military
Current Genre is: /anime/genre/19/Music
Current Genre is: /anime/genre/6/Mythology
Current Genre is: /anime/genre/68/Organized_Crime
Current Genre is: /anime/genre/69/Otaku_Culture
Current Genre is: /anime/genre/20/Parody
Current Genre is: /anime/genre/70/Performing_Arts
Current Genre is: /anime/genre/71/Pets
Current Genre is: /anime/genre/40/Psychological
Current Genre is: /anime/genre/3/Racing
Current Genre is: /anime/genre/72/Reincarnation
Current Genre is: /anime/genre/73/Reverse_Harem
Current Ge

### Output DataFrame (Looks Good)

In [54]:
import pandas as pd
final = pd.DataFrame(data)
final.to_excel('My_anime_list_raw_data.xlsx', index = False)
final.head()

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration,studio,source,themes,genre,member,sypnopsis
0,Shingeki no Kyojin,8.54,2013,TV,25 eps,24 min,WitStudio,Manga,"[Gore, Military, Survival, Shounen]","[Action, AwardWinning, Drama, Suspense]",3.8M,"Centuries ago, mankind was slaughtered to near..."
1,Fullmetal Alchemist: Brotherhood,9.1,2009,TV,64 eps,24 min,Bones,Manga,"[Military, Shounen]","[Action, Adventure, Drama, Fantasy]",3.2M,After a horrific alchemy experiment goes wrong...
2,One Punch Man,8.5,2015,TV,12 eps,24 min,Madhouse,Webmanga,"[AdultCast, Parody, SuperPower, Seinen]","[Action, Comedy]",3.1M,The seemingly unimpressive Saitama has a rathe...
3,Sword Art Online,7.2,2012,TV,25 eps,23 min,A-1Pictures,Lightnovel,"[LovePolygon, VideoGame]","[Action, Adventure, Fantasy, Romance]",3.0M,Ever since the release of the innovative Nerve...
4,Boku no Hero Academia,7.88,2016,TV,13 eps,24 min,Bones,Manga,"[School, SuperPower, Shounen]",[Action],2.9M,"The appearance of ""quirks,"" newly discovered s..."


### Phase 2: Exploring the Data

In [115]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' % x)

mal = pd.read_excel('My_anime_list_raw_data.xlsx')

In [116]:
mal

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration,studio,source,themes,genre,member,sypnopsis
0,Shingeki no Kyojin,8.54,2013,TV,25 eps,24 min,WitStudio,Manga,"['Gore', 'Military', 'Survival', 'Shounen']","['Action', 'AwardWinning', 'Drama', 'Suspense']",3.8M,"Centuries ago, mankind was slaughtered to near..."
1,Fullmetal Alchemist: Brotherhood,9.10,2009,TV,64 eps,24 min,Bones,Manga,"['Military', 'Shounen']","['Action', 'Adventure', 'Drama', 'Fantasy']",3.2M,After a horrific alchemy experiment goes wrong...
2,One Punch Man,8.50,2015,TV,12 eps,24 min,Madhouse,Webmanga,"['AdultCast', 'Parody', 'SuperPower', 'Seinen']","['Action', 'Comedy']",3.1M,The seemingly unimpressive Saitama has a rathe...
3,Sword Art Online,7.20,2012,TV,25 eps,23 min,A-1Pictures,Lightnovel,"['LovePolygon', 'VideoGame']","['Action', 'Adventure', 'Fantasy', 'Romance']",3.0M,Ever since the release of the innovative Nerve...
4,Boku no Hero Academia,7.88,2016,TV,13 eps,24 min,Bones,Manga,"['School', 'SuperPower', 'Shounen']",['Action'],2.9M,"The appearance of ""quirks,"" newly discovered s..."
...,...,...,...,...,...,...,...,...,...,...,...,...
10224,Fresh Precure! Movie: Omocha no Kuni wa Himits...,6.69,2009,Movie,1 ep,70 min,ToeiAnimation,Original,['MahouShoujo'],"['Action', 'Comedy', 'Fantasy']",4.9K,After resolving the problems in the Labyrinth ...
10225,Precure All Stars Movie New Stage 2: Kokoro no...,7.16,2013,Movie,1 ep,71 min,ToeiAnimation,Original,['MahouShoujo'],['Action'],4.8K,"One day, the Pretty Cures receive an invitatio..."
10226,Dokidoki! Precure Movie: Mana Kekkon!!? Mirai ...,6.87,2013,Movie,1 ep,71 min,ToeiAnimation,Original,['MahouShoujo'],['Fantasy'],4.6K,Mana is given a wedding dress which was worn b...
10227,Ginga Ojousama Densetsu Yuna: Kanashimi no Siren,6.12,1995,OVA,2 eps,28 min,J.C.Staff,Visualnovel,"['MahouShoujo', 'Mecha']","['Action', 'Adventure', 'Comedy', 'Sci-Fi']",4.5K,"Yuna Kagurazaka, a clumsy and sweet 16-year-ol..."


In [117]:
# Cleaning up the data 
mal['member_count'] = mal['member'].str[:-1]
mal['member_count'] = pd.to_numeric(mal['member_count'],errors ='coerce')
mal['member_count'] = mal.apply(lambda x: x['member_count']*1000000 if 'M' in x['member'] 
                                    else (x['member_count']*1000 if 'K' in x['member'] 
                                          else x['member_count']), axis = 1)
mal.drop(columns = ['member'], inplace = True)
mal['themes'] = mal['themes'].astype('string')
mal['themes'] = mal['themes'].str.replace(r'\[|\]', '', regex=True)
mal['genre'] = mal['genre'].astype('string')
mal['genre'] = mal['genre'].str.replace(r'\[|\]', '', regex=True)

mal['episodes'] = mal['episodes'].apply(lambda x: x.split('eps')[0].strip() if x else x)
mal['duration'] = mal['duration'].apply(lambda x: x.split('min')[0].strip() if x else x)

mal.rename(columns = {'duration':'duration(mins)'}, inplace = True)
mal.head()

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration(mins),studio,source,themes,genre,sypnopsis,member_count
0,Shingeki no Kyojin,8.54,2013,TV,25,24,WitStudio,Manga,"'Gore', 'Military', 'Survival', 'Shounen'","'Action', 'AwardWinning', 'Drama', 'Suspense'","Centuries ago, mankind was slaughtered to near...",3800000.0
1,Fullmetal Alchemist: Brotherhood,9.1,2009,TV,64,24,Bones,Manga,"'Military', 'Shounen'","'Action', 'Adventure', 'Drama', 'Fantasy'",After a horrific alchemy experiment goes wrong...,3200000.0
2,One Punch Man,8.5,2015,TV,12,24,Madhouse,Webmanga,"'AdultCast', 'Parody', 'SuperPower', 'Seinen'","'Action', 'Comedy'",The seemingly unimpressive Saitama has a rathe...,3100000.0
3,Sword Art Online,7.2,2012,TV,25,23,A-1Pictures,Lightnovel,"'LovePolygon', 'VideoGame'","'Action', 'Adventure', 'Fantasy', 'Romance'",Ever since the release of the innovative Nerve...,3000000.0
4,Boku no Hero Academia,7.88,2016,TV,13,24,Bones,Manga,"'School', 'SuperPower', 'Shounen'",'Action',"The appearance of ""quirks,"" newly discovered s...",2900000.0


#### Exploratory Data Analysis with the cleaned data

##### Exploring the data for:
- Missing Values
- Duplicated Values
- Basic descriptive statistics of the data

In [118]:
mal.head()

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration(mins),studio,source,themes,genre,sypnopsis,member_count
0,Shingeki no Kyojin,8.54,2013,TV,25,24,WitStudio,Manga,"'Gore', 'Military', 'Survival', 'Shounen'","'Action', 'AwardWinning', 'Drama', 'Suspense'","Centuries ago, mankind was slaughtered to near...",3800000.0
1,Fullmetal Alchemist: Brotherhood,9.1,2009,TV,64,24,Bones,Manga,"'Military', 'Shounen'","'Action', 'Adventure', 'Drama', 'Fantasy'",After a horrific alchemy experiment goes wrong...,3200000.0
2,One Punch Man,8.5,2015,TV,12,24,Madhouse,Webmanga,"'AdultCast', 'Parody', 'SuperPower', 'Seinen'","'Action', 'Comedy'",The seemingly unimpressive Saitama has a rathe...,3100000.0
3,Sword Art Online,7.2,2012,TV,25,23,A-1Pictures,Lightnovel,"'LovePolygon', 'VideoGame'","'Action', 'Adventure', 'Fantasy', 'Romance'",Ever since the release of the innovative Nerve...,3000000.0
4,Boku no Hero Academia,7.88,2016,TV,13,24,Bones,Manga,"'School', 'SuperPower', 'Shounen'",'Action',"The appearance of ""quirks,"" newly discovered s...",2900000.0


In [119]:
mal.isna().sum()

title               0
rating            499
year_released       3
anime_type          0
episodes            0
duration(mins)      0
studio              0
source            206
themes              0
genre               0
sypnopsis           0
member_count        0
dtype: int64

In [120]:
mal.loc[mal['year_released'].isna()]

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration(mins),studio,source,themes,genre,sypnopsis,member_count
847,"Yuru Yuri,",7.68,,OVA,1 ep,31,Lay-duce,Manga,"'CGDCT', 'GagHumor', 'School'","'Comedy', 'GirlsLove'","Akari Akaza, Chinatsu Yoshikawa, Kyouko Toshin...",31000.0
9114,"Yuru Yuri,",7.68,,OVA,1 ep,31,Lay-duce,Manga,"'CGDCT', 'GagHumor', 'School'","'Comedy', 'GirlsLove'","Akari Akaza, Chinatsu Yoshikawa, Kyouko Toshin...",31000.0
9406,"Yuru Yuri,",7.68,,OVA,1 ep,31,Lay-duce,Manga,"'CGDCT', 'GagHumor', 'School'","'Comedy', 'GirlsLove'","Akari Akaza, Chinatsu Yoshikawa, Kyouko Toshin...",31000.0


It seems the 3 NaN values observed are duplicated values coming from the anime 'Yuru Yuri,'.

In [121]:
mal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10229 entries, 0 to 10228
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           10229 non-null  object 
 1   rating          9730 non-null   float64
 2   year_released   10226 non-null  object 
 3   anime_type      10229 non-null  object 
 4   episodes        10229 non-null  object 
 5   duration(mins)  10229 non-null  object 
 6   studio          10229 non-null  object 
 7   source          10023 non-null  object 
 8   themes          10229 non-null  string 
 9   genre           10229 non-null  string 
 10  sypnopsis       10229 non-null  object 
 11  member_count    10229 non-null  float64
dtypes: float64(2), object(8), string(2)
memory usage: 959.1+ KB


In [122]:
mal.duplicated().sum()

4645

In [123]:
mal.loc[mal.duplicated()].sort_values('title').head()

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration(mins),studio,source,themes,genre,sypnopsis,member_count
5544,"""Oshi no Ko""",8.75,2023,TV,11,30,DogaKobo,Manga,"'Reincarnation', 'Showbiz', 'Seinen'","'Drama', 'Supernatural'","In the entertainment world, celebrities often ...",679000.0
1652,"""Oshi no Ko""",8.75,2023,TV,11,30,DogaKobo,Manga,"'Reincarnation', 'Showbiz', 'Seinen'","'Drama', 'Supernatural'","In the entertainment world, celebrities often ...",679000.0
6786,"""Oshi no Ko""",8.75,2023,TV,11,30,DogaKobo,Manga,"'Reincarnation', 'Showbiz', 'Seinen'","'Drama', 'Supernatural'","In the entertainment world, celebrities often ...",679000.0
5126,"""Oshi no Ko""",8.75,2023,TV,11,30,DogaKobo,Manga,"'Reincarnation', 'Showbiz', 'Seinen'","'Drama', 'Supernatural'","In the entertainment world, celebrities often ...",679000.0
5554,"""Oshi no Ko"" Season 2",,-Not yet aired,TV,?,0,DogaKobo,Manga,"'Reincarnation', 'Showbiz', 'Seinen'","'Drama', 'Supernatural'","Second season of ""Oshi no Ko"".",117000.0


The duplicated values in the dataset appear to be genuine duplicates, making it safe to remove them without significant loss of information.

In [124]:
mal.drop_duplicates(keep = 'first', inplace = True)

In [125]:
mal.describe()

Unnamed: 0,rating,member_count
count,5127.0,5584.0
mean,6.97,149088.41
std,0.91,312420.43
min,1.99,3.0
25%,6.46,5800.0
50%,7.09,35000.0
75%,7.56,146250.0
max,9.1,3800000.0


#### Performing Some Univariate Analysis

In [126]:
mal['title'].nunique()

5580

A total of 5494 anime is extracted from the site. We will be conducting some univariate analysis on some of these variables to understand the dataset further.

In [127]:
import pandas as pd
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

plt.rcParams['figure.facecolor'] = 'w'
sns.set_style('darkgrid')

# Assuming mal is your DataFrame
numerical_columns = ['rating', 'year_released', 'episodes', 'duration(mins)']

# Convert selected columns to numeric and handle 'N/A' values
mal[numerical_columns] = mal[numerical_columns].apply(pd.to_numeric, errors='coerce')
mal['year_released'] = mal['year_released'].astype('int')
print(mal.info())
mal.describe().T

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [None]:
for variable in numerical_columns:
    fig,ax = plt.subplots(1,2,figsize=(10,3))
    sns.set_style('darkgrid')
    sns.histplot(mal, x = variable, ax = ax[0])
    stats.probplot(x = mal[variable], plot = plt)
    ax[0].set_title(f'Distribution of {variable}')
    ax[1].set_title(f'QQ Plot of {variable}')

From the analysis, rating wise, average rating would be around 7, average number of episodes for a show would be 20 episodes, and shows averagely last around 28 mins per episode.<br><br>

Some abnormal data points were also observed throughout the analysis. One of them would be a year released of only 2, which does not make sense and should be investigated further.<br><br>

Theres also a datapoint with an anime more than 1750 episodes, which should be investigated further as well.

In [None]:
mal.loc[mal['year_released'] == 2]

The anime with a invalid year released date belongs to `Natsu-iro Egao de 1, 2, Jump!`. Upon further research, the anime seems to have been released during the year 2011. We will replace the current value with the correct value instead.

In [None]:
mal.loc[3342,'year_released'] = 2011

In [None]:
mal.loc[mal['title'] == 'Natsu-iro Egao de 1, 2, Jump!' ]

The correct year of released have been replaced accordingly for the anime.

In [None]:
# Revisiting the distribution of released year for anime
fig = px.histogram(mal, x = 'year_released', marginal = 'box', title = 'Distribution of Released Year for Anime',
                  color_discrete_sequence = ['darkgreen'])
fig.update_layout(paper_bgcolor = 'lightgrey', plot_bgcolor = 'lightgrey', height = 400)
fig.show()

print(mal.describe().T)

The distribution is observed to be right skewed, with a significant uptick in concentration of animes released during the year 2000 - 2023