### Phase 1: Scraping the Data

Reference Link: https://myanimelist.net/anime.php

#### Optimizing the code to scrap info from all different genres

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import re

# Getting the links to each genre page first
r = requests.get('https://myanimelist.net/anime.php')
r.status_code

soup = bs(r.content, 'html.parser')

# Extracting the links for each genre
links = [link.get('href') for link in soup.select('a[class]') if link.get('href')]

# Cleaning up the links so it only contains links for genres
valid_links = []
for i in links:
    if 'genre' in i:
        valid_links.append(i)

data = []
for i in range(1,10):
    for genre_url in valid_links:
        print(f'Current Genre is: {genre_url}')
        url = 'https://myanimelist.net'
        page = f'?page={i}'
        full_url = url + genre_url + page
        r = requests.get(full_url)
        soup = bs(r.content, 'html.parser')
        infoboxes = soup.find_all('div', class_='js-anime-category-producer')

        for item in infoboxes:
            anime_info = {}
            # Getting tht title:
            title_element = item.select('h2.h2_anime_title a.link-title')
            anime_info['title'] = [name.text for name in title_element][0]  # Listing the index so the final output is a string instead of a list (dataframe will contain the square brackets[] if i dont do this)

            # Getting the rating:
            rating_element = item.select('div.scormem-item.score.score-label')
            anime_info['rating'] = [x.text.strip() for x in rating_element][0]

            # Getting the released year:
            year_element = item.find_all('div', class_ = 'info')
            year_released = item.text.split(',')[1].split('\n')[0].strip()
            year_released = re.findall(r"\d+|\D+", year_released)
            if year_released:
                anime_info['year_released'] = year_released[0]
            else:
                anime_info['year_released'] = None

            # Getting the anime type
            anime_element = item.find_all('div', class_ = 'info')
            anime_info['anime_type'] = [anitype.text.split(',')[0] for anitype in anime_element][0]

    #        Getting the number of episodes
            episode_element = item.find_all('div', class_ = 'info')
            anime_info['episodes'] = [ee.text.split(',')[1].split('\n')[1].strip() for ee in episode_element][0]

    #        Getting the duration of the anime
            duration_element = item.find_all('div', class_ = 'info')
            anime_info['duration'] = [dura.text.split(',')[2].split('\n')[1].strip() for dura in duration_element][0]


            # Getting studio source and themes of the anime:
            element = item.select('.properties .item')
            anime_info['studio'] = [s.text.replace(' ','').strip() for s in element][0]
            # anime_info['source'] = [s.text.replace(' ','').strip() for s in element][1]
            # anime_info['themes'] = [s.text.replace(' ','').strip() for s in element][2:]

            # Getting genre
            genre_element = item.select('div .genre')
            anime_info['genre'] = [s.text.strip().replace(' ','') for s in genre_element]

            # Getting the number of members
            member_element = item.select('div.scormem-item.member')
            anime_info['member'] = [s.text.strip() for s in member_element][0]

            # Getting the sypnosis
            sypnopsis_element = item.select('div p')
            anime_info['sypnopsis'] = [s.text.split('[Written by MAL Rewrite]')[0].strip().replace('\r\n',' ') for s in sypnopsis_element][0]
            
            # Saving the output into a dataframe
            data.append(anime_info)

Current Genre is: /anime/genre/1/Action
Current Genre is: /anime/genre/2/Adventure
Current Genre is: /anime/genre/5/Avant_Garde
Current Genre is: /anime/genre/46/Award_Winning
Current Genre is: /anime/genre/28/Boys_Love
Current Genre is: /anime/genre/4/Comedy
Current Genre is: /anime/genre/8/Drama
Current Genre is: /anime/genre/10/Fantasy
Current Genre is: /anime/genre/26/Girls_Love
Current Genre is: /anime/genre/47/Gourmet
Current Genre is: /anime/genre/14/Horror
Current Genre is: /anime/genre/7/Mystery
Current Genre is: /anime/genre/22/Romance
Current Genre is: /anime/genre/24/Sci-Fi
Current Genre is: /anime/genre/36/Slice_of_Life
Current Genre is: /anime/genre/30/Sports
Current Genre is: /anime/genre/37/Supernatural
Current Genre is: /anime/genre/41/Suspense
Current Genre is: /anime/genre/9/Ecchi
Current Genre is: /anime/genre/49/Erotica
Current Genre is: /anime/genre/12/Hentai
Current Genre is: /anime/genre/50/Adult_Cast
Current Genre is: /anime/genre/51/Anthropomorphic
Current Gen

### Output DataFrame (Looks Good)

In [2]:
import pandas as pd
import plotly.express as px
import glob
import numpy as np 

pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Saving and taking a look at the output dataframe
final = pd.DataFrame(data)
final.to_excel('My_anime_list_raw_data.xlsx', index = False)
final.head()

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration,studio,genre,member,sypnopsis
0,Shingeki no Kyojin,8.54,2013,TV,25 eps,24 min,WitStudio,"[Action, AwardWinning, Drama, Suspense]",3.9M,"Centuries ago, mankind was slaughtered to near..."
1,Fullmetal Alchemist: Brotherhood,9.09,2009,TV,64 eps,24 min,Bones,"[Action, Adventure, Drama, Fantasy]",3.3M,After a horrific alchemy experiment goes wrong...
2,One Punch Man,8.5,2015,TV,12 eps,24 min,Madhouse,"[Action, Comedy]",3.2M,The seemingly unimpressive Saitama has a rathe...
3,Sword Art Online,7.2,2012,TV,25 eps,23 min,A-1Pictures,"[Action, Adventure, Fantasy, Romance]",3.0M,Ever since the release of the innovative Nerve...
4,Boku no Hero Academia,7.87,2016,TV,13 eps,24 min,Bones,[Action],3.0M,"The appearance of ""quirks,"" newly discovered s..."


In [3]:
mal = pd.read_excel('My_anime_list_raw_data.xlsx')
mal.loc[7965]

title                                                     Udon Pan
rating                                                         NaN
year_released                                                 2019
anime_type                                                   Music
episodes                                                      1 ep
duration                                                     2 min
studio                                                     Unknown
genre                                                  ['Gourmet']
member                                                          63
sypnopsis        Music video for the song Udon Pan by Izumi Sak...
Name: 7965, dtype: object

### Phase 2: Data Cleaning, Preprocessing and Feature Engineering

In [6]:
# Reading the dataframe:
mal = pd.read_excel('My_anime_list_raw_data.xlsx')

# Cleaning up the data and removing explicit info:
mal['member_count'] = mal['member'].str[:-1]
mal['member_count'] = pd.to_numeric(mal['member_count'],errors ='coerce')
mal['member_count'] = mal.apply(lambda x: x['member_count']*1000000 if 'M' in x['member'] 
                                          else (x['member_count']*1000 if 'K' in x['member'] 
                                                else x['member']), axis = 1)
mal.drop(columns = ['member'], inplace = True)
# mal['themes'] = mal['themes'].astype('string')
# mal['themes'] = mal['themes'].str.replace(r'\[|\]', '', regex=True)
mal['genre'] = mal['genre'].astype('string')
mal['genre'] = mal['genre'].str.replace(r'\[|\]', '', regex=True)
mal = mal[~mal['genre'].str.contains('Hentai')]

mal['episodes'] = mal['episodes'].apply(lambda x: x.split('eps')[0].strip() if x else x)
mal['episodes'] = mal['episodes'].apply(lambda x: None if '?' in x else x)
mal['duration'] = mal['duration'].apply(lambda x: x.split('min')[0].strip() if x else x)

mal.rename(columns = {'duration':'duration(mins)'}, inplace = True)

mal['year_released'] = pd.to_numeric(mal['year_released'], errors='coerce')
mal['member_count'] = pd.to_numeric(mal['member_count'], errors='coerce')
mal['episodes'] = pd.to_numeric(mal['episodes'], errors='coerce')


mal['year_released'] = mal['year_released'].astype('Int64')  # Use 'Int64' to allow for NaN
mal['member_count'] = mal['member_count'].astype('Int64')
mal['episodes'] = mal['episodes'].astype('Int64')
mal['duration(mins)'] = mal['duration(mins)'].astype('Int64')


mal.head()


Unnamed: 0,title,rating,year_released,anime_type,episodes,duration(mins),studio,genre,sypnopsis,member_count
0,Shingeki no Kyojin,8.54,2013,TV,25,24,WitStudio,"'Action', 'AwardWinning', 'Drama', 'Suspense'","Centuries ago, mankind was slaughtered to near...",3900000
1,Fullmetal Alchemist: Brotherhood,9.09,2009,TV,64,24,Bones,"'Action', 'Adventure', 'Drama', 'Fantasy'",After a horrific alchemy experiment goes wrong...,3300000
2,One Punch Man,8.5,2015,TV,12,24,Madhouse,"'Action', 'Comedy'",The seemingly unimpressive Saitama has a rathe...,3200000
3,Sword Art Online,7.2,2012,TV,25,23,A-1Pictures,"'Action', 'Adventure', 'Fantasy', 'Romance'",Ever since the release of the innovative Nerve...,3000000
4,Boku no Hero Academia,7.87,2016,TV,13,24,Bones,'Action',"The appearance of ""quirks,"" newly discovered s...",3000000


### Final Touch Ups on Abnormal values

In [7]:
print(mal.isna().sum().sort_values(ascending = False))
print()
print(mal.duplicated().sum())

episodes          2705
rating             477
year_released      317
title                0
anime_type           0
duration(mins)       0
studio               0
genre                0
sypnopsis            0
member_count         0
dtype: int64

4703


There are a lot of missing and duplicated values present in the dataset, we will proceed to drop the duplicated rows first, while keep one instace of it.

In [8]:
mal.drop_duplicates(keep = 'first', inplace= True)

mal.duplicated().sum()

0

In [9]:
mal.describe()

Unnamed: 0,rating,year_released,episodes,duration(mins),member_count
count,4921.0,5150.0,3400.0,5353.0,5353.0
mean,6.98,2010.51,20.55,27.98,158568.17
std,0.91,30.04,43.23,26.19,323960.4
min,1.99,2.0,2.0,0.0,33.0
25%,6.46,2006.0,12.0,15.0,6000.0
50%,7.11,2014.0,12.0,23.0,40000.0
75%,7.58,2019.0,24.0,25.0,160000.0
max,9.14,2025.0,1787.0,168.0,3900000.0


From the analysis, rating wise, average rating would be around 7, average number of episodes for a show would be 20 episodes, and shows averagely last around 28 mins per episode.<br><br>

Some abnormal data points were also observed throughout the analysis. One of them would be a year released of only 2, which does not make sense and should be investigated further.<br><br>

Theres also a datapoint with an anime more than 1750+ episodes, which should be investigated further as well.

In [10]:
mal.loc[mal['year_released'] == 2]

# Replace invalid values in 'year_released' with the specified replacement value
mal.loc[mal['year_released'] == 2, 'year_released'] = 2011

The anime with a invalid year released date belongs to `Natsu-iro Egao de 1, 2, Jump!`. Upon further research, the anime seems to have been released during the year 2011. We have replace the current value with the correct value instead.

In [11]:
mal.loc[mal['episodes'] >=1700]

Unnamed: 0,title,rating,year_released,anime_type,episodes,duration(mins),studio,genre,sypnopsis,member_count
2224,Doraemon (1979),7.82,1979,TV,1787,11,Shin-EiAnimation,"'Adventure', 'Comedy', 'Fantasy', 'Sci-Fi'",Nobita Nobi is a normal fourth grade student. ...,62000


Upon further research, turns out the show 'Doraemon' does indeed have over 1.7k episodes. We will leave the values as is and finalize the dataset for our dashboard.

In [12]:
mal.to_csv('./cleaned_data.csv', index = False)