## MyAnimeList Top Anime Scrapper ##

This script scraps data regarding the top anime as rated on MyAnimeList (MAL) https://myanimelist.net

First we would import 'requests' and 'BeautifulSoup' for requesting the site's HTML content and parsing it.

In [82]:
import requests
from bs4 import BeautifulSoup

Next we import 'pandas' to organize the scrapped data and conveniently save it in the CSV file format.

In [83]:
import pandas as pd

We will alos need to import 'time' and 'random' to avoid the site blocking us for excessive requests.

In [84]:
import time
import random

Finally we will import 'warnings' to avoid known warning messages that would slow down the scraping process in the long term.

In [85]:
import warnings

The following method will scan the list of the top 50 rated anime on MAL starting from the 'limit' rank as can be seen in the following web page: https://myanimelist.net/topanime.php?limit=0 <br>
As of this stage we will only save the anime's id for future use. <br>
We will use the 'limit' variable as our iterator later.

In [86]:
def ScanPageForAnime(starting_index):
    # base page showing the top 50 anime listings starting from the 'starting_index' spot:
    result = requests.get("https://myanimelist.net/topanime.php?limit={}".format(starting_index))

    # page HTML content:
    src = result.content

    # parse src using BeautifulSoup:
    soup = BeautifulSoup(src, 'lxml')

    anime_block_list = soup.find_all("td",attrs={"class","title al va-t word-break"})

    anime_list = []
    for anime_block in anime_block_list:
        anime_link = anime_block.find("a").attrs["href"]
        anime_id = anime_link.replace("https://myanimelist.net/anime/", "").split("/")[0]
        anime_list.append(anime_id)
        
    return anime_list

The following method will scan the page of a specific anime and extract useful information to be later saved in the dataframe. <br>
To access the page of a specific show we will use its previously saved id, For example: using the id of 'Steins;Gate', 9253, we can access https://myanimelist.net/anime/9253 <br><br>
The method will return the following parameters regarding a show:
 - Title.
 - MAL Id. This method's input, its returned for convenience.
 - Type. For example: 'Movie', 'TV', 'Music', ect.
 - Episode's duration. For example: '23 min', '2 hour', ect.
 - Publishing animation studio.
 - Source type. For example: 'Manga', 'Visual' (short for Visual Novel), ect.
 - Genres. For example: 'Fantasy', 'Romance', 'Action', etc.
 - Themes. For example: 'Military', 'Apocalypse', etc.
 - Rating. For example: 'PG-13'.
 - Popularity. Popularity ranking, lower is better.
 - Score. x out of 10, given by MAL user.
 - Year. Release year. For example: '2022'.
 - Number of Episodes. For example: '24'.
 - Demographic. Intended demographic. For example: 'shonen' (meaning young boys).

In [87]:
def ScanAnimeDetails(mal_index):
    # Will be used in case of unexpected failure:
    failed = False

    # Requesting the page corresponding to the given index and parsing it using BeautifulSoup:
    anime_link = "https://myanimelist.net/anime/{}".format(mal_index)
    result = requests.get(anime_link)
    src = result.content
    soup = BeautifulSoup(src, 'lxml')

    # Getting show's title (located at the top of the page):
    anime_title = "N\A"
    try:
        anime_title = soup.find("h1",attrs={"class","title-name h1_bold_none"}).find("strong").string
    except:
        # In case of failure we will add the anime to a separate list and handle it ourselves:
        failed = True
        print("Anime title error {}".format(mal_index))

    # Getting show's score (located in the statistics block):
    anime_score = "N\A"
    divs = soup.find_all("div",attrs={"class","fl-l score"})

    for div in divs:
        if div.find("div").text.replace('.', '', 1).isdigit():
            anime_score = div.find("div").text

    # Getting the rest of the data (located in the leftside bar):
    details = soup.find_all("div",attrs={"class","spaceit_pad"})

    anime_type = "N\A"
    anime_duration = "N\A"
    anime_studios = "N\A"
    anime_source = "N\A"
    anime_genres = "N\A"
    anime_themes = "N\A"
    anime_rating = "N\A"
    anime_popularity = "N\A"
    anime_year = "N\A"
    anime_demographic = "N\A"
    anime_episode_count = "N\A"

    for div in details:
        # Skipping obviously irrelevant tags:
        if div.find("span") == None:
            continue

        try:
            if (div.find("span").string == "Type:"):
                    anime_type = div.find("a").string

            elif (div.find("span").string == "Studios:"):
                    studios = ''
                    for studio in div.find_all("a"):
                        studios += studio.string + ','
                    if len(studios) == 0:
                        anime_studios = 'N\A'
                    else:
                        anime_studios = studios[:-1]

            elif (div.find("span").string == "Genres:"):
                genres = ''
                for genre in div.find_all("a"):
                    genres += genre.string + ','
                if len(genres) == 0:
                    anime_genres = 'N\A'
                else:
                    anime_genres = genres[:-1]

            elif (div.find("span").string == "Themes:"):
                themes = ''
                for theme in div.find_all("a"):
                    themes += theme.string + ','
                if len(themes) == 0:
                    anime_themes = 'N\A'
                else:
                    anime_themes = themes[:-1]

            elif (div.find("span").string == "Source:"):
                anime_source = div.text.split(" ")[2].split("\n")[0]

            elif (div.find("span").string == "Duration:"):
                txt = div.text.split(" ")[2:]
                anime_duration = " ".join(txt).split("\n")[0]

            elif (div.find("span").string == "Rating:"):
                anime_rating = div.text.replace("\n", "").replace("Rating: ", "")
                if anime_rating == None:
                    anime_rating = "N\A"

            elif (div.find("span").string == "Popularity:"):
                anime_popularity = div.text.split(" ")[2].split("\n")[0][1:]

            elif (div.find("span").string == "Demographic:"):
                anime_demographic = div.find('a').string

            elif (div.find("span").string == "Aired:"):
                anime_year = div.text.split(' ')[4].replace('\n', '')

            elif (div.find("span").string == "Episodes:"):
                anime_episode_count = div.text.split(' ')[2].split('\n')[0]
            
        except:
            # In case of failure we will add the anime to a separate list and handle it ourselves:
            failed = True
            print("Anime {} error at {}".format(div.find("span").string[:-1], mal_index))
        
    return ([anime_title, mal_index, anime_type, anime_episode_count, anime_duration, anime_studios, anime_source, anime_genres, anime_themes, anime_rating, anime_popularity, anime_score, anime_demographic, anime_year], failed)

Ignore known warning messages.
Such as: 

<font color='red'>
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
</font>

In [88]:
warnings.filterwarnings('ignore')

Create the dataframe and set the future name for the CSV file.

In [89]:
df = pd.DataFrame(columns=("Title", "MAL Id", "Type", "Number of Episodes", "Episode Duration", "Studios", "Source Type", "Genres", "Themes", "Rating", "Popularity", "Score", "Demographic", "Year"))

filename = "anime_df.csv"

Pulls 50 shows at a time and scraps their individual pages for data, saving it in the dataframe.

In [90]:
for i in range(1, 2):
    # Getting the next 50 indexes of shows to be scrapped:
    ani = ScanPageForAnime(i*50)

    for anime_id in ani:
        # Scrapping show details:
        data, failed = ScanAnimeDetails(anime_id)

        # Either saving it in the dataframe or in case of failure saving it in a separate dataframe for manual review:
        a_series = pd.Series(data, index = df.columns)
        if not failed:
            df = df.append(a_series, ignore_index=True)

        # Avoiding the site blocking us due to excessive requests:
        rand = random.uniform(1, 3)
        time.sleep(rand)

    # Saving the accumulated data:
    df.to_csv("anime_df_raw-{}.csv".format((i+1)*50), sep=';', index=False)
    print("Scraped {} anime's pages.".format((i+1)*50))

df.to_csv("anime_df_raw.csv", sep=';', index=False)

Scraped 100 anime's pages.


Example of the dataframe:

In [91]:
df.head(3)

Unnamed: 0,Title,MAL Id,Type,Number of Episodes,Episode Duration,Studios,Source Type,Genres,Themes,Rating,Popularity,Score,Demographic,Year
0,Made in Abyss,34599,TV,13,25 min. per ep.,Kinema Citrus,Web,"Adventure,Drama,Fantasy,Mystery,Sci-Fi",N\A,R - 17+ (violence & profanity),95,8.69,N\A,2017
1,Made in Abyss Movie 3: Fukaki Tamashii no Reimei,36862,Movie,1,1 hr. 45 min.,Kinema Citrus,Web,"Adventure,Drama,Fantasy,Mystery,Sci-Fi",N\A,R - 17+ (violence & profanity),591,8.69,N\A,2020
2,Mononoke Hime,164,Movie,1,2 hr. 13 min.,Studio Ghibli,Original,"Action,Adventure,Fantasy",N\A,PG-13 - Teens 13 or older,100,8.68,N\A,1997
