This file retrieves and stores MAL data

In [7]:
## load libraries

import numpy as np
import pandas as pd

from bs4 import BeautifulSoup, SoupStrainer
import requests
import time, os

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

chromedriver = "C:\\Users\\vi_ci\\Downloads\\chromedriver\\chromedriver.exe" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

In [3]:
def hover(self):
    wd = webdriver.connection
    element = wd.find_element_by_link_text(self.locator)
    hov = ActionChains(wd).move_to_element(element)
    hov.perform()

**How are MyAnimeList scores calculated?**

All scores given in the database are calculated as a weighted score.

Weighted Score = (v / (v + m)) \* S + (m / (v + m)) \* C
- S = Average score for the anime/manga
- v = Number users giving a score for the anime/manga †
- m = Minimum number of scored users required to get a calculated score
- C = The mean score across the entire Anime/Manga database

† Note that v does not correspond to the "number of scored users" as seen on the database page. Scores from users who have not viewed 1/5 of the series upon its completion are not included. Scores given from illegitimate accounts created to sway votes are also not included in the scoring algorithm.

Not Yet Aired entries have no score and will display N/A. Entries that do not meet the minimum number of scored users will also not display a calculated score.

**Top Anime/Manga Rankings**

The "Top Upcoming" and "Most Popular" rankings are ordered by the number of users who have added the entry to their list. All other Top Anime and Top Manga rankings are ordered by weighted score, as calculated above. Please note that while R18+ entries calculate a weighted score, they are excluded from the rankings.

In [37]:
def MAL_loadtopanime(page_num):
    '''
    Load the top anime from MAL based on page_num, 50 per page
    So X = (page_num - 1) * 50 + 1) to (page_num * 50)
    Returns a soup of the list of X top anime on page_num

    Complexity: O(requests.get or BeautifulSoup)
    '''
    limit = (page_num - 1) * 50
    url = "https://myanimelist.net/topanime.php?limit=" + str(limit)
    response = requests.get(url)
    if response.status_code != 200:
        print('Enountered', response.status_code, 'error while reading page', page_num , 'of MAL Top Anime')
    else:
        return BeautifulSoup(response.text, 'lxml').find_all(class_="detail")
        # return BeautifulSoup(response.text, 'lxml', parse_only=SoupStrainer(class_=["detail", "hoverinfo_trigger"]))

In [38]:
soup_mal_top50 = MAL_loadtopanime(1)

Testing to retrieve data from the first item (defining functions in the meantime)

In [39]:
def MAL_initEntry_top(soup):
    '''
    Takes the soup of an entry in the MAL top 50
    Returns a dictionay entry with Title and URL of the anime
    '''
    entry = {}
    entry['Title'] = soup.find(class_="hoverinfo_trigger").text
    entry['URL'] = soup.find(class_="hoverinfo_trigger").get('href')
    return entry

In [40]:
for soup in soup_mal_top50:
    number1 = MAL_initEntry_top(soup)
    break

print(number1['Title'])
print(number1['URL'])

Fullmetal Alchemist: Brotherhood
https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood


Go to number1\['URL'\] and pull the soup

In [47]:
def MAL_retrieveEntry(entry):
    '''
    Takes a dictionary that has URL and Title of the MAL anime
    Returns a soup of the anine page in question
    '''
    response = requests.get(entry['URL'])
    if response.status_code != 200:
        print('Encountered ' + str(response.status_code) + ' error reading ' + entry['Title'])
        return -1
    else:
        return BeautifulSoup(response.text, 'html5', parse_only=SoupStrainer(id='content'))

In [48]:
soup_mal_number1 = MAL_retrieveEntry(number1)

Retrieve more items from the first item

In [51]:
number1sidebar = soup_mal_number1

The below gives a list of all fields in the left column of the anime show page

In [52]:
number1sidebar.find_all(class_="dark_text")

[<span class="dark_text">English:</span>,
 <span class="dark_text">Synonyms:</span>,
 <span class="dark_text">Japanese:</span>,
 <span class="dark_text">Type:</span>,
 <span class="dark_text">Episodes:</span>,
 <span class="dark_text">Status:</span>,
 <span class="dark_text">Aired:</span>,
 <span class="dark_text">Premiered:</span>,
 <span class="dark_text">Broadcast:</span>,
 <span class="dark_text">Producers:</span>,
 <span class="dark_text">Licensors:</span>,
 <span class="dark_text">Studios:</span>,
 <span class="dark_text">Source:</span>,
 <span class="dark_text">Genres:</span>,
 <span class="dark_text">Duration:</span>,
 <span class="dark_text">Rating:</span>,
 <span class="dark_text">Score:</span>,
 <span class="dark_text">Ranked:</span>,
 <span class="dark_text">Popularity:</span>,
 <span class="dark_text">Members:</span>,
 <span class="dark_text">Favorites:</span>]

Cycle through the headers in the left column to get all entries. This does not take in multiple entries for a single column, only the first entry

In [54]:
def MAL_retrieveSidebar(anime_dict, soup):
    '''
    Returns anime_dict with additional raw data for the sidebar of the anime entry 
    '''
    headers = soup.find_all(class_="dark_text")
    for header in headers:
        column_name = header.text.strip()[:-1]
        # print(column_name) # error-checking
        entry = header.next_sibling.strip()
        # print(entry) # error-checking
        if(entry.strip() == "None found,"): # no entries
            entry = []
        elif(entry == ""):
            # special case for score
            if(column_name == 'Score'):
                entry = header.findNext().text
            else:
                entry_soup = header.findNext('a')
                # print(entry_soup.text, " ||| ", entry_soup.findNext('a'), " ||| ", entry_soup.findNext().name) # error-checking
                
                # create a list of items if more than one entry; signifed by the plural in column_name
                if(column_name[-1] != 's'):
                    entry = entry_soup.text
                else:
                    entry = [entry_soup.text]
                    while(entry_soup.findNext().name != 'div'):
                        entry_soup = entry_soup.findNext('a')
                        entry.append(entry_soup.text)
        anime_dict[column_name] = entry
        # print('---')
    return anime_dict

In [55]:
number1 = MAL_retrieveSidebar(number1, soup_mal_number1)

In [56]:
number1

{'Title': 'Fullmetal Alchemist: Brotherhood',
 'URL': 'https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood',
 'English': 'Fullmetal Alchemist: Brotherhood',
 'Synonyms': 'Hagane no Renkinjutsushi: Fullmetal Alchemist, Fullmetal Alchemist (2009), FMA, FMAB',
 'Japanese': '鋼の錬金術師 FULLMETAL ALCHEMIST',
 'Type': 'TV',
 'Episodes': '64',
 'Status': 'Finished Airing',
 'Aired': 'Apr 5, 2009 to Jul 4, 2010',
 'Premiered': 'Spring 2009',
 'Broadcast': 'Sundays at 17:00 (JST)',
 'Producers': ['Aniplex',
  'Square Enix',
  'Mainichi Broadcasting System',
  'Studio Moriken'],
 'Licensors': ['Funimation', 'Aniplex of America'],
 'Studios': ['Bones'],
 'Source': 'Manga',
 'Genres': ['Action',
  'Military',
  'Adventure',
  'Comedy',
  'Drama',
  'Magic',
  'Fantasy',
  'Shounen'],
 'Duration': '24 min. per ep.',
 'Rating': 'R - 17+ (violence & profanity)',
 'Score': '9.23',
 'Ranked': '#1',
 'Popularity': '#4',
 'Members': '1,739,651',
 'Favorites': '148,926'}

Aired, Duration, Members and Favorites needs post-processing

In [57]:
def MAL_ppDuration(duration_entry):
    '''
    Takes in the raw duration entry in the form (xx min.) (xx hr.) (per ep)
    Returns total number of minutes per episode
    '''
    duration_array = duration_entry.split()
    duration_norm = 0
    if 'min.' in duration_array:
        duration_norm += int(duration_array[duration_array.index('min.') - 1])
    if 'hr.' in duration_array:
        duration_norm += 60 * int(duration_array[duration_array.index('hr.') - 1])
    return duration_norm

In [58]:
def MAL_ppAired(anime_dict):
    '''
    Takes in the original anime sidebar dictionary
    Return the dictionary with 'Started' and 'Ended' columns added in Timestamp format
    '''
    def MAL_toDatetime(date_string):
        try:
            try:
                return pd.to_datetime(aired_array[0], format="%b %d, %Y")
            except:
                try:
                    return pd.to_datetime(aired_array[0], format="%b, %Y")
                except:
                    return pd.to_datetime(aired_array[0], format="%Y")
        except:
            return np.nan

    aired_array = anime_dict['Aired'].split(' to ')
    anime_dict['Started'] = MAL_toDatetime(aired_array[0])
    if len(aired_array) > 1:
        anime_dict['Ended'] = MAL_toDatetime(aired_array[1])
    return anime_dict

In [59]:
def remove_commas(input_str):
    return int(input_str.strip().replace(',',''))

def MAL_ppSidebar(sidebar_dict):
    '''
    Postprocessing of MAL sidebar
    Take in the unprocessed sidebar dictionary
    Return the processed sidebar dictionary
    '''

    if(sidebar_dict['Episodes'] == 'Unknown'):
        sidebar_dict['Episodes'] = np.nan
    else:
        sidebar_dict['Episodes'] = int(sidebar_dict['Episodes'])
    sidebar_dict['Duration'] = MAL_ppDuration(sidebar_dict['Duration'])
    sidebar_dict = MAL_ppAired(sidebar_dict)
    sidebar_dict['Members'] = remove_commas(sidebar_dict['Members'])
    sidebar_dict['Favorites'] = remove_commas(sidebar_dict['Favorites'])

    return sidebar_dict

In [60]:
number1 = MAL_ppSidebar(number1)

Now retrieve the details from the topbar of the anime show page

In [61]:
def MAL_retrieveTopbar(anime_dict, soup):
    '''
    Retrieves the score and voters from the topbar of the anime entry
    Return the modified anime_dict
    '''

    topbar = soup.find(class_='anime-detail-header-stats')
    
    if(anime_dict['Score'] == 'N/A'):
        anime_dict['Score'] = np.nan 
        anime_dict['Voters'] = np.nan # no score means no one voted
    else:
        anime_dict['Score'] = float(topbar.find(class_='score-label').text)
        anime_dict['Voters'] = remove_commas(topbar.find(class_='score').get('data-user').split()[0])

    return anime_dict

In [62]:
number1 = MAL_retrieveTopbar(number1, soup_mal_number1)
print(number1['Score'])
print(number1['Voters'])

9.23
1040597


To add related anime:

In [63]:
def MAL_retrieveRelated(anime_dict, soup):
    '''
    Retrieves related anime from the anime entry
    Returns the modified anime_dict
    '''
    related = soup.find(class_='anime_detail_related_anime')
    if(related is None): # nothing to add, so return
        return anime_dict
    related_rows = related.find_all('tr')
    for item in related_rows:
        header = item.find('td').text.strip()[:-1]
        entry = item.find('td').find_next().text.strip()
        anime_dict[header] = [item.strip() for item in entry.split(',')]
    
    return anime_dict

In [64]:
number1 = MAL_retrieveRelated(number1, soup_mal_number1)

Test to see that all data is captured

In [65]:
number1

{'Title': 'Fullmetal Alchemist: Brotherhood',
 'URL': 'https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood',
 'English': 'Fullmetal Alchemist: Brotherhood',
 'Synonyms': 'Hagane no Renkinjutsushi: Fullmetal Alchemist, Fullmetal Alchemist (2009), FMA, FMAB',
 'Japanese': '鋼の錬金術師 FULLMETAL ALCHEMIST',
 'Type': 'TV',
 'Episodes': 64,
 'Status': 'Finished Airing',
 'Aired': 'Apr 5, 2009 to Jul 4, 2010',
 'Premiered': 'Spring 2009',
 'Broadcast': 'Sundays at 17:00 (JST)',
 'Producers': ['Aniplex',
  'Square Enix',
  'Mainichi Broadcasting System',
  'Studio Moriken'],
 'Licensors': ['Funimation', 'Aniplex of America'],
 'Studios': ['Bones'],
 'Source': 'Manga',
 'Genres': ['Action',
  'Military',
  'Adventure',
  'Comedy',
  'Drama',
  'Magic',
  'Fantasy',
  'Shounen'],
 'Duration': 24,
 'Rating': 'R - 17+ (violence & profanity)',
 'Score': 9.23,
 'Ranked': '#1',
 'Popularity': '#4',
 'Members': 1739651,
 'Favorites': 148926,
 'Started': Timestamp('2009-04-05 00:00:00'),
 '

Now read the top 50 into a dictionary list

In [66]:
def MAL_createdict_top(soup):
    '''
    Takes in the soup of top anime and returns a dictionary list 
    '''
    mal_top = []
    for anime_soup in soup:

        # proceses the entry in the top 50 page(s) and gets the related page
        mal_entry = MAL_initEntry_top(anime_soup) 
        soup_mal_entry = MAL_retrieveEntry(mal_entry)
        
        # processes the anime page
        mal_entry = MAL_retrieveSidebar(mal_entry, soup_mal_entry)
        mal_entry = MAL_ppSidebar(mal_entry)
        mal_entry = MAL_retrieveTopbar(mal_entry, soup_mal_entry)
        mal_entry = MAL_retrieveRelated(mal_entry, soup_mal_entry)

        mal_top.append(mal_entry)
    
    return mal_top

In [67]:
mal_top50 = MAL_createdict_top(soup_mal_top50)

Put the required columns into a dataframe

In [32]:
mal_df = pd.DataFrame(mal_top50)

In [33]:
mal_df.head(5)

Unnamed: 0,Title,URL,English,Synonyms,Japanese,Type,Episodes,Status,Aired,Premiered,...,Alternative version,Side story,Spin-off,Alternative setting,Sequel,Other,Prequel,Character,Parent story,Summary
0,Fullmetal Alchemist: Brotherhood,https://myanimelist.net/anime/5114/Fullmetal_A...,Fullmetal Alchemist: Brotherhood,"Hagane no Renkinjutsushi: Fullmetal Alchemist,...",鋼の錬金術師 FULLMETAL ALCHEMIST,TV,64,Finished Airing,"Apr 5, 2009 to Jul 4, 2010",Spring 2009,...,[Fullmetal Alchemist],"[Fullmetal Alchemist: Brotherhood Specials, Fu...",[Fullmetal Alchemist: Brotherhood - 4-Koma The...,,,,,,,
1,Steins;Gate,https://myanimelist.net/anime/9253/Steins_Gate,Steins;Gate,,STEINS;GATE,TV,24,Finished Airing,"Apr 6, 2011 to Sep 14, 2011",Spring 2011,...,[Steins;Gate: Kyoukaimenjou no Missing Link - ...,,,"[ChäoS;HEAd, Robotics;Notes, ChäoS;Child, Occu...",[Steins;Gate: Oukoubakko no Poriomania],[Steins;Gate: Soumei Eichi no Cognitive Comput...,,,,
2,Gintama°,https://myanimelist.net/anime/28977/Gintama°,Gintama Season 4,Gintama' (2015),銀魂°,TV,51,Finished Airing,"Apr 8, 2015 to Mar 30, 2016",Spring 2015,...,,[Gintama°: Umai-mono wa Atomawashi ni Suru to ...,,,[Gintama.],,[Gintama Movie 2: Kanketsu-hen - Yorozuya yo E...,,,
3,Hunter x Hunter (2011),https://myanimelist.net/anime/11061/Hunter_x_H...,Hunter x Hunter,HxH (2011),HUNTER×HUNTER（ハンター×ハンター）,TV,148,Finished Airing,"Oct 2, 2011 to Sep 24, 2014",Fall 2011,...,"[Hunter x Hunter, Hunter x Hunter: Yorkshin Ci...","[Hunter x Hunter Movie 1: Phantom Rouge, Hunte...",,,,,,,,
4,Ginga Eiyuu Densetsu,https://myanimelist.net/anime/820/Ginga_Eiyuu_...,Legend of the Galactic Heroes,"LoGH, LotGH, Gin'eiden, GinEiDen, Heldensagen ...",銀河英雄伝説,OVA,110,Finished Airing,"Jan 8, 1988 to Mar 17, 1997",,...,[Ginga Eiyuu Densetsu: Die Neue These - Kaikou...,"[Ginga Eiyuu Densetsu Gaiden, Ginga Eiyuu Dens...",,,,,[Ginga Eiyuu Densetsu: Arata Naru Tatakai no O...,,,


Save it!

In [34]:
mal_df.to_pickle('data\\mal_top50_mvp.pkl')

Now let's pull data from the alphabetical list

In [35]:
def MAL_loadanime_char(num_pages, char):
    '''
    Load the first X anime starting with char from MAL based on number of pages, 50 per page
    So X = (num_pages * 50)
    To load anime with non-alphabetical starts, use '.'
    Returns a soup of the list of the first X anime starting with char
    '''
    animechar_text = ""
    for counter in range(1, num_pages + 1):
        limit = (counter - 1) * 50
        url = "https://myanimelist.net/anime.php?letter=" + char + "&show=" + str(limit)
        response = requests.get(url)
        if response.status_code != 200:
            print('Error reading page ', counter , ' of MAL first ', num_pages * 50, ' starting with ', char)
        else:
            animechar_text += response.text
    return BeautifulSoup(animechar_text, 'html5').find_all(class_="picSurround")

In [36]:
def MAL_initEntry_alpha(soup):
    '''
    Takes the soup of an entry in the MAL alphabetical list
    Returns a dictionay entry with Title and URL of the anime
    '''
    entry = {}
    entry['Title'] = soup.find("img").get('alt')
    entry['URL'] = soup.get('href')
    return entry

In [37]:
soup_mal_first50A_list = MAL_loadanime_char(1, 'A')

In [38]:
def MAL_createdict_firstchar(soup):
    '''
    Takes in the soup of first X anime starting with char and returns a dictionary list 
    '''
    mal_firstchar = []
    for anime_soup in soup:

        anime_soup = anime_soup.find(class_="hoverinfo_trigger")

        # proceses the entry in the top 50 page and gets the related page
        mal_entry = MAL_initEntry_alpha(anime_soup) 
        soup_mal_entry = MAL_retrieveEntry(mal_entry)
        
        # processes the anime page
        mal_entry = MAL_retrieveSidebar(mal_entry, soup_mal_entry)
        mal_entry = MAL_ppSidebar(mal_entry)
        mal_entry = MAL_retrieveTopbar(mal_entry, soup_mal_entry)
        mal_entry = MAL_retrieveRelated(mal_entry, soup_mal_entry)

        mal_firstchar.append(mal_entry)
    return mal_firstchar

In [39]:
mal_first50A = MAL_createdict_firstchar(soup_mal_first50A_list)

In [40]:
mal_df = pd.DataFrame(mal_first50A)

In [41]:
mal_df.iloc[0]

Title                                                 A Brightening Life
URL                    https://myanimelist.net/anime/40628/A_Brighten...
English                                               A brightening life
Japanese                                              A brightening life
Type                                                               Movie
Episodes                                                               1
Status                                                   Finished Airing
Aired                                                               2010
Producers                                                             []
Licensors                                                             []
Studios                                                               []
Source                                                          Original
Genres                                                    [Drama, Music]
Duration                                           

In [42]:
mal_df.head(5)

Unnamed: 0,Title,URL,English,Japanese,Type,Episodes,Status,Aired,Producers,Licensors,...,Alternative version,Synonyms,Premiered,Broadcast,Adaptation,Other,Prequel,Parent story,Side story,Spin-off
0,A Brightening Life,https://myanimelist.net/anime/40628/A_Brighten...,A brightening life,A brightening life,Movie,1.0,Finished Airing,2010,[],[],...,,,,,,,,,,
1,A Christmas Song,https://myanimelist.net/anime/39086/A_Christma...,A Christmas Song,A Christmas Song,Music,1.0,Finished Airing,"Nov 29, 2012",[Avex Entertainment],[],...,,,,,,,,,,
2,A Kite,https://myanimelist.net/anime/320/A_Kite,Kite,A KITE（カイト）,OVA,2.0,Finished Airing,"Feb 25, 1998 to Oct 25, 1998","[Green Bunny, BEAM Entertainment]",[Media Blasters],...,,,,,,,,,,
3,A Log Day of Timbre,https://myanimelist.net/anime/38712/A_Log_Day_...,A Log Day of Timbre,A Log Day of Timbre,ONA,1.0,Finished Airing,"Feb 25, 2011",[],[],...,[Timbre A to Z],,,,,,,,,
4,A New Journey,https://myanimelist.net/anime/39057/A_New_Journey,,A New Journey,ONA,1.0,Finished Airing,"Jan 18, 2019",[],[],...,,Season 2019: A New Journey | League of Legends,,,,,,,,


In [43]:
mal_df.to_pickle('data\\mal_first50a_mvp.pkl')

Generate for top 500 and first 500 in A

In [44]:
# do for top 500

mal_top500_soup = MAL_loadtopanime(10)
mal_top500_dict = MAL_createdict_top(mal_top500_soup)
mal_top500_df = pd.DataFrame(mal_top500_dict)

mal_top500_df.to_pickle('data\\mal_top500.pkl')

In [45]:
# and for first 500 A

mal_first500A_soup = MAL_loadanime_char(10, 'A')
mal_first500A_dict = MAL_createdict_firstchar(mal_first500A_soup)
mal_first500A_df = pd.DataFrame(mal_first500A_dict)

mal_first500A_df.to_pickle('data\\mal_first500A.pkl')