# MyAnimeList Web Scraper

**Goal:** Given a valid MyAnimeList username, produce a list of that user's "anime list" and pertinent details about each listed show.

In [1]:
# Import necessary libraries
from requests_html import AsyncHTMLSession

In [2]:
# Instantiate an Asynchronous Session
session = AsyncHTMLSession()

In [3]:
# Specify a MAL user and the corresponding anime list URL
username = "ABPhanatic"
url = "https://myanimelist.net/animelist/" + username

In [4]:
# Connect to the webpage
r = await session.get(url)
r.status_code

200

### Examining the page's HTML code before rendering the JavaScript

In [5]:
html = r.html

In [6]:
# Number of div elements PRE-JS render
len(html.find('div'))

38

In [7]:
# Number of anchor elements PRE-JS render
len(html.find('a'))

36

In [8]:
# Number of span elements PRE-JS render
len(html.find('span'))

26

In [9]:
# Saving the pre-rendered HTML to a file so we can inspect it

with open('MAL_PreRender_Scrape.html', 'wb') as file:
    file.write(html.raw_html)

### Examining the page's HTML code after rendering the JavaScript

In [10]:
# Rendering the JavaScript of the webpage
await r.html.arender()

In [11]:
html = r.html

In [12]:
# Number of div elements POST-JS render
len(html.find('div'))

419

In [13]:
# Number of anchor elements POST-JS render
len(html.find('a'))

567

In [14]:
# Number of span elements POST-JS render
len(html.find('span'))

599

In [15]:
# Saving the post-rendered HTML to a file as well

with open('MAL_PostRender_Scrape.html', 'wb') as file:
    file.write(html.raw_html)

Clearly, the webpage contains much more content once the JavaScript code is rendered. As seen above, the number of divs, anchors, and span tags increase more than ten-fold after rendering the JS. 
<br><br>
**Note:** If we examine the the pre-render and post-render HTML code that we saved above, we see that the pre-rendered HTML does not provide any information about the shows in the user's anime list  (the exact information I was interested in). The post-rendered HTML however, contains the information we want so that is what we use for the rest of this project.

# Using Pandas to Scrape HTML Table

It is important to acknowledge that MAL uses a regular HTML table to present the data of a user's anime list. Therefore, my first choice was to see what information I could collect using **Pandas** convenient _read_html()_ method.

In [16]:
import pandas as pd

In [17]:
tables = pd.read_html(html.html)

In [18]:
table = tables[0] 
table

# For the purposes of this project, the columns "Unnamed: 0", "#", "Image", and "Tags" are not needed
# Also, it appears that every other row in this table contains unnecessary missing information which we should get rid of

Unnamed: 0.1,Unnamed: 0,#,Image,Anime Title,Score,Type,Progress,Tags
0,,1.0,,Hajime no Ippo Add - More,9,TV,30 / 75,
1,,,,,,,,
2,,2.0,,Kimetsu no Yaiba: Yuukaku-hen Airing Add - ...,8,TV,3 / -,
3,,,,,,,,
4,,3.0,,Ousama Ranking Airing Add - More,9,TV,11 / 23,
...,...,...,...,...,...,...,...,...
175,,,,,,,,
176,,89.0,,Wu Shan Wu Xing Add - More,-,ONA,- / 3,
177,,,,,,,,
178,,90.0,,Yuu☆Yuu☆Hakusho Add - More,-,TV,- / 112,


In [19]:
# Removing all the unnecessary rows
anime_list = table.iloc[0:-1:2].copy()
anime_list

Unnamed: 0.1,Unnamed: 0,#,Image,Anime Title,Score,Type,Progress,Tags
0,,1.0,,Hajime no Ippo Add - More,9,TV,30 / 75,
2,,2.0,,Kimetsu no Yaiba: Yuukaku-hen Airing Add - ...,8,TV,3 / -,
4,,3.0,,Ousama Ranking Airing Add - More,9,TV,11 / 23,
6,,4.0,,Tensei shitara Slime Datta Ken Add - More,8,TV,19 / 24,
8,,5.0,,Akame ga Kill! Add - More,7,TV,24,
...,...,...,...,...,...,...,...,...
170,,86.0,,Stranger: Mukou Hadan Add - More,-,Movie,- / 1,
172,,87.0,,Tokyo Ghoul Add - More,-,TV,- / 12,
174,,88.0,,Violet Evergarden Add - More,-,TV,- / 13,
176,,89.0,,Wu Shan Wu Xing Add - More,-,ONA,- / 3,


In [20]:
anime_list.columns

Index(['Unnamed: 0', '#', 'Image', 'Anime Title', 'Score', 'Type', 'Progress',
       'Tags'],
      dtype='object')

In [21]:
# Dropping the columns we don't need
anime_list = anime_list.drop(labels=['Unnamed: 0','Tags','#','Image'], axis=1)
anime_list

Unnamed: 0,Anime Title,Score,Type,Progress
0,Hajime no Ippo Add - More,9,TV,30 / 75
2,Kimetsu no Yaiba: Yuukaku-hen Airing Add - ...,8,TV,3 / -
4,Ousama Ranking Airing Add - More,9,TV,11 / 23
6,Tensei shitara Slime Datta Ken Add - More,8,TV,19 / 24
8,Akame ga Kill! Add - More,7,TV,24
...,...,...,...,...
170,Stranger: Mukou Hadan Add - More,-,Movie,- / 1
172,Tokyo Ghoul Add - More,-,TV,- / 12
174,Violet Evergarden Add - More,-,TV,- / 13
176,Wu Shan Wu Xing Add - More,-,ONA,- / 3


In [22]:
# Resetting the indexes of the final rows
anime_list = anime_list.reset_index()
anime_list

Unnamed: 0,index,Anime Title,Score,Type,Progress
0,0,Hajime no Ippo Add - More,9,TV,30 / 75
1,2,Kimetsu no Yaiba: Yuukaku-hen Airing Add - ...,8,TV,3 / -
2,4,Ousama Ranking Airing Add - More,9,TV,11 / 23
3,6,Tensei shitara Slime Datta Ken Add - More,8,TV,19 / 24
4,8,Akame ga Kill! Add - More,7,TV,24
...,...,...,...,...,...
85,170,Stranger: Mukou Hadan Add - More,-,Movie,- / 1
86,172,Tokyo Ghoul Add - More,-,TV,- / 12
87,174,Violet Evergarden Add - More,-,TV,- / 13
88,176,Wu Shan Wu Xing Add - More,-,ONA,- / 3


In [23]:
# Dropping the old "index" column
anime_list = anime_list.drop(labels='index', axis=1)
anime_list

Unnamed: 0,Anime Title,Score,Type,Progress
0,Hajime no Ippo Add - More,9,TV,30 / 75
1,Kimetsu no Yaiba: Yuukaku-hen Airing Add - ...,8,TV,3 / -
2,Ousama Ranking Airing Add - More,9,TV,11 / 23
3,Tensei shitara Slime Datta Ken Add - More,8,TV,19 / 24
4,Akame ga Kill! Add - More,7,TV,24
...,...,...,...,...
85,Stranger: Mukou Hadan Add - More,-,Movie,- / 1
86,Tokyo Ghoul Add - More,-,TV,- / 12
87,Violet Evergarden Add - More,-,TV,- / 13
88,Wu Shan Wu Xing Add - More,-,ONA,- / 3


Although we are able to collect some information with this method, there is more information about each show present in the HTML code of the web page that we need (such as the "status" of each show i.e. Completed, Watching, Plan To Watch, etc., as well as the URLs of each anime which we will use to gather more information in the future).  
Therefore, we will collect our data through web-scraping instead.

# Scraping the HTML code

## Getting all anime names

In [24]:
list_entries = html.find('tr.list-table-data')

In [25]:
len(list_entries) # After checking the actual webpage, we can confirm that this number seems correct

90

In [26]:
# Examining the structure of one entry
list_entries[0].html

'<tr class="list-table-data"><td class="data status watching"></td> <td class="data number">1</td> <td class="data image"><a href="/anime/263/Hajime_no_Ippo" class="link sort"><img src="https://cdn.myanimelist.net/r/96x136/images/anime/4/86334.webp?s=ef3213de769a55ea677d4536af3dfb87" class="hover-info image" /></a></td> <td class="data title clearfix"><a href="/anime/263/Hajime_no_Ippo" class="link sort">Hajime no Ippo</a> <div class="icon-watch2"><a href="/anime/263/Hajime_no_Ippo/video" title="Watch Episode Video" class="mal-icon ml4"><i class="malicon malicon-movie-episode"></i></a></div> <span class="rewatching" style="display: none;">\n</span> <span class="content-status" style="display: none;">\n</span> <div class="add-edit-more"><!-- --> <span class="add"><a href="/ownlist/anime/add?selected_series_id=263&amp;hideLayout" class="List_LightBox">Add</a></span>\n            -\n            <span class="more"><a href="#">More</a></span></div></td> <td class="data score"><a href="#" cl

In [27]:
# The element containing the anime title is located in this tag: 
# <a href="/anime/263/Hajime_no_Ippo" class="link sort">Hajime no Ippo</a>

In [28]:
# Isolating only the anime names
anime_names = [entry.find('td.data.title.clearfix')[0].find('a.link.sort')[0].text for entry in list_entries]

In [29]:
len(anime_names) # Still the correct number

90

In [30]:
anime_names[:10] # Looks good

['Hajime no Ippo',
 'Kimetsu no Yaiba: Yuukaku-hen',
 'Ousama Ranking',
 'Tensei shitara Slime Datta Ken',
 'Akame ga Kill!',
 'Banana Fish',
 'Black Clover',
 'Boku no Hero Academia',
 'Boku no Hero Academia 2nd Season',
 'Boku no Hero Academia 3rd Season']

In [31]:
#anime_names

## Getting all anime urls

In [32]:
# The relative URL for each anime can be found as the 'href' attribute of the anchor tags containing their titles

anime_urls = [entry.find('td.data.title.clearfix')[0].find('a.link.sort')[0].attrs['href'] for entry in list_entries]

In [33]:
len(anime_urls) # 90 titles --> 90 URLs

90

In [34]:
anime_urls[:10] # We should convert these to absolute URLs in case we want to visit them in the future

['/anime/263/Hajime_no_Ippo',
 '/anime/47778/Kimetsu_no_Yaiba__Yuukaku-hen',
 '/anime/40834/Ousama_Ranking',
 '/anime/37430/Tensei_shitara_Slime_Datta_Ken',
 '/anime/22199/Akame_ga_Kill',
 '/anime/36649/Banana_Fish',
 '/anime/34572/Black_Clover',
 '/anime/31964/Boku_no_Hero_Academia',
 '/anime/33486/Boku_no_Hero_Academia_2nd_Season',
 '/anime/36456/Boku_no_Hero_Academia_3rd_Season']

In [35]:
base = 'https://myanimelist.net'

# We only want to attach the base URL to the front if the href is indeed a relative URL (starting with '/')
absolute_urls = [base+url if url.startswith('/') else url for url in anime_urls]

In [36]:
len(absolute_urls)

90

In [37]:
absolute_urls[:10]

['https://myanimelist.net/anime/263/Hajime_no_Ippo',
 'https://myanimelist.net/anime/47778/Kimetsu_no_Yaiba__Yuukaku-hen',
 'https://myanimelist.net/anime/40834/Ousama_Ranking',
 'https://myanimelist.net/anime/37430/Tensei_shitara_Slime_Datta_Ken',
 'https://myanimelist.net/anime/22199/Akame_ga_Kill',
 'https://myanimelist.net/anime/36649/Banana_Fish',
 'https://myanimelist.net/anime/34572/Black_Clover',
 'https://myanimelist.net/anime/31964/Boku_no_Hero_Academia',
 'https://myanimelist.net/anime/33486/Boku_no_Hero_Academia_2nd_Season',
 'https://myanimelist.net/anime/36456/Boku_no_Hero_Academia_3rd_Season']

In [38]:
#absolute_urls # Looks good

## Getting all anime Scores

In [39]:
import numpy as np

In [40]:
anime_scores = [entry.find('span.score-label')[0].text for entry in list_entries]

In [41]:
len(anime_scores)

90

In [42]:
anime_scores[:10]

['9', '8', '9', '8', '7', '9', '10', '9', '9', '9']

In [43]:
set(anime_scores)

# We eventually want to convert these scores to some number data type so we need to handle the case of a score being '-'.
# In MAL, if a score is '-' it means no score has been given yet (not that the score is necessarily a 0).
# Therefore, we will convert the '-' scores to the special np.nan value indicating a missing value as to not cause inteference
# when we recast the other scores to integer data types.

{'-', '10', '6', '7', '8', '9'}

In [44]:
# Converting '-' scores to np.nan (missing value) and recasting all other scores as integer data types

anime_scores = [np.nan if score=='-' else int(score) for score in anime_scores]

In [45]:
len(anime_scores)

90

In [46]:
anime_scores[:10]

[9, 8, 9, 8, 7, 9, 10, 9, 9, 9]

In [47]:
#anime_scores # Looks good

## Getting all anime Statuses

In [48]:
anime_status = [entry.find('td.status')[0].attrs['class'][2] for entry in list_entries]

In [49]:
len(anime_status)

90

In [50]:
anime_status[:10]

['watching',
 'watching',
 'watching',
 'watching',
 'completed',
 'completed',
 'completed',
 'completed',
 'completed',
 'completed']

In [51]:
set(anime_status) 

# These are already in an acceptable format for the purposes of this project, but we can perform minor cleaning later.

{'completed', 'dropped', 'onhold', 'plantowatch', 'watching'}

In [52]:
#anime_status

## Getting all anime Types

In [53]:
anime_types = [entry.find('td.type')[0].text for entry in list_entries]

In [54]:
len(anime_types)

90

In [55]:
anime_types[:10]

['TV', 'TV', 'TV', 'TV', 'TV', 'TV', 'TV', 'TV', 'TV', 'TV']

In [56]:
set(anime_types) # These are already in an acceptable format

{'Movie', 'ONA', 'Special', 'TV'}

In [57]:
#anime_types

## Getting anime Progress

In [58]:
# The progress of each anime can be stored in one or two spans 
# (the progress of a Movie can only be 0 or 1 while that of a TV show is in the format [watched episodes] / [total episodes])

progress_span_containers = [entry.find('td.progress')[0].find('span') for entry in list_entries]

In [59]:
len(progress_span_containers)

90

In [60]:
progress_span_containers[0]

[<Element 'span' >, <Element 'span' >]

In [61]:
progress_span_containers[0][0].html

'<span><a href="#" class="link edit-disabled">30</a> <!-- -->\n              /\n            </span>'

In [62]:
progress_span_containers[0][1].html

'<span>75</span>'

**Plan:** In the case of an anime having two spans in their progress containers we know the final string should be in the format: (watched episodes) / (total episodes). If an anime only has one span, we will simply extract the text as we know it likely be just a 1 for completion.

In [63]:
# Joining the two spans together as strings or just extracting the text if only one span is present

anime_progress = ["".join([span.text[:-2]+span.text[-1] if ('/' in span.text) else span.text for span in c]) for c in progress_span_containers]

In [64]:
len(anime_progress)

90

In [65]:
anime_progress[:10] # Looks good

['30/75', '3/-', '11/23', '19/24', '24', '24', '170', '13', '25', '25']

In [66]:
#anime_progress # We will keep these as strings for now as we only want to collect the data not transform it too much

In [67]:
# Closing our Async session now that all the web scraping is done

await session.close()

# Combining lists into a DataFrame

In [68]:
user_animelist = pd.DataFrame()

In [69]:
user_animelist["Name"] = anime_names
user_animelist["User Score"] = anime_scores
user_animelist["Progress"] = anime_progress
user_animelist["Status"] = anime_status
user_animelist["Type"] = anime_types
user_animelist["URL"] = absolute_urls

In [70]:
pd.set_option("display.max_rows", None, "display.max_columns", None, "display.max_colwidth", None)

In [71]:
# Creating a function to clean up each anime's status as we mentioned above

def convertStatus(item):
    if item=='completed':
        return "Completed"
    elif item=='onhold':
        return "On Hold"
    elif item=='watching':
        return "Watching"
    elif item=='plantowatch':
        return "Plan To Watch"
    else:
        return

In [72]:
user_animelist["Status"] = user_animelist["Status"].apply(convertStatus)

In [73]:
len(user_animelist) # Looks like we have all 90 entries intact

90

In [74]:
user_animelist.head()

Unnamed: 0,Name,User Score,Progress,Status,Type,URL
0,Hajime no Ippo,9.0,30/75,Watching,TV,https://myanimelist.net/anime/263/Hajime_no_Ippo
1,Kimetsu no Yaiba: Yuukaku-hen,8.0,3/-,Watching,TV,https://myanimelist.net/anime/47778/Kimetsu_no_Yaiba__Yuukaku-hen
2,Ousama Ranking,9.0,11/23,Watching,TV,https://myanimelist.net/anime/40834/Ousama_Ranking
3,Tensei shitara Slime Datta Ken,8.0,19/24,Watching,TV,https://myanimelist.net/anime/37430/Tensei_shitara_Slime_Datta_Ken
4,Akame ga Kill!,7.0,24,Completed,TV,https://myanimelist.net/anime/22199/Akame_ga_Kill


In [75]:
# Saving our final DataFrame as a CSV file for future use

user_animelist.to_csv('Scraped_MAL_Data.csv', index=False, encoding='utf-8')