# Scrapping Animes info using Python


In [1]:
from bs4 import BeautifulSoup
import urllib.request,sys,time
import requests
import pandas as pd

## Making Simple requests

We would use "Horror" genre from [My Anime List](https://myanimelist.net/anime/genre/14/Horror).

In [2]:
URL = "https://myanimelist.net/anime/genre/14/Horror"

page = requests.get(URL)

## Extracting content from HTML

In [3]:
soup = BeautifulSoup(page.content, "html.parser")

Here we need to find all the links that connect with the animes pages.

In [4]:
links = soup.find_all("a", attrs = {'class' : 'link-title'})

print(len(links))

100


In [6]:
anime_links = []
for link in links:
    anime_links.append(link['href'])

In [7]:
anime_links[0:10]

['https://myanimelist.net/anime/22319/Tokyo_Ghoul',
 'https://myanimelist.net/anime/22535/Kiseijuu__Sei_no_Kakuritsu',
 'https://myanimelist.net/anime/27899/Tokyo_Ghoul_√A',
 'https://myanimelist.net/anime/37779/Yakusoku_no_Neverland',
 'https://myanimelist.net/anime/11111/Another',
 'https://myanimelist.net/anime/226/Elfen_Lied',
 'https://myanimelist.net/anime/8074/Highschool_of_the_Dead',
 'https://myanimelist.net/anime/6880/Deadman_Wonderland',
 'https://myanimelist.net/anime/36511/Tokyo_Ghoul_re',
 'https://myanimelist.net/anime/35120/Devilman__Crybaby']

Super!!!!

### Explore others page
We extract all the links from the first page but now we need to search how to explore the others page without knowing how many pages are present. Maybe we can use a `while` loop.

For example, **Horror** genre only have 5 pages, so at the moment to reach the sixth page we need to break

In [8]:
num_page = 1 
status = 202
while status != 404:
    URL = "https://myanimelist.net/anime/genre/14/Horror?page=" + str(num_page)
    page = requests.get(URL)
    status = page.status_code
    num_page += 1
    print(f"Correct Page {URL} and status code {page.status_code}")

Correct Page https://myanimelist.net/anime/genre/14/Horror?page=1 and status code 200
Correct Page https://myanimelist.net/anime/genre/14/Horror?page=2 and status code 200
Correct Page https://myanimelist.net/anime/genre/14/Horror?page=3 and status code 200
Correct Page https://myanimelist.net/anime/genre/14/Horror?page=4 and status code 200
Correct Page https://myanimelist.net/anime/genre/14/Horror?page=5 and status code 200
Correct Page https://myanimelist.net/anime/genre/14/Horror?page=6 and status code 404


nice!!

With this we can explore all the page from the genre that we wanted.

## Explore each anime

Now that we have the link we can explore each anime and extract other information. We would use [Tokyo_Ghoul](https://myanimelist.net/anime/22319/Tokyo_Ghoul) as example.

In [9]:
URL = "https://myanimelist.net/anime/22319/Tokyo_Ghoul"

page = requests.get(URL)
print(page.status_code)

soup = BeautifulSoup(page.content, "html.parser")

200


Looking the web design the easier way is find the all \<tr\> and the the first \<td\>

In [10]:
anime_card = soup.find("tr")
anime_info = anime_card.find("td")

anime_english = anime_info.find("span", text = "English:").next_sibling
anime_japanese = anime_info.find("span", text = "Japanese:").next_sibling

In [11]:
anime_episodes = anime_info.find("span", text = "Episodes:").next_sibling
anime_source = anime_info.find("span", text = "Source:").next_sibling
anime_ratings = anime_info.find("span", text = "Rating:").next_sibling

In [15]:
print(anime_english.strip())
print(anime_japanese.strip())
print(anime_episodes.strip())
print(anime_source.strip())
print(anime_ratings.strip())

Tokyo Ghoul
東京喰種-トーキョーグール-
12
Manga
R - 17+ (violence & profanity)


Here the important aspect is that we would need to find the sections from we are interested. Remember that if there is a error or the sections was not found we would need to make an exception.

To extract other information is a little harder becuase they don't have a specific ID or class

In [16]:
anime_details = soup.find("div", attrs={'class':'anime-detail-header-stats'})
score = anime_details.find("div", attrs={'class': 'score-label'}).text
scored_by = anime_details.find("div", attrs={'class': 'score'})['data-user'].replace("users", "").replace(",", "")
ranked = anime_details.find("span", attrs = {'class': 'numbers ranked'}).text
popularity = anime_details.find("span", attrs = {'class': 'numbers popularity'}).text
members = anime_details.find("span", 
                             attrs = {'class': 'numbers members'}).text.replace("Members ", "").replace(",", "")
season = anime_details.find("span", attrs = {'class': 'information season'}).text
anime_type = anime_details.find("span", attrs = {'class': 'information type'}).text
studio = anime_details.find("span", attrs = {'class': 'information studio author'}).text

studio

'Studio Pierrot'

Finally we need to make a datafrae with all animes

## Second version web scrapping

In [9]:
url = "https://myanimelist.net/anime/genre/14/Horror"

page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")

In [11]:
animes = soup.find_all("div", attrs={'class':'seasonal-anime'})
print(len(animes))

100


In [12]:
animes[0]

<div class="seasonal-anime js-seasonal-anime" data-genre="1,7,14,40,37,8,42"><div>
<div class="title"><a class="icon-watch fl-r" href="https://myanimelist.net/anime/22319/Tokyo_Ghoul/video" title="Watch Episode Video">Watch Video</a><p class="title-text">
<h2 class="h2_anime_title"><a class="link-title" href="https://myanimelist.net/anime/22319/Tokyo_Ghoul">Tokyo Ghoul</a></h2>
</p>
</div>
<div class="prodsrc">
<span class="producer"><a href="/anime/producer/1/Studio_Pierrot" title="Studio Pierrot">Studio Pierrot</a></span>
<div class="eps">
<a href="https://myanimelist.net/anime/22319/Tokyo_Ghoul/episode"><span>12 eps</span>
</a>
</div>
<span class="source">Manga</span>
<a class="button_add btn-anime-watch-status js-anime-watch-status notinmylist" href="https://myanimelist.net/login.php?error=login_required&amp;from=%2Fanime%2Fgenre%2F14%2FHorror" title="Quick add anime to my list">add</a>
</div>
<div class="genres js-genre" id="22319">
<div class="genres-inner js-genre-inner"><span c