# Anime Profile Pic Scraper

A bot to: 
1. Cycle through all the characters in https://myanimelist.net starting from https://myanimelist.net/character/1 ;
2. For each character, navigate to its pictures page eg https://myanimelist.net/character/1/Spike_Spiegel/pictures * (as a side note, I could perhaps search instead for https://myanimelist.net/character/1/<???>/pictures) *
3. Scrape all links to the profile pictures
4. Create a new folder with the ID + Profile name of the Character
5. Save all pictures in this new folder
6. Repeat

## How will it do each step?

1. Character IDs are incremental - we can simply start from https://myanimelist.net/character/1 and then add 1 until we receive an error; As a failsafe, consider saving the enumerator in an outside text file.
2. Two possible methods using bs4:
    1. Scrape link to the pages by navigating to its position on the page
    2. Search each `<a>` tag on the page for https://myanimelist.net/character/1/<???>/pictures
3. Use bs4 to create a list of all the links * (idea, how about using a <a href="https://docs.python.org/3.3/tutorial/datastructures.html"><strong>set rather than a list</strong></a>? Sets naturally check for duplicates and eliminate them - thus ensuring a picture is not downloaded twice). *
4. It should be easy to scrape the name from either the original character page or the character page. Then use the os module to create a new folder with the id and the name.
5. Cycle through the set created in step 3. and download the pictures. A good option to do this is `urllib.request.urlretrieve()` (<a href='https://stackoverflow.com/a/8286449'>see here for reference</a>)
6. Print out confirmation message, add 1 to the enumerator, and repeat.

In [1]:
#!/usr/bin/python3

# coding: utf-8

# !Py3.5.2

In [2]:
#dependencies
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve

In [3]:
link_prefix = "https://myanimelist.net"

## Step 1: 
* Cycle through all the characters in https://myanimelist.net starting from https://myanimelist.net/character/1
* Character IDs are incremental - we can simply start from https://myanimelist.net/character/1 and then add 1 until we receive an error. 
* As a failsafe, **consider saving the enumerator in an outside text file**.

In [4]:
#while True: #uncomment this for production and indent everything below

In [5]:
#starting up
character_id = 1000

#create a link
character_page_url = "https://myanimelist.net/character/"+str(character_id)+"/"
character_page_html = urlopen(character_page_url, timeout=30)
print(character_page_html.info()) #remove for production

Date: Fri, 02 Jun 2017 11:42:42 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Server: Apache
Set-Cookie: MALSESSIONID=d2gaaa41kv1hm6jnskueik1rq1; expires=Mon, 31-May-2027 11:42:42 GMT; Max-Age=315360000; path=/; secure; HttpOnly
Set-Cookie: MALHLOGSESSID=eebe27405849f964728d354a8a2ab24f; expires=Wed, 01-Jun-2022 11:42:42 GMT; Max-Age=157680000; path=/
Cache-Control: no-cache
Vary: User-Agent,Accept-Encoding
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload




In [6]:
#create a Beautiful Soup object
soup = BeautifulSoup(character_page_html, "html.parser")
#print(soup.prettify()) #remove for production

## Step 2:

* For each character, navigate to its pictures page eg https://myanimelist.net/character/1/Spike_Spiegel/pictures
* Two possible methods using bs4:
    1. Scrape link to the pages by navigating to its position on the page
    2. Search each `<a>` tag on the page for https://myanimelist.net/character/1/<???>/pictures

In [7]:
# Method 1:
div = soup.find(id="content")
pictures_link = div.a.get('href')
print(link_prefix+pictures_link)

https://myanimelist.net/character/1000/Chao_Lingshen/pictures


In [8]:
# Method 2: # couldn't get this to work - but this seems more complicated than method one, which seems to work just fine
a_tags = soup.find_all("a")
for a_tag in a_tags:
    a_tag_link = a_tag.get('href')
    if a_tag_link != None and a_tag_link.startswith("https"):
        split = a_tag_link.split("/")
        if split[-1] == "pictures":
            print(a_tag_link) 