# Web Scraping - Collecting Data about Monsters - Complete Notebook

## 1. First Steps: Using Beautiful Soup to Scrape a Wikipedia Page

We are going to scrape Wikipedia for information about various mythological creatures. The information we want exists in multiple layers, like a tree structure. We're starting from [this list](https://en.wikipedia.org/wiki/Lists_of_legendary_creatures). This is a top level page which is a list of lists, one for each letter (eg `List of legendary creatures (A)`). The `A` sublist contains a list of creatures, as well as each creature's cultural origin. We can go one level deeper to the creature itself for even more information.
- Lists of legendary creatures
    - List of legendary creatures (A)
    - List of legendary creatures (B)
        - Ba (Egyptian)
        - Baba Yaga (Slavic)
        - Backoo (Guyanese) ...
    - List of legendary creatures (C) ...

Let's grab the list of lists and try to extract our sublists. We'll use the Requests library, as we did with APIs, as well as [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#), which is a library for extracting data from HTML and XML. We're going to use the `html5lib` parser with Beautiful Soup because it is pretty lenient and creates valid HTML; the downside is that it is slightly slower than some other parsers.

So, first, let's import all the modules we need, including Beautiful Soup, which we'll simply call `bs`.

In [None]:
import requests 
from bs4 import BeautifulSoup as bs
import string
import re

Now that we have our modules, the following cell will scrape our list of lists of legendary creatures.

As with our API queries, the line of script that uses `requests` to do the scraping is very straightforward: `R = requests.get(list_of_lists_url)`.

Once we have that HTML content, we'll then parse it into a tree structure of Python objects using Beautiful Soup.

In [None]:
list_of_lists_url = "https://en.wikipedia.org/wiki/Lists_of_legendary_creatures"
R = requests.get(list_of_lists_url)
soup = bs(R.content, 'html5lib')

Let's take a quick look at the massive blob of HTML that we've just scraped.

In [None]:
soup

That's a little unweildly. Let's instead see a "pretty" version of the HTML tree using the `soup.prettify()` method. Once you run the following cell, the HTML will be structured in a much more human-readable format.

In [None]:
print(soup.prettify())

If, instead, we print the actual raw content using `print(R.content)`, we can see that that it looks like one long string.

In [None]:
print(R.content)

This is what we downloaded without any formatting. It's actually a sequence of <b>byte literals</b>. You can tell this because byte literals are prefixed by `b` when printed.

So far, so good.

## 2. Navigating HTML with Beautiful Soup

There's a lot that you can do with the HTML data that you've just scraped, but to do so you have to navigate this mass of information effectively. Fortunately, Beautiful Soup includes a number of methods that are built for just this purpose. Most of these methods isolate specific HTML tags, like `<title>` or `<body>` or `<p>`. We'll look at a few here, but you can always [consult the documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) if you want to see a full list.

First of all, we can isolate the text from the page without any HTML markup:

In [None]:
print(soup.get_text())

This looks a little messy at the top of the page, but if you scroll down, you'll see that a good deal of the text is captured quite cleanly.

Similarly, you can isolate the `title` element. By using the `title` method by itself, you'll retain the HTML tags for this element.

In [None]:
soup.title

Or, if you prefer, you can remove those tags by isolating the string itself.

In [None]:
print(soup.title.string)

Here, we get the first paragraph element, using `.p`

In [None]:
soup.p

Feel free to modify the cell above to locate other common HTML elements, like `.head`, `.body`, `.div` and `.a`.

Often, you'll want to search for a specific occurence of a tag using attributes such as a classname or id. You can use the `.find()` method to specify these attributes. Let's grab the Table of Contents from our soup. In a browser window, open up [the Wikipedia page]('https://en.wikipedia.org/wiki/Lists_of_legendary_creatures'). Right click on the table of contents and choose "Inspect" (in Chrome) or "Inspect Element" (Firefox). This will show the current state of the DOM and open to the point in the DOM or "Document Object Model" where you've clicked. You should see that there is a `<div>` with an `id` of `"toc"`. Let's select that with BeautifulSoup:

In [None]:
tag = soup.find("div", id="toc")
print(tag)

It still looks a bit complicated, but we've cleanly isolated the Table of Contents for this particular page.

Now that we've isolated this one `div` tag, we can see all sorts of information about it, including its type, name, attributes, and id.

In [None]:
print(type(tag))
print(tag.name)
print(tag.attrs) 
print(tag['id'])

We could have just used `soup.find(id="toc")` because HTML IDs should be unique, but sometimes you'll want to combine selector attributes as we did here. This shows that BeautifulSoup tags are their own Python class, and have names and attributes. Those attributes can be accessed just like you would access a dictionary property.

## 3. Compiling the Alphabetical Lists of Legendary Creatures

As we've seen, in order to compile a master list of legendary creatures, we'll have to scrape each of the alphabetical lists sequentially: all the creatures that start with "A" (like, "Anubis"), all that start with "B" (like, "Basilisk"), etc. To do that, we'll need the corresponding URL for each of the alphebetical lists.

Let's start by trying to get every instance of the `a` or `anchor` tag, which marks each link on our page. To do this, we'll use the `.find_all()` method.

In [None]:
all_links = soup.find_all('a')
print(len(all_links))
all_links

Whoa, that's way too many links! <b>(You'll probably see a different number from what Cole sees in the video.)</b> We need to find just the hyperlinks for the alphabetical lists. Go ahead and try finding one of the links in the developer's console and seeing if there are any special attributes that might allow us to easily select it.

To make it easy for you, we've included this little snippet of our pretified HTML, so you can see the beginning of the list of links that concerns us.

![Prettified HTML](alphabetical_lists_-_legendary_creatures.png "Alphabetical Lists - Legendary Creatures")

Unfortunately, it looks like there's no class for the links we want, nor is there an id or class on the containing list element (`<ul>`). There is an ID in the span of the header above the list element we want (`Alphabetical_lists`). We'll need to do some slightly finicky traversing of the tree to get the data: first, find the span; then go from the span to the header (`parent`); get the first member of the parent's `siblings`; and finally, get all of the links in that unordered list (`<li>` tags in the `<ul>`). You can do this using `.contents`, which returns a list, or `.children`, which returns an iterator.

Let's find the `span` with the `id` "Alphabetical_lists` first.

In [None]:
tag_header = soup.find(id="Alphabetical_lists")
print(tag_header)

Now, navigating through this soup of HTML can be a little tricky, but beautifulsoup fortunately has a number of methods that can be used for just this purpose:

    `parent` - moves from one element in the HTML to another that contains it
    `children` - which does the opposite, moves from an an element to those that it contains
    `next_sibling` - moves between two elements on the same level
    
Follow along as Cole works his way through the HTML to find the elements we need.
    

In [None]:
tag_parent = tag_header.parent
print(tag_parent)

In [None]:
tag_nextsib = tag_parent.next_sibling
print(tag_nextsib)

As Cole points out, `next_sibling` will sometimes give you empty blank space. Instead, use `find_next_sibling` to proceed right to the unordered list that we want.

In [None]:
tag_ul = tag_parent.find_next_sibling()
print(tag_ul)

There's one last step. Now that we have this list, we have to be able to isolate each subdirectory, first by singling out one HTML link, and then using `.get("href")` to extract the subdirectory URL. It works like this:

In [None]:
link1 = tag_ul.find("a")

In [None]:
link1.get("href")

Now that you've isolated both our unordered list and the subdirectory URL within each link, you should be able to use what you know to produce a list of each of the complete alphabetical URLs.

On your own, take a stab at writing a loop that 1) singles out the list element for each letter, 2) identifies each URL subdirectory, 3) appends that subdirectory to the main wikipedia URL , which is `"https://en.wikipedia.org"`, and 4) finally, make a list of each of these complete URLs.

Here, we've given you a head start with some hints:

In [None]:
hrefs = []

for link in tag_ul.find_all('a'):
    url = "https://en.wikipedia.org" + link.get('href')
    hrefs.append(url)
print(hrefs)

Pause the video here and complete the loop!

## 4. Scraping List Pages and Creating a Dictionary of Creatures

At this point, you've learned how to scrape a webpage and parse its HTML to find the elements that you need. That was our main objective. For the sake of completeness, we'll show you how Cole went on to scrape a much more complete database, with all sorts of information about these legendary creatures. Much of this is pretty finicky, so we're going to move much more quickly, and we won't be walking through each step in the process.

First, we're going to compile all the creatures into one source. Let's make a dictionary for that, using creature names as keys. Then we'll want to create a function that we can reuse to scrape each of the list pages for the creatures. There are several pages that have some irregular items without links or the dashes we're using to separate the creature names from their short descriptions.

Here's the dictionary:

In [None]:
all_creatures = {}

And here is the function Cole wrote to populate his dictionary. If this looks complicated, much of the code here is involved in cleaning up the names and short descriptions for each creature.

In [None]:
def scrape_list_page(url):
    """
    Scrapes a list of creature pages.
    Returns a dictionary of creatures with names, titles (of links), links, short descriptions, and cultures.
    """
    
    creatures = {}
    R = requests.get(url)
    soup = bs(R.content, 'html5lib')
    list_items = soup.find("div", class_="mw-parser-output").find_all("ul")[1].find_all("li")
    for li in list_items:
        if len(li.find_all("a")) < 1: # There are a couple items without links
            split_text = li.get_text().split("-", 1)[0].strip()
            name = split_text[0].strip()
            desc = split_text[1].strip()
            creatures[name] = {
                "name": name,
                "short_description": desc
            }
        else:
            name = li.a.contents[0]
            creatures[name] = {}
            creatures[name]["name"] = name
            creatures[name]["title"] = li.a['title']
            creatures[name]["link"] = requests.compat.urljoin("https://en.wikipedia.org/wiki", li.a['href'])
            if "-" in li.get_text(): # There are a couple items where the dash isn't surrounded by spaces
                creatures[name]["short_description"] = li.get_text().split("-", 1)[1].strip()
            if(len(li.find_all("a")) > 1): # Couple of items without links
                creatures[name]["culture"] = li.find_all("a")[1].contents[0]
                creatures[name]["culture_title"] = li.find_all("a")[1]['title']
                creatures[name]["culture_href"] = requests.compat.urljoin("https://en.wikipedia.org/wiki", li.find_all("a")[1]['href'])
    return creatures

Let's test this on one page of our `hrefs` list to make sure it works. If that looks good, then we can run it on all the subpages.

In [None]:
print(hrefs[0])
all_creatures.update(scrape_list_page(hrefs[0]))
all_creatures

Beautiful. Now that we have that information, let's run a loop that collects data from each of the alphabetical URLs we collected.

In [None]:
for creature_list_url in hrefs:
    creatures = scrape_list_page(creature_list_url)
    all_creatures.update(creatures)
print(len(all_creatures.keys()))

Now that we have our complete list of creatures, it's a simple matter to print the information for each of them.

In [None]:
all_creatures["Vampire"]

### Exporting Our Creatures

Great! That's a lot of data. Let's export it all to a CSV in case we want to import it again later. We could use the `csv` library, but Pandas has a quick handy function for this as well.

In [None]:
import pandas as pd
import csv
creatures_list = []

In [None]:
for key, creature in all_creatures.items():
    creatures_list.append(creature)
df = pd.DataFrame(creatures_list)
df.to_csv("all_creatures.csv", header=True, index=False)

We've made our CSV and the job is done! Find the csv in your section 2 directory and check in out!

### Adding More Data
There's also some data we'd like to get from individual creature pages, like a more complete description; image links; and "See Also" links to other related pages. We'll need to be fairly flexible about the function we write to handle this scraping, because individual creature pages will be more heterogenous than the list pages.

In [None]:
def skip_before_this(tag):
    """Find the first tag we want to stop searching at"""

    if tag.name == "h2":
        return True
    elif tag.attrs.get("class") == "toc":
        return True
    elif tag.attrs.get("id") == "See_also":
        return True
    else:
        return False

In [None]:
def scrape_creature(url, name="", download=False):
    """Scrape an individual creature page"""
    
    R = requests.get(url)
    soup = bs(R.content, 'html5lib')
    creature = {}
    
    # Get paragraphs from start to stop_at: TOC / References / See Also
    stop_at = soup.find(skip_before_this)
    paragraphs = stop_at.find_all_previous("p")
    desc_list = []
    for p in paragraphs:
        desc_list.append(re.sub('\[[0-9]\]','', p.get_text().strip()))
    desc_list.reverse()
    desc = " ".join(desc_list)
    desc = desc.replace("\r"," ")
    desc = desc.replace("\n"," ")
    creature["long_description"] = desc
    
    # Get See Also if exists - UL after id="See_also"
    if(soup.find(id="See_also")):
        try:
            see_also_bs = soup.find(id="See_also").parent.find_next_sibling("ul")
            if see_also_bs is not None:
                lis = see_also_bs.find_all("li")
                see_also = []
                if(len(lis) > 0):
                    for li in lis:
                        see_also.append(li)
                    creature["see_also"] = see_also
        except KeyError as e:
            print(f"KeyError: {e}")
            print(soup.find(id="See_also").parent.find_next_sibling("ul"))
            
    # Get pictures if they exist
    image_urls = []
    images = soup.find_all("img", class_="thumbimage")
    for img in images:
        clean = "https://" + img['src'].strip("//")
        image_urls.append(clean)
    if len(image_urls) > 0:
        creature["images"] = image_urls
        
    if(download):
        part = url.split("/")[-1]
        filename = f"creature_pages/creatures/{part}.html"
        all_creatures[name]['localfile'] = filename
        with open(filename, mode="wb") as file:
            file.write(R.content)
        
    return creature

Let's test it on a single creature:

In [None]:
scrape_creature(all_creatures["Angel"]["link"])

In [None]:
# Download all lists
for url in hrefs:
    part = url.split("/")[-1]
    filename = f"{part}.html"
    R = requests.get(url)
    with open(filename, mode="wb") as file:
        file.write(R.content)

### Scrape All the Creatures
And now let's run it on the full list. Be warned, this cell could take a while, probably about half an hour!

In [None]:
for key, creature in all_creatures.items():
    if creature.get("long_description") is None and creature.get("images") is None:
        try:
            creature_data = scrape_creature(creature['link'], name=creature['name'], download=True)
            all_creatures[creature['name']].update(creature_data)
        except KeyError as e:
            print(f"KeyError: {e}")

In [None]:
print(all_creatures)

### Exporting Our Data
Great! That's a lot of data. Let's export it all to a CSV in case we want to import it again later. We could use the `csv` library, but Pandas has a quick handy function for this as well.

In [None]:
import pandas as pd
creatures_list = []
for key, creature in all_creatures.items():
    creatures_list.append(creature)
df = pd.DataFrame(creatures_list)
df.to_csv("all_creatures_updated.csv", index=False, header=True)

If we want to reimport the data from a CSV, we need to read it and assign it to `all_creatures`. `csv.DictReader` creates OrderedDicts, so we'll cast it to a regular Dictionary. We'll also make sure that each creature's `images` are an actual list, not just a string representation of a list, by using the `ast` library to evaluate the list correctly.

In [None]:
# Reimport from CSV - only run if you want to (re)load the data
import csv, ast
input_file = csv.DictReader(open("all_creatures_updated.csv"))
all_creatures_new = {}
for row in input_file:
    name = row['name']
    all_creatures_new[name] = dict(row) #casting from OrderedDict to a regular Dictionary
    if row['images']:
        all_creatures[name]['images'] = ast.literal_eval(all_creatures[name]['images']) #Evaluate the list using ast library

Now what can we do with this data? We can display any creature quite easily:

In [None]:
from IPython.core.display import display, HTML
from IPython.display import display, Image, HTML

In [None]:
def display_creature(name):
    """ Display a creature from all_creatures, given an exact matching name """
    try:
        creature = all_creatures[name]
        htmlOutput = '<style> .boxes { width: 25%; float: left } </style>'
        htmlOutput += f"<h2>{name}</h2><div><ul>"
        for key, value in creature.items():
            if key != "images" and key != "see_also":
                htmlOutput += f"<li>{key}: {value}</li>"
        htmlOutput += "</ul>"
        images = creature.get("images")
        if images is not None:
            for imageurl in images:
                htmlOutput += f"<div class='boxes'><img src='{imageurl}' style='max-height:200px; max-width:200px;'></div>"
        htmlOutput += "</div>"
        display(HTML(htmlOutput))
    except KeyError as e:
        print("No legendary creature by that name.")

In [None]:
display_creature("Basilisk")

We can also start searching our data for creatures from specific cultures, perhaps with a keyword as well, and then displaying all our results: