# Web Scraping, HTML and Beautiful Soup


###  Objectives:
* Describe a typical web scraping data pipeline.
* Explain the basic concepts of HTML.
* Write code to pull elements from a web page using BeautifulSoup.
* Use an existing API to fetch data and parse using BeautifulSoup.

## Resources

* [w3 schools](http://www.w3schools.com/) : HTML tags and their attributes.
* [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Scrape anonymously with Tor](https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/)

## HTML Concepts

**H**yper**T**ext **M**arkup **L**anguage

A *markup language* (think markdown) that forms the building blocks of all websites. Hypertext is text that includes links to other pages. HTML specifies not just the text of the document and the links but also the organization (into sections and paragraphs and lists and such). It can also control the layout of the document (the font and color and size and such) though that is properly handled with Cascading Style Sheets (CSS). 

It consists of opening and closing tags enclosed in angle brackets (like `<html>` and `</html>`) often with more HTML in between.

A minimal HTML document, unfortuantely, contains a lot of cruft.  Here's one I got from [https://www.sitepoint.com/a-minimal-html-document/](https://www.sitepoint.com/a-minimal-html-document/).


```html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
  <head>
  
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title>title</title>
    <link rel="stylesheet" type="text/css" href="style.css">
    <script type="text/javascript" src="script.js"></script>
  </head>
  <body>
		
  </body>
</html>
```

The key=value pairs inside of a tag are called attributes. The `<link>` and `<script>` tags aren't necessary, but appear in more or less every HTML document.

* The `<link>` tag points to a **stylesheet**, which controls who different parts of the docuemnt are rendered in the browser.  This makes things pretty.
* The `<script>` tag points to a **javascript** program.  This allows programmers to add *dynamic behaviour* to a html document.
* The `<body>` tag contains the guts of your document.

### Important Tags

```html
<a href="http://www.w3schools.com">A hyperlink to W3Schools.com!</a>

<h1>This is a header!</h1>

<p>This is a paragraph!</p>

<h2>This is a Subheading!</h2>

<table>
  This is a table!
  <tr>
    <th>The header in the first row.</th>
    <th>Another header in the first row.</th>
  </tr>
  <tr>
    <td>An entry in the second row.</td>
    <td>Another entry in the second row.</td>
  </tr>
</table>

<ul>
  This is an unordered list!
  <li>This is the first thing in the list!</li>
  <li>This is the second thing in the list!</li>
</ul>
<div>Specifies a division of the document, generally with additional attributes specifying layout and behavior.</div>
A <span>span is similar</span> but occurs in the middle of a line.

```

I saved the HTML document above as <a href="basic.html">basic.html</a>.

## Web vs Internet

The web (or www or World Wide Web) is different from Internet in a couple ways.

First the web is just part of the internet. The internet includes plenty of pieces unrelated to the web, like email and ssh, although the web has become a dominant piece.

But in a deeper sense, the internet is a set of protocols used for transferring data, together with the infrastructure that run those protocols. The web is handled by one of those protocols, a high-level protocol called HTTP. HTTP express how requests are made by clients and how documents are returned by servers. So the web is really just a set of HTML documents sitting on HTTP servers.

## HTTP Requests

To get data from the web, you need to make a HTTP request.  The two most important request types are:

* GET (queries data, no data is *sent*)
* POST (updates data, *data must be sent*)

Usually HTTP requests are sent by browsers (like Chrome or Safari) but `curl` is a command line program for sending HTTP requests.  It's easy to send a `GET` request to a url.

In [None]:
!curl http://madrury.github.io

`curl` can also send POST requests, but with a bit more effort.

In [None]:
!curl -X POST -H "Content-Type: application/json" -H 'User-Agent: DataWrangling/1.1 matthew.drury@galvanize.com' -d '{"action": "parse", "format": "json", "page": "Unicorn"}' https://en.wikipedia.org/w/api.php  
    
    

We're going to send this POST request in a much better way below, so don't worry about remembering how to do it with curl.

## Scraping

Web Scraping is the process of programmatically getting data from the web.

<img src="images/pipeline.png" width = 500>

### Example: Load table into a data frame.

Lets load the Super Metroid speedrun leaderboards at [Deer Tier](http://deertier.com/Leaderboard/AnyPercentRealTime) into a Mongo database, and then load this database into a pandas data frame.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
from pymongo import MongoClient
import pprint

import copy
import pandas as pd

# Requests sends and recieves HTTP requests.
import requests

# Beautiful Soup parses HTML documents in python.
from bs4 import BeautifulSoup

#### Step 1: Check out the website in a browser.

The first step is to check out the website in a browser.

Open the `Developer Tools` to get a useful display of the hypertext we will be working with.

The table we will need is inside a `<div>` with `class=scoreTable`.  Looking closely the structure is like this:

```
<div class=scoreTable>
  <table>
    <tr>..</tr>
    ...
    <tr>...</tr>
  </table>
</div>
```

Each row has a `title` attribute that contains some interesting data:

```
<tr title="Submitted by Oatsngoats on: 19/10/2016">
```

Inside each row, the columns have the following data:

```
rank, player, time, video url, comment
```

This should be enough infomation for us to get to scraping.

#### Step 2: Send a GET request for the data.

In [None]:
deer_tier_url = 'http://deertier.com/Leaderboard/AnyPercentRealTime'
r = requests.get(deer_tier_url)

A status code of `200` means that everything went well.

In [None]:
r.status_code

We can check out the raw hypertext in the `content` attribute of the request.

In [None]:
r.content

#### Step 3: Save all the hypertext into mongo for later use.

In [None]:
client = MongoClient('localhost', 27017)
db = client.metroid
pages = db.pages

pages.insert_one({'html': r.content})

#### Step 4: Parse the hypertext with BeautifulSoup.

This is the beautiful part of the soup.  Parsing the HTML into a python object is effortless.

In [None]:
soup = BeautifulSoup(r.content, "html")

In [None]:
print(soup)

In [None]:
print(soup.prettify())

In [None]:
print (soup.title)

#### Step 5: Navigate the data to pull out the table information.

Recall the structure of the table we are looking for:

```
<div class=scoreTable>
  <table>
    <tr>..</tr>
    ...
    <tr>...</tr>
  </table>
</div>
```

In [None]:
div = soup.find("div", {"class": "scoreTable"})
table = div.find("table")

# This returns an iterator over the rows in the table.
rows = table.find_all("tr")

all_rows = []

# Let's store each row as a dictionary 
empty_row = {
    "rank": None, "player": None, "time": None, "comment": None
}

# The first row contains header information, so we are skipping it.
for row in rows[1:]:
    new_row = copy.copy(empty_row)
    # A list of all the entries in the row.
    columns = row.find_all("td")
    new_row['rank'] = int(columns[0].text.strip())
    new_row['player'] = columns[1].text.strip()
    new_row['time'] = columns[2].text.strip()
    new_row['comment'] = columns[4].text.strip()
    all_rows.append(new_row)    

In [None]:
pprint.pprint(all_rows[:4])

#### Step 6: Load all the rows into a Mongo database.

Since we collected all the rows into python dictionaries, this is easy.

In [None]:
db = client.metroid

In [None]:
deer_tier = db.deer_tier

In [None]:
for row in all_rows:
    deer_tier.insert_one(row)

Now we can check from the command line that the data is really in there!

#### Step 7: Load all the rows into a pandas dataframe.

Even though there is no real reason to, let's load all the rows from the Mongo database just to give a more thorough example of how you can go about things.

In [None]:
rows = deer_tier.find()
super_metroid_times = pd.DataFrame(list(rows))

In [None]:
super_metroid_times.head()

In [None]:
super_metroid_times = super_metroid_times.drop("_id", axis=1)
super_metroid_times = super_metroid_times.set_index("rank")
super_metroid_times.head()

Goal Achieved!

**Large-ish Exercise**: Scrape the leaderboads for [Ocarana of Time](http://zeldaspeedruns.com/leaderboards/oot/any) into a dataframe.

## Example: Use a web API to scrape Wikipedia

Wikipedia provides a free API to programatically collect data.  This service is *designed* for programmers to interact with.

[Wikipedia API Documentation](https://www.mediawiki.org/wiki/API:Main_page)

A high level summary of the documentation:

> Send a POST request to https://en.wikipedia.org/w/api.php with a JSON payload describing the data you want, and the format in which you want it.

#### Step 1: Get the Data

In [None]:
import json
import re

Wikipedia wants us to identify ourselves before it will give us data.  The `User-Agent` section of a HTTP header contains this information.

In [None]:
headers = {'User-Agent': 'GalvanizeDataWrangling/1.1 matthew.drury@galvanize.com'}

In [None]:
api_url = 'https://en.wikipedia.org/w/api.php'

# Parameters for the API request: We want the Unicorn page encoded as json.
payload = {'action': 'parse', 'format': 'json', 'page': "Unicorn"}

r = requests.post(api_url, data=payload, headers=headers)

In [None]:
print(r.json().keys())

We get a lot of data back!

In [None]:
print(r.json()['parse'])

#### Step 2: Store the Data in MongoDB

In [None]:
# import MongoDB modules
from pymongo import MongoClient
#from bson.objectid import ObjectId

# connect to the hosted MongoDB instance
client = MongoClient('localhost', 27017)
db = client.wikipedia

In [None]:
collection = db.wikipedia

In [None]:
if not collection.find_one(r.json()['parse']):
    collection.insert_one(r.json()['parse'])

In [None]:
unicorn_article = collection.find_one({ "title" : "Unicorn"})

In [None]:
pprint.pprint(unicorn_article)

In [None]:
print (unicorn_article.keys())

#### Step 3: Retrieve and store every article (with associated metadata) within one link

We want to hop from the 'Unicorn' article. *Do not follow external links, only linked Wikipedia articles*

HINT: The Unicorn Law article should be located at: 
'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Unicorn'

In [None]:
links = unicorn_article['links']

pprint.pprint(links)

In [None]:
len(links)

Now let's request each of these documents, and store the result in our collection.

In [None]:
for link in links:

    payload = {'action': 'parse' ,'format': 'json', 'page' : link['*'] }
    r = requests.post(api_url, data=payload, headers=headers)

    # check to first see if the document is already in our database, if not, store it.
    try:
        j = r.json()
        if not collection.find_one(j['parse']):
            print("Writing The Article: {}".format(j['parse']['title']))
            collection.insert_one(j['parse'])
    except Exception as e:
        print(e)

#### Step 4: Find all articles that mention 'Horn' or 'Horned' (case insensitive)

* Use regular expressions in order to search the content of the articles for the terms Horn or Horned. 
* We only want articles that mention these terms in the displayed text however, so we must first remove all the unnecessary HTML tags and only keep what is in between the relevant tags. 
* Beautiful Soup makes this almost trivial. Explore the documentation to find how to do this effortlessly: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

* Test out your Regular Expressions before you run them over every document you have in your database: http://pythex.org/. Here is some useful documentation on regular expressions in Python: https://docs.python.org/3/howto/regex.html

* Once you have identified the relevant articles, save them to a file for now, we do not need to persist them in the database.

In [None]:
# compile our regular expression since we will use it many times
regex = re.compile(' Horn | Horned ', re.IGNORECASE)

with open('wiki_articles.txt', 'w') as out:

    for doc in collection.find():
        
        # Extract the HTML from the document
        html = doc['text']['*']

        # Stringify the ID for serialization to our text file
        doc['_id'] = str(doc['_id'])

        # Create a Beautiful Soup object from the HTML
        soup = BeautifulSoup(html)

        # Extract all the relevant text of the web page: strips out tags and head/meta content
        text = soup.get_text()

        # Perform a regex search with the expression we compiled earlier
        match = regex.search(text)

        # if our search returned an object (it matched the regex), write the document to our output file
        if match:
            try:
                print("Writing Article: {}".format(doc['title']))
                json.dump(doc, out) 
                out.write('\n')
            except UnicodeEncodeError as e:
                print(e)

    out.close()