# Web Scraping with Beautiful Soup

## Overview

### What You'll Learn
In this section, you'll learn
1. How to scrape data from web pages like Wikipedia
2. How to work with the data you scrape

### Prerequisites
Before starting this section, you should have an understanding of
1. Requests add link

### Introduction
Web scraping allows you to parse through and work with data you find on public websites. We'll work through how to scrape data from Wikipedia, and you'll build a command-line tool that allows you to scrape Wikipedia like so:

```
> python3 demo.py
Harvey G. Stenger

7th president of Binghamton University
Incumbent
Assumed office 2012
Preceded by C. Peter Magrath


Personal details
Alma mater Cornell University (B.S. 1979)Massachusetts Institute of Technology (Ph.D. 1983)
Profession Educator, academic administrator
```

***

## How Web Scraping Works
Web scraping allows you to get data from a website in the case that it doesn't have an API. It involves systematically searching websites' HTML for data of interest.

`BeautifulSoup` is a Python library that allows users to easily parse through HTML documents in search for data. In this section, we'll use Wikipedia as an example, and try to scrape information from the infoboxes you usually see on the side.

**Harvey Stenger's Infobox:**

[insert image here]


### HTML Structure
In order to understand how to webscrape, it's important to understand the structure of HTML documents. When you view a webpage, your browser is sending a request to the website and receiving an HTML file in return. Your browser then interprets the HTML for you and displays it to you in a human-readable form.

Every piece of information in an HTML document resides inside of an *element*. These elements come as `<div>`s, `<p>`s, `<tr>`s, and many more, and they're all nested within each other inside of HTML documents. The arragement of these elements as well as their individual properties tell the browser how to display them to the user.

Ultimately, *we're not interested in the way the webpage looks* -- we're interested in the text data within each element, because that data is what we're trying to scrape.

Here's some example HTML for a simple website (about web scraping):
```html
<html>
    <head>
        <title>Web Scraping</title>
    </head>
    <body>
        <h1>Web Scraping in Python</h1>
        
        <div id="requirements">
            <h2>Requirements:</h1>
            <p>Python 3</p>
            <p>requests library</p>
            <p>BeautifulSoup library</p>
        </div>
        
        <div id="how-to-scrape">
            <h2>How to Scrape:</h1>
            <p>Load the HTML with requests</p>
            <p>Pass the text of the request into BeautifulSoup</p>
            <p>Use .find() or .find_all() functions to search for elements</p>
        </div>
        
        <h4>Copyright HackBU 2019</h4>
        
    </body>
</html>
```

[Here's a link](http://htmlpreview.github.io/?https://github.com/HackBinghamton/Webscraping-APIsWorkshop/blob/master/web-scraping-with-beautifulsoup/example_webpage.html) to this page so that your browser can render it for you.

### Finding What to Scrape
Looking at the example HTML and the corresponding screenshot, it's fairly easy to pick out where in the HTML each piece of text comes from. *This isn't the case most of the time.*

Usually, larger websites will have hundreds of elements on a single page, were most of the elements don't actually do anything except hold other elements. They'll have horrible, unintuitive names and reading the HTML itself won't get you very far.

Enter *Inspect Element*. This tool lets you look through every element on a webpage. To access it, right click on any part of a webpage and select *Inspect Element*.

**Let's say that we decided to scrape from this webpage the names of every requirement.**

By hovering over different elements in the *Inspector* pane, we can highlight what sections of the webpage they relate to. In the case of this website, we can see that the requirements we're trying to scrape are all found in the `div` element with `id="requirements"`

![Example Page with Inspector](img/inspector.png)

Knowing that the `<div>` element with `id="requirements"` gives us almost everything we need to start scraping -- yet, we need a little bit more information on how to filter down to *just* the requirements elements themselves.

Looking back at the HTML, we can see that the *Requirements* `<div>` holds a `<h1>` header element as well as three `<p>` elements. These `<p>` elements contain the data we're trying to scrape.

Now, we're ready to scrape.

## How to Scrape
With `BeautifulSoup` we can scrape this above webpage and pick out particular information about it.

There are two main steps to scraping: loading the HTML and searching it. First, we must *load* the HTML into a `BeautifulSoup` parser and then *search* the HTML with the parser to find where our data is.

### 1. Loading the HTML
For the purposes of this demo, we'll have already stored the HTML as a string in a variable (`html_text`). Usually, you'd use a `requests` call to get the HTTP response from your target website, then use the response's `.text` instance variable to get the HTML as a string.

Then, we'll create a `BeautifulSoup` parser object that we'll use to scrape with.

In [None]:
# Import the BeautifulSoup parser
from bs4 import BeautifulSoup

# Load your HTML (in this case, copy pasted, but usually grabbed from a request)
html_text = """<html>
    <head>
        <title>Web Scraping</title>
    </head>
    <body>
        <h1>Web Scraping in Python</h1>
        
        <div id="requirements">
            <h2>Requirements:</h2>
            <p>Python 3</p>
            <p>requests library</p>
            <p>BeautifulSoup library</p>
        </div>
        
        <div id="how-to-scrape">
            <h2>How to Scrape:</h2>
            <p>Load the HTML with requests</p>
            <p>Pass the text of the request into BeautifulSoup</p>
            <p>Use .find() or .find_all() functions to search for elements</p>
        </div>
        
        <h4>Copyright HackBU 2019</h4>
        
    </body>
</html>"""

# Create and load a parser with the HTML
parser = BeautifulSoup(html_text, "html.parser")


### 2. Scraping the HTML
Given that the information we're trying to scrape is in the `<div>` with `id="requirements"` we can direct our parser to find that area of the webpage.

The *`.find()`* method allows you to search a document for elements with specific properties, and returns the first matching result. If we wanted to search a parser for the first `<h1>` tag on a website, we could run the following command:

In [None]:
h1 = parser.find("h1")

To display what the parser found, we can print the parser object `h1`. *Notice that printing the parser alone will include the HTML tags around it.* To exclude the tags and print only the data within the tag, we can use the `.text` instance variable of `h1`.

In [None]:
# Printing the returned parser object alone will print the tags as well
print(h1)

# By printing the .text variable of the parser, we can extract the data inside of the tags
print(h1.text)

Let's say we want to be more specific. Oftentimes, elements will have certain attributes assigned to them, such as `id`, `class`, or `title`. We can search for elements that have these properties set to certain values. Let's say we wanted to find an element on a webpage such as `<p class="data">`. We could use the following code to find the first instance of it.

```python
p = parser.find("p", {"class": "data"})
```

In our HTML, we notice that all of the requirements we're trying to scrape are stored inside of a `<div>` with `id="requirements"`. To scrape this one section, we can run the following code:

In [None]:
requirements_div = parser.find("div", {"id": "requirements"})
print("All data:")
print(requirements_div)

print("Just text:")
print(requirements_div.text)

Okay, that's progress! Now, we want to just grab the `<p>` elements from this `<div>`, since those contain the requirements we're talking about.

Thankfully, the `.find_all()` method of `BeautifulSoup` parsers will return you a list of each match found in a document! For example, to find every instance of `<div>` inside of the document we could run this code:

```python
divs = parser.find_all("div")
```

So, given that we scraped up the `requirements` div from earlier, we can now scrape that div for each `<p>` element inside of it!

In [None]:
requirements = requirements_div.find_all("p")

print(requirements)

# To access and print each element, use a for loop!
for requirement in requirements:
    print(requirement.text)

We've now scraped up the data we were after!

***

## Project: Making a Wikipedia Scraper
Wikipedia often has boxes on the right-hand side that contain general info about the subject of the page. These boxes are called 'infoboxes'. Here's Harvey Stenger's:

![Harvey Stenger's Infobox](img/harveybox.png)

Let's make a tool to scrape these infoboxes and print out the output of them like so:
```
Harvey G. Stenger

7th president of Binghamton University
Incumbent
Assumed office 2012
Preceded by C. Peter Magrath


Personal details
Alma mater Cornell University (B.S. 1979)Massachusetts Institute of Technology (Ph.D. 1983)
Profession Educator, academic administrator
```

### 1. Loading the HTML
Before we can scrape, we must get the HTML for a Wikipedia page of our choice.

Use a `requests.get()` call to fetch the Wikipedia website of your choice:

In [None]:
### YOUR CODE HERE ###

### 2. Creating and Loading the Parser
Before we start searching for the data, let's make sure to create an instance of a parser.

*Remember that the format for instantiating a parser is* `parser = BeautifulSoup(<HTML string>, "html.parser")`

In [None]:
### YOUR CODE HERE ###

### 3. Finding Elements with `.find()` and `.find_all()`
Now, we can start to scrape.

Wikipedia infoboxes are consistently found as a `<table>` with `class="infobox"` *(Sometimes, the infobox `class` string will contain other words, too, but we don't need to worry about them).*

Use your parser to find the `<table>` with `class="infobox"`:

In [None]:
### YOUR CODE HERE ###

**IMPORTANT:** Since every Wikipedia infobox has a different structure, we cannot predict what other elements may be inside of it. Thus, the "scraping" part is over, and we just need to print out the text information inside of this infobox.

However, you may notice that printing it out straight doesn't look very good. Here's the Harvey Stenger infobox information printed straight-up:

```
Harvey G. Stenger7th president of Binghamton UniversityIncumbentAssumed office 2012Preceded byC. Peter Magrath
Personal detailsAlma materCornell University (B.S. 1979)Massachusetts Institute of Technology (Ph.D. 1983)ProfessionEducator, academic administrator
```

Due to the variable quantity of rows in this `<table>` element, we'll just have to iterate through and see what we can and can't print.

### 4. Getting Text Information from Elements
To print out the infobox, we'll need to:
1. Grab all items of the infobox that are `<tr>`s
2. Iterate through the `<tr>` elements of the infobox (`for element in <tr element list>`)
3. Print their `.text` attributes (*if they have them*)

**IMPORTANT:** Not every element will have a `.text` attribute. So, in order to check this, we'll need to check each item as we iterate through to see if it has the `.text` attribute so that we don't get errors.

To check if an object has a certain attribute in Python, use the `hasattr()` function.

If we wanted to see if a variable `element` has an attribute `text`, we can use `hasattr(element, "text")`.

In [None]:
### YOUR CODE HERE ###
# Grab each <tr> element in the infobox

# Iterate through each <tr> element

    # Check if it has .text attribute

    # Print it
    

There we go! If all goes well, you'll have a basic Wikipedia scraper. To get your output to match mine, try checking if each `<tr>` element has sub-elements and printing them.

Once you have this working, feel free to improve it! Add the ability for users


