# Lab 04: Scraping

This lab will walk you through the basics of using both the `requests` library as well as [Beautiful Soup]() to communicate with and retrieve data from websites on the internet.

**Instructions:** When you have completed the lab, please submit it via [Tulane Canvas](https://tulane.instructure.com/) before the due date.

In [None]:
# clone the course repository, change to right directory, and import libraries.
# NB: Since this lab involves downloading files from the web, we will mount our Google Drive so that the files will
# persist even if the Colab session ends.
# You will be asked to give permissions to this Colab to write to your Google Drive.
%cd /content
!git clone https://github.com/nmattei/cmps6790.git
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive

In [None]:
# Our current working directory is now the root of your personal Google Drive.
# All downloaded files will be saved here.
# list the current working directory
!pwd
# list files in the current working directory
!ls

## Scraping Data From the Internet

What happens when you type a URL into the search bar? Well, you can follow along at [this in depth guide](https://github.com/alex/what-happens-when) which takes you through all the layers of the IP stack, tracing every bit of data exchange. However, for our purposes we just need to understand a few basics:
1. When you hit `enter` on your web-browser, the browser issues a `GET` command to the webserver, asking for the default homepage, typically a page called `index.html`.
2. The server then sends back either the contents of that page with a HTTP code `200`, meaning everything worked, or it sends back an error like `404`, meaning that you asked for a page it doesn't know how to find. You can read more about [HTTP Error codes here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).
3. Your browser then takes that page, perhaps issuing even more requests to other pages to grab content like rich media, and then puts all this together into an interactive display that you can use on your desktop.

So, at a high level, our computer asks for something, and the web server sends it back to us!

Throughout this demo we are going to work with the webpage for the [Data Science course at the University of Maryland by Prof. John P. Dickerson](http://www.cs.umd.edu/class/fall2018/cmsc320/). The course here at Tulane owes a big debt to John who helped us come up with much of the content for this course.

To start out with, let's try to get the contents of the course, not in our web-browser but instead just to see the text! First we'll use a Linux command line tool called [curl](https://curl.se/) to just grab the contents of the page.

In [None]:
!curl 'http://www.cs.umd.edu/class/fall2018/cmsc320/'

You can see that the contents of the site is the same as if we had gone to the page and clicked the `view source` button that is available in most browsers. That is, the site returned all its information in HTML, Hyper Text Markup Language. HTML has a format and specification that is controlled by the [w3C Standard](https://html.spec.whatwg.org/multipage/) and we're currently on HTML 5, the newest version.

In short, HTML is a system of tags, each inside a bracket. So for instance, the `<body>` tag typically marks the main body of an HTML page. You can read more about the tags at the page above, for our purposes we'll be mainly focused on `<table>` tags, because that's where lots of data lives!

## Using the Requests Library

In Python we are going to generally make use of the [requests library, whose full documentation can be found at this site](https://requests.readthedocs.io/en/latest/).

In general, Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to [urllib3](https://github.com/urllib3/urllib3).

Requests can do many more things than we will use it for here, but for now let's grab the webpage from John and read it into a variable.

In [None]:
import requests

r = requests.get('http://www.cs.umd.edu/class/fall2018/cmsc320/', timeout=10)
print(r.status_code)
print(r.content[:1000])

## Using Beautiful Soup

So all that HTML is great but, like many things in this class, we don't want to have to read through all those tags ourselves! Imagine having to write your own parser for all that :-(

Luckily there is a great library called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) which can handle many of these things for us. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Let's try this out with John's webpage and navigate through all the parts of the page. The first thing we do is get the contents of the webpage and then load it into a Beautiful Soup object, which gives us an easy way to navigate bits of the page. We can use the `prettify` specifier to parse John's webpage from above into the various tags of an HTML page.

In [None]:
from bs4 import BeautifulSoup

r = requests.get('http://www.cs.umd.edu/class/fall2018/cmsc320/')
soup = BeautifulSoup( r.content )
print(soup.prettify())

More than that, soup gives us an array of commands to allow us to get the title of the page and many other things! Let's try it out.

In [None]:
# Get the webpage title tag
print(soup.title)

# Get just the string
print(soup.title.string)


One common thing we can do is to get all the `text` from a page, without all the markup! For John's page this includes a lot of whitespace so we'll use the [Python string manipulation function](https://docs.python.org/3.3/library/stdtypes.html?highlight=split) called `split` to split up all the whitespace and just get all the words on the page in an array, one word at a time.

In [None]:
# Only print the first 50 words to make life easier.

print(soup.get_text().split()[:50])

Another common thing is to take a webpage and get all the links on that page. For a hyperlink, we use the `<a href="DEST">Text</a>` tag as these are called `hyper-references` (everything was really hyper back in the 80's). Beautiful Soup has a pretty easy way to do this!

In [None]:
# Iterate over all the <a> tags and get the target (href) element.
for link in soup.find_all('a'):
    print(link.get('href'))

Finally, perhaps the most important part for us is to get at all the tables on a particular webpage. If you go look at the webpage, you'll see several elements that look like a table. In HTML a tables always looks like this:

```
<table class="table table-striped">
<caption>Office Hours</caption>
<thead>
<tr>
<th style="width: 30%">Human</th>
<th style="width: 45%">Time</th>
<th style="width: 25%">Location</th>
</tr>
</thead>
<tbody>
<tr>
<td>Eleftheria Briakou</td>
<td>Mondays, 11:45am–1:45pm</td>
<td>AVW 1120</td>
</tr>
...
```

Where the <table> tag begins a table and closes with `</table>`. Likewise a table can have a `<caption>` as well as a table head marked `<thead>`. For each row of a table we start with `<tr>` and end that row with `</tr>` separating elements with `<td>` tags.

So we can use Beautiful Soup to grab the first table using the `find` command. Note that `find` will by default only grab the first table.


In [None]:
# Grab the first table.
soup.find("table")

Now, here's the cool part. Pandas knows how to read HTML! Specifically it knows how to read HTML tables, using the [read_html()](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) function. There are quite a few gotchas with working with HTML so for any page you want to read you'll have to make sure the table is well formatted, and that the Pandas function is working.

Note that there are more than a few **Gotcha's** in the below example.
1. `find` lazily evaluates the search, so we have to use the `str` command to force Beautiful Soup to return a string that can be passed into `read_html` like it expects.
2. If we look at the documentation for `read_html` we find that it returns a **list of tables** since there might be more than one, so here we need to simply display the first table.

In [None]:
import pandas as pd
df_t = pd.read_html(str(soup.find("table")))
display(df_t[0])

In [None]:
df_tables = []
for t in soup.findAll("table"):
    df_t = pd.read_html(str(t))
    df_tables.append(df_t[0])

for t in df_tables:
    display(t)

And now we are close to what we need -- we can read all the tables into DataFrames and then we know how to manipulate these.

One other thing that might be useful is not necessarily loading all the tables in, but only certain ones. If you go look at the page source you'll see that the tables have a type associated with them, `<table class="table table-striped">`. If there are many different styles of table on a page, then we could pass this into Beautiful Soup to only find those tables that have that class property, as we do in the next example.

In [None]:
print(soup.find("table", {"class": "table-striped"}))

While this isn't super useful just yet, since each table on the website isn't tagged, we can use it to search for other things. If you go look at the layout of the page you'll notice that there are `<div>` tags, which are short for page divisions, and each one of these has an `id` element. So, if wanted to find the table in the assignments section of the page only, we can do the following.

In [None]:
df_assignments = pd.read_html(str(soup.find("div", {"id": "assignments"}).find("table")))[0]
display(df_assignments)

## Putting It All Together: Reading all the PDFs on a Webpage

Consider the following, very real, example. Let's say that you waited until the last night before the final and you need to download all the PDFs from the webpage so you can print them off really small for your cheat sheet! However, you've been cramming pizza as part of your study routine so you can't right-click the mouse any more. So we'll need to write a parser to go read the webpage, grab all the links in the lectures table, and only keep the ones that end in `.pdf` ... Easy right?

Let's take this step by step...

First we'll write a command to go grab the table we want, and extract the links from that section.

In [None]:
soup.find("div", {"id": "schedule"}).find_all('a')

Now, we can see that there is a whole mix of different things here, and we for sure don't want them all! In fact, we really only want the ones from the `slides` column of the table, we could go that route but instead let's just pull all the links that end in `.pdf`.

In [None]:
pdfs = []
links = soup.find("div", {"id": "schedule"}).find_all('a')
for l in links:
    # Just get the reference link
    href = l['href']
    # If it's a PDF, save it...
    if href.lower().endswith(('.pdf')):
        pdfs.append(href)
print(pdfs)

**Note:** The next cell will download all the files and save them locally, so only run it if you really want them!

So now that we have the list `pdfs`, we just need to walk through it and issue a `get` request for the data. We know this data is a PDF but our system does not. So the best way to do this is to download the file in `binary` which just gets the individual bits, and then save them.

In [None]:
from urllib.parse import urljoin
import os

base_url = "http://www.cs.umd.edu/class/fall2018/cmsc320/"

for href in pdfs:
    urld = urljoin(base_url, href)
    print(urld)
    rd = requests.get(urld, stream=True)

    # Write the downloaded PDF to a file
    # Note because the href is a path we have to just get the filename!
    outfile = os.path.join("./", href.split("/")[-1])
    print("Writing: ",outfile)
    with open(outfile, 'wb') as f:
        f.write(rd.content)

### An Extra Example

So note that above we used Beautiful Soup to really get all the bits of the webpage we need. Back in the olden days, one could iterate through all the tags on the webpage to get at the same information, but it would be a lot harder. To prove this point, Karthik wrote the below code that runs through each bit of the webpage, and extracts the tables, and then keeps track of the link. This is a much more manual process but it serves to illustrate how tough this can be.

In [None]:
from IPython.display import display, Markdown

r = requests.get( "http://www.cs.umd.edu/class/fall2018/cmsc320/" )
soup = BeautifulSoup(r.content )

tables = soup.findAll("table")

for i,t in enumerate(tables):
    # get table head for column names
    table_head = t.find("thead")
    column_names = [t.text for t in table_head.findAll("th")]
    # extract row data from table body
    table_body = t.find("tbody")
    row_tags = [r for r in table_body.findAll("tr")]
    all_rows = []
    for r_ in row_tags:
        row_data = []
        for row_item_tag in r_.findAll(["td","th"]):
            row_entry = None
            # if row item has some link , save that as well as a tuple entry in the table
            if row_item_tag.find("a"):
                row_entry = (row_item_tag.text,[x["href"] for x in row_item_tag.findAll("a")])
            else:
                row_entry = row_item_tag.text

            row_data.append(row_entry)

        all_rows.append(row_data)

    df_t = pd.DataFrame(all_rows,columns=column_names)
    display(Markdown(f"# Table{i+1}"))
    display(df_t)
    print("\n\n")

## Exercises

All of the exercises in this section will build on one another, you'll be attempting to get some **very messy** data from the internet and make it useable and beautiful like the data we've provided for you all semester.

You've been hired by a new space weather startup looking to disrupt the space weather reporting business. Your first project is to provide better data about the top 50 solar flares recorded so far than that shown by your competitor [SpaceWeatherLive.com](https://www.spaceweatherlive.com/en/solar-activity/top-50-solar-flares). To do this, they've pointed you to [this messy HTML page](http://cdaw.gsfc.nasa.gov/CME_list/radio/waves_type2.html) from NASA ([available here also](http://www.hcbravo.org/IntroDataSci/misc/waves_type2.html)) where you can get the extra data your startup is going to post in your new spiffy site.

Of course, you don't have access to the raw data for either of these two tables, so as an enterprising data scientist you will scrape this information directly from each HTML page using all the great tools available to you in Python. By the way, you should read up a bit on [Solar Flares](https://en.wikipedia.org/wiki/Solar_flare), [coronal mass ejections](https://www.spaceweatherlive.com/en/help/what-is-a-coronal-mass-ejection-cme), [the solar flare alphabet soup](http://spaceweather.com/glossary/flareclasses.html), [the scary storms of Halloween 2003](http://www.nasa.gov/topics/solarsystem/features/halloween_storms.html), and [sickening solar flares](https://science.nasa.gov/science-news/science-at-nasa/2005/27jan_solarflares).

In this Lab we'll first get the data into a format we can sort of use, and next lab we'll spend some time cleaning the data and attempting to match them up with one another.

### Exercise 1

The first thing we need to do is download the webpage into a variable. So write a command with the `requests` module to get the webpage at <https://www.spaceweatherlive.com/en/solar-activity/top-50-solar-flares>. Once you issue the command, print the status code of the webpage along with the text, is this what you expected?

A helpful reference for this part is the [documentation for the requests module](https://requests.readthedocs.io/en/latest/) along with the documentation for [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

Find the code at the [Mozilla Foundation Webpage](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) and briefly explain the problem.

In [None]:
# Your Code Here.

**Written Answer:** Fill in your answer here.

### Exercise 2

Welp, looks like they don't want us to scrape their webpage. There are a few tricks we can use to get around this, but shhhh! don't tell anyone.

We can add a few things to our [Requests Headder](https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers), [Maybe try something like this](https://stackoverflow.com/questions/27652543/how-to-use-python-requests-to-fake-a-browser-visit), to trick the server into thinking you're human...

After reading those pages, write another request command below that successfully gets the webpage, print out the status code and the first 500 characters of the webpage.

In [None]:
# Your Code Here.

### Exercise 3

Well now that we have all that text, we need to figure out how to get what we want: just the data out of the table. In the below cell write code that takes the response from the website and uses the [Beautiful Soup find Function](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) to extract the only table on the page. Print the first 600 characters of the table.

In [None]:
# Your Code Here.

### Exercise 4

Well, there's the table! Now what do we do with it? If you look again at the website you'll see that the table looks and awful awful lot like a table we'd use in Pandas. Well, luckily, Pandas has a great function [read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html_) which works a whole lot like the `read_csv` function we've been using all class. In the below use the `read_html` function to load all this into an dataframe, print the first 10 rows of the table.

**Gotcha Warning:** Read the docs on `read_html` closely -- what is the return type? Make sure you end up with a dataframe!

In [None]:
# Your Code Here.

### Exercise 5

Okay, now we've got a table but it's not very clean... or tidy. This is the hardest part of the whole lab! You'll need to do several steps for this part, all at once.

1. Assign reasonable names to each of the columns of the dataframe
2. Drop the last column of the dataframe, we don't need it.
3. Check the `Region` column. Make sure to set any weird codes to NaN's (though there might not be any, you should check)!
4. The dates and times aren't that useful as is, we need them to be one datetime object so we can do math on them.
5. Re-arrange the columns to be pretty.

When you're done your table should look like the below. Note that this is a tough problem, you'll need to do several steps to achieve this one! Make sure to check the dTypes of things as you go and **Hint** you might want to check out the [Pandas to_datetime function](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html), it's the easiest way to accomplish this!

In [None]:
# Your Code Here.

**In The Cell Below:** Make sure you display both the table and the final dtypes in the cell below.

In [None]:
# Your Code Here.

Your table should look like this:

|    |   rank | x_class   | start_datetime      | max_datetime        | end_datetime        |   region |
|---:|-------:|:----------|:--------------------|:--------------------|:--------------------|---------:|
|  0 |      1 | X28+      | 2003-11-04 19:29:00 | 2003-11-04 19:53:00 | 2003-11-04 20:06:00 |      486 |
|  1 |      2 | X20+      | 2001-04-02 21:32:00 | 2001-04-02 21:51:00 | 2001-04-02 22:03:00 |     9393 |
|  2 |      3 | X17.2+    | 2003-10-28 09:51:00 | 2003-10-28 11:10:00 | 2003-10-28 11:24:00 |      486 |
|  3 |      4 | X17+      | 2005-09-07 17:17:00 | 2005-09-07 17:40:00 | 2005-09-07 18:03:00 |      808 |
|  4 |      5 | X14.4     | 2001-04-15 13:19:00 | 2001-04-15 13:50:00 | 2001-04-15 13:55:00 |     9415 |
|  5 |      6 | X10       | 2003-10-29 20:37:00 | 2003-10-29 20:49:00 | 2003-10-29 21:01:00 |      486 |
|  6 |      7 | X9.4      | 1997-11-06 11:49:00 | 1997-11-06 11:55:00 | 1997-11-06 12:01:00 |     8100 |
|  7 |      8 | X9.3      | 2017-09-06 11:53:00 | 2017-09-06 12:02:00 | 2017-09-06 12:10:00 |     2673 |
|  8 |      9 | X9        | 2006-12-05 10:18:00 | 2006-12-05 10:35:00 | 2006-12-05 10:45:00 |      930 |
|  9 |     10 | X8.3      | 2003-11-02 17:03:00 | 2003-11-02 17:25:00 | 2003-11-02 17:39:00 |      486 |

Congratulations, you've completed the first step of a much larger project involving data integration, and scraping multiple websites we used to do as part of a more advanced version of this course. If you're itching for more work (not bonus) you can find the full assignment as Project1 on the course webpage.