# Web Scraping Walkthrough

### Hypertext Transfer Protocol (HTTP) is the foundation for data communication on the world wide web
- Entering a URL is a request for the resource at that domain address
- Response is what happens (page loads? 404 error?)

To retrieve the contents of a website, we will be using the [_requests_](http://docs.python-requests.org/en/master/user/quickstart/) library.


In [None]:
import requests

We will be using a GET request. This is a request for data from a specified resource.  
There are other types of requests. Another common type is a POST request. POST submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.

In [None]:
response = requests.get('https://en.wikipedia.org/wiki/Turing_Award')

In [None]:
type(response)

In [None]:
response.status_code

A 200 status code is the standard response for a successful request.  

Let's see what happens if we request a non-existent URL.

In [None]:
requests.get('https://en.wikipedia.org/wiki/Tuning_Award')

Now, let's look at what we retrieved.

In [None]:
response.text

It is very hard to decipher the above text. Luckily for us, the [_Beautiful Soup_](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library comes to the rescue. This library assists us in parsing HTML into something usable.

In [None]:
from bs4 import BeautifulSoup as BS

First, we can soupify our response text. Since we are working with HTML, we can specify that we need the html parser.

In [None]:
soup = BS(response.text, 'html.parser')

In [None]:
print(soup.prettify())

Beautiful Soup lets us search through this HTML and extract out the contents we want by tag.  

Say we wanted to find the title of this page. We can accomplish this by using the `.find` method on our soup and then prining its `.text` attribute.

In [None]:
soup.find('title')

In [None]:
soup.find('title').text

We can find all elements with a particular tag using the `.findAll(<tag>)` method. 

In [None]:
soup.findAll('img')

In [None]:
soup.findAll('img')[0]['src']

In [None]:
[x['alt'] for x in soup.findAll('img')]

We can further navigate the html tree to extract out other bits of information.

In [None]:
soup.findAll('div')[2]

Let's say we want to extract something from this section of html.

In [None]:
soup.findAll('div')[2].find('h1')

In [None]:
soup.findAll('div')[2].find('h1')['id']

In [None]:
soup.findAll('div')[2].find('h1').text

In [None]:
soup.findAll('table')

If we know a bit more about what we are looking for, we can include an `attrs` argument and pass a dictionary. 

Go to the Turing award page in your browser, right click on the top of the table and choose "Inspect". You will notice that this table is defined with tag `<table class="wikitable">.` Armed with this information, we can narrow down our search.

In [None]:
soup.find('table', attrs={'class' : 'wikitable'})

We can display the table by importing the `HTML` function.

In [None]:
table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))

from IPython.core.display import HTML

HTML(table_html)

However, this does not give us a way to work with the data in the table, only to display it.

As part of Data Question 3, your group will need to figure out how to convert the resulting table into a `pandas` DataFrame.

## Using the USGS API

The USGS provides an API for retrieving data about earthquakes. This API is accessible at https://earthquake.usgs.gov/fdsnws/event/1.

To use this API, one option is to simply use your browser - modify the query string to find what you're looking for:  
https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&starttime=2019-01-01&endtime=2019-01-02M

However, this method is kind of clunky. We can also use the requests library to help us. This time, instead of doing just a GET request on a particular url, we will pass a payload containing our desired parameters as a dictionary.   

Let's say we are interested in finding all earthquakes that occurred since 2010 that were at least magnitude 6.0.

In [None]:
url = 'https://earthquake.usgs.gov/fdsnws/event/1/query'

In [None]:
payload = {'format': 'csv', 
           'starttime': '2010-01-01', 
           'minmagnitude' : '6.0'}

In [None]:
r = requests.get(url=url, params=payload)

In [None]:
print(r.url)

In [None]:
print(r.text[:1000])

There are a few ways we can now proceed. First, we can read the text into a DataFrame using `read_csv`. To get this to work, we have to first pass the text through the StringIO method.

In [None]:
import pandas as pd
from io import StringIO

In [None]:
eq = pd.read_csv(StringIO(r.text))
eq.head()

Another way to make it work is to pass the url itself to `read_csv`.

In [None]:
eq = pd.read_csv(r.url)
eq.head()

Finally, we can save the text as a csv file and then read it back in using pandas:

In [None]:
with open('eq.csv', 'w') as fi:
    fi.write(r.text)

In [None]:
eq = pd.read_csv('eq.csv')
eq.head()