# Web Scraping Intro

### Hypertext Transfer Protocol (HTTP) is the foundation for data communication on the world wide web.
- Entering a URL is a request for the resource at that domain address
- Response is what happens (page loads? 404 error?)

To retrieve the contents of a website, we will be using the [_requests_](https://requests.readthedocs.io/en/master/) library.

In [None]:
import requests

In this notebook, we will be using a **GET** request. This is a request for data from a specified resource.  

Another common type or request is a **POST** request. POST submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.

To perform a GET request, use `requests.get()` and pass in the desired url.

In [None]:
URL = 'http://en.wikipedia.org/wiki/Turing_Award'

response = requests.get(URL)

Let's see what kind of object we get.

In [None]:
type(response)

We can check the status code using the `status_code` attribute.

In [None]:
response.status_code

A 200 status code is the standard response for a successful request.  

Other common status codes:
 * 400: Bad Request
 * 404: Not Found

Let's see what happens if we request a non-existent URL.

In [None]:
requests.get('https://en.wikipedia.org/wiki/Tuning_Award')

**Back to the good correct request**, let's see what this request returned.

In [None]:
response.text

It is very hard to decipher the above text. Luckily for us, the [_Beautiful Soup_](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library comes to the rescue. This library assists us in parsing HTML into something usable.

In [None]:
from bs4 import BeautifulSoup as BS

First, we can soupify our response text. Since we are working with HTML, we can specify that we need the html parser.

In [None]:
soup = BS(response.text)

Now, we can print it out in a slightly more readable form.

In [None]:
print(soup.prettify())

What we are looking at is the HTML for this page. This is rendered by your browser into the Wikipedia page that you see.

<img src="assets/html.png">


If you navigate to this page in your browser, you can view page source or inspect elements to see the underlying HTML.

If you are using Safari, this may not by avaiable and you'll need to activate it. According to [this](https://www.socialmeteor.com/2013/03/04/how-to-view-html-source-in-safari-web-browser/) website, you can activate this by following these steps:


1. Open Safari.
2. Select ‘Preferences’ from the ‘Safari’ menu.
3. In the ‘Advanced’ section and select ‘Show Develop menu’ in menu bar.’
4. Visit the web page you want to view HTML source for.
5. Select ‘Show Page Source’ from the ‘Develop’ menu that has been added to Safari.


Beautiful Soup lets us search through this HTML and extract out the contents we want by tag.  

Say we wanted to find the title of this page. We can accomplish this by using the `.find` method on our soup, telling it that we want to find the first `title` tag.

In [None]:
soup.find('title')

Notice that this returns a bs4 Tag object.

In [None]:
type(soup.find('title'))

To extract out the text, you can use the `.text` attribute.

In [None]:
soup.find('title').text

The `.find` method find the first matching tag. 

We can find _all_ elements with a particular tag using the `.findAll(<tag>)` method. Say we want to find all images. We'll look for the `img` tag.

In [None]:
images = soup.findAll('img')
print(type(images))
images

Let's look closer at the first image.

In [None]:
first_image = images[0]
print(type(first_image))
first_image

You can access attributes of a Tag object in the same way that you would access values from a dictionary.

In [None]:
first_image['src']

You can also safely access attributes using `.get`. This might be useful if, for example, you aren't sure if a particular Tag or all tags had a certain attribute.

In [None]:
# Non-safe
first_image['class']

In [None]:
# Safe
first_image.get('class')

You can also specify a default value when using `get`.

In [None]:
first_image.get('class', default = 'No Class')

If you want to grab a particular attribute for all images, an easy way to do so is with a list comprehension.

In [None]:
image_srcs = [x.get('src') for x in images]

In [None]:
image_srcs

We can further navigate the html tree to extract out other bits of information.

When scraping from a web page, you should make use of "View Page Source" and/or "Inspect Element" in your web browswer.

For example, let's say we want to look at the third div on the page.

In [None]:
soup.findAll('div')[2]

Similar to using `find` and `findall` in the full soup, we can use the `.find` method just within a Tag.

In [None]:
soup.findAll('div')[2].find('h1')

In [None]:
soup.findAll('div')[2].find('h1').get('id')

In [None]:
soup.findAll('div')[2].find('h1').text

Now, let's look for the table containing the Turing Award winners.

Using `.findAll` reveals that there are multiple tables on the page.

In [None]:
soup.findAll('table')

If we know a bit more about what we are looking for, we can include an `attrs` argument and pass a dictionary. 

Go to the Turing award page in your browser, right click on the top of the table and choose "Inspect". You will notice that this table is defined with tag `<table class="wikitable">.` Armed with this information, we can narrow down our search.

In [None]:
soup.find('table', attrs={'class' : 'wikitable'})

We can display the table by importing the `HTML` function.

In [None]:
table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))

from IPython.core.display import HTML

HTML(table_html)

However, this does not give us a way to work with the data in the table, only to display it.

If we want to interact with the table, we can use the _pandas_ `read_html` method.

In [None]:
import pandas as pd

In [None]:
pd.read_html(str(soup.find('table', attrs={'class' : 'wikitable'})))[0]