## API versus web-scraping

Both are ways to sample data from the internet

API
- structured
- limited data / rate limits
- parsing JSON

Web scraping
- less structure
- parsing HTML

This notebook covers **web scraping**.  It is reccomended that you have worked with the [using-an-API.ipynb]() notebook before working through this one.

## Web scraping

Two processes
1. fetching a webpage HTML
2. extracting data from the HTML

Note that some websites do not want to be scraped!  They may offer an API instead (try to find a *For Developers* page on their website).

## Fetching HTML

We will be scraping Wikipedia.  We will be scraping the Wiki pages for the three recipients of the 2018 Turing Award 
- [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) - Chief AI Scientist at Facebook
- 

First we need a data structure to hold the URL's of our three deep learning pioneers.  Let's use a `dict`:

In [None]:
pioneers = {
    'lecun': 'https://en.wikipedia.org/wiki/Yann_LeCun'
}

In [None]:
!pip install -q requests
import requests

response = requests.get(pioneers['lecun'])

We can look at the HTML content we get back - this is the same HTML that your browser uses to render a page:

In [None]:
response.text[:250]

In [None]:
len(response.text)

## Parsing HTML

We need some way to parse this HTML text - to do this we will use **beautiful soup**:

In [None]:
!pip install -q bs4
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'lxml')

#  uncomment out the line below - warining - it prints out a lot!
# print(soup.prettify())

We can use beautiful soup to parse the HTML for specific tags:

https://en.wikipedia.org/wiki/File:Yann_LeCun_-_2018_(cropped).jpg

In [None]:
soup.title

## Developer tools

One useful tool in web development are the **Developer Tools** included in modern browsers:

![](../assets/dev1.png)

The **Inspect elements** tool allows us to find the HTML block for the biography table:

![](../assets/dev2.png)

Let's find the table:

In [None]:
table = soup.find('table', 'infobox biography vcard')

## Tables in HTML

`tr` = row

`th` = header cell

`td` = data cell

Let's take a look at the third row (**Born**):

In [None]:
rows = [r for r in table.find_all('tr')]
row = rows[2]

The header:

In [None]:
row.find('th')

The data:

In [None]:
row.find('td')

We can also get the text from these HTML elements:

In [None]:
row.find('td').text

We can store this data in a dictionary:

In [None]:
data = {}

data[row.find('th').text] = row.find('td').text
data

## Exercise

Let's iterate over the rows in the biography table and store each row in a list of dictionaries.

Note that you might encounter a row without a header - in this case `bs4` will return `None`:

In [None]:
header = rows[6].find('th')

print(header)

You can deal with this by taking advantage of `None` being *falsey* in Python:

In [None]:
obj = None

if obj:
    print('obj is truthy')
else:
    print('obj is falsey')

## Finding links

Another common task when parsing HTML is to look for links - in HTML links have an `a` tag.  

Let's find all the links in the **References** section - which is a `div` element:

In [None]:
table = soup.find('div', 'mw-references-wrap mw-references-columns')

In [None]:
links = [link for link in table.find_all('a')]

li = links[1]

li

In [None]:
li['href']

In [None]:
li.text

We can store a link in a `namedtuple`:

In [None]:
from collections import namedtuple

Link = namedtuple('Link', ['text', 'url'])

Link(li.text, li['href'])

## Exercise

Create a list of the links from the **External Links** section: