In [None]:
!pip install -q requests bs4

from bs4 import BeautifulSoup
import requests

## API versus web-scraping

Both are ways to sample data from the internet

API
- structured
- limited data / rate limits
- parsing JSON

Web scraping
- less structure
- parsing HTML

This notebook covers **web scraping**.  It is recommended that you have worked with the [using-an-API.ipynb]() notebook before working through this one.

## Web scraping

Two processes
1. fetching a webpage HTML
2. extracting data from the HTML

Note that some websites do not want to be scraped!  They may offer an API instead (try to find a *For Developers* page on their website).

## Fetching HTML

We will be scraping Wikipedia.  We will be scraping the Wiki page one of the three recipients of the 2018 Turing Award - [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) - Chief AI Scientist at Facebook

First we need a data structure to hold the URL's of our three deep learning pioneers.  Let's use a `dict`:

In [None]:
response = requests.get('https://en.wikipedia.org/wiki/Yann_LeCun')

We can look at the HTML content we get back - this is the same HTML that your browser uses to render a page:

In [None]:
response.text[:250]

In [None]:
len(response.text)

## HTML 101

HTML is a markup language used to format text.  An HTML element will have a **tag** - common tags include:
- `<p>` paragraph
- `<h1>` heading
- `<a>` link
- `<img>` image

Tags can have **attributes** - for example the `<a>` usually has an attribute of `href` that holds the link:

`<a href="https://adgefficiency.com/">My personal blog</a>`

This is rendered as:

<a href="https://adgefficiency.com/">My personal blog</a>

A common attribute for HTML elements to have is a **class** - this is used to specify the styling of the object to a CSS class.

## Parsing HTML

We need some way to parse this HTML text - to do this we will use **Beautiful Soup**:

We can use Beautiful Soup to parse the HTML for specific tags.  First we create an instance of the `BeautifulSoup` class, taking the HTML text we got using `requests`:

In [None]:
soup = BeautifulSoup(response.text, 'lxml')

#  uncomment out the line below - warining - it prints out a lot!
#print(soup.prettify())

The **title** tag is a special tag required in all HTML documents:

In [None]:
soup.title

We can use Beautiful Soup to find all the `p` tags:

In [None]:
p = soup.find_all('p')

p[-1]

Or to find all the links (`a`) in a page:

In [None]:
p = soup.find_all('a')

p[-1]

## Developer tools

One useful tool in web development are the **Developer Tools** included in modern browsers:

![](../assets/dev1.png)

The **Inspect elements** tool allows us to find the HTML block for the biography table:

![](../assets/dev2.png)

Let's find the table:

In [None]:
table = soup.find('table', 'infobox biography vcard')

## Tables in HTML

`tr` = row

`th` = header cell

`td` = data cell

Let's take a look at the third row (**Born**):

In [None]:
rows = [r for r in table.find_all('tr')]
row = rows[2]
row

The header:

In [None]:
row.find('th')

The data:

In [None]:
row.find('td')

We can also get the text from these HTML elements:

In [None]:
row.find('td').text

We can store this data in a dictionary:

In [None]:
data = {}

data[row.find('th').text] = row.find('td').text
data

## Exercise

Let's iterate over the rows in the biography table and store each row in a list of dictionaries:

## Finding links

Another common task when parsing HTML is to look for links - in HTML links have an `a` tag.  

Let's find all the links in the **References** section - which is a `div` element:

In [None]:
table = soup.find('div', 'mw-references-wrap mw-references-columns')

In [None]:
links = [link for link in table.find_all('a')]

li = links[1]

li

In [None]:
li['href']

In [None]:
li.text

## Exercise

Create a list of the links from the External Links section:

## Downloading images

Now we are familiar with Beautiful Soup, we know we can find all the images in a page eaisly:

In [None]:
soup.find_all('img')

Let's download the first one - note that we use the `src` attribute, and have to append `'https:'` onto the url:

In [None]:
img = soup.find_all('img')[0]

url = 'https:' + img['src']

url

Now we can use `requests` again to get the bytes for this image:

In [None]:
res = requests.get(url)

In [None]:
res.content[:50]

Now let's download this image into a file.  

Note that we use Python's context management to automatically close the file, and the `iter_content` method to download the file in chunks:

In [None]:
with open('./le-cun.png', 'wb') as fi:
    for chunk in res.iter_content(100000):
        fi.write(chunk)

We can now see the image (you may need to run this cell again):

![](./le-cun.png)

## Exercise - downloading XKCD comics

Now let's try another use of web scraping - downloading XKCD comics.  This exercise is taken from the excellent [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/).

The basic workflow will be to:
1. download a page (start with https://xkcd.com/)
2. find the `img` tag
3. download the image
4. find the url of the previous comic & repeat