In [None]:
!pip install -Uq requests bs4
from bs4 import BeautifulSoup
import requests

# Web Scraping

## What is web scraping?

**Web scraping is two sequential steps**
1. fetching a webpage HTML
2. extracting data from the HTML

## Am I allowed to scrape?

*I'm not a lawyer, and don't play one on the internet*

Web scraping involves extracting data from HTML
- this HTML (& data) is publically available through an HTTP request
- you only get back what they send you

`robots.txt` 
- way for websites to tell crawlers & webscrapers what is allowed or not
- for example - https://www.theguardian.com/robots.txt

You should be polite
- tell the website who you are (`user-agent`)
- don't spam requests - consider adding a `time.sleep` in between requests
- spamming the server is not polite
- if they offer an API, use that instead

If you ever use data from web scraping commercially

- check for copyright
- i.e. couldn't scrape videos from YouTube & repost 

## Fetching HTML

Web scraping is two steps
1. fetching a webpage HTML
2. extracting data from the HTML

We will be scraping the Wikipedia page for [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) - one of the three recipients of the 2018 Turing award for work in Deep Learning - the other two being [Geoffery Hinton](https://en.wikipedia.org/wiki/Geoffrey_Hinton) and [Yoshua Bengio](https://en.wikipedia.org/wiki/Yoshua_Bengio).

Let's do an HTTP request to the Wikipedia URL:

In [None]:
response = requests.get('https://en.wikipedia.org/wiki/Yann_LeCun')

We can look at the HTML content we get back using `.text`.

This is the same HTML that your browser uses to render a page:

In [None]:
response.text[:250]

## What is the web?

We think of the web as a collection of pages

- in fact, the web is a collection of users (usually web browsers, can also be servers) and servers

A better mental for the web is a conversation between users & servers

## What is a server?

It's just a computer running a program

- i.e. Flask, which is a Python program

Servers can also run & be accessed locally

- this is how we use Jupyter Lab :)

## HTTP - what happens when you visit a website?

This is the kind of conversation that happens when you access a page on the internet:

*CLIENT* - request to https://www.reddit.com

*SERVER* - I'm the server hosting reddit.com - what page would you like?

*CLIENT* - Please give me https://www.reddit.com/r/MachineLearning/

*SERVER* - Sure -> sends text files

*CLIENT* - Thanks! -> renders text files in browser

This kind of conversation is had every time you access a webpage

- **it's also the same thing that happens when we do `requests.get`!

*Further reading*
[Interactive Data Visualization for the Web - Scott Murray](oreilly.com/library/view/interactive-data-visualization/9781449340223/) - in particular Chapter 3

## What text files are common on the internet?

What do you expect to get back when you send a request
1. HTML (`.html`)
2. CSS (`.css`)
3. Javascript (`.js`)

### HTML

HTML is a markup language used to format text.  

The fundamental primitive is an element.  Elements can have different tags, such as:
- `<p>` paragraph
- `<h1>` heading
- `<a>` link
- `<img>` image
- `<script>` Javascript

These elements can be nested to create complex structure (particularly parent - child, or inheritance relationships).

Take a look at `example.html` to see a full HTML document.  You can also use HTML within notebooks (like this one).

HTML elements have optional attributes
- property `<a property="value">`
- class `<a class="myClass">`
- id `<a id="myID>`

Properties are usually stuff like color, and change how the HTML renders
- classes & ID's are used to identify

### CSS

Used to style HTML
- you don't need to know this for web scraping

### Javascript

Dynamic, weakly untyped language

- executes in the browser
- do fancy stuff like calling API's, dynamically rendering HTML, responding to user input

While you don't need to know Javascript for web scraping, it is useful to look out for JSON strings
- these can hold useful infomation
- example - always check for a `<script type="application/ld+json">`

## HTML 101


Tags can have **attributes** - for example the `<a>` usually has an attribute of `href` that holds the link:

`<a href="https://adgefficiency.com/">My personal blog</a>`

This is rendered as:

<a href="https://adgefficiency.com/">My personal blog</a>

A common attribute for HTML elements to have is a **class** - this is used to specify the styling of the object to a CSS class.

## Parsing HTML

We need some way to parse this HTML text - to do this we will use **Beautiful Soup**:

We can use Beautiful Soup to parse the HTML for specific tags.  First we create an instance of the `BeautifulSoup` class, taking the HTML text we got using `requests`:

In [None]:
soup = BeautifulSoup(response.text)

The **title** tag is a special tag required in all HTML documents:

In [None]:
soup.title

We can use Beautiful Soup to find all the `p` tags:

In [None]:
p = soup.find_all('p')

p[0]

Or to find all the links (`a`) in a page:

In [None]:
p = soup.find_all('a')
p[-1]

## Developer tools

One useful tool in web development are the **Developer Tools** included in modern browsers:

![](assets/dev1.png)

The **Inspect elements** tool allows us to find the HTML block for the biography table:

![](assets/dev2.png)

Let's find the table:

In [None]:
table = soup.find('table', attrs={'class': 'infobox biography vcard'})

## Tables in HTML

`tr` = row

`th` = header cell

`td` = data cell

Let's take a look at the third row (**Born**):

In [None]:
rows = [r for r in table.find_all('tr')]
row = rows[2]
row

The header:

In [None]:
row.find('th')

The data:

In [None]:
row.find('td')

We can also get the text from these HTML elements:

In [None]:
row.find('td').text

We can store this data in a dictionary:

In [None]:
data = {}

data[row.find('th').text] = row.find('td').text
data

## Exercise - clean the biography table

Let's iterate over the rows in the biography table and store each row in a list of dictionaries:

In [None]:
#from answers import store_biography_table
#store_biography_table(rows)

## Finding links

Another common task when parsing HTML is to look for links - in HTML links have an `a` tag.  

Let's find all the links in the **References** section - which is a `div` element:

In [None]:
table = soup.find('div', 'mw-references-wrap mw-references-columns')

In [None]:
links = []
for link in table.find_all('a'):
    links.append(link)

links = [link for link in table.find_all('a')]

li = links[1]

li

In [None]:
li['href']

In [None]:
li.text

## Exercise

Create a list of the links from the External Links section:

In [None]:
from answers import all_external_links

_ = all_external_links('https://en.wikipedia.org/wiki/Yann_LeCun')

## Downloading images

Now we are familiar with Beautiful Soup, we know we can find all the images in a page eaisly:

In [None]:
soup.find_all('img')

Let's download the first one - note that we use the `src` attribute, and have to append `'https:'` onto the url:

In [None]:
img = soup.find_all('img')[0]
url = 'https:' + img['src']
url

Now we can use `requests` again to get the bytes for this image:

In [None]:
res = requests.get(url)

In [None]:
res.content[:50]

Now let's download this image into a file.  

Note that we use Python's context management to automatically close the file, and the `iter_content` method to download the file in chunks:

In [None]:
with open('data/le-cun.png', 'wb') as fi:
    for chunk in res.iter_content(100000):
        fi.write(chunk)

We can now see the image (you may need to run this cell again):

![](data/le-cun.png)

## Exercise (together) - scraping public-apis/public-apis

Scrape https://github.com/public-apis/public-apis
- I'd like to be able to filter API's based on their `AUTH`

## Exercise (individual) - downloading XKCD comics

Now let's try another use of web scraping - downloading XKCD comics - this exercise is taken from the excellent [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/).

The basic workflow will be to:
1. download a page (start with https://xkcd.com/)
2. find the `img` tag
3. download the image
4. find the url of the previous comic & repeat

In [None]:
#from answers import main
#urls = main()
#main??

In [None]:
#from answers import xckd_simple
#xckd_simple()
#xckd_simple??