# 00 Introduction

## HTML

If our aim is to scrape websites we first have to talk about HTML. Because, behind every web page is an HTML document. While we're not going to write any HTML in this course, we do have to know how to read it. 

If you're coming from a web development background, or if you've written some HTML, this little introduction will be a breeze! And If you have no idea what HTML is or what it looks like, don't sweat! We'll start at the start. 

Fire up your favourite web browser (I like Firefox), and bring up [Google](www.google.com):

<img src="images/google_home.png" width=500 />

Google is a great case study in HTML because it's famously minimal. To see the HTML that renders the Google home page inside the browser, right click anywhere and select `Inspect Element`:

<img src="images/google_inspect.png" width=500 />

This will bring up the "Inspector":

<img src="images/google_html.png" width=500 />

The Inspector connects each section of HTML code to each section of the displayed page. Hovering over a piece of code in the Inspector will highlight the linked element inside the browser.

## Boilerplate

There are a lot of `<angled>` brackets in HTML. The Google home page is no exception. The page is riddled with `<div>`, `<span>` and `<style>` tags, each helping, in their own way, to structure and render the result that we see inside the browser. Though Google is (relatively) simple in HTML terms, there's a lot of code in the Inspector that deserves unpacking. We won't. Instead, let's take a couple of gigantic steps back to look at, and appreciate, the minimum amount of boilerplate HTML code required to render a (blank) page:

```html
<!DOCTYPE html>
<html>
  <head>
    <title></title>
  </head>
  <body>
  </body>
</html>
```

A couple of things to note:

1. The document type is declared at the top
2. The entire page is wrapped in an `<html>` tag
3. Open tags (`<tag>`) are eventually followed by close tags (`</tag>`)
4. The page is divided into two parts (`head` and `body`)


Every HTML is pretty well segmented into two parts:

- head: metadata and scripts and styling
- body: actual content

Here's a more complete page (still not very impressive)

In [1]:
with open('data/bad.html', 'r') as f:
    html = f.read()
    
from IPython.display import HTML; HTML(html)

This above html document is rendered with the following code:

In [2]:
print(html)

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>This is HTML</title>
  </head>
  <body>
    <h1>This is HTML</h1>
    <p>It's not the greatest...</p>
    <div class='foo'>...but it is <i>functional</i>.</div>
    <br />
    <div>For good measure, here's some more of it!</div>
    <p>And an image:</p>
    <img src='https://invisiblebread.com/comics-firstpanel/2015-03-03-scrape.png' height='200' />
    <p id='bar'>Isn't HTML great?!</p>
  </body>
</html>



### gazpacho

In [None]:
!pip install gazpacho

Notice the various different tags. And now imagine we want to extract information from this page. In order to get all of the `<p>` tags, we would need to import Soup from gazpacho:

In [3]:
from gazpacho import Soup

Wrap the html string in a gazpacho Soup object:

In [4]:
soup = Soup(html)

And use the main "find" method on the target tag:

In [5]:
soup.find('p')

[<p>It's not the greatest...</p>,
 <p>And an image:</p>,
 <p id="bar">Isn't HTML great?!</p>]

Find, by default, will return a list if there is more than one element that shares that tag, or a soup object if there's just one.

To isolate on specific tags, we can target tag attributes (attrs) in a python dictionary. So, if we're interested in scraping this slice of html: 

`<p id="bar">Isn't HTML great?!</p>` 

We can run:

In [6]:
soup.find('p', attrs={'id': 'bar'})

<p id="bar">Isn't HTML great?!</p>

To get the text inside the HTML, we can run the `.text` attribute:

In [7]:
soup.find('p', {'id': 'bar'}).text

"Isn't HTML great?!"

And to find all the `div`s on the tag we can do the same thing but with `div` as the first argument:

In [8]:
soup.find('div')

[<div class="foo">...but it is <i>functional</i>.</div>,
 <div>For good measure, here's some more of it!</div>]

To get just the first `div`:

In [9]:
soup.find('div', mode='first')

<div class="foo">...but it is <i>functional</i>.</div>

And to isolate the `div` tags that have `class=foo`:

In [10]:
soup.find('div', {'class': 'foo'}).text

'...but it is'

You can literally isolate any tag:

In [11]:
soup.find('i').text

'functional'

But sometimes you want to just get rid of tags, this is accomplished with:

In [12]:
soup.find('div', {'class': 'foo'}).remove_tags()

'...but it is functional.'

# 01 get

HTML is the stuff of websites. In reality we're not going to import documents from our computer! We're going to have to "get" HTML from a website.

To get, or download, the HTML from a specific page we'll use get from gazpacho:

In [13]:
from gazpacho import get

### Status Codes

Some common status codes... 

Everyone is familiar with 404 and maybe 503. Importantly, 200 is the best, you always want 200s. But 400s are your fault and 500s are the website's fault:

- 1xx Informational
- 2xx Sucess
    - 200 - OK
- 3xx Redirection
- 4xx Client Error (a.k.a. **your fault**)
    - 400 - Bad Request
    - 401 - Unauthorized
    - 403 - Forbidden
    - 404 - Not Found
    - 418 - 🍵
    - 429 - Too many requests
- 5xx Server Error (a.k.a. **their fault**)
    - 500 - Internal Server Error
    - 501 - Not Implemented
    - 502 - Bad Gateway
    - 503 - Service Unavailable
    - 504 - Gateway Timeout

Uncomment and run to see:

In [None]:
# get('https://httpstat.us/403')

In [None]:
# get('https://httpstat.us/404')

In [None]:
# get('https://httpstat.us/418')

### Structuring a `get` request

Sometimes we need to manipulate a URL to return something different:

In [14]:
url = 'https://httpbin.org/anything?year=2020&colour=black'

get(url)

{'args': {'colour': 'black', 'year': '2020'},
 'data': '',
 'files': {},
 'form': {},
 'headers': {'Accept-Encoding': 'identity',
  'Host': 'httpbin.org',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0',
  'X-Amzn-Trace-Id': 'Root=1-5f46cad4-249a11b862035dddbf7b59b6'},
 'json': None,
 'method': 'GET',
 'origin': '50.101.35.196',
 'url': 'https://httpbin.org/anything?year=2020&colour=black'}

If you want to make this more Pythonic, we can use a dictionary:

In [15]:
url = 'https://httpbin.org/anything'

r = get(
    url, 
    params={'year': 2020, 'colour': 'black'}, 
    headers={'User-Agent': 'gazpacho'}
)

r

{'args': {'colour': 'black', 'year': '2020'},
 'data': '',
 'files': {},
 'form': {},
 'headers': {'Accept-Encoding': 'identity',
  'Host': 'httpbin.org',
  'User-Agent': 'gazpacho',
  'X-Amzn-Trace-Id': 'Root=1-5f46cadf-3a9dcfdfc120cf29058da6c0'},
 'json': None,
 'method': 'GET',
 'origin': '50.101.35.196',
 'url': 'https://httpbin.org/anything?year=2020&colour=black'}

# 02 Scrape World

The problem in building a web scraping course is that the web is always changing. It could be that by the time this is published all of the examples are out of date.

So, I created a Sandbox for us at: www.scrape.world

If, for some reason that site is down ($$) grab the repo here: https://github.com/maxhumber/scrape.world and change all the base urls accordingly:

In [16]:
local = False

if local: 
    url = 'localhost:5000'
else:
    url = "https://scrape.world"

First example, scraping all the link tags in the `section-speech`:

In [17]:
from gazpacho import get, Soup

url = "https://scrape.world/soup"
html = get(url)
soup = Soup(html)

fos = soup.find("div", {"class": "section-speech"})

links = []
for a in fos.find("a"):
    try:
        link = a.attrs["href"]
        links.append(link)
    except AttributeError:
        pass

links = [l for l in links if "wikipedia.org" in l]

print(links)

['https://en.wikipedia.org/wiki/Alphabet_soup_(linguistics)', 'https://en.wikipedia.org/wiki/Alphabet', 'https://en.wikipedia.org/wiki/Abiogenesis', 'https://en.wikipedia.org/wiki/Soup_kitchen', 'https://en.wikipedia.org/wiki/Stone_soup', 'https://en.wikipedia.org/wiki/Souperism', 'https://en.wikipedia.org/wiki/Great_Famine_(Ireland)', 'https://en.wikipedia.org/wiki/Tag_soup', 'https://en.wikipedia.org/wiki/HTML']


# 03 Tables

Scrape the total spend for each team:

In [19]:
from gazpacho import get, Soup

url = "https://scrape.world/spend"
html = get(url)
soup = Soup(html)

trs = soup.find("tr", {"class": "tmx"})


def parse_tr(tr):
    team = tr.find("td", {"data-label": "TEAM"}).text
    spend = float(
        tr.find("td", {"data-label": "TODAYS CAP HIT"}).text.replace(",", "")[1:]
    )
    return team, spend


spend = [parse_tr(tr) for tr in trs]

spend

[('Toronto Pine Needles', 95929643.0),
 ('Arizona Dingos', 87349818.0),
 ('Buffalo Knives', 86968691.0),
 ('Dallas Celebrities', 82349165.0),
 ('St. Louis Doldrums', 82862927.0),
 ('Vancouver Whales', 83580706.0),
 ('Philadelphia Travellers', 83494245.0),
 ('Boston Kodiaks', 81394166.0),
 ('Chicago Greyfalcons', 82984294.0),
 ('Vegas Shining Templars', 81833332.0),
 ('Florida Jaguars', 82432002.0),
 ('San Jose Charlatans', 81395750.0),
 ('Washington Investments', 80589294.0),
 ('Edmonton Workers', 80901164.0),
 ('Detroit Carmine Feathers', 82133668.0),
 ('Pittsburgh Puffins', 80657875.0),
 ('Carolina Cyclones', 80405665.0),
 ('Calgary Flares', 78848375.0),
 ('Nashville Carnivores', 79779643.0),
 ('Tampa Bay Thunder', 79103331.0),
 ('Minnesota Savage', 78420255.0),
 ('New York Officials', 78837300.0),
 ('Anaheim Mallards', 78173090.0),
 ('Montreal Quebecers', 79868809.0),
 ('Winnipeg Airplanes', 77652021.0),
 ('Los Angeles Monarchs', 76517727.0),
 ('New York Indwellers', 76554999.0),
 (

# 04 Credentials

In [None]:
!pip install selenium
# requires some additional setup: https://stackoverflow.com/a/42231328/3731467

Using credentials to log in using Selenium:

In [20]:
%%writefile credentials.py

from gazpacho import Soup
import pandas as pd
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

url = "https://scrape.world/season"

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)
browser.get(url)

# username

username = browser.find_element_by_id("username")
username.clear()
username.send_keys("admin")

# password

password = browser.find_element_by_name("password")
password.clear()
password.send_keys("admin")

# submit

browser.find_element_by_xpath("/html/body/div/div/form/div/input[3]").click()

# refetch page (just incase)

browser.get(url)

html = browser.page_source
soup = Soup(html)

tables = pd.read_html(browser.page_source)
east = tables[0]
west = tables[1]
df = pd.concat([east, west], axis=0)
df["W"] = df["W"].apply(pd.to_numeric, errors="coerce")
df = df.dropna(subset=["W"])
df = df[["Team", "W"]]
df = df.rename(columns={"Team": "team", "W": "wins"})
df = df.sort_values("wins", ascending=False)

print(df)

Writing credentials.py


In [21]:
!python credentials.py

                        team  wins
11    Washington Investments  51.9
2          Tampa Bay Thunder  50.2
12        Pittsburgh Puffins  49.1
1             Boston Kodiaks  48.5
2         Colorado Landslide  47.3
1         St. Louis Doldrums  46.6
13       New York Indwellers  46.5
15         Carolina Cyclones  45.9
3         Dallas Celebrities  44.8
3            Florida Jaguars  44.5
10          Vancouver Whales  44.1
17   Philadelphia Travellers  43.1
14       Columbus Navy Coats  43.0
5       Toronto Pine Needles  42.5
11          Edmonton Workers  41.9
12    Vegas Shining Templars  41.6
18        New York Officials  40.5
6         Winnipeg Airplanes  40.4
4       Nashville Carnivores  40.2
13            Arizona Dingos  40.1
15            Calgary Flares  39.9
8           Minnesota Savage  39.0
7        Chicago Greyfalcons  38.9
6         Montreal Quebecers  38.8
16       San Jose Charlatans  36.0
7             Buffalo Knives  36.0
17          Anaheim Mallards

# 05 Interactions 1

Interacting with dropdown elements:

In [22]:
%%writefile interactions1.py

import time
from gazpacho import Soup
import pandas as pd
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import Select

url = "https://scrape.world/results"

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)
browser.get(url)

# username

username = browser.find_element_by_id("username")
username.clear()
username.send_keys("admin")
time.sleep(0.5)

# password

password = browser.find_element_by_name("password")
password.clear()
password.send_keys("admin")
time.sleep(0.5)

# submit

browser.find_element_by_xpath("/html/body/div/div/form/div/input[3]").click()
time.sleep(0.5)

# refetch page (just incase)

browser.get(url)

search = browser.find_element_by_xpath("/html/body/div/div/div[2]/div[2]/label/input")
search.clear()
search.send_keys("toronto")
time.sleep(0.5)

drop_down = Select(
    browser.find_element_by_xpath("/html/body/div/div/div[2]/div[1]/label/select")
)
drop_down.select_by_visible_text("100")
time.sleep(0.5)

html = browser.page_source
soup = Soup(html)
df = pd.read_html(str(soup.find("table")))[0]

print(df)

Writing interactions1.py


In [24]:
!python interactions1.py

    day                      away  ...  goals_home extra_time_loss
0     1        Ottawa Legislators  ...           5               0
1     3      Toronto Pine Needles  ...           1               0
2     4        Montreal Quebecers  ...           5               1
3     6        St. Louis Doldrums  ...           2               0
4     9         Tampa Bay Thunder  ...           3               0
5    11      Toronto Pine Needles  ...           2               0
6    14          Minnesota Savage  ...           4               0
7    15      Toronto Pine Needles  ...           4               0
8    18            Boston Kodiaks  ...           4               1
9    20       Columbus Navy Coats  ...           3               1
10   21      Toronto Pine Needles  ...           4               0
11   24       San Jose Charlatans  ...           4               0
12   25      Toronto Pine Needles  ...           5               0
13   28    Washington Investments  ...          

# 06 Interactions 2

Scrolling on the page to load more data:

In [25]:
%%writefile interactions2.py

from gazpacho import Soup
import pandas as pd
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys

base = "https://scrape.world/"
endpoint = "population"
url = base + endpoint

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)
browser.get(url)

poplist = browser.find_element_by_id('infinite-list')

days = 365
n = 0
while n < 365:
    browser.execute_script('arguments[0].scrollTop = arguments[0].scrollHeight', poplist)
    html = browser.page_source
    soup = Soup(html)
    n = len(soup.find('ul', {'id': 'infinite-list'}).find('li'))

lis = soup.find('ul', {'id': 'infinite-list'}).find('li')

def parse_li(li):
    day, population = li.text.split(' Population ')
    population = int(population)
    day = int(day.split('Day ')[-1])
    return {'day': day, 'population': population}

population = [parse_li(li) for li in lis][:days]
print(poplation[:20])

Writing interactions2.py


In [26]:
!python interactions2.py

# 07 Downloading

Downloading multimedia elements:

In [27]:
from pathlib import Path
from shutil import rmtree as delete
from urllib.request import urlretrieve as download
from gazpacho import get, Soup

dir = "media"
Path(dir).mkdir(exist_ok=True)

base = "https://scrape.world"
url = base + "/books"
html = get(url)
soup = Soup(html)

# download images

imgs = soup.find("img")
srcs = [i.attrs["src"] for i in imgs]

for src in srcs:
    name = src.split("/")[-1]
    download(base + src, f"{dir}/{name}")

# download audio

audio = soup.find("audio").find("source").attrs["src"]
name = audio.split("/")[-1]
download(base + audio, f"{dir}/{name}")

# download video

video = soup.find("video").find("source").attrs["src"]
name = video.split("/")[-1]
download(base + video, f"{dir}/{name}")

# clean up

delete(dir)

# 08 Scheduling (Local)

Scrape book prices to start:

In [34]:
%%writefile books.py

from gazpacho import get, Soup
import pandas as pd

def parse(book):
    name = book.find("h4").text
    price = float(book.find("p").text[1:].split(" ")[0])
    return name, price

def fetch_books():
    url = "https://scrape.world/books"
    html = get(url)
    soup = Soup(html)
    books = soup.find("div", {"class": "book-"})
    return [parse(book) for book in books]

data = fetch_books()
books = pd.DataFrame(data, columns=["title", "price"])

string = f"Current Prices:\n```\n{books.to_markdown(index=False, tablefmt='grid')}\n```"

print(string)

Writing books.py


**Scheduling** 

In order to schedule this script to execute at some cadence we'll use `hickory` (`pip install hickory`:

```
hickory schedule books.py --every=30seconds
```

To check the status:

```
hickory status
```

And to kill:

```
hickory kill books.py
```

**Moving to Slack**

To send results to Slack instead of install printing to a log file we'll use `slackclient` the official Slack API for Python:

```python
pip install slackclient
```

### Generating a Slack API token 

In order to build a Slack Bot, we'll need a Slack API token, which will require us to do the following:

1. Create a new Slack App

Create a [Slack App](https://api.slack.com/apps) by following the link, and clicking **Create New App**.

2. Add permissions

In the menu on the left, find **OAuth and Permissions**. Click it, and scroll down to the **Scopes** section. Click **Add an OAuth Scope**.

Search for the **chat:write** and **chat:write.public** scopes, and add them. At this point, you can install the app to your workspace.

3. Copy the token to a `.env` file

On the same page you'll find your access token under the label **Bot User OAuth Access Token**. Copy this token, and save it to a `.env` file

In [35]:
%%writefile booksbot.py

import os
import sqlite3

from gazpacho import get, Soup
from dotenv import find_dotenv, load_dotenv # pip install python-dotenv
import pandas as pd
from slack import WebClient # pip install slackclient

load_dotenv(find_dotenv())

con = sqlite3.connect("data/books.db")
cur = con.cursor()

slack_token = os.environ["SLACK_API_TOKEN"]
client = WebClient(token=slack_token)

def parse(book):
    name = book.find("h4").text
    price = float(book.find("p").text[1:].split(" ")[0])
    return name, price

def fetch_books():
    url = "https://scrape.world/books"
    html = get(url)
    soup = Soup(html)
    books = soup.find("div", {"class": "book-"})
    return [parse(book) for book in books]

data = fetch_books()
books = pd.DataFrame(data, columns=["title", "price"])
books['date'] = pd.Timestamp("now")

books.to_sql('books', con, if_exists='append', index=False)
average = pd.read_sql("select title, round(avg(price),2) as average from books group by title", con)
df = pd.merge(books[['title', 'price']], average)

string = f"Current Prices:```\n{df.to_markdown(index=False, tablefmt='grid')}\n```"

response = client.chat_postMessage(
    channel="books",
    text=string
)

Overwriting booksbot.py


Schedule with `hickory schedule booksbot.py --every=30seconds`

# 09 Serverless (Lambda)

In [None]:
# insert here: demand + AWS Lambda (Chalice)