<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Web-scraping-with-requests-and-BeautifulSoup" data-toc-modified-id="Web-scraping-with-requests-and-BeautifulSoup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Web scraping with <code>requests</code> and <code>BeautifulSoup</code></a></span></li><li><span><a href="#Scraping-best-practices" data-toc-modified-id="Scraping-best-practices-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Scraping best practices</a></span></li><li><span><a href="#Walkthrough-of-the-scraping-process" data-toc-modified-id="Walkthrough-of-the-scraping-process-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Walkthrough of the scraping process</a></span><ul class="toc-item"><li><span><a href="#The-document-object-model-(DOM)" data-toc-modified-id="The-document-object-model-(DOM)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>The document object model (DOM)</a></span></li><li><span><a href="#Accessing-the-DOM-in-Python:-the-bs4-package-and-BeautifulSoup" data-toc-modified-id="Accessing-the-DOM-in-Python:-the-bs4-package-and-BeautifulSoup-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Accessing the DOM in <code>Python</code>: the <code>bs4</code> package and <code>BeautifulSoup</code></a></span></li><li><span><a href="#Finding-the-<table>" data-toc-modified-id="Finding-the-<table>-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Finding the <code>&lt;table&gt;</code></a></span></li><li><span><a href="#Extracting-data-from-the-<table>-rows" data-toc-modified-id="Extracting-data-from-the-<table>-rows-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Extracting data from the <code>&lt;table&gt;</code> rows</a></span></li></ul></li><li><span><a href="#Putting-it-all-together..." data-toc-modified-id="Putting-it-all-together...-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Putting it all together...</a></span><ul class="toc-item"><li><span><a href="#Complete-code-for-script-file" data-toc-modified-id="Complete-code-for-script-file-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Complete code for script file</a></span></li></ul></li></ul></div>

# Web scraping with `requests` and `BeautifulSoup`

Sometimes the data you want or need exists in structured form on the web, but can't be accessed via an API or a database connection. A good example of this would be a table on a website. Can we obtain this data efficiently and automatically by code, rather than a laborious manual process? Yes, via a technique known as **web scraping**. We will examine an example using the `requests` and `BeautifulSoup` packages. 

We are going to look at an example in which we scrape data relating to avalanche fatalities in Utah. Let's have a look at the website in question. For reasons that will become clear later, we are going to split the URL of the site up into `base` and `route` parts. The whole we obtain just by adding the strings together

In [1]:
base = 'https://utahavalanchecenter.org/'
route = 'avalanches/fatalities'
url =  base + route
url

'https://utahavalanchecenter.org/avalanches/fatalities'

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

Explore this website. In particular, answer the following:

* What variables does the table contain? (Assume we don't wish to scrape the links to each event)
* Is all the data presented on one page, or split over multiple pages?

**Solution**

![Utah avalanche site first page](images/avalanches_site.png)

* Leaving aside the links, we have `Date`, `Region`, `Trigger` and `# Killed`
* The complete data is spread over multiple pages. We navigate through pages using the `< Previous` and `Next >` links near the bottom of the page

***

<hr style="border:8px solid black"> </hr>

# Scraping best practices

The robots exclusion protocol (REP) was developed in 1994 to govern the scraping and crawling behaviour of **robots** deployed by organisations that adhere to the protocol. In this context, a robot is an instance of automated software that attempts to map a website and/or obtain data from it. The REP suggests that a special file `robots.txt` be placed at the root (or base) of a site hierarchy. The file indicates which user agents are permitted (`user-agent`), areas of the site to be avoided (paths, directories and files) using the `Disallow` keyword, and often specifies a `Crawl-delay`: definitions of this vary as it is non-standard, but a reasonable interpretation is the amount of time in seconds that should elapse between requests.

Best practices include:

* Obey the REP as far as possible: read the `robots.txt` for the site you are going to scrape (if it exists) and avoid disallowed areas of the site.
* Scraping can be rough on web servers, so try to scrape at off-peak hours and implement any `Crawl-delay` specified in `robots.txt`
* Above all, consider the data you are scraping. Is it ethical to scrape it? Once scraped, will it then be used for ethical purposes of which the originating organisation would approve?
* Obey any terms and conditions to which you agree in order to use a site. See **[this article](https://www.aima.org/journal/aima-journal---q3-2015-edition/article/data-scraping---everybody-else-was-doing-it--so-i-thought-it-was-ok-.html)** on the legal implications of web scraping

Here is the `robots.txt` file from the Utah Avalanches site. You can see it at https://utahavalanchecenter.org/robots.txt

![Robots.txt file](images/robots_txt.png)

# Walkthrough of the scraping process

To start with we will walk through the scraping process. Later we will see how to assemble the code we write into a form that could be placed in a `Python` script to be run from the command line. 

First we need to create a `header` to be sent with our request. Typically, we set the `User-Agent` value in the header to mimic a web browser on a particular operating system. You can see the user agent for your browser by going to **httpbin.org/user-agent**

You can either copy the `User-Agent` for your browser, or just use the one below

In [2]:
user_agent_desktop = 'Mozilla/5.0 (X11; Linux x86_64) '\
    'AppleWebKit/537.36 (KHTML, like Gecko) '\
    'Chrome/88.0.4324.96'

headers = {"User-Agent": user_agent_desktop}
headers

{'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96'}

Let's send a request to the site, passing the headers we created above

In [3]:
import requests
response = requests.get(url, headers=headers)

If we look at the response text, we will see that the server has sent `HTML` back to us. What do we do with this `HTML`?!

In [4]:
response.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Avalanche Fatalities - Utah Avalanche Center</title>\n<meta name="viewport" content="width=device-width, initial-scale=1.0">             \n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="https://utahavalanchecenter.org/sites/default/files/images/uac-site/uac-favicon.png" type="image/png" />\n<link type="text/css" rel="stylesheet" href="https://utahavalanchecenter.org/sites/default/files/css/css_xE-rWrJf-fncB6ztZfd2huxqgxu4WO-qwma6Xer30m4.css" media="all" />\n<link type="text/css" rel="stylesheet" href="https://utahavalanchecenter.org/sites/default/files/css/css_tcqXHDMHRqtAPwNCTJBf-bQZ7knqzs48NDfFsr31Pkg.css" media="all" />\n<link type="text/css" rel="stylesheet" href="https://utahavalanchecenter.org/sites/default/files/css/css_j_ng14rDsucP7t_nwmN6YQxzyuuXDjxPRwFsC1vg8q0.css" media="all" />\n<link type="text/css" rel="stylesheet" href="https://utahavalanchecenter.org/sites/default/

## The document object model (DOM)

When a browser receivers an HTML document like the one above, it creates from it a hierarchical structure (a ‘tree-like’ structure) known as the document object model (DOM).

* Each item in the tree is an **element**
* Elements have **attributes** (data) and **methods** (functions or behaviour), corresponding to objects in object-oriented programming languages (the ‘O’ in DOM)
* The tree represents a hierarchy, within which elements can have:
    - **parents** (elements above them in the tree)
    - **siblings** (elements at the same level as themselves in the tree)
    - **children** (elements below them in the tree)


![The document object model](images/DOM.png)
*The DOM model by Birger Eriksson, distributed under [CC Attribution-Share Alike 3.0 Unported](https://creativecommons.org/licenses/by-sa/3.0/deed.en) licence*


You can see the DOM for a page in most browsers by activating the `DOM Inspector`. In `Chrome` you can either right-click on a page and choose `Inspect` or press `Ctrl+Shift+i` or `Cmd+Shift+i` depending on operating system.

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

Give this a try yourself on the Utah Avalanche Fatalities page.

* Open the `DOM Inspector`
* Move the pointer around over the `Elements` tab of the `DOM Inspector` and see which parts of the page are highlighted.
* Can you find your way down through `<div>` elements to find the `<table>` containing the data?

**Solution**

![The DOM inspector](images/DOM_inspector.png)

***

<hr style="border:8px solid black"> </hr>

## Accessing the DOM in `Python`: the `bs4` package and `BeautifulSoup`

The data we need is held in the `<table>` element of the DOM. So now we need a way to let `Python` access the DOM: this is where the `BeautifulSoup` class of the `bs4` package comes in.

We pass in the response text, and tell `BeautifulSoup()` to parse the text as `HTML`

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, features='html.parser')

Let's have a look at what is generated

In [6]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<title>Avalanche Fatalities - Utah Avalanche Center</title>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<link href="https://utahavalanchecenter.org/sites/default/files/images/uac-site/uac-favicon.png" rel="shortcut icon" type="image/png"/>
<link href="https://utahavalanchecenter.org/sites/default/files/css/css_xE-rWrJf-fncB6ztZfd2huxqgxu4WO-qwma6Xer30m4.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://utahavalanchecenter.org/sites/default/files/css/css_tcqXHDMHRqtAPwNCTJBf-bQZ7knqzs48NDfFsr31Pkg.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://utahavalanchecenter.org/sites/default/files/css/css_j_ng14rDsucP7t_nwmN6YQxzyuuXDjxPRwFsC1vg8q0.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://utahavalanchecenter.org/sites/default/files/css/css_kkbUDYNW1y5GNXAF9-P7R8g2-uaWYuMMAE-amvBaiU4.css

This doesn't look much better than the plain text to be honest. But the difference now is that `BeautifulSoup` equips all the elements in `soup` with the **attributes** and **methods** we discussed above. Let's see an example: we'll find all the `<div>` elements in `soup` using the `.find_all()` method

In [7]:
divs = soup.find_all(name='div')
len(divs)

114

We get back a `list()` of `<div>` elements. Let's have a look at the attributes of the first `<div>` in the list

In [8]:
first_div = divs[0]
first_div.attrs

{'class': ['sm-max-width-4',
  'md-max-width-4',
  'lg-max-width-4',
  'x-lg-max-width-4',
  'mx-auto']}

and at the text it contains

In [9]:
first_div.text

'\n\n\n\n\n\n\n\n\n\n\n\n\n\nicon-add\n          Observation\n        \n\nMenu\n\n\n\n×\n\nMenu\n\nForecasts        \n\nLogan\nOgden\nSalt Lake\nProvo\nUintas\nSkyline\nMoab\nAbajos\nWeather\nArchives\nHow to read the forecast\n\n\n\n\nObservations & Avalanches        \n\nSubmit Observation\nAll Observations\nAvalanches\nFatalities\nPlace Names Map\nArchives\n\n\n\n\nEducation        \n\nUAC & KBYG Classes\nOther Classes\nOnline Avalanche Courses\nKnow Before You Go\nResources & Tutorials\nBackcountry Emergencies\nBeacon Practice\nFAQ\nEncyclopedia\nClass Cancellation Policy\n\n\n\n\nEvents\n\n\nStore\n\n\nAbout        \n\nContact\nWho we are\nStaff\nBoard of Directors\nPast Forecasters\nSponsors\nAnnual Reports\nSign In\n\n\n\n\nBlog\n\nMake a Donation\n\n\n\n\nSearch \n\n\n\n\n\n\n\n\n\n\nForecasts\n\nLogan\nOgden\nSalt Lake\nProvo\nUintas\nSkyline\nMoab\nAbajos\nWeather\nArchives\nHow to read the forecast\n\n\n\nObservations & Avalanches\n\nSubmit Observation\nAll Observations\nAval

## Finding the `<table>`

Now, how do we find our way through the DOM to the `<table>` containing the data we want to scrape? The easiest way, if possible, is to search for some text that the table contains that occurs nowhere else on the page. Let's look for the text `Region`: 

In [10]:
header_text = soup.find(text='Region')
header_text

'Region'

What is the `.parent` of `header_text`?

In [11]:
header_th = header_text.parent
header_th

<th>Region</th>

It's a **table header** element `<th>`. Next, the parent of the table header is a **table row** `<tr>`

In [12]:
header_row = header_th.parent
header_row

<tr>
<th>Date</th>
<th>Region</th>
<th></th>
<th>Trigger</th>
<th># Killed</th>
</tr>

The parent of the header row is a `<thead>` element

In [13]:
head = header_row.parent
head

<thead>
<tr>
<th>Date</th>
<th>Region</th>
<th></th>
<th>Trigger</th>
<th># Killed</th>
</tr>
</thead>

Finally, the parent of the head is the whole `<table>` element

In [14]:
table = head.parent
table

<table>
<thead>
<tr>
<th>Date</th>
<th>Region</th>
<th></th>
<th>Trigger</th>
<th># Killed</th>
</tr>
</thead>
<tbody>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">2/6/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58959">Accident: Wilson Glade</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views-field views-field-field-killed views-align-right">
      4    </td>
</tr>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">1/30/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58594">Accident: Squaretop</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views

Hopefully this makes it clear how the DOM hierarchy works, but I think we can agree that it's a cumbersome way to work. Instead, let's use the `.find_parent()` method to look for the **first occurring** parent element of the 'Region' text that has the name 'table'

In [15]:
table = soup.find(text='Region').find_parent(name='table')
table

<table>
<thead>
<tr>
<th>Date</th>
<th>Region</th>
<th></th>
<th>Trigger</th>
<th># Killed</th>
</tr>
</thead>
<tbody>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">2/6/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58959">Accident: Wilson Glade</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views-field views-field-field-killed views-align-right">
      4    </td>
</tr>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">1/30/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58594">Accident: Squaretop</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views

This is much cleaner, but we might still run into problems if the site changes in the future and the text 'Region' ever occurs elsewhere on the page. It might be better to look for something within the strucure of the elements themselves that identifies the table. 

Does the table have any attributes we can use to locate it?

In [16]:
table.attrs

{}

Nope. But what about the parent of the table?

In [17]:
table.parent

<div class="view-content">
<table>
<thead>
<tr>
<th>Date</th>
<th>Region</th>
<th></th>
<th>Trigger</th>
<th># Killed</th>
</tr>
</thead>
<tbody>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">2/6/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58959">Accident: Wilson Glade</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views-field views-field-field-killed views-align-right">
      4    </td>
</tr>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">1/30/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58594">Accident: Squaretop</a> </td>
<td class="views-field views-field-field-trigger">
      Skie

Nice! We see that it's a `<div>` with `class='view-content'`. Is there more than one such `<div>`?

In [18]:
len(soup.find_all(name='div', attrs={'class': 'view-content'}))

1

Nope, it's unique! So we can find the table in this way

In [19]:
table = soup.find(name='div', attrs={'class': 'view-content'}).find(name='table')
table

<table>
<thead>
<tr>
<th>Date</th>
<th>Region</th>
<th></th>
<th>Trigger</th>
<th># Killed</th>
</tr>
</thead>
<tbody>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">2/6/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58959">Accident: Wilson Glade</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views-field views-field-field-killed views-align-right">
      4    </td>
</tr>
<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">1/30/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58594">Accident: Squaretop</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views

## Extracting data from the `<table>` rows

Now we've found the table, let's extract the rows and then look at how to pull the data from them.

In [20]:
rows = table.find_all(name='tr')
rows[0]

<tr>
<th>Date</th>
<th>Region</th>
<th></th>
<th>Trigger</th>
<th># Killed</th>
</tr>

The first row contains table header cells `<th>`. What about the second row?

In [21]:
rows[1]

<tr>
<td class="views-field views-field-field-occurrence-date">
<span class="date-display-single">2/6/2021</span> </td>
<td class="views-field views-field-field-region-forecaster nowrap">
      Salt Lake    </td>
<td class="views-field views-field-title">
<a href="/avalanche/58959">Accident: Wilson Glade</a> </td>
<td class="views-field views-field-field-trigger">
      Skier    </td>
<td class="views-field views-field-field-killed views-align-right">
      4    </td>
</tr>

Nice, this contains table data cells `<td>`. So, for every row in the table, we want to extract the `<td>` elements. Let's do this for the second row

In [22]:
cells = rows[1].find_all(name='td')
cells

[<td class="views-field views-field-field-occurrence-date">
 <span class="date-display-single">2/6/2021</span> </td>,
 <td class="views-field views-field-field-region-forecaster nowrap">
       Salt Lake    </td>,
 <td class="views-field views-field-title">
 <a href="/avalanche/58959">Accident: Wilson Glade</a> </td>,
 <td class="views-field views-field-field-trigger">
       Skier    </td>,
 <td class="views-field views-field-field-killed views-align-right">
       4    </td>]

Now let's look at the first cell more closely. It looks like the `class` attribute will be useful in labelling what the cell contains. We often call these classes 'tags' in `HTML`

In [23]:
first_cell = cells[0]
tags = first_cell.get('class')
tags

['views-field', 'views-field-field-occurrence-date']

Now for the actual data. Let's try accessing `stripped_strings` property of the cell: this will return all the text held in the cell, stripped of whitespace. One problem is that this returns a `generator` object

In [24]:
first_cell.stripped_strings

<generator object Tag.stripped_strings at 0x7f1e46289b30>

But we can force it into proper string form by this trick:

In [25]:
''.join(first_cell.stripped_strings)

'2/6/2021'

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

Get the tags and stripped strings of the **second** cell. 

**Solution**

In [26]:
second_cell = cells[1]
tags = second_cell.get('class')
tags

['views-field', 'views-field-field-region-forecaster', 'nowrap']

In [27]:
''.join(second_cell.stripped_strings)

'Salt Lake'

***

<hr style="border:8px solid black"> </hr>


Alright! Now we're getting somewhere. For each cell, it looks like:

* the **label** of what the cell contains can be found in the tags; specifically, the tag starting with `views-field-field-`
* the **data** can be obtained by joining the `stripped_strings`

Let's write a function that, given the cells on a row, extracts the labels and data from those cells. We'll hold the row data in a `dictionary`, given it will take the form of a set of key-value pairs

In [28]:
def get_row_data(cells):
    row_data = {}
    tag_start = 'views-field-field-'
    for cell in cells:
        tags = cell.get('class')
        for tag in tags:
            if tag.startswith(tag_start):
                # label goes from just after 'views-field-field-' to end of string
                label = tag[len(tag_start):]
                data = ''.join(cell.stripped_strings)
                row_data[label] = data
    return row_data

Let's try it out on our collection of cells from earlier

In [29]:
get_row_data(cells)

{'occurrence-date': '2/6/2021',
 'region-forecaster': 'Salt Lake',
 'trigger': 'Skier',
 'killed': '4'}

Great, this dictionary contains all the data from the first row of the table!

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

Try running the `get_row_data()` function on the cells extracted from the **third row** of the table. 

[**Hint** - here's the code we used to get all the cells from the second row of the table `cells = rows[1].find_all(name='td')`. What do you need to change to get the cells from the third row?]

**Solution**

In [30]:
cells = rows[2].find_all(name='td')
get_row_data(cells)

{'occurrence-date': '1/30/2021',
 'region-forecaster': 'Salt Lake',
 'trigger': 'Skier',
 'killed': '1'}

***

<hr style="border:8px solid black"> </hr>

What next? Well, now we need to run the function we just wrote over **all rows in the table**. Let's write another function to do this. The function will take in a set of rows, and return the data in the form of a `list` of `dictionaries`: one `dictionary` for each row.

In [31]:
def get_table_data(rows):
    table_data = []
    for row in rows:
        cells = row.find_all(name='td')
        # row_data will be a dictionary
        row_data = get_row_data(cells)
        table_data.append(row_data)
    return table_data

Now the moment of truth! If we run this function on all rows, we expect to get a `list` of `dictionaries` back, one `dictionary` for each row

In [32]:
get_table_data(rows)

[{},
 {'occurrence-date': '2/6/2021',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Skier',
  'killed': '4'},
 {'occurrence-date': '1/30/2021',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Skier',
  'killed': '1'},
 {'occurrence-date': '1/8/2021',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Skier',
  'killed': '1'},
 {'occurrence-date': '1/18/2020',
  'region-forecaster': 'Ogden',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '12/15/2019',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Snowboarder',
  'killed': '1'},
 {'occurrence-date': '2/9/2019',
  'region-forecaster': 'Uintas',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '2/7/2019',
  'region-forecaster': 'Southwest',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '1/25/2019',
  'region-forecaster': 'Moab',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '1/18/2019',
  'region-forecaster': 'Skyline',
  'trigger': 'Skier',
  'killed'

Nearly, but notice there is an **empty `dictionary`** at the start of the list. This comes from the header row of the table (which contains `<th>` elements, but no `<td>` elements). Let's add a check to trim it out

In [33]:
def get_table_data(rows):
    table_data = []
    for row in rows:
        cells = row.find_all(name='td')
        if cells:
            row_data = get_row_data(cells)
            table_data.append(row_data)
    return table_data

In [34]:
results = get_table_data(rows)
results

[{'occurrence-date': '2/6/2021',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Skier',
  'killed': '4'},
 {'occurrence-date': '1/30/2021',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Skier',
  'killed': '1'},
 {'occurrence-date': '1/8/2021',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Skier',
  'killed': '1'},
 {'occurrence-date': '1/18/2020',
  'region-forecaster': 'Ogden',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '12/15/2019',
  'region-forecaster': 'Salt Lake',
  'trigger': 'Snowboarder',
  'killed': '1'},
 {'occurrence-date': '2/9/2019',
  'region-forecaster': 'Uintas',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '2/7/2019',
  'region-forecaster': 'Southwest',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '1/25/2019',
  'region-forecaster': 'Moab',
  'trigger': 'Snowmobiler',
  'killed': '1'},
 {'occurrence-date': '1/18/2019',
  'region-forecaster': 'Skyline',
  'trigger': 'Skier',
  'killed': '1'

Finally, let's convert this `list` of `dictionaries` to a `pandas` `DataFrame`. Happily, `pandas` straightforwardly accepts data in this form

In [35]:
import pandas as pd
results = pd.DataFrame(results)
results.head()

Unnamed: 0,occurrence-date,region-forecaster,trigger,killed
0,2/6/2021,Salt Lake,Skier,4
1,1/30/2021,Salt Lake,Skier,1
2,1/8/2021,Salt Lake,Skier,1
3,1/18/2020,Ogden,Snowmobiler,1
4,12/15/2019,Salt Lake,Snowboarder,1


If we wanted to, we can then save this to file using

In [36]:
results.to_csv("avalanche_fatalities.csv")

# Putting it all together...

We've scraped all the data from one page of the site, but remember we saw earlier that there were more pages available, accessible via the `Next >` and `< Previous` links at the bottom of each page. How do we extend our code to scrape the data from all pages?

Well, we can also scrape the link for the next page from each page! If we follow that link, we should be able to scrape the data from each page in turn. Let's write a function to find the link to the next page on the current page 

In [37]:
def get_next_page_route(soup):
    next_page_link = soup.find(text='Next >').parent
    if next_page_link:
        next_page_route = next_page_link.attrs['href']
        if 'page' in next_page_route:
            return next_page_route
    return None

In [38]:
get_next_page_route(soup)

'/avalanches/fatalities?page=1'

Hopefully you see now why we have split the URL for the site into `base` and `route`. The base will always stay the same, whereas the `route` changes for each subsequent page of results.

## Complete code for script file

In [39]:
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

################################################################################################
# User configurable

base = 'https://utahavalanchecenter.org/'
route = 'avalanches/fatalities'
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"\
               "AppleWebKit/537.36 (KHTML, like Gecko)"\
               "Chrome/87.0.4280.88 Safari/537.36"}
crawl_delay = 10

################################################################################################
# Do not change below

def get_row_data(cells):
    row_data = {}
    tag_start = 'views-field-field-'
    for cell in cells:
        tags = cell.get('class')
        for tag in tags:
            if tag.startswith(tag_start):
                # label goes from just after 'views-field-field-' to end of string
                label = tag[len(tag_start):]
                data = ''.join(cell.stripped_strings)
                row_data[label] = data
    return row_data

def get_table_data(rows):
    table_data = []
    for row in rows:
        cells = row.find_all(name='td')
        if cells:
            row_data = get_row_data(cells)
            table_data.append(row_data)
    return table_data

def get_next_page_route(soup):
    next_page_link = soup.find(text='Next >').parent
    if next_page_link:
        next_page_route = next_page_link.attrs['href']
        if 'page' in next_page_route:
            return next_page_route
    return None

results = []
while route:
    print("Scraping route: " + route)
    url = base + route
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, features='html.parser')
    table = soup.find(name='div', attrs={'class': 'view-content'}).find(name='table')
    rows = table.find_all(name='tr')
    table_data = get_table_data(rows)
    if table_data:
        results.extend(table_data)
    print("Length of results = " + str(len(results)))
    route = get_next_page_route(soup)
    time.sleep(crawl_delay)
    
results = pd.DataFrame(results)
results.to_csv("avalanche_fatalities.csv")
print("Done.")

Scraping route: avalanches/fatalities
Length of results = 25
Scraping route: /avalanches/fatalities?page=1
Length of results = 50
Scraping route: /avalanches/fatalities?page=2
Length of results = 75
Scraping route: /avalanches/fatalities?page=3
Length of results = 100
Scraping route: /avalanches/fatalities?page=4
Length of results = 108
Done.


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

Examine the code above and try to follow the logic and flow of execution. In particular, try to answer the following questions:

* What does the `while route:` loop do?
* Why have we inserted a call to `time.sleep()`?
* In what form is the data outputted by the code?

**Solution**

* The `while route:` loop directs scraping over multiple pages of results. While there is another `route` to scrape data from, the code will keep running. It will stop when `route` is `None`.
* The `time.sleep()` call implements the `crawl_delay` requested in the `robots.txt` file.
* The data is saved to a CSV file.

***

<hr style="border:8px solid black"> </hr>