# Scraping the web with Python

Adam Mansur

NMNH Department of Mineral Sciences

Smithsonian Carpentries Community Chat

23 Feb 2022 

## Department of Mineral Sciences Collections
<img src="img/collections.jpg" style="width: 80%;" alt="A mosaic of geologic specimens and related objects">
<br />
Consists of 600k objects, including rocks, minerals, and meteorites

## Scraping basics

Scraping is the process of **extracting data from web pages intended to be viewed in a browser**

### When to scrape

Consider scraping if:

+ Data is only available via a browser-based website
+ Lots of data or data is frequently updated
+ Scraping isn't prohibited by the website

### How to scrape

Write a script to:

1. Retrieve a web page
2. Parse and extract data from HTML

Lots of options, but we'll focus on two libraries today:

+ [`requests`](https://docs.python-requests.org/en/latest/) to retrieve web pages
+ [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse and extract data from HTML

### Making a request

Two types of HTTP requests can be used to retrieve pages:

#### GET

+ All info needed for request encoded as parameters in the URL
+ Easily cacheable

#### POST

+ Parameters passed in the request body
+ May not be cacheable
+ Need to a special tool to parse the request (like a browser inspect tool)
+ Often used with forms

### Be nice

A well-behaved scraper should:

+ Identify itself when making a request
+ Cache requests as much as possible
+ Respect terms of use
+ Respect rate limits


### Reading and parsing HTML

General structure of HTML:

<span style="color: green">&lt;parent&gt;</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: green">&lt;tag</span>
<span style="color: blue">attribute="value"</span><span style="color: green">&gt;</span>content<span style="color: green">&lt;/tag&gt;</span><br />
<span style="color: green">&lt;/parent&gt;</span>

#### Making a link

<!-- How to make a hyperlink -->
<a href="https://naturalhistory.si.edu">NMNH</a>

#### Example page

[HTML file showing basic structure and common tags](example.htm)

## Example: Meteorite Bulletin Database

<img src="img/lafayette.jpg" style="width: 75%;" alt="A gray meteorite with flow lines on a red background">
<br />
Lafayette (<a href="https://n2t.net/ark:/65665/3b4f9f988-dac3-4e07-b29b-8fc4e731e5be">USNM 1505</a>).
Photo by Chip Clark, Smithsonian.

### Project description

#### Website

The [Meteorite Bulletin Database](https://www.lpi.usra.edu/meteor/metbull.php) (MBDB), which contains canonical data about all known meteorites, including classifications and coordinates

#### Goal

Align records in the NMNH collections database with MBDB

### Is scraping is a viable approach?

Yes, because MBDB:

+ Is not available through an API
+ Contains tens of thousands of records
+ Is updated frequently with new meteorites, updated classifications, etc.
+ Does not provide a license (but does not state restrictions either)

### Workflow

1. **Evaluate the website**
    1. Verify that necessary info appears in the HTML
    2. Determine GET vs. POST
    3. Determine query parameters/payload
    4. Determine how to select element(s)
     
    
2. **Scrape data using a script**
    1. Request page
    2. Parse HTML
    3. Find and extract info from HTML

### Evaluate the website

Visit the [website](https://www.lpi.usra.edu/meteor/metbull.php) and:

+ Verify that necessary info appears in the HTML
+ Determine GET vs. POST
+ Determine query parameters/payload
+ Determine how to select element(s)



#### Results

+ Verify that necessary info appears in the HTML: **Yes**
+ Determine GET vs. POST: **Both supported, we'll use GET**
+ Determine query parameters/payload: **Get this from query string** 
+ Determine how to select element(s): **Use id="maintable"**

### Build the scraper

In [1]:
# Import libraries needed for the scraper
import re
import time

import html5lib
import pandas as pd
import requests
import requests_cache
from bs4 import BeautifulSoup


# Provide an email address for the user agent
EMAIL = ""

# Cache requests to minimize trips to the server
requests_cache.install_cache()

### Request a page

Here is the URL needed to retrieve the Lafayette meteorite:

<a href="https://www.lpi.usra.edu/meteor/metbull.php?sea=Lafayette+%28stone%29&sfor=names&ants=&nwas=&falls=&valids=&stype=contains&lrec=50&map=dm&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=0"><span style="color: blue">https://www.lpi.usra.edu/meteor/metbull.php</span>?<span style="color: green;">sea=Lafayette+%28stone%29&sfor=names&ants=&nwas=&falls=&valids=&stype=contains&lrec=50&map=dm&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=0</span></a>

The URL has two main parts:

+ <span style="color: blue">**Page address:**</span> https://www.lpi.usra.edu/meteor/metbull.php
+ <span style="color: green">**Query string:**</span> Everything after the ?, consists of key=val pairs joined by ampersands

To make the request, first format the query string to work with `requests`:

In [2]:
# Map key=val params from query string to a dict
params = {
    "sea": "Lafayette (stone)",  # as plain text, requests handles encoding
    "sfor": "names",
    "ants": "",
    "falls": "",
    "valids": "",
    "stype": "contains",
    "lrec": 50,
    "map": "dm",
    "browse": "",
    "country": "All",
    "srt": "name",
    "categ": "All",
    "mblist": "All",
    "rect": "",
    "phot": "",
    "strewn": "",
    "snew": 0,
    "pnt": "Normal table",
    "dr": "",
    "page": 0,
}

Next define a custom User-Agent to identify our application to the server:

In [3]:
# Add a user agent identifying the scraper
headers = {"User-Agent": f"a friendly nmnh bot // {EMAIL}".rstrip("/ ")}

Now make the GET request:

In [4]:
# Make GET request using requests
resp = requests.get(
    "https://www.lpi.usra.edu/meteor/metbull.php", headers=headers, params=params
)

# If the request fails, throw an error
resp.raise_for_status()

# Since we're using GET, we can also show the URL. This can be useful for
# debugging if the request does not go as planned.
resp.url

'https://www.lpi.usra.edu/meteor/metbull.php?sea=Lafayette+%28stone%29&sfor=names&ants=&falls=&valids=&stype=contains&lrec=50&map=dm&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal+table&dr=&page=0'

In [5]:
# The content of the website can be accessed using the text attribute on the
# Response object, which in this case is a standard blob of HTML
print(resp.text[:1000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<LINK REL="SHORTCUT ICON" href="favicon.ico">
<meta name="robots" content="index,nofollow"/>
  <title>Meteoritical Bulletin: Search the Database</title>
  <link rel="stylesheet" href="metbull.css" type="text/css">
  <link rel="alternate" type="application/rss+xml" title="Meteoritical Bulletin RSS Feed" href="https://www.lpi.usra.edu/meteor/meteorite-rss.php">
  <link href="Suggest/css/style.css" rel="stylesheet" type="text/css">
  <!--[if gte IE 6]>
  <link rel="stylesheet" type="text/css" href="Suggest/i_hate_IE.css" />
  <![endif]-->
<script src="https://maps.googleapis.com/maps/api/js?key=AIzaSyDCiu15YF6OkcnQfXovYCsj5BzXSU2vKaY&libraries=drawing" type="text/javascript" asynch defer></script>
<script src="TileMaps2.js"></script>
<style> #map {height: 100%;}</style>
</head>
<body style="background-color: #eeeeee; margin-top: 0

Or alternatively a POST request:

In [6]:
# Make POST request using the requests library
resp = requests.post(
    "https://www.lpi.usra.edu/meteor/metbull.php", headers=headers, data=params
)

# If the request fails, throw an error
resp.raise_for_status()

# Response text is the same but the request URL is less useful for POST
resp.url

'https://www.lpi.usra.edu/meteor/metbull.php'

### Parse website HTML

BeautifulSoup defaults to the XML parser included with Python, but two optional parsers are available as well:

+ **html5lib** is slower but very forgiving with malformed HTML
+ **lxml** is faster but less forgiving

In [7]:
# Parse HTML using BeautifulSoup and html5lib
soup = BeautifulSoup(resp.text, "html5lib")

### Find and extract data

Use the **id="maintable"** attribute we identified earlier:

In [8]:
# Extract the table using the id attribute
table = soup.find(id="maintable")  # find returns the first matching tag

# Pretty print the first few hundred characters of the HTML
print(table.prettify()[:500])

<table border="1" cellpadding="4" cellspacing="0" class="tablefat sortable" id="maintable">
 <caption style="text-align: left">
 </caption>
 <tbody>
  <tr>
   <th align="left" class="insidehead" scope="col">
    <span style="white-space: nowrap;">
     Name
     <a href="https://www.lpi.usra.edu/meteor/notes.php?note=1">
      <img alt="help" src="qmark.gif"/>
     </a>
    </span>
   </th>
   <th align="left" class="insidehead" scope="col">
    <span style="white-space: nowrap;">
     Status
  


In [9]:
# Extract list of rows from the table. The top row contains column headers,
# every following row is a single meteorite.
rows = table.find_all("tr")  # find_all returns all matching tags

# Pretty print the first few hundred characters of the first row
print(rows[0].prettify()[:500])

<tr>
 <th align="left" class="insidehead" scope="col">
  <span style="white-space: nowrap;">
   Name
   <a href="https://www.lpi.usra.edu/meteor/notes.php?note=1">
    <img alt="help" src="qmark.gif"/>
   </a>
  </span>
 </th>
 <th align="left" class="insidehead" scope="col">
  <span style="white-space: nowrap;">
   Status
   <a href="https://www.lpi.usra.edu/meteor/notes.php?note=3">
    <img alt="help" src="qmark.gif"/>
   </a>
  </span>
 </th>
 <th align="left" class="insidehead" scope="col">


Now convert each row to a dict:

In [10]:
# Get columns from the table headers. Columns names appear as text, not tags,
# and can be accessed using the text method.
cols = [c.text.strip() for c in rows[0].find_all("th")]

# Use the columns to create dicts for each row in the table
data = []
for row in rows[1:]:
    cells = [c.text.strip() for c in row.find_all("td")]
    data.append(dict(zip(cols, cells)))

data

[{'Name': 'Lafayette (stone)',
  'Status': 'Official',
  'Year': '1931',
  'Place': 'Indiana, USA',
  'Type': 'Martian (nakhlite)',
  'Mass': '800 g',
  '(Lat,Long)': "(40° 25'N, 86° 53'W)",
  'Notes': ''}]

### Combining data from multiple pages

First wrap the code above inside a function so we can easily re-use it:

In [11]:
def parse_mbdb_page(headers=None, **kwargs):
    """Parses results table from an MBDB search

    Parameters
    ----------
    headers : dict
        header to pass to request
    kwargs :
        search parameters for the MBDB search

    Returns
    -------
    list
        list of meteorite data as dicts

    Raises
    ------
    requests.HTTPError if request fails
    """

    # List of available keys with defaults. Any of keys can be overwritten
    # by passing a keyword argument with the same key.
    params = {
        "sea": "*",
        "sfor": "names",
        "ants": "",
        "falls": "",
        "valids": "",
        "stype": "contains",
        "lrec": 50,
        "map": "dm",
        "browse": "",
        "country": "All",
        "srt": "name",
        "categ": "All",
        "mblist": "All",
        "rect": "",
        "phot": "",
        "strewn": "",
        "snew": 0,
        "pnt": "Normal table",
        "dr": "",
        "page": 0,
    }

    # Update params based on kwargs provided by the user
    params.update(kwargs)

    # Make GET request using the requests library
    resp = requests.get(
        "https://www.lpi.usra.edu/meteor/metbull.php", headers=headers, params=params
    )

    # Raise an error if the request fails
    resp.raise_for_status()

    # Parse HTML using BeautifulSoup and the html5lib parser
    soup = BeautifulSoup(resp.text, "html5lib")

    # Get rows from the result table
    rows = soup.find(id="maintable").find_all("tr")

    # Get column names from the first row
    cols = [c.text.strip() for c in rows[0].find_all("th")]

    # Convert all others rows in the table to dicts
    mets = []
    for row in rows[1:]:

        # Get all cells in this row
        cells = row.find_all("td")

        # Convert row to a dict
        met = dict(zip(cols, [c.text.strip() for c in cells]))

        # Remove footnote indicators from meteorite name
        met["Name"] = met["Name"].rstrip("* ")

        # Break lat and long into separate columns
        try:
            lat, lng = met["(Lat,Long)"].strip("()").split(", ")
            met["Lat"] = lat.strip()
            met["Long"] = lng.strip()
        except ValueError:
            pass
        del met["(Lat,Long)"]

        # Add optional columns if missing from table. These columns only
        # appear on some pages and throw off the column mapping if missing.
        for col in ("Abbrev", "Antarctic"):
            met.setdefault(col, "")

        # If a publication is specified, pull citation from title text
        if met["MetBull"]:
            met["MetBull"] = cells[cols.index("MetBull")].find("a")["title"]

        mets.append(met)

    return mets

Now use this function to combine data from multiple pages:

In [12]:
# Compile data for meteorites found in the United States
us_mets = []

lrec = 500  # number of records per page (this is the key MBDB uses)
page = 1  # number of the page to start on (0 and 1 are the same for MBDB)

while True:

    parsed = parse_mbdb_page(country="United States", lrec=lrec, page=page)
    us_mets.extend(parsed)

    # Stop iterating when the number of records returned is not the same
    # as the number of records per page. This is needed because MBDB will
    # return the last set of results indefinitely when you go beyond the
    # last page of results. Note that this check will fail if the number
    # of records in the result set happens to be an exact multiple of
    # the records per page.
    if len(parsed) != lrec:
        break

    page += 1

#### Results

In [13]:
# Display results as a dataframe
pd.DataFrame(us_mets).head()

Unnamed: 0,Name,Abbrev,Status,Fall,Year,Place,Type,Mass,MetBull,Notes,Lat,Long,Antarctic
0,Abbott,,Official,,1951,"New Mexico, USA",H3-6,21.1 kg,"Meteoritical Bulletin, no. 37, Moscow (1966)",,36° 18'N,104° 17'W,
1,Abernathy,,Official,,1941,"Texas, USA",L6,2.91 kg,,,33° 51'N,101° 48'W,
2,Achilles,,Official,,1924,"Kansas, USA",H5,16 kg,,,"39° 46' 36""N","100° 48' 48""W",
3,Ackerly,,Official,,1995,"Texas, USA",L5,3.05 kg,"Meteoritical Bulletin, no. 100, MAPS 49, E1-E1...",,"32° 35' 25""N","101° 46' 20""W",
4,Acme,,Official,,1947,"New Mexico, USA",H5,75 kg,,,33° 38'N,104° 16'W,


### Improving the scraper

+ Handle HTTP status codes
+ Retry failed requests due to connectivity issues, server timeouts, etc. using an [exponential backoff](https://cloud.google.com/iot/docs/how-tos/exponential-backoff)

## Questions?

+ Email: mansura@si.edu
+ ORCID: [0000-0002-7512-4206](https://orcid.org/0000-0002-7512-4206)
+ GitHub: https://github.com/adamancer