# Web scraping

Sometimes the data you want is on a web page, instead of a machine-readable table. The process for extracting this information is called **web scraping**.

## General guidance

Web scraping has significant downsides:
* Error-prone
* Time-consuming
* Often forbidden by site terms of service

On the other hand, web scraping is often the *only* way to get access to a novel dataset. Consequently, web scraping projects are subject to the following constraints:

* The scrape-ee is not explicitly adversarial (e.g., competitor, government).
* Be courteous. Limit your requests, connections, download speeds, and total download size.

As a result:

* Always look for another way to get the data first. A polite email does wonders.
* Appropriate for datasets in the hundreds of megabytes. Web scraping becomes untenable for datasets around 1 GB.


## Our first web scrape

We all know about the web. Usually we use browsers to get web pages, but let's use Python instead.

In [1]:
from urllib.request import urlopen

In [2]:
url = "http://www.example.com"
fp = urlopen(url)
contents = fp.read()
contents

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    

What just happened?

1. The `urlopen` function takes a URL and returns a file-like object.
2. The file-like object has a `read` method, which we can use to extact the contents of the web page.
3. Jupyter automatically printed the return value of the `read` method.

This is a very convenient way to fetch web pages using Python, but there's still work to be done.

### Web pages aren't delivered as text

The `read` method returns a **byte string**, as indicated by the `b'` at the beginning of the string. A byte string is a list of numbers between 0 and 255. Above, those numbers are represented with human-readable characters. For example 60 is printed as "<", 97 is printed as "a", and 10 is printed as "\n". 

The byte string may look like text, **but that is an unhelpful illusion**. In fact it would be more true and less confusing to print out the contents of the web page as a number:

In [3]:
contents.hex()

'3c21646f63747970652068746d6c3e0a3c68746d6c3e0a3c686561643e0a202020203c7469746c653e4578616d706c6520446f6d61696e3c2f7469746c653e0a0a202020203c6d65746120636861727365743d227574662d3822202f3e0a202020203c6d65746120687474702d65717569763d22436f6e74656e742d747970652220636f6e74656e743d22746578742f68746d6c3b20636861727365743d7574662d3822202f3e0a202020203c6d657461206e616d653d2276696577706f72742220636f6e74656e743d2277696474683d6465766963652d77696474682c20696e697469616c2d7363616c653d3122202f3e0a202020203c7374796c6520747970653d22746578742f637373223e0a20202020626f6479207b0a20202020202020206261636b67726f756e642d636f6c6f723a20236630663066323b0a20202020202020206d617267696e3a20303b0a202020202020202070616464696e673a20303b0a2020202020202020666f6e742d66616d696c793a202d6170706c652d73797374656d2c2073797374656d2d75692c20426c696e6b4d616353797374656d466f6e742c20225365676f65205549222c20224f70656e2053616e73222c202248656c766574696361204e657565222c2048656c7665746963612c20417269616c2c2073616e732d73657269663b0a2020202

This byte string needs to be **decoded** into text, using its `decode` method. If you do not provide an encoding name, `decode` defaults to `utf8`.

In [4]:
decoded_contents = contents.decode()
decoded_contents

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

By default, jupyter displays line breaks as "\n". If we want to see the text as it was meant to be seen, we need to use the `print` function.

In [5]:
print(decoded_contents)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

In [6]:
print(type(decoded_contents))

<class 'str'>


### Web pages aren't delivered formatted

The contents of a web page is **source code**; usually a combination of HTML, CSS, and Javascript. It is the significant challenge of a web browser to convert this source code into an image of a web page. We won't attempt that; instead, we'll extract data directly from the source code, by searching and manipulating the source code string.

For example we could find the position of the `<title>` and `</title>` tags and extract everything between them.

In [7]:
start = decoded_contents.find("<title>")
start

34

In [8]:
end = decoded_contents.find("</title>")
end

55

In [9]:
decoded_contents[start+len("<title>"):end]

'Example Domain'

We did it! We scraped a piece of data from a web page. We'll get into more sophisticated ways to search the source code of a page.

### Reading HTML: a beautiful soup

HTML isn't really a language, it's more like a **cloud of conventions**. Intepreting this **cloud** with a computer is a heavyweight task; it's a big reason why web browsers are such significant pieces of software.

In the world of Python, the weapon of choice is a library called `beautifulsoup`. Beautifulsoup4, (aka `bs4`) converts potentially poorly-formatted HTML code into a form that's convenient to query. Let's take a look at a simple example:

In [10]:
import requests

In [11]:
url = 'http://www.example.com'

In [12]:
r  = requests.get(url)

### Take a look at the methods & attributes of the response

In [13]:
r.status_code

200

In [14]:
r.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [15]:
response_html = r.text

### Parse the html with beautiful soup

In [16]:
from bs4 import BeautifulSoup

In [17]:
soup = BeautifulSoup(response_html, 'html.parser')

### Inspect methods/attributes

In [18]:
soup

<!DOCTYPE html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative example

In [19]:
soup.title.text

'Example Domain'

In [20]:
soup.head.title.text

'Example Domain'

More detail: `beautifulsoup` converts the text string into a **parse tree** of elements, where the child elements are the subcomponents of the document:

In [21]:
for child in soup.children:
    print(child)
    print(type(child))
    print()

html
<class 'bs4.element.Doctype'>



<class 'bs4.element.NavigableString'>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example D

In practice, `beautifulsoup` is a big swiss army knife of tools for rummaging through an HTML document.

In [22]:
# find a single child element
soup.title

<title>Example Domain</title>

In [23]:
# find a single child element; it may contain its own children
soup.body

<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>

In [24]:
# strips away markup from all children of a tag
bodytext = soup.body.text

In [25]:
print(bodytext)



Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...




In [26]:
# use attribute notation to get a child of a child
soup.body.h1

<h1>Example Domain</h1>

In [27]:
# find all children with a particular tag
soup.find_all('p')

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

### Let's try getting some tabular data in Wikipedia using web-scraping

In [28]:
url = "https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_Denver"

In [29]:
r  = requests.get(url)
response_html_2 = r.text

In [30]:
print(response_html_2)


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of tallest buildings in Denver - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XoRq-gpAICoAAGpWRAYAAAAC","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_tallest_buildings_in_Denver","wgTitle":"List of tallest buildings in Denver","wgCurRevisionId":946320583,"wgRevisionId":946320583,"wgArticleId":14887355,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Use American English from August 2019","All Wikipedia articles written in Ameri

In [31]:
soup2 = BeautifulSoup(response_html_2, 'html.parser')

In [32]:
tables = soup2.find_all('table', class_='wikitable sortable')

In [33]:
len(tables)

2

In [34]:
tables[0]

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Name
</th>
<th>Image
</th>
<th>Height<br/><small><a class="mw-redirect" href="/wiki/Foot_(length)" title="Foot (length)">ft</a> / <a href="/wiki/Metre" title="Metre">m</a></small>
</th>
<th>Floors
</th>
<th>Year
</th>
<th class="unsortable">Notes
</th></tr>
<tr>
<td>1
</td>
<td><a href="/wiki/Republic_Plaza_(Denver)" title="Republic Plaza (Denver)">Republic Plaza</a>
</td>
<td>
</td>
<td>717 / 219
</td>
<td>56
</td>
<td>1984
</td>
<td>Has been the tallest building in Denver and Colorado since 1984. Tallest building in the <a class="mw-redirect" href="/wiki/Mountain_States" title="Mountain States">Mountain States</a>. Tallest building constructed in Denver in the 1980s.<sup class="reference" id="cite_ref-republic_1-1"><a href="#cite_note-republic-1">[1]</a></sup><sup class="reference" id="cite_ref-republic_sky_9-0"><a href="#cite_note-republic_sky-9">[9]</a></sup>
</td></tr>
<tr>
<td>2
</td>
<td><a href="/wiki/1801_Califo

In [35]:
tables[1]

<table class="wikitable sortable">
<tbody><tr>
<th>Name
</th>
<th>Street address
</th>
<th>Years as tallest
</th>
<th>Height<br/><small><a class="mw-redirect" href="/wiki/Foot_(length)" title="Foot (length)">ft</a> / <a href="/wiki/Metre" title="Metre">m</a></small>
</th>
<th>Floors
</th>
<th class="unsortable">Reference
</th></tr>
<tr>
<td><a href="/wiki/Equitable_Building_(Denver)" title="Equitable Building (Denver)">Equitable Building</a></td>
<td>730 17th Street</td>
<td>1892–1910</td>
<td>148 / 45</td>
<td>9</td>
<td><sup class="reference" id="cite_ref-Equitable_Building_4-1"><a href="#cite_note-Equitable_Building-4">[4]</a></sup>
</td></tr>
<tr>
<td><a href="/wiki/Daniels_%26_Fisher_Tower" title="Daniels &amp; Fisher Tower">Daniels &amp; Fisher Tower</a></td>
<td>1601 Arapahoe Street</td>
<td>1910–1957</td>
<td>371 / 113</td>
<td>20</td>
<td><sup class="reference" id="cite_ref-Daniels_&amp;_Fisher_Tower_sky_56-1"><a href="#cite_note-Daniels_&amp;_Fisher_Tower_sky-56">[56]</a></su

In [36]:
## looks like we want tables[0]
tallest = tables[0]

In [37]:
anchors = tallest.find_all('a')

In [38]:
anchors

[<a class="mw-redirect" href="/wiki/Foot_(length)" title="Foot (length)">ft</a>,
 <a href="/wiki/Metre" title="Metre">m</a>,
 <a href="/wiki/Republic_Plaza_(Denver)" title="Republic Plaza (Denver)">Republic Plaza</a>,
 <a class="mw-redirect" href="/wiki/Mountain_States" title="Mountain States">Mountain States</a>,
 <a href="#cite_note-republic-1">[1]</a>,
 <a href="#cite_note-republic_sky-9">[9]</a>,
 <a href="/wiki/1801_California_Street" title="1801 California Street">1801 California</a>,
 <a href="#cite_note-quest-2">[2]</a>,
 <a href="#cite_note-1801_California_sky-10">[10]</a>,
 <a href="#cite_note-SKY-11">[11]</a>,
 <a class="mw-redirect" href="/wiki/Television_program" title="Television program">television series</a>,
 <a href="/wiki/Dynasty_(1981_TV_series)" title="Dynasty (1981 TV series)">Dynasty</a>,
 <a href="/wiki/Wells_Fargo_Center_(Denver)" title="Wells Fargo Center (Denver)">Wells Fargo Center</a>,
 <a href="#cite_note-12">[12]</a>,
 <a href="#cite_note-13">[13]</a>,
 <

In [39]:
buildings = []
for anchor in anchors:
    anchor_str = str(anchor)
    if anchor_str[3] == 'h' and 'title' in anchor_str:
        buildings.append(anchor.get('title'))

In [40]:
for i, building in enumerate(buildings):
    print(f"{i}. {building}")

0. Metre
1. Republic Plaza (Denver)
2. 1801 California Street
3. Dynasty (1981 TV series)
4. Wells Fargo Center (Denver)
5. Four Seasons Hotel Denver
6. 1999 Broadway
7. 707 17th Street
8. 555 17th Street
9. Hyatt Regency Denver at the Colorado Convention Center
10. Spire (Denver)
11. 1670 Broadway
12. 17th Street Plaza
13. 633 17th Street
14. Dynasty (1981 TV series)
15. Brooks Tower
16. Denver Place
17. One Tabor Center
18. Johns Manville Plaza
19. Granite Tower (Denver)
20. Ritz-Carlton Denver
21. U.S. Bank Tower (Denver)
22. 621 17th Street
23. 1600 Glenarm Place
24. Dominion Plaza
25. One Lincoln Park, Denver
26. Confluence Park
27. Cherry Creek (Colorado)
28. South Platte River
29. Denver Financial Center
30. Daniels & Fisher Tower
31. Mississippi River
32. Lincoln Center (Denver)
33. 1125 17th Street
34. United Western Financial Center
35. Five Points, Denver
36. 1600 Broadway
37. The Curtis
38. Elitch Gardens Theme Park
39. Speer, Denver
40. Country Club, Denver
41. Speer, Denv

In [41]:
# clearly not perfect, more clean-up required