# Lecture 8

February 02, 2022


### Announcements

* HW2 due tomorrow

### Topics

* Undocumented APIs
* XML and HTML
* Web Scraping

### Datasets

* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)
* [Wikipedia's List of Largest Cities](https://en.wikipedia.org/wiki/List_of_largest_cities)
* [CUESA's Vegetable Seasons Chart](https://cuesa.org/eat-seasonally/charts/vegetables)

### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* [__requests-html__ documentation](https://html.python-requests.org/)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
* Python for Data Analysis, Ch. 6
* Python for Data Analysis, Ch. 7.3 (to review string processing)

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Getting Data from the Web

Revised list of ways you can get data from the web, from most to least convenient:

1. Direct download or "data dump"
2. Python or R package (there are packages for many popular web APIs)
3. Documented web API
4. Undocumented web API
5. Scraping

## Undocumented Web APIs

Many websites use undocumented web APIs to get data. For example:

* [University of California Compensation](https://ucannualwage.ucop.edu/wage/)
* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)

You can identify these websites by looking at requests in your browser's developer tools. In Firefox or Chrome, you can open the developer tools with `ctrl-shift-i`.

Requests to web APIs almost always return JSON or XML data. By examining the browser requests, you can work out the endpoints and parameters, allowing you to use the API.

**CAUTION:** Web APIs that are undocumented are often undocumented for a reason. Using an undocumented API may make someone angry or get you into legal trouble! Government and quasi-government websites (like the examples above) are probably okay, as long as you cache and rate-limit your requests. For everything else, find for an alternative or get permission first.

Let's reverse engineer the Yolo County Health Inspections web API so that we can get data about local restaurants.

In [2]:
import numpy as np
import pandas as pd
import requests
import requests_cache

requests_cache.install_cache("mycache")

In [4]:
def get_health_info(q):
    response = requests.post("https://yoloeco.envisionconnect.com/api/pressAgentClient/searchFacilities", params = {
        "PressAgentOid": "c08cb189-894c-4c8c-b595-a5ef010226b4",
    }, json = {
        # "FacilityName": q,
        "Addresses": 
        # "FacilityId": "FA0021354"
    })

    response.raise_for_status()
    #"FacilityName": "pluto's"

    # Different ways to attach data to a POST request:
    # With data=, we get a query string
    #   FacilityName=pluto's

    # With json=, we get a json object

    return response.json()


get_health_info("A")

[{'FacilityId': 'FA0021354',
  'FacilityName': 'A KE TACO',
  'Address': '1900 PARKWOOD DR ',
  'CityStateZip': 'YUBA CITY CA 95993 ',
  'LastScore': 100.0,
  'attachmentId': 'c5bf43d6-aa71-416f-b95a-ad4f00b3cda4'},
 {'FacilityId': 'FA0019474',
  'FacilityName': 'ACOUSTIC EVENTS',
  'Address': '4467 D ST ',
  'CityStateZip': 'SACRAMENTO CA 95819 ',
  'LastScore': 100.0,
  'attachmentId': '6ad8b394-103b-4f88-8e1c-ae2b01015500'},
 {'FacilityId': 'FA0014014',
  'FacilityName': 'AFC SUSHI / HOT WOK @ BEL AIR #526',
  'Address': '1885 E GIBSON RD ',
  'CityStateZip': 'WOODLAND CA 95776 ',
  'LastScore': 100.0,
  'attachmentId': None},
 {'FacilityId': 'FA0014013',
  'FacilityName': "AFC SUSHI / HOT WOK @ RALEY'S #206",
  'Address': '367 W MAIN ST ',
  'CityStateZip': 'WOODLAND CA 95695 ',
  'LastScore': 100.0,
  'attachmentId': None},
 {'FacilityId': 'FA0014015',
  'FacilityName': "AFC SUSHI / HOT WOK @ RALEY'S #448",
  'Address': '1601 W CAPITOL AVE ',
  'CityStateZip': 'WEST SACRAMENTO CA 

We could reverse engineer other parts of the API to get detailed data about health violations.

### Private API

Tutorial: https://github.com/ping/instagram_private_api


Here is the youtube video that talks undocumented API in more details: https://www.youtube.com/watch?v=pAHxtiQVL60&list=WL&index=50&t=731s

## Web Scraping

### What makes a web page?

Web pages are written in _hypertext markup language_ (HTML). HTML files (`.htm` or `.html`) are plain text, just like JSON, Python scripts, and R scripts.

In HTML, we use _tags_ to create _elements_ of a web page. Elements add formatting and structure to the page.

* Tags usually come in pairs: an opening tag and a closing tag.
* Tags are written `<NAME>` for opening tags, `</NAME>` for closing tags, and `<NAME />` for singleton tags.
* Opening and singleton tags can have _attributes_ that contain additional information. Attributes are written `ATTRIBUTE=VALUE` after the tag name. 

See [here](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for a more detailed explanation, and [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of valid HTML elements.

#### Examples

As an example:

```html
<p>This page is famous and this <b>word</b> is emphasized.</p>
```

```html
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
```

```html
<li>1. Something</li>
```
The `p` tag marks a paragraph, the `a` tag marks a link (an _anchor_), the `strong` tag marks emphasized text,
and `li` tag marks a list.

Here's a string that contains HTML for a simple, complete website:

In [23]:
page = """
<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a hrep="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>

<body>
    <p>This is a new paragraph!</p>
</body>

</html>
"""

In [24]:
page

'\n<html> \n<head>\n    <title>This is the Title!</title>\n</head>\n\n<body>\n    <p>This is a paragraph!</p>\n    <p id="best-paragraph">This is another paragraph! &#127790;</p>\n    <p>Visit <a hrep="https://pudding.cool">The Pudding</a>.</p>\n    <span>This is a span, it comes with an taco &#127790;</span>\n</body>\n\n<body>\n    <p>This is a new paragraph!</p>\n</body>\n\n</html>\n'

_Extensible markup language_ (XML) also uses tags to create elements. We say XML is _extensible_ because you can create your own XML elements (unlike HTML). People typically use XML to describe structure and meaning of data, rather than for formatting.

We'll use the same process to extract data from both HTML and XML.

### Helper Packages

A _parser_ converts formatted data into familiar data structures. We've used __requests__' built-in JSON parser, but the package doesn't have a built-in HTML/XML parser. Fortunately, there are many other Python packages for parsing HTML/XML and web scraping.

HTML/XML Parsers:
* [lxml](https://lxml.de/)
* [html5lib](https://github.com/html5lib/html5lib-python)
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/)
* [requests-html](https://docs.python-requests.org/projects/requests-html/en/latest/)

Scraper Frameworks (_convenient after learning the basics with parsers_):
* [scrapy](https://scrapy.org/)
* [newspaper3k](https://github.com/codelucas/newspaper)

Even more [here](https://github.com/lorien/awesome-web-scraping/blob/master/python.md#web-scraping-frameworks).

We'll use __lxml__ here, but you're welcome to use other packages on assignments and the project. To install __lxml__ for Anaconda, run `conda install -c anaconda lxml` in a shell.

In [7]:
import lxml.html as lx

html = lx.fromstring(page)
html

<Element html at 0x7f7d41157b80>

### Finding Elements

Elements are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```
This is similar to the file system on your computer. The key difference is that elements at the same level can have the same tag name.

#### XPath

The _XML Path Language_ (XPath) lets us write paths to elements. XPath paths look a lot like file paths. XPath is not Python-specific!

The `.xpath()` method gets all elements at an XPath path:

In [25]:
html.xpath("/html/head/title")

[<Element title at 0x7f7d41196770>]

In [26]:
html.xpath("/html/body")

[<Element body at 0x7f7d41194f90>]

Since there may be more than one element, the method always returns a list.

Absolute paths are not robust for scraping. An update to a web page that adds a single tag can break a scraper that uses absolute paths. In XPath, `//` means "anywhere below". We'll use `//` often because it's more robust:

In [27]:
html.xpath("/html/body//a")

[<Element a at 0x7f7d41194cc0>]

What if we just elements that satisfy a certain condition? In XPath, `[ ]` filters out elements that don't match a condition. For example:

In [13]:
html.xpath("//p[@id = 'best-paragraph']")

[<Element p at 0x7f7d41193ef0>]

[XPath Diner](http://www.topswagcode.com/xpath/) is an interactive tutorial that teaches most of the XPath syntax. It takes about 20-60 minutes. Work through it to become an XPath ninja!

#### CSS Selectors

_Cascading Style Sheets_ (CSS) is another language for formatting elements in an HTML document. CSS provides another way to select elements, called _CSS selectors_.

CSS selectors are more concise but less flexible than XPath paths. The `.cssselect()` method gets all elements at a CSS selector:

In [16]:
html.cssselect("body")

[<Element body at 0x7f7d41194f90>]

### Extracting Text and Attributes

There are two ways to get text from an element:

* `.text` gives text inside the element, but not its children
* `.text_content()` gives text inside the element and its children, with all tags removed

In [28]:
p = html.xpath("//p[@id = 'best-paragraph']")[0]
p.text_content()

'This is another paragraph! 🌮'

In [29]:
p.text

'This is another paragraph! 🌮'

In [32]:
html.text_content()

' \n\n    This is the Title!\n\n\n\n    This is a paragraph!\n    This is another paragraph! 🌮\n    Visit The Pudding.\n\n\n\n'

In [33]:
html.text

' \n'

We can get values from attributes on an element with `.attrib`, which is a dictionary:

In [34]:
p.attrib["id"]

'best-paragraph'

In [36]:
[x.attrib["hrep"] for x in html.xpath("//a")]

['https://pudding.cool']

### Example: Scraping Tables

For data in a `table` element, we can use __Pandas__ instead of writing a scraper.

Wikipedia provides lots of useful information in tables. Let's get the Wikipedia list of [US cities by area][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

In [37]:
import pandas as pd

In [46]:
tabs = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area")
# tbl = tabs[1]
# tbl
tbl = tabs[1]
tbl.head(n = 10)

Unnamed: 0,Rank,City,State,Land area (sq mi),Land area (km2),Water area (sq mi),Water area (km2),Total area (sq mi),Total area (km2),Population (2020)[2]
0,1,Sitka,Alaska,2870.1,7434,1945.1,5038.0,4815.1,12471,8458
1,2,Juneau,Alaska,2704.0,7003,550.7,1426.0,3254.7,8430,32255
2,3,Wrangell,Alaska,2556.0,6620,920.6,2384.0,3476.6,9004,2127
3,4,Anchorage,Alaska,1706.8,4421,239.9,621.0,1946.7,5042,291247
4,5,Tribune [note 1]*,Kansas,778.2,2016,0.0,0.0,778.2,2016,1182
5,6,Jacksonville,Florida,747.3,1935,127.2,329.0,874.5,2265,949611
6,7,Anaconda,Montana,736.7,1908,4.7,12.0,741.4,1920,9421
7,8,Butte *,Montana,715.8,1854,0.6,1.6,716.3,1855,34494
8,9,Houston,Texas,640.4,1659,31.2,81.0,671.7,1740,2304580
9,10,Oklahoma City,Oklahoma,606.2,1570,14.3,37.0,620.5,1607,681054


In [45]:
len(tabs)

2

In [51]:
tabs = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area")
tbl = tabs[1]

def strip_footnote(x):
    """This function removes bracketed footnotes, such as '[1]'."""
    if pd.isna(x):
        return x
    
    return x.partition("[")[0]

In [52]:
# combine table headers into a row and remove footnote
cols = list(tbl.columns.get_level_values(0))
cols

['Rank',
 'City',
 'State',
 'Land area (sq\xa0mi)',
 'Land area (km2)',
 'Water area (sq\xa0mi)',
 'Water area (km2)',
 'Total area (sq\xa0mi)',
 'Total area (km2)',
 'Population (2020)[2]']

In [53]:
tbl.columns = [strip_footnote(c).title() for c in cols]

In [54]:
tbl.columns

Index(['Rank', 'City', 'State', 'Land Area (Sq Mi)', 'Land Area (Km2)',
       'Water Area (Sq Mi)', 'Water Area (Km2)', 'Total Area (Sq Mi)',
       'Total Area (Km2)', 'Population (2020)'],
      dtype='object')

In [55]:
tbl = tbl.applymap(strip_footnote)

AttributeError: 'int' object has no attribute 'partition'

In [56]:
tbl.dtypes

Rank                    int64
City                   object
State                  object
Land Area (Sq Mi)     float64
Land Area (Km2)         int64
Water Area (Sq Mi)    float64
Water Area (Km2)      float64
Total Area (Sq Mi)    float64
Total Area (Km2)        int64
Population (2020)       int64
dtype: object

In [57]:
tbl_str = tbl.applymap(str)

In [60]:
tbl_str.dtypes

Rank                  object
City                  object
State                 object
Land Area (Sq Mi)     object
Land Area (Km2)       object
Water Area (Sq Mi)    object
Water Area (Km2)      object
Total Area (Sq Mi)    object
Total Area (Km2)      object
Population (2020)     object
dtype: object

In [61]:
tbl_str = tbl_str.applymap(strip_footnote)

In [62]:
tbl_str.head()

Unnamed: 0,Rank,City,State,Land Area (Sq Mi),Land Area (Km2),Water Area (Sq Mi),Water Area (Km2),Total Area (Sq Mi),Total Area (Km2),Population (2020)
0,1,Sitka,Alaska,2870.1,7434,1945.1,5038.0,4815.1,12471,8458
1,2,Juneau,Alaska,2704.0,7003,550.7,1426.0,3254.7,8430,32255
2,3,Wrangell,Alaska,2556.0,6620,920.6,2384.0,3476.6,9004,2127
3,4,Anchorage,Alaska,1706.8,4421,239.9,621.0,1946.7,5042,291247
4,5,Tribune,Kansas,778.2,2016,0.0,0.0,778.2,2016,1182


In [63]:
tbl_str.dtypes

Rank                  object
City                  object
State                 object
Land Area (Sq Mi)     object
Land Area (Km2)       object
Water Area (Sq Mi)    object
Water Area (Km2)      object
Total Area (Sq Mi)    object
Total Area (Km2)      object
Population (2020)     object
dtype: object

In [64]:
tbl_str['Land Area (Km2)'] = tbl_str['Land Area (Km2)'].astype(float)

In [66]:
tbl_str.dtypes

Rank                   object
City                   object
State                  object
Land Area (Sq Mi)      object
Land Area (Km2)       float64
Water Area (Sq Mi)     object
Water Area (Km2)       object
Total Area (Sq Mi)     object
Total Area (Km2)       object
Population (2020)      object
dtype: object