```
Title: Web-Services-APIs-Python
Author: John Fay for ENV872
Date: Spring 2023
```

# Web Services and APIs with Python

## Objectives for this lesson

- Address programmatic data aquistion
- Learn principles of web-services
- Recognize vast opportunities of APIs

## Specific achievements

- Programatically acquire data embedded in a web page
- Request data through a REST API
- Use the [census](https://pypi.python.org/pypi/census) package to acquire data

## Why script data acquistion?

- Too time intensive to aquire manually
- Update or reuse for new data
- Reproducibility
- Only available through an Application Programming Interface (API)

## Tiers of access to online data

- **Scraping:** download static data displayed on a webpage for people
- **REST API:** send HTTP requests for data using a URI following the providers documentation
- **Specialized Package** import a "wrapper" created by a data provider

## Requests

That  "http" at the beginning of the URL for a possible data source is a  protocol—an understanding between a client and a server about how to  communicate. The client does not have to be a web browser, so long as it  knows the protocol.

 

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re

#Scrape the web content of the url into a requests object called "response"
response = requests.get('https://www.ncwater.org/WUDC/app/WWATR/report/view/0004-0001/2020')
#Use "Beautiful Soup" to parse the raw response into a search-able object
doc = BeautifulSoup(response.text, 'lxml')

In [None]:
#Reveal the datatype of the "doc" object
type(doc)

In [None]:
the_registrant = doc.select('.table tr:nth-child(1) td:nth-child(2)')[0].text
print(the_registrant)

In [None]:
the_facility =  doc.select('tr:nth-child(2) .left~ .left+ td.left')[0].text
print(the_facility)

In [None]:
avg_withdrawals = doc.select('.table:nth-child(7) td:nth-child(7) , .table:nth-child(7) td:nth-child(3)')
[mgd.text for mgd in avg_withdrawals]

## Range of complexity

Pages designed for humans are increasingly harder to parse programmatically.

- Servers provide different responses based on client's "metadata"
- Javascript often needs to be executed by the client
- The html `<table>` is drifting into obscurity (mostly for the better)

### HTML Tables

Sites  with easilly accessible html tables nowadays are often geared toward  non-human agents. Here we'll examine how to extract a table from a wikipedia page:
<https://en.wikipedia.org/wiki/Global_Social_Mobility_Index>

In [None]:
#Here, the read_html function pulls into a list object any table in the URL we provide.
tableList = pd.read_html('https://en.wikipedia.org/wiki/Global_Social_Mobility_Index',header=0)
print ("{} tables were found".format(len(tableList)))

In [None]:
#Let's grab the 1st table one and display it's firt five rows
df = tableList[0]
df.head()

In [None]:
#Here is as quick preview of pandas' plotting capability
%matplotlib inline
df.iloc[:20].plot(
    kind='bar',
    x='Country',
    figsize=(20,5),
    legend=False);

### REST API
The US Census Burea provides access to its vast stores of demographic data via their API at <https://api.census.gov>.

The **I** in **API** is the entry point  into an application: it's the steering wheel and dashboard for whatever  more or less complicated vehicle you're driving. In the case of the  Census, the main component of the application is a relational database  management system. There are probabably several **GUI**s designed for human readers; the Census API is meant for communication between your software and their application.

In a REST API, the already universal system for transferring data  over the internet between applications (a web server and your browser)  called `http` is half of the interface. From there we just need documentation for how to construct the URL in a standards compliant way.

<https://api.census.gov/data/2021/acs/acs5?get=NAME,B01001_001E&for=county&in=state:37>

| Section           | Description                                                  |
| ----------------- | ------------------------------------------------------------ |
| `https://`        | scheme                                                       |
| `api.census.gov`  | authority, or simply host if there's no user authentication  |
| `/data/2021/acs/acs5` | path to a resource within a hierarchy                        |
| `?`               | beginning of the "query" component of a URL                  |
| `get=NAME,B01001_001E` | first query parameter                                        |
| `&`               | query parameter separator                                    |
| `for=county`      | second query parameter                                       |
| `&`               | query parameter separator                                    |
| `in=state:*`      | third query parameter                                        |
| `#`               | beginning of the "fragment" component of a URL               |

In [None]:
#Extract the data
path = 'https://api.census.gov/data/2021/acs/acs5'
query = {'get':'NAME,B01001_001E,B01001_002E,B01001_026E', 'for':'county', 'in':'state:37'}
response = requests.get(path, params=query).json()

In [None]:
#Convert to a dataframe
df_acs = pd.DataFrame(
            columns=response[0],
            data=response[1:]).rename(columns={
                'B01001_001E':'Total_pop',
                'B01001_002E':'Male_pop',
                'B01001_026E':'Female_pop'
})
df_acs.head()

## API Keys & Limits
Most servers request good behavior, others enforce it.

- Size of single query
- Rate of queries (calls per second, or per day)
- User credentials specified by an API key

## From the Census Bureau

> [**What Are the Query Limits?**](https://www.census.gov/data/developers/guidance/api-user-guide.Query_Components.html)
>
> You can include up to 50 variables in a single API query and can make  up to 500 queries per IP address per day. More than 500 queries per IP  address per day requires that you [register for a Census key](https://www.census.gov/developers/). That key will be part of your data request URL string.
>
> Please keep in mind that all queries from a business or organization  having multiple employees might employ a proxy service or firewall. This  will make all of the users of that business or organization appear to  have the same IP address.  If multiple employees were making queries,  the 500-query limit would be for the proxy server/firewall, not the  individual user.

## Specialized Packages

The  third tier of access to online data is the most convenient, if it  exists. The data provider may also maintain a package in your  programming languages repository ([PyPI](http://pypi.python.org) or [CRAN](http://cran.r-project.org)).

- Additional guidance on query parameters
- Returns data in native formats
- Handles all "encoding" problems

In [None]:
#See if the Census package is installed
import Census

If you get an error "no module named 'Census'", we need to install it. From our container, we can use `pip` to install packages. This is a shell command so we preceed it with `!`

In [None]:
#If you get an error above, install the Census package using pip
pip install Census

In [None]:
from census import Census
key = None
c = Census(key, year=2021)

In [None]:
variables = ('NAME', 'B19001_001E')
params = {'for':'tract:*', 'in':'state:24'}
response = c.acs5.get(variables, params)
response = pd.DataFrame(response)
response.dtypes

#### Plot with Plotnine/ggplot

In [None]:
#Install plotnine (install if needed)
try: 
    from plotnine import *
except:
    !pip install plotnine
    from plotnine import *

In [None]:
response[variables[1]] = pd.to_numeric(response[variables[1]])
(ggplot(response, aes(x = 'county', y = variables[1])) + gg.geom_boxplot())