# STA 141B Lecture 8

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

* Links to sample projects posted in Project Description
* Submit project proposal as Markdown (`.md`) or Jupyter Notebook (`.ipynb`) in your project repo
* Assignment 1 grades posted soon
* Assignment 3 is posted

### Topics

* Query Strings
* API Keys
* Undocumented APIs

### Datasets

* [iTunes Search API](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/)
* [The Guardian API](https://open-platform.theguardian.com/)
* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)


### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* Python for Data Analysis, Ch. 6

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Example Questions

1. Approximately how many remixes are there of PSY's Gangnam Style?
2. Did Clinton or Trump get more newspaper coverage in the days leading up to the 2016 U.S. presidential election?

## Query Strings

Most of the functions we use have parameters, and you can pass arguments for those parameters when you call a function.

Endpoints in REST APIs work the same way, but the syntax is different. You can pass arguments by adding `?PARAMETER=ARGUMENT` to the end of the URL. Parameter and argument pairs are separated by `&`. This syntax is called a _query string_.

For instance, Apple provides a web API for the iTunes store, with [documentation](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/). We can use this to answer the example question about Gangnam Style.

The search endpoint is `https://itunes.apple.com/search`, and the documentation lists several parameters. We can use __requests__ to build the query string automatically.

In [52]:
import requests
import requests_cache
import pandas as pd
import time

requests_cache.install_cache("mycache")

In [10]:
response = requests.get("https://itunes.apple.com/search", params = {
    "term": "Gangnam Style",
    "country": "US",
    "limit": 200
})
response.raise_for_status()
response

<Response [200]>

Every response has a `.url` attribute that shows the URL used for the request.

In [11]:
response.url

'https://itunes.apple.com/search?country=US&limit=200&term=Gangnam+Style'

In [16]:
results = response.json()["results"]
results = pd.DataFrame(results)

is_gangnam = results["trackName"].str.contains("Gangnam Style")

results[is_gangnam][["trackName", "artistName"]].shape

(149, 2)

## Authentication

### API Keys

Many APIs use a _key_ or _token_ to identify the user.

For instance, The Guardian, a British newspaper, provides a [web API](https://open-platform.theguardian.com/) to access their news articles. You need an API key to use their web APIs. You can get one for free [here](https://bonobo.capi.gutools.co.uk/register/developer).

#### Storing API Keys

Your API key is private and your responsibility. Treat it like a password. Keep it secret! **Don't commit it** in your git repo.

In order to keep your API key separate from your code:
1. Save the API key in a text file.
2. Use Python to load the API key into a variable.

Python's built-in `open()` function opens a file, and the `.readline()` method reads a line from a file. Often you'll see these used with `with`, which automatically closes the file at the end of the block:

In [18]:
def read_key(keyfile):
    with open(keyfile) as f:
        return f.readline().strip("\n")

In [21]:
# Don't print out your actual API key
print(read_key("keys/example"))

key = read_key("keys/guardian")

This is my key


Now you can use the `key` variable anywhere you need the actual API key.

#### Querying The Guardian

We've got our key, so let's use The Guardian API to answer our question about media coverage of Clinton and Trump.

Let's start by trying to get all of the articles about one of the candidates.

In [64]:
def get_articles(q, page = 1):
    response = requests.get("https://content.guardianapis.com/search", params = {
        "api-key": key,
        "q": q,
        "from-date": "2016-11-01",
        "to-date": "2016-11-08",
        "page-size": 50,
        "page": page
    })
    response.raise_for_status()
    return response.json()["response"]

In [66]:
def get_all_articles(q):
    # Get the first page, and find out how many pages there are.
    # NOTE: We could make this function clearer by renaming the `clinton` variable,
    # since the function might be searching for something else.
    clinton = get_articles(q)
    pages = clinton["pages"]

    # Loop over remaining pages.
    results = clinton["results"]
    for p in range(2, pages + 1):
        results += get_articles(q, p)["results"]
        time.sleep(0.1)

    # Convert the articles to data frame, and the date column to a date.
    df = pd.DataFrame(results)
    df["webPublicationDate"] = pd.to_datetime(df["webPublicationDate"])
    
    # Get the day and day name, then count them.
    date = df["webPublicationDate"].dt
    dates = pd.DataFrame({"day": date.day, "day_name": date.day_name()})
    return dates.groupby(["day", "day_name"]).size()

In [67]:
print(get_all_articles("Clinton"))
print(get_all_articles("Trump"))

day  day_name 
1    Tuesday      28
2    Wednesday    25
3    Thursday     28
4    Friday       31
5    Saturday     19
6    Sunday       32
7    Monday       46
8    Tuesday      49
dtype: int64
day  day_name 
1    Tuesday      40
2    Wednesday    31
3    Thursday     38
4    Friday       45
5    Saturday     26
6    Sunday       44
7    Monday       53
8    Tuesday      55
dtype: int64


What are some ways this analysis could be improved?

* Check that articles about "Trump" and "Clinton" are actually about the two candidates. Some may be about other things -- the English word "trump", "Bill Clinton", etc...
* Check whether the API searches article text or just article titles.
* Use more sources, and use American newspapers (unless the goal was to analyze international news).
* Make visualizations.
* Use a larger time window.
* Use other kinds of data (e.g., poll results) to look for relationships.

Collecting and cleaning data takes a lot of very technical work, but it's only the first step in the analysis. When you finish data collection and cleaning, it can feel like you're finally done. Take a moment to congratulate yourself and step away from the data, so that when you come back you'll be ready to do a careful statistical analysis.

### OAuth

[OAuth](https://en.wikipedia.org/wiki/OAuth) is a way to give an application access to data on a website or web API.

You might run into OAuth if you use a web API where the data is private. For instance, Twitter provides a [web API](https://developer.twitter.com/en/docs.html) for managing your personal Twitter account. If you want to access the API from a Python script, first you have to use OAuth to tell Twitter that the script has permission to use your data.

OAuth can operate in several different ways. As always, check the documentation for the web API you want to use in order to find out what you need to do.

The simplest case of OAuth requires scripts to have a key or token from the web API provider. This is very similar to using an API key.

For more complicated cases, the **requests-ouathlib** package (documentation [here](https://requests-oauthlib.readthedocs.io/en/latest/)) may help.

## Undocumented Web APIs

Many websites use undocumented web APIs to get data. For example:

* [University of California Compensation](https://ucannualwage.ucop.edu/wage/)
* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)

You can identify these websites by looking at requests in your browser's developer tools. In Firefox or Chrome, you can open the developer tools with `ctrl-shift-i`.

Requests to web APIs almost always return JSON or XML data. By examining the browser requests, you can work out the endpoints and parameters, allowing you to use the API.

**CAUTION:** Web APIs that are undocumented are often undocumented for a reason. Using an undocumented API may make someone angry or get you into legal trouble! Government and quasi-government websites (like the examples above) are probably okay, as long as you cache and rate-limit your requests. For everything else, find for an alternative or get permission first.