# Lecture 7

January 23, 2022


### Announcements

* Submit project proposal on Canvas
* Assignment 2 due next week

### Topics

* Query Strings
* API Keys
* Undocumented APIs

### Datasets

* [iTunes Search API](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/)
* [The Guardian API](https://open-platform.theguardian.com/)
* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)


### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* Python for Data Analysis, Ch. 6

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Example Questions

Did Biden or Trump get more newspaper coverage in the days leading up to the 2020 U.S. presidential election?

## Authentication

### API Keys

Many APIs use a _key_ or _token_ to identify the user.

For instance, The Guardian, a British newspaper, provides a [web API](https://open-platform.theguardian.com/) to access their news articles. You need an API key to use their web APIs. You can get one for free [here](https://bonobo.capi.gutools.co.uk/register/developer).

#### Storing API Keys

Your API key is private and your responsibility. Treat it like a password. Keep it secret! 

In order to keep your API key separate from your code:
1. Save the API key in a text file.
2. Use Python to load the API key into a variable.

Python's built-in `open()` function opens a file, and the `.readline()` method reads a line from a file. Often you'll see these used with `with`, which automatically closes the file at the end of the block:

In [1]:
def read_key(keyfile):
    with open(keyfile) as f:
        return f.readline().strip("\n")

In [2]:
# Don't print out your actual API key
# print(read_key("api-keys/example"))

key = read_key("api-keys/guardian.txt")

In [3]:
type(key)

str

Now you can use the `key` variable anywhere you need the actual API key.

#### Querying The Guardian

We've got our key, so let's use The Guardian API to answer our question about media coverage of Biden and Trump.

Let's start by trying to get all of the articles about one of the candidates.

In [4]:
import requests

In [5]:
response = requests.get("https://content.guardianapis.com/search", params = {
        "api-key": key,
        "q": "Biden",
        "from-date": "2020-11-01",
        "to-date": "2020-11-10",
        "page-size": 50,
        "page": 1
    })

In [6]:
response.raise_for_status()

In [7]:
def get_articles(q, page = 1, from_date = "2020-11-01"):
    response = requests.get("https://content.guardianapis.com/search", params = {
        "api-key": key,
        "q": q,
        "from-date": from_date,
        "to-date": "2020-11-10",
        "page-size": 50,
        "page": page
    })
    response.raise_for_status()
    return response.json()["response"]

In [8]:
biden = get_articles("Biden")

In [9]:
biden

{'status': 'ok',
 'userTier': 'developer',
 'total': 474,
 'startIndex': 1,
 'pageSize': 50,
 'currentPage': 1,
 'pages': 10,
 'orderBy': 'relevance',
 'results': [{'id': 'australia-news/2020/nov/05/biden-edges-closer-to-victory',
   'type': 'article',
   'sectionId': 'australia-news',
   'sectionName': 'Australia news',
   'webPublicationDate': '2020-11-05T05:38:22Z',
   'webTitle': 'US election briefing for Australia: Biden edges closer to victory',
   'webUrl': 'https://www.theguardian.com/australia-news/2020/nov/05/biden-edges-closer-to-victory',
   'apiUrl': 'https://content.guardianapis.com/australia-news/2020/nov/05/biden-edges-closer-to-victory',
   'isHosted': False,
   'pillarId': 'pillar/news',
   'pillarName': 'News'},
  {'id': 'politics/2020/nov/10/johnsons-biden-win-tweet-contains-hidden-trump-congratulations',
   'type': 'article',
   'sectionId': 'politics',
   'sectionName': 'Politics',
   'webPublicationDate': '2020-11-10T13:19:16Z',
   'webTitle': "Johnson's Biden wi

In [10]:
pages = biden["pages"]
pages

10

In [11]:
pageSize = biden["pageSize"]
pageSize

50

In [12]:
# Loop over remaining pages.
import time

results = biden["results"]
for p in range(2, pages + 1):
    results += get_articles("biden", p)["results"]
    time.sleep(0.1) # 3 seconds

In [19]:
results

[{'id': 'australia-news/2020/nov/05/biden-edges-closer-to-victory',
  'type': 'article',
  'sectionId': 'australia-news',
  'sectionName': 'Australia news',
  'webPublicationDate': '2020-11-05T05:38:22Z',
  'webTitle': 'US election briefing for Australia: Biden edges closer to victory',
  'webUrl': 'https://www.theguardian.com/australia-news/2020/nov/05/biden-edges-closer-to-victory',
  'apiUrl': 'https://content.guardianapis.com/australia-news/2020/nov/05/biden-edges-closer-to-victory',
  'isHosted': False,
  'pillarId': 'pillar/news',
  'pillarName': 'News'},
 {'id': 'politics/2020/nov/10/johnsons-biden-win-tweet-contains-hidden-trump-congratulations',
  'type': 'article',
  'sectionId': 'politics',
  'sectionName': 'Politics',
  'webPublicationDate': '2020-11-10T13:19:16Z',
  'webTitle': "Johnson's Biden win tweet contains hidden Trump congratulations",
  'webUrl': 'https://www.theguardian.com/politics/2020/nov/10/johnsons-biden-win-tweet-contains-hidden-trump-congratulations',
  'a

In [16]:
import pandas as pd

In [20]:
df = pd.DataFrame(results)

In [21]:
df.tail()

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName
469,world/live/2020/nov/10/coronavirus-live-news-w...,liveblog,world,World news,2020-11-10T23:55:55Z,Greek football players attend 'coronavirus par...,https://www.theguardian.com/world/live/2020/no...,https://content.guardianapis.com/world/live/20...,False,pillar/news,News
470,world/live/2020/nov/06/coronavirus-live-news-u...,liveblog,world,World news,2020-11-07T00:21:25Z,"France reports record 60,486 new cases; Russia...",https://www.theguardian.com/world/live/2020/no...,https://content.guardianapis.com/world/live/20...,False,pillar/news,News
471,business/live/2020/nov/10/uk-unemployment-redu...,liveblog,business,Business,2020-11-10T17:14:20Z,FTSE 100 vaccine rally continues; UK unemploym...,https://www.theguardian.com/business/live/2020...,https://content.guardianapis.com/business/live...,False,pillar/news,News
472,australia-news/live/2020/nov/02/australia-coro...,liveblog,australia-news,Australia news,2020-11-02T08:27:07Z,Ministers pay tribute to Christine Holgate – a...,https://www.theguardian.com/australia-news/liv...,https://content.guardianapis.com/australia-new...,False,pillar/news,News
473,world/live/2020/nov/03/vienna-austria-synagogu...,liveblog,world,World news,2020-11-03T13:32:50Z,Vienna shooting: fourth victim dies as police ...,https://www.theguardian.com/world/live/2020/no...,https://content.guardianapis.com/world/live/20...,False,pillar/news,News


In [22]:
df["webPublicationDate"] = pd.to_datetime(df["webPublicationDate"])

In [23]:
df.head()

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName
0,australia-news/2020/nov/05/biden-edges-closer-...,article,australia-news,Australia news,2020-11-05 05:38:22+00:00,US election briefing for Australia: Biden edge...,https://www.theguardian.com/australia-news/202...,https://content.guardianapis.com/australia-new...,False,pillar/news,News
1,politics/2020/nov/10/johnsons-biden-win-tweet-...,article,politics,Politics,2020-11-10 13:19:16+00:00,Johnson's Biden win tweet contains hidden Trum...,https://www.theguardian.com/politics/2020/nov/...,https://content.guardianapis.com/politics/2020...,False,pillar/news,News
2,us-news/live/2020/nov/07/us-election-joe-biden...,liveblog,us-news,US news,2020-11-08 10:04:24+00:00,Biden addresses Americans after victory – as i...,https://www.theguardian.com/us-news/live/2020/...,https://content.guardianapis.com/us-news/live/...,False,pillar/news,News
3,us-news/2020/nov/08/donald-trump-concede-legal...,article,us-news,US news,2020-11-08 19:39:23+00:00,Republicans back Trump challenge to Biden elec...,https://www.theguardian.com/us-news/2020/nov/0...,https://content.guardianapis.com/us-news/2020/...,False,pillar/news,News
4,us-news/2020/nov/07/joe-biden-disunited-states...,article,us-news,US news,2020-11-07 10:00:07+00:00,Joe Biden poised to inherit Disunited States o...,https://www.theguardian.com/us-news/2020/nov/0...,https://content.guardianapis.com/us-news/2020/...,False,pillar/news,News


In [None]:
# Get the day and day name, then count them.
date = df["webPublicationDate"].dt
# dates = pd.DataFrame({"day": date.day, "day_name": date.day_name()}
date

In [None]:
date.day_name()

In [None]:
dates = pd.DataFrame({"day": date.day, "day_name": date.day_name()})

In [None]:
dates

Write it as a function

In [13]:
def get_articles(q, page = 1):
    response = requests.get("https://content.guardianapis.com/search", params = {
        "api-key": key,
        "q": q,
        "from-date": "2020-11-01",
        "to-date": "2020-11-05",
        "page-size": 50,
        "page": page
    })
    response.raise_for_status()
    return response.json()["response"]

In [17]:
def get_all_articles(q, time_sleep = 0.1):
    # Get the first page, and find out how many pages there are.
    # NOTE: We could make this function clearer by renaming the `biden` variable,
    # since the function might be searching for something else.
    candidate = get_articles(q)
    pages = candidate["pages"]

    # Loop over remaining pages.
    results = candidate["results"]
    for p in range(2, pages + 1):
        results += get_articles(q, p)["results"]
        time.sleep(time_sleep)

    # Convert the articles to data frame, and the date column to a date.
    df = pd.DataFrame(results)
    df["webPublicationDate"] = pd.to_datetime(df["webPublicationDate"])
    
    # Get the day and day name, then count them.
    date = df["webPublicationDate"].dt
    dates = pd.DataFrame({"day": date.day, "day_name": date.day_name()})
    return dates.groupby(["day", "day_name"]).size()

In [24]:
print(get_all_articles("Biden"))

day  day_name 
1    Sunday       29
2    Monday       26
3    Tuesday      49
4    Wednesday    60
5    Thursday     40
dtype: int64


In [25]:
print(get_all_articles("Trump"))

day  day_name 
1    Sunday       47
2    Monday       37
3    Tuesday      56
4    Wednesday    64
5    Thursday     49
dtype: int64


What are some ways this analysis could be improved?

* Check that articles about "Trump" and "Biden" are actually about the two candidates. Some may be about other things -- the English word "trump", "Hunter Biden", etc...
* Check whether the API searches article text or just article titles.
* Use more sources, and use American newspapers (unless the goal was to analyze international news).
* Make visualizations.
* Use a larger time window.
* Use other kinds of data (e.g., poll results) to look for relationships.

Collecting and cleaning data takes a lot of very technical work, but it's only the first step in the analysis. When you finish data collection and cleaning, it can feel like you're finally done. Take a moment to congratulate yourself and step away from the data, so that when you come back you'll be ready to do a careful statistical analysis.

### OAuth

[OAuth](https://en.wikipedia.org/wiki/OAuth) is a way to give an application access to data on a website or web API.

You might run into OAuth if you use a web API where the data is private. For instance, Twitter provides a [web API](https://developer.twitter.com/en/docs.html) for managing your personal Twitter account. If you want to access the API from a Python script, first you have to use OAuth to tell Twitter that the script has permission to use your data.

OAuth can operate in several different ways. As always, check the documentation for the web API you want to use in order to find out what you need to do.

The simplest case of OAuth requires scripts to have a key or token from the web API provider. This is very similar to using an API key.

For more complicated cases, the **requests-ouathlib** package (documentation [here](https://requests-oauthlib.readthedocs.io/en/latest/)) may help.

## Undocumented Web APIs

Many websites use undocumented web APIs to get data. For example:

* [University of California Compensation](https://ucannualwage.ucop.edu/wage/)
* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)

You can identify these websites by looking at requests in your browser's developer tools. In Firefox or Chrome, you can open the developer tools with `ctrl-shift-i`.

Requests to web APIs almost always return JSON or XML data. By examining the browser requests, you can work out the endpoints and parameters, allowing you to use the API.

**CAUTION:** Web APIs that are undocumented are often undocumented for a reason. Using an undocumented API may make someone angry or get you into legal trouble! Government and quasi-government websites (like the examples above) are probably okay, as long as you cache and rate-limit your requests. For everything else, find for an alternative or get permission first.

Let's reverse engineer the Yolo County Health Inspections web API so that we can get data about local restaurants.

In [26]:
# install requests-cache package 
!pip install requests-cache 



In [27]:
import numpy as np
import pandas as pd
import requests
import requests_cache

requests_cache.install_cache("mycache")

In [56]:
def get_health_info(q):
    response = requests.post("https://yoloeco.envisionconnect.com/api/pressAgentClient/searchFacilities", params = {
        "PressAgentOid": "c08cb189-894c-4c8c-b595-a5ef010226b4",
    }, json = {
        'FacilityId': q
        # "FacilityName": q,
        # "LastScore": p
        # "Address": p
    })

    response.raise_for_status()
    #"FacilityName": "pluto's"

    # Different ways to attach data to a POST request:
    # With data=, we get a query string
    #   FacilityName=pluto's

    # With json=, we get a json object

    return response.json()


result = get_health_info("FA")

HTTPError: 500 Server Error: Internal Server Error for url: https://yoloeco.envisionconnect.com/api/pressAgentClient/searchFacilities?PressAgentOid=c08cb189-894c-4c8c-b595-a5ef010226b4

We could reverse engineer other parts of the API to get detailed data about health violations.

In [39]:
result = get_health_info("CREPE")
result

[{'FacilityId': 'FA0004831',
  'FacilityName': 'CREPEVILLE DAVIS',
  'Address': '330 3RD ST ',
  'CityStateZip': 'DAVIS CA 95616 ',
  'LastScore': 100.0,
  'attachmentId': '78dc0f32-9160-4820-b5c4-ad2b00fdb87a'}]

In [42]:
result = get_health_info("a")

In [43]:
result

[{'FacilityId': 'FA0021354',
  'FacilityName': 'A KE TACO',
  'Address': '1900 PARKWOOD DR ',
  'CityStateZip': 'YUBA CITY CA 95993 ',
  'LastScore': 100.0,
  'attachmentId': 'c5bf43d6-aa71-416f-b95a-ad4f00b3cda4'},
 {'FacilityId': 'FA0019474',
  'FacilityName': 'ACOUSTIC EVENTS',
  'Address': '4467 D ST ',
  'CityStateZip': 'SACRAMENTO CA 95819 ',
  'LastScore': 100.0,
  'attachmentId': '17afd99d-fe2b-4627-98a7-acb60091e364'},
 {'FacilityId': 'FA0014014',
  'FacilityName': 'AFC SUSHI / HOT WOK @ BEL AIR #526',
  'Address': '1885 E GIBSON RD ',
  'CityStateZip': 'WOODLAND CA 95776 ',
  'LastScore': 100.0,
  'attachmentId': None},
 {'FacilityId': 'FA0014013',
  'FacilityName': "AFC SUSHI / HOT WOK @ RALEY'S #206",
  'Address': '367 W MAIN ST ',
  'CityStateZip': 'WOODLAND CA 95695 ',
  'LastScore': 100.0,
  'attachmentId': None},
 {'FacilityId': 'FA0014015',
  'FacilityName': "AFC SUSHI / HOT WOK @ RALEY'S #448",
  'Address': '1601 W CAPITOL AVE ',
  'CityStateZip': 'WEST SACRAMENTO CA 

In [33]:
result = get_health_info("In-N-Out")

In [34]:
result

[{'FacilityId': 'FA0003293',
  'FacilityName': 'IN-N-OUT BURGER #127',
  'Address': '1020 OLIVE DR ',
  'CityStateZip': 'DAVIS CA 95616 ',
  'LastScore': 100.0,
  'attachmentId': '24bf15d9-8778-415d-a098-ae0700fb9057'},
 {'FacilityId': 'FA0011197',
  'FacilityName': 'IN-N-OUT BURGER #231',
  'Address': '2011 BRONZE STAR DR ',
  'CityStateZip': 'WOODLAND CA 95776 ',
  'LastScore': 100.0,
  'attachmentId': None},
 {'FacilityId': 'FA0003160',
  'FacilityName': 'IN-N-OUT BURGERS #225',
  'Address': '780 IKEA CT ',
  'CityStateZip': 'WEST SACRAMENTO CA 95691 ',
  'LastScore': 100.0,
  'attachmentId': '641e66ee-2819-4c1a-a6f1-adbf0083baca'}]

In [45]:
result = get_health_info("Mikuni")
result

[{'FacilityId': 'FA0020204',
  'FacilityName': 'MIKUNI JAPANESE RESTAURANT & SUSHI BAR',
  'Address': '500 1ST ST 19 STE ',
  'CityStateZip': 'DAVIS CA 95616 ',
  'LastScore': 100.0,
  'attachmentId': '02a6642a-08c4-483e-b200-ace501170842'}]

In [35]:
result_pd = pd.DataFrame(result)

In [36]:
result_pd

Unnamed: 0,FacilityId,FacilityName,Address,CityStateZip,LastScore,attachmentId
0,FA0003293,IN-N-OUT BURGER #127,1020 OLIVE DR,DAVIS CA 95616,100.0,24bf15d9-8778-415d-a098-ae0700fb9057
1,FA0011197,IN-N-OUT BURGER #231,2011 BRONZE STAR DR,WOODLAND CA 95776,100.0,
2,FA0003160,IN-N-OUT BURGERS #225,780 IKEA CT,WEST SACRAMENTO CA 95691,100.0,641e66ee-2819-4c1a-a6f1-adbf0083baca


In [44]:
index = ["In-N-out", "ali baba", "crepeville"]