# APIs with Keys

 <div class="alert alert-warning">
    <b>Note:</b> Run the code below first so you can install a needed package. Then restart the kernel.
 </div>

In [1]:
%pip install pyyaml

Note: you may need to restart the kernel to use updated packages.


In [1]:
from requests import get 
import pandas as pd 
import numpy as np
import yaml

# Quick Review

REST APIs allow us to send `get` requests to retrieve data from a website. For instance, the catfacts API will return a random cat-related fact when we send a request:

In [2]:
catfact = get('https://catfact.ninja/fact')

In [3]:
catfact.content

b'{"fact":"Blue-eyed, pure white cats are frequently deaf.","length":47}'

In most cases, API data will be returned in .json format, this is a format with a very similar structure to python dictionaries, and we can turn it into a python dictionary with the `.json` method:

In [4]:
catfact_dict = catfact.json()

In [5]:
catfact_dict

{'fact': 'Blue-eyed, pure white cats are frequently deaf.', 'length': 47}

And then we can handle them more-or-less like dictionary objects

In [6]:
catfact_dict['fact']

'Blue-eyed, pure white cats are frequently deaf.'

In many cases, results will be more complex and may contain multiple layers of nesting. 

In [7]:
breeds = get('https://catfact.ninja/breeds')
breed_data = breeds.json()

Here, we've got several different keys. The `data` key contains a list which, in turn, contains a series of dictioanaries, with each one describing a different breed of cat:

In [8]:
breed_data.keys()

dict_keys(['current_page', 'data', 'first_page_url', 'from', 'last_page', 'last_page_url', 'links', 'next_page_url', 'path', 'per_page', 'prev_page_url', 'to', 'total'])

In [9]:
# Viewing the first two breeds
breed_data['data'][:2]

[{'breed': 'Abyssinian',
  'country': 'Ethiopia',
  'origin': 'Natural/Standard',
  'coat': 'Short',
  'pattern': 'Ticked'},
 {'breed': 'Aegean',
  'country': 'Greece',
  'origin': 'Natural/Standard',
  'coat': 'Semi-long',
  'pattern': 'Bi- or tri-colored'}]

So making this into something useful will for analysis generally require us to do a little clean-up

In [10]:
breedlist = [i['breed'] for i in breed_data['data']]
print(breedlist)

['Abyssinian', 'Aegean', 'American Curl', 'American Bobtail', 'American Shorthair', 'American Wirehair', 'Arabian Mau', 'Australian Mist', 'Asian', 'Asian Semi-longhair', 'Balinese', 'Bambino', 'Bengal', 'Birman', 'Bombay', 'Brazilian Shorthair', 'British Semi-longhair', 'British Shorthair', 'British Longhair', 'Burmese', 'Burmilla', 'California Spangled', 'Chantilly-Tiffany', 'Chartreux', 'Chausie']


In most cases, we'll try to reformat things as a Pandas Dataframe, but this can be more complicated depending on the structure of our result. 

In [11]:
breed_df = pd.DataFrame(breed_data['data'])
breed_df.head()

Unnamed: 0,breed,country,origin,coat,pattern
0,Abyssinian,Ethiopia,Natural/Standard,Short,Ticked
1,Aegean,Greece,Natural/Standard,Semi-long,Bi- or tri-colored
2,American Curl,United States,Mutation,Short/Long,All
3,American Bobtail,United States,Mutation,Short/Long,All
4,American Shorthair,United States,Natural,Short,All but colorpoint


## Query Parameters

Simple APIs like catfacts may only require a single query to get data. But often we'll need to add additional parameters in order to filter our results.

For instance, the [Nobel Prize API](https://www.nobelprize.org/organization/developer-zone-2/) allows us to set multiple parameters to get specific results for years or subjects.

The base URL for the Prizes data is:
http://api.nobelprize.org/2.1/nobelPrizes


But we could modify this URL to get only the data for the year 1901 for the economics category. (The parameters are the `key=value` pairs that come after a `?`)

In [12]:
nobel = get('http://api.nobelprize.org/2.1/nobelPrizes?nobelPrizeYear=2024&yearTo=2024&nobelPrizeCategory=eco')

In [13]:
print(nobel.json())

{'nobelPrizes': [{'awardYear': '2024', 'category': {'en': 'Economic Sciences', 'no': 'Økonomi', 'se': 'Ekonomi'}, 'categoryFullName': {'en': 'The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel', 'no': 'Sveriges Riksbanks pris i økonomisk vitenskap til minne om Alfred Nobel', 'se': 'Sveriges Riksbanks pris i ekonomisk vetenskap till Alfred Nobels minne'}, 'dateAwarded': '2024-10-14', 'prizeAmount': 11000000, 'prizeAmountAdjusted': 11000000, 'links': [{'rel': 'nobelPrize', 'href': 'https://api.nobelprize.org/2/nobelPrize/eco/2024', 'action': 'GET', 'types': 'application/json'}], 'laureates': [{'id': '1044', 'knownName': {'en': 'Daron Acemoglu'}, 'fullName': {'en': 'Daron Acemoglu'}, 'portion': '1/3', 'sortOrder': '1', 'motivation': {'en': 'for studies of how institutions are formed and affect prosperity', 'se': 'för studier av hur institutioner formas och påverkar välstånd'}, 'links': [{'rel': 'laureate', 'href': 'https://api.nobelprize.org/2/laureate/1044', 'acti

Instead of manually typing out query parameters, we'll typically specify them using a python dictionary. So here's how I would adjust the query to retrieve the winner of the 1901 prize for Chemistry. Note that, when we access the `url` attribute from the response, we can see the URL is structured very similarly to the one above:

In [14]:
base_url = 'http://api.nobelprize.org/2.1/nobelPrizes'
parameters = {"nobelPrizeYear":1901, 
              "yearTo":1901, 
              "nobelPrizeCategory":"che"}

nobel = get(base_url, parameters)

nobel.url

'https://api.nobelprize.org/2.1/nobelPrizes?nobelPrizeYear=1901&yearTo=1901&nobelPrizeCategory=che'

Also note that accessing specific elements of this json data is quite a bit more complex than the previous case because we have several layers of nesting to navigate:

In [15]:
nobel_data =nobel.json()
#  getting the english motivation for the first prize winner:
nobel_data['nobelPrizes'][0]['laureates'][0]['motivation']['en']


'in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions'

And so converting this to a dataframe also gives a slightly less useable result: some of our cells contain nested data that we would probably need to manipulate further to really use.

In [16]:
pd.DataFrame(nobel_data['nobelPrizes'])

Unnamed: 0,awardYear,category,categoryFullName,dateAwarded,prizeAmount,prizeAmountAdjusted,links,laureates
0,1901,"{'en': 'Chemistry', 'no': 'Kjemi', 'se': 'Kemi'}","{'en': 'The Nobel Prize in Chemistry', 'no': '...",1901-11-12,150782,9704878,"[{'rel': 'nobelPrize', 'href': 'https://api.no...","[{'id': '160', 'knownName': {'en': 'Jacobus H...."


Here's a bit more documentation on the Nobel Prize API:
[API documentation](https://app.swaggerhub.com/apis/NobelMedia/NobelMasterData/2.1#/info)

In [17]:
catfact_dict['fact']

'Blue-eyed, pure white cats are frequently deaf.'

# API Keys
Many times, data providers don't want to provide access to their APIs to just anybody. In order to make sure that they control access and track usage of the API, they might require the use of an API key. An API key is basically like a password that is uniquely associated with your account that you use every time you want to use that API.

# New York Times API
One example of an API that requires a key is the New York Times API. We'll show an example of using the New York Times API to make the API call. We start by navigating the NYT API site so that we can look up instructions on how to access their API.

We need to get an API key from the New York Times first before we can access the API. We can go to their Dev Portal to sign up and get access: https://developer.nytimes.com/apis. You'll need to make an account, then log in. After you have an account, you can access your Apps by clicking on your username at the top right and create an app. Enable the APIs that you want to have access to, and get the key.

After you get the key, create a new text file (I called mine nyt-key.txt) and paste the key into that text file. <b>We want to avoid writing out the key in any documents we share with others</b>, so we're going to keep the key separate and simply read in the key into Python and use it to call the API.



<b style="color:red;"> Question 1: Do the steps described above and write your api key in the `keys.yml` file in the appropriate spot in the project directory</b>

In [18]:
with open('../../keys.yml', 'r') as file:
    keys = yaml.safe_load(file)


In [19]:
nyt_key = keys['nyt_api_key']

# NYT Archives
After you do this, you can poke around on the API site a bit to get an idea of what data is available and how you might access that data. We'll start with the Archives API, for which the documentation can be found here: https://developer.nytimes.com/docs/archive-product/1/overview. The Archives API can be used to access article metadata (such as headline, byline, article URL, and so on) for a given month. Let's try getting the content for January 2019.

Following the instructions given on their site, we start with the base URL.

In [20]:
base_url = "https://api.nytimes.com/svc/archive/v1/2019/1.json"

In [21]:
r = get(base_url, params= {'api-key':nyt_key})

Now we can check the status code. Remember that code 200 means everything is fine. When we're sending authentication information, a code of 401 will indicate that our request is not authorized. 

In [22]:
r.status_code

200

We are good to go. Now let's get the content.

In [23]:
response = r.json()  # Convert response to JSON format

<b style="color:red;">Question 2: How many NYT articles were there in January 2019?</b>



In [24]:

response = r.json()  # Convert response to JSON format
len(response['response']['docs'])


4482

In [25]:
# OR:
response['response']['meta']['hits']


4482

<b style="color:red;">Question 3: What are the types of metadata that are available in the data from this API? Show the keys from one article to answer this question.</b>

In [26]:
response['response']['docs'][0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

<b style="color:red;">Question 4: Create a list called `abstracts` that contains the article abstract for each article in `json`.

</b>

In [27]:
abstracts = [i['abstract'] for i in response['response']['docs']]

abstracts[1:5]

['Imagine what we could do with our money, and hours, if we set our phones aside for a year.',
 'Wells was a deep threat for the vaunted Oakland offenses of the late 1960s, but his playing days ended after he served a seasonlong prison sentence.',
 'Can the Constitution withstand the partisans?',
 'The Christian right doesn’t like the president only for his judges. They like his style.']

## Editing strings

If we wanted to get all of the metadata of articles published in a certain year, or over an extended time period, we would actually need to change the base URL that we were using. That's because the URL as we've defined it contains the year and month hard-coded into it. This might get tedious, so we can instead edit the strings to do this automatically. This way, we are able to, for example, loop through years and months and get the data we want.



In [28]:
month = 10
year = 2020

url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json"

data = get(url, params= {'api-key':nyt_key})

In [29]:
data.url

'https://api.nytimes.com/svc/archive/v1/2020/10.json?api-key=EkyJv1gfSUT2A5KYvIA5jldmwZF1Zyjv'

In [30]:
month = 11
year = 2020

url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json"

url

'https://api.nytimes.com/svc/archive/v1/2020/11.json'

The `f` in front of the string indicates that it is an f-string, and the pieces that we want to replace within the string are included with curly braces. We use the names of the objects we want to put into those places, and the values are then interpolated into the string.

<b style="color:red;">Question 5: Write a function called nyt_api that has two arguments, month and year, and outputs the response from pulling from the NYT Archive API for that month and year.</b>

In [31]:
def nyt_api(month, year):
    url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json"
    return get(url, params= {'api-key':nyt_key})
res = nyt_api(1, 2020)

<b style="color:red;">Question 6: Write a function called nyt_headlines that has two arguments, month and year, and outputs a list of headlines from pulling from the NYT Archive API for that month and year.</b>

In [32]:
def nyt_headlines(month, year):
    res = nyt_api(month, year)
    out=res.json()
    headlines = [i['headline']['main'] for i in out['response']['docs'] ]
    return(headlines)

headlines = nyt_headlines(12, 2024)


In [33]:

headlines[:10]

['Trump Says He Will Nominate Kash Patel to Run F.B.I.',
 'A College Volleyball Team’s Season in the Spotlight Comes to an End',
 'No Corrections: Dec. 1, 2024',
 'Trump Picks a Florida Sheriff as D.E.A. Administrator',
 'Five Things to Know About Kash Patel, Trump’s Pick to Lead the F.B.I.',
 'Lou Carnesecca, St. John’s Basketball Coach With 526 Wins, Is Dead at 99',
 'From Pong to Pokémon: A History of Holiday ‘It’ Toys',
 '80 Years After Killings, Senegal Wants the Facts From France',
 'Mexican Cartels Lure Chemistry Students to Make Fentanyl',
 'Quote of the Day: Bittersweet Homecoming for Displaced Lebanese']

## JSON to Pandas DataFrame

If we have nicely formatted JSON data we can often convert it into a more useable pandas data frame with minimal effort by using `pd.DataFrame()`, but keep in mind you may have to do a little indexing first in order to get to the accessible part of the data:

In [34]:
out = pd.DataFrame(res.json()['response']['docs']).head()
out.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name
0,The gunman who shot two parishioners at the We...,https://www.nytimes.com/2019/12/31/us/texas-ch...,The gunman who shot two parishioners at the We...,"WHITE SETTLEMENT, Texas — Given West Freeway C...",A,16.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': '‘Battling a Demon’: Drifter Sought H...,"[{'name': 'subject', 'value': 'Churches (Build...",2020-01-01T00:14:41+0000,article,National,U.S.,"{'original': 'By Dave Montgomery, Anemona Hart...",News,nyt://article/22fc94cd-2e4a-5af1-89f8-7260bf27...,1295,nyt://article/22fc94cd-2e4a-5af1-89f8-7260bf27...,
1,Congress could do much more to protect America...,https://www.nytimes.com/2019/12/31/opinion/for...,Congress could do much more to protect America...,Congress invited predatory for-profit colleges...,A,18.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","{'main': 'Protect Veterans From Fraud', 'kicke...","[{'name': 'subject', 'value': 'Veterans', 'ran...",2020-01-01T00:18:54+0000,article,Editorial,Opinion,"{'original': 'By The Editorial Board', 'person...",Editorial,nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3...,680,nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3...,
2,The tobacco and vaping industries and conserva...,https://www.nytimes.com/2019/12/31/health/e-ci...,The tobacco and vaping industries and conserva...,The Trump administration is expected to announ...,A,1.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'F.D.A. Plans to Ban Most E-Cigarette...,"[{'name': 'subject', 'value': 'E-Cigarettes', ...",2020-01-01T01:22:27+0000,article,Science,Health,{'original': 'By Sheila Kaplan and Maggie Habe...,News,nyt://article/42d25485-0e48-50bf-8d16-948833b2...,1236,nyt://article/42d25485-0e48-50bf-8d16-948833b2...,
3,Christina Iverson and Jeff Chen ring in the Ne...,https://www.nytimes.com/2019/12/31/crosswords/...,Christina Iverson and Jeff Chen ring in the Ne...,WEDNESDAY PUZZLE — The weekend columnist Caitl...,,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","{'main': '‘It’s Green and Slimy’', 'kicker': '...","[{'name': 'subject', 'value': 'Crossword Puzzl...",2020-01-01T03:00:10+0000,article,Games,Crosswords & Games,"{'original': 'By Deb Amlen', 'person': [{'firs...",News,nyt://article/9edddb54-0aa3-5835-a833-d311a76f...,931,nyt://article/9edddb54-0aa3-5835-a833-d311a76f...,
4,Corrections that appeared in print on Wednesda...,https://www.nytimes.com/2019/12/31/pageoneplus...,Corrections that appeared in print on Wednesda...,An “On This Day in History” item on Tuesday ab...,A,20.0,The New York Times,[],"{'main': 'Corrections: Jan. 1, 2020', 'kicker'...",[],2020-01-01T03:28:45+0000,article,Corrections,Corrections,"{'original': '', 'person': [], 'organization':...",Correction,nyt://article/16ebc00a-01f2-5f35-905f-15d299e5...,299,nyt://article/16ebc00a-01f2-5f35-905f-15d299e5...,


### Article Search

If you are looking into the New York Times archives, most of the time, you are trying to find articles about a certain topic. That is, you usually don't want to try to sift through all of the articles that the NYT has published. But, you might be interested in how they are covering the election, for example. In that case, you might not want to grab every single article published. Instead, you'd want to do a search on some keywords. To do this, you can use the Article Search API instead.

You can look at the documentation at https://developer.nytimes.com/docs/articlesearch-product/1/overview for more information on how this might work. It is very similar to the Archive API, except we use a slightly different base URL, as well as different parameters. 

In [35]:
article_base = 'https://api.nytimes.com/svc/search/v2/articlesearch.json'

We can specify the keywords using `q` in our parameters. Let's look for articles with the keyword "election".

In [36]:
r = get(article_base, params= {'q':'election','api-key':nyt_key}) 

In [37]:
response_dict =  r.json()
response_dict.keys()

dict_keys(['status', 'copyright', 'response'])

In [38]:
election_articles = r.json()['response']['docs']
len(election_articles)

10

<b style="color:red;">Question 7: Use the NYT Article Search to look for articles about mental health in January 2024. How many articles were there? How does this compare to January 2014?</b>

Note that the search only returns 10 articles at a time. We can get more using pagination. 

In [39]:
r = get(article_base, params= {'q':'mental health','api-key':nyt_key})
mhealth = r.json()

In [40]:
# pulling the abstracts:

[i['abstract'] for i in mhealth['response']['docs']]

['The sensation of being detached from your surroundings may point to a hard-to-diagnose condition.',
 'Azara Ballet in Florida is a place where performers can just be themselves.',
 'The rapper and designer formerly known as Kanye West revealed the diagnosis during a podcast interview where he also discussed his upcoming album.',
 'Two new Canadian studies are the largest to date looking at death rates and psychosis associated with cannabis use disorder.',
 'Lori Laird was defending a couple whose son shot 23 people at his school, while engaged in a desperate struggle with her own son’s mental illness. When are parents to blame?',
 'The Navy quietly started screening elite fighter pilots for signs of brain injuries caused by flying, a risk it officially denies exists.',
 'The fallout from the F.D.A.’s rejection of MDMA-assisted treatment for PTSD worries researchers and experts who fear other psychedelic drugs in the pipeline could be jeopardized.',
 'Yoni Barrios, an unauthorized imm

We can also take a look at the meta information to see how many hits we had. Since we are just searching on "election" without any other qualifiers, we would expect to be pretty high.

In [41]:
r = get(article_base, params= {'q':'election','api-key':nyt_key}) 

To narrow our search, we can add filters. For example, you can adjust the begin and end dates of your search to look at specific time periods. Let's take a look at the month of January in 2020. Note that the dates use "YYYYMMDD" formatting. So, January 1, 2020 will be `20200101`. 

In [42]:
election_parameters = {'q':'election',
                       'begin_date':'20200101',
                       'end_date':'20200201',
                       'api-key':nyt_key}

response_2020 = get(article_base, params= election_parameters).json()
election_articles3 = response_2020['response']['docs']
election_articles3[0]['web_url']

'https://www.nytimes.com/2020/01/16/us/politics/fbi-notify-state-elections-breaches.html'

<font color = 'red'>**Question 7: Use the NYT Article Search to look for articles about mental health in January 2024. How many articles were there? How does this compare to January 2014?**</font>

In [43]:
params24 = {'q':'mental health',
                       'begin_date':'20240101',
                       'end_date':'20240201',
                       'api-key':nyt_key}

params14 = {'q':'mental health',
                       'begin_date':'20140101',
                       'end_date':'20140201',
                       'api-key':nyt_key}

response_01_2024 = get(article_base, params= params24).json()



response_01_2014 = get(article_base, params= params14).json()


In [44]:
response_01_2024['response']['meta']['hits']

response_01_2014['response']['meta']['hits']

75

## Census API

One extremely useful API in social science research is the **Census API**. This API provides access to a wide variety of data sources on demographics and characteristics of people in the US. It contains data from the Decennial Census, but also from many other sources, such as the American Community Survey (ACS). Information about the Census API can be found at: https://www.census.gov/data/developers/data-sets.html.

As with the New York Times API, you will need to request an API key in order to access it. You can request an API key here: https://api.census.gov/data/key_signup.html. You will need to provide your email address and organization (you can just put University of Maryland), and you should get an email with your census key shortly after that. As with the previous case, you'll want to add it in the appropriate location in your `keys.yml` file, then run the code below to assing the census key to a python variable



In [45]:
with open('../../keys.yml', 'r') as file:
    keys = yaml.safe_load(file)
census_key = keys['census_api_key']

Even within just one data source like the ACS, there are lots of different variables and groupings that you can pull data about. We'll start with the 1-year ACS estimates. Information about this data can be found by navigating to the American Community Survey 1-Year Data page (https://www.census.gov/data/developers/data-sets/acs-1year.html). 

The webpage documentation shows how to access their data as well example code and a list of variables. For example, if you scroll down to the Detailed Tables section, you can find a link to the detailed tables variables (https://api.census.gov/data/2022/acs/acs1/variables.html). The Examples and Supported Geographies page (https://api.census.gov/data/2022/acs/acs1.html) can also be helpful in identifying the data that you want.

To start, let's find something basic: the total number of people in each state. Looking at the variables table, we can see that this is called `B01001_001E` (not very intuitive, I know). Since we want this for every state, we use `state:*` as our `for` parameter. We include `NAME` as a variable we want to get since we want to know what the state names are for each of the counts. Finally, we make sure to include our key.

In [None]:
census_base_url = 'https://api.census.gov/data/2022/acs/acs1'

census_params = {'get':'NAME,B01001_001E', 
                 'for':'state:*',
                 'key':census_key}

r = get(census_base_url, params = census_params)


In [None]:
people_by_state = r.json()


<font color = 'red'>**Question 8: Which states had more than 10,000,000 people in 2022? Create a list that contains the names of these states.**</font>

Note: the structure here is not quite the same as what we retrieved from the NYT API. We're getting a list of lists, and the variable names are just stored in the first list:

In [None]:
people_by_state[:5]

Also note that the population variable is being returned as a string object instead of an integer. So we need to convert this to an integer using `int()`. 

In [225]:
type(people_by_state[1][1])

str

In [227]:
type(int(people_by_state[1][1]))

int

So, here's one way to retrieve this list of states using a list comprehension:

In [228]:
states_over_10mil = [i[0] for i in people_by_state[1:] if int(i[1])>=10000000]

In [229]:
states_over_10mil

['California',
 'Florida',
 'Georgia',
 'Illinois',
 'Michigan',
 'New York',
 'North Carolina',
 'Ohio',
 'Pennsylvania',
 'Texas']

Finally, here's how we could make this into a list of dictionaries so we can then easily convert it into a pandas data frame:

In [None]:
popdata =[{'state': i[0], 
           'population':int(i[1]), 
           'statecode':i[2]} for i in people_by_state[1:]]

In [None]:
popdata[:5]

In [244]:
popdf = pd.DataFrame(popdata)

In [243]:
# sorting by population and then taking the first 10 rows: 
popdf.sort_values('population', ascending=False)[:10]

Unnamed: 0,state,population,statecode
4,California,39029342,6
43,Texas,30029572,48
9,Florida,22244823,12
32,New York,19677151,36
38,Pennsylvania,12972008,42
13,Illinois,12582032,17
35,Ohio,11756058,39
10,Georgia,10912876,13
33,North Carolina,10698973,37
22,Michigan,10034118,26
