# A Beginners Guide to API | Using Python to Collect Air Quality Data
In this tutorial, I'm going to briefly explain the basics of APIs, how to use an API to gather data, and a few optional tips and tricks that I learned in the process. As an example, I'll be using the United States Environmental Protection Agency (EPA) *AirData* API. First, I'll show you how to get your very own API key so that you can access the data for yourself. Then, I'll explain how to use the requests package in python, along with a URL and some parameters, to extract data from a database. And finally, I'll show you how to save the data as a CSV. We will use the air particle data we pull in this example for an experiment later. Let's begin!

## A Brief Explanation of APIs
In short, an API is a way for us to pretend to be a web browser to extract information from a database. Think about it. You use URLs and web browsers all the time to instantly display webpages. Those web pages are often made up of thousands of lines of code. And those thousands of lines of code come from a database that's giving you the information you asked for based on the URL you sent.

With an API, that's exactly what we do. We start with a base URL, then we add some parameters into the URL string to tell the database what information to send back to us. Because every database will have different values stored inside, every API will have different parameters to feed into the URL string. This is why it's important to always look at the documentation for the API you are using. Here is the link to the documentation for the Air Quality System (AQS) data portal: 
 - https://aqs.epa.gov/aqsweb/documents/data_api.html

## Obtaining an API Key
Most APIs will have some sort of authorization information you'll have to pass into the URL. Various combinations of usernames, IDs, or email addresses along with with passwords, AUTH keys, and tokens are common. Again, every API is different. The AQS requires us to enter an email address to receive an API key. 
 - Signup Documentation - https://aqs.epa.gov/aqsweb/documents/data_api.html#signup  
  
Copy the link below into your address bar and replace the example email address with your email address.  
 - https://aqs.epa.gov/data/api/signup?email=myemail@example.com  

**You will receive an email with your API key. The authorization information are the first parameters we will pass into the URL string, and they will be necessary to access the API.**

## Understanding API Query Limits
Most APIs have limits to how they can be used. There are size limits, frequency limits, and parameter limits. Size limits restrict how many rows of data that you can pull at once. Frequency or rate limits restrict how often the API can be accessed. Parameter limits impose some sort of restriction on how parameters can be used.  

The EPA AQS has the following limits:
 - size limit: 1,000,000 rows per query
 - rate limit: 10 queries per minute 
 - additional: no simultanious queries 
 - recommended: pause of 5 seconds between queries
 - parameter limits: maximum of 5 "param" codes per query
 - parameter limits: "bdate" and "edate" must be in the same year
 
**Failure to stay within the limits could trigger the system to disable your account.**

## Understanding URL Strings & Parameters
Typically, APIs will have various databases that we can interact with, each with their own base URL and parameters. Let's look at the AQS documentation to find the base URL and which databases we can interact with.
 - Services Documentation: https://aqs.epa.gov/aqsweb/documents/data_api.html#services

 - **Base URL: https://aqs.epa.gov/data/api/**

As you can see in the services overview, there are many datasets availiable. Let's go through a few of them.
 - Lists: Provides the variable values you may need to create other service requests.
 - Sample Data: Returns fine-grain data reported to the EPA via monitors throughout the country.

Now let's explore the sample data service, making sure we look closely at the *endpoints* and *variables*.  
 - https://aqs.epa.gov/aqsweb/documents/data_api.html#sample
 
As you can see in the sample data overview, thare are many *endpoints* availiable such as: by site, by county, by state, by box, and by csba. These endpoints are different geograpical areas by which we can group (or filter) the monitor data.  
 - By State: https://aqs.epa.gov/data/api/sampleData/byState?
 - By CBSA: https://aqs.epa.gov/data/api/sampleData/byCBSA?  
 
**Notice the structure of the URL: base_url/service(database)/endpoint(filter)**

Each endpoint has different required variables (or parameters). These required variables are added to the URL to further specify what information we want the API to return to us. For example, if we were to query sample data *by state* we would need to specify from which state we want sample data. Because databases and APIs have unique structures, we must use the specific language of the database and API we are working with. 
 - Variables for this API: https://aqs.epa.gov/aqsweb/documents/data_api.html#variables  

 - By State required variables: email, key, param, bdate, edate, state
  - Example query URL: https://aqs.epa.gov/data/api/sampleData/byState?email=test@aqs.api&key=test&param=45201&bdate=19950515&edate=19950515&state=37
 - By CBSA required variables: email, key, param, bdate, edate, cbsa
  - Example query URL: https://aqs.epa.gov/data/api/sampleData/byCBSA?email=test@aqs.api&key=test&param=42602&bdate=20170101&edate=20170101&cbsa=16740  
  
**Notice the use of numerical codes being used to represent a state or cbsa. These can be found in the 'lists' service**


## Using an API to Collect Data
Now that you know the basics about APIs, URL strings, and parameters, lets explore how to actually use them. You may be wondering, if APIs work to distribute data simply by the URL they receive, can't we just use our web browser to collect data? Yes! You could just create your URL string with the parameters you desire, pass the URL into your web browser, and press enter. Typically, this will result in your web browser displaying a JSON file which represents the data via a structured list of lists. You could then save this JSON file to your computer and access it with something like excel.  

While the above method is fine for a one-off query, this method doesn't really scale well for multiple queries. What if you want to query every state? or a list of states? or a list of dates? or some combination of several dates, areas, and pollution types? Changing all of this information by hand, one query at a time, can be tedious and slow. This is where programming comes in, to help us scale and even automate our collection of data.

## Using Python & requests to query an API

### Hiding Your API Key (Optional)
If you plan on sharing your code with others, either by working with a partner or by uploading your code to an online repository like github, then you'll want to know how you can hide your API key. There are a few methods to accomplish this task, but I think the best way is to save your key as an environment variable and then call that environment variable into your code. That way, when share your code with others, all they will see is that you called an environment variable. They won't be able to see the variable itself. Let me show you how!

First you will need to save your authentication information as environment variables. There are many guides about how to do this depending on how your environment is set up. I use Anaconda to set up my python environments, so the method I used can be found here:  
 - https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#setting-environment-variables
 
If you want a tutorial about how to set up your environment or environment variables, please let me know!

In [1]:
## Now that you have your environment variables set, you just need to call them into the script.
## To do this, we will need to use the getenv() function found in the package called 'os'
## My email and key are stored in my environment under the names 'airnow_email' & 'airnow_key'

## Load the os package
import os

## Call the variables into the script, or replace with your email and key.
email = os.getenv('airnow_email')
key = os.getenv('airnow_key')

## From now on, when I use the variable 'email' in my code, the program will be using my actual 
## email, without having to display it in the code. neat!

### Adding Parameters
Now that we have our email and key registered and ready to go, let's think about what parameters we want to add to our query string. For my project, I'm looking for sample data: 
 - of small particle air pollution levels (PM 2.5)
 - in the Seattle-Tacoma-Bellevue CBSA
 - from Jan 1 to May 31
 - for the years: 2018, 2019, & 2020

In [2]:
'''
Parameters: sampleData/byCBSA	email, key, param, bdate, edate, cbsa
'''
## First, lets find the codes we'll need for the param and cbsa variables. These can be found
## by accessing the 'lists' service of the API. 
## Note: We could just visit the links and ctrl+f for 'Seattle-Tacoma-Bellevue' or 'PM 2.5'
## However, if you plan to access the database often, it may be helpful to save the code lists.

## Dependencies
import requests
import pandas as pd

## Using requests get() function to retrieve a list of CBSAs
## Then using the .json() method to convert the r object to JSON format
## Then pandas DataFrame.from_dict() function to convert the "Data" portion of the JSON file
## to a DF.Then pandas to_csv() function to save the DF as a .CSV file
cbsa_list_url = 'https://aqs.epa.gov/data/api/list/cbsas?email=test@aqs.api&key=test'
r = requests.get(cbsa_list_url)
cbsa_list_json = r.json()
cbsa_list = pd.DataFrame.from_dict(cbsa_list_json["Data"])
cbsa_list.to_csv("cbsa_list.csv")

## Similarly, we can create and save a list of Parameter Classes
parameter_classes_url ='https://aqs.epa.gov/data/api/list/classes?email=test@aqs.api&key=test'
r = requests.get(parameter_classes_url)
parameter_classes_json = r.json()
parameter_classes = pd.DataFrame.from_dict(parameter_classes_json["Data"])
parameter_classes.to_csv("parameter_classes.csv")

## Similarly, we can create and save a list of Parameters in ALL Parameter Classes.
parameters_url = 'https://aqs.epa.gov/data/api/list/parametersByClass?email=test@aqs.api&key=test&pc=ALL'
r = requests.get(parameters_url)
parameters_json = r.json()
parameters = pd.DataFrame.from_dict(parameters_json["Data"])
parameters.to_csv("parameters.csv")

## Now we can fill in the parameter details using the codes from the lists
param = 88101
cbsa = 42660

### Putting It All Together 
One of the benefits of using python is the ability to execute multiple queries easily. Here, I'll show you how to iterate through a list of years to perform three separate queries, with a delay in between each one, so that we adhere to the API limit rules. We will use the sleep() function in the time package to add the delay.

In [None]:
## Dependencies
import time

## Since I only want to query sample data by cbsa, I will start with this as my base url.
base_url = 'https://aqs.epa.gov/data/api/sampleData/byCBSA?'

## Using a list and for loop to run multiple queries and save multiple data files programatically
years_list = [2018, 2019, 2020]
for year in years_list:
    bdate = str(year) + '0101'
    edate = str(year) + '0531'
    ## The get() function in requests can also take a dictionary of variable:value pairs, along
    ## with a base URL, and correctly feed the variables and values into the URL string.
    ## This allows you to easily change parameters.
    parameter_dictionary = {"email":email, "key":key, "param":param, 
                        "bdate":bdate, "edate":edate, "cbsa":cbsa}
    r = requests.get(base_url, parameter_dictionary)
    query_json = r.json()
    query = pd.DataFrame.from_dict(query_json["Data"])
    ## By using the for loop to change the beginning and end dates, we can also utilize the for
    ## loop to generate unique save files with descriptive names.
    query.to_csv(bdate+"_to_"+edate+"Seattle-Tacoma-Bellevue_PM2.5.csv")
    time.sleep(5)

## Conclusion
There you have it. Now you know the basics about APIs, URL strings, and parameters. And, you know how to use python and the requests package to get() the data using the API. Good luck collecting your own data!