# Data Access Notebook for OpenAQ
_by Michelle H Wilkerson, Adelmo Eloy_

## Purpose of this Notebook

This notebook was developed as part of NSF Grant 2445609 to support accessing and processing OpenAQ data for middle and high school classroom activities. It's written to be relatively accessible to beginners, but if you have not interacted with computational notebooks or python before you may find navigating this tool difficult. (Check out the Show Your Work project for a gentle introduction to computational notebooks for educators!)

Our project is focused on supporting data analysis and mechanistic reasoning in science education. In other words, we want students to learn how data provides information about _how scientific mechanisms work_, and how understanding scientific mechanisms can help them to _explain and interpret patterns in data_. This builds on a long history of research on complex systems and agent-based modeling, and more closely connects that work to current expansions of data analysis across subjects.

Here, we are focused on Air Quality as a phenomenon. While most students understand that poor Air Quality can impact health, they may not know that there are many different kinds of air pollution, each caused by different processes and chemicals. These are reflected by different patterns over the course of a day or year

This data tool allows users to connect to OpenAQ, search for air quality data streams in an area of interest, and then identifies the data streams that record observations for _both_ PM2.5 and O3, as two key pollutants impacting air quality and that tend to behave very differently over time. These kinds of datasets can serve as a launch to examining what AQ is and what are its underlying mechanistic and compositional complexities.

You are welcome to modify and adapt this script. You may find the OpenAQ documentation [here](https://docs.openaq.org/) and [here](https://python.openaq.org/) helpful.

## Part I: Connecting with OpenAQ

Before you get started, you will need an OpenAQ API Key. To get one, register for a free account here. Once you're logged in, go to your account settings. At the bottom of the page you will find an API Key. Copy it and set API_KEY in the cell block below to your key. Then, run the cell below to install the openaq API and set your key.

In [None]:
!pip install openaq

API_KEY = "paste your key here"

Now we'll open your openAQ client. This is what talks to openAQ.

In [None]:
# set up the client to get data from openAQ

from openaq import OpenAQ

client = OpenAQ(api_key=API_KEY)

When we make a request through the client, it will pass data and information back to us in http and JSON formats. These are standard text, but they are not formatted in order to save space. This makes it hard to read and understand the structure of the info we're getting from the API. So the cell below loads some helper tools to process and "pretty print" the information we get from OpenAQ. 

In [None]:
# import some helpful tools for looking at data

import json # to handle JSON
from pprint import pprint # pretty print JSON and other structures

We are going to focus on datasets that report levels of PM2.5 and O3 for the same location. But, you might be interested in accessing data about other pollutants. The cell below lists all the site-specific parameters that are available through the API. This doesn't mean that each site has each parameter, it just reviews everything that can be requested (whether or not it exists for a given site or date range).

In [None]:
# let's see all the parameters we can access from openaq
pprint(client.parameters.list().results)

The list above shows that there are a few parameters that report PM2.5 and O3, some measured in different ways. The parameters with ids 2, 97, 130, and 19860 all report PM2.5 in different ways. The parameters with ids 3 and 10 report O3 using µg/m³ and ppm, respectively. 

After some tinkering, it seems that a lot of stations use sensor id 10 (reporting O3 in ppm). This is not shown on the web explorer, but is still data we probably want to consider using in our work. Nearly all the sensors include id 2 (PM2.3 in µg/m³). So for now, I'm just going to go with parameter 2 and then _either_ 3 or 10, to catch as many instances of PM2.5 and O3 as possible.

## Part 2: Finding a location that meets our sensor needs

Ok so now let's narrow in on a location. Eventually, if we want, we can use geopy to look up the lat and long coordinates for whatever location you prefer. But that requires some signup and so for now I'm gonna hard code and we decide this is useful we can go with geopy or whatever. Then we'll do a search for the nearest OpenAQ data stream to the specified lat and long that has recent readings for _both_ PM2.5 and O3.

In [None]:
# IF WE WANT TO USE GEOPY WE WOULD USE THE CODE BELOW

# pip install geopy
# from geopy import geocoders
# gn = geocoders.GeoNames()

# gn.geocode("Cleveland, OH 44106") 
# then we'd take the returned object and feed lat and long to next step

# fetch the locations near the specified lat and long
fetch_locations = json.loads(client.locations.list(coordinates=[37.85, -122.25], radius=12000, limit=1000).json())
locations = fetch_locations["results"]

# report how many locations we found
len(locations)

We likely found more locations than we expected! But, these are not all likely to have the specific information we need. Let's filter our results so we're only looking at locations that PM2.5 and O3.

In [None]:
filtered_locations = []

for location in locations:
    #reset flags for each location
    pm = False
    o3 = False 

    #see if the pollutants are recorded
    for sensor in location['sensors']:
        if sensor['parameter']['id'] == 2:
            pm = True
        if sensor['parameter']['id'] == 3 or sensor['parameter']['id'] == 10:
            o3 = True

    #if both pollutants are recorded, add to the list
    if pm & o3:
        filtered_locations.append(location)

# kind of a hack, but let's go ahead and load pandas so we can display this nicely
import pandas as pd

pd.DataFrame(filtered_locations)

Good! Now you can clearly see the id of stations that have the data you need. You can also see the interval of times for which these records are available by scrolling all the way to the right in the table above. If you're using the originally hard-coded bay area, these look like they both started in 2016 and are up-to-date.

# Part 3: Loading your data

Okay, now you've found a place you can load some data that might exhibit patterns that will be intriguing to students and that will highlight some of our learning goals around the complex nature of AQ. Let's look at recent data for the location with id 2135, identified above. The API has us pull measurements per sensor, so let's look for the sensor ids we need.

In [None]:
pprint(client.locations.sensors(locations_id=2135).results)

Ok so our O3 sensor id is 3833, and our PM2.5 sensor is 3835. Let's fetch the latest data for each of these sensors. We clean it up and merge it into one dataset.

In [None]:
o3 = client.measurements.list(sensors_id="3833")
pm = client.measurements.list(sensors_id="3835")

o3_df = pd.DataFrame(o3.results)
pm_df = pd.DataFrame(pm.results)

def cleanup(messy_df, pollutant):
    df = pd.DataFrame(messy_df)

    # Extract 'local' datetime strings into new columns
    df['datetime_from'] = df['period'].str.get('datetime_from').str.get('local')
    df['datetime_to']   = df['period'].str.get('datetime_to').str.get('local')

    # Convert to pandas datetimes
    df['datetime_from'] = pd.to_datetime(df['datetime_from'])
    df['datetime_to']   = pd.to_datetime(df['datetime_to'])

    # Tag with pollutant label
    df['pollutant'] = pollutant

    # Drop unused columns if present
    for col in ['period', 'parameter', 'coordinates', 'summary', 'coverage']:
        if col in df.columns:
            df = df.drop(col, axis=1)
    return df

def fetch_1000_since(client, sensor_id, start=None):
    """
    Return up to 1000 measurements for a given sensor, starting at `start` (inclusive).
    Strategy:
      1) Try common server-side date params (APIs differ in naming).
      2) If none works, paginate client-side and filter by datetime_from.local.
    """
    base = {"sensors_id": str(sensor_id), "limit": 1000}

    # Normalize `start` to ISO string and timestamp
    start_ts = pd.to_datetime(start) if start else None
    start_iso = start_ts.strftime("%Y-%m-%dT%H:%M:%SZ") if start_ts is not None else None

    # --- (1) Try server-side date filtering with common param names ---
    if start_iso is not None:
        for key in ("datetime_from", "start_at", "start", "from_datetime", "from_date", "date_from"):
            try:
                resp = client.measurements.list(**{**base, key: start_iso})
                return pd.DataFrame(resp.results[:1000])
            except TypeError:
                # This API doesn't recognize that param name; try the next one.
                continue

    # If we reach here, either no start provided or server-side filtering not available.
    # We'll paginate and filter on the client side.

    collected = []
    page = 1
    while len(collected) < 1000:
        # --- (2) Try page-based pagination; fall back to offset if needed ---
        page_resp = None
        try:
            page_resp = client.measurements.list(**{**base, "page": page})
        except TypeError:
            try:
                page_resp = client.measurements.list(**{**base, "offset": (page - 1) * base["limit"]})
            except TypeError:
                # No pagination supported by params; use a single page once.
                if page == 1:
                    page_resp = client.measurements.list(**base)
                else:
                    page_resp = None

        if page_resp is None:
            break

        items = getattr(page_resp, "results", []) or []
        if not items:
            break

        # Filter items by start_ts using the 'period.datetime_from.local' field when available
        for r in items:
            dt_local = None
            if isinstance(r, dict):
                p = r.get("period")
                if isinstance(p, dict):
                    df = p.get("datetime_from")
                    if isinstance(df, dict):
                        dt_local = df.get("local")
                # Fallback for APIs that use 'date.local'
                if dt_local is None:
                    d = r.get("date")
                    if isinstance(d, dict):
                        dt_local = d.get("local")

            # Keep if no start filter OR datetime exists and is >= start
            if (start_ts is None) or (dt_local and pd.to_datetime(dt_local) >= start_ts):
                collected.append(r)
                if len(collected) >= 1000:
                    break

        # Stop if this looks like the last page
        if len(items) < base["limit"]:
            break
        page += 1

    return pd.DataFrame(collected[:1000])

# ===== Usage: set your desired start date (YYYY-MM-DD or ISO8601) =====
start = "2025-01-01"  # <-- change this to control the start of the collection

o3_raw = fetch_1000_since(client, 3833, start=start)
pm_raw = fetch_1000_since(client, 3835, start=start)

o3_df = cleanup(o3_raw, "O3")
pm_df = cleanup(pm_raw, "PM")

merged = pd.concat([o3_df, pm_df], ignore_index=True)
merged

Now we're ready to plot...

In [None]:
import seaborn as sns

line_plot = sns.lineplot(
    data=o3_df,
    x='datetime_from',
    y='value',
    hue='pollutant',
    palette='viridis' # Use a nice color palette
)

In [None]:
line_plot = sns.lineplot(
    data=pm_df,
    x='datetime_from',
    y='value',
    hue='pollutant',
    palette='viridis' # Use a nice color palette
)

Here's some code to expore a csv of the merged dataset to use in other tools. (Should be very easy to work with in CODAP from here.)

In [None]:
merged.to_csv("filename.csv")