# Functions to query PurpleAir

This notebook retrieves 10 minute average readings from PurpleAir Sensors (using ATM estimations) in Minneapolis and cleans/explores the entries.

## [PurpleAir Documentation](https://api.purpleair.com)

### From PurpleAir: 

"The data from individual sensors will update no less than every 30 seconds." 
"limit the number of requests to no more than once every 1 to 10 minutes,"
"If retrieving data from multiple sensors at once, please send a single request rather than individual requests in succession."

A paper on this process: https://doi.org/10.5194/amt-14-4617-2021 (Link for [Download](https://www.researchgate.net/publication/352663348_Development_and_application_of_a_United_States-wide_correction_for_PM25_data_collected_with_the_PurpleAir_sensor) )

Chat on which PM Estimate to use: https://community.purpleair.com/t/pm2-5-algorithms/3972/6

## Prep

### Import Packages

In [1]:
# File Manipulation

import os # For working with Operating System
from dotenv import load_dotenv # Loading .env info

# Web

import requests # Accessing the Web

# Time

import datetime as dt # Working with dates/times
import pytz # Timezones

# Database 

import psycopg2
from psycopg2 import sql

# Data Manipulation

import numpy as np
import geopandas as gpd
import pandas as pd

### Global Variables

In [2]:
load_dotenv() # Load .env file

## API Keys

purpleAir_api = os.getenv('PURPLEAIR_API_TOKEN') # PurpleAir API Read Key

## Database credentials

creds = [os.getenv('DB_NAME'),
         os.getenv('DB_USER'),
         os.getenv('DB_PASS'),
         os.getenv('DB_PORT'),
         os.getenv('DB_HOST')
        ]

pg_connection_dict = dict(zip(['dbname', 'user', 'password', 'port', 'host'], creds))  

# Other Constants - should be system arguments of some sort

spike_threshold = 35 # Value which defines an AQ_Spike (Micgrograms per meter cubed)

# When to stop the program?
days_to_run = 7 # How many days will we run this?
timestep = 10 # Sleep time in between updates (in Minutes)
stoptime = dt.datetime.now() + dt.timedelta(days=days_to_run) # When to stop the program (datetime)

## Initial Functions

### Get sensor_ids from database

In [3]:
def get_sensor_ids(pg_connection_dict):
    '''
    This function gets the sensor_ids of all sensors in our database
    Returns a pandas Series
    '''

    # Connect
    conn = psycopg2.connect(**pg_connection_dict) 
    # Create cursor
    cur = conn.cursor()

    cmd = sql.SQL('''SELECT sensor_index 
    FROM "PurpleAir Stations"
    ''')

    cur.execute(cmd) # Execute
    conn.commit() # Committ command

    # Unpack response into pandas series

    sensor_ids = pd.DataFrame(cur.fetchall(), columns = ['sensor_index']).sensor_index

    # Close cursor
    cur.close()
    # Close connection
    conn.close()

    return sensor_ids

### Get Sensors Data from PurpleAir

In [4]:
def getSensorsData(query='', api_read_key=''):

    # my_url is assigned the URL we are going to send our request to.
    url = 'https://api.purpleair.com/v1/sensors?' + query

    # my_headers is assigned the context of our request we want to make. In this case
    # we will pass through our API read key using the variable created above.
    my_headers = {'X-API-Key':api_read_key}

    # This line creates and sends the request and then assigns its response to the
    # variable, r.
    response = requests.get(url, headers=my_headers)

    # We then return the response we received.
    return response

## PurpleAir API Experiments

### Query to DataFrame Pipeline

In [5]:
#Setting parameters for API
fields = ['pm2.5_10minute', 'channel_flags' , 'last_seen']

fields_string = 'fields=' + '%2C'.join(fields)

In [6]:
# Query only for sensors in our database

sensor_ids = get_sensor_ids(pg_connection_dict) # Get the sensor ids as a pandas series

sensor_string = 'show_only=' + '%2C'.join(sensor_ids.astype(str))

In [7]:
# # Query only for sensors modified since <- This didn't seem to work

# prev_runtime = dt.datetime.now() # Dummy variable

# formatted_time = str(int(prev_runtime.timestamp()))

# modified_since_string = f'modified_since={formatted_time}'

# modified_since_string

In [8]:
# Final Query String 

query_string = '&'.join([fields_string, sensor_string]) # , modified_since_string

In [9]:
# Finalizing query for API function

# print('https://api.purpleair.com/v1/sensors?' + query_string)

In [10]:
# Call API

response = getSensorsData(query_string, purpleAir_api) # The response is a requests.response object

runtime = dt.datetime.now(pytz.timezone('America/Chicago')) # When we call - datetime in our timezone

In [14]:
response.status_code

200

In [55]:
# Read response object into pd.DataFrame

response_dict = response.json() # Read response as a dictionary

col_names = response_dict['fields'] # Get field names from dictionary
data = np.array(response_dict['data']) # Get data from dictionary

sensors_df = pd.DataFrame(data, columns = col_names)

# # Correct last_seen # <- Keeping in case we want to go with the old route

sensors_df['last_seen'] = pd.to_datetime(sensors_df['last_seen'],
                                         utc = True,
                                         unit='s').dt.tz_convert('America/Chicago')

In [56]:
#visualizing API response
sensors_df.head()

Unnamed: 0,sensor_index,last_seen,channel_flags,pm2.5_10minute
0,142718,2023-11-06 16:29:47-06:00,0,15.4
1,142720,2023-11-06 16:31:03-06:00,0,15.2
2,142726,2023-11-06 16:30:22-06:00,0,23.6
3,142724,2023-11-06 16:30:47-06:00,0,14.4
4,142730,2023-09-25 09:18:30-05:00,0,0.0


In [57]:
sensors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype                          
---  ------          --------------  -----                          
 0   sensor_index    66 non-null     object                         
 1   last_seen       66 non-null     datetime64[ns, America/Chicago]
 2   channel_flags   66 non-null     object                         
 3   pm2.5_10minute  65 non-null     object                         
dtypes: datetime64[ns, America/Chicago](1), object(3)
memory usage: 2.2+ KB


### Cleaning PurpleAir Station Data

In [59]:
# Key\
# Channel Flags - 0 = Normal, 1 = A Downgraded, 2 - B Downgraded, 3 - Both Downgraded

flags = (sensors_df.channel_flags != 0
          ) | (sensors_df.last_seen < dt.datetime.now(pytz.timezone('America/Chicago')) - dt.timedelta(minutes=60)
                 )

clean_df = sensors_df[~flags].copy()

# Rename column for ease of use

clean_df = clean_df.rename(columns = {'pm2.5_10minute':'pm25'})

# Remove obvious error values

clean_df = clean_df[clean_df.pm25 < 1000] 

# Remove NaNs

clean_df = clean_df.dropna()

clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38 entries, 0 to 63
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype                          
---  ------         --------------  -----                          
 0   sensor_index   38 non-null     object                         
 1   last_seen      38 non-null     datetime64[ns, America/Chicago]
 2   channel_flags  38 non-null     object                         
 3   pm25           38 non-null     object                         
dtypes: datetime64[ns, America/Chicago](1), object(3)
memory usage: 1.5+ KB


## Check for Spikes

In [60]:
# Check for spikes

spikes_df =  clean_df[clean_df.pm25 >= spike_threshold][['sensor_index', 'pm25']].reset_index(drop=True)

spikes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   sensor_index  0 non-null      object
 1   pm25          0 non-null      object
dtypes: object(2)
memory usage: 124.0+ bytes


In [61]:
### Get Flagged Sensors
    
flagged_df = sensors_df[~sensors_df.sensor_index.isin(clean_df.sensor_index)]

flagged_sensor_ids = flagged_df.reset_index(drop=True).sensor_index