# Observed Air Quality (PurpleAir)

This notebook retrieves readings from PurpleAir Sensors in Minneapolis and cleans the entries and saves the results as a csv file.

Documentation is available here: https://api.purpleair.com.
You can read this article for help getting started: https://community.purpleair.com/t/making-api-calls-with-the-purpleair-api/180.

From PurpleAir: 

"The data from individual sensors will update no less than every 30 seconds. As a courtesy, we ask that you limit the number of requests to no more than once every 1 to 10 minutes, assuming you are only using the API to obtain data from sensors. If retrieving data from multiple sensors at once, please send a single request rather than individual requests in succession.

The PurpleAir historical API is released as of July 18, 2022. For more information, view this post: https://community.purpleair.com/t/new-version-of-the-purpleair-api-on-july-18th/1251.

Please let us know if you have any questions or concerns, and have a great day!"

A paper on this process: https://doi.org/10.5194/amt-14-4617-2021 (Link for [Download](https://www.researchgate.net/publication/352663348_Development_and_application_of_a_United_States-wide_correction_for_PM25_data_collected_with_the_PurpleAir_sensor) )

Chat on which PM Estimate to use: https://community.purpleair.com/t/pm2-5-algorithms/3972/6

In [24]:
import os
import requests 
import datetime as dt
import pandas as pd
import arcpy
import numpy as np
import io

In [4]:
cwd = os.getcwd() # This is a global variable for where the notebook is (must change if running in arcpro)

# Make it workspace

arcpy.env.workspace = os.path.join(cwd, '..', '..', 'data', 'QAQC.gdb')

arcpy.env.overwriteOutput = True # Overwrite layers is okay

## Setting MPLS Bounds

In [5]:
#bound strings

mpls_8km = "mpls_8km"

bounds_strings = [f'nwlng=-93.43083707299996',
                  f'nwlat=45.12366876300007',
                  f'selng=-93.09225748799997',
                  f'selat=44.81791263300005']
bounds_string = '&'.join(bounds_strings)

print(bounds_string)

nwlng=-93.43083707299996&nwlat=45.12366876300007&selng=-93.09225748799997&selat=44.81791263300005


## Get Station IDs

In [6]:
# This function will be used to collect data for multiple public PurpleAir sensors.
def getSensorsData(query='', api_read_key=''):

    # my_url is assigned the URL we are going to send our request to.
    url = 'https://api.purpleair.com/v1/sensors?' + query
    
    print('Here is the full url for the API call:\n\n', url)

    # my_headers is assigned the context of our request we want to make. In this case
    # we will pass through our API read key using the variable created above.
    my_headers = {'X-API-Key':api_read_key}

    # This line creates and sends the request and then assigns its response to the
    # variable, r.
    response = requests.get(url, headers=my_headers)

    # We then return the response we received.
    return response

In [7]:
# This is my personal API key... Please use responsibly!
# 51592903-B445-11ED-B6F4-42010A800007

api = input('Please enter your Purple Air api key')

Please enter your Purple Air api key 51592903-B445-11ED-B6F4-42010A800007


In [8]:
# Designating and formatting the fields to request

fields = ['location_type']

fields_string = 'fields=' + '%2C'.join(fields)

print(fields_string)

fields=location_type


In [9]:
# Put it all together

query_string = '&'.join([fields_string, bounds_string])

print(query_string)

fields=location_type&nwlng=-93.43083707299996&nwlat=45.12366876300007&selng=-93.09225748799997&selat=44.81791263300005


In [10]:
# Make the request

response = getSensorsData(query_string, api)

Here is the full url for the API call:

 https://api.purpleair.com/v1/sensors?fields=location_type&nwlng=-93.43083707299996&nwlat=45.12366876300007&selng=-93.09225748799997&selat=44.81791263300005


In [11]:
# Get response into Pandas DataFrame

response_dict = response.json() # Read response as a json (dictionary)

col_names = response_dict['fields']
data = np.array(response_dict['data'])

df = pd.DataFrame(data, columns = col_names)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88 entries, 0 to 87
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   sensor_index   88 non-null     int32
 1   location_type  88 non-null     int32
dtypes: int32(2)
memory usage: 832.0 bytes


In [12]:
df.head()

Unnamed: 0,sensor_index,location_type
0,3088,0
1,5582,0
2,137876,1
3,11134,0
4,142718,0


In [13]:
# Only want outside sensors

outside_sensors = df[df['location_type']==0] # 0 = outside

len(outside_sensors)

80

In [14]:
response.json()

{'api_version': 'V1.0.11-0.0.42',
 'time_stamp': 1680562985,
 'data_time_stamp': 1680562925,
 'max_age': 604800,
 'firmware_default_version': '7.02',
 'fields': ['sensor_index', 'location_type'],
 'location_types': ['outside', 'inside'],
 'data': [[3088, 0],
  [5582, 0],
  [137876, 1],
  [11134, 0],
  [142718, 0],
  [142720, 0],
  [142726, 0],
  [142730, 0],
  [142728, 0],
  [142734, 0],
  [142732, 0],
  [142736, 0],
  [142744, 0],
  [142750, 0],
  [142748, 0],
  [142752, 0],
  [142756, 0],
  [142774, 0],
  [142772, 0],
  [142852, 0],
  [143214, 0],
  [143216, 0],
  [143222, 0],
  [143226, 0],
  [143224, 0],
  [143238, 0],
  [143242, 0],
  [143240, 0],
  [143246, 0],
  [143636, 0],
  [143648, 0],
  [143656, 0],
  [143666, 0],
  [143668, 0],
  [143916, 0],
  [145202, 0],
  [145204, 0],
  [145234, 0],
  [145242, 0],
  [145250, 0],
  [145454, 0],
  [145470, 0],
  [145498, 0],
  [145506, 0],
  [145604, 0],
  [145610, 0],
  [145616, 0],
  [147749, 1],
  [17189, 1],
  [21179, 0],
  [154751, 

In [15]:
#drop the location_type now that we have filtered for outdoor sensors only
df_historic = outside_sensors.drop('location_type', axis=1)

## Pulling Historic Sensor CSVs

### Setting time period

In [31]:
#pulling from 9/1/22 - 4/2/23 

# Start time

end_datetime = dt.datetime(2023,4,2) # April 2, 2023
end_timestamp = int(dt.datetime.timestamp(end_datetime))

# End time

start_datetime = dt.datetime(2022,9,1) # September 1, 2022
start_timestamp = int(dt.datetime.timestamp(start_datetime))

# Sensors

sensor_ids = outside_sensors.sensor_index.apply(lambda x: int(x))

### Creating the Query for the API

In [32]:
# Sensor id

sensor_id = sensor_ids[0]

# Timestamp String

time_string = 'start_timestamp=' + str(start_timestamp) + '&end_timestamp=' + str(end_timestamp)

# Average string (in minutes) 1440 is 1 day average

avg_string = 'average=1440'

# Environmental fields

env_fields = ['humidity', 'temperature', 'pressure', 'pm2.5_cf_1']

env_fields_string = 'fields=' + '%2C%20'.join(env_fields)

# Base URL

base_url = f'https://api.purpleair.com/v1/sensors/{sensor_id}/history/csv?'

# Put it all together

query_url = base_url + '&'.join([time_string, avg_string, env_fields_string])

my_headers = {'X-API-Key':api}

# This line creates and sends the request and then assigns its response to the
# variable, r.
r = requests.get(query_url, headers=my_headers)

# Read response as CSV data
csv_data = r.content.decode('utf-8')

# Parse CSV data into pandas DataFrame
df_historic = pd.read_csv(io.StringIO(csv_data), header=None)
df_historic = df_historic.iloc[1:]  # exclude the header row
df_historic.columns = ['timestamp'] + ['sensor_index'] + env_fields
df_historic

Unnamed: 0,humidity,temperature,pressure,pm2.5_cf_1,timestamp,pm2.5_cf_1.1
0,1674864000,3088,40.201,17.581,991.743,0.6330
1,1671753600,3088,42.605,0.743,987.762,0.0785
2,1668729600,3088,48.795,29.080,990.701,54.4190
3,1672531200,3088,53.942,42.170,980.359,8.8820
4,1676764800,3088,50.660,40.259,978.572,7.8060
...,...,...,...,...,...,...
208,1677628800,3088,52.806,44.600,977.058,3.9945
209,1679356800,3088,41.454,42.326,987.113,2.3800
210,1676592000,3088,38.144,22.964,995.503,0.5600
211,1679529600,3088,41.673,41.035,988.664,3.2915


### Creating a 'for' loop to parse through all sensor_ids

## Cleaning Historic Data for Analysis

In [None]:
#rename pm2.5 column to pm2_5 for SQL
df_historic = df_historic.rename(columns={'pm2.5_cf_1' : 'pm2_5'})

In [37]:
#changing UNIX date to pd date
df_historic['timestamp'] = pd.to_datetime(df_historic['timestamp'], unit='s')

Unnamed: 0,timestamp,sensor_index,humidity,temperature,pressure,pm2.5_cf_1
1,2023-01-28,3088,40.201,17.581,991.743,0.633
2,2022-12-23,3088,42.605,0.743,987.762,0.0785
3,2022-11-18,3088,48.795,29.08,990.701,54.419000000000004
4,2023-01-01,3088,53.942,42.17,980.359,8.882
5,2023-02-19,3088,50.66,40.259,978.572,7.806000000000001
...,...,...,...,...,...,...
209,2023-03-01,3088,52.806,44.6,977.058,3.9945
210,2023-03-21,3088,41.454,42.326,987.113,2.38
211,2023-02-17,3088,38.144,22.964,995.503,0.56
212,2023-03-23,3088,41.673,41.035,988.664,3.2915


## QAQC

In [None]:
#create a blank dataframe to hold the errors

purpleair_historic_errors = pd.DataFrame(columns = ['humidity_error', 'temperature_error', 'pressure_error', 'pm2_5_error'])
purpleair_historic_errors['sensor_index'] = df_historic['sensor_index']
purpleair_historic_errors['timestamp'] = df_historic['timestamp']

### Humidity Check

In [None]:
#ranges pulled from https://www.currentresults.com/Weather/Minnesota/humidity-annual.php
#range is actually 40-90 but I was getting tons of errors so I reduced it to 10-90

def check_range(value):
    if value is None:
        return 'no value given'  # or any other value that indicates a missing value
    elif value >= 10 and value <= 90:
        pass
    else:
        return 'out of range (10%-90%)'
    
purpleair_historic_errors['humidity_error'] = df_historic['humidity'].apply(check_range)

print(purpleair_historic_errors)

### Temperature Check

In [None]:
#winter -4 - 28
#spring 22 - 57
#summer 48 - 81
#fall 29 - 59
#ref from https://www.dnr.state.mn.us/climate/summaries_and_publications/normalsportal.html

def check_range(value):
    if value is None:
        return 'no value given'  # or any other value that indicates a missing value
    elif value >= -20 and value <= 100:
        pass
    else:
        return 'out of range (-20-100F)'
'''
#if we can get time stamp we should use this with a date check too
#this is not correct - we can do seasonal if we can relate it to date range
def check_range(value):
    if value is None:
        return -1
    if value >= -20 and value <=35:
        return 'winter (-20-35F)'
    if value >10 and value <=70:
        return 'spring (10-70F)'
    if value >30 and value <=100:
        return 'summer (30-100F)'
    if value >15 and value <=70:
        return 'fall (15-70F)'
    else:
        return 'out of range'
'''

purpleair_historic_errors['temperature_error'] = df_historic['temperature'].apply(check_range)

print(purpleair_historic_errors)

### Pressure Check

In [None]:
# range is 25 - 35 Hg according to https://barometricpressure.app/minneapolis
# PurpleAir uses Millibars so I used https://www.weather.gov/epz/wxcalc_pressureconvert to convert
# rage is 846.6 - 1185.24

def check_range(value):
    if value is None:
        return 'no value given'  # or any other value that indicates a missing value
    elif value >= 830 and value <= 1200:
        pass
    else:
        return 'out of range (830 - 1200 Millibars)'
    
purpleair_historic_errors['pressure_error'] = df_historic['pressure'].apply(check_range)

print(purpleair_historic_errors)

### PM Check

In [None]:
#Average reading in MPLS is 30 ug/m3 per https://www.epa.gov/air-trends/air-quality-cities-and-counties

def check_range(value):
    if value is None:
        return 'no value given'
#    if value == 0:
 #       return '0'
#    if value >0.1 and value <=10:
#        return 'PM2.5 0.1-10'
#    if value >10 and value <=20:
#        return 'PM2.5 10-20'
#    if value >20 and value <=30:
#        return 'PM2.5 20-30'
#    if value >30 and value <=40:
#        return 'PM2.5 30-40'
#    if value >40 and value <=50:
#        return 'PM2.5 40-50'
#    if value >50 and value <=60:
#        return 'PM2.5 50-60'
#    if value >60 and value <=70:
#        return 'PM2.5 60-70'
    if value >0.1 and value <70:
        pass
    else:
        return 'above 70'
    
purpleair_historic_errors['pm2_5_error'] = df_historic['pm2_5'].apply(check_range)

print(purpleair_historic_errors)

In [None]:
# Removing rows from the error table that don't have any errors

purpleair_historic_errors = purpleair_historic_errors.dropna(subset=purpleair_historic_errors.columns.difference(['sensor_index', 'timestamp']), how='all')
purpleair_historic_errors

## Connecting to the Server

In [None]:
import psycopg2
from psycopg2 import sql

In [None]:
connection = psycopg2.connect(host = '34.132.44.118',
                              database = 'lab1-2',
                              user = 'postgres',
                              password = 'password',
                              port = '5432')
connection.closed

## Insert Data into SQL Table

In [None]:
#connect to the cursor
cur = connection.cursor()

# iterate over the dataframe and insert each row into the database using a SQL INSERT statement
for index, row in df_historic.iterrows():
    cur.execute('''
    INSERT INTO PURPLEAIR_HISTORIC (sensor_index, timestamp, humidity, temperature, pressure, pm2_5) 
    VALUES (%s, %s, %s, %s, %s) 
    ''', (row['sensor_index'], row['timestamp'], row['humidity'], row['temperature'], row['pressure'], row['pm2_5']))
    connection.commit()
    
for i, r in purpleair_historic_errors.iterrows():
    cur.execute('''
    INSERT INTO PURPLEAIR_HISTORIC_ERRORS (sensor_index, timestamp, humidity_error, temperature_error, pressure_error, pm2_5_error) 
    VALUES (%s, %s, %s, %s, %s) 
    ''', (r['sensor_index'], r['timestamp'], r['humidity_error'], r['temperature_error'], r['pressure_error'], r['pm2_5_error']))
    connection.commit()
# commit the changes to the database and close the cursor and connection
cur.close()
connection.close()