# Observed Air Quality (PurpleAir)

This notebook retrieves readings from PurpleAir Sensors in Minneapolis and cleans the entries and saves the results as a csv file.

Documentation is available here: https://api.purpleair.com.
You can read this article for help getting started: https://community.purpleair.com/t/making-api-calls-with-the-purpleair-api/180.

From PurpleAir: 

"The data from individual sensors will update no less than every 30 seconds. As a courtesy, we ask that you limit the number of requests to no more than once every 1 to 10 minutes, assuming you are only using the API to obtain data from sensors. If retrieving data from multiple sensors at once, please send a single request rather than individual requests in succession.

The PurpleAir historical API is released as of July 18, 2022. For more information, view this post: https://community.purpleair.com/t/new-version-of-the-purpleair-api-on-july-18th/1251.

Please let us know if you have any questions or concerns, and have a great day!"

A paper on this process: https://doi.org/10.5194/amt-14-4617-2021 (Link for [Download](https://www.researchgate.net/publication/352663348_Development_and_application_of_a_United_States-wide_correction_for_PM25_data_collected_with_the_PurpleAir_sensor) )

Chat on which PM Estimate to use: https://community.purpleair.com/t/pm2-5-algorithms/3972/6

In [86]:
### Import Packages

# File manipulation

import os # For working with Operating System
import requests # Accessing the Web
import datetime as dt # Working with dates/times
import io # Input/Output Bytes objects

# Analysis

import numpy as np
import pandas as pd

## Set Working Environment

In [59]:
# # Get CWD

# cwd = os.getcwd() # This is a global variable for where the notebook is (must change if running in arcpro)

# # Create GeoDataBase
# # This is the communal GeoDataBase

# if not os.path.exists(os.path.join(cwd, '..', '..', 'data', 'QAQC.gdb')): # If it doesn't exist, create it

#     arcpy.management.CreateFileGDB(os.path.join(cwd, '..', '..', 'data'), 'QAQC')

# # Make it workspace

# arcpy.env.workspace = os.path.join(cwd, '..', '..', 'data', 'QAQC.gdb')

# arcpy.env.overwriteOutput = True # Overwrite layers is okay

## Definitions

In [60]:
spike_threshold = 28 # Micgrograms per meter cubed

### Summary Statistics Functions

In [111]:
%run Summary_Functions.py

print('Stat Names:\n\n', summary_stats_names, '\n')
print('Stat Types:\n\n',summary_stats_dtypes, '\n')

print('Function Names:\n\n', summary_stats_functions)

Stat Names:

 ['n_observations', 'pm25_fullDay_mean', 'pm25_fullDay_min', 'pm25_fullDay_minTime', 'pm25_fullDay_max', 'pm25_fullDay_maxTime', 'pm25_fullDay_std', 'pm25_fullDay_minutesAbove12ug', 'pm25_morningRush_mean', 'pm25_morningRush_min', 'pm25_morningRush_minTime', 'pm25_morningRush_max', 'pm25_morningRush_maxTime', 'pm25_morningRush_std', 'pm25_eveningRush_mean', 'pm25_eveningRush_min', 'pm25_eveningRush_minTime', 'pm25_eveningRush_max', 'pm25_eveningRush_maxTime', 'pm25_eveningRush_std', 'pm25_daytimeAmbient_mean', 'pm25_daytimeAmbient_min', 'pm25_daytimeAmbient_minTime', 'pm25_daytimeAmbient_max', 'pm25_daytimeAmbient_maxTime', 'pm25_daytimeAmbient_std', 'pm25_nighttimeAmbient_mean', 'pm25_nighttimeAmbient_min', 'pm25_nighttimeAmbient_minTime', 'pm25_nighttimeAmbient_max', 'pm25_nighttimeAmbient_maxTime', 'pm25_nighttimeAmbient_std'] 

Stat Types:

 [<class 'int'>, <class 'float'>, <class 'float'>, <class 'str'>, <class 'float'>, <class 'str'>, <class 'float'>, <class 'int'>, 

### Functions

In [112]:
# QAQC

def qaqc(df):
    '''This function wil perform some basic QAQC
    '''
    
    clean_df = df.copy()
    
    # Convert timestamp to datetime
    
    clean_df['timestamp'] = pd.to_datetime(clean_df['timestamp'], unit='s')
    
    # Remove obvious error values
    
    clean_df = clean_df[clean_df.pm25 < 1000] 
    
    # Remove NaNs
    
    clean_df = clean_df.dropna()
    
    return clean_df

# Remove and record Spikes

def get_spikes(df, spike_threshold):
    '''This function removes spikes from a dataframe 
    and returns both the new dataframe
    and a separate spike dataframe
    '''
    
    df_w_spikes = df.copy()
    
    condition = (df.pm25 > spike_threshold)
    
    df_w_spikes['is_spike'] = condition
    
    spikes = df_w_spikes[condition]
    
    return df_w_spikes, spikes

# Get Summary Stats

def get_summary_stats(df):
    ''' This is the main function. It will run all of our functions that get summary stats
    and return as a list.
    '''
    
    stats = []
    
    # Run the functions
    
    for f in summary_stats_functions:
        stats += f(df)
    
    return stats

### Set Up Parameters for Query

In [113]:
# This is my personal API key... Please use responsibly!

# api = input('Please enter your Purple Air api key')

In [114]:
### Query Strings

# Average string (in minutes) 1440 is 1 day average

avg_string = 'average=10'

# Environmental fields

env_fields = ['pm2.5_cf_1']

env_fields_string = 'fields=' + '%2C%20'.join(env_fields)

# My Header

my_headers = {'X-API-Key': api}

## The Loop

In [115]:
## Iterables

sensor_ids = [3088, 5582, 11134, 142718, 142720] # This should be an iterable of the sensor ids as integers

datelist = pd.date_range(start = dt.datetime(2022,6,15), # June 15, 2022,
                         end = dt.datetime.today(),
                        normalize = True)

print('Last Run on ', dt.datetime.today())

Last Run on  2023-04-19 23:26:50.845933


In [116]:
## Initialize Storage

# Daily Summary

cols = ['date'] + ['sensor_index'] + summary_stats_names

datatypes = [str, int] + summary_stats_dtypes

dtypes = np.dtype(list(zip(cols, datatypes)))

daily_summary_df = pd.DataFrame(np.empty(0, dtype = dtypes))

# Daily Summary (No Spikes)

cols = ['date'] + ['sensor_index'] + summary_stats_names

datatypes = [str, int] + summary_stats_dtypes

dtypes = np.dtype(list(zip(cols, datatypes)))

daily_summary_no_spikes_df = pd.DataFrame(np.empty(0, dtype = dtypes))

# Spikes

all_spikes_df = pd.DataFrame(np.empty(0, dtype = [('timestamp', pd._libs.tslibs.timestamps.Timestamp),
                                              ('pm25', float),
                                              ('sensor_index', int)]
                                 )
                        )

In [126]:
# Iterate through the days

for i in range(len(datelist)-1): 
    
    # Set up Timestamp for query    
    
    start_timestamp = int(datelist[i].timestamp())
    end_timestamp = int(datelist[i+1].timestamp())
    
    time_string = 'start_timestamp=' + str(start_timestamp) + '&end_timestamp=' + str(end_timestamp)
    
    # Iterate through the Sensors
    
    for sensor_id in sensor_ids:

        # Base URL
        base_url = f'https://api.purpleair.com/v1/sensors/{sensor_id}/history/csv?'

        # Put it all together
        query_url = base_url + '&'.join([time_string, avg_string, env_fields_string])

        response = requests.get(query_url, headers=my_headers)

        if response.status_code == 200:

            # Read response as CSV data
            csv_data = response.content.decode('utf-8')

            if csv_data.count('\n') == 1: # There is only one line (empty data)
                print(f"No data for sensor {sensor_id} on {datelist[i]}")
                
            else:
                # Parse CSV data into pandas DataFrame
                df_individual_sensor = pd.read_csv(io.StringIO(csv_data),
                                                   header=0
                                                  )[['time_stamp', 'pm2.5_cf_1']]
                
                df_individual_sensor.columns = ['timestamp', 'pm25']
                
                # Perform QAQC
                
                clean = qaqc(df_individual_sensor)
                
                # Remove Spikes & Concatenate to main storage of spikes

                clean_w_spikes, spikes = get_spikes(clean, spike_threshold)
                
                spikes['sensor_index'] = int(sensor_id)
                
                all_spikes_df = pd.concat([all_spikes_df, spikes], ignore_index=True)
                
                # Get Stats (With Spikes)

                sum_stats = get_summary_stats(clean_w_spikes)
                
                # Add to the daily summary dataframe
                
                row = [str(datelist[i].date()), int(sensor_id)] + sum_stats
                
                daily_summary_df.loc[len(daily_summary_df.index)] = row
                
                # Get Stats (Without Spikes)
                
                no_spikes = clean_w_spikes[clean_w_spikes.is_spike == False]
                
                sum_stats = get_summary_stats(no_spikes)
                
                # Add to the daily summary dataframe
                
                row = [str(datelist[i].date()), int(sensor_id)] + sum_stats
                
                daily_summary_no_spikes_df.loc[len(daily_summary_no_spikes_df.index)] = row
 
        else:
            print(f"Error fetching data for sensor {sensor_id}: {response.status_code} on {datelist[i]}")
            
            
    if i == 7:
        break

Error fetching data for sensor 5582: 429 on 2022-06-21 00:00:00
Error fetching data for sensor 11134: 429 on 2022-06-21 00:00:00
No data for sensor 142718 on 2022-06-21 00:00:00
Error fetching data for sensor 142720: 429 on 2022-06-21 00:00:00


In [118]:
daily_summary_df

Unnamed: 0,date,sensor_index,n_observations,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,pm25_fullDay_maxTime,pm25_fullDay_std,pm25_fullDay_minutesAbove12ug,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
0,2022-06-15,3088,144,1.710434,0.396,14:00:00,5.406,00:30:00,1.147087,0,...,14:00:00,1.804,13:00:00,0.480074,2.947263,0.837,02:20:00,5.406,00:30:00,1.839966
1,2022-06-16,11134,144,3.743021,2.089,07:10:00,6.217,18:00:00,1.282206,0,...,12:00:00,5.194,14:40:00,0.810727,2.837,2.57,01:50:00,3.205,03:00:00,0.194722
2,2022-06-17,3088,144,1.923563,0.214,22:40:00,3.633,03:50:00,0.988797,0,...,14:50:00,2.963,12:00:00,0.645621,2.864789,2.559,00:40:00,3.126,02:50:00,0.14459
3,2022-06-18,11134,144,4.240792,1.935,01:20:00,21.938,22:20:00,2.182115,10,...,12:00:00,4.662,14:30:00,0.363897,2.930421,1.935,01:20:00,5.833,00:10:00,1.006426
4,2022-06-19,3088,144,3.00884,0.898,11:10:00,6.432,21:10:00,1.820652,0,...,12:00:00,2.789,14:50:00,0.492058,2.239474,1.969,01:10:00,2.479,02:20:00,0.160423
5,2022-06-19,11134,144,5.278875,2.534,09:50:00,10.519,23:30:00,2.504007,0,...,12:00:00,4.461,14:40:00,0.456517,4.238211,3.366,00:30:00,5.012,01:40:00,0.443378
6,2022-06-19,142720,144,5.627215,2.245,10:40:00,12.454,21:50:00,2.957626,30,...,12:30:00,5.253,14:50:00,0.66074,4.941421,3.754,00:50:00,10.299,00:00:00,1.601641
7,2022-06-20,3088,144,8.103399,5.982,04:50:00,11.995,13:50:00,1.768963,0,...,12:10:00,11.995,13:50:00,0.678177,6.365526,6.03,00:20:00,6.603,01:10:00,0.174002
8,2022-06-20,11134,144,12.810403,9.399,03:00:00,17.952,14:20:00,2.57721,710,...,12:00:00,17.952,14:20:00,0.644735,10.022421,9.399,03:00:00,10.565,01:40:00,0.32951
9,2022-06-20,142720,144,13.824361,9.37,22:00:00,22.56,05:20:00,3.012498,890,...,12:00:00,19.406,14:00:00,0.808058,11.445789,10.226,00:20:00,14.006,03:00:00,1.128675


In [119]:
daily_summary_no_spikes_df

Unnamed: 0,date,sensor_index,n_observations,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,pm25_fullDay_maxTime,pm25_fullDay_std,pm25_fullDay_minutesAbove12ug,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
0,2022-06-15,3088,144,1.710434,0.396,14:00:00,5.406,00:30:00,1.147087,0,...,14:00:00,1.804,13:00:00,0.480074,2.947263,0.837,02:20:00,5.406,00:30:00,1.839966
1,2022-06-16,11134,144,3.743021,2.089,07:10:00,6.217,18:00:00,1.282206,0,...,12:00:00,5.194,14:40:00,0.810727,2.837,2.57,01:50:00,3.205,03:00:00,0.194722
2,2022-06-17,3088,144,1.923563,0.214,22:40:00,3.633,03:50:00,0.988797,0,...,14:50:00,2.963,12:00:00,0.645621,2.864789,2.559,00:40:00,3.126,02:50:00,0.14459
3,2022-06-18,11134,144,4.240792,1.935,01:20:00,21.938,22:20:00,2.182115,10,...,12:00:00,4.662,14:30:00,0.363897,2.930421,1.935,01:20:00,5.833,00:10:00,1.006426
4,2022-06-19,3088,144,3.00884,0.898,11:10:00,6.432,21:10:00,1.820652,0,...,12:00:00,2.789,14:50:00,0.492058,2.239474,1.969,01:10:00,2.479,02:20:00,0.160423
5,2022-06-19,11134,144,5.278875,2.534,09:50:00,10.519,23:30:00,2.504007,0,...,12:00:00,4.461,14:40:00,0.456517,4.238211,3.366,00:30:00,5.012,01:40:00,0.443378
6,2022-06-19,142720,144,5.627215,2.245,10:40:00,12.454,21:50:00,2.957626,30,...,12:30:00,5.253,14:50:00,0.66074,4.941421,3.754,00:50:00,10.299,00:00:00,1.601641
7,2022-06-20,3088,144,8.103399,5.982,04:50:00,11.995,13:50:00,1.768963,0,...,12:10:00,11.995,13:50:00,0.678177,6.365526,6.03,00:20:00,6.603,01:10:00,0.174002
8,2022-06-20,11134,144,12.810403,9.399,03:00:00,17.952,14:20:00,2.57721,710,...,12:00:00,17.952,14:20:00,0.644735,10.022421,9.399,03:00:00,10.565,01:40:00,0.32951
9,2022-06-20,142720,144,13.824361,9.37,22:00:00,22.56,05:20:00,3.012498,890,...,12:00:00,19.406,14:00:00,0.808058,11.445789,10.226,00:20:00,14.006,03:00:00,1.128675


In [120]:
all_spikes_df

Unnamed: 0,timestamp,pm25,sensor_index
0,2022-06-21 23:40:00,28.224,11134


In [123]:
daily_summary_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 15
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   date                           16 non-null     object 
 1   sensor_index                   16 non-null     int64  
 2   n_observations                 16 non-null     int64  
 3   pm25_fullDay_mean              16 non-null     float64
 4   pm25_fullDay_min               16 non-null     float64
 5   pm25_fullDay_minTime           16 non-null     object 
 6   pm25_fullDay_max               16 non-null     float64
 7   pm25_fullDay_maxTime           16 non-null     object 
 8   pm25_fullDay_std               16 non-null     float64
 9   pm25_fullDay_minutesAbove12ug  16 non-null     int64  
 10  pm25_morningRush_mean          16 non-null     float64
 11  pm25_morningRush_min           16 non-null     float64
 12  pm25_morningRush_minTime       16 non-null     objec

In [52]:
# # Save a test dataframe

# clean_w_spikes.to_csv('example_df.csv', index = False)

# Save it!?!

# daily_summary_df.to_csv('daily_summaries.csv')

# daily_summary_no_spikes_df.to_csv('daily_summaries_no_spikes.csv')