# Observed Air Quality (PurpleAir)

This notebook retrieves readings from PurpleAir Sensors in Minneapolis and cleans the entries and saves the results as a csv file.

Documentation is available here: https://api.purpleair.com.
You can read this article for help getting started: https://community.purpleair.com/t/making-api-calls-with-the-purpleair-api/180.

From PurpleAir: 

"The data from individual sensors will update no less than every 30 seconds. As a courtesy, we ask that you limit the number of requests to no more than once every 1 to 10 minutes, assuming you are only using the API to obtain data from sensors. If retrieving data from multiple sensors at once, please send a single request rather than individual requests in succession.

The PurpleAir historical API is released as of July 18, 2022. For more information, view this post: https://community.purpleair.com/t/new-version-of-the-purpleair-api-on-july-18th/1251.

Please let us know if you have any questions or concerns, and have a great day!"

A paper on this process: https://doi.org/10.5194/amt-14-4617-2021 (Link for [Download](https://www.researchgate.net/publication/352663348_Development_and_application_of_a_United_States-wide_correction_for_PM25_data_collected_with_the_PurpleAir_sensor) )

Chat on which PM Estimate to use: https://community.purpleair.com/t/pm2-5-algorithms/3972/6

In [1]:
### Import Packages

# File manipulation

import os # For working with Operating System
import requests # Accessing the Web
import datetime as dt # Working with dates/times
import io # Input/Output Bytes objects
import time # For sleep in for loop

# Analysis

import numpy as np
import pandas as pd

## Definitions

In [2]:
spike_threshold = 28 # Micgrograms per meter cubed

In [3]:
# This is my personal API key... Please use responsibly!

api = input('Please enter your Purple Air api key')

Please enter your Purple Air api key 51592903-B445-11ED-B6F4-42010A800007


### Sensor Indices and Start Times

In [4]:
# Sensor Indices (from City of Minneapolis)

sensor_info = pd.read_excel('PA IDs and indexes.xlsx') # Load as DataFrame

sensor_ids = sensor_info['Sensor Index'].dropna().astype(int) # This should be an iterable of the sensor ids as integers

In [5]:
def getSensorsData(query='', api_read_key=''):

    # my_url is assigned the URL we are going to send our request to.
    url = 'https://api.purpleair.com/v1/sensors?' + query
    
    # print('Here is the full url for the API call:\n\n', url)

    # my_headers is assigned the context of our request we want to make. In this case
    # we will pass through our API read key using the variable created above.
    my_headers = {'X-API-Key':api_read_key}

    # This line creates and sends the request and then assigns its response to the
    # variable, r.
    response = requests.get(url, headers=my_headers)

    # We then return the response we received.
    return response

In [6]:
# Get start Times

sensor_string = '%2C'.join(sensor_ids.astype(str))

query = 'fields=date_created&show_only=' + sensor_string

response = getSensorsData(query, api)

response_dict = response.json() # Read response as a json (dictionary)

col_names = response_dict['fields']
data = np.array(response_dict['data'])

sensors_df = pd.DataFrame(data, columns = col_names)

In [7]:
sensors_df.head()

Unnamed: 0,sensor_index,date_created
0,142718,1642013869
1,142720,1642013875
2,142726,1642013897
3,142724,1642013889
4,142730,1642013916


### Summary Statistics Functions

In [25]:
%run Summary_Functions.py

print('Stat Names:\n\n', summary_stats_names, '\n')
print('Stat Types:\n\n',summary_stats_dtypes, '\n')

print('Function Names:\n\n', summary_stats_functions)

Stat Names:

 ['n_observations', 'pm25_fullDay_mean', 'pm25_fullDay_min', 'pm25_fullDay_minTime', 'pm25_fullDay_max', 'pm25_fullDay_maxTime', 'pm25_fullDay_std', 'pm25_fullDay_minutesAbove12ug', 'pm25_morningRush_mean', 'pm25_morningRush_min', 'pm25_morningRush_minTime', 'pm25_morningRush_max', 'pm25_morningRush_maxTime', 'pm25_morningRush_std', 'pm25_eveningRush_mean', 'pm25_eveningRush_min', 'pm25_eveningRush_minTime', 'pm25_eveningRush_max', 'pm25_eveningRush_maxTime', 'pm25_eveningRush_std', 'pm25_daytimeAmbient_mean', 'pm25_daytimeAmbient_min', 'pm25_daytimeAmbient_minTime', 'pm25_daytimeAmbient_max', 'pm25_daytimeAmbient_maxTime', 'pm25_daytimeAmbient_std', 'pm25_nighttimeAmbient_mean', 'pm25_nighttimeAmbient_min', 'pm25_nighttimeAmbient_minTime', 'pm25_nighttimeAmbient_max', 'pm25_nighttimeAmbient_maxTime', 'pm25_nighttimeAmbient_std'] 

Stat Types:

 [<class 'int'>, <class 'float'>, <class 'float'>, <class 'str'>, <class 'float'>, <class 'str'>, <class 'float'>, <class 'int'>, 

### Functions

In [26]:
# QAQC

def qaqc(df):
    '''This function wil perform some basic QAQC
    '''
    
    clean_df = df.copy()
    
    # Convert timestamp to datetime
    
    clean_df['timestamp'] = pd.to_datetime(clean_df['timestamp'], unit='s')
    
    # Remove obvious error values
    
    clean_df = clean_df[clean_df.pm25 < 1000] 
    
    # Remove NaNs
    
    clean_df = clean_df.dropna()
    
    return clean_df

# Remove and record Spikes

def get_spikes(df, spike_threshold):
    '''This function removes spikes from a dataframe 
    and returns both the new dataframe
    and a separate spike dataframe
    '''
    
    df_w_spikes = df.copy()
    
    condition = (df.pm25 > spike_threshold)
    
    df_w_spikes['is_spike'] = condition
    
    spikes = df_w_spikes[condition].copy()
    
    return df_w_spikes, spikes

# Get Summary Stats

def get_summary_stats(df):
    ''' This is the main function. It will run all of our functions that get summary stats
    and return as a list.
    '''
    
    stats = []
    
    # Run the functions
    
    for f in summary_stats_functions:
        stats += f(df)
    
    return stats

### Set Up Parameters for Query

In [10]:
### Query Strings

# Average string (in minutes) 1440 is 1 day average

avg_string = 'average=10'

# Environmental fields

env_fields = ['pm2.5_cf_1']

env_fields_string = 'fields=' + '%2C%20'.join(env_fields)

# My Header

my_headers = {'X-API-Key': api}

## The Loop

In [11]:
## Iterables

# Dates

# first_date = pd.to_datetime(sensors_df.date_created, unit = 's').min() # This is just untrue...

first_date = dt.datetime(2022, 6, 15) # June 15th, 2022?

datelist = pd.date_range(start = first_date, 
                         end = dt.datetime.today(),
                        normalize = True)

print('Last Run on ', dt.datetime.today())

Last Run on  2023-04-20 14:16:20.199991


In [18]:
## Initialize Storage

# Daily Summary

cols = ['sensor_index', 'date'] + summary_stats_names

datatypes = [int, str] + summary_stats_dtypes

dtypes = np.dtype(list(zip(cols, datatypes)))

daily_summary_df = pd.DataFrame(np.empty(0, dtype = dtypes))

# Daily Summary (No Spikes)

cols = ['sensor_index', 'date'] + summary_stats_names

datatypes = [int, str] + summary_stats_dtypes

dtypes = np.dtype(list(zip(cols, datatypes)))

daily_summary_no_spikes_df = pd.DataFrame(np.empty(0, dtype = dtypes))

# Spikes

all_spikes_df = pd.DataFrame(np.empty(0, dtype = [('sensor_index', int),
                                                  ('timestamp', pd._libs.tslibs.timestamps.Timestamp),
                                                  ('pm25', float)]
                                 )
                        )

# No Data for sensor

no_data = pd.DataFrame(np.empty(0, dtype = [('sensor_index', int),
                                            ('date', str)
                                           ]))

In [None]:
# Iterate through the days

for i in range(len(datelist)-1): 
    
    # Set up Timestamp for query    
    
    start_timestamp = int(datelist[i].timestamp())
    end_timestamp = int(datelist[i+1].timestamp())
    
    time_string = 'start_timestamp=' + str(start_timestamp) + '&end_timestamp=' + str(end_timestamp)
    
    # Select Sensors that had been created before this date
    
    select_sensors = sensors_df[sensors_df.date_created <= start_timestamp]
    
    sensor_ids = select_sensors.sensor_index
    
    # Iterate through the Sensors
    
    for sensor_id in sensor_ids:
        
        # For skipping to last spot
        
        is_done = (daily_summary_df.sensor_index == int(sensor_id)) & (daily_summary_df.date == str(datelist[i].date()))
        is_no_data = (no_data.sensor_index == int(sensor_id)) & (no_data.date == str(datelist[i].date()))
        
        # If either of these has a true, it has been parsed
        # True is not in both, then we should process
        if (True not in is_done.values) & (True not in is_no_data.values):
            
            ### Actual Loop

            time.sleep(3)

            # Base URL
            base_url = f'https://api.purpleair.com/v1/sensors/{sensor_id}/history/csv?'

            # Put it all together
            query_url = base_url + '&'.join([time_string, avg_string, env_fields_string])

            response = requests.get(query_url, headers=my_headers)

            if response.status_code == 200:

                # Read response as CSV data
                csv_data = response.content.decode('utf-8')

                if csv_data.count('\n') == 1: # There is only one line (empty data)
                    # print(f"No data for sensor {sensor_id} on {datelist[i]}")

                    no_data.loc[len(no_data.index)] = [sensor_id, datelist[i]]

                else:
                    # Parse CSV data into pandas DataFrame
                    df_individual_sensor = pd.read_csv(io.StringIO(csv_data),
                                                       header=0
                                                      )[['time_stamp', 'pm2.5_cf_1']]

                    df_individual_sensor.columns = ['timestamp', 'pm25']

                    # Perform QAQC

                    clean = qaqc(df_individual_sensor)

                    # Remove Spikes & Concatenate to main storage of spikes

                    clean_w_spikes, spikes = get_spikes(clean, spike_threshold)

                    spikes['sensor_index'] = int(sensor_id)

                    all_spikes_df = pd.concat([all_spikes_df, 
                                               spikes[['sensor_index',
                                                        'timestamp',
                                                        'pm25']]
                                              ],
                                               ignore_index=True)

                    # Get Stats (With Spikes)

                    sum_stats = get_summary_stats(clean_w_spikes)

                    # Add to the daily summary dataframe

                    row = [int(sensor_id), str(datelist[i].date())] + sum_stats

                    daily_summary_df.loc[len(daily_summary_df.index)] = row

                    # Get Stats (Without Spikes)

                    no_spikes = clean_w_spikes[clean_w_spikes.is_spike == False]

                    sum_stats = get_summary_stats(no_spikes)

                    # Add to the daily summary dataframe

                    row = [int(sensor_id), str(datelist[i].date())] + sum_stats

                    daily_summary_no_spikes_df.loc[len(daily_summary_no_spikes_df.index)] = row

            else:
                print(f"Error fetching data for sensor {sensor_id}: {response.status_code} on {datelist[i]}")
            
            
    # if i == 7:
    #     break

In [79]:
daily_summary_df.tail(10)

# Drop Duplicates
# daily_summary_df = daily_summary_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date,n_observations,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,pm25_fullDay_maxTime,pm25_fullDay_std,pm25_fullDay_minutesAbove12ug,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
151,143240,2022-06-28,144,13.101639,8.208,17:40:00,20.343,02:30:00,3.432173,900,...,14:00:00,14.438,12:30:00,0.785621,17.689316,15.817,00:30:00,20.343,02:30:00,1.359649
152,143648,2022-06-28,144,19.365271,7.63,23:10:00,148.698,03:20:00,21.115132,970,...,14:40:00,18.282,12:00:00,1.60395,31.590211,15.269,00:10:00,98.158,03:00:00,23.122607
153,145242,2022-06-28,144,17.677049,10.092,23:20:00,83.577,01:20:00,6.936505,1380,...,13:10:00,20.568,12:00:00,1.337133,25.262316,19.802,00:00:00,83.577,01:20:00,14.206316
154,145250,2022-06-28,144,15.610861,9.145,23:20:00,47.556,15:40:00,4.593105,1040,...,14:20:00,18.783,12:20:00,1.314626,19.345842,18.003,00:50:00,22.242,03:00:00,1.115699
155,145262,2022-06-28,144,8.25266,2.144,23:40:00,21.641,02:30:00,5.03206,280,...,14:30:00,16.818,12:40:00,3.102336,13.230789,9.174,00:30:00,21.641,02:30:00,3.827334
156,145470,2022-06-28,144,14.105069,7.931,23:40:00,24.9,04:20:00,4.126792,910,...,14:40:00,17.722,12:40:00,1.760253,18.705895,16.224,00:40:00,20.884,03:00:00,1.54106
157,145498,2022-06-28,144,14.140799,8.409,17:40:00,21.805,04:10:00,3.40089,970,...,13:30:00,18.452,12:30:00,1.796285,18.315895,16.218,00:00:00,20.347,02:30:00,1.31487
158,142720,2022-06-29,144,7.96766,3.968,05:00:00,14.156,00:30:00,2.242552,30,...,13:10:00,10.887,14:30:00,1.852923,10.964368,9.414,00:50:00,14.156,00:30:00,1.286276
159,142724,2022-06-29,144,8.403882,4.671,05:00:00,46.236,17:40:00,3.856035,30,...,12:50:00,10.913,14:00:00,1.706316,10.058421,9.17,00:00:00,10.776,00:40:00,0.388972
160,142734,2022-06-29,144,8.062417,3.7,07:00:00,136.922,22:50:00,11.019097,40,...,13:10:00,9.083,14:20:00,1.229121,9.170789,8.273,01:10:00,9.813,02:10:00,0.477519


In [77]:
daily_summary_no_spikes_df.tail(10)

# To Remove Duplicates
# daily_summary_no_spikes_df = daily_summary_no_spikes_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date,n_observations,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,pm25_fullDay_maxTime,pm25_fullDay_std,pm25_fullDay_minutesAbove12ug,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
151,143240,2022-06-28,144,13.101639,8.208,17:40:00,20.343,02:30:00,3.432173,900,...,14:00:00,14.438,12:30:00,0.785621,17.689316,15.817,00:30:00,20.343,02:30:00,1.359649
152,143648,2022-06-28,131,13.888351,7.63,23:10:00,26.362,04:10:00,3.691609,840,...,14:40:00,18.282,12:00:00,1.60395,16.81625,15.269,00:10:00,20.676,01:50:00,1.387947
153,145242,2022-06-28,142,17.134535,10.092,23:20:00,27.545,03:40:00,4.101402,1360,...,13:10:00,20.568,12:00:00,1.337133,22.022611,19.802,00:00:00,25.828,03:00:00,1.594559
154,145250,2022-06-28,143,15.387469,9.145,23:20:00,23.593,04:20:00,3.742774,1030,...,14:20:00,18.783,12:20:00,1.314626,19.345842,18.003,00:50:00,22.242,03:00:00,1.115699
155,145262,2022-06-28,144,8.25266,2.144,23:40:00,21.641,02:30:00,5.03206,280,...,14:30:00,16.818,12:40:00,3.102336,13.230789,9.174,00:30:00,21.641,02:30:00,3.827334
156,145470,2022-06-28,144,14.105069,7.931,23:40:00,24.9,04:20:00,4.126792,910,...,14:40:00,17.722,12:40:00,1.760253,18.705895,16.224,00:40:00,20.884,03:00:00,1.54106
157,145498,2022-06-28,144,14.140799,8.409,17:40:00,21.805,04:10:00,3.40089,970,...,13:30:00,18.452,12:30:00,1.796285,18.315895,16.218,00:00:00,20.347,02:30:00,1.31487
158,142720,2022-06-29,144,7.96766,3.968,05:00:00,14.156,00:30:00,2.242552,30,...,13:10:00,10.887,14:30:00,1.852923,10.964368,9.414,00:50:00,14.156,00:30:00,1.286276
159,142724,2022-06-29,143,8.139322,4.671,05:00:00,19.585,17:50:00,2.196332,20,...,12:50:00,10.913,14:00:00,1.706316,10.058421,9.17,00:00:00,10.776,00:40:00,0.388972
160,142734,2022-06-29,143,7.161301,3.7,07:00:00,18.178,23:00:00,2.126664,30,...,13:10:00,9.083,14:20:00,1.229121,9.170789,8.273,01:10:00,9.813,02:10:00,0.477519


In [75]:
all_spikes_df

Unnamed: 0,sensor_index,timestamp,pm25
0,143226,2022-06-17 22:10:00,58.910
1,143226,2022-06-17 22:20:00,35.495
2,143226,2022-06-17 22:00:00,319.660
3,142720,2022-06-18 03:20:00,33.595
4,142720,2022-06-18 03:30:00,36.026
...,...,...,...
95,145242,2022-06-28 01:20:00,83.577
96,145242,2022-06-28 03:30:00,28.814
97,145250,2022-06-28 15:40:00,47.556
98,142724,2022-06-29 17:40:00,46.236


In [71]:
no_data

Unnamed: 0,sensor_index,date
0,142718,2022-06-15
1,142720,2022-06-15
2,142726,2022-06-15
3,142730,2022-06-15
4,142728,2022-06-15
...,...,...
363,143944,2022-06-23
364,145202,2022-06-23
365,145204,2022-06-23
366,145234,2022-06-23


In [None]:
daily_summary_df.info()

In [52]:
# # Save a test dataframe

# clean_w_spikes.to_csv('example_df.csv', index = False)

# Save it!?!

# daily_summary_df.to_csv('daily_summaries.csv')

# daily_summary_no_spikes_df.to_csv('daily_summaries_no_spikes.csv')