# Observed Air Quality (PurpleAir)

This notebook retrieves readings from PurpleAir Sensors in Minneapolis and cleans the entries and saves the results as a csv file.

Documentation is available here: https://api.purpleair.com.
You can read this article for help getting started: https://community.purpleair.com/t/making-api-calls-with-the-purpleair-api/180.

From PurpleAir: 

"The data from individual sensors will update no less than every 30 seconds. As a courtesy, we ask that you limit the number of requests to no more than once every 1 to 10 minutes, assuming you are only using the API to obtain data from sensors. If retrieving data from multiple sensors at once, please send a single request rather than individual requests in succession.

The PurpleAir historical API is released as of July 18, 2022. For more information, view this post: https://community.purpleair.com/t/new-version-of-the-purpleair-api-on-july-18th/1251.

Please let us know if you have any questions or concerns, and have a great day!"

A paper on this process: https://doi.org/10.5194/amt-14-4617-2021 (Link for [Download](https://www.researchgate.net/publication/352663348_Development_and_application_of_a_United_States-wide_correction_for_PM25_data_collected_with_the_PurpleAir_sensor) )

Chat on which PM Estimate to use: https://community.purpleair.com/t/pm2-5-algorithms/3972/6

In [1]:
### Import Packages

# File manipulation

import os # For working with Operating System
import requests # Accessing the Web
import datetime as dt # Working with dates/times
import io # Input/Output Bytes objects
import time # For sleep in for loop

# Analysis

import numpy as np
import pandas as pd
import geopandas as gpd

# Current working directory

cwd = os.getcwd()

## Definitions

In [2]:
spike_threshold = 28 # Micgrograms per meter cubed

In [3]:
# This is my personal API key... Please use responsibly! 51592903-B445-11ED-B6F4-42010A800007

api = input('Please enter your Purple Air api key')

Please enter your Purple Air api key 51592903-B445-11ED-B6F4-42010A800007


### Load Sensor Information

In [4]:
# Sensor Indices (from City of Minneapolis)

datapath = os.path.join(cwd, '..', '..', '..', 'Data')

sensor_info = gpd.read_file(os.path.join(datapath, 'PurpleAir_Stations.geojson'))

### Summary Statistics Functions

In [5]:
%run Summary_Functions.py

print('Stat Names:\n\n', summary_stats_names, '\n')
print('Stat Types:\n\n',summary_stats_dtypes, '\n')

print('Function Names:\n\n', summary_stats_functions)

Stat Names:

 ['n_observations', 'humidity_fullDay_mean', 'temperature_fullDay_mean', 'pressure_fullDay_mean', 'pm25_fullDay_mean', 'pm25_fullDay_min', 'pm25_fullDay_minTime', 'pm25_fullDay_max', 'pm25_fullDay_maxTime', 'pm25_fullDay_std', 'pm25_fullDay_minutesAbove12ug', 'pm25_morningRush_mean', 'pm25_morningRush_min', 'pm25_morningRush_minTime', 'pm25_morningRush_max', 'pm25_morningRush_maxTime', 'pm25_morningRush_std', 'pm25_eveningRush_mean', 'pm25_eveningRush_min', 'pm25_eveningRush_minTime', 'pm25_eveningRush_max', 'pm25_eveningRush_maxTime', 'pm25_eveningRush_std', 'pm25_daytimeAmbient_mean', 'pm25_daytimeAmbient_min', 'pm25_daytimeAmbient_minTime', 'pm25_daytimeAmbient_max', 'pm25_daytimeAmbient_maxTime', 'pm25_daytimeAmbient_std', 'pm25_nighttimeAmbient_mean', 'pm25_nighttimeAmbient_min', 'pm25_nighttimeAmbient_minTime', 'pm25_nighttimeAmbient_max', 'pm25_nighttimeAmbient_maxTime', 'pm25_nighttimeAmbient_std'] 

Stat Types:

 [<class 'int'>, <class 'float'>, <class 'float'>, <

### Functions

In [6]:
# QAQC

def qaqc(df):
    '''This function wil perform some basic QAQC
    '''
    
    clean_df = df.copy()
    
    # Convert timestamp to datetime
    
    clean_df['timestamp'] = pd.to_datetime(clean_df['timestamp'], unit='s')
    
    # Remove obvious error values
    
    clean_df = clean_df[clean_df.pm25 < 1000] 
    
    # Remove NaNs
    
    clean_df = clean_df.dropna()
    
    return clean_df

# Remove and record Spikes

def get_spikes(df, spike_threshold):
    '''This function removes spikes from a dataframe 
    and returns both the new dataframe
    and a separate spike dataframe
    '''
    
    df_w_spikes = df.copy()
    
    condition = (df.pm25 > spike_threshold)
    
    df_w_spikes['is_spike'] = condition
    
    spikes = df_w_spikes[condition].copy()
    
    return df_w_spikes, spikes

# Get Summary Stats

def get_summary_stats(df):
    ''' This is the main function. It will run all of our functions that get summary stats
    and return as a list.
    '''
    
    stats = []
    
    # Run the functions
    
    for f in summary_stats_functions:
        stats += f(df)
    
    return stats

### Set Up Parameters for Query

In [7]:
### Query Strings

# Average string (in minutes) 1440 is 1 day average

avg_string = 'average=10'

# Environmental fields

env_fields = ['humidity', 'temperature', 'pressure', 'pm2.5_atm']

env_fields_string = 'fields=' + '%2C%20'.join(env_fields)

# My Header

my_headers = {'X-API-Key': api}

## The Loop

In [8]:
## Iterables

# Dates

# first_date = pd.to_datetime(sensors_df.date_created, unit = 's').min()
# ^ This is just untrue...


first_date = dt.datetime(2022, 6, 15) # June 15th, 2022?

datelist = pd.date_range(start = first_date, 
                         end = dt.datetime.today(),
                        normalize = True)

print('Last Run on ', dt.datetime.today())

Last Run on  2023-08-07 12:43:38.681410


In [9]:
## Initialize Storage

# If this script has already been run, please replace the dataframes with their respective .csvs

has_run = True

if has_run:
    
    daily_summary_df = pd.read_csv(os.path.join(datapath, 'daily_summaries.csv'))
    
    daily_summary_no_spikes_df = pd.read_csv(os.path.join(datapath, 'daily_summaries_no_spikes.csv'))
    
    all_spikes_df = pd.read_csv(os.path.join(datapath, 'all_spikes.csv'))
    all_spikes_df['timestamp'] = pd.to_datetime(all_spikes_df.timestamp)
    
    no_data = pd.read_csv(os.path.join(datapath, 'no_data_sensors.csv'))
                                   
    
else: # Initialize the daily summary dataframes
    
    # Daily Summary

    cols = ['sensor_index', 'date'] + summary_stats_names

    datatypes = [int, str] + summary_stats_dtypes

    dtypes = np.dtype(list(zip(cols, datatypes)))

    daily_summary_df = pd.DataFrame(np.empty(0, dtype = dtypes))

    # Daily Summary (No Spikes)

    cols = ['sensor_index', 'date'] + summary_stats_names

    datatypes = [int, str] + summary_stats_dtypes

    dtypes = np.dtype(list(zip(cols, datatypes)))

    daily_summary_no_spikes_df = pd.DataFrame(np.empty(0, dtype = dtypes))

    # Spikes

    all_spikes_df = pd.DataFrame(np.empty(0, dtype = [('sensor_index', int),
                                                      ('timestamp', pd._libs.tslibs.timestamps.Timestamp),
                                                      ('pm25', float)]
                                     )
                            )

    # No Data for sensor

    no_data = pd.DataFrame(np.empty(0, dtype = [('sensor_index', int),
                                                ('date', str)
                                               ]))

In [10]:
# Iterate through the days

for i in range(len(datelist)-1):
    
    # Set up Timestamp for query    
    
    start_timestamp = int(datelist[i].timestamp())
    end_timestamp = int(datelist[i+1].timestamp())
    
    time_string = 'start_timestamp=' + str(start_timestamp) + '&end_timestamp=' + str(end_timestamp)
    
    # Select Sensors that had been created before this date
    
    select_sensors = sensor_info[sensor_info.date_created.astype(int)//10**9 <= start_timestamp]
    
    sensor_ids = select_sensors.sensor_index
    
    # Iterate through the Sensors
    
    for sensor_id in sensor_ids:
        
        # For skipping to last spot
        
        is_done = (daily_summary_df.sensor_index == int(sensor_id)) & (daily_summary_df.date == str(datelist[i].date()))
        is_no_data = (no_data.sensor_index == int(sensor_id)) & (no_data.date == str(datelist[i].date()))
        
        # If either of these has a true, it has been parsed
        # True is not in both, then we should process
        if (True not in is_done.values) & (True not in is_no_data.values):
            
            ### Actual Loop

            time.sleep(3)

            # Base URL
            base_url = f'https://api.purpleair.com/v1/sensors/{sensor_id}/history/csv?'

            # Put it all together
            query_url = base_url + '&'.join([time_string, avg_string, env_fields_string])

            response = requests.get(query_url, headers=my_headers)

            if response.status_code == 200:

                # Read response as CSV data
                csv_data = response.content.decode('utf-8')

                if csv_data.count('\n') == 1: # There is only one line (empty data)
                    # print(f"No data for sensor {sensor_id} on {datelist[i]}")

                    no_data.loc[len(no_data.index)] = [sensor_id, str(datelist[i].date())]

                else:
                    # Parse CSV data into pandas DataFrame
                    df_individual_sensor = pd.read_csv(io.StringIO(csv_data),
                                                       header=0
                                                      )[['time_stamp', 'humidity', 'temperature', 
                                                         'pressure', 'pm2.5_atm']]

                    df_individual_sensor.columns = ['timestamp','humidity', 'temperature', 
                                                    'pressure', 'pm25']

                    # Perform QAQC

                    clean = qaqc(df_individual_sensor)

                    # Remove Spikes & Concatenate to main storage of spikes

                    clean_w_spikes, spikes = get_spikes(clean, spike_threshold)

                    spikes['sensor_index'] = int(sensor_id)

                    all_spikes_df = pd.concat([all_spikes_df, 
                                               spikes[['sensor_index',
                                                        'timestamp',
                                                        'pm25']]
                                              ],
                                               ignore_index=True)

                    # Get Stats (With Spikes)

                    sum_stats = get_summary_stats(clean_w_spikes)

                    # Add to the daily summary dataframe

                    row = [int(sensor_id), str(datelist[i].date())] + sum_stats

                    daily_summary_df.loc[len(daily_summary_df.index)] = row

                    # Get Stats (Without Spikes)

                    no_spikes = clean_w_spikes[clean_w_spikes.is_spike == False]

                    sum_stats = get_summary_stats(no_spikes)

                    # Add to the daily summary dataframe

                    row = [int(sensor_id), str(datelist[i].date())] + sum_stats

                    daily_summary_no_spikes_df.loc[len(daily_summary_no_spikes_df.index)] = row

            else:
                print(f"Error fetching data for sensor {sensor_id}: {response.status_code} on {datelist[i].date()}")
    
    
    # # Save it!?! After a day is processed, yes!

    # daily_summary_df.to_csv('daily_summaries.csv', index = False)

    # daily_summary_no_spikes_df.to_csv('daily_summaries_no_spikes.csv', index = False)
    
    # For testing a week
    
    # if i == 7:
    #     break

In [11]:
daily_summary_df.tail(10)

# Drop Duplicates
# daily_summary_df = daily_summary_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date,n_observations,humidity_fullDay_mean,temperature_fullDay_mean,pressure_fullDay_mean,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
16359,157747,2023-08-06,144,41.986458,84.698729,982.861979,30.819424,24.494,19:00:00,38.412,...,12:10:00,31.873,14:50:00,0.710809,35.212974,31.547,00:00:00,38.412,01:50:00,2.021926
16360,157787,2023-08-06,144,53.938889,79.940278,984.448458,20.33441,17.472,19:00:00,24.7,...,12:30:00,21.0,13:00:00,0.501463,21.864842,20.715,00:30:00,22.902,00:50:00,0.657448
16361,157785,2023-08-06,137,54.286372,78.487591,983.521818,34.666693,27.8,20:40:00,39.904,...,12:10:00,35.977,14:40:00,0.605062,37.742071,36.652,00:20:00,39.005,02:10:00,0.721055
16362,157837,2023-08-06,144,50.704167,80.552778,983.315833,33.182722,26.403,19:40:00,38.771,...,12:00:00,34.969,14:50:00,0.82728,35.794895,34.703,00:20:00,37.007,02:50:00,0.705599
16363,157845,2023-08-06,0,,,,,,,,...,,,,,,,,,,
16364,157861,2023-08-06,144,53.958681,78.447222,983.631917,36.077833,30.285,19:20:00,40.034,...,12:00:00,37.095,14:40:00,0.485476,38.994211,37.707,00:00:00,40.034,02:20:00,0.714287
16365,157871,2023-08-06,31,61.991935,76.890323,983.885129,42.943758,14.365,21:10:00,852.3195,...,,,,,,,,,,
16366,157877,2023-08-06,144,58.984722,77.613889,984.421674,33.324104,27.222,20:40:00,37.412,...,12:00:00,34.284,14:40:00,0.508677,35.937789,34.86,00:30:00,36.905,02:30:00,0.664136
16367,157935,2023-08-06,144,52.0,72.0,636.27,32.029153,26.299,21:30:00,37.144,...,12:40:00,33.021,14:50:00,0.368422,35.011579,33.436,00:00:00,37.144,02:40:00,0.894431
16368,168327,2023-08-06,144,55.348264,78.455556,983.09475,33.332142,27.091,20:40:00,39.187,...,12:40:00,34.476,14:40:00,0.565609,36.949211,35.476,00:00:00,38.54,01:50:00,0.784944


In [12]:
daily_summary_no_spikes_df.tail(10)

# To Remove Duplicates
# daily_summary_no_spikes_df = daily_summary_no_spikes_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date,n_observations,humidity_fullDay_mean,temperature_fullDay_mean,pressure_fullDay_mean,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
16359,157747,2023-08-06,24,45.566667,83.116667,982.179083,26.811208,24.494,19:00:00,27.992,...,,,,,,,,,,
16360,157787,2023-08-06,144,53.938889,79.940278,984.448458,20.33441,17.472,19:00:00,24.7,...,12:30:00,21.0,13:00:00,0.501463,21.864842,20.715,00:30:00,22.902,00:50:00,0.657448
16361,157785,2023-08-06,4,59.4,77.2,983.1845,27.898,27.8,20:40:00,27.984,...,,,,,,,,,,
16362,157837,2023-08-06,8,58.1,77.55,982.8955,27.394875,26.403,19:40:00,27.944,...,,,,,,,,,,
16363,157845,2023-08-06,0,,,,,,,,...,,,,,,,,,,
16364,157861,2023-08-06,0,,,,,,,,...,,,,,,,,,,
16365,157871,2023-08-06,30,61.966667,76.886667,983.881733,15.964567,14.365,21:10:00,18.975,...,,,,,,,,,,
16366,157877,2023-08-06,2,69.3,74.1,983.418,27.2775,27.222,20:40:00,27.333,...,,,,,,,,,,
16367,157935,2023-08-06,12,52.0,72.0,636.27,27.388917,26.299,21:30:00,27.974,...,,,,,,,,,,
16368,168327,2023-08-06,11,60.527273,77.745455,982.608909,27.462273,27.091,20:40:00,27.89,...,,,,,,,,,,


In [13]:
all_spikes_df

# To Remove Duplicates
# all_spikes_df = all_spikes_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,timestamp,pm25
0,143226,2022-06-17 22:00:00,213.712
1,143226,2022-06-17 22:10:00,42.698
2,143226,2022-06-17 22:20:00,28.325
3,143226,2022-06-18 00:20:00,64.563
4,142724,2022-06-18 02:10:00,54.165
...,...,...,...
311209,168327,2023-08-06 16:00:00,36.121
311210,168327,2023-08-06 18:30:00,35.596
311211,168327,2023-08-06 04:20:00,35.266
311212,168327,2023-08-06 19:50:00,28.523


In [14]:
# no_data['date'] = pd.to_datetime(no_data.date).dt.date.astype(str) # If need to correct the dates
no_data
# To Remove Duplicates
# no_data = no_data.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date
0,142718,2022-06-15
1,143636,2022-06-15
2,143648,2022-06-15
3,143656,2022-06-15
4,143660,2022-06-15
...,...,...
9688,145504,2023-08-06
9689,145610,2023-08-06
9690,157757,2023-08-06
9691,166459,2023-08-06


In [15]:
daily_summary_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16369 entries, 0 to 16368
Data columns (total 37 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   sensor_index                   16369 non-null  int64  
 1   date                           16369 non-null  object 
 2   n_observations                 16369 non-null  int64  
 3   humidity_fullDay_mean          15228 non-null  float64
 4   temperature_fullDay_mean       15228 non-null  float64
 5   pressure_fullDay_mean          15228 non-null  float64
 6   pm25_fullDay_mean              15228 non-null  float64
 7   pm25_fullDay_min               15228 non-null  float64
 8   pm25_fullDay_minTime           15228 non-null  object 
 9   pm25_fullDay_max               15228 non-null  float64
 10  pm25_fullDay_maxTime           15228 non-null  object 
 11  pm25_fullDay_std               15208 non-null  float64
 12  pm25_fullDay_minutesAbove12ug  16369 non-null  int6

In [20]:
# Sort them by date

daily_summary_df = daily_summary_df.sort_values(by='date')

daily_summary_no_spikes_df = daily_summary_no_spikes_df.sort_values(by='date')

all_spikes_df = all_spikes_df.sort_values(by='timestamp')

no_data = no_data.sort_values(by='date')

In [21]:
# # Save a test dataframe

# clean_w_spikes.to_csv('example_df.csv', index = False)

# Save it!?!

savepath = os.path.join('..', '..','..', 'Data')

daily_summary_df.to_csv(os.path.join(savepath, 'daily_summaries.csv'), index = False)

daily_summary_no_spikes_df.to_csv(os.path.join(savepath, 'daily_summaries_no_spikes.csv'), index = False)

all_spikes_df.to_csv(os.path.join(savepath, 'all_spikes.csv'), index = False)

no_data.to_csv(os.path.join(savepath, 'no_data_sensors.csv'), index = False)