# Observed Air Quality (PurpleAir) - REAL TIME

This notebook retrieves readings from PurpleAir Sensors in Minneapolis and cleans the entries and saves the results as a csv file.

Documentation is available here: https://api.purpleair.com.
You can read this article for help getting started: https://community.purpleair.com/t/making-api-calls-with-the-purpleair-api/180.

From PurpleAir: 

"The data from individual sensors will update no less than every 30 seconds. As a courtesy, we ask that you limit the number of requests to no more than once every 1 to 10 minutes, assuming you are only using the API to obtain data from sensors. If retrieving data from multiple sensors at once, please send a single request rather than individual requests in succession.

The PurpleAir historical API is released as of July 18, 2022. For more information, view this post: https://community.purpleair.com/t/new-version-of-the-purpleair-api-on-july-18th/1251.

Please let us know if you have any questions or concerns, and have a great day!"

A paper on this process: https://doi.org/10.5194/amt-14-4617-2021 (Link for [Download](https://www.researchgate.net/publication/352663348_Development_and_application_of_a_United_States-wide_correction_for_PM25_data_collected_with_the_PurpleAir_sensor) )

Chat on which PM Estimate to use: https://community.purpleair.com/t/pm2-5-algorithms/3972/6

In [19]:
### Import Packages

# File manipulation

import os # For working with Operating System
import requests # Accessing the Web
import datetime as dt # Working with dates/times
import io # Input/Output Bytes objects
import time # For sleep in for loop

# Analysis

import numpy as np
import pandas as pd

## Definitions

In [20]:
spike_threshold = 28 # Micgrograms per meter cubed

In [21]:
# This is my personal API key... Please use responsibly!

api = input('Please enter your Purple Air api key')

Please enter your Purple Air api key 51592903-B445-11ED-B6F4-42010A800007


### Sensor Indices

In [22]:
# Sensor Indices (from City of Minneapolis)

indices_path = os.path.join('..', 'Historic_PurpleAir', 'PA IDs and indexes.xlsx')

sensor_info = pd.read_excel(indices_path) # Load as DataFrame

sensor_ids = sensor_info['Sensor Index'].dropna().astype(int) # This should be an iterable of the sensor ids as integers

In [23]:
def getSensorsData(query='', api_read_key=''):

    # my_url is assigned the URL we are going to send our request to.
    url = 'https://api.purpleair.com/v1/sensors?' + query
    
    # print('Here is the full url for the API call:\n\n', url)

    # my_headers is assigned the context of our request we want to make. In this case
    # we will pass through our API read key using the variable created above.
    my_headers = {'X-API-Key':api_read_key}

    # This line creates and sends the request and then assigns its response to the
    # variable, r.
    response = requests.get(url, headers=my_headers)

    # We then return the response we received.
    return response

## Actual RealTime

In [26]:
# Actual RealTime

sensor_string = 'show_only=' + '%2C'.join(sensor_ids.astype(str))

query = 'fields=pm2.5_cf_1&' + sensor_string

response = getSensorsData(query, api)

response_dict = response.json() # Read response as a json (dictionary)

col_names = response_dict['fields']
data = np.array(response_dict['data'])

sensors_df = pd.DataFrame(data, columns = col_names)

In [27]:
sensors_df.head()

Unnamed: 0,sensor_index,pm2.5_cf_1
0,142718.0,1.6
1,142720.0,1.6
2,142726.0,2.0
3,142724.0,1.2
4,142730.0,1.4


## For Summary of Previous Day

### Summary Statistics Functions

In [28]:
%run ../Historic_PurpleAir/Summary_Functions.py

print('Stat Names:\n\n', summary_stats_names, '\n')
print('Stat Types:\n\n',summary_stats_dtypes, '\n')

print('Function Names:\n\n', summary_stats_functions)

Stat Names:

 ['n_observations', 'pm25_fullDay_mean', 'pm25_fullDay_min', 'pm25_fullDay_minTime', 'pm25_fullDay_max', 'pm25_fullDay_maxTime', 'pm25_fullDay_std', 'pm25_fullDay_minutesAbove12ug', 'pm25_morningRush_mean', 'pm25_morningRush_min', 'pm25_morningRush_minTime', 'pm25_morningRush_max', 'pm25_morningRush_maxTime', 'pm25_morningRush_std', 'pm25_eveningRush_mean', 'pm25_eveningRush_min', 'pm25_eveningRush_minTime', 'pm25_eveningRush_max', 'pm25_eveningRush_maxTime', 'pm25_eveningRush_std', 'pm25_daytimeAmbient_mean', 'pm25_daytimeAmbient_min', 'pm25_daytimeAmbient_minTime', 'pm25_daytimeAmbient_max', 'pm25_daytimeAmbient_maxTime', 'pm25_daytimeAmbient_std', 'pm25_nighttimeAmbient_mean', 'pm25_nighttimeAmbient_min', 'pm25_nighttimeAmbient_minTime', 'pm25_nighttimeAmbient_max', 'pm25_nighttimeAmbient_maxTime', 'pm25_nighttimeAmbient_std'] 

Stat Types:

 [<class 'int'>, <class 'float'>, <class 'float'>, <class 'str'>, <class 'float'>, <class 'str'>, <class 'float'>, <class 'int'>, 

### Functions

In [29]:
# QAQC

def qaqc(df):
    '''This function wil perform some basic QAQC
    '''
    
    clean_df = df.copy()
    
    # Convert timestamp to datetime
    
    clean_df['timestamp'] = pd.to_datetime(clean_df['timestamp'], unit='s')
    
    # Remove obvious error values
    
    clean_df = clean_df[clean_df.pm25 < 1000] 
    
    # Remove NaNs
    
    clean_df = clean_df.dropna()
    
    return clean_df

# Remove and record Spikes

def get_spikes(df, spike_threshold):
    '''This function removes spikes from a dataframe 
    and returns both the new dataframe
    and a separate spike dataframe
    '''
    
    df_w_spikes = df.copy()
    
    condition = (df.pm25 > spike_threshold)
    
    df_w_spikes['is_spike'] = condition
    
    spikes = df_w_spikes[condition].copy()
    
    return df_w_spikes, spikes

# Get Summary Stats

def get_summary_stats(df):
    ''' This is the main function. It will run all of our functions that get summary stats
    and return as a list.
    '''
    
    stats = []
    
    # Run the functions
    
    for f in summary_stats_functions:
        stats += f(df)
    
    return stats

### Set Up Parameters for Query

In [30]:
### Query Strings

# Average string (in minutes) 1440 is 1 day average

avg_string = 'average=10'

# Environmental fields

env_fields = ['pm2.5_cf_1']

env_fields_string = 'fields=' + '%2C%20'.join(env_fields)

# My Header

my_headers = {'X-API-Key': api}

## The Loop

In [60]:
# Dates

today = dt.datetime.combine(today, dt.datetime.min.time()) # set to today @ midnight 
yesterday = today - dt.timedelta(days=1)

# Set up Timestamp for query    

start_timestamp = int(yesterday.timestamp())
end_timestamp = int(today.timestamp())

time_string = 'start_timestamp=' + str(start_timestamp) + '&end_timestamp=' + str(end_timestamp)

print('Last Run on ', today)

Last Run on  2023-04-21 00:00:00


In [61]:
## Iterables

sensor_ids = sensors_df.sensor_index.astype(int)

In [62]:
## Initialize Storage

# Daily Summary

cols = ['sensor_index', 'date'] + summary_stats_names

datatypes = [int, str] + summary_stats_dtypes

dtypes = np.dtype(list(zip(cols, datatypes)))

daily_summary_df = pd.DataFrame(np.empty(0, dtype = dtypes))

# Daily Summary (No Spikes)

cols = ['sensor_index', 'date'] + summary_stats_names

datatypes = [int, str] + summary_stats_dtypes

dtypes = np.dtype(list(zip(cols, datatypes)))

daily_summary_no_spikes_df = pd.DataFrame(np.empty(0, dtype = dtypes))

# Spikes

all_spikes_df = pd.DataFrame(np.empty(0, dtype = [('sensor_index', int),
                                                  ('timestamp', pd._libs.tslibs.timestamps.Timestamp),
                                                  ('pm25', float)]
                                 )
                        )

# No Data for sensor

no_data = pd.DataFrame(np.empty(0, dtype = [('sensor_index', int),
                                            ('date', str)
                                           ]))

In [63]:
   
# Iterate through the Sensors

for sensor_id in sensor_ids:

    # For skipping to last spot

    is_done = (daily_summary_df.sensor_index == int(sensor_id))
    is_no_data = (no_data.sensor_index == int(sensor_id))

    # If either of these has a true, it has been parsed
    # True is not in both, then we should process
    if (True not in is_done.values) & (True not in is_no_data.values):

        ### Actual Loop

        time.sleep(3)

        # Base URL
        base_url = f'https://api.purpleair.com/v1/sensors/{sensor_id}/history/csv?'

        # Put it all together
        query_url = base_url + '&'.join([time_string, avg_string, env_fields_string])

        response = requests.get(query_url, headers=my_headers)

        if response.status_code == 200:

            # Read response as CSV data
            csv_data = response.content.decode('utf-8')

            if csv_data.count('\n') == 1: # There is only one line (empty data)
                # print(f"No data for sensor {sensor_id} on {datelist[i]}")

                no_data.loc[len(no_data.index)] = [sensor_id, yesterday.date()]

            else:
                # Parse CSV data into pandas DataFrame
                df_individual_sensor = pd.read_csv(io.StringIO(csv_data),
                                                   header=0
                                                  )[['time_stamp', 'pm2.5_cf_1']]

                df_individual_sensor.columns = ['timestamp', 'pm25']

                # Perform QAQC

                clean = qaqc(df_individual_sensor)

                # Remove Spikes & Concatenate to main storage of spikes

                clean_w_spikes, spikes = get_spikes(clean, spike_threshold)

                spikes['sensor_index'] = int(sensor_id)

                all_spikes_df = pd.concat([all_spikes_df, 
                                           spikes[['sensor_index',
                                                    'timestamp',
                                                    'pm25']]
                                          ],
                                           ignore_index=True)

                # Get Stats (With Spikes)

                sum_stats = get_summary_stats(clean_w_spikes)

                # Add to the daily summary dataframe

                row = [int(sensor_id), str(yesterday.date())] + sum_stats

                daily_summary_df.loc[len(daily_summary_df.index)] = row

                # Get Stats (Without Spikes)

                no_spikes = clean_w_spikes[clean_w_spikes.is_spike == False]

                sum_stats = get_summary_stats(no_spikes)

                # Add to the daily summary dataframe

                row = [int(sensor_id), str(yesterday.date())] + sum_stats

                daily_summary_no_spikes_df.loc[len(daily_summary_no_spikes_df.index)] = row

        else:
            print(f"Error fetching data for sensor {sensor_id}: {response.status_code} on {yesterday.date()}")

In [64]:
daily_summary_df.tail(10)

# Drop Duplicates
# daily_summary_df = daily_summary_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date,n_observations,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,pm25_fullDay_maxTime,pm25_fullDay_std,pm25_fullDay_minutesAbove12ug,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
45,157787,2023-04-21,144,5.214417,0.0,01:10:00,39.754,00:40:00,4.571028,40,...,12:20:00,10.574,14:30:00,1.223515,2.324421,0.0,01:10:00,39.754,00:40:00,9.075059
46,157785,2023-04-21,44,9.545864,1.594,02:40:00,18.651,15:10:00,3.827041,110,...,13:30:00,15.513,14:50:00,2.489245,3.930333,1.594,02:40:00,6.388,03:00:00,2.399302
47,157837,2023-04-21,144,6.321972,0.0,00:10:00,16.165,10:00:00,4.480355,130,...,12:00:00,12.303,12:30:00,1.641386,0.537895,0.0,00:10:00,4.748,03:00:00,1.153437
48,157845,2023-04-21,144,8.718562,0.013,23:50:00,32.439,09:10:00,7.222246,390,...,12:30:00,15.278,14:50:00,1.954299,0.965053,0.025,00:20:00,7.29,03:00:00,1.938701
49,157861,2023-04-21,144,8.009451,0.014,23:10:00,40.768,13:50:00,7.78686,210,...,12:40:00,40.768,13:50:00,11.75206,0.925789,0.019,01:10:00,6.357,03:00:00,1.58936
50,157871,2023-04-21,144,6.357406,0.0,00:00:00,17.444,15:30:00,4.589354,150,...,12:00:00,15.355,14:40:00,2.197929,0.672921,0.0,00:00:00,6.307,03:00:00,1.595011
51,157877,2023-04-21,144,8.145618,0.0,00:00:00,55.777,10:20:00,8.442067,200,...,12:30:00,13.461,14:10:00,1.24712,1.018421,0.0,00:00:00,8.205,03:00:00,2.327841
52,157935,2023-04-21,144,5.550535,0.0,00:00:00,15.395,15:30:00,4.162187,120,...,12:30:00,13.484,14:50:00,1.851681,0.885737,0.0,00:00:00,7.747,03:00:00,2.074742
53,166459,2023-04-21,144,7.838781,0.684,00:30:00,18.798,11:00:00,4.526562,280,...,12:30:00,13.211,14:20:00,0.661761,2.590526,0.684,00:30:00,10.896,03:00:00,2.992554
54,168327,2023-04-21,144,6.365479,0.026,00:20:00,17.701,15:20:00,4.055151,120,...,12:00:00,13.996,14:50:00,2.279379,1.442263,0.026,00:20:00,6.926,03:00:00,1.689103


In [65]:
daily_summary_no_spikes_df.tail(10)

# To Remove Duplicates
# daily_summary_no_spikes_df = daily_summary_no_spikes_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date,n_observations,pm25_fullDay_mean,pm25_fullDay_min,pm25_fullDay_minTime,pm25_fullDay_max,pm25_fullDay_maxTime,pm25_fullDay_std,pm25_fullDay_minutesAbove12ug,...,pm25_daytimeAmbient_minTime,pm25_daytimeAmbient_max,pm25_daytimeAmbient_maxTime,pm25_daytimeAmbient_std,pm25_nighttimeAmbient_mean,pm25_nighttimeAmbient_min,pm25_nighttimeAmbient_minTime,pm25_nighttimeAmbient_max,pm25_nighttimeAmbient_maxTime,pm25_nighttimeAmbient_std
45,157787,2023-04-21,143,4.972881,0.0,01:10:00,14.894,15:30:00,3.547027,30,...,12:20:00,10.574,14:30:00,1.223515,0.245,0.0,01:10:00,1.794,03:00:00,0.461158
46,157785,2023-04-21,44,9.545864,1.594,02:40:00,18.651,15:10:00,3.827041,110,...,13:30:00,15.513,14:50:00,2.489245,3.930333,1.594,02:40:00,6.388,03:00:00,2.399302
47,157837,2023-04-21,144,6.321972,0.0,00:10:00,16.165,10:00:00,4.480355,130,...,12:00:00,12.303,12:30:00,1.641386,0.537895,0.0,00:10:00,4.748,03:00:00,1.153437
48,157845,2023-04-21,143,8.552685,0.013,23:50:00,27.036,06:10:00,6.966927,380,...,12:30:00,15.278,14:50:00,1.954299,0.965053,0.025,00:20:00,7.29,03:00:00,1.938701
49,157861,2023-04-21,137,6.733102,0.014,23:10:00,25.061,14:50:00,5.384299,140,...,12:40:00,25.061,14:50:00,6.827937,0.925789,0.019,01:10:00,6.357,03:00:00,1.58936
50,157871,2023-04-21,144,6.357406,0.0,00:00:00,17.444,15:30:00,4.589354,150,...,12:00:00,15.355,14:40:00,2.197929,0.672921,0.0,00:00:00,6.307,03:00:00,1.595011
51,157877,2023-04-21,137,6.601372,0.0,00:00:00,22.951,11:20:00,4.711728,130,...,12:30:00,13.461,14:10:00,1.24712,1.018421,0.0,00:00:00,8.205,03:00:00,2.327841
52,157935,2023-04-21,144,5.550535,0.0,00:00:00,15.395,15:30:00,4.162187,120,...,12:30:00,13.484,14:50:00,1.851681,0.885737,0.0,00:00:00,7.747,03:00:00,2.074742
53,166459,2023-04-21,144,7.838781,0.684,00:30:00,18.798,11:00:00,4.526562,280,...,12:30:00,13.211,14:20:00,0.661761,2.590526,0.684,00:30:00,10.896,03:00:00,2.992554
54,168327,2023-04-21,144,6.365479,0.026,00:20:00,17.701,15:20:00,4.055151,120,...,12:00:00,13.996,14:50:00,2.279379,1.442263,0.026,00:20:00,6.926,03:00:00,1.689103


In [66]:
all_spikes_df

# To Remove Duplicates
# all_spikes_df = all_spikes_df.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,timestamp,pm25
0,142718,2023-04-20 19:20:00,28.566
1,142718,2023-04-20 17:30:00,34.71
2,142718,2023-04-20 21:00:00,54.477
3,143246,2023-04-20 14:00:00,80.602
4,143246,2023-04-20 14:10:00,114.955
5,143246,2023-04-20 13:50:00,36.07
6,143636,2023-04-20 12:50:00,73.4165
7,156605,2023-04-20 12:30:00,33.118
8,156605,2023-04-20 14:50:00,32.964
9,156605,2023-04-20 12:10:00,39.664


In [67]:
no_data

# To Remove Duplicates
# no_data = no_data.drop_duplicates(ignore_index = True).copy()

Unnamed: 0,sensor_index,date
0,142772,2023-04-21
1,142852,2023-04-21
2,143248,2023-04-21
3,145234,2023-04-21
4,145262,2023-04-21
5,145502,2023-04-21
6,145504,2023-04-21
7,145610,2023-04-21


In [68]:
daily_summary_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55 entries, 0 to 54
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   sensor_index                   55 non-null     int64  
 1   date                           55 non-null     object 
 2   n_observations                 55 non-null     int64  
 3   pm25_fullDay_mean              55 non-null     float64
 4   pm25_fullDay_min               55 non-null     float64
 5   pm25_fullDay_minTime           55 non-null     object 
 6   pm25_fullDay_max               55 non-null     float64
 7   pm25_fullDay_maxTime           55 non-null     object 
 8   pm25_fullDay_std               55 non-null     float64
 9   pm25_fullDay_minutesAbove12ug  55 non-null     int64  
 10  pm25_morningRush_mean          55 non-null     float64
 11  pm25_morningRush_min           55 non-null     float64
 12  pm25_morningRush_minTime       55 non-null     objec

In [85]:
# # Save it!?!

# daily_summary_df.to_csv('daily_summaries_today.csv', index = False)

# daily_summary_no_spikes_df.to_csv('daily_summaries_no_spikes_today.csv', index = False)

# all_spikes_df.to_csv('all_spikes_today.csv', index = False)

# no_data.to_csv('no_data_sensors_today.csv', index = False)