# Processing Pipeline for Archived Sensor.Community Data (AD4GD, Pilot 3)
This notebook outlines a full processing pipeline for archived air quality data from the Sensor.Community project. Unlike the near real-time (NRT) version, this workflow is tailored to monthly batches of data.  
**Note: The processing steps are computationally intensive and may require significant memory and CPU resources.**

### Step 1: Setup and Imports
Essential packages and project-specific modules are imported to handle data standardization, correction, and gridding. make sure *data_processing* module is available.

In [1]:
import subprocess
from datetime import datetime
from pathlib import Path
from data_processing.standardize_sensor_community import StandardizeData
from data_processing.iot_qa_hour import Corrector
from data_processing.kriging_only import KrigingIoT

### Step 2: Download Monthly Archived Data
This step (via a script) downloads archived monthly data for a specific sensor type (sds011) over a date range. The files are saved in a local input directory.
> python3 data_processing/download_data_archive.py --sensor sds011 -s 20241201 -e 20241231 -i testfolder/ad4gd_pilot3_test

In [None]:
# download monthly archive data from sensor.community
# python3 data_processing/download_data_archive.py --sensor sds011 -s 20241201 -e 20241231 -i testfolder/ad4gd_pilot3_test

### Step 3: Standardize Data to Hourly Averages
This code block transforms raw, unevenly spaced data into hourly averages using a custom standardization class. This ensures time-regular data, suitable for quality control and gridding.

In [None]:
sensor = 'sds011'
date = datetime(2024, 12, 1)
inputfn = Path(testfolder, 'L1A', 'sds011', '2024-12_sds011.parquet')
outputfold = Path(testfolder, 'L2')
Standard = StandardizeData(date, sensor, inputfn, outputfold)
Standard.run()

### Step 4: Correct IoT data using Meteorological and Air Quality Model Data
This step prepares input filenames for correction of the hourly particulate matter (PM) data. External datasets from ERA5 (meteorology) and CAMS (air quality) are used to:
- Improve sensor data accuracy via ML-based correction,
- Flag and filter potential outliers.

In [None]:
# Correct hourly SDS011 data using ML model with ERA5 data as input
# and apply outlier detection using CAMS data
month = datetime(2024, 12, 1)
ad4gd_fold = Path(testfolder)
scomfn = Path(ad4gd_fold, 'L2', f'SDS011_PM2.5_hourly_{month:%Y%m%d}.nc')
pollutant = 'pm25'
meteofn = Path(ad4gd_fold, f'ERA5_sfc_CAMS_domain_{month:%Y%m}.nc')
camsfn = Path(ad4gd_fold, f'CAMS_analysis_ensemble_{month:%Y%m}.nc')
outputfolder = Path(ad4gd_fold, 'L2B')
Corr = Corrector(scomfn,
                month,
                pollutant,
                meteofn,
                camsfn,
                outputfolder)
Corr.run()

: 

### Step 5: Spatial Regridding Using Ordinary Kriging
The final step interpolates the corrected IoT measurements onto a regular grid using kriging, enabling spatial mapping and integration with models or satellite data.

In [None]:
day = datetime(2024, 12, 1)
iotmeasfn = Path(testfolder, 'L2B', 'pm25', f'iot_hour_corr_pm25_{day:%Y%m}.nc')
outfn = Path(testfolder, 'L3', f'iotonly_hour_pm25_gridded_{day:%Y%m%d}.nc')
Kriging = KrigingIoT(timeliness='hourly',
                     date=month,
                     iotmeasfn=iotmeasfn,
                     outfn=outfn)
Kriging.run()

2025-07-07 13:51:58.096567 Reading IoT data
2025-07-07 13:52:23.309341 Processing 2024-12-01T00:00:00.000000000
2025-07-07 13:52:23.317219 Identifying clusters
2025-07-07 13:52:27.415335 Checking distance to nearest IoT station
2025-07-07 13:52:27.415466 Create and query KDTree
2025-07-07 13:52:39.716482 Calculate distances
2025-07-07 13:52:39.873968 Performing global kriging
