This repository contains the code, data-processing pipeline, and analysis notebooks for a fairness audit of machine-learning heat–mortality models in Maricopa County, Arizona (2010–2019). Three model classes—Random Forest (RF), Gradient Boosting Machine (GBM), and Generalized Additive Model (GAM)—are trained and audited on a 51-PCA × 10-year panel under two feature configurations (environment-only baseline and a demographically augmented full model).
The accompanying paper, FinalReport.tex, reports on two research questions:
- RQ1: Does adding demographic and socioeconomic features improve aggregate predictive performance over environment-only baselines?
- RQ2: Across model classes, do the extended models distribute predictive errors equitably across demographic and neighborhood subgroups?
Dominic Woetzel, Jen Navarro, Jesse Gallegos — University of Southern California
DSCIProject/
├── README.md This file
│
├── scripts/ Data-processing pipeline (Python)
│ ├── build_tract_pca_crosswalk.py
│ ├── aggregate_pca_demographics_2010_2019.py
│ ├── build_pca_gridmet_weather.py
│ ├── convert_aqs_datamart_to_adviz.py
│ ├── standardize_monitor_pollutants.py
│ ├── build_master_no_ozone_pm25.py
│ ├── build_master_co_so2.py
│ ├── standardize_co_so2_stdlib.py
│ ├── build_monitor_site_pca_crosswalk.py
│ ├── aggregate_pca_pollutants_annual.py
│ ├── aggregate_pca_pollutants_proximity_weighted.py
│ ├── merge_pca_environmental_daily.py
│ ├── merge_environmental_panel.py
│ ├── daily_enrichment_features.py
│ ├── join_environment_demographics_mortality.py
│ ├── impute_pollutants_modeling_panel.py
│ ├── map_maricopa_pca_monitors.py
│ ├── qa_imputed_pollutants_panel.py
│ ├── qa_acs_pca_preaggregation.py
│ ├── run_proximity_qa_checks.py
│ └── test_gridmet_tmmx_download.py
│
├── CensusData/ Processed data + QA reports
│ ├── B01001/ B19013/ B03002/ Raw ACS downloads (age, income, race)
│ ├── *_qa_report.md Pipeline QA documentation
│ ├── *_data_dictionary.md Variable definitions
│ ├── tract_to_pca_crosswalk_*.csv Tract→PCA area-weighted crosswalks
│ ├── pca_demographics_ses_*.csv Aggregated PCA demographics
│ ├── pca_weather_gridmet_*.csv Weather features per PCA-year
│ ├── pca_pollutants_*.csv Air-quality features per PCA-year
│ ├── pca_modeling_panel_*.csv Final modeling panel (51 PCAs × 10 years)
│ └── pca_monitor_map_maricopa_interactive.html Interactive monitor map
│
├── ResultsAnalysis/
│ └── FinalDemographAndEnvironment/
│ ├── combined_fairness_audit_Demographicv2.ipynb Full-feature audit
│ ├── combined_fairness_audit_EnvironmentOnlyv2.ipynb Env-only audit
│
├── outputs/ Full-feature run outputs
│ ├── tables/ CSV summaries, fairness tables, importances
│ ├── figures/ Figures referenced by FinalReport.tex
│ └── all_models_fairness_summary.md Cross-model summary
│
└── outputs_envonly/ Environment-only run outputs
├── tables/
└── all_models_fairness_summary.md
All four data streams used in this study are publicly available and the processed inputs are already committed to this repository — you do not need to download anything to reproduce the analysis.
| Domain | Source | Notes |
|---|---|---|
| Mortality | Arizona Department of Health Services — Population Health Indicator Hub | Indicator 609 ("Mortality by cause of death"), Measure 60410 = annual all-cause mortality rate per 100,000 persons, aggregated by Primary Care Area. Publicly published indicator data; no data-use agreement required. Committed as MortalityData/MortalityPCA_2010_2019.csv. |
| Climate (gridMET) | climatologylab.org/gridmet | Daily NetCDF rasters of tmmx, tmmn, rmax, rmin, vs for 2010–2019. Only needed if rebuilding pca_weather_gridmet_*.csv from scratch. |
| Air quality | US EPA AirData | Daily monitor-level CSVs for NO₂, O₃, PM₂.₅, CO, PM₁₀, SO₂ for Arizona. Only needed if rebuilding pollutant CSVs from scratch. |
| Demographics | US Census Bureau ACS 5-year | Tables B01001 (age × sex), B19013 (median household income), B03002 (race/ethnicity), 2010–2019 estimates. Raw downloads are committed under CensusData/B01001/, B19013/, B03002/. |
| PCA boundaries | Arizona Health Improvement Plan – Primary Care Areas | Shapefile published by AZ DHS. Only needed if rebuilding the tract→PCA crosswalk from scratch; place under PCAShapeFile/. |
| Census tract boundaries | US Census Bureau TIGER/Line | 2010 tract shapefile for Maricopa County (state 04, county 013). Only needed if rebuilding the crosswalk; place under CensusTractShapeFile/. |
MortalityData/MortalityPCA_2010_2019.csv is the AZDHS Population Health Indicator Hub export. Relevant columns:
| Column | Type | Description |
|---|---|---|
GeogID |
string | Primary Care Area identifier (1–83) |
NAME |
string | PCA name (e.g. "AHWATUKEE", "SOUTH PHOENIX") |
Year |
int | Calendar year (2010–2019) |
Domain |
string | Always MT (mortality) |
Indicator |
int | Always 609 (Mortality by cause of death) |
Measure |
int | Filter on 60410 (annual all-cause rate per 100k) |
Month |
string | Filter on ALL (annual rollup) |
Type |
string | Filter on AllDeaths |
Value |
float | Mortality rate per 100,000 persons — the modeling target |
The build script (scripts/join_environment_demographics_mortality.py) handles the filtering and renames Value to mortality_rate_per_100k in the modeling panel.
This project uses Python 3.10+ plus R 4.x (the GAM is fit via mgcv called through rpy2).
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate
pip install -r requirements.txtKey packages:
- Core:
pandas,numpy,scipy,scikit-learn - Modeling:
rpy2(R bridge),statsmodels - Geospatial:
geopandas,rioxarray,xarray,exactextract,pyshp,pyproj,shapely - Plotting:
matplotlib,seaborn - Notebooks:
jupyter,ipykernel,tqdm
A pinned requirements.txt is provided. If you have trouble installing geopandas / exactextract on Windows, use Miniconda/Anaconda:
conda create -n heat-fairness python=3.10
conda activate heat-fairness
conda install -c conda-forge geopandas rioxarray exactextract xarray pyshp pyproj shapely rpy2
pip install -r requirements.txtThe GAM is fit with mgcv in R via rpy2. Install R 4.x, then in an R session:
install.packages(c("mgcv"))rpy2 discovers R automatically when R_HOME is set or R.exe / R is on PATH.
Several scripts contain a hard-coded ROOT = Path(r"e:\Github\DSCIProject"). Replace this with your local clone path, or run the scripts from the repo root and use Path(__file__).resolve().parents[1].
All intermediate CSVs are already committed, so you do not need to rebuild the pipeline. To reproduce every result, figure, and table in FinalReport.tex, just run the two analysis notebooks end-to-end.
jupyter notebook ResultsAnalysis/FinalDemographAndEnvironment/Open and run each notebook top-to-bottom (Kernel → Restart & Run All):
combined_fairness_audit_Demographicv2.ipynb— full model (28 features: 19 environmental + 9 demographic/SES). Writes tooutputs/.combined_fairness_audit_EnvironmentOnlyv2.ipynb— environment-only baseline (19 features). Writes tooutputs_envonly/.
Each notebook reads the prebuilt modeling panel at
CensusData/pca_modeling_panel_2010_2019_imputed_pollutants.csv
and produces:
- Hyperparameter search results (5-fold GroupKFold by year, RMSE-minimizing)
- 5-fold rolling temporal walk-forward evaluation (test years 2015–2019)
- Subgroup fairness audit across 11 demographic features × 5 quintile bins
- Paired cluster-bootstrap significance tests
- Spatial cross-validation (leave-one-PCA-out and 4-region cluster)
- Calibration diagnostics
- Every figure cited in
FinalReport.tex
Approximate runtime on a modern laptop (16 GB RAM):
| Section | Time |
|---|---|
| RF (hyperparameter search + audit) | ~10 min |
| GBM (hyperparameter search + audit) | ~15 min |
| GAM (rpy2 / mgcv) | ~25 min |
| Per notebook total | ~50–60 min |
Cell outputs in the committed notebooks are pre-populated, so the simplest check is to scroll through and inspect; rerunning reproduces the same numbers up to bootstrap-sampling noise (B = 500, seed-fixed where possible).
The two notebooks are independent — you can run them in parallel in two Jupyter kernels if you have the cores. Outputs go to disjoint folders (outputs/ vs. outputs_envonly/).
You only need this section if you are extending the panel (adding a new year, swapping in different pollutant sources, etc.). The committed CSVs are produced by the scripts below in this order.
Requires CensusTractShapeFile/ and PCAShapeFile/ (see Data Sources).
python scripts/build_tract_pca_crosswalk.pypython scripts/aggregate_pca_demographics_2010_2019.pyReads CensusData/B01001/, B19013/, B03002/; writes pca_demographics_ses_2010_2019.csv.
Requires gridMET NetCDF files (download separately):
python scripts/build_pca_gridmet_weather.py \
--gridmet-dir path/to/gridmet/netcdf \
--years 2010-2019python scripts/convert_aqs_datamart_to_adviz.py
python scripts/build_master_no_ozone_pm25.py
python scripts/build_master_co_so2.py
python scripts/standardize_co_so2_stdlib.py
python scripts/standardize_monitor_pollutants.py
python scripts/build_monitor_site_pca_crosswalk.py
python scripts/aggregate_pca_pollutants_proximity_weighted.py
python scripts/aggregate_pca_pollutants_annual.pypython scripts/merge_pca_environmental_daily.py
python scripts/daily_enrichment_features.py
python scripts/merge_environmental_panel.pypython scripts/join_environment_demographics_mortality.py
python scripts/impute_pollutants_modeling_panel.pyProduces:
CensusData/pca_modeling_panel_2010_2019_imputed_pollutants.csv
This is the file the two analysis notebooks consume. PCAs 46, 53, and 82 (Surprise North & Wickenburg, Buckeye, Tohono O'odham Nation) are dropped by default because 100% of their pollutant-years are imputed; pass --include-all-pcas to the join script for sensitivity comparisons.
This project builds on the comparative-modeling framework of Boudreault et al.\ (2023):
Boudreault, J., Campagna, C., & Chebana, F. (2023). Machine and deep learning for modelling heat-health relationships. Science of the Total Environment, 892, 164660.