Evaluating Fairness in Heat–Mortality Models for Maricopa County, Arizona

This repository contains the code, data-processing pipeline, and analysis notebooks for a fairness audit of machine-learning heat–mortality models in Maricopa County, Arizona (2010–2019). Three model classes—Random Forest (RF), Gradient Boosting Machine (GBM), and Generalized Additive Model (GAM)—are trained and audited on a 51-PCA × 10-year panel under two feature configurations (environment-only baseline and a demographically augmented full model).

The accompanying paper, FinalReport.tex, reports on two research questions:

RQ1: Does adding demographic and socioeconomic features improve aggregate predictive performance over environment-only baselines?
RQ2: Across model classes, do the extended models distribute predictive errors equitably across demographic and neighborhood subgroups?

Authors

Dominic Woetzel, Jen Navarro, Jesse Gallegos — University of Southern California

Repository Structure

DSCIProject/
├── README.md                            This file
│
├── scripts/                             Data-processing pipeline (Python)
│   ├── build_tract_pca_crosswalk.py
│   ├── aggregate_pca_demographics_2010_2019.py
│   ├── build_pca_gridmet_weather.py
│   ├── convert_aqs_datamart_to_adviz.py
│   ├── standardize_monitor_pollutants.py
│   ├── build_master_no_ozone_pm25.py
│   ├── build_master_co_so2.py
│   ├── standardize_co_so2_stdlib.py
│   ├── build_monitor_site_pca_crosswalk.py
│   ├── aggregate_pca_pollutants_annual.py
│   ├── aggregate_pca_pollutants_proximity_weighted.py
│   ├── merge_pca_environmental_daily.py
│   ├── merge_environmental_panel.py
│   ├── daily_enrichment_features.py
│   ├── join_environment_demographics_mortality.py
│   ├── impute_pollutants_modeling_panel.py
│   ├── map_maricopa_pca_monitors.py
│   ├── qa_imputed_pollutants_panel.py
│   ├── qa_acs_pca_preaggregation.py
│   ├── run_proximity_qa_checks.py
│   └── test_gridmet_tmmx_download.py
│
├── CensusData/                          Processed data + QA reports
│   ├── B01001/ B19013/ B03002/          Raw ACS downloads (age, income, race)
│   ├── *_qa_report.md                   Pipeline QA documentation
│   ├── *_data_dictionary.md              Variable definitions
│   ├── tract_to_pca_crosswalk_*.csv     Tract→PCA area-weighted crosswalks
│   ├── pca_demographics_ses_*.csv        Aggregated PCA demographics
│   ├── pca_weather_gridmet_*.csv        Weather features per PCA-year
│   ├── pca_pollutants_*.csv             Air-quality features per PCA-year
│   ├── pca_modeling_panel_*.csv         Final modeling panel (51 PCAs × 10 years)
│   └── pca_monitor_map_maricopa_interactive.html   Interactive monitor map
│
├── ResultsAnalysis/
│   └── FinalDemographAndEnvironment/
│       ├── combined_fairness_audit_Demographicv2.ipynb     Full-feature audit
│       ├── combined_fairness_audit_EnvironmentOnlyv2.ipynb Env-only audit
│
├── outputs/                             Full-feature run outputs
│   ├── tables/                          CSV summaries, fairness tables, importances
│   ├── figures/                         Figures referenced by FinalReport.tex
│   └── all_models_fairness_summary.md   Cross-model summary
│
└── outputs_envonly/                     Environment-only run outputs
    ├── tables/
    └── all_models_fairness_summary.md

Data Sources

All four data streams used in this study are publicly available and the processed inputs are already committed to this repository — you do not need to download anything to reproduce the analysis.

Domain	Source	Notes
Mortality	Arizona Department of Health Services — Population Health Indicator Hub	Indicator 609 ("Mortality by cause of death"), Measure 60410 = annual all-cause mortality rate per 100,000 persons, aggregated by Primary Care Area. Publicly published indicator data; no data-use agreement required. Committed as `MortalityData/MortalityPCA_2010_2019.csv`.
Climate (gridMET)	climatologylab.org/gridmet	Daily NetCDF rasters of `tmmx`, `tmmn`, `rmax`, `rmin`, `vs` for 2010–2019. Only needed if rebuilding `pca_weather_gridmet_*.csv` from scratch.
Air quality	US EPA AirData	Daily monitor-level CSVs for NO₂, O₃, PM₂.₅, CO, PM₁₀, SO₂ for Arizona. Only needed if rebuilding pollutant CSVs from scratch.
Demographics	US Census Bureau ACS 5-year	Tables B01001 (age × sex), B19013 (median household income), B03002 (race/ethnicity), 2010–2019 estimates. Raw downloads are committed under `CensusData/B01001/`, `B19013/`, `B03002/`.
PCA boundaries	Arizona Health Improvement Plan – Primary Care Areas	Shapefile published by AZ DHS. Only needed if rebuilding the tract→PCA crosswalk from scratch; place under `PCAShapeFile/`.
Census tract boundaries	US Census Bureau TIGER/Line	2010 tract shapefile for Maricopa County (state `04`, county `013`). Only needed if rebuilding the crosswalk; place under `CensusTractShapeFile/`.

Mortality data schema

MortalityData/MortalityPCA_2010_2019.csv is the AZDHS Population Health Indicator Hub export. Relevant columns:

Column	Type	Description
`GeogID`	string	Primary Care Area identifier (1–83)
`NAME`	string	PCA name (e.g. `"AHWATUKEE"`, `"SOUTH PHOENIX"`)
`Year`	int	Calendar year (2010–2019)
`Domain`	string	Always `MT` (mortality)
`Indicator`	int	Always `609` (Mortality by cause of death)
`Measure`	int	Filter on `60410` (annual all-cause rate per 100k)
`Month`	string	Filter on `ALL` (annual rollup)
`Type`	string	Filter on `AllDeaths`
`Value`	float	Mortality rate per 100,000 persons — the modeling target

The build script (scripts/join_environment_demographics_mortality.py) handles the filtering and renames Value to mortality_rate_per_100k in the modeling panel.

Environment Setup

This project uses Python 3.10+ plus R 4.x (the GAM is fit via mgcv called through rpy2).

1. Python dependencies

python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

pip install -r requirements.txt

Key packages:

Core: pandas, numpy, scipy, scikit-learn
Modeling: rpy2 (R bridge), statsmodels
Geospatial: geopandas, rioxarray, xarray, exactextract, pyshp, pyproj, shapely
Plotting: matplotlib, seaborn
Notebooks: jupyter, ipykernel, tqdm

A pinned requirements.txt is provided. If you have trouble installing geopandas / exactextract on Windows, use Miniconda/Anaconda:

conda create -n heat-fairness python=3.10
conda activate heat-fairness
conda install -c conda-forge geopandas rioxarray exactextract xarray pyshp pyproj shapely rpy2
pip install -r requirements.txt

2. R dependencies

The GAM is fit with mgcv in R via rpy2. Install R 4.x, then in an R session:

install.packages(c("mgcv"))

rpy2 discovers R automatically when R_HOME is set or R.exe / R is on PATH.

3. Repository path

Several scripts contain a hard-coded ROOT = Path(r"e:\Github\DSCIProject"). Replace this with your local clone path, or run the scripts from the repo root and use Path(__file__).resolve().parents[1].

Running the Analysis

All intermediate CSVs are already committed, so you do not need to rebuild the pipeline. To reproduce every result, figure, and table in FinalReport.tex, just run the two analysis notebooks end-to-end.

Quick start

jupyter notebook ResultsAnalysis/FinalDemographAndEnvironment/

Open and run each notebook top-to-bottom (Kernel → Restart & Run All):

combined_fairness_audit_Demographicv2.ipynb — full model (28 features: 19 environmental + 9 demographic/SES). Writes to outputs/.
combined_fairness_audit_EnvironmentOnlyv2.ipynb — environment-only baseline (19 features). Writes to outputs_envonly/.

Each notebook reads the prebuilt modeling panel at CensusData/pca_modeling_panel_2010_2019_imputed_pollutants.csv and produces:

Hyperparameter search results (5-fold GroupKFold by year, RMSE-minimizing)
5-fold rolling temporal walk-forward evaluation (test years 2015–2019)
Subgroup fairness audit across 11 demographic features × 5 quintile bins
Paired cluster-bootstrap significance tests
Spatial cross-validation (leave-one-PCA-out and 4-region cluster)
Calibration diagnostics
Every figure cited in FinalReport.tex

Runtime

Approximate runtime on a modern laptop (16 GB RAM):

Section	Time
RF (hyperparameter search + audit)	~10 min
GBM (hyperparameter search + audit)	~15 min
GAM (rpy2 / mgcv)	~25 min
Per notebook total	~50–60 min

Cell outputs in the committed notebooks are pre-populated, so the simplest check is to scroll through and inspect; rerunning reproduces the same numbers up to bootstrap-sampling noise (B = 500, seed-fixed where possible).

Order doesn't matter

The two notebooks are independent — you can run them in parallel in two Jupyter kernels if you have the cores. Outputs go to disjoint folders (outputs/ vs. outputs_envonly/).

Rebuilding the Modeling Panel From Raw Inputs (Optional)

You only need this section if you are extending the panel (adding a new year, swapping in different pollutant sources, etc.). The committed CSVs are produced by the scripts below in this order.

Stage 1 — Geospatial crosswalks

Requires CensusTractShapeFile/ and PCAShapeFile/ (see Data Sources).

python scripts/build_tract_pca_crosswalk.py

Stage 2 — Demographics

python scripts/aggregate_pca_demographics_2010_2019.py

Reads CensusData/B01001/, B19013/, B03002/; writes pca_demographics_ses_2010_2019.csv.

Stage 3 — Climate (gridMET)

Requires gridMET NetCDF files (download separately):

python scripts/build_pca_gridmet_weather.py \
    --gridmet-dir path/to/gridmet/netcdf \
    --years 2010-2019

Stage 4 — Air-quality monitors

python scripts/convert_aqs_datamart_to_adviz.py
python scripts/build_master_no_ozone_pm25.py
python scripts/build_master_co_so2.py
python scripts/standardize_co_so2_stdlib.py
python scripts/standardize_monitor_pollutants.py
python scripts/build_monitor_site_pca_crosswalk.py
python scripts/aggregate_pca_pollutants_proximity_weighted.py
python scripts/aggregate_pca_pollutants_annual.py

Stage 5 — Daily environmental panel + enrichments

python scripts/merge_pca_environmental_daily.py
python scripts/daily_enrichment_features.py
python scripts/merge_environmental_panel.py

Stage 6 — Final modeling panel

python scripts/join_environment_demographics_mortality.py
python scripts/impute_pollutants_modeling_panel.py

Produces:

CensusData/pca_modeling_panel_2010_2019_imputed_pollutants.csv

This is the file the two analysis notebooks consume. PCAs 46, 53, and 82 (Surprise North & Wickenburg, Buckeye, Tohono O'odham Nation) are dropped by default because 100% of their pollutant-years are imputed; pass --include-all-pcas to the join script for sensitivity comparisons.

This project builds on the comparative-modeling framework of Boudreault et al.\ (2023):

Boudreault, J., Campagna, C., & Chebana, F. (2023). Machine and deep learning for modelling heat-health relationships. Science of the Total Environment, 892, 164660.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
CO		CO
CensusData		CensusData
CensusTractShapeFile		CensusTractShapeFile
Misc		Misc
MortalityData		MortalityData
NO		NO
Ozone		Ozone
PCAShapeFile		PCAShapeFile
PM10		PM10
PM2.5		PM2.5
PinalSites		PinalSites
ResultsAnalysis		ResultsAnalysis
SO2		SO2
Weather		Weather
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Fairness in Heat–Mortality Models for Maricopa County, Arizona

Authors

Repository Structure

Data Sources

Mortality data schema

Environment Setup

1. Python dependencies

2. R dependencies

3. Repository path

Running the Analysis

Quick start

Runtime

Order doesn't matter

Rebuilding the Modeling Panel From Raw Inputs (Optional)

Stage 1 — Geospatial crosswalks

Stage 2 — Demographics

Stage 3 — Climate (gridMET)

Stage 4 — Air-quality monitors

Stage 5 — Daily environmental panel + enrichments

Stage 6 — Final modeling panel

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Fairness in Heat–Mortality Models for Maricopa County, Arizona

Authors

Repository Structure

Data Sources

Mortality data schema

Environment Setup

1. Python dependencies

2. R dependencies

3. Repository path

Running the Analysis

Quick start

Runtime

Order doesn't matter

Rebuilding the Modeling Panel From Raw Inputs (Optional)

Stage 1 — Geospatial crosswalks

Stage 2 — Demographics

Stage 3 — Climate (gridMET)

Stage 4 — Air-quality monitors

Stage 5 — Daily environmental panel + enrichments

Stage 6 — Final modeling panel

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages