Skip to content

Domo-W/DSCIProject

Repository files navigation

Evaluating Fairness in Heat–Mortality Models for Maricopa County, Arizona

This repository contains the code, data-processing pipeline, and analysis notebooks for a fairness audit of machine-learning heat–mortality models in Maricopa County, Arizona (2010–2019). Three model classes—Random Forest (RF), Gradient Boosting Machine (GBM), and Generalized Additive Model (GAM)—are trained and audited on a 51-PCA × 10-year panel under two feature configurations (environment-only baseline and a demographically augmented full model).

The accompanying paper, FinalReport.tex, reports on two research questions:

  • RQ1: Does adding demographic and socioeconomic features improve aggregate predictive performance over environment-only baselines?
  • RQ2: Across model classes, do the extended models distribute predictive errors equitably across demographic and neighborhood subgroups?

Authors

Dominic Woetzel, Jen Navarro, Jesse Gallegos — University of Southern California


Repository Structure

DSCIProject/
├── README.md                            This file
│
├── scripts/                             Data-processing pipeline (Python)
│   ├── build_tract_pca_crosswalk.py
│   ├── aggregate_pca_demographics_2010_2019.py
│   ├── build_pca_gridmet_weather.py
│   ├── convert_aqs_datamart_to_adviz.py
│   ├── standardize_monitor_pollutants.py
│   ├── build_master_no_ozone_pm25.py
│   ├── build_master_co_so2.py
│   ├── standardize_co_so2_stdlib.py
│   ├── build_monitor_site_pca_crosswalk.py
│   ├── aggregate_pca_pollutants_annual.py
│   ├── aggregate_pca_pollutants_proximity_weighted.py
│   ├── merge_pca_environmental_daily.py
│   ├── merge_environmental_panel.py
│   ├── daily_enrichment_features.py
│   ├── join_environment_demographics_mortality.py
│   ├── impute_pollutants_modeling_panel.py
│   ├── map_maricopa_pca_monitors.py
│   ├── qa_imputed_pollutants_panel.py
│   ├── qa_acs_pca_preaggregation.py
│   ├── run_proximity_qa_checks.py
│   └── test_gridmet_tmmx_download.py
│
├── CensusData/                          Processed data + QA reports
│   ├── B01001/ B19013/ B03002/          Raw ACS downloads (age, income, race)
│   ├── *_qa_report.md                   Pipeline QA documentation
│   ├── *_data_dictionary.md              Variable definitions
│   ├── tract_to_pca_crosswalk_*.csv     Tract→PCA area-weighted crosswalks
│   ├── pca_demographics_ses_*.csv        Aggregated PCA demographics
│   ├── pca_weather_gridmet_*.csv        Weather features per PCA-year
│   ├── pca_pollutants_*.csv             Air-quality features per PCA-year
│   ├── pca_modeling_panel_*.csv         Final modeling panel (51 PCAs × 10 years)
│   └── pca_monitor_map_maricopa_interactive.html   Interactive monitor map
│
├── ResultsAnalysis/
│   └── FinalDemographAndEnvironment/
│       ├── combined_fairness_audit_Demographicv2.ipynb     Full-feature audit
│       ├── combined_fairness_audit_EnvironmentOnlyv2.ipynb Env-only audit
│
├── outputs/                             Full-feature run outputs
│   ├── tables/                          CSV summaries, fairness tables, importances
│   ├── figures/                         Figures referenced by FinalReport.tex
│   └── all_models_fairness_summary.md   Cross-model summary
│
└── outputs_envonly/                     Environment-only run outputs
    ├── tables/
    └── all_models_fairness_summary.md

Data Sources

All four data streams used in this study are publicly available and the processed inputs are already committed to this repository — you do not need to download anything to reproduce the analysis.

Domain Source Notes
Mortality Arizona Department of Health Services — Population Health Indicator Hub Indicator 609 ("Mortality by cause of death"), Measure 60410 = annual all-cause mortality rate per 100,000 persons, aggregated by Primary Care Area. Publicly published indicator data; no data-use agreement required. Committed as MortalityData/MortalityPCA_2010_2019.csv.
Climate (gridMET) climatologylab.org/gridmet Daily NetCDF rasters of tmmx, tmmn, rmax, rmin, vs for 2010–2019. Only needed if rebuilding pca_weather_gridmet_*.csv from scratch.
Air quality US EPA AirData Daily monitor-level CSVs for NO₂, O₃, PM₂.₅, CO, PM₁₀, SO₂ for Arizona. Only needed if rebuilding pollutant CSVs from scratch.
Demographics US Census Bureau ACS 5-year Tables B01001 (age × sex), B19013 (median household income), B03002 (race/ethnicity), 2010–2019 estimates. Raw downloads are committed under CensusData/B01001/, B19013/, B03002/.
PCA boundaries Arizona Health Improvement Plan – Primary Care Areas Shapefile published by AZ DHS. Only needed if rebuilding the tract→PCA crosswalk from scratch; place under PCAShapeFile/.
Census tract boundaries US Census Bureau TIGER/Line 2010 tract shapefile for Maricopa County (state 04, county 013). Only needed if rebuilding the crosswalk; place under CensusTractShapeFile/.

Mortality data schema

MortalityData/MortalityPCA_2010_2019.csv is the AZDHS Population Health Indicator Hub export. Relevant columns:

Column Type Description
GeogID string Primary Care Area identifier (1–83)
NAME string PCA name (e.g. "AHWATUKEE", "SOUTH PHOENIX")
Year int Calendar year (2010–2019)
Domain string Always MT (mortality)
Indicator int Always 609 (Mortality by cause of death)
Measure int Filter on 60410 (annual all-cause rate per 100k)
Month string Filter on ALL (annual rollup)
Type string Filter on AllDeaths
Value float Mortality rate per 100,000 persons — the modeling target

The build script (scripts/join_environment_demographics_mortality.py) handles the filtering and renames Value to mortality_rate_per_100k in the modeling panel.

Environment Setup

This project uses Python 3.10+ plus R 4.x (the GAM is fit via mgcv called through rpy2).

1. Python dependencies

python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

pip install -r requirements.txt

Key packages:

  • Core: pandas, numpy, scipy, scikit-learn
  • Modeling: rpy2 (R bridge), statsmodels
  • Geospatial: geopandas, rioxarray, xarray, exactextract, pyshp, pyproj, shapely
  • Plotting: matplotlib, seaborn
  • Notebooks: jupyter, ipykernel, tqdm

A pinned requirements.txt is provided. If you have trouble installing geopandas / exactextract on Windows, use Miniconda/Anaconda:

conda create -n heat-fairness python=3.10
conda activate heat-fairness
conda install -c conda-forge geopandas rioxarray exactextract xarray pyshp pyproj shapely rpy2
pip install -r requirements.txt

2. R dependencies

The GAM is fit with mgcv in R via rpy2. Install R 4.x, then in an R session:

install.packages(c("mgcv"))

rpy2 discovers R automatically when R_HOME is set or R.exe / R is on PATH.

3. Repository path

Several scripts contain a hard-coded ROOT = Path(r"e:\Github\DSCIProject"). Replace this with your local clone path, or run the scripts from the repo root and use Path(__file__).resolve().parents[1].


Running the Analysis

All intermediate CSVs are already committed, so you do not need to rebuild the pipeline. To reproduce every result, figure, and table in FinalReport.tex, just run the two analysis notebooks end-to-end.

Quick start

jupyter notebook ResultsAnalysis/FinalDemographAndEnvironment/

Open and run each notebook top-to-bottom (Kernel → Restart & Run All):

  1. combined_fairness_audit_Demographicv2.ipynb — full model (28 features: 19 environmental + 9 demographic/SES). Writes to outputs/.
  2. combined_fairness_audit_EnvironmentOnlyv2.ipynb — environment-only baseline (19 features). Writes to outputs_envonly/.

Each notebook reads the prebuilt modeling panel at CensusData/pca_modeling_panel_2010_2019_imputed_pollutants.csv and produces:

  • Hyperparameter search results (5-fold GroupKFold by year, RMSE-minimizing)
  • 5-fold rolling temporal walk-forward evaluation (test years 2015–2019)
  • Subgroup fairness audit across 11 demographic features × 5 quintile bins
  • Paired cluster-bootstrap significance tests
  • Spatial cross-validation (leave-one-PCA-out and 4-region cluster)
  • Calibration diagnostics
  • Every figure cited in FinalReport.tex

Runtime

Approximate runtime on a modern laptop (16 GB RAM):

Section Time
RF (hyperparameter search + audit) ~10 min
GBM (hyperparameter search + audit) ~15 min
GAM (rpy2 / mgcv) ~25 min
Per notebook total ~50–60 min

Cell outputs in the committed notebooks are pre-populated, so the simplest check is to scroll through and inspect; rerunning reproduces the same numbers up to bootstrap-sampling noise (B = 500, seed-fixed where possible).

Order doesn't matter

The two notebooks are independent — you can run them in parallel in two Jupyter kernels if you have the cores. Outputs go to disjoint folders (outputs/ vs. outputs_envonly/).


Rebuilding the Modeling Panel From Raw Inputs (Optional)

You only need this section if you are extending the panel (adding a new year, swapping in different pollutant sources, etc.). The committed CSVs are produced by the scripts below in this order.

Stage 1 — Geospatial crosswalks

Requires CensusTractShapeFile/ and PCAShapeFile/ (see Data Sources).

python scripts/build_tract_pca_crosswalk.py

Stage 2 — Demographics

python scripts/aggregate_pca_demographics_2010_2019.py

Reads CensusData/B01001/, B19013/, B03002/; writes pca_demographics_ses_2010_2019.csv.

Stage 3 — Climate (gridMET)

Requires gridMET NetCDF files (download separately):

python scripts/build_pca_gridmet_weather.py \
    --gridmet-dir path/to/gridmet/netcdf \
    --years 2010-2019

Stage 4 — Air-quality monitors

python scripts/convert_aqs_datamart_to_adviz.py
python scripts/build_master_no_ozone_pm25.py
python scripts/build_master_co_so2.py
python scripts/standardize_co_so2_stdlib.py
python scripts/standardize_monitor_pollutants.py
python scripts/build_monitor_site_pca_crosswalk.py
python scripts/aggregate_pca_pollutants_proximity_weighted.py
python scripts/aggregate_pca_pollutants_annual.py

Stage 5 — Daily environmental panel + enrichments

python scripts/merge_pca_environmental_daily.py
python scripts/daily_enrichment_features.py
python scripts/merge_environmental_panel.py

Stage 6 — Final modeling panel

python scripts/join_environment_demographics_mortality.py
python scripts/impute_pollutants_modeling_panel.py

Produces:

CensusData/pca_modeling_panel_2010_2019_imputed_pollutants.csv

This is the file the two analysis notebooks consume. PCAs 46, 53, and 82 (Surprise North & Wickenburg, Buckeye, Tohono O'odham Nation) are dropped by default because 100% of their pollutant-years are imputed; pass --include-all-pcas to the join script for sensitivity comparisons.


This project builds on the comparative-modeling framework of Boudreault et al.\ (2023):

Boudreault, J., Campagna, C., & Chebana, F. (2023). Machine and deep learning for modelling heat-health relationships. Science of the Total Environment, 892, 164660.


About

Evaluating Fairness in Heat–Mortality Models for Maricopa County, Arizona

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors