# ü¶∑ NHANES Periodontitis Prediction: Temporal Validation & Gradient Boosting Benchmark

**Author:** Francisco Teixeira Barbosa (Cisco @ Periospot)  
**Date:** November 2025  
**Project:** Replicating & Improving Bashir et al. (2022)

---

## üìÑ Reference Paper

**Bashir NZ, Gill S, Tawse-Smith A, Torkzaban P, Graf D, Gary MT.**  
*Systematic comparison of machine learning algorithms to develop and validate predictive models for periodontitis.*  
**J Clin Periodontol.** 2022;49:958-969.

üìÅ **Paper Location:** `scientific_articles/J Clinic Periodontology - 2022 - Bashir...pdf`

---

## üéØ Project Goals & Rationale

### The Problem
**Periodontitis** affects ~50% of US adults aged 30+, yet early prediction remains challenging.

**Bashir et al. (2022)** tested 10 ML algorithms and achieved:
- ‚úÖ **Internal validation:** AUC > 0.95 (excellent)
- ‚ùå **External validation:** AUC ~0.50‚Äì0.60 (poor ‚Äî no better than random!)

**Why the failure?** Cross-population validation (Taiwan ‚Üî US) is too strict‚Äîdifferent healthcare systems, diagnostic criteria, and populations.

### Our Approach: Temporal Validation

Instead of validating across populations, we validate **across time within the same population**:

```
üìÖ TRAIN:      2011-2012 + 2013-2014  (~7,000 participants)
üìÖ VALIDATION: 2015-2016              (~3,500 participants)
üìÖ TEST:       2017-2018              (~3,500 participants)
```

**Why temporal?**
- ‚úÖ Mimics real-world deployment: "Can a model trained on past data predict future patients?"
- ‚úÖ Same population (US adults), same methodology (NHANES)
- ‚úÖ More realistic than random shuffling
- ‚úÖ Tests if model captures stable biological risk vs. temporal artifacts

### Methodological Improvements

1. **Modern Gradient Boosting:** XGBoost, CatBoost, LightGBM (NOT tested by Bashir)
2. **Hyperparameter Optimization:** Optuna Bayesian search
3. **Calibration:** Isotonic regression + decision curve analysis
4. **Interpretability:** SHAP feature importance
5. **Survey Weights:** Sensitivity analysis with NHANES complex sampling
6. **Reproducibility:** Versioned config, saved artifacts, git tracking

### Research Gap

From **Polizzi et al. (2024)** systematic review:  
> "None of the included articles used more powerful networks"

**Translation:** XGBoost, CatBoost, and LightGBM are **underutilized** in periodontitis prediction research.

**This study fills that gap.**

---

## üìä Success Metrics

| Metric | Bashir Internal | Bashir External | **Our Target** |
|--------|----------------|-----------------|----------------|
| **AUC-ROC** | 0.95+ | 0.50‚Äì0.60 | **0.75‚Äì0.85** |
| **PR-AUC** | Not reported | Not reported | **0.60‚Äì0.75** |
| **Calibration** | Not reported | Not reported | **Brier < 0.20** |
| **Temporal Generalization** | N/A | Poor | **Better** |

**Realistic Expectation:** Even if we don't dramatically improve AUC, demonstrating that gradient boosting **doesn't magically solve external validation** is itself a **publishable finding** that advances the field.

---

## üó∫Ô∏è Notebook Roadmap

This notebook has **20 sections** organized into **5 phases**:

### Phase 1: Data Acquisition & Labeling (Sections 1‚Äì5)
1. Environment setup
2. Load configuration
3. Download NHANES data
4. Merge components
5. Apply CDC/AAP case definitions

### Phase 2: Feature Engineering & EDA (Sections 6‚Äì7)
6. Build 15 Bashir predictors
7. Exploratory analysis & drift detection

### Phase 3: Baseline Models (Sections 8‚Äì10)
8. Temporal train/val/test split
9. Preprocessing pipelines
10. Baseline models (LogReg, RandomForest)

### Phase 4: Gradient Boosting & Optimization (Sections 11‚Äì13)
11. XGBoost with Optuna
12. CatBoost with Optuna
13. LightGBM with Optuna

### Phase 5: Evaluation & Export (Sections 14‚Äì20)
14. Threshold selection on validation
15. Final test evaluation
16. Calibration & decision curves
17. SHAP interpretability
18. Survey weights sensitivity
19. Save artifacts & model card
20. Reproducibility log

---

## ‚ö†Ô∏è Important Notes Before Starting

1. **Read the Config First:** All parameters are in `configs/config.yaml`
2. **Implement TODOs Sequentially:** Each section builds on previous ones
3. **Test as You Go:** Run cells immediately to catch errors early
4. **Use Autocomplete:** Function signatures are provided‚Äîlet your IDE help
5. **Don't Skip Section 5:** CDC/AAP classification is the most critical and brittle step
6. **Survey Weights:** For ML training, we use unweighted data (documented), but report weighted prevalence
7. **Freeze Threshold on Val:** Never touch test set until final evaluation

---

Let's begin! üöÄ

In [2]:
"""
Section 1: Environment Setup & Imports
========================================
Set up the computational environment with all required libraries,
apply reproducibility measures, and configure Periospot plotting style.
"""

import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.model_selection import cross_validate, StratifiedKFold, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    roc_auc_score, average_precision_score, brier_score_loss,
    accuracy_score, recall_score, precision_score, f1_score,
    confusion_matrix, roc_curve, precision_recall_curve
)

import xgboost as xgb
import catboost as cb
import lightgbm as lgb

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

import shap

import yaml
import json
from datetime import datetime

import sys
sys.path.insert(0, str(Path.cwd().parent / 'src'))
from ps_plot import set_style, get_palette, save_figure
from labels import label_periodontitis
from evaluation import compute_metrics, plot_roc_pr, select_threshold, plot_calibration_curve
from utils import set_seed, save_json, log_versions, save_model

RANDOM_SEED = 42
set_seed(RANDOM_SEED)

set_style()
palette = get_palette()
print("‚úÖ Periospot color palette loaded:")
for name, hex_code in palette.items():
    print(f"   {name}: {hex_code}")

for dir_name in ['figures', 'models', 'results', 'artifacts', 'logs']:
    Path(dir_name).mkdir(exist_ok=True, parents=True)

print("\nüì¶ Package Versions:")
print(f"   pandas: {pd.__version__}")
print(f"   numpy: {np.__version__}")
print(f"   scikit-learn: {sklearn.__version__}")
print(f"   xgboost: {xgb.__version__}")
print(f"   catboost: {cb.__version__}")
print(f"   lightgbm: {lgb.__version__}")
print(f"   optuna: {optuna.__version__}")
print(f"   shap: {shap.__version__}")

print("‚úÖ Section 1 Complete: Environment configured, seed set, Periospot style applied")

‚úÖ Periospot color palette loaded:
   periospot_blue: #15365a
   mystic_blue: #003049
   periospot_red: #6c1410
   crimson_blaze: #a92a2a
   vanilla_cream: #f7f0da
   black: #000000
   white: #ffffff

üì¶ Package Versions:
   pandas: 2.3.2
   numpy: 2.3.5
   scikit-learn: 1.7.1
   xgboost: 3.1.1
   catboost: 1.2.8
   lightgbm: 4.6.0
   optuna: 4.6.0
   shap: 0.50.0
‚úÖ Section 1 Complete: Environment configured, seed set, Periospot style applied


## 2Ô∏è‚É£ Load Configuration

**Load:** `configs/config.yaml`

**Contains:** NHANES cycles, temporal split, 15 predictors, CDC/AAP definitions, Optuna params, Periospot colors, survey weights

---

In [4]:
# TODO: Load config.yaml
with open("../configs/config.yaml") as f:
    config = yaml.safe_load(f)
TRAIN_CYCLES = config["temporal_split"]["train"]
VAL_CYCLES = config["temporal_split"]["validation"]
TEST_CYCLES = config["temporal_split"]["test"]
print(f"Train: {TRAIN_CYCLES}, Val: {VAL_CYCLES}, Test: {TEST_CYCLES}")
print("‚úÖ Section 2: Config loaded")

Train: ['2011_2012', '2013_2014'], Val: ['2015_2016'], Test: ['2017_2018']
‚úÖ Section 2: Config loaded


## 3Ô∏è‚É£ Download NHANES Data (XPT Files)

**Download** 4 cycles √ó 9 components = 36 XPT files from CDC

**Method:** `pd.read_sas(url)` ‚Üí save as parquet

---

In [None]:
# TODO: Loop cycles, download XPT files using pd.read_sas(url)
# for cycle in CYCLES:
#     cycle_dir = Path(f"data/raw/{cycle}")
#     cycle_dir.mkdir(parents=True, exist_ok=True)
#     for component, file_prefix in COMPONENTS.items():
#         url = f"{BASE_URL}/{cycle}/{file_prefix}_{suffix}.XPT"
#         df = pd.read_sas(url)
#         df.to_parquet(cycle_dir / f"{component}.parquet")
print("‚úÖ Section 3: Data downloaded")

## 4Ô∏è‚É£ Merge Components on SEQN

**Join** all components by participant ID (SEQN)

**Filter:** Adults 30+

---

In [None]:
# TODO: Merge all components on SEQN, filter age >= 30
# for cycle in CYCLES:
#     dfs = []
#     for comp in COMPONENTS:
#         df = pd.read_parquet(f"data/raw/{cycle}/{comp}.parquet")
#         dfs.append(df)
#     merged = dfs[0]
#     for df in dfs[1:]:
#         merged = merged.merge(df, on="SEQN", how="outer")
#     merged = merged[merged["RIDAGEYR"] >= 30]
#     merged.to_parquet(f"data/processed/{cycle}_merged.parquet")
print("‚úÖ Section 4: Components merged")

## 5Ô∏è‚É£ Apply CDC/AAP Case Definitions

**Most Critical Section!**

**Implement:**
- Severe: CAL ‚â•6mm (‚â•2 different teeth) + PD ‚â•5mm (‚â•1 site)
- Moderate: CAL ‚â•4mm (‚â•2 teeth) OR PD ‚â•5mm (‚â•2 teeth)
- Mild: (CAL ‚â•3mm + PD ‚â•4mm on ‚â•2 teeth) OR PD ‚â•5mm (‚â•1 site)

**Use:** `src/labels.py` `label_periodontitis()`

---

In [None]:
# TODO: Apply CDC/AAP classification using src/labels.py
# from labels import label_periodontitis
# for cycle in CYCLES:
#     df = pd.read_parquet(f"data/processed/{cycle}_merged.parquet")
#     df_labeled = label_periodontitis(df)
#     df_labeled.to_parquet(f"data/processed/{cycle}_labeled.parquet")
#     print(f"{cycle} prevalence: {df_labeled['has_periodontitis'].mean():.2%}")
print("‚úÖ Section 5: CDC/AAP labels applied")

## 6Ô∏è‚É£ Build 15 Predictors

Extract Bashir predictors from NHANES variables

---

In [None]:
# TODO: Build predictors
print("‚úÖ Section 6: Predictors built")

## 7Ô∏è‚É£ Exploratory Analysis

Prevalence by cycle, missingness, drift

---

In [None]:
# TODO: EDA plots
print("‚úÖ Section 7: EDA complete")

## 8Ô∏è‚É£ Temporal Split

Train 2011-2014, Val 2015-2016, Test 2017-2018

---

In [None]:
# TODO: Split by cycle
print("‚úÖ Section 8: Temporal split done")

## 9Ô∏è‚É£ Preprocessing Pipelines

Imputation + scaling (fit on train only)

---

In [None]:
# TODO: Build sklearn pipelines
print("‚úÖ Section 9: Pipelines built")

## üîü Baseline Models

LogReg, RandomForest with 5-fold CV

---

In [None]:
# TODO: Train baselines
print("‚úÖ Section 10: Baselines trained")

## 1Ô∏è‚É£1Ô∏è‚É£ XGBoost + Optuna

Hyperparameter search, early stopping

---

In [None]:
# TODO: Optuna tune XGBoost
print("‚úÖ Section 11: XGBoost tuned")

## 1Ô∏è‚É£2Ô∏è‚É£ CatBoost + Optuna

Native categorical handling

---

In [None]:
# TODO: Optuna tune CatBoost
print("‚úÖ Section 12: CatBoost tuned")

## 1Ô∏è‚É£3Ô∏è‚É£ LightGBM + Optuna

Fast gradient boosting

---

In [None]:
# TODO: Optuna tune LightGBM
print("‚úÖ Section 13: LightGBM tuned")

## 1Ô∏è‚É£4Ô∏è‚É£ Threshold Selection

Choose policy (Youden, F1-max, Recall‚â•0.80), freeze on Val

---

In [None]:
# TODO: Select threshold on Val
print("‚úÖ Section 14: Threshold frozen")

## 1Ô∏è‚É£5Ô∏è‚É£ Final Test Evaluation

Apply frozen threshold, compute all metrics

---

In [None]:
# TODO: Evaluate on Test
print("‚úÖ Section 15: Test metrics computed")

## 1Ô∏è‚É£6Ô∏è‚É£ Calibration & Decision Curves

Isotonic/Platt scaling, net benefit

---

In [None]:
# TODO: Calibration plots
print("‚úÖ Section 16: Calibration done")

## 1Ô∏è‚É£7Ô∏è‚É£ SHAP Interpretability

Beeswarm + bar plots

---

In [None]:
# TODO: SHAP analysis
print("‚úÖ Section 17: SHAP complete")

## 1Ô∏è‚É£8Ô∏è‚É£ Survey Weights Sensitivity

Weighted prevalence with WTMEC2YR

---

In [None]:
# TODO: Weighted stats
print("‚úÖ Section 18: Survey weights applied")

## 1Ô∏è‚É£9Ô∏è‚É£ Save Artifacts

Export model, metrics, HF model card

---

In [None]:
# TODO: Save all artifacts
print("‚úÖ Section 19: Artifacts saved")

## 2Ô∏è‚É£0Ô∏è‚É£ Reproducibility Log

Package versions, git hash, system info

---

In [None]:
# TODO: Log system info
print("‚úÖ Section 20: Reproducibility logged")