# Organ Aging Analysis with NHANES: Project Overview

## Context and Motivation

Aging is not a uniform process across the body. Different organs and systems age at different rates, and understanding these **differential aging patterns** can provide insights into:

- Individual health risks
- Biological vs. chronological age discrepancies
- Organ-specific interventions for healthy aging

This project uses **NHANES** (National Health and Nutrition Examination Survey) data to build **organ-specific clocks** - machine learning models that predict chronological age from organ-specific biomarkers.

## Key Concepts

### Organ Clock
A supervised ML model trained to predict chronological age from organ-specific biomarkers:

```
f_organ(biomarkers_organ, covariates) → predicted_age
```

### Biological Age
For a given individual and organ, the predicted age = **biological age of that organ**.

### Age Gap
The difference between biological and chronological age:

```
AgeGap_organ = BiologicalAge_organ - ChronologicalAge
```

- **Positive gap**: Organ is aging faster than expected
- **Negative gap**: Organ is aging slower than expected

## NHANES Limitation

NHANES data is **cross-sectional** (not longitudinal). We observe different individuals at different ages, but don't track the same individuals over time.

→ Our trajectory analysis will be **pseudo-longitudinal**, using age bins to simulate aging trends.

## Project Pipeline

```
1. Data Preparation
   ├─ Load NHANES files (XPT/CSV)
   ├─ Merge tables on SEQN
   └─ Clean & preprocess

2. Feature Engineering
   ├─ Define organ panels (biomarker groups)
   ├─ Build feature matrices per organ
   └─ Train/val/test split

3. Model Training (per organ)
   ├─ Baseline: Linear regression
   ├─ Non-linear: Gradient Boosting
   └─ Evaluation & comparison

4. Age Gap Analysis
   ├─ Compute biological ages
   ├─ Calculate age gaps
   └─ Identify patterns

5. Exploration
   ├─ Pseudo-longitudinal trajectories
   ├─ Co-occurrence analysis
   └─ Clustering (UMAP + KMeans)

6. Storytelling
   └─ Synthesize findings for presentation
```

## Expected Organs/Systems

- **Liver**: ALT, AST, GGT, Albumin, Bilirubin
- **Kidney**: Creatinine, BUN, Uric Acid, Albumin/Creatinine ratio
- **Cardio-metabolic**: Blood pressure, cholesterol, triglycerides, glucose, HbA1c
- **Immune**: WBC, lymphocytes, neutrophils, monocytes
- **Hematologic**: RBC, hemoglobin, hematocrit, MCV, platelets

---

## Explainability Focus

This project emphasizes **interpretability**:
- Baseline vs non-linear model comparison
- Feature importance analysis
- SHAP values for biomarker contributions
- Clear visualizations at every step

---

## Setup and Installation

In [13]:
# Install required packages (if needed)
# !pip install -r ../requirements.txt

In [14]:
# Setup paths - works regardless of kernel working directory
import sys
from pathlib import Path

# Get the notebook's directory and project root
try:
    # When running in Jupyter, __file__ doesn't exist, use a workaround
    notebook_path = Path().resolve()
    if notebook_path.name == 'notebooks':
        project_root = notebook_path.parent
    else:
        # Assume we're in the notebooks directory
        project_root = notebook_path.parent if (notebook_path.parent / 'src').exists() else notebook_path
except:
    project_root = Path().resolve().parent

# Add src to path if not already there
src_path = project_root / 'src'
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

print(f"✓ Project root: {project_root}")
print(f"✓ Source path: {src_path}")

# Now import our modules
from organ_aging import config

# Import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ Setup complete")

✓ Project root: C:\Users\bastien\Documents\TAF\Hackathon\Vitalist
✓ Source path: C:\Users\bastien\Documents\TAF\Hackathon\Vitalist\src
✓ Setup complete


## Verify Configuration Files

In [15]:
# Check configuration files exist
config_dir = Path.cwd().parent / 'configs'

paths_config_file = config_dir / 'paths.yaml'
organs_config_file = config_dir / 'organ_panels.yaml'

if paths_config_file.exists():
    print(f"✓ Found paths configuration: {paths_config_file}")
else:
    print(f"✗ Missing paths configuration: {paths_config_file}")

if organs_config_file.exists():
    print(f"✓ Found organ panels configuration: {organs_config_file}")
else:
    print(f"✗ Missing organ panels configuration: {organs_config_file}")

✓ Found paths configuration: c:\Users\bastien\Documents\TAF\Hackathon\Vitalist\configs\paths.yaml
✓ Found organ panels configuration: c:\Users\bastien\Documents\TAF\Hackathon\Vitalist\configs\organ_panels.yaml


In [16]:
# Load and display configurations
try:
    paths_config = config.load_paths_config(str(paths_config_file))
    print("\n=== Paths Configuration ===")
    print(f"Raw data directory: {paths_config['raw_data_dir']}")
    print(f"Number of NHANES files: {len(paths_config['nhanes_files'])}")
    
    organs_config = config.load_organ_panels_config(str(organs_config_file))
    print("\n=== Organ Panels Configuration ===")
    organs = [k for k in organs_config.keys() if k != 'global_covariates']
    print(f"Organs defined: {', '.join(organs)}")
    
    for organ in organs:
        n_biomarkers = len(organs_config[organ])
        print(f"  {organ}: {n_biomarkers} biomarkers")
    
    print("\n✓ Configurations loaded successfully")
    
except Exception as e:
    print(f"✗ Error loading configurations: {e}")


=== Paths Configuration ===
Raw data directory: data/raw
Number of NHANES files: 9

=== Organ Panels Configuration ===
Organs defined: liver, kidney, cardio_metabolic, immune, hematologic, target_variable, preprocessing
  liver: 8 biomarkers
  kidney: 5 biomarkers
  cardio_metabolic: 10 biomarkers
  immune: 6 biomarkers
  hematologic: 8 biomarkers
  target_variable: 8 biomarkers
  preprocessing: 7 biomarkers

✓ Configurations loaded successfully


## Project Structure

```
Vitalist/
├── data/
│   ├── raw/           # NHANES files (not in repo)
│   ├── interim/       # Cleaned data
│   └── processed/     # Feature matrices
├── configs/
│   ├── paths.yaml
│   └── organ_panels.yaml
├── notebooks/
│   ├── 00_overview_and_setup.ipynb
│   ├── 01_nhanes_data_preparation.ipynb
│   ├── 02_feature_engineering_organs.ipynb
│   ├── 03_train_organ_clocks.ipynb
│   ├── 04_analyze_agegaps.ipynb
│   ├── 05_trajectories_and_clustering.ipynb
│   └── 06_jury_storytelling_report.ipynb
├── src/organ_aging/
│   ├── config.py
│   ├── data_loading.py
│   ├── preprocessing.py
│   ├── features.py
│   ├── models.py
│   ├── evaluation.py
│   ├── explainability.py
│   ├── analysis.py
│   ├── visualization.py
│   └── clustering.py
├── models/            # Trained models
├── tests/             # Unit tests
├── README.md
└── requirements.txt
```

## Next Steps

1. **Notebook 01**: Load and prepare NHANES data
2. **Notebook 02**: Engineer features for each organ
3. **Notebook 03**: Train organ clock models
4. **Notebook 04**: Analyze age gaps
5. **Notebook 05**: Explore trajectories and clustering
6. **Notebook 06**: Create presentation-ready report

→ **Continue to `01_nhanes_data_preparation.ipynb`**