# Blood Pressure Prediction using NHANES Data
## EHR Inference System for Baseline Systolic BP Estimation

### Overview
This notebook builds a machine learning pipeline to predict baseline systolic blood pressure (SBP) in patients NOT on antihypertensive medication using NHANES data. It implements EHR logic gates to:
- **Skip inference** for patients already on blood pressure medications
- **Run inference** to predict baseline SBP and risk status for untreated patients

### Key Components
1. Data Integration: Combines demographic, examination, labs, and medication data
2. Model Training: Uses Random Forest on untreated population baseline
3. Preprocessing: Handles missing values and feature scaling
4. EHR Logic: Decision workflow for clinical decision support
5. Testing: Validates pipeline with sample cases

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'sklearn'

## 2. Load and Explore NHANES Data

We load four datasets:
- **demographic.csv**: Patient demographics (SEQN, age, gender)
- **examination.csv**: Physical examination results (BP, BMI, waist circumference)
- **labs.csv**: Laboratory results (total cholesterol)
- **medications.csv**: Current medication information

In [None]:
print("Loading NHANES datasets...")
demo = pd.read_csv('med_data/demographic.csv', encoding='latin-1')[['SEQN', 'RIDAGEYR', 'RIAGENDR']]
exam = pd.read_csv('med_data/examination.csv', encoding='latin-1')[['SEQN', 'BPXSY1', 'BMXBMI', 'BMXWAIST']]
labs = pd.read_csv('med_data/labs.csv', encoding='latin-1')[['SEQN', 'LBXTC']]
meds = pd.read_csv('med_data/medications.csv', encoding='latin-1')

print(f"\nDemographic data shape: {demo.shape}")
print(demo.head())
print(f"\nExamination data shape: {exam.shape}")
print(exam.head())
print(f"\nLabs data shape: {labs.shape}")
print(labs.head())
print(f"\nMedications data shape: {meds.shape}")
print(meds.head())

## 3. Identify Patients on Blood Pressure Medications

We identify patients taking antihypertensive medications by searching for specific drug keywords in the medications dataset.

**Methodology:**
- Create a list of common BP medication names (ACE inhibitors, ARBs, beta-blockers, diuretics, etc.)
- Search both RXDDRUG (drug name) and RXDRSD1 (reason for use) fields
- Extract unique patient IDs (SEQN) who match these criteria

In [None]:
# Define common antihypertensive medications
bp_med_keywords = [
    'LISINOPRIL', 'AMLODIPINE', 'METOPROLOL', 'HYDROCHLOROTHIAZIDE', 
    'LOSARTAN', 'ATENOLOL', 'ENALAPRIL', 'FUROSEMIDE', 'PROPRANOLOL',
    'VALSARTAN', 'DILTIAZEM', 'CARVEDILOL', 'SPIRONOLACTONE'
]

print(f"Searching for {len(bp_med_keywords)} antihypertensive medications...")
print(f"Keywords: {', '.join(bp_med_keywords)}\n")

# Identify patients taking BP medications
# Search in drug name (RXDDRUG) and reason for use (RXDRSD1)
is_bp_med = meds['RXDDRUG'].str.contains('|'.join(bp_med_keywords), case=False, na=False) | \
            meds['RXDRSD1'].str.contains('hypertension', case=False, na=False)

treated_seqns = meds[is_bp_med]['SEQN'].unique()
print(f"Total patients in medications data: {meds['SEQN'].nunique()}")
print(f"Patients on BP medications identified: {len(treated_seqns)}")
print(f"Sample treated patient SEQNs: {treated_seqns[:10]}")

## 4. Integrate and Prepare Data

Merge all datasets on patient ID (SEQN) and create a binary medication flag.

**Key Points:**
- Merge on SEQN to align all patient measurements
- Create `on_bp_meds` flag: 1 if patient is on BP medications, 0 otherwise
- This flag determines whether to run inference or skip

In [None]:
# Merge datasets on SEQN (patient ID)
df = demo.merge(exam, on='SEQN').merge(labs, on='SEQN')

# Create medication status flag
df['on_bp_meds'] = df['SEQN'].isin(treated_seqns).astype(int)

print(f"Integrated dataset shape: {df.shape}")
print(f"\nDataset overview:")
print(df.head())

print(f"\nMedication status distribution:")
print(df['on_bp_meds'].value_counts())
print(f"  - Untreated (0): {(df['on_bp_meds'] == 0).sum()} patients")
print(f"  - Treated (1): {(df['on_bp_meds'] == 1).sum()} patients")

print(f"\nMissing values:")
print(df.isnull().sum())

## 5. Train Random Forest Model on Untreated Population

**Rationale for Training on Untreated Only:**
- We want to predict *baseline* systolic BP (what it would be without medication)
- Training on treated patients would give confounded predictions (medication-suppressed values)
- Using untreated population provides true baseline BP relationships

**Features Used:**
- `RIDAGEYR`: Age in years
- `RIAGENDR`: Gender
- `BMXBMI`: Body Mass Index
- `BMXWAIST`: Waist circumference
- `LBXTC`: Total cholesterol

**Target:**
- `BPXSY1`: Systolic Blood Pressure (first reading)

In [None]:
# Filter to untreated population only
train_df = df[df['on_bp_meds'] == 0].dropna(subset=['BPXSY1'])
print(f"Training set size (untreated patients with BP data): {len(train_df)}")

# Define features and target
features = ['RIDAGEYR', 'RIAGENDR', 'BMXBMI', 'BMXWAIST', 'LBXTC']
X = train_df[features]
y = train_df['BPXSY1']

print(f"\nFeatures: {features}")
print(f"Target: BPXSY1 (Systolic BP)")
print(f"\nFeature statistics (before preprocessing):")
print(X.describe())

# Initialize preprocessors
imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTrain set: {X_train.shape[0]}, Test set: {X_test.shape[0]}")

# Preprocessing
X_train_processed = scaler.fit_transform(imputer.fit_transform(X_train))
X_test_processed = scaler.transform(imputer.transform(X_test))

# Train Random Forest model
print("\nTraining Random Forest Regressor...")
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_processed, y_train)

# Evaluate
mae = mean_absolute_error(y_test, model.predict(X_test_processed))
print(f"Model trained successfully!")
print(f"Test MAE: {mae:.2f} mmHg")

# Feature importance
print("\nFeature Importance:")
for feat, importance in zip(features, model.feature_importances_):
    print(f"  {feat}: {importance:.4f}")

## 6. Implement EHR Inference Logic

The logic gate implements two decision paths:

1. **If patient is on BP medication (on_bp_meds == 1):**
   - Action: SKIP_INFERENCE
   - Reason: Predictions would be confounded by medication effect
   - Recommendation: Monitor treatment efficacy via clinical readings

2. **If patient is not on BP medication (on_bp_meds == 0):**
   - Action: RUN_INFERENCE
   - Predict baseline systolic BP using the trained model
   - Risk stratification: "High" if predicted SBP > 130 mmHg, "Normal" otherwise
   - Recommendation: Screen for hypertension if not already diagnosed

In [None]:
def ehr_inference_logic(patient_data):
    """
    Simulates the logic gate in an EHR app.
    
    Parameters:
    -----------
    patient_data : dict
        Dictionary containing patient features and medication status
        
    Returns:
    --------
    dict : Decision output with action, reasoning, and recommendation
    """
    print(f"\n--- Processing Patient ID: {patient_data['SEQN']} ---")
    
    if patient_data['on_bp_meds'] == 1:
        # SKIP INFERENCE FOR MEDICATED PATIENTS
        return {
            "action": "SKIP_INFERENCE",
            "reason": "Patient is already on antihypertensive medication.",
            "recommendation": "Monitor treatment efficacy via clinical readings."
        }
    else:
        # RUN INFERENCE FOR UNTREATED PATIENTS
        # Prepare feature vector
        feat_vals = np.array([[patient_data[f] for f in features]])
        feat_prepped = scaler.transform(imputer.transform(feat_vals))
        prediction = model.predict(feat_prepped)[0]
        
        return {
            "action": "RUN_INFERENCE",
            "predicted_sbp": round(prediction, 1),
            "risk_status": "High" if prediction > 130 else "Normal",
            "recommendation": "Screen for hypertension if not already diagnosed."
        }

print("EHR inference logic function defined successfully!")

## 7. Test Inference Pipeline with Sample Cases

Test the complete EHR inference pipeline with:
1. A patient **on** antihypertensive medication (should skip inference)
2. A patient **not on** antihypertensive medication (should run inference)

In [None]:
# Extract sample patients
treated_available = df[df['on_bp_meds'] == 1]
untreated_available = df[df['on_bp_meds'] == 0]

if len(treated_available) > 0 and len(untreated_available) > 0:
    sample_treated = treated_available.iloc[0].to_dict()
    sample_untreated = untreated_available.iloc[0].to_dict()
    
    print("="*70)
    print("CASE 1: PATIENT ON BLOOD PRESSURE MEDICATION")
    print("="*70)
    result_treated = ehr_inference_logic(sample_treated)
    print(f"Result: {result_treated}\n")
    
    print("="*70)
    print("CASE 2: PATIENT NOT ON BLOOD PRESSURE MEDICATION")
    print("="*70)
    result_untreated = ehr_inference_logic(sample_untreated)
    print(f"Result: {result_untreated}\n")
    
    print("="*70)
    print("SUMMARY")
    print("="*70)
    print(f"Treated patient (on meds): Action = {result_treated['action']}")
    print(f"Untreated patient (no meds): Action = {result_untreated['action']}")
    if result_untreated['action'] == 'RUN_INFERENCE':
        print(f"  └─ Predicted SBP: {result_untreated['predicted_sbp']} mmHg")
        print(f"  └─ Risk Status: {result_untreated['risk_status']}")
else:
    print("Warning: Not enough sample data to test both cases.")