# Heart Disease Population Survey Comparative Utility Analysis (CDC BRFSS 2022)

## Executive Summary
- Dataset: CDC BRFSS 2022 (cleaned, no missing values)
- Objective: Compare Clinical Utility of two subsets of population features in heart attack occurrence.
- Target variable: HadHeartAttack (Yes/No)
- Models evaluated: Logistic Regression

# Heart Disease Risk Analysis (CDC BRFSS 2022)

## Executive Summary
- Dataset: CDC BRFSS 2022 (cleaned, no missing values)
- Objective: Predict prior heart attack occurrence
- Target variable: HadHeartAttack (Yes/No)
- Models evaluated: Logistic Regression
- Key finding:

This analysis evaluates whether self-reported health and lifestyle indicators provide incremental value beyond basic demographics for discriminating prior heart attack status using BRFSS survey data. Logistic regression models were evaluated using stratified cross-validation, with performance assessed primarily via ROC-AUC and secondarily via precision–recall AUC to account for class imbalance. A baseline model using age and sex achieved moderate discriminative performance (ROC-AUC ≈ 0.75). Incorporating Life’s Essential 8 proxy variables resulted in a consistent improvement in discrimination (ROC-AUC ≈ 0.79; Δ ≈ +0.04), with corresponding gains in precision–recall performance. Holdout evaluation produced comparable results, supporting the stability of the cross-validated estimates. Overall, findings indicate that self-reported survey data contain meaningful incremental signal for population-level discrimination of heart attack status, though predictive precision remains limited by outcome prevalence.


## Problem Statement

Cardiovascular disease risk has been extensively studied using clinical and physiological measurements. However, many large-scale population datasets rely primarily on self-reported health indicators and behaviors rather than direct clinical measurements. The utility of these self-reported indicators for predictive modeling of cardiovascular outcomes remains an important practical question for population health analytics.

## Objective
The objective of this study is to evaluate the extent to which self-reported health indicators, behavioral factors, and demographic characteristics from two different feature subsets can be used to predict a history of myocardial infarction in a population-based survey dataset.


## Scope and Limitations
Because the dataset relies on self-reported measures and does not include direct physiological measurements such as blood pressure, conclusions are limited to the utility of observable, non-invasive indicators for population-level risk modeling.

## Dataset Overview

### 1. Set up

In [None]:
#import libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, average_precision_score