# Heart Disease Population Survey Comparative Utility Analysis (CDC BRFSS 2022)

## Executive Summary
- Dataset: CDC BRFSS 2022 (cleaned, no missing values)
- Objective: Compare Clinical Utility of two subsets of population features in heart attack occurrence.
- Target variable: HadHeartAttack (Yes/No)
- Models evaluated: Logistic Regression

# Heart Disease Risk Analysis (CDC BRFSS 2022)

## Executive Summary
- Dataset: CDC BRFSS 2022 (cleaned, no missing values)
- Objective: Predict prior heart attack occurrence
- Target variable: HadHeartAttack (Yes/No)
- Models evaluated: Logistic Regression
- Key finding:

This analysis evaluates whether self-reported health and lifestyle indicators provide incremental value beyond basic demographics for discriminating prior heart attack status using BRFSS survey data. Logistic regression models were evaluated using stratified cross-validation, with performance assessed primarily via ROC-AUC and secondarily via precision–recall AUC to account for class imbalance. A baseline model using age and sex achieved moderate discriminative performance (ROC-AUC ≈ 0.75). Incorporating Life’s Essential 8 proxy variables resulted in a consistent improvement in discrimination (ROC-AUC ≈ 0.79; Δ ≈ +0.04), with corresponding gains in precision–recall performance. Holdout evaluation produced comparable results, supporting the stability of the cross-validated estimates. Overall, findings indicate that self-reported survey data contain meaningful incremental signal for population-level discrimination of heart attack status, though predictive precision remains limited by outcome prevalence.


## Problem Statement

Cardiovascular disease risk has been extensively studied using clinical and physiological measurements. However, many large-scale population datasets rely primarily on self-reported health indicators and behaviors rather than direct clinical measurements. The utility of these self-reported indicators for predictive modeling of cardiovascular outcomes remains an important practical question for population health analytics.

## Objective
The objective of this study is to evaluate the extent to which self-reported health indicators, behavioral factors, and demographic characteristics from two different feature subsets can be used to predict a history of myocardial infarction in a population-based survey dataset.


## Scope and Limitations
Because the dataset relies on self-reported measures and does not include direct physiological measurements such as blood pressure, conclusions are limited to the utility of observable, non-invasive indicators for population-level risk modeling.

## Dataset Overview

### 1. Set up

In [9]:
#import libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, average_precision_score

### 2. Load Data

In [10]:
from pathlib import Path
DATA_PATH = Path("..") / "data" / "heart_2022_no_nans.csv"
df = pd.read_csv(DATA_PATH, delimiter=",")

### 3. Dataset features

In [11]:
df.shape

(246022, 40)

##### The dataset is sufficiently large for baseline Logistic regression analysis.

In [12]:
df.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Alabama,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.6,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No
1,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,None of them,No,...,1.78,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No
2,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,"6 or more, but not all",No,...,1.85,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
3,Alabama,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.7,90.72,31.32,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
4,Alabama,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,Yes,5.0,1 to 5,No,...,1.55,79.38,33.07,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246022 entries, 0 to 246021
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      246022 non-null  object 
 1   Sex                        246022 non-null  object 
 2   GeneralHealth              246022 non-null  object 
 3   PhysicalHealthDays         246022 non-null  float64
 4   MentalHealthDays           246022 non-null  float64
 5   LastCheckupTime            246022 non-null  object 
 6   PhysicalActivities         246022 non-null  object 
 7   SleepHours                 246022 non-null  float64
 8   RemovedTeeth               246022 non-null  object 
 9   HadHeartAttack             246022 non-null  object 
 10  HadAngina                  246022 non-null  object 
 11  HadStroke                  246022 non-null  object 
 12  HadAsthma                  246022 non-null  object 
 13  HadSkinCancer              24

## Target Variable Analysis

In [14]:
target_counts = df["HadHeartAttack"].value_counts()

In [15]:
(target_counts/ target_counts.sum()).rename("proportion")

HadHeartAttack
No     0.945391
Yes    0.054609
Name: proportion, dtype: float64

##### There is an imbalance in the outcome with patients saying yes to heart attacks representing 5.46% of the dataset. This confirms that the outcome is relatively rare and requires handling of class imbalance in downstream modeling and evaluation.

## Data Validation

### 1. Missing Value Verification

### 2. Data Sanity and Consistency Checks

### 3. Feature Type Classification

### 4. Population Characteristics

### 5. Data Representation Commentary

## Exploratory Data Analysis

### 1. Feature filtering

### 2. Defining Feature Subsets of Interest

### 3. Commentary on Feature selection

## Methods

## Modeling and Results

### 1. Encoding and Setup

### 2. Interpretation and Discussion

## Limitations

## Conclusion