## Feature Engineering

Feature engineering decisions were guided by insights obtained during exploratory data analysis and grounded in clinical reasoning.

Rather than introducing a large number of derived variables, the focus was placed on representing meaningful health concepts that could enhance early risk detection without increasing model complexity unnecessarily.

### BMI Capping

Extreme BMI values were observed in the dataset, with a small number of records exceeding clinically plausible ranges. To prevent these outliers from disproportionately influencing model splits, BMI values were capped at an upper threshold of 70.

This approach preserves the monotonic relationship between BMI and diabetes risk while reducing sensitivity to extreme anomalies.

### Composite Health Indicator: Difficult_Health

While the raw number of physically unhealthy days provides granular information, it may not fully capture sustained health deterioration. To represent prolonged physical impairment, a binary feature was derived indicating whether an individual reported more than 15 days of poor physical health in the past month.

This transformation emphasizes chronic health burden rather than short-term fluctuations, aligning with the model’s goal of early disease risk identification.

In [1]:
import sys
import os

project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

In [2]:
import pandas as pd

from src.preprocessing import basic_cleaning, engineer_features

In [3]:
df = pd.read_csv("../data/diabetes_binary_health_indicators_BRFSS2015.csv")
df_clean = basic_cleaning(df)
df_fe = engineer_features(df_clean)

In [4]:
df_fe.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,Difficult_Health
0,0,1,1,1,40,1,0,0,0,0,...,0,5,18,15,1,0,9,4,3,0
1,0,0,0,0,25,1,0,0,1,0,...,1,3,0,0,0,0,7,6,1,0
2,0,1,1,1,28,0,0,0,0,1,...,1,5,30,30,1,0,9,4,8,1
3,0,1,0,1,27,0,0,0,1,1,...,0,2,0,0,0,0,11,3,6,0
4,0,1,1,1,24,0,0,0,1,1,...,0,2,3,0,0,0,11,5,4,0
