# **Predicting 10-Year Coronary Heart Disease (CHD) Risk**  

### Cardiovascular diseases are the `leading` global `cause of death`, with `coronary heart disease (CHD)` as the most prevalent, accounting for `13% of global deaths` from 2000 to 2021 ([WHO](https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-leading-causes-of-death)). For instance, in the U.S., nearly half of adults have at least one major CHD risk factor—high blood pressure, high cholesterol, or smoking ([NHLBI](https://www.nhlbi.nih.gov/health/coronary-heart-disease/risk-factors)). The `goal of the project` is to develop a `logistic regression model` to estimate an individual's `10-year CHD probability`, optimizing predictive accuracy, interpretability, and classification effectiveness. The analysis relies on a [Kaggle dataset](https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea), allegedly linked to the [Framingham Heart Study](https://www.framinghamheartstudy.org/fhs-about/), a cornerstone in cardiovascular research.
  
### **Project Roadmap**  

| **Section** | **Objective** |
|------------|--------------|
| 1. Data Cleaning | Load the dataset, inspect structure, handle missing values, and encode categorical variables. |
| 2. Exploratory Data Analysis (EDA) | Examine feature distributions, assess correlations, detect & analyze outliers, and evaluate multicollinearity. |
| 3. Data Preprocessing | Split data into training and testing sets, apply appropriate scaling, handle outliers, and finalize preprocessing steps before modeling. |
| 4. Modeling & Evaluation | Train a logistic regression model, optimize classification threshold, and validate performance on the test set. |
| 5. Interpretation & Considerations | Analyze feature importance, assess generalizability, and discuss dataset limitations. |

<br>

---

## **1. Data Cleaning**  

 ##### The dataset is `loaded and inspected` to understand its composition and detect `missing values` or `duplicates`. Missing data is handled using appropriate imputation techniques to minimize information loss or is dropped if more appropriate. `Categorical variables` are `encoded` into numeric format to ensure compatibility with modeling.

### **1.1. Imports & Configurations**

Importing essential libraries for data manipulation, visualization, preprocessing, and modeling. Display settings are configured to ensure precision and readability when working with numerical outputs.


In [546]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report,
    precision_recall_curve
)

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

pd.set_option("display.float_format", "{:.2f}".format)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 50)

sns.set_theme(style="whitegrid", palette="muted")

### **1.2. Load & Inspect Data**

In [547]:
df = pd.read_csv("train.csv")
print(f"Dataset contains {df.shape[0]} observations and {df.shape[1]} features.")

Dataset contains 3390 observations and 17 features.


In [548]:
df.dtypes

id                   int64
age                  int64
education          float64
sex                 object
is_smoking          object
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate          float64
glucose            float64
TenYearCHD           int64
dtype: object

In [549]:
df.sample(5, random_state=42)

Unnamed: 0,id,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
134,134,64,1.0,F,NO,0.0,0.0,0,1,1,262.0,147.0,90.0,26.51,85.0,173.0,0
1764,1764,36,2.0,M,YES,20.0,0.0,0,1,0,248.0,135.0,94.5,36.52,65.0,85.0,0
2465,2465,61,1.0,M,YES,13.0,0.0,0,0,0,312.0,110.0,66.0,26.28,68.0,96.0,0
1987,1987,51,2.0,F,NO,0.0,0.0,0,0,0,233.0,120.0,81.0,28.25,80.0,75.0,0
1295,1295,59,4.0,M,YES,20.0,0.0,0,1,0,232.0,151.5,110.0,26.89,68.0,69.0,0


`education` is stored as `float64`, which is unusual since it represents categorical order rather than continuous values, requiring further investigation. `BPMeds` is expected to be a binary categorical variable (0 = No, 1 = Yes), but its `float64` format draws attention for further investigation. `sex` and `is_smoking` are stored as object types instead of expected binary categorical features in `int64` format. `id` is a non-informative column and is dropped.

In [550]:
df.drop(columns=["id"], inplace=True)

Checking `unique values per variable` helps ensure precise feature classification—nominal, ordinal, binary, or continuous—guiding appropriate encoding, scaling, and transformation strategies critical for model performance and interpretability.

In [551]:
display(df.nunique().to_frame("Unique Values"))

Unnamed: 0,Unique Values
age,39
education,4
sex,2
is_smoking,2
cigsPerDay,32
BPMeds,2
prevalentStroke,2
prevalentHyp,2
diabetes,2
totChol,240


In [552]:
for col in ["education", "sex", "is_smoking", "BPMeds"]:
    unique_vals = df[col].dropna().unique()
    print(f"{col} ({df[col].dtype}): {unique_vals}")

education (float64): [2. 4. 1. 3.]
sex (object): ['F' 'M']
is_smoking (object): ['YES' 'NO']
BPMeds (float64): [0. 1.]


**Feature Classification**

| Type              | Features  | Notes |
|--------------------------------|---------------|-----------|
| Nominal Categorical        | `sex` (Gender: M/F), `is_smoking` (Smoking status: YES/NO) | Needs encoding into `binary categorical` in `int64` |
| Ordinal Categorical        | `education` (Education level: 1-4, ordered but not continuous) | `education` needs casting to `int64`, treated as ordinal categorical |
| Binary Categorical         | `BPMeds` (Blood pressure medication), `prevalentStroke` (Stroke history), `prevalentHyp` (Hypertension history), `diabetes` (Diabetes history), `TenYearCHD` (10-year CHD risk, target variable) | `BPMeds` needs casting to `int64`, treated as binary categorical |
| Continuous Numeric                | `age` (Age in years, treated as continuous), `cigsPerDay` (Cigarettes per day), `totChol` (Total cholesterol level), `sysBP` (Systolic blood pressure), `diaBP` (Diastolic blood pressure), `BMI` (Body Mass Index), `heartRate` (Heart rate in beats per minute), `glucose` (Blood glucose level) | No changes needed, keep as `float64` |

> ⚠️ **Disclaimer.** Feature value casting and encoding will be performed **after** addressing duplicates and missing values. 


### **1.3. Handling Duplicate Observations**

In [553]:
print(f"\n✅ This dataset has {df.duplicated().sum()} duplicated observations.")


✅ This dataset has 0 duplicated observations.


### **1.4. Handling Missing Values**

In [554]:
missing = df.isnull().sum().pipe(lambda x: x[x > 0])
print(f"Total missing: {missing.sum()} ({(missing.sum() / df.size * 100):.2f}%)")
display(missing.to_frame("Missing Values").assign(Percentage=lambda x: (x / len(df) * 100).round(2)))

Total missing: 510 (0.94%)


Unnamed: 0,Missing Values,Percentage
education,87,2.57
cigsPerDay,22,0.65
BPMeds,44,1.3
totChol,38,1.12
BMI,14,0.41
heartRate,1,0.03
glucose,304,8.97


The dataset has missing values in **7 features**, with `glucose` missing the most **(8.97%)**, requiring careful handling. Other variables, including `education`, `cigsPerDay`, `BPMeds`, `totChol`, and `BMI`, have **low missingness (<3%)**, making median or mode imputation suitable. `heartRate` has only **one missing value (0.03%)** and can be **safely dropped or imputed without impact**.

#### **1.4.1. Categoricals Variables (`education`, `BPMeds`)**

`Mode imputation` is best for categorical variables since it replaces missing values with the most common category.

In [555]:
mode_imputer = SimpleImputer(strategy="most_frequent")

df[["education", "BPMeds"]] = mode_imputer.fit_transform(df[["education", "BPMeds"]])

df["education"] = df["education"].astype(int)
df["BPMeds"] = df["BPMeds"].astype(int)

print(f"✅ {df['education'].isnull().sum()} missing values in 'education'")
print(f"✅ {df['BPMeds'].isnull().sum()} missing values in 'BPMeds'" )

✅ 0 missing values in 'education'
✅ 0 missing values in 'BPMeds'


#### **1.4.2. Numerical Variables (`glucose`, `cigsPerDay`, `totChol`, `BMI`, `heartRate`**

`glucose` and `cigsPerDay` are handled separately from other numerical variables because the missing values are imputed using `group-wise median imputation` to preserve the natural distribution of glucose levels across different health conditions and separate smokers from non-smokers. Instead of a single median, separate median values for `diabetic` and `non-diabetic`, as well as `smoker` and `non-smoker` individuals are computed.

In [556]:
glucose_median_map = df.groupby("diabetes")["glucose"].median().to_dict()
df.loc[df["glucose"].isnull(), "glucose"] = df["diabetes"].map(glucose_median_map)


cigs_median_smokers = df.loc[df["is_smoking"] == "YES", "cigsPerDay"].median()
df.loc[df["is_smoking"] == "NO", "cigsPerDay"] = df.loc[df["is_smoking"] == "NO", "cigsPerDay"].fillna(0)
df.loc[(df["is_smoking"] == "YES") & (df["cigsPerDay"].isnull()), "cigsPerDay"] = cigs_median_smokers

print(f"Missing values in glucose: {df['glucose'].isnull().sum()}")
print(f"Missing values in cigsPerDay: {df['cigsPerDay'].isnull().sum()}")


Missing values in glucose: 0
Missing values in cigsPerDay: 0


In [557]:
median_imputer = SimpleImputer(strategy="median")

df[["totChol", "BMI", "heartRate"]] = median_imputer.fit_transform(df[["totChol", "BMI", "heartRate"]])

print(f"Missing values in totChol: {df['totChol'].isnull().sum()}")
print(f"Missing values in BMI: {df['BMI'].isnull().sum()}")
print(f"Missing values in heartRate: {df['heartRate'].isnull().sum()}")

Missing values in totChol: 0
Missing values in BMI: 0
Missing values in heartRate: 0


In [558]:
missing_summary = df.isnull().sum()
missing_summary = missing_summary[missing_summary > 0]

if missing_summary.empty:
    print("✅ No missing values remain in the dataset!")
else:
    print("⚠️ Missing values still exist in the following columns:")
    print(missing_summary)

✅ No missing values remain in the dataset!


### **1.5 Encoding Nominal Categorical Variables**

---