# **Predicting 10-Year Coronary Heart Disease (CHD) Risk**

Cardiovascular diseases are the `leading` global `cause of death`, with `coronary heart disease` (CHD) being the most prevalent. According to the [World Health Organization (WHO)](https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-leading-causes-of-death), CHD accounted for 13% of global deaths from 2000 to 2021. In the U.S., the [National Heart, Lung, and Blood Institute (NHLBI)](https://www.nhlbi.nih.gov/health/coronary-heart-disease/risk-factors) states that nearly half of adults have at least one major CHD risk factor: high blood pressure, high cholesterol, or smoking.  

This project aims to build a `logistic regression model` to estimate an individual’s `10-year CHD probability`. The focus is on balancing predictive accuracy, interpretability, and classification effectiveness to support risk assessment. The [Kaggle dataset](https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea), which is relied upon in this project, is reportedly linked to the [Framingham Heart Study](https://www.framinghamheartstudy.org/fhs-about/), a cornerstone in cardiovascular research. 

However, the dataset lacks transparency regarding its true origin, as `no metadata` was found on Kaggle. Key details such as data collection timeframe, participant selection criteria, geographic representation, other possible predictors or preprocessing methods remain unknown. Without this information, the dataset’s reliability and generalizability cannot be fully assessed. Therefore, findings should be interpreted with caution.

##### **Project Roadmap**

| Stage | Objective |
|-----------|--------------|
| 1. Data Cleaning | Handle missing values, encode categorical variables, and standardize numerical features for consistency. |
| 2. Exploratory Data Analysis (EDA) | Examine feature distributions, assess correlations, and identify multicollinearity. |
| 3. Modeling | Train a logistic regression model, perform feature selection, optimize classification thresholds, and validate performance. |
| 4. Interpretation and Considerations | Analyze model coefficients, assess predictive significance, evaluate generalizability, and discuss dataset limitations. |

<br>
<hr> 


### **1. Data Cleaning**  

This section deals with loading and describing the dataset, examining it for missing values, inconsistencies, and potential issues such as duplicate records or incorrect data types. Categorical variables are encoded, and numerical features are standardized. Additionally, outliers are identified and handled to prevent extreme values from distorting model performance.

##### **1.1. Imports**


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    roc_auc_score, roc_curve, confusion_matrix, classification_report, 
    precision_recall_curve
)

from statsmodels.api import Logit, add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor

import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.float_format", "{:.5f}".format)


ModuleNotFoundError: No module named 'statsmodels'