# Predicting Heart Disease Risk
#### By Markus Chu

---
#### Introduction
According to the World Health Organization, heart disease is still the leading cause of death globally, with over 17.9 million deaths per year, which is about 32% of all global fatalities (World Health Organization, 2021). Even with medical advancements like the use of AI, it is still hard to detect heart disease in its early stages. It is important to consider that traditional diagnostic methods rely on clinical assessments, tests, and imaging, which are time-consuming, expensive, and sometimes even inaccessible.

Using machine learning (ML) offers an approach to predict heart disease risk efficiently and cost-effectively. These predictive models can consider a variety of health factors to analyze patterns related to heart disease risk. This allows for earlier responses, such as potential lifestyle adjustments, medical treatment, or additional diagnostic testing.

---
#### Objective

**The goal of this project is to develop an accurate ML model that predicts one's risk of getting heart disease with patient data.**

This model will:
1. Provide binary classification (presence or absence of heart disease)
2. Generate risk probabilities (0-100%)
3. Identify key contributing factors of heart disease

Intended use:
1. Help doctors make early decisions
2. Let patients check their own health risks
3. Support research on how health risk factors work together


---
#### Data
The dataset is from Kaggle ("Heart Disease"), and contains multiple health indicators and lifestyle factors assess heart disease risk.

<u> Variables (21): </u>
1. **Age**: Patient age in years (continuous)
2. **Gender**: Biological sex ("Male" or "Female")
3. **Blood Pressure**: Systolic blood pressure measurement in mmHg (continuous)
4. **Cholesterol Level**: Total cholesterol measurement in mg/dL (continuous)
5. **Exercise Habits**: Self-reported exercise frequency ("Low", "Medium", "High")
6. **Smoking**: Smoking status at the time ("Yes" or "No")
7. **Family Heart Disease**: Family history of heart disease ("Yes" or "No")
8. **Diabetes**: Diabetes diagnosis ("Yes" or "No")
9. **BMI**: Body Mass Index in kg/m^2 (continuous)
    - Underweight: < 18.5
    - Normal Weight: 18.5 - 24.9
    - Overweight: 25 - 29.9
    - Obese: 30 <=
10. **High Blood Pressure**: Hypertension diagnosis ("Yes" or "No")
11. **Low HDL Cholesterol**: Low "good" cholesterol ("Yes" or "No")
12. **High LDL Cholesterol**: High "bad" cholesterol ("Yes" or "No")
13. **Alcohol Consumption**: Self-reported alcohol intake ("None", "Low", "Medium", "High")
14. **Stress Level**: Perceived stress ("Low", "Medium", "High")
15. **Sleep Hours**: Average hours of sleep at night (continuous)
16. **Sugar Consumption**: Self-reported sugar intake ("Low", "Medium", "High")
17. **Triglyceride Level**: Blood triglyceride measurement in mg/dL (continuous)
18. **Fasting Blood Sugar**: Sugar level after fasting in mg/dL (continuous)
19. **CRP Level**: C-reactive protein levels in mg/dL, showing presence and extent of inflammation in the body (continuous)
    - Normal: < 0.3
20. **Homocysteine Level**: Amino acid linked to heart disease in μmol/L (continuous)
21. **Heart Disease Status**: Presence of heart disease ("Yes" or "No")

<u> Data Features: </u>
- 10 continuous variables, 7 binary variables, 4 categorical variables

---
## 3. Data Description

Describe the dataset(s) you will use:
- Source of the data
- Number of records and features
- Description of key features (age, sex, cholesterol, etc.)

## 4. Exploratory Data Analysis (EDA)

- Summary statistics
- Data visualization (histograms, boxplots, correlations)
- Handling missing values and outliers

## 5. Data Preprocessing

- Feature engineering
- Encoding categorical variables
- Scaling/normalization
- Splitting data into training and test sets

## 6. Model Selection and Training

- Algorithms chosen (e.g., Logistic Regression, Random Forest, etc.)
- Model training details
- Hyperparameter tuning

## 7. Model Evaluation

- Metrics used (accuracy, precision, recall, ROC-AUC)
- Confusion matrix
- Cross-validation results

## 8. Results and Discussion

- Interpret the results
- Feature importance
- Limitations of the model

## 9. Conclusion

Summarize the findings and potential next steps.

## 10. References

List any papers, articles, or resources you referenced.

---

In [1]:
# Libaries for Data cleaning, manipulation, visualization, statistical analysis, ML
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [3]:
df = pd.read_csv("data/heart_disease.csv")