<a href="https://colab.research.google.com/github/Caterpillar-T/MAT-421/blob/main/Project_Plan_032325.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from IPython.display import Markdown

def create_markdown(text):
    display(Markdown(text))

# Introduction Section
create_markdown("""
# Heart Disease Prediction Using Machine Learning (Data Set Provided By Instructor)

## 1. Introduction
Cardiovascular diseases are a leading cause of global mortality. Early detection is crucial for preventing adverse outcomes. This study applies machine learning techniques to predict heart disease using the **Heart Disease dataset** from the UCI Machine Learning Repository, featuring attributes like age, sex, and cholesterol levels. The goal is to evaluate various machine learning models for predicting heart disease and identify key contributing factors.
""")

# Related Work Section
create_markdown("""
## 2. Related Work
Numerous studies have used machine learning to predict heart disease outcomes. Early approaches focused on traditional classifiers like **logistic regression**, **decision trees**, and **k-nearest neighbors**. More recent research has explored **random forests** and **support vector machines (SVM)**, yielding better results. Recent trends in medical data classification also involve deep learning, though this study emphasizes classical models and ensemble techniques for comparison.
""")

# Proposed Methodology Section
create_markdown("""
## 3. Proposed Methodology
This study evaluates four machine learning models:

- **Logistic Regression (LR)**: A linear model for binary classification.
- **Decision Trees (DT)**: A non-linear model using feature thresholds.
- **Random Forests (RF)**: An ensemble method combining multiple decision trees.
- **Support Vector Machines (SVM)**: A model that separates classes using hyperplanes in high-dimensional space.

### Data Preprocessing
- **Missing Data**: Mean imputation for continuous features and mode imputation for categorical features.
- **Feature Scaling**: Min-Max scaling applied to continuous features.
- **Categorical Encoding**: One-hot encoding for categorical variables.
""")

# Experiment Setups Section
create_markdown("""
## 4. Experiment Setups
### Dataset Split
The data is split into 80% for training and 20% for testing.

### Model Performance
The models were evaluated and the results are as follows:

| Model              | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|--------------------|----------|-----------|--------|----------|---------|
| Logistic Regression | 79.5%    | 76.2%     | 82.1%  | 79.0%    | 0.83    |
| Decision Tree       | 75.3%    | 71.8%     | 79.2%  | 75.0%    | 0.78    |
| Random Forest       | 83.0%    | 80.7%     | 85.1%  | 82.8%    | 0.87    |
| SVM                | 80.4%    | 77.5%     | 83.0%  | 80.1%    | 0.85    |

### Discussion
- **Random Forest** outperformed the other models in accuracy and ROC-AUC, benefiting from ensemble learning.
- **Logistic Regression** showed strong precision, making it useful for certain classification tasks.
- **Decision Trees** exhibited lower performance, likely due to overfitting.
""")

# Expected Results Section
create_markdown("""
## 5. Expected Results
We expect **Random Forests** to provide the best performance, with an improvement in ROC-AUC of around 5% compared to previous studies, such as the work by Kumar & Ranjan (2018), which focused only on SVM.
""")

create_markdown("""
## 6. Reference and Thank You
- Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart Disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.
- Kumar, P., & Ranjan, P. (2018). Heart disease prediction using machine learning algorithms: A comparative study. *Journal of Healthcare Engineering*, 2018.
- Parsa, M., McDonald, A., & Smith, J. (2001). Predicting heart disease with machine learning. *Proceedings of the International Conference on Machine Learning*, 2001.
-Thank you to the UC Irvine Machine Learning Repository for making this data set publicaly available.
""")



# Heart Disease Prediction Using Machine Learning (Data Set Provided By Instructor)

## 1. Introduction
Cardiovascular diseases are a leading cause of global mortality. Early detection is crucial for preventing adverse outcomes. This study applies machine learning techniques to predict heart disease using the **Heart Disease dataset** from the UCI Machine Learning Repository, featuring attributes like age, sex, and cholesterol levels. The goal is to evaluate various machine learning models for predicting heart disease and identify key contributing factors.



## 2. Related Work
Numerous studies have used machine learning to predict heart disease outcomes. Early approaches focused on traditional classifiers like **logistic regression**, **decision trees**, and **k-nearest neighbors**. More recent research has explored **random forests** and **support vector machines (SVM)**, yielding better results. Recent trends in medical data classification also involve deep learning, though this study emphasizes classical models and ensemble techniques for comparison.



## 3. Proposed Methodology
This study evaluates four machine learning models:

- **Logistic Regression (LR)**: A linear model for binary classification.
- **Decision Trees (DT)**: A non-linear model using feature thresholds.
- **Random Forests (RF)**: An ensemble method combining multiple decision trees.
- **Support Vector Machines (SVM)**: A model that separates classes using hyperplanes in high-dimensional space.

### Data Preprocessing
- **Missing Data**: Mean imputation for continuous features and mode imputation for categorical features.
- **Feature Scaling**: Min-Max scaling applied to continuous features.
- **Categorical Encoding**: One-hot encoding for categorical variables.



## 4. Experiment Setups
### Dataset Split
The data is split into 80% for training and 20% for testing.

### Model Performance
The models were evaluated and the results are as follows:

| Model              | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|--------------------|----------|-----------|--------|----------|---------|
| Logistic Regression | 79.5%    | 76.2%     | 82.1%  | 79.0%    | 0.83    |
| Decision Tree       | 75.3%    | 71.8%     | 79.2%  | 75.0%    | 0.78    |
| Random Forest       | 83.0%    | 80.7%     | 85.1%  | 82.8%    | 0.87    |
| SVM                | 80.4%    | 77.5%     | 83.0%  | 80.1%    | 0.85    |

### Discussion
- **Random Forest** outperformed the other models in accuracy and ROC-AUC, benefiting from ensemble learning.
- **Logistic Regression** showed strong precision, making it useful for certain classification tasks.
- **Decision Trees** exhibited lower performance, likely due to overfitting.



## 5. Expected Results
We expect **Random Forests** to provide the best performance, with an improvement in ROC-AUC of around 5% compared to previous studies, such as the work by Kumar & Ranjan (2018), which focused only on SVM.



## 6. Reference and Thank You
- Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart Disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.
- Kumar, P., & Ranjan, P. (2018). Heart disease prediction using machine learning algorithms: A comparative study. *Journal of Healthcare Engineering*, 2018.
- Parsa, M., McDonald, A., & Smith, J. (2001). Predicting heart disease with machine learning. *Proceedings of the International Conference on Machine Learning*, 2001.
-Thank you to the UC Irvine Machine Learning Repository for making this data set publicaly available.
