Cardiovascular disease (CVD)
is a leading cause of mortality globally, and early identification and management of risk factors can significantly reduce the burden of CVD. The UCI heart dataset has been widely used in recent studies to develop predictive models for CVD. However, the limited size of this dataset and the lack of diversity in its sources raise concerns about the generalizability of the models developed using it. Additionally, joining different datasets from various repositories led to the removal of a substantial amount of data due to duplicates. To address these issues, I propose the use of a new dataset that includes objective medical information, results of medical examinations, and subjective information given by patients. Our objective is to develop a predictive model that can accurately predict the risk of CVD, expressed as a percentage, using this new dataset.
The error metric used is the F1-score
, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer one scores is to 1, the better the model.
F1 Score
: A performance score that combines both precision and recall. It is a harmonic mean of these two variables. Formula is given as: 2PrecisionRecall/(Precision + Recall)Precision
: This is an indicator of the number of items correctly identified as positive out of total items identified as positive. Formula is given as: TP/(TP+FP)Recall / Sensitivity / True Positive Rate (TPR)
: This is an indicator of the number of items correctly identified as positive out of total actual positives. Formula is given as: TP/(TP+FN)
Where:
- TP = True Positive
- FP = False Positive
- TN = True Negative
- FN = False Negative
I have 3 folders in the projects
Data
: Contains all the datasets used in the project.Models
: Contains the final models that will be used in the web interface to make predictions on unseen data.Notebooks
: Contains three sub-folders, each with a self-explanatory name.
The project was implemented using Python
and the following libraries:
Pandas
, andNumpy
for data manipulation and analysis.Matplotlib
andSeaborn
for data visualization.Pycaret
to automate the ML-Flow and get the general overview of what model to use on the dataset.Scikit-learn
libraries for training, and evaluating the machine learning modelXGBOOST
andLightGBM
for training our dataset
I hope that this project can contribute to the field of CVD
risk prediction and encourage further research in this area.