CVD

Problem Statements

Cardiovascular disease (CVD) is a leading cause of mortality globally, and early identification and management of risk factors can significantly reduce the burden of CVD. The UCI heart dataset has been widely used in recent studies to develop predictive models for CVD. However, the limited size of this dataset and the lack of diversity in its sources raise concerns about the generalizability of the models developed using it. Additionally, joining different datasets from various repositories led to the removal of a substantial amount of data due to duplicates. To address these issues, I propose the use of a new dataset that includes objective medical information, results of medical examinations, and subjective information given by patients. Our objective is to develop a predictive model that can accurately predict the risk of CVD, expressed as a percentage, using this new dataset.

Evaluation

The error metric used is the F1-score, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer one scores is to 1, the better the model.

F1 Score: A performance score that combines both precision and recall. It is a harmonic mean of these two variables. Formula is given as: 2PrecisionRecall/(Precision + Recall)
Precision: This is an indicator of the number of items correctly identified as positive out of total items identified as positive. Formula is given as: TP/(TP+FP)
Recall / Sensitivity / True Positive Rate (TPR): This is an indicator of the number of items correctly identified as positive out of total actual positives. Formula is given as: TP/(TP+FN)

Where:

TP = True Positive
FP = False Positive
TN = True Negative
FN = False Negative

Folders

I have 3 folders in the projects

Data: Contains all the datasets used in the project.
Models: Contains the final models that will be used in the web interface to make predictions on unseen data.
Notebooks: Contains three sub-folders, each with a self-explanatory name.

Tools and Technologies

The project was implemented using Python and the following libraries:

Pandas, and Numpy for data manipulation and analysis.
Matplotlib and Seaborn for data visualization.
Pycaret to automate the ML-Flow and get the general overview of what model to use on the dataset.
Scikit-learn libraries for training, and evaluating the machine learning model
XGBOOST and LightGBM for training our dataset

I hope that this project can contribute to the field of CVD risk prediction and encourage further research in this area.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Data		Data
Models		Models
Notebooks		Notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CVD

Problem Statements

Evaluation

Folders

Tools and Technologies

About

Releases

Packages

Languages

JammalAdeyemi/CardiovascularHeartDisease

Folders and files

Latest commit

History

Repository files navigation

CVD

Problem Statements

Evaluation

Folders

Tools and Technologies

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages