Skip to content

End to end classification project predicting the risk of coronary heart disease based on several demographic, behavioral and medical variables

Notifications You must be signed in to change notification settings

Hpanuganty/Heart_Disease_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

This repository will provide a detailed walk through of the 9 essential steps for an end-to-end project using the Framingham Heart Study dataset. The data is from an ongoing cardiovascular study on individuals from Framingham, Massachusetts where study participants are monitored for the risk of Coronary Heart Disease (CHD) based off 15 different variables. With this dataset we'll determine the most relevant variables to the outcome and predict the overall risk of being diagnosed with CHD.

Step 1: Defining the problem Understanding supervised vs unsupervised ml problems and classification vs regression. In this project the label is provided for us and we're looking to classify individuals into two (binary categories), thus this is a supervised classification problem

Step 2: Data Loading Brief explanation of common data loading libraries (NumPy, Pandas, Matplotlib, Seaborn), loading data from a CSV file and describing variables/features

Step 3: Data Cleaning Identifiying and handling null values with specific methodologies/techniques

Step 4: Exploratory Data Analysis

  • Boxplots: Recognizing and removing outliers (extreme data points)
  • Heatmaps: Showcasing the relationship between all variables with the target variable, and tips on reading heatmaps
  • Distplots: Showing frequency distribution and skew of each of the variables
  • Barplots: Plot relationship between two categorical variables
  • Countplots: Displaying the count of numerical observations in a categorical 'bin' with bars
  • Regplots: Plotting trendline to show relationship between a numerical and categorical variable

Step 5: Feature Selection An overview of choosing specific features based off EDA for the ml model and techniques on how to do so. Detailed explanation of SelectKBest and Mutual Information Classification methods.

Step 6: Data pre-processing Summarize pre-processing steps including train-test split, feature scaling and balancing via SMOTE

Step 7: Predictive Modeling Walk through of 4 common classification algorithms and reasoning as to why it was used

  • Logistic Regression: Understanding influence of one or more indepedent variables to single outcome variable
  • Random Forest: Combines multiple decision trees into one model, enhancing performance of each tree model into one strong tree model
  • K-Nearest Neighbors: Calculating 'closeness' between each data point to assign the label
  • Support Vector Machine: Linear separator to divide data points into classes

Step 8: Hyperparameter Tuning Discussed importance of hyperparameter tuning, overview of RandomizedSearch/GridSearch and provided an explanation as to how hyperparameters were adjusted for each model and how tuning affected overall model performance.

Step 9: Model Evaluation Explored the different ways of evaluating a model and why accuracy/confusion matrices were chosen in this instance. Provided reasoning as to why Hypertuned Random Forest was the model that represented the dataset and outcome the best.

model-ranking.jpg

About

End to end classification project predicting the risk of coronary heart disease based on several demographic, behavioral and medical variables

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published