# Heart disease prediction

## Introduction

<div>
<img src="attachment:Health-Heart-Disease.jpg" width="180" align="Right"/>
</div>
According to the <a href = "https://www.cdc.gov/heartdisease/risk_factors.htm">CDC</a>, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.

### Datatset and its features

Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.". The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]". In this dataset, I noticed many different factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects.
1. <strong>HeartDisease</strong>: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
2. <strong>BMI</strong>: Body Mass Index (BMI)
3. <strong>Smoking</strong>: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
4. <strong>AlcoholDrinking</strong>: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
5. <strong>Stroke</strong>: (Ever told) (you had) a stroke?
6. <strong>PhysicalHealthNow</strong>: thinking about your physical health, which includes physical illness and injury, for how many days during the past 30
7. <strong>MentalHealth</strong>: Thinking about your mental health, for how many days during the past 30 days was your mental health not good? 
8. <strong>DiffWalking</strong>: Do you have serious difficulty walking or climbing stairs?
9. <strong>Sex</strong>: Are you male or female?
10. <strong>AgeCategory</strong>: Fourteen-level age category
11. <strong>Race</strong>: Imputed race/ethnicity value
12. <strong>Diabetic</strong>: (Ever told) (you had) diabetes?
13. <strong>PhysicalActivity</strong>: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
14. <strong>GenHealth</strong>: Would you say that in general your health is...
15. <strong>Sleep</strong>: TimeOn average, how many hours of sleep do you get in a 24-hour period?
16. <strong>Asthma</strong>: (Ever told) (you had) asthma?
17. <strong>KidneyDisease</strong>: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? 
18. <strong>SkinCancer</strong>: (Ever told) (you had) skin cancer?

## Objective

<strong>Explore the dataset and develop a classification model to predict heart disease from a set of features</strong>

# Begin Analysis

## Import Libraries

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns

## Reading the dataset

In [3]:
df = pd.read_csv("data/heart_2020_cleaned.csv")
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,Yes,27.41,Yes,No,No,7.0,0.0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319791,No,29.84,Yes,No,No,0.0,0.0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319792,No,24.24,No,No,No,0.0,0.0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319793,No,32.81,No,No,No,0.0,0.0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No
