# Business Understanding
Heart disease is the leading cause of death in the United States. The term "heart disease" refers to several types of heart conditions. The most common type of heart disease in the United States is coronary artery disease (CAD), which can lead to a heart attack. Machine learning leads to a better understanding of how we can predict heart disease.

# Data Understanding
### About the dataset 
Factors assessed by the BRFSS in 2020 included health status and healthy days, exercise, inadequate sleep, chronic health conditions, oral health, tobacco use, cancer screenings, and health-care access (core section). Optional Module topics for 2020 included prediabetes and diabetes, cognitive decline, electronic cigarettes, cancer survivorship (type, treatment, pain management) and sexual orientation/gender identity (SOGI).

# Data Collecting 
The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is a system of ongoing health-related telephone surveys designed to collect data on health-related risk behaviors, chronic health conditions, and the use of preventive services from the non-institutionalized adult population (≥ 18 years) residing in the United States. The BRFSS is administered and supported by CDC's Population Health Surveillance Branch, under the Division of Population Health at CDC's National Center for Chronic Disease Prevention and Health Promotion.

<br>

Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.". The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]". In this dataset, I noticed many different factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects([1](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)).

<br>

### Dataset Description [source](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf)

<br>

|Category|Label|Question|Value| 
|-|-|-|-|
|<b>HeartDisease</b>|Ever had CHD or MI| <i>-Respondents that have ever reported having coronary <br> -heart disease (CHD) or myocardial infarction (MI)</i>|-Yes<br>-No|
|<b>BMI</b>|Computed body mass index|<i>Computed body mass index</i>|Integer[1-9999]
|<b>Smoking</b>|Smoked at Least 100 Cigarettes|<i>Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]</i>|-Yes<br>-No|
|<b>AlcoholDrinking</b>|Heavy Alcohol Consumption Calculated Variable|<i>Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)</i>|-Yes<br>-No|
|<b>Stroke</b>|Ever Diagnosed with a Stroke|<i>(Ever told) (you had) a stroke.</i>|-Yes<br>-No|
|<b>PhysicalHealth</b>|Number of Days Physical Health Not Good|<i>Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?</i>|Number of days [1-30]|
|<b>MentalHealth</b>|Number of Days Mental Health Not Good|<i>Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?</i>|Number of days [1-30]|
|<b>DiffWalking</b>|Difficulty Walking or Climbing Stairs|<i>Do you have serious difficulty walking or climbing stairs?</i>|-Yes<br>-No|
|<b>Sex</b>|Are you male or female?|<i>Are you male or female?</i>|-Male<br>-Female|
|<b>AgeCategory</b>|Reported age in five-year age categories calculated variable|<i>Fourteen-level age category</i>|-Age [18-79]<br>-Age [80 or older]|
|<b>Race</b>|Imputed race/ethnicity value|<i>Imputed race/ethnicity value (This value is the reported race/ethnicity or an imputed race/ethnicity, if the respondent refused to give a race/ethnicity. The value of the imputed race/ethnicity will be the most common race/ethnicity response for that region of the state)</i>|-White<br>-Black<br>-Asian<br>-American Indian/Alaskan Native<br>-Hispanic<br>-Other|
|<b>Diabetic</b>|(Ever told) you had diabetes|<i>(Ever told) (you had) diabetes? (If ´Yes´ and respondent is female, ask ´Was this only when you were pregnant?´. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)</i>|-Yes<br>-No<br>-No, borderline diabetes<br>-Yes (during pregnancy)|
|<b>PhysicalActivity</b>|Exercise in Past 30 Days|<i>During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?</i>|-Yes<br>-No|
|<b>GenHealth</b>|General Health|<i>Would you say that in general your health is:</i>|-Excellent<br>-Very good<br>-Good<br>-Fair<br>-Poor|
|<b>SleepTime</b>|How Much Time Do You Sleep|<i>On average, how many hours of sleep do you get in a 24-hour period?</i>|Number of hours [1-24]|
|<b>Asthma</b>|Ever Told Had Asthma|<i>(Ever told) (you had) asthma?</i>|-Yes<br>-No|
|<b>KidneyDisease</b>|Ever told you have kidney disease?|<i>Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?</i>|-Yes<br>-No|
|<b>SkinCancer</b>|(Ever told) you had skin cancer?|<i>(Ever told) (you had) skin cancer?</i>|-Yes<br>-No|

In [4]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Machine-Learning-Projects1/2020-BRFSS-Codebook-CDC/main/dataset/heart_2020_cleaned.csv')
display(df)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3,30,No,Female,55-59,White,Yes,Yes,Very good,5,Yes,No,Yes
1,No,20.34,No,No,Yes,0,0,No,Female,80 or older,White,No,Yes,Very good,7,No,No,No
2,No,26.58,Yes,No,No,20,30,No,Male,65-69,White,Yes,Yes,Fair,8,Yes,No,No
3,No,24.21,No,No,No,0,0,No,Female,75-79,White,No,No,Good,6,No,No,Yes
4,No,23.71,No,No,No,28,0,Yes,Female,40-44,White,No,Yes,Very good,8,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,Yes,27.41,Yes,No,No,7,0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6,Yes,No,No
319791,No,29.84,Yes,No,No,0,0,No,Male,35-39,Hispanic,No,Yes,Very good,5,Yes,No,No
319792,No,24.24,No,No,No,0,0,No,Female,45-49,Hispanic,No,Yes,Good,6,No,No,No
319793,No,32.81,No,No,No,0,0,No,Female,25-29,Hispanic,No,No,Good,12,No,No,No
