# Heart Disease Dataset
________________________________________________________

##### Our Data Source: <br> https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset/data 

##### Data Cleaning Notebook <br> https://www.kaggle.com/code/alexteboul/heart-disease-health-indicators-dataset-notebook/notebook 

##### Codebook 2015: <br> https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

--------------------------------------------------------

Original data source (HUGE): 
https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system 


Relevant Research Paper using BRFSS for Diabetes ML: 
https://www.cdc.gov/pcd/issues/2019/19_0109.htm


#### ... Small Resume: 
... how prevelant heart disease is <br>
... It being the #1 cause for death world-wide <br>
... Why preventative measures are crucial <br>
... How can Big Data and ML help in this domain <br>

##### ...Little bg info - Difference between: Heart Disease - Heart Attack - Stroke:
- Heart disease - any disease related to the heart, such as heart arithmia... etc
- Heart Attack - myocardial infarctions, is clog in the coronary arthery which supplies the heart itself with blood. 
- Stroke - clogging of the arthery supplying the brain with blood resulting in temporary or permanently nerve and brain damage or worst brain death


        #### Some background info from the Kaggle Notebook
        The Centers for Disease Control and Prevention has identified high blood pressure, high blood cholesterol, and smoking as three key risk factors for heart disease. Roughly half of Americans have at least one of these three risk factors. The National Heart, Lung, and Blood Institute highlights a wider array of factors such as Age, Environment and Occupation, Family History and Genetics, Lifestyle Habits, Other Medical Conditions, Race or Ethnicity, and Sex for clinicians to use in diagnosing coronary heart disease. Diagnosis tends to be driven by an initial survey of these common risk factors followed by bloodwork and other tests.

        Content
        The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, I downloaded a csv of the dataset available on Kaggle for the year 2015. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

        This dataset contains 253,680 survey responses from cleaned BRFSS 2015 to be used primarily for the binary classification of heart disease. Not that there is strong class imbalance in this dataset. 229,787 respondents do not have/have not had heart disease while 23,893 have had heart disease. The question to be explored is:

        1. To what extend can survey responses from the BRFSS be used for predicting heart disease risk?
        2. Can a subset of questions from the BRFSS be used for preventative health screening for diseases like heart disease?

## Exploring the Data

In [2]:
#import the relevant libraries
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
#load our data
df = pd.read_csv("../data/heart_disease_health_indicators_BRFSS2015.csv", delimiter=",")

In [5]:
#check type
print("type of data: ", type(df))

#check dimensions
print("dimensions are: ", df.shape)

type of data:  <class 'pandas.core.frame.DataFrame'>
dimensions are:  (253680, 22)


So we have a DataFrame with:
-  253 680 rows (this is the correct quantity, no 0 indexing) => 253 680 respondents
-  22      columns

In [6]:
# check first 10 rows
df.head(10)

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
5,0.0,1.0,1.0,1.0,25.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,2.0,0.0,1.0,10.0,6.0,8.0
6,0.0,1.0,0.0,1.0,30.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,3.0,0.0,14.0,0.0,0.0,9.0,6.0,7.0
7,0.0,1.0,1.0,1.0,25.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,0.0,0.0,1.0,0.0,11.0,4.0,4.0
8,1.0,1.0,1.0,1.0,30.0,1.0,0.0,2.0,0.0,1.0,...,1.0,0.0,5.0,30.0,30.0,1.0,0.0,9.0,5.0,1.0
9,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,1.0,8.0,4.0,3.0


In [8]:
# check last 10 rows
df.tail(10)

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
253670,1.0,1.0,1.0,1.0,25.0,0.0,0.0,2.0,0.0,1.0,...,1.0,0.0,5.0,15.0,0.0,1.0,0.0,13.0,6.0,4.0
253671,1.0,1.0,1.0,1.0,23.0,0.0,1.0,0.0,0.0,0.0,...,1.0,1.0,4.0,0.0,5.0,0.0,1.0,8.0,3.0,2.0
253672,1.0,1.0,0.0,1.0,30.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,2.0,1.0
253673,0.0,1.0,0.0,1.0,42.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,14.0,4.0,0.0,1.0,3.0,6.0,8.0
253674,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,6.0,5.0
253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,0.0,1.0,1.0,1.0,18.0,0.0,0.0,2.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0
253679,1.0,1.0,1.0,1.0,25.0,0.0,0.0,2.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,9.0,6.0,2.0


#### REMARKS
NB: the total number of respondent is 253 680, but the index of the last row (respondent) is with one less because of 0-indexing.

NB2: Already from the exploration of our head and tail of our DataFrame we notice that we have a homogeneous dataset - meaning we have only numerical values of type float64 - either binary or ordinal. So the creator of the dataset has already done some cleaning of the data as well as numerical encoding. What this means for our project:
- We have some of our job done for us => we probably won't have to perform data cleaning and encoding ourselves. 
- But it also comes with an innate bias that the creator of the dataset has implied through his cleaning and interpretation of the data. Therefore, we would have an extra exploratory and investigative work, figuring out how our data was cleaned and encoded and why.
- Thankfully, the guy who cleaned the original dataset also made a detailed notebook on how he cleaned it to the current form. So we don't have to guess a lot. Link to notebook cleaning the dataset: <br> https://www.kaggle.com/code/alexteboul/heart-disease-health-indicators-dataset-notebook/notebook 

Summary: Using preprocessed datasets saves us some time and effort but introduces biases from the creator's interpretations. We'll likely skip data cleaning and encoding but need to explore how it was done and why.

In [9]:
# get column labels
df.columns

Index(['HeartDiseaseorAttack', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'Diabetes', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')

#### REMARKS
This are the 22 different categories that our dataset contains

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   HeartDiseaseorAttack  253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   Diabetes              253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

### REMARKS

- range length - 253680
- indices - 0-253679
- columns: names, count-22, none have null instances, datatype - all are float64

In [10]:
# get quick statistical summary of our dataset
df.describe()

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,...,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.094186,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.296921,0.756544,0.634256,...,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,8.032119,5.050434,6.053875
std,0.292087,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.69816,0.429169,0.481639,...,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,3.05422,0.985774,2.071148
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0,4.0,5.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0,6.0,8.0
max,1.0,1.0,1.0,1.0,98.0,1.0,1.0,2.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0,6.0,8.0


#### REMARKS

NB1: Already here we can get some quick statistical overview of each column such as:
- the count 
- the mean
- the standard deviation
- the min and max value
- the 25%, 50% & 75% value  <.........get back to this when you read about statistics

NB2: The min and max value get us a pretty good understanding which categories are suitable for our target (since it has to be of a binary type - 0 or 1)
- In our case those categories are: <br> 'Heart Disease or Attack', 'High Blood Pressure', 'High Colecterol', 'Cholesterol Check', 'Smoker', 'Stroke', 'Physical Activity', 'Fruits', 'Any Healthcare', 'No Doctor because Cost', 'Difficulty Walking' and  'Sex'

In [12]:
# Get the mean of each column
np.mean(df, axis=0)

HeartDiseaseorAttack     0.094186
HighBP                   0.429001
HighChol                 0.424121
CholCheck                0.962670
BMI                     28.382364
Smoker                   0.443169
Stroke                   0.040571
Diabetes                 0.296921
PhysActivity             0.756544
Fruits                   0.634256
Veggies                  0.811420
HvyAlcoholConsump        0.056197
AnyHealthcare            0.951053
NoDocbcCost              0.084177
GenHlth                  2.511392
MentHlth                 3.184772
PhysHlth                 4.242081
DiffWalk                 0.168224
Sex                      0.440342
Age                      8.032119
Education                5.050434
Income                   6.053875
dtype: float64

#### REMARKS

This was already shown through our df.describe() function, but here we get a more focused and clean look on the mean only of each column

NB: Now it is also a suitable time to take a look into the different numerical values and explore how and why they were encoded. Thankfully, the creator of our dataset has provided us also with a nice descriptive overview of what the differet categories are and what do their values represent
Link to his keggle: 

### How the dataset got to its current state

The original data is part of CDC's (Centers for Disease Control) annual data collection for their Behavioral Risk Factor Surveillance System (BRFSS). The data is collected through phone interviews carried out on a random principle in the USA. The data in this project is based on BRFSS 2015 Codebook Report.
<br>Source: https://health.gov/healthypeople/objectives-and-data/data-sources-and-methods/data-sources/behavioral-risk-factor-surveillance-system-brfss#:~:text=The%20Behavioral%20Risk%20Factor%20Surveillance,to%20chronic%20disease%20and%20injury. 


While the original codebook has 330 categories and includes over 430 000 interviewees, the author of the cleaned version decided to narrow the focus by selecting the most relative factors for heart disease. Based on medical research he created a new dataset with the selected the categories (not in exact order of importance)- 
high blood pressure, 
cholesterol high, 
smoking,
diabetes,
obesity,
age,
sex,
race,
diet,
exercise,
alcohol consumption,
BMI,
Household Income,
Marital Status,
Sleep,
Time since last checkup,
Education,
Health care coverage,
Mental Health

Then he did initial cleaning of the new dataset with a dropna() function which removed over 100 000 rows with some not available data. Afterwars he did some encoding which is documented/ described in details in the following section.

### Category exploration and encoding <br> Link:

Attributes Description: https://www.kaggle.com/code/alexteboul/heart-disease-health-indicators-dataset-notebook/notebook (Author's cleaning notebook)
- **HeartDiseaseorAttack** - (0, 1)<br>
    Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
    -   0 - No heart disease or attack
    -   1 - Has heart disease or attack
    <br>
    <br>
        - Codes in the original codebook:
        - The original code in the codebook is _MICHD
        - 1 - people, who "reported having coronary heart disease (CHD) or myocardial infarction (MI)
        - 2 - "Did not report having MI or CHD"
        - 9 represents people who gave an answer of "Don’t know/Not Sure/Refused/Missing".
        - Every category includes a small percentage of people listed under the value 9 (some places it is also the value of 7 or "BLANK") - those respondents where removed from the dataset in all categories and going foreward this step is implied without being explicitly stated.
    <br>
    <br>
        - Encoding in the new database:
        - The author changed the code of those, who did not report CHD or MI from 2 to 0.
    <br>
- **HighBP** - (0, 1)<br>
    Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional
    -   0 - DOESN'T HAVE high blood pressure
    -   1 - HAS high blood pressure
    <br>
    <br>
        - The original code in the codebook is _RFHYPE5
        - 1 - people, who haven't been told they have e high blood pressure (HBP) by a doctor, nurse, or other health professional
        - 2 - people who have been told they have HBP.
    <br>
    <br>
        - The author remaps 1 to 0 for those who don't have HBD and 2 to 1 for those who do have it.
    <br>
- **HighChol** - (0, 1)<br>
    Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high?
    -   0 - DOESN'T HAVE high cholesterol
    -   1 - HAS high cholesterol
    <br>
    <br>
        - The original code in the codebook is TOLDHI2
        - 1 - people, who have been told they have high cholesterol
        - 2 - people who haven't been told they have high cholesterol.
    <br>
    <br>
        - The author remaps 2 to 0 for those who don't have it.
    <br>
- **CholCheck** - (0, 1)<br>
    Cholesterol check within past five years
    -   0 - HASN'T HAD chelesterol check within past 5 years
    -   1 - HAD chelesterol check within past 5 years
    <br>
    <br>
        - The original code in the codebook is _CHOLCHK
        - 1 - people, who had choseterol check in the past 5 years
        - 2 - people who did not have a cholesterol check in that period
        - 3 - people who never have had a check.
    <br>
    <br>    
        - The author remaps 2 and 3 to 0 for those who didn't have it.
    <br>
- **BMI** <br>
    Body Mass Index (BMI)
    - Values from 1.00 to 99.99 
    <br>
    <br>
        - The original code in the codebook is _BMI5
        - There the values are between 1 and 9999.
    <br>
    <br>    
        - The author divides the values by 100.
    <br>
- **Smoker** - (0, 1)<br>
    Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
    -   0 - Non-smoker
    -   1 - Smoker
    <br>
    <br>
        - The original code in the codebook is SMOKE100
        - 1 - people, who have smoked a 100 cigarettes in their life
        - 2 - people who haven't smoked a 100 cigarettes in their life.
    <br>
    <br>
        - The author remaps 2 to 0 for those who haven't it.
    <br>
- **Stroke** - (0, 1) <br>
    (Ever told) you had a stroke.
    -   0 - HASN'T HAD a stroke
    -   1 - HAD a stroke
    <br>
    <br>
        - The original code in the codebook is CVDSTRK3
        - 1 - people, who have had a stroke
        - 2 - people who haven't had a stroke
    <br>
    <br>
        - The author remaps 2 to 0 for those who haven't had.
    <br>
- **Diabetes**  (0, 1, 2)<br>
    (Ever told) you have diabetes
    -   0 - no diabetes
    -   1 - pre-diabetes or borderline diabetes
    -   2 - Diabetes
    <br>
    <br>
        - The original code in the codebook is DIABETE3
        - 1 - people who have diabetes
        - 2 - Females who had it only during pregnancy
        - 3 - People, who don't haven it
        - 4 - People who don't have it, but are pre-diabetes or borderline diabetes
    <br>
    <br>
        - The author encodes with 0 for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes, 2 is for yes diabetes
    <br>
- **PhysActivity**  (0, 1)<br>
    Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
    -   0 - No, physiscal activity
    -   1 - Yes, physical activity
    <br>
    <br>
        - The original code in the codebook is _TOTINDA
        - 1 - Had physical activity or exercise
        - 2 - No physical activity or exercise in last 30 days
    <br>
    <br>
        - The author changes 2 to 0.
    <br>
- **Fruits**  (0, 1)<br>
    Consume Fruit 1 or more times per day
    -   0 - No, doesn't consume fruit
    -   1 - Yes. consumes at least 1 fruit a day
    <br>
    <br>
        - The original code in the codebook is _FRTLT1
        - 1 - Consumed fruit one or more times per day
        - 2 - Consumed fruit less than one time per day
    <br>
    <br>
        - The author changes 2 to 0.
    <br>
- **Veggies** (0, 1) <br>
    Consume Vegetables 1 or more times per day
    -   0 - No, doesn't consume vegetables
    -   1 - Yes. consumes at least 1 vegetable a day
    <br>
    <br>
        - The original code in the codebook is _VEGLT1
        - 1 - Consumed vegetables one or more times per day
        - 2 - Consumed vegetables less than one time per day
    <br>
    <br>
        - The author changes 2 to 0.
    <br>
- **HvyAlcoholConsump**  (0, 1)<br>
    Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)
    -   0 - Not a heavy drinker
    -   1 - Heavy drinker
    <br>
    <br>
        - The original code in the codebook is _RFDRHV5
        - 1 - No
        - 2 - Yes
    <br>
    <br>
        - The author changes 2 to 1 for heavy drinkers and 1 to 0 for non-heavy drinkers.
    <br>
- **AnyHealthcare**  (0, 1)<br>
    Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
    -   0 - Doesn't have any healthcare
    -   1 - Has healthcare
    <br>
    <br>
        - The original code in the codebook is HLTHPLN1
        - 1 - Yes
        - 2 - No
    <br>
    <br>
        - The author changes 2 to 0.
    <br>
- **NoDocbcCost**  (0, 1)<br>
    Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
    -   0 - No
    -   1 - Yes
    <br>
    <br>
        - The original code in the codebook is MEDCOST
        - 1 - Yes
        - 2 - No
    <br>
    <br>
        - The author changes 2 to 0.
    <br>
- **GenHlth** <br>
    Would you say that in general your health is?
    - 1 - Excellent
    - 2 - Very good
    - 3 - Good
    - 4 - Fair
    - 5 - Poor
    <br>
    <br>
        - The original code in the codebook is GENHLTH
        - 1 - Excellent
        - 2 - Very good
        - 3 - Good
        - 4 - Fair
        - 5 - Poor
        - 7 - Don't know / Not sure
        - 9 - Refused
        - BLANK - Not asked or Missing
    <br>
    <br>
        - The author keeps keeps this ordinal variable as it is, only removes 7, 9 and BLANK
    <br>
- **MentHlth** <br>
    Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
    - 0 - no days of poor mental health (aka great mental health)
    - 1-30 - Number of days in poor mental health
    <br>
    <br>
        - The original code in the codebook is MENTHLTH
        - 1-30 - Number of days in poor mental health
        - 88 - None
        - 77 - Don’t know/Not sure
        - 99 - Refused
    <br>
    <br>
        - The author keeps the value signifying # of days of poor mental health, he changes 88 to 0 for people, that stated they had 0 days of bad mental health.
    <br>
- **PhysHlth** <br>
    Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
    - 0 - no days of poor physical health (aka great physical health)
    - 1-30 - Number of days in poor physical health
    <br>
    <br>
        - The original code in the codebook is PHYSHLTH
        - 1-30 - Number of days in poor physical health
        - 88 - None
    <br>
    <br>
        - The author keeps the value signifying # of days of poor physical health, he changes 88 to 0 for people, that stated they had 0 days of bad physical health. 
    <br>
- **DiffWalk**  (0, 1)<br>
    Do you have serious difficulty walking or climbing stairs?
    -   0 - no difficulty walking
    -   1 - yes difficulty walking
    <br>
    <br>
        - The original code in the codebook is DIFFWALK
        - 1 - Yes
        - 2 - No
    <br>
    <br>
        - The author changes 2 to 0. 
    <br>
- **Sex**  (0, 1)<br>
    Indicate sex of respondent.
    -   0 - Female
    -   1 - Male  
    <br>
    <br>
        - The original code in the codebook is SEX
        - 1 - Male
        - 2 - Female
    <br>
    <br>
        - The author changes 2 to 0 for female, arguing that it is somewhat arbitrary, but also men are generally more prone to heart diseases so makes sense to leave them at 1.
    <br>
- **Age** <br>
    Fourteen-level age category.
    - 1 - Age 18 to 24
    - 2 - Age 25 to 29
    - 3 - Age 30 to 34
    - 4 - Age 35 to 39
    - 5 - Age 40 to 44
    - 6 - Age 45 to 49
    - 7 - Age 50 to 54
    - 8 - Age 55 to 59
    - 9 - Age 60 to 64
    - 10 - Age 65 to 69
    - 11 - Age 70 to 74
    - 12 - Age 75 to 79
    - 13 - Age 80 or older
    <br>
    <br>
        - The original code in the codebook is _AGEG5YR
        - 1 - Age 18 to 24
        - 2 - Age 25 to 29
        - 3 - Age 30 to 34
        - 4 - Age 35 to 39
        - 5 - Age 40 to 44
        - 6 - Age 45 to 49
        - 7 - Age 50 to 54
        - 8 - Age 55 to 59
        - 9 - Age 60 to 64
        - 10 - Age 65 to 69
        - 11 - Age 70 to 74
        - 12 - Age 75 to 79
        - 13 - Age 80 or older
        - 14 - Don’t know/Refused/Missing
    <br>
    <br>
        - The author removes 14 and keeps the rest as it is.
    <br>
- **Education** <br>
    What is the highest grade or year of school you completed?
    - 1 - Never attended school or only kindergarten
    - 2 - Grades 1 through 8 (Elementary)
    - 3 - Grades 9 through 11 (Some high school)
    - 4 - Grade 12 or GED (High school graduate)
    - 5 - College 1 year to 3 years (Some college or technical school)
    - 6 - College 4 years or more (College graduate)
    <br>
    <br>
        - The original code in the codebook is EDUCA
        - 1 - Never attended school or only kindergarten
        - 2 - Grades 1 through 8 (Elementary)
        - 3 - Grades 9 through 11 (Some high school)
        - 4 - Grade 12 or GED (High school graduate)
        - 5 - College 1 year to 3 years (Some college or technical school)
        - 6 - College 4 years or more (College graduate)
    <br>
    <br>
        - The author keeps it as it is.
    <br>
- **Income** - <br>
    Is your annual household income from all sources.
    - 1 - Less than $10,000
    - 2 - $10,000 to less than $15,000
    - 3 - $15,000 to less than $20,000
    - 4 - $20,000 to less than $25,000
    - 5 - $25,000 to less than $35,000
    - 6 - $35,000 to less than $50,000
    - 7 - $50,000 to less than $75,000
    - 8 - $75,000 or more
    <br>
    <br>
        - The original code in the codebook is INCOME2
        - 1 - Less than $10,000
        - 2 - $10,000 to less than $15,000
        - 3 - $15,000 to less than $20,000
        - 4 - $20,000 to less than $25,000
        - 5 - $25,000 to less than $35,000
        - 6 - $35,000 to less than $50,000
        - 7 - $50,000 to less than $75,000
        - 8 - $75,000 or more
    <br>
    <br>
        - The author keeps it as it is. 
    <br>

    

After this cleaning the new dataset is left with:
- 252680 rows
- 22 columns

The last thing the author does is to give the feature more readable names. The features are renamed as following: <br>
{'_MICHD':'HeartDiseaseorAttack', '_RFHYPE5':'HighBP', 'TOLDHI2':'HighChol', '_CHOLCHK':'CholCheck', '_BMI5':'BMI', 'SMOKE100':'Smoker', 'CVDSTRK3':'Stroke', 'DIABETE3':'Diabetes', '_TOTINDA':'PhysActivity', '_FRTLT1':'Fruits', '_VEGLT1':"Veggies", '_RFDRHV5':'HvyAlcoholConsump', 'HLTHPLN1':'AnyHealthcare', 'MEDCOST':'NoDocbcCost', 'GENHLTH':'GenHlth', 'MENTHLTH':'MentHlth', 'PHYSHLTH':'PhysHlth', 'DIFFWALK':'DiffWalk', 'SEX':'Sex','_AGEG5YR':'Age', 'EDUCA':'Education', 'INCOME2':'Income' }
<br>
<br>
<br>
<br>

----------------------------------------------------