## Data Cleaning and Preparation

<p>Data cleaning and preparation are crucial steps in the data analysis process. These steps involve transforming raw data into a format that is suitable for analysis, and ensuring that the data is accurate, complete, and consistent. </p>

The following are some key steps involved in our data cleaning and preparation process:

1. **Data Collection**: our raw dataset is from the 2021 BRFSS survey data done by the Center of Disease Control and Prevention based on over 400,000 survey participants in the United States. The original data file can be accessed at: https://www.cdc.gov/brfss/annual_data/annual_2021.html 
<br>`The original file is in XPT format, Python packages(pip) "xport" was used to convert into CSV`


2. **Data Extraction**: the survey dataset consists of 303 columns, which contain responses for different questions asked in the survey. To identify factors releted to our research area, i.e., hypertension, extensive research was conducted. After which, relevant variable columns are identified and extracted from the survey dataset.


3. **Data Cleaning**: numerous steps were taken to clean the dataset, including tackling missing values from survey respondents, dropping irrelevant responses and removing outliers for numeric data.


4. **Data Transformation**: after cleaning, the data is transformed into a format that is suitable for analysis. This involves creating new variables by manipulating existing columns and decoding categorical data into corresponding levels.


5. **Data Documentation & Export**: the last step involves a detailed documentation of the data. Our codebook can be found in the data description file.



In [7]:
# Basic Libraries
import zipfile
import numpy as np
import pandas as pd
import seaborn as sb #for graphics
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [8]:
#Import raw survey dataset
# use of zipfile due to too big of a file

# If zipfile doesnt work, please unzip and use this instead of below commands
# rawData = pd.read_csv('./Data/2021raw.csv')

zf = zipfile.ZipFile("./Data/2021raw.zip")
rawData = pd.read_csv(zf.open('2021raw.csv'))

In [9]:
rawData.head()

Unnamed: 0.1,Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,...,_FRTRES1,_VEGRES1,_FRUTSU1,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1
0,0,1.0,1.0,1192021,1,19,2021,1100.0,2021000001,2021000000.0,...,1.0,1.0,100.0,214.0,1.0,1.0,1.0,1.0,0.0,0.0
1,1,1.0,1.0,1212021,1,21,2021,1100.0,2021000002,2021000000.0,...,1.0,1.0,100.0,128.0,1.0,1.0,1.0,1.0,0.0,0.0
2,2,1.0,1.0,1212021,1,21,2021,1100.0,2021000003,2021000000.0,...,1.0,1.0,100.0,71.0,1.0,2.0,1.0,1.0,0.0,0.0
3,3,1.0,1.0,1172021,1,17,2021,1100.0,2021000004,2021000000.0,...,1.0,1.0,114.0,165.0,1.0,1.0,1.0,1.0,0.0,0.0
4,4,1.0,1.0,1152021,1,15,2021,1100.0,2021000005,2021000000.0,...,1.0,1.0,100.0,258.0,1.0,1.0,1.0,1.0,0.0,0.0


### Step 1: Extract columns that are relevant to our problem, i.e., hypertension analysis
    
1. `CVDINFR4 (Heart Attack)`: Hypertension, characterized by high blood pressure, puts additional strain on the heart and blood vessels, potentially leading to the development of cardiovascular diseases such as heart attacks. 

2.	`1_INCOMG1 (Income)`: Socioeconomic factors, including income levels, can impact lifestyle choices and access to healthcare, which in turn can influence the development and management of hypertension.

3.	`_RACEPRV (Race)`: Studies have found disparities in hypertension rates among different racial and ethnic populations. Factors such as genetic predisposition, cultural influences, socioeconomic factors, and healthcare disparities can contribute to these variations. 

4.	`_SEX` (Gender)`: Men and women may have different risks and patterns of hypertension. For instance, men are more likely to develop hypertension at a younger age, while women may experience an increase in blood pressure during menopause. Hormonal factors, lifestyle differences, and genetic predispositions may contribute to these gender disparities in hypertension.

5.	`_RFHLTH (OVerall Health)`: Poor overall health, characterized by chronic diseases,  sedentary lifestyle, unhealthy diet, and high stress levels, can contribute to the development and worsening of hypertension. Conversely, maintaining good overall health through regular physical activity, a balanced diet, stress management, and proper healthcare can help prevent and manage hypertension.

6.	`_PHYS14D (Physical Health)`: Individuals with poor physical health, including low levels of physical activity, unhealthy weight, and sedentary lifestyle, are at a higher risk of developing hypertension. Engaging in regular physical activity, maintaining a healthy weight, and adopting a physically active lifestyle can help prevent and manage hypertension.

7.	`_MENT14D (Mental Health)`: There is evidence suggesting that individuals with poor mental health, such as chronic stress, anxiety, and depression, may have an increased risk of developing hypertension. The relationship between mental health and hypertension is bidirectional, as hypertension can also contribute to mental health problems.

8.	`_TOTINDA (Exercise)`: Regular physical activity and exercise have been found to have a significant impact on preventing and managing hypertension. Engaging in exercise can help lower blood pressure, improve cardiovascular health, enhance blood vessel function, and contribute to overall well-being.

9.	`_BMI5/_BMI5CAT (BMI)`: BMI is a measure of body weight in relation to height and is commonly used as an indicator of overall body fatness. Higher BMI levels, indicating overweight or obesity, have been associated with an increased risk of developing hypertension. Excess body weight puts additional strain on the cardiovascular system, leading to elevated blood pressure.	    

10.	`_MICHD (coronary heart disease (CHD) or myocardial infarction (MI))`: Hypertension is a significant risk factor for the development of CHD and MI. High blood pressure places increased stress on the coronary arteries, which supply blood to the heart muscle. Over time, this can lead to the development of atherosclerosis (narrowing of the arteries) and increase the risk of CHD or MI

11. `_RFCHOL3 (Cholesterol):High levels of cholesterol can contribute to the development of atherosclerosis, which narrows the arteries and increases blood pressure.

12. `_LTASTH1 (Asthma)`: Asthma has been found to be associated with an increased risk of developing hypertension. While the exact mechanisms are not fully understood, factors such as chronic inflammation, medication use, and impaired lung function in individuals with asthma may contribute to the development of hypertension.

13. `DIABETE4 (Diabetes)`: High blood sugar levels in diabetes can damage blood vessels and increase the risk of hypertension. Conversely, hypertension can impair insulin sensitivity and worsen diabetes control.

14. `_DRNKWK1 (Alcohol Consumption)`: Drinking alcohol in large quantities can raise blood pressure and also interfere with the effectiveness of hypertension medications.

15. `_FRUTSU1 (Fruit Consumption)`: Consuming an adequate amount of fruits per day is beneficial for overall health, including blood pressure management. Fruits are rich in potassium, fiber, and antioxidants, which can help lower blood pressure.

16. `_VEGESU1` (Vegetable Consumption): Similar to fruit consumption, a sufficient intake of vegetables per day is associated with lower blood pressure levels. Vegetables are nutrient-dense and high in potassium, magnesium, and fiber, all of which contribute to healthy blood pressure. 

17. `_SMOKER3` (Smoking): The chemicals in tobacco smoke can damage blood vessels, increase heart rate, and elevate blood pressure. Quitting smoking is crucial in managing hypertension and reducing the risk of associated complications.

18. `_CURECI1` (E-Cigarette Use): The use of electronic cigarettes (e-cigarettes) is a relatively new area of research, and its direct relationship with hypertension is not yet fully understood. However, some studies suggest that e-cigarette use may have negative cardiovascular effects and potentially contribute to hypertension.

19.	`_RFHYPE6` (High Blood Pressure): Hypertension, or high blood pressure, is a condition characterized by elevated arterial pressure and is a major risk factor for cardiovascular diseases. Lifestyle modifications, medication, and regular monitoring are key to managing hypertension and reducing associated health risks.

In [10]:
vars = [
'CVDINFR4',
'_INCOMG1',
'_RACEPRV',
'_SEX',
'_RFHLTH',
'_PHYS14D',
'_MENT14D',
'_TOTINDA',
'_BMI5',
'_BMI5CAT',
'_MICHD',
'_RFHYPE6',
'_RFCHOL3',
'_LTASTH1',
'DIABETE4',
'_DRNKWK1',
'_FRUTSU1',
'_VEGESU1',
'_SMOKER3',
'_CURECI1']
dataset = pd.DataFrame(rawData[vars])
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438693 entries, 0 to 438692
Data columns (total 20 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   CVDINFR4  438691 non-null  float64
 1   _INCOMG1  438693 non-null  float64
 2   _RACEPRV  438693 non-null  float64
 3   _SEX      438693 non-null  float64
 4   _RFHLTH   438693 non-null  float64
 5   _PHYS14D  438693 non-null  float64
 6   _MENT14D  438693 non-null  float64
 7   _TOTINDA  438693 non-null  float64
 8   _BMI5     391841 non-null  float64
 9   _BMI5CAT  391841 non-null  float64
 10  _MICHD    434058 non-null  float64
 11  _RFHYPE6  438693 non-null  float64
 12  _RFCHOL3  377542 non-null  float64
 13  _LTASTH1  438693 non-null  float64
 14  DIABETE4  438690 non-null  float64
 15  _DRNKWK1  438693 non-null  float64
 16  _FRUTSU1  387606 non-null  float64
 17  _VEGESU1  378566 non-null  float64
 18  _SMOKER3  438693 non-null  float64
 19  _CURECI1  438693 non-null  float64
dtypes: f

-----------------------------------------------

### Step 2: Handling the Missing Values

Given that the survey data from the US contains missing values in various variable columns due to non-responses, we have chosen a straightforward approach for handling them. We decided to remove the rows with missing values, at the same time, ensuring that we still have a substantial amount of data for analysis.

In [11]:
#Check the number of NaN values in each variable column
dataset.isnull().sum()

CVDINFR4        2
_INCOMG1        0
_RACEPRV        0
_SEX            0
_RFHLTH         0
_PHYS14D        0
_MENT14D        0
_TOTINDA        0
_BMI5       46852
_BMI5CAT    46852
_MICHD       4635
_RFHYPE6        0
_RFCHOL3    61151
_LTASTH1        0
DIABETE4        3
_DRNKWK1        0
_FRUTSU1    51087
_VEGESU1    60127
_SMOKER3        0
_CURECI1        0
dtype: int64

In [12]:
#Drop and check the rows with missing values
dataset.dropna(axis=0,inplace=True)
dataset.isnull().sum()

CVDINFR4    0
_INCOMG1    0
_RACEPRV    0
_SEX        0
_RFHLTH     0
_PHYS14D    0
_MENT14D    0
_TOTINDA    0
_BMI5       0
_BMI5CAT    0
_MICHD      0
_RFHYPE6    0
_RFCHOL3    0
_LTASTH1    0
DIABETE4    0
_DRNKWK1    0
_FRUTSU1    0
_VEGESU1    0
_SMOKER3    0
_CURECI1    0
dtype: int64

-----------------------------------------------

### Step 3: Tackling irrelevant data entries

In the survey, respondents have the option to select "Don't know" or "Refused to tell" if they are uncomfortable or unwilling to answer certain questions. 

Since these responses do not contribute to our analysis on the relationship between hypertension and the factors, we made the decision to exclude data rows containing such irrelevant responses.

For example, for the variable `_TOTINDA`, `9` indicates "Don’t know/Refused/Missing". Thus, rows with the value of 9 is dropped. 


In [13]:
#Droping rows with irrelevant survey responses. 
#Since each variale column has different encoding for "Don’t know/Refused/Missing" responses, we have to drop these values individually for each variable

#Heart Attack
dataset.drop(dataset[dataset['CVDINFR4'] == 7].index, inplace=True) #Don’t know/Refused/Missing
dataset.drop(dataset[dataset['CVDINFR4'] == 9].index, inplace=True) #Don’t know/Refused/Missing

#Income
dataset.drop(dataset[dataset['_INCOMG1'] == 9].index, inplace=True) #Don’t know/Refused/Missing

#Overall Health 
dataset.drop(dataset[dataset['_RFHLTH'] == 9].index, inplace=True) #Don’t know/Refused/Missing

#High Cholesterol
dataset.drop(dataset[dataset['_RFCHOL3'] == 9].index, inplace=True) #Don’t know/Refused/Missing

#Asthma
dataset.drop(dataset[dataset['_LTASTH1'] == 9].index, inplace=True) #Don’t know/Refused/Missing

#Alcohol Consumption
dataset.drop(dataset[dataset['_DRNKWK1'] == 99900].index, inplace=True) #Don’t know/Not sure/Refused/Missing 

#Smokes
dataset.drop(dataset[dataset['_SMOKER3'] == 9].index, inplace=True) #Don’t know/Refused/Missing 

#E-cigarette User
dataset.drop(dataset[dataset['_CURECI1'] == 9].index, inplace=True) #Don’t know/Refused/Missing 

#Exercise
dataset.drop(dataset[dataset['_TOTINDA'] == 9].index, inplace=True) #Don’t know/Refused/Missing

#Hypertension
dataset.drop(dataset[dataset['_RFHYPE6'] == 9].index, inplace=True) #Don’t know/Refused/Missing

#Diabetes
dataset.drop(dataset[dataset['DIABETE4'] == 9].index, inplace=True) #Don’t know/Refused/Missing 
dataset.drop(dataset[dataset['DIABETE4'] == 7].index, inplace=True) #Don’t know/Refused/Missing 

#Physical Health
dataset.drop(dataset[dataset['_PHYS14D'] == 9].index, inplace=True) #Don’t know/Refused/Missing 

#Mental Health
dataset.drop(dataset[dataset['_MENT14D'] == 9].index, inplace=True) #Don’t know/Refused/Missing 

#Diabetes
dataset.drop(dataset[dataset['DIABETE4'] == 2].index, inplace=True) # Told during pregnant, not applicable
dataset.drop(dataset[dataset['DIABETE4'] == 4].index, inplace=True) # Pre- diabeties, not applicable

-----------------------------------------------

### Step 4: Decode the categorical variable 
This has to be done manually because the encoding is not consistent, i.e., 1.0 is "yes" for some variables but "no" for others

In [14]:
# rename catagorical values
for ind in dataset.index:
    # Smoker3
    if dataset.loc[ind, '_SMOKER3'] == 1.0:
        dataset.loc[ind, '_SMOKER3'] = "smoke before"
    elif dataset.loc[ind, '_SMOKER3'] == 2.0:
        dataset.loc[ind, '_SMOKER3'] = "smoke before"
    elif dataset.loc[ind, '_SMOKER3'] == 3.0:
        dataset.loc[ind, '_SMOKER3'] = "smoke before"
    elif dataset.loc[ind, '_SMOKER3'] == 4.0:
        dataset.loc[ind, '_SMOKER3'] = "Never smoked"
        
    # Decode _CURECI1 to be yes/no
    if dataset.loc[ind, '_CURECI1'] == 1.0:
        dataset.loc[ind, '_CURECI1'] = "No"
    elif dataset.loc[ind, '_CURECI1'] == 2.0:
        dataset.loc[ind, '_CURECI1'] = "Yes"
    # Physical health level
    if  dataset.loc[ind, '_PHYS14D'] == 1.0:
        dataset.loc[ind, '_PHYS14D'] = "Excellent"
    elif  dataset.loc[ind, '_PHYS14D'] == 2.0:
        dataset.loc[ind, '_PHYS14D'] = "Good"
    elif  dataset.loc[ind, '_PHYS14D'] == 3.0:
        dataset.loc[ind, '_PHYS14D'] = "Bad"

    # Mental health level
    if  dataset.loc[ind, '_MENT14D'] == 1.0:
        dataset.loc[ind, '_MENT14D'] = "Excellent"
    elif  dataset.loc[ind, '_MENT14D'] == 2.0:
        dataset.loc[ind, '_MENT14D'] = "Good"
    elif dataset.loc[ind, '_MENT14D'] == 3.0:
        dataset.loc[ind, '_MENT14D'] = "Bad"

    # Physical activity
    if  dataset.loc[ind, '_TOTINDA'] == 1.0:
        dataset.loc[ind, '_TOTINDA'] = "Yes"
    elif  dataset.loc[ind, '_TOTINDA'] == 2.0:
        dataset.loc[ind, '_TOTINDA'] = "No"

    # BMI
    if  dataset.loc[ind, '_BMI5CAT'] == 1.0:
        dataset.loc[ind, '_BMI5CAT'] = "Underweight"
    elif  dataset.loc[ind, '_BMI5CAT'] == 2.0:
        dataset.loc[ind, '_BMI5CAT'] = "Normal"
    elif  dataset.loc[ind, '_BMI5CAT'] == 3.0:
        dataset.loc[ind, '_BMI5CAT'] = "Overweight"
    elif  dataset.loc[ind, '_BMI5CAT'] == 4.0:
        dataset.loc[ind, '_BMI5CAT'] = "Obese"

        
    # Decode _RFHYPE6 to be yes/no
    if dataset.loc[ind, '_RFHYPE6'] == 1.0:
        dataset.loc[ind, '_RFHYPE6'] = "No"
    elif dataset.loc[ind, '_RFHYPE6'] == 2.0:
        dataset.loc[ind, '_RFHYPE6'] = "Yes"
        
    # Decode _RFCHOL3 to be yes/no
    if dataset.loc[ind, '_RFCHOL3'] == 1.0:
        dataset.loc[ind, '_RFCHOL3'] = "No"
    elif dataset.loc[ind, '_RFCHOL3'] == 2.0:
        dataset.loc[ind, '_RFCHOL3'] = "Yes"
        
    # Decode _LTASTH1 to be yes/no
    if dataset.loc[ind, '_LTASTH1'] == 1.0:
        dataset.loc[ind, '_LTASTH1'] = "No"
    elif dataset.loc[ind, '_LTASTH1'] == 2.0:
        dataset.loc[ind, '_LTASTH1'] = "Yes"
        
    # Decode DIABETE4 to be yes/no
    if dataset.loc[ind, 'DIABETE4'] == 3.0:
        dataset.loc[ind, 'DIABETE4'] = "No"
    elif dataset.loc[ind, 'DIABETE4'] == 1.0:
        dataset.loc[ind, 'DIABETE4'] = "Yes"

    # Ever diagnosed with heart attack
    if dataset.loc[ind, 'CVDINFR4'] == 1.0:
        dataset.loc[ind, 'CVDINFR4'] = "Diagnosed before"
    elif dataset.loc[ind, 'CVDINFR4'] == 2.0: 
        dataset.loc[ind, 'CVDINFR4'] = "Not diagnosed before"

    # Sex of person
    if dataset.loc[ind, '_SEX'] == 1.0:
        dataset.loc[ind, '_SEX'] = "Male"
    elif dataset.loc[ind, '_SEX'] == 2.0: 
        dataset.loc[ind, '_SEX'] = "Female"
        
    # Race groups
    if dataset.loc[ind, '_RACEPRV'] == 1.0:
        dataset.loc[ind, '_RACEPRV'] = "White"
    elif dataset.loc[ind, '_RACEPRV'] == 2.0: 
        dataset.loc[ind, '_RACEPRV'] = "Black"
    elif dataset.loc[ind, '_RACEPRV'] == 3.0: 
        dataset.loc[ind, '_RACEPRV'] = "American Indian"
    elif dataset.loc[ind, '_RACEPRV'] == 4.0: 
        dataset.loc[ind, '_RACEPRV'] = "Asian"
    elif dataset.loc[ind, '_RACEPRV'] == 5.0: 
        dataset.loc[ind, '_RACEPRV'] = "Native Hawaiian"
    elif dataset.loc[ind, '_RACEPRV'] == 6.0: 
        dataset.loc[ind, '_RACEPRV'] = "Other race"
    elif dataset.loc[ind, '_RACEPRV'] == 7.0: 
        dataset.loc[ind, '_RACEPRV'] = "Multiracial"
    elif dataset.loc[ind, '_RACEPRV'] == 8.0: 
        dataset.loc[ind, '_RACEPRV'] = "Hispanic"
        
    # Income
    if dataset.loc[ind, '_INCOMG1'] == 1.0:
        dataset.loc[ind, '_INCOMG1'] = "< 15000"
    elif dataset.loc[ind, '_INCOMG1'] == 2.0: 
        dataset.loc[ind, '_INCOMG1'] = "15000 to < 25000"
    elif dataset.loc[ind, '_INCOMG1'] == 3.0: 
        dataset.loc[ind, '_INCOMG1'] = "25000 to < 35000"
    elif dataset.loc[ind, '_INCOMG1'] == 4.0: 
        dataset.loc[ind, '_INCOMG1'] = "35000 to < 50000"
    elif dataset.loc[ind, '_INCOMG1'] == 5.0: 
        dataset.loc[ind, '_INCOMG1'] = "50000 to < 100000"
    elif dataset.loc[ind, '_INCOMG1'] == 6.0: 
        dataset.loc[ind, '_INCOMG1'] = "100000 to < 200000"
    elif dataset.loc[ind, '_INCOMG1'] == 7.0: 
        dataset.loc[ind, '_INCOMG1'] = "> 200000"

    # Health
    if dataset.loc[ind, '_RFHLTH'] == 1.0:
        dataset.loc[ind, '_RFHLTH'] = "Good or better health"
    elif dataset.loc[ind, '_RFHLTH'] == 2.0: 
        dataset.loc[ind, '_RFHLTH'] = "Fair or poor health"


    # _MICHD
    if dataset.loc[ind, '_MICHD'] == 1.0:
        dataset.loc[ind, '_MICHD'] = "Yes"
    elif dataset.loc[ind, '_MICHD'] == 2.0: 
        dataset.loc[ind, '_MICHD'] = "No"

-----------------------------------------------

### Step 5: Compute true values for numeric variables

For some numeric data, their actual values are multipled by `100` before being stored into the data file, this is to prevent potential truncation and data loss. Thus, these variables have `2 implied decimal places` according to the survey codebook, e.g., `9999` actually means `99.99`. To get the actual values, we divided them by `100`. 

Numeric variables that need adjustment include:
1. _BMI5
2. _FRUTSU1
3. _VEGESU1

In [15]:
#divide "implied 2 dp" variables by 100
dataset['_FRUTSU1'] = dataset['_FRUTSU1'].div(100).round(2)
dataset['_VEGESU1'] = dataset['_VEGESU1'].div(100).round(2)
dataset['_BMI5'] = dataset['_BMI5'].div(100).round(2)


-----------------------------------------------

### Step 6: Identify and remove outliers for numeric variables

For numeric variables,it is important to examine the presence of outliers. In our study, considering that the survey data is obtained from US citizens, it is possible that some respondents may have provided inaccurate or false information, resulting in the presence of outliers. These outliers can distort data visualization, such as scatter plots or histograms, making it challenging to interpret the data accurately. 

Additionally, outliers can disproportionately influence predictive models, leading to inaccurate predictions. By conducting an outlier analysis, we can identify and eliminate anomalies in the data, ensuring a more reliable foundation for further exploratory data analysis.

In [16]:
#Check number of outliers for each numeric variable
numeric_vars = ['_FRUTSU1','_BMI5','_VEGESU1','_DRNKWK1']
def outliers_counter(df):
        Q1=df.quantile(0.25)
        Q3=df.quantile(0.75)
        IQR=Q3-Q1
        
        count=0
        for i in df:
            if i<(Q1-1.5*IQR) or i>(Q3+1.5*IQR):
                count+=1
        return count

for var in numeric_vars:
    print( var, " has ",outliers_counter(dataset[var]),"outliers.")


_FRUTSU1  has  6225 outliers.
_BMI5  has  7148 outliers.
_VEGESU1  has  13844 outliers.
_DRNKWK1  has  24722 outliers.


In [17]:
#Determine the index of rows with outliers
def outliers_index(df):
        Q1=df.quantile(0.25)
        Q3=df.quantile(0.75)
        IQR=Q3-Q1
        index_list=[]
        index=0
        for i in df:
            if i<(Q1-1.5*IQR) or i>(Q3+1.5*IQR):
                index_list.append(index)
            index+=1
        return index_list
#Remove rows with outliers
for var in numeric_vars:
    if outliers_counter(dataset[var])>0:
        drop =outliers_index(dataset[var])
        dataset = dataset.drop(dataset.index[drop])
for var in numeric_vars:
    print( var, " has ",outliers_counter(dataset[var]),"outliers.") 

_FRUTSU1  has  0 outliers.
_BMI5  has  1250 outliers.
_VEGESU1  has  1451 outliers.
_DRNKWK1  has  14349 outliers.


-----------------------------------------------

### Step 7: Rename all variables for better understanding, based on data description

In [18]:
new_column_names = {
'CVDINFR4': 'Heart Attack', 
'_INCOMG1': 'Income', 
'_RACEPRV': 'Race',
'_SEX': 'Gender',
'_RFHLTH': 'Overall Health',
'_PHYS14D': 'Physical Health',
'_MENT14D': 'Mental Health',
'_TOTINDA': 'Does Exercise',
'_BMI5': 'BMI',
'_BMI5CAT': 'BMI Category',
'_MICHD': 'Have CHD/MI',
'_RFHYPE6': 'Hypertension',
'_RFCHOL3': 'High Cholesterol',
'_LTASTH1': 'Asthma',
'DIABETE4': 'Diabetes',
'_DRNKWK1': 'Alcohol Consumption',
'_FRUTSU1': 'Fruits Consumption',
'_VEGESU1': 'Vegetables Consumption',
'_SMOKER3': 'Smokes',
'_CURECI1': 'E-cigarette User',}
dataset = dataset.rename(columns=new_column_names)
new_vars = dataset.columns.values.tolist()

-----------------------------------------------

In [19]:
print(new_vars)

['Heart Attack', 'Income', 'Race', 'Gender', 'Overall Health', 'Physical Health', 'Mental Health', 'Does Exercise', 'BMI', 'BMI Category', 'Have CHD/MI', 'Hypertension', 'High Cholesterol', 'Asthma', 'Diabetes', 'Alcohol Consumption', 'Fruits Consumption', 'Vegetables Consumption', 'Smokes', 'E-cigarette User']


In [20]:
dataset.head()

Unnamed: 0,Heart Attack,Income,Race,Gender,Overall Health,Physical Health,Mental Health,Does Exercise,BMI,BMI Category,Have CHD/MI,Hypertension,High Cholesterol,Asthma,Diabetes,Alcohol Consumption,Fruits Consumption,Vegetables Consumption,Smokes,E-cigarette User
0,Not diagnosed before,25000 to < 35000,White,Female,Fair or poor health,Bad,Good,No,14.54,Underweight,No,No,Yes,Yes,No,0.0,1.0,2.14,smoke before,No
2,Not diagnosed before,15000 to < 25000,Black,Female,Good or better health,Excellent,Excellent,No,28.29,Overweight,Yes,Yes,No,No,Yes,0.0,1.0,0.71,Never smoked,No
3,Not diagnosed before,50000 to < 100000,White,Female,Good or better health,Excellent,Good,Yes,33.47,Obese,No,Yes,Yes,No,Yes,300.0,1.14,1.65,Never smoked,No
4,Diagnosed before,15000 to < 25000,Multiracial,Male,Fair or poor health,Bad,Excellent,Yes,28.73,Overweight,Yes,No,Yes,No,Yes,0.0,1.0,2.58,Never smoked,No
5,Not diagnosed before,35000 to < 50000,White,Male,Good or better health,Excellent,Excellent,No,24.37,Normal,No,No,No,No,No,0.0,0.29,0.42,smoke before,No


In [21]:
#Output cleaned dataset
dataset.to_csv("./Data/Cleaned_Data.csv")