# 03. Pre-Processing
___


In [56]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import norm 


import warnings
warnings.filterwarnings('ignore')


In [57]:
heart22 = pd.read_csv('~/Desktop/capstone-project-Tasnimacj/data/cleaned_data/heart_bool.csv', index_col=0)

In [58]:
# heart22.info()  #---DEBUG: Index: 246013 entries, 342 to 445130

In [59]:
heart22.tail(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
445124,Male,Good,0.0,15.0,Within past year (anytime less than 12 months ...,1,7.0,1 to 5,0,0,...,1.68,83.91,29.86,1,1,1,1,"Yes, received tetanus shot but not sure what type",0,Yes
445128,Female,Excellent,2.0,2.0,Within past year (anytime less than 12 months ...,1,7.0,None of them,0,0,...,1.7,83.01,28.66,0,1,1,0,"Yes, received tetanus shot but not sure what type",0,No
445130,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0,5.0,None of them,1,0,...,1.83,108.86,32.55,0,1,1,1,"No, did not receive any tetanus shot in the pa...",0,Yes


After loading in the data, we can see that the index does not match the number of rows. Our last row is index '445130' but the data only has 246013 rows. 
To make sure there are no indexing issues that could pose a problem, lets reset the index.

In [60]:
heart22.reset_index(drop=True,inplace=True)

In [61]:
# heart22.info() #---DEBUG: RangeIndex: 246013 entries, 0 to 246012 and no extra index column

In [62]:
heart22.tail(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
246010,Male,Good,0.0,15.0,Within past year (anytime less than 12 months ...,1,7.0,1 to 5,0,0,...,1.68,83.91,29.86,1,1,1,1,"Yes, received tetanus shot but not sure what type",0,Yes
246011,Female,Excellent,2.0,2.0,Within past year (anytime less than 12 months ...,1,7.0,None of them,0,0,...,1.7,83.01,28.66,0,1,1,0,"Yes, received tetanus shot but not sure what type",0,No
246012,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0,5.0,None of them,1,0,...,1.83,108.86,32.55,0,1,1,1,"No, did not receive any tetanus shot in the pa...",0,Yes


We have successfully resest the index. We can continue with encoding and preparing data for modeling.
Machine learning models have difficulty with label variables so they must be converted to numeric variables before we can begin looking at classification models.

In [63]:
categorical= list(heart22.select_dtypes(include=['object']))

In [64]:
len(categorical)

11

We have 11 object columns. Let's have a look at what they contain and how we can transform them to contain numeric values instead.

In [65]:
for col in categorical:
    print(heart22[col].value_counts(), '\n')

Sex
Female    127806
Male      118207
Name: count, dtype: int64 

GeneralHealth
Very good    86996
Good         77407
Excellent    41522
Fair         30658
Poor          9430
Name: count, dtype: int64 

LastCheckupTime
Within past year (anytime less than 12 months ago)         198144
Within past 2 years (1 year but less than 2 years ago)      23227
Within past 5 years (2 years but less than 5 years ago)     13744
5 or more years ago                                         10898
Name: count, dtype: int64 

RemovedTeeth
None of them              131585
1 to 5                     74701
6 or more, but not all     25949
All                        13778
Name: count, dtype: int64 

HadDiabetes
No                                         204827
Yes                                         33811
No, pre-diabetes or borderline diabetes      5392
Yes, but only during pregnancy (female)      1983
Name: count, dtype: int64 

SmokerStatus
Never smoked                             147731
Former smoker  

For these columns, we would need to represent these values as numeric data with dummy variables. It will turn the catergorical values into new numeric variables that indicate yes, no or other for each value. 

For some columns, we can aggregate the data to reduce granularity in our data. For example, we can break down the ECigaretteUsage column into currently use e-cigarettes vs do not use e-cigarettes.

For the Sex column, I will assign female as 1 (as there are more females in the dataset), and male as 0. To make it clear that the column represents female or not, I can rename the column.

In [66]:
heart22 = heart22.rename(columns={"Sex": "Female"})

heart22['Female'] = heart22['Female'].map({'Female': 1, 'Male': 0})


In [67]:
heart22.head(3)

Unnamed: 0,Female,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,1,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,1,9.0,None of them,0,0,...,1.6,71.67,27.99,0,0,1,1,"Yes, received Tdap",0,No
1,0,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,1,6.0,None of them,0,0,...,1.78,95.25,30.13,0,0,1,1,"Yes, received tetanus shot but not sure what type",0,No
2,0,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0,8.0,"6 or more, but not all",0,0,...,1.85,108.86,31.66,1,0,0,1,"No, did not receive any tetanus shot in the pa...",0,Yes


GenHealth

In [68]:
heart22['GeneralHealth'] = heart22['GeneralHealth'].map({'Excellent' : 1, 
                                                         'Very good' : 2, 
                                                         'Good' : 3, 
                                                         'Fair' : 4, 
                                                         'Poor': 5})


LastCheckup

In [69]:
heart22['LastCheckupTime'] = heart22['LastCheckupTime'].map({'Within past year (anytime less than 12 months ago)' : 1, 
                                                         'Within past 2 years (1 year but less than 2 years ago)' : 2, 
                                                         'Within past 5 years (2 years but less than 5 years ago)' : 3, 
                                                         '5 or more years ago' : 4})

Removedteeth

In [70]:
heart22['RemovedTeeth'] = heart22['RemovedTeeth'].map({'None of them' : 0, 
                                                         '1 to 5' : 1, 
                                                         '6 or more, but not all' : 2, 
                                                         'All' : 3})

HadDiabetes

In [71]:
heart22['HadDiabetes'].value_counts()

HadDiabetes
No                                         204827
Yes                                         33811
No, pre-diabetes or borderline diabetes      5392
Yes, but only during pregnancy (female)      1983
Name: count, dtype: int64

In [72]:
heart22['HadDiabetes'] = heart22['HadDiabetes'].map({'No': 0,
                                                     'No, pre-diabetes or borderline diabetes': 0,
                                                     'Yes':1,
                                                     'Yes, but only during pregnancy (female)':1})

In [73]:
heart22['HadDiabetes'].value_counts()

HadDiabetes
0    210219
1     35794
Name: count, dtype: int64

SmokerStatus

In [74]:

heart22['SmokerStatus'] = np.where(heart22['SmokerStatus'].str.startswith('Current'), 'Yes', heart22['SmokerStatus'])

In [75]:
heart22['SmokerStatus'] = np.where(heart22['SmokerStatus'].str.startswith('Never'), 'No', heart22['SmokerStatus'])

heart22['SmokerStatus'] = np.where(heart22['SmokerStatus'].str.startswith('Former'), 'No', heart22['SmokerStatus'])

In [76]:
heart22['SmokerStatus'] = heart22['SmokerStatus'].map({'Yes': 1, 'No': 0})


ECigaretteUsage

In [77]:
heart22['ECigaretteUsage'] = np.where(heart22['ECigaretteUsage'].str.startswith('Use'), 'Yes', heart22['ECigaretteUsage'])

In [78]:
heart22['ECigaretteUsage'] = heart22['ECigaretteUsage'].map({'Yes': 1, 'Never used e-cigarettes in my entire life': 0,'Not at all (right now)':0})

In [79]:
heart22['ECigaretteUsage'].value_counts()

ECigaretteUsage
0    233400
1     12613
Name: count, dtype: int64

RaceEthnicityCategory dummy variables

With the RaceEthnicityCategory column, I will need to use dummy variables as I cannot group them and binarise like i did before. I also cannot give an order to these variables as the values have no weight/order over each other.

In [80]:
# renaming values to be easier to understand
heart22['RaceEthnicityCategory'] = heart22['RaceEthnicityCategory'].map({'White only, Non-Hispanic': "White only",
                                                            'Hispanic': "Hispanic",
                                                            'Black only, Non-Hispanic':'Black only',
                                                            'Other race only, Non-Hispanic':'Other race only',
                                                            'Multiracial, Non-Hispanic':'Multiracial'})

In [81]:
heart22['RaceEthnicityCategory'].value_counts()

RaceEthnicityCategory
White only         186327
Hispanic            22570
Black only          19330
Other race only     12205
Multiracial          5581
Name: count, dtype: int64

In [82]:
 dummy_Race = pd.get_dummies(heart22['RaceEthnicityCategory'],prefix ='RaceEthnicity',dtype=int).iloc[:,:-1]
    
#White is the baseline, so dropping that
# if Hispanic,Black only,Other race only,Multiracial= 0, then ethnicity is White

Checking if our dummy variable dataframe reflects our actual data. 

In [83]:
dummy_Race.tail(5)

Unnamed: 0,RaceEthnicity_Black only,RaceEthnicity_Hispanic,RaceEthnicity_Multiracial,RaceEthnicity_Other race only
246008,0,0,0,0
246009,1,0,0,0
246010,0,0,1,0
246011,1,0,0,0
246012,1,0,0,0


In [84]:
heart22['RaceEthnicityCategory'].tail(5)

246008     White only
246009     Black only
246010    Multiracial
246011     Black only
246012     Black only
Name: RaceEthnicityCategory, dtype: object

In [85]:
heart22.shape

(246013, 39)

In [86]:
heart22 = pd.concat([heart22.reset_index(drop=True), dummy_Race.reset_index(drop=True)], axis =1 )

In [87]:
heart22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246013 entries, 0 to 246012
Data columns (total 43 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Female                         246013 non-null  int64  
 1   GeneralHealth                  246013 non-null  int64  
 2   PhysicalHealthDays             246013 non-null  float64
 3   MentalHealthDays               246013 non-null  float64
 4   LastCheckupTime                246013 non-null  int64  
 5   PhysicalActivities             246013 non-null  int64  
 6   SleepHours                     246013 non-null  float64
 7   RemovedTeeth                   246013 non-null  int64  
 8   HadHeartAttack                 246013 non-null  int64  
 9   HadAngina                      246013 non-null  int64  
 10  HadStroke                      246013 non-null  int64  
 11  HadAsthma                      246013 non-null  int64  
 12  HadSkinCancer                 

In [88]:
heart22.drop(columns='RaceEthnicityCategory', inplace=True)

In [89]:
heart22.shape

(246013, 42)

Add the dummy columns onto the original dataframe and remove the categorical column as it is no longer needed.


Age ranges

In [90]:
heart22['AgeCategory'] = heart22['AgeCategory'].map({
                                                     'Age 80 or older': 13,

                                                     'Age 75 to 79' : 12,
                                                     'Age 70 to 74' : 11,
                                                     'Age 65 to 69' : 10,
                                                     'Age 60 to 64' : 9,

                                                     'Age 55 to 59' : 8,
                                                     'Age 50 to 54' : 7,
                                                     'Age 45 to 49' : 6,
                                                     'Age 40 to 44' : 5,

                                                     'Age 35 to 39' : 4,
                                                     'Age 30 to 34' : 3,
                                                     'Age 25 to 29' : 2,
                                                     'Age 18 to 24' : 1 })

TetanusLast10Tdap

In [91]:
heart22['TetanusLast10Tdap'] = np.where(heart22['TetanusLast10Tdap'].str.startswith('Yes'), 'Yes', heart22['TetanusLast10Tdap'])

In [92]:
heart22['TetanusLast10Tdap'] = heart22['TetanusLast10Tdap'].map({'Yes': 1, 'No, did not receive any tetanus shot in the past 10 years': 0})

In [93]:
heart22['TetanusLast10Tdap'].value_counts()

TetanusLast10Tdap
1    164270
0     81743
Name: count, dtype: int64

covid

In [94]:
heart22['CovidPos'] = heart22['CovidPos'].map({'Yes': 1,
                                                'Tested positive using home test without a health professional':1,
                                                'No': 0})

In [95]:
heart22['CovidPos'].value_counts()

CovidPos
0    167297
1     78716
Name: count, dtype: int64

In [96]:
heart22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246013 entries, 0 to 246012
Data columns (total 42 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Female                         246013 non-null  int64  
 1   GeneralHealth                  246013 non-null  int64  
 2   PhysicalHealthDays             246013 non-null  float64
 3   MentalHealthDays               246013 non-null  float64
 4   LastCheckupTime                246013 non-null  int64  
 5   PhysicalActivities             246013 non-null  int64  
 6   SleepHours                     246013 non-null  float64
 7   RemovedTeeth                   246013 non-null  int64  
 8   HadHeartAttack                 246013 non-null  int64  
 9   HadAngina                      246013 non-null  int64  
 10  HadStroke                      246013 non-null  int64  
 11  HadAsthma                      246013 non-null  int64  
 12  HadSkinCancer                 

All columns are numeric and no label variables. We can save this as a new csv and use it for modeling in future notebooks.


In [97]:
# save as new cleaned .csv
heart22.to_csv('~/Desktop/capstone-project-Tasnimacj/data/cleaned_data/heart22_preprocessed.csv')