# 03. Pre-Processing
___


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import norm 


import warnings
warnings.filterwarnings('ignore')


In [18]:
heart22 = pd.read_csv('~/Desktop/capstone-project-Tasnimacj/data/cleaned_data/heart_bool.csv', index_col=0)

In [24]:
# heart22.info()  #---DEBUG: Index: 246013 entries, 342 to 445130



In [20]:
heart22.tail(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
445124,Male,Good,0.0,15.0,Within past year (anytime less than 12 months ...,1,7.0,1 to 5,0,0,...,1.68,83.91,29.86,1,1,1,1,"Yes, received tetanus shot but not sure what type",0,Yes
445128,Female,Excellent,2.0,2.0,Within past year (anytime less than 12 months ...,1,7.0,None of them,0,0,...,1.7,83.01,28.66,0,1,1,0,"Yes, received tetanus shot but not sure what type",0,No
445130,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0,5.0,None of them,1,0,...,1.83,108.86,32.55,0,1,1,1,"No, did not receive any tetanus shot in the pa...",0,Yes


After loading in the data, we can see that the index does not match the number of rows. Our last row is index '445130' but the data only has 246013 rows. 
To make sure there are no indexing issues that could pose a problem, lets reset the index.

In [21]:
heart22.reset_index(drop=True,inplace=True)

In [25]:
# heart22.info() #---DEBUG: RangeIndex: 246013 entries, 0 to 246012 and no extra index column

In [23]:
heart22.tail(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
246010,Male,Good,0.0,15.0,Within past year (anytime less than 12 months ...,1,7.0,1 to 5,0,0,...,1.68,83.91,29.86,1,1,1,1,"Yes, received tetanus shot but not sure what type",0,Yes
246011,Female,Excellent,2.0,2.0,Within past year (anytime less than 12 months ...,1,7.0,None of them,0,0,...,1.7,83.01,28.66,0,1,1,0,"Yes, received tetanus shot but not sure what type",0,No
246012,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0,5.0,None of them,1,0,...,1.83,108.86,32.55,0,1,1,1,"No, did not receive any tetanus shot in the pa...",0,Yes


We have successfully resest the index. We can continue with encoding and preparing data for modeling.


In [28]:
categorical= list(heart22.select_dtypes(include=['object']))

In [30]:
len(categorical)

11

In [29]:
for col in categorical:
    print(heart22[col].value_counts(), '\n')

Sex
Female    127806
Male      118207
Name: count, dtype: int64 

GeneralHealth
Very good    86996
Good         77407
Excellent    41522
Fair         30658
Poor          9430
Name: count, dtype: int64 

LastCheckupTime
Within past year (anytime less than 12 months ago)         198144
Within past 2 years (1 year but less than 2 years ago)      23227
Within past 5 years (2 years but less than 5 years ago)     13744
5 or more years ago                                         10898
Name: count, dtype: int64 

RemovedTeeth
None of them              131585
1 to 5                     74701
6 or more, but not all     25949
All                        13778
Name: count, dtype: int64 

HadDiabetes
No                                         204827
Yes                                         33811
No, pre-diabetes or borderline diabetes      5392
Yes, but only during pregnancy (female)      1983
Name: count, dtype: int64 

SmokerStatus
Never smoked                             147731
Former smoker  

For these columns, we would need to represent these values as numeric data with dummy variables. It will turn the catergorical values into new numeric variables that indicate yes, no or other for each value. 

For some columns, we can aggregate the data to reduce granularity in our data. For example, we can break down the ECigaretteUsage column into currently use e-cigarettes vs do not use e-cigarettes.

For the Sex column, I will assign female as 1 (as there are more females in the dataset), and male as 0. To make it clear that the column represents female or not, I can rename the column.

In [36]:
heart22 = heart22.rename(columns={"Sex": "Female"})

heart22['Female'] = heart22['Female'].map({'Female': 1, 'Male': 0})


In [40]:
heart22.head(3)

Unnamed: 0,Female,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,1,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,1,9.0,None of them,0,0,...,1.6,71.67,27.99,0,0,1,1,"Yes, received Tdap",0,No
1,0,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,1,6.0,None of them,0,0,...,1.78,95.25,30.13,0,0,1,1,"Yes, received tetanus shot but not sure what type",0,No
2,0,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0,8.0,"6 or more, but not all",0,0,...,1.85,108.86,31.66,1,0,0,1,"No, did not receive any tetanus shot in the pa...",0,Yes
