Script to perform exploratory data analysis on the Indicators of Heart
Disease dataset. Before anything else, the data must be loaded and split
into training and testing sets. The training set will be used to create the
ideas and hypotheses, while the testing set will be used to validate them.

In [2]:
import pandas as pd
from ydata_profiling import ProfileReport


In [20]:
def load_data():
    """Load the Indicators of Heart Disease dataset"""
    df_train = pd.read_parquet("../data/interim/heart_train.parquet")
    return df_train

In [21]:
data = load_data()

In [22]:
data.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Utah,Female,Very good,0.0,0.0,Within past 5 years (2 years but less than 5 y...,Yes,7.0,None of them,No,...,1.6,51.71,20.19,Yes,Yes,Yes,No,"Yes, received Tdap",No,No
1,District of Columbia,Female,Fair,4.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,"6 or more, but not all",No,...,1.6,54.43,21.26,No,No,No,Yes,,No,No
2,Washington,Male,Good,1.0,1.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.7,88.45,30.54,Yes,Yes,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,Tested positive using home test without a heal...
3,Wisconsin,Female,Good,0.0,0.0,Within past 5 years (2 years but less than 5 y...,No,9.0,"6 or more, but not all",No,...,1.68,69.4,24.69,No,No,No,No,"No, did not receive any tetanus shot in the pa...",No,No
4,Kansas,Male,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,8.0,1 to 5,No,...,1.75,108.86,35.44,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes


In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356105 entries, 0 to 356104
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      356105 non-null  object 
 1   Sex                        356105 non-null  object 
 2   GeneralHealth              355168 non-null  object 
 3   PhysicalHealthDays         347414 non-null  float64
 4   MentalHealthDays           348848 non-null  float64
 5   LastCheckupTime            349508 non-null  object 
 6   PhysicalActivities         355234 non-null  object 
 7   SleepHours                 351756 non-null  float64
 8   RemovedTeeth               347063 non-null  object 
 9   HadHeartAttack             353632 non-null  object 
 10  HadAngina                  352578 non-null  object 
 11  HadStroke                  354855 non-null  object 
 12  HadAsthma                  354698 non-null  object 
 13  HadSkinCancer              35

In [24]:
data.describe()

Unnamed: 0,PhysicalHealthDays,MentalHealthDays,SleepHours,HeightInMeters,WeightInKilograms,BMI
count,347414.0,348848.0,351756.0,333363.0,322538.0,317239.0
mean,4.353046,4.381794,7.021117,1.702585,83.074472,28.532016
std,8.695588,8.38843,1.50317,0.107053,21.455636,6.554786
min,0.0,0.0,1.0,0.91,22.68,12.02
25%,0.0,0.0,6.0,1.63,68.04,24.13
50%,0.0,0.0,7.0,1.7,80.74,27.44
75%,3.0,5.0,8.0,1.78,95.25,31.75
max,30.0,30.0,24.0,2.41,292.57,99.64


In [25]:
data.isnull().sum()

State                            0
Sex                              0
GeneralHealth                  937
PhysicalHealthDays            8691
MentalHealthDays              7257
LastCheckupTime               6597
PhysicalActivities             871
SleepHours                    4349
RemovedTeeth                  9042
HadHeartAttack                2473
HadAngina                     3527
HadStroke                     1250
HadAsthma                     1407
HadSkinCancer                 2543
HadCOPD                       1743
HadDepressiveDisorder         2220
HadKidneyDisease              1519
HadArthritis                  2087
HadDiabetes                    865
DeafOrHardOfHearing          16402
BlindOrVisionDifficulty      17143
DifficultyConcentrating      19276
DifficultyWalking            19070
DifficultyDressingBathing    19012
DifficultyErrands            20405
SmokerStatus                 28274
ECigaretteUsage              28412
ChestScan                    44791
RaceEthnicityCategor

In [26]:
profile = ProfileReport(
    data, title="Indicators of Heart Disease", explorative=True)

In [27]:
import os
os.getcwd()

'/Users/daniel/mlops/heart-disease/notebooks'

In [28]:
profile.to_file("./heart_disease_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [29]:
profile



Analyzing the above report (also present @ notebooks/heart_disease_report.html),
it's possible to see that the dataset has a low amount of missing values and
duplicate rows. So, in the preprocessing step, these rows will be removed.

Also, many columns are imbalanced, which can be a problem for the model, but
initially, this issue won't be addressed.

Many columns are categorical, so for models that require numerical data, these
columns will be one-hot encoded.