## **Pre-Processing**

## Imports

In [1]:
# Imported Libraries 
import pandas as pd # pandas library 
import numpy as np # numpy library
#import matplotlib.pyplot as plt # Import the pyplot (pythonplot) part of the matlotlib library
#import seaborn as sns # seaborn library

from sklearn.preprocessing import OrdinalEncoder

## Data Loading

In [6]:
## Data Loading
# Data loading for github only

CLN_DATA_PATH='../data/heart_attack_clean.csv'

try:
    heart_attack_clean = pd.read_csv(CLN_DATA_PATH)
    print("Data loaded successfully.")
except FileNotFoundError:
    print("ERROR: The data file does not exist.")

Data loaded successfully.


## Encoding

Checking the Loaded Data

In [11]:
heart_attack_clean.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,342,343,345,346,347
State,Alabama,Alabama,Alabama,Alabama,Alabama
Sex,Female,Male,Male,Female,Female
GeneralHealth,Very good,Very good,Very good,Fair,Good
PhysicalHealthDays,4.0,0.0,0.0,5.0,3.0
MentalHealthDays,0.0,0.0,0.0,0.0,15.0
LastCheckupTime,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...
PhysicalActivities,Yes,Yes,No,Yes,Yes
SleepHours,9.0,6.0,8.0,9.0,5.0
RemovedTeeth,None of them,None of them,"6 or more, but not all",None of them,1 to 5


Dropping Columns that are not Needed

In [22]:
heart_attack_enc = heart_attack_clean.drop(columns=['Unnamed: 0', 'State', 'RemovedTeeth'])

heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,1.6,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,1.78,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,1.85,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
3,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,1.7,90.72,31.32,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
4,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,Yes,5.0,No,No,No,...,1.55,79.38,33.07,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No


**Dropping Columns Comments**

I dropped the Unnamed: 0', 'State', 'RemovedTeeth columns because the problem I am tackling is heart attack indicators. There is no relationship in the frequency of heart attacks and each column dropped.

--

In [12]:
heart_attack_enc.head(1).T

Unnamed: 0,0
Sex,Female
GeneralHealth,Very good
PhysicalHealthDays,4.0
MentalHealthDays,0.0
LastCheckupTime,Within past year (anytime less than 12 months ...
PhysicalActivities,Yes
SleepHours,9.0
HadHeartAttack,No
HadAngina,No
HadStroke,No


### **Encoding**

LastCheckupTime

In [13]:
print(f"Number of Responses per each LastCheckupTime:\n{heart_attack_clean['LastCheckupTime'].value_counts()}")

Number of Responses per each LastCheckupTime:
LastCheckupTime
Within past year (anytime less than 12 months ago)         198144
Within past 2 years (1 year but less than 2 years ago)      23227
Within past 5 years (2 years but less than 5 years ago)     13744
5 or more years ago                                         10898
Name: count, dtype: int64


In [51]:
# Define the order of the categories
LastCheckupTime_Order = ['Within past year (anytime less than 12 months ago)', 
                    'Within past 2 years (1 year but less than 2 years ago)', 
                    'Within past 5 years (2 years but less than 5 years ago)', 
                    '5 or more years ago']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[LastCheckupTime_Order], dtype=float)

# Fit and transform the data
heart_attack_enc['LastCheckupTime_enc'] = enc.fit_transform(heart_attack_enc[['LastCheckupTime']])

# Checking the data to see if the data has been encoded correctly
heart_attack_enc.head(5)


Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No,0.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0
3,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,90.72,31.32,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0
4,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,Yes,5.0,No,No,No,...,79.38,33.07,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No,0.0


In [49]:
print(f"Number of Responses per each LastCheckupTime:\n{heart_attack_enc['LastCheckupTime'].value_counts()}\n")
print(f"Number of Responses per each LastCheckupTime_enc:\n{heart_attack_enc['LastCheckupTime_enc'].value_counts()}")

Number of Responses per each LastCheckupTime:
LastCheckupTime
Within past year (anytime less than 12 months ago)         198144
Within past 2 years (1 year but less than 2 years ago)      23227
Within past 5 years (2 years but less than 5 years ago)     13744
5 or more years ago                                         10898
Name: count, dtype: int64

Number of Responses per each LastCheckupTime_enc:
LastCheckupTime_enc
0    198144
1     23227
2     13744
3     10898
Name: count, dtype: int64


**LastCheckupTime_enc Comments**

- If a person has not gone to have a check up in an ever increasing timeframe, the chances of something bad happening is compounded. Therefore I have introduced a order to the column which infers, a system of serverity. 0 being within a year, going up in severity to 3 being the most severe.

--

Sex

In [55]:
# One-Hot Encoding the 'Sex' column
df_encoded = pd.get_dummies(heart_attack_enc['Sex'], prefix='Gender', dtype=float)

# Concatenating the encoded columns with the original DataFrame
heart_attack_enc = pd.concat([heart_attack_enc, df_encoded], axis=1)

# Display the DataFrame with the original 'Sex' column and the new encoded columns
heart_attack_enc.head(3)


Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,No,Yes,Yes,"Yes, received Tdap",No,No,0.0,1.0,0.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,0.0,1.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,0.0,1.0


In [61]:
print(f"Number of Responses in the cleaned dataset per each Gender (Sex):\n{heart_attack_clean['Sex'].value_counts()}\n")
print(f"Number of Responses per each Gender (Sex):\n{heart_attack_enc['Gender_Female'].value_counts()}\n")
print(f"Number of Responses per each Gender (Sex):\n{heart_attack_enc['Gender_Male'].value_counts()}")

Number of Responses in the cleaned dataset per each Gender (Sex):
Sex
Female    127806
Male      118207
Name: count, dtype: int64

Number of Responses per each Gender (Sex):
Gender_Female
1.0    127806
0.0    118207
Name: count, dtype: int64

Number of Responses per each Gender (Sex):
Gender_Male
0.0    127806
1.0    118207
Name: count, dtype: int64


**Sex_enc Comments**

- Sex has been changed to Gender_Female and Gender_Male.

- Female is encoded as 0 and Male is encoded as 1 (binary encoding no_order)

--

GeneralHealth

In [62]:
print(f"Number of Responses per each GeneralHealth:\n{heart_attack_enc['GeneralHealth'].value_counts()}")

Number of Responses per each GeneralHealth:
GeneralHealth
Very good    86996
Good         77407
Excellent    41522
Fair         30658
Poor          9430
Name: count, dtype: int64


In [63]:
# Define the order of the categories
GeneralHealth_Order = ['Excellent', 
                    'Very good', 
                    'Good', 
                    'Fair', 
                    'Poor']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[GeneralHealth_Order], dtype=float)

# Fit and transform the data
heart_attack_enc['GeneralHealth_enc'] = enc.fit_transform(heart_attack_enc[['GeneralHealth']])

# Checking the data to see if the data has been encoded correctly
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,Yes,Yes,"Yes, received Tdap",No,No,0.0,1.0,0.0,1.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,0.0,1.0,1.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,0.0,1.0,1.0


In [64]:
print(f"Number of Responses per each GeneralHealth:\n{heart_attack_clean['GeneralHealth'].value_counts()}\n")
print(f"Number of Responses per each GeneralHealth:\n{heart_attack_enc['GeneralHealth_enc'].value_counts()}")

Number of Responses per each GeneralHealth:
GeneralHealth
Very good    86996
Good         77407
Excellent    41522
Fair         30658
Poor          9430
Name: count, dtype: int64

Number of Responses per each GeneralHealth:
GeneralHealth_enc
1.0    86996
2.0    77407
0.0    41522
3.0    30658
4.0     9430
Name: count, dtype: int64


**GeneralHealth_enc Comments**

- General Indications of health are stated so the patient has understanding of how good their health is. Therefore I have introduced an order to the column which infers, a system of serverity. 0 being excellent, going up in severity to 4 being poor.

--

PhysicalActivities

In [79]:
print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")

Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191310
No      54703
Name: count, dtype: int64



In [76]:
# Mapping 'yes' to 1 and 'no' to 0
heart_attack_enc['PhysicalActivities_enc'] = heart_attack_enc['PhysicalActivities'].map({'Yes': 1.0, 'No': 0.0})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)


Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,Yes,Yes,"Yes, received Tdap",No,No,0.0,1.0,0.0,1.0,1.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,0.0,1.0,1.0,1.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,0.0,1.0,1.0,0.0


In [78]:
print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")
print(f"Number of Responses per each PhysicalActivities_enc:\n{heart_attack_enc['PhysicalActivities_enc'].value_counts()}")

Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191310
No      54703
Name: count, dtype: int64

Number of Responses per each PhysicalActivities_enc:
PhysicalActivities_enc
1.0    191310
0.0     54703
Name: count, dtype: int64


**PhysicalActivities_enc Comments**

- Yes no response, has been encoded as binary.

- Yes is 1 and No is 0 

--

SmokerStatus

In [83]:
print(f"Number of Responses per each SmokerStatus:\n{heart_attack_enc['SmokerStatus'].value_counts()}\n")

Number of Responses per each SmokerStatus:
SmokerStatus
Never smoked                             147731
Former smoker                             68524
Current smoker - now smokes every day     21659
Current smoker - now smokes some days      8099
Name: count, dtype: int64



In [90]:
# Mapping responses to Never smoked and Has Smoked
heart_attack_enc['SmokerStatus_enc'] = heart_attack_enc['SmokerStatus'].map({'Never smoked': 'Never smoked', 'Former smoker': 'Has Smoked', 'Current smoker - now smokes every day': 'Has Smoked', 'Current smoker - now smokes some days': 'Has Smoked'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,Yes,"Yes, received Tdap",No,No,0.0,1.0,0.0,1.0,1.0,Has Smoked
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,0.0,1.0,1.0,1.0,Has Smoked
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,0.0,1.0,1.0,0.0,Has Smoked


In [88]:
print(f"Number of Responses per each SmokerStatus:\n{heart_attack_enc['SmokerStatus'].value_counts()}\n")
print(f"Former smoker + Current smoker - now smokes every day + Current smoker - now smokes some days = {68524 + 21659 + 8099}")
print(f"\nNumber of Responses per each SmokerStatus_enc:\n{heart_attack_enc['SmokerStatus_enc'].value_counts()}")

Number of Responses per each SmokerStatus:
SmokerStatus
Never smoked                             147731
Former smoker                             68524
Current smoker - now smokes every day     21659
Current smoker - now smokes some days      8099
Name: count, dtype: int64

Former smoker + Current smoker - now smokes every day + Current smoker - now smokes some days = 98282

Number of Responses per each SmokerStatus_enc:
SmokerStatus_enc
Never smoked    147731
Has Smoked       98282
Name: count, dtype: int64


In [91]:
# Mapping responses to Yes and No
heart_attack_enc['SmokerStatus_enc'] = heart_attack_enc['SmokerStatus_enc'].map({'Never smoked': 'No', 'Has Smoked': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,Yes,"Yes, received Tdap",No,No,0.0,1.0,0.0,1.0,1.0,Yes
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,0.0,1.0,1.0,1.0,Yes
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,0.0,1.0,1.0,0.0,Yes


**SmokerStatus_enc Comments**

Now the Column has been mapped to yes and no, dependent of if the person has or has not smoked removing any ambiguaty in time or any potential order.

- Now the column has been changed to a Yes or No response I will be encoding the column using binary encoding for yes and no response, later in the pre-processing section this column will be changed.


--

ECigaretteUsage

In [93]:
print(f"Number of Responses per each ECigaretteUsage:\n{heart_attack_enc['ECigaretteUsage'].value_counts()}\n")

Number of Responses per each ECigaretteUsage:
ECigaretteUsage
Never used e-cigarettes in my entire life    190119
Not at all (right now)                        43281
Use them some days                             6658
Use them every day                             5955
Name: count, dtype: int64



In [94]:
# Mapping responses to Never Used and Has Used
heart_attack_enc['ECigaretteUsage_enc'] = heart_attack_enc['ECigaretteUsage'].map({'Never used e-cigarettes in my entire life': 'Never Used', 'Not at all (right now)': 'Has Used', 'Use them some days': 'Has Used', 'Use them every day': 'Has Used'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,"Yes, received Tdap",No,No,0.0,1.0,0.0,1.0,1.0,Yes,Never Used
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,"Yes, received tetanus shot but not sure what type",No,No,0.0,0.0,1.0,1.0,1.0,Yes,Never Used
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,0.0,1.0,1.0,0.0,Yes,Never Used


In [95]:
print(f"Number of Responses per each ECigaretteUsage:\n{heart_attack_enc['ECigaretteUsage'].value_counts()}\n")
print(f"Not at all (right now) + Use them some days + Use them every day = {43281 + 6658 + 5955}")
print(f"\nNumber of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}")

Number of Responses per each ECigaretteUsage:
ECigaretteUsage
Never used e-cigarettes in my entire life    190119
Not at all (right now)                        43281
Use them some days                             6658
Use them every day                             5955
Name: count, dtype: int64

Not at all (right now) + Use them some days + Use them every day = 55894

Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
Never Used    190119
Has Used       55894
Name: count, dtype: int64


In [96]:
# Mapping responses to Yes and No
heart_attack_enc['ECigaretteUsage_enc'] = heart_attack_enc['ECigaretteUsage_enc'].map({'Never Used': 'No', 'Has Used': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,"Yes, received Tdap",No,No,0.0,1.0,0.0,1.0,1.0,Yes,No
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,"Yes, received tetanus shot but not sure what type",No,No,0.0,0.0,1.0,1.0,1.0,Yes,No
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,0.0,1.0,1.0,0.0,Yes,No


In [142]:
print(f"\nNumber of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}")


Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
No     190112
Yes     55892
Name: count, dtype: int64


**ECigaretteUsage_enc Comments**

Now the Column has been mapped to yes and no, dependent of if the person has or has not used an E-Cigarette, removing any ambiguaty in usage.

- Now the column has been changed to a Yes or No response I will be encoding the column using binary encoding for yes and no response, later in the pre-processing section this column will be changed.

--

RaceEthnicityCategory

In [97]:
print(f"Number of Responses per each RaceEthnicityCategory:\n{heart_attack_enc['RaceEthnicityCategory'].value_counts()}\n")

Number of Responses per each RaceEthnicityCategory:
RaceEthnicityCategory
White only, Non-Hispanic         186327
Hispanic                          22570
Black only, Non-Hispanic          19330
Other race only, Non-Hispanic     12205
Multiracial, Non-Hispanic          5581
Name: count, dtype: int64



**RaceEthnicityCategory_enc Comments**



--

AgeCategory

In [99]:
print(f"Number of Responses per each AgeCategory:\n{heart_attack_enc['AgeCategory'].value_counts()}\n")

Number of Responses per each AgeCategory:
AgeCategory
Age 65 to 69       28555
Age 60 to 64       26719
Age 70 to 74       25737
Age 55 to 59       22224
Age 50 to 54       19912
Age 75 to 79       18133
Age 80 or older    17816
Age 40 to 44       16973
Age 45 to 49       16753
Age 35 to 39       15614
Age 30 to 34       13346
Age 18 to 24       13122
Age 25 to 29       11109
Name: count, dtype: int64



In [100]:
# Define the order of the age categories
age_categories_order = ['Age 18 to 24', 'Age 25 to 29', 'Age 30 to 34', 'Age 35 to 39', 
                        'Age 40 to 44', 'Age 45 to 49', 'Age 50 to 54', 'Age 55 to 59', 
                        'Age 60 to 64', 'Age 65 to 69', 'Age 70 to 74', 'Age 75 to 79', 'Age 80 or older']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[age_categories_order])

# Fit and transform the data
heart_attack_enc['AgeCategory_enc'] = enc.fit_transform(heart_attack_enc[['AgeCategory']])

# Display the DataFrame with the original and encoded columns
heart_attack_enc.head(3)


Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,No,0.0,1.0,0.0,1.0,1.0,Yes,No,9.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,No,0.0,0.0,1.0,1.0,1.0,Yes,No,10.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,No,Yes,0.0,0.0,1.0,1.0,0.0,Yes,No,11.0


In [101]:
print(f"Number of Responses per each AgeCategory:\n{heart_attack_enc['AgeCategory'].value_counts()}\n")
print(f"Number of Responses per each AgeCategory_enc:\n{heart_attack_enc['AgeCategory_enc'].value_counts()}")

Number of Responses per each AgeCategory:
AgeCategory
Age 65 to 69       28555
Age 60 to 64       26719
Age 70 to 74       25737
Age 55 to 59       22224
Age 50 to 54       19912
Age 75 to 79       18133
Age 80 or older    17816
Age 40 to 44       16973
Age 45 to 49       16753
Age 35 to 39       15614
Age 30 to 34       13346
Age 18 to 24       13122
Age 25 to 29       11109
Name: count, dtype: int64

Number of Responses per each AgeCategory_enc:
AgeCategory_enc
9.0     28555
8.0     26719
10.0    25737
7.0     22224
6.0     19912
11.0    18133
12.0    17816
4.0     16973
5.0     16753
3.0     15614
2.0     13346
0.0     13122
1.0     11109
Name: count, dtype: int64


AgeCategory_enc Comments

- As age increases the risk of having a heart attack is greater. That is why this column has been encoded with the ordinal encoder. as there is an order of age that needs to be inhrently kept as a factor in having a heart attack.

--

TetanusLast10Tdap

In [102]:
print(f"Number of Responses per each TetanusLast10Tdap:\n{heart_attack_enc['TetanusLast10Tdap'].value_counts()}\n")

Number of Responses per each TetanusLast10Tdap:
TetanusLast10Tdap
No, did not receive any tetanus shot in the past 10 years    81743
Yes, received tetanus shot but not sure what type            74118
Yes, received Tdap                                           70282
Yes, received tetanus shot, but not Tdap                     19870
Name: count, dtype: int64



In [104]:
# Mapping responses to Never Used and Has Used
heart_attack_enc['TetShotLast10_enc'] = heart_attack_enc['TetanusLast10Tdap'].map({'No, did not receive any tetanus shot in the past 10 years': 'No', 'Yes, received tetanus shot but not sure what type': 'Yes', 'Yes, received Tdap': 'Yes', 'Yes, received tetanus shot, but not Tdap': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,CovidPos,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc,TetShotLast10_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,0.0,1.0,0.0,1.0,1.0,Yes,No,9.0,Yes
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,0.0,0.0,1.0,1.0,1.0,Yes,No,10.0,Yes
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,Yes,0.0,0.0,1.0,1.0,0.0,Yes,No,11.0,No


In [109]:
print(f"Number of Responses per each TetanusShotLast10:\n{heart_attack_enc['TetanusLast10Tdap'].value_counts()}\n")
print(f"Yes, received tetanus shot but not sure what type + Yes, received Tdap + Yes, received tetanus shot, but not Tdap = {74118 + 70282 + 19870}\n")
print(f"Number of Responses per each TetShotLast10_enc:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}")

Number of Responses per each TetanusShotLast10:
TetanusLast10Tdap
No, did not receive any tetanus shot in the past 10 years    81743
Yes, received tetanus shot but not sure what type            74118
Yes, received Tdap                                           70282
Yes, received tetanus shot, but not Tdap                     19870
Name: count, dtype: int64

Yes, received tetanus shot but not sure what type + Yes, received Tdap + Yes, received tetanus shot, but not Tdap = 164270

Number of Responses per each TetShotLast10_enc:
TetShotLast10_enc
Yes    164270
No      81743
Name: count, dtype: int64


**TetanusLast10Tdap_enc Comments**

Now the Column has been mapped to yes and no, dependent of if the person has or has not had a tetanus shot in the last 10 years, removing any ambiguaty in usage.

- Now the column has been changed to a Yes or No response I will be encoding the column using binary encoding for yes and no response, later in the pre-processing section this column will be changed.

-- 

HadDiabetes

In [117]:
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes'].value_counts()}\n")

Number of Responses per each HadDiabetes:
HadDiabetes
No                                         204827
Yes                                         33811
No, pre-diabetes or borderline diabetes      5392
Yes, but only during pregnancy (female)      1983
Name: count, dtype: int64



In [118]:
#Highlighting Outlier Responses
print(f"HadDiabetes Responses by Gender ('Sex'):\n{heart_attack_clean.groupby(['HadDiabetes','Sex']).size()}\n")
print(f"Outlier Response - Yes, but only during pregnancy (female):\n{heart_attack_clean.groupby(['HadDiabetes','Sex']).size()[7]}\n\
Male response to a female only response")

HadDiabetes Responses by Gender ('Sex'):
HadDiabetes                              Sex   
No                                       Female    106476
                                         Male       98351
No, pre-diabetes or borderline diabetes  Female      2942
                                         Male        2450
Yes                                      Female     16414
                                         Male       17397
Yes, but only during pregnancy (female)  Female      1974
                                         Male           9
dtype: int64

Outlier Response - Yes, but only during pregnancy (female):
9
Male response to a female only response


In [132]:
# Removing the Male only responses (9 Responses) to the "Yes, but only during pregnancy (female)"
heart_attack_enc = heart_attack_enc[~((heart_attack_enc['HadDiabetes'] == 'Yes, but only during pregnancy (female)') & (heart_attack_enc['Sex'] == 'Male'))]

In [133]:
# Checking the rows have been removed
print(f"HadDiabetes Responses by Gender ('Sex'):\n{heart_attack_enc.groupby(['HadDiabetes','Sex']).size()}\n")

HadDiabetes Responses by Gender ('Sex'):
HadDiabetes                              Sex   
No                                       Female    106476
                                         Male       98351
No, pre-diabetes or borderline diabetes  Female      2942
                                         Male        2450
Yes                                      Female     16414
                                         Male       17397
Yes, but only during pregnancy (female)  Female      1974
dtype: int64



In [135]:
# Mapping responses to Never Used and Has Used
heart_attack_enc['HadDiabetes_enc'] = heart_attack_enc['HadDiabetes'].map({'No': 'No', 'No, pre-diabetes or borderline diabetes': 'No', 'Yes': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,LastCheckupTime_enc,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc,TetShotLast10_enc,HadDiabetes_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,0.0,1.0,0.0,1.0,1.0,Yes,No,9.0,Yes,No
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,0.0,0.0,1.0,1.0,1.0,Yes,No,10.0,Yes,Yes
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,0.0,0.0,1.0,1.0,0.0,Yes,No,11.0,No,No


In [137]:
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")

Number of Responses per each HadDiabetes:
HadDiabetes_enc
No     210219
Yes     33811
Name: count, dtype: int64



**HadDiabetes Comments**

I have sorted the issue with the 9 male responsed in the female category (Yes, but only during pregnancy (female)).

--

CovidPos

In [138]:
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos'].value_counts()}\n")

Number of Responses per each CovidPos:
CovidPos
No                                                               167292
Yes                                                               70320
Tested positive using home test without a health professional      8392
Name: count, dtype: int64



In [139]:
# Mapping responses to Never Used and Has Used
heart_attack_enc['CovidPos_enc'] = heart_attack_enc['CovidPos'].map({'No': 'No', 'Yes': 'Yes', 'Tested positive using home test without a health professional': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,Gender_Female,Gender_Male,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc,TetShotLast10_enc,HadDiabetes_enc,CovidPos_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,1.0,0.0,1.0,1.0,Yes,No,9.0,Yes,No,No
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,0.0,1.0,1.0,1.0,Yes,No,10.0,Yes,Yes,No
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,0.0,1.0,1.0,0.0,Yes,No,11.0,No,No,Yes


In [140]:
print(f"Number of Responses per each CovidPos_enc:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")

Number of Responses per each CovidPos_enc:
CovidPos_enc
No     167292
Yes     78712
Name: count, dtype: int64



**CovidPos_enc Comments**

Covid Positive, is a question that is supposed to have a yes or no response regardless of where the person tested positive, its still covid positive. 

- Tested positive using home test without a health professional and Yes have been put together due to both being positive.

Yes & No Responses

In [162]:
# Checking the Value Counts of each Column with a Yes or No response

print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")
print(f"Number of Responses per each SmokerStatus_enc:\n{heart_attack_enc['SmokerStatus_enc'].value_counts()}\n")
print(f"Number of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}\n")
#print(f"Number of Responses per each RaceEthnicity_enc:\n{heart_attack_enc['RaceEthnicity_enc'].value_counts()}\n")
#print(f"Number of Responses per each TetShotLast10Tdap:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}\n")
print(f"Number of Responses per each HadHeartAttack:\n{heart_attack_enc['HadHeartAttack'].value_counts()}\n")
print(f"Number of Responses per each HadAngina:\n{heart_attack_enc['HadAngina'].value_counts()}\n")
print(f"Number of Responses per each HadStroke:\n{heart_attack_enc['HadStroke'].value_counts()}\n")
print(f"Number of Responses per each HadAsthma:\n{heart_attack_enc['HadAsthma'].value_counts()}\n")
print(f"Number of Responses per each HadSkinCancer:\n{heart_attack_enc['HadSkinCancer'].value_counts()}\n")
print(f"Number of Responses per each HadCOPD:\n{heart_attack_enc['HadCOPD'].value_counts()}\n")
print(f"Number of Responses per each HadDepressiveDisorder:\n{heart_attack_enc['HadDepressiveDisorder'].value_counts()}\n")
print(f"Number of Responses per each HadKidneyDisease:\n{heart_attack_enc['HadKidneyDisease'].value_counts()}\n")
print(f"Number of Responses per each HadArthritis:\n{heart_attack_enc['HadArthritis'].value_counts()}\n")
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")
print(f"Number of Responses per each DeafOrHardOfHearing:\n{heart_attack_enc['DeafOrHardOfHearing'].value_counts()}\n")
print(f"Number of Responses per each BlindOrVisionDifficulty:\n{heart_attack_enc['BlindOrVisionDifficulty'].value_counts()}\n")
print(f"Number of Responses per each DifficultyWalking:\n{heart_attack_enc['DifficultyWalking'].value_counts()}\n")
print(f"Number of Responses per each DifficultyDressingBathing:\n{heart_attack_enc['DifficultyDressingBathing'].value_counts()}\n")
print(f"Number of Responses per each DifficultyErrands:\n{heart_attack_enc['DifficultyErrands'].value_counts()}\n")
print(f"Number of Responses per each ChestScan:\n{heart_attack_enc['ChestScan'].value_counts()}\n")
print(f"Number of Responses per each AlcoholDrinkers:\n{heart_attack_enc['AlcoholDrinkers'].value_counts()}\n")
print(f"Number of Responses per each HIVTesting:\n{heart_attack_enc['HIVTesting'].value_counts()}\n")
print(f"Number of Responses per each FluVaxLast12:\n{heart_attack_enc['FluVaxLast12'].value_counts()}\n")
print(f"Number of Responses per each PneumoVaxEver:\n{heart_attack_enc['PneumoVaxEver'].value_counts()}\n")
print(f"Number of Responses per each HighRiskLastYear:\n{heart_attack_enc['HighRiskLastYear'].value_counts()}\n")
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")



Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191305
No      54699
Name: count, dtype: int64

Number of Responses per each SmokerStatus_enc:
SmokerStatus_enc
No     147725
Yes     98279
Name: count, dtype: int64

Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
No     190112
Yes     55892
Name: count, dtype: int64

Number of Responses per each HadHeartAttack:
HadHeartAttack
No     232569
Yes     13435
Name: count, dtype: int64

Number of Responses per each HadAngina:
HadAngina
No     231051
Yes     14953
Name: count, dtype: int64

Number of Responses per each HadStroke:
HadStroke
No     235893
Yes     10111
Name: count, dtype: int64

Number of Responses per each HadAsthma:
HadAsthma
No     209479
Yes     36525
Name: count, dtype: int64

Number of Responses per each HadSkinCancer:
HadSkinCancer
No     224985
Yes     21019
Name: count, dtype: int64

Number of Responses per each HadCOPD:
HadCOPD
No     227010
Yes     18994
Name: count, dty

In [163]:
# Changeing the Dtype

heart_attack_enc = heart_attack_enc.astype({"PhysicalActivities": "category",
                                                "SmokerStatus_enc": "category",
                                                "ECigaretteUsage_enc": "category", 
                                                "HadHeartAttack": "category",
                                                "HadAngina": "category", 
                                                "HadStroke": "category",
                                                "HadAsthma": "category", 
                                                "HadSkinCancer": "category",
                                                "HadCOPD": "category", 
                                                "HadDepressiveDisorder": "category",
                                                "HadKidneyDisease": "category",
                                                "HadArthritis": "category", 
                                                "HadDiabetes_enc": "category",
                                                "DeafOrHardOfHearing": "category", 
                                                "BlindOrVisionDifficulty": "category",
                                                "DifficultyWalking": "category", 
                                                "DifficultyDressingBathing": "category",
                                                "DifficultyErrands": "category",
                                                "ChestScan": "category", 
                                                "AlcoholDrinkers": "category",
                                                "HIVTesting": "category", 
                                                "FluVaxLast12": "category",
                                                "PneumoVaxEver": "category", 
                                                "HighRiskLastYear": "category", 
                                                "CovidPos_enc": "category"})

In [164]:
heart_attack_enc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 49 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Sex                        246004 non-null  object  
 1   GeneralHealth              246004 non-null  object  
 2   PhysicalHealthDays         246004 non-null  float64 
 3   MentalHealthDays           246004 non-null  float64 
 4   LastCheckupTime            246004 non-null  object  
 5   PhysicalActivities         246004 non-null  category
 6   SleepHours                 246004 non-null  float64 
 7   HadHeartAttack             246004 non-null  category
 8   HadAngina                  246004 non-null  category
 9   HadStroke                  246004 non-null  category
 10  HadAsthma                  246004 non-null  category
 11  HadSkinCancer              246004 non-null  category
 12  HadCOPD                    246004 non-null  category
 13  HadDepressiveDisord

In [165]:
# Checking the Value Counts of each Column with a Yes or No response

print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")
print(f"Number of Responses per each SmokerStatus_enc:\n{heart_attack_enc['SmokerStatus_enc'].value_counts()}\n")
print(f"Number of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}\n")
#print(f"Number of Responses per each RaceEthnicity_enc:\n{heart_attack_enc['RaceEthnicity_enc'].value_counts()}\n")
#print(f"Number of Responses per each TetShotLast10Tdap:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}\n")
print(f"Number of Responses per each HadHeartAttack:\n{heart_attack_enc['HadHeartAttack'].value_counts()}\n")
print(f"Number of Responses per each HadAngina:\n{heart_attack_enc['HadAngina'].value_counts()}\n")
print(f"Number of Responses per each HadStroke:\n{heart_attack_enc['HadStroke'].value_counts()}\n")
print(f"Number of Responses per each HadAsthma:\n{heart_attack_enc['HadAsthma'].value_counts()}\n")
print(f"Number of Responses per each HadSkinCancer:\n{heart_attack_enc['HadSkinCancer'].value_counts()}\n")
print(f"Number of Responses per each HadCOPD:\n{heart_attack_enc['HadCOPD'].value_counts()}\n")
print(f"Number of Responses per each HadDepressiveDisorder:\n{heart_attack_enc['HadDepressiveDisorder'].value_counts()}\n")
print(f"Number of Responses per each HadKidneyDisease:\n{heart_attack_enc['HadKidneyDisease'].value_counts()}\n")
print(f"Number of Responses per each HadArthritis:\n{heart_attack_enc['HadArthritis'].value_counts()}\n")
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")
print(f"Number of Responses per each DeafOrHardOfHearing:\n{heart_attack_enc['DeafOrHardOfHearing'].value_counts()}\n")
print(f"Number of Responses per each BlindOrVisionDifficulty:\n{heart_attack_enc['BlindOrVisionDifficulty'].value_counts()}\n")
print(f"Number of Responses per each DifficultyWalking:\n{heart_attack_enc['DifficultyWalking'].value_counts()}\n")
print(f"Number of Responses per each DifficultyDressingBathing:\n{heart_attack_enc['DifficultyDressingBathing'].value_counts()}\n")
print(f"Number of Responses per each DifficultyErrands:\n{heart_attack_enc['DifficultyErrands'].value_counts()}\n")
print(f"Number of Responses per each ChestScan:\n{heart_attack_enc['ChestScan'].value_counts()}\n")
print(f"Number of Responses per each AlcoholDrinkers:\n{heart_attack_enc['AlcoholDrinkers'].value_counts()}\n")
print(f"Number of Responses per each HIVTesting:\n{heart_attack_enc['HIVTesting'].value_counts()}\n")
print(f"Number of Responses per each FluVaxLast12:\n{heart_attack_enc['FluVaxLast12'].value_counts()}\n")
print(f"Number of Responses per each PneumoVaxEver:\n{heart_attack_enc['PneumoVaxEver'].value_counts()}\n")
print(f"Number of Responses per each HighRiskLastYear:\n{heart_attack_enc['HighRiskLastYear'].value_counts()}\n")
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")

Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191305
No      54699
Name: count, dtype: int64

Number of Responses per each SmokerStatus_enc:
SmokerStatus_enc
No     147725
Yes     98279
Name: count, dtype: int64

Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
No     190112
Yes     55892
Name: count, dtype: int64

Number of Responses per each HadHeartAttack:
HadHeartAttack
No     232569
Yes     13435
Name: count, dtype: int64

Number of Responses per each HadAngina:
HadAngina
No     231051
Yes     14953
Name: count, dtype: int64

Number of Responses per each HadStroke:
HadStroke
No     235893
Yes     10111
Name: count, dtype: int64

Number of Responses per each HadAsthma:
HadAsthma
No     209479
Yes     36525
Name: count, dtype: int64

Number of Responses per each HadSkinCancer:
HadSkinCancer
No     224985
Yes     21019
Name: count, dtype: int64

Number of Responses per each HadCOPD:
HadCOPD
No     227010
Yes     18994
Name: count, dty

In [168]:
# Changing every category column with Yes and No responses to 1 and 0 (binary) for model interpretation

yes_no_binmap = {'Yes': 1.0, 'No': 0.0}

for col in heart_attack_enc.select_dtypes(include=['category']).columns:
    if col.endswith('_enc'):
        # Change the column in place if it ends with '_enc'
        heart_attack_enc[col] = heart_attack_enc[col].map(yes_no_binmap)
    else:
        # Create a new column with '_enc' suffix if it doesn't already end with '_enc'
        new_col_name = col + '_enc'
        heart_attack_enc[new_col_name] = heart_attack_enc[col].map(yes_no_binmap)

heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,BlindOrVisionDifficulty_enc,DifficultyWalking_enc,DifficultyDressingBathing_enc,DifficultyErrands_enc,ChestScan_enc,AlcoholDrinkers_enc,HIVTesting_enc,FluVaxLast12_enc,PneumoVaxEver_enc,HighRiskLastYear_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0


In [169]:
# Checking the Value Counts of each Column with a Yes or No response

print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities_enc'].value_counts()}\n")
print(f"Number of Responses per each SmokerStatus_enc:\n{heart_attack_enc['SmokerStatus_enc'].value_counts()}\n")
print(f"Number of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}\n")
#print(f"Number of Responses per each RaceEthnicity_enc:\n{heart_attack_enc['RaceEthnicity_enc'].value_counts()}\n")
#print(f"Number of Responses per each TetShotLast10Tdap:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}\n")
print(f"Number of Responses per each HadHeartAttack:\n{heart_attack_enc['HadHeartAttack_enc'].value_counts()}\n")
print(f"Number of Responses per each HadAngina:\n{heart_attack_enc['HadAngina_enc'].value_counts()}\n")
print(f"Number of Responses per each HadStroke:\n{heart_attack_enc['HadStroke_enc'].value_counts()}\n")
print(f"Number of Responses per each HadAsthma:\n{heart_attack_enc['HadAsthma_enc'].value_counts()}\n")
print(f"Number of Responses per each HadSkinCancer:\n{heart_attack_enc['HadSkinCancer_enc'].value_counts()}\n")
print(f"Number of Responses per each HadCOPD:\n{heart_attack_enc['HadCOPD_enc'].value_counts()}\n")
print(f"Number of Responses per each HadDepressiveDisorder:\n{heart_attack_enc['HadDepressiveDisorder_enc'].value_counts()}\n")
print(f"Number of Responses per each HadKidneyDisease:\n{heart_attack_enc['HadKidneyDisease_enc'].value_counts()}\n")
print(f"Number of Responses per each HadArthritis:\n{heart_attack_enc['HadArthritis_enc'].value_counts()}\n")
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")
print(f"Number of Responses per each DeafOrHardOfHearing:\n{heart_attack_enc['DeafOrHardOfHearing_enc'].value_counts()}\n")
print(f"Number of Responses per each BlindOrVisionDifficulty:\n{heart_attack_enc['BlindOrVisionDifficulty_enc'].value_counts()}\n")
print(f"Number of Responses per each DifficultyWalking:\n{heart_attack_enc['DifficultyWalking_enc'].value_counts()}\n")
print(f"Number of Responses per each DifficultyDressingBathing:\n{heart_attack_enc['DifficultyDressingBathing_enc'].value_counts()}\n")
print(f"Number of Responses per each DifficultyErrands:\n{heart_attack_enc['DifficultyErrands_enc'].value_counts()}\n")
print(f"Number of Responses per each ChestScan:\n{heart_attack_enc['ChestScan_enc'].value_counts()}\n")
print(f"Number of Responses per each AlcoholDrinkers:\n{heart_attack_enc['AlcoholDrinkers_enc'].value_counts()}\n")
print(f"Number of Responses per each HIVTesting:\n{heart_attack_enc['HIVTesting_enc'].value_counts()}\n")
print(f"Number of Responses per each FluVaxLast12:\n{heart_attack_enc['FluVaxLast12_enc'].value_counts()}\n")
print(f"Number of Responses per each PneumoVaxEver:\n{heart_attack_enc['PneumoVaxEver_enc'].value_counts()}\n")
print(f"Number of Responses per each HighRiskLastYear:\n{heart_attack_enc['HighRiskLastYear_enc'].value_counts()}\n")
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")

Number of Responses per each PhysicalActivities:
PhysicalActivities_enc
1.0    191305
0.0     54699
Name: count, dtype: int64

Number of Responses per each SmokerStatus_enc:
SmokerStatus_enc
0.0    147725
1.0     98279
Name: count, dtype: int64

Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
0.0    190112
1.0     55892
Name: count, dtype: int64

Number of Responses per each HadHeartAttack:
HadHeartAttack_enc
0.0    232569
1.0     13435
Name: count, dtype: int64

Number of Responses per each HadAngina:
HadAngina_enc
0.0    231051
1.0     14953
Name: count, dtype: int64

Number of Responses per each HadStroke:
HadStroke_enc
0.0    235893
1.0     10111
Name: count, dtype: int64

Number of Responses per each HadAsthma:
HadAsthma_enc
0.0    209479
1.0     36525
Name: count, dtype: int64

Number of Responses per each HadSkinCancer:
HadSkinCancer_enc
0.0    224985
1.0     21019
Name: count, dtype: int64

Number of Responses per each HadCOPD:
HadCOPD_enc
0.0    227010
1.

In [170]:
heart_attack_enc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 69 columns):
 #   Column                         Non-Null Count   Dtype   
---  ------                         --------------   -----   
 0   Sex                            246004 non-null  object  
 1   GeneralHealth                  246004 non-null  object  
 2   PhysicalHealthDays             246004 non-null  float64 
 3   MentalHealthDays               246004 non-null  float64 
 4   LastCheckupTime                246004 non-null  object  
 5   PhysicalActivities             246004 non-null  category
 6   SleepHours                     246004 non-null  float64 
 7   HadHeartAttack                 246004 non-null  category
 8   HadAngina                      246004 non-null  category
 9   HadStroke                      246004 non-null  category
 10  HadAsthma                      246004 non-null  category
 11  HadSkinCancer                  246004 non-null  category
 12  HadCOPD              

In [171]:
# Changeing the Dtype back to their originals, object.

heart_attack_enc = heart_attack_enc.astype({"PhysicalActivities": object,
                                                #"SmokerStatus_enc": object,
                                                #"ECigaretteUsage_enc": object, 
                                                "HadHeartAttack": object,
                                                "HadAngina": object, 
                                                "HadStroke": object,
                                                "HadAsthma": object, 
                                                "HadSkinCancer": object,
                                                "HadCOPD": object, 
                                                "HadDepressiveDisorder": object,
                                                "HadKidneyDisease": object,
                                                "HadArthritis": object, 
                                                #"HadDiabetes_enc": object,
                                                "DeafOrHardOfHearing": object, 
                                                "BlindOrVisionDifficulty": object,
                                                "DifficultyWalking": object, 
                                                "DifficultyDressingBathing": object,
                                                "DifficultyErrands": object,
                                                "ChestScan": object, 
                                                "AlcoholDrinkers": object,
                                                "HIVTesting": object, 
                                                "FluVaxLast12": object,
                                                "PneumoVaxEver": object, 
                                                "HighRiskLastYear": object, 
                                                "CovidPos_enc": object})

In [172]:
# Changeing the Dtype of the encoded columns to floats

heart_attack_enc = heart_attack_enc.astype({"PhysicalActivities_enc": "float64",
                                                "SmokerStatus_enc": "float64",
                                                "ECigaretteUsage_enc": "float64", 
                                                "HadHeartAttack_enc": "float64",
                                                "HadAngina_enc": "float64", 
                                                "HadStroke_enc": "float64",
                                                "HadAsthma_enc": "float64", 
                                                "HadSkinCancer_enc": "float64",
                                                "HadCOPD_enc": "float64", 
                                                "HadDepressiveDisorder_enc": "float64",
                                                "HadKidneyDisease_enc": "float64",
                                                "HadArthritis_enc": "float64", 
                                                "HadDiabetes_enc": "float64",
                                                "DeafOrHardOfHearing_enc": "float64", 
                                                "BlindOrVisionDifficulty_enc": "float64",
                                                "DifficultyWalking_enc": "float64", 
                                                "DifficultyDressingBathing_enc": "float64",
                                                "DifficultyErrands_enc": "float64",
                                                "ChestScan_enc": "float64", 
                                                "AlcoholDrinkers_enc": "float64",
                                                "HIVTesting_enc": "float64", 
                                                "FluVaxLast12_enc": "float64",
                                                "PneumoVaxEver_enc": "float64", 
                                                "HighRiskLastYear_enc": "float64", 
                                                "CovidPos_enc": "float64"})

In [173]:
# Chckinng if all data is of the correct type

heart_attack_enc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 69 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Sex                            246004 non-null  object 
 1   GeneralHealth                  246004 non-null  object 
 2   PhysicalHealthDays             246004 non-null  float64
 3   MentalHealthDays               246004 non-null  float64
 4   LastCheckupTime                246004 non-null  object 
 5   PhysicalActivities             246004 non-null  object 
 6   SleepHours                     246004 non-null  float64
 7   HadHeartAttack                 246004 non-null  object 
 8   HadAngina                      246004 non-null  object 
 9   HadStroke                      246004 non-null  object 
 10  HadAsthma                      246004 non-null  object 
 11  HadSkinCancer                  246004 non-null  object 
 12  HadCOPD                        2460

**Yes & No Encoding Comments**

- Firstly I changed the datatype to category.

- This was done to differentiate the columns I needed to change, from the columns that needed to stay the same. This was done to allow my function to cycle through all category data type coulmns to then make new columns with an added _enc for (encoded) to each new column made. 
- If the column had a _enc in the column name the following changes were made to the column instead of making a new column.
- Next the columns with Yes and/Or No as a response were changed to 1 and 0 using the binary encoding technique mapping (.map()). 

As it was only 2 responses (Yes and No), binary encoding is the best option. This was done to allow for model interpretation without the loss of any data. 

--