## **Pre-Processing**

## Introduction

Data Pre-Processing is a critical step in the data science pipeline, as it ensures that the data is in the most appropriate form to maximize the performance of machine learning models to come.The primary objective of this notebook is to encode the cleaned dataset from the previous notebook into a format suitable for machine learning modeling. 

Given the diverse nature of the variables in the heart_attacks_clean.csv dataset, it is essential to apply suitable encoding techniques to convert categorical variables into numerical representations. This notebook will focus on the following key tasks:

1. **Categorical Encoding:** Converting categorical variables into numerical formats using appropriate encoding techniques.

    - **Ordinal Encoding:** Applied to variables with an inherent order or ranking, ensuring the numerical values reflect the relative importance or scale.
    
    - **Binary Encoding:** Applied to binary categorical variables, converting them into 0.0's and 1.0's for model compatibility.

The prepared dataset will then be ready for the subsequent modeling phase, where we will develop and evaluate various machine learning models to identify key indicators of heart attacks. Proper pre-processing ensures that the models can leverage the full potential of the data, leading to more reliable and actionable insights.

___

## Imports

In [1]:
# Imported Libraries 
import pandas as pd # pandas library 
import numpy as np # numpy library

from sklearn.preprocessing import OrdinalEncoder # Order Encoder

from sklearn.model_selection import train_test_split # Train, Validation, Test Split
from sklearn.preprocessing import StandardScaler # Standard Scaling

## Data Loading

In [2]:
## Data Loading
# Data loading for github only

CLN_DATA_PATH='../data/heart_attack_clean.csv'

try:
    heart_attack_clean = pd.read_csv(CLN_DATA_PATH, index_col=0)
    print("Data loaded successfully.")
except FileNotFoundError:
    print("ERROR: The data file does not exist.")

Data loaded successfully.


___

## Encoding

Checking the Loaded Data

In [3]:
heart_attack_clean.head().T # Transposed to easily display the columns

Unnamed: 0,342,343,345,346,347
State,Alabama,Alabama,Alabama,Alabama,Alabama
Sex,Female,Male,Male,Female,Female
GeneralHealth,Very good,Very good,Very good,Fair,Good
PhysicalHealthDays,4.0,0.0,0.0,5.0,3.0
MentalHealthDays,0.0,0.0,0.0,0.0,15.0
LastCheckupTime,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...,Within past year (anytime less than 12 months ...
PhysicalActivities,Yes,Yes,No,Yes,Yes
SleepHours,9.0,6.0,8.0,9.0,5.0
RemovedTeeth,None of them,None of them,"6 or more, but not all",None of them,1 to 5
HadHeartAttack,No,No,No,No,No


Dropping Columns that are not Needed:

In [4]:
heart_attack_enc = heart_attack_clean.drop(columns=['State', 'RemovedTeeth', 'RaceEthnicityCategory'])

heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
342,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,1.6,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No
343,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,1.78,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No
345,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,1.85,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes


**Dropping Columns Comments**

- I dropped the `State`, `RemovedTeeth` columns because the problem I am tackling is heart attack indicators. There is no relationship in the frequency of heart attacks and each column dropped based on domain knowledge.

- I have dropped `RaceEthnicityCategory` due to ethics and a data imbalence. I do not want the machine learning model to infer race as a determining factor to having a heart attack, basically intoducing a heirarchial order to race as there are multiple races would make the model biased and naivly infer a race is better than another on the basis of heart attack prevelance. This is against the ethics of data science and will NOT happen here.

--

Checking the columns have been dropped:

In [29]:
heart_attack_enc.head(1).T

Unnamed: 0,0
Sex,Female
GeneralHealth,Very good
PhysicalHealthDays,4.0
MentalHealthDays,0.0
LastCheckupTime,Within past year (anytime less than 12 months ...
PhysicalActivities,Yes
SleepHours,9.0
HadHeartAttack,No
HadAngina,No
HadStroke,No


Now the columns have been dropped sucessfully the feature encoding can begin.

___

### **Encoding**

All categorical columns need to be encoded for machine learning puposes in the next notebook. If the encoding is not done properly the model will not work as intended. Firstly I started out with the columns that have 3+ response categories. The reason why I am starting here is they can be more complicated than responses with only 2 options, No and Yes. The columns with 2 can be binary encoded where No = 0 and Yes = 1. The responses with 3+ need to either be concatenated into No and Yes then binary encoded or oridinal / heirarchial encoded for serverity.

To Summerise:

- **2 Responses** = Binary Encoding

- **3+ Responses** = Ordinal / Heirarchial Encoding if there is an inherant order

**LastCheckupTime**

LastCheckupTime is an indicator of when the survey recipient last went to the doctor for a checkup. It has 4 categories within the column. There is an order of severity here as when a person does not go for regular check ups diseases can be self diagnosed or misdiagnosed as something less than what it actually is.

In [30]:
print(f"Number of Responses per each LastCheckupTime:\n{heart_attack_clean['LastCheckupTime'].value_counts()}")

Number of Responses per each LastCheckupTime:
LastCheckupTime
Within past year (anytime less than 12 months ago)         198144
Within past 2 years (1 year but less than 2 years ago)      23227
Within past 5 years (2 years but less than 5 years ago)     13744
5 or more years ago                                         10898
Name: count, dtype: int64


In [31]:
# Define the order of the categories
LastCheckupTime_Order = ['Within past year (anytime less than 12 months ago)', 
                    'Within past 2 years (1 year but less than 2 years ago)', 
                    'Within past 5 years (2 years but less than 5 years ago)', 
                    '5 or more years ago']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[LastCheckupTime_Order], dtype=float)

# Fit and transform the data
heart_attack_enc['LastCheckupTime_enc'] = enc.fit_transform(heart_attack_enc[['LastCheckupTime']])

# Checking the data to see if the data has been encoded correctly
heart_attack_enc.head(5)


Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No,0.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0
3,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,90.72,31.32,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0
4,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,Yes,5.0,No,No,No,...,79.38,33.07,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No,0.0


In [32]:
print(f"Number of Responses per each LastCheckupTime:\n{heart_attack_enc['LastCheckupTime'].value_counts()}\n")
print(f"Number of Responses per each LastCheckupTime_enc:\n{heart_attack_enc['LastCheckupTime_enc'].value_counts()}")

Number of Responses per each LastCheckupTime:
LastCheckupTime
Within past year (anytime less than 12 months ago)         198144
Within past 2 years (1 year but less than 2 years ago)      23227
Within past 5 years (2 years but less than 5 years ago)     13744
5 or more years ago                                         10898
Name: count, dtype: int64

Number of Responses per each LastCheckupTime_enc:
LastCheckupTime_enc
0.0    198144
1.0     23227
2.0     13744
3.0     10898
Name: count, dtype: int64


**LastCheckupTime_enc Comments**

- If a person has not gone to have a check up in an ever increasing timeframe, the chances of something bad happening is compounded. Therefore I have introduced a order to the column which infers, a system of serverity. 0 being within a year, going up in severity to 3 being the most severe.

--

**Sex (Gender)**

The Sex column defines the number of female and male participants. As there are only 2 answers in the column it is a prime candidate for binary encoding, as there is no order.

In [33]:
# Mapping 'Female' to 1 and 'Male' to 0
heart_attack_enc['Gender_enc'] = heart_attack_enc['Sex'].map({'Male': 1.0, 'Female': 0.0})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No,0.0,0.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,1.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,1.0


In [34]:
print(f"Number of Responses in the cleaned dataset per each Gender (Sex):\n{heart_attack_clean['Sex'].value_counts()}\n")
print(f"Number of Responses per each Gender (Sex):\n{heart_attack_enc['Gender_enc'].value_counts()}")

Number of Responses in the cleaned dataset per each Gender (Sex):
Sex
Female    127806
Male      118207
Name: count, dtype: int64

Number of Responses per each Gender (Sex):
Gender_enc
0.0    127806
1.0    118207
Name: count, dtype: int64


**Gender_enc (Sex) Comments**

- Sex has been changed to Gender_enc (enc for encoded).

- Female is encoded as 0 and Male is encoded as 1 (binary encoding no_order).

--

**GeneralHealth**

GeneralHealth also has a inherant order of severity if someone says they are in poor health there is a vast difference than someone who says they are in excellent health. That being said ordinal encoding is the best fit here.

In [35]:
print(f"Number of Responses per each GeneralHealth:\n{heart_attack_enc['GeneralHealth'].value_counts()}")

Number of Responses per each GeneralHealth:
GeneralHealth
Very good    86996
Good         77407
Excellent    41522
Fair         30658
Poor          9430
Name: count, dtype: int64


In [36]:
# switch the order round

# Define the order of the categories 
GeneralHealth_Order = ['Poor', 
                    'Fair', 
                    'Good', 
                    'Very good', 
                    'Excellent']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[GeneralHealth_Order], dtype=float)

# Fit and transform the data
heart_attack_enc['GeneralHealth_enc'] = enc.fit_transform(heart_attack_enc[['GeneralHealth']])

# Checking the data to see if the data has been encoded correctly
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,No,Yes,Yes,"Yes, received Tdap",No,No,0.0,0.0,3.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,1.0,3.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,1.0,3.0


In [37]:
print(f"Number of Responses per each GeneralHealth:\n{heart_attack_clean['GeneralHealth'].value_counts()}\n")
print(f"Number of Responses per each GeneralHealth:\n{heart_attack_enc['GeneralHealth_enc'].value_counts()}")

Number of Responses per each GeneralHealth:
GeneralHealth
Very good    86996
Good         77407
Excellent    41522
Fair         30658
Poor          9430
Name: count, dtype: int64

Number of Responses per each GeneralHealth:
GeneralHealth_enc
3.0    86996
2.0    77407
4.0    41522
1.0    30658
0.0     9430
Name: count, dtype: int64


**GeneralHealth_enc Comments**

- General Indications of health are stated so the patient has understanding of how good their health is. Therefore I have introduced an order to the column which infers, a system of serverity. 0 being excellent, going up in severity to 4 being poor.

--

**PhysicalActivities**

The data has 2 answers for the PhysicalActivities column, do they participate in physical activities, Yes or No. The column will be binary encoded.

In [38]:
print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")

Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191310
No      54703
Name: count, dtype: int64



In [39]:
# Mapping 'yes' to 1 and 'no' to 0
heart_attack_enc['PhysicalActivities_enc'] = heart_attack_enc['PhysicalActivities'].map({'Yes': 1.0, 'No': 0.0})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)


Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc,PhysicalActivities_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,Yes,Yes,"Yes, received Tdap",No,No,0.0,0.0,3.0,1.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,1.0,3.0,1.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,1.0,3.0,0.0


In [40]:
print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")
print(f"Number of Responses per each PhysicalActivities_enc:\n{heart_attack_enc['PhysicalActivities_enc'].value_counts()}")

Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191310
No      54703
Name: count, dtype: int64

Number of Responses per each PhysicalActivities_enc:
PhysicalActivities_enc
1.0    191310
0.0     54703
Name: count, dtype: int64


**PhysicalActivities_enc Comments**

- Yes no response, has been encoded as binary.

- Yes is 1 and No is 0 

--

**SmokerStatus**

SmokerStatus also has an order of severity. Never smoked is not the same as former smoker or Current smoker as the more time spent smoking, the more of a risk in heart conditions. Studies show there is a lasting impact to smoking even if one did smoke and stopped previously. Ordinal encoding is the best fit here.

In [41]:
print(f"Number of Responses per each SmokerStatus:\n{heart_attack_enc['SmokerStatus'].value_counts()}\n")

Number of Responses per each SmokerStatus:
SmokerStatus
Never smoked                             147731
Former smoker                             68524
Current smoker - now smokes every day     21659
Current smoker - now smokes some days      8099
Name: count, dtype: int64



In [42]:
# Define the order of the categories
SmokerStatus_Order = ['Never smoked', 
                    'Former smoker', 
                    'Current smoker - now smokes some days', 
                    'Current smoker - now smokes every day']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[SmokerStatus_Order], dtype=float)

# Fit and transform the data
heart_attack_enc['SmokerStatus_enc'] = enc.fit_transform(heart_attack_enc[['SmokerStatus']])

# Checking the data to see if the data has been encoded correctly
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,Yes,Yes,"Yes, received Tdap",No,No,0.0,0.0,3.0,1.0,1.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,1.0,3.0,1.0,1.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,1.0,3.0,0.0,1.0


In [43]:
print(f"Number of Responses per each SmokerStatus:\n{heart_attack_enc['SmokerStatus'].value_counts()}\n")
print(f"Number of Responses per each SmokerStatus_enc:\n{heart_attack_enc['SmokerStatus_enc'].value_counts()}")

Number of Responses per each SmokerStatus:
SmokerStatus
Never smoked                             147731
Former smoker                             68524
Current smoker - now smokes every day     21659
Current smoker - now smokes some days      8099
Name: count, dtype: int64

Number of Responses per each SmokerStatus_enc:
SmokerStatus_enc
0.0    147731
1.0     68524
3.0     21659
2.0      8099
Name: count, dtype: int64


**SmokerStatus_enc Comments**

Now the Column has been changed to reflect order of severity. 

- 0 being Never smoked and 3 being Current smoker - now smokes some days. 


--

**ECigaretteUsage**

Like SmokerStatus, ECigaretteUsage will be ordinal encoded for the same reasons. Even though there is much less papers published about Ecigarrete usage due to how new the technology is, the ones that have been published suggest like cigarrete usage in the mid 1900s there will be a delayed rise in the prevelance of organ specific diseases (like the heart) in 20-40 years with continued usage of Ecigarettes.

In [44]:
print(f"Number of Responses per each ECigaretteUsage:\n{heart_attack_enc['ECigaretteUsage'].value_counts()}\n")

Number of Responses per each ECigaretteUsage:
ECigaretteUsage
Never used e-cigarettes in my entire life    190119
Not at all (right now)                        43281
Use them some days                             6658
Use them every day                             5955
Name: count, dtype: int64



In [45]:
# Define the order of the categories
ECigaretteUsage_Order = ['Never used e-cigarettes in my entire life', 
                    'Not at all (right now)', 
                    'Use them some days', 
                    'Use them every day']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[ECigaretteUsage_Order], dtype=float)

# Fit and transform the data
heart_attack_enc['ECigaretteUsage_enc'] = enc.fit_transform(heart_attack_enc[['ECigaretteUsage']])

# Checking the data to see if the data has been encoded correctly
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,Yes,"Yes, received Tdap",No,No,0.0,0.0,3.0,1.0,1.0,0.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,Yes,"Yes, received tetanus shot but not sure what type",No,No,0.0,1.0,3.0,1.0,1.0,0.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,1.0,3.0,0.0,1.0,0.0


In [46]:
print(f"Number of Responses per each ECigaretteUsage:\n{heart_attack_enc['ECigaretteUsage'].value_counts()}\n")
print(f"\nNumber of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}")

Number of Responses per each ECigaretteUsage:
ECigaretteUsage
Never used e-cigarettes in my entire life    190119
Not at all (right now)                        43281
Use them some days                             6658
Use them every day                             5955
Name: count, dtype: int64


Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
0.0    190119
1.0     43281
2.0      6658
3.0      5955
Name: count, dtype: int64


**ECigaretteUsage_enc Comments**

Now the Column has been changed to reflect order of severity. 

- 0 being "Never used e-cigarettes in my entire life" and 3 being "Use them every day". 

--

**AgeCategory**

the age category is slightly different as the ages are not specified but put into ranges of 5 year spans with the only 2 exceptions being "Age 18 to 24" and "Age 80 or older". There is a order here as when we grow older the prevelance of heart atttacks raise due to a decrease in physical activity. This column will be ordinal encoded.

In [47]:
print(f"Number of Responses per each AgeCategory:\n{heart_attack_enc['AgeCategory'].value_counts()}\n")

Number of Responses per each AgeCategory:
AgeCategory
Age 65 to 69       28555
Age 60 to 64       26719
Age 70 to 74       25737
Age 55 to 59       22224
Age 50 to 54       19912
Age 75 to 79       18133
Age 80 or older    17816
Age 40 to 44       16973
Age 45 to 49       16753
Age 35 to 39       15614
Age 30 to 34       13346
Age 18 to 24       13122
Age 25 to 29       11109
Name: count, dtype: int64



In [48]:
# Define the order of the age categories
age_categories_order = ['Age 18 to 24', 'Age 25 to 29', 'Age 30 to 34', 'Age 35 to 39', 
                        'Age 40 to 44', 'Age 45 to 49', 'Age 50 to 54', 'Age 55 to 59', 
                        'Age 60 to 64', 'Age 65 to 69', 'Age 70 to 74', 'Age 75 to 79', 'Age 80 or older']

# Create the OrdinalEncoder and specify the order
enc = OrdinalEncoder(categories=[age_categories_order])

# Fit and transform the data
heart_attack_enc['AgeCategory_enc'] = enc.fit_transform(heart_attack_enc[['AgeCategory']])

# Display the DataFrame with the original and encoded columns
heart_attack_enc.head(3)


Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,TetanusLast10Tdap,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,"Yes, received Tdap",No,No,0.0,0.0,3.0,1.0,1.0,0.0,9.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,"Yes, received tetanus shot but not sure what type",No,No,0.0,1.0,3.0,1.0,1.0,0.0,10.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,"No, did not receive any tetanus shot in the pa...",No,Yes,0.0,1.0,3.0,0.0,1.0,0.0,11.0


In [49]:
print(f"Number of Responses per each AgeCategory:\n{heart_attack_enc['AgeCategory'].value_counts()}\n")
print(f"Number of Responses per each AgeCategory_enc:\n{heart_attack_enc['AgeCategory_enc'].value_counts()}")

Number of Responses per each AgeCategory:
AgeCategory
Age 65 to 69       28555
Age 60 to 64       26719
Age 70 to 74       25737
Age 55 to 59       22224
Age 50 to 54       19912
Age 75 to 79       18133
Age 80 or older    17816
Age 40 to 44       16973
Age 45 to 49       16753
Age 35 to 39       15614
Age 30 to 34       13346
Age 18 to 24       13122
Age 25 to 29       11109
Name: count, dtype: int64

Number of Responses per each AgeCategory_enc:
AgeCategory_enc
9.0     28555
8.0     26719
10.0    25737
7.0     22224
6.0     19912
11.0    18133
12.0    17816
4.0     16973
5.0     16753
3.0     15614
2.0     13346
0.0     13122
1.0     11109
Name: count, dtype: int64


AgeCategory_enc Comments

- As age increases the risk of having a heart attack is greater. That is why this column has been encoded with the ordinal encoder. as there is an order of age that needs to be inhrently kept as a factor in having a heart attack.

--

**TetanusLast10Tdap**

Having a tetanus shot normally should not affect the heart muscle and this column would be a good candidate to be dropped from the dataframe but it would be interesting to see how much of an importance it has on the target based on the questions asked in the CDC survey. So it will be kept but encoded differently to the previous endodings above. I will combine the responses and the category will be either Yes or No for the question, have you had a tetanus shot? No specifics about wether the shot is Tdap or not, just Yes or No.

In [50]:
print(f"Number of Responses per each TetanusLast10Tdap:\n{heart_attack_enc['TetanusLast10Tdap'].value_counts()}\n")

Number of Responses per each TetanusLast10Tdap:
TetanusLast10Tdap
No, did not receive any tetanus shot in the past 10 years    81743
Yes, received tetanus shot but not sure what type            74118
Yes, received Tdap                                           70282
Yes, received tetanus shot, but not Tdap                     19870
Name: count, dtype: int64



In [51]:
# Mapping responses to Never Used and Has Used
heart_attack_enc['TetShotLast10_enc'] = heart_attack_enc['TetanusLast10Tdap'].map({'No, did not receive any tetanus shot in the past 10 years': 'No', 'Yes, received tetanus shot but not sure what type': 'Yes', 'Yes, received Tdap': 'Yes', 'Yes, received tetanus shot, but not Tdap': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,HighRiskLastYear,CovidPos,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc,TetShotLast10_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,No,0.0,0.0,3.0,1.0,1.0,0.0,9.0,Yes
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,No,0.0,1.0,3.0,1.0,1.0,0.0,10.0,Yes
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,No,Yes,0.0,1.0,3.0,0.0,1.0,0.0,11.0,No


In [52]:
print(f"Number of Responses per each TetanusShotLast10:\n{heart_attack_enc['TetanusLast10Tdap'].value_counts()}\n")
print(f"Yes, received tetanus shot but not sure what type + Yes, received Tdap + Yes, received tetanus shot, but not Tdap = {74118 + 70282 + 19870}\n")
print(f"Number of Responses per each TetShotLast10_enc:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}")

Number of Responses per each TetanusShotLast10:
TetanusLast10Tdap
No, did not receive any tetanus shot in the past 10 years    81743
Yes, received tetanus shot but not sure what type            74118
Yes, received Tdap                                           70282
Yes, received tetanus shot, but not Tdap                     19870
Name: count, dtype: int64

Yes, received tetanus shot but not sure what type + Yes, received Tdap + Yes, received tetanus shot, but not Tdap = 164270

Number of Responses per each TetShotLast10_enc:
TetShotLast10_enc
Yes    164270
No      81743
Name: count, dtype: int64


**TetanusLast10Tdap_enc Comments**

Now the Column has been mapped to yes and no, dependent of if the person has or has not had a tetanus shot in the last 10 years, removing any ambiguaty in usage.

- Now the column has been changed to a Yes or No response I will be encoding the column using binary encoding for yes and no response, later in the pre-processing section this column will be changed.

-- 

**HadDiabetes**

Diabetes has 4 responses catered to all scenarios. 2 for yes and 2 for no. The No's have a repsonce called "No, pre-diabetes or borderline diabetes". This answer allows the survery conductors to see how much of a prevelance diabetes is on the population to come. In this project i just want to know if someone is diabetic or not.

Therefore I will combine the responses for No into a single No answer. The Yes category is more complicated, "Yes, but only during pregnancy (female)" is a response only for females but because this is survey data this could have been input wrong and have male responses. If there are none I will also combine the responses for Yes into a single Yes answer. If not the male responses in that answer will have to be dropped as I do not know if it was a mistake or intentional, as i cannot speculate i would rather drop the rows and then proceed with the combination. These 2 responses will then be binary encoded.

In [53]:
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes'].value_counts()}\n")

Number of Responses per each HadDiabetes:
HadDiabetes
No                                         204827
Yes                                         33811
No, pre-diabetes or borderline diabetes      5392
Yes, but only during pregnancy (female)      1983
Name: count, dtype: int64



In [54]:
#Highlighting Outlier Responses
print(f"HadDiabetes Responses by Gender ('Sex'):\n{heart_attack_clean.groupby(['HadDiabetes','Sex']).size()}\n")
print(f"Outlier Response - Yes, but only during pregnancy (female):\n{heart_attack_clean.groupby(['HadDiabetes','Sex']).size()[7]}\n\
Male response to a female only response")

HadDiabetes Responses by Gender ('Sex'):
HadDiabetes                              Sex   
No                                       Female    106476
                                         Male       98351
No, pre-diabetes or borderline diabetes  Female      2942
                                         Male        2450
Yes                                      Female     16414
                                         Male       17397
Yes, but only during pregnancy (female)  Female      1974
                                         Male           9
dtype: int64

Outlier Response - Yes, but only during pregnancy (female):
9
Male response to a female only response


In [55]:
# Removing the Male only responses (9 Responses) to the "Yes, but only during pregnancy (female)"
heart_attack_enc = heart_attack_enc[~((heart_attack_enc['HadDiabetes'] == 'Yes, but only during pregnancy (female)') & (heart_attack_enc['Sex'] == 'Male'))]

In [56]:
# Checking the rows have been removed
print(f"HadDiabetes Responses by Gender ('Sex'):\n{heart_attack_enc.groupby(['HadDiabetes','Sex']).size()}\n")

HadDiabetes Responses by Gender ('Sex'):
HadDiabetes                              Sex   
No                                       Female    106476
                                         Male       98351
No, pre-diabetes or borderline diabetes  Female      2942
                                         Male        2450
Yes                                      Female     16414
                                         Male       17397
Yes, but only during pregnancy (female)  Female      1974
dtype: int64



In [57]:
# Mapping responses to Never Used and Has Used
heart_attack_enc['HadDiabetes_enc'] = heart_attack_enc['HadDiabetes'].map({'No': 'No', 'No, pre-diabetes or borderline diabetes': 'No', 'Yes': 'Yes', 'Yes, but only during pregnancy (female)': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,CovidPos,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc,TetShotLast10_enc,HadDiabetes_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,No,0.0,0.0,3.0,1.0,1.0,0.0,9.0,Yes,No
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,No,0.0,1.0,3.0,1.0,1.0,0.0,10.0,Yes,Yes
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,Yes,0.0,1.0,3.0,0.0,1.0,0.0,11.0,No,No


In [58]:
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")

Number of Responses per each HadDiabetes:
HadDiabetes_enc
No     210219
Yes     35785
Name: count, dtype: int64



**HadDiabetes Comments**

I have sorted the issue with the 9 male responsed in the female category (Yes, but only during pregnancy (female)) by dropping the male responses to the answer.

- I have also combined the 2 responses for No and the 2 responses for Yes into just No or Yes. Then the column now only has 2 options and can be binary encoded.

--

**CovidPos**

CovidPos is another of the columns that does not need specifics for this project and is just asking for a Yes or No answer but has 3 responses. As the answer "Tested positive using home test without a health professional" is still a positive test to the question, have you ever tested positive for covid19?, it can be combined with the Yes answer. This will give final answers of No and Yes. Then the 2 answers can be binary encoded.

In [59]:
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos'].value_counts()}\n")

Number of Responses per each CovidPos:
CovidPos
No                                                               167292
Yes                                                               70320
Tested positive using home test without a health professional      8392
Name: count, dtype: int64



In [60]:
# Mapping responses to Never Used and Has Used
heart_attack_enc['CovidPos_enc'] = heart_attack_enc['CovidPos'].map({'No': 'No', 'Yes': 'Yes', 'Tested positive using home test without a health professional': 'Yes'})

# Display the DataFrame with the new binary column
heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,LastCheckupTime_enc,Gender_enc,GeneralHealth_enc,PhysicalActivities_enc,SmokerStatus_enc,ECigaretteUsage_enc,AgeCategory_enc,TetShotLast10_enc,HadDiabetes_enc,CovidPos_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,0.0,0.0,3.0,1.0,1.0,0.0,9.0,Yes,No,No
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,0.0,1.0,3.0,1.0,1.0,0.0,10.0,Yes,Yes,No
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,0.0,1.0,3.0,0.0,1.0,0.0,11.0,No,No,Yes


In [61]:
print(f"Number of Responses per each CovidPos_enc:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")

Number of Responses per each CovidPos_enc:
CovidPos_enc
No     167292
Yes     78712
Name: count, dtype: int64



**CovidPos_enc Comments**

Covid Positive, is a question that is supposed to have a yes or no response regardless of where the person tested positive, its still covid positive. 

- Tested positive using home test without a health professional and Yes have been put together due to both being positive.

--

**Yes & No Responses**

Here I am looking at all columns with the answers, Yes and No. The reason for this is because there are only 2 responses in the columns they are all prime candidates for binary encoding and the encoding will not introduce order.

In [62]:
# Checking the Value Counts of each Column with a Yes or No response

print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")
print(f"Number of Responses per each TetShotLast10Tdap:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}\n")
print(f"Number of Responses per each HadHeartAttack:\n{heart_attack_enc['HadHeartAttack'].value_counts()}\n")
print(f"Number of Responses per each HadAngina:\n{heart_attack_enc['HadAngina'].value_counts()}\n")
print(f"Number of Responses per each HadStroke:\n{heart_attack_enc['HadStroke'].value_counts()}\n")
print(f"Number of Responses per each HadAsthma:\n{heart_attack_enc['HadAsthma'].value_counts()}\n")
print(f"Number of Responses per each HadSkinCancer:\n{heart_attack_enc['HadSkinCancer'].value_counts()}\n")
print(f"Number of Responses per each HadCOPD:\n{heart_attack_enc['HadCOPD'].value_counts()}\n")
print(f"Number of Responses per each HadDepressiveDisorder:\n{heart_attack_enc['HadDepressiveDisorder'].value_counts()}\n")
print(f"Number of Responses per each HadKidneyDisease:\n{heart_attack_enc['HadKidneyDisease'].value_counts()}\n")
print(f"Number of Responses per each HadArthritis:\n{heart_attack_enc['HadArthritis'].value_counts()}\n")
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")
print(f"Number of Responses per each DeafOrHardOfHearing:\n{heart_attack_enc['DeafOrHardOfHearing'].value_counts()}\n")
print(f"Number of Responses per each BlindOrVisionDifficulty:\n{heart_attack_enc['BlindOrVisionDifficulty'].value_counts()}\n")
print(f"Number of Responses per each DifficultyWalking:\n{heart_attack_enc['DifficultyWalking'].value_counts()}\n")
print(f"Number of Responses per each DifficultyDressingBathing:\n{heart_attack_enc['DifficultyDressingBathing'].value_counts()}\n")
print(f"Number of Responses per each DifficultyErrands:\n{heart_attack_enc['DifficultyErrands'].value_counts()}\n")
print(f"Number of Responses per each ChestScan:\n{heart_attack_enc['ChestScan'].value_counts()}\n")
print(f"Number of Responses per each AlcoholDrinkers:\n{heart_attack_enc['AlcoholDrinkers'].value_counts()}\n")
print(f"Number of Responses per each HIVTesting:\n{heart_attack_enc['HIVTesting'].value_counts()}\n")
print(f"Number of Responses per each FluVaxLast12:\n{heart_attack_enc['FluVaxLast12'].value_counts()}\n")
print(f"Number of Responses per each PneumoVaxEver:\n{heart_attack_enc['PneumoVaxEver'].value_counts()}\n")
print(f"Number of Responses per each HighRiskLastYear:\n{heart_attack_enc['HighRiskLastYear'].value_counts()}\n")
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")



Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191305
No      54699
Name: count, dtype: int64

Number of Responses per each TetShotLast10Tdap:
TetShotLast10_enc
Yes    164266
No      81738
Name: count, dtype: int64

Number of Responses per each HadHeartAttack:
HadHeartAttack
No     232569
Yes     13435
Name: count, dtype: int64

Number of Responses per each HadAngina:
HadAngina
No     231051
Yes     14953
Name: count, dtype: int64

Number of Responses per each HadStroke:
HadStroke
No     235893
Yes     10111
Name: count, dtype: int64

Number of Responses per each HadAsthma:
HadAsthma
No     209479
Yes     36525
Name: count, dtype: int64

Number of Responses per each HadSkinCancer:
HadSkinCancer
No     224985
Yes     21019
Name: count, dtype: int64

Number of Responses per each HadCOPD:
HadCOPD
No     227010
Yes     18994
Name: count, dtype: int64

Number of Responses per each HadDepressiveDisorder:
HadDepressiveDisorder
No     195384
Yes     50620
Name: coun

In [63]:
# Changing the Dtype

heart_attack_enc = heart_attack_enc.astype({"PhysicalActivities": "category", 
                                                "HadHeartAttack": "category",
                                                "HadAngina": "category", 
                                                "HadStroke": "category",
                                                "HadAsthma": "category", 
                                                "HadSkinCancer": "category",
                                                "HadCOPD": "category", 
                                                "HadDepressiveDisorder": "category",
                                                "HadKidneyDisease": "category",
                                                "HadArthritis": "category", 
                                                "HadDiabetes_enc": "category",
                                                "DeafOrHardOfHearing": "category", 
                                                "BlindOrVisionDifficulty": "category",
                                                "DifficultyWalking": "category", 
                                                "DifficultyDressingBathing": "category",
                                                "DifficultyErrands": "category",
                                                "ChestScan": "category", 
                                                "AlcoholDrinkers": "category",
                                                "HIVTesting": "category", 
                                                "FluVaxLast12": "category",
                                                "PneumoVaxEver": "category", 
                                                "HighRiskLastYear": "category", 
                                                "CovidPos_enc": "category",
                                                "TetShotLast10_enc": "category"})

In [64]:
# Checking the Datatypes have been changed
heart_attack_enc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 47 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Sex                        246004 non-null  object  
 1   GeneralHealth              246004 non-null  object  
 2   PhysicalHealthDays         246004 non-null  float64 
 3   MentalHealthDays           246004 non-null  float64 
 4   LastCheckupTime            246004 non-null  object  
 5   PhysicalActivities         246004 non-null  category
 6   SleepHours                 246004 non-null  float64 
 7   HadHeartAttack             246004 non-null  category
 8   HadAngina                  246004 non-null  category
 9   HadStroke                  246004 non-null  category
 10  HadAsthma                  246004 non-null  category
 11  HadSkinCancer              246004 non-null  category
 12  HadCOPD                    246004 non-null  category
 13  HadDepressiveDisord

In [65]:
# Checking the Value Counts of each Column with a Yes or No response

print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities'].value_counts()}\n")
print(f"Number of Responses per each SmokerStatus_enc:\n{heart_attack_enc['SmokerStatus_enc'].value_counts()}\n")
print(f"Number of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}\n")
print(f"Number of Responses per each TetShotLast10Tdap:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}\n")
print(f"Number of Responses per each HadHeartAttack:\n{heart_attack_enc['HadHeartAttack'].value_counts()}\n")
print(f"Number of Responses per each HadAngina:\n{heart_attack_enc['HadAngina'].value_counts()}\n")
print(f"Number of Responses per each HadStroke:\n{heart_attack_enc['HadStroke'].value_counts()}\n")
print(f"Number of Responses per each HadAsthma:\n{heart_attack_enc['HadAsthma'].value_counts()}\n")
print(f"Number of Responses per each HadSkinCancer:\n{heart_attack_enc['HadSkinCancer'].value_counts()}\n")
print(f"Number of Responses per each HadCOPD:\n{heart_attack_enc['HadCOPD'].value_counts()}\n")
print(f"Number of Responses per each HadDepressiveDisorder:\n{heart_attack_enc['HadDepressiveDisorder'].value_counts()}\n")
print(f"Number of Responses per each HadKidneyDisease:\n{heart_attack_enc['HadKidneyDisease'].value_counts()}\n")
print(f"Number of Responses per each HadArthritis:\n{heart_attack_enc['HadArthritis'].value_counts()}\n")
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")
print(f"Number of Responses per each DeafOrHardOfHearing:\n{heart_attack_enc['DeafOrHardOfHearing'].value_counts()}\n")
print(f"Number of Responses per each BlindOrVisionDifficulty:\n{heart_attack_enc['BlindOrVisionDifficulty'].value_counts()}\n")
print(f"Number of Responses per each DifficultyWalking:\n{heart_attack_enc['DifficultyWalking'].value_counts()}\n")
print(f"Number of Responses per each DifficultyDressingBathing:\n{heart_attack_enc['DifficultyDressingBathing'].value_counts()}\n")
print(f"Number of Responses per each DifficultyErrands:\n{heart_attack_enc['DifficultyErrands'].value_counts()}\n")
print(f"Number of Responses per each ChestScan:\n{heart_attack_enc['ChestScan'].value_counts()}\n")
print(f"Number of Responses per each AlcoholDrinkers:\n{heart_attack_enc['AlcoholDrinkers'].value_counts()}\n")
print(f"Number of Responses per each HIVTesting:\n{heart_attack_enc['HIVTesting'].value_counts()}\n")
print(f"Number of Responses per each FluVaxLast12:\n{heart_attack_enc['FluVaxLast12'].value_counts()}\n")
print(f"Number of Responses per each PneumoVaxEver:\n{heart_attack_enc['PneumoVaxEver'].value_counts()}\n")
print(f"Number of Responses per each HighRiskLastYear:\n{heart_attack_enc['HighRiskLastYear'].value_counts()}\n")
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")

Number of Responses per each PhysicalActivities:
PhysicalActivities
Yes    191305
No      54699
Name: count, dtype: int64

Number of Responses per each SmokerStatus_enc:
SmokerStatus_enc
0.0    147725
1.0     68523
3.0     21657
2.0      8099
Name: count, dtype: int64

Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
0.0    190112
1.0     43279
2.0      6658
3.0      5955
Name: count, dtype: int64

Number of Responses per each HadHeartAttack:
HadHeartAttack
No     232569
Yes     13435
Name: count, dtype: int64

Number of Responses per each HadAngina:
HadAngina
No     231051
Yes     14953
Name: count, dtype: int64

Number of Responses per each HadStroke:
HadStroke
No     235893
Yes     10111
Name: count, dtype: int64

Number of Responses per each HadAsthma:
HadAsthma
No     209479
Yes     36525
Name: count, dtype: int64

Number of Responses per each HadSkinCancer:
HadSkinCancer
No     224985
Yes     21019
Name: count, dtype: int64

Number of Responses per each HadCO

Now I have checked that all the columns specified above only have Yes and No answers, all the responses in those columns can be binary encoded:

- Yes = 1.0 
- No = 0.0.

In [66]:
# Changing every category column with Yes and No responses to 1 and 0 (binary) for model interpretation

yes_no_binmap = {'Yes': 1.0, 'No': 0.0}

for col in heart_attack_enc.select_dtypes(include=['category']).columns:
    if col.endswith('_enc'):
        # Change the column in place if it ends with '_enc'
        heart_attack_enc[col] = heart_attack_enc[col].map(yes_no_binmap)
    else:
        # Create a new column with '_enc' suffix if it doesn't already end with '_enc'
        new_col_name = col + '_enc'
        heart_attack_enc[new_col_name] = heart_attack_enc[col].map(yes_no_binmap)

heart_attack_enc.head(3)

Unnamed: 0,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,HadHeartAttack,HadAngina,HadStroke,...,BlindOrVisionDifficulty_enc,DifficultyWalking_enc,DifficultyDressingBathing_enc,DifficultyErrands_enc,ChestScan_enc,AlcoholDrinkers_enc,HIVTesting_enc,FluVaxLast12_enc,PneumoVaxEver_enc,HighRiskLastYear_enc
0,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,No,No,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,No,No,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,No,No,No,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0


In [67]:
# Checking the Value Counts of each Column with a Yes or No responses post binary mapped

print(f"Number of Responses per each PhysicalActivities:\n{heart_attack_enc['PhysicalActivities_enc'].value_counts()}\n")
print(f"Number of Responses per each SmokerStatus_enc:\n{heart_attack_enc['SmokerStatus_enc'].value_counts()}\n")
print(f"Number of Responses per each ECigaretteUsage_enc:\n{heart_attack_enc['ECigaretteUsage_enc'].value_counts()}\n")
print(f"Number of Responses per each TetShotLast10Tdap:\n{heart_attack_enc['TetShotLast10_enc'].value_counts()}\n")
print(f"Number of Responses per each HadHeartAttack:\n{heart_attack_enc['HadHeartAttack_enc'].value_counts()}\n")
print(f"Number of Responses per each HadAngina:\n{heart_attack_enc['HadAngina_enc'].value_counts()}\n")
print(f"Number of Responses per each HadStroke:\n{heart_attack_enc['HadStroke_enc'].value_counts()}\n")
print(f"Number of Responses per each HadAsthma:\n{heart_attack_enc['HadAsthma_enc'].value_counts()}\n")
print(f"Number of Responses per each HadSkinCancer:\n{heart_attack_enc['HadSkinCancer_enc'].value_counts()}\n")
print(f"Number of Responses per each HadCOPD:\n{heart_attack_enc['HadCOPD_enc'].value_counts()}\n")
print(f"Number of Responses per each HadDepressiveDisorder:\n{heart_attack_enc['HadDepressiveDisorder_enc'].value_counts()}\n")
print(f"Number of Responses per each HadKidneyDisease:\n{heart_attack_enc['HadKidneyDisease_enc'].value_counts()}\n")
print(f"Number of Responses per each HadArthritis:\n{heart_attack_enc['HadArthritis_enc'].value_counts()}\n")
print(f"Number of Responses per each HadDiabetes:\n{heart_attack_enc['HadDiabetes_enc'].value_counts()}\n")
print(f"Number of Responses per each DeafOrHardOfHearing:\n{heart_attack_enc['DeafOrHardOfHearing_enc'].value_counts()}\n")
print(f"Number of Responses per each BlindOrVisionDifficulty:\n{heart_attack_enc['BlindOrVisionDifficulty_enc'].value_counts()}\n")
print(f"Number of Responses per each DifficultyWalking:\n{heart_attack_enc['DifficultyWalking_enc'].value_counts()}\n")
print(f"Number of Responses per each DifficultyDressingBathing:\n{heart_attack_enc['DifficultyDressingBathing_enc'].value_counts()}\n")
print(f"Number of Responses per each DifficultyErrands:\n{heart_attack_enc['DifficultyErrands_enc'].value_counts()}\n")
print(f"Number of Responses per each ChestScan:\n{heart_attack_enc['ChestScan_enc'].value_counts()}\n")
print(f"Number of Responses per each AlcoholDrinkers:\n{heart_attack_enc['AlcoholDrinkers_enc'].value_counts()}\n")
print(f"Number of Responses per each HIVTesting:\n{heart_attack_enc['HIVTesting_enc'].value_counts()}\n")
print(f"Number of Responses per each FluVaxLast12:\n{heart_attack_enc['FluVaxLast12_enc'].value_counts()}\n")
print(f"Number of Responses per each PneumoVaxEver:\n{heart_attack_enc['PneumoVaxEver_enc'].value_counts()}\n")
print(f"Number of Responses per each HighRiskLastYear:\n{heart_attack_enc['HighRiskLastYear_enc'].value_counts()}\n")
print(f"Number of Responses per each CovidPos:\n{heart_attack_enc['CovidPos_enc'].value_counts()}\n")

Number of Responses per each PhysicalActivities:
PhysicalActivities_enc
1.0    191305
0.0     54699
Name: count, dtype: int64

Number of Responses per each SmokerStatus_enc:
SmokerStatus_enc
0.0    147725
1.0     68523
3.0     21657
2.0      8099
Name: count, dtype: int64

Number of Responses per each ECigaretteUsage_enc:
ECigaretteUsage_enc
0.0    190112
1.0     43279
2.0      6658
3.0      5955
Name: count, dtype: int64

Number of Responses per each TetShotLast10Tdap:
TetShotLast10_enc
1.0    164266
0.0     81738
Name: count, dtype: int64

Number of Responses per each HadHeartAttack:
HadHeartAttack_enc
0.0    232569
1.0     13435
Name: count, dtype: int64

Number of Responses per each HadAngina:
HadAngina_enc
0.0    231051
1.0     14953
Name: count, dtype: int64

Number of Responses per each HadStroke:
HadStroke_enc
0.0    235893
1.0     10111
Name: count, dtype: int64

Number of Responses per each HadAsthma:
HadAsthma_enc
0.0    209479
1.0     36525
Name: count, dtype: int64

Number

Checking the datatypes of the newly created encoded (enc) columns:

In [68]:
heart_attack_enc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 67 columns):
 #   Column                         Non-Null Count   Dtype   
---  ------                         --------------   -----   
 0   Sex                            246004 non-null  object  
 1   GeneralHealth                  246004 non-null  object  
 2   PhysicalHealthDays             246004 non-null  float64 
 3   MentalHealthDays               246004 non-null  float64 
 4   LastCheckupTime                246004 non-null  object  
 5   PhysicalActivities             246004 non-null  category
 6   SleepHours                     246004 non-null  float64 
 7   HadHeartAttack                 246004 non-null  category
 8   HadAngina                      246004 non-null  category
 9   HadStroke                      246004 non-null  category
 10  HadAsthma                      246004 non-null  category
 11  HadSkinCancer                  246004 non-null  category
 12  HadCOPD              

I will now change the original columns back to their original datatypes.

In [69]:
# Changeing the datatype back to their originals, object.

heart_attack_enc = heart_attack_enc.astype({"PhysicalActivities": object, 
                                                "HadHeartAttack": object,
                                                "HadAngina": object, 
                                                "HadStroke": object,
                                                "HadAsthma": object, 
                                                "HadSkinCancer": object,
                                                "HadCOPD": object, 
                                                "HadDepressiveDisorder": object,
                                                "HadKidneyDisease": object,
                                                "HadArthritis": object, 
                                                "DeafOrHardOfHearing": object, 
                                                "BlindOrVisionDifficulty": object,
                                                "DifficultyWalking": object, 
                                                "DifficultyDressingBathing": object,
                                                "DifficultyErrands": object,
                                                "ChestScan": object, 
                                                "AlcoholDrinkers": object,
                                                "HIVTesting": object, 
                                                "FluVaxLast12": object,
                                                "PneumoVaxEver": object, 
                                                "HighRiskLastYear": object, 
                                                "CovidPos_enc": object})

Next I will make sure the encoded columns are of the same datatype. I chose float because the original numerical datatypes in the dataset are of the float datatype. Making the datatype uniform throughout all numeric columns seemed the most professional thing to do.

In [70]:
# Changeing the datatypes of the encoded columns to floats

heart_attack_enc = heart_attack_enc.astype({"PhysicalActivities_enc": "float64",
                                                "HadHeartAttack_enc": "float64",
                                                "HadAngina_enc": "float64", 
                                                "HadStroke_enc": "float64",
                                                "HadAsthma_enc": "float64", 
                                                "HadSkinCancer_enc": "float64",
                                                "HadCOPD_enc": "float64", 
                                                "HadDepressiveDisorder_enc": "float64",
                                                "HadKidneyDisease_enc": "float64",
                                                "HadArthritis_enc": "float64", 
                                                "HadDiabetes_enc": "float64",
                                                "DeafOrHardOfHearing_enc": "float64", 
                                                "BlindOrVisionDifficulty_enc": "float64",
                                                "DifficultyWalking_enc": "float64", 
                                                "DifficultyDressingBathing_enc": "float64",
                                                "DifficultyErrands_enc": "float64",
                                                "ChestScan_enc": "float64", 
                                                "AlcoholDrinkers_enc": "float64",
                                                "HIVTesting_enc": "float64", 
                                                "FluVaxLast12_enc": "float64",
                                                "PneumoVaxEver_enc": "float64", 
                                                "HighRiskLastYear_enc": "float64", 
                                                "CovidPos_enc": "float64",
                                                "TetShotLast10_enc": "float64"})

In [71]:
# Checkinng if all data is of the correct type

heart_attack_enc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 67 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Sex                            246004 non-null  object 
 1   GeneralHealth                  246004 non-null  object 
 2   PhysicalHealthDays             246004 non-null  float64
 3   MentalHealthDays               246004 non-null  float64
 4   LastCheckupTime                246004 non-null  object 
 5   PhysicalActivities             246004 non-null  object 
 6   SleepHours                     246004 non-null  float64
 7   HadHeartAttack                 246004 non-null  object 
 8   HadAngina                      246004 non-null  object 
 9   HadStroke                      246004 non-null  object 
 10  HadAsthma                      246004 non-null  object 
 11  HadSkinCancer                  246004 non-null  object 
 12  HadCOPD                        2460

**Yes & No Encoding Comments**

- Firstly I changed the datatype to category.

- This was done to differentiate the columns I needed to change, from the columns that needed to stay the same. This was done to allow my function to cycle through all category data type coulmns to then make new columns with an added _enc for (encoded) to each new column made. 
- If the column had a _enc in the column name the following changes were made to the column instead of making a new column.
- Next the columns with Yes and/Or No as a response were changed to 1 and 0 using the binary encoding technique mapping (.map()). 

As it was only 2 responses (Yes and No), binary encoding is the best option. This was done to allow for model interpretation without the loss of any data. 

---

**Making a Encoded (enc) Features DataFrame**

In [72]:
heart_attack_enc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 67 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Sex                            246004 non-null  object 
 1   GeneralHealth                  246004 non-null  object 
 2   PhysicalHealthDays             246004 non-null  float64
 3   MentalHealthDays               246004 non-null  float64
 4   LastCheckupTime                246004 non-null  object 
 5   PhysicalActivities             246004 non-null  object 
 6   SleepHours                     246004 non-null  float64
 7   HadHeartAttack                 246004 non-null  object 
 8   HadAngina                      246004 non-null  object 
 9   HadStroke                      246004 non-null  object 
 10  HadAsthma                      246004 non-null  object 
 11  HadSkinCancer                  246004 non-null  object 
 12  HadCOPD                        2460

In [73]:
# Making a dataframe of only numerical columns and the target variable

heart_attack_enc2 = heart_attack_enc[['HadHeartAttack', 
                                      'PhysicalHealthDays', 
                                      'MentalHealthDays', 
                                      'SleepHours', 
                                      'HeightInMeters', 
                                      'WeightInKilograms', 
                                      'BMI', 
                                      'LastCheckupTime_enc', 
                                      'Gender_enc', 
                                      'GeneralHealth_enc', 
                                      'PhysicalActivities_enc',  
                                      'SmokerStatus_enc', 
                                      'ECigaretteUsage_enc', 
                                      'AgeCategory_enc', 
                                      'TetShotLast10_enc', 
                                      'HadDiabetes_enc', 
                                      'CovidPos_enc', 
                                      'HadHeartAttack_enc', 
                                      'HadAngina_enc', 
                                      'HadStroke_enc', 
                                      'HadAsthma_enc', 
                                      'HadSkinCancer_enc', 
                                      'HadCOPD_enc', 
                                      'HadDepressiveDisorder_enc', 
                                      'HadKidneyDisease_enc', 
                                      'HadArthritis_enc', 
                                      'DeafOrHardOfHearing_enc', 
                                      'BlindOrVisionDifficulty_enc', 
                                      'DifficultyWalking_enc', 
                                      'DifficultyDressingBathing_enc', 
                                      'DifficultyErrands_enc', 
                                      'ChestScan_enc', 
                                      'AlcoholDrinkers_enc', 
                                      'HIVTesting_enc', 
                                      'FluVaxLast12_enc', 
                                      'PneumoVaxEver_enc', 
                                      'HighRiskLastYear_enc']].copy()

Checking the new DataFrame. It should only have 1 non numerical column, being the target HadHeartAttack.

In [74]:
heart_attack_enc2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 37 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   HadHeartAttack                 246004 non-null  object 
 1   PhysicalHealthDays             246004 non-null  float64
 2   MentalHealthDays               246004 non-null  float64
 3   SleepHours                     246004 non-null  float64
 4   HeightInMeters                 246004 non-null  float64
 5   WeightInKilograms              246004 non-null  float64
 6   BMI                            246004 non-null  float64
 7   LastCheckupTime_enc            246004 non-null  float64
 8   Gender_enc                     246004 non-null  float64
 9   GeneralHealth_enc              246004 non-null  float64
 10  PhysicalActivities_enc         246004 non-null  float64
 11  SmokerStatus_enc               246004 non-null  float64
 12  ECigaretteUsage_enc            2460

---

## Machine Learning Pre-Processing

In [75]:
heart_attack_enc2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 37 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   HadHeartAttack                 246004 non-null  object 
 1   PhysicalHealthDays             246004 non-null  float64
 2   MentalHealthDays               246004 non-null  float64
 3   SleepHours                     246004 non-null  float64
 4   HeightInMeters                 246004 non-null  float64
 5   WeightInKilograms              246004 non-null  float64
 6   BMI                            246004 non-null  float64
 7   LastCheckupTime_enc            246004 non-null  float64
 8   Gender_enc                     246004 non-null  float64
 9   GeneralHealth_enc              246004 non-null  float64
 10  PhysicalActivities_enc         246004 non-null  float64
 11  SmokerStatus_enc               246004 non-null  float64
 12  ECigaretteUsage_enc            2460

Making a fully encoded dataframe with the target also encoded.

In [76]:
heart_attack_enc_m = heart_attack_enc2.drop(columns='HadHeartAttack', axis=1)

# Checking the data is correct in the fully encoded dateframe
heart_attack_enc_m.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246004 entries, 0 to 246012
Data columns (total 36 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   PhysicalHealthDays             246004 non-null  float64
 1   MentalHealthDays               246004 non-null  float64
 2   SleepHours                     246004 non-null  float64
 3   HeightInMeters                 246004 non-null  float64
 4   WeightInKilograms              246004 non-null  float64
 5   BMI                            246004 non-null  float64
 6   LastCheckupTime_enc            246004 non-null  float64
 7   Gender_enc                     246004 non-null  float64
 8   GeneralHealth_enc              246004 non-null  float64
 9   PhysicalActivities_enc         246004 non-null  float64
 10  SmokerStatus_enc               246004 non-null  float64
 11  ECigaretteUsage_enc            246004 non-null  float64
 12  AgeCategory_enc                2460

Now the dataframes have been made we can now move onto modeling.

___

#### Saving the Dataframe

In [77]:
print("\033[1mSaving the Cleaned Dataframe:\033[0m\n")

try:
    heart_attack_enc_m.to_csv('heart_attack_enc_m.csv')
    print("Data saved successfully.")
except:
    print("ERROR: The data has NOT been saved.")

[1mSaving the Cleaned Dataframe:[0m

Data saved successfully.


---


## Conclusion

In this notebook, I successfully pre-processed the heart_attack_enc_m.csv dataset to ensure it is in a format suitable for machine learning modeling. The primary focus was on encoding categorical variables to facilitate their use in various macine learning algorithms.

I applied binary encoding to transform binary categorical variables into numerical format. Specifically, I encoded responses of "No" as 0.0 and "Yes" as 1.0. This transformation ensures that binary features are represented numerically, allowing models to effectively interpret and utilize these variables, without any heirarchial or order implied.

For ordinal variables, where the categories have an inherent order or severity, I used ordinal encoding. This method assigned numerical values to categories based on their rank or importance, with 0 representing the best or least severe option. This encoding preserves the ordinal relationships within the data, which is crucial for accurately capturing the impact of these variables on heart attack outcomes.

Through these encoding techniques, I have transformed our dataset into a fully numerical format, enabling seamless integration with machine learning algorithms. The pre-processed data is now well-prepared for the next phase, where I will develop and evaluate models to identify key indicators of heart attacks.

The encoding steps taken in this notebook ensure that our models can leverage the full informational content of both binary and ordinal features, ultimately leading to more accurate and insightful predictions for the models to come.