In [102]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import pandas as pd


In [103]:
df = pd.read_csv('heart.attack.csv')

Discretization 
We opted for discretization of the following attributes (Age, Cholesterol Levels), to help us achieve many benefits, such as allowing us to focus on broader trends and patterns within specific groups rather than examining a continuous range which in return can reveal insights about the behavior and needs of these different groups. Also it will help our machine learning algorithms operate more effectively as interpreting and analysing the data will become much easier, and will help us make more informed decisions about the specific needs of each attribute group.

In [104]:
bins = [0, 17, 34, 65, 100]
age_labels = ['Children','Young Adults', 'Older Adults' ,'Seniors']


df[' Age'] = pd.cut(df[' Age'], bins=bins, labels=age_labels)

print(df[' Age'])

0            Seniors
1       Young Adults
2       Young Adults
3            Seniors
4            Seniors
            ...     
3995         Seniors
3996    Older Adults
3997         Seniors
3998         Seniors
3999    Older Adults
Name:  Age, Length: 4000, dtype: category
Categories (4, object): ['Children' < 'Young Adults' < 'Older Adults' < 'Seniors']


When using the discretization method, first we defined the bin intervals based on what was suitable for our datasetâ€™s information, then defined our labels that are compatible with our intervals. we converted patients age into age groups attributes such that:

0-17 years: Children

18-34 years: Young Adults

35-56 years: Older Adults

65-100: Seniors

In [105]:
bins = [0,200,239,400]
Cholesterol_labels = ['Normal', 'Borderline High', 'High']


df[' Cholesterol'] = pd.cut(df[' Cholesterol'], bins=bins, labels=Cholesterol_labels)

print(df[' Cholesterol'])

0       Borderline High
1                  High
2                  High
3                  High
4                  High
             ...       
3995    Borderline High
3996             Normal
3997             Normal
3998               High
3999             Normal
Name:  Cholesterol, Length: 4000, dtype: category
Categories (3, object): ['Normal' < 'Borderline High' < 'High']


And lastly, with the same steps as before, we converted patients Cholesterol levels into Cholesterol level groups attributes such that:

0-200 mg/dl : Normal

201-239 mg/dl : Borderline High

240-400 mg/dl : High

In [121]:
#Extract the columns we need to normalize
columns_to_normalize = ['Systolic BP' , 'Diastolic BP' , ' Heart Rate']
data_to_normalize = df[columns_to_normalize]
minmax_scaler = MinMaxScaler()
normalized_data_minmax = minmax_scaler.fit_transform(data_to_normalize)
# Replace the normalized values in the original dataFrame
df[columns_to_normalize] = normalized_data_minmax

print("Min-Max Scaled data (only 5th , 6th and 7th columns)")
print(df)

Min-Max Scaled data (only 5th , 6th and 7th columns)
     Patient ID           Age     Sex      Cholesterol  Systolic BP  \
0       BMW7812       Seniors    Male  Borderline High     0.755556   
1       CZE1114  Young Adults    Male             High     0.833333   
2       BNI9906  Young Adults  Female             High     0.933333   
3       JLN3497       Seniors    Male             High     0.811111   
4       GFO8847       Seniors    Male             High     0.011111   
...         ...           ...     ...              ...          ...   
3995    UII9280       Seniors    Male  Borderline High     0.911111   
3996    SZU8764  Older Adults  Female           Normal     0.211111   
3997    CQJ6551       Seniors    Male           Normal     0.700000   
3998    DZQ4343       Seniors    Male             High     0.211111   
3999    WER4678  Older Adults  Female           Normal     0.111111   

      Diastolic BP   Heart Rate  Diabetes  Family History  Smoking       Diet  \
0            

Normalization

We used normalization for specific attributes to let the values fall within a smaller range and give them equal weight. In this case, we rescaled the systolic and diastolic blood pressure data between [0, 1] using Min-Max normalization. This will make it easier to understand and more efficient and give us accurate analysis and models. As we also normalize the heart rate feature, it makes no sense to pair someone with a heart rate of 0 with someone who has a heart rate of 30, as 0 indicates death, unlike people with heart rates of 30.

In [124]:
le = LabelEncoder()
df['Patient ID'] = le.fit_transform(df['Patient ID'])
df[' Age'] = le.fit_transform(df[' Age'])
df[' Cholesterol'] = le.fit_transform(df[' Cholesterol'])
df['Sex'] = le.fit_transform(df['Sex'])
df['Diet'] = le.fit_transform(df['Diet'])
df['Continent'] = le.fit_transform(df['Continent'])

print(df)

      Patient ID   Age  Sex   Cholesterol  Systolic BP  Diastolic BP  \
0            241     1    1             0     0.755556          0.56   
1            455     2    1             1     0.833333          0.66   
2            247     2    0             1     0.933333          0.78   
3           1468     1    1             1     0.811111          0.80   
4            972     1    1             1     0.011111          0.56   
...          ...   ...  ...           ...          ...           ...   
3995        3099     1    1             0     0.911111          0.62   
3996        2881     0    0             2     0.211111          0.06   
3997         400     1    1             2     0.700000          1.00   
3998         623     1    1             1     0.211111          0.86   
3999        3387     0    0             2     0.111111          0.74   

       Heart Rate  Diabetes  Family History  Smoking  Diet  Continent  \
0        0.457143         0               0        1     0    

We did encoding to transform all categorical features to numeric to prepare the data for a machine learning model as encoding enhances integrity, efficiency and standardization. 

We encoded the following features (Patient ID, Age, Sex, Cholesterol, Diet, Continent).