<a href="https://colab.research.google.com/github/Madhisha/CVD-Model/blob/main/preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

df = pd.read_csv('heart_2020_cleaned[1].csv')
print(df.head())

  HeartDisease    BMI Smoking AlcoholDrinking Stroke  PhysicalHealth  \
0           No  16.60     Yes              No     No             3.0   
1           No  20.34      No              No    Yes             0.0   
2           No  26.58     Yes              No     No            20.0   
3           No  24.21      No              No     No             0.0   
4           No  23.71      No              No     No            28.0   

   MentalHealth DiffWalking     Sex  AgeCategory   Race Diabetic  \
0          30.0          No  Female        55-59  White      Yes   
1           0.0          No  Female  80 or older  White       No   
2          30.0          No    Male        65-69  White      Yes   
3           0.0          No  Female        75-79  White       No   
4           0.0         Yes  Female        40-44  White       No   

  PhysicalActivity  GenHealth  SleepTime Asthma KidneyDisease SkinCancer  
0              Yes  Very good        5.0    Yes            No        Yes  
1       

Convert binary cols to numerical

In [None]:
binary_columns = ['HeartDisease', 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking',
                  'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer']

df[binary_columns] = df[binary_columns].apply(lambda x: x.map({'Yes': 1, 'No': 0}))

print(df.head())

   HeartDisease    BMI  Smoking  AlcoholDrinking  Stroke  PhysicalHealth  \
0             0  16.60        1                0       0             3.0   
1             0  20.34        0                0       1             0.0   
2             0  26.58        1                0       0            20.0   
3             0  24.21        0                0       0             0.0   
4             0  23.71        0                0       0            28.0   

   MentalHealth  DiffWalking     Sex  AgeCategory   Race Diabetic  \
0          30.0            0  Female        55-59  White      Yes   
1           0.0            0  Female  80 or older  White       No   
2          30.0            0    Male        65-69  White      Yes   
3           0.0            0  Female        75-79  White       No   
4           0.0            1  Female        40-44  White       No   

   PhysicalActivity  GenHealth  SleepTime  Asthma  KidneyDisease  SkinCancer  
0                 1  Very good        5.0       1

Label encode categorical cols (LabelEncoder for ordinal values)

In [None]:
from sklearn.preprocessing import LabelEncoder

label_columns = ['Sex', 'Diabetic', 'GenHealth']

le = LabelEncoder()
for col in label_columns:
    df[col] = le.fit_transform(df[col])

print(df.head(12)) # male -1, female - 0

    HeartDisease    BMI  Smoking  AlcoholDrinking  Stroke  PhysicalHealth  \
0              0  16.60        1                0       0             3.0   
1              0  20.34        0                0       1             0.0   
2              0  26.58        1                0       0            20.0   
3              0  24.21        0                0       0             0.0   
4              0  23.71        0                0       0            28.0   
5              1  28.87        1                0       0             6.0   
6              0  21.63        0                0       0            15.0   
7              0  31.64        1                0       0             5.0   
8              0  26.45        0                0       0             0.0   
9              0  40.69        0                0       0             0.0   
10             1  34.30        1                0       0            30.0   
11             0  28.71        1                0       0             0.0   

In [None]:
print(df['Race'].value_counts())
print(df['AgeCategory'].value_counts)

Race
White                             245212
Hispanic                           27446
Black                              22939
Other                              10928
Asian                               8068
American Indian/Alaskan Native      5202
Name: count, dtype: int64
<bound method IndexOpsMixin.value_counts of 0               55-59
1         80 or older
2               65-69
3               75-79
4               40-44
             ...     
319790          60-64
319791          35-39
319792          45-49
319793          25-29
319794    80 or older
Name: AgeCategory, Length: 319795, dtype: object>


One-hot encoding for non-ordinal and non-binary categorical value

In [None]:
df = pd.get_dummies(df, columns=['AgeCategory', 'Race'])
print(df.head(10))

   HeartDisease    BMI  Smoking  AlcoholDrinking  Stroke  PhysicalHealth  \
0             0  16.60        1                0       0             3.0   
1             0  20.34        0                0       1             0.0   
2             0  26.58        1                0       0            20.0   
3             0  24.21        0                0       0             0.0   
4             0  23.71        0                0       0            28.0   
5             1  28.87        1                0       0             6.0   
6             0  21.63        0                0       0            15.0   
7             0  31.64        1                0       0             5.0   
8             0  26.45        0                0       0             0.0   
9             0  40.69        0                0       0             0.0   

   MentalHealth  DiffWalking  Sex  Diabetic  ...  AgeCategory_65-69  \
0          30.0            0    0         2  ...              False   
1           0.0      

In [None]:
# Convert only the newly created columns from True/False to 1/0 (after get_dummies)
df[df.columns[df.dtypes == 'bool']] = df[df.columns[df.dtypes == 'bool']].astype(int)
print(df.head())

   HeartDisease    BMI  Smoking  AlcoholDrinking  Stroke  PhysicalHealth  \
0             0  16.60        1                0       0             3.0   
1             0  20.34        0                0       1             0.0   
2             0  26.58        1                0       0            20.0   
3             0  24.21        0                0       0             0.0   
4             0  23.71        0                0       0            28.0   

   MentalHealth  DiffWalking  Sex  Diabetic  ...  AgeCategory_65-69  \
0          30.0            0    0         2  ...                  0   
1           0.0            0    0         0  ...                  0   
2          30.0            0    1         2  ...                  1   
3           0.0            0    0         0  ...                  0   
4           0.0            1    0         0  ...                  0   

   AgeCategory_70-74  AgeCategory_75-79  AgeCategory_80 or older  \
0                  0                  0         

In [None]:
print(df.columns)

Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'Diabetic',
       'PhysicalActivity', 'GenHealth', 'SleepTime', 'Asthma', 'KidneyDisease',
       'SkinCancer', 'AgeCategory_18-24', 'AgeCategory_25-29',
       'AgeCategory_30-34', 'AgeCategory_35-39', 'AgeCategory_40-44',
       'AgeCategory_45-49', 'AgeCategory_50-54', 'AgeCategory_55-59',
       'AgeCategory_60-64', 'AgeCategory_65-69', 'AgeCategory_70-74',
       'AgeCategory_75-79', 'AgeCategory_80 or older',
       'Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black',
       'Race_Hispanic', 'Race_Other', 'Race_White'],
      dtype='object')


Handle missing values

In [None]:
# Fill missing numerical columns with median or mean (example using median here)
numerical_columns = ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
df[numerical_columns] = df[numerical_columns].apply(lambda x: x.fillna(x.median()))

print(df.head())

   HeartDisease    BMI  Smoking  AlcoholDrinking  Stroke  PhysicalHealth  \
0             0  16.60        1                0       0             3.0   
1             0  20.34        0                0       1             0.0   
2             0  26.58        1                0       0            20.0   
3             0  24.21        0                0       0             0.0   
4             0  23.71        0                0       0            28.0   

   MentalHealth  DiffWalking  Sex  Diabetic  ...  AgeCategory_65-69  \
0          30.0            0    0         2  ...                  0   
1           0.0            0    0         0  ...                  0   
2          30.0            0    1         2  ...                  1   
3           0.0            0    0         0  ...                  0   
4           0.0            1    0         0  ...                  0   

   AgeCategory_70-74  AgeCategory_75-79  AgeCategory_80 or older  \
0                  0                  0         

Scaling numerical data

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# List of continuous features to scale
continuous_features = ['BMI', 'PhysicalHealth', 'MentalHealth']

# Scale the selected continuous features
df[continuous_features] = scaler.fit_transform(df[continuous_features])

# Print the scaled DataFrame to verify
print(df.head())

   HeartDisease       BMI  Smoking  AlcoholDrinking  Stroke  PhysicalHealth  \
0             0 -1.844750        1                0       0       -0.046751   
1             0 -1.256338        0                0       1       -0.424070   
2             0 -0.274603        1                0       0        2.091388   
3             0 -0.647473        0                0       0       -0.424070   
4             0 -0.726138        0                0       0        3.097572   

   MentalHealth  DiffWalking  Sex  Diabetic  ...  AgeCategory_65-69  \
0      3.281069            0    0         2  ...                  0   
1     -0.490039            0    0         0  ...                  0   
2      3.281069            0    1         2  ...                  1   
3     -0.490039            0    0         0  ...                  0   
4     -0.490039            1    0         0  ...                  0   

   AgeCategory_70-74  AgeCategory_75-79  AgeCategory_80 or older  \
0                  0          

In [None]:
from google.colab import files

# Save the preprocessed DataFrame to a CSV file
df.to_csv('preprocessed_data.csv', index=False)

# Download the CSV file
files.download('preprocessed_data.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
print(df['HeartDisease'].value_counts())

HeartDisease
0    292422
1     27373
Name: count, dtype: int64
