# üõ†Ô∏è Heart Disease Feature Engineering

## üìå Objective  
This notebook performs **Feature Engineering** to enhance predictive power for heart disease classification.  
We will:
1. **Encode categorical variables** (convert text to numbers)
2. **Transform numerical features** (scaling, binning age)
3. **Create new features** (Risk Score)
4. **Remove redundant features** (drop highly correlated ones)


## üìÇ Load & Preview Dataset

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load dataset
file_path = "Cleaned_Data.csv"  # Update path if needed
df = pd.read_csv(file_path)
df = df.drop(df.columns[0], axis=1)

# Display dataset info & first few rows
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      270 non-null    int64  
 1   Sex                      270 non-null    int64  
 2   Chest pain type          270 non-null    int64  
 3   BP                       270 non-null    int64  
 4   Cholesterol              270 non-null    int64  
 5   FBS over 120             270 non-null    int64  
 6   EKG results              270 non-null    int64  
 7   Max HR                   270 non-null    int64  
 8   Exercise angina          270 non-null    int64  
 9   ST depression            270 non-null    float64
 10  Slope of ST              270 non-null    int64  
 11  Number of vessels fluro  270 non-null    int64  
 12  Thallium                 270 non-null    int64  
 13  Heart Disease            270 non-null    object 
dtypes: float64(1), int64(12), 

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


## üîÑ Step 1: Encoding Categorical Variables

In [3]:
# Encode binary categorical variables
df["Heart Disease"] = df["Heart Disease"].map({"Presence": 1, "Absence": 0})

# One-hot encode Chest Pain Type (Ensure Binary 0/1 Instead of True/False)
df = pd.get_dummies(df, columns=["Chest pain type"], prefix="ChestPain", drop_first=False).astype(int)

# Display dataset after encoding
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease,ChestPain_1,ChestPain_2,ChestPain_3,ChestPain_4
0,70,1,130,322,0,2,109,0,2,2,3,3,1,0,0,0,1
1,67,0,115,564,0,2,160,0,1,2,0,7,0,0,0,1,0
2,57,1,124,261,0,0,141,0,0,1,0,7,1,0,1,0,0
3,64,1,128,263,0,0,105,1,0,2,1,7,0,0,0,0,1
4,74,0,120,269,0,2,121,1,0,1,1,3,0,0,1,0,0


## üìè Step 2: Feature Transformation (Scaling & Binning)

In [4]:
# Scaling continuous numerical features
scaler = StandardScaler()
num_features = ["Age", "BP", "Cholesterol", "Max HR", "ST depression"]
df[num_features] = scaler.fit_transform(df[num_features])

# Display dataset after scaling and binning
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease,ChestPain_1,ChestPain_2,ChestPain_3,ChestPain_4
0,1.712094,1,-0.07541,1.402212,0,2,-1.759208,0,1.173372,2,3,3,1,0,0,0,1
1,1.38214,0,-0.916759,6.093004,0,2,0.446409,0,0.221989,2,0,7,0,0,0,1,0
2,0.282294,1,-0.41195,0.219823,0,0,-0.375291,0,-0.729393,1,0,7,1,0,1,0,0
3,1.052186,1,-0.18759,0.258589,0,0,-1.932198,1,-0.729393,2,1,7,0,0,0,0,1
4,2.152032,0,-0.63631,0.37489,0,2,-1.240239,1,-0.729393,1,1,3,0,0,1,0,0


## üö® Step 3: Creating New Features (Risk Score)

In [5]:
# Create a new risk score based on high BP, cholesterol, and exercise angina
df["Risk Score"] = df["BP"] * 0.3 + df["Cholesterol"] * 0.4 + df["Exercise angina"] * 0.3

# Display dataset after new feature creation
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease,ChestPain_1,ChestPain_2,ChestPain_3,ChestPain_4,Risk Score
0,1.712094,1,-0.07541,1.402212,0,2,-1.759208,0,1.173372,2,3,3,1,0,0,0,1,0.538262
1,1.38214,0,-0.916759,6.093004,0,2,0.446409,0,0.221989,2,0,7,0,0,0,1,0,2.162174
2,0.282294,1,-0.41195,0.219823,0,0,-0.375291,0,-0.729393,1,0,7,1,0,1,0,0,-0.035656
3,1.052186,1,-0.18759,0.258589,0,0,-1.932198,1,-0.729393,2,1,7,0,0,0,0,1,0.347159
4,2.152032,0,-0.63631,0.37489,0,2,-1.240239,1,-0.729393,1,1,3,0,0,1,0,0,0.259063


In [6]:
# Save engineered dataset
df.to_csv('Engineered_Data.csv', index=False)
print("Feature Engineering Complete.")

Feature Engineering Complete.
