# Feature Engineering Class

## Dataset: Heart Disease UCI

**Link**: [https://www.kaggle.com/datasets/ronitf/heart-disease-uci](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

---

## Tasks

### 1. Data Loading and Inspection

* Load the dataset using pandas.
* Display the first 5 rows.
* Check for missing values.
* Inspect data types and basic statistics.

---

### 2. Creating New Features

* Create an **age group** feature by segmenting age into bins:

  * Young (0-35)
  * Middle-aged (36-50)
  * Senior (51-65)
  * Elderly (66+)

* Create a new feature that represents the **ratio of cholesterol to age** (`chol_per_age`).

* Create an **interaction feature** by multiplying `thal` and `slope`.

---

### 3. Encoding Categorical Features

* Identify all categorical features.
* Apply **One-Hot Encoding** to the categorical features.
* Concatenate the encoded features back into the dataset.

---

### 4. Scaling Numerical Features

* Identify all numerical features.
* Apply **StandardScaler** to normalize these features.

---

### 5. Final Verification

* Show the transformed dataset.
* Check that all features are now numeric.
* Confirm the dataset shape and readiness for modeling.

---

## Objective

By completing these tasks, you will learn how to perform essential **feature engineering** steps:

* Creating new informative features
* Encoding categorical variables
* Scaling for machine learning models



In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [None]:
# 1: Loading e exploração
df = pd.read_csv("C:\\Users\\pevi\\OneDrive - GFT Technologies SE\\Documents\\heart.csv")

print(df.head())

# verificação de colunas vazias
print(df.isnull().sum())

   Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  MaxHR  \
0   40   M           ATA        140          289          0     Normal    172   
1   49   F           NAP        160          180          0     Normal    156   
2   37   M           ATA        130          283          0         ST     98   
3   48   F           ASY        138          214          0     Normal    108   
4   54   M           NAP        150          195          0     Normal    122   

  ExerciseAngina  Oldpeak ST_Slope  HeartDisease  
0              N      0.0       Up             0  
1              N      1.0     Flat             1  
2              N      0.0       Up             0  
3              Y      1.5     Flat             1  
4              N      0.0       Up             0  
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slop

In [None]:
# 2. Criando novas features

# função age_group criada, que vai categorizar as idades pelo range informado no if
def age_group(Age):
    if Age <= 35:
        return 'Young'
    elif Age <= 50:
        return 'Middle-aged'
    elif Age <= 65:
        return 'Senior'
    else:
        return 'Elderly'

# nova coluna age_group, que aplica a função acima na feature age, criando uma variável categórica
df['age_group'] = df['Age'].apply(age_group)
print(df[['Age', 'age_group']].head())

# nova feature que recebe o valor do colesterol por Idade
df['chol_age'] = df['Cholesterol'] / df['Age']
print(df['chol_age'].head())

In [21]:
# 3. Categorizando usando o One Hot encoder
features_categoricas = ['ChestPainType', 'RestingECG', 'ST_Slope',  'Sex', 'FastingBS', 'ExerciseAngina', 'age_group']
df_encoded = pd.get_dummies(df, columns=features_categoricas, drop_first=True)
df_encoded.head()


Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,HeartDisease,chol_age,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ST_Slope_Flat,ST_Slope_Up,Sex_M,FastingBS_1,ExerciseAngina_Y,age_group_Middle-aged,age_group_Senior,age_group_Young
0,40,140,289,172,0.0,0,7.225,True,False,False,True,False,False,True,True,False,False,True,False,False
1,49,160,180,156,1.0,1,3.673469,False,True,False,True,False,True,False,False,False,False,True,False,False
2,37,130,283,98,0.0,0,7.648649,True,False,False,False,True,False,True,True,False,False,True,False,False
3,48,138,214,108,1.5,1,4.458333,False,False,False,True,False,True,False,False,False,True,True,False,False
4,54,150,195,122,0.0,0,3.611111,False,True,False,True,False,False,True,True,False,False,False,True,False


In [22]:
# 4. Fazendo scaling das features numéricas
numerical_features = ['Age', 'MaxHR', 'Cholesterol', 'RestingBP', 'Oldpeak', 'chol_age']
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
df_encoded.head()


Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,HeartDisease,chol_age,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ST_Slope_Flat,ST_Slope_Up,Sex_M,FastingBS_1,ExerciseAngina_Y,age_group_Middle-aged,age_group_Senior,age_group_Young
0,-1.43314,0.410909,0.82507,1.382928,-0.832432,0,1.463847,True,False,False,True,False,False,True,True,False,False,True,False,False
1,-0.478484,1.491752,-0.171961,0.754157,0.105664,1,-0.091453,False,True,False,True,False,True,False,False,False,False,True,False,False
2,-1.751359,-0.129513,0.770188,-1.525138,-0.832432,0,1.649373,True,False,False,False,True,False,True,True,False,False,True,False,False
3,-0.584556,0.302825,0.13904,-1.132156,0.574711,1,0.252258,False,False,False,True,False,True,False,False,False,True,True,False,False
4,0.051881,0.951331,-0.034755,-0.581981,-0.832432,0,-0.118761,False,True,False,True,False,False,True,True,False,False,False,True,False


In [26]:
# 5. Verificação final
print('mostrando o dataset modificado')
print(df_encoded.head(15))

print('retornando o nome das features numéricas do novo dataframe')
print(list(df_encoded.select_dtypes(include=['number']).columns))

print('retornando o shape (linhas x colunas) do dataframe novo: ')
print(df_encoded.shape)

mostrando o dataset modificado
         Age  RestingBP  Cholesterol     MaxHR   Oldpeak  HeartDisease  \
0  -1.433140   0.410909     0.825070  1.382928 -0.832432             0   
1  -0.478484   1.491752    -0.171961  0.754157  0.105664             1   
2  -1.751359  -0.129513     0.770188 -1.525138 -0.832432             0   
3  -0.584556   0.302825     0.139040 -1.132156  0.574711             1   
4   0.051881   0.951331    -0.034755 -0.581981 -0.832432             0   
5  -1.539213  -0.669935     1.282424  1.304332 -0.832432             0   
6  -0.902775  -0.129513     0.349422  1.304332 -0.832432             0   
7   0.051881  -1.210356     0.084157  0.203982 -0.832432             0   
8  -1.751359   0.410909     0.075010 -0.267596  0.574711             1   
9  -0.584556  -0.669935     0.779335 -0.660578 -0.832432             0   
10 -1.751359  -0.129513     0.111598  0.203982 -0.832432             0   
11  0.476173   0.194740    -0.318314 -1.485840  1.043759             1   
12 -1.5