# **Feature Engineering**

## Objectives

* Apply numerical transformations (e.g., scaling)
* Encode ordinal categorical features
* Drop highly correlated features using Smart Correlated Selection

## Inputs

* output/datasets/cleaned/TrainSetCleaned.csv

## Outputs

* Dataset insights for Modeling 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Dataset


In [None]:
import pandas as pd
TrainSet = pd.read_csv('outputs/datasets/cleaned/TrainSetCleaned.csv')
TrainSet.head(10)


---

## Vizualize Numerical Distribution

This will help us confirm skewness and outliers.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

numerical_cols = ['age', 'bmi', 'children', 'charges']

# Plot histogram and boxplot
def plot_histogram_and_boxplot(df, cols):
    for col in cols:
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        
        sns.histplot(df[col], kde=True, ax=axes[0])
        axes[0].set_title(f'Histogram of {col}')
        
        sns.boxplot(x=df[col], ax=axes[1])
        axes[1].set_title(f'Boxplot of {col}')
        
        plt.tight_layout()
        plt.show()

# Run the visualization
plot_histogram_and_boxplot(TrainSet, numerical_cols)

## Numerical Transformations

**Age and BMI transformation**

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()

# Apply to age and bmi
TrainSet[['age_scaled', 'bmi_scaled']] = scaler.fit_transform(TrainSet[['age', 'bmi']])

# Optional: Check summary stats
TrainSet[['age', 'age_scaled', 'bmi', 'bmi_scaled']].describe()


We applied encoding to the Age and BMI features and Yeo-Johnson to charges, which are ordinal categorical features. We will use the `OrdinalEncoder` from `sklearn.preprocessing` to encode these features. On a demonstration, we will wrap the transformations into a clean pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Define columns
num_features_to_scale = ['age', 'bmi']
pass_through = ['children']

# Pipeline for numerical columns
num_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipeline, num_features_to_scale),
    ('passthrough', 'passthrough', pass_through)
])

## Categorical Encoding

Now we will apply categorical encoding to your features. You have the following categorical variables:
- sex (binary)
- smoker (binary)
- region (nominal — no natural order)

We will one `OrdinalEncoder` since it can handle both categories, and combine with the numerical transformation in a single pipeline.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Categorical columns
cat_features = ['sex', 'smoker', 'region']

# Categorical encoder pipeline
cat_pipeline = Pipeline([
    ('ordinal', OrdinalEncoder())
])


In [None]:
# Full preprocessor
full_preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipeline, num_features_to_scale),
    ('passthrough', 'passthrough', pass_through),
    ('cat', cat_pipeline, cat_features)
])


In [None]:
# Combine all features (numerical + categorical)
X_all = TrainSet[num_features_to_scale + pass_through + cat_features]

# Apply full pipeline
X_ready = full_preprocessor.fit_transform(X_all)

# Get column names (ordinal outputs are unnamed, so you can add suffixes)
final_columns = num_features_to_scale + pass_through + cat_features
X_ready_df = pd.DataFrame(X_ready, columns=final_columns)
X_ready_df.head()


Here we can see the DataFrame with the all encoded features. We used the `OrdinalEncoder` from `sklearn.preprocessing` to encode categorical features, `StandardScaler` to scale numerical features. The `Pipeline` from `sklearn.pipeline` allows us to chain these transformations together.

In [None]:
df_modeling = X_ready_df.copy()
df_modeling['charges'] = TrainSet['charges']
df_modeling.head()

---

# Conclusions and Next Steps

## Conclusions
* We have successfully done the following:
  - Applied numerical transformations to the dataset.
  - Encoded ordinal categorical features using `OrdinalEncoder`.
* Final feature set includes:
    - age, bmi (standardized)
    - children (int, unchanged)
    - sex, smoker, region (ordinal encoded)
* Important note: 'charges' is was not part of the feature set, as it is the target variable.

# Next Steps
* We will now proceed to build the pipelines for modeling and feature engineering based on the insights gained from this feature engineering step.