# **Feature Engineering**

## Objectives

* Apply numerical transformations
* Encode ordinal categorical features

## Inputs

* output/datasets/cleaned/TrainSetCleaned.csv

## Outputs

* Dataset with encoded features
* Dataset insights for Modeling 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Dataset


In [None]:
import pandas as pd
TrainSet = pd.read_csv('outputs/datasets/cleaned/TrainSetCleaned.csv')
TrainSet.head(10)


---

## Vizualize Numerical Distribution

This will help us confirm skewness and outliers.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

numerical_cols = ['age', 'bmi', 'children', 'charges']

# Plot histogram and boxplot
def plot_histogram_and_boxplot(df, cols):
    for col in cols:
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        
        sns.histplot(df[col], kde=True, ax=axes[0])
        axes[0].set_title(f'Histogram of {col}')
        
        sns.boxplot(x=df[col], ax=axes[1])
        axes[1].set_title(f'Boxplot of {col}')
        
        plt.tight_layout()
        plt.show()

# Run the visualization
plot_histogram_and_boxplot(TrainSet, numerical_cols)

## Numerical Transformations

**Age and BMI transformation**

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()
TrainSet_copy = TrainSet.copy()

# Apply to age and bmi
TrainSet_copy[['age', 'bmi']] = scaler.fit_transform(TrainSet_copy[['age', 'bmi']])

# Check summary stats
TrainSet_copy[['age', 'bmi']].describe()


We applied encoding to the Age and BMI features. We will use the `StandardScaler` from `sklearn.preprocessing` to encode these features. On a demonstration, we will wrap the transformations into a clean pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper

# Define columns
num_features_to_scale = ['age', 'bmi']

# Pipeline for numerical columns

num_pipeline = Pipeline([('scaler', SklearnTransformerWrapper(transformer=StandardScaler(),
                                             variables=num_features_to_scale))
])

## Categorical Encoding

Now we will apply categorical encoding to your features. You have the following categorical variables:
- sex (binary)
- smoker (binary)
- region (nominal — no natural order)

We will one `OrdinalEncoder` since it can handle both categories, and combine with the numerical transformation in a single pipeline.

In [None]:
from feature_engine.encoding import OrdinalEncoder

# Categorical columns
cat_features = ['sex', 'smoker', 'region']

# Categorical encoder pipeline
cat_pipeline = Pipeline([
    ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary',
                                           variables=cat_features))
])


**Join the numerical and categorical pipelines**

In [None]:
# Full preprocessor
full_pipeline = Pipeline([
        ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary',
                                           variables=cat_features)),
        ('scaler', SklearnTransformerWrapper(transformer=StandardScaler(),
                                             variables=num_features_to_scale))
    ])


**Check if the pipeline works**

In [None]:
print(f"DataSet before preprocessing:\n{TrainSet.head(10)}\n")

Here we can see the DataFrame with the all encoded features. We used the `OrdinalEncoder` from `sklearn.preprocessing` to encode categorical features, `StandardScaler` to scale numerical features. The `Pipeline` from `sklearn.pipeline` allows us to chain these transformations together.

In [None]:
# Check if the pipeline works
preprocessed_data = full_pipeline.fit_transform(TrainSet.drop(columns=['charges']))
print(f"DataSet after processing:\n{preprocessed_data.head(10)}\n")

---

## Correlation Matrix of Encoded Features

In [None]:
# Check correlation matrix of the processed data
# Concatenate charges to preprocessed_data
preprocessed_with_target = preprocessed_data.copy()
preprocessed_with_target['charges'] = TrainSet['charges'].values

# Compute correlation matrix
correlation_matrix = preprocessed_with_target.corr()

print(correlation_matrix)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix of Preprocessed Data (with Target)")
plt.show()

---

## Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/feature_engineered')
except Exception as e:
  print(e)

# Save the dataframe to a CSV file in the outputs folder
preprocessed_with_target.to_csv('outputs/datasets/feature_engineered/insurance_fe.csv', index=False)

# Conclusions and Next Steps

## Conclusions
* We have successfully done the following:
  - Applied numerical transformations to the dataset.
  - Encoded ordinal categorical features using `OrdinalEncoder`.
* Final feature set includes:
    - age, bmi (standardized)
    - children (int, unchanged)
    - sex, smoker, region (ordinal encoded)
* Important note: 'charges' is was not part of the feature set, as it is the target variable.

# Next Steps
* We will now proceed to build the pipelines for modeling and feature engineering based on the insights gained from this feature engineering step.