# Feature Engineering with Open-Source

**Objective:**  
In this notebook, we will reproduce the Feature Engineering Pipeline from the notebook, but in this notebook we will replace, whenever possible, the manually created functions by open-source classes, and understand the value they bring forward. The goal is to extract, transform, and encode features in a way that improves the model’s ability to predict `SalePrice`.

---

## Key Steps:

1. **Handle missing values**
   - Domain-specific imputation strategies
   - Use of indicators for missingness (if meaningful)

2. **Transform variables**
   - Log transformation for skewed features
   - Binning or discretization of continuous variables
   - Date or time-based feature extraction

3. **Encode categorical features**
   - Ordinal encoding for ranked categories
   - One-hot encoding for nominal variables
   - Group rare categories

4. **Feature scaling (if required)**
   - Standardization or normalization of numerical values for certain models

---

**Outcome:**  
We produce a cleaned and transformed dataset with meaningful features that are ready to be fed into machine learning models for accurate price prediction.


In [1]:
# Import Libraries
import polars as pl
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import numpy as np
from bokeh.models import NumeralTickFormatter
import holoviews as hv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, Binarizer
import joblib
from feature_engine.imputation import AddMissingIndicator, MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import RareLabelEncoder, OrdinalEncoder
from feature_engine.transformation import LogTransformer, YeoJohnsonTransformer
from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper

hv.extension('bokeh')

In [2]:
# Load the dataset, handle missing values and get basic information
data = pd.read_csv('data/train.csv')
print(f'The dataset contains {data.shape[0]} rows and {data.shape[1]} columns')
data.head()

The dataset contains 1460 rows and 81 columns


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


###  Missing Values

In [3]:
# We start the missing value process by the categorical data
print('Missing Data - Categorical Columns')
categorical_features = [var for var in data.columns if data[var].dtype == 'O']
categorical_features_with_na = [var for var in categorical_features if data[var].isnull().sum() > 0]

# First we find which columns have more or less than 10% missing data. 
lot_missing_data = [var for var in categorical_features_with_na if data[var].isnull().mean() > 0.1]
few_missing_data = [var for var in categorical_features_with_na if data[var].isnull().mean() <= 0.1]
print(f'We have {len(lot_missing_data)} of columns with more than 10% data and {len(few_missing_data)} of columns with less than 10%')

Missing Data - Categorical Columns
We have 6 of columns with more than 10% data and 10 of columns with less than 10%


In [4]:
# For the columns that have >10% missing values we replace the NA or null with 'Missing' using feature-engine library
categorical_imputer_missing = CategoricalImputer(imputation_method='missing', variables=lot_missing_data) # type: ignore
categorical_imputer_missing.fit(data)

# For the columns that have <=10% missing values we are going to replace them with the most common value
categorical_imputer_frequent = CategoricalImputer(imputation_method='frequent', variables=few_missing_data) # type: ignore
categorical_imputer_frequent.fit(data)

# Now that we have both categorical imputers to fill the NA, null values we can fit the data
data = categorical_imputer_missing.transform(data)
data = categorical_imputer_frequent.transform(data)

# Ensure that we have no null values
for column in categorical_features:
  print(f'For the column {column} we have {data[column].isnull().sum()} null or NA items')

For the column MSZoning we have 0 null or NA items
For the column Street we have 0 null or NA items
For the column Alley we have 0 null or NA items
For the column LotShape we have 0 null or NA items
For the column LandContour we have 0 null or NA items
For the column Utilities we have 0 null or NA items
For the column LotConfig we have 0 null or NA items
For the column LandSlope we have 0 null or NA items
For the column Neighborhood we have 0 null or NA items
For the column Condition1 we have 0 null or NA items
For the column Condition2 we have 0 null or NA items
For the column BldgType we have 0 null or NA items
For the column HouseStyle we have 0 null or NA items
For the column RoofStyle we have 0 null or NA items
For the column RoofMatl we have 0 null or NA items
For the column Exterior1st we have 0 null or NA items
For the column Exterior2nd we have 0 null or NA items
For the column MasVnrType we have 0 null or NA items
For the column ExterQual we have 0 null or NA items
For the co

### Numerical variables
