## 4. Feature Engineering

### 4.1 Objectives

The objective of this phase is to:

* Convert categorical variables into **machine-readable numerical formats**
* Preserve **ordinal meaning** where rankings exist
* Scale numerical features to improve model convergence and interpretability

### 4.2 Using the Feature Engineering Module

We'll use the structured `FeatureEngineer` class from `src.feature_engineering` to ensure consistent and reproducible feature engineering.

---

## 4.3 Ordinal Feature Encoding

Ordinal variables represent ordered quality or condition levels. These are encoded using **domain-consistent mappings** derived from the data dictionary and stored in `src.config`.

### 4.3.1 Initialize Feature Engineer and Load Data

In [1]:
import sys
sys.path.append('../src')
from feature_engineering import FeatureEngineer
from config import QUALITY_MAPPING, BASEMENT_EXPOSURE_MAPPING, BASEMENT_FINISH_MAPPING, DATA_PATHS

# Initialize the feature engineer
engineer = FeatureEngineer()

# Load prepared data
import pandas as pd
df = pd.read_csv(DATA_PATHS['prepared_train'])
print(f"Data shape: {df.shape}")
print(f"Available ordinal mappings:")
print(f"- Quality mapping: {len(QUALITY_MAPPING)} levels")
print(f"- Basement exposure mapping: {len(BASEMENT_EXPOSURE_MAPPING)} levels")
print(f"- Basement finish mapping: {len(BASEMENT_FINISH_MAPPING)} levels")

Data shape: (1460, 81)
Available ordinal mappings:
- Quality mapping: 9 levels
- Basement exposure mapping: 5 levels
- Basement finish mapping: 7 levels


### 4.3.2 Apply Ordinal Encoding Using FeatureEngineer

In [2]:
# Apply ordinal encoding
df_ordinal = engineer.encode_ordinal_features(df)

# Show some examples of encoded features
ordinal_features = ['ExterQual', 'KitchenQual', 'BsmtQual', 'BsmtExposure', 'BsmtFinType1']
for feature in ordinal_features:
    if feature in df_ordinal.columns:
        print(f"{feature}: {df_ordinal[feature].unique()[:10]}...")  # Show first 10 unique values

ExterQual: [4 3 5 2]...
KitchenQual: [4 3 5 2]...
BsmtQual: [4 3 5 0 2]...
BsmtExposure: [1 4 2 3 0]...
BsmtFinType1: [6 5 1 3 4 0 2]...


## 4.4 Nominal Feature Encoding (One-Hot Encoding)

Nominal variables have **no intrinsic ordering** and are encoded using **one-hot encoding**.

### 4.4.1 Apply One-Hot Encoding Using FeatureEngineer

In [3]:
# Apply one-hot encoding to nominal features
df_encoded = engineer.encode_nominal_features(df_ordinal, drop_first=True)

print(f"Shape before one-hot encoding: {df_ordinal.shape}")
print(f"Shape after one-hot encoding: {df_encoded.shape}")
print(f"New features added: {df_encoded.shape[1] - df_ordinal.shape[1]}")

Shape before one-hot encoding: (1460, 81)
Shape after one-hot encoding: (1460, 216)
New features added: 135


**Rationale:**

* `drop_first=True` avoids perfect multicollinearity (dummy variable trap)
* Ordinal features were already numerically encoded and excluded
* All remaining categorical features are nominal and suitable for one-hot encoding

## 4.5 Feature Scaling

Scaling is applied **only to numerical predictors**, not the target variable (`SalePrice`).

### 4.5.1 Apply Standard Scaling Using FeatureEngineer

In [4]:
# Apply standard scaling to numerical features
df_scaled = engineer.scale_numerical_features(df_encoded, target_column='SalePrice', fit_scaler=True)

print(f"Scaling completed")
print(f"Final data shape: {df_scaled.shape}")
print(f"Scaler fitted: {engineer.scaler is not None}")

Scaling completed
Final data shape: (1460, 216)
Scaler fitted: True


**Justification:**

* Standardization (mean = 0, std = 1) benefits:

  * Linear regression
  * Regularized models (Ridge, Lasso, Elastic Net)
* Scaling is performed **after encoding** to ensure uniform treatment
* Scaler is stored for consistent application to test data

## 4.6 Complete Feature Engineering Pipeline

### 4.6.1 Automated Feature Engineering

The `engineer_features` method combines all steps into a single pipeline:

In [5]:
# Run complete feature engineering pipeline
df_engineered, y_transformed, was_log_transformed = engineer.engineer_features(
    df, 
    target_column='SalePrice',
    apply_log_target=True,
    scale_features=True
)

print(f"Complete pipeline results:")
print(f"- Final data shape: {df_engineered.shape}")
print(f"- Log transformation applied: {was_log_transformed}")
print(f"- Target skewness before: {df['SalePrice'].skew():.2f}")
if was_log_transformed:
    print(f"- Target skewness after: {y_transformed.skew():.2f}")

Complete pipeline results:
- Final data shape: (1460, 217)
- Log transformation applied: True
- Target skewness before: 1.88
- Target skewness after: 0.12


## 4.7 Final Feature Matrix

At the end of this phase:

* All features are numeric
* Ordinal meaning is preserved
* Nominal variables are expanded
* Numerical scales are normalized
* Target variable is log-transformed for better model performance

In [6]:
# Save the engineered data using the prepare_modeling_data method
df_final_engineered, y_final, was_log_final = engineer.prepare_modeling_data()

print(f"Data saved to: {DATA_PATHS['prepared_scaled']}")
print(f"Final engineered data shape: {df_final_engineered.shape}")
print(f"Final log transformation: {was_log_final}")

Data saved to: e:\Projects_3\Data Science\Regression\Housing Prices_2\data\processed\train_prepared_scaled.csv
Final engineered data shape: (1460, 217)
Final log transformation: True


## 4.8 Readiness for Modeling

The dataset is now suitable for:

* Linear Regression
* Regularized regression (Ridge, Lasso)
* Tree-based models (without additional scaling if required)

## Summary

This feature engineering process using the FeatureEngineer module:

* Respects domain semantics through proper ordinal mappings
* Aligns with CRISP-DM best practices
* Produces a robust, model-ready feature set
* Ensures reproducibility through centralized logic
* Provides automated log transformation for skewed targets

### Key Benefits of Using the FeatureEngineer Module:

1. **Consistency**: Same encoding applied to train and test data
2. **Reproducibility**: Scaler and mappings stored for reuse
3. **Automation**: Complete pipeline in single method call
4. **Domain Awareness**: Proper handling of ordinal vs nominal features
5. **Flexibility**: Configurable log transformation and scaling