# Feature Engineering

## Purpose
The purpose of this notebook is to **prepare modeling-ready features**
based on the conclusions from the EDA phase.

This notebook focuses on:
- explicitly defining input features (`X`) and the target (`y`)
- addressing feature scale differences
- deciding which transformations are required (and which are not)
- producing a clean feature matrix suitable for model training

No models are trained in this notebook.

---

## Context
The dataset has already been:
- materialized to `data/california_housing.csv`
- inspected for data quality issues
- reviewed for feature meaning and target behavior

Key observations from EDA:
- All features are numeric
- Feature scales vary significantly
- The target variable (`MedHouseVal`) is continuous and capped
- Geographic and income features appear informative

This notebook operationalizes those observations.

---

## Output
By the end of this notebook, we will have:
- a clearly defined feature set (`X`)
- a clearly defined target (`y`)
- documented decisions around scaling and transformations
- feature data that is ready to be consumed by a modeling notebook

If feature definitions or preprocessing decisions are unclear at the end
of this notebook, the project should not proceed to model training.


In [9]:
# import libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import json
from pathlib import Path


In [2]:
df = pd.read_csv("../data/raw/california_housing.csv")
df.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [3]:
TARGET = "MedHouseVal"

FEATURES = [
    "MedInc",
    "HouseAge",
    "AveRooms",
    "AveBedrms",
    "Population",
    "AveOccup",
    "Latitude",
    "Longitude"
]

X = df[FEATURES]
y = df[TARGET]

print(f'Features Head:\n{X.head()}')
print(f'Target Head:\n{y.head()}')


Features Head:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  
Target Head:
0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64


In [4]:
# sanity check

X.shape, y.shape


((20640, 8), (20640,))

In [5]:
# sanity check

X.isna().sum()


MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

In [6]:
# sanity check

X.dtypes


MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
dtype: object

### Notes

- Shape mismatches are common real-world bugs
- You validate inputs before transforming them

In [7]:
X.describe()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


### Data Values

- `MedInc` is on a very different scale than `Population`
- `Latitude` / `Longitude` are bounded
- Linear models and distance-based models will care

### Scaling Decision

Based on EDA and feature inspection:

- Features vary significantly in scale
- No features require nonlinear transformations at this stage
- All features are numeric and continuous

Decision:
- Apply **standardization (z-score scaling)** to all input features
- Perform scaling **after train/test split** in the modeling notebook
- Fit scalers only on training data to avoid data leakage


In [8]:
X.values[:5]


array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02],
       [ 5.64310000e+00,  5.20000000e+01,  5.81735160e+00,
         1.07305936e+00,  5.58000000e+02,  2.54794521e+00,
         3.78500000e+01, -1.22250000e+02],
       [ 3.84620000e+00,  5.20000000e+01,  6.28185328e+00,
         1.08108108e+00,  5.65000000e+02,  2.18146718e+00,
         3.78500000e+01, -1.22250000e+02]])

### Note
Feature engineering does not mean fitting transformers yet.

In [10]:
# Feature name order preservation

FEATURE_ORDER = FEATURES.copy()
FEATURE_ORDER


['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [11]:
# Save feature metadata for later use

metadata = {
    "target": TARGET,
    "features": FEATURE_ORDER
}

Path("../artifacts").mkdir(exist_ok=True)

with open("../artifacts/feature_metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print("Metadata saved successfully")

Metadata saved successfully


# Feature Engineering Summary & Decisions

## Objective
The goal of this notebook was to transform the raw dataset into a
**modeling-ready feature set** based on conclusions from the EDA phase.

No model training was performed in this notebook.

---

## Feature Definition
- **Target variable:** `MedHouseVal`
- **Input features:**
  - MedInc
  - HouseAge
  - AveRooms
  - AveBedrms
  - Population
  - AveOccup
  - Latitude
  - Longitude

Feature selection was explicit to avoid silent inclusion or exclusion
of columns during model training.

---

## Data Integrity
- All selected features are numeric
- No missing values detected in features or target
- Feature matrix and target vector have consistent dimensions

---

## Scaling Decisions
- Feature scales vary significantly across inputs
- No nonlinear transformations were required at this stage
- **Standardization (z-score scaling)** was selected as the scaling strategy

To prevent data leakage:
- Scalers will be fit on training data only
- Scaling will be applied after the train/test split in the modeling phase

---

## Feature Order & Metadata
- Feature order was explicitly defined and preserved
- Feature metadata (target name and feature list) was saved for downstream use
- This ensures consistency between training, inference, and future deployment

---

## Decisions Moving Forward
- Proceed to baseline model training using the defined feature set
- Apply scaling within the modeling workflow
- Evaluate models using RMSE
- Save trained model artifacts to `models/`

---

## Status
Feature engineering complete. Data is ready for model training.
