# Capstone Project — California Housing  
## Part 2: Data Splits & Preprocessing Pipeline

**Objective:**  
Prepare the dataset for model training by:
- Creating reproducible 60/20/20 splits (train/validation/test).
- Building preprocessing pipelines for scaling and optional log transformations.
- Capturing environment dependencies for reproducibility.

This stage ensures data cleanliness, reproducibility, and separation of train/test information to avoid leakage.

## Imports & seed setup

In [1]:
# --- 0. Repro + Imports ---
import os, random, numpy as np, pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

SEED = 42
random.seed(SEED); np.random.seed(SEED); os.environ["PYTHONHASHSEED"]=str(SEED)

## Load dataset

In [2]:
# --- 1. Load Data ---
data = fetch_california_housing(as_frame=True)
df = data.frame.copy()
df.rename(columns={"MedHouseVal": "target"}, inplace=True)
df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521


## Split logic

In [3]:
# --- 2. Split Data (60/20/20) ---
X = df.drop("target", axis=1)
y = df["target"]

# 60/20/20 split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

Train: 12384, Val: 4128, Test: 4128


## Preprocessing Pipeline

In [4]:
# --- 3. Preprocessing Pipeline ---
log_features = ["MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup"]
log_transformer = Pipeline([("log", FunctionTransformer(np.log1p, validate=False)),
                            ("scaler", StandardScaler())])

num_features = [col for col in X.columns if col not in log_features]
num_transformer = Pipeline([("scaler", StandardScaler())])

preprocessor = ColumnTransformer([
    ("log_scaled", log_transformer, log_features),
    ("scaled", num_transformer, num_features)
])

# Fit & transform
X_train_prep = preprocessor.fit_transform(X_train)
X_val_prep = preprocessor.transform(X_val)
X_test_prep = preprocessor.transform(X_test)

print("Shapes:", X_train_prep.shape, X_val_prep.shape, X_test_prep.shape)

Shapes: (12384, 8) (4128, 8) (4128, 8)


## Save preprocessor

In [5]:
joblib.dump(preprocessor, "preprocessor.pkl")
print("Preprocessor saved as preprocessor.pkl")

Preprocessor saved as preprocessor.pkl


## Environment snapshot

In [6]:
# --- 5. Environment Snapshot ---
!pip freeze | grep -E "torch|numpy|scikit-learn" > env_snapshot.txt
!cat env_snapshot.txt

numpy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_8cclhhg0x3/croot/numpy_and_numpy_base_1755590859900/work/dist/numpy-2.2.5-cp312-cp312-macosx_11_0_arm64.whl#sha256=f2d995e91fd194392feaa446357eb199182af2d1686c264d5c4b75a0c272253c
scikit-learn @ file:///opt/miniconda3/conda-bld/scikit-learn_1758620846586/work
torch==2.8.0
torchaudio==2.8.0
torchvision==0.23.0


### Summary
Data successfully split into reproducible train/validation/test sets.
Preprocessing pipeline (log + scaling) serialized to `preprocessor.pkl`.
Environment snapshot stored in `env_snapshot.txt` for full reproducibility.