# Support Vector Regression â€” California Housing Notebook

This notebook mirrors the production SVR pipeline in `src/` and provides an interactive space for dataset inspection, training diagnostics, and hyperparameter experiments. Work through the sections sequentially to keep results aligned with the scripted workflow.

**Roadmap**

- Load the cached dataset and confirm feature ordering.
- Recreate the quantile-stratified train/validation split.
- Train the SVR pipeline, persist artefacts, and validate metrics.
- Visualise residuals, error distributions, and kernel behaviour.
- Capture experiment notes for future improvements.

In [None]:
"""Environment imports aligned with the production pipeline."""
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.display import display

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import PredictionErrorDisplay

from src.config import CONFIG as SVR_CONFIG, SVRConfig
from src.data import load_dataset, build_features, train_validation_split
from src.pipeline import CaliforniaHousingSVRPipeline, train_and_persist

In [None]:
sns.set_theme(style="whitegrid")
config: SVRConfig = SVR_CONFIG
raw_df = load_dataset(config)
display(raw_df.head())
print(f"Total rows: {len(raw_df):,}")
print("Missing values per column:")
display(raw_df.isna().sum().sort_values(ascending=False))

## 1. Dataset Overview

Columns are normalised to snake_case so they align with `SVRConfig.feature_columns`. The target `median_house_value` is stored in $100k units.

In [None]:
X, y = build_features(raw_df, config)
print(f"Features shape: {X.shape}")
print("Target summary:")
display(y.describe())

### Correlation snapshot

ETL sanity check for multicollinearity before fitting SVR.

In [None]:
corr = X.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=False, cmap="coolwarm", square=True)
plt.title("Feature correlation heatmap")
plt.tight_layout()

## 2. Train/Validation Split

Replicate the quantile-stratified 80/20 split used by `src/train.py` to keep notebook metrics in sync with the CLI workflow.

In [None]:
X_train, X_val, y_train, y_val = train_validation_split(config)
print(f"Train size: {X_train.shape[0]:,} | Validation size: {X_val.shape[0]:,}")
print("Target quantiles (train):")
display(y_train.quantile([0.1, 0.5, 0.9]))
print("Target quantiles (validation):")
display(y_val.quantile([0.1, 0.5, 0.9]))

## 3. Train the Production Pipeline

Instantiate `CaliforniaHousingSVRPipeline`, fit on the training fold, and persist artefacts. rerun this cell after tweaking hyperparameters or preprocessing to regenerate weights and metrics.

In [None]:
pipeline = CaliforniaHousingSVRPipeline(config)
metrics = pipeline.train()
artifact_path = pipeline.save()
metrics_path = pipeline.write_metrics(metrics)
print("Training metrics:")
display(metrics)
print(f"Model artifact: {artifact_path}")
print(f"Metrics file: {metrics_path}")

In [None]:
y_val_pred = pipeline.pipeline.predict(X_val)
eval_metrics = {
    'r2': float(r2_score(y_val, y_val_pred)),
    'rmse': float(np.sqrt(mean_squared_error(y_val, y_val_pred))),
    'mae': float(mean_absolute_error(y_val, y_val_pred)),
}
display(eval_metrics)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].hist(y_val - y_val_pred, bins=40, color="steelblue", alpha=0.8)
axes[0].set_title("Residual histogram")
axes[0].set_xlabel("Error (actual - predicted)")
axes[0].set_ylabel("Frequency")
PredictionErrorDisplay.from_predictions(y_val, y_val_pred, kind="actual_vs_predicted", ax=axes[1])
axes[1].plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], linestyle='--', color='grey', alpha=0.6)
axes[1].set_title("Actual vs. Predicted")
plt.tight_layout()

## 4. Support Vector Diagnostics

Although SVR does not expose class counts, examining the number of support vectors helps gauge model complexity.

In [None]:
svr = pipeline.pipeline.named_steps['regressor']
print(f"Support vectors: {svr.support_.shape[0]}")
print('Dual coefficients summary:')
display(pd.Series(svr.dual_coef_[0]).describe())

## 5. Experiment Log

- **Parameter sweeps**: log results for different combinations of `C`, `epsilon`, and kernel choices.
- **Feature engineering**: evaluate log-scaled population or derived ratios (rooms per bedroom) and note metric shifts.
- **Monitoring**: track rolling RMSE on fresh validation slices to detect drift.
- **Prediction intervals**: experiment with residual bootstrapping or conformal methods to provide uncertainty estimates.
- **Batch scoring**: document CLI or Airflow jobs that reuse `CaliforniaHousingService` for offline inference.