# Phase 1: Business Understanding

## 1) Overview
Build an Automated Valuation Model (AVM) that predicts **log(sale_price)** for California homes using pre-sale information (structure, location, nearby schools, listing text).  
**Data**: homes sold in **2020**; **test** homes occur **later in time** than **train**.  
**Official metric**: **RMSE between log(predicted price) and log(actual price)** (i.e., percent-style error).

---

## 2) Problem → Decision
**Who acts & how**
- **Investors / agents** (weekly): prioritize properties where predicted fair value exceeds asking by **≥ 5%**; trigger deeper due diligence.
- **Appraisal QA** (daily): **flag** listings with absolute residual **≥ 10%** for manual review.
- *(Optional) Planners* (quarterly): monitor areas with systematic over/undervaluation trends.

**Why it matters**
- Reduces overpay risk and missed-opportunity risk.
- Focuses expert time on high-value reviews.

---

## 3) Scope & Assumptions
- **Unit of analysis**: individual **listing / house** at **listing time**.  
- **Features used**: bedrooms/baths, living area, geolocation, nearby schools (as provided), **seller summary text** (pre-sale only).  
- **No external data** beyond the provided files.  
- Dataset is **static**; no streaming/real-time requirements in this phase.

---

## 4) Success Criteria
**Technical (primary)**
- **log-RMSE ≤ 0.25** on **out-of-time** test.
- **Median APE ≤ 15%** across price quintiles.
- **80% prediction-interval (PI) coverage within ±25%** of price.
- **Calibration**: Expected Calibration Error (ECE) ≤ **0.05** (on log/percent scale).

**Business (secondary)**
- **Top-K targeting** improves negotiation savings or review yield by **≥ 10%** vs baseline.
- **False-flag rate ≤ 20%** in QA workflow.

**Guardrails**
- **Parity**: gap in MAE across **coastal vs inland** and **price quintiles** ≤ **5 percentage points**.
- **Coverage**: predictions produced for **≥ 98%** of eligible listings.
- **Latency/SLA**: batch score **20k** listings in **≤ 5 minutes**.

> **Acceptance checklist:**  
> - [ ] log-RMSE met  
> - [ ] APE met  
> - [ ] PI coverage met  
> - [ ] Calibration met  
> - [ ] Business lift met  
> - [ ] Guardrails met

---

## 5) Generalization & Validation Plan
- **Primary**: **Temporal generalization** (future months).  
  - **Rolling-origin CV** (e.g., train Jan–Jun → validate Jul; then Jan–Jul → validate Aug …).  
  - Final evaluation on the **held-out later-in-time test**.
- **Secondary**: **Spatial robustness**.  
  - Within each fold, **leave-geo-cluster-out** (ZIP/tract or H3/S2 tiles) to reduce near-duplicate comp leakage.

---

## 6) Risks & Mitigations

| Risk | Example | Mitigation |
|---|---|---|
| **Post-event leakage** | Days-on-market after contract; price changes near closing | Restrict to features **available at listing time**; freeze extraction timestamp |
| **Duplicates / near-dupes** | Same home relisted | Deduplicate by parcel/address; keep earliest listing per sale |
| **Market regime shift** | Pandemic-era price spikes | Rolling-origin CV; drift tests; **monthly/quarterly** refresh plan |
| **Outliers / heavy tails** | Extreme luxury sales | Train/evaluate on **log(price)**; winsorize features; robust losses where applicable |
| **Proxy & fairness risk** | Location/schools encode socioeconomic patterns | Track error parity across regions & price bands; document use-policy; human review of explanations |
| **Misuse** | Underwriting/eligibility decisions | **Not for credit eligibility** without compliance/legal approval |

---

## 7) Operational Constraints (Phase-appropriate)
- **Batch scoring**: API or notebook batch; ≤ **5 min** for **20k** rows.  
- **Refresh cadence**: **Monthly** retrain proposal; revisit based on drift.  
- **Ownership**: DS lead (model), Data Eng (pipeline), Product (consumer workflows).  
- **Monitoring**: input data quality (missing/validity), **log-RMSE**, APE by segment, PI coverage, calibration, parity gaps; alert & **rollback** if log-RMSE ↑ **>10%** or parity gap > **5pp** for **7 days**.

---

## 8) Deliverables (for this phase)
- This **Business Understanding README** (decision, scope, success, validation, risks).  
- **Use-policy** & governance note (fair housing/ECOA awareness; not legal advice).  
- Initial KPI & monitoring spec (what will be tracked post-deployment).

---

## 9) Out of Scope
- External data integrations, real-time serving, and UI build-out.  
- Final model selection, hyperparameter tuning, or deployment architecture (covered in later CRISP-DM phases).

---

### Notes on Metric Interpretation
Because the objective uses **log-RMSE**, improvements translate to **proportional** error reductions. For intuition: a **10%** price miss on a $500k home ≈ **$50k** impact. Decision thresholds (5%, 10%) are **initial** and will be tuned via cost-benefit analysis in the **Evaluation** phase.

---

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train = pd.read_csv('./housing_prices_dataset/train.csv')
print(train.head())

   Id            Address  Sold Price  \
0   0        540 Pine Ln   3825000.0   
1   1     1727 W 67th St    505000.0   
2   2     28093 Pine Ave    140000.0   
3   3  10750 Braddock Dr   1775000.0   
4   4  7415 O Donovan Rd   1175000.0   

                                             Summary          Type  \
0  540 Pine Ln, Los Altos, CA 94022 is a single f...  SingleFamily   
1  HURRY, HURRY.......Great house 3 bed and 2 bat...  SingleFamily   
2  'THE PERFECT CABIN TO FLIP!  Strawberry deligh...  SingleFamily   
3  Rare 2-story Gated 5 bedroom Modern Mediterran...  SingleFamily   
4  Beautiful 200 acre ranch land with several pas...    VacantLand   

   Year built                                       Heating  \
0      1969.0  Heating - 2+ Zones, Central Forced Air - Gas   
1      1926.0                                   Combination   
2      1958.0                                    Forced air   
3      1947.0                                       Central   
4         NaN          

## Phase 2: Data Understanding
Performing
 EDA to explore data distributions, missing values, and correlations.

In [5]:
train.describe()

Unnamed: 0,Id,Sold Price,Year built,Lot,Bathrooms,Full bathrooms,Total interior livable area,Total spaces,Garage spaces,Elementary School Score,Elementary School Distance,Middle School Score,Middle School Distance,High School Score,High School Distance,Tax assessed value,Annual tax amount,Listed Price,Last Sold Price,Zip
count,47439.0,47439.0,46394.0,33258.0,43974.0,39574.0,44913.0,46523.0,46522.0,42543.0,42697.0,30734.0,30735.0,42220.0,42438.0,43787.0,43129.0,47439.0,29673.0,47439.0
mean,23719.0,1296050.0,1956.634888,235338.3,2.355642,2.094961,5774.587,1.567117,1.491746,5.720824,1.152411,5.317206,1.691593,6.134344,2.410366,786311.8,9956.843817,1315890.0,807853.7,93279.178587
std,13694.604047,1694452.0,145.802456,11925070.0,1.188805,0.96332,832436.3,9.011608,8.964319,2.10335,2.332367,2.002768,2.462879,1.984711,3.59612,1157796.0,13884.254976,2628695.0,1177903.0,2263.459104
min,0.0,100500.0,0.0,0.0,0.0,1.0,1.0,-15.0,-15.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,85611.0
25%,11859.5,565000.0,1946.0,4991.0,2.0,2.0,1187.0,0.0,0.0,4.0,0.3,4.0,0.6,5.0,0.8,254961.5,3467.0,574500.0,335000.0,90220.0
50%,23719.0,960000.0,1967.0,6502.0,2.0,2.0,1566.0,1.0,1.0,6.0,0.5,5.0,1.0,6.0,1.3,547524.0,7129.0,949000.0,598000.0,94114.0
75%,35578.5,1525000.0,1989.0,10454.0,3.0,2.0,2142.0,2.0,2.0,7.0,1.0,7.0,1.8,8.0,2.4,937162.5,12010.0,1498844.0,950000.0,95073.0
max,47438.0,90000000.0,9999.0,1897474000.0,24.0,17.0,176416400.0,1000.0,1000.0,10.0,57.2,9.0,57.2,10.0,73.9,45900000.0,552485.0,402532000.0,90000000.0,96155.0


In [4]:
train.isnull().sum()

Unnamed: 0,0
Id,0
Address,0
Sold Price,0
Summary,354
Type,0
Year built,1045
Heating,6852
Cooling,20694
Parking,1374
Lot,14181


Schema & Data Quality Inventory

In [23]:
import numpy as np, pandas as pd

def dq_inventory(df: pd.DataFrame):
    rows = []
    for c in df.columns:
        s = df[c]
        is_num = pd.api.types.is_numeric_dtype(s)
        nonnull = s.notna().sum()
        miss = s.isna().sum()
        miss_pct = miss / len(df)
        nunique = s.nunique(dropna=True)
        card_ratio = (nunique / nonnull) if nonnull else np.nan
        zeros = int((s == 0).sum()) if is_num else np.nan
        negs  = int((s < 0).sum()) if is_num else np.nan
        stats = {
            "feature": c, "dtype": str(s.dtype), "non_null": nonnull,
            "missing": miss, "missing_pct": round(miss_pct, 4),
            "unique": nunique, "card_ratio": round(card_ratio, 4) if pd.notna(card_ratio) else np.nan,
            "zeros": zeros, "negatives": negs
        }
        if is_num:
            q = s.quantile([.01,.05,.5,.95,.99])
            stats.update({
                "min": s.min(), "p01": q.loc[.01], "p05": q.loc[.05],
                "median": q.loc[.5], "p95": q.loc[.95], "p99": q.loc[.99], "max": s.max(),
                "mean": s.mean(), "std": s.std()
            })
        else:
            if s.dtype == object:
                stats["avg_len"] = s.astype(str).str.len().replace("nan", np.nan).astype(float).mean()
        rows.append(stats)
    return pd.DataFrame(rows).sort_values(["missing_pct","feature"], ascending=[False, True])

dq_train = dq_inventory(train)
dq_train.head(20)


Unnamed: 0,feature,dtype,non_null,missing,missing_pct,unique,card_ratio,zeros,negatives,min,p01,p05,median,p95,p99,max,mean,std,avg_len
28,Cooling features,object,25216,22223,0.4685,311,0.0123,,,,,,,,,,,,6.601868
7,Cooling,object,26745,20694,0.4362,540,0.0202,,,,,,,,,,,,8.568098
36,Last Sold On,object,29673,17766,0.3745,6113,0.206,,,,,,,,,,,,7.378486
37,Last Sold Price,float64,29673,17766,0.3745,3979,0.1341,4.0,0.0,0.0,22500.0,113500.0,598000.0,2050000.0,4287940.0,90000000.0,807853.711152,1177903.0,
20,Middle School,object,30735,16704,0.3521,488,0.0159,,,,,,,,,,,,17.706697
22,Middle School Distance,float64,30735,16704,0.3521,226,0.0074,18.0,0.0,0.0,0.1,0.3,1.0,5.5,10.8,57.2,1.691593,2.462879,
21,Middle School Score,float64,30734,16705,0.3521,9,0.0003,0.0,0.0,1.0,2.0,2.0,5.0,9.0,9.0,9.0,5.317206,2.002768,
30,Laundry features,object,32828,14611,0.308,1975,0.0602,,,,,,,,,,,,14.644175
9,Lot,float64,33258,14181,0.2989,8205,0.2467,2.0,0.0,0.0,612.71,1598.0,6502.0,151803.91,871200.0,1897474000.0,235338.259388,11925070.0,
29,Appliances included,object,33846,13593,0.2865,4583,0.1354,,,,,,,,,,,,37.543203


Duplicates & Near-Duplicates

In [24]:
dupe_exact = train.duplicated().sum()

aliases = [
    ["parcel_id"], ["address","city","zip"], ["lat","lon"], ["latitude","longitude"]
]
dupe_keys = {}
for keys in aliases:
    if set(keys).issubset(train.columns):
        dupe_keys[tuple(keys)] = int(train.duplicated(subset=keys).sum())

# Near-dupe heuristic: same (lat,lon) rounded + same beds/baths/sqft
cands = {}
lat = [c for c in ["lat","latitude"] if c in train.columns]
lon = [c for c in ["lon","lng","longitude"] if c in train.columns]
beds = [c for c in ["bedrooms","beds"] if c in train.columns]
baths = [c for c in ["bathrooms","baths"] if c in train.columns]
sqft = [c for c in ["sqft","living_area","area"] if c in train.columns]

if lat and lon:
    df = train.copy()
    df["_latr"] = (df[lat[0]].astype(float).round(4))
    df["_lonr"] = (df[lon[0]].astype(float).round(4))
    grp_cols = ["_latr","_lonr"] + beds[:1] + baths[:1] + sqft[:1]
    dup_near = int(df.duplicated(subset=grp_cols).sum())
else:
    dup_near = None

dupe_exact, dupe_keys, dup_near


(np.int64(0), {}, None)

Missingness Mechanism

In [26]:
# Missingness matrix + simple associations
miss_flags = pd.DataFrame({f"{c}_isna": train[c].isna().astype(int) for c in train.columns})
# Correlate missingness with numeric covariates (proxy for MAR)
num_cols = train.select_dtypes(include=[np.number]).columns.tolist()
miss_assoc = (miss_flags.join(train[num_cols])
              .corr().loc[[c for c in miss_flags.columns], num_cols].abs().max(axis=1).sort_values(ascending=False))
miss_assoc.head(20)

Unnamed: 0,0
High School Distance_isna,0.258765
High School_isna,0.258765
High School Score_isna,0.252793
Middle School_isna,0.233853
Middle School Score_isna,0.233853
Middle School Distance_isna,0.233853
Cooling features_isna,0.232637
Cooling_isna,0.224214
Lot_isna,0.202012
Bedrooms_isna,0.187496


## Phase 3: Data Preparation
Impute missing values, encode categoricals, and scale numeric data.**bold text**

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X = train.drop('Sold Price', axis=1)
y = train['Sold Price']

num_features = X.select_dtypes(include=['int64','float64']).columns
cat_features = X.select_dtypes(exclude=['int64','float64']).columns

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, num_features),
    ('cat', categorical_transformer, cat_features)])

In [9]:
# ==== 0) Imports ====
import numpy as np, pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder, RobustScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

# ==== 1) Quarantine obvious leakage cols (adjust to your schema) ====
LEAKY_KEYS = [c for c in train.columns if any(k in c.lower() for k in
    ["sold_", "sale_", "closing", "days_on_market", "dom", "pending", "status", "price_change"])]
X = train.drop(columns=LEAKY_KEYS + ['Sold Price'])
y_log = np.log(train['Sold Price'])  # natural log, aligns with evaluation

# ==== 2) Column buckets ====
num_features = X.select_dtypes(include=[np.number]).columns.tolist()
cat_features = X.select_dtypes(exclude=[np.number]).columns.tolist()

# Optional: identify a text field and lat/lon if present
text_col = next((c for c in ["seller_summary","description","remarks","listing_text"] if c in X.columns), None)
lat_col  = next((c for c in ["lat","latitude"] if c in X.columns), None)
lon_col  = next((c for c in ["lon","lng","longitude"] if c in X.columns), None)

if text_col and text_col in cat_features:
    cat_features.remove(text_col)

# ==== 3) Robust clipping transformer for heavy-tailed numerics ====
class QuantileClipper(BaseEstimator, TransformerMixin):
    def __init__(self, q_low=0.01, q_high=0.99):
        self.q_low, self.q_high = q_low, q_high
    def fit(self, X, y=None):
        X = pd.DataFrame(X)
        self.lows_  = X.quantile(self.q_low)
        self.highs_ = X.quantile(self.q_high)
        return self
    def transform(self, X):
        X = pd.DataFrame(X).clip(self.lows_, self.highs_, axis=1)
        return X.values

# ==== 4) Rare-bucket encoder for high-card cats (sklearn >=1.1) ====
ohe = OneHotEncoder(
    handle_unknown="infrequent_if_exist",  # falls back to 'ignore' if not available in your version
    min_frequency=0.01,                    # or an absolute int like 50
    sparse_output=True
)

# ==== 5) Optional: simple text + geo features ====
# Text: sparse TF-IDF (kept simple; PII redaction handled upstream in Data Understanding)
from sklearn.feature_extraction.text import TfidfVectorizer
text_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="")),
    ("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1,2), min_df=5))
]) if text_col else 'drop'

# Geo: coarse tiles via rounding (stand-in for H3/S2)
def make_geo(df):
    out = pd.DataFrame(index=df.index)
    if lat_col and lon_col:
        out["lat_bin"] = df[lat_col].round(2).astype(str)
        out["lon_bin"] = df[lon_col].round(2).astype(str)
    return out

geo_builder = FunctionTransformer(lambda df: make_geo(df), feature_names_out="one-to-one")
geo_ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=True)

geo_ct = Pipeline(steps=[
    ("builder", geo_builder),
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", geo_ohe)
]) if (lat_col and lon_col) else 'drop'

# ==== 6) Numeric & categorical pipelines ====
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median", add_indicator=True)),
    ("clip", QuantileClipper(0.01, 0.99)),
    ("scaler", RobustScaler(with_centering=True, with_scaling=True))
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", ohe)
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
        ("text", text_transformer, text_col) if text_col else ("text","drop",[]),
        ("geo", geo_ct, [lat_col, lon_col] if (lat_col and lon_col) else [])
    ],
    remainder="drop",
    sparse_threshold=0.3
)

# ==== 7) Temporal validation scaffold (replace with your real date col) ====
# If you have listing/sold date in X (remove it from features to avoid leakage, but use for splitting)
date_col = next((c for c in ["listing_date","list_date","sold_date","sale_date"] if c in train.columns), None)
if date_col:
    order = pd.to_datetime(train[date_col]).argsort().values
    # Example rolling-origin CV; later you’ll use this in model selection
    cv = TimeSeriesSplit(n_splits=5)
# Else: define folds externally (e.g., by month) and pass to CV later.

# preprocessor is now ready to .fit/.transform within a modeling pipeline

## Phase 4: Modeling

Training a regression model using the preprocessed data.


Reduced the size of the training data by sampling a percentage/number of the rows.

In [11]:
from sklearn.model_selection import train_test_split
# Reduce the size of the training data by sampling
train_sampled = train.sample(n=5000, random_state=42)

X_sampled = train_sampled.drop('Sold Price', axis=1)
y_sampled = train_sampled['Sold Price']

# Split the sampled data
X_train_sampled, X_test_sampled, y_train_sampled, y_test_sampled = train_test_split(
    X_sampled, y_sampled, test_size=0.2, random_state=42
)

# Replace the original training data with the sampled data
X_train = X_train_sampled
y_train = y_train_sampled

print(f"Original training data size: {len(X)}")
print(f"Sampled training data size: {len(X_train)}")

Original training data size: 47439
Sampled training data size: 4000


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

model = RandomForestRegressor(n_estimators=100, random_state=42)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', model)])

pipeline.fit(X_train, y_train)

In [15]:
import numpy as np, pandas as pd
from sklearn.model_selection import GroupShuffleSplit
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline

# --- 0) Target, features, simple geo groups ---
y = train['Sold Price'].astype(float).values
X = train.drop(columns=['Sold Price']).copy()

lat = next((c for c in ['lat','latitude'] if c in X.columns), None)
lon = next((c for c in ['lon','lng','longitude'] if c in X.columns), None)
zipc = next((c for c in ['zip','zipcode','postal_code', 'Zip'] if c in X.columns), None) # Added 'Zip' based on data inspection

if lat and lon:
    groups = (X[lat].round(2).astype(str) + "_" + X[lon].round(2).astype(str)).values  # ~1–2km tiles
elif zipc:
    groups = X[zipc].astype(str).values
else:
    groups = np.zeros(len(X), dtype=int)  # fallback (document this)

# --- 1) One geo holdout test (80/20) ---
gss = GroupShuffleSplit(n_splits=1, test_size=0.20, random_state=42)
tr_idx, te_idx = next(gss.split(X, y, groups=groups))

X_trcal, y_trcal, groups_trcal = X.iloc[tr_idx], y[tr_idx], groups[tr_idx]
X_te,    y_te,    groups_te    = X.iloc[te_idx], y[te_idx], groups[te_idx]

# --- 2) Small calibration split (from the 80%) ---
# Check if there are enough samples for the second split
if len(X_trcal) > 1: # Ensure there is at least 2 samples to split
    gss_cal = GroupShuffleSplit(n_splits=1, test_size=0.25, random_state=42)  # 25% of 80% ≈ 20% total
    try:
        tr_idx2, cal_idx = next(gss_cal.split(X_trcal, y_trcal, groups=groups_trcal))
        X_tr,  y_tr  = X_trcal.iloc[tr_idx2], y_trcal[tr_idx2]
        X_cal, y_cal = X_trcal.iloc[cal_idx], y_trcal[cal_idx]

        # --- 3) Simple model: Ridge on log(price) via TTR ---
        ridge_pipe = Pipeline([('preprocessor', preprocessor),
                               ('reg', Ridge(alpha=3.0, random_state=42))])

        model = TransformedTargetRegressor(regressor=ridge_pipe, func=np.log, inverse_func=np.exp)
        model.fit(X_tr, y_tr)

        # --- 4) Split-conformal 80% intervals (log-space symmetric) ---
        y_cal_pred = model.predict(X_cal)
        eps = np.abs(np.log(y_cal) - np.log(y_cal_pred))      # absolute log-residuals
        q80 = float(np.quantile(eps, 0.80))                   # 80th percentile

        def predict_with_pi(est, X_new, q=q80):
            mu = est.predict(X_new)
            lo = np.exp(np.log(mu) - q)
            hi = np.exp(np.log(mu) + q)
            return mu, lo, hi

        # --- 5) Evaluate on the geo holdout test ---
        y_pred, y_lo, y_hi = predict_with_pi(model, X_te)

        log_rmse = float(np.sqrt(np.mean((np.log(y_pred) - np.log(y_te))**2)))
        ape = np.abs(y_pred - y_te) / y_te
        median_ape = float(np.median(ape))
        coverage80 = float(np.mean((y_te >= y_lo) & (y_te <= y_hi)))
        pi_width_pct = float(np.median((y_hi - y_lo) / y_pred))

        # Optional quick parity slices (price bands; add coastal/inland if you have lon)
        price_q = pd.qcut(y_te, q=5, labels=False, duplicates='drop')
        median_ape_by_q = pd.Series(ape).groupby(price_q).median()
        coverage_by_q = pd.Series(((y_te >= y_lo) & (y_te <= y_hi)).astype(int)).groupby(price_q).mean()

    except ValueError as e:
        print(f"Could not perform the second split for calibration: {e}")
        print("Skipping prediction interval calculation and evaluation metrics that depend on it.")
        # Define variables to avoid NameError later if needed, e.g.,
        y_pred = model.predict(X_te) if 'model' in locals() else None
        log_rmse = float(np.sqrt(mean_squared_error(np.log(y_pred), np.log(y_te)))) if y_pred is not None else None
        median_ape = float(np.median(np.abs(y_pred - y_te) / y_te)) if y_pred is not None else None
        coverage80 = None
        pi_width_pct = None
        median_ape_by_q = None
        coverage_by_q = None
        y_lo = None
        y_hi = None

else:
    print(f"Not enough samples ({len(X_trcal)}) to perform the second split for calibration.")
    print("Skipping prediction interval calculation and evaluation metrics that depend on it.")
    # Define variables to avoid NameError later if needed, e.g.,
    # Assuming model is already trained from previous steps or can be trained on the full X_trcal if needed
    # For now, let's assume we train the model on X_trcal if we skip the calibration split
    ridge_pipe = Pipeline([('preprocessor', preprocessor),
                           ('reg', Ridge(alpha=3.0, random_state=42))])
    model = TransformedTargetRegressor(regressor=ridge_pipe, func=np.log, inverse_func=np.exp)
    model.fit(X_trcal, y_trcal)
    y_pred = model.predict(X_te)
    log_rmse = float(np.sqrt(mean_mean_squared_error(np.log(y_pred), np.log(y_te))))
    ape = np.abs(y_pred - y_te) / y_te
    median_ape = float(np.median(ape))
    coverage80 = None
    pi_width_pct = None
    median_ape_by_q = None
    coverage_by_q = None
    y_lo = None
    y_hi = None

## Phase 5: Evaluation
Evaluating the model performance and feature importances.

In [22]:
import numpy as np
from sklearn.metrics import mean_squared_error

pipeline = model  # alias so the evaluation code works unchanged

# predictions (prices in $)
y_pred = pipeline.predict(X_test_sampled)  # use full holdout if possible

# clip to positive to avoid log issues (prices should already be >0)
eps = 1e-9
y_true_pos = np.clip(y_test_sampled, eps, None)
y_pred_pos = np.clip(y_pred, eps, None)

# log-RMSE (natural log)
log_rmse = np.sqrt(mean_squared_error(np.log(y_true_pos), np.log(y_pred_pos)))

# median absolute percent error (MAPE-like, as per your success criteria)
ape = np.abs(y_pred_pos - y_true_pos) / y_true_pos
median_ape = float(np.median(ape))

print(f"log-RMSE: {log_rmse:.4f}")
print(f"Median APE: {median_ape:.3%}")

log-RMSE: 0.3052
Median APE: 15.914%


## Phase 6: Deployment
Saving the model for reuse and demonstrate prediction.

In [20]:
import joblib
joblib.dump(model, 'california_housing_model.pkl')
print('Model saved!')

Model saved!
