# Lab: Trees and Model Stability

Trees are notorious for being **unstable**: Small changes in the data can lead to noticeable or large changes in the tree. We're going to explore this phenomenon, and a common rebuttal.

In the folder for this lab, there are three datasets that we used in class: Divorce, heart failure, and the AirBnB price dataset.

1. Pick one of the datasets and appropriately clean it.
2. Perform a train-test split for a specific seed (save the seed for reproducibility). Fit a classification/regression tree and a linear model on the training data and evaluate their performance on the test data. Set aside the predictions these models make.
3. Repeat step 2 for three to five different seeds (save the seeds for reproducibility). How different are the trees that you get? Your linear model coefficients?. Set aside the predictions these models make.

Typically, you would see the trees changing what appears to be a non-trivial amount, while the linear model coefficients don't vary nearly as much. Often, the changes appear substantial. 

But are they?

4. Instead of focusing on the tree or model coefficients, do three things:
    1. Make scatterplots of the predicted values on the test set from question 2 against the predicted values for the alternative models from part 3, separately for your trees and linear models. Do they appear reasonably similar?
    2. Compute the correlation between your model in part 2 and your alternative models in part 3, separately for your trees and linear models. Are they highly correlated or not?
    3. Run a simple linear regression of the predicted values on the test set from the alternative models on the predicted values from question 2, separately for your trees and linear models. Is the intercept close to zero? Is the slope close to 1? Is the $R^2$ close to 1?

5. Do linear models appear to have similar coefficients and predictions across train/test splits? Do trees?
6. True or false, and explain: "Even if the models end up having a substantially different appearance, the predictions they generate are often very similar."

In [None]:
# PART 1 — Load & clean the AirBnB dataset (self-contained, idempotent)
import pandas as pd
import numpy as np
from pathlib import Path

#paths
RAW_PATHS = [Path("data/airbnb_hw.csv"), Path("/mnt/data/airbnb_hw.csv")]
RAW_PATH = next((p for p in RAW_PATHS if p.exists()), None)
assert RAW_PATH is not None, "Could not find airbnb_hw.csv in data/ or /mnt/data/."

CLEAN_PATH = Path("data/airbnb_clean.csv")

#load
df = pd.read_csv(RAW_PATH)

#normalize column names
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(r"[^0-9a-zA-Z]+", "_", regex=True)
      .str.strip("_")
)

#helpers
def to_num_currency(s):
    return pd.to_numeric(
        s.astype(str).str.replace(r"[\$,]", "", regex=True).str.strip(),
        errors="coerce"
    )

def to_num_percent(s):
    out = pd.to_numeric(s.astype(str).str.replace("%", "", regex=False), errors="coerce")
    return out / 100.0

#parse common fields if present
for c in [c for c in df.columns if c in ["price","weekly_price","monthly_price","cleaning_fee","security_deposit"]]:
    df[c] = to_num_currency(df[c])

for c in [c for c in df.columns if c in ["host_response_rate","host_acceptance_rate","occupancy_rate"]]:
    df[c] = to_num_percent(df[c])

for c in [c for c in df.columns if c.startswith("review_scores")]:
    df[c] = pd.to_numeric(df[c], errors="coerce")

#deduplicate had AI help me catch this almost mistake
df = df.drop_duplicates()

# drop columns with too much missingness=
missing_frac = df.isna().mean()
to_drop = missing_frac[missing_frac > 0.40].index.tolist()
if to_drop:
    df = df.drop(columns=to_drop)

#target presence & obvious invalid rows
if "price" in df.columns:
    df = df[pd.notna(df["price"]) & (df["price"] > 0)]
    # robust outlier trimming on price
    q1, q3 = df["price"].quantile([0.25, 0.75])
    iqr = q3 - q1
    upper = q3 + 3.0 * iqr
    lower = max(q1 - 3.0 * iqr, 0)
    df = df[(df["price"] >= lower) & (df["price"] <= upper)]

#impute remaining missing values
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()

for c in num_cols:
    if df[c].isna().any():
        df[c] = df[c].fillna(df[c].median())
for c in cat_cols:
    if df[c].isna().any():
        mode = df[c].mode(dropna=True)
        df[c] = df[c].fillna(mode.iloc[0] if not mode.empty else "unknown")

#drop obvious ID-like columns if present
for id_like in ["id","listing_id","host_id","scrape_id"]:
    if id_like in df.columns:
        df = df.drop(columns=id_like)

# -------- save & report --------
print("Shape after cleaning:", df.shape)
print("Numeric columns:", len(num_cols), "| Categorical columns:", len(cat_cols))
df.to_csv(CLEAN_PATH, index=False)
print(f"Saved cleaned dataset to: {CLEAN_PATH.resolve()}")



Shape after cleaning: (29831, 12)
Numeric columns: 8 | Categorical columns: 5
Saved cleaned dataset to: /workspaces/lab_tree_stability/data/airbnb_clean.csv


In [None]:
# PART 2 — split + tree vs linear (robust to sklearn version, memory safe)
#had issue of my kernal crashing before so I had to truncate this version by not creating a giant matrix

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

#settings
SEED = 2025
CLEAN = Path("data/airbnb_clean.csv")   # produced in Part 1
assert CLEAN.exists(), "Run Part 1 first to create data/airbnb_clean.csv"

#load & basic X/y
df = pd.read_csv(CLEAN)
TARGET = "price"
assert TARGET in df.columns
X = df.drop(columns=[TARGET], errors="ignore").copy()
y = df[TARGET].astype(float)

# Drop obvious non-features if present (ids/long text or dates that slipped through)
for c in ["id","listing_id","host_id","scrape_id","host_since","name","description","neighborhood_overview"]:
    if c in X.columns:
        X.drop(columns=c, inplace=True)

#limit categorical OHE to low-cardinality
raw_cat_cols = X.select_dtypes(exclude=[np.number]).columns.tolist()
card = X[raw_cat_cols].nunique(dropna=True) if raw_cat_cols else pd.Series(dtype=int)
LOW_CARD_MAX = 50
low_card_cats = [c for c in raw_cat_cols if card[c] <= LOW_CARD_MAX]
dropped_cats  = sorted(set(raw_cat_cols) - set(low_card_cats))
if dropped_cats:
    print("Dropping high-cardinality/text columns:", dropped_cats)
    X.drop(columns=dropped_cats, inplace=True)

num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = low_card_cats  # may be []

#split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=SEED
)

#preprocessors needed AI help on this section
numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler(with_mean=True, with_std=True)),
])

# sklearn version compatibility: prefer sparse_output, fall back to sparse
try:
    _ = OneHotEncoder(sparse_output=True)
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=True)
except TypeError:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=True)

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", ohe),
])

preprocess_sparse = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_cols),
        ("cat", categorical_pipe, cat_cols),
    ],
    remainder="drop",
    sparse_threshold=1.0  # keep sparse if any block is sparse
)

#models, inspired from class notes
lin = Pipeline([
    ("pre", preprocess_sparse),              # LinearRegression can take CSR sparse
    ("model", LinearRegression()),
])

to_dense = FunctionTransformer(
    lambda Z: Z.toarray() if hasattr(Z, "toarray") else Z, accept_sparse=True
)

tree = Pipeline([
    ("pre", preprocess_sparse),
    ("to_dense", to_dense),                  # densify only for tree
    ("model", DecisionTreeRegressor(random_state=SEED)),
])

#fit
lin.fit(X_train, y_train)
tree.fit(X_train, y_train)

#evaluate
def report(name, y_true, y_pred):
    try:
        rmse = mean_squared_error(y_true, y_pred, squared=False)
    except TypeError:  # older sklearn without 'squared' kwarg
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae  = mean_absolute_error(y_true, y_pred)
    r2   = r2_score(y_true, y_pred)
    print(f"{name:>8} | RMSE {rmse:,.2f}  MAE {mae:,.2f}  R² {r2: .3f}")
    return {"rmse": rmse, "mae": mae, "r2": r2}

yhat_lin  = lin.predict(X_test)
yhat_tree = tree.predict(X_test)

m_lin  = report("Linear", y_test, yhat_lin)
m_tree = report("Tree",   y_test, yhat_tree)

# -------------------- save predictions for next parts --------------------
preds = pd.DataFrame(
    {"y_true": y_test.values, "y_pred_linear": yhat_lin, "y_pred_tree": yhat_tree},
    index=y_test.index
)
preds.to_csv("data/preds_part2_airbnb.csv", index=True)
print("Saved: data/preds_part2_airbnb.csv")



  Linear | RMSE 61.08  MAE 42.90  R²  0.494
    Tree | RMSE 69.22  MAE 45.95  R²  0.350
Saved: data/preds_part2_airbnb.csv
