### Current Modeling Status (using engineered features)

- Train/test split: 80/20 on NYC listings.
- Features:
  - Numeric: capacity, host features, reviews, availability, neighbourhood stats, lat/long, etc. (no price-based leakage).
  - Binary: room-type flags, `host_is_superhost`, `instant_bookable`.
  - Categorical: `borough`, `neighbourhood_name`, `room_type`, `property_type_grouped`, `capacity_bucket`, `host_listings_bucket`, `rating_bucket`.
- Preprocessing:
  - Numeric + binary → median imputation + StandardScaler.
  - Categorical → most-frequent imputation + one-hot.

**Baselines (same split)**  
- Global mean baseline: RMSE ≈ 2021, MAE ≈ 280  
- Neighbourhood mean baseline: RMSE ≈ 2024, MAE ≈ 287  

**Models**

- **RandomForestRegressor (main model)**  
  - RMSE ≈ 1032  
  - MAE ≈ 108  
  - R² ≈ 0.74  

- **HistGradientBoostingRegressor (same features)**  
  - RMSE ≈ 1040  
  - MAE ≈ 145  
  - R² ≈ 0.74  

Random Forest clearly improves over the baselines and slightly outperforms HGB, so we treat it as our main model going forward. HGB serves as an additional comparison model.

Feature importance (from the RF) shows that:
- Host-related features (`host_years`, `calculated_host_listings_count`, hotel-style room types) are strong predictors.
- Neighbourhood price stats (`neigh_avg_price`) and location (`latitude`, `longitude`) contribute meaningful signal.
- Reviews and availability metrics provide additional predictive power.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.ensemble import RandomForestRegressor  # or from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer

In [3]:
# load the feature table you saved from features.ipynb
df = pd.read_csv("../data/processed/listing_features_engineered.csv")

print(df.shape)
df.head()

(14436, 49)


Unnamed: 0,host_since,host_is_superhost,room_type,property_type,accommodates,bedrooms,beds,bathrooms,bathrooms_text,latitude,...,neigh_listing_count,price_minus_neigh_mean,price_over_neigh_mean,price_minus_neigh_median,price_over_neigh_median,is_entire_home,is_private_room,is_shared_room,is_hotel_room,property_type_grouped
0,2008-09-09,0,Entire home/apt,Entire rental unit,1,0,1,1.0,1 bath,40.75356,...,669,-816.077728,0.227256,-56.0,0.810811,1,0,0,0,Entire rental unit
1,2009-05-06,1,Entire home/apt,Entire rental unit,3,2,1,1.0,1 bath,40.70935,...,570,-111.408772,0.462854,-69.0,0.581818,1,0,0,0,Entire rental unit
2,2009-05-07,0,Private room,Private room in condo,1,1,1,1.0,1 shared bath,40.80107,...,255,-106.254902,0.357024,-69.0,0.460938,0,1,0,0,Other
3,2009-05-12,0,Private room,Private room in rental unit,1,2,2,1.0,1 shared bath,40.78778,...,255,-92.254902,0.441742,-55.0,0.570312,0,1,0,0,Private room in rental unit
4,2009-05-17,1,Private room,Private room in guest suite,2,1,2,1.0,1 private bath,40.69194,...,107,5.17757,1.024559,36.0,1.2,0,1,0,0,Other


In [4]:
import numpy as np
import pandas as pd

from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# -------------------------------------------------
# 1) Start from engineered features and CLEAN rows
# -------------------------------------------------
df_clean = df.copy()

rows_before = len(df_clean)

# keep reasonable prices only
df_clean = df_clean[df_clean["price"].between(10, 1000)]

# additional outlier filters on key numeric columns
df_clean = df_clean[
    (df_clean["accommodates"].between(1, 10)) &
    (df_clean["bedrooms"].between(0, 8)) &
    (df_clean["beds"].between(0, 10)) &
    (df_clean["bathrooms"].between(0, 5)) &
    (df_clean["review_scores_rating"].between(1, 5)) &
    (df_clean["availability_365"].between(0, 365)) &
    (df_clean["host_years"].between(0, 20)) &
    (df_clean["reviews_per_month"].between(0, 20))
]

print(f"Rows before cleaning: {rows_before}")
print(f"Rows after cleaning:  {len(df_clean)}")

# -------------------------------------------------
# 2) Define target and feature groups (no leakage)
# -------------------------------------------------
target_col = "price"
alt_target_col = "log_price"

categorical_cols = [
    "borough",
    "neighbourhood_name",
    "room_type",
    "property_type_grouped",
    "capacity_bucket",
    "host_listings_bucket",
    "rating_bucket",
]

binary_cols = [
    "host_is_superhost",
    "instant_bookable",
    "is_entire_home",
    "is_private_room",
    "is_shared_room",
    "is_hotel_room",
]

date_string_cols = ["host_since", "first_review", "last_review"]

# price-based leakage features we DO NOT use
price_leak_cols = [
    "price_per_accommodate",
    "price_per_bed",
    "price_per_bedroom",
    "price_minus_neigh_mean",
    "price_over_neigh_mean",
    "price_minus_neigh_median",
    "price_over_neigh_median",
    "estimated_revenue",
]

# id-like / raw text columns we don't want as numeric features
id_like_cols = ["listing_id", "neighbourhood_id", "host_id"]
extra_raw_cols = [
    "property_type",
    "bathrooms_text",
    "host_since_dt",
    "host_name",
] + id_like_cols

exclude_for_numeric = (
    [target_col, alt_target_col]
    + categorical_cols
    + binary_cols
    + date_string_cols
    + price_leak_cols
    + extra_raw_cols
)

numeric_cols = [
    c for c in df_clean.columns
    if c not in exclude_for_numeric
    and pd.api.types.is_numeric_dtype(df_clean[c])
]

print("\nNumeric feature columns:")
print(numeric_cols)

# -------------------------------------------------
# 3) Train/test split
# -------------------------------------------------
X = df_clean[categorical_cols + binary_cols + numeric_cols]
y = df_clean[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------------------------------------------------
# 4) Preprocessing + RandomForest pipeline
# -------------------------------------------------
num_features = numeric_cols + binary_cols
cat_features = categorical_cols

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ]
)

model = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
)

clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", model),
])

clf.fit(X_train, y_train)

# -------------------------------------------------
# 5) Metrics on cleaned data
# -------------------------------------------------
y_pred = clf.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"\nRandomForest AFTER cleaning")
print(f"RMSE: {rmse:,.2f}")
print(f"MAE:  {mae:,.2f}")
print(f"R^2:  {r2:.3f}")

Rows before cleaning: 14436
Rows after cleaning:  14122

Numeric feature columns:
['accommodates', 'bedrooms', 'beds', 'bathrooms', 'latitude', 'longitude', 'number_of_reviews', 'availability_365', 'review_scores_rating', 'calculated_host_listings_count', 'reviews_per_month', 'available_days_365', 'availability_rate_365', 'blocked_or_booked_days_365', 'blocked_or_booked_rate_365', 'log_number_of_reviews', 'log_reviews_per_month', 'host_years', 'neigh_avg_price', 'neigh_median_price', 'neigh_listing_count']

RandomForest AFTER cleaning
RMSE: 82.49
MAE:  51.86
R^2:  0.646


In [5]:
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error

# global mean baseline
global_mean = y_train.mean()
y_pred_global = np.full_like(y_test, global_mean, dtype=float)

rmse_global = sqrt(mean_squared_error(y_test, y_pred_global))
mae_global = mean_absolute_error(y_test, y_pred_global)

# neighbourhood mean baseline (train only)
train_neigh_means = (
    X_train.assign(price=y_train)
    .groupby("neighbourhood_name")["price"]
    .mean()
)

y_pred_neigh = X_test["neighbourhood_name"].map(train_neigh_means).fillna(global_mean)

rmse_neigh = sqrt(mean_squared_error(y_test, y_pred_neigh))
mae_neigh = mean_absolute_error(y_test, y_pred_neigh)

print("Global mean baseline  - RMSE: {:,.2f}, MAE: {:,.2f}".format(rmse_global, mae_global))
print("Neighbourhood mean    - RMSE: {:,.2f}, MAE: {:,.2f}".format(rmse_neigh, mae_neigh))
print("RandomForest (clean)  - RMSE: {:,.2f}, MAE: {:,.2f}, R^2: {:.3f}".format(rmse, mae, r2))

Global mean baseline  - RMSE: 138.85, MAE: 98.92
Neighbourhood mean    - RMSE: 122.68, MAE: 83.64
RandomForest (clean)  - RMSE: 82.49, MAE: 51.86, R^2: 0.646


In [6]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

rf_base = RandomForestRegressor(
    random_state=42,
    n_jobs=-1
)

param_dist = {
    "model__n_estimators": [100, 200, 400],
    "model__max_depth": [None, 10, 20, 40],
    "model__min_samples_split": [2, 5, 10],
    "model__min_samples_leaf": [1, 2, 4],
    "model__max_features": ["sqrt", "log2", 0.5],
}

rf_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),   # same preprocessor you used for RF
    ("model", rf_base),
])

search = RandomizedSearchCV(
    rf_pipe,
    param_distributions=param_dist,
    n_iter=20,
    scoring="neg_root_mean_squared_error",
    cv=3,
    n_jobs=-1,
    random_state=42,
    verbose=1,
)

search.fit(X_train, y_train)

print("Best params:", search.best_params_)

best_clf = search.best_estimator_

y_pred_best = best_clf.predict(X_test)
rmse_best = sqrt(mean_squared_error(y_test, y_pred_best))
mae_best = mean_absolute_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print(f"\nTuned RF  - RMSE: {rmse_best:,.2f}, MAE: {mae_best:,.2f}, R^2: {r2_best:.3f}")
print(f"Current RF - RMSE: {rmse:,.2f}, MAE: {mae:,.2f}, R^2: {r2:.3f}")

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best params: {'model__n_estimators': 400, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 0.5, 'model__max_depth': 40}

Tuned RF  - RMSE: 81.53, MAE: 51.61, R^2: 0.655
Current RF - RMSE: 82.49, MAE: 51.86, R^2: 0.646


In [50]:
target_col = "price"
alt_target_col = "log_price"

categorical_cols = [
    "borough",
    "neighbourhood_name",
    "room_type",
    "property_type_grouped",
    "capacity_bucket",
    "host_listings_bucket",
    "rating_bucket",
]

binary_cols = [
    "host_is_superhost",
    "instant_bookable",
    "is_entire_home",
    "is_private_room",
    "is_shared_room",
    "is_hotel_room",
]

date_string_cols = ["host_since", "first_review", "last_review"]

# >>> price-based features that leak the target <<<
price_leak_cols = [
    "price_per_accommodate",
    "price_per_bed",
    "price_per_bedroom",
    "price_minus_neigh_mean",
    "price_over_neigh_mean",
    "price_minus_neigh_median",
    "price_over_neigh_median",
]

# original raw text / datetime columns we don't feed directly
extra_raw_cols = [
    "property_type",
    "bathrooms_text",
    "host_since_dt",
]

exclude_for_numeric = (
    [target_col, alt_target_col]
    + categorical_cols
    + binary_cols
    + date_string_cols
    + price_leak_cols
    + extra_raw_cols
)

numeric_cols = [
    c for c in df.columns
    if c not in exclude_for_numeric
    and pd.api.types.is_numeric_dtype(df[c])
]

print("Target:", target_col)
print("Alt target (optional):", alt_target_col)
print("\nCategorical cols:", categorical_cols)
print("\nBinary cols:", binary_cols)
print("\nNumeric cols:", numeric_cols)
print("\nRaw date string cols:", date_string_cols)
print("\nExcluded (price leakage + raw text/datetime):", price_leak_cols + extra_raw_cols)

Target: price
Alt target (optional): log_price

Categorical cols: ['borough', 'neighbourhood_name', 'room_type', 'property_type_grouped', 'capacity_bucket', 'host_listings_bucket', 'rating_bucket']

Binary cols: ['host_is_superhost', 'instant_bookable', 'is_entire_home', 'is_private_room', 'is_shared_room', 'is_hotel_room']

Numeric cols: ['accommodates', 'bedrooms', 'beds', 'bathrooms', 'latitude', 'longitude', 'number_of_reviews', 'availability_365', 'review_scores_rating', 'calculated_host_listings_count', 'reviews_per_month', 'available_days_365', 'availability_rate_365', 'blocked_or_booked_days_365', 'blocked_or_booked_rate_365', 'log_number_of_reviews', 'log_reviews_per_month', 'host_years', 'neigh_avg_price', 'neigh_median_price', 'neigh_listing_count']

Raw date string cols: ['host_since', 'first_review', 'last_review']

Excluded (price leakage + raw text/datetime): ['price_per_accommodate', 'price_per_bed', 'price_per_bedroom', 'price_minus_neigh_mean', 'price_over_neigh_mean'

In [51]:
# features & target
X = df[categorical_cols + binary_cols + numeric_cols]
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape

((11548, 34), (2888, 34))

In [52]:
from sklearn.ensemble import RandomForestRegressor

# numeric + binary treated as numeric features
num_features = numeric_cols + binary_cols
cat_features = categorical_cols

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ]
)

# choose a model: start with RandomForestRegressor
model = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

# full pipeline: preprocessing + model
clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", model),
])

# fit
clf.fit(X_train, y_train)

In [53]:
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred = clf.predict(X_test)

rmse = sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:,.2f}")
print(f"MAE:  {mae:,.2f}")
print(f"R^2:  {r2:.3f}")

RMSE: 1,032.07
MAE:  107.75
R^2:  0.739


In [55]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# numeric + binary treated as numeric features
num_features = numeric_cols + binary_cols
cat_features = categorical_cols

# same numeric transformer as before
numeric_transformer_hgb = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

# categorical transformer, but DENSE output
categorical_transformer_hgb = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    # NOTE: sparse_output=False for sklearn >= 1.2
    # If your version complains, change to sparse=False instead.
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

preprocessor_hgb = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer_hgb, num_features),
        ("cat", categorical_transformer_hgb, cat_features),
    ]
)

In [56]:
# Optional: keep a copy of the RandomForest metrics for comparison
rmse_rf = rmse
mae_rf = mae
r2_rf = r2

from sklearn.ensemble import HistGradientBoostingRegressor
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# 1) Define HGB model
hgb_model = HistGradientBoostingRegressor(
    learning_rate=0.05,
    max_iter=300,
    max_depth=None,
    random_state=42
)

# 2) Full pipeline: use the *dense* preprocessor
clf_hgb = Pipeline(steps=[
    ("preprocess", preprocessor_hgb),
    ("model", hgb_model),
])

# 3) Fit on the same train split
clf_hgb.fit(X_train, y_train)

# 4) Predict and evaluate
y_pred_hgb = clf_hgb.predict(X_test)

rmse_hgb = sqrt(mean_squared_error(y_test, y_pred_hgb))
mae_hgb = mean_absolute_error(y_test, y_pred_hgb)
r2_hgb = r2_score(y_test, y_pred_hgb)

print(f"RandomForest  RMSE: {rmse_rf:,.2f}, MAE: {mae_rf:,.2f}, R^2: {r2_rf:.3f}")
print(f"HGB           RMSE: {rmse_hgb:,.2f}, MAE: {mae_hgb:,.2f}, R^2: {r2_hgb:.3f}")

RandomForest  RMSE: 1,032.07, MAE: 107.75, R^2: 0.739
HGB           RMSE: 1,040.23, MAE: 144.95, R^2: 0.735


In [None]:
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Global mean baseline
global_mean = y_train.mean()
y_pred_global = np.full_like(y_test, fill_value=global_mean, dtype=float)

rmse_global = sqrt(mean_squared_error(y_test, y_pred_global))
mae_global = mean_absolute_error(y_test, y_pred_global)

# Neighbourhood mean baseline (train only)
train_neigh_means = (
    X_train.assign(price=y_train)
    .groupby("neighbourhood_name")["price"]
    .mean()
)

y_pred_neigh = X_test["neighbourhood_name"].map(train_neigh_means).fillna(global_mean)

rmse_neigh = sqrt(mean_squared_error(y_test, y_pred_neigh))
mae_neigh = mean_absolute_error(y_test, y_pred_neigh)

print("Global mean baseline  - RMSE: {:,.2f}, MAE: {:,.2f}".format(rmse_global, mae_global))
print("Neighbourhood mean    - RMSE: {:,.2f}, MAE: {:,.2f}".format(rmse_neigh, mae_neigh))
print("RandomForest + FE     - RMSE: {:,.2f}, MAE: {:,.2f}, R^2: {:.3f}".format(rmse, mae, r2))

Global mean baseline  - RMSE: 2,021.36, MAE: 280.47
Neighbourhood mean    - RMSE: 2,023.57, MAE: 286.83
RandomForest + FE     - RMSE: 1,032.07, MAE: 107.75, R^2: 0.739


In [None]:
# 1. Get feature names after preprocessing
preprocessor = clf.named_steps["preprocess"]

feature_names = preprocessor.get_feature_names_out()

# 2. Get importances from the RandomForest
rf_model = clf.named_steps["model"]
importances = rf_model.feature_importances_

# 3. Put into a DataFrame
fi = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values("importance", ascending=False)

fi.head(20)



Unnamed: 0,feature,importance
17,num__host_years,0.222049
26,num__is_hotel_room,0.213101
248,cat__room_type_Hotel room,0.193995
9,num__calculated_host_listings_count,0.092832
18,num__neigh_avg_price,0.074288
5,num__longitude,0.024674
8,num__review_scores_rating,0.023622
4,num__latitude,0.01841
259,cat__property_type_grouped_Room in hotel,0.015387
2,num__beds,0.014817


In [None]:
fi["family"] = fi["feature"].apply(
    lambda s: (
        "location" if "neighbourhood" in s or "borough" in s or "latitude" in s or "longitude" in s
        else "size" if any(k in s for k in ["accommodates", "bedrooms", "beds", "bathrooms"])
        else "host" if any(k in s for k in ["host_", "superhost"])
        else "reviews" if "review" in s
        else "neigh_price" if "neigh_" in s or "price_over_neigh" in s
        else "other"
    )
)

fi.groupby("family")["importance"].sum().sort_values(ascending=False)

family
other          0.470953
host           0.319814
neigh_price    0.084098
reviews        0.055767
location       0.049443
size           0.019924
Name: importance, dtype: float64