### Current Modeling Status (using engineered features)

- Train/test split: 80/20 on NYC listings.
- Features:
  - Numeric: capacity, host features, reviews, availability, neighbourhood stats, lat/long, etc. (no price-based leakage).
  - Binary: room-type flags, `host_is_superhost`, `instant_bookable`.
  - Categorical: `borough`, `neighbourhood_name`, `room_type`, `property_type_grouped`, `capacity_bucket`, `host_listings_bucket`, `rating_bucket`.
- Preprocessing:
  - Numeric + binary → median imputation + StandardScaler.
  - Categorical → most-frequent imputation + one-hot.

**Baselines (same split)**  
- Global mean baseline: RMSE ≈ 2021, MAE ≈ 280  
- Neighbourhood mean baseline: RMSE ≈ 2024, MAE ≈ 287  

**Models**

- **RandomForestRegressor (main model)**  
  - RMSE ≈ 1032  
  - MAE ≈ 108  
  - R² ≈ 0.74  

- **HistGradientBoostingRegressor (same features)**  
  - RMSE ≈ 1040  
  - MAE ≈ 145  
  - R² ≈ 0.74  

Random Forest clearly improves over the baselines and slightly outperforms HGB, so we treat it as our main model going forward. HGB serves as an additional comparison model.

Feature importance (from the RF) shows that:
- Host-related features (`host_years`, `calculated_host_listings_count`, hotel-style room types) are strong predictors.
- Neighbourhood price stats (`neigh_avg_price`) and location (`latitude`, `longitude`) contribute meaningful signal.
- Reviews and availability metrics provide additional predictive power.

In [22]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.ensemble import RandomForestRegressor  # or from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer

In [23]:
# load the feature table you saved from features.ipynb
df = pd.read_csv("../data/processed/listing_features_engineered.csv")

print(df.shape)
df.head()

(20631, 56)


Unnamed: 0,city,host_since,host_is_superhost,room_type,property_type,accommodates,bedrooms,beds,bathrooms,bathrooms_text,...,city_superhost_rate,city_avg_rating,city_avg_reviews_per_month,city_entire_home_share,log_city_listing_count,is_entire_home,is_private_room,is_shared_room,is_hotel_room,property_type_grouped
0,NYC,2008-09-09,0,Entire home/apt,Entire rental unit,1,0,1,1.0,1 bath,...,0.391092,4.735708,1.086862,0.557287,9.55556,1,0,0,0,Entire rental unit
1,Washington DC,2008-12-10,0,Entire home/apt,Entire condo,2,1,3,1.0,1 bath,...,0.52522,4.781591,2.241325,0.803262,8.290544,1,0,0,0,Entire condo
2,Washington DC,2008-11-26,0,Private room,Private room in home,1,1,2,1.0,1 shared bath,...,0.52522,4.781591,2.241325,0.803262,8.290544,0,1,0,0,Private room in home
3,Boston,2008-12-03,1,Entire home/apt,Entire rental unit,2,1,1,1.0,1 bath,...,0.433439,4.731446,1.910008,0.701664,7.833996,1,0,0,0,Entire rental unit
4,Washington DC,2008-12-12,1,Private room,Private room in townhouse,2,1,1,1.0,1 private bath,...,0.52522,4.781591,2.241325,0.803262,8.290544,0,1,0,0,Private room in townhouse


In [24]:
import numpy as np
import pandas as pd

from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# -------------------------------------------------
# 1) Start from engineered features and CLEAN rows
# -------------------------------------------------
df_clean = df.copy()

rows_before = len(df_clean)

# keep reasonable prices only
df_clean = df_clean[df_clean["price"].between(10, 1000)]

# additional outlier filters on key numeric columns
df_clean = df_clean[
    (df_clean["accommodates"].between(1, 10)) &
    (df_clean["bedrooms"].between(0, 8)) &
    (df_clean["beds"].between(0, 10)) &
    (df_clean["bathrooms"].between(0, 5)) &
    (df_clean["review_scores_rating"].between(1, 5)) &
    (df_clean["availability_365"].between(0, 365)) &
    (df_clean["host_years"].between(0, 20)) &
    (df_clean["reviews_per_month"].between(0, 20))
]

print(f"Rows before cleaning: {rows_before}")
print(f"Rows after cleaning:  {len(df_clean)}")

# -------------------------------------------------
# 2) Define target and feature groups (no leakage)
# -------------------------------------------------
target_col = "price"
alt_target_col = "log_price"

# Categorical features (include city)
categorical_cols = [
    "city",
    "borough",
    "neighbourhood_name",
    "room_type",
    "property_type_grouped",
    "capacity_bucket",
    "host_listings_bucket",
    "rating_bucket",
]

# Binary features
binary_cols = [
    "host_is_superhost",
    "instant_bookable",
    "is_entire_home",
    "is_private_room",
    "is_shared_room",
    "is_hotel_room",
]

# Date-like raw text columns
date_string_cols = ["host_since", "first_review", "last_review"]

# Price-derived leakage features (do NOT feed to model)
price_leak_cols = [
    "price_per_accommodate",
    "price_per_bed",
    "price_per_bedroom",
    "price_minus_neigh_mean",
    "price_over_neigh_mean",
    "price_minus_neigh_median",
    "price_over_neigh_median",
    "neigh_avg_price",
    "neigh_median_price",
    "estimated_revenue",      # not in df but safe to list
]

# City env features: we only want these in the model:
#   log_city_listing_count, city_superhost_rate,
#   city_entire_home_share, city_avg_reviews_per_month
# so exclude the raw count + avg rating from numeric_cols
city_not_for_model = [
    "city_listing_count",
    "city_avg_rating",
]

# Other raw / text-like columns we don't want as numeric
extra_raw_cols = [
    "property_type",
    "bathrooms_text",
    "host_since_dt",
    "host_name",
] + city_not_for_model

# Build numeric_cols = all numeric features minus excluded sets
exclude_for_numeric = (
    [target_col, alt_target_col]
    + categorical_cols
    + binary_cols
    + date_string_cols
    + price_leak_cols
    + extra_raw_cols
)

numeric_cols = [
    c for c in df_clean.columns
    if c not in exclude_for_numeric
    and pd.api.types.is_numeric_dtype(df_clean[c])
]

print("\nNumeric feature columns used in model:")
print(numeric_cols)

# -------------------------------------------------
# 3) Train/test split
# -------------------------------------------------
X = df_clean[categorical_cols + binary_cols + numeric_cols]
y = df_clean[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------------------------------------------------
# 4) Preprocessing + RandomForest pipeline
# -------------------------------------------------
num_features = numeric_cols + binary_cols
cat_features = categorical_cols

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ]
)

model = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
)

clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", model),
])

clf.fit(X_train, y_train)

# -------------------------------------------------
# 5) Metrics on cleaned data
# -------------------------------------------------
y_pred = clf.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"\nRandomForest AFTER cleaning")
print(f"RMSE: {rmse:,.2f}")
print(f"MAE:  {mae:,.2f}")
print(f"R^2:  {r2:.3f}")

Rows before cleaning: 20631
Rows after cleaning:  20631

Numeric feature columns used in model:
['accommodates', 'bedrooms', 'beds', 'bathrooms', 'latitude', 'longitude', 'number_of_reviews', 'availability_365', 'review_scores_rating', 'calculated_host_listings_count', 'reviews_per_month', 'available_days_365', 'availability_rate_365', 'blocked_or_booked_days_365', 'blocked_or_booked_rate_365', 'log_number_of_reviews', 'log_reviews_per_month', 'host_years', 'neigh_listing_count', 'city_superhost_rate', 'city_avg_reviews_per_month', 'city_entire_home_share', 'log_city_listing_count']

RandomForest AFTER cleaning
RMSE: 75.65
MAE:  46.56
R^2:  0.720


In [25]:
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error

# global mean baseline
global_mean = y_train.mean()
y_pred_global = np.full_like(y_test, global_mean, dtype=float)

rmse_global = sqrt(mean_squared_error(y_test, y_pred_global))
mae_global = mean_absolute_error(y_test, y_pred_global)

# neighbourhood mean baseline (train only)
train_neigh_means = (
    X_train.assign(price=y_train)
    .groupby("neighbourhood_name")["price"]
    .mean()
)

y_pred_neigh = X_test["neighbourhood_name"].map(train_neigh_means).fillna(global_mean)

rmse_neigh = sqrt(mean_squared_error(y_test, y_pred_neigh))
mae_neigh = mean_absolute_error(y_test, y_pred_neigh)

print("Global mean baseline  - RMSE: {:,.2f}, MAE: {:,.2f}".format(rmse_global, mae_global))
print("Neighbourhood mean    - RMSE: {:,.2f}, MAE: {:,.2f}".format(rmse_neigh, mae_neigh))
print("RandomForest (clean)  - RMSE: {:,.2f}, MAE: {:,.2f}, R^2: {:.3f}".format(rmse, mae, r2))

Global mean baseline  - RMSE: 143.09, MAE: 98.86
Neighbourhood mean    - RMSE: 124.57, MAE: 84.55
RandomForest (clean)  - RMSE: 75.65, MAE: 46.56, R^2: 0.720
