## Feature Engineering Summary

We start from the cleaned `listing` + `neighbourhood` tables for NYC. Each row is one listing, with host info, location, reviews, availability, and price.

**Target**
- Main target: `price` (nightly USD).
- Also created: `log_price` to reduce skew for linear models.

**Size & Capacity**
- Raw: `accommodates`, `bedrooms`, `beds`, `bathrooms`.
- Ratios: `price_per_accommodate`, `price_per_bed`, `price_per_bedroom`.
- Bucket: `capacity_bucket` ∈ {`1-2`, `3-4`, `5-6`, `7-8`, `9-12`, `13+`} based on `accommodates`.

**Availability & Demand**
- From `availability_365`:
  - `available_days_365`, `availability_rate_365`.
  - `blocked_or_booked_days_365`, `blocked_or_booked_rate_365`.
- Reviews:
  - `log_number_of_reviews` = log(1 + `number_of_reviews`).
  - `log_reviews_per_month` = log(1 + `reviews_per_month`).

**Host Features**
- `host_years` = years since `host_since` (using max `last_review` as reference).
- `host_listings_bucket` from `calculated_host_listings_count` ∈ {`1`, `2-3`, `4-10`, `11-50`, `50+`}.
- Keep binary: `host_is_superhost`, `instant_bookable`.

**Neighbourhood Pricing**
- Aggregates per `neighbourhood_name`:
  - `neigh_avg_price`, `neigh_median_price`, `neigh_listing_count`.
- Relative price:
  - `price_minus_neigh_mean`, `price_over_neigh_mean`.
  - `price_minus_neigh_median`, `price_over_neigh_median`.

**Room & Property Type**
- Room type flags:
  - `is_entire_home`, `is_private_room`, `is_shared_room`, `is_hotel_room`.
- Property type:
  - `property_type_grouped` keeps the 8 most common `property_type` values, others → `"Other"`.

**Location**
- Keep raw `borough`, `neighbourhood_name`, `latitude`, `longitude`.

The final engineered table is saved as  
`data/processed/listing_features_engineered.csv`  
along with lists of `categorical_cols`, `binary_cols`, and `numeric_cols` for modeling.

In [13]:
import sqlite3
import pandas as pd
import numpy as np

# connect to SQLite DB (adjust path if needed)
conn = sqlite3.connect("../data/airbnb.db")

# base table: listing + neighbourhood info
base_query = """
SELECT 
    l.*,
    n.borough,
    n.neighbourhood_name
FROM listing l
LEFT JOIN neighbourhood n
    ON l.neighbourhood_id = n.neighbourhood_id
"""

df = pd.read_sql_query(base_query, conn)

print(df.shape)
df.head()

(27797, 28)


Unnamed: 0,listing_id,neighbourhood_id,city,host_id,host_name,host_since,host_is_superhost,room_type,property_type,accommodates,...,availability_365,estimated_revenue,first_review,last_review,review_scores_rating,instant_bookable,calculated_host_listings_count,reviews_per_month,borough,neighbourhood_name
0,2595,115,NYC,2845,Jennifer,2008-09-09,0,Entire home/apt,Entire rental unit,1,...,289,0.0,2009-11-21,2022-06-21,4.68,0,3,0.24,Manhattan,Midtown
1,3344,268,Washington DC,4957,A.J.,2008-12-10,0,Entire home/apt,Entire condo,2,...,362,0.0,2009-05-09,2016-08-31,5.0,0,2,0.05,,"Downtown, Chinatown, Penn Quarters, Mount Vern..."
2,3686,276,Washington DC,4645,Vita,2008-11-26,0,Private room,Private room in home,1,...,298,0.0,2010-11-01,2023-08-30,4.64,0,1,0.47,,Historic Anacostia
3,3781,240,Boston,4804,Frank,2008-12-03,1,Entire home/apt,Entire rental unit,2,...,326,0.0,2015-07-10,2024-08-09,4.96,0,1,0.21,,East Boston
4,3943,271,Washington DC,5059,Vasa,2008-12-12,1,Private room,Private room in townhouse,2,...,331,19434.0,2009-05-10,2025-05-27,4.86,0,5,2.78,,"Edgewood, Bloomingdale, Truxton Circle, Eckington"


In [14]:
# target
target_col = "price"

# columns we definitely don't want as features
id_cols = ["listing_id", "neighbourhood_id", "host_id"]
leaky_cols = ["estimated_revenue"]  # derived from price & availability
text_id_cols = ["host_name"]        # mostly just an identifier

drop_cols = id_cols + leaky_cols + text_id_cols

df_model = df.drop(columns=drop_cols)

df_model.head()

Unnamed: 0,city,host_since,host_is_superhost,room_type,property_type,accommodates,bedrooms,beds,bathrooms,bathrooms_text,...,number_of_reviews,availability_365,first_review,last_review,review_scores_rating,instant_bookable,calculated_host_listings_count,reviews_per_month,borough,neighbourhood_name
0,NYC,2008-09-09,0,Entire home/apt,Entire rental unit,1,0,1,1.0,1 bath,...,47,289,2009-11-21,2022-06-21,4.68,0,3,0.24,Manhattan,Midtown
1,Washington DC,2008-12-10,0,Entire home/apt,Entire condo,2,1,3,1.0,1 bath,...,10,362,2009-05-09,2016-08-31,5.0,0,2,0.05,,"Downtown, Chinatown, Penn Quarters, Mount Vern..."
2,Washington DC,2008-11-26,0,Private room,Private room in home,1,1,2,1.0,1 shared bath,...,84,298,2010-11-01,2023-08-30,4.64,0,1,0.47,,Historic Anacostia
3,Boston,2008-12-03,1,Entire home/apt,Entire rental unit,2,1,1,1.0,1 bath,...,26,326,2015-07-10,2024-08-09,4.96,0,1,0.21,,East Boston
4,Washington DC,2008-12-12,1,Private room,Private room in townhouse,2,1,1,1.0,1 private bath,...,546,331,2009-05-10,2025-05-27,4.86,0,5,2.78,,"Edgewood, Bloomingdale, Truxton Circle, Eckington"


In [15]:
import numpy as np

df_features = df_model.copy()

# --- 1) Target transform: log price (for later modeling) ---
df_features["log_price"] = np.log1p(df_features["price"])

# --- 2) Price per capacity features (ratios) ---
accom = df_features["accommodates"].replace(0, np.nan)
beds = df_features["beds"].replace(0, np.nan)
bedrooms = df_features["bedrooms"].replace(0, np.nan)

df_features["price_per_accommodate"] = df_features["price"] / accom
df_features["price_per_bed"] = df_features["price"] / beds
df_features["price_per_bedroom"] = df_features["price"] / bedrooms

# --- 3) Availability-based features (no fake "occupancy") ---

# how many days the host is *willing* to rent
df_features["available_days_365"] = df_features["availability_365"]

# proportion of the year they are open for business
df_features["availability_rate_365"] = df_features["available_days_365"] / 365.0

# days that are *not* available – could be booked OR manually blocked
df_features["blocked_or_booked_days_365"] = 365 - df_features["available_days_365"]
df_features["blocked_or_booked_rate_365"] = (
    df_features["blocked_or_booked_days_365"] / 365.0
)

# --- 4) Review-based transforms ---
df_features["log_number_of_reviews"] = np.log1p(df_features["number_of_reviews"])

# reviews_per_month can be 0 or very small; clip negatives just in case
rpm = df_features["reviews_per_month"].clip(lower=0)
df_features["log_reviews_per_month"] = np.log1p(rpm)

# Quick preview of the new columns
df_features[
    [
        "price",
        "log_price",
        "accommodates",
        "beds",
        "bedrooms",
        "price_per_accommodate",
        "price_per_bed",
        "price_per_bedroom",
        "availability_365",
        "available_days_365",
        "availability_rate_365",
        "blocked_or_booked_days_365",
        "blocked_or_booked_rate_365",
        "number_of_reviews",
        "log_number_of_reviews",
        "reviews_per_month",
        "log_reviews_per_month",
    ]
].head()

Unnamed: 0,price,log_price,accommodates,beds,bedrooms,price_per_accommodate,price_per_bed,price_per_bedroom,availability_365,available_days_365,availability_rate_365,blocked_or_booked_days_365,blocked_or_booked_rate_365,number_of_reviews,log_number_of_reviews,reviews_per_month,log_reviews_per_month
0,240.0,5.484797,1,1,0,240.0,240.0,,289,289,0.791781,76,0.208219,47,3.871201,0.24,0.215111
1,150.0,5.01728,2,3,1,75.0,50.0,150.0,362,362,0.991781,3,0.008219,10,2.397895,0.05,0.04879
2,60.0,4.110874,1,2,1,60.0,30.0,60.0,298,298,0.816438,67,0.183562,84,4.442651,0.47,0.385262
3,125.0,4.836282,2,1,1,62.5,125.0,125.0,326,326,0.893151,39,0.106849,26,3.295837,0.21,0.19062
4,79.0,4.382027,2,1,1,39.5,79.0,79.0,331,331,0.906849,34,0.093151,546,6.304449,2.78,1.329724


In [16]:
import pandas as pd
import numpy as np

# -------------------------------
# 1) Capacity bucket (categorical)
# -------------------------------
cap_bins = [0, 2, 4, 6, 8, 12, np.inf]
cap_labels = ["1-2", "3-4", "5-6", "7-8", "9-12", "13+"]

df_features["capacity_bucket"] = pd.cut(
    df_features["accommodates"],
    bins=cap_bins,
    labels=cap_labels
)

# -----------------------------------------
# 2) Host experience: years on Airbnb
# -----------------------------------------
# make a datetime version of host_since
df_features["host_since_dt"] = pd.to_datetime(
    df_features["host_since"], errors="coerce"
)

# use the max last_review date as a rough "reference" date
# Convert to datetime first, then find max (handles NaN values properly)
last_review_dt = pd.to_datetime(df_features["last_review"], errors="coerce")
ref_date = last_review_dt.max()
# If all values are NaN, use today's date as fallback
if pd.isna(ref_date):
    ref_date = pd.Timestamp.now()

df_features["host_years"] = (
    (ref_date - df_features["host_since_dt"]).dt.days / 365.25
)

# -------------------------------------------
# 3) Host scale: how many listings they run
# -------------------------------------------
df_features["host_listings_bucket"] = pd.cut(
    df_features["calculated_host_listings_count"],
    bins=[0, 1, 3, 10, 50, np.inf],
    labels=["1", "2-3", "4-10", "11-50", "50+"]
)

# ----------------------------------
# 4) Rating quality buckets
# ----------------------------------
df_features["rating_bucket"] = pd.cut(
    df_features["review_scores_rating"],
    bins=[0, 4.0, 4.5, 4.8, 5.1],
    labels=["<4.0", "4.0-4.5", "4.5-4.8", "4.8-5.0"]
)

# -----------------------------------------------------
# 5) Neighbourhood-level price stats (aggregated)
# -----------------------------------------------------

# if we already have these from a previous run, drop them to avoid _x/_y columns
for col in [
    "neigh_avg_price", "neigh_median_price", "neigh_listing_count",
    "price_minus_neigh_mean", "price_over_neigh_mean",
    "price_minus_neigh_median", "price_over_neigh_median"
]:
    if col in df_features.columns:
        df_features = df_features.drop(columns=col)

neigh_stats = (
    df_features
    .groupby("neighbourhood_name")["price"]
    .agg(
        neigh_avg_price="mean",
        neigh_median_price="median",
        neigh_listing_count="size",
    )
    .reset_index()
)

# join back to each listing
df_features = df_features.merge(
    neigh_stats, on="neighbourhood_name", how="left"
)

# relative-to-neighbourhood features
df_features["price_minus_neigh_mean"] = (
    df_features["price"] - df_features["neigh_avg_price"]
)
df_features["price_over_neigh_mean"] = (
    df_features["price"] / df_features["neigh_avg_price"]
)

df_features["price_minus_neigh_median"] = (
    df_features["price"] - df_features["neigh_median_price"]
)
df_features["price_over_neigh_median"] = (
    df_features["price"] / df_features["neigh_median_price"]
)

# quick peek at the new features
df_features[
    [
        "accommodates",
        "capacity_bucket",
        "host_years",
        "host_listings_bucket",
        "review_scores_rating",
        "rating_bucket",
        "neighbourhood_name",
        "neigh_avg_price",
        "neigh_median_price",
        "neigh_listing_count",
        "price",
        "price_minus_neigh_mean",
        "price_over_neigh_mean",
        "price_minus_neigh_median",
        "price_over_neigh_median",
    ]
].head()

Unnamed: 0,accommodates,capacity_bucket,host_years,host_listings_bucket,review_scores_rating,rating_bucket,neighbourhood_name,neigh_avg_price,neigh_median_price,neigh_listing_count,price,price_minus_neigh_mean,price_over_neigh_mean,price_minus_neigh_median,price_over_neigh_median
0,1,1-2,17.059548,2-3,4.68,4.5-4.8,Midtown,2217.575379,348.0,1121,240.0,-1977.575379,0.108226,-108.0,0.689655
1,2,1-2,16.807666,2-3,5.0,4.8-5.0,"Downtown, Chinatown, Penn Quarters, Mount Vern...",264.215596,254.0,218,150.0,-114.215596,0.567718,-104.0,0.590551
2,1,1-2,16.845996,1,4.64,4.5-4.8,Historic Anacostia,138.34375,95.0,32,60.0,-78.34375,0.433702,-35.0,0.631579
3,2,1-2,16.826831,1,4.96,4.8-5.0,East Boston,231.695652,168.0,161,125.0,-106.695652,0.539501,-43.0,0.744048
4,2,1-2,16.80219,4-10,4.86,4.8-5.0,"Edgewood, Bloomingdale, Truxton Circle, Eckington",124.589443,116.0,341,79.0,-45.589443,0.634083,-37.0,0.681034


In [17]:
import numpy as np
import pandas as pd

# ---------------------------
# 1) Room type indicator flags
# ---------------------------
df_features["is_entire_home"] = (df_features["room_type"] == "Entire home/apt").astype(int)
df_features["is_private_room"] = (df_features["room_type"] == "Private room").astype(int)
df_features["is_shared_room"] = (df_features["room_type"] == "Shared room").astype(int)
df_features["is_hotel_room"]  = (df_features["room_type"] == "Hotel room").astype(int)

# ------------------------------
# 2) Property type grouping
# ------------------------------
# keep the 8 most common property types, group the rest as "Other"
top_props = df_features["property_type"].value_counts().nlargest(8).index

df_features["property_type_grouped"] = np.where(
    df_features["property_type"].isin(top_props),
    df_features["property_type"],
    "Other"
)

# ------------------------------
# 3) Quick sanity check
# ------------------------------
#print("Room type flags (sum = number of listings in each type):")
#print(
#    df_features[[
#        "is_entire_home",
#        "is_private_room",
#        "is_shared_room",
#        "is_hotel_room",
#    ]].sum()
#)

#print("\nGrouped property types (value counts):")
#print(df_features["property_type_grouped"].value_counts())

df_features[[
    "room_type",
    "is_entire_home",
    "is_private_room",
    "is_shared_room",
    "is_hotel_room",
    "property_type",
    "property_type_grouped",
]].head()

Unnamed: 0,room_type,is_entire_home,is_private_room,is_shared_room,is_hotel_room,property_type,property_type_grouped
0,Entire home/apt,1,0,0,0,Entire rental unit,Entire rental unit
1,Entire home/apt,1,0,0,0,Entire condo,Entire condo
2,Private room,0,1,0,0,Private room in home,Private room in home
3,Entire home/apt,1,0,0,0,Entire rental unit,Entire rental unit
4,Private room,0,1,0,0,Private room in townhouse,Private room in townhouse


### Note for the Modeling Lead

We intentionally engineered **a lot** of features. That’s a good thing it gives you flexibility to try different models and ablations without having to come back to the data step.

You don’t have to use everything at once. A simple way to start:

1. **Start small (numeric + binary only)**  
   - Use `numeric_cols + binary_cols` as `X` and `price` (or `log_price`) as `y`.  
   - Fit a baseline model (e.g., LinearRegression, RandomForestRegressor) and compare to our earlier baselines (global mean, neighbourhood mean).

2. **Add categoricals with one-hot encoding**  
   - One-hot encode `categorical_cols` (`borough`, `neighbourhood_name`, `room_type`, `property_type_grouped`, `capacity_bucket`, `host_listings_bucket`, `rating_bucket`).  
   - Concatenate these encoded columns with the numeric + binary features.  
   - Refit the same models and compare RMSE/MAE to see the gain from location/room-type info.

3. **Use regularization / tree models to handle many features**  
   - Linear models: try Ridge/Lasso/ElasticNet on the full feature set (numeric + binary + one-hot).  
   - Tree-based models (Random Forest, Gradient Boosting, XGBoost, etc.) naturally handle lots of features and interactions.

4. **Feature importance / ablations**  
   - Once you have a good model, look at feature importances or coefficients.  
   - Optionally run small ablations (e.g., drop neighbourhood features, or drop host features) to see which groups matter most.

The goal of this notebook is to hand you a **rich, ready-to-use feature matrix** so you can focus on modeling choices and evaluation, not on going back and rebuilding data transformations.

In [18]:
# ===============================
# FINAL: save engineered features
# ===============================

# Where to save
output_path = "../data/processed/listing_features_engineered.csv"

# Save everything (features + targets + raw dates)
df_features.to_csv(output_path, index=False)
print("Saved engineered features to:", output_path)

# Quick shape + column check
print("Shape:", df_features.shape)
print("\nColumns:\n", list(df_features.columns))

Saved engineered features to: ../data/processed/listing_features_engineered.csv
Shape: (27797, 50)

Columns:
 ['city', 'host_since', 'host_is_superhost', 'room_type', 'property_type', 'accommodates', 'bedrooms', 'beds', 'bathrooms', 'bathrooms_text', 'latitude', 'longitude', 'price', 'number_of_reviews', 'availability_365', 'first_review', 'last_review', 'review_scores_rating', 'instant_bookable', 'calculated_host_listings_count', 'reviews_per_month', 'borough', 'neighbourhood_name', 'log_price', 'price_per_accommodate', 'price_per_bed', 'price_per_bedroom', 'available_days_365', 'availability_rate_365', 'blocked_or_booked_days_365', 'blocked_or_booked_rate_365', 'log_number_of_reviews', 'log_reviews_per_month', 'capacity_bucket', 'host_since_dt', 'host_years', 'host_listings_bucket', 'rating_bucket', 'neigh_avg_price', 'neigh_median_price', 'neigh_listing_count', 'price_minus_neigh_mean', 'price_over_neigh_mean', 'price_minus_neigh_median', 'price_over_neigh_median', 'is_entire_home',

In [19]:
import pandas as pd

# main targets
target_col = "price"
alt_target_col = "log_price"

# Categorical features (to one-hot encode later)
categorical_cols = [
    "borough",
    "neighbourhood_name",
    "room_type",
    "property_type_grouped",
    "capacity_bucket",
    "host_listings_bucket",
    "rating_bucket",
]

# Binary indicator features (already 0/1)
binary_cols = [
    "host_is_superhost",
    "instant_bookable",
    "is_entire_home",
    "is_private_room",
    "is_shared_room",
    "is_hotel_room",
]

# Raw date-like text columns they probably won't feed directly to the model
date_string_cols = ["host_since", "first_review", "last_review"]

# columns we know we *don't* want as numeric features
extra_raw_cols = [
    "property_type",   # original text
    "bathrooms_text",  # original text
    "host_since_dt",   # datetime, use host_years instead
    "lat_bin",         # if created earlier, don't use
    "lng_bin",         # if created earlier, don't use
]

exclude_for_numeric = (
    [target_col, alt_target_col]
    + categorical_cols
    + binary_cols
    + date_string_cols
    + extra_raw_cols
)

numeric_cols = [
    c for c in df_features.columns
    if c not in exclude_for_numeric
    and pd.api.types.is_numeric_dtype(df_features[c])
]

print("Target column:", target_col)
print("Alt target (optional):", alt_target_col)

print("\nCategorical columns (encode these):")
print(categorical_cols)

print("\nBinary columns (already 0/1):")
print(binary_cols)

print("\nNumeric columns:")
print(numeric_cols)

print("\nRaw date string columns (probably ignore or parse differently):")
print(date_string_cols)

Target column: price
Alt target (optional): log_price

Categorical columns (encode these):
['borough', 'neighbourhood_name', 'room_type', 'property_type_grouped', 'capacity_bucket', 'host_listings_bucket', 'rating_bucket']

Binary columns (already 0/1):
['host_is_superhost', 'instant_bookable', 'is_entire_home', 'is_private_room', 'is_shared_room', 'is_hotel_room']

Numeric columns:
['accommodates', 'bedrooms', 'beds', 'bathrooms', 'latitude', 'longitude', 'number_of_reviews', 'availability_365', 'review_scores_rating', 'calculated_host_listings_count', 'reviews_per_month', 'price_per_accommodate', 'price_per_bed', 'price_per_bedroom', 'available_days_365', 'availability_rate_365', 'blocked_or_booked_days_365', 'blocked_or_booked_rate_365', 'log_number_of_reviews', 'log_reviews_per_month', 'host_years', 'neigh_avg_price', 'neigh_median_price', 'neigh_listing_count', 'price_minus_neigh_mean', 'price_over_neigh_mean', 'price_minus_neigh_median', 'price_over_neigh_median']

Raw date strin