## Feature Engineering Summary

We start from the cleaned `listing` + `neighbourhood` tables for NYC. Each row is one listing, with host info, location, reviews, availability, and price.

**Target**
- Main target: `price` (nightly USD).
- Also created: `log_price` to reduce skew for linear models.

**Size & Capacity**
- Raw: `accommodates`, `bedrooms`, `beds`, `bathrooms`.
- Ratios: `price_per_accommodate`, `price_per_bed`, `price_per_bedroom`.
- Bucket: `capacity_bucket` âˆˆ {`1-2`, `3-4`, `5-6`, `7-8`, `9-12`, `13+`} based on `accommodates`.

**Availability & Demand**
- From `availability_365`:
  - `available_days_365`, `availability_rate_365`.
  - `blocked_or_booked_days_365`, `blocked_or_booked_rate_365`.
- Reviews:
  - `log_number_of_reviews` = log(1 + `number_of_reviews`).
  - `log_reviews_per_month` = log(1 + `reviews_per_month`).

**Host Features**
- `host_years` = years since `host_since` (using max `last_review` as reference).
- `host_listings_bucket` from `calculated_host_listings_count` âˆˆ {`1`, `2-3`, `4-10`, `11-50`, `50+`}.
- Keep binary: `host_is_superhost`, `instant_bookable`.

**Neighbourhood Pricing**
- Aggregates per `neighbourhood_name`:
  - `neigh_avg_price`, `neigh_median_price`, `neigh_listing_count`.
- Relative price:
  - `price_minus_neigh_mean`, `price_over_neigh_mean`.
  - `price_minus_neigh_median`, `price_over_neigh_median`.

**Room & Property Type**
- Room type flags:
  - `is_entire_home`, `is_private_room`, `is_shared_room`, `is_hotel_room`.
- Property type:
  - `property_type_grouped` keeps the 8 most common `property_type` values, others â†’ `"Other"`.

**Location**
- Keep raw `borough`, `neighbourhood_name`, `latitude`, `longitude`.

The final engineered table is saved as  
`data/processed/listing_features_engineered.csv`  
along with lists of `categorical_cols`, `binary_cols`, and `numeric_cols` for modeling.

In [33]:
import sqlite3
import pandas as pd
import numpy as np

# connect to SQLite DB (adjust path if needed)
conn = sqlite3.connect("../data/airbnb.db")

# base table: listing + neighbourhood info
base_query = """
SELECT 
    l.*,
    n.borough,
    n.neighbourhood_name
FROM listing l
LEFT JOIN neighbourhood n
    ON l.neighbourhood_id = n.neighbourhood_id
"""

df = pd.read_sql_query(base_query, conn)

print(df.shape)
df.head()

(14436, 27)


Unnamed: 0,listing_id,neighbourhood_id,host_id,host_name,host_since,host_is_superhost,room_type,property_type,accommodates,bedrooms,...,availability_365,estimated_revenue,first_review,last_review,review_scores_rating,instant_bookable,calculated_host_listings_count,reviews_per_month,borough,neighbourhood_name
0,2595,115,2845,Jennifer,2008-09-09,0,Entire home/apt,Entire rental unit,1,0,...,289,0.0,2009-11-21,2022-06-21,4.68,0,3,0.24,Manhattan,Midtown
1,6848,96,15991,Allen,2009-05-06,1,Entire home/apt,Entire rental unit,3,2,...,285,17280.0,2009-05-25,2025-06-09,4.59,0,1,0.98,Brooklyn,Williamsburg
2,6872,102,16104,Kahshanna,2009-05-07,0,Private room,Private room in condo,1,1,...,83,0.0,2022-06-05,2022-06-05,5.0,0,2,0.02,Manhattan,East Harlem
3,6990,102,16800,Cynthia,2009-05-12,0,Private room,Private room in rental unit,1,2,...,186,17520.0,2009-10-28,2025-05-27,4.88,0,1,1.28,Manhattan,East Harlem
4,7097,76,17571,Jane,2009-05-17,1,Private room,Private room in guest suite,2,1,...,0,55080.0,2010-01-16,2025-09-23,4.89,1,2,2.21,Brooklyn,Fort Greene


In [34]:
# target
target_col = "price"

# columns we definitely don't want as features
id_cols = ["listing_id", "neighbourhood_id", "host_id"]
leaky_cols = ["estimated_revenue"]  # derived from price & availability
text_id_cols = ["host_name"]        # mostly just an identifier

drop_cols = id_cols + leaky_cols + text_id_cols

df_model = df.drop(columns=drop_cols)

df_model.head()

Unnamed: 0,host_since,host_is_superhost,room_type,property_type,accommodates,bedrooms,beds,bathrooms,bathrooms_text,latitude,...,number_of_reviews,availability_365,first_review,last_review,review_scores_rating,instant_bookable,calculated_host_listings_count,reviews_per_month,borough,neighbourhood_name
0,2008-09-09,0,Entire home/apt,Entire rental unit,1,0,1,1.0,1 bath,40.75356,...,47,289,2009-11-21,2022-06-21,4.68,0,3,0.24,Manhattan,Midtown
1,2009-05-06,1,Entire home/apt,Entire rental unit,3,2,1,1.0,1 bath,40.70935,...,195,285,2009-05-25,2025-06-09,4.59,0,1,0.98,Brooklyn,Williamsburg
2,2009-05-07,0,Private room,Private room in condo,1,1,1,1.0,1 shared bath,40.80107,...,1,83,2022-06-05,2022-06-05,5.0,0,2,0.02,Manhattan,East Harlem
3,2009-05-12,0,Private room,Private room in rental unit,1,2,2,1.0,1 shared bath,40.78778,...,249,186,2009-10-28,2025-05-27,4.88,0,1,1.28,Manhattan,East Harlem
4,2009-05-17,1,Private room,Private room in guest suite,2,1,2,1.0,1 private bath,40.69194,...,423,0,2010-01-16,2025-09-23,4.89,1,2,2.21,Brooklyn,Fort Greene


In [35]:
import numpy as np

df_features = df_model.copy()

# --- 1) Target transform: log price (for later modeling) ---
df_features["log_price"] = np.log1p(df_features["price"])

# --- 2) Price per capacity features (ratios) ---
accom = df_features["accommodates"].replace(0, np.nan)
beds = df_features["beds"].replace(0, np.nan)
bedrooms = df_features["bedrooms"].replace(0, np.nan)

df_features["price_per_accommodate"] = df_features["price"] / accom
df_features["price_per_bed"] = df_features["price"] / beds
df_features["price_per_bedroom"] = df_features["price"] / bedrooms

# --- 3) Availability-based features (no fake "occupancy") ---

# how many days the host is *willing* to rent
df_features["available_days_365"] = df_features["availability_365"]

# proportion of the year they are open for business
df_features["availability_rate_365"] = df_features["available_days_365"] / 365.0

# days that are *not* available â€“ could be booked OR manually blocked
df_features["blocked_or_booked_days_365"] = 365 - df_features["available_days_365"]
df_features["blocked_or_booked_rate_365"] = (
    df_features["blocked_or_booked_days_365"] / 365.0
)

# --- 4) Review-based transforms ---
df_features["log_number_of_reviews"] = np.log1p(df_features["number_of_reviews"])

# reviews_per_month can be 0 or very small; clip negatives just in case
rpm = df_features["reviews_per_month"].clip(lower=0)
df_features["log_reviews_per_month"] = np.log1p(rpm)

# Quick preview of the new columns
df_features[
    [
        "price",
        "log_price",
        "accommodates",
        "beds",
        "bedrooms",
        "price_per_accommodate",
        "price_per_bed",
        "price_per_bedroom",
        "availability_365",
        "available_days_365",
        "availability_rate_365",
        "blocked_or_booked_days_365",
        "blocked_or_booked_rate_365",
        "number_of_reviews",
        "log_number_of_reviews",
        "reviews_per_month",
        "log_reviews_per_month",
    ]
].head()

Unnamed: 0,price,log_price,accommodates,beds,bedrooms,price_per_accommodate,price_per_bed,price_per_bedroom,availability_365,available_days_365,availability_rate_365,blocked_or_booked_days_365,blocked_or_booked_rate_365,number_of_reviews,log_number_of_reviews,reviews_per_month,log_reviews_per_month
0,240.0,5.484797,1,1,0,240.0,240.0,,289,289,0.791781,76,0.208219,47,3.871201,0.24,0.215111
1,96.0,4.574711,3,1,2,32.0,96.0,48.0,285,285,0.780822,80,0.219178,195,5.278115,0.98,0.683097
2,59.0,4.094345,1,1,1,59.0,59.0,59.0,83,83,0.227397,282,0.772603,1,0.693147,0.02,0.019803
3,73.0,4.304065,1,2,2,73.0,36.5,36.5,186,186,0.509589,179,0.490411,249,5.521461,1.28,0.824175
4,216.0,5.379897,2,2,1,108.0,108.0,216.0,0,0,0.0,365,1.0,423,6.049733,2.21,1.166271


In [36]:
import pandas as pd
import numpy as np

# -------------------------------
# 1) Capacity bucket (categorical)
# -------------------------------
cap_bins = [0, 2, 4, 6, 8, 12, np.inf]
cap_labels = ["1-2", "3-4", "5-6", "7-8", "9-12", "13+"]

df_features["capacity_bucket"] = pd.cut(
    df_features["accommodates"],
    bins=cap_bins,
    labels=cap_labels
)

# -----------------------------------------
# 2) Host experience: years on Airbnb
# -----------------------------------------
# make a datetime version of host_since
df_features["host_since_dt"] = pd.to_datetime(
    df_features["host_since"], errors="coerce"
)

# use the max last_review date as a rough "reference" date
ref_date = pd.to_datetime(df_features["last_review"].max())

df_features["host_years"] = (
    (ref_date - df_features["host_since_dt"]).dt.days / 365.25
)

# -------------------------------------------
# 3) Host scale: how many listings they run
# -------------------------------------------
df_features["host_listings_bucket"] = pd.cut(
    df_features["calculated_host_listings_count"],
    bins=[0, 1, 3, 10, 50, np.inf],
    labels=["1", "2-3", "4-10", "11-50", "50+"]
)

# ----------------------------------
# 4) Rating quality buckets
# ----------------------------------
df_features["rating_bucket"] = pd.cut(
    df_features["review_scores_rating"],
    bins=[0, 4.0, 4.5, 4.8, 5.1],
    labels=["<4.0", "4.0-4.5", "4.5-4.8", "4.8-5.0"]
)

# -----------------------------------------------------
# 5) Neighbourhood-level price stats (aggregated)
# -----------------------------------------------------

# if we already have these from a previous run, drop them to avoid _x/_y columns
for col in [
    "neigh_avg_price", "neigh_median_price", "neigh_listing_count",
    "price_minus_neigh_mean", "price_over_neigh_mean",
    "price_minus_neigh_median", "price_over_neigh_median"
]:
    if col in df_features.columns:
        df_features = df_features.drop(columns=col)

neigh_stats = (
    df_features
    .groupby("neighbourhood_name")["price"]
    .agg(
        neigh_avg_price="mean",
        neigh_median_price="median",
        neigh_listing_count="size",
    )
    .reset_index()
)

# join back to each listing
df_features = df_features.merge(
    neigh_stats, on="neighbourhood_name", how="left"
)

# relative-to-neighbourhood features
df_features["price_minus_neigh_mean"] = (
    df_features["price"] - df_features["neigh_avg_price"]
)
df_features["price_over_neigh_mean"] = (
    df_features["price"] / df_features["neigh_avg_price"]
)

df_features["price_minus_neigh_median"] = (
    df_features["price"] - df_features["neigh_median_price"]
)
df_features["price_over_neigh_median"] = (
    df_features["price"] / df_features["neigh_median_price"]
)

# quick peek at the new features
df_features[
    [
        "accommodates",
        "capacity_bucket",
        "host_years",
        "host_listings_bucket",
        "review_scores_rating",
        "rating_bucket",
        "neighbourhood_name",
        "neigh_avg_price",
        "neigh_median_price",
        "neigh_listing_count",
        "price",
        "price_minus_neigh_mean",
        "price_over_neigh_mean",
        "price_minus_neigh_median",
        "price_over_neigh_median",
    ]
].head()

Unnamed: 0,accommodates,capacity_bucket,host_years,host_listings_bucket,review_scores_rating,rating_bucket,neighbourhood_name,neigh_avg_price,neigh_median_price,neigh_listing_count,price,price_minus_neigh_mean,price_over_neigh_mean,price_minus_neigh_median,price_over_neigh_median
0,1,1-2,17.059548,2-3,4.68,4.5-4.8,Midtown,1056.077728,296.0,669,240.0,-816.077728,0.227256,-56.0,0.810811
1,3,3-4,16.405202,1,4.59,4.5-4.8,Williamsburg,207.408772,165.0,570,96.0,-111.408772,0.462854,-69.0,0.581818
2,1,1-2,16.402464,2-3,5.0,4.8-5.0,East Harlem,165.254902,128.0,255,59.0,-106.254902,0.357024,-69.0,0.460938
3,1,1-2,16.388775,1,4.88,4.8-5.0,East Harlem,165.254902,128.0,255,73.0,-92.254902,0.441742,-55.0,0.570312
4,2,1-2,16.375086,2-3,4.89,4.8-5.0,Fort Greene,210.82243,180.0,107,216.0,5.17757,1.024559,36.0,1.2


In [37]:
import numpy as np
import pandas as pd

# ---------------------------
# 1) Room type indicator flags
# ---------------------------
df_features["is_entire_home"] = (df_features["room_type"] == "Entire home/apt").astype(int)
df_features["is_private_room"] = (df_features["room_type"] == "Private room").astype(int)
df_features["is_shared_room"] = (df_features["room_type"] == "Shared room").astype(int)
df_features["is_hotel_room"]  = (df_features["room_type"] == "Hotel room").astype(int)

# ------------------------------
# 2) Property type grouping
# ------------------------------
# keep the 8 most common property types, group the rest as "Other"
top_props = df_features["property_type"].value_counts().nlargest(8).index

df_features["property_type_grouped"] = np.where(
    df_features["property_type"].isin(top_props),
    df_features["property_type"],
    "Other"
)

# ------------------------------
# 3) Quick sanity check
# ------------------------------
#print("Room type flags (sum = number of listings in each type):")
#print(
#    df_features[[
#        "is_entire_home",
#        "is_private_room",
#        "is_shared_room",
#        "is_hotel_room",
#    ]].sum()
#)

#print("\nGrouped property types (value counts):")
#print(df_features["property_type_grouped"].value_counts())

df_features[[
    "room_type",
    "is_entire_home",
    "is_private_room",
    "is_shared_room",
    "is_hotel_room",
    "property_type",
    "property_type_grouped",
]].head()

Unnamed: 0,room_type,is_entire_home,is_private_room,is_shared_room,is_hotel_room,property_type,property_type_grouped
0,Entire home/apt,1,0,0,0,Entire rental unit,Entire rental unit
1,Entire home/apt,1,0,0,0,Entire rental unit,Entire rental unit
2,Private room,0,1,0,0,Private room in condo,Other
3,Private room,0,1,0,0,Private room in rental unit,Private room in rental unit
4,Private room,0,1,0,0,Private room in guest suite,Other


### Note for the Modeling Lead

We intentionally engineered **a lot** of features. Thatâ€™s a good thing it gives you flexibility to try different models and ablations without having to come back to the data step.

You donâ€™t have to use everything at once. A simple way to start:

1. **Start small (numeric + binary only)**  
   - Use `numeric_cols + binary_cols` as `X` and `price` (or `log_price`) as `y`.  
   - Fit a baseline model (e.g., LinearRegression, RandomForestRegressor) and compare to our earlier baselines (global mean, neighbourhood mean).

2. **Add categoricals with one-hot encoding**  
   - One-hot encode `categorical_cols` (`borough`, `neighbourhood_name`, `room_type`, `property_type_grouped`, `capacity_bucket`, `host_listings_bucket`, `rating_bucket`).  
   - Concatenate these encoded columns with the numeric + binary features.  
   - Refit the same models and compare RMSE/MAE to see the gain from location/room-type info.

3. **Use regularization / tree models to handle many features**  
   - Linear models: try Ridge/Lasso/ElasticNet on the full feature set (numeric + binary + one-hot).  
   - Tree-based models (Random Forest, Gradient Boosting, XGBoost, etc.) naturally handle lots of features and interactions.

4. **Feature importance / ablations**  
   - Once you have a good model, look at feature importances or coefficients.  
   - Optionally run small ablations (e.g., drop neighbourhood features, or drop host features) to see which groups matter most.

The goal of this notebook is to hand you a **rich, ready-to-use feature matrix** so you can focus on modeling choices and evaluation, not on going back and rebuilding data transformations.

In [38]:
# ===============================
# FINAL: save engineered features
# ===============================

# Where to save
output_path = "../data/processed/listing_features_engineered.csv"

# Save everything (features + targets + raw dates)
df_features.to_csv(output_path, index=False)
print("Saved engineered features to:", output_path)

# Quick shape + column check
print("Shape:", df_features.shape)
print("\nColumns:\n", list(df_features.columns))

Saved engineered features to: ../data/processed/listing_features_engineered.csv
Shape: (14436, 49)

Columns:
 ['host_since', 'host_is_superhost', 'room_type', 'property_type', 'accommodates', 'bedrooms', 'beds', 'bathrooms', 'bathrooms_text', 'latitude', 'longitude', 'price', 'number_of_reviews', 'availability_365', 'first_review', 'last_review', 'review_scores_rating', 'instant_bookable', 'calculated_host_listings_count', 'reviews_per_month', 'borough', 'neighbourhood_name', 'log_price', 'price_per_accommodate', 'price_per_bed', 'price_per_bedroom', 'available_days_365', 'availability_rate_365', 'blocked_or_booked_days_365', 'blocked_or_booked_rate_365', 'log_number_of_reviews', 'log_reviews_per_month', 'capacity_bucket', 'host_since_dt', 'host_years', 'host_listings_bucket', 'rating_bucket', 'neigh_avg_price', 'neigh_median_price', 'neigh_listing_count', 'price_minus_neigh_mean', 'price_over_neigh_mean', 'price_minus_neigh_median', 'price_over_neigh_median', 'is_entire_home', 'is_pri

In [39]:
import pandas as pd

# main targets
target_col = "price"
alt_target_col = "log_price"

# Categorical features (to one-hot encode later)
categorical_cols = [
    "borough",
    "neighbourhood_name",
    "room_type",
    "property_type_grouped",
    "capacity_bucket",
    "host_listings_bucket",
    "rating_bucket",
]

# Binary indicator features (already 0/1)
binary_cols = [
    "host_is_superhost",
    "instant_bookable",
    "is_entire_home",
    "is_private_room",
    "is_shared_room",
    "is_hotel_room",
]

# Raw date-like text columns they probably won't feed directly to the model
date_string_cols = ["host_since", "first_review", "last_review"]

# columns we know we *don't* want as numeric features
extra_raw_cols = [
    "property_type",   # original text
    "bathrooms_text",  # original text
    "host_since_dt",   # datetime, use host_years instead
    "lat_bin",         # if created earlier, don't use
    "lng_bin",         # if created earlier, don't use
]

exclude_for_numeric = (
    [target_col, alt_target_col]
    + categorical_cols
    + binary_cols
    + date_string_cols
    + extra_raw_cols
)

numeric_cols = [
    c for c in df_features.columns
    if c not in exclude_for_numeric
    and pd.api.types.is_numeric_dtype(df_features[c])
]

print("Target column:", target_col)
print("Alt target (optional):", alt_target_col)

print("\nCategorical columns (encode these):")
print(categorical_cols)

print("\nBinary columns (already 0/1):")
print(binary_cols)

print("\nNumeric columns:")
print(numeric_cols)

print("\nRaw date string columns (probably ignore or parse differently):")
print(date_string_cols)

Target column: price
Alt target (optional): log_price

Categorical columns (encode these):
['borough', 'neighbourhood_name', 'room_type', 'property_type_grouped', 'capacity_bucket', 'host_listings_bucket', 'rating_bucket']

Binary columns (already 0/1):
['host_is_superhost', 'instant_bookable', 'is_entire_home', 'is_private_room', 'is_shared_room', 'is_hotel_room']

Numeric columns:
['accommodates', 'bedrooms', 'beds', 'bathrooms', 'latitude', 'longitude', 'number_of_reviews', 'availability_365', 'review_scores_rating', 'calculated_host_listings_count', 'reviews_per_month', 'price_per_accommodate', 'price_per_bed', 'price_per_bedroom', 'available_days_365', 'availability_rate_365', 'blocked_or_booked_days_365', 'blocked_or_booked_rate_365', 'log_number_of_reviews', 'log_reviews_per_month', 'host_years', 'neigh_avg_price', 'neigh_median_price', 'neigh_listing_count', 'price_minus_neigh_mean', 'price_over_neigh_mean', 'price_minus_neigh_median', 'price_over_neigh_median']

Raw date strin

In [40]:
import pandas as pd
import numpy as np
import os

# Read the cleaned listings CSV (output from clean_csvs.ipynb)
df_clean = pd.read_csv("../data/processed/listings_cleaned.csv")
print(f"Data shape: {df_clean.shape}")
df_clean.head()

Data shape: (21328, 79)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,40824219,https://www.airbnb.com/rooms/40824219,20251001171547,2025-10-02,city scrape,Room close to Manhattan for FEMALE guests,This cozy spacious room includes a twin size b...,Sunnyside is a safe residental area. <br />The...,https://a0.muscache.com/pictures/hosting/Hosti...,317540555,...,4.88,4.94,4.69,,f,3,0,3,0,0.23
1,40839416,https://www.airbnb.com/rooms/40839416,20251001171547,2025-10-02,city scrape,ðŸª´XL dojo ðŸŒ¾ shared green yogi palace apt ðŸŒ¿,"New York City living at its best. A massive, c...",Live like the Ramones > The East Village is st...,https://a0.muscache.com/pictures/hosting/Hosti...,4765305,...,5.0,5.0,4.95,,f,8,0,8,0,0.4
2,40843980,https://www.airbnb.com/rooms/40843980,20251001171547,2025-10-01,city scrape,Cozy 2 Bedroom Spacious Apartment near Manhattan,This 2 bed. furnished apt on the 2nd fl. in Oz...,The borough of Queens offers plenty of outdoor...,https://a0.muscache.com/pictures/c5ca4ce9-8cb5...,295370107,...,4.06,4.46,4.0,,f,2,2,0,0,1.46
3,40824301,https://www.airbnb.com/rooms/40824301,20251001171547,2025-10-02,city scrape,Cozy room in Williamsburg,"This place is located in Williamsburg, close t...",This is such a cool neighborhood with great st...,https://a0.muscache.com/pictures/hosting/Hosti...,14890430,...,4.88,4.88,4.77,,f,1,0,1,0,0.86
4,40825740,https://www.airbnb.com/rooms/40825740,20251001171547,2025-10-02,city scrape,House of Oyo - A Historic Brownstone Mansion,Located on the prestigious St. Marks Millionai...,"There are great coffee shops, bars, restaurant...",https://a0.muscache.com/pictures/55752387-150b...,7728754,...,5.0,5.0,5.0,,f,1,1,0,0,0.03


In [41]:
# Create new features from cleaned data
listing_features = pd.DataFrame()

listing_features["listing_id"] = df_clean["id"]

listing_features["price"] = df_clean["price"]

listing_features["borough"] = df_clean["neighbourhood_group_cleansed"]
listing_features["neighbourhood_name"] = df_clean["neighbourhood_cleansed"]

listing_features["accommodates"] = df_clean["accommodates"]
listing_features["bedrooms"] = df_clean["bedrooms"]
listing_features["beds"] = df_clean["beds"]

accom = listing_features["accommodates"].replace(0, np.nan)
listing_features["price_per_accommodate"] = listing_features["price"] / accom

listing_features["host_is_superhost"] = df_clean["host_is_superhost"]
listing_features["number_of_reviews"] = df_clean["number_of_reviews"]
listing_features["review_scores_rating"] = df_clean["review_scores_rating"]
listing_features["availability_365"] = df_clean["availability_365"]

listing_features.head()

Unnamed: 0,listing_id,price,borough,neighbourhood_name,accommodates,bedrooms,beds,price_per_accommodate,host_is_superhost,number_of_reviews,review_scores_rating,availability_365
0,40824219,66.0,Queens,Sunnyside,1,1.0,1.0,66.0,t,16,4.81,77
1,40839416,76.0,Manhattan,East Village,1,1.0,1.0,76.0,t,20,4.95,168
2,40843980,97.0,Queens,Ozone Park,6,2.0,3.0,16.166667,t,93,4.14,364
3,40824301,60.0,Brooklyn,Williamsburg,1,2.0,1.0,60.0,t,26,4.92,187
4,40825740,425.0,Brooklyn,Crown Heights,6,3.0,3.0,70.833333,f,1,5.0,224


In [42]:
# Save features to CSV
os.makedirs("../data/processed", exist_ok=True)

out_path = "../data/processed/listing_features.csv"
listing_features.to_csv(out_path, index=False)

print(f"Features saved to: {out_path}")
out_path

Features saved to: ../data/processed/listing_features.csv


'../data/processed/listing_features.csv'