### Building a Pricing Model for Airbnb Listings

**Business Case**

We operate a chain of Airbnb properties and need a data-driven pricing model.
The goal is to predict listing prices based on property characteristics,
availability, and host features in order to support pricing decisions.

---

## 1. Data Selection and Wrangling

We use Airbnb listings data from InsideAirbnb.

- **Core dataset:** Toronto, 2024 Q3 (earlier quarter, >10,000 listings)
- **Later period:** Toronto, 2025 Q4
- **Other city:** Vancouver, 2025 Q4

The Toronto 2024 Q3 dataset is used to train models. The other datasets
are used for validity checks.

Before modeling, we clean prices, extract amenities information,
handle missing values, and construct features suitable for prediction.




In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error, r2_score
import time



In [2]:
import os
os.listdir("data/raw")


['toronto24Q3listings.csv',
 'vancouver25Q4_listings.csv',
 'toronto25Q4_listings.csv']

In [3]:
## Earlier quarter (core dataset)
toronto_q3 = pd.read_csv("data/raw/toronto24Q3listings.csv")

## Later date (for validity checks later)
toronto_q4 = pd.read_csv("data/raw/toronto25Q4_listings.csv")

## Other city (same country)
vancouver_q4 = pd.read_csv("data/raw/vancouver25Q4_listings.csv")

toronto_q3.shape, toronto_q4.shape, vancouver_q4.shape


((21825, 75), (21468, 79), (5685, 79))

### Data Cleaning and Feature Engineering

Key steps:

- Clean the `price` variable and remove invalid values
- Log-transform price to reduce skewness
- Extract the number of amenities from the amenities string
- Select economically meaningful variables
- Impute missing values using medians (numeric) and "Unknown" (categorical)

This process is wrapped in a single function to ensure reproducibility
across datasets.


In [4]:
def clean_airbnb(df):
    df = df.copy()

    # Clean price
    df["price"] = (
        df["price"]
        .replace(r"[\$,]", "", regex=True)
        .astype(float)
    )
    df = df[df["price"] > 0]

    # Log-transform target
    df["log_price"] = np.log(df["price"])

    # Amenities count
    df["n_amenities"] = df["amenities"].fillna("").apply(
        lambda x: len(x.split(",")) if x != "" else 0
    )

    numeric_features = [
        "accommodates",
        "bedrooms",
        "minimum_nights",
        "availability_365",
        "number_of_reviews",
        "reviews_per_month",
        "review_scores_rating",
        "n_amenities"
    ]

    categorical_features = [
        "room_type",
        "property_type"
    ]

    # Impute missing values
    df[numeric_features] = df[numeric_features].fillna(
        df[numeric_features].median()
    )
    df[categorical_features] = df[categorical_features].fillna("Unknown")

    return df, numeric_features, categorical_features


In [5]:
## Apply cleaning consistently
toronto_core, numeric_features, categorical_features = clean_airbnb(toronto_q3)
toronto_q4_clean, _, _ = clean_airbnb(toronto_q4)
vancouver_q4_clean, _, _ = clean_airbnb(vancouver_q4)

toronto_core.shape



(16536, 77)

**Feature matrix and target**

We define the feature matrix **X** using numeric and categorical variables
and the target **y** as log price.


In [6]:
X = toronto_core[numeric_features + categorical_features]
y = toronto_core["log_price"]

X.shape, y.shape


((16536, 10), (16536,))

In [7]:
# Train–test split & encoding
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [8]:
# One-hot encode categorical variables
X_train_enc = pd.get_dummies(X_train, drop_first=True)
X_test_enc = pd.get_dummies(X_test, drop_first=True)

# Align columns
X_train_enc, X_test_enc = X_train_enc.align(
    X_test_enc, join="left", axis=1, fill_value=0
)

X_train_enc.shape, X_test_enc.shape


((13228, 60), (3308, 60))

### Predictive Models

In [9]:
## Evaluation helper
import time
from sklearn.metrics import mean_squared_error, r2_score

def evaluate_model(model, X_train, X_test, y_train, y_test):
    start = time.time()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    elapsed = time.time() - start

    rmse = np.sqrt(mean_squared_error(y_test, preds))
    r2 = r2_score(y_test, preds)

    return rmse, r2, elapsed


In [10]:
## OLS
from sklearn.linear_model import LinearRegression

ols = LinearRegression()
ols_rmse, ols_r2, ols_time = evaluate_model(
    ols, X_train_enc, X_test_enc, y_train, y_test
)


  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_


In [11]:
## LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.001, max_iter=5000)
lasso_rmse, lasso_r2, lasso_time = evaluate_model(
    lasso, X_train_enc, X_test_enc, y_train, y_test
)


  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_


In [12]:
## Random Forest
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf_rmse, rf_r2, rf_time = evaluate_model(
    rf, X_train_enc, X_test_enc, y_train, y_test
)


In [13]:
## Gradient Boosting
from sklearn.ensemble import GradientBoostingRegressor

gboost = GradientBoostingRegressor(random_state=42)
gb_rmse, gb_r2, gb_time = evaluate_model(
    gboost, X_train_enc, X_test_enc, y_train, y_test
)


In [14]:
## Extra Trees

from sklearn.ensemble import ExtraTreesRegressor

extra_trees = ExtraTreesRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

et_rmse, et_r2, et_time = evaluate_model(
    extra_trees, X_train_enc, X_test_enc, y_train, y_test
)


### Model Comparison

In [15]:
results = pd.DataFrame({
    "Model": ["OLS", "LASSO", "Random Forest", "Gradient Boosting", "Extra Trees"],
    "RMSE": [ols_rmse, lasso_rmse, rf_rmse, gb_rmse, et_rmse],
    "R2": [ols_r2, lasso_r2, rf_r2, gb_r2, et_r2],
    "Training Time (s)": [ols_time, lasso_time, rf_time, gb_time, et_time]
})

results


Unnamed: 0,Model,RMSE,R2,Training Time (s)
0,OLS,0.485347,0.578551,0.011246
1,LASSO,0.48806,0.573826,0.013218
2,Random Forest,0.457907,0.624858,1.413417
3,Gradient Boosting,0.453178,0.632567,1.289345
4,Extra Trees,0.47276,0.600126,0.965814


### Model Performance

Linear models (OLS and LASSO) provide reasonable baseline performance but
are clearly outperformed by tree-based models.

Random Forest achieves the lowest RMSE and highest R² while maintaining
reasonable training time. Gradient Boosting performs similarly but is
computationally more expensive.

Overall, non-linear models capture pricing dynamics more effectively.


### Feature Importance

In [16]:
## Random Forest
rf_importance = pd.Series(
    rf.feature_importances_,
    index=X_train_enc.columns
).sort_values(ascending=False)

rf_importance.head(10)


room_type_Private room        0.342789
accommodates                  0.140623
minimum_nights                0.096183
availability_365              0.090892
n_amenities                   0.084471
reviews_per_month             0.054923
number_of_reviews             0.044201
bedrooms                      0.037539
review_scores_rating          0.035910
property_type_Entire condo    0.016109
dtype: float64

In [17]:
## Gradient Boosting
gb_importance = pd.Series(
    gboost.feature_importances_,
    index=X_train_enc.columns
).sort_values(ascending=False)

gb_importance.head(10)


room_type_Private room                0.367265
accommodates                          0.285114
minimum_nights                        0.104202
bedrooms                              0.079863
property_type_Private room in home    0.028722
property_type_Entire condo            0.028143
availability_365                      0.018918
n_amenities                           0.017873
room_type_Shared room                 0.011095
property_type_Entire rental unit      0.008970
dtype: float64

### Feature Importance Comparison

Both models identify similar key drivers of price:

- Room type (entire vs private)
- Accommodation capacity
- Minimum nights
- Bedrooms
- Availability
- Number of amenities

Some features appear in one model but not the other. This is expected:
tree-based models only assign importance to variables used in splits.
A missing value here indicates zero importance, not missing data.

---

## Part II. Validity

## Validity Checks

We test model stability across:
- A later time period (Toronto 2025 Q4)
- A different city (Vancouver 2025 Q4)


In [18]:
# External evaluation
def prepare_external_X(df):
    X_ext = df[numeric_features + categorical_features]
    X_ext_enc = pd.get_dummies(X_ext, drop_first=True)
    X_ext_enc = X_ext_enc.reindex(
        columns=X_train_enc.columns,
        fill_value=0
    )
    return X_ext_enc


In [19]:
X_toronto_q4 = prepare_external_X(toronto_q4_clean)
y_toronto_q4 = toronto_q4_clean["log_price"]

X_vancouver_q4 = prepare_external_X(vancouver_q4_clean)
y_vancouver_q4 = vancouver_q4_clean["log_price"]


In [20]:
def evaluate_external(model, X, y):
    preds = model.predict(X)
    rmse = np.sqrt(mean_squared_error(y, preds))
    r2 = r2_score(y, preds)
    return rmse, r2


In [21]:
external_results = pd.DataFrame({
    "Model": results["Model"],
    "Toronto 2025Q4 RMSE": [
        evaluate_external(ols, X_toronto_q4, y_toronto_q4)[0],
        evaluate_external(lasso, X_toronto_q4, y_toronto_q4)[0],
        evaluate_external(rf, X_toronto_q4, y_toronto_q4)[0],
        evaluate_external(gboost, X_toronto_q4, y_toronto_q4)[0],
        evaluate_external(extra_trees, X_toronto_q4, y_toronto_q4)[0]
    ]
})

external_results


  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_


Unnamed: 0,Model,Toronto 2025Q4 RMSE
0,OLS,0.499052
1,LASSO,0.499975
2,Random Forest,0.462145
3,Gradient Boosting,0.469226
4,Extra Trees,0.463353


### Temporal Validity: Performance on Toronto 2025 Q4 Data

This analysis evaluates temporal stability by applying models trained on
Toronto 2024 Q3 data to a later dataset from the same city (Toronto 2025 Q4).

All models experience a moderate decline in performance, reflecting changes
in market conditions, seasonality, and listing composition over time.
This degradation is expected in real-world pricing applications.

Linear models (OLS and LASSO) show the largest increase in prediction error,
indicating limited flexibility in adapting to temporal shifts.
Tree-based models remain more robust, with Random Forest and Extra Trees
maintaining relatively strong performance.

Overall, the results suggest that non-linear models provide more reliable
pricing predictions when deployed in future periods without retraining.



In [22]:
results_vancouver = pd.DataFrame({
    "Model": results["Model"],
    "Vancouver 2025Q4 RMSE": [
        evaluate_external(ols, X_vancouver_q4, y_vancouver_q4)[0],
        evaluate_external(lasso, X_vancouver_q4, y_vancouver_q4)[0],
        evaluate_external(rf, X_vancouver_q4, y_vancouver_q4)[0],
        evaluate_external(gboost, X_vancouver_q4, y_vancouver_q4)[0],
        evaluate_external(extra_trees, X_vancouver_q4, y_vancouver_q4)[0]
    ],
    "Vancouver 2025Q4 R2": [
        evaluate_external(ols, X_vancouver_q4, y_vancouver_q4)[1],
        evaluate_external(lasso, X_vancouver_q4, y_vancouver_q4)[1],
        evaluate_external(rf, X_vancouver_q4, y_vancouver_q4)[1],
        evaluate_external(gboost, X_vancouver_q4, y_vancouver_q4)[1],
        evaluate_external(extra_trees, X_vancouver_q4, y_vancouver_q4)[1]
    ]
})

results_vancouver


  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_


Unnamed: 0,Model,Vancouver 2025Q4 RMSE,Vancouver 2025Q4 R2
0,OLS,0.511015,0.389054
1,LASSO,0.512935,0.384455
2,Random Forest,0.515541,0.378184
3,Gradient Boosting,0.492478,0.432574
4,Extra Trees,0.515794,0.377573


### Cross-City Validity: Vancouver 2025 Q4

Cross-city validity is assessed by applying Toronto-trained models to
Vancouver listings from the same period (2025 Q4).

Model performance deteriorates substantially across all approaches,
highlighting structural differences between local housing markets,
demand patterns, and regulatory environments.

Linear models perform poorly, indicating limited transferability across
cities. Tree-based models generalize better, with Gradient Boosting achieving
the strongest performance in Vancouver.

These findings suggest that while non-linear models are more robust,
city-specific retraining or local calibration is necessary for optimal
pricing performance in practice.



Models were not retrained on Toronto 2025 Q4 or Vancouver data in order to
preserve strict out-of-sample validity and ensure a fair evaluation of
temporal and geographic generalization.


## Extension: Business Implications

The results suggest that a single global pricing model may be insufficient
for multi-city Airbnb operations. While tree-based models generalize better,
pricing accuracy improves with local calibration.

In practice, a hybrid strategy could be adopted, combining a global model
with city-specific adjustments or fixed effects to balance scalability
and accuracy.
