# GWPCA on NYC Airbnb Listings

**Dataset**: [NYC Airbnb Open Data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data) — 48 895 listings across the five New York City boroughs, each described by price, review activity, availability, and host behaviour features.

**Question**: Do Airbnb pricing dynamics operate the same way across all NYC boroughs, or does the correlation structure between price, demand, and availability shift spatially?

Global PCA can only answer this with a single average answer.  GWPCA computes a local covariance matrix for each listing using only its geographic neighbours, revealing how the dominant axes of variation change from Manhattan to the outer boroughs.

In [4]:
!pip install -q "geolatent[umap]" kaggle

In [6]:
from google.colab import files
uploaded = files.upload()

Saving AB_NYC_2019.csv to AB_NYC_2019 (1).csv


---
## Load and Prepare

In [7]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
from sklearn.preprocessing import StandardScaler
from geolatent import inspect_latent_space, DARK_SCIENTIFIC
import matplotlib.pyplot as plt
from scipy import stats

df = pd.read_csv("AB_NYC_2019.csv")

print("Original Shape:", df.shape)
df.head()

Original Shape: (48895, 16)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [8]:
# Drop rows missing coordinates or price
df = df.dropna(subset=["latitude", "longitude", "price",
                        "reviews_per_month", "availability_365"])

# Remove zero-price listings and extreme outliers (top 1%)
df = df[df["price"] > 0]
price_cap = df["price"].quantile(0.99)
df = df[df["price"] <= price_cap]

# Feature engineering
df["log_price"]          = np.log1p(df["price"])
df["log_min_nights"]     = np.log1p(df["minimum_nights"].clip(upper=90))
df["log_host_listings"]  = np.log1p(df["calculated_host_listings_count"])

features = [
    "log_price",
    "log_min_nights",
    "reviews_per_month",
    "availability_365",
    "log_host_listings",
]
df = df.dropna(subset=features)
print(f"Clean shape: {df.shape}")

borough_ids   = {b: i for i, b in enumerate(sorted(df["neighbourhood_group"].unique()))}
borough_names = {v: k for k, v in borough_ids.items()}

df["borough_id"] = df["neighbourhood_group"].map(borough_ids)
print("Boroughs:", borough_names)

Clean shape: (38444, 19)
Boroughs: {0: 'Bronx', 1: 'Brooklyn', 2: 'Manhattan', 3: 'Queens', 4: 'Staten Island'}


---
## Geographic Overview

All listings plotted by coordinates, coloured by borough.  This confirms
the spatial layout before running GWPCA.

In [9]:
palette = ["#58a6ff", "#3fb950", "#f78166", "#d2a8ff", "#ffa657"]

fig = go.Figure()
for bid, bname in borough_names.items():
    sub = df[df["borough_id"] == bid]
    fig.add_trace(go.Scatter(
        x=sub["longitude"], y=sub["latitude"],
        mode="markers",
        marker=dict(size=2, color=palette[bid], opacity=0.5),
        name=bname,
    ))
fig.update_layout(
    title="NYC Airbnb listings by borough",
    xaxis_title="Longitude", yaxis_title="Latitude",
    paper_bgcolor="#0d1117", plot_bgcolor="#161b22",
    font=dict(color="#e6edf3"),
    legend=dict(bgcolor="#161b22"),
    width=750, height=600,
)
fig.show()

---
## GWPCA

GWPCA is computationally O(n²) per focal point.  We work with a stratified
subsample of 3 000 listings (600 per borough) so runtimes stay within a few
minutes on Colab's CPU.

Bandwidth is set to **0.04 degrees** (~4.4 km), roughly the scale of a NYC
neighbourhood.  At this scale each focal point's local covariance matrix is
estimated from the listings in its immediate vicinity.

In [10]:
rng = np.random.default_rng(42)

per_borough = 600
sample_idx = np.concatenate([
    rng.choice(np.where(df["borough_id"].values == bid)[0],
               size=min(per_borough, (df["borough_id"] == bid).sum()),
               replace=False)
    for bid in borough_names
])
sub = df.iloc[sample_idx].reset_index(drop=True)

X_raw    = sub[features].values
coords   = sub[["latitude", "longitude"]].values
y_labels = sub["borough_id"].values
X_scaled = StandardScaler().fit_transform(X_raw)

print(f"Sample: {len(sub)} listings across {len(borough_names)} boroughs")

Sample: 2714 listings across 5 boroughs


In [11]:
def gwpca(X, coords, bandwidth, n_components=3):
    n = len(X)
    scores     = np.zeros((n, n_components))
    var_ratios = np.zeros((n, n_components))

    for i in range(n):
        d = np.sqrt(((coords - coords[i]) ** 2).sum(axis=1))
        w = np.exp(-0.5 * (d / bandwidth) ** 2)

        w_sum = w.sum()
        mu    = (X * w[:, None]).sum(0) / w_sum
        Xc    = X - mu
        C     = (Xc * w[:, None]).T @ Xc / w_sum

        eigvals, eigvecs = np.linalg.eigh(C)
        order            = eigvals.argsort()[::-1]
        eigvals, eigvecs = eigvals[order], eigvecs[:, order]

        scores[i]     = Xc[i] @ eigvecs[:, :n_components]
        total         = eigvals.clip(0).sum() + 1e-12
        var_ratios[i] = eigvals[:n_components].clip(0) / total

    return scores, var_ratios


print("Running GWPCA (bandwidth=0.04°, ~4.4 km) ...")
local_scores, local_var = gwpca(X_scaled, coords, bandwidth=0.04)
print(f"Done.  Local scores shape: {local_scores.shape}")

Running GWPCA (bandwidth=0.04°, ~4.4 km) ...
Done.  Local scores shape: (2714, 3)


---
## Local PC1 Explained Variance — Spatial Map

How much of local variance does the first principal component capture?  A high
value means one dominant axis explains most of the price/demand structure in
that neighbourhood.  Heterogeneity across the map indicates that the market
behaves differently in different parts of the city.

In [12]:
fig = go.Figure(go.Scatter(
    x=coords[:, 1], y=coords[:, 0],
    mode="markers",
    marker=dict(
        color=local_var[:, 0],
        colorscale="Plasma",
        size=6,
        showscale=True,
        colorbar=dict(title="Local PC1<br>var ratio"),
        opacity=0.85,
    ),
))
fig.update_layout(
    title="Local PC1 explained variance across NYC",
    xaxis_title="Longitude", yaxis_title="Latitude",
    paper_bgcolor="#0d1117", plot_bgcolor="#161b22",
    font=dict(color="#e6edf3"),
    width=750, height=600,
)
fig.show()

print("\nMean local PC1 explained variance by borough:")
for bid, bname in borough_names.items():
    mask = y_labels == bid
    print(f"  {bname:<15}  PC1={local_var[mask, 0].mean():.3f}  "
          f"PC2={local_var[mask, 1].mean():.3f}  PC3={local_var[mask, 2].mean():.3f}")


Mean local PC1 explained variance by borough:
  Bronx            PC1=0.295  PC2=0.262  PC3=0.179
  Brooklyn         PC1=0.313  PC2=0.255  PC3=0.201
  Manhattan        PC1=0.370  PC2=0.233  PC3=0.176
  Queens           PC1=0.355  PC2=0.250  PC3=0.177
  Staten Island    PC1=0.322  PC2=0.256  PC3=0.203


---
## GeoLatent — GWPCA Local Score Space

Each listing is represented by its 3 local PC scores — its position relative
to the dominant axes of variation in its own neighbourhood.  GeoLatent projects
this cloud and colours it by borough.  Clear borough separation means the local
correlation structure is geographically consistent within each borough.

In [13]:
inspect_latent_space(
    local_scores, y_labels,
    config=DARK_SCIENTIFIC.with_method("pca"),
    show_ellipsoids=True,
    show_convex_hulls=True,
    ellipsoid_confidence=0.80,
    class_names=borough_names,
    title="GWPCA local scores — NYC Airbnb (coloured by borough)",
).show()

Global PCA baseline — the same 3 000 listings projected onto the globally
computed axes.  All boroughs share the same projection frame; within-borough
spatial variation is invisible.

In [14]:
from sklearn.decomposition import PCA

global_scores = PCA(n_components=3).fit_transform(X_scaled)

inspect_latent_space(
    global_scores, y_labels,
    config=DARK_SCIENTIFIC.with_method("pca"),
    show_ellipsoids=True,
    show_convex_hulls=True,
    ellipsoid_confidence=0.80,
    class_names=borough_names,
    title="Global PCA scores — NYC Airbnb (coloured by borough)",
).show()

---
## t-SNE and UMAP on Local Scores

Non-linear projection methods reveal whether the borough clusters in local
score space are genuinely separated or form a continuous gradient.

In [15]:
for method in ("tsne", "umap"):
    inspect_latent_space(
        local_scores, y_labels,
        config=DARK_SCIENTIFIC.with_method(method),
        show_ellipsoids=True,
        class_names=borough_names,
        title=f"GWPCA local scores — {method.upper()} (NYC boroughs)",
    ).show()

---
## Bandwidth Sensitivity

Narrow bandwidth (0.01° ≈ 1.1 km) captures hyper-local block-level variation.
Wide bandwidth (0.10° ≈ 11 km) approximates borough-level averages.
The borough separation in local score space should peak at an intermediate scale.

In [16]:
for bw, label in [(0.01, "0.01° ~1 km"), (0.04, "0.04° ~4 km"), (0.10, "0.10° ~11 km")]:
    print(f"Bandwidth {label} ...")
    sc, _ = gwpca(X_scaled, coords, bandwidth=bw)
    inspect_latent_space(
        sc, y_labels,
        config=DARK_SCIENTIFIC.with_method("pca"),
        show_ellipsoids=True,
        class_names=borough_names,
        title=f"GWPCA — bandwidth {label}",
    ).show()

Bandwidth 0.01° ~1 km ...


Bandwidth 0.04° ~4 km ...


Bandwidth 0.10° ~11 km ...


---
## What the Local PCs Represent

To interpret the local PCs, we compute mean absolute local loadings per borough.
This shows which features dominate the first PC in each geographic region,
revealing the qualitatively different market dynamics across boroughs.

In [17]:
def gwpca_with_loadings(X, coords, bandwidth, n_components=3):
    n = len(X)
    loadings = np.zeros((n, X.shape[1], n_components))

    for i in range(n):
        d = np.sqrt(((coords - coords[i]) ** 2).sum(axis=1))
        w = np.exp(-0.5 * (d / bandwidth) ** 2)
        w_sum = w.sum()
        mu = (X * w[:, None]).sum(0) / w_sum
        Xc = X - mu
        C  = (Xc * w[:, None]).T @ Xc / w_sum

        eigvals, eigvecs = np.linalg.eigh(C)
        order = eigvals.argsort()[::-1]
        loadings[i] = eigvecs[:, order[:n_components]]

    return loadings


print("Computing local loadings ...")
local_loadings = gwpca_with_loadings(X_scaled, coords, bandwidth=0.04)

print("\nMean |loading| on PC1 by borough (higher = feature drives local variation):")
print(f"{'Feature':<22}  " + "  ".join(f"{borough_names[i]:<14}" for i in range(5)))
print("-" * 90)
for fi, fname in enumerate(features):
    row = f"{fname:<22}  "
    for bid in range(5):
        mask = y_labels == bid
        mean_abs = np.abs(local_loadings[mask, fi, 0]).mean()
        row += f"{mean_abs:.3f}          "
    print(row)

Computing local loadings ...

Mean |loading| on PC1 by borough (higher = feature drives local variation):
Feature                 Bronx           Brooklyn        Manhattan       Queens          Staten Island 
------------------------------------------------------------------------------------------
log_price               0.211          0.121          0.114          0.097          0.548          
log_min_nights          0.520          0.628          0.540          0.521          0.241          
reviews_per_month       0.666          0.189          0.154          0.474          0.304          
availability_365        0.314          0.399          0.338          0.202          0.415          
log_host_listings       0.137          0.548          0.713          0.495          0.505          
