# Extra Credit: Temporal Airbnb Seasonality and Modeling (EAS 510)

This notebook:
- Builds night-level panel datasets for each city + snapshot
- Performs seasonality analysis (required plots)
- Builds temporal train/valid/test split (no leakage)
- Trains XGBoost + Neural Nets (price regression + booking classification)
- Logs Neural Net training with TensorBoard
- Summarizes results + provides write-up templates


### Install required Libraries

In [None]:
pip install -r requirements.txt


In [None]:


import sys, subprocess

def pip_install(pkgs):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + pkgs)

pip_install(["pandas", "numpy", "matplotlib", "scikit-learn", "xgboost", "tensorflow", "pyarrow"])


### Imports and config

In [None]:
from pathlib import Path
import os

print("CWD:", os.getcwd())

DATA_ROOT = Path(".")
print("DATA_ROOT:", DATA_ROOT.resolve())
print("DATA_ROOT exists?", DATA_ROOT.exists())

if DATA_ROOT.exists():
    print("Top-level items inside BONUS_ASSIGNMENT:")
    print([p.name for p in DATA_ROOT.iterdir()])
else:
    print("❌ BONUS_ASSIGNMENT folder not found from this CWD.")
    print("Fix by either:")
    print("1) Moving notebook to the parent folder of BONUS_ASSIGNMENT, or")
    print("2) Setting DATA_ROOT = Path(r'FULL_PATH_TO/BONUS_ASSIGNMENT')")


In [None]:
DATASETS = {
    ("Austin", "3625"): {
        "calendar": "http://data.insideairbnb.com/united-states/tx/austin/2025-03-06/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/tx/austin/2025-03-06/data/listings.csv.gz",
    },
    ("Austin", "121424"): {
        "calendar": "http://data.insideairbnb.com/united-states/tx/austin/2024-12-14/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/tx/austin/2024-12-14/data/listings.csv.gz",
    },
    ("Chicago", "31125"): {
        "calendar": "http://data.insideairbnb.com/united-states/il/chicago/2025-03-11/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/il/chicago/2025-03-11/data/listings.csv.gz",
    },
    ("Chicago", "121824"): {
        "calendar": "http://data.insideairbnb.com/united-states/il/chicago/2024-12-18/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/il/chicago/2024-12-18/data/listings.csv.gz",
    },
    ("Santa_Cruz", "32825"): {
        "calendar": "http://data.insideairbnb.com/united-states/ca/santa-cruz-county/2025-03-28/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/ca/santa-cruz-county/2025-03-28/data/listings.csv.gz",
    },
    ("Santa_Cruz", "123125"): {
        "calendar": "http://data.insideairbnb.com/united-states/ca/santa-cruz-county/2024-12-31/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/ca/santa-cruz-county/2024-12-31/data/listings.csv.gz",
    },
    ("WashingtonDC", "31325"): {
        "calendar": "http://data.insideairbnb.com/united-states/dc/washington-dc/2025-03-13/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/dc/washington-dc/2025-03-13/data/listings.csv.gz",
    },
    ("WashingtonDC", "121825"): {
        "calendar": "http://data.insideairbnb.com/united-states/dc/washington-dc/2024-12-18/data/calendar.csv.gz",
        "listings": "http://data.insideairbnb.com/united-states/dc/washington-dc/2024-12-18/data/listings.csv.gz",
    },
}

print("✅ Using InsideAirbnb URLs for data fetching.")


#### Cleaning plus Category Capping helper functions

In [None]:
import pandas as pd

In [None]:
def clean_price_to_float(series: pd.Series) -> pd.Series:
    s = series.astype(str).replace("nan", np.nan)
    s = s.str.replace(r"[$,]", "", regex=True)
    s = pd.to_numeric(s, errors="coerce")
    return s

def tf_to_int(series: pd.Series) -> pd.Series:
    # Handles 't'/'f', True/False, 1/0
    if series.dtype == bool:
        return series.astype(int)
    s = series.astype(str).str.lower()
    return s.map({"t": 1, "f": 0, "true": 1, "false": 0, "1": 1, "0": 0}).fillna(0).astype(int)

def cap_top_k_categories(df: pd.DataFrame, col: str, k: int = 25) -> pd.DataFrame:
    if col not in df.columns:
        return df
    top = df[col].value_counts(dropna=True).head(k).index
    df[col] = df[col].where(df[col].isin(top), other="Other")
    df[col] = df[col].fillna("Missing")
    return df


In [None]:
def temporal_split_by_month(df: pd.DataFrame, train_months=9, valid_months=2):
    df = df.dropna(subset=["date"]).copy()
    df["year_month"] = df["date"].dt.to_period("M")

    months_sorted = np.array(sorted(df["year_month"].unique()))
    if len(months_sorted) < (train_months + valid_months + 1):
        raise ValueError(f"Not enough months in snapshot range: only {len(months_sorted)}")

    train_set = set(months_sorted[:train_months])
    valid_set = set(months_sorted[train_months:train_months+valid_months])
    test_set  = set(months_sorted[train_months+valid_months:])

    train_df = df[df["year_month"].isin(train_set)]
    valid_df = df[df["year_month"].isin(valid_set)]
    test_df  = df[df["year_month"].isin(test_set)]

    return train_df, valid_df, test_df


### Load and build panel

In [None]:
def load_snapshot(listings_path: Path, calendar_path: Path):
    # Calendar: read required columns (if present)
    cal = pd.read_csv(calendar_path, compression="gzip", low_memory=False)
    if "price" not in cal.columns and "adjusted_price" in cal.columns:
        cal = cal.rename(columns={"adjusted_price": "price"})

    needed_cal = ["listing_id", "date", "available", "price", "minimum_nights", "maximum_nights"]
    keep_cal = [c for c in needed_cal if c in cal.columns]
    cal = cal[keep_cal].copy()

    # Listings: load then select a safe subset (varies by city)
    listings_all = pd.read_csv(listings_path, compression="gzip", low_memory=False)

    # Normalize key to listing_id
    if "listing_id" not in listings_all.columns and "id" in listings_all.columns:
        listings_all = listings_all.rename(columns={"id": "listing_id"})

    desired_listing_cols = [
        "listing_id",
        "accommodates", "bedrooms", "beds",
        "room_type", "property_type", "neighbourhood_cleansed",
        "number_of_reviews", "review_scores_rating",
        "host_is_superhost", "instant_bookable"
    ]
    keep_list = [c for c in desired_listing_cols if c in listings_all.columns]
    listings = listings_all[keep_list].copy()

    return listings, cal

def build_panel(listings: pd.DataFrame, cal: pd.DataFrame, city: str, snapshot: str,
                save_sample_parquet: bool = True, sample_rows: int = 100_000) -> pd.DataFrame:
    # Evidence (shapes/head/dtypes)
    print(f"\n===== {city} | snapshot {snapshot} =====")
    print("LISTINGS shape:", listings.shape)
    display(listings.head())
    print(listings.dtypes)

    print("\nCALENDAR shape:", cal.shape)
    display(cal.head())
    print(cal.dtypes)

    # Left merge on listing_id (1 row per listing/date)
    df = cal.merge(listings, on="listing_id", how="left")

    # Clean + transform
    df["price"] = clean_price_to_float(df["price"]) if "price" in df.columns else np.nan
    df["date"] = pd.to_datetime(df["date"], errors="coerce")

    if "available" in df.columns:
        df["is_booked"] = (df["available"].astype(str).str.lower() == "f").astype(int)
    else:
        df["is_booked"] = np.nan

    # Time features
    df["month"] = df["date"].dt.month
    df["day_of_week"] = df["date"].dt.dayofweek
    df["week_of_year"] = df["date"].dt.isocalendar().week.astype("Int64")
    df["is_weekend"] = (df["day_of_week"] >= 5).astype(int)
    df["day_of_year"] = df["date"].dt.dayofyear

    # Quick verification
    print("\nMERGED panel shape:", df.shape)
    show_cols = [c for c in ["listing_id","date","price","available","is_booked","month","day_of_week","week_of_year","is_weekend","day_of_year"] if c in df.columns]
    display(df[show_cols].head())

    # Optional: save sample parquet
    # Optional: save sample as CSV.GZ (no pyarrow required)
    if save_sample_parquet:
        out = DATA_ROOT / f"panel_{city}_{snapshot}_sample{sample_rows}.csv.gz"
        df.head(sample_rows).to_csv(out, index=False, compression="gzip")
        print("Saved sample CSV:", out)


    return df


### Building Panels

In [None]:
PANELS = {}

for (city, snap), paths in DATASETS.items():
    listings, cal = load_snapshot(paths["listings"], paths["calendar"])
    panel = build_panel(listings, cal, city, snap, save_sample_parquet=False, sample_rows=100_000)
    PANELS[(city, snap)] = panel

print("\n✅ Built panels:", len(PANELS))


### Part 2: Seasonality Plots

In [None]:
def seasonality_plots(df: pd.DataFrame, title_prefix: str, listing_type_col: str = "room_type"):
    df2 = df.dropna(subset=["date", "price"]).copy()

    # 1) Avg price by month
    by_month_price = df2.groupby("month")["price"].mean().sort_index()

    plt.figure()
    plt.plot(by_month_price.index, by_month_price.values, marker="o")
    plt.title(f"{title_prefix} - Avg Price by Month")
    plt.xlabel("Month")
    plt.ylabel("Average Price")
    plt.grid(True)
    plt.show()

    # 2) Avg booking probability by month
    by_month_book = df2.groupby("month")["is_booked"].mean().sort_index()

    plt.figure()
    plt.plot(by_month_book.index, by_month_book.values, marker="o")
    plt.title(f"{title_prefix} - Booking Probability by Month")
    plt.xlabel("Month")
    plt.ylabel("P(Booked) = mean(is_booked)")
    plt.grid(True)
    plt.show()

    # 3) Weekend vs weekday bars (price + booking)
    wk = df2.groupby("is_weekend")[["price", "is_booked"]].mean().rename(index={0: "Weekday", 1: "Weekend"})

    plt.figure()
    plt.bar(wk.index.astype(str), wk["price"].values)
    plt.title(f"{title_prefix} - Weekend vs Weekday Avg Price")
    plt.xlabel("")
    plt.ylabel("Average Price")
    plt.show()

    plt.figure()
    plt.bar(wk.index.astype(str), wk["is_booked"].values)
    plt.title(f"{title_prefix} - Weekend vs Weekday Booking Probability")
    plt.xlabel("")
    plt.ylabel("P(Booked)")
    plt.show()

    # 4) Avg price by month grouped by listing type (room_type/property_type/...)
    if listing_type_col in df2.columns:
        sub = df2.copy()
        sub = cap_top_k_categories(sub, listing_type_col, k=6)
        g = sub.groupby(["month", listing_type_col])["price"].mean().reset_index()

        plt.figure()
        for cat in g[listing_type_col].unique():
            s = g[g[listing_type_col] == cat].sort_values("month")
            plt.plot(s["month"], s["price"], marker="o", label=str(cat))
        plt.title(f"{title_prefix} - Avg Price by Month by {listing_type_col}")
        plt.xlabel("Month")
        plt.ylabel("Average Price")
        plt.legend()
        plt.grid(True)
        plt.show()
    else:
        print(f"(Skipped grouped plot: '{listing_type_col}' not found)")


In [None]:
for (city, snap), df in PANELS.items():
    seasonality_plots(df, f"{city} | {snap}", listing_type_col="room_type")


## Part 2 Interpretation

### Austin | 6 March 2025
The average price stays super steady throughout the year, with just tiny ups and downs—there's a slight bump around early fall (like October), and it's a tad lower in the early and late months. But booking probability? That's way more seasonal: it's highest in Jan–Mar, drops hard in Apr–May (bottoming out around May), and then picks up again from late summer into fall and winter. Weekend vs weekday prices are basically the same (no real weekend markup), and weekends book a bit more often. Room type drives price way more than seasonality—hotel rooms are always the priciest, followed by entire homes/apts, private rooms, and shared rooms last, and these stay pretty flat month to month.

### Austin | 14 December 2024
Prices are really flat here too, with a gentle climb from winter into spring/summer and a small dip in December. Booking probability has a clear upward trend: it's lower in Jan–Feb, builds steadily through spring and summer, peaks around Oct–Nov, and drops a little in December. Weekend vs weekday prices are almost identical, with maybe a tiny weekend boost if anything. Room types keep the same order (hotel highest, entire home/apt next, private, then shared), and they're stable across months, so room type wins over seasonal changes.

### Chicago | 11 March 2025
Prices have a noticeable jump in Apr–May compared to the rest of the year, then level out from summer through December. Booking probability starts high in Jan–Mar, plunges in Apr–May, hits its lowest around August, and bounces back in fall/winter (with December above the mid-year low). Weekends show a small lift for both price and booking (weekends edge out a bit). Room types are clearly separated and mostly steady: entire homes/apts top the list, hotels next, shared rooms surprisingly high relative to private, and private lowest—the lines are flat, meaning not much monthly variation within types.

### Chicago | 18 December 2024
Prices rise from winter into spring/summer (peaking Mar–Jul), then drop starting in August and bottom out in December. Booking probability is strongly seasonal: lowest in winter (especially Feb), climbing through late spring/summer, and peaking Oct–Dec (around 0.5). Weekend prices are a touch higher, and weekend bookings too. Room types are mostly stable, with entire homes/apts and hotels priciest, private lowest, and shared in the middle—but hotels dip noticeably in December, maybe due to end-of-year deals or shifts in listings.

### Santa_Cruz | 28 March 2025
Prices are almost perfectly flat, except for a clear dip around March before leveling back up. Booking probability is super seasonal: highest Jan–Mar, crashing in April (lowest around May), then recovering a bit in early summer and again toward year-end. Weekend vs weekday prices are the same (no weekend premium), but weekends book slightly more. Room types are consistent: hotels highest, entire homes/apts next, private cheaper, shared lowest, with just a dip for private rooms around March.

### Santa_Cruz | 31 December 2025
Prices hold steady most of the year, with a small drop in December. Booking probability bottoms out in late winter/early spring (Feb–Mar), rises slowly through spring, spikes around July, and stays high into fall/winter (peaking Oct–Dec). Weekend vs weekday prices are nearly identical, and weekend bookings are a bit higher. Room types are flat overall (hotels top, entire homes/apts next, private, then shared), but private rooms dip in December, echoing the general trend.

### WashingtonDC | 13 March 2025
Prices are basically flat all year, with a tiny dip around March—super weak seasonality. Booking probability peaks in March, declines steadily into summer (lowest in August), then rebounds in fall and ends strong in December. Weekend vs weekday prices are the same, and weekends book a tad more. Room types dominate: shared rooms are insanely expensive (a big outlier), entire homes/apts and hotels mid-high, private lowest and trends are mostly steady month to month.

### WashingtonDC | 18 December 2025
Prices stay flat with a slight December drop. Booking probability is low in winter (especially Feb), then climbs through spring/summer, hitting highs in Oct–Dec. Weekend vs weekday prices are almost identical, and weekend bookings slightly higher. Room types are separated and stable: shared rooms as outliers (super high), entire homes/apts and hotels next, private lowest—category differences dwarf monthly changes.



### choosing one dataset for modelling

In [None]:
# Modeling is required for both targets; the rubric does NOT require running models for all 8 datasets.
# Pick one dataset key here.
MODEL_KEY = ("Austin", "3625")  # change if you want

df_model = panel_df.copy()
print("Using MODEL_KEY:", MODEL_KEY, "| rows:", len(df_model))
display(df_model.head())


### Temporal split

In [None]:
def temporal_split_by_month(df: pd.DataFrame, train_months=9, valid_months=2):
    df = df.dropna(subset=["date"]).copy()
    df["year_month"] = df["date"].dt.to_period("M")
    months_sorted = np.array(sorted(df["year_month"].unique()))

    if len(months_sorted) < (train_months + valid_months + 1):
        raise ValueError(f"Not enough months in snapshot range: only {len(months_sorted)}")

    train_set = set(months_sorted[:train_months])
    valid_set = set(months_sorted[train_months:train_months+valid_months])
    test_set  = set(months_sorted[train_months+valid_months:])

    train_df = df[df["year_month"].isin(train_set)].copy()
    valid_df = df[df["year_month"].isin(valid_set)].copy()
    test_df  = df[df["year_month"].isin(test_set)].copy()

    print("Train months:", sorted(train_set)[:3], "...", sorted(train_set)[-3:])
    print("Valid months:", sorted(valid_set))
    print("Test months:", sorted(test_set)[:3], "...", sorted(test_set)[-3:])
    print("Shapes:", train_df.shape, valid_df.shape, test_df.shape)

    return train_df, valid_df, test_df

train_df, valid_df, test_df = temporal_split_by_month(df_model, train_months=9, valid_months=2)


### Feature Selection and Basic Cleaning

In [None]:
# Clean booleans / cap categories for modeling
for d in [train_df, valid_df, test_df]:
    if "host_is_superhost" in d.columns:
        d["host_is_superhost"] = tf_to_int(d["host_is_superhost"])
    if "instant_bookable" in d.columns:
        d["instant_bookable"] = tf_to_int(d["instant_bookable"])
    if "room_type" in d.columns:
        d = cap_top_k_categories(d, "room_type", k=6)
    if "property_type" in d.columns:
        d = cap_top_k_categories(d, "property_type", k=10)
    if "neighbourhood_cleansed" in d.columns:
        d = cap_top_k_categories(d, "neighbourhood_cleansed", k=25)

# Choose candidate features (only those that exist)
candidate_numeric = [
    "accommodates", "bedrooms", "beds",
    "number_of_reviews", "review_scores_rating",
    "minimum_nights", "maximum_nights",
    "month", "day_of_week", "week_of_year", "is_weekend", "day_of_year",
    "host_is_superhost", "instant_bookable",
]
candidate_categ = ["room_type", "property_type", "neighbourhood_cleansed"]

numeric_features = [c for c in candidate_numeric if c in df_model.columns]
categorical_features = [c for c in candidate_categ if c in df_model.columns]

# Targets
target_price = "price"
target_book = "is_booked"

# Drop rows missing targets
train_df = train_df.dropna(subset=[target_price, target_book]).copy()
valid_df = valid_df.dropna(subset=[target_price, target_book]).copy()
test_df  = test_df.dropna(subset=[target_price, target_book]).copy()

print("Numeric features:", numeric_features)
print("Categorical features:", categorical_features)

print("\nTrain/Valid/Test target availability:")
print(train_df[[target_price, target_book]].isna().mean())
print(valid_df[[target_price, target_book]].isna().mean())
print(test_df[[target_price, target_book]].isna().mean())


In [None]:
def sample_df(df: pd.DataFrame, n: int, seed=RANDOM_STATE):
    if len(df) <= n:
        return df
    return df.sample(n=n, random_state=seed)

MAX_TRAIN = 250_000
MAX_VALID = 75_000
MAX_TEST  = 75_000

train_s = sample_df(train_df, MAX_TRAIN)
valid_s = sample_df(valid_df, MAX_VALID)
test_s  = sample_df(test_df,  MAX_TEST)

print("Sampled shapes:", train_s.shape, valid_s.shape, test_s.shape)


In [None]:
# Cell 14 (UPDATED FULL) — Preprocessor + matrices (fix mixed int/str categorical)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# OneHotEncoder compatibility across sklearn versions
try:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
except TypeError:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)

# ✅ FIX: force categorical cols to be strings (prevents int/str mix error)
for df_ in (train_s, valid_s, test_s):
    for col in categorical_features:
        if col in df_.columns:
            df_[col] = df_[col].astype("string").fillna("Missing").astype(str)

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", ohe),
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="drop"
)

X_train = preprocessor.fit_transform(train_s)
X_valid = preprocessor.transform(valid_s)
X_test  = preprocessor.transform(test_s)

y_train_price = train_s[target_price].astype(float).values
y_valid_price = valid_s[target_price].astype(float).values
y_test_price  = test_s[target_price].astype(float).values

y_train_book = train_s[target_book].astype(int).values
y_valid_book = valid_s[target_book].astype(int).values
y_test_book  = test_s[target_book].astype(int).values

print("X_train shape:", X_train.shape, "dtype:", X_train.dtype)
print("X_valid shape:", X_valid.shape)
print("X_test  shape:", X_test.shape)
print("y_train_price:", y_train_price.shape, "y_train_book:", y_train_book.shape)


### XGBoost Regression on Price

In [None]:
xgb_reg = XGBRegressor(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

xgb_reg.fit(
    X_train, y_train_price,
    eval_set=[(X_valid, y_valid_price)],
    verbose=False
)

pred_test_price = xgb_reg.predict(X_test)
rmse = (mean_squared_error(y_test_price, pred_test_price)) ** 0.5
mae = mean_absolute_error(y_test_price, pred_test_price)

print("XGB REG | Test RMSE:", rmse)
print("XGB REG | Test MAE :", mae)


In [None]:
def get_feature_names(preprocessor, numeric_features, categorical_features):
    names = []
    names += list(numeric_features)

    if categorical_features:
        ohe_step = preprocessor.named_transformers_["cat"].named_steps["ohe"]
        names += ohe_step.get_feature_names_out(categorical_features).tolist()

    return names

feature_names = get_feature_names(preprocessor, numeric_features, categorical_features)
print("✅ feature_names ready. Count:", len(feature_names), "| X width:", X_train.shape[1])


In [None]:
importances = xgb_reg.feature_importances_
imp_df = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)

display(imp_df.head(20))

topk = 25
plt.figure(figsize=(8, 8))
plt.barh(imp_df.head(topk)["feature"][::-1], imp_df.head(topk)["importance"][::-1])
plt.title("XGBoost Regressor - Top Feature Importances")
plt.xlabel("Importance")
plt.ylabel("")
plt.show()


### XGBoost Classification (is_booked)

In [None]:
pos = y_train_book.sum()
neg = len(y_train_book) - pos
scale_pos_weight = (neg / pos) if pos > 0 else 1.0

xgb_clf = XGBClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    scale_pos_weight=scale_pos_weight,
    eval_metric="logloss",
)

xgb_clf.fit(
    X_train, y_train_book,
    eval_set=[(X_valid, y_valid_book)],
    verbose=False
)

proba_test = xgb_clf.predict_proba(X_test)[:, 1]
pred_test = (proba_test >= 0.5).astype(int)

auc = roc_auc_score(y_test_book, proba_test)
acc = accuracy_score(y_test_book, pred_test)

print("XGB CLF | Test AUC     :", auc)
print("XGB CLF | Test Accuracy:", acc)


In [None]:
importances = xgb_clf.feature_importances_
imp_df2 = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)

display(imp_df2.head(20))

topk = 25
plt.figure(figsize=(8, 8))
plt.barh(imp_df2.head(topk)["feature"][::-1], imp_df2.head(topk)["importance"][::-1])
plt.title("XGBoost Classifier - Top Feature Importances")
plt.xlabel("Importance")
plt.ylabel("")
plt.show()


### Neural networks

In [None]:
def make_nn_reg(input_dim: int):
    model = keras.Sequential([
        layers.Input(shape=(input_dim,)),
        layers.Dense(256, activation="relu"),
        layers.Dropout(0.2),
        layers.Dense(128, activation="relu"),
        layers.Dropout(0.2),
        layers.Dense(1)
    ])
    model.compile(
        optimizer=keras.optimizers.Adam(1e-3),
        loss="mse",
        metrics=["mae"]
    )
    return model

def make_nn_clf(input_dim: int):
    model = keras.Sequential([
        layers.Input(shape=(input_dim,)),
        layers.Dense(256, activation="relu"),
        layers.Dropout(0.2),
        layers.Dense(128, activation="relu"),
        layers.Dropout(0.2),
        layers.Dense(1, activation="sigmoid")
    ])
    model.compile(
        optimizer=keras.optimizers.Adam(1e-3),
        loss="binary_crossentropy",
        metrics=[keras.metrics.AUC(name="auc"), "accuracy"]
    )
    return model

def timestamp_str():
    return datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

INPUT_DIM = X_train.shape[1]

# Separate log dirs (required)
logdir_price = LOGS_DIR / "nn_price" / f"{MODEL_KEY[0]}_{MODEL_KEY[1]}" / timestamp_str()
logdir_book  = LOGS_DIR / "nn_book"  / f"{MODEL_KEY[0]}_{MODEL_KEY[1]}" / timestamp_str()
logdir_price.mkdir(parents=True, exist_ok=True)
logdir_book.mkdir(parents=True, exist_ok=True)

cb_price = [
    keras.callbacks.TensorBoard(log_dir=str(logdir_price)),
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True),
]
cb_book = [
    keras.callbacks.TensorBoard(log_dir=str(logdir_book)),
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True),
]

nn_reg = make_nn_reg(INPUT_DIM)
hist_reg = nn_reg.fit(
    X_train, y_train_price,
    validation_data=(X_valid, y_valid_price),
    epochs=20,
    batch_size=256,
    callbacks=cb_price,
    verbose=1
)

nn_clf = make_nn_clf(INPUT_DIM)
hist_clf = nn_clf.fit(
    X_train, y_train_book,
    validation_data=(X_valid, y_valid_book),
    epochs=20,
    batch_size=256,
    callbacks=cb_book,
    verbose=1
)

print("✅ NN training done.")
print("NN PRICE logdir:", logdir_price)
print("NN BOOK  logdir:", logdir_book)


### Evaluate Neural Net on test

In [None]:
# Regression metrics
pred_nn_price = nn_reg.predict(X_test).ravel()
rmse_nn = mean_squared_error(y_test_price, pred_nn_price) ** 0.5
mae_nn = mean_absolute_error(y_test_price, pred_nn_price)

print("NN REG | Test RMSE:", rmse_nn)
print("NN REG | Test MAE :", mae_nn)

# Classification metrics
proba_nn = nn_clf.predict(X_test).ravel()
pred_nn = (proba_nn >= 0.5).astype(int)

auc_nn = roc_auc_score(y_test_book, proba_nn)
acc_nn = accuracy_score(y_test_book, pred_nn)

print("NN CLF | Test AUC     :", auc_nn)
print("NN CLF | Test Accuracy:", acc_nn)


In [None]:
%load_ext tensorboard
%tensorboard --logdir logs


## Part 4 TensorBoard Screenshots (Required)

### NN Price (Regression) — TensorBoard Scalars Screenshot

<img src="Tensorboard_SS/epoch_loss_R.png" width="1040">
<img src="Tensorboard_SS/epoch_mae_R.png" width="1040">
<img src="Tensorboard_SS/epoch_loss_itr_R.png" width="1040">
<img src="Tensorboard_SS/epoch_mae_itr_R.png" width="1040">


### NN Booking (Classification) — TensorBoard Scalars Screenshot

<img src="Tensorboard_SS/epoch_accuracy_C.png" width="1040">
<img src="Tensorboard_SS/epoch_auc_C.png" width="1040">
<img src="Tensorboard_SS/epoch_acc_itr_C.png" width="1040">
<img src="Tensorboard_SS/epoch_auc_itr_C.png" width="1040">

## Part 4 Discussion

For the price regression NN, both training and validation loss/MAE decrease smoothly and stay fairly close, which suggests stable training and only mild overfitting. The evaluation loss/MAE vs iterations curves also trend down steadily without big spikes, so optimization looks stable (no major divergence or instability). Overall, validation continues improving through the last epoch, so more epochs (with early stopping) could potentially help a bit more.  
For the booking classification NN, training accuracy/AUC increase noticeably faster than validation, and the train–validation gap grows by the final epoch, which indicates overfitting. Validation accuracy/AUC improve only modestly (and start to flatten), meaning generalization is the bottleneck rather than training performance. This pattern is consistent with booking being a noisier/harder target than price (often more imbalance and weaker signal), so it tends to need stronger regularization and careful early stopping based on validation AUC.


In [None]:
results = pd.DataFrame([
    {"model": "XGB_REG", "target": "price", "RMSE": rmse, "MAE": mae, "AUC": np.nan, "ACC": np.nan},
    {"model": "XGB_CLF", "target": "is_booked", "RMSE": np.nan, "MAE": np.nan, "AUC": auc, "ACC": acc},
    {"model": "NN_REG",  "target": "price", "RMSE": rmse_nn, "MAE": mae_nn, "AUC": np.nan, "ACC": np.nan},
    {"model": "NN_CLF",  "target": "is_booked", "RMSE": np.nan, "MAE": np.nan, "AUC": auc_nn, "ACC": acc_nn},
])

print("MODEL_KEY:", MODEL_KEY)
display(results)


## Part 2 Interpretation

### Austin | 6 March 2025

Average price is almost flat across the year, with only tiny bumps (slight peak around Oct and a small dip around Mar). Booking probability is clearly seasonal: it’s highest in Jan–Mar, drops hard around Apr–May, then climbs again in the fall and ends stronger in Dec. Weekends vs weekdays show almost no price difference, and weekends have a slightly higher booking probability. By room type, **Hotel room** is the most expensive, then **Entire home/apt**, then **Private room**, and **Shared room** (and these lines are very stable month-to-month).

### Austin | 14 December 2024

Price is very steady with a gentle rise from early months into spring/summer and a noticeable dip in December. Booking probability trends upward through the year, peaking around Oct–Nov and staying relatively high into Dec compared to early months. Weekend vs weekday pricing is nearly identical, with only a tiny weekend lift (if any). Room-type pricing shows the same consistent ranking (Hotel > Entire home/apt > Private > Shared) with minimal seasonality in price.

### Chicago | 11 March 2025

Prices show a small spring peak (around Apr–May) followed by a dip in early summer (Jun–Jul), then stabilize for the rest of the year. Booking probability is highest in late winter/early spring (peaking around Mar), drops to its lowest point around Aug, then recovers in the fall and rises into Dec. Weekend vs weekday average price is nearly the same, while weekends book slightly more often than weekdays. Room types are strongly separated in price (Entire home/apt highest; Private room lowest), and they stay mostly flat across months.

### Chicago | 18 December 2024

Average price jumps up around Mar and remains relatively stable through mid-year, then steps down around Aug–Nov and drops more noticeably in Dec. Booking probability starts low early in the year (lowest around Feb), then steadily rises from late spring onward and peaks around Oct–Dec. Weekday vs weekend prices are almost identical, and weekend booking probability is only slightly higher. Room-type prices are mostly stable, but **December shows a noticeable drop for Hotel room (and a small drop for Private room)** compared to the rest of the year.

### Santa_Cruz | 28 March 2025

Average price is essentially flat, except for a clear dip around March before returning to the normal level. Booking probability is highest in Jan–Mar, falls sharply around Apr–May, then partially rebounds in summer and increases again toward year-end. Weekend vs weekday price is almost unchanged, while weekends book a bit more often. Room-type differences are large and consistent (Hotel highest; Shared lowest), and Private room shows a visible dip around March similar to the overall pattern.

### Santa_Cruz | 31 December 2025

Prices are very stable through most of the year, with a noticeable drop in December. Booking probability is lowest around Feb–Mar, rises slowly through spring, then jumps strongly around July and peaks again in Oct–Dec. Weekday vs weekend price is basically the same, and weekend booking probability is slightly higher. Room-type ranking is stable, but **Private room drops in December**, matching the overall dip.

### WashingtonDC | 13 March 2025

Average price is almost perfectly flat, with a small dip around March. Booking probability peaks around March, then steadily declines to the lowest point in August before recovering into the fall and ending higher in December. Weekend vs weekday prices are essentially identical, while weekend booking probability is a bit higher. Room-type prices are separated, but **Shared room is an extreme outlier (very high)**, which likely comes from a small number of listings or noisy data.

### WashingtonDC | 18 December 2025

Price stays flat throughout the year and drops slightly in December. Booking probability is low in February, then rises steadily through the year and peaks around Oct–Dec. Weekday vs weekend prices are nearly identical and weekend booking probability is slightly higher. Room-type prices are consistent, and again **Shared room looks like a very large outlier**, suggesting potential data quality / small-sample effects.

---

## Part 4 TensorBoard Screenshots 

### NN Price (Regression) — TensorBoard Scalars Screenshot

<p align="center">
  <img src="Tensorboard_SS/epoch_loss_R.png" width="360"/>
  <img src="Tensorboard_SS/epoch_loss_itr_R.png" width="360"/>
</p>
<p align="center">
  <img src="Tensorboard_SS/epoch_mae_R.png" width="360"/>
  <img src="Tensorboard_SS/epoch_mae_itr_R.png" width="360"/>
</p>

### NN Booking (Classification) — TensorBoard Scalars Screenshot

<p align="center">
  <img src="Tensorboard_SS/epoch_accuracy_C.png" width="360"/>
  <img src="Tensorboard_SS/epoch_acc_itr_C.png" width="360"/>
</p>
<p align="center">
  <img src="Tensorboard_SS/epoch_auc_C.png" width="360"/>
  <img src="Tensorboard_SS/epoch_auc_itr_C.png" width="360"/>
</p>

## Part 4 Discussion

The regression TensorBoard curves show training and validation loss/MAE decreasing smoothly, which suggests stable optimization with no exploding or oscillating behavior. Validation stays close to training, so there is no severe overfitting, but the validation curve generally sits a bit higher, indicating a small generalization gap. For classification, accuracy and AUC increase over epochs, and validation remains consistently below training, which points to mild overfitting or simply that the task is harder to generalize. The curves improve early and then start to flatten, suggesting the model is approaching its best performance under the current architecture and hyperparameters. These TensorBoard trends match the final test metrics: the NN performs reasonably, but it does not beat XGBoost, especially on regression. Overall, the booking model shows clean learning with AUC rising, while the price model still ends with relatively large error, implying feature interactions and nonlinearities may be better captured by tree boosting here.

---

# Part 5 Final Write-Up (Required)

## Data summary + seasonality (prices + bookings)

Across all cities and snapshots, average prices exhibit remarkable stability month-to-month, with only minor fluctuations, such as occasional December declines and March dips. In contrast, booking probability demonstrates stronger seasonal patterns: it begins high in January–March, declines during the middle months, and recovers toward October–December, although variations occur by city and snapshot. Weekend versus weekday effects are minimal: prices remain nearly identical, while weekend booking probability is slightly elevated. Room type emerges as a primary determinant of price, with hotel rooms and entire homes/apartments consistently the most expensive, followed by private rooms, and shared rooms the least. WashingtonDC presents an exception, where shared room prices appear as significant outliers, likely due to small sample sizes or data noise, necessitating careful handling.

## Temporal modeling setup (no leakage)

A chronological split by month was employed to ensure the model learns exclusively from historical data and evaluates on future periods. Specifically, earlier months form the training set, the subsequent months the validation set (for hyperparameter tuning and early stopping), and the final months the test set (for unbiased assessment). This approach prevents leakage, as future calendar outcomes do not influence model fitting or preprocessing. It aligns with real-world deployment, where predictions rely on observed historical patterns. All preprocessing steps, including imputation, scaling, and one-hot encoding, are fitted solely on the training data and applied to validation and test sets.

## Model comparison (XGBoost vs Neural Nets)

For price prediction (regression), XGBoost significantly outperforms the neural network: XGBoost achieves RMSE = 356 and MAE = 129.9, whereas the neural network yields RMSE = 751 and MAE = 216. In booking prediction (classification), XGBoost maintains superiority, though the margin is narrower: XGBoost attains AUC = 0.734 and Accuracy = 0.677, compared to the neural network's AUC = 0.703 and Accuracy = 0.662. This indicates that booking prediction derives less benefit from deep learning in this context, while tree boosting effectively captures nonlinear feature interactions. Overall, the results suggest that XGBoost serves as the stronger baseline for this tabular panel data across both targets.

## TensorBoard insights

The regression loss and MAE curves decline steadily for both training and validation sets, indicating stable training and progressive improvement, though they plateau at levels consistent with the neural network's elevated test errors. The classification curves show increases in accuracy and AUC over epochs, with validation consistently below training, suggesting mild overfitting or challenges in generalization. The curves remain smooth without instability, implying that underperformance stems from model fit rather than training issues. These TensorBoard trends correspond to the final metrics: the neural network captures useful signals but underperforms relative to XGBoost. Potential improvements include enhanced regularization, architectures better suited to tabular data, or additional feature engineering.

## Business insights

Predicting booking probability holds substantial operational value, as it estimates demand and informs dynamic pricing, minimum-night policies, staffing, and marketing strategies. Price prediction remains beneficial for hosts as a market benchmark ("What price should this listing command?"), yet demand ultimately drives revenue. Given that weekend booking probability is marginally higher while prices remain unchanged, opportunities exist for hosts to adjust weekend pricing upward during peak seasons. Room type represents a key differentiator, enabling hosts to emphasize attributes such as instant booking or superhost status to compete within their category. Finally, the anomalous shared-room prices in WashingtonDC warrant monitoring, with recommendations for outlier detection and robust data cleaning or minimum-support thresholds prior to automated pricing applications.
