---

# 0. Project Roadmap – Expedia Hotel Recommendation

---

This roadmap outlines the steps of our workflow:  
from loading and preparing the data, through feature engineering, to model building and evaluation.

---

## Table of Content + Introduction
1. Data Preparation
- Load datasets:
  - `onlybookings.csv` (main training data, only booking events)
  - `destinations.csv` (150-dimensional destination embeddings)
- Inspect data structure, missing values, distributions
- Handle missing values (e.g., distance)

---

2. Feature Engineering
- **Date & time features**
  - `season` (Spring, Summer, Autumn, Winter from `srch_ci`)
  - `search_month`, `search_weekday`, `search_hour`
  - `length_of_stay` = `(srch_co - srch_ci).days`
  - `advance_days` = `(srch_ci - date_time).days`
- **Geographic features**
  - `distance_log` (log-transformed, imputed)
  - `distance_was_missing` (binary flag)
  - `same_country`, `same_continent`
- **User & session features**
  - `cnt_log` (log of number of similar events)
  - `cnt_bin` (binned categories: single / few / many)
  - `is_mobile`, `is_package`, `channel`
- **Destination features**
  - Run PCA on `destinations.csv` (reduce from 150 → 3 components)
  - Join to main dataset via `srch_destination_id`
  - Add `dest_pca1`, `dest_pca2`, `dest_pca3` to feature set

---

3. Baseline Model
- Goal: simple first benchmark
- Train a basic **Multinomial Logistic Regression** or **Random Forest**
- Evaluate with:
  - Accuracy (quick sanity check)
  - MAP@5 (Kaggle evaluation metric)

---

4. Intermediate Models
- Improve preprocessing:
  - Scale numeric features
  - One-Hot Encode categorical features
- Try tree-based models:
  - Random Forest
  - Gradient Boosting (e.g., HistGradientBoostingClassifier)
- Feature importance analysis

---

5. Advanced Model
- Use more powerful models:
  - XGBoost, LightGBM, or CatBoost (if available)
- Apply **Target Encoding** for high-cardinality features:
  - `srch_destination_id`
  - `hotel_market`, `hotel_country`
- Hyperparameter Tuning:
  - RandomizedSearchCV or GridSearchCV
- Evaluate again with MAP@5

---

#6. Model Comparison & Results
- Summarize baseline vs intermediate vs advanced models
- Compare MAP@5 scores
- Document most important features (interpretability)

---

7. Conclusion & Next Steps
- Insights about which features matter most
- Ideas for further improvement (e.g., embeddings, deep learning, session-level modeling)
- Reflection on challenges and learnings


### Feature Derivation (at a glance)

| Feature | Source column(s) | Type | Transformation / Derivation | Encoding recommendation | Notes |
|---|---|---|---|---|---|
| `season` | `srch_ci` | categorical (4) | Map month → {Spring, Summer, Autumn, Winter} | One-Hot | Robust, low cardinality |
| `search_month` | `date_time` | cyclical | Extract month (1–12), optional sin/cos | One-Hot or sin/cos | Use sin/cos to keep periodicity |
| `search_weekday` | `date_time` | categorical (7) | Extract weekday (0–6) | One-Hot | Useful for business vs leisure patterns |
| `search_hour` | `date_time` | cyclical | Extract hour (0–23), optional sin/cos | One-Hot or sin/cos | Hourly patterns can matter |
| `length_of_stay` | `srch_ci`, `srch_co` | numeric | `(srch_co - srch_ci).days`, clip to [0, 30] | Scale (StandardScaler) or raw for trees | Outlier clipping stabilizes training |
| `advance_days` | `srch_ci`, `date_time` | numeric | `(srch_ci - date_time).days`, clip to [0, 365] | Scale (StandardScaler) or raw for trees | Captures early vs last-minute booking |
| `orig_destination_distance` | `orig_destination_distance` | numeric | Coerce to numeric; impute (group median by `season`,`hotel_market` → fallback global); optional log | Use `distance_log = log1p(imputed)` and scale | Add `distance_was_missing` flag |
| `distance_was_missing` | `orig_destination_distance` | binary | `1` if original value missing | As is | Helps model “missingness” signal |
| `distance_log` | `orig_destination_distance` | numeric | `log1p(imputed distance)` | Scale | Reduces skew / outliers |
| `same_country` | `user_location_country`, `hotel_country` | binary | 1 if equal, else 0 | As is | Cheap but often predictive |
| `same_continent` | `posa_continent`, `hotel_continent` | binary | 1 if equal, else 0 | As is | Adds geographic proximity signal |
| `cnt_log` | `cnt` | numeric | `log1p(cnt)` | Scale | Smoother than raw `cnt` |
| `cnt_bin` | `cnt` | categorical (3–4) | Bin: 1=single, 2–3=few, ≥4=many | One-Hot | Optional, complements `cnt_log` |
| `is_mobile` | `is_mobile` | binary | — | As is / One-Hot | Device signal |
| `is_package` | `is_package` | binary | — | As is / One-Hot | Package vs stand-alone |
| `channel` | `channel` | categorical (low/med) | — | One-Hot | Marketing/acquisition signal |
| `posa_continent` | `posa_continent` | categorical (low) | — | One-Hot | Origin continent |
| `hotel_continent` | `hotel_continent` | categorical (low) | — | One-Hot | Destination continent |
| `hotel_country` | `hotel_country` | categorical (med/high) | — | Target Encoding or OHE (if feasible) | Higher cardinality |
| `hotel_market` | `hotel_market` | categorical (high) | — | Target Encoding / CatBoost / LightGBM | High cardinality; powerful |
| `srch_destination_id` | `srch_destination_id` | categorical (very high) | — | Target Encoding / Embeddings | Very predictive of `hotel_cluster` but risky for overfitting; never One-Hot |

---

### Excluded / Handle-with-care (to avoid leakage or overfitting)

| Column | Why exclude or transform |
|---|---|
| `hotel_cluster` | **Target label** — never use as feature |
| `user_id` | High cardinality, user-specific leakage; don’t use for generalizable models |
| Raw timestamps: `date_time`, `srch_ci`, `srch_co` | Use only via derived features (season, month, weekday, hour, length_of_stay, advance_days) |


In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from collections import Counter
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import OrdinalEncoder

from sklearn.experimental import enable_hist_gradient_boosting  # noqa


# 1. Data Preparation

In [3]:
# Step 1: Load Data
# ================================
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load main training data (only bookings)
df = pd.read_csv("data/onlybookings.csv", low_memory=False)

# Load destination embeddings (150 latent features)
dest = pd.read_csv("data/destinations.csv", low_memory=False)

print("Shape onlybookings:", df.shape)
print("Shape destinations:", dest.shape)

Shape onlybookings: (1000001, 24)
Shape destinations: (62106, 150)


In [4]:
# Step 2: PCA on destinations.csv
# ================================

# Remove id column (srch_destination_id) from PCA input
dest_id = dest["srch_destination_id"]
dest_features = dest.drop("srch_destination_id", axis=1)

# Standardize features before PCA
scaler = StandardScaler()
dest_scaled = scaler.fit_transform(dest_features)

# Run PCA (reduce 150 → 3 components)
pca = PCA(n_components=3, random_state=42)
dest_pca = pca.fit_transform(dest_scaled)

# Put into DataFrame with srch_destination_id
dest_pca_df = pd.DataFrame(dest_pca, columns=["dest_pca1","dest_pca2","dest_pca3"])
dest_pca_df["srch_destination_id"] = dest_id.values

print("Explained variance by 3 components:", pca.explained_variance_ratio_.sum())
dest_pca_df.head()

Explained variance by 3 components: 0.5986751942782461


Unnamed: 0,dest_pca1,dest_pca2,dest_pca3,srch_destination_id
0,1.179554,-1.41871,-0.734699,0
1,6.605425,0.249541,0.599817,1
2,-1.153878,-0.303059,0.92731,2
3,7.358539,0.394187,-0.056713,3
4,3.727968,0.584871,0.403375,4


In [5]:
# Step 3: Left Join
# ================================
# Merge PCA features into main bookings dataset
df = df.merge(dest_pca_df, on="srch_destination_id", how="left")

print("Shape after join:", df.shape)
df[["srch_destination_id","dest_pca1","dest_pca2","dest_pca3"]].head()

Shape after join: (1000001, 27)


Unnamed: 0,srch_destination_id,dest_pca1,dest_pca2,dest_pca3
0,8250,-45.864357,-4.72233,-10.897558
1,8250,-45.864357,-4.72233,-10.897558
2,8291,-19.182445,7.503546,4.190429
3,1385,-9.925175,-7.212029,9.635184
4,8803,-11.895228,3.632621,2.20695


# 2. Feature Engineering

In [6]:
# ================================
# Step 2: Feature Engineering
# ================================

# --- Date & Time Features ---
df["date_time"] = pd.to_datetime(df["date_time"], errors="coerce")
df["srch_ci"]   = pd.to_datetime(df["srch_ci"], errors="coerce")
df["srch_co"]   = pd.to_datetime(df["srch_co"], errors="coerce")

# Season feature
def month_to_season(m):
    if pd.isna(m): return np.nan
    if m in (12, 1, 2):  return "Winter"
    if m in (3, 4, 5):   return "Spring"
    if m in (6, 7, 8):   return "Summer"
    return "Autumn"

df["season"] = df["srch_ci"].dt.month.map(month_to_season)

# Month, weekday, hour
df["search_month"]   = df["date_time"].dt.month
df["search_weekday"] = df["date_time"].dt.weekday
df["search_hour"]    = df["date_time"].dt.hour

# Length of stay (clip to avoid extreme outliers)
df["length_of_stay"] = (df["srch_co"] - df["srch_ci"]).dt.days
df["length_of_stay"] = df["length_of_stay"].clip(lower=0, upper=30)

# Advance booking days (clip as well)
df["advance_days"] = (df["srch_ci"] - df["date_time"]).dt.days
df["advance_days"] = df["advance_days"].clip(lower=0, upper=365)

# --- Distance Features ---
df["orig_destination_distance"] = pd.to_numeric(df["orig_destination_distance"], errors="coerce")

# Median imputation (global fallback)
global_med = df["orig_destination_distance"].median()
df["distance_imputed"] = df["orig_destination_distance"].fillna(global_med)

# Log transform
df["distance_log"] = np.log1p(df["distance_imputed"])

# Missingness flag
df["distance_was_missing"] = df["orig_destination_distance"].isna().astype(int)

# --- User & Session Features ---
df["cnt"] = pd.to_numeric(df["cnt"], errors="coerce")
df["cnt_log"] = np.log1p(df["cnt"])

# Bin categories for cnt
df["cnt_bin"] = pd.cut(
    df["cnt"],
    bins=[0,1,3,1000],  # adapt bins if needed
    labels=["single","few","many"],
    include_lowest=True
)

# --- Geographic Flags ---
df["same_country"] = (df["user_location_country"] == df["hotel_country"]).astype(int)
df["same_continent"] = (df["posa_continent"] == df["hotel_continent"]).astype(int)

# --- Keep PCA features ---
# Already joined: dest_pca1, dest_pca2, dest_pca3

# --- Final check ---
feature_cols = [
    "season","search_month","search_weekday","search_hour",
    "length_of_stay","advance_days",
    "distance_log","distance_was_missing",
    "cnt_log","cnt_bin",
    "is_mobile","is_package","channel",
    "posa_continent","hotel_continent","hotel_country","hotel_market",
    "same_country","same_continent",
    "dest_pca1","dest_pca2","dest_pca3"
]

print("Number of features prepared:", len(feature_cols))
df[feature_cols].head()

Number of features prepared: 22


Unnamed: 0,season,search_month,search_weekday,search_hour,length_of_stay,advance_days,distance_log,distance_was_missing,cnt_log,cnt_bin,...,channel,posa_continent,hotel_continent,hotel_country,hotel_market,same_country,same_continent,dest_pca1,dest_pca2,dest_pca3
0,Summer,8,0,7,4,15,7.712115,0,1.386294,few,...,9,3,2,50,628,0,0,-45.864357,-4.72233,-10.897558
1,Summer,8,0,8,4,17,7.712115,0,0.693147,single,...,9,3,2,50,628,0,0,-45.864357,-4.72233,-10.897558
2,Spring,2,3,18,2,49,6.689968,1,0.693147,single,...,4,3,2,50,191,0,0,-19.182445,7.503546,4.190429
3,Autumn,6,5,15,8,82,6.689968,1,0.693147,single,...,9,4,0,185,185,0,0,-9.925175,-7.212029,9.635184
4,Summer,11,6,18,2,214,6.689968,1,0.693147,single,...,9,4,3,151,69,0,0,-11.895228,3.632621,2.20695


# 3. Baseline Model

In [11]:
# Step 3: Baseline Model
# ================================

target = "hotel_cluster"

numeric_features = [
    "length_of_stay","advance_days","distance_log","cnt_log",
    "distance_was_missing","same_country","same_continent",
    "dest_pca1","dest_pca2","dest_pca3"
]

categorical_features = [
    "season","search_month","search_weekday","search_hour",
    "cnt_bin","is_mobile","is_package","channel",
    "posa_continent","hotel_continent","hotel_country","hotel_market"
]

# Drop rows with missing target just in case
df = df.dropna(subset=[target])

X = df[numeric_features + categorical_features]
y = df[target].astype(int)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ---- Preprocessing with Imputers (CRITICAL FIX) ----
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),   # NEW
    ("scaler",  StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),  # NEW
    ("ohe",     OneHotEncoder(handle_unknown="ignore", sparse=True))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ],
    sparse_threshold=0.3
)

clf = LogisticRegression(
    multi_class="multinomial",
    solver="saga",
    max_iter=200,
    n_jobs=-1,
    verbose=0
)

pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("clf", clf)
])

pipe.fit(X_train, y_train)

# --- Evaluation ---
y_pred = pipe.predict(X_val)
acc = accuracy_score(y_val, y_pred)

def map_at_k(y_true, proba, classes, k=5):
    topk_idx = np.argsort(-proba, axis=1)[:, :k]
    topk_labels = classes[topk_idx]
    rr = []
    for t, row in zip(y_true, topk_labels):
        if t in row:
            rank = np.where(row == t)[0][0] + 1
            rr.append(1.0 / rank)
        else:
            rr.append(0.0)
    return float(np.mean(rr))

proba_val = pipe.predict_proba(X_val)
classes_  = pipe.named_steps["clf"].classes_
map5 = map_at_k(y_val.values, proba_val, classes_, k=5)

print(f"Validation Accuracy: {acc:.4f}")
print(f"Validation MAP@5:   {map5:.4f}")


Validation Accuracy: 0.1640
Validation MAP@5:   0.2761


# 4. Intermediate Model

In [23]:
# --- Features ---
numeric_features = [
    "length_of_stay","advance_days","distance_log","cnt_log",
    "distance_was_missing","same_country","same_continent",
    "dest_pca1","dest_pca2","dest_pca3"
]

categorical_features = [
    "season","search_month","search_weekday","search_hour",
    "cnt_bin","is_mobile","is_package","channel",
    "posa_continent","hotel_continent","hotel_country","hotel_market"
]

target = "hotel_cluster"

# --- Split ---
X = df[numeric_features + categorical_features]
y = df[target].astype(int)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Preprocessing ---
# Numerics: impute missing with median
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

# Categoricals: impute missing + encode as integers
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ord", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", num_pipe, numeric_features),
        ("cat", cat_pipe, categorical_features)
    ]
)

# --- Model ---
clf = HistGradientBoostingClassifier(
    learning_rate=0.1,
    max_depth=10,
    max_iter=200,
    random_state=42
)

pipe = Pipeline([
    ("preprocess", preprocess),
    ("clf", clf)
])

# --- Train ---
pipe.fit(X_train, y_train)

# --- Evaluate ---
y_pred = pipe.predict(X_val)
acc = accuracy_score(y_val, y_pred)

# Probabilities for MAP@5
proba_val = pipe.predict_proba(X_val)
classes_  = pipe.named_steps["clf"].classes_

def map_at_k(y_true, proba, classes, k=5):
    topk_idx = np.argsort(-proba, axis=1)[:, :k]
    topk_labels = classes[topk_idx]
    rr = []
    for t, row in zip(y_true, topk_labels):
        if t in row:
            rank = np.where(row == t)[0][0] + 1
            rr.append(1.0 / rank)
        else:
            rr.append(0.0)
    return float(np.mean(rr))

map5 = map_at_k(y_val.values, proba_val, classes_, k=5)

print(f"Validation Accuracy (HGBT): {acc:.4f}")
print(f"Validation MAP@5 (HGBT):   {map5:.4f}")


Validation Accuracy (HGBT): 0.1578
Validation MAP@5 (HGBT):   0.2619


In [None]:
import numpy as np
from joblib import parallel_backend
from sklearn.inspection import permutation_importance

# Top-5 MAP/Recall@5 as Scorer – top-level defined (picklable)
def map5_scorer(estimator, X, y):
    proba = estimator.predict_proba(X)
    top5_idx = np.argsort(proba, axis=1)[:, -5:][:, ::-1]
    preds = estimator.classes_[top5_idx]
    y = np.asarray(y)
    hits = (preds == y.reshape(-1,1)).any(axis=1)
    return hits.mean()

with parallel_backend('threading'):  # threads instead of processes
    result = permutation_importance(
        pipe, X_val, y_val,
        n_repeats=5,
        random_state=42,
        n_jobs=-1,
        scoring=map5_scorer
    )


In [None]:
# 1) Feature-names robust
try:
    # frequent: pipe = Pipeline([("preproc", coltx), ("model", clf)])
    feature_names = pipe.named_steps["preproc"].get_feature_names_out()
except Exception:
    # Fallback:
    feature_names = numeric_features + categorical_features

# 2) target object in df
imp_df = pd.DataFrame({
    "feature": feature_names,
    "importance_mean": result.importances_mean,
    "importance_std": result.importances_std
}).sort_values("importance_mean", ascending=False).reset_index(drop=True)

# 3) overview
display(imp_df.head(20))
print(f"Anzahl Features: {len(imp_df)}")

# 4) Optional: save
# # imp_df.to_csv("permutation_importance_map5.csv", index=False)


Unnamed: 0,feature,importance_mean,importance_std
0,dest_pca1,0.180181,0.000664
1,hotel_continent,0.142167,0.000775
2,hotel_market,0.116786,0.000381
3,dest_pca2,0.110017,0.000546
4,dest_pca3,0.10616,0.000626
5,hotel_country,0.067914,0.000387
6,length_of_stay,0.005074,0.000146
7,advance_days,0.003972,0.000227
8,distance_log,0.00244,9.1e-05
9,is_package,0.00176,9.4e-05


Anzahl Features: 22


In [None]:
# CatBoost Safe-Mode Patch 
# =========================
from catboost import CatBoostClassifier, Pool

# --- 0) Target + Features ---
target = "hotel_cluster"
features = [c for c in df.columns if c != target]

# --- 1) Downcast numerics to save RAM ---
num_cols = df.select_dtypes(include=["int64","float64"]).columns.tolist()
if num_cols:
    df[num_cols] = df[num_cols].apply(pd.to_numeric, downcast="integer")
    df[num_cols] = df[num_cols].apply(pd.to_numeric, downcast="float")

# --- 2) Cap high-cardinality categoricals (to avoid memory blow-up) ---
def cap_top_k(series, k=200):
    top = series.value_counts(dropna=False).nlargest(k).index
    return series.where(series.isin(top), other="other")

high_card_cols = [c for c in ["hotel_market","srch_destination_id","hotel_country"] if c in df.columns]
for c in high_card_cols:
    df[c] = df[c].astype("object").fillna("missing")
    df[c] = cap_top_k(df[c], k=200)

# --- 3) Detect categorical features ---
auto_cats = [c for c in features if str(df[c].dtype) in ("object","category")]
known_cats = {
    "hotel_continent","hotel_market","hotel_country",
    "search_month","search_weekday","search_hour","season",
    "is_package","is_mobile","same_country","same_continent","distance_was_missing",
    "srch_destination_id","posa_continent","channel","cnt_bin"
}
cat_features = sorted(set(auto_cats).union(known_cats.intersection(features)))
num_features = [c for c in features if c not in cat_features]

# Ensure categoricals are strings with explicit "missing"
for c in cat_features:
    df[c] = df[c].astype("object").fillna("missing")

print("Categorical features:", cat_features)
print("Numeric features:", num_features)

# --- 4) Subsample for stability (scale up later) ---
X_full = df[features]
y_full = df[target].astype(int)

sample_size = 250_000







if len(X_full) > sample_size:
    X_sample, _, y_sample, _ = train_test_split(
        X_full, y_full, train_size=sample_size, stratify=y_full, random_state=42
    )
else:
    X_sample, y_sample = X_full, y_full

X_train, X_val, y_train, y_val = train_test_split(
    X_sample, y_sample, test_size=0.1, stratify=y_sample, random_state=42
)

# --- Make sure categorical columns are strings (no floats/NaNs) ---
X_train = X_train.copy()
X_val   = X_val.copy()

def coerce_cats_to_str(df, cat_cols):
    for c in cat_cols:
        # If numeric-like, convert to pandas nullable int first to avoid float artifacts,
        # then to string. Finally, ensure missing values become a clear token.
        if pd.api.types.is_numeric_dtype(df[c]):
            df[c] = df[c].astype('Int64').astype('string')
        else:
            df[c] = df[c].astype('string')
        df[c] = df[c].fillna('missing')
    return df

X_train = coerce_cats_to_str(X_train, cat_features)
X_val   = coerce_cats_to_str(X_val,   cat_features)

# Rebuild cat_indices after any column changes (names unchanged, but safe to recompute)
cat_indices = [X_train.columns.get_loc(c) for c in cat_features]

# --- 5) Now build CatBoost Pools safely ---
train_pool = Pool(X_train, label=y_train, cat_features=cat_indices)
val_pool   = Pool(X_val,   label=y_val,   cat_features=cat_indices)


# --- 6) Train CatBoost (safe params) ---
params = dict(
    loss_function="MultiClass",
    eval_metric="MultiClass",
    learning_rate=0.1,
    iterations=400,          # increase stepwise when stable
    depth=6,
    max_bin=128,
    l2_leaf_reg=3.0,
    random_seed=42,
    od_type="Iter",
    od_wait=40,
    subsample=0.8,
    rsm=0.8,
    one_hot_max_size=16,
    max_ctr_complexity=1,
    bootstrap_type="Bernoulli",
    task_type="CPU",
    thread_count=2,
    allow_writing_files=False,
)

model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=val_pool, verbose=100, use_best_model=True)

# --- 7) Evaluate Accuracy + MAP@5 ---
val_proba = model.predict_proba(X_val)
val_pred  = np.argmax(val_proba, axis=1)
acc = accuracy_score(y_val, val_pred)

def mapk(y_true, y_proba, k=5):
    y_true = np.asarray(y_true)
    topk = np.argsort(-y_proba, axis=1)[:, :k]
    hits = (topk == y_true[:, None])
    first_hit_rank = (hits * (np.arange(1, k+1)[None, :])).max(axis=1)
    scores = np.where(first_hit_rank > 0, 1.0 / first_hit_rank, 0.0)
    return scores.mean()

map5 = mapk(y_val.values, val_proba, k=5)

print(f"\nValidation Accuracy (CatBoost, sample={len(X_train)+len(X_val):,}): {acc:.4f}")
print(f"Validation MAP@5   (CatBoost, sample={len(X_train)+len(X_val):,}): {map5:.4f}")

Categorical features: ['channel', 'cnt_bin', 'distance_was_missing', 'hotel_continent', 'hotel_country', 'hotel_market', 'is_mobile', 'is_package', 'posa_continent', 'same_continent', 'same_country', 'search_hour', 'search_month', 'search_weekday', 'season', 'srch_destination_id']
Numeric features: ['date_time', 'site_name', 'user_location_country', 'user_location_region', 'user_location_city', 'orig_destination_distance', 'user_id', 'srch_ci', 'srch_co', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_type_id', 'is_booking', 'cnt', 'dest_pca1', 'dest_pca2', 'dest_pca3', 'length_of_stay', 'advance_days', 'distance_imputed', 'distance_log', 'cnt_log']
0:	learn: 4.3548180	test: 4.3555767	best: 4.3555767 (0)	total: 2m 3s	remaining: 13h 43m 25s


In [10]:
# =========================
# Fast CatBoost (Multiclass) on `df` with class-capped downsampling
# =========================

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier, Pool

# ---------- 0) Columns ----------
target_col = "hotel_cluster"

# Candidate features (will filter to those that actually exist in df)
candidate_features = [
    # Top signal features (from your permutation importance)
    "dest_pca1","dest_pca2","dest_pca3",
    "hotel_continent","hotel_market","hotel_country",

    # Useful numerics
    "length_of_stay","advance_days","distance_log","orig_destination_distance",
    "cnt","cnt_bin",

    # Time & context
    "search_month","search_weekday","search_hour","season",

    # Binary / flags
    "is_package","is_mobile","same_country","same_continent","distance_was_missing",

    # Other IDs that often help CatBoost
    "srch_destination_id","posa_continent","channel"
]

features = [c for c in candidate_features if c in df.columns]
print(f"Using {len(features)} features:", features)

# --- replace the manual cat/num split with auto-detection + whitelist merge ---

# 1) start from your curated categorical candidates
categorical_candidates = [
    "hotel_continent","hotel_market","hotel_country",
    "search_month","search_weekday","search_hour","season",
    "is_package","is_mobile","same_country","same_continent","distance_was_missing",
    "srch_destination_id","posa_continent","channel",
    "cnt_bin",  # <- add this explicitly since it's often categorical
]

# keep only those that are present
cat_features = [c for c in categorical_candidates if c in features]

# 2) auto-add any column that has pandas dtype 'category' or 'object'
auto_cats = [c for c in features if str(df[c].dtype) in ("category", "object")]
cat_features = sorted(set(cat_features).union(auto_cats))

# 3) define numeric features as the rest
num_features = [c for c in features if c not in cat_features]

print("Categorical features:", cat_features)
print("Numeric features:", num_features)

cat_features = [c for c in categorical_candidates if c in features]
num_features = [c for c in features if c not in cat_features]

# Ensure categorical columns are strings with explicit "missing"
for c in cat_features:
    X[c] = df[c].astype("object").fillna("missing")


print("Categorical features:", cat_features)
print("Numeric features:", num_features)

# ---------- 2) Basic cleaning / NA handling ----------
# CatBoost can handle NaNs, but it's often safer to standardize missing values.
X = df[features].copy()
y = df[target_col].copy()

# For numeric columns, keep NaN (CatBoost handles), or fill with sentinel if you prefer:
# X[num_features] = X[num_features].astype(float).fillna(np.nan)

# For categorical columns: convert to string and set explicit "missing"
for c in cat_features:
    X[c] = X[c].astype("object").fillna("missing")

# ---------- 3) Class-capped downsampling to keep training manageable ----------
# Keep up to `cap_per_class` rows per class (keeps all minority, trims heavy majority)
cap_per_class = 8000  # adjust based on your RAM/time budget
np.random.seed(42)

tmp = X.copy()
tmp[target_col] = y.values

# Shuffle within each class and take head(cap)
tmp["_rand"] = np.random.rand(len(tmp))
tmp = tmp.sort_values([target_col, "_rand"])

# groupby().head() is efficient and keeps all classes <= cap
tmp_small = tmp.groupby(target_col, group_keys=False).head(cap_per_class).drop(columns="_rand")

X_small = tmp_small[features].reset_index(drop=True)
y_small = tmp_small[target_col].reset_index(drop=True)

print(f"Downsampled from {len(df):,} to {len(X_small):,} rows "
      f"({y_small.nunique()} classes).")

# ---------- 4) Train/Valid split (stratified) ----------
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.15, random_state=42)
train_idx, val_idx = next(sss.split(X_small, y_small))

X_train, X_val = X_small.iloc[train_idx], X_small.iloc[val_idx]
y_train, y_val = y_small.iloc[train_idx], y_small.iloc[val_idx]

# CatBoost needs indices (not names) for categorical features
cat_indices = [X_train.columns.get_loc(c) for c in cat_features]

train_pool = Pool(X_train, label=y_train, cat_features=cat_indices)
val_pool   = Pool(X_val,   label=y_val,   cat_features=cat_indices)

# ---------- 5) Train CatBoost ----------
# Tip: If you have a CUDA GPU, set task_type='GPU' and maybe increase iterations.
params = dict(
    loss_function="MultiClass",
    eval_metric="MultiClass",
    learning_rate=0.1,
    depth=8,
    l2_leaf_reg=3.0,
    iterations=1500,
    random_seed=42,
    early_stopping_rounds=100,
    task_type="CPU",              # change to "GPU" if available
    verbose=200,
    allow_writing_files=False,    # keep workspace clean
)
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=val_pool, use_best_model=True)

# ---------- 6) Metrics: Accuracy + MAP@5 ----------
def apk(actual, predicted, k=5):
    """Average precision at k for a single sample."""
    if len(predicted) > k:
        predicted = predicted[:k]
    score = 0.0
    num_hits = 0.0
    # If actual is a scalar (multiclass), wrap as list
    actual_set = {actual} if not isinstance(actual, (list, set, tuple)) else set(actual)
    for i, p in enumerate(predicted):
        if p in actual_set and p is not None:
            num_hits += 1.0
            score += num_hits / (i + 1.0)
            # For single-label multiclass, we can break after first hit
            break
    return score

def mapk(actuals, preds_topk, k=5):
    return np.mean([apk(a, p, k) for a, p in zip(actuals, preds_topk)])

# Predict probabilities then take top-5 classes
proba = model.predict_proba(val_pool)
top5 = np.argsort(-proba, axis=1)[:, :5]  # top-5 class indices according to model's internal class order

# CatBoost's class order matches sorted(unique(y_train)) internally; we need to map to original labels:
classes = model.classes_  # array of labels in the model order
top5_labels = classes[top5]

# Accuracy (top-1)
pred1_labels = top5_labels[:, 0]
acc = accuracy_score(y_val, pred1_labels)

# MAP@5
map5 = mapk(y_val.values, top5_labels, k=5)

print(f"\nValidation Accuracy (CatBoost): {acc:.4f}")
print(f"Validation MAP@5 (CatBoost):   {map5:.4f}")

# ---------- 7) Feature importance ----------
imp_values = model.get_feature_importance(type="PredictionValuesChange")
imp_df = (
    pd.DataFrame({"feature": X_train.columns, "importance": imp_values})
      .sort_values("importance", ascending=False)
      .reset_index(drop=True)
)
print("\nTop 20 feature importances:")
display(imp_df.head(20))

# (Optional) Save model
# model.save_model("catboost_multiclass.cbm")


Using 24 features: ['dest_pca1', 'dest_pca2', 'dest_pca3', 'hotel_continent', 'hotel_market', 'hotel_country', 'length_of_stay', 'advance_days', 'distance_log', 'orig_destination_distance', 'cnt', 'cnt_bin', 'search_month', 'search_weekday', 'search_hour', 'season', 'is_package', 'is_mobile', 'same_country', 'same_continent', 'distance_was_missing', 'srch_destination_id', 'posa_continent', 'channel']
Categorical features: ['channel', 'cnt_bin', 'distance_was_missing', 'hotel_continent', 'hotel_country', 'hotel_market', 'is_mobile', 'is_package', 'posa_continent', 'same_continent', 'same_country', 'search_hour', 'search_month', 'search_weekday', 'season', 'srch_destination_id']
Numeric features: ['dest_pca1', 'dest_pca2', 'dest_pca3', 'length_of_stay', 'advance_days', 'distance_log', 'orig_destination_distance', 'cnt']
Categorical features: ['hotel_continent', 'hotel_market', 'hotel_country', 'search_month', 'search_weekday', 'search_hour', 'season', 'is_package', 'is_mobile', 'same_cou

: 