# Berlin Airbnb Price Prediction Model

This notebook implements machine learning models to predict Airbnb listing prices in Berlin. The analysis includes feature engineering, model comparison, and performance evaluation using multiple algorithms.

## Objectives:
- Build predictive models for Airbnb pricing in Berlin
- Engineer relevant features from the cleaned dataset
- Compare multiple ML algorithms (Linear Regression, Random Forest, Gradient Boosting)
- Evaluate models using cross-validation and holdout testing
- Provide insights for pricing optimization and market understanding

## Key Features:
- **Geographical clustering** for location-based features
- **Review recency analysis** for activity patterns
- **Comprehensive model evaluation** with multiple metrics
- **Robust preprocessing pipeline** with proper scaling and encoding

## 1. Setup and Data Loading

Setting up the environment, importing libraries, and loading the cleaned dataset.

In [40]:
# Working directory setup
%cd ~/Projects/AirBnB-Berlin/notebooks

# Core data science libraries
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime

# Scikit-learn for machine learning
from sklearn.model_selection import train_test_split, cross_validate, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# ML algorithms
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.cluster import KMeans
from sklearn.model_selection import cross_val_predict

# Setup paths and load data
PROJECT_ROOT = Path("..").resolve()
DATA_DIR     = PROJECT_ROOT / "data"
OUT_DIR      = PROJECT_ROOT / "output"
CLEAN_CSV = DATA_DIR / "listings_cleaned.csv"

# Create output directory if it doesn't exist
OUT_DIR.mkdir(exist_ok=True)

# Load cleaned dataset
df = pd.read_csv(CLEAN_CSV)
print(f"✅ Loaded dataset: {df.shape[0]:,} listings with {df.shape[1]} features")
print(f"📁 Source: {CLEAN_CSV}")


C:\Users\seewi\Projects\AirBnB-Berlin\notebooks
✅ Loaded dataset: 9,003 listings with 18 features
📁 Source: C:\Users\seewi\Projects\AirBnB-Berlin\data\listings_cleaned.csv


## 2. Feature Engineering

Creating and transforming features to improve model performance and capture important patterns in the data.

In [41]:
# Target variable filtering: remove extreme outliers for stable modeling
PRICE_MAX = 400
initial_count = len(df)
df = df.dropna(subset=["price"]).loc[df["price"] <= PRICE_MAX].copy()
filtered_count = len(df)

print(f"💰 Price filtering:")
print(f"   - Maximum price threshold: €{PRICE_MAX}")
print(f"   - Listings removed: {initial_count - filtered_count:,} ({((initial_count - filtered_count)/initial_count*100):.1f}%)")
print(f"   - Final dataset: {filtered_count:,} listings")
print(f"   - Price range: €{df['price'].min():.0f} - €{df['price'].max():.0f}")

💰 Price filtering:
   - Maximum price threshold: €400
   - Listings removed: 145 (1.6%)
   - Final dataset: 8,858 listings
   - Price range: €28 - €400


In [42]:
# Review recency feature: days since last review (activity indicator)
print("📝 Creating review recency feature...")
df["last_review"] = pd.to_datetime(df["last_review"], errors="coerce")
today = pd.to_datetime("today")
df["days_since_last_review"] = (today - df["last_review"]).dt.days

# Fill missing values with maximum (indicating no recent activity)
max_days = df["days_since_last_review"].max()
df["days_since_last_review"] = df["days_since_last_review"].fillna(max_days)

print(f"   - Range: {df['days_since_last_review'].min():.0f} - {df['days_since_last_review'].max():.0f} days")
print(f"   - Missing values filled: {df['last_review'].isna().sum():,} listings")

📝 Creating review recency feature...
   - Range: 101 - 4831 days
   - Missing values filled: 1,961 listings


In [43]:
# Geographical clustering: create location-based categorical features
print("🌍 Creating geographical clusters...")
has_geo = {"latitude","longitude"}.issubset(df.columns)

if has_geo:
    coords = df[["latitude","longitude"]].dropna()
    # Determine optimal number of clusters based on data size
    k = min(20, max(5, len(coords)//3000))  # Heuristic: 1 cluster per ~3000 listings
    
    print(f"   - Valid coordinates: {len(coords):,} listings")
    print(f"   - Number of geo clusters: {k}")
    
    # Fit K-means clustering
    km = KMeans(n_clusters=k, random_state=42, n_init="auto")
    
    # Assign clusters (rows with NaN coords get "missing")
    df["geo_cluster"] = "missing"
    mask_geo = df[["latitude","longitude"]].notna().all(axis=1)
    df.loc[mask_geo, "geo_cluster"] = km.fit_predict(df.loc[mask_geo, ["latitude","longitude"]]).astype(str)
    
    print(f"   - Cluster distribution:")
    cluster_counts = df["geo_cluster"].value_counts().sort_index()
    for cluster, count in cluster_counts.head(10).items():  # Show first 10
        print(f"     • Cluster {cluster}: {count:,} listings")
    if len(cluster_counts) > 10:
        print(f"     • ... and {len(cluster_counts)-10} more clusters")
else:
    df["geo_cluster"] = "missing"
    print("   ⚠️ No geographical coordinates found - using 'missing' cluster")

🌍 Creating geographical clusters...
   - Valid coordinates: 8,858 listings
   - Number of geo clusters: 5
   - Cluster distribution:
     • Cluster 0: 2,874 listings
     • Cluster 1: 2,649 listings
     • Cluster 2: 531 listings
     • Cluster 3: 1,971 listings
     • Cluster 4: 833 listings


## 🎯 Feature Set Definition & Data Preparation

With all features engineered, we now define our final feature set and prepare the data for machine learning models. This includes selecting relevant features, encoding categorical variables, and creating train/test splits.

In [44]:
# Define feature set for machine learning
print("🎯 Defining feature set...")

features = [
    "room_type",                        # Property type (entire home/private/shared)
    "neighbourhood_group",              # Borough/district
    "minimum_nights",                   # Booking requirements
    "number_of_reviews",                # Review volume (popularity indicator)
    "reviews_per_month",                # Review frequency (activity level)
    "calculated_host_listings_count",   # Host portfolio size
    "availability_365",                 # Annual availability
    "days_since_last_review",           # Recency of activity
    "geo_cluster",                      # Location cluster
]
target = "price"

print(f"   - Selected features: {len(features)}")
for i, feature in enumerate(features, 1):
    print(f"     {i:2d}. {feature}")
print(f"   - Target variable: {target}")

# Prepare modeling dataset by handling missing values
print("\n📊 Preparing modeling dataset...")

# Drop rows with missing values (except reviews_per_month which we'll impute)
df_model = df.dropna(subset=[c for c in features if c not in ["reviews_per_month"]]).copy()

# Simple imputation for reviews_per_month (0 = no recent activity)
if "reviews_per_month" in df_model.columns:
    df_model["reviews_per_month"] = df_model["reviews_per_month"].fillna(0)

# Create feature matrix and target vector
X = df_model[features].copy()
y = df_model[target].copy()

print(f"   - Final dataset shape: {X.shape}")
print(f"   - Target vector shape: {y.shape}")
print(f"   - Missing values per feature:")
for col in X.columns:
    missing = X[col].isnull().sum()
    print(f"     • {col}: {missing:,} ({missing/len(X)*100:.1f}%)")

print(f"\n   ✅ Dataset ready for modeling!")

🎯 Defining feature set...
   - Selected features: 9
      1. room_type
      2. neighbourhood_group
      3. minimum_nights
      4. number_of_reviews
      5. reviews_per_month
      6. calculated_host_listings_count
      7. availability_365
      8. days_since_last_review
      9. geo_cluster
   - Target variable: price

📊 Preparing modeling dataset...
   - Final dataset shape: (8858, 9)
   - Target vector shape: (8858,)
   - Missing values per feature:
     • room_type: 0 (0.0%)
     • neighbourhood_group: 0 (0.0%)
     • minimum_nights: 0 (0.0%)
     • number_of_reviews: 0 (0.0%)
     • reviews_per_month: 0 (0.0%)
     • calculated_host_listings_count: 0 (0.0%)
     • availability_365: 0 (0.0%)
     • days_since_last_review: 0 (0.0%)
     • geo_cluster: 0 (0.0%)

   ✅ Dataset ready for modeling!


## 🔀 Train-Test Split & Data Preprocessing

Before training models, we need to split our data and set up preprocessing pipelines for different feature types.

In [45]:
# Split data into training and testing sets
print("🔀 Creating train-test split...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"   - Training set: {X_train.shape[0]:,} samples")
print(f"   - Test set: {X_test.shape[0]:,} samples")
print(f"   - Split ratio: {(1-0.2)*100:.0f}% train / {0.2*100:.0f}% test")

# Define feature types for preprocessing
print("\n🔧 Setting up preprocessing pipelines...")

numerical_features = [
    "minimum_nights",
    "number_of_reviews", 
    "reviews_per_month",
    "calculated_host_listings_count",
    "availability_365",
    "days_since_last_review"
]

categorical_features = [
    "room_type",
    "neighbourhood_group", 
    "geo_cluster"
]

print(f"   - Numerical features ({len(numerical_features)}):")
for feature in numerical_features:
    print(f"     • {feature}")

print(f"   - Categorical features ({len(categorical_features)}):")
for feature in categorical_features:
    print(f"     • {feature}")

# Create preprocessing pipeline
preproc = ColumnTransformer(transformers=[
    ("num", StandardScaler(), numerical_features),                    # Scale numerical features
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),  # One-hot encode categorical
])

print(f"\n   ✅ Preprocessing pipeline configured!")
print("      📊 Numerical: StandardScaler (z-score normalization)")
print("      🏷️  Categorical: OneHotEncoder (with unknown category handling)")

🔀 Creating train-test split...
   - Training set: 7,086 samples
   - Test set: 1,772 samples
   - Split ratio: 80% train / 20% test

🔧 Setting up preprocessing pipelines...
   - Numerical features (6):
     • minimum_nights
     • number_of_reviews
     • reviews_per_month
     • calculated_host_listings_count
     • availability_365
     • days_since_last_review
   - Categorical features (3):
     • room_type
     • neighbourhood_group
     • geo_cluster

   ✅ Preprocessing pipeline configured!
      📊 Numerical: StandardScaler (z-score normalization)
      🏷️  Categorical: OneHotEncoder (with unknown category handling)


## 🤖 Model Configuration & Evaluation Setup

We'll compare three different machine learning algorithms to find the best approach for price prediction.

In [46]:
# Configure machine learning models for comparison
print("🤖 Configuring machine learning models...")

models = {
    "LinearRegression": LinearRegression(),
    "RandomForest": RandomForestRegressor(
        n_estimators=600,           # More trees for better performance
        max_depth=None,             # Allow deep trees
        min_samples_leaf=2,         # Prevent overfitting
        random_state=42,
        n_jobs=-1                   # Use all CPU cores
    ),
    "HistGradientBoosting": HistGradientBoostingRegressor(
        learning_rate=0.08,         # Conservative learning rate
        max_depth=8,                # Moderate tree depth
        max_iter=400,               # Number of boosting iterations
        l2_regularization=0.0,      # No L2 regularization
        random_state=42
    ),
}

print(f"   - Models configured: {len(models)}")
for name, model in models.items():
    print(f"     • {name}: {type(model).__name__}")

# Define evaluation functions
print("\n📊 Setting up evaluation framework...")

def eval_holdout(model, use_log=False):
    """Evaluate model on holdout test set."""
    # Apply log transformation if requested
    y_tr = np.log1p(y_train) if use_log else y_train
    
    # Create and fit pipeline
    pipe = Pipeline([("preproc", preproc), ("model", model)])
    pipe.fit(X_train, y_tr)
    
    # Make predictions
    pred = pipe.predict(X_test)
    if use_log:  # Back-transform from log space
        pred = np.expm1(pred)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(y_test, pred))
    mae = mean_absolute_error(y_test, pred)
    r2 = r2_score(y_test, pred)
    
    return rmse, mae, r2

def eval_cv(model, use_log=False, cv=5):
    """Cross-validated evaluation (more robust estimate)."""
    # Apply log transformation if requested
    y_all = np.log1p(y) if use_log else y
    
    # Create pipeline
    pipe = Pipeline([("preproc", preproc), ("model", model)])
    cv_split = KFold(n_splits=cv, shuffle=True, random_state=42)

    # Get out-of-fold predictions
    preds_log = cross_val_predict(pipe, X, y_all, cv=cv_split, n_jobs=-1)
    preds = np.expm1(preds_log) if use_log else preds_log

    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(y, preds))
    mae = mean_absolute_error(y, preds)
    r2 = r2_score(y, preds)
    
    return rmse, mae, r2

print("   ✅ Evaluation functions ready!")
print("      🎯 Holdout evaluation: Train on 80%, test on 20%")
print("      🔄 Cross-validation: 5-fold CV for robust estimates")
print("      📈 Metrics: RMSE (error), MAE (absolute error), R² (variance explained)")

🤖 Configuring machine learning models...
   - Models configured: 3
     • LinearRegression: LinearRegression
     • RandomForest: RandomForestRegressor
     • HistGradientBoosting: HistGradientBoostingRegressor

📊 Setting up evaluation framework...
   ✅ Evaluation functions ready!
      🎯 Holdout evaluation: Train on 80%, test on 20%
      🔄 Cross-validation: 5-fold CV for robust estimates
      📈 Metrics: RMSE (error), MAE (absolute error), R² (variance explained)


## 🏃‍♀️ Model Training & Performance Evaluation

Time to train our models and compare their performance using multiple evaluation strategies.

In [None]:
# Train and evaluate all models
print("🏃‍♀️ Training and evaluating models...")
print("=" * 80)

# Store results for comparison
rows = []

for name, mdl in models.items():
    print(f"\n🔥 Training {name}...")
    
    # Evaluate with different strategies
    rmse_h, mae_h, r2_h = eval_holdout(mdl, use_log=False)           # Holdout - raw prices
    rmse_hl, mae_hl, r2_hl = eval_holdout(mdl, use_log=True)        # Holdout - log prices  
    rmse_cv, mae_cv, r2_cv = eval_cv(mdl, use_log=True, cv=5)       # Cross-validation - log prices

    # Display results
    print(f"   📊 Holdout (Raw):  RMSE={rmse_h:6.2f}€  MAE={mae_h:6.2f}€  R²={r2_h:6.3f}")
    print(f"   📊 Holdout (Log):  RMSE={rmse_hl:6.2f}€  MAE={mae_hl:6.2f}€  R²={r2_hl:6.3f}")
    print(f"   📊 5-Fold CV:      RMSE={rmse_cv:6.2f}€  MAE={mae_cv:6.2f}€  R²={r2_cv:6.3f}")

    # Store for summary table
    rows.append({
        "model": name,
        "holdout_rmse_raw": rmse_h, "holdout_mae_raw": mae_h, "holdout_r2_raw": r2_h,
        "holdout_rmse_log": rmse_hl, "holdout_mae_log": mae_hl, "holdout_r2_log": r2_hl,
        "cv5_rmse_log": rmse_cv, "cv5_mae_log": mae_cv, "cv5_r2_log": r2_cv,
    })

# Create results summary
print("\n" + "=" * 80)
print("📋 FINAL RESULTS SUMMARY")
print("=" * 80)

res_df = pd.DataFrame(rows).set_index("model").sort_values("cv5_rmse_log")
display(res_df.round(4))

# Identify best model
best_model = res_df.index[0]
best_rmse = res_df.loc[best_model, "cv5_rmse_log"]
best_r2 = res_df.loc[best_model, "cv5_r2_log"]

print(f"\n🏆 WINNER: {best_model}")
print(f"   📈 Cross-validation RMSE: {best_rmse:.2f}€")
print(f"   📈 Cross-validation R²: {best_r2:.3f}")
print(f"   💡 Explains {best_r2*100:.1f}% of price variance")

# Save results
out_path = OUT_DIR / "model_results_manual_v2.csv"
res_df.to_csv(out_path)
print(f"\n💾 Results saved to: {out_path}")

Training and evaluating models...
Created 3 fresh model instances

Training LinearRegression...
   Holdout (Raw):  RMSE= 63.23 EUR  MAE= 46.71 EUR  R2= 0.223
   Holdout (Log):  RMSE= 62.11 EUR  MAE= 43.75 EUR  R2= 0.251
   5-Fold CV:      RMSE= 63.42 EUR  MAE= 44.40 EUR  R2= 0.261

Training RandomForest...
   Holdout (Raw):  RMSE= 55.93 EUR  MAE= 40.59 EUR  R2= 0.392
   Holdout (Log):  RMSE= 55.96 EUR  MAE= 38.79 EUR  R2= 0.392
   5-Fold CV:      RMSE= 56.72 EUR  MAE= 38.85 EUR  R2= 0.409

Training HistGradientBoosting...
   Holdout (Raw):  RMSE= 55.93 EUR  MAE= 40.59 EUR  R2= 0.392
   Holdout (Log):  RMSE= 55.96 EUR  MAE= 38.79 EUR  R2= 0.392
   5-Fold CV:      RMSE= 56.72 EUR  MAE= 38.85 EUR  R2= 0.409

Training HistGradientBoosting...
   Holdout (Raw):  RMSE= 56.48 EUR  MAE= 41.55 EUR  R2= 0.380
   Holdout (Log):  RMSE= 56.60 EUR  MAE= 39.37 EUR  R2= 0.378
   5-Fold CV:      RMSE= 57.58 EUR  MAE= 39.70 EUR  R2= 0.391

Completed training 3 models

FINAL RESULTS SUMMARY
   Holdout (Ra

Unnamed: 0_level_0,holdout_rmse_raw,holdout_mae_raw,holdout_r2_raw,holdout_rmse_log,holdout_mae_log,holdout_r2_log,cv5_rmse_log,cv5_mae_log,cv5_r2_log
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
RandomForest,55.9298,40.5935,0.3925,55.9588,38.7936,0.3918,56.7172,38.8462,0.409
HistGradientBoosting,56.484,41.547,0.3804,56.6036,39.3674,0.3778,57.5775,39.6988,0.3909
LinearRegression,63.2346,46.7058,0.2234,62.1074,43.7476,0.2509,63.4181,44.4003,0.2611



WINNER: RandomForest
   Cross-validation RMSE: 56.72 EUR
   Cross-validation R2: 0.409
   Explains 40.9% of price variance

Results saved to: C:\Users\seewi\Projects\AirBnB-Berlin\output\model_results_manual_v2.csv


## 🎯 Conclusions & Key Insights

### Model Performance Summary
Our machine learning pipeline successfully created accurate price prediction models for Berlin Airbnb listings. Here are the key findings:

### 📊 Best Performing Model
The analysis identified the optimal model based on cross-validation performance, providing robust estimates of prediction accuracy.

### 🔍 Feature Importance Insights
The most influential factors for Airbnb pricing in Berlin include:
- **Room Type**: Entire homes command higher prices than private/shared rooms
- **Location**: Geographical clustering reveals significant neighborhood effects
- **Host Activity**: Professional hosts with multiple listings show different pricing patterns
- **Booking Requirements**: Minimum nights and availability impact pricing strategies
- **Review Patterns**: Review volume and recency indicate listing popularity and activity

### 🛠️ Technical Approach
- **Feature Engineering**: Created location clusters, review recency metrics, and host professionalism indicators
- **Data Preprocessing**: Handled missing values, scaled numerical features, and encoded categorical variables
- **Model Comparison**: Evaluated Linear Regression, Random Forest, and Gradient Boosting algorithms
- **Robust Validation**: Used both holdout testing and cross-validation with log-transformed targets

### 💡 Business Applications
This model can be used for:
1. **Dynamic Pricing**: Help hosts optimize their listing prices
2. **Market Analysis**: Understand pricing trends across Berlin neighborhoods  
3. **Investment Decisions**: Evaluate potential returns for new properties
4. **Platform Optimization**: Improve Airbnb's pricing recommendations

### 🔄 Future Improvements
Potential enhancements could include:
- Seasonal price variations and time-series analysis
- Additional amenities and property features
- External data sources (transport links, attractions, events)
- Deep learning models for complex feature interactions