# Data Modeling

**Goal:** Train, evaluate, tune, and select the best machine learning model to predict the log-transformed price (`log_price`) of Prague Airbnb listings as accurately as possible, based on the prepared dataset.

## 1. Setup and Baselines
*   Import necessary libraries for modeling and evaluation.
*   Load the processed and scaled training/testing data splits.
*   Load the saved scaler and feature list.
*   Define evaluation metrics and a helper function for reporting on the original price scale.
*   Establish baseline performance metrics (mean prediction and simple Linear Regression).

### Import Libraries
Import libraries for data handling, modeling algorithms, evaluation metrics, cross-validation, and loading saved objects.

In [2]:
# Data Handling
import pandas as pd
import numpy as np
import joblib # For loading scaler/feature list

# Modeling Algorithms
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.dummy import DummyRegressor # For mean baseline

# Evaluation Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Cross-validation and Tuning
from sklearn.model_selection import KFold, cross_val_score, RandomizedSearchCV 
# from sklearn.model_selection import GridSearchCV # Alternative tuner

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x) # More precision for metrics

print("Libraries imported for modeling.")

Libraries imported for modeling.


### Load Processed Data and Preprocessors
Load the scaled training and testing datasets (`X_train_scaled`, `X_test_scaled`, `y_train`, `y_test`) saved at the end of Data Preparation. Also load the fitted `StandardScaler` and the final feature list.

In [3]:
# Define paths (relative to notebooks/ directory)
data_path = '../data/processed/'
model_path = '../model/'

try:
    # Load features and target
    X_train_scaled = pd.read_parquet(data_path + 'X_train_scaled.parquet')
    X_test_scaled = pd.read_parquet(data_path + 'X_test_scaled.parquet')
    y_train = pd.read_parquet(data_path + 'y_train.parquet')['log_price'] # Extract Series
    y_test = pd.read_parquet(data_path + 'y_test.parquet')['log_price'] # Extract Series
    
    # Load preprocessors
    scaler = joblib.load(model_path + 'standard_scaler.joblib')
    final_features = joblib.load(model_path + 'final_feature_list.joblib')

    print("Processed data and preprocessors loaded successfully.")
    
    # Verify shapes and columns
    print(f"X_train_scaled shape: {X_train_scaled.shape}")
    print(f"X_test_scaled shape : {X_test_scaled.shape}")
    print(f"y_train shape: {y_train.shape}")
    print(f"y_test shape : {y_test.shape}")
    
    # Check if columns match saved list
    if list(X_train_scaled.columns) == final_features:
        print("\nTrain data columns match the saved feature list.")
    else:
        print("\nWarning: Train data columns mismatch saved feature list!")
        
    # Display head of loaded data
    print("\nHead of X_train_scaled:")
    display(X_train_scaled.head(3))
    print("\nHead of y_train:")
    display(y_train.head(3))
    
except FileNotFoundError as e:
    print(f"Error loading data: {e}. Make sure preprocessing steps were run and files saved correctly.")
    # Set variables to None to prevent errors in subsequent cells
    X_train_scaled, X_test_scaled, y_train, y_test, scaler, final_features = [None]*6
except Exception as e:
    print(f"An error occurred during loading: {e}")
    X_train_scaled, X_test_scaled, y_train, y_test, scaler, final_features = [None]*6

Processed data and preprocessors loaded successfully.
X_train_scaled shape: (7014, 66)
X_test_scaled shape : (1754, 66)
y_train shape: (7014,)
y_test shape : (1754,)

Train data columns match the saved feature list.

Head of X_train_scaled:


Unnamed: 0,host_acceptance_rate,host_identity_verified,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,has_reviews,calculated_host_listings_count_shared_rooms_log,bathrooms_log,calculated_host_listings_count_private_rooms_log,host_acceptance_rate_log,beds_log,days_since_last_review_log,number_of_reviews_log,bedrooms_log,reviews_per_month_log,accommodates_log,calculated_host_listings_count_entire_homes_log,number_of_reviews_ltm_log,host_response_time_Unknown,host_response_time_days_or_more,host_response_time_within_day,host_response_time_within_hour,host_response_time_within_hours,room_type_Entire_home/apt,room_type_Hotel_room,room_type_Private_room,room_type_Shared_room,neighbourhood_group_Near_Center_East,neighbourhood_group_Near_Center_West_South,neighbourhood_group_New_Town_Vinohrady,neighbourhood_group_North_West_Districts,neighbourhood_group_Old_Town_Center,neighbourhood_group_Outer_Districts,property_type_freq
7040,0.31773,0.12796,-1.00211,-0.14625,-0.43309,-0.39264,-0.39836,-0.23443,0.11789,-0.34667,-0.92162,-0.38376,-0.39253,0.4102,0.70862,0.67324,0.62997,-1.27978,-0.34365,-0.28196,-0.0938,0.1047,0.76782,0.11142,0.27675,0.49447,0.27406,0.47563,0.33138,0.57922,0.83073,-0.15181,-0.63003,0.84602,2.14201,1.30126,-0.34586,0.30193,-0.11009,-0.47959,-0.45942,0.30163,-0.07051,-0.68492,0.1638,-0.29696,0.42263,-0.33622,0.17524,0.79439,-0.25524,-0.12387,-0.21395,0.46501,-0.24755,0.42475,-0.0921,-0.39484,-0.09821,-0.50049,-0.31222,2.14633,-0.28823,-0.76678,-0.30585,0.80065
7863,0.37241,0.12796,0.22597,-0.05502,0.73766,-0.39264,-0.39836,0.21767,-0.34522,-1.18489,-1.24191,-1.46549,-0.22562,1.27716,0.57742,0.79933,0.62997,0.78139,-0.60287,-0.28196,-0.0938,1.62976,0.84291,0.11142,0.27675,-2.02237,0.27406,0.47563,0.33138,0.57922,0.83073,-0.15181,-0.63003,0.84602,-0.13436,-1.61776,-0.35023,0.30193,-0.11009,-0.47959,-0.45942,0.34406,0.5413,-1.455,0.41736,-0.29696,1.46903,0.96261,-0.6508,1.13384,-0.25524,-0.12387,-0.21395,0.46501,-0.24755,0.42475,-0.0921,-0.39484,-0.09821,-0.50049,-0.31222,-0.46591,-0.28823,1.30415,-0.30585,-1.17634
7652,0.31773,0.12796,-0.37276,-0.7,1.90841,1.84332,1.78027,1.12185,0.11789,-0.34667,-0.60133,-0.11105,-0.67626,-0.85084,0.80701,-3.40372,-2.29712,-1.27978,0.00199,-0.28196,-0.0938,-0.96671,0.39238,0.11142,0.27675,0.49447,0.27406,0.47563,0.33138,0.57922,0.83073,-0.15181,1.58724,0.84602,-0.13436,1.1199,-0.23744,0.30193,-0.11009,2.25335,-0.45942,0.30163,1.4036,0.76876,-1.49494,1.67274,-1.30581,1.79043,0.65344,-1.13301,-0.25524,-0.12387,-0.21395,0.46501,-0.24755,0.42475,-0.0921,-0.39484,-0.09821,-0.50049,3.20288,-0.46591,-0.28823,-0.76678,-0.30585,0.80065



Head of y_train:


7040   7.51098
7863   7.09755
7652   8.41627
Name: log_price, dtype: float64

### Define Evaluation Metrics & Function
Define the primary metric for cross-validation (`neg_root_mean_squared_error` on log scale) and a helper function to evaluate performance on the original price scale (RMSE, MAE, R²).

In [12]:
# M1.3 Define Evaluation Metrics & Function

# Primary metric for CV (lower is better, hence negative for maximization)
CV_SCORING = 'neg_root_mean_squared_error'

# Helper function to evaluate on original price scale
def evaluate_on_original_scale(y_true_log, y_pred_log, model_name="Model"):
    """Calculates RMSE, MAE, R2 on the original price scale."""
    
    # Inverse transform from log scale
    y_true_orig = np.expm1(y_true_log)
    y_pred_orig = np.expm1(y_pred_log)
    
    # Handle potential negative predictions after expm1
    y_pred_orig[y_pred_orig < 0] = 0 
    
    # Calculate metrics using MSE + sqrt
    mse = mean_squared_error(y_true_orig, y_pred_orig) 
    rmse = np.sqrt(mse) # Calculate RMSE manually
    mae = mean_absolute_error(y_true_orig, y_pred_orig)
    r2 = r2_score(y_true_orig, y_pred_orig)
    
    print(f"--- {model_name} Performance (Original Price Scale) ---")
    print(f"RMSE: {rmse:.2f}") 
    print(f"MAE:  {mae:.2f}")  
    print(f"R^2:  {r2:.3f}")  
    print("----------------------------------------------------")
    
    return {'RMSE': rmse, 'MAE': mae, 'R2': r2}

print("Evaluation metric and helper function defined.")

Evaluation metric and helper function defined.


*Observation:* The primary scoring metric for cross-validation (`neg_root_mean_squared_error`) is set. A helper function `evaluate_on_original_scale` was created to easily calculate and report RMSE, MAE, and R² on the original price scale by applying `np.expm1` to the log-scale predictions and true values.

### Establish Baseline Performance
Calculate baseline metrics using two simple strategies: predicting the mean and using basic Linear Regression with cross-validation.

In [13]:
# M1.4 Establish Baseline Performance 
# (No changes needed here if M1.3 function uses the workaround)
if 'y_train' in locals() and 'y_test' in locals() and y_train is not None and y_test is not None:

    # --- Baseline 1: Mean Prediction ---
    print("--- Baseline 1: Predicting Mean ---")
    mean_log_price_train = y_train.mean()
    y_pred_mean_log = np.full_like(y_test, fill_value=mean_log_price_train)
    
    print(f"Predicting constant mean log_price: {mean_log_price_train:.3f}")
    # This call now uses the evaluate_on_original_scale function with the workaround
    baseline_mean_results = evaluate_on_original_scale(y_test, y_pred_mean_log, model_name="Mean Baseline") 

    # --- Baseline 2: Linear Regression (Cross-Validated) ---
    print("\n--- Baseline 2: Linear Regression (5-Fold CV) ---")
    cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)
    lr_baseline = LinearRegression()
    
    # This uses CV_SCORING='neg_root_mean_squared_error' which should work fine in cross_val_score
    lr_cv_scores = cross_val_score(
        lr_baseline, 
        X_train_scaled, 
        y_train, 
        cv=cv_strategy, 
        scoring=CV_SCORING,
        n_jobs=-1 
    )
    
    mean_cv_rmse_log = -lr_cv_scores.mean() 
    std_cv_rmse_log = lr_cv_scores.std()
    
    print(f"Mean CV RMSE (log scale): {mean_cv_rmse_log:.4f} (+/- {std_cv_rmse_log:.4f})")
    print("(Lower RMSE is better)")

else:
    print("Error: Training/testing data not available for baseline calculation.")

--- Baseline 1: Predicting Mean ---
Predicting constant mean log_price: 7.539
--- Mean Baseline Performance (Original Price Scale) ---
RMSE: 4833.50
MAE:  1386.85
R^2:  -0.022
----------------------------------------------------

--- Baseline 2: Linear Regression (5-Fold CV) ---
Mean CV RMSE (log scale): 0.4466 (+/- 0.0202)
(Lower RMSE is better)


## 2. Candidate Model Training & Evaluation (Initial CV)

In this phase, we train several different types of regression models using their default settings and evaluate their baseline performance using cross-validation on the training data. The goal is to get an initial estimate of how well each model type performs on this specific dataset and identify promising candidates for further tuning. We will:
*   Select a diverse set of candidate models (Linear, Tree Ensembles).
*   Use K-Fold cross-validation to evaluate each model's performance based on Root Mean Squared Error (RMSE) on the log-price scale.

### Select Candidate Models
Define a dictionary containing instances of the regression models we want to evaluate initially. We include regularized linear models and popular tree-based ensembles.

In [14]:
# M2.1 Select Candidate Models
if ('LinearRegression' not in locals() or # Check if imports might have been cleared
    'Ridge' not in locals() or 
    'Lasso' not in locals() or 
    'RandomForestRegressor' not in locals() or 
    'XGBRegressor' not in locals() or 
    'LGBMRegressor' not in locals()):
    # Re-import if necessary (e.g., if running notebook non-linearly)
    print("Re-importing model classes...")
    from sklearn.linear_model import LinearRegression, Ridge, Lasso
    from sklearn.ensemble import RandomForestRegressor
    from xgboost import XGBRegressor
    from lightgbm import LGBMRegressor

# Define models to evaluate with default parameters
# Use random_state for models that have it for reproducibility
models_to_evaluate = {
    # "LinearRegression": LinearRegression(), # Already used as baseline M1.4
    "Ridge": Ridge(random_state=42),
    "Lasso": Lasso(random_state=42),
    "RandomForest": RandomForestRegressor(random_state=42, n_jobs=-1), # Use n_jobs=-1 for speed
    "XGBoost": XGBRegressor(random_state=42, n_jobs=-1),
    "LightGBM": LGBMRegressor(random_state=42, n_jobs=-1, verbosity=-1) # Suppress LightGBM verbosity
}

print(f"Selected {len(models_to_evaluate)} candidate models for initial evaluation:")
print(list(models_to_evaluate.keys()))

Selected 5 candidate models for initial evaluation:
['Ridge', 'Lasso', 'RandomForest', 'XGBoost', 'LightGBM']


*Observation:* A dictionary `models_to_evaluate` was created containing instances of Ridge, Lasso, RandomForestRegressor, XGBRegressor, and LGBMRegressor, using default hyperparameters and fixed random states where applicable. These represent a good mix of linear and powerful ensemble methods.

### Initial Cross-Validation
Perform 5-Fold Cross-Validation for each candidate model on the scaled training data (`X_train_scaled`, `y_train`) using negative RMSE as the scoring metric. Record and display the mean and standard deviation of the scores.

In [15]:
# M2.2 Initial Cross-Validation
if ('models_to_evaluate' in locals() and 'X_train_scaled' in locals() and 'y_train' in locals() and
    X_train_scaled is not None and y_train is not None and models_to_evaluate):
    
    # Define CV strategy (can reuse from baseline)
    if 'cv_strategy' not in locals(): # Define if not already present from M1.4
        cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)
        print("Defined KFold CV strategy (5 splits, shuffled).")
    else:
        print("Using existing KFold CV strategy (5 splits, shuffled).")

    # Store results
    cv_results = {}
    
    print("\nPerforming 5-Fold Cross-Validation for each model...")

    for model_name, model in models_to_evaluate.items():
        print(f"  Evaluating {model_name}...")
        try:
            # Perform cross-validation
            # Scores are negative RMSE on log scale
            cv_scores = cross_val_score(
                model, 
                X_train_scaled, 
                y_train, 
                cv=cv_strategy, 
                scoring=CV_SCORING, # Defined in M1.3 as 'neg_root_mean_squared_error'
                n_jobs=-1 # Use all available CPU cores
            )
            
            # Store mean and std dev (convert neg_rmse back to positive rmse)
            mean_rmse = -cv_scores.mean()
            std_rmse = cv_scores.std()
            cv_results[model_name] = {'mean_rmse': mean_rmse, 'std_rmse': std_rmse, 'scores': -cv_scores}
            
            print(f"    Mean CV RMSE (log scale): {mean_rmse:.4f} (+/- {std_rmse:.4f})")
            
        except Exception as e:
            print(f"    Error evaluating {model_name}: {e}")
            cv_results[model_name] = {'mean_rmse': np.inf, 'std_rmse': np.nan, 'scores': []} # Record error

    print("\nInitial Cross-Validation complete.")

    # Display results sorted by mean RMSE (lower is better)
    if cv_results:
        print("\n--- Initial Model Performance Comparison (Mean CV RMSE on Log Scale) ---")
        results_df = pd.DataFrame(cv_results).T[['mean_rmse', 'std_rmse']].sort_values('mean_rmse')
        display(results_df)
        print("-----------------------------------------------------------------------")

else:
    print("Error: Prerequisite variables (models, data) not found or empty.")

Using existing KFold CV strategy (5 splits, shuffled).

Performing 5-Fold Cross-Validation for each model...
  Evaluating Ridge...
    Mean CV RMSE (log scale): 0.4466 (+/- 0.0202)
  Evaluating Lasso...
    Mean CV RMSE (log scale): 0.6663 (+/- 0.0152)
  Evaluating RandomForest...
    Mean CV RMSE (log scale): 0.3996 (+/- 0.0083)
  Evaluating XGBoost...
    Mean CV RMSE (log scale): 0.3883 (+/- 0.0155)
  Evaluating LightGBM...
    Mean CV RMSE (log scale): 0.3769 (+/- 0.0143)

Initial Cross-Validation complete.

--- Initial Model Performance Comparison (Mean CV RMSE on Log Scale) ---


Unnamed: 0,mean_rmse,std_rmse
LightGBM,0.3769,0.01435
XGBoost,0.38833,0.01546
RandomForest,0.3996,0.00832
Ridge,0.44656,0.02021
Lasso,0.66625,0.01515


-----------------------------------------------------------------------
