**Table of contents**<a id='toc0_'></a>    
- [NYC Taxi Duration Prediction Pipeline](#toc1_)    
  - [Project Overview](#toc1_1_)    
  - [Environment Setup & Ingestion](#toc1_2_)    
    - [Imports and Setup](#toc1_2_1_)    
    - [Global Configuration (The SSoT Layer)](#toc1_2_2_)    
    - [Experiment Tracking Setup (MLflow)](#toc1_2_3_)    
    - [Data Ingestion Engine](#toc1_2_4_)    
  - [Data Engineering & Preprocessing](#toc1_3_)    
    - [Memory Management (SSoT for Data Types)](#toc1_3_1_)    
  - [Data Transformation (Vectorization)](#toc1_4_)    
  - [Baseline Modeling & Persistence](#toc1_5_)    
    - [Model Persistence (Universal Serialization)](#toc1_5_1_)    
  - [Model Validation (Out-of-Sample)](#toc1_6_)    
  - [Experimentation with Advanced Models (XGBoost)](#toc1_7_)    
    - [Exporting the Advanced Model (XGBoost)](#toc1_7_1_)    
    - [Evaluation Analysis (November Validation Set)](#toc1_7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[NYC Taxi Duration Prediction Pipeline](#toc0_)
**Author:** Ali Ahmed  
**Role:** Associate ML/MLOps Engineer  
**Contact:** [ðŸ“§ Email](mailto:ali.ahmed.nour14@gmail.com) | [ðŸ“± Phone](tel:+201007871314) | [ðŸ”— LinkedIn](https://www.linkedin.com/in/ali-ahmed-nour/)

**Status:** Development / Production-Ready Simulation

---

## <a id='toc1_1_'></a>[Project Overview](#toc0_)
This project implements a professional data pipeline designed with MLOps best practices:
* **Automation Ready:** Modular code structure prepared for orchestration.
* **Data Versioning Support:** Tiered storage (raw/processed) for better data lineage.
* **Portability:** Environment-agnostic path management for seamless deployment.

## <a id='toc1_2_'></a>[Environment Setup & Ingestion](#toc0_)
*(Phase 1: Preparing tools, configurations, and fetching raw data)*

### <a id='toc1_2_1_'></a>[Imports and Setup](#toc0_)
Library Initialization: Importation of necessary dependencies including XGBoost and CatBoost to ensure early resolution of environment requirements.

In [6]:
# Standard library imports
from pathlib import Path
import pickle
from datetime import datetime
from typing import TypedDict, List, Dict, Any, cast  # Add List, Dict, Any, and cast

# Third-party library imports
import mlflow
import pandas as pd
import xgboost as xgb
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

# Setting pandas display options for professional logging
pd.options.display.max_columns = None

  import pkg_resources


### <a id='toc1_2_2_'></a>[Global Configuration (The SSoT Layer)](#toc0_)
Global Configuration Layer: Establishment of a centralized Single Source of Truth (SSoT) to manage environment-specific variables, directory structures, and feature schemas.

* **Environment Agnostic:** My code is designed to work seamlessly across Linux, macOS, and Windows.
* **Scalable & Portable:** All paths (raw, processed, and model artifacts) as well as feature schemas are managed through a central configuration object. This allows for easy updates for different taxi types or time periods.
* **Type-Safe:** I use `TypedDict` to provide explicit type hinting, which improves code maintainability and reduces runtime errors.

In [4]:
# Updated Configuration to handle Dynamic Versioning
class ProjectConfig(TypedDict):
    taxi_type: str
    year: int
    month: int
    data_url: str
    raw_path: str
    processed_path: str
    model_path: str
    categorical_features: list[str]
    numerical_features: list[str]
    all_features: list[str]
    model_type: str  # Mandatory for multi-model SSoT
    target: str


def get_config(
    taxi_type: str = "yellow", year: int = 2025, month: int = 10, model_type: str = "lr"
) -> ProjectConfig:
    """
    Generate a centralized configuration object for data and model paths.
    Follows the Single Source of Truth (SSoT) principle.
    """
    # 1. Base Directory Resolution (The Foundation)
    base_dir = Path.cwd().parent
    raw_dir = base_dir / "data" / "raw"
    proc_dir = base_dir / "data" / "processed"
    model_dir = base_dir / "models"

    # 2. Versioning & Time Logic (The Context)
    today = datetime.now().strftime("%Y-%m-%d")

    # 3. Dynamic Name Construction (The Artifacts)
    # Data filename (Input)
    data_file = f"{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet"

    # Model filename (Output) - Combines type, data period, and training date
    model_name = f"model_{model_type}_{taxi_type}_{year:04d}-{month:02d}_v_{today}.bin"

    # 4. Physical Directory Creation (The Execution)
    for d in [raw_dir, proc_dir, model_dir]:
        d.mkdir(parents=True, exist_ok=True)

    # 5. Feature Sets & Mapping (The Metadata)
    cat_features = ["PULocationID", "DOLocationID"]
    num_features = ["trip_distance"]

    # 6. Final Object Assembly (The Result)
    config: ProjectConfig = {
        "taxi_type": taxi_type,
        "year": year,
        "month": month,
        "model_type": model_type,
        "data_url": f"https://d37ci6vzurychx.cloudfront.net/trip-data/{data_file}",
        "raw_path": str(raw_dir / data_file),
        "processed_path": str(proc_dir / data_file),
        "model_path": str(model_dir / model_name),
        "categorical_features": cat_features,
        "numerical_features": num_features,
        "all_features": cat_features + num_features,
        "target": "duration",
    }

    return config


# 1. Defining the Universal Saver Function (Place this at the top with other functions)
def save_artifact(cfg: ProjectConfig, model_obj, dv):
    """
    Standardizes model saving across the notebook.
    Ensures the (dv, model) bundle is always preserved.
    """
    model_path = Path(cfg["model_path"])

    with model_path.open("wb") as f_out:
        pickle.dump((dv, model_obj), f_out)

    print(f"âœ… SUCCESS: {cfg['model_type'].upper()} model saved to: {model_path.name}")
    print(f"ðŸ“¦ Size: {model_path.stat().st_size / 1024**2:.2f} MB")


# Initialize Training Config
cfg = get_config(year=2025, month=10)

# Ruff fix: f-strings now contain variables, removing F541 warning
print(
    f"LOG: SSoT Config initialized for {cfg['taxi_type']} taxi (Period: {cfg['year']}-{cfg['month']:02d})."
)
print(f"LOG: Model will be saved as: {Path(cfg['model_path']).name}")

LOG: SSoT Config initialized for yellow taxi (Period: 2025-10).
LOG: Model will be saved as: model_lr_yellow_2025-10_v_2025-12-31.bin


### <a id='toc1_2_3_'></a>[Experiment Tracking Setup (MLflow)](#toc0_)
Configuration of the MLflow tracking infrastructure for metadata management. This setup ensures all model parameters, metrics, and artifacts are systematically logged to the local server.

In [3]:
# 1.3.1 Tracking Configuration
# Setting the tracking URI to the local server
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# 1.3.2 Experiment Initialization
# Grouping all runs under the project name for systematic tracking
mlflow.set_experiment("nyc-taxi-duration-prediction")

# 1.3.3 Pipeline Autologging
# Enabling automatic logging for all supported libraries (XGBoost, Sklearn)
mlflow.autolog()

print(f"LOG: MLflow Tracking URI: {mlflow.get_tracking_uri()}")

2025/12/31 09:39:26 INFO mlflow.tracking.fluent: Experiment with name 'nyc-taxi-duration-prediction' does not exist. Creating a new experiment.
2025/12/31 09:39:48 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2025/12/31 09:39:48 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.


LOG: MLflow Tracking URI: http://127.0.0.1:5000


### <a id='toc1_2_4_'></a>[Data Ingestion Engine](#toc0_)
In this stage, I implement an **Idempotent Ingestion Engine** to handle data loading. My goal is to fetch the NYC Taxi dataset from the official Cloudfront repository and store it in the `data/raw` directory. 

**Professional Standards Applied:**
* **Idempotency & Caching:** The engine checks if the data already exists in the `raw_path` defined in our SSoT to avoid redundant downloads, saving bandwidth and execution time.
* **Separation of Concerns:** The logic for "where" the data is and "how" to get it is encapsulated in a modular function, separated from the main execution flow.
* **Automated Logging:** I included status updates to track the ingestion progress and file paths based on the centralized configuration.

In [None]:
def ingest_data(config: ProjectConfig) -> pd.DataFrame:
    """
    Download raw parquet files if not present locally.
    Implements idempotent data ingestion.
    """
    # Use Path object for modern path manipulation
    raw_path = Path(config["raw_path"])

    # Check if data exists in local storage to prevent redundant downloads
    if not raw_path.exists():
        # Attempt cloud data retrieval
        try:
            print(f"LOG: Downloading data from {config['data_url']}...")
            df = pd.read_parquet(config["data_url"])

            # Ensure the directory structure is ready before saving
            raw_path.parent.mkdir(parents=True, exist_ok=True)
            df.to_parquet(raw_path)
            print(f"LOG: Data successfully cached at: {raw_path.name}")

        except Exception as e:
            print(f"ERROR: Failed to fetch data from cloud. Exception: {e}")
            raise
    else:
        # Load directly from cache if available
        print(f"LOG: Data already exists at {raw_path.name}. Loading locally...")
        df = pd.read_parquet(raw_path)

    return df


# Executing ingestion using the October 2025 SSoT config
df = ingest_data(cfg)
print(f"LOG: Raw data shape: {df.shape}")

LOG: Data already exists at yellow_tripdata_2025-10.parquet. Loading locally...
LOG: Raw data shape: (4428699, 20)


## <a id='toc1_3_'></a>[Data Engineering & Preprocessing](#toc0_)
*(Phase 2: Cleaning, transforming, and optimizing the dataset)*

Data Engineering Phase: Transformation of raw taxi records into machine-learning-ready features through duration derivation and outlier mitigation.

* **SSoT Feature Mapping:** Instead of hard-coding column names, I dynamically pull them from the `ProjectConfig`. This ensures that any schema changes in the source data only need to be updated in one place.
* **Target Engineering:** I derive the `duration` variable from pickup and dropoff timestamps and mitigate outliers by filtering for trips between 1 and 60 minutes.
* **Memory Optimization & Type Integrity:** I convert categorical location IDs to strings. This ensures the `DictVectorizer` treats them as discrete entities rather than continuous numbers, while maintaining a memory-efficient workflow.

In [None]:
def preprocess_data(df: pd.DataFrame, config: ProjectConfig) -> pd.DataFrame:
    """
    Perform data cleaning, feature engineering, and outlier filtering.
    Enforcement of Single Source of Truth (SSoT) feature types is applied.
    """
    # 1. Target Derivation: Conversion of pickup/dropoff timestamps to duration in minutes
    df["duration"] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df["duration"] = df["duration"].apply(lambda td: td.total_seconds() / 60)

    # 2. Outlier Mitigation: Filtration to maintain operational range (1-60 minutes)
    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    # 3. Categorical Integrity: Casting feature identifiers to string for consistent vectorization
    # Feature names are dynamically retrieved from the centralized ProjectConfig
    cat_features = config["categorical_features"]
    df[cat_features] = df[cat_features].astype(str)

    return df


# Execution of Data Engineering Phase
print("LOG: Starting Data Engineering and Preprocessing...")
df_processed = preprocess_data(df, cfg)

print("LOG: Preprocessing complete.")
print(f"LOG: Final Shape: {df_processed.shape}")
print(f"LOG: Average Duration: {df_processed.duration.mean():.2f} minutes")

LOG: Starting Data Engineering and Preprocessing...
LOG: Preprocessing complete.
LOG: Final Shape: (4198802, 21)
LOG: Average Duration: 17.32 minutes


### <a id='toc1_3_1_'></a>[Memory Management (SSoT for Data Types)](#toc0_)
To handle millions of rows efficiently, I enforce a strict schema for data types. This optimization phase ensures the pipeline remains scalable:

* **Vectorized Optimization:** I use a vectorized approach to downcast numerical features, ensuring high performance and significantly reducing the CPU overhead.
* **Consistency & Footprint:** This ensures that both training and validation data occupy the minimum memory footprint (e.g., using `float32` instead of `float64`), which is critical for cloud-based training environments.
* **SSoT Integration:** The function dynamically identifies numerical columns and the target from the centralized `ProjectConfig`, maintaining a Single Source of Truth for the entire data schema.

In [None]:
def optimize_memory_vectorized(df: pd.DataFrame, config: ProjectConfig) -> pd.DataFrame:
    """
    Optimizes memory usage by downcasting numerical types based on SSoT config.
    """
    # 1. Calculate memory BEFORE optimization
    mem_before = df.memory_usage(deep=True).sum() / 1024**2

    # 2. Identify numerical columns from SSoT (Features + Target)
    # This prevents hard-coding and respects the schema defined in Section 1.2
    num_cols = config["numerical_features"] + [config["target"]]

    # 3. Downcast to float32 (Industry standard for ML precision/memory balance)
    df[num_cols] = df[num_cols].astype("float32")

    # 4. Calculate memory AFTER optimization
    mem_after = df.memory_usage(deep=True).sum() / 1024**2
    improvement = ((mem_before - mem_after) / mem_before) * 100

    print(f"LOG: Memory Optimization Report for {config['taxi_type']} dataset:")
    print(f"   - BEFORE: {mem_before:.2f} MB")
    print(f"   - AFTER: {mem_after:.2f} MB")
    print(f"   - Reduction: {improvement:.2f}%")

    return df


# Execute Memory Optimization
df_processed = optimize_memory_vectorized(df_processed, cfg)

LOG: Memory Optimization Report for yellow dataset:
   - BEFORE: 1183.29 MB
   - AFTER: 1151.25 MB
   - Reduction: 2.71%


## <a id='toc1_4_'></a>[Data Transformation (Vectorization)](#toc0_)
*(Phase 3: Converting dataframes into numerical matrices)*

In this stage, I convert the processed categorical and numerical features into a format that the machine learning model can understand. This transformation is crucial for several reasons:

* **One-Hot Encoding & Efficiency:** I use `DictVectorizer` to handle categorical location IDs, which creates a **sparse matrix**. This optimizes memory usage and ensures the model can interpret discrete categories correctly.
* **SSoT Alignment:** Instead of manually selecting columns, I dynamically utilize the `all_features` list from the `ProjectConfig`. This guarantees a consistent and reproducible feature order.
* **Consistency for Production:** The `DictVectorizer` (dv) fitted here becomes the **"Source of Truth"** for all future data (Validation/Production), ensuring that feature schemas remain synchronized.

In [None]:
# 1. Feature & Target Selection using SSoT
# We pull definitions directly from the config object
features = cfg["all_features"]
target = cfg["target"]

# 2. Convert DataFrame to List of Dictionaries
# This is a memory-efficient way to pass data to the DictVectorizer
train_dicts = df_processed[features].to_dict(orient="records")

# 3. Fit and Transform the Vectorizer
# 'dv' will be saved later to be used in the inference pipeline
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

# 4. Extract Target Vector
y_train = df_processed[target].to_numpy()

print("LOG: Feature Transformation complete.")
print(f"LOG: Training Matrix Shape: {X_train.shape}")
print(f"LOG: DictVectorizer successfully mapped {len(dv.feature_names_)} features.")

LOG: Feature Transformation complete.
LOG: Training Matrix Shape: (4198802, 523)
LOG: DictVectorizer successfully mapped 523 features.


## <a id='toc1_5_'></a>[Baseline Modeling & Persistence](#toc0_)
*(Phase 4: Establishing a performance benchmark and managing artifacts)*

In this final stage, I establish a **Baseline** for the project and ensure the persistence of the training results following our **Single Source of Truth (SSoT)** architecture:

* **Baseline Establishment & RMSE:** I train a **Linear Regression** model to serve as our performance benchmark. Using Root Mean Squared Error (RMSE), I establish a "score to beat" to verify the predictive power of features dynamically pulled from the `ProjectConfig`.
* **Artifact Synchronization:** I serialize both the model and the `DictVectorizer` into a single binary file. This is a critical MLOps practice that prevents **"Schema Skew"** during inference by ensuring the preprocessor and model remain perfectly synchronized.
* **SSoT Persistence:** I utilize the `model_path` defined in our centralized configuration to ensure artifacts are stored automatically in the correct project directory (`models/`), ensuring reproducibility and pipeline alignment.

In [None]:
# 1. Model Configuration
# Ensuring the Baseline model uses 'lr' type for proper SSoT pathing
cfg = get_config(taxi_type="yellow", year=2025, month=10, model_type="lr")

# 2. Model Initialization
# Establishing a simple Linear Regression as the performance benchmark
lr = LinearRegression()

# 3. Model Training
# SSoT: Training using the October feature matrix (X_train)
lr.fit(X_train, y_train)

# 4. Training Evaluation
# Checking how well the model fits the training data
y_pred_train = lr.predict(X_train)
rmse_lr_train = root_mean_squared_error(y_train, y_pred_train)

print(f"LOG: Baseline model ({lr.__class__.__name__}) training complete.")
print(f"LOG: Training RMSE: {rmse_lr_train:.2f} minutes")

LOG: Baseline model (LinearRegression) training complete.
LOG: Training RMSE: 9.55 minutes


### <a id='toc1_5_1_'></a>[Model Persistence (Universal Serialization)](#toc0_)

In a production-grade pipeline, saving the model is not just about writing a file; it's about ensuring the **reproducibility** of the entire inference logic.

* **Unified Saver:** We use a centralized `save_artifact` function to handle the serialization of any model type (Baseline or XGBoost) through a unified interface.
* **The "Inference Bundle":** To prevent **Training-Serving Skew**, we bundle the `DictVectorizer` (the feature schema) with the model object. This ensures that the 523+ features generated during training are mapped identically during future predictions.
* **Dynamic Versioning:** The file name is automatically derived from the `ProjectConfig` (SSoT), incorporating the model type, data period, and training date for full traceability.
* **Pathlib Integration:** Using `pathlib` ensures cross-platform compatibility and clean directory management.

In [None]:
# 6. Artifact Persistence
# Save the model and dict vectorizer using the centralized function
save_artifact(cfg, lr, dv)

âœ… SUCCESS: LR model saved to: model_lr_yellow_2025-10_v_2025-12-30.bin
ðŸ“¦ Size: 0.02 MB


## <a id='toc1_6_'></a>[Model Validation (Out-of-Sample)](#toc0_)
In this phase, I evaluate the model's performance on unseen data (November 2025). 
* **The Goal:** To ensure the model generalizes well to new data and maintains a similar RMSE to the training phase.
* **The Pipeline:** I will reuse the `prepare_features` and `optimize_memory_vectorized` functions to maintain consistency.

In [None]:
# 1. Initialize Validation Configuration for November 2025
# Centralized configuration ensures we point to the correct validation file
val_cfg = get_config(year=2025, month=11)

# 2. Ingest and Prepare Validation Data
print(f"LOG: Starting validation pipeline for {val_cfg['month']}/{val_cfg['year']}...")
df_val_raw = ingest_data(val_cfg)

# 3. Preprocessing and Memory Optimization
df_val = preprocess_data(df_val_raw, val_cfg)
df_val = optimize_memory_vectorized(df_val, val_cfg)

# 4. Feature Transformation (Transform only using fitted DV)
# We cast types to ensure stability and use iterator for memory efficiency
val_dicts = df_val[val_cfg["all_features"]].to_dict(orient="records")
X_val = dv.transform(iter(cast(List[Dict[str, Any]], val_dicts)))
y_val = df_val[val_cfg["target"]].to_numpy()

# 5. Baseline (Linear Regression) Validation Evaluation
# Using the model 'lr' trained in the previous cell
y_pred_lr_val = lr.predict(X_val)
rmse_lr_val = root_mean_squared_error(y_val, y_pred_lr_val)

print(f"LOG: Validation complete for {val_cfg['taxi_type']} taxi.")
print(f"LOG: Linear Regression Validation RMSE: {rmse_lr_val:.2f} minutes")

LOG: Starting validation pipeline for 11/2025...
LOG: Data already exists at yellow_tripdata_2025-11.parquet. Loading locally...
LOG: Memory Optimization Report for yellow dataset:
   - BEFORE: 1118.86 MB
   - AFTER: 1088.52 MB
   - Reduction: 2.71%
LOG: Validation complete for yellow taxi.
LOG: Linear Regression Validation RMSE: 9.48 minutes


## <a id='toc1_7_'></a>[Experimentation with Advanced Models (XGBoost)](#toc0_)
This phase focuses on advancing beyond Linear Regression to capture non-linear patterns within the NYC taxi dataset. The primary goal is to evaluate if a gradient boosting approach can improve the baseline RMSE (9.48).

* **Model:** XGBoost (Extreme Gradient Boosting).
* **Evaluation:** Comparison of validation RMSE against the linear baseline.

In [None]:
# 1. Specialized data structure for XGBoost (No new transformations)
# These DMatrix objects reference the existing X_train and X_val matrices
train_xgb = xgb.DMatrix(X_train, label=y_train)
val_xgb = xgb.DMatrix(X_val, label=y_val)

# 2. Define hyperparameters
params = {"max_depth": 6, "objective": "reg:squarederror", "nthread": 8, "seed": 42}

# 3. Model Training Logic
# Booster training using the established validation set
booster = xgb.train(
    params=params,
    dtrain=train_xgb,
    num_boost_round=100,
    evals=[(val_xgb, "validation")],
    early_stopping_rounds=10,
    verbose_eval=10,
)

# 4. Evaluation (Consistency check)
# Updated variable names to match the dynamic summary table
y_pred_xgb_val = booster.predict(val_xgb)
rmse_xgb_val = root_mean_squared_error(y_val, y_pred_xgb_val)

# Also calculating training RMSE for the full comparison table
y_pred_xgb_train = booster.predict(train_xgb)
rmse_xgb_train = root_mean_squared_error(y_train, y_pred_xgb_train)

print(f"LOG: Linear Regression Validation RMSE: {rmse_lr_val:.2f}")
print(f"LOG: XGBoost Validation RMSE: {rmse_xgb_val:.2f} minutes")

[0]	validation-rmse:9.10641
[10]	validation-rmse:6.65067
[20]	validation-rmse:6.56195
[30]	validation-rmse:6.51450
[40]	validation-rmse:6.47538
[50]	validation-rmse:6.44804
[60]	validation-rmse:6.42301
[70]	validation-rmse:6.40096
[80]	validation-rmse:6.38448
[90]	validation-rmse:6.36894
[99]	validation-rmse:6.35575
LOG: Linear Regression Validation RMSE: 9.48
LOG: XGBoost Validation RMSE: 6.36 minutes


### <a id='toc1_7_1_'></a>[Exporting the Advanced Model (XGBoost)](#toc0_)

Following the tuning and training of the XGBoost booster, the model is persisted using the universal interface. 
Updating the `model_type` within the configuration triggers the system to automatically route the artifact to the designated path, adhering to the established naming conventions.

In [None]:
cfg = get_config(model_type="xgb", year=2025, month=10)
# ... Ø¨Ø¹Ø¯ Ø§Ù„ØªØ¯Ø±ÙŠØ¨ ...
save_artifact(cfg, booster, dv)

âœ… SUCCESS: XGB model saved to: model_xgb_yellow_2025-10_v_2025-12-30.bin
ðŸ“¦ Size: 0.42 MB


### <a id='toc1_7_2_'></a>[Evaluation Analysis (November Validation Set)](#toc0_)
The model performance was evaluated using the November 2025 dataset, ensuring an out-of-sample validation consistent with the previous linear baseline. 


In [None]:
from IPython.display import Markdown, display

# Calculate improvement percentage
improvement = (rmse_lr_val - rmse_xgb_val) / rmse_lr_val * 100

table_content = f"""
| Model | Training RMSE | Validation RMSE | Improvement (Val) |
| :--- | :--- | :--- | :--- |
| **Linear Regression** | {rmse_lr_train:.2f} | {rmse_lr_val:.2f} | Reference |
| **XGBoost** | {rmse_xgb_train:.2f} | {rmse_xgb_val:.2f} | ~{improvement:.1f}% |
"""

display(Markdown(table_content))


| Model | Training RMSE | Validation RMSE | Improvement (Val) |
| :--- | :--- | :--- | :--- |
| **Linear Regression** | 9.55 | 9.48 | Reference |
| **XGBoost** | 6.36 | 6.36 | ~32.9% |



**Key Finding:** XGBoost successfully captured high-dimensional interactions between location IDs that were ignored by the linear model, leading to a significant reduction in prediction error.