# Predicting Deutsche Bahn Train Delays  
## A Reproducible Baseline for Supervised Regression

**Reason.**  We are restarting the project because v1 became ... In this fresh notebook we will keep the code simple, attach every algorithm to its statistical-learning formula, and follow the machine-learning workflow outlined by the lecturer.

**Objective.**  Build a supervised regression model that predicts the **planned-vs-actual delay** for Deutsche Bahn trains, using the public *“Deutsche Bahn (DB) Delays”* dataset (Kaggle, nokkyu, 2024-07).* At each major step we will present combined visualisations-several related plots in one figure-followed by a single descriptive caption in the style used by *Introduction to Statistical Learning*. 
<!-- typically a set of related plots accompanied by a single descriptive caption below, summarizing the figure and explaining each subplot (e.g., “Left: … Right: …”) in a research-oriented format. -->

---

Lets first load the data:

In [None]:
import pandas as pd
from kagglehub import load_dataset, KaggleDatasetAdapter

# Load the Deutsche Bahn delays dataset
def load_db_delays() -> pd.DataFrame:
    df = load_dataset(
        KaggleDatasetAdapter.PANDAS,
        "nokkyu/deutsche-bahn-db-delays",
        "DBtrainrides.csv"
    )
    df["departure_plan"] = pd.to_datetime(df["departure_plan"], errors="coerce")
    return df

df = load_db_delays()

<!-- ```
print(df.head())

                                  ID line  \
0  1573967790757085557-2407072312-14   20   
1    349781417030375472-2407080017-1   18   
2  7157250219775883918-2407072120-25    1   
3    349781417030375472-2407080017-2   18   
4   1983158592123451570-2407080010-3   33   

                                                path   eva_nr  category  \
0  Stolberg(Rheinl)Hbf Gl.44|Eschweiler-St.Jöris|...  8000001         2   
1                                                NaN  8000001         2   
2  Hamm(Westf)Hbf|Kamen|Kamen-Methler|Dortmund-Ku...  8000406         4   
3                                         Aachen Hbf  8000404         5   
4                            Herzogenrath|Kohlscheid  8000404         5   

             station                state    city    zip      long        lat  \
0         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
1         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
2  Aachen-Rothe Erde  Nordrhein-Westfalen  Aachen  52066  6.116475  50.770202   
3        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   
4        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   

          arrival_plan       departure_plan       arrival_change  \
0  2024-07-08 00:00:00  2024-07-08 00:01:00  2024-07-08 00:03:00   
1                  NaN  2024-07-08 00:17:00                  NaN   
2  2024-07-08 00:03:00  2024-07-08 00:04:00  2024-07-08 00:03:00   
3  2024-07-08 00:20:00  2024-07-08 00:21:00                  NaN   
4  2024-07-08 00:20:00  2024-07-08 00:21:00  2024-07-08 00:20:00   

      departure_change  arrival_delay_m  departure_delay_m info  \
0  2024-07-08 00:04:00                3                  3  NaN   
1                  NaN                0                  0  NaN   
2  2024-07-08 00:04:00                0                  0  NaN   
3                  NaN                0                  0  NaN   
4  2024-07-08 00:21:00                0                  0  NaN   

  arrival_delay_check departure_delay_check  
0             on_time               on_time  
1             on_time               on_time  
2             on_time               on_time  
3             on_time               on_time  
4             on_time               on_time  
``` -->


### Study Design - Working Draft

**Problem Type.**  Supervised Regression
**Target Variable.**  `departure_delay_m` (measured in minutes; positive values indicate late departures, negative values indicate early departures).

**Feature Categories:**
1. **Timetable-Based Features:**
   * `arrival_plan`, `departure_plan`: Scheduled arrival/departure times. Useful for extracting day-of-week, seasonal patterns.
   * `arrival_change`, `departure_change`: Updated predicted changes from DB. Potentially strong predictors, but may introduce label leakage. Will evaluate performance with and without these features.
2. **Categorical Attributes:**
   * Includes `line`, `category`, `station`, `state`, `city`, and `eva_nr`. These may reflect systemic delays or regional issues.
   * Initial encoding will use one-hot. If dimensionality becomes unmanageable (e.g., for `station`), fallback to target mean encoding.
3. **Calendar Features:**
   * `day_of_week`, `hour_of_day`: To capture temporal variations (e.g., peak vs. off-peak hours).
   * (Optional) Public holiday flag for Germany, using external calendar data.

**Validation Approach.**  Group-based cross-validation using `ID` (trip hash) to prevent data leakage across train and validation sets.
Primary evaluation metric: **Mean Absolute Error (MAE)** – provides straightforward interpretability (“average error in minutes”), and is more robust than MSE to outliers.

**Baseline Models:**
1. Constant predictor using mean delay — serves as performance floor.
2. Linear regression with one-hot features — for basic interpretability.
3. Light Random Forest — to introduce non-linear decision surfaces early.

**Road-map:**
| Phase | Deliverable / reasoning |
|-------|-------------------------|
| 1 | **Data ingestion & schema validation** – read CSV, parse datetimes, assert expected columns/types |
| 2 | **Exploratory data analysis (EDA) & feature engineering** – histograms, missingness, correlation heat-map → decide which engineered features survive |
| 3 | **Baseline models** – mean predictor ➜ LinearReg ➜ small RandomForest (quick non-linear check) |
| 4 | **Hyper-parameter search & model selection** – grid/random search under `GroupKFold`; keep track of uplift vs. baseline |
| 5 | **Interpretation & error analysis** – feature importance, partial-dependence plots, error slices by line/station |

All code will stay **stateless and functional** (no hidden globals, random seeds fixed at top) so that each cell can run in isolation or be dropped into a production script without surprises.



## Data Exploration

Let's start by loading the dataset and examining its structure. We will also check for missing values and basic statistics to understand the data better.

In [None]:
# Display the first few rows of the dataset
print(df.head())
# Display the shape of the dataset
print(f"Dataset shape: {df.shape}")
# Display the columns of the dataset
print(f"Dataset columns: {df.columns.tolist()}")
# Display the data types of the columns
print(f"Dataset dtypes:\n{df.dtypes}")
# Display basic statistics of the dataset
print(f"Dataset statistics:\n{df.describe(include='all')}")
# Display the number of missing values in each column
print(f"Missing values:\n{df.isnull().sum()}")