# Predicting Deutsche Bahn Train Delays  
## A Reproducible Baseline for Supervised Regression

**Objective:** Build a supervised regression model to predict train arrival delays (in minutes) for Deutsche Bahn trains using statistical learning methods.

**Target Variable:** `arrival_delay_m` - continuous variable representing delay in minutes

In [None]:
import pandas as pd
from kagglehub import load_dataset, KaggleDatasetAdapter

# Load the Deutsche Bahn delays dataset
def load_db_delays() -> pd.DataFrame:
    df = load_dataset(
        KaggleDatasetAdapter.PANDAS,
        "nokkyu/deutsche-bahn-db-delays",
        "DBtrainrides.csv"
    )
    df["departure_plan"] = pd.to_datetime(df["departure_plan"], errors="coerce")
    return df

df = load_db_delays()

<!-- ```
print(df.head())

                                  ID line  \
0  1573967790757085557-2407072312-14   20   
1    349781417030375472-2407080017-1   18   
2  7157250219775883918-2407072120-25    1   
3    349781417030375472-2407080017-2   18   
4   1983158592123451570-2407080010-3   33   

                                                path   eva_nr  category  \
0  Stolberg(Rheinl)Hbf Gl.44|Eschweiler-St.Jöris|...  8000001         2   
1                                                NaN  8000001         2   
2  Hamm(Westf)Hbf|Kamen|Kamen-Methler|Dortmund-Ku...  8000406         4   
3                                         Aachen Hbf  8000404         5   
4                            Herzogenrath|Kohlscheid  8000404         5   

             station                state    city    zip      long        lat  \
0         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
1         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
2  Aachen-Rothe Erde  Nordrhein-Westfalen  Aachen  52066  6.116475  50.770202   
3        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   
4        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   

          arrival_plan       departure_plan       arrival_change  \
0  2024-07-08 00:00:00  2024-07-08 00:01:00  2024-07-08 00:03:00   
1                  NaN  2024-07-08 00:17:00                  NaN   
2  2024-07-08 00:03:00  2024-07-08 00:04:00  2024-07-08 00:03:00   
3  2024-07-08 00:20:00  2024-07-08 00:21:00                  NaN   
4  2024-07-08 00:20:00  2024-07-08 00:21:00  2024-07-08 00:20:00   

      departure_change  arrival_delay_m  departure_delay_m info  \
0  2024-07-08 00:04:00                3                  3  NaN   
1                  NaN                0                  0  NaN   
2  2024-07-08 00:04:00                0                  0  NaN   
3                  NaN                0                  0  NaN   
4  2024-07-08 00:21:00                0                  0  NaN   

  arrival_delay_check departure_delay_check  
0             on_time               on_time  
1             on_time               on_time  
2             on_time               on_time  
3             on_time               on_time  
4             on_time               on_time  
``` -->


## Data Exploration

### Loading and Initial Inspection

**Reference:** *ISLP Chapter 2 - Statistical Learning*, *Slides 03a - The ML Project (p. 5-6)*

In any ML project, we begin with understanding our data structure and distribution. As outlined in the ML project steps (Slide 3), data exploration is the crucial second step after data acquisition.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Parse datetime columns
datetime_cols = ["departure_plan", "arrival_plan", "departure_change", "arrival_change"]
for col in datetime_cols:
    df[col] = pd.to_datetime(df[col], errors="coerce")

print(f"Dataset shape: {df.shape}")
print(f"Target variable (arrival_delay_m) statistics:")
print(df['arrival_delay_m'].describe())

### Understanding the Target Distribution

**Mathematical Foundation:** For regression problems, we assume:
$$Y = f(X) + \epsilon$$

where:
- $Y$ is the response variable (arrival_delay_m)
- $X$ represents our predictors
- $f$ is the systematic information
- $\epsilon$ is random error with $E(\epsilon) = 0$ and $Var(\epsilon) = \sigma^2$

**Reference:** *ISLP Equation 2.1, Slides 02 - Machine Learning Overview*

In [None]:
# Create figure with subplots for comprehensive target analysis
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram of arrival delays
axes[0, 0].hist(df['arrival_delay_m'].dropna(), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Arrival Delay (minutes)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Arrival Delays')
axes[0, 0].axvline(df['arrival_delay_m'].mean(), color='red', linestyle='--', 
                    label=f'Mean: {df["arrival_delay_m"].mean():.1f} min')
axes[0, 0].legend()

# Box plot to identify outliers
axes[0, 1].boxplot(df['arrival_delay_m'].dropna())
axes[0, 1].set_ylabel('Arrival Delay (minutes)')
axes[0, 1].set_title('Box Plot of Arrival Delays')

# Q-Q plot for normality check
from scipy import stats
stats.probplot(df['arrival_delay_m'].dropna(), dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot: Checking Normality')

# Log-transformed delays (handling negative values)
delay_shifted = df['arrival_delay_m'] + abs(df['arrival_delay_m'].min()) + 1
axes[1, 1].hist(np.log(delay_shifted.dropna()), bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Log(Arrival Delay + offset)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Log-Transformed Distribution')

plt.tight_layout()
plt.show()

# Statistical summary
print("\nTarget Variable Analysis:")
print(f"Mean delay: {df['arrival_delay_m'].mean():.2f} minutes")
print(f"Median delay: {df['arrival_delay_m'].median():.2f} minutes")
print(f"Standard deviation: {df['arrival_delay_m'].std():.2f} minutes")
print(f"Skewness: {df['arrival_delay_m'].skew():.2f}")
print(f"Percentage of on-time arrivals: {(df['arrival_delay_m'] == 0).sum() / len(df) * 100:.1f}%")

### Missing Data Analysis

**Reference:** *ISLP Section 4.6.6 - Missing Data*

Missing data can introduce bias if not handled properly. We need to understand the pattern of missingness before deciding on an imputation strategy.


In [None]:
# Missing data visualization
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df)) * 100
}).sort_values('Missing_Percentage', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(missing_data['Column'][:15], missing_data['Missing_Percentage'][:15])
plt.xlabel('Missing Percentage (%)')
plt.title('Missing Data by Column')
plt.tight_layout()
plt.show()

print("Missing Data Summary:")
print(missing_data[missing_data['Missing_Count'] > 0])

### Feature Type Analysis

In [None]:
# Identify feature types for preprocessing
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
datetime_features = df.select_dtypes(include=['datetime64']).columns.tolist()

print(f"Categorical features ({len(categorical_features)}): {categorical_features[:5]}...")
print(f"Numerical features ({len(numerical_features)}): {numerical_features}")
print(f"Datetime features ({len(datetime_features)}): {datetime_features}")

# Sample data inspection
print("\nSample of the data:")
print(df[['zip', 'category', 'arrival_plan', 'departure_plan', 'arrival_delay_m']].head())

---

## Data Preparation

### Feature Engineering

**Mathematical Foundation:** Feature transformation can be represented as:
$$\phi: \mathcal{X} \rightarrow \mathcal{F}$$

where $\phi$ maps from the original feature space $\mathcal{X}$ to a new feature space $\mathcal{F}$.

**Reference:** *Slides 03a - First Classifiers (Feature Extraction)*

In [None]:
# Create working copy and select relevant columns
df_work = df.copy()

# Define columns to keep based on domain knowledge
keep_cols = [
    "ID",  # For group-based CV
    "zip",
    "category",
    "arrival_plan",
    "departure_plan", 
    "arrival_change",
    "departure_change",
    "arrival_delay_m"
]

df_work = df_work[keep_cols]
print(f"Shape after column selection: {df_work.shape}")

# Remove duplicates
df_work = df_work.drop_duplicates()
print(f"Shape after removing duplicates: {df_work.shape}")

# Feature engineering function
def engineer_temporal_features(df):
    """
    Extract temporal features from datetime columns.
    Reference: Domain knowledge suggests time-of-day and day-of-week 
    patterns in train delays.
    """
    df_feat = df.copy()
    
    # Time-based features
    df_feat['arr_hour'] = df_feat['arrival_plan'].dt.hour
    df_feat['arr_minute'] = df_feat['arrival_plan'].dt.minute
    df_feat['arr_weekday'] = df_feat['arrival_plan'].dt.weekday
    df_feat['arr_month'] = df_feat['arrival_plan'].dt.month
    df_feat['arr_day'] = df_feat['arrival_plan'].dt.day
    
    df_feat['dep_hour'] = df_feat['departure_plan'].dt.hour
    df_feat['dep_minute'] = df_feat['departure_plan'].dt.minute
    df_feat['dep_weekday'] = df_feat['departure_plan'].dt.weekday
    
    # Calculate deltas (in minutes) - these could be strong predictors
    df_feat['arr_change_delta'] = (
        (df_feat['arrival_change'] - df_feat['arrival_plan'])
        .dt.total_seconds() / 60
    ).fillna(0)
    
    df_feat['dep_change_delta'] = (
        (df_feat['departure_change'] - df_feat['departure_plan'])
        .dt.total_seconds() / 60
    ).fillna(0)
    
    # Peak hour indicators
    df_feat['is_morning_peak'] = df_feat['arr_hour'].isin([7, 8, 9]).astype(int)
    df_feat['is_evening_peak'] = df_feat['arr_hour'].isin([17, 18, 19]).astype(int)
    df_feat['is_weekend'] = (df_feat['arr_weekday'] >= 5).astype(int)
    
    return df_feat

# Apply feature engineering
df_work = engineer_temporal_features(df_work)