# Predicting Deutsche Bahn Train Delays  
## A Reproducible Baseline for Supervised Regression

**Objective:** Build a supervised regression model to predict train arrival delays (in minutes) for Deutsche Bahn trains using statistical learning methods.

**Target Variable:** `arrival_delay_m` - continuous variable representing delay in minutes

In [None]:
import pandas as pd
from kagglehub import load_dataset, KaggleDatasetAdapter

# Load the Deutsche Bahn delays dataset
def load_db_delays() -> pd.DataFrame:
    df = load_dataset(
        KaggleDatasetAdapter.PANDAS,
        "nokkyu/deutsche-bahn-db-delays",
        "DBtrainrides.csv"
    )
    df["departure_plan"] = pd.to_datetime(df["departure_plan"], errors="coerce")
    return df

df = load_db_delays()

<!-- ```
print(df.head())

                                  ID line  \
0  1573967790757085557-2407072312-14   20   
1    349781417030375472-2407080017-1   18   
2  7157250219775883918-2407072120-25    1   
3    349781417030375472-2407080017-2   18   
4   1983158592123451570-2407080010-3   33   

                                                path   eva_nr  category  \
0  Stolberg(Rheinl)Hbf Gl.44|Eschweiler-St.Jöris|...  8000001         2   
1                                                NaN  8000001         2   
2  Hamm(Westf)Hbf|Kamen|Kamen-Methler|Dortmund-Ku...  8000406         4   
3                                         Aachen Hbf  8000404         5   
4                            Herzogenrath|Kohlscheid  8000404         5   

             station                state    city    zip      long        lat  \
0         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
1         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
2  Aachen-Rothe Erde  Nordrhein-Westfalen  Aachen  52066  6.116475  50.770202   
3        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   
4        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   

          arrival_plan       departure_plan       arrival_change  \
0  2024-07-08 00:00:00  2024-07-08 00:01:00  2024-07-08 00:03:00   
1                  NaN  2024-07-08 00:17:00                  NaN   
2  2024-07-08 00:03:00  2024-07-08 00:04:00  2024-07-08 00:03:00   
3  2024-07-08 00:20:00  2024-07-08 00:21:00                  NaN   
4  2024-07-08 00:20:00  2024-07-08 00:21:00  2024-07-08 00:20:00   

      departure_change  arrival_delay_m  departure_delay_m info  \
0  2024-07-08 00:04:00                3                  3  NaN   
1                  NaN                0                  0  NaN   
2  2024-07-08 00:04:00                0                  0  NaN   
3                  NaN                0                  0  NaN   
4  2024-07-08 00:21:00                0                  0  NaN   

  arrival_delay_check departure_delay_check  
0             on_time               on_time  
1             on_time               on_time  
2             on_time               on_time  
3             on_time               on_time  
4             on_time               on_time  
``` -->


## Data Exploration

### Loading and Initial Inspection

**Reference:** *ISLP Chapter 2 - Statistical Learning*, *Slides 03a - The ML Project (p. 5-6)*

In any ML project, we begin with understanding our data structure and distribution. As outlined in the ML project steps (Slide 3), data exploration is the crucial second step after data acquisition.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Parse datetime columns
datetime_cols = ["departure_plan", "arrival_plan", "departure_change", "arrival_change"]
for col in datetime_cols:
    df[col] = pd.to_datetime(df[col], errors="coerce")

print(f"Dataset shape: {df.shape}")
print(f"Target variable (arrival_delay_m) statistics:")
print(df['arrival_delay_m'].describe())

### Understanding the Target Distribution

**Mathematical Foundation:** For regression problems, we assume:
$$Y = f(X) + \epsilon$$

where:
- $Y$ is the response variable (arrival_delay_m)
- $X$ represents our predictors
- $f$ is the systematic information
- $\epsilon$ is random error with $E(\epsilon) = 0$ and $Var(\epsilon) = \sigma^2$

**Reference:** *ISLP Equation 2.1, Slides 02 - Machine Learning Overview*

In [None]:
# Create figure with subplots for comprehensive target analysis
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram of arrival delays
axes[0, 0].hist(df['arrival_delay_m'].dropna(), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Arrival Delay (minutes)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Arrival Delays')
axes[0, 0].axvline(df['arrival_delay_m'].mean(), color='red', linestyle='--', 
                    label=f'Mean: {df["arrival_delay_m"].mean():.1f} min')
axes[0, 0].legend()

# Box plot to identify outliers
axes[0, 1].boxplot(df['arrival_delay_m'].dropna())
axes[0, 1].set_ylabel('Arrival Delay (minutes)')
axes[0, 1].set_title('Box Plot of Arrival Delays')

# Q-Q plot for normality check
from scipy import stats
stats.probplot(df['arrival_delay_m'].dropna(), dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot: Checking Normality')

# Log-transformed delays (handling negative values)
delay_shifted = df['arrival_delay_m'] + abs(df['arrival_delay_m'].min()) + 1
axes[1, 1].hist(np.log(delay_shifted.dropna()), bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Log(Arrival Delay + offset)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Log-Transformed Distribution')

plt.tight_layout()
plt.show()

# Statistical summary
print("\nTarget Variable Analysis:")
print(f"Mean delay: {df['arrival_delay_m'].mean():.2f} minutes")
print(f"Median delay: {df['arrival_delay_m'].median():.2f} minutes")
print(f"Standard deviation: {df['arrival_delay_m'].std():.2f} minutes")
print(f"Skewness: {df['arrival_delay_m'].skew():.2f}")
print(f"Percentage of on-time arrivals: {(df['arrival_delay_m'] == 0).sum() / len(df) * 100:.1f}%")

### Missing Data Analysis

**Reference:** *ISLP Section 4.6.6 - Missing Data*

Missing data can introduce bias if not handled properly. We need to understand the pattern of missingness before deciding on an imputation strategy.


In [None]:
# Missing data visualization
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df)) * 100
}).sort_values('Missing_Percentage', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(missing_data['Column'][:15], missing_data['Missing_Percentage'][:15])
plt.xlabel('Missing Percentage (%)')
plt.title('Missing Data by Column')
plt.tight_layout()
plt.show()

print("Missing Data Summary:")
print(missing_data[missing_data['Missing_Count'] > 0])

### Feature Type Analysis

In [None]:
# Identify feature types for preprocessing
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
datetime_features = df.select_dtypes(include=['datetime64']).columns.tolist()

print(f"Categorical features ({len(categorical_features)}): {categorical_features[:5]}...")
print(f"Numerical features ({len(numerical_features)}): {numerical_features}")
print(f"Datetime features ({len(datetime_features)}): {datetime_features}")

# Sample data inspection
print("\nSample of the data:")
print(df[['zip', 'category', 'arrival_plan', 'departure_plan', 'arrival_delay_m']].head())

---

## Data Preparation

### Feature Engineering

**Mathematical Foundation:** Feature transformation can be represented as:
$$\phi: \mathcal{X} \rightarrow \mathcal{F}$$

where $\phi$ maps from the original feature space $\mathcal{X}$ to a new feature space $\mathcal{F}$.

**Reference:** *Slides 03a - First Classifiers (Feature Extraction)*

In [None]:
# Create working copy and select relevant columns
df_work = df.copy()

# Define columns to keep based on domain knowledge
keep_cols = [
    "ID",  # For group-based CV
    "zip",
    "category",
    "arrival_plan",
    "departure_plan", 
    "arrival_change",
    "departure_change",
    "arrival_delay_m"
]

df_work = df_work[keep_cols]
print(f"Shape after column selection: {df_work.shape}")

# Remove duplicates
df_work = df_work.drop_duplicates()
print(f"Shape after removing duplicates: {df_work.shape}")

# Feature engineering function
def engineer_temporal_features(df):
    """
    Extract temporal features from datetime columns.
    Reference: Domain knowledge suggests time-of-day and day-of-week 
    patterns in train delays.
    """
    df_feat = df.copy()
    
    # Time-based features
    df_feat['arr_hour'] = df_feat['arrival_plan'].dt.hour
    df_feat['arr_minute'] = df_feat['arrival_plan'].dt.minute
    df_feat['arr_weekday'] = df_feat['arrival_plan'].dt.weekday
    df_feat['arr_month'] = df_feat['arrival_plan'].dt.month
    df_feat['arr_day'] = df_feat['arrival_plan'].dt.day
    
    df_feat['dep_hour'] = df_feat['departure_plan'].dt.hour
    df_feat['dep_minute'] = df_feat['departure_plan'].dt.minute
    df_feat['dep_weekday'] = df_feat['departure_plan'].dt.weekday
    
    # Calculate deltas (in minutes) - these could be strong predictors
    df_feat['arr_change_delta'] = (
        (df_feat['arrival_change'] - df_feat['arrival_plan'])
        .dt.total_seconds() / 60
    ).fillna(0)
    
    df_feat['dep_change_delta'] = (
        (df_feat['departure_change'] - df_feat['departure_plan'])
        .dt.total_seconds() / 60
    ).fillna(0)
    
    # Peak hour indicators
    df_feat['is_morning_peak'] = df_feat['arr_hour'].isin([7, 8, 9]).astype(int)
    df_feat['is_evening_peak'] = df_feat['arr_hour'].isin([17, 18, 19]).astype(int)
    df_feat['is_weekend'] = (df_feat['arr_weekday'] >= 5).astype(int)
    
    return df_feat

# Apply feature engineering
df_work = engineer_temporal_features(df_work)

### Train-Validation-Test Split

**Mathematical Foundation:** To estimate the test error, we use:
$$\text{Test MSE} = E[(Y - \hat{f}(X))^2]$$

We need independent test data to get an unbiased estimate.

**Reference:** *ISLP Section 5.1 - Cross-Validation, Slides 03a (p. 9)*

In [None]:
# Separate features and target
feature_cols = [col for col in df_work.columns if col not in ['arrival_delay_m', 'ID']]
X = df_work[feature_cols + ['ID']]  # Keep ID for group-based CV
y = df_work['arrival_delay_m']

# Remove rows with missing target
mask = ~y.isna()
X = X[mask]
y = y[mask]

# First split: 80% temp, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=pd.qcut(y, q=10, duplicates='drop')
)

# Second split: From temp, create 64% train, 16% validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.20, random_state=42
)

print(f"Training set size: {X_train.shape}")
print(f"Validation set size: {X_val.shape}")
print(f"Test set size: {X_test.shape}")

# Extract IDs for group-based CV
train_ids = X_train['ID']
X_train = X_train.drop('ID', axis=1)
X_val = X_val.drop('ID', axis=1)
X_test = X_test.drop('ID', axis=1)

### Preprocessing Pipeline

**Mathematical Foundation:** Standardization transforms features to have zero mean and unit variance:
$$z_i = \frac{x_i - \bar{x}}{s}$$

where $\bar{x}$ is the sample mean and $s$ is the sample standard deviation.

**Reference:** *ISLP Section 6.2 - Ridge Regression and Standardization*

In [None]:
# Define feature groups
datetime_cols_to_drop = ['arrival_plan', 'departure_plan', 'arrival_change', 'departure_change']
X_train = X_train.drop(columns=datetime_cols_to_drop, errors='ignore')
X_val = X_val.drop(columns=datetime_cols_to_drop, errors='ignore')
X_test = X_test.drop(columns=datetime_cols_to_drop, errors='ignore')

# Identify numerical and categorical columns
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical features: {numerical_features}")
print(f"Categorical features: {categorical_features}")

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), 
         categorical_features)
    ])

# Fit preprocessor on training data
preprocessor.fit(X_train)

# Feature correlation analysis (on a sample for efficiency)
X_train_transformed = preprocessor.transform(X_train[:1000])
feature_names = (numerical_features + 
                [f"{cat}_{val}" for cat, vals in 
                 zip(categorical_features, preprocessor.named_transformers_['cat'].categories_) 
                 for val in vals[1:]])  # drop='first' removes first category

# Correlation heatmap
plt.figure(figsize=(12, 10))
corr_matrix = pd.DataFrame(X_train_transformed, columns=feature_names).corr()
mask = np.triu(np.ones_like(corr_matrix), k=1)
sns.heatmap(corr_matrix, mask=mask, cmap='RdBu_r', center=0, 
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix (Sample)')
plt.tight_layout()
plt.show()

---

## ML Algorithm and Parameter Exploration

### Baseline Models

**Mathematical Foundation:** The simplest model predicts the mean:
$$\hat{y} = \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$$

This gives us a performance floor.

**Reference:** *ISLP Section 3.1 - Simple Linear Regression*

In [None]:
# Baseline: Mean predictor
mean_delay = y_train.mean()
baseline_train_mae = mean_absolute_error(y_train, [mean_delay] * len(y_train))
baseline_val_mae = mean_absolute_error(y_val, [mean_delay] * len(y_val))

print(f"Baseline Model (Mean Predictor):")
print(f"Mean delay prediction: {mean_delay:.2f} minutes")
print(f"Training MAE: {baseline_train_mae:.2f}")
print(f"Validation MAE: {baseline_val_mae:.2f}")

### Linear Regression

**Mathematical Foundation:** Linear regression assumes:
$$Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon$$

The coefficients are estimated by minimizing RSS:
$$\text{RSS} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

**Reference:** *ISLP Chapter 3 - Linear Regression, Slides 02*

In [None]:
# Linear Regression Pipeline
lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Fit the model
lr_pipeline.fit(X_train, y_train)

# Predictions
y_train_pred_lr = lr_pipeline.predict(X_train)
y_val_pred_lr = lr_pipeline.predict(X_val)

# Evaluation metrics
def evaluate_model(y_true, y_pred, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n{model_name} Performance:")
    print(f"MAE: {mae:.2f} minutes")
    print(f"RMSE: {rmse:.2f} minutes")
    print(f"R²: {r2:.3f}")
    
    return {'mae': mae, 'mse': mse, 'rmse': rmse, 'r2': r2}

lr_train_metrics = evaluate_model(y_train, y_train_pred_lr, "Linear Regression - Training")
lr_val_metrics = evaluate_model(y_val, y_val_pred_lr, "Linear Regression - Validation")

# Residual analysis
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residual plot
residuals = y_val - y_val_pred_lr
axes[0].scatter(y_val_pred_lr, residuals, alpha=0.5)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot - Linear Regression')

# Actual vs Predicted
axes[1].scatter(y_val, y_val_pred_lr, alpha=0.5)
axes[1].plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--')
axes[1].set_xlabel('Actual Delays')
axes[1].set_ylabel('Predicted Delays')
axes[1].set_title('Actual vs Predicted - Linear Regression')

plt.tight_layout()
plt.show()

### Ridge Regression

**Mathematical Foundation:** Ridge regression minimizes:
$$\text{RSS} + \lambda \sum_{j=1}^{p} \beta_j^2$$

where $\lambda \geq 0$ is the tuning parameter.

**Reference:** *ISLP Section 6.2.1 - Ridge Regression*

In [None]:
# Ridge Regression with cross-validation for lambda selection
from sklearn.linear_model import RidgeCV

# Define lambda range (alpha in sklearn)
alphas = np.logspace(-3, 3, 100)

ridge_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RidgeCV(alphas=alphas, cv=5))
])

ridge_pipeline.fit(X_train, y_train)

print(f"Optimal lambda (alpha): {ridge_pipeline.named_steps['regressor'].alpha_:.4f}")

# Predictions and evaluation
y_train_pred_ridge = ridge_pipeline.predict(X_train)
y_val_pred_ridge = ridge_pipeline.predict(X_val)

ridge_train_metrics = evaluate_model(y_train, y_train_pred_ridge, "Ridge Regression - Training")
ridge_val_metrics = evaluate_model(y_val, y_val_pred_ridge, "Ridge Regression - Validation")

### Lasso Regression

**Mathematical Foundation:** Lasso minimizes:
$$\text{RSS} + \lambda \sum_{j=1}^{p} |\beta_j|$$

The L1 penalty can force coefficients to exactly zero, performing variable selection.

**Reference:** *ISLP Section 6.2.2 - The Lasso*

In [None]:
from sklearn.linear_model import LassoCV

# Lasso with cross-validation
lasso_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LassoCV(cv=5, random_state=42, max_iter=2000))
])

lasso_pipeline.fit(X_train, y_train)

print(f"Optimal lambda (alpha): {lasso_pipeline.named_steps['regressor'].alpha_:.4f}")

# Get feature importance (non-zero coefficients)
lasso_coef = lasso_pipeline.named_steps['regressor'].coef_
n_selected = np.sum(lasso_coef != 0)
print(f"Number of selected features: {n_selected} out of {len(lasso_coef)}")

# Predictions and evaluation
y_train_pred_lasso = lasso_pipeline.predict(X_train)
y_val_pred_lasso = lasso_pipeline.predict(X_val)

lasso_train_metrics = evaluate_model(y_train, y_train_pred_lasso, "Lasso Regression - Training")
lasso_val_metrics = evaluate_model(y_val, y_val_pred_lasso, "Lasso Regression - Validation")

### Random Forest

**Mathematical Foundation:** Random Forest combines multiple decision trees:
$$\hat{f}(x) = \frac{1}{B}\sum_{b=1}^{B} T_b(x)$$

where $T_b$ is the $b$-th tree trained on a bootstrap sample.

**Reference:** *ISLP Section 8.2.1 - Random Forests, Slides 08*

In [None]:
# Random Forest with limited hyperparameters for initial exploration
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, max_depth=10, 
                                       min_samples_split=20, random_state=42, n_jobs=-1))
])

rf_pipeline.fit(X_train, y_train)

# Predictions and evaluation
y_train_pred_rf = rf_pipeline.predict(X_train)
y_val_pred_rf = rf_pipeline.predict(X_val)

rf_train_metrics = evaluate_model(y_train, y_train_pred_rf, "Random Forest - Training")
rf_val_metrics = evaluate_model(y_val, y_val_pred_rf, "Random Forest - Validation")

# Feature importance analysis
feature_importance = rf_pipeline.named_steps['regressor'].feature_importances_
feature_names_all = (numerical_features + 
                    [f"{cat}_{val}" for cat, vals in 
                     zip(categorical_features, preprocessor.named_transformers_['cat'].categories_) 
                     for val in vals[1:]])

# Top 15 features
importance_df = pd.DataFrame({
    'feature': feature_names_all,
    'importance': feature_importance
}).sort_values('importance', ascending=False).head(15)

plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Feature Importance')
plt.title('Top 15 Most Important Features - Random Forest')
plt.tight_layout()
plt.show()

### Cross-Validation Comparison

**Mathematical Foundation:** K-fold CV estimate:
$$\text{CV}_{(k)} = \frac{1}{k}\sum_{i=1}^{k} \text{MSE}_i$$

**Reference:** *ISLP Section 5.1.3 - k-Fold Cross-Validation*

In [None]:
# Compare models using cross-validation
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': RidgeCV(alphas=alphas, cv=5),
    'Lasso': LassoCV(cv=5, random_state=42, max_iter=2000),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, 
                                          min_samples_split=20, random_state=42, n_jobs=-1)
}

cv_results = {}
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])
    
    # Use negative MAE for scoring (sklearn convention)
    cv_scores = cross_val_score(pipeline, X_train, y_train, 
                               cv=kfold, scoring='neg_mean_absolute_error', n_jobs=-1)
    cv_results[name] = -cv_scores  # Convert back to positive MAE
    
    print(f"{name} - CV MAE: {-cv_scores.mean():.2f} (+/- {cv_scores.std() * 2:.2f})")

# Visualization of CV results
plt.figure(figsize=(10, 6))
plt.boxplot(cv_results.values(), labels=cv_results.keys())
plt.ylabel('MAE (minutes)')
plt.title('Cross-Validation Performance Comparison')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Bias-Variance Trade-off Analysis

**Mathematical Foundation:** The expected test MSE can be decomposed as:
$$E[(Y - \hat{f}(X))^2] = \text{Var}(\hat{f}(X)) + [\text{Bias}(\hat{f}(X))]^2 + \text{Var}(\epsilon)$$

**Reference:** *ISLP Section 2.2.2 - The Bias-Variance Trade-Off*

In [None]:
# Create learning curves to visualize bias-variance trade-off
from sklearn.model_selection import learning_curve

def plot_learning_curves(estimator, title, X, y, cv=5):
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=-1, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_absolute_error'
    )
    
    train_scores_mean = -train_scores.mean(axis=1)
    train_scores_std = train_scores.std(axis=1)
    val_scores_mean = -val_scores.mean(axis=1)
    val_scores_std = val_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, val_scores_mean - val_scores_std,
                     val_scores_mean + val_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, val_scores_mean, 'o-', color="g", label="Cross-validation score")
    plt.xlabel("Training Set Size")
    plt.ylabel("MAE")
    plt.title(f"Learning Curves - {title}")
    plt.legend(loc="best")
    plt.grid(True)
    plt.show()

# Plot learning curves for best models
plot_learning_curves(lr_pipeline, "Linear Regression", X_train, y_train)
plot_learning_curves(rf_pipeline, "Random Forest", X_train, y_train)

### Final Model Selection and Test Set Evaluation

In [None]:
# Based on validation performance, select best model
val_performances = {
    'Baseline': baseline_val_mae,
    'Linear Regression': lr_val_metrics['mae'],
    'Ridge': ridge_val_metrics['mae'],
    'Lasso': lasso_val_metrics['mae'],
    'Random Forest': rf_val_metrics['mae']
}

best_model_name = min(val_performances, key=val_performances.get)
print(f"\nBest model based on validation MAE: {best_model_name}")
print(f"Validation MAE: {val_performances[best_model_name]:.2f} minutes")

# Train best model on combined train+validation set
X_train_full = pd.concat([X_train, X_val])
y_train_full = pd.concat([y_train, y_val])

if best_model_name == 'Random Forest':
    final_model = rf_pipeline
elif best_model_name == 'Linear Regression':
    final_model = lr_pipeline
elif best_model_name == 'Ridge':
    final_model = ridge_pipeline
else:
    final_model = lasso_pipeline

# Refit on full training data
final_model.fit(X_train_full, y_train_full)

# Final test set evaluation
y_test_pred = final_model.predict(X_test)
test_metrics = evaluate_model(y_test, y_test_pred, f"{best_model_name} - Test Set")

# Summary visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Model comparison bar chart
models_list = list(val_performances.keys())
mae_values = list(val_performances.values())
axes[0, 0].bar(models_list, mae_values)
axes[0, 0].set_ylabel('MAE (minutes)')
axes[0, 0].set_title('Model Performance Comparison (Validation Set)')
axes[0, 0].tick_params(axis='x', rotation=45)

# Test set predictions
axes[0, 1].scatter(y_test, y_test_pred, alpha=0.5)
axes[0, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0, 1].set_xlabel('Actual Delays')
axes[0, 1].set_ylabel('Predicted Delays')
axes[0, 1].set_title(f'Test Set: Actual vs Predicted - {best_model_name}')

# Error distribution
errors = y_test - y_test_pred
axes[1, 0].hist(errors, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Prediction Error (minutes)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Prediction Errors')
axes[1, 0].axvline(x=0, color='red', linestyle='--')

# Performance metrics summary
metrics_text = f"""Final Model: {best_model_name}

Test Set Performance:
MAE: {test_metrics['mae']:.2f} minutes
RMSE: {test_metrics['rmse']:.2f} minutes
R²: {test_metrics['r2']:.3f}

Baseline MAE: {baseline_val_mae:.2f} minutes
Improvement: {(baseline_val_mae - test_metrics['mae'])/baseline_val_mae*100:.1f}%
"""
axes[1, 1].text(0.1, 0.5, metrics_text, transform=axes[1, 1].transAxes,
                fontsize=12, verticalalignment='center')
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

print("\nProject Summary:")
print(f"We successfully built a {best_model_name} model to predict train delays.")
print(f"The model achieves a test MAE of {test_metrics['mae']:.2f} minutes,")
print(f"which is a {(baseline_val_mae - test_metrics['mae'])/baseline_val_mae*100:.1f}% improvement over the baseline.")

<!-- 
## Conclusions and Next Steps

This analysis demonstrates the complete machine learning workflow for predicting train delays:

1. **Data Exploration**: We identified that train delays follow a right-skewed distribution with many on-time arrivals.

2. **Data Preparation**: We engineered temporal features and handled missing data appropriately.

3. **Model Selection**: Through systematic evaluation using cross-validation, we compared multiple algorithms from simple (linear regression) to complex (random forest).

**Key Findings**:
- Temporal features (hour of day, day of week) are important predictors
- The change/update times provided by DB are strong predictors but may not always be available
- Non-linear models (Random Forest) generally outperform linear models for this problem

**Future Improvements**:
1. Feature engineering: weather data, holiday indicators, route-specific patterns
2. Advanced models: Gradient Boosting, Neural Networks
3. Time series approaches: considering sequential nature of delays
4. Ensemble methods: combining multiple models

**References**:
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning*
- Mayer, M. (2025). *Machine Learning Course Slides*, TH Deggendorf -->