# Predicting Deutsche Bahn Train Delays  
## A Reproducible Baseline for Supervised Regression

**Objective:** Build a supervised regression model to predict train arrival delays (in minutes) for Deutsche Bahn trains using statistical learning methods.

**Target Variable:** `arrival_delay_m` - continuous variable representing delay in minutes

## 1. Environment Setup and Imports

### Package Installation (for Google Colab)
Run this cell only if using Google Colab

In [None]:
# Uncomment the following lines if running in Google Colab
# !pip install pandas numpy matplotlib seaborn scikit-learn scipy -q
# !pip install --upgrade scikit-learn -q

### Package Installation (for Local Jupyter Setup)
```bash
# 1. Install Anaconda from https://www.anaconda.com/download

# 2. Create a new conda environment
conda create -n ml-project python=3.9 -y

# 3. Activate the environment
conda activate ml-project

# 4. Install required packages
conda install -c conda-forge pandas numpy matplotlib seaborn scikit-learn scipy jupyter notebook -y

# 5. Launch Jupyter Notebook
jupyter notebook

# 6. Navigate to your notebook file and open it
```

### Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn imports
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV, 
    RandomizedSearchCV, learning_curve, validation_curve
)
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    make_scorer
)

# Models
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# Additional imports
from scipy import stats
import os

# Set random seed for reproducibility
np.random.seed(42)

# Plotting settings
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print("All packages imported successfully!")
print(f"Scikit-learn version: {sklearn.__version__}")

---

## 2. Data Loading and Initial Exploration

### Load the Dataset

For this project, we're using a Deutsche Bahn train delay dataset. The dataset should contain information about scheduled and actual arrival/departure times, routes, and other relevant features for predicting delays.


In [None]:
import pandas as pd
from kagglehub import load_dataset, KaggleDatasetAdapter

# Load the Deutsche Bahn delays dataset
def load_db_delays() -> pd.DataFrame:
    df = load_dataset(
        KaggleDatasetAdapter.PANDAS,
        "nokkyu/deutsche-bahn-db-delays",
        "DBtrainrides.csv"
    )
    df["departure_plan"] = pd.to_datetime(df["departure_plan"], errors="coerce")
    return df

df = load_db_delays()

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

<!-- ```
print(df.head())

                                  ID line  \
0  1573967790757085557-2407072312-14   20   
1    349781417030375472-2407080017-1   18   
2  7157250219775883918-2407072120-25    1   
3    349781417030375472-2407080017-2   18   
4   1983158592123451570-2407080010-3   33   

                                                path   eva_nr  category  \
0  Stolberg(Rheinl)Hbf Gl.44|Eschweiler-St.Jöris|...  8000001         2   
1                                                NaN  8000001         2   
2  Hamm(Westf)Hbf|Kamen|Kamen-Methler|Dortmund-Ku...  8000406         4   
3                                         Aachen Hbf  8000404         5   
4                            Herzogenrath|Kohlscheid  8000404         5   

             station                state    city    zip      long        lat  \
0         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
1         Aachen Hbf  Nordrhein-Westfalen  Aachen  52064  6.091499  50.767800   
2  Aachen-Rothe Erde  Nordrhein-Westfalen  Aachen  52066  6.116475  50.770202   
3        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   
4        Aachen West  Nordrhein-Westfalen  Aachen  52072  6.070715  50.780360   

          arrival_plan       departure_plan       arrival_change  \
0  2024-07-08 00:00:00  2024-07-08 00:01:00  2024-07-08 00:03:00   
1                  NaN  2024-07-08 00:17:00                  NaN   
2  2024-07-08 00:03:00  2024-07-08 00:04:00  2024-07-08 00:03:00   
3  2024-07-08 00:20:00  2024-07-08 00:21:00                  NaN   
4  2024-07-08 00:20:00  2024-07-08 00:21:00  2024-07-08 00:20:00   

      departure_change  arrival_delay_m  departure_delay_m info  \
0  2024-07-08 00:04:00                3                  3  NaN   
1                  NaN                0                  0  NaN   
2  2024-07-08 00:04:00                0                  0  NaN   
3                  NaN                0                  0  NaN   
4  2024-07-08 00:21:00                0                  0  NaN   

  arrival_delay_check departure_delay_check  
0             on_time               on_time  
1             on_time               on_time  
2             on_time               on_time  
3             on_time               on_time  
4             on_time               on_time  
``` -->


### Initial Data Exploration

Following ISLP Section 2.3.9 - Additional Graphical and Numerical Summaries

In [None]:
# Basic dataset information
print("Dataset Info:")
print("="*50)
df.info()

print("\n\nFirst 5 rows:")
print("="*50)
df.head()

print("\n\nBasic Statistics:")
print("="*50)
df.describe()

print("\n\nMissing Values:")
print("="*50)
print(df.isnull().sum())

---

## 3. Exploratory Data Analysis (EDA)

### Target Variable Analysis

Understanding the distribution of our target variable is crucial for model selection and evaluation metrics.

**Mathematical Framework (ISLP Equation 2.1):**
$$Y = f(X) + \epsilon$$

where:
- $Y$ is the response variable (arrival_delay_m)
- $X$ represents our predictors
- $f$ is the systematic information
- $\epsilon$ is random error with $E(\epsilon) = 0$ and $Var(\epsilon) = \sigma^2$

In [None]:
# Create comprehensive target analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Histogram of arrival delays
axes[0, 0].hist(df['arrival_delay_m'].dropna(), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Arrival Delay (minutes)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Arrival Delays')
axes[0, 0].axvline(df['arrival_delay_m'].mean(), color='red', linestyle='--', 
                    label=f'Mean: {df["arrival_delay_m"].mean():.1f} min')
axes[0, 0].axvline(df['arrival_delay_m'].median(), color='green', linestyle='--', 
                    label=f'Median: {df["arrival_delay_m"].median():.1f} min')
axes[0, 0].legend()

# 2. Box plot to identify outliers
axes[0, 1].boxplot(df['arrival_delay_m'].dropna())
axes[0, 1].set_ylabel('Arrival Delay (minutes)')
axes[0, 1].set_title('Box Plot of Arrival Delays')

# 3. Q-Q plot for normality check
stats.probplot(df['arrival_delay_m'].dropna(), dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot: Checking Normality')

# 4. Log-transformed delays
delay_positive = df['arrival_delay_m'] + 1  # Add 1 to handle zeros
axes[1, 1].hist(np.log(delay_positive.dropna()), bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Log(Arrival Delay + 1)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Log-Transformed Distribution')

plt.tight_layout()
plt.show()

# Statistical summary
print("\nTarget Variable Statistical Analysis:")
print("="*50)
print(f"Mean delay: {df['arrival_delay_m'].mean():.2f} minutes")
print(f"Median delay: {df['arrival_delay_m'].median():.2f} minutes")
print(f"Standard deviation: {df['arrival_delay_m'].std():.2f} minutes")
print(f"Skewness: {df['arrival_delay_m'].skew():.2f}")
print(f"Kurtosis: {df['arrival_delay_m'].kurtosis():.2f}")
print(f"Percentage of on-time arrivals: {(df['arrival_delay_m'] == 0).sum() / len(df) * 100:.1f}%")
print(f"95th percentile: {df['arrival_delay_m'].quantile(0.95):.2f} minutes")

### Feature Analysis and Relationships

In [None]:
# Analyze relationships between features and target
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# 1. Delay by train type
df.boxplot(column='arrival_delay_m', by='train_type', ax=axes[0, 0])
axes[0, 0].set_title('Delays by Train Type')

# 2. Delay by hour of day
hourly_delays = df.groupby('hour')['arrival_delay_m'].mean()
axes[0, 1].plot(hourly_delays.index, hourly_delays.values, marker='o')
axes[0, 1].set_xlabel('Hour of Day')
axes[0, 1].set_ylabel('Average Delay (min)')
axes[0, 1].set_title('Average Delays by Hour')
axes[0, 1].grid(True)

# 3. Delay by day of week
daily_delays = df.groupby('day_of_week')['arrival_delay_m'].mean()
axes[0, 2].bar(daily_delays.index, daily_delays.values)
axes[0, 2].set_xlabel('Day of Week (0=Monday)')
axes[0, 2].set_ylabel('Average Delay (min)')
axes[0, 2].set_title('Average Delays by Day of Week')

# 4. Delay vs distance
axes[1, 0].scatter(df['distance_km'], df['arrival_delay_m'], alpha=0.5)
axes[1, 0].set_xlabel('Distance (km)')
axes[1, 0].set_ylabel('Delay (min)')
axes[1, 0].set_title('Delay vs Distance')

# 5. Delay by weather condition
df.boxplot(column='arrival_delay_m', by='weather_condition', ax=axes[1, 1])
axes[1, 1].set_title('Delays by Weather Condition')

# 6. Correlation heatmap for numerical features
numerical_features = df.select_dtypes(include=[np.number]).columns
corr_matrix = df[numerical_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[1, 2])
axes[1, 2].set_title('Feature Correlation Matrix')

plt.tight_layout()
plt.show()