# Project Group - 

Members: Jarrik Algera, Lorenzo Bouman, Jasper IJsselstein, Joep Neelissen, Tjerk Rijkens


Student numbers: test

# Research Objective

*Requires data modeling and quantitative research in Transport, Infrastructure & Logistics*

The objective of this project is to examine the relationship between weather conditions and train disruptions in the Netherlands. Railway disruption records are combined with meteorological data to quantify the extent to which factors such as precipitation, wind, and temperature are associated with delays and service interruptions. The analysis aims to establish whether specific weather conditions systematically affect railway performance and to what degree. The geographical scale is limited to the Netherlands, and after initial data cleaning and exploratory analyis a specific corridor is chosen. The temporal scale is 2024. 

## Main question

#### *To what extent do weather conditions influence train disruptions in the Netherlands?*

### Sub-questions

*Are disruptions more frequent on days with adverse weather compared to days with mild weather?*


*Do disruption durations increase under heavy precipitation, strong wind, or temperature extremes?*


*Which weather variable shows the strongest association with disruption frequency?*


*Are there seasonal differences in how weather affects train disruptions?*


*Are there regional variations in the relationship between weather and disruptions?*

# Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**: N/A

**Author 2**: N/A

**Author 3**: N/A

# Data Used

## Datasets

Two primary datasets are available for this study. The first dataset contains records of train disruptions in the Netherlands between 2011 and 2024. Each record provides details on the type of disruption (for example staffing problems or a signal failure), its duration, the time of occurrence, and the affected trajectory (for example Eindhoven - Venlo, Nijmegen - Venlo). This dataset represents the dependent variable, as it directly reflects railway system performance. The dataset is obtained from Rijden de Treinen (https://www.rijdendetreinen.nl/open-data/treinstoringen#downloads). It is important to realize that not every train that is delayed or cancelled is communicated by NS as a disruption; the rule of thumb that NS uses is that a disruption is communicated when multiple trains are delayed or cancelled (i.e. a major impact of the train service). The dataset is provided as a .csv file.

The second dataset is derived from the Royal Netherlands Meteorological Institute (KNMI) (https://www.daggegevens.knmi.nl/klimatologie/uurgegevens) and provides hourly weather observations. It includes variables such as precipitation intensity, temperature, and wind speed and direction. Not every weather station measures all these variables. Therefore, it is necessary to make a selection of usable stations. These variables serve as independent factors to explain variation in railway disruptions.

# Data Pipeline

## Intended Data Analysis Pipeline

### The planned analysis will follow several stages:

The planned analysis will follow several stages:

##### 1. Data cleaning and preparation
- Align disruption data with corresponding weather observations based on time and location. Thus should be executed on an hourly basis for proper comparison. 
- Filter data to dataframe of weather data in 2024. 
- Base data from weather stations on chosen train trajectory. 
- Filter through weather stations to exclude stations without rain and temperature measurements.
- Choose appropriate parameters related to the weather data to use in analysis


##### 2. Exploratory analysis
- Compute descriptive statistics for disruption frequency and duration.
- Visualize weather variables alongside disruptions to detect initial patterns.

##### 3. Correlation analysis
- Assess associations between weather conditions and disruption frequency or duration.

##### 4. Modeling
- Apply regression models to estimate the effect of precipitation, wind, and temperature on disruption likelihood and duration.
- Investigate seasonal and regional variation in the relationship between weather and disruptions.

##### 5. Evaluation and visualization
- Present results using time series plots, scatterplots, and heatmaps, alongside statistical measures.


In [5]:
%run load_weather_data


In [16]:
import pandas as pd
import plotly.express as px
import os

df_stations = pd.read_parquet(os.path.join(os.getcwd(),r"data\df_stations.parquet"))
print(df_stations.head())

fig = px.scatter_map(
    df_stations,
    lat="LAT(north)",
    lon="LON(east)",
    hover_name="NAME",        # Column to appear in bold in the hover tooltip
    hover_data=["STN"],       # Additional data to appear in the tooltip
    color_discrete_sequence=["blue"], # Set marker color
    zoom=7,                   # Initial map zoom level
    title="Weather Stations in the Netherlands"
)

fig.update_layout(
    mapbox_style="open-street-map",
    margin={"r":0, "t":40, "l":0, "b":0} # Adjust margins
)

# 4. Show the figure
fig.show()


   STN  LON(east)  LAT(north)  ALT(m)                NAME
2  215      4.437      52.141    -1.1         Voorschoten
4  235      4.781      52.928     1.2             De Kooy
5  240      4.790      52.318    -3.3            Schiphol
8  249      4.979      52.644    -2.4            Berkhout
9  251      5.346      53.392     0.7  Hoorn Terschelling


In [None]:
import pandas as pd
from geopy.distance import geodesic
import numpy as np
import os

# 1. Sample DataFrame with station locations and weather data
#    (Replace with your actual DataFrame)
data = {
    'STN': [215, 235, 240, 249, 251, 260, 267],
    'LON(east)': [4.437, 4.781, 4.790, 4.979, 5.346, 5.180, 5.384],
    'LAT(north)': [52.141, 52.928, 52.318, 52.644, 53.392, 52.100, 52.898],
    'NAME': ['Voorschoten', 'De Kooy', 'Schiphol', 'Berkhout', 'Hoorn Terschelling', 'De Bilt', 'Stavoren'],
    'Temperature_C': [15.2, 14.8, 15.5, 14.9, 14.1, 15.6, 14.5],
    'Wind_Speed_kmh': [22, 28, 25, 26, 35, 19, 30]
}
df_stations = pd.DataFrame(data)



def find_closest_stations(input_lat, input_lon, stations_df):
    """Finds the 3 closest weather stations to a given lat/lon point."""
    input_location = (input_lat, input_lon)
    
    distances = stations_df.apply(
        lambda row: geodesic(input_location, (row['LAT(north)'], row['LON(east)'])).km,
        axis=1
    )
    
    df_with_dist = stations_df.copy()
    df_with_dist['distance_km'] = distances
    
    return df_with_dist.sort_values('distance_km').head(3)


def get_weighted_average_weather(closest_stations_df):
    """
    Calculates the weighted average of weather data from the closest stations
    using Inverse Distance Weighting (IDW).
    
    Returns:
        A dictionary with the interpolated weather data.
    """
    # Handle the edge case where the location is exactly at a station
    if closest_stations_df['distance_km'].iloc[0] < 0.01: # less than 10 meters
        # Return the data from the first station directly
        first_station = closest_stations_df.iloc[0]
        weather_cols = ['Temperature_C', 'Wind_Speed_kmh'] # Define weather columns
        return first_station[weather_cols].to_dict()

    # Calculate inverse distance weights
    weights = 1 / closest_stations_df['distance_km']
    sum_of_weights = np.sum(weights)
    
    weighted_averages = {}
    
    # Iterate through the weather data columns to calculate the weighted average
    weather_cols_to_average = ['Temperature_C', 'Wind_Speed_kmh']
    
    for col in weather_cols_to_average:
        weighted_sum = np.sum(closest_stations_df[col] * weights)
        weighted_average = weighted_sum / sum_of_weights
        weighted_averages[col] = weighted_average
        
    return weighted_averages


# --- EXAMPLE USAGE ---

# 1. Manually enter a location (e.g., Rotterdam Centraal)
my_lat = 51.9225
my_lon = 4.47917

# 2. Find the 3 closest stations
closest_stations = find_closest_stations(my_lat, my_lon, df_stations)

print("The 3 closest stations to your location are:")
print(closest_stations[['NAME', 'STN', 'distance_km', 'Temperature_C', 'Wind_Speed_kmh']])
print("-" * 50)

# 3. Calculate and display the interpolated weather data for that location
interpolated_weather = get_weighted_average_weather(closest_stations)

print("Interpolated weather data for your location:")
print(f"Estimated Temperature: {interpolated_weather['Temperature_C']:.2f}°C")
print(f"Estimated Wind Speed: {interpolated_weather['Wind_Speed_kmh']:.2f} km/h")

   STN  LON(east)  LAT(north)  ALT(m)                NAME
2  215      4.437      52.141    -1.1         Voorschoten
4  235      4.781      52.928     1.2             De Kooy
5  240      4.790      52.318    -3.3            Schiphol
8  249      4.979      52.644    -2.4            Berkhout
9  251      5.346      53.392     0.7  Hoorn Terschelling
The 3 closest stations to your location are:


KeyError: "['Temperature_C', 'Wind_Speed_kmh'] not in index"