In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("pandas version:", pd.__version__)
#print("matplotlib.pyplot version:", plt.__version__)
print("seaborn version:", sns.__version__)

pandas version: 2.3.2
seaborn version: 0.13.2


This data is gathered from this website: https://www.rijdendetreinen.nl/en/open-data/train-archive

### Data Dictionary

Below you will find the data dictionary provided by Rijden-de-treinen.

Each row in these files represent a stop at a station. Each service at least departs from and arrives at a station (i.e. two rows). For each stop, you can find the name of the station, the arrival and departure time, delays and cancellations. The exact meaning of each column is explained below.

The source for this data is the real-time data from NS with live departure times, live arrival times and service updates. This data is also used in the app and website of Rijden Treinen.

#### Columns
- `Service:RDT-ID` *Unique identifier (service):* This is the ID that Rijden de Treinen uses for this service. It has no useful meaning beyond uniquely identifying a single service on a single date. This ID will occur more than once in the CSV files because this ID is unique for each service. There is also a column Stop:RDT-ID with a unique ID for each stop.
- `Service:Date` *Service date (schedule date):* The scheduled service date for this service. The service date is not always the same as the actual date. For example, a service that departs at 23:59 on 31 July and arrives at 02:00 on 1 August has a service date of 31 July. Delays do not affect the service date.
- `Service:Type` *Service type:* The service type, for example: *Intercity*, *Sprinter* or *ICE International*.
- `Service:Company` *Operator:* Company operating this service, like NS or Arriva.
- `Service:Train number` *Train number:* The train number (service number) for this service uniquely identifies this service on this date. This number is sometimes also communicated to passengers (especially for international trains). A single service may sometimes have multiple train numbers. For example, when a train is split in two parts, or when a train changes a train number on a major station halfway.
- `Service:Completely cancelled` *Service is fully cancelled:* This column is true when all stops of this service have been cancelled. Or in other words: when the train does not run at all.
- `Service:Partly cancelled` *Partially cancelled:* This column is true when one or more stops of this service have been cancelled. Or in other words: when the train does not run on a part of the route.
- `Service:Maximum delay` *Highest delay for this service:* The highest delay (**in minutes**) of all stops of this service.
- `Stop:RDT-ID` *Unique identifier (stop):* Unique identifier for this stop. This ID is unique for each stop in the dataset. It has no further useful meaning.
- `Stop:Station code` *Station code:* Code (abbreviation) of the station name. See also the [dataset with railway stations.](https://www.rijdendetreinen.nl/en/open-data/stations)
- `Stop:Station name` *Station name:* The name of the station.
- `Stop:Arrival time` *Arrival time:* Scheduled arrival time in **RFC 3339 format.** This column is empty when no arrival was scheduled.
- `Stop:Arrival delay` *Arrival delay:* Arrival delay **in minutes.** This column is empty when no arrival was scheduled.
- `Stop:Arrival cancelled` *Cancelled arrival:* This column is true when the arrival at this stop has been cancelled. This column is empty when no arrival was scheduled.
- `Stop:Departure time` *Departure time:* Scheduled departure time in **RFC 3339 format.** This column is empty when no departure was scheduled.
- `Stop:Departure delay` *Departure delay:* Departure delay **in minutes.** This column is empty when no departure was scheduled.
- `Stop:Departure cancelled` *Cancelled departure:* This column is true when the departure at this stop has been cancelled. This column is empty when no departure was scheduled.
- `Stop:Platform change` *Platform change:* This column is true when the platform of this stop has changed from the planned platform.
- `Stop:Planned platform` *Scheduled platform:* The originally scheduled platform for this service.
- `Stop:Actual platform` *Actual platform:* The platform that was actually used for this service.

In [5]:
df = pd.read_parquet('Data/version_3_traject_Eindhoven_sittard_2019_2025.parquet')
df.head()

Unnamed: 0,Service:RDT-ID,Service:Date,Service:Type,Service:Company,Service:Train number,Service:Completely cancelled,Service:Partly cancelled,Service:Maximum delay,Stop:RDT-ID,Stop:Station code,...,Stop:Arrival time,Stop:Arrival delay,Stop:Arrival cancelled,Stop:Departure time,Stop:Departure delay,Stop:Departure cancelled,Stop:Platform change,Stop:Planned platform,Stop:Actual platform,sort_time
0,738846,2019-01-01,Intercity,NS,2925,False,False,5,6220367,EKZ,...,NaT,,,2019-01-01 05:39:00+00:00,0.0,False,False,1,1,2019-01-01 05:39:00+00:00
1,738846,2019-01-01,Intercity,NS,2925,False,False,0,6220388,BKF,...,2019-01-01 05:43:00+00:00,0.0,False,2019-01-01 05:43:00+00:00,0.0,False,False,1,1,2019-01-01 05:43:00+00:00
2,738846,2019-01-01,Intercity,NS,2925,False,False,0,6220405,BKG,...,2019-01-01 05:45:00+00:00,0.0,False,2019-01-01 05:47:00+00:00,0.0,False,False,2,2,2019-01-01 05:45:00+00:00
3,738846,2019-01-01,Intercity,NS,2925,False,False,0,6220435,HKS,...,2019-01-01 05:52:00+00:00,0.0,False,2019-01-01 05:52:00+00:00,0.0,False,False,2,2,2019-01-01 05:52:00+00:00
4,738846,2019-01-01,Intercity,NS,2925,False,False,0,6220479,HNK,...,2019-01-01 05:58:00+00:00,1.0,False,2019-01-01 05:59:00+00:00,0.0,False,False,1,1,2019-01-01 05:58:00+00:00


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2855495 entries, 0 to 2855494
Data columns (total 21 columns):
 #   Column                        Dtype              
---  ------                        -----              
 0   Service:RDT-ID                int64              
 1   Service:Date                  object             
 2   Service:Type                  object             
 3   Service:Company               object             
 4   Service:Train number          int64              
 5   Service:Completely cancelled  bool               
 6   Service:Partly cancelled      bool               
 7   Service:Maximum delay         int64              
 8   Stop:RDT-ID                   int64              
 9   Stop:Station code             object             
 10  Stop:Station name             object             
 11  Stop:Arrival time             datetime64[ns, UTC]
 12  Stop:Arrival delay            float64            
 13  Stop:Arrival cancelled        object             
 14  St

In [13]:
df.isnull().sum()

Service:RDT-ID                       0
Service:Date                         0
Service:Type                         0
Service:Company                      0
Service:Train number                 0
Service:Completely cancelled         0
Service:Partly cancelled             0
Service:Maximum delay                0
Stop:RDT-ID                          0
Stop:Station code                    0
Stop:Station name                    0
Stop:Arrival time               190474
Stop:Arrival delay              190474
Stop:Arrival cancelled          190474
Stop:Departure time             190506
Stop:Departure delay            190506
Stop:Departure cancelled        190506
Stop:Platform change                 0
Stop:Planned platform           333433
Stop:Actual platform            333433
sort_time                            0
dtype: int64

We know from previous analysis that the Arrival time and departure time are not filled in when the station is the first one on the track.
This means we cannot remove stations without arrival time, because it contains valuable information on the departure time.

We also cannot put it to 0 because I plan on calculating with these numbers.

A solution thought about is to set the arrival time to the same value as the departure time when arrival time is not filled in. Arrival delay would be set to 0 minutes. The same thing will go for when departure time is not filled in.