Skip to content

πŸš† Analysis of delays of the Deutsche Bahn (DB)

Notifications You must be signed in to change notification settings

Splines/deutsche-bahn-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Live-Poll Logo

Deutsche Bahn* Analysis

Analyzing delays of the Deutsche Bahn (DB) in the course of 2023.

⚠️ At the moment, this repo is primarily a playground for a funny little data science project. While I usually put in an effort to write clean and well-documented code, please note that these standards will suffer in this project as this is more aimed to "get things done" instead of writing the best code. And of course, things might change very often as this is a playground.

(*) This project is not endorsed at all by the official Deutsche Bahn.

What is this?

While most of the time, at least in my experience, DB trains arrive on schedule, there are still these one or two situations where there is a huge delay or a small, yet big enough delay, so that you miss your connection. David Kriesel has conducted an extensive study of the DB in the course of 2019 and presented his results in a very engaging way at the chaos computer club (ccc) (see this video).

I wanted to do an analysis of the DB on my own and especially for the region I am living in including all regional trains. First, I will collect the data from the Bahn API and just dump the xml files on the filesystem. Then, data analysis is another step where I will probably read in all xml files, organize them in a database and draw conclusions (literally "draw" some nice plots πŸ˜…).

Some questions I hope to answer:

  • At which time of the day are there the fewest delays per train connection?
  • Given any train connection: what is the probability that it is delayed by 1,2,...30 minutes?
  • At which time of the year are there the fewest delays? -> Histogram
  • How late are trains on average & what delay is most common? (e.g. result could be: 1-minute delays are most common)
  • Which stations have the most delays overall ? (of course weighted with the number of total trains passing through that station)
  • At which station do trains "collect" most delays? -> These are the stations you have to keep your fingers crossed.

Additional resources

Setup

Create a .env file with these entries in the root of the cloned repository. You get the credential by registering for the DB API Marketplace (for free).

CLIENT_ID=...
CLIENT_KEY=...

Run

The script is intended to be run indefinitely in the background on a server.

python3 ./src/main.py

Useful scripts

python3 ./src/util/decompress.py
python3 ./src/util/date_converter.py

Monitoring (Prometheus)

Monitoring directory sizes with Textfile Collector

Adapted from this great article using the Textfile Collector from the node_exporter.

"To use it, set the --collector.textfile.directory flag on the node_exporter commandline. The collector will parse all files in that directory matching the glob *.prom using the text format."

./node_exporter --collector.textfile.directory /root/monitoring/node_exporter/textfile_collector/

Cron job I've used by putting it in /etc/cron.d/: directory_size

Prometheus useful commands

Start Prometheus

./prometheus --config.file=./prometheus.yml --web.config.file=./web.yml --web.enable-admin-api

Create snapshot
For this, the Admin API has to be enabled. To do so, pass --web.enable-admin-api as command line flag when starting Prometheus.

POST /api/v1/admin/tsdb/snapshot

Snapshots are located in the Prometheus folder under data/snapshots/.

Services
Put Linux service files (see monitoring folder) to /etc/systemd/system/. Do NOT used ~ in service files!

About

πŸš† Analysis of delays of the Deutsche Bahn (DB)

Topics

Resources

Stars

Watchers

Forks

Languages