This system will poll updates to Pubtrans database (regarding departures and arrivals) and convert those to GTFS real-time messages using a pipeline created with Apache Pulsar. The final step will output the messages via MQTT broker.
General usage pattern is to build Docker images and then run them with docker-compose. Services are separated to subfolders, each containing the source code and the Dockerfile.
/bin-folder contains scripts to launch Docker images for Pulsar and Redis which are requirements for some of the services.
Overall system requirements for running the system are:
- Connection to a Pubtrans SQL Server database
- Connection to an MQTT broker
System Architecture & Components
Components are stored in their own Github Repositories:
- transitdata-common contains generic components and shared constants.
- transitdata-cache-bootstrapper fills journey-metadata to Redis cache for the next step
- transitdata-pubtrans-source polls changes to Pubtrans database and publishes the events to Pulsar as "raw-data"
- transitdata-stop-estimates creates higher-level data (StopEstimates) from the raw-data where the data source is abstracted (bus, metro, train).
- transitdata-omm-cancellation-source reads OMM database and generates TripUpdate trip cancellations
- transitdata-hslalert-source reads trip cancellations from HSL public HTML API and generates TripUpdate cancellations. This is for transition period support and will be removed in the near future.
- transitdata-tripupdate-processor reads the estimates and cancellations and generates GTFS-RT messages and publishes them to Pulsar
- transitdata-omm-alert-source reads OMM database and generates internal service alert messages.
- transitdata-alert-processor reads internal service alert messages and generates GTFS-RT Service alerts
- pulsar-mqtt-gateway routes Pulsar messages to MQTT broker.
- transitdata-gtfsrt-full-publisher publishes GTFS-RT Full dataset based on the GTFS-RT TripUpdate topic.
- transitdata-pulsar-monitoring creates information about the state of the current pipeline for monitoring purposes
Transitlog HFP components
- transitlog-alert-sink application for inserting service alerts to PostgreSQL
- transitlog-hfp-sink Insert HFP data from Pulsar to TimescaleDB
- transitdata-hfp-parser parses MQTT raw topic & payload into Protobuf with HFP schema
- transitdata-hfp-deduplicator deduplicate data read from Pulsar topic(s)
- mqtt-pulsar-gateway application for reading data from MQTT topic and feeding it into Pulsar topic. This application doesn't care about the payload, it just transfers the bytes.
All the components in this project use semver, but the output conforms always to the GTFS Realtime standard. Some vendor-specific extensions might be added, which require incrementing the major version. Otherwise, new features should only increment the minor version, but some exceptions might arise. TripUpdate and ServiceAlert APIs are versioned independently.
Pulsar seems to cause approximately 5ms of latency for each message, which is consistent with their promise. The latency is not a problem in itself, and is well within acceptable bounds. However, the latency means that a single-threaded consumer-producer loop can only process 200 messages per second.