-
Notifications
You must be signed in to change notification settings - Fork 21
DR: Operational Dashboard Structure for Watchdog Metrics
Watchdog exposes metrics related to:
- API availability
- static GTFS integrity
- realtime feed health
- realtime/static matching quality
- vehicle anomalies
- debugging diagnostics
As the number of metrics grows, organizing them by source package or implementation area makes the dashboard difficult to navigate operationally.
This DR defines how the Grafana dashboard should be structured and why metrics are grouped the way they are.
The dashboard will be designed around:
operational observer workflow (make the observer's life easier)
The dashboard will be a single, vertically scrollable Grafana dashboard composed of multiple sections and panels. It will NOT use separate tabs/pages for each category.
The goal is to allow operators to investigate incidents in a natural top-to-bottom flow without navigating between dashboards.
Dashboard
│
├── System Availability
│ ├── API Status
│ └── Dependency Latency
│
├── Static GTFS Health
│ ├── Feed Expiration
│ └── Agency Consistency
│
├── Realtime Feed Health
│ ├── Vehicle Volume
│ ├── Feed Freshness
│ ├── Update Throughput
│ └── API ↔ GTFS-RT Consistency
│
├── Matching Quality
│ ├── Trip Matching
│ ├── Stop Matching
│ └── Coverage Summary
│
├── Vehicle Anomalies
│ ├── Speed Validation
│ └── Spatial Validation
│
└── Deep Diagnostics
├── Unmatched Stop Clusters
└── Unmatched Stop Locations
The entire Grafana page.
A logical grouping of related panels inside the dashboard.
An individual visualization such as: (graph, stat, table, heatmap, geomap). Each panel should answer one focused operational question.
Answers:
“Is the system reachable and operational?”
Metrics:
oba_api_status
Question answered:
“Is the API up?”
Metrics:
http_outgoing_request_duration_seconds
Question answered:
“Are external dependencies becoming slow?”
Availability and dependency latency are the first things operators check during incidents. This section is intentionally placed at the top of the dashboard for fast visibility.
Answers:
“Is the static GTFS dataset valid and up to date?”
Metrics:
gtfs_bundle_days_until_earliest_expirationgtfs_bundle_days_until_latest_expiration
Question answered:
“Are GTFS bundles approaching expiration?”
Metrics:
oba_agencies_in_static_gtfsoba_agencies_in_coverage_endpointoba_agencies_match
Question answered:
“Does the coverage endpoint match the static GTFS dataset?”
These metrics describe static dataset integrity rather than runtime behavior. Keeping them separate from realtime metrics helps operators distinguish configuration/data problems from realtime ingestion problems.
Answers:
“Are realtime feeds alive, updating, and populated?”
Metrics:
realtime_vehicle_positions_count_gtfs_rtvehicle_count_apigtfs_rt_tracked_vehicles_count
Question answered:
“Is realtime coverage dropping?”
Metrics:
vehicle_position_report_interval_secondsoba_time_since_last_update_seconds
Question answered:
“How stale is realtime data?”
Metrics:
vehicle_report_total
Question answered:
“Are realtime updates continuously flowing?”
Metrics:
vehicle_count_match
Question answered:
“Does the API reflect GTFS-RT correctly?”
Realtime health is the primary operational concern of the system. The metrics are separated into (coverage, freshness, throughput, consistency) because these represent different failure modes.
Answers:
“Is realtime data correctly matching scheduled/static data?”
Metrics:
oba_realtime_records_totaloba_realtime_trips_matched_countoba_realtime_trips_unmatched_countoba_realtime_trip_match_ratio
Question answered:
“How much realtime trip data is usable?”
Metrics:
oba_stops_matched_countoba_stops_unmatched_countoba_stop_match_ratio
Question answered:
“Are stop mappings degrading?”
Metrics:
oba_agencies_with_coverage_countoba_scheduled_trips_count
Question answered:
“What is the overall realtime coverage footprint?”
A realtime feed may still be (reachable, fresh, populated) while being operationally unusable because matching is failing. This section isolates semantic correctness from feed health.
Answers:
“Are vehicles behaving realistically?”
Metrics:
gtfs_rt_vehicle_computed_speedgtfs_rt_vehicle_speed_discrepancy_ratio
Question answered:
“Are vehicles reporting unrealistic speeds?”
Metrics:
gtfs_rt_invalid_vehicle_coordinatesgtfs_rt_stopped_out_of_bounds_vehicles
Question answered:
“Are vehicle coordinates geographically valid?”
These metrics are diagnostic and anomaly-oriented. They are intentionally separated from core health indicators to reduce dashboard noise and cognitive load.
Answers:
“Where exactly are matching failures occurring?”
Metrics:
oba_unmatched_stop_cluster_count
Question answered:
“Where are failures concentrated?”
Metrics:
oba_unmatched_stop_location
Question answered:
“Which exact stops are failing?”
These metrics are intended for investigation after a higher-level issue has already been detected. They are placed at the bottom of the dashboard because they are highly detailed and debugging-oriented.
Sections such as Vehicle Anomalies and Deep Diagnostics contain metrics that are often:
- high-cardinality (e.g., per vehicle, per stop, per location)
- noisy when visualized at aggregate level
- more useful for slicing and investigation than for continuous monitoring
These metrics are primarily introduced for alerting and anomaly detection, where thresholds or sudden deviations can trigger operational alerts.
In the context of this dashboard, they are included only when:
- a meaningful aggregation or grouping exists (e.g., max, ratio, clustered counts), or they support investigation after an alert has already been triggered
They are not intended to be primary “at-a-glance” health indicators, but rather:
- alerting signals for automated detection
- investigative tools when drilling into system issues
If a suitable visualization is available (e.g., clustering, aggregation, or bounded summary view), these metrics can be surfaced in panels to support debugging workflows. Otherwise, they should remain primarily alert-driven rather than dashboard-driven.
Most metrics are grouped by server_id.
To avoid noisy visualizations:
- the dashboard should use Grafana variables/dropdowns
- operators should be able to filter by server and agency
Example:
Server: [ All ▼ ]
Agency: [ All ▼ ]
This allows one reusable dashboard focused on per-server investigation, while also supporting aggregated global monitoring when needed.
- Metrics are grouped by observer workflow, incident investigation flow, and operational semantics rather than by implementation structure.
- Each panel should answer one question or a very small set of tightly related questions to improve readability and reduce cognitive load.
- Top sections focus on availability, freshness, and consistency, while lower sections focus on anomalies, debugging, and forensic investigation to create a natural incident investigation flow.
- Dashboard Flow Mirrors Incident Investigation. The dashboard is intentionally ordered to guide operators through:
Is the system up?
↓
Is static data healthy?
↓
Are realtime feeds updating?
↓
Is realtime usable?
↓
Are anomalies occurring?
↓
Where exactly is the failure?
This ordering minimizes investigation time and improves operational clarity.