Skip to content

DR: Operational Dashboard Structure for Watchdog Metrics

Mohamed Ahmed Abdelhamid Abo-Omar edited this page May 26, 2026 · 1 revision

Context

Watchdog exposes metrics related to:

  • API availability
  • static GTFS integrity
  • realtime feed health
  • realtime/static matching quality
  • vehicle anomalies
  • debugging diagnostics

As the number of metrics grows, organizing them by source package or implementation area makes the dashboard difficult to navigate operationally.

This DR defines how the Grafana dashboard should be structured and why metrics are grouped the way they are.

Decision

The dashboard will be designed around:

operational observer workflow (make the observer's life easier)

The dashboard will be a single, vertically scrollable Grafana dashboard composed of multiple sections and panels. It will NOT use separate tabs/pages for each category.

The goal is to allow operators to investigate incidents in a natural top-to-bottom flow without navigating between dashboards.

Dashboard Structure

Dashboard
│
├── System Availability
│   ├── API Status
│   └── Dependency Latency
│
├── Static GTFS Health
│   ├── Feed Expiration
│   └── Agency Consistency
│
├── Realtime Feed Health
│   ├── Vehicle Volume
│   ├── Feed Freshness
│   ├── Update Throughput
│   └── API ↔ GTFS-RT Consistency
│
├── Matching Quality
│   ├── Trip Matching
│   ├── Stop Matching
│   └── Coverage Summary
│
├── Vehicle Anomalies
│   ├── Speed Validation
│   └── Spatial Validation
│
└── Deep Diagnostics
    ├── Unmatched Stop Clusters
    └── Unmatched Stop Locations

Terminology

What is a dashboard?

The entire Grafana page.

What is a Section?

A logical grouping of related panels inside the dashboard.

What is a Panel?

An individual visualization such as: (graph, stat, table, heatmap, geomap). Each panel should answer one focused operational question.

Section Design

1. System Availability

Purpose

Answers:

“Is the system reachable and operational?”

Panels

1.1 API Status

Metrics:

  • oba_api_status

Question answered:

“Is the API up?”

1.2 Dependency Latency

Metrics:

  • http_outgoing_request_duration_seconds

Question answered:

“Are external dependencies becoming slow?”

Why This Section Exists

Availability and dependency latency are the first things operators check during incidents. This section is intentionally placed at the top of the dashboard for fast visibility.

2. Static GTFS Health

Purpose

Answers:

“Is the static GTFS dataset valid and up to date?”

Panels

2.1 Feed Expiration

Metrics:

  • gtfs_bundle_days_until_earliest_expiration
  • gtfs_bundle_days_until_latest_expiration

Question answered:

“Are GTFS bundles approaching expiration?”

2.2 Agency Consistency

Metrics:

  • oba_agencies_in_static_gtfs
  • oba_agencies_in_coverage_endpoint
  • oba_agencies_match

Question answered:

“Does the coverage endpoint match the static GTFS dataset?”

Why This Section Exists

These metrics describe static dataset integrity rather than runtime behavior. Keeping them separate from realtime metrics helps operators distinguish configuration/data problems from realtime ingestion problems.

3. Realtime Feed Health

Purpose

Answers:

“Are realtime feeds alive, updating, and populated?”

Panels

3.1 Vehicle Volume

Metrics:

  • realtime_vehicle_positions_count_gtfs_rt
  • vehicle_count_api
  • gtfs_rt_tracked_vehicles_count

Question answered:

“Is realtime coverage dropping?”

3.2 Feed Freshness

Metrics:

  • vehicle_position_report_interval_seconds
  • oba_time_since_last_update_seconds

Question answered:

“How stale is realtime data?”

3.3 Update Throughput

Metrics:

  • vehicle_report_total

Question answered:

“Are realtime updates continuously flowing?”

3.4 API ↔ GTFS-RT Consistency

Metrics:

  • vehicle_count_match

Question answered:

“Does the API reflect GTFS-RT correctly?”

Why This Section Exists

Realtime health is the primary operational concern of the system. The metrics are separated into (coverage, freshness, throughput, consistency) because these represent different failure modes.

4. Matching Quality

Purpose

Answers:

“Is realtime data correctly matching scheduled/static data?”

Panels

4.1 Trip Matching

Metrics:

  • oba_realtime_records_total
  • oba_realtime_trips_matched_count
  • oba_realtime_trips_unmatched_count
  • oba_realtime_trip_match_ratio

Question answered:

“How much realtime trip data is usable?”

4.2 Stop Matching

Metrics:

  • oba_stops_matched_count
  • oba_stops_unmatched_count
  • oba_stop_match_ratio

Question answered:

“Are stop mappings degrading?”

4.3 Coverage Summary

Metrics:

  • oba_agencies_with_coverage_count
  • oba_scheduled_trips_count

Question answered:

“What is the overall realtime coverage footprint?”

Why This Section Exists

A realtime feed may still be (reachable, fresh, populated) while being operationally unusable because matching is failing. This section isolates semantic correctness from feed health.

5. Vehicle Anomalies

Purpose

Answers:

“Are vehicles behaving realistically?”

Panels

5.1 Speed Validation

Metrics:

  • gtfs_rt_vehicle_computed_speed
  • gtfs_rt_vehicle_speed_discrepancy_ratio

Question answered:

“Are vehicles reporting unrealistic speeds?”

5.2 Spatial Validation

Metrics:

  • gtfs_rt_invalid_vehicle_coordinates
  • gtfs_rt_stopped_out_of_bounds_vehicles

Question answered:

“Are vehicle coordinates geographically valid?”

Why This Section Exists

These metrics are diagnostic and anomaly-oriented. They are intentionally separated from core health indicators to reduce dashboard noise and cognitive load.

6. Deep Diagnostics

Purpose

Answers:

“Where exactly are matching failures occurring?”

Panels

5.1 Unmatched Stop Clusters

Metrics:

  • oba_unmatched_stop_cluster_count

Question answered:

“Where are failures concentrated?”

5.2 Unmatched Stop Locations

Metrics:

  • oba_unmatched_stop_location

Question answered:

“Which exact stops are failing?”

Why This Section Exists

These metrics are intended for investigation after a higher-level issue has already been detected. They are placed at the bottom of the dashboard because they are highly detailed and debugging-oriented.

Note on High Cardinality & Diagnostic Metrics

Sections such as Vehicle Anomalies and Deep Diagnostics contain metrics that are often:

  • high-cardinality (e.g., per vehicle, per stop, per location)
  • noisy when visualized at aggregate level
  • more useful for slicing and investigation than for continuous monitoring

These metrics are primarily introduced for alerting and anomaly detection, where thresholds or sudden deviations can trigger operational alerts.

In the context of this dashboard, they are included only when:

  • a meaningful aggregation or grouping exists (e.g., max, ratio, clustered counts), or they support investigation after an alert has already been triggered

They are not intended to be primary “at-a-glance” health indicators, but rather:

  • alerting signals for automated detection
  • investigative tools when drilling into system issues

If a suitable visualization is available (e.g., clustering, aggregation, or bounded summary view), these metrics can be surfaced in panels to support debugging workflows. Otherwise, they should remain primarily alert-driven rather than dashboard-driven.

Dashboard Interaction Model

Most metrics are grouped by server_id.

To avoid noisy visualizations:

  • the dashboard should use Grafana variables/dropdowns
  • operators should be able to filter by server and agency

Example:

Server: [ All ▼ ]
Agency: [ All ▼ ]

This allows one reusable dashboard focused on per-server investigation, while also supporting aggregated global monitoring when needed.

Design Principles

  1. Metrics are grouped by observer workflow, incident investigation flow, and operational semantics rather than by implementation structure.
  2. Each panel should answer one question or a very small set of tightly related questions to improve readability and reduce cognitive load.
  3. Top sections focus on availability, freshness, and consistency, while lower sections focus on anomalies, debugging, and forensic investigation to create a natural incident investigation flow.
  4. Dashboard Flow Mirrors Incident Investigation. The dashboard is intentionally ordered to guide operators through:
Is the system up?
    ↓
Is static data healthy?
    ↓
Are realtime feeds updating?
    ↓
Is realtime usable?
    ↓
Are anomalies occurring?
    ↓
Where exactly is the failure?

This ordering minimizes investigation time and improves operational clarity.