DR: Operational Dashboard Structure for Watchdog Metrics

Context

Watchdog exposes metrics related to:

API availability
static GTFS integrity
realtime feed health
realtime/static matching quality
vehicle anomalies
debugging diagnostics

As the number of metrics grows, organizing them by source package or implementation area makes the dashboard difficult to navigate operationally.

This DR defines how the Grafana dashboard should be structured and why metrics are grouped the way they are.

Decision

The dashboard will be designed around:

operational observer workflow (make the observer's life easier)

The dashboard will be a single, vertically scrollable Grafana dashboard composed of multiple sections and panels. It will NOT use separate tabs/pages for each category.

The goal is to allow operators to investigate incidents in a natural top-to-bottom flow without navigating between dashboards.

Dashboard Structure

Dashboard
│
├── System Availability
│   ├── API Status
│   └── Dependency Latency
│
├── Static GTFS Health
│   ├── Feed Expiration
│   └── Agency Consistency
│
├── Realtime Feed Health
│   ├── Vehicle Volume
│   ├── Feed Freshness
│   ├── Update Throughput
│   └── API ↔ GTFS-RT Consistency
│
├── Matching Quality
│   ├── Trip Matching
│   ├── Stop Matching
│   └── Coverage Summary
│
├── Vehicle Anomalies
│   ├── Speed Validation
│   └── Spatial Validation
│
└── Deep Diagnostics
    ├── Unmatched Stop Clusters
    └── Unmatched Stop Locations

Terminology

What is a dashboard?

The entire Grafana page.

What is a Section?

A logical grouping of related panels inside the dashboard.

What is a Panel?

An individual visualization such as: (graph, stat, table, heatmap, geomap). Each panel should answer one focused operational question.

Section Design

1. System Availability

Purpose

Answers:

“Is the system reachable and operational?”

Panels

1.1 API Status

Metrics:

oba_api_status

Question answered:

“Is the API up?”

1.2 Dependency Latency

Metrics:

http_outgoing_request_duration_seconds

Question answered:

“Are external dependencies becoming slow?”

Why This Section Exists

Availability and dependency latency are the first things operators check during incidents. This section is intentionally placed at the top of the dashboard for fast visibility.

2. Static GTFS Health

Purpose

Answers:

“Is the static GTFS dataset valid and up to date?”

Panels

2.1 Feed Expiration

Metrics:

gtfs_bundle_days_until_earliest_expiration
gtfs_bundle_days_until_latest_expiration

Question answered:

“Are GTFS bundles approaching expiration?”

2.2 Agency Consistency

Metrics:

oba_agencies_in_static_gtfs
oba_agencies_in_coverage_endpoint
oba_agencies_match

Question answered:

“Does the coverage endpoint match the static GTFS dataset?”

Why This Section Exists

These metrics describe static dataset integrity rather than runtime behavior. Keeping them separate from realtime metrics helps operators distinguish configuration/data problems from realtime ingestion problems.

3. Realtime Feed Health

Purpose

Answers:

“Are realtime feeds alive, updating, and populated?”

Panels

3.1 Vehicle Volume

Metrics:

realtime_vehicle_positions_count_gtfs_rt
vehicle_count_api
gtfs_rt_tracked_vehicles_count

Question answered:

“Is realtime coverage dropping?”

3.2 Feed Freshness

Metrics:

vehicle_position_report_interval_seconds
oba_time_since_last_update_seconds

Question answered:

“How stale is realtime data?”

3.3 Update Throughput

Metrics:

vehicle_report_total

Question answered:

“Are realtime updates continuously flowing?”

3.4 API ↔ GTFS-RT Consistency

Metrics:

vehicle_count_match

Question answered:

“Does the API reflect GTFS-RT correctly?”

Why This Section Exists

Realtime health is the primary operational concern of the system. The metrics are separated into (coverage, freshness, throughput, consistency) because these represent different failure modes.

4. Matching Quality

Purpose

Answers:

“Is realtime data correctly matching scheduled/static data?”

Panels

4.1 Trip Matching

Metrics:

oba_realtime_records_total
oba_realtime_trips_matched_count
oba_realtime_trips_unmatched_count
oba_realtime_trip_match_ratio

Question answered:

“How much realtime trip data is usable?”

4.2 Stop Matching

Metrics:

oba_stops_matched_count
oba_stops_unmatched_count
oba_stop_match_ratio

Question answered:

“Are stop mappings degrading?”

4.3 Coverage Summary

Metrics:

oba_agencies_with_coverage_count
oba_scheduled_trips_count

Question answered:

“What is the overall realtime coverage footprint?”

Why This Section Exists

A realtime feed may still be (reachable, fresh, populated) while being operationally unusable because matching is failing. This section isolates semantic correctness from feed health.

5. Vehicle Anomalies

Purpose

Answers:

“Are vehicles behaving realistically?”

Panels

5.1 Speed Validation

Metrics:

gtfs_rt_vehicle_computed_speed
gtfs_rt_vehicle_speed_discrepancy_ratio

Question answered:

“Are vehicles reporting unrealistic speeds?”

5.2 Spatial Validation

Metrics:

gtfs_rt_invalid_vehicle_coordinates
gtfs_rt_stopped_out_of_bounds_vehicles

Question answered:

“Are vehicle coordinates geographically valid?”

Why This Section Exists

These metrics are diagnostic and anomaly-oriented. They are intentionally separated from core health indicators to reduce dashboard noise and cognitive load.

6. Deep Diagnostics

Purpose

Answers:

“Where exactly are matching failures occurring?”

Panels

5.1 Unmatched Stop Clusters

Metrics:

oba_unmatched_stop_cluster_count

Question answered:

“Where are failures concentrated?”

5.2 Unmatched Stop Locations

Metrics:

oba_unmatched_stop_location

Question answered:

“Which exact stops are failing?”

Why This Section Exists

These metrics are intended for investigation after a higher-level issue has already been detected. They are placed at the bottom of the dashboard because they are highly detailed and debugging-oriented.

Note on High Cardinality & Diagnostic Metrics

Sections such as Vehicle Anomalies and Deep Diagnostics contain metrics that are often:

high-cardinality (e.g., per vehicle, per stop, per location)
noisy when visualized at aggregate level
more useful for slicing and investigation than for continuous monitoring

These metrics are primarily introduced for alerting and anomaly detection, where thresholds or sudden deviations can trigger operational alerts.

In the context of this dashboard, they are included only when:

a meaningful aggregation or grouping exists (e.g., max, ratio, clustered counts), or they support investigation after an alert has already been triggered

They are not intended to be primary “at-a-glance” health indicators, but rather:

alerting signals for automated detection
investigative tools when drilling into system issues

If a suitable visualization is available (e.g., clustering, aggregation, or bounded summary view), these metrics can be surfaced in panels to support debugging workflows. Otherwise, they should remain primarily alert-driven rather than dashboard-driven.

Dashboard Interaction Model

Most metrics are grouped by server_id.

To avoid noisy visualizations:

the dashboard should use Grafana variables/dropdowns
operators should be able to filter by server and agency

Example:

Server: [ All ▼ ]
Agency: [ All ▼ ]

This allows one reusable dashboard focused on per-server investigation, while also supporting aggregated global monitoring when needed.

Design Principles

Metrics are grouped by observer workflow, incident investigation flow, and operational semantics rather than by implementation structure.
Each panel should answer one question or a very small set of tightly related questions to improve readability and reduce cognitive load.
Top sections focus on availability, freshness, and consistency, while lower sections focus on anomalies, debugging, and forensic investigation to create a natural incident investigation flow.
Dashboard Flow Mirrors Incident Investigation. The dashboard is intentionally ordered to guide operators through:

Is the system up?
    ↓
Is static data healthy?
    ↓
Are realtime feeds updating?
    ↓
Is realtime usable?
    ↓
Are anomalies occurring?
    ↓
Where exactly is the failure?

This ordering minimizes investigation time and improves operational clarity.

DR: Operational Dashboard Structure for Watchdog Metrics

Context

Decision

Dashboard Structure

Terminology

What is a dashboard?

What is a Section?

What is a Panel?

Section Design

1. System Availability

Purpose

Panels

1.1 API Status

1.2 Dependency Latency

Why This Section Exists

2. Static GTFS Health

Purpose

Panels

2.1 Feed Expiration

2.2 Agency Consistency

Why This Section Exists

3. Realtime Feed Health

Purpose

Panels

3.1 Vehicle Volume

3.2 Feed Freshness

3.3 Update Throughput

3.4 API ↔ GTFS-RT Consistency

Why This Section Exists

4. Matching Quality

Purpose

Panels

4.1 Trip Matching

4.2 Stop Matching

4.3 Coverage Summary

Why This Section Exists

5. Vehicle Anomalies

Purpose

Panels

5.1 Speed Validation

5.2 Spatial Validation

Why This Section Exists

6. Deep Diagnostics

Purpose

Panels

5.1 Unmatched Stop Clusters

5.2 Unmatched Stop Locations

Why This Section Exists

Note on High Cardinality & Diagnostic Metrics

Dashboard Interaction Model

Design Principles

Uh oh!

Clone this wiki locally