Skip to content

Add per-query DB query count/duration metrics with gtfsdb instrumenta…#718

Merged
aaronbrethorst merged 3 commits intoOneBusAway:mainfrom
Adityatorgal17:feat/db-query-prometheus-metrics
Mar 25, 2026
Merged

Add per-query DB query count/duration metrics with gtfsdb instrumenta…#718
aaronbrethorst merged 3 commits intoOneBusAway:mainfrom
Adityatorgal17:feat/db-query-prometheus-metrics

Conversation

@Adityatorgal17
Copy link
Copy Markdown
Contributor

Summary

This PR adds per-query database observability to Maglev by exporting query count and latency metrics with query_name, op, and status labels.

Fixes #717

What changed

  • Added:
    • maglev_db_query_total
    • maglev_db_query_duration_seconds
  • Implemented RecordDBQuery(...) in internal/metrics.
  • Introduced a small DBQueryMetricsRecorder interface in gtfsdb/config.go.
  • Updated gtfsdb query wrapper to emit metrics for:
    • ExecContext (op=exec)
    • QueryContext (op=query)
    • QueryRowContext (op=query_row)
  • Added query name extraction from sqlc SQL headers.
  • Wired recorder from static GTFS manager into all gtfsdb.NewClient(...) creation paths.
  • Updated/added unit tests for metric initialization and recording behavior.

Validation Passed

  • make lint
  • make test

Known limitation (tracked for follow-up)

query_row status is currently based on QueryRowContext dispatch, not final Scan() result.
I'll open a follow-up PR to record query_row status at scan-time.

Copy link
Copy Markdown
Member

@aaronbrethorst aaronbrethorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Aditya, this is a well-architected observability addition — the DBQueryMetricsRecorder interface creates a clean dependency boundary between gtfsdb and the metrics package, the sqlc query name extraction is clever, and the wiring across all three NewClient call sites is thorough. The overall design is solid.

There are a couple of issues to address before merging.

Critical Issues (1 found)

1. [silent-failure-hunter]: Typed-nil interface causes unnecessary wrapper creation when metrics are disabled — internal/gtfs/static.go:89,312,328

All three call sites unconditionally assign config.Metrics to the interface field:

dbConfig.QueryMetricsRecorder = config.Metrics

When config.Metrics is nil (which happens in all tests and any startup path that doesn't configure metrics), Go wraps the nil *metrics.Metrics pointer inside a non-nil DBQueryMetricsRecorder interface. This means config.QueryMetricsRecorder != nil evaluates to true in client.go:34, causing the slowQueryDB wrapper to be created for every DB client — even when no metrics are being collected.

The nil receiver guard in RecordDBQuery prevents a crash, but every database operation now pays the overhead of going through the wrapper, calling recordQueryMetrics, calling RecordDBQuery, and hitting the nil check — all for nothing.

Fix: Guard the assignment at all three sites:

if config.Metrics != nil {
    dbConfig.QueryMetricsRecorder = config.Metrics
}

This ensures the interface remains an untyped nil when metrics aren't configured, and the wrapper is only created when it's actually needed.

Important Issues (1 found)

2. [silent-failure-hunter]: query_row metrics always report status="ok" — needs a code comment at the call site — gtfsdb/helpers.go:97

The PR description honestly discloses that QueryRowContext passes nil for the error (since *sql.Row defers errors to Scan()). However, this limitation isn't documented at the call site itself. A future developer looking at helpers.go:97 would see recordQueryMetrics("query_row", query, elapsed, nil) and might not understand why the error is hardcoded to nil.

Add a brief comment at the call site:

// Note: QueryRowContext defers errors to row.Scan(), so err is always nil here.
// query_row metrics always report status="ok". See PR description for follow-up plan.
s.recordQueryMetrics("query_row", query, elapsed, nil)

This also serves as a breadcrumb for the follow-up PR that will fix scan-time status tracking.

Suggestions (2 found)

3. [pr-test-analyzer]: No integration test for NewClient wiring when only metrics are enabled (threshold=0) — gtfsdb/client.go:34

The condition slowQueryThreshold > 0 || config.QueryMetricsRecorder != nil is the key wiring logic, but TestSlowQueryDB_RecordsQueryMetrics manually constructs the wrapper — it doesn't go through NewClient. If someone changes || to &&, no test would catch it. An integration test that creates a NewClient with QueryMetricsRecorder set and verifies metrics are recorded would close this gap.

4. [code-reviewer]: Consider whether prometheus.DefBuckets are appropriate for DB query latency — internal/metrics/metrics.go:108

DefBuckets are {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} seconds. For SQLite queries, most will complete in microseconds (< 1ms), meaning they'll all land in the first bucket (< 5ms). You won't get useful latency distribution from the histogram. Consider tighter buckets like {.0001, .0005, .001, .005, .01, .05, .1, .5, 1} to capture the sub-millisecond range where SQLite actually operates. This is non-blocking — DefBuckets still works, it just won't give great resolution.

Strengths

  • The DBQueryMetricsRecorder interface is a textbook example of dependency inversion — gtfsdb emits metrics without importing the metrics package
  • extractQueryName is well-implemented — it handles sqlc headers, skips blank lines and non-sqlc comments, and falls back to "unknown" gracefully
  • Label cardinality is safe: ~105 sqlc query names x 3 ops x 2 statuses = ~630 combinations, well within Prometheus best practices
  • Nil safety is handled at multiple levels: recordQueryMetrics checks s.queryMetrics == nil, RecordDBQuery checks m == nil
  • All three NewClient call sites in static.go are wired, including the error recovery path — no paths were missed
  • The test mock (testQueryMetricsRecorder) is clean and minimal

Recommended Action

Verdict: Request Changes

  1. Fix the typed-nil interface issue at all three static.go assignment sites
  2. Add a comment at helpers.go:97 documenting the QueryRowContext nil-error limitation
  3. Consider the histogram bucket and integration test suggestions

@Adityatorgal17 Adityatorgal17 force-pushed the feat/db-query-prometheus-metrics branch from be89edb to 0a9af53 Compare March 20, 2026 06:26
@Adityatorgal17 Adityatorgal17 force-pushed the feat/db-query-prometheus-metrics branch from 32dd759 to e4de84a Compare March 21, 2026 04:14
Copy link
Copy Markdown
Member

@aaronbrethorst aaronbrethorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Aditya, excellent follow-through on all the feedback. Every item has been cleanly addressed:

Previous Feedback — All Resolved

# Issue Status
1 Typed-nil interface causing unnecessary wrapper creation FixednewGTFSDBConfig now guards with if config.Metrics != nil
2 query_row nil-error needs comment at call site Fixed — Clear comment added at helpers.go:95-96
3 Integration test for NewClient wiring AddressedTestNewClient_RecordsQueryMetricsWhenOnlyMetricsEnabled added with proper DDL override
4 Histogram buckets too coarse for SQLite Addressed — Changed to {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1}

Additional Improvements

  • The newGTFSDBConfig helper in static.go is a nice refactor — it consolidates the config construction and nil-guard into one place, eliminating the duplication across all three call sites
  • Bonus fix: added MockClearServiceIDsCache() call in arrivals_and_departures_for_stop_handler_test.go to fix a pre-existing test issue

Verification

  • make lint: 0 issues
  • make test: All packages pass

@aaronbrethorst aaronbrethorst merged commit a16c475 into OneBusAway:main Mar 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add per-query DB Prometheus metrics for query count and latency

2 participants