Add per-query DB query count/duration metrics with gtfsdb instrumenta… by Adityatorgal17 · Pull Request #718 · OneBusAway/maglev

Adityatorgal17 · 2026-03-15T14:08:04Z

Summary

This PR adds per-query database observability to Maglev by exporting query count and latency metrics with query_name, op, and status labels.

Fixes #717

What changed

Added:
- maglev_db_query_total
- maglev_db_query_duration_seconds
Implemented RecordDBQuery(...) in internal/metrics.
Introduced a small DBQueryMetricsRecorder interface in gtfsdb/config.go.
Updated gtfsdb query wrapper to emit metrics for:
- ExecContext (op=exec)
- QueryContext (op=query)
- QueryRowContext (op=query_row)
Added query name extraction from sqlc SQL headers.
Wired recorder from static GTFS manager into all gtfsdb.NewClient(...) creation paths.
Updated/added unit tests for metric initialization and recording behavior.

Validation Passed

make lint
make test

Known limitation (tracked for follow-up)

query_row status is currently based on QueryRowContext dispatch, not final Scan() result.
I'll open a follow-up PR to record query_row status at scan-time.

aaronbrethorst

Hey Aditya, this is a well-architected observability addition — the DBQueryMetricsRecorder interface creates a clean dependency boundary between gtfsdb and the metrics package, the sqlc query name extraction is clever, and the wiring across all three NewClient call sites is thorough. The overall design is solid.

There are a couple of issues to address before merging.

Critical Issues (1 found)

1. [silent-failure-hunter]: Typed-nil interface causes unnecessary wrapper creation when metrics are disabled — `internal/gtfs/static.go:89,312,328`

All three call sites unconditionally assign config.Metrics to the interface field:

dbConfig.QueryMetricsRecorder = config.Metrics

When config.Metrics is nil (which happens in all tests and any startup path that doesn't configure metrics), Go wraps the nil *metrics.Metrics pointer inside a non-nil DBQueryMetricsRecorder interface. This means config.QueryMetricsRecorder != nil evaluates to true in client.go:34, causing the slowQueryDB wrapper to be created for every DB client — even when no metrics are being collected.

The nil receiver guard in RecordDBQuery prevents a crash, but every database operation now pays the overhead of going through the wrapper, calling recordQueryMetrics, calling RecordDBQuery, and hitting the nil check — all for nothing.

Fix: Guard the assignment at all three sites:

if config.Metrics != nil {
    dbConfig.QueryMetricsRecorder = config.Metrics
}

This ensures the interface remains an untyped nil when metrics aren't configured, and the wrapper is only created when it's actually needed.

Important Issues (1 found)

2. [silent-failure-hunter]: `query_row` metrics always report `status="ok"` — needs a code comment at the call site — `gtfsdb/helpers.go:97`

The PR description honestly discloses that QueryRowContext passes nil for the error (since *sql.Row defers errors to Scan()). However, this limitation isn't documented at the call site itself. A future developer looking at helpers.go:97 would see recordQueryMetrics("query_row", query, elapsed, nil) and might not understand why the error is hardcoded to nil.

Add a brief comment at the call site:

// Note: QueryRowContext defers errors to row.Scan(), so err is always nil here.
// query_row metrics always report status="ok". See PR description for follow-up plan.
s.recordQueryMetrics("query_row", query, elapsed, nil)

This also serves as a breadcrumb for the follow-up PR that will fix scan-time status tracking.

Suggestions (2 found)

3. [pr-test-analyzer]: No integration test for `NewClient` wiring when only metrics are enabled (threshold=0) — `gtfsdb/client.go:34`

The condition slowQueryThreshold > 0 || config.QueryMetricsRecorder != nil is the key wiring logic, but TestSlowQueryDB_RecordsQueryMetrics manually constructs the wrapper — it doesn't go through NewClient. If someone changes || to &&, no test would catch it. An integration test that creates a NewClient with QueryMetricsRecorder set and verifies metrics are recorded would close this gap.

4. [code-reviewer]: Consider whether `prometheus.DefBuckets` are appropriate for DB query latency — `internal/metrics/metrics.go:108`

DefBuckets are {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} seconds. For SQLite queries, most will complete in microseconds (< 1ms), meaning they'll all land in the first bucket (< 5ms). You won't get useful latency distribution from the histogram. Consider tighter buckets like {.0001, .0005, .001, .005, .01, .05, .1, .5, 1} to capture the sub-millisecond range where SQLite actually operates. This is non-blocking — DefBuckets still works, it just won't give great resolution.

Strengths

The DBQueryMetricsRecorder interface is a textbook example of dependency inversion — gtfsdb emits metrics without importing the metrics package
extractQueryName is well-implemented — it handles sqlc headers, skips blank lines and non-sqlc comments, and falls back to "unknown" gracefully
Label cardinality is safe: ~105 sqlc query names x 3 ops x 2 statuses = ~630 combinations, well within Prometheus best practices
Nil safety is handled at multiple levels: recordQueryMetrics checks s.queryMetrics == nil, RecordDBQuery checks m == nil
All three NewClient call sites in static.go are wired, including the error recovery path — no paths were missed
The test mock (testQueryMetricsRecorder) is clean and minimal

Recommended Action

Verdict: Request Changes

Fix the typed-nil interface issue at all three static.go assignment sites
Add a comment at helpers.go:97 documenting the QueryRowContext nil-error limitation
Consider the histogram bucket and integration test suggestions

…tion

…y client path

…ment for query_row status behavior

aaronbrethorst

Hey Aditya, excellent follow-through on all the feedback. Every item has been cleanly addressed:

Previous Feedback — All Resolved

#	Issue	Status
1	Typed-nil interface causing unnecessary wrapper creation	Fixed — `newGTFSDBConfig` now guards with `if config.Metrics != nil`
2	`query_row` nil-error needs comment at call site	Fixed — Clear comment added at `helpers.go:95-96`
3	Integration test for `NewClient` wiring	Addressed — `TestNewClient_RecordsQueryMetricsWhenOnlyMetricsEnabled` added with proper DDL override
4	Histogram buckets too coarse for SQLite	Addressed — Changed to `{0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1}`

Additional Improvements

The newGTFSDBConfig helper in static.go is a nice refactor — it consolidates the config construction and nil-guard into one place, eliminating the duplication across all three call sites
Bonus fix: added MockClearServiceIDsCache() call in arrivals_and_departures_for_stop_handler_test.go to fix a pre-existing test issue

Verification

make lint: 0 issues
make test: All packages pass

aaronbrethorst requested changes Mar 19, 2026

View reviewed changes

Adityatorgal17 added 2 commits March 20, 2026 11:31

Add per-query DB query count/duration metrics with gtfsdb instrumenta…

70b65c5

…tion

fix(metrics): avoid typed-nil DB metrics wiring and cover metrics-onl…

0a9af53

…y client path

Adityatorgal17 force-pushed the feat/db-query-prometheus-metrics branch from be89edb to 0a9af53 Compare March 20, 2026 06:26

Adityatorgal17 requested a review from aaronbrethorst March 20, 2026 06:32

fix(gtfsdb): prevent typed-nil metrics interface wrapping and add com…

e4de84a

…ment for query_row status behavior

Adityatorgal17 force-pushed the feat/db-query-prometheus-metrics branch from 32dd759 to e4de84a Compare March 21, 2026 04:14

aaronbrethorst approved these changes Mar 25, 2026

View reviewed changes

aaronbrethorst merged commit a16c475 into OneBusAway:main Mar 25, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-query DB query count/duration metrics with gtfsdb instrumenta…#718

Add per-query DB query count/duration metrics with gtfsdb instrumenta…#718
aaronbrethorst merged 3 commits intoOneBusAway:mainfrom
Adityatorgal17:feat/db-query-prometheus-metrics

Adityatorgal17 commented Mar 15, 2026

Uh oh!

aaronbrethorst left a comment

Uh oh!

aaronbrethorst left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Adityatorgal17 commented Mar 15, 2026

Summary

Fixes #717

What changed

Validation Passed

Known limitation (tracked for follow-up)

Uh oh!

aaronbrethorst left a comment

Choose a reason for hiding this comment

Critical Issues (1 found)

1. [silent-failure-hunter]: Typed-nil interface causes unnecessary wrapper creation when metrics are disabled — internal/gtfs/static.go:89,312,328

Important Issues (1 found)

2. [silent-failure-hunter]: query_row metrics always report status="ok" — needs a code comment at the call site — gtfsdb/helpers.go:97

Suggestions (2 found)

3. [pr-test-analyzer]: No integration test for NewClient wiring when only metrics are enabled (threshold=0) — gtfsdb/client.go:34

4. [code-reviewer]: Consider whether prometheus.DefBuckets are appropriate for DB query latency — internal/metrics/metrics.go:108

Strengths

Recommended Action

Uh oh!

aaronbrethorst left a comment

Choose a reason for hiding this comment

Previous Feedback — All Resolved

Additional Improvements

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. [silent-failure-hunter]: Typed-nil interface causes unnecessary wrapper creation when metrics are disabled — `internal/gtfs/static.go:89,312,328`

2. [silent-failure-hunter]: `query_row` metrics always report `status="ok"` — needs a code comment at the call site — `gtfsdb/helpers.go:97`

3. [pr-test-analyzer]: No integration test for `NewClient` wiring when only metrics are enabled (threshold=0) — `gtfsdb/client.go:34`

4. [code-reviewer]: Consider whether `prometheus.DefBuckets` are appropriate for DB query latency — `internal/metrics/metrics.go:108`