Add per-query DB query count/duration metrics with gtfsdb instrumenta…#718
Conversation
aaronbrethorst
left a comment
There was a problem hiding this comment.
Hey Aditya, this is a well-architected observability addition — the DBQueryMetricsRecorder interface creates a clean dependency boundary between gtfsdb and the metrics package, the sqlc query name extraction is clever, and the wiring across all three NewClient call sites is thorough. The overall design is solid.
There are a couple of issues to address before merging.
Critical Issues (1 found)
1. [silent-failure-hunter]: Typed-nil interface causes unnecessary wrapper creation when metrics are disabled — internal/gtfs/static.go:89,312,328
All three call sites unconditionally assign config.Metrics to the interface field:
dbConfig.QueryMetricsRecorder = config.MetricsWhen config.Metrics is nil (which happens in all tests and any startup path that doesn't configure metrics), Go wraps the nil *metrics.Metrics pointer inside a non-nil DBQueryMetricsRecorder interface. This means config.QueryMetricsRecorder != nil evaluates to true in client.go:34, causing the slowQueryDB wrapper to be created for every DB client — even when no metrics are being collected.
The nil receiver guard in RecordDBQuery prevents a crash, but every database operation now pays the overhead of going through the wrapper, calling recordQueryMetrics, calling RecordDBQuery, and hitting the nil check — all for nothing.
Fix: Guard the assignment at all three sites:
if config.Metrics != nil {
dbConfig.QueryMetricsRecorder = config.Metrics
}This ensures the interface remains an untyped nil when metrics aren't configured, and the wrapper is only created when it's actually needed.
Important Issues (1 found)
2. [silent-failure-hunter]: query_row metrics always report status="ok" — needs a code comment at the call site — gtfsdb/helpers.go:97
The PR description honestly discloses that QueryRowContext passes nil for the error (since *sql.Row defers errors to Scan()). However, this limitation isn't documented at the call site itself. A future developer looking at helpers.go:97 would see recordQueryMetrics("query_row", query, elapsed, nil) and might not understand why the error is hardcoded to nil.
Add a brief comment at the call site:
// Note: QueryRowContext defers errors to row.Scan(), so err is always nil here.
// query_row metrics always report status="ok". See PR description for follow-up plan.
s.recordQueryMetrics("query_row", query, elapsed, nil)This also serves as a breadcrumb for the follow-up PR that will fix scan-time status tracking.
Suggestions (2 found)
3. [pr-test-analyzer]: No integration test for NewClient wiring when only metrics are enabled (threshold=0) — gtfsdb/client.go:34
The condition slowQueryThreshold > 0 || config.QueryMetricsRecorder != nil is the key wiring logic, but TestSlowQueryDB_RecordsQueryMetrics manually constructs the wrapper — it doesn't go through NewClient. If someone changes || to &&, no test would catch it. An integration test that creates a NewClient with QueryMetricsRecorder set and verifies metrics are recorded would close this gap.
4. [code-reviewer]: Consider whether prometheus.DefBuckets are appropriate for DB query latency — internal/metrics/metrics.go:108
DefBuckets are {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} seconds. For SQLite queries, most will complete in microseconds (< 1ms), meaning they'll all land in the first bucket (< 5ms). You won't get useful latency distribution from the histogram. Consider tighter buckets like {.0001, .0005, .001, .005, .01, .05, .1, .5, 1} to capture the sub-millisecond range where SQLite actually operates. This is non-blocking — DefBuckets still works, it just won't give great resolution.
Strengths
- The
DBQueryMetricsRecorderinterface is a textbook example of dependency inversion —gtfsdbemits metrics without importing themetricspackage extractQueryNameis well-implemented — it handles sqlc headers, skips blank lines and non-sqlc comments, and falls back to "unknown" gracefully- Label cardinality is safe: ~105 sqlc query names x 3 ops x 2 statuses = ~630 combinations, well within Prometheus best practices
- Nil safety is handled at multiple levels:
recordQueryMetricscheckss.queryMetrics == nil,RecordDBQuerychecksm == nil - All three
NewClientcall sites instatic.goare wired, including the error recovery path — no paths were missed - The test mock (
testQueryMetricsRecorder) is clean and minimal
Recommended Action
Verdict: Request Changes
- Fix the typed-nil interface issue at all three
static.goassignment sites - Add a comment at
helpers.go:97documenting theQueryRowContextnil-error limitation - Consider the histogram bucket and integration test suggestions
be89edb to
0a9af53
Compare
…ment for query_row status behavior
32dd759 to
e4de84a
Compare
aaronbrethorst
left a comment
There was a problem hiding this comment.
Hey Aditya, excellent follow-through on all the feedback. Every item has been cleanly addressed:
Previous Feedback — All Resolved
| # | Issue | Status |
|---|---|---|
| 1 | Typed-nil interface causing unnecessary wrapper creation | Fixed — newGTFSDBConfig now guards with if config.Metrics != nil |
| 2 | query_row nil-error needs comment at call site |
Fixed — Clear comment added at helpers.go:95-96 |
| 3 | Integration test for NewClient wiring |
Addressed — TestNewClient_RecordsQueryMetricsWhenOnlyMetricsEnabled added with proper DDL override |
| 4 | Histogram buckets too coarse for SQLite | Addressed — Changed to {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} |
Additional Improvements
- The
newGTFSDBConfighelper instatic.gois a nice refactor — it consolidates the config construction and nil-guard into one place, eliminating the duplication across all three call sites - Bonus fix: added
MockClearServiceIDsCache()call inarrivals_and_departures_for_stop_handler_test.goto fix a pre-existing test issue
Verification
make lint: 0 issuesmake test: All packages pass
Summary
This PR adds per-query database observability to Maglev by exporting query count and latency metrics with
query_name,op, andstatuslabels.Fixes #717
What changed
maglev_db_query_totalmaglev_db_query_duration_secondsRecordDBQuery(...)ininternal/metrics.DBQueryMetricsRecorderinterface ingtfsdb/config.go.gtfsdbquery wrapper to emit metrics for:ExecContext(op=exec)QueryContext(op=query)QueryRowContext(op=query_row)gtfsdb.NewClient(...)creation paths.Validation Passed
make lintmake testKnown limitation (tracked for follow-up)
query_rowstatus is currently based onQueryRowContextdispatch, not finalScan()result.I'll open a follow-up PR to record
query_rowstatus at scan-time.