Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 28 additions & 7 deletions infra/newrelic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,23 @@ infra/newrelic/
p95-latency-high.json# p95 > 500ms over 5m
worker-stalled.json # no jobs processed in 10m
nats-down.json # >=3 NATS error logs in 5m
policies/
instant-api.json # umbrella policy alerts attach to via policyName
tests/
bake_test.sh # schema-shape regression tests (run pre-merge)
```

Each JSON file is a stand-alone dashboard or alert condition payload in the shape
NR's NerdGraph schema expects. The `accountIds: [0]` placeholders in dashboard
queries are rewritten by the apply tooling (see below) to the real account ID
before the API call.
NR's NerdGraph schema expects. The `accountIds: ["${NEW_RELIC_ACCOUNT_ID}"]`
substitution token in dashboard queries is rewritten by the apply tooling (see
below) to the real account ID before the API call. Both `apply.sh` and the
Terraform path read the same source files — neither needs a special pre-flight
adapter step.

Alert JSON files include a `policyName` field that links them to the umbrella
policy declared in `policies/instant-api.json`. There is no `type` discriminator
on alert JSON — the mutation name (`alertsNrqlConditionStaticCreate`) encodes
that, and including `"type": "NRQL"` causes NerdGraph to reject the payload.

## Required env

Expand Down Expand Up @@ -59,8 +70,8 @@ provider "newrelic" {
resource "newrelic_one_dashboard_json" "api_overview" {
json = replace(
file("${path.module}/dashboards/api-overview.json"),
"\"accountIds\": [0]",
"\"accountIds\": [${var.account_id}]",
"\"${NEW_RELIC_ACCOUNT_ID}\"",
tostring(var.account_id),
)
}

Expand Down Expand Up @@ -163,13 +174,23 @@ see `infra/k8s/README.md` (owned by track 6) for that procedure.
## Validation

```bash
# Every JSON file must parse
find infra/newrelic -name '*.json' -exec jq empty {} \;
# Every JSON file must parse + the schema-shape bake assertions must hold
bash infra/newrelic/tests/bake_test.sh

# NRQL queries are not lintable offline — copy a query into the NR UI's
# "Query your data" tool to sanity-check syntax after any edit.
```

`bake_test.sh` enforces:

- Every dashboard/alert/policy file parses as JSON.
- Dashboards use the `"${NEW_RELIC_ACCOUNT_ID}"` substitution token, never
the legacy `[0]` placeholder.
- Alerts have no top-level `"type"` field — NerdGraph rejects it on the
`NrqlConditionStaticInput` mutation.
- Every alert has `policyName: "instant-api alerts"` linking to
`policies/instant-api.json`.

## Dependencies

These payloads assume the metrics/log fields wired up by:
Expand Down
2 changes: 1 addition & 1 deletion infra/newrelic/alerts/error-rate-high.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "instant-api — error rate > 1% (5m)",
"type": "NRQL",
"policyName": "instant-api alerts",
"description": "Page when instant-api error rate exceeds 1% sustained for 5 minutes. Sourced from Transaction events emitted by track 3's Fiber NR middleware.",
"enabled": true,
"nrql": {
Expand Down
2 changes: 1 addition & 1 deletion infra/newrelic/alerts/nats-down.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "instant-api — NATS connection failures",
"type": "NRQL",
"policyName": "instant-api alerts",
"description": "Page when the api logs NATS connection failures. Triggers on any error log mentioning NATS — covers JetStream unreachable, auth failure, or stream not found. Threshold deliberately low (>=3 in 5min) to catch real outages without paging on transient blips.",
"enabled": true,
"nrql": {
Expand Down
2 changes: 1 addition & 1 deletion infra/newrelic/alerts/p95-latency-high.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "instant-api — p95 latency > 500ms (5m)",
"type": "NRQL",
"policyName": "instant-api alerts",
"description": "Page when instant-api p95 latency exceeds 500ms sustained for 5 minutes. Tracks user-visible slowness on provisioning + dashboard read paths.",
"enabled": true,
"nrql": {
Expand Down
2 changes: 1 addition & 1 deletion infra/newrelic/alerts/worker-stalled.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "instant-worker — no jobs processed in 10m",
"type": "NRQL",
"policyName": "instant-api alerts",
"description": "Page when the worker has processed zero jobs for 10 minutes. Catches stalled River pollers, deadlocks against the platform DB, and pod crash loops missed by k8s readiness probes. Source: Log records emitted by River job middleware in track 4.",
"enabled": true,
"nrql": {
Expand Down
14 changes: 7 additions & 7 deletions infra/newrelic/dashboards/api-overview.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT rate(count(*), 1 minute) FROM Transaction WHERE appName LIKE 'instant-api%' TIMESERIES AUTO"
}
],
Expand All @@ -28,7 +28,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT percentage(count(*), WHERE error IS true) AS 'error rate' FROM Transaction WHERE appName LIKE 'instant-api%' TIMESERIES AUTO"
}
],
Expand All @@ -42,7 +42,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT percentile(duration, 95) * 1000 AS 'p95 ms', percentile(duration, 99) * 1000 AS 'p99 ms' FROM Transaction WHERE appName LIKE 'instant-api%' TIMESERIES AUTO"
}
],
Expand All @@ -56,7 +56,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) AS 'requests', percentile(duration, 95) * 1000 AS 'p95 ms', percentage(count(*), WHERE error IS true) AS 'err %' FROM Transaction WHERE appName LIKE 'instant-api%' FACET name LIMIT 25"
}
],
Expand All @@ -70,7 +70,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Transaction WHERE appName LIKE 'instant-api%' AND httpResponseCode >= '500' FACET name LIMIT 20"
}
],
Expand All @@ -84,7 +84,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT apdex(duration, t: 0.5) FROM Transaction WHERE appName LIKE 'instant-api%'"
}
],
Expand All @@ -102,7 +102,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'api' AND level = 'ERROR' FACET commit_id TIMESERIES AUTO LIMIT 5"
}
],
Expand Down
14 changes: 7 additions & 7 deletions infra/newrelic/dashboards/deploy.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT filter(count(*), WHERE deploy_status = 'success') AS 'success', filter(count(*), WHERE deploy_status = 'fail') AS 'fail' FROM Log WHERE service = 'api' AND deploy_status IS NOT NULL SINCE 24 hours ago"
}
],
Expand All @@ -28,7 +28,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT percentile(build_duration_seconds, 50) AS 'p50 s', percentile(build_duration_seconds, 95) AS 'p95 s' FROM Log WHERE service = 'api' AND build_duration_seconds IS NOT NULL TIMESERIES AUTO"
}
],
Expand All @@ -42,7 +42,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT percentile(duration, 95) * 1000 AS 'p95 ms' FROM Transaction WHERE appName LIKE 'instant-api%' AND name LIKE '%/deploy%' FACET name TIMESERIES AUTO"
}
],
Expand All @@ -56,7 +56,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'api' AND deploy_status = 'fail' FACET fail_reason LIMIT 10"
}
],
Expand All @@ -70,7 +70,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT uniqueCount(deploy_id) FROM Log WHERE service = 'api' AND deploy_status = 'running' SINCE 5 minutes ago"
}
],
Expand All @@ -84,7 +84,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'api' AND message LIKE '%deploy.created%' FACET tier SINCE 7 days ago"
}
],
Expand All @@ -98,7 +98,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Transaction WHERE appName LIKE 'instant-api%' AND name LIKE '%/redeploy' TIMESERIES 1 day"
}
],
Expand Down
12 changes: 6 additions & 6 deletions infra/newrelic/dashboards/provisioning.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT (sum(newrelic.timeslice.value) / (sum(newrelic.timeslice.value) + filter(sum(newrelic.timeslice.value), WHERE metricTimesliceName = 'Custom/Provision/Fail'))) * 100 AS 'success %' FROM Metric WHERE metricTimesliceName = 'Custom/Provision/Success' SINCE 1 hour ago"
}
],
Expand All @@ -32,7 +32,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT sum(newrelic.timeslice.value) AS 'count' FROM Metric WHERE metricTimesliceName IN ('Custom/Provision/Success', 'Custom/Provision/Fail') FACET metricTimesliceName TIMESERIES AUTO"
}
],
Expand All @@ -46,7 +46,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT percentile(duration, 50) * 1000 AS 'median ms', percentile(duration, 95) * 1000 AS 'p95 ms' FROM Transaction WHERE appName LIKE 'instant-api%' AND name IN ('POST /db/new', 'POST /cache/new', 'POST /nosql/new', 'POST /queue/new', 'POST /storage/new', 'POST /webhook/new') FACET name TIMESERIES AUTO"
}
],
Expand All @@ -60,7 +60,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'api' AND level = 'ERROR' AND message LIKE '%provision%' FACET resource_type LIMIT 10"
}
],
Expand All @@ -74,7 +74,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT sum(newrelic.timeslice.value) FROM Metric WHERE metricTimesliceName = 'Custom/Resource/Expired' TIMESERIES AUTO"
}
],
Expand All @@ -88,7 +88,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'api' AND message LIKE '%provisioned%' FACET tier"
}
],
Expand Down
16 changes: 8 additions & 8 deletions infra/newrelic/dashboards/worker.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT rate(count(*), 1 minute) FROM Log WHERE service = 'worker' AND message = 'job.completed' TIMESERIES AUTO"
}
],
Expand All @@ -28,7 +28,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'worker' AND message IN ('job.completed', 'job.failed') FACET message TIMESERIES AUTO"
}
],
Expand All @@ -42,7 +42,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT rate(count(*), 1 minute) FROM Log WHERE service = 'worker' AND message = 'job.retried' TIMESERIES AUTO"
}
],
Expand All @@ -56,7 +56,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'worker' AND message = 'job.completed' FACET job_kind TIMESERIES AUTO"
}
],
Expand All @@ -70,7 +70,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT percentile(duration_ms, 95) FROM Log WHERE service = 'worker' AND message = 'job.completed' AND duration_ms IS NOT NULL FACET job_kind TIMESERIES AUTO"
}
],
Expand All @@ -84,7 +84,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT (max(timestamp) - min(timestamp)) / 1000 AS 'seconds since expire run' FROM Log WHERE service = 'worker' AND job_kind = 'expire' AND message = 'job.completed' SINCE 1 hour ago"
}
],
Expand All @@ -102,7 +102,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'worker' AND message = 'job.failed' FACET job_kind SINCE 24 hours ago LIMIT 10"
}
],
Expand All @@ -116,7 +116,7 @@
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [0],
"accountIds": ["${NEW_RELIC_ACCOUNT_ID}"],
"query": "SELECT count(*) FROM Log WHERE service = 'worker' AND level = 'ERROR' TIMESERIES AUTO"
}
],
Expand Down
4 changes: 4 additions & 0 deletions infra/newrelic/policies/instant-api.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"name": "instant-api alerts",
"incidentPreference": "PER_CONDITION_AND_TARGET"
}
Loading