Skip to content

Commit 9d1146d

Browse files
committed
feat: port metrics-reference docs page
1 parent 38bf67e commit 9d1146d

3 files changed

Lines changed: 354 additions & 1 deletion

File tree

content/docs/reference/meta.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
"agent-architecture",
77
"agent-troubleshooting",
88
"database",
9-
"pipeline-yaml"
9+
"pipeline-yaml",
10+
"metrics"
1011
]
1112
}

content/docs/reference/metrics.mdx

Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
---
2+
title: "VectorFlow Metrics Reference"
3+
---
4+
5+
VectorFlow exposes a Prometheus-compatible metrics endpoint at `GET /api/metrics`.
6+
7+
## Authentication
8+
9+
The endpoint requires a service account Bearer token with the `metrics.read` permission:
10+
11+
```
12+
Authorization: Bearer vf_<your-service-account-key>
13+
```
14+
15+
Generate a service account key in **Settings → Service Accounts**.
16+
17+
---
18+
19+
## Prometheus Scrape Configuration
20+
21+
Add this job to your `prometheus.yml`:
22+
23+
```yaml
24+
scrape_configs:
25+
- job_name: vectorflow
26+
scrape_interval: 30s
27+
scrape_timeout: 10s
28+
scheme: https # use http for local dev
29+
metrics_path: /api/metrics
30+
authorization:
31+
credentials: vf_<your-key> # or use credentials_file
32+
static_configs:
33+
- targets:
34+
- your-vectorflow-host:443
35+
labels:
36+
env: production
37+
```
38+
39+
For Docker Compose environments, replace the target with the service name and port (e.g. `vectorflow:3000`).
40+
41+
---
42+
43+
## Metrics
44+
45+
All VectorFlow metric names are prefixed with `vectorflow_`. Metrics are exposed in **Prometheus text format 0.0.4**.
46+
47+
> **Implementation note:** Throughput counters (`events_in_total`, `events_out_total`, etc.) are registered as Gauge types in prom-client but store cumulative totals sourced from the database. They are monotonically increasing across the lifetime of a pipeline run and behave correctly with `rate()` and `increase()` in PromQL.
48+
49+
---
50+
51+
### Node Metrics
52+
53+
#### `vectorflow_node_status`
54+
55+
Node health status.
56+
57+
| Field | Value |
58+
|-------|-------|
59+
| **Type** | Gauge |
60+
| **Labels** | `node_id`, `node_name`, `environment_id` |
61+
62+
**Value mapping:**
63+
64+
| Value | Status | Meaning |
65+
|-------|--------|---------|
66+
| `1` | `HEALTHY` | Node is reachable and operating normally |
67+
| `2` | `DEGRADED` | Node is reachable but reporting issues |
68+
| `3` | `UNREACHABLE` | Node cannot be contacted |
69+
| `0` | `UNKNOWN` | Status has not been determined yet |
70+
71+
**Example queries:**
72+
73+
```text
74+
# All unhealthy nodes
75+
vectorflow_node_status != 1
76+
77+
# Fraction of healthy nodes
78+
(count(vectorflow_node_status == 1) or vector(0)) / count(vectorflow_node_status)
79+
80+
# Alert: any node unreachable for >2 min
81+
vectorflow_node_status == 3
82+
```
83+
84+
---
85+
86+
### Pipeline Metrics
87+
88+
All pipeline metrics carry the labels `node_id` and `pipeline_id`.
89+
90+
#### `vectorflow_pipeline_status`
91+
92+
Pipeline process status.
93+
94+
| Field | Value |
95+
|-------|-------|
96+
| **Type** | Gauge |
97+
| **Labels** | `node_id`, `pipeline_id` |
98+
99+
**Value mapping:**
100+
101+
| Value | Status | Meaning |
102+
|-------|--------|---------|
103+
| `1` | `RUNNING` | Pipeline is actively processing events |
104+
| `2` | `STARTING` | Pipeline process is initialising |
105+
| `3` | `STOPPED` | Pipeline was stopped gracefully |
106+
| `4` | `CRASHED` | Pipeline process exited unexpectedly |
107+
| `0` | `PENDING` | Pipeline has not started yet |
108+
109+
---
110+
111+
#### `vectorflow_pipeline_events_in_total`
112+
113+
Cumulative count of events received by the pipeline since it started.
114+
115+
| Field | Value |
116+
|-------|-------|
117+
| **Type** | Gauge (cumulative total) |
118+
| **Unit** | Events |
119+
| **Labels** | `node_id`, `pipeline_id` |
120+
121+
**Example queries:**
122+
123+
```text
124+
# Current ingest rate (events/sec)
125+
rate(vectorflow_pipeline_events_in_total[2m])
126+
127+
# Total events ingested across all pipelines
128+
sum(vectorflow_pipeline_events_in_total)
129+
```
130+
131+
---
132+
133+
#### `vectorflow_pipeline_events_out_total`
134+
135+
Cumulative count of events emitted by the pipeline since it started.
136+
137+
| Field | Value |
138+
|-------|-------|
139+
| **Type** | Gauge (cumulative total) |
140+
| **Unit** | Events |
141+
| **Labels** | `node_id`, `pipeline_id` |
142+
143+
**Example queries:**
144+
145+
```text
146+
# Outbound throughput rate
147+
rate(vectorflow_pipeline_events_out_total[2m])
148+
149+
# Drop rate: events consumed but not forwarded
150+
rate(vectorflow_pipeline_events_in_total[2m])
151+
- rate(vectorflow_pipeline_events_out_total[2m])
152+
```
153+
154+
---
155+
156+
#### `vectorflow_pipeline_errors_total`
157+
158+
Cumulative count of errors encountered by the pipeline.
159+
160+
| Field | Value |
161+
|-------|-------|
162+
| **Type** | Gauge (cumulative total) |
163+
| **Unit** | Errors |
164+
| **Labels** | `node_id`, `pipeline_id` |
165+
166+
**Example queries:**
167+
168+
```text
169+
# Error rate
170+
rate(vectorflow_pipeline_errors_total[2m])
171+
172+
# Error ratio (errors per inbound event)
173+
rate(vectorflow_pipeline_errors_total[5m])
174+
/ (rate(vectorflow_pipeline_events_in_total[5m]) > 0)
175+
```
176+
177+
---
178+
179+
#### `vectorflow_pipeline_events_discarded_total`
180+
181+
Cumulative count of events intentionally discarded (e.g. by a `filter` or `drop` transform).
182+
183+
| Field | Value |
184+
|-------|-------|
185+
| **Type** | Gauge (cumulative total) |
186+
| **Unit** | Events |
187+
| **Labels** | `node_id`, `pipeline_id` |
188+
189+
---
190+
191+
#### `vectorflow_pipeline_bytes_in_total`
192+
193+
Cumulative byte volume received by the pipeline since it started.
194+
195+
| Field | Value |
196+
|-------|-------|
197+
| **Type** | Gauge (cumulative total) |
198+
| **Unit** | Bytes |
199+
| **Labels** | `node_id`, `pipeline_id` |
200+
201+
**Example queries:**
202+
203+
```text
204+
# Inbound throughput in bytes/sec
205+
rate(vectorflow_pipeline_bytes_in_total[2m])
206+
```
207+
208+
---
209+
210+
#### `vectorflow_pipeline_bytes_out_total`
211+
212+
Cumulative byte volume emitted by the pipeline since it started.
213+
214+
| Field | Value |
215+
|-------|-------|
216+
| **Type** | Gauge (cumulative total) |
217+
| **Unit** | Bytes |
218+
| **Labels** | `node_id`, `pipeline_id` |
219+
220+
---
221+
222+
#### `vectorflow_pipeline_utilization`
223+
224+
Fractional CPU/processing utilisation of the pipeline, as reported by the Vector process. Range: `0.0` (idle) to `1.0` (fully saturated).
225+
226+
| Field | Value |
227+
|-------|-------|
228+
| **Type** | Gauge |
229+
| **Unit** | Ratio (0–1) |
230+
| **Labels** | `node_id`, `pipeline_id` |
231+
232+
**Example queries:**
233+
234+
```text
235+
# Pipelines over 80% utilisation
236+
vectorflow_pipeline_utilization > 0.8
237+
238+
# Average utilisation across all running pipelines
239+
avg(vectorflow_pipeline_utilization > 0)
240+
```
241+
242+
---
243+
244+
#### `vectorflow_pipeline_latency_mean_ms`
245+
246+
Mean end-to-end pipeline latency in milliseconds, sourced from the latest `PipelineMetric` snapshot stored in the database. This metric only appears when latency data has been reported.
247+
248+
| Field | Value |
249+
|-------|-------|
250+
| **Type** | Gauge |
251+
| **Unit** | Milliseconds |
252+
| **Labels** | `pipeline_id`, `node_id` |
253+
254+
**Example queries:**
255+
256+
```text
257+
# Pipelines with mean latency > 1 second
258+
vectorflow_pipeline_latency_mean_ms > 1000
259+
260+
# 95th percentile latency across pipelines (approximate via max)
261+
max(vectorflow_pipeline_latency_mean_ms)
262+
```
263+
264+
---
265+
266+
### Internal Metrics
267+
268+
#### `vectorflow_metric_store_streams`
269+
270+
Number of active metric streams held in the in-process `MetricStore`. Each stream corresponds to a live metric time series being accumulated in memory before persistence.
271+
272+
| Field | Value |
273+
|-------|-------|
274+
| **Type** | Gauge |
275+
| **Unit** | Count |
276+
| **Labels** | None |
277+
278+
---
279+
280+
#### `vectorflow_metric_store_memory_bytes`
281+
282+
Estimated memory consumed by the in-process `MetricStore`, in bytes.
283+
284+
| Field | Value |
285+
|-------|-------|
286+
| **Type** | Gauge |
287+
| **Unit** | Bytes |
288+
| **Labels** | None |
289+
290+
**Example queries:**
291+
292+
```text
293+
# Alert if MetricStore exceeds 100 MiB
294+
vectorflow_metric_store_memory_bytes > 104857600
295+
```
296+
297+
---
298+
299+
## Summary Table
300+
301+
| Metric | Type | Labels | Unit |
302+
|--------|------|--------|------|
303+
| `vectorflow_node_status` | Gauge | `node_id`, `node_name`, `environment_id` | Enum (0–3) |
304+
| `vectorflow_pipeline_status` | Gauge | `node_id`, `pipeline_id` | Enum (0–4) |
305+
| `vectorflow_pipeline_events_in_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Events |
306+
| `vectorflow_pipeline_events_out_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Events |
307+
| `vectorflow_pipeline_errors_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Errors |
308+
| `vectorflow_pipeline_events_discarded_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Events |
309+
| `vectorflow_pipeline_bytes_in_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Bytes |
310+
| `vectorflow_pipeline_bytes_out_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Bytes |
311+
| `vectorflow_pipeline_utilization` | Gauge | `node_id`, `pipeline_id` | Ratio (0–1) |
312+
| `vectorflow_pipeline_latency_mean_ms` | Gauge | `pipeline_id`, `node_id` | Milliseconds |
313+
| `vectorflow_metric_store_streams` | Gauge | — | Count |
314+
| `vectorflow_metric_store_memory_bytes` | Gauge | — | Bytes |
315+
316+
---
317+
318+
## Pre-built Dashboards and Rules
319+
320+
| File | Description |
321+
|------|-------------|
322+
| `monitoring/grafana/vectorflow-overview.json` | Grafana 10+ dashboard — import via **Dashboards → Import** |
323+
| `monitoring/prometheus/vectorflow.rules.yml` | Recording rules and alerting rules — reference from `prometheus.yml` |
324+
325+
### Loading the Grafana dashboard
326+
327+
1. Open Grafana → **Dashboards → Import**.
328+
2. Upload `monitoring/grafana/vectorflow-overview.json` or paste its contents.
329+
3. Select your Prometheus data source when prompted.
330+
4. Click **Import**.
331+
332+
### Loading the Prometheus rules
333+
334+
Add a reference in `prometheus.yml`:
335+
336+
```yaml
337+
rule_files:
338+
- /etc/prometheus/rules/vectorflow.rules.yml
339+
```
340+
341+
Then copy `monitoring/prometheus/vectorflow.rules.yml` to that path and reload Prometheus:
342+
343+
```bash
344+
curl -X POST http://localhost:9090/-/reload
345+
```
346+
347+
Verify rules loaded successfully:
348+
349+
```bash
350+
curl http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name | startswith("vectorflow"))'
351+
```

scripts/port-gitbook.mjs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ function transformFileEmbeds(source) {
7575
// edit to the ported file that would regress on the next port run.
7676
const SHIKI_LANG_REMAP = {
7777
caddy: 'text',
78+
promql: 'text',
7879
};
7980

8081
function transformFenceLanguages(source) {

0 commit comments

Comments
 (0)