|
| 1 | +--- |
| 2 | +title: "VectorFlow Metrics Reference" |
| 3 | +--- |
| 4 | + |
| 5 | +VectorFlow exposes a Prometheus-compatible metrics endpoint at `GET /api/metrics`. |
| 6 | + |
| 7 | +## Authentication |
| 8 | + |
| 9 | +The endpoint requires a service account Bearer token with the `metrics.read` permission: |
| 10 | + |
| 11 | +``` |
| 12 | +Authorization: Bearer vf_<your-service-account-key> |
| 13 | +``` |
| 14 | + |
| 15 | +Generate a service account key in **Settings → Service Accounts**. |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## Prometheus Scrape Configuration |
| 20 | + |
| 21 | +Add this job to your `prometheus.yml`: |
| 22 | + |
| 23 | +```yaml |
| 24 | +scrape_configs: |
| 25 | + - job_name: vectorflow |
| 26 | + scrape_interval: 30s |
| 27 | + scrape_timeout: 10s |
| 28 | + scheme: https # use http for local dev |
| 29 | + metrics_path: /api/metrics |
| 30 | + authorization: |
| 31 | + credentials: vf_<your-key> # or use credentials_file |
| 32 | + static_configs: |
| 33 | + - targets: |
| 34 | + - your-vectorflow-host:443 |
| 35 | + labels: |
| 36 | + env: production |
| 37 | +``` |
| 38 | +
|
| 39 | +For Docker Compose environments, replace the target with the service name and port (e.g. `vectorflow:3000`). |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## Metrics |
| 44 | + |
| 45 | +All VectorFlow metric names are prefixed with `vectorflow_`. Metrics are exposed in **Prometheus text format 0.0.4**. |
| 46 | + |
| 47 | +> **Implementation note:** Throughput counters (`events_in_total`, `events_out_total`, etc.) are registered as Gauge types in prom-client but store cumulative totals sourced from the database. They are monotonically increasing across the lifetime of a pipeline run and behave correctly with `rate()` and `increase()` in PromQL. |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +### Node Metrics |
| 52 | + |
| 53 | +#### `vectorflow_node_status` |
| 54 | + |
| 55 | +Node health status. |
| 56 | + |
| 57 | +| Field | Value | |
| 58 | +|-------|-------| |
| 59 | +| **Type** | Gauge | |
| 60 | +| **Labels** | `node_id`, `node_name`, `environment_id` | |
| 61 | + |
| 62 | +**Value mapping:** |
| 63 | + |
| 64 | +| Value | Status | Meaning | |
| 65 | +|-------|--------|---------| |
| 66 | +| `1` | `HEALTHY` | Node is reachable and operating normally | |
| 67 | +| `2` | `DEGRADED` | Node is reachable but reporting issues | |
| 68 | +| `3` | `UNREACHABLE` | Node cannot be contacted | |
| 69 | +| `0` | `UNKNOWN` | Status has not been determined yet | |
| 70 | + |
| 71 | +**Example queries:** |
| 72 | + |
| 73 | +```text |
| 74 | +# All unhealthy nodes |
| 75 | +vectorflow_node_status != 1 |
| 76 | +
|
| 77 | +# Fraction of healthy nodes |
| 78 | +(count(vectorflow_node_status == 1) or vector(0)) / count(vectorflow_node_status) |
| 79 | +
|
| 80 | +# Alert: any node unreachable for >2 min |
| 81 | +vectorflow_node_status == 3 |
| 82 | +``` |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +### Pipeline Metrics |
| 87 | + |
| 88 | +All pipeline metrics carry the labels `node_id` and `pipeline_id`. |
| 89 | + |
| 90 | +#### `vectorflow_pipeline_status` |
| 91 | + |
| 92 | +Pipeline process status. |
| 93 | + |
| 94 | +| Field | Value | |
| 95 | +|-------|-------| |
| 96 | +| **Type** | Gauge | |
| 97 | +| **Labels** | `node_id`, `pipeline_id` | |
| 98 | + |
| 99 | +**Value mapping:** |
| 100 | + |
| 101 | +| Value | Status | Meaning | |
| 102 | +|-------|--------|---------| |
| 103 | +| `1` | `RUNNING` | Pipeline is actively processing events | |
| 104 | +| `2` | `STARTING` | Pipeline process is initialising | |
| 105 | +| `3` | `STOPPED` | Pipeline was stopped gracefully | |
| 106 | +| `4` | `CRASHED` | Pipeline process exited unexpectedly | |
| 107 | +| `0` | `PENDING` | Pipeline has not started yet | |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +#### `vectorflow_pipeline_events_in_total` |
| 112 | + |
| 113 | +Cumulative count of events received by the pipeline since it started. |
| 114 | + |
| 115 | +| Field | Value | |
| 116 | +|-------|-------| |
| 117 | +| **Type** | Gauge (cumulative total) | |
| 118 | +| **Unit** | Events | |
| 119 | +| **Labels** | `node_id`, `pipeline_id` | |
| 120 | + |
| 121 | +**Example queries:** |
| 122 | + |
| 123 | +```text |
| 124 | +# Current ingest rate (events/sec) |
| 125 | +rate(vectorflow_pipeline_events_in_total[2m]) |
| 126 | +
|
| 127 | +# Total events ingested across all pipelines |
| 128 | +sum(vectorflow_pipeline_events_in_total) |
| 129 | +``` |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +#### `vectorflow_pipeline_events_out_total` |
| 134 | + |
| 135 | +Cumulative count of events emitted by the pipeline since it started. |
| 136 | + |
| 137 | +| Field | Value | |
| 138 | +|-------|-------| |
| 139 | +| **Type** | Gauge (cumulative total) | |
| 140 | +| **Unit** | Events | |
| 141 | +| **Labels** | `node_id`, `pipeline_id` | |
| 142 | + |
| 143 | +**Example queries:** |
| 144 | + |
| 145 | +```text |
| 146 | +# Outbound throughput rate |
| 147 | +rate(vectorflow_pipeline_events_out_total[2m]) |
| 148 | +
|
| 149 | +# Drop rate: events consumed but not forwarded |
| 150 | +rate(vectorflow_pipeline_events_in_total[2m]) |
| 151 | + - rate(vectorflow_pipeline_events_out_total[2m]) |
| 152 | +``` |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +#### `vectorflow_pipeline_errors_total` |
| 157 | + |
| 158 | +Cumulative count of errors encountered by the pipeline. |
| 159 | + |
| 160 | +| Field | Value | |
| 161 | +|-------|-------| |
| 162 | +| **Type** | Gauge (cumulative total) | |
| 163 | +| **Unit** | Errors | |
| 164 | +| **Labels** | `node_id`, `pipeline_id` | |
| 165 | + |
| 166 | +**Example queries:** |
| 167 | + |
| 168 | +```text |
| 169 | +# Error rate |
| 170 | +rate(vectorflow_pipeline_errors_total[2m]) |
| 171 | +
|
| 172 | +# Error ratio (errors per inbound event) |
| 173 | +rate(vectorflow_pipeline_errors_total[5m]) |
| 174 | + / (rate(vectorflow_pipeline_events_in_total[5m]) > 0) |
| 175 | +``` |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +#### `vectorflow_pipeline_events_discarded_total` |
| 180 | + |
| 181 | +Cumulative count of events intentionally discarded (e.g. by a `filter` or `drop` transform). |
| 182 | + |
| 183 | +| Field | Value | |
| 184 | +|-------|-------| |
| 185 | +| **Type** | Gauge (cumulative total) | |
| 186 | +| **Unit** | Events | |
| 187 | +| **Labels** | `node_id`, `pipeline_id` | |
| 188 | + |
| 189 | +--- |
| 190 | + |
| 191 | +#### `vectorflow_pipeline_bytes_in_total` |
| 192 | + |
| 193 | +Cumulative byte volume received by the pipeline since it started. |
| 194 | + |
| 195 | +| Field | Value | |
| 196 | +|-------|-------| |
| 197 | +| **Type** | Gauge (cumulative total) | |
| 198 | +| **Unit** | Bytes | |
| 199 | +| **Labels** | `node_id`, `pipeline_id` | |
| 200 | + |
| 201 | +**Example queries:** |
| 202 | + |
| 203 | +```text |
| 204 | +# Inbound throughput in bytes/sec |
| 205 | +rate(vectorflow_pipeline_bytes_in_total[2m]) |
| 206 | +``` |
| 207 | + |
| 208 | +--- |
| 209 | + |
| 210 | +#### `vectorflow_pipeline_bytes_out_total` |
| 211 | + |
| 212 | +Cumulative byte volume emitted by the pipeline since it started. |
| 213 | + |
| 214 | +| Field | Value | |
| 215 | +|-------|-------| |
| 216 | +| **Type** | Gauge (cumulative total) | |
| 217 | +| **Unit** | Bytes | |
| 218 | +| **Labels** | `node_id`, `pipeline_id` | |
| 219 | + |
| 220 | +--- |
| 221 | + |
| 222 | +#### `vectorflow_pipeline_utilization` |
| 223 | + |
| 224 | +Fractional CPU/processing utilisation of the pipeline, as reported by the Vector process. Range: `0.0` (idle) to `1.0` (fully saturated). |
| 225 | + |
| 226 | +| Field | Value | |
| 227 | +|-------|-------| |
| 228 | +| **Type** | Gauge | |
| 229 | +| **Unit** | Ratio (0–1) | |
| 230 | +| **Labels** | `node_id`, `pipeline_id` | |
| 231 | + |
| 232 | +**Example queries:** |
| 233 | + |
| 234 | +```text |
| 235 | +# Pipelines over 80% utilisation |
| 236 | +vectorflow_pipeline_utilization > 0.8 |
| 237 | +
|
| 238 | +# Average utilisation across all running pipelines |
| 239 | +avg(vectorflow_pipeline_utilization > 0) |
| 240 | +``` |
| 241 | + |
| 242 | +--- |
| 243 | + |
| 244 | +#### `vectorflow_pipeline_latency_mean_ms` |
| 245 | + |
| 246 | +Mean end-to-end pipeline latency in milliseconds, sourced from the latest `PipelineMetric` snapshot stored in the database. This metric only appears when latency data has been reported. |
| 247 | + |
| 248 | +| Field | Value | |
| 249 | +|-------|-------| |
| 250 | +| **Type** | Gauge | |
| 251 | +| **Unit** | Milliseconds | |
| 252 | +| **Labels** | `pipeline_id`, `node_id` | |
| 253 | + |
| 254 | +**Example queries:** |
| 255 | + |
| 256 | +```text |
| 257 | +# Pipelines with mean latency > 1 second |
| 258 | +vectorflow_pipeline_latency_mean_ms > 1000 |
| 259 | +
|
| 260 | +# 95th percentile latency across pipelines (approximate via max) |
| 261 | +max(vectorflow_pipeline_latency_mean_ms) |
| 262 | +``` |
| 263 | + |
| 264 | +--- |
| 265 | + |
| 266 | +### Internal Metrics |
| 267 | + |
| 268 | +#### `vectorflow_metric_store_streams` |
| 269 | + |
| 270 | +Number of active metric streams held in the in-process `MetricStore`. Each stream corresponds to a live metric time series being accumulated in memory before persistence. |
| 271 | + |
| 272 | +| Field | Value | |
| 273 | +|-------|-------| |
| 274 | +| **Type** | Gauge | |
| 275 | +| **Unit** | Count | |
| 276 | +| **Labels** | None | |
| 277 | + |
| 278 | +--- |
| 279 | + |
| 280 | +#### `vectorflow_metric_store_memory_bytes` |
| 281 | + |
| 282 | +Estimated memory consumed by the in-process `MetricStore`, in bytes. |
| 283 | + |
| 284 | +| Field | Value | |
| 285 | +|-------|-------| |
| 286 | +| **Type** | Gauge | |
| 287 | +| **Unit** | Bytes | |
| 288 | +| **Labels** | None | |
| 289 | + |
| 290 | +**Example queries:** |
| 291 | + |
| 292 | +```text |
| 293 | +# Alert if MetricStore exceeds 100 MiB |
| 294 | +vectorflow_metric_store_memory_bytes > 104857600 |
| 295 | +``` |
| 296 | + |
| 297 | +--- |
| 298 | + |
| 299 | +## Summary Table |
| 300 | + |
| 301 | +| Metric | Type | Labels | Unit | |
| 302 | +|--------|------|--------|------| |
| 303 | +| `vectorflow_node_status` | Gauge | `node_id`, `node_name`, `environment_id` | Enum (0–3) | |
| 304 | +| `vectorflow_pipeline_status` | Gauge | `node_id`, `pipeline_id` | Enum (0–4) | |
| 305 | +| `vectorflow_pipeline_events_in_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Events | |
| 306 | +| `vectorflow_pipeline_events_out_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Events | |
| 307 | +| `vectorflow_pipeline_errors_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Errors | |
| 308 | +| `vectorflow_pipeline_events_discarded_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Events | |
| 309 | +| `vectorflow_pipeline_bytes_in_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Bytes | |
| 310 | +| `vectorflow_pipeline_bytes_out_total` | Gauge (cumulative) | `node_id`, `pipeline_id` | Bytes | |
| 311 | +| `vectorflow_pipeline_utilization` | Gauge | `node_id`, `pipeline_id` | Ratio (0–1) | |
| 312 | +| `vectorflow_pipeline_latency_mean_ms` | Gauge | `pipeline_id`, `node_id` | Milliseconds | |
| 313 | +| `vectorflow_metric_store_streams` | Gauge | — | Count | |
| 314 | +| `vectorflow_metric_store_memory_bytes` | Gauge | — | Bytes | |
| 315 | + |
| 316 | +--- |
| 317 | + |
| 318 | +## Pre-built Dashboards and Rules |
| 319 | + |
| 320 | +| File | Description | |
| 321 | +|------|-------------| |
| 322 | +| `monitoring/grafana/vectorflow-overview.json` | Grafana 10+ dashboard — import via **Dashboards → Import** | |
| 323 | +| `monitoring/prometheus/vectorflow.rules.yml` | Recording rules and alerting rules — reference from `prometheus.yml` | |
| 324 | + |
| 325 | +### Loading the Grafana dashboard |
| 326 | + |
| 327 | +1. Open Grafana → **Dashboards → Import**. |
| 328 | +2. Upload `monitoring/grafana/vectorflow-overview.json` or paste its contents. |
| 329 | +3. Select your Prometheus data source when prompted. |
| 330 | +4. Click **Import**. |
| 331 | + |
| 332 | +### Loading the Prometheus rules |
| 333 | + |
| 334 | +Add a reference in `prometheus.yml`: |
| 335 | + |
| 336 | +```yaml |
| 337 | +rule_files: |
| 338 | + - /etc/prometheus/rules/vectorflow.rules.yml |
| 339 | +``` |
| 340 | + |
| 341 | +Then copy `monitoring/prometheus/vectorflow.rules.yml` to that path and reload Prometheus: |
| 342 | + |
| 343 | +```bash |
| 344 | +curl -X POST http://localhost:9090/-/reload |
| 345 | +``` |
| 346 | + |
| 347 | +Verify rules loaded successfully: |
| 348 | + |
| 349 | +```bash |
| 350 | +curl http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name | startswith("vectorflow"))' |
| 351 | +``` |
0 commit comments