Skip to content

Mistake in the documentation of the Prometheus setup #1

@dmrub

Description

@dmrub

IMHO there is a bug in the documentation for setting up Prometheus: kubernetes/kube-prometheus.rst .

The following rule should collect all Kubernetes endpoints and use them to scrape metrics:

additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

However, not all endpoints in the gpu-operator namespace provide Prometheus metrics. In particular, node-feature-discovery-master has only one gRPC endpoint on port 8080, which cannot be scraped. I have changed this rule as follows to fix the problem:

additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator
  relabel_configs:
  - source_labels: [__meta_kubernetes_endpoints_name]
    action: drop
    regex: .*-node-feature-discovery-master
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

Metadata

Metadata

Assignees

Labels

documentationIssue/PR focused on fixing/editing/adding documentation bits

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions