Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Pods metrics missing from API #16839

Open
Garahk opened this issue Jan 24, 2024 · 7 comments
Open

[Bug]: Pods metrics missing from API #16839

Garahk opened this issue Jan 24, 2024 · 7 comments
Labels
bug cannot reproduce This is to tag issues we weren't able to reproduce the problem and fix it need feedback

Comments

@Garahk
Copy link

Garahk commented Jan 24, 2024

Bug description

Pod's CPU/MEM metrics are missing from the API.

In V1 dashboard cannot be seen:

image

Filtering on the Allmetrics API, if look for netdata's child pods only the k8s_kubelet.kubelet_pods_log_filesystem_used_bytes chart contains the keyword child:

image

I had version 1.42.1 installed in this server and it also had the issue, upgraded to v1.44.1 to see if issue cleared, however having the same issue.

Expected behavior

Pod's metrics available in API

Steps to reproduce

  1. Install netdata with helmchart
  2. Verify the API's metrics
  3. Pod's individual metrics CPU/MEM are present
    ...

Installation method

helmchart (kubernetes)

System info

$ uname -a; grep -HvE "^#|URL" /etc/*release
Linux eni-ucs-tlf-peru 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
/etc/os-release:NAME="Red Hat Enterprise Linux"
/etc/os-release:VERSION="8.8 (Ootpa)"
/etc/os-release:ID="rhel"
/etc/os-release:ID_LIKE="fedora"
/etc/os-release:VERSION_ID="8.8"
/etc/os-release:PLATFORM_ID="platform:el8"
/etc/os-release:PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
/etc/os-release:ANSI_COLOR="0;31"
/etc/os-release:CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
/etc/os-release:
/etc/os-release:REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
/etc/os-release:REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
/etc/os-release:REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
/etc/os-release:REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
/etc/redhat-release:Red Hat Enterprise Linux release 8.8 (Ootpa)
/etc/system-release:Red Hat Enterprise Linux release 8.8 (Ootpa)

Netdata build info

$(ps aux | grep -m1 -E -o "[a-zA-Z/]+netdata ") -W buildinfo
Packaging:
    Netdata Version ____________________________________________ : v1.44.1
    Installation Type __________________________________________ : oci
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ : unknown
    Configure Options __________________________________________ :  '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-math' '--with-user=netdata' '--without-bundled-protobuf' '--disable-ebpf' '--disable-dependency-tracking' '--enable-lto' 'CFLAGS=-ffunction-sections -fdata-sections -O2 -funroll-loops -pipe -DFLB_HAVE_INOTIFY' 'LDFLAGS=-Wl,--gc-sections'
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /usr/share/netdata/web
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 4.18.0-477.27.1.el8_8.x86_64
    Operating System ___________________________________________ : Red Hat Enterprise Linux
    Operating System ID ________________________________________ : rhel
    Operating System ID Like ___________________________________ : fedora
    Operating System Version ___________________________________ : 8.8 (Ootpa)
    Operating System Version ID ________________________________ : 12
    Detection __________________________________________________ : /host/etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 96
    CPU Frequency ______________________________________________ : 3800000000
    RAM Bytes __________________________________________________ : 134105305088
    Disk Capacity ______________________________________________ : 11518907777024
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : none
Container:
    Container __________________________________________________ : container
    Container Detection ________________________________________ : kubernetes
    Container Orchestrator _____________________________________ : kubernetes
    Container Operating System _________________________________ : Debian GNU/Linux
    Container Operating System ID ______________________________ : debian
    Container Operating System ID Like _________________________ : unknown
    Container Operating System Version _________________________ : 12 (bookworm)
    Container Operating System Version ID ______________________ : 12
    Container Operating System Detection _______________________ : /etc/os-release
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine ___________________________________________________ : YES
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    map ________________________________________________________ : YES
    save _______________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Judy (high-performance dynamic arrays and hashtables) ______ : YES (bundled)
    dlib (robust machine learning toolkit) _____________________ : YES (bundled)
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libm (mathematical functions) ______________________________ : YES
    jemalloc ___________________________________________________ : NO
    TCMalloc ___________________________________________________ : NO
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : NO
    ebpf (monitor system calls) ________________________________ : NO
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : NO
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

Cluster is k3s

@Garahk Garahk added bug needs triage Issues which need to be manually labelled labels Jan 24, 2024
@ilyam8 ilyam8 added need feedback cannot reproduce This is to tag issues we weren't able to reproduce the problem and fix it and removed needs triage Issues which need to be manually labelled labels Jan 25, 2024
@ilyam8
Copy link
Member

ilyam8 commented Jan 25, 2024

K8s containers metrics are in "Kubernetes containers"

Screenshot Screenshot 2024-01-25 at 11 40 52

I am not sure why you are using exporting in JSON format, but anyway:

$ curl -sq "http://10.10.11.102:19999/api/v1/allmetrics?format=json" | grep cgroup_k8s_cntr | more
	"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_eth0": {
		"name":"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_eth0",
	"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_operstate_eth0": {
		"name":"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_operstate_eth0",
	"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_carrier_eth0": {
		"name":"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_carrier_eth0",
	"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_mtu_eth0": {
		"name":"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_mtu_eth0",
	"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_packets_eth0": {
		"name":"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_packets_eth0",
	"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_drops_eth0": {
		"name":"cgroup_k8s_cntr_lab-httpd_httpd-78d959c55c-9tsrq_httpd.net_drops_eth0",
	"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_eth0": {
		"name":"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_eth0",
	"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_operstate_eth0": {
		"name":"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_operstate_eth0",
	"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_carrier_eth0": {
		"name":"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_carrier_eth0",
	"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_mtu_eth0": {
		"name":"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_mtu_eth0",
	"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_packets_eth0": {
		"name":"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_packets_eth0",
	"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_drops_eth0": {
		"name":"cgroup_k8s_cntr_lab-busybox-load_busybox-load-5b5b854dbb-4plpp_busybox-load.net_drops_eth0",
	"cgroup_k8s_cntr_lab-redis_redis-66cf65f9b7-rnqsh_redis.net_eth0": {
...
...
curl -sq "http://10.10.11.102:19999/api/v1/allmetrics?format=json" | grep cgroup_k8s_cntr | wc -l
6922

@Garahk
Copy link
Author

Garahk commented Jan 29, 2024

@ilyam8 Perhaps I can clarify through comparison. The metrics I am currently not receiving pertain to CPU and memory by pod on the control-plane. In the provided screenshot, you'll notice the server I am encountering issues with. I am specifically searching for netdata's child pod, and there are only two matches. In contrast, in the second screenshot featuring another server with the same setup, I find 62 matches, encompassing the metrics I am seeking, including CPU, as visible in the screenshot.

image

image

@ilyam8
Copy link
Member

ilyam8 commented Jan 30, 2024

@Garahk Unfortunately, I don't see how I can help you with the information provided, other than suggesting that you debug the problem yourself - I can't reproduce the problem. These metrics are collected by cgroups plugin - consider checking logs. If you want you can share your logs and we will check them too.

@Garahk
Copy link
Author

Garahk commented Jan 30, 2024

@ilyam8 Apologies, I should've started sharing some logs. I've sent them to your inbox, and here's the same reply I wrote.

Let me explain the logs..., Am attaching 2 child-pod logs, the one called last week was taken right after I installed netdata, so it has the top of the logs, when netdata is instantiated. While the other is from today, which logs are smaller, not sure why though.

Does it have to do with these errors that keep showing in the logs, regarding the maximum number of CGROUPS (these are just 2 of several in the logs)?

2024-01-24 13:47:13: netdata INFO  : P[cgroups] : CGROUP: maximum number of cgroups reached (1000). Not adding cgroup '/system.slice/var-lib-kubelet-pods-7f311e8c\x2d139c\x2d4c8c\x2d8b95\x2df11f03e3645a-volumes-kubernetes.io\x7esecret-nia\x2ddms\x2drabbitmq\x2dpass\x2dsecret.mount'
2024-01-24 13:47:13: netdata INFO  : P[cgroups] : CGROUP: maximum number of cgroups reached (1000). Not adding cgroup '/system.slice/run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-0f8c8bd12b97d16dca533e8cadc85dad8782a5eabb760dbeeebbc4c5db516edb-rootfs.mount'

And if so, is it something related to my setup?

@Garahk
Copy link
Author

Garahk commented Feb 9, 2024

@ilyam8 Correct me if am wrong, but this section of the logs mention the collector cannot reach the metrics server:

time=2024-02-09T17:24:19.450Z level=error msg="Get \"http://127.0.0.1:10255/metrics\": dial tcp 127.0.0.1:10255: connect: connection refused" plugin=go.d collector=k8s_kubelet job=k8s_kubelet
time=2024-02-09T17:24:19.450Z level=error msg="check failed" plugin=go.d collector=k8s_kubelet job=k8s_kubelet

@didlawowo
Copy link

i have the same thing
but your port are incorrect

@Garahk
Copy link
Author

Garahk commented May 14, 2024

Hello @ilyam8

I've upgraded my setup to v1.45.4, and keep missing those metrics in one of my servers RH 8.8:

image

I would expect the Kubernetes Containers section, but is not there, like on this other server RH 7.9:

image

I don't think it's related to the OS version, however I would much appreciate if you can check for any error in the logs of the child pods so I can narrow it down, let me know if you need the parent's logs.

I've shared the pod's logs via mail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cannot reproduce This is to tag issues we weren't able to reproduce the problem and fix it need feedback
Projects
None yet
Development

No branches or pull requests

3 participants