Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring addon / OMS agent - memory consumption, oom restarts #2457

Closed
kaplik opened this issue Jul 26, 2021 · 13 comments
Closed

Monitoring addon / OMS agent - memory consumption, oom restarts #2457

kaplik opened this issue Jul 26, 2021 · 13 comments

Comments

@kaplik
Copy link

kaplik commented Jul 26, 2021

What happened:
I'm facing issue with AKS monitoring addon (oms agent) which is restarting quite often. Possible cause is OOM.
The issue occurred probably after AKS upgrade (from 1.19.X -> 1.20.7). The OMS agent was updated automatically during AKS cluster upgrade to: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod06112021

I did this upgrade on 3 AKS clusters, and all of them is facing same issue with OMS agent.

See the number of restart in last 3 days:

NAME                                  READY   STATUS    RESTARTS   AGE
omsagent-lz67n                        1/1     Running   15         3d11h
omsagent-rs-6f9dfb967f-f8sn4          1/1     Running   0          3d11h
omsagent-vvprs                        1/1     Running   4          3d11h
omsagent-wj8mj                        1/1     Running   60         3d11h

Kubernetes events for pod:

  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  40m   kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:04 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linu
xmonagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  1995 Done                    ps -ef
      1996 Broken pipe             | grep "fluentd"
      1997 Killed                  | grep -v "grep"
  Warning  Unhealthy  39m  kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:05 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linux
monagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  2013 Done                    ps -ef
      2014 Broken pipe             | grep "fluentd"
      2015 Killed                  | grep -v "grep"
  Warning  Unhealthy  38m  kubelet  Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "2a288770d2f943e520ecc68e795cedb6af2f4608c44cc545e6eca140e9186721": OCI runtime exec failed: ex
ec failed: container_linux.go:380: starting container process caused: process_linux.go:107: possibly OOM-killed caused: read init-p: connection reset by peer: unknown
  Warning  Unhealthy  33m  kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:05 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linux
monagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  2149 Done                    ps -ef
      2150 Killed                  | grep "fluentd"
      2151 Killed                  | grep -v "grep"
  Warning  Unhealthy  30m  kubelet  Liveness probe failed: /opt/livenessprobe.sh: line 4:  2251 Done                    ps -ef
      2252 Broken pipe             | grep "mdsd"
      2253 Killed                  | grep -v "grep"
  Warning  Unhealthy  29m  kubelet  Liveness probe failed: /opt/livenessprobe.sh: line 4:  2263 Done                    ps -ef
      2264 Killed                  | grep "mdsd"
      2265 Killed                  | grep -v "grep"
  Normal   Pulled     28m (x82 over 3d12h)  kubelet  Container image "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod06112021" already present on machine
  Normal   Created    28m (x82 over 3d12h)  kubelet  Created container omsagent
  Normal   Killing    28m (x81 over 3d11h)  kubelet  Container omsagent failed liveness probe, will be restarted
  Warning  Unhealthy  28m                   kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:05 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/op
t/microsoft/linuxmonagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  2280 Done                    ps -ef
      2281 Broken pipe             | grep "fluentd"
      2282 Killed                  | grep -v "grep"
  Normal  Started  28m (x82 over 3d12h)  kubelet  Started container omsagent

I have tried to disable and reenable the monitoring addon, but that didn't help:

az aks disable-addons -a monitoring -n xx --resource-group xx
az aks enable-addons -a monitoring -n xx --resource-group xx --workspace-resource-id "/subscriptions/xx/resourceGroups/xx/providers/Microsoft.OperationalInsights/workspaces/xx-la"

I have tried to investigate a bit and i have figure out, that "td-agent-bit" is causing the memory consumption:

root@omsagent-lz67n:/opt# ps aux --sort -rss
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       155  0.5  2.7 1831780 450288 ?      SLl  06:10   0:33 /opt/td-agent-bit/bin/td-agent-bit -c /etc/opt/microsoft/docker-cimprov/td-agent-bit.conf -e /opt/td-agent-bit/bin/out_oms.so
root       193  0.6  0.2 313524 41456 ?        Sl   06:10   0:38 /usr/bin/ruby2.6 -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -c /etc/fluent/container.conf -o /var/opt/microsoft/docker-cimprov/log/fluentd.log --log-rotate-age 5 --log-r
root        97  0.1  0.2 565244 34908 ?        Sl   06:10   0:11 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linuxmonagent/log/mdsd.info -q /var/opt/microso
root        98  0.0  0.1 231676 30052 ?        Sl   06:10   0:03 /usr/bin/ruby2.6 /usr/local/bin/fluentd -c /etc/fluent/container.conf -o /var/opt/microsoft/docker-cimprov/log/fluentd.log --log-rotate-age 5 --log-rotate-size 20971520
root       245  0.2  0.1 5145300 26296 ?       Sl   06:10   0:16 /opt/telegraf --config /etc/opt/microsoft/docker-cimprov/telegraf.conf
root      3581  0.0  0.0  34412  2760 pts/0    R+   07:50   0:00 ps aux --sort -rss
root      3569  0.1  0.0  18512  1904 pts/0    Ss   07:50   0:00 bash
root         1  0.0  0.0  18592   540 ?        Ss   06:10   0:00 /bin/bash /opt/main.sh
root        83  0.0  0.0  28360   240 ?        Ss   06:10   0:00 /usr/sbin/cron
root        10  0.0  0.0   6708   120 ?        S    06:10   0:00 inotifywait /etc/config/settings --daemon --recursive --outfile /opt/inotifyoutput.txt --event create,delete --format %e : %T --timefmt +%s
root       277  0.0  0.0   4540    68 ?        S    06:10   0:00 sleep inf

Memory consumed by this process is constantly growing, until the pod will consume 100% memory from its limit.

Another interesting thing is that fluent-bit have a problem with flushing chunks to OMS:

==> fluent-bit.log <==
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627283150.71408622.flb', retry in 195 seconds: task_id=1548, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627281380.61586511.flb', retry in 1425 seconds: task_id=722, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627280959.374993596.flb', retry in 1328 seconds: task_id=526, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627280383.977423469.flb', retry in 1195 seconds: task_id=256, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627283031.90606381.flb', retry in 912 seconds: task_id=1490, input=tail.0 > output=oms.0

This could be caused by:

==> fluent-bit-out-oms-runtime.log <==
2021/07/26 07:53:58 oms.go:1362: RequestId b7cfcaa1-c430-47bc-9b2d-294bc9a280ad Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId bcf0f2a1-aabb-47b5-bcb3-3fadb3005c90 Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId d77085a2-fc63-4995-9326-a878da810e39 Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId 53aa6d92-81dc-497f-92d4-870758627c44 Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId 5c8433dd-897e-45f2-ab81-6686d38b365e Status 403 Forbidden Status Code 403

Head of fluent-bit-out-oms-runtime.log

root@omsagent-lz67n:/var/opt/microsoft/docker-cimprov/log# head -100 fluent-bit-out-oms-runtime.log
2021/07/26 06:10:34 out_oms.go:23: Initializing out_oms go plugin for fluentbit
2021/07/26 06:10:34 out_oms.go:36: Using /etc/opt/microsoft/docker-cimprov/out_oms.conf for plugin config
2021/07/26 06:10:34 oms.go:1470: ContainerLogEnrichment=false
2021/07/26 06:10:34 oms.go:1483: Container Type
2021/07/26 06:10:34 oms.go:1489: Reading configuration for Linux from /etc/opt/microsoft/docker-cimprov/out_oms.conf
2021/07/26 06:10:34 oms.go:1543: OMSEndpoint https://xx.ods.opinsights.azure.com/OperationalData.svc/PostJsonDataItems
2021/07/26 06:10:34 oms.go:1551: ResourceCentric: True
2021/07/26 06:10:34 oms.go:1552: ResourceID=/subscriptions/xx/resourceGroups/xx/providers/Microsoft.ContainerService/managedClusters/xx
2021/07/26 06:10:34 oms.go:1553: ResourceName=/subscriptions/xx/resourceGroups/xx/providers/Microsoft.ContainerService/managedClusters/xx
2021/07/26 06:10:34 oms.go:1566: Container Runtime engine containerd
2021/07/26 06:10:34 oms.go:1576: Usage-Agent = ContainerAgent/16.0.0.0
2021/07/26 06:10:34 oms.go:1587: containerInventoryRefreshInterval = 60
2021/07/26 06:10:34 oms.go:1590: kubeMonAgentConfigEventFlushInterval = 60
2021/07/26 06:10:34 oms.go:1593: Computer == aks-agentpool-17637801-vmss000002
2021/07/26 06:10:34 oms.go:1620: AZMON_CONTAINER_LOGS_ROUTE:v2
2021/07/26 06:10:34 oms.go:1661: Routing container logs thru v2 route...
2021/07/26 06:10:34 utils.go:141: Successfully created MDSD msgp socket connection for ContainerLogV2: /var/run/mdsd/default_fluent.socket
2021/07/26 06:10:34 oms.go:1675: Creating MDSD clients for KubeMonAgentEvents & InsightsMetrics
2021/07/26 06:10:34 utils.go:155: Successfully created MDSD msgp socket connection for KubeMon events:/var/run/mdsd/default_fluent.socket
2021/07/26 06:10:34 utils.go:169: Successfully created MDSD msgp socket connection for Insights metrics /var/run/mdsd/default_fluent.socket
2021/07/26 06:10:34 oms.go:1681: AZMON_CONTAINER_LOG_SCHEMA_VERSION:
2021/07/26 06:10:34 oms.go:449: Excluding namespace kube-system for stdout log collection
2021/07/26 06:10:34 oms.go:449: Excluding namespace xxx for stdout log collection
2021/07/26 06:10:34 oms.go:449: Excluding namespace xxxt for stdout log collection
2021/07/26 06:10:34 oms.go:449: Excluding namespace xxx for stdout log collection
2021/07/26 06:10:34 oms.go:462: Excluding namespace kube-system for stderr log collection
2021/07/26 06:10:34 oms.go:1699: ContainerLogEnrichment=false
2021/07/26 06:10:34 telemetry.go:212: Sending Event : ContainerLogPluginInitialized
2021/07/26 06:10:34 telemetry.go:212: Sending Event : ContainerLogDaemonSetHeartbeatEvent
2021/07/26 06:10:34 telemetry.go:173: Log Size Rate: NaN
2021/07/26 06:10:49 oms.go:1362: RequestId c58f47a0-bee5-4c2b-8887-4772a2a1128d Status 403 Forbidden Status Code 403
2021/07/26 06:10:49 oms.go:1362: RequestId af1bad2c-0d71-4434-a117-9e33e7bb079c Status 403 Forbidden Status Code 403

Target LA workspace settings:
image
OMS agent is using (as far as i can tell) correct WS ID and KEY.

What you expected to happen:
0 restarts.

How to reproduce it (as minimally and precisely as possible):
i haven't be able to reproduce this issue on newly deployed AKS cluster

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:52:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"4a4913e790a033f1101af2f0fb7be189b106e3ae", GitTreeState:"clean", BuildDate:"2021-05-25T17:37:03Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?)
3x Standard_D4s_v3
other cluster with same issue have different number of nodes, thus it seems to be irrelevant

@ghost ghost added the triage label Jul 26, 2021
@ghost
Copy link

ghost commented Jul 26, 2021

Hi kaplik, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@kaplik
Copy link
Author

kaplik commented Jul 26, 2021

Support request ID | 2107260050000609

@ghost ghost added the action-required label Jul 28, 2021
@ghost
Copy link

ghost commented Jul 28, 2021

Triage required from @Azure/aks-pm

@ganga1980
Copy link

Hi, @kaplik, I am investigating this from Monitoring addon team and let provide the update and workaround for this issue soon.

@ghost
Copy link

ghost commented Jul 28, 2021

@vishiy, @saaror would you be able to assist?

Issue Details

What happened:
I'm facing issue with AKS monitoring addon (oms agent) which is restarting quite often. Possible cause is OOM.
The issue occurred probably after AKS upgrade (from 1.19.X -> 1.20.7). The OMS agent was updated automatically during AKS cluster upgrade to: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod06112021

I did this upgrade on 3 AKS clusters, and all of them is facing same issue with OMS agent.

See the number of restart in last 3 days:

NAME                                  READY   STATUS    RESTARTS   AGE
omsagent-lz67n                        1/1     Running   15         3d11h
omsagent-rs-6f9dfb967f-f8sn4          1/1     Running   0          3d11h
omsagent-vvprs                        1/1     Running   4          3d11h
omsagent-wj8mj                        1/1     Running   60         3d11h

Kubernetes events for pod:

  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  40m   kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:04 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linu
xmonagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  1995 Done                    ps -ef
      1996 Broken pipe             | grep "fluentd"
      1997 Killed                  | grep -v "grep"
  Warning  Unhealthy  39m  kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:05 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linux
monagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  2013 Done                    ps -ef
      2014 Broken pipe             | grep "fluentd"
      2015 Killed                  | grep -v "grep"
  Warning  Unhealthy  38m  kubelet  Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "2a288770d2f943e520ecc68e795cedb6af2f4608c44cc545e6eca140e9186721": OCI runtime exec failed: ex
ec failed: container_linux.go:380: starting container process caused: process_linux.go:107: possibly OOM-killed caused: read init-p: connection reset by peer: unknown
  Warning  Unhealthy  33m  kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:05 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linux
monagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  2149 Done                    ps -ef
      2150 Killed                  | grep "fluentd"
      2151 Killed                  | grep -v "grep"
  Warning  Unhealthy  30m  kubelet  Liveness probe failed: /opt/livenessprobe.sh: line 4:  2251 Done                    ps -ef
      2252 Broken pipe             | grep "mdsd"
      2253 Killed                  | grep -v "grep"
  Warning  Unhealthy  29m  kubelet  Liveness probe failed: /opt/livenessprobe.sh: line 4:  2263 Done                    ps -ef
      2264 Killed                  | grep "mdsd"
      2265 Killed                  | grep -v "grep"
  Normal   Pulled     28m (x82 over 3d12h)  kubelet  Container image "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod06112021" already present on machine
  Normal   Created    28m (x82 over 3d12h)  kubelet  Created container omsagent
  Normal   Killing    28m (x81 over 3d11h)  kubelet  Container omsagent failed liveness probe, will be restarted
  Warning  Unhealthy  28m                   kubelet  Liveness probe failed: root        97     1  0 05:59 ?        00:00:05 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/op
t/microsoft/linuxmonagent/log/mdsd.info -q /var/opt/microsoft/linuxmonagent/log/mdsd.qos
/opt/livenessprobe.sh: line 21:  2280 Done                    ps -ef
      2281 Broken pipe             | grep "fluentd"
      2282 Killed                  | grep -v "grep"
  Normal  Started  28m (x82 over 3d12h)  kubelet  Started container omsagent

I have tried to disable and reenable the monitoring addon, but that didn't help:

az aks disable-addons -a monitoring -n xx --resource-group xx
az aks enable-addons -a monitoring -n xx --resource-group xx --workspace-resource-id "/subscriptions/xx/resourceGroups/xx/providers/Microsoft.OperationalInsights/workspaces/xx-la"

I have tried to investigate a bit and i have figure out, that "td-agent-bit" is causing the memory consumption:

root@omsagent-lz67n:/opt# ps aux --sort -rss
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       155  0.5  2.7 1831780 450288 ?      SLl  06:10   0:33 /opt/td-agent-bit/bin/td-agent-bit -c /etc/opt/microsoft/docker-cimprov/td-agent-bit.conf -e /opt/td-agent-bit/bin/out_oms.so
root       193  0.6  0.2 313524 41456 ?        Sl   06:10   0:38 /usr/bin/ruby2.6 -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -c /etc/fluent/container.conf -o /var/opt/microsoft/docker-cimprov/log/fluentd.log --log-rotate-age 5 --log-r
root        97  0.1  0.2 565244 34908 ?        Sl   06:10   0:11 mdsd -e /var/opt/microsoft/linuxmonagent/log/mdsd.err -w /var/opt/microsoft/linuxmonagent/log/mdsd.warn -o /var/opt/microsoft/linuxmonagent/log/mdsd.info -q /var/opt/microso
root        98  0.0  0.1 231676 30052 ?        Sl   06:10   0:03 /usr/bin/ruby2.6 /usr/local/bin/fluentd -c /etc/fluent/container.conf -o /var/opt/microsoft/docker-cimprov/log/fluentd.log --log-rotate-age 5 --log-rotate-size 20971520
root       245  0.2  0.1 5145300 26296 ?       Sl   06:10   0:16 /opt/telegraf --config /etc/opt/microsoft/docker-cimprov/telegraf.conf
root      3581  0.0  0.0  34412  2760 pts/0    R+   07:50   0:00 ps aux --sort -rss
root      3569  0.1  0.0  18512  1904 pts/0    Ss   07:50   0:00 bash
root         1  0.0  0.0  18592   540 ?        Ss   06:10   0:00 /bin/bash /opt/main.sh
root        83  0.0  0.0  28360   240 ?        Ss   06:10   0:00 /usr/sbin/cron
root        10  0.0  0.0   6708   120 ?        S    06:10   0:00 inotifywait /etc/config/settings --daemon --recursive --outfile /opt/inotifyoutput.txt --event create,delete --format %e : %T --timefmt +%s
root       277  0.0  0.0   4540    68 ?        S    06:10   0:00 sleep inf

Memory consumed by this process is constantly growing, until the pod will consume 100% memory from its limit.

Another interesting thing is that fluent-bit have a problem with flushing chunks to OMS:

==> fluent-bit.log <==
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627283150.71408622.flb', retry in 195 seconds: task_id=1548, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627281380.61586511.flb', retry in 1425 seconds: task_id=722, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627280959.374993596.flb', retry in 1328 seconds: task_id=526, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627280383.977423469.flb', retry in 1195 seconds: task_id=256, input=tail.0 > output=oms.0
[2021/07/26 07:53:58] [ warn] [engine] failed to flush chunk '155-1627283031.90606381.flb', retry in 912 seconds: task_id=1490, input=tail.0 > output=oms.0

This could be caused by:

==> fluent-bit-out-oms-runtime.log <==
2021/07/26 07:53:58 oms.go:1362: RequestId b7cfcaa1-c430-47bc-9b2d-294bc9a280ad Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId bcf0f2a1-aabb-47b5-bcb3-3fadb3005c90 Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId d77085a2-fc63-4995-9326-a878da810e39 Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId 53aa6d92-81dc-497f-92d4-870758627c44 Status 403 Forbidden Status Code 403
2021/07/26 07:53:58 oms.go:1362: RequestId 5c8433dd-897e-45f2-ab81-6686d38b365e Status 403 Forbidden Status Code 403

Head of fluent-bit-out-oms-runtime.log

root@omsagent-lz67n:/var/opt/microsoft/docker-cimprov/log# head -100 fluent-bit-out-oms-runtime.log
2021/07/26 06:10:34 out_oms.go:23: Initializing out_oms go plugin for fluentbit
2021/07/26 06:10:34 out_oms.go:36: Using /etc/opt/microsoft/docker-cimprov/out_oms.conf for plugin config
2021/07/26 06:10:34 oms.go:1470: ContainerLogEnrichment=false
2021/07/26 06:10:34 oms.go:1483: Container Type
2021/07/26 06:10:34 oms.go:1489: Reading configuration for Linux from /etc/opt/microsoft/docker-cimprov/out_oms.conf
2021/07/26 06:10:34 oms.go:1543: OMSEndpoint https://xx.ods.opinsights.azure.com/OperationalData.svc/PostJsonDataItems
2021/07/26 06:10:34 oms.go:1551: ResourceCentric: True
2021/07/26 06:10:34 oms.go:1552: ResourceID=/subscriptions/xx/resourceGroups/xx/providers/Microsoft.ContainerService/managedClusters/xx
2021/07/26 06:10:34 oms.go:1553: ResourceName=/subscriptions/xx/resourceGroups/xx/providers/Microsoft.ContainerService/managedClusters/xx
2021/07/26 06:10:34 oms.go:1566: Container Runtime engine containerd
2021/07/26 06:10:34 oms.go:1576: Usage-Agent = ContainerAgent/16.0.0.0
2021/07/26 06:10:34 oms.go:1587: containerInventoryRefreshInterval = 60
2021/07/26 06:10:34 oms.go:1590: kubeMonAgentConfigEventFlushInterval = 60
2021/07/26 06:10:34 oms.go:1593: Computer == aks-agentpool-17637801-vmss000002
2021/07/26 06:10:34 oms.go:1620: AZMON_CONTAINER_LOGS_ROUTE:v2
2021/07/26 06:10:34 oms.go:1661: Routing container logs thru v2 route...
2021/07/26 06:10:34 utils.go:141: Successfully created MDSD msgp socket connection for ContainerLogV2: /var/run/mdsd/default_fluent.socket
2021/07/26 06:10:34 oms.go:1675: Creating MDSD clients for KubeMonAgentEvents & InsightsMetrics
2021/07/26 06:10:34 utils.go:155: Successfully created MDSD msgp socket connection for KubeMon events:/var/run/mdsd/default_fluent.socket
2021/07/26 06:10:34 utils.go:169: Successfully created MDSD msgp socket connection for Insights metrics /var/run/mdsd/default_fluent.socket
2021/07/26 06:10:34 oms.go:1681: AZMON_CONTAINER_LOG_SCHEMA_VERSION:
2021/07/26 06:10:34 oms.go:449: Excluding namespace kube-system for stdout log collection
2021/07/26 06:10:34 oms.go:449: Excluding namespace xxx for stdout log collection
2021/07/26 06:10:34 oms.go:449: Excluding namespace xxxt for stdout log collection
2021/07/26 06:10:34 oms.go:449: Excluding namespace xxx for stdout log collection
2021/07/26 06:10:34 oms.go:462: Excluding namespace kube-system for stderr log collection
2021/07/26 06:10:34 oms.go:1699: ContainerLogEnrichment=false
2021/07/26 06:10:34 telemetry.go:212: Sending Event : ContainerLogPluginInitialized
2021/07/26 06:10:34 telemetry.go:212: Sending Event : ContainerLogDaemonSetHeartbeatEvent
2021/07/26 06:10:34 telemetry.go:173: Log Size Rate: NaN
2021/07/26 06:10:49 oms.go:1362: RequestId c58f47a0-bee5-4c2b-8887-4772a2a1128d Status 403 Forbidden Status Code 403
2021/07/26 06:10:49 oms.go:1362: RequestId af1bad2c-0d71-4434-a117-9e33e7bb079c Status 403 Forbidden Status Code 403

Target LA workspace settings:
image
OMS agent is using (as far as i can tell) correct WS ID and KEY.

What you expected to happen:
0 restarts.

How to reproduce it (as minimally and precisely as possible):
i haven't be able to reproduce this issue on newly deployed AKS cluster

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:52:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"4a4913e790a033f1101af2f0fb7be189b106e3ae", GitTreeState:"clean", BuildDate:"2021-05-25T17:37:03Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?)
3x Standard_D4s_v3
other cluster with same issue have different number of nodes, thus it seems to be irrelevant

Author: kaplik
Assignees: -
Labels:

action-required, addon/container-insights, triage

Milestone: -

@ganga1980
Copy link

/assign @ganga1980

@ganga1980
Copy link

ganga1980 commented Jul 29, 2021

Hi, @kaplik, I did investigate and couldnt find exact root cause, and discussed with support team who is handling the support ticket to schedule a meeting on your earliest convienent time so that we can investigate and understand whats causing this. Based on our telemetry, looks like you have upgraded the cluster 2021-07-14T20:14:00Z and our updated agent ciprod06112021 got deployed on 2021-07-15. But OOMing happening on starting 2021-07-22 and ifmy analysis correct with regards to timings. Then I am suspecting , your application containers logs or some timing issue causing to get into this state. Need to understand your application container workload to see what causing get into these state for example following calls are not suppose to make

2021/07/26 06:10:49 oms.go:1362: RequestId c58f47a0-bee5-4c2b-8887-4772a2a1128d Status 403 Forbidden Status Code 403
2021/07/26 06:10:49 oms.go:1362: RequestId af1bad2c-0d71-4434-a117-9e33e7bb079c Status 403 Forbidden Status Code 403

@kaplik
Copy link
Author

kaplik commented Jul 29, 2021

Hi @ganga1980,
thank you for your investigation. We have scheduled call with support engineer Punnet Dhir on Friday 30.7. 12:00-13:00 CEST (Europe/Prague), i would be happy if you could join us.

We have 3 AKS clusters which are affected by this issue, all of them have been upgraded by same procedure. Here is the timeline of the upgrade. All clusters are in the same subscription (151c879a-7d27-4fc1-9662-46db11216bb1).

12.7.2021 around 22:00 CEST - [XXX]-test-aks upgrade to 1.20.7
14.7.2021 around 22:00 CEST- [XXX]-prod-aks + [XXX]-sandbox-aks upgrade to 1.20.7
21.7.2021 around 18:00 CEST - reimage nodepool to latest version on AKS cluster [XXX]-test-aks
22.7.2021 around 22:00 CEST - reimage nodepool to latest version on AKS clusters [XXX]-prod-aks + [XXX]-sandbox-aks

It's true that i have noticed the OMS agent restarts after 22.7., so it is definitely possible, that the problem starts after the nodepool reimage to latest version on 21.7. and 22.7.) and everything was OK after the AKS upgrades (12.7. and 14.7.).

Node pool upgrade was performed by this command on all clusters:

az aks nodepool upgrade --node-image-only --cluster-name xxxx --name agentpool --resource-group xxx

I was able to find out in container insights that the OMS agent was probably automatically upgraded on 16.7. ([XXX]-prod-aks cluster):

(times in UTC)

TimeGenerated Computer Repository Image ImageTag
16/07/2021, 08:18:14.000 aks-agentpool-17637801-vmss000001 mcr.microsoft.com azuremonitor/containerinsights/ciprod ciprod05202021
16/07/2021, 08:20:06.000 aks-agentpool-17637801-vmss000002 mcr.microsoft.com azuremonitor/containerinsights/ciprod ciprod06112021

Need to understand your application container workload to see what causing get into these state for example following calls are not suppose to make

Workload in all clusters:

  • system/default workload in kube-system
  • nginx ingress controller
  • cert-manager
  • our custom java (spring boot) application microservices -> logging to stdout

@ganga1980
Copy link

Thanks, @kaplik for sharing the logs and also confirming the workaround #1 works without any issues.

Here are the details of this issue ..

If the namespace excluded in both stdout & stderr streams, then no issue since we don’t tail the logs of the container of that namespace completely.

But, If the namespace excluded either to stdout or stderr, then we drop messages of excluded stream after receiving log lines.

If the batch, we received from the fluent bit tail plugin contains the all log lines of excluded stream then instead of dropping them, because of the bug in ciprod06112021, agent try to ingest incorrect path which will fail with 403 (which is expected) but those log lines keep retried.

This will not issue of the number of dropped messages but if the dropped messages are very high and that will cause, the number of log lines are retrying are very high, then that will impact the memory usage of td-agent-bit (aka Fluent bit) and eventually agent will OOM if the memory usage keep grows because of the retries .

In your case, the namespace excluded “baapi2-test” excluded for stdout but not stderr, the containers in this namespace generating very high number of stdout logs which are keep retrying incorrect path instead of dropping those log lines.

Possible workarounds with released Monitoring addon agent (ciprod06112021)

1 . Exclude or include the namespace for both stdout & stderr streams in the container-azm-ms-agentconfig configmap

Or

  1. In the application container code, turn off the stdout logs if you only want the stderr log lines being collected and vice versa

@ghost
Copy link

ghost commented Aug 7, 2021

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Aug 7, 2021
@ghost
Copy link

ghost commented Aug 22, 2021

Issue needing attention of @Azure/aks-leads

@ganga1980
Copy link

Thanks @kaplik for reporting and also helping on validating the fix. Fix has been rolled out to all the regions.

@ganga1980
Copy link

Hi, @Azure/aks-leads , Can you guys please help on closing this issue as the fix has been rolled out to all the AKS regions?

@vishiy vishiy closed this as completed Sep 1, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Oct 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants