[BUG] system-probe failed to create containerd task #13436

ThangEthan · 2022-09-09T10:24:00Z

During helm installation. system-probe pod failed to start

Agent Environment
gcr.io/datadoghq/cluster-agent:1.22.0
gcr.io/datadoghq/agent:7.38.2

Describe what happened:
During helm installation. system-probe pod failed to start with this messages: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

Describe what you expected:
All pod running

Steps to reproduce the issue:
helm install -f resources/datadog-values.yaml datadog-monitoring --namespace datadog-system datadog/datadog

Additional environment details (Operating System, Cloud provider, etc):
Kubernetes version 1.24.4
Container runtime: containerd

ViniciusBastosTR · 2022-09-14T20:28:24Z

Same issue here.
datadoghq/agent:7.37.1
K8S Version: v1.21.7

bencouture · 2022-09-20T14:59:08Z

Also seeing this exact same error, running these versions:
datadoghq/agent:7.38.2
kube version: v1.19.7

bencouture · 2022-09-20T15:00:04Z

Why is this issue closed, was a resolution identified?

MarcioCruzTR · 2022-09-21T21:17:07Z

Same issue here.
datadoghq/agent:7.37.1
K8S Version: v1.21.7

ViniciusBastosTR · 2022-09-21T21:19:53Z

System probe container is returning message below:
Message: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

brycekahle · 2022-09-26T18:09:46Z

Is anyone able to get logs from a failed system-probe container? If you use the -p argument to kubectl logs, you can get logs from a previous container. If you are not comfortable putting them here, please open a support case at https://help.datadoghq.com/hc/en-us/requests/new

froth · 2022-09-28T09:23:17Z

Fetching logs for kubectl logs datadog-XXXX -c system-probe successfully yields empty output. I believe the probe is not even able to write logs.

brycekahle · 2022-09-28T14:11:57Z

@froth Can you try with -p, kubectl logs -p datadog-XXXX -c system-probe? If the container is crashing, I wouldn't expect getting logs for the current container to work. Getting the previous logs should.

Go2Engle · 2022-09-28T14:21:03Z

@brycekahle I too am having the same issue. When running the command above it just returns blank.
k8s: v1.24.6
Container runtime: containerd://1.6.8
OS: Ubuntu 20.04.5 LTS

froth · 2022-09-28T14:44:19Z

@froth Can you try with -p, kubectl logs -p datadog-XXXX -c system-probe? If the container is crashing, I would expect getting logs for the current container to work. Getting the previous logs should.

Same result

brycekahle · 2022-09-28T14:50:14Z

@froth Got it. We can try to reproduce. Any other relevant details about your setup? Are you using EKS, GKE, or another cloud k8s setup?

froth · 2022-09-28T15:10:12Z

We are experiencing this on EKS

Go2Engle · 2022-09-28T15:18:37Z

We are having the issue on prem Ubuntu 20.04.5 LTS running in vsphere.

brycekahle · 2022-09-28T17:02:35Z

@froth do you know what Helm chart version you have?

Go2Engle · 2022-09-28T17:54:01Z

I know we have tried 3.1.3 and even downgraded back to 2.37.7 that we were running before but still have issues on either version.

brycekahle · 2022-09-28T17:58:13Z

@Go2Engle what version of the agent are you trying?

Go2Engle · 2022-09-28T18:00:01Z

@brycekahle tried 7.38.2 and even 7.39.0

brycekahle · 2022-09-28T19:17:50Z

Is anyone running SELinux?

Go2Engle · 2022-09-28T19:21:02Z

we are not

brycekahle · 2022-09-28T19:29:48Z

@Go2Engle since it says "operation not permitted", which is usually EPERM, can you take a look at the host kernel logs to see if anything jumps out? I'm having trouble reproducing this error.

brycekahle · 2022-09-28T19:33:11Z

Also check the output of cat /sys/kernel/security/lockdown

Go2Engle · 2022-09-28T19:36:37Z

in /var/log/kernlog there are many instances of the below
Sep 28 15:31:45 dev-k8s-wrk-02 kernel: [17025663.143951] audit: type=1400 audit(1664393505.656:77529857): apparmor="DENIED" operation="ptrace" profile="cri-containerd.apparmor.d" pid=2822187 comm="agent" requested_mask="read" denied_mask="read" peer="unconfined"

when running cat /sys/kernel/security/lockdown it outputs the below
[none] integrity confidentiality

Go2Engle · 2022-09-28T19:38:21Z

when viewing the system-probe container in lens I get this as the last status if that helps also.

Last Status
terminated
Reason:Reason: StartError - exit code: 128

brycekahle · 2022-10-04T18:32:38Z

Turning off conntrack can have a pretty significant effect on the data quality for NPM, if NAT is used at all. NAT is quite common in containerized/k8s environments. If you start to see your NPM data not correctly resolving the source or destination, that would probably be why.

brycekahle · 2022-10-04T18:33:30Z

@AlvaroCostaAbreu what error message were you getting before turning off conntrack?

brycekahle · 2022-10-04T18:36:29Z

@froth can you try with the newest helm chart version 3.1.7?

AlvaroCostaAbreu · 2022-10-04T18:50:38Z

Turning off conntrack can have a pretty significant effect on the data quality for NPM, if NAT is used at all. NAT is quite common in containerized/k8s environments. If you start to see your NPM data not correctly resolving the source or destination, that would probably be why.

@brycekahle - Now, i turning on the 'conntrack' to answer you and 'Voila' is't working.
During my debug process i adjust the parameter: seccomp: with removing 'localhost' of the path file.

brycekahle · 2022-10-04T18:59:33Z

@AlvaroCostaAbreu yeah, we had a bug in helm chart version 3.1.6.

Go2Engle · 2022-10-04T19:05:21Z

Just tested chart 3.1.7 and im still having the same error.

dlorent · 2022-10-05T07:21:57Z

I also just tested with 3.1.7 ( 7.39.1) and while enabling networkMonitoring, i got

Error: failed to start container "system-probe": Error response from daemon: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

bencouture · 2022-10-05T15:40:17Z

I upgraded to 3.1.8, verified that rseq and clone3 were actually added to /var/lib/kubelet/seccomp/system-probe (basically just verified that the changes in the PR were actually applied to the host), then completely rebooted the host and recreated the Pod for good measure. Same exact error message. I uploaded a flare from the Pod on that host, attached to case 944682.

brycekahle · 2022-10-05T16:45:52Z

@dlorent @bencouture Can you both detail versions of your setup (OS/distro, k8s, containerd, runc, agent, helm chart)? We are trying to find a common change that might help us identify where the problem is.

bencouture · 2022-10-05T17:01:21Z

package	version
OS	`Ubuntu 18.04.5 LTS`
kernel	`5.4.0-1055-aws`
Helm itself	`v2.16.8-rancher1`
Helm chart	`3.1.8`
Kubernetes	`v1.19.7`
containerd.io	`1.6.8-1`
runc	`1.1.4`
agent	`7.32.4`

More background: we know for a fact that this issue happened when we upgraded from containerd.io 1.5.11-1 (which uses runc 1.0.3) to containerd.io 1.6.8-1 (which uses runc 1.1.4). If we downgrade the containerd.io version, then system-probe starts up with no issue. When we upgrade again, then it goes back to a CrashLoopBackOff.

Go2Engle · 2022-10-05T17:03:28Z

@bencouture I would say thats when the issue started for us as well. After containerd upgrade. Have not tired downgrading but seems like that may be the common denominator!

brycekahle · 2022-10-05T17:30:07Z

@bencouture Thanks! That is very helpful information

brycekahle · 2022-10-05T19:45:59Z

Alright folks, helm chart 3.1.9 should solve your woes.

@bencouture @Go2Engle @froth @dlorent @AlvaroCostaAbreu @ViniciusBastosTR @MarcioCruzTR @ThangEthan

Go2Engle · 2022-10-05T19:49:52Z

@brycekahle YAY! Got a healthy system-probe container running! looks like that did the trick!

bencouture · 2022-10-05T20:04:26Z

Yep, that did it! Healthy on all 30+ nodes in the cluster.

dlorent · 2022-10-06T07:49:43Z

Alright folks, helm chart 3.1.9 should solve your woes.

@bencouture @Go2Engle @froth @dlorent @AlvaroCostaAbreu @ViniciusBastosTR @MarcioCruzTR @ThangEthan

Fantastic! tested on +60 nodes, and it's working! :) thanks!

froth · 2022-10-06T08:11:59Z

Same here, thanks a lot!

smg-serkly · 2022-10-07T07:35:42Z

Thanks, brycekahle! Had the same issue!

nashmrd · 2022-11-21T07:36:31Z

Getting this exception with Kubernetes Operator on AWS EKS. Chart version 0.9.1

Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

I enabled the operator systemprobe via the environment variable: DD_SYSTEM_PROBE_SERVICE_MONITORING_ENABLED

blongman-snapdocs · 2022-11-22T17:43:24Z

@nashmrd Getting the same thing in a cluster that I rebuilt the node groups on. Difference is that we're on an old version of the chart - 2.27.0. It looks like we didn't pin the version of datadog, just the chart version. I'm guessing that the chart pulls the latest version of datadog, and we're just too far behind drift.

I'm currently looking for a solution that don't require a major version upgrade on the chart.

Go2Engle · 2022-11-22T17:50:20Z

@nashmrd Getting the same thing in a cluster that I rebuilt the node groups on. Difference is that we're on an old version of the chart - 2.27.0. It looks like we didn't pin the version of datadog, just the chart version. I'm guessing that the chart pulls the latest version of datadog, and we're just too far behind drift.

I'm currently looking for a solution that don't require a major version upgrade on the chart.

I safely updated from 2 to 3 during this troubleshooting with no issues. I know all env's are different but seems like the agent is pretty safe to update. I used my same values files and everything.

blongman-snapdocs · 2022-11-22T18:11:13Z

I safely updated from 2 to 3 during this troubleshooting with no issues. I know all env's are different but seems like the agent is pretty safe to update. I used my same values files and everything.

That worked. I expect that sort of thing to not work. Thanks. They've started. I'm updating my terraform to make it so the chart version is a variable, and will put together a project to update across the board in the next couple of weeks.

mhhplumber · 2022-11-30T23:07:57Z

@nashmrd
try adding this annotation
container.seccomp.security.alpha.kubernetes.io/system-probe: runtime/default
what 3.1.9 fixes is allowing an additional syscall in the created seccomp profile, you can however just use a different profile

igaskin · 2022-11-30T23:27:18Z

@nashmrd I too am using the datadog-operator, and was able to confirm that the configmap generated by the operator in chart version 0.9.1 does not include faccessat in the seccomp profile. Curiously the operator code has included these changes, but it appears that haven't made it to a release.
https://github.com/DataDog/datadog-operator/blob/main/controllers/datadogagent/component/agent/default.go#LL451C15-L451C15

I also tested the operator image tag 0.8.3, 1.0.0-rc3, and latest, none of which corrected the configmap.
https://registry.hub.docker.com/r/datadog/operator/tags

At least for me, I'll be using the datadog agent helm chart instead of the datadog operator helm chart.

gotchipete · 2023-01-20T22:02:44Z

@nashmrd try adding this annotation container.seccomp.security.alpha.kubernetes.io/system-probe: runtime/default what 3.1.9 fixes is allowing an additional syscall in the created seccomp profile, you can however just use a different profile

This worked for me, I had (in my datadog-agent.yaml)
container.seccomp.security.alpha.kubernetes.io/system-probe: localhost/system-probe
... once I replaced with
container.seccomp.security.alpha.kubernetes.io/system-probe: runtime/default
... smooth sailing!

ishworg · 2023-02-16T09:38:54Z

helm chart 3.1.10 fixes the regression.

but, on kernel 5.4.228-131.415.amzn2.x86_64, we get a GP fault.

Feb 16 09:25:05 ip-10-250-19-90 kernel: ------------[ cut here ]------------
Feb 16 09:25:05 ip-10-250-19-90 kernel: General protection fault in user access. Non-canonical address?
Feb 16 09:25:05 ip-10-250-19-90 kernel: WARNING: CPU: 4 PID: 24530 at arch/x86/mm/extable.c:77 ex_handler_uaccess+0x4d/0x60
Feb 16 09:25:05 ip-10-250-19-90 kernel: Modules linked in: binfmt_misc xt_owner xt_REDIRECT xt_multiport veth xt_state xt_connmark nf_conntrack_netlink nfnetlink xt_addrtype xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_filter ip6table_nat xt_MASQUERADE xt_conntrack xt_comment xt_mark iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter sunrpc crc32_pclmul ghash_clmulni_intel br_netfilter bridge aesni_intel stp ena llc mousedev crypto_simd overlay psmouse cryptd button evdev glue_helper crc32c_intel autofs4
Feb 16 09:25:05 ip-10-250-19-90 kernel: CPU: 4 PID: 24530 Comm: system-probe Not tainted 5.4.228-131.415.amzn2.x86_64 #1
Feb 16 09:25:05 ip-10-250-19-90 kernel: Hardware name: Amazon EC2 c5a.2xlarge/, BIOS 1.0 10/16/2017
Feb 16 09:25:05 ip-10-250-19-90 kernel: RIP: 0010:ex_handler_uaccess+0x4d/0x60
Feb 16 09:25:05 ip-10-250-19-90 kernel: Code: 83 c4 08 b8 01 00 00 00 5b c3 80 3d ad 92 75 01 00 75 dc 48 c7 c7 80 21 07 82 48 89 34 24 c6 05 99 92 75 01 01 e8 b3 ff 01 00 <0f> 0b 48 8b 34 24 eb bd 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Feb 16 09:25:05 ip-10-250-19-90 kernel: RSP: 0018:ffffc900022cfa08 EFLAGS: 00010282
Feb 16 09:25:05 ip-10-250-19-90 kernel: RAX: 0000000000000000 RBX: ffffffff81c045a0 RCX: 0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: RDX: 0000000000000007 RSI: ffffffff82dd759f RDI: ffffffff82dd512c
Feb 16 09:25:05 ip-10-250-19-90 kernel: RBP: 000000000000000d R08: ffffffff82dd7560 R09: 000000000000003f
Feb 16 09:25:05 ip-10-250-19-90 kernel: R10: 0000000000000000 R11: 0000000000005fd2 R12: 0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: FS:  00007f79f2f3e640(0000) GS:ffff888424100000(0000) knlGS:0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 16 09:25:05 ip-10-250-19-90 kernel: CR2: 000000c000d6b000 CR3: 00000003a4404000 CR4: 00000000003406e0
Feb 16 09:25:05 ip-10-250-19-90 kernel: Call Trace:
Feb 16 09:25:05 ip-10-250-19-90 kernel: fixup_exception+0x43/0x60
Feb 16 09:25:05 ip-10-250-19-90 kernel: do_general_protection+0x46/0x140
Feb 16 09:25:05 ip-10-250-19-90 kernel: general_protection+0x28/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? copy_user_generic_string+0x31/0x40
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? __probe_kernel_read+0x54/0x80
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? bpf_probe_read+0x98/0xa0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? trace_call_bpf+0x62/0xd0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? kprobe_perf_func+0x201/0x280
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? kprobe_ftrace_handler+0x92/0xf0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? ftrace_ops_assist_func+0x98/0x110
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? map_update_elem+0x1eb/0x390
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? 0xffffffffa00e50bf
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? do_tcp_getsockopt.isra.44+0xdd0/0xdd0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? __sys_getsockopt+0xb0/0x120
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? __x64_sys_getsockopt+0x20/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? do_syscall_64+0x48/0xf0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Feb 16 09:25:05 ip-10-250-19-90 kernel: ---[ end trace 40123dadeae3f453 ]---

ishworg · 2023-02-16T09:40:58Z

perhaps, there should be a guidance from ddog team on the helm chart on the kernel compatibility (version, changelogs etc)? i have seen dirty GP faults and panics in the past (the old days of eBPF instrumentation code in ddog agents) and fixed by manually bisecting kernel releases with the then uptodate agents... room to improve.

brycekahle · 2023-02-16T18:11:51Z

@ishworg That is a warning and logged once per boot. It is the result of the current kernel struct offset guessing logic, which walks addresses to find the correct offsets at runtime. This will no longer be a problem once all our eBPF-based products have transitioned to CO-RE or runtime compilation (currently in-progress, so hopefully within a couple versions).

ThangEthan added the team/triage label Sep 9, 2022

ThangEthan closed this as completed Sep 9, 2022

ThangEthan reopened this Sep 20, 2022

vickenty added team/triage team/networks and removed team/triage labels Sep 26, 2022

brycekahle mentioned this issue Oct 5, 2022

Add faccessat to system-probe seccomp profile DataDog/helm-charts#776

Merged

3 tasks

brycekahle closed this as completed Oct 7, 2022

[BUG] system-probe failed to create containerd task #13436

[BUG] system-probe failed to create containerd task #13436

Comments

ThangEthan commented Sep 9, 2022

ViniciusBastosTR commented Sep 14, 2022 • edited Loading

bencouture commented Sep 20, 2022

bencouture commented Sep 20, 2022

MarcioCruzTR commented Sep 21, 2022

ViniciusBastosTR commented Sep 21, 2022

brycekahle commented Sep 26, 2022

froth commented Sep 28, 2022 • edited Loading

brycekahle commented Sep 28, 2022 • edited Loading

Go2Engle commented Sep 28, 2022 • edited Loading

froth commented Sep 28, 2022

brycekahle commented Sep 28, 2022

froth commented Sep 28, 2022

Go2Engle commented Sep 28, 2022

brycekahle commented Sep 28, 2022

Go2Engle commented Sep 28, 2022

brycekahle commented Sep 28, 2022

Go2Engle commented Sep 28, 2022

brycekahle commented Sep 28, 2022

Go2Engle commented Sep 28, 2022

brycekahle commented Sep 28, 2022

brycekahle commented Sep 28, 2022

Go2Engle commented Sep 28, 2022

Go2Engle commented Sep 28, 2022

brycekahle commented Oct 4, 2022

brycekahle commented Oct 4, 2022

brycekahle commented Oct 4, 2022

AlvaroCostaAbreu commented Oct 4, 2022

brycekahle commented Oct 4, 2022

Go2Engle commented Oct 4, 2022

dlorent commented Oct 5, 2022

bencouture commented Oct 5, 2022

brycekahle commented Oct 5, 2022

bencouture commented Oct 5, 2022

Go2Engle commented Oct 5, 2022

brycekahle commented Oct 5, 2022

brycekahle commented Oct 5, 2022

Go2Engle commented Oct 5, 2022

bencouture commented Oct 5, 2022

dlorent commented Oct 6, 2022

froth commented Oct 6, 2022

smg-serkly commented Oct 7, 2022

nashmrd commented Nov 21, 2022

blongman-snapdocs commented Nov 22, 2022

Go2Engle commented Nov 22, 2022

blongman-snapdocs commented Nov 22, 2022

mhhplumber commented Nov 30, 2022

igaskin commented Nov 30, 2022 • edited Loading

gotchipete commented Jan 20, 2023 • edited Loading

ishworg commented Feb 16, 2023

ishworg commented Feb 16, 2023

brycekahle commented Feb 16, 2023

ViniciusBastosTR commented Sep 14, 2022 •

edited

Loading

froth commented Sep 28, 2022 •

edited

Loading

brycekahle commented Sep 28, 2022 •

edited

Loading

Go2Engle commented Sep 28, 2022 •

edited

Loading

igaskin commented Nov 30, 2022 •

edited

Loading

gotchipete commented Jan 20, 2023 •

edited

Loading