Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] system-probe failed to create containerd task #13436

Closed
ThangEthan opened this issue Sep 9, 2022 · 74 comments
Closed

[BUG] system-probe failed to create containerd task #13436

ThangEthan opened this issue Sep 9, 2022 · 74 comments

Comments

@ThangEthan
Copy link

During helm installation. system-probe pod failed to start

Agent Environment
gcr.io/datadoghq/cluster-agent:1.22.0
gcr.io/datadoghq/agent:7.38.2

Describe what happened:
During helm installation. system-probe pod failed to start with this messages: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

Describe what you expected:
All pod running

Steps to reproduce the issue:
helm install -f resources/datadog-values.yaml datadog-monitoring --namespace datadog-system datadog/datadog

Additional environment details (Operating System, Cloud provider, etc):
Kubernetes version 1.24.4
Container runtime: containerd

@ViniciusBastosTR
Copy link

ViniciusBastosTR commented Sep 14, 2022

Same issue here.
datadoghq/agent:7.37.1
K8S Version: v1.21.7

@bencouture
Copy link

Also seeing this exact same error, running these versions:
datadoghq/agent:7.38.2
kube version: v1.19.7

@bencouture
Copy link

Why is this issue closed, was a resolution identified?

@ThangEthan ThangEthan reopened this Sep 20, 2022
@MarcioCruzTR
Copy link

Same issue here.
datadoghq/agent:7.37.1
K8S Version: v1.21.7

@ViniciusBastosTR
Copy link

System probe container is returning message below:
Message: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

@brycekahle
Copy link
Member

Is anyone able to get logs from a failed system-probe container? If you use the -p argument to kubectl logs, you can get logs from a previous container. If you are not comfortable putting them here, please open a support case at https://help.datadoghq.com/hc/en-us/requests/new

@froth
Copy link

froth commented Sep 28, 2022

Fetching logs for kubectl logs datadog-XXXX -c system-probe successfully yields empty output. I believe the probe is not even able to write logs.

@brycekahle
Copy link
Member

brycekahle commented Sep 28, 2022

@froth Can you try with -p, kubectl logs -p datadog-XXXX -c system-probe? If the container is crashing, I wouldn't expect getting logs for the current container to work. Getting the previous logs should.

@Go2Engle
Copy link

Go2Engle commented Sep 28, 2022

@brycekahle I too am having the same issue. When running the command above it just returns blank.
k8s: v1.24.6
Container runtime: containerd://1.6.8
OS: Ubuntu 20.04.5 LTS

@froth
Copy link

froth commented Sep 28, 2022

@froth Can you try with -p, kubectl logs -p datadog-XXXX -c system-probe? If the container is crashing, I would expect getting logs for the current container to work. Getting the previous logs should.

Same result

@brycekahle
Copy link
Member

@froth Got it. We can try to reproduce. Any other relevant details about your setup? Are you using EKS, GKE, or another cloud k8s setup?

@froth
Copy link

froth commented Sep 28, 2022

We are experiencing this on EKS

@Go2Engle
Copy link

We are having the issue on prem Ubuntu 20.04.5 LTS running in vsphere.

@brycekahle
Copy link
Member

@froth do you know what Helm chart version you have?

@Go2Engle
Copy link

I know we have tried 3.1.3 and even downgraded back to 2.37.7 that we were running before but still have issues on either version.

@brycekahle
Copy link
Member

@Go2Engle what version of the agent are you trying?

@Go2Engle
Copy link

@brycekahle tried 7.38.2 and even 7.39.0

@brycekahle
Copy link
Member

Is anyone running SELinux?

@Go2Engle
Copy link

we are not

@brycekahle
Copy link
Member

@Go2Engle since it says "operation not permitted", which is usually EPERM, can you take a look at the host kernel logs to see if anything jumps out? I'm having trouble reproducing this error.

@brycekahle
Copy link
Member

Also check the output of cat /sys/kernel/security/lockdown

@Go2Engle
Copy link

in /var/log/kernlog there are many instances of the below
Sep 28 15:31:45 dev-k8s-wrk-02 kernel: [17025663.143951] audit: type=1400 audit(1664393505.656:77529857): apparmor="DENIED" operation="ptrace" profile="cri-containerd.apparmor.d" pid=2822187 comm="agent" requested_mask="read" denied_mask="read" peer="unconfined"

when running cat /sys/kernel/security/lockdown it outputs the below
[none] integrity confidentiality

@Go2Engle
Copy link

when viewing the system-probe container in lens I get this as the last status if that helps also.

Last Status
terminated
Reason:Reason: StartError - exit code: 128

@brycekahle
Copy link
Member

Turning off conntrack can have a pretty significant effect on the data quality for NPM, if NAT is used at all. NAT is quite common in containerized/k8s environments. If you start to see your NPM data not correctly resolving the source or destination, that would probably be why.

@brycekahle
Copy link
Member

@AlvaroCostaAbreu what error message were you getting before turning off conntrack?

@brycekahle
Copy link
Member

@froth can you try with the newest helm chart version 3.1.7?

@AlvaroCostaAbreu
Copy link

Turning off conntrack can have a pretty significant effect on the data quality for NPM, if NAT is used at all. NAT is quite common in containerized/k8s environments. If you start to see your NPM data not correctly resolving the source or destination, that would probably be why.

@brycekahle - Now, i turning on the 'conntrack' to answer you and 'Voila' is't working.
During my debug process i adjust the parameter: seccomp: with removing 'localhost' of the path file.

@brycekahle
Copy link
Member

@AlvaroCostaAbreu yeah, we had a bug in helm chart version 3.1.6.

@Go2Engle
Copy link

Go2Engle commented Oct 4, 2022

Just tested chart 3.1.7 and im still having the same error.

@dlorent
Copy link

dlorent commented Oct 5, 2022

I also just tested with 3.1.7 ( 7.39.1) and while enabling networkMonitoring, i got

Error: failed to start container "system-probe": Error response from daemon: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

@bencouture
Copy link

I upgraded to 3.1.8, verified that rseq and clone3 were actually added to /var/lib/kubelet/seccomp/system-probe (basically just verified that the changes in the PR were actually applied to the host), then completely rebooted the host and recreated the Pod for good measure. Same exact error message. I uploaded a flare from the Pod on that host, attached to case 944682.

@brycekahle
Copy link
Member

@dlorent @bencouture Can you both detail versions of your setup (OS/distro, k8s, containerd, runc, agent, helm chart)? We are trying to find a common change that might help us identify where the problem is.

@bencouture
Copy link

package version
OS Ubuntu 18.04.5 LTS
kernel 5.4.0-1055-aws
Helm itself v2.16.8-rancher1
Helm chart 3.1.8
Kubernetes v1.19.7
containerd.io 1.6.8-1
runc 1.1.4
agent 7.32.4

More background: we know for a fact that this issue happened when we upgraded from containerd.io 1.5.11-1 (which uses runc 1.0.3) to containerd.io 1.6.8-1 (which uses runc 1.1.4). If we downgrade the containerd.io version, then system-probe starts up with no issue. When we upgrade again, then it goes back to a CrashLoopBackOff.

@Go2Engle
Copy link

Go2Engle commented Oct 5, 2022

@bencouture I would say thats when the issue started for us as well. After containerd upgrade. Have not tired downgrading but seems like that may be the common denominator!

@brycekahle
Copy link
Member

@bencouture Thanks! That is very helpful information

@brycekahle
Copy link
Member

Alright folks, helm chart 3.1.9 should solve your woes.

@bencouture @Go2Engle @froth @dlorent @AlvaroCostaAbreu @ViniciusBastosTR @MarcioCruzTR @ThangEthan

@Go2Engle
Copy link

Go2Engle commented Oct 5, 2022

@brycekahle YAY! Got a healthy system-probe container running! looks like that did the trick!

@bencouture
Copy link

Yep, that did it! Healthy on all 30+ nodes in the cluster.

@dlorent
Copy link

dlorent commented Oct 6, 2022

Alright folks, helm chart 3.1.9 should solve your woes.

@bencouture @Go2Engle @froth @dlorent @AlvaroCostaAbreu @ViniciusBastosTR @MarcioCruzTR @ThangEthan

Fantastic! tested on +60 nodes, and it's working! :) thanks!

@froth
Copy link

froth commented Oct 6, 2022

Same here, thanks a lot!

@smg-serkly
Copy link

Thanks, brycekahle! Had the same issue!

@nashmrd
Copy link

nashmrd commented Nov 21, 2022

Getting this exception with Kubernetes Operator on AWS EKS. Chart version 0.9.1

Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown

I enabled the operator systemprobe via the environment variable: DD_SYSTEM_PROBE_SERVICE_MONITORING_ENABLED

@blongman-snapdocs
Copy link

@nashmrd Getting the same thing in a cluster that I rebuilt the node groups on. Difference is that we're on an old version of the chart - 2.27.0. It looks like we didn't pin the version of datadog, just the chart version. I'm guessing that the chart pulls the latest version of datadog, and we're just too far behind drift.

I'm currently looking for a solution that don't require a major version upgrade on the chart.

@Go2Engle
Copy link

@nashmrd Getting the same thing in a cluster that I rebuilt the node groups on. Difference is that we're on an old version of the chart - 2.27.0. It looks like we didn't pin the version of datadog, just the chart version. I'm guessing that the chart pulls the latest version of datadog, and we're just too far behind drift.

I'm currently looking for a solution that don't require a major version upgrade on the chart.

I safely updated from 2 to 3 during this troubleshooting with no issues. I know all env's are different but seems like the agent is pretty safe to update. I used my same values files and everything.

@blongman-snapdocs
Copy link

I safely updated from 2 to 3 during this troubleshooting with no issues. I know all env's are different but seems like the agent is pretty safe to update. I used my same values files and everything.

That worked. I expect that sort of thing to not work. Thanks. They've started. I'm updating my terraform to make it so the chart version is a variable, and will put together a project to update across the board in the next couple of weeks.

@mhhplumber
Copy link

@nashmrd
try adding this annotation
container.seccomp.security.alpha.kubernetes.io/system-probe: runtime/default
what 3.1.9 fixes is allowing an additional syscall in the created seccomp profile, you can however just use a different profile

@igaskin
Copy link

igaskin commented Nov 30, 2022

@nashmrd I too am using the datadog-operator, and was able to confirm that the configmap generated by the operator in chart version 0.9.1 does not include faccessat in the seccomp profile. Curiously the operator code has included these changes, but it appears that haven't made it to a release.
https://github.com/DataDog/datadog-operator/blob/main/controllers/datadogagent/component/agent/default.go#LL451C15-L451C15

I also tested the operator image tag 0.8.3, 1.0.0-rc3, and latest, none of which corrected the configmap.
https://registry.hub.docker.com/r/datadog/operator/tags

At least for me, I'll be using the datadog agent helm chart instead of the datadog operator helm chart.

@gotchipete
Copy link

gotchipete commented Jan 20, 2023

@nashmrd try adding this annotation container.seccomp.security.alpha.kubernetes.io/system-probe: runtime/default what 3.1.9 fixes is allowing an additional syscall in the created seccomp profile, you can however just use a different profile

This worked for me, I had (in my datadog-agent.yaml)
container.seccomp.security.alpha.kubernetes.io/system-probe: localhost/system-probe
... once I replaced with
container.seccomp.security.alpha.kubernetes.io/system-probe: runtime/default
... smooth sailing!

@ishworg
Copy link

ishworg commented Feb 16, 2023

helm chart 3.1.10 fixes the regression.

but, on kernel 5.4.228-131.415.amzn2.x86_64, we get a GP fault.

Feb 16 09:25:05 ip-10-250-19-90 kernel: ------------[ cut here ]------------
Feb 16 09:25:05 ip-10-250-19-90 kernel: General protection fault in user access. Non-canonical address?
Feb 16 09:25:05 ip-10-250-19-90 kernel: WARNING: CPU: 4 PID: 24530 at arch/x86/mm/extable.c:77 ex_handler_uaccess+0x4d/0x60
Feb 16 09:25:05 ip-10-250-19-90 kernel: Modules linked in: binfmt_misc xt_owner xt_REDIRECT xt_multiport veth xt_state xt_connmark nf_conntrack_netlink nfnetlink xt_addrtype xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_filter ip6table_nat xt_MASQUERADE xt_conntrack xt_comment xt_mark iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter sunrpc crc32_pclmul ghash_clmulni_intel br_netfilter bridge aesni_intel stp ena llc mousedev crypto_simd overlay psmouse cryptd button evdev glue_helper crc32c_intel autofs4
Feb 16 09:25:05 ip-10-250-19-90 kernel: CPU: 4 PID: 24530 Comm: system-probe Not tainted 5.4.228-131.415.amzn2.x86_64 #1
Feb 16 09:25:05 ip-10-250-19-90 kernel: Hardware name: Amazon EC2 c5a.2xlarge/, BIOS 1.0 10/16/2017
Feb 16 09:25:05 ip-10-250-19-90 kernel: RIP: 0010:ex_handler_uaccess+0x4d/0x60
Feb 16 09:25:05 ip-10-250-19-90 kernel: Code: 83 c4 08 b8 01 00 00 00 5b c3 80 3d ad 92 75 01 00 75 dc 48 c7 c7 80 21 07 82 48 89 34 24 c6 05 99 92 75 01 01 e8 b3 ff 01 00 <0f> 0b 48 8b 34 24 eb bd 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Feb 16 09:25:05 ip-10-250-19-90 kernel: RSP: 0018:ffffc900022cfa08 EFLAGS: 00010282
Feb 16 09:25:05 ip-10-250-19-90 kernel: RAX: 0000000000000000 RBX: ffffffff81c045a0 RCX: 0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: RDX: 0000000000000007 RSI: ffffffff82dd759f RDI: ffffffff82dd512c
Feb 16 09:25:05 ip-10-250-19-90 kernel: RBP: 000000000000000d R08: ffffffff82dd7560 R09: 000000000000003f
Feb 16 09:25:05 ip-10-250-19-90 kernel: R10: 0000000000000000 R11: 0000000000005fd2 R12: 0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: FS:  00007f79f2f3e640(0000) GS:ffff888424100000(0000) knlGS:0000000000000000
Feb 16 09:25:05 ip-10-250-19-90 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 16 09:25:05 ip-10-250-19-90 kernel: CR2: 000000c000d6b000 CR3: 00000003a4404000 CR4: 00000000003406e0
Feb 16 09:25:05 ip-10-250-19-90 kernel: Call Trace:
Feb 16 09:25:05 ip-10-250-19-90 kernel: fixup_exception+0x43/0x60
Feb 16 09:25:05 ip-10-250-19-90 kernel: do_general_protection+0x46/0x140
Feb 16 09:25:05 ip-10-250-19-90 kernel: general_protection+0x28/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? copy_user_generic_string+0x31/0x40
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? __probe_kernel_read+0x54/0x80
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? bpf_probe_read+0x98/0xa0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? trace_call_bpf+0x62/0xd0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? kprobe_perf_func+0x201/0x280
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? kprobe_ftrace_handler+0x92/0xf0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? ftrace_ops_assist_func+0x98/0x110
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? map_update_elem+0x1eb/0x390
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? 0xffffffffa00e50bf
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? do_tcp_getsockopt.isra.44+0xdd0/0xdd0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x1/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? tcp_getsockopt+0x5/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? __sys_getsockopt+0xb0/0x120
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? __x64_sys_getsockopt+0x20/0x30
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? do_syscall_64+0x48/0xf0
Feb 16 09:25:05 ip-10-250-19-90 kernel: ? entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Feb 16 09:25:05 ip-10-250-19-90 kernel: ---[ end trace 40123dadeae3f453 ]---

@ishworg
Copy link

ishworg commented Feb 16, 2023

perhaps, there should be a guidance from ddog team on the helm chart on the kernel compatibility (version, changelogs etc)? i have seen dirty GP faults and panics in the past (the old days of eBPF instrumentation code in ddog agents) and fixed by manually bisecting kernel releases with the then uptodate agents... room to improve.

@brycekahle
Copy link
Member

@ishworg That is a warning and logged once per boot. It is the result of the current kernel struct offset guessing logic, which walks addresses to find the correct offsets at runtime. This will no longer be a problem once all our eBPF-based products have transitioned to CO-RE or runtime compilation (currently in-progress, so hopefully within a couple versions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests