-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] system-probe failed to create containerd task #13436
Comments
Same issue here. |
Also seeing this exact same error, running these versions: |
Why is this issue closed, was a resolution identified? |
Same issue here. |
System probe container is returning message below: |
Is anyone able to get logs from a failed |
Fetching logs for |
@froth Can you try with |
@brycekahle I too am having the same issue. When running the command above it just returns blank. |
Same result |
@froth Got it. We can try to reproduce. Any other relevant details about your setup? Are you using EKS, GKE, or another cloud k8s setup? |
We are experiencing this on EKS |
We are having the issue on prem Ubuntu 20.04.5 LTS running in vsphere. |
@froth do you know what Helm chart version you have? |
I know we have tried 3.1.3 and even downgraded back to 2.37.7 that we were running before but still have issues on either version. |
@Go2Engle what version of the agent are you trying? |
@brycekahle tried 7.38.2 and even 7.39.0 |
Is anyone running SELinux? |
we are not |
@Go2Engle since it says "operation not permitted", which is usually |
Also check the output of |
in when running |
when viewing the system-probe container in lens I get this as the last status if that helps also.
|
Turning off conntrack can have a pretty significant effect on the data quality for NPM, if NAT is used at all. NAT is quite common in containerized/k8s environments. If you start to see your NPM data not correctly resolving the source or destination, that would probably be why. |
@AlvaroCostaAbreu what error message were you getting before turning off |
@froth can you try with the newest helm chart version |
@brycekahle - Now, i turning on the 'conntrack' to answer you and 'Voila' is't working. |
@AlvaroCostaAbreu yeah, we had a bug in helm chart version |
Just tested chart 3.1.7 and im still having the same error. |
I also just tested with 3.1.7 ( 7.39.1) and while enabling networkMonitoring, i got
|
I upgraded to 3.1.8, verified that |
@dlorent @bencouture Can you both detail versions of your setup (OS/distro, k8s, containerd, runc, agent, helm chart)? We are trying to find a common change that might help us identify where the problem is. |
More background: we know for a fact that this issue happened when we upgraded from containerd.io |
@bencouture I would say thats when the issue started for us as well. After containerd upgrade. Have not tired downgrading but seems like that may be the common denominator! |
@bencouture Thanks! That is very helpful information |
Alright folks, helm chart @bencouture @Go2Engle @froth @dlorent @AlvaroCostaAbreu @ViniciusBastosTR @MarcioCruzTR @ThangEthan |
@brycekahle YAY! Got a healthy |
Yep, that did it! Healthy on all 30+ nodes in the cluster. |
Fantastic! tested on +60 nodes, and it's working! :) thanks! |
Same here, thanks a lot! |
Thanks, brycekahle! Had the same issue! |
Getting this exception with Kubernetes Operator on AWS EKS. Chart version 0.9.1
I enabled the operator systemprobe via the environment variable: |
@nashmrd Getting the same thing in a cluster that I rebuilt the node groups on. Difference is that we're on an old version of the chart - 2.27.0. It looks like we didn't pin the version of datadog, just the chart version. I'm guessing that the chart pulls the latest version of datadog, and we're just too far behind drift. I'm currently looking for a solution that don't require a major version upgrade on the chart. |
I safely updated from 2 to 3 during this troubleshooting with no issues. I know all env's are different but seems like the agent is pretty safe to update. I used my same values files and everything. |
That worked. I expect that sort of thing to not work. Thanks. They've started. I'm updating my terraform to make it so the chart version is a variable, and will put together a project to update across the board in the next couple of weeks. |
@nashmrd |
@nashmrd I too am using the datadog-operator, and was able to confirm that the configmap generated by the operator in chart version I also tested the operator image tag At least for me, I'll be using the datadog agent helm chart instead of the datadog operator helm chart. |
This worked for me, I had (in my |
helm chart 3.1.10 fixes the regression. but, on kernel
|
perhaps, there should be a guidance from ddog team on the helm chart on the kernel compatibility (version, changelogs etc)? i have seen dirty GP faults and panics in the past (the old days of eBPF instrumentation code in ddog agents) and fixed by manually bisecting kernel releases with the then uptodate agents... room to improve. |
@ishworg That is a warning and logged once per boot. It is the result of the current kernel struct offset guessing logic, which walks addresses to find the correct offsets at runtime. This will no longer be a problem once all our eBPF-based products have transitioned to CO-RE or runtime compilation (currently in-progress, so hopefully within a couple versions). |
During helm installation. system-probe pod failed to start
Agent Environment
gcr.io/datadoghq/cluster-agent:1.22.0
gcr.io/datadoghq/agent:7.38.2
Describe what happened:
During helm installation. system-probe pod failed to start with this messages: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown
Describe what you expected:
All pod running
Steps to reproduce the issue:
helm install -f resources/datadog-values.yaml datadog-monitoring --namespace datadog-system datadog/datadog
Additional environment details (Operating System, Cloud provider, etc):
Kubernetes version 1.24.4
Container runtime: containerd
The text was updated successfully, but these errors were encountered: