Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208 #35

Closed
sidewinder12s opened this issue Dec 15, 2021 · 3 comments

Comments

@sidewinder12s
Copy link

We are running the "dcgm-exporter" Kubernetes DaemonsetSet on AWS EKS, and whenever we use a "g4dn.metal" EC2 instance, the "dcgm-exporter" gets stuck in a crashloop with the following log message:

time="2021-08-13T20:07:08Z" level=info msg="Starting dcgm-exporter"
time="2021-08-13T20:07:09Z" level=info msg="DCGM successfully initialized!"
time="2021-08-13T20:07:27Z" level=info msg="Collecting DCP Metrics"
fatal: morestack on gsignal

This does not happen on any other G4DN class of machine, only with the "metal" variant. The NVIDIA drivers are installed and user code utilizing the GPUs is running fine. Using "nvidia-smi" results shows all 8 GPUs as expected. I have done searching and I cannot find any information on this.

Copied from here: NVIDIA/gpu-monitoring-tools#208

@nikkon-dev
Copy link
Collaborator

Hello,

Could you reproduce the issue with the GOTRACEBACK=system environment variable set and provide the crash logs?

WBR,
Nik

@sidewinder12s
Copy link
Author

Sure, also going to upgrade to see if it's fixed, we're a few releases behind.

@sidewinder12s
Copy link
Author

Well, after upgrading and increasing some resources for the exporter, this doesn't appear to be happening anymore. I'll re-open if we see it again with that env var set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants