dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208 #35

sidewinder12s · 2021-12-15T21:27:43Z

We are running the "dcgm-exporter" Kubernetes DaemonsetSet on AWS EKS, and whenever we use a "g4dn.metal" EC2 instance, the "dcgm-exporter" gets stuck in a crashloop with the following log message:

time="2021-08-13T20:07:08Z" level=info msg="Starting dcgm-exporter"
time="2021-08-13T20:07:09Z" level=info msg="DCGM successfully initialized!"
time="2021-08-13T20:07:27Z" level=info msg="Collecting DCP Metrics"
fatal: morestack on gsignal

This does not happen on any other G4DN class of machine, only with the "metal" variant. The NVIDIA drivers are installed and user code utilizing the GPUs is running fine. Using "nvidia-smi" results shows all 8 GPUs as expected. I have done searching and I cannot find any information on this.

Copied from here: NVIDIA/gpu-monitoring-tools#208

The text was updated successfully, but these errors were encountered:

nikkon-dev · 2021-12-16T06:45:38Z

Hello,

Could you reproduce the issue with the GOTRACEBACK=system environment variable set and provide the crash logs?

WBR,
Nik

sidewinder12s · 2021-12-16T15:36:45Z

Sure, also going to upgrade to see if it's fixed, we're a few releases behind.

sidewinder12s · 2021-12-17T19:29:46Z

Well, after upgrading and increasing some resources for the exporter, this doesn't appear to be happening anymore. I'll re-open if we see it again with that env var set.

sidewinder12s closed this as completed Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208 #35

dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208 #35

sidewinder12s commented Dec 15, 2021

nikkon-dev commented Dec 16, 2021

sidewinder12s commented Dec 16, 2021

sidewinder12s commented Dec 17, 2021

dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208 #35

dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208 #35

Comments

sidewinder12s commented Dec 15, 2021

nikkon-dev commented Dec 16, 2021

sidewinder12s commented Dec 16, 2021

sidewinder12s commented Dec 17, 2021