Hi, I'm using DCGM on a server with multiple A100 GPUs. When other profiling tools are running such as CUPTI based profilers, the dcgmi will raise an error the third-party profiling module returned an unrecoverable error. It makes sense. But if other profilers only profile on one GPU, and I specify the other GPUs such as dcgmi -i 1, the error doesn't make sense.
Is it possible to implement such a feature that dcgm still works if the specified GPUs are not used and will not be used in the future by other processes?
Hi, I'm using DCGM on a server with multiple A100 GPUs. When other profiling tools are running such as CUPTI based profilers, the dcgmi will raise an error
the third-party profiling module returned an unrecoverable error. It makes sense. But if other profilers only profile on one GPU, and I specify the other GPUs such asdcgmi -i 1, the error doesn't make sense.Is it possible to implement such a feature that dcgm still works if the specified GPUs are not used and will not be used in the future by other processes?