-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Running the GPUStress Tool on a A100 card is reporting the below error. However, card seems to be healthy and working correctly per the HW tests performed by our hardware vendor.
Command Executed: ./gst -T=1
Output:
./gst capturing GPU information...
WATCHDOG starting, TIMEOUT: 600 seconds
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100 80GB PCIe"
Device 1: "NVIDIA A100 80GB PCIe"
Initilizing A100 80 GB based test suite
TYPE=2
GPU Memory: 79, memgb: 80
Device 0: "NVIDIA A100 80GB PCIe", PCIe: 17
***** STARTING TEST 0: INT8 On Device 0 NVIDIA A100 80GB PCIe
math_type 10
args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344
args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904
loop=1
***** TEST INT8 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 24 seconds
***** STARTING TEST 1: FP16 On Device 0 NVIDIA A100 80GB PCIe
math_type 0
args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032
args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928
loop=1
***** TEST FP16 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 17 seconds
***** STARTING TEST 2: TF32 On Device 0 NVIDIA A100 80GB PCIe
math_type 0
args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064
std::exception: out of memory
testing cublasLt fail