Failure on A100 Card

Running the GPUStress Tool on a A100 card is reporting the below error. However, card seems to be healthy and working correctly per the HW tests performed by our hardware vendor.

Command Executed: ./gst -T=1
Output:
./gst capturing GPU information...
WATCHDOG starting, TIMEOUT: 600 seconds
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100 80GB PCIe"
Device 1: "NVIDIA A100 80GB PCIe"
Initilizing A100 80 GB based test suite
TYPE=2
GPU Memory: 79, memgb: 80
Device 0: "NVIDIA A100 80GB PCIe", PCIe: 17
***** STARTING TEST 0: INT8 On Device 0 NVIDIA A100 80GB PCIe
#### math_type 10
#### args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344
#### args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904
loop=1
***** TEST INT8 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 24 seconds
***** STARTING TEST 1: FP16 On Device 0 NVIDIA A100 80GB PCIe
#### math_type 0
#### args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032
#### args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928
loop=1
***** TEST FP16 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 17 seconds
***** STARTING TEST 2: TF32 On Device 0 NVIDIA A100 80GB PCIe
#### math_type 0
#### args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064
std::exception: out of memory
testing cublasLt fail 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failure on A100 Card #5

math_type 10

args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344

args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904

math_type 0

args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032

args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928

math_type 0

args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure on A100 Card #5

Description

math_type 10

args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344

args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904

math_type 0

args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032

args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928

math_type 0

args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions