Skip to content

CUDNN cache locking prevents finalizers resulting in OOMs #1461

@mashu

Description

@mashu

Describe the bug

Have been working on model using 5 CNN layers and Dense, but same model that worked before after updating CUDA started giving out error

Out of GPU memory trying to allocate 653.794 MiB
Effective GPU memory usage: 99.32% (7.734 GiB/7.787 GiB)
Memory pool usage: 5.630 GiB (6.344 GiB reserved)

Revetring back to CUDA@3.8.0 which I luckily had in git log fixed the problem.

To reproduce

I used https://github.com/FluxML/model-zoo/tree/master/vision/conv_mnist
Updated CUDA and Flux to latest versions and changed a bit model to make it stupidly bigger so number of parameters is more similar to my model. See attached diff
LaNet.txt

Manifest.toml

I've tested in the following order

 CUDA@3.8.5 #1 memory out
 CUDA@3.8.4 #2 memory out #6 memory out again
 CUDA@3.8.0 #3 works
 CUDA@3.8.2 #4 works
 CUDA@3.8.3 #5 works

So something bad happens between 3.8.3 and 3.8.4.

Expected behavior

I expected the same code that worked before to still work with CUDA@3.8.4+

Version info

Details on Julia:

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6
  JULIA_EDITOR = atom  -a

Details on CUDA:

CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
  Downloaded artifact: CUTENSOR
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce RTX 2070 with Max-Q Design (sm_75, 4.775 GiB / 7.787 GiB available)

Additional context

With broken CUDA@3.8.4 using

    GC.gc(true);
    CUDA.reclaim();

at the end of each mini-batch also helped to keep memory usage down to 21%, when this is not done in all cases nvtop shows 100% memory usage, but CUDA 3.8.0->3.8.3 is not crashing with out of memory error, while 3.8.4 and 3.8.5 just crash.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions