Skip to content

RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure #93

@serg06

Description

@serg06

System:

  • Windows 10

  • RTX 3080

  • Python 3.6 with PyTorch CUDA 11.0

Problem:

Whenever I try to run train.py, it runs for a few epochs, then I run into the issue stated in the title:

Traceback (most recent call last):
  File "C:/repo/flowtron-custom/train.py", line 425, in <module>
    train(n_gpus, rank, **train_config)
  File "C:/repo/flowtron-custom/train.py", line 336, in train
    loss.backward()
  File "C:\Users\Serguei\anaconda3\envs\flowtron-nightly-36\lib\site-packages\torch\tensor.py", line 233, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\Serguei\anaconda3\envs\flowtron-nightly-36\lib\site-packages\torch\autograd\__init__.py", line 146, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure

Funnily enough, when I simply remove the loss.backwards() line, it stops breaking and runs perfectly.

What I've tried:

Nvidia drivers:

  • 457.51 (latest as of Dec 8 2020)
  • 465.12 (beta drivers for WSL 2)

Input data:

  • My own data
  • LJSpeech

Batch size:

  • 1
  • 4

Commits

PyTorch/CUDA configs:

  • os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
  • torch.cudnn.enabled = False

Python versions:

  • 3.8
  • 3.7
  • 3.6

PyTorch versions:

  • 1.7.0 with CUDA 11.0, cuDNN 8.0.4
  • 1.8.0 nightly build (12/07) with CUDA 11.0, cuDNN 8.0.4
  • 1.7.1 with CUDA 11.0, cuDNN 8.0.5

FP16:

  • Enabled
  • Disabled

What worked?

The only thing that worked to solve the issue was os.environ['CUDA_LAUNCH_BLOCKING'] = '1', but it also slowed the training down by a lot, so it's a pretty awful solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions