RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure

## System:

- Windows 10

- RTX 3080

- Python 3.6 with PyTorch CUDA 11.0

## Problem:

Whenever I try to run `train.py`, it runs for a few epochs, then I run into the issue stated in the title:

```
Traceback (most recent call last):
  File "C:/repo/flowtron-custom/train.py", line 425, in <module>
    train(n_gpus, rank, **train_config)
  File "C:/repo/flowtron-custom/train.py", line 336, in train
    loss.backward()
  File "C:\Users\Serguei\anaconda3\envs\flowtron-nightly-36\lib\site-packages\torch\tensor.py", line 233, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\Serguei\anaconda3\envs\flowtron-nightly-36\lib\site-packages\torch\autograd\__init__.py", line 146, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
```

Funnily enough, when I simply remove the `loss.backwards()` line, it stops breaking and runs perfectly.

## What I've tried:

### Nvidia drivers:
- 457.51 (latest as of Dec 8 2020)
- 465.12 (beta drivers for WSL 2)

### Input data:
- My own data
- LJSpeech

### Batch size:
- 1
- 4

### Commits
- a5a8ef39def9ddf5916ae1603e76fdd113cfa6c7 (Dec 8 2020)
- fe14d3a725b68a22e8e431c6674b2a9cb78b87d6 (Sept 24 2020)

### PyTorch/CUDA configs:
- `os.environ['CUDA_LAUNCH_BLOCKING'] = '1'`
- `torch.cudnn.enabled = False`

### Python versions:
- 3.8
- 3.7
- 3.6

### PyTorch versions:
- 1.7.0 with CUDA 11.0, cuDNN 8.0.4
- 1.8.0 nightly build (12/07) with CUDA 11.0, cuDNN 8.0.4
- 1.7.1 with CUDA 11.0, cuDNN 8.0.5

### FP16:
- Enabled
- Disabled

## What worked?

The only thing that worked to solve the issue was `os.environ['CUDA_LAUNCH_BLOCKING'] = '1'`, but it also slowed the training down by a lot, so it's a pretty awful solution.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure #93

System:

Problem:

What I've tried:

Nvidia drivers:

Input data:

Batch size:

Commits

PyTorch/CUDA configs:

Python versions:

PyTorch versions:

FP16:

What worked?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure #93

Description

System:

Problem:

What I've tried:

Nvidia drivers:

Input data:

Batch size:

Commits

PyTorch/CUDA configs:

Python versions:

PyTorch versions:

FP16:

What worked?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions