Training is interrupted without error with MulitGPU

## 🐛 Bug


The training is interrupted randomly in the middle of an epoch without errors. The console only says: Terminated.
The error does not necessarily occur, if it does then mostly between epochs 2-4. It is noticeable that processes are still running after the termination, the graphic cards are still used by python processes.

We train the PyTorch version of the ImageGPT model with huggingface transformers. Could also be problem of huggingface, we are not sure.

`Epoch 1:  29%|█▍   | 9413/32393 [3:28:18<8:28:33,  1.33s/it, loss=3.23, v_num=9]Terminated `

## Please reproduce using the BoringModel

Cant reproduce with Boring Model.

## Code



```
class ImageGPT(pl.LightningModule):

    def __init__(self,
                 learning_rate=learning_rate
                 ):
        super().__init__()
        self.gpt2 =  ImageGPT2LMHeadModel(config=...)
        self.criterion = nn.CrossEntropyLoss(reduction='none')
        self.learning_rate = learning_rate

    def forward(self, x):
        return self.gpt2(x, past_key_values=None)

....


logger = pl_loggers.TensorBoardLogger(save_dir="logs", name=name)

checkpoint_callback = ModelCheckpoint(
        save_top_k=1,
        verbose=True,
        monitor='val_loss',
        mode='min',
        filepath='../models',
        prefix='ImageGPT'
    )

trainer = Trainer(
                accelerator='ddp',
                max_epochs=10,
                max_steps=None,
                precision=32,
                accumulate_grad_batches=1,
                gpus=[0, 1, 2],
                callbacks=[checkpoint_callback],
                logger=logger,
                gradient_clip_val=0.6
            )

trainer.fit(model=model, datamodule=datamodule)
```

### Expected behavior

The training is fully completed across all epochs.

### Environment

* CUDA:
	- GPU:
		- TITAN RTX
		- TITAN RTX
		- TITAN RTX
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.19.4
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1
	- pytorch-lightning: 1.1.2
	- transformers:          3.5.1
	- tqdm:              4.55.0
* System:
	- OS:                Linux, 64bit
	- processor:         x86_64
	- python:            3.7.4
	- version:           86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020


### Additional context


We have made the following points to solve the problem:
* set the num-workers of the dataloaders to 0 or 1 (instead of 32-64)
* go back to 32 bit precision
* different learning rates
* added gradient clipping
* used AdamW implementation from huggingface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training is interrupted without error with MulitGPU #5604

🐛 Bug

Please reproduce using the BoringModel

Code

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training is interrupted without error with MulitGPU #5604

Description

🐛 Bug

Please reproduce using the BoringModel

Code

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions