Skip to content

Training is interrupted without error with MulitGPU #5604

@skull3r7

Description

@skull3r7

🐛 Bug

The training is interrupted randomly in the middle of an epoch without errors. The console only says: Terminated.
The error does not necessarily occur, if it does then mostly between epochs 2-4. It is noticeable that processes are still running after the termination, the graphic cards are still used by python processes.

We train the PyTorch version of the ImageGPT model with huggingface transformers. Could also be problem of huggingface, we are not sure.

Epoch 1: 29%|█▍ | 9413/32393 [3:28:18<8:28:33, 1.33s/it, loss=3.23, v_num=9]Terminated

Please reproduce using the BoringModel

Cant reproduce with Boring Model.

Code

class ImageGPT(pl.LightningModule):

    def __init__(self,
                 learning_rate=learning_rate
                 ):
        super().__init__()
        self.gpt2 =  ImageGPT2LMHeadModel(config=...)
        self.criterion = nn.CrossEntropyLoss(reduction='none')
        self.learning_rate = learning_rate

    def forward(self, x):
        return self.gpt2(x, past_key_values=None)

....


logger = pl_loggers.TensorBoardLogger(save_dir="logs", name=name)

checkpoint_callback = ModelCheckpoint(
        save_top_k=1,
        verbose=True,
        monitor='val_loss',
        mode='min',
        filepath='../models',
        prefix='ImageGPT'
    )

trainer = Trainer(
                accelerator='ddp',
                max_epochs=10,
                max_steps=None,
                precision=32,
                accumulate_grad_batches=1,
                gpus=[0, 1, 2],
                callbacks=[checkpoint_callback],
                logger=logger,
                gradient_clip_val=0.6
            )

trainer.fit(model=model, datamodule=datamodule)

Expected behavior

The training is fully completed across all epochs.

Environment

  • CUDA:
    • GPU:
      • TITAN RTX
      • TITAN RTX
      • TITAN RTX
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.19.4
    • pyTorch_debug: False
    • pyTorch_version: 1.7.1
    • pytorch-lightning: 1.1.2
    • transformers: 3.5.1
    • tqdm: 4.55.0
  • System:
    • OS: Linux, 64bit
    • processor: x86_64
    • python: 3.7.4
    • version: 86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020

Additional context

We have made the following points to solve the problem:

  • set the num-workers of the dataloaders to 0 or 1 (instead of 32-64)
  • go back to 32 bit precision
  • different learning rates
  • added gradient clipping
  • used AdamW implementation from huggingface

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 0High priority taskwaiting on authorWaiting on user action, correction, or update

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions