-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority taskwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update
Description
🐛 Bug
The training is interrupted randomly in the middle of an epoch without errors. The console only says: Terminated.
The error does not necessarily occur, if it does then mostly between epochs 2-4. It is noticeable that processes are still running after the termination, the graphic cards are still used by python processes.
We train the PyTorch version of the ImageGPT model with huggingface transformers. Could also be problem of huggingface, we are not sure.
Epoch 1: 29%|█▍ | 9413/32393 [3:28:18<8:28:33, 1.33s/it, loss=3.23, v_num=9]Terminated
Please reproduce using the BoringModel
Cant reproduce with Boring Model.
Code
class ImageGPT(pl.LightningModule):
def __init__(self,
learning_rate=learning_rate
):
super().__init__()
self.gpt2 = ImageGPT2LMHeadModel(config=...)
self.criterion = nn.CrossEntropyLoss(reduction='none')
self.learning_rate = learning_rate
def forward(self, x):
return self.gpt2(x, past_key_values=None)
....
logger = pl_loggers.TensorBoardLogger(save_dir="logs", name=name)
checkpoint_callback = ModelCheckpoint(
save_top_k=1,
verbose=True,
monitor='val_loss',
mode='min',
filepath='../models',
prefix='ImageGPT'
)
trainer = Trainer(
accelerator='ddp',
max_epochs=10,
max_steps=None,
precision=32,
accumulate_grad_batches=1,
gpus=[0, 1, 2],
callbacks=[checkpoint_callback],
logger=logger,
gradient_clip_val=0.6
)
trainer.fit(model=model, datamodule=datamodule)
Expected behavior
The training is fully completed across all epochs.
Environment
- CUDA:
- GPU:
- TITAN RTX
- TITAN RTX
- TITAN RTX
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.19.4
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.2
- transformers: 3.5.1
- tqdm: 4.55.0
- System:
- OS: Linux, 64bit
- processor: x86_64
- python: 3.7.4
- version: 86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
Additional context
We have made the following points to solve the problem:
- set the num-workers of the dataloaders to 0 or 1 (instead of 32-64)
- go back to 32 bit precision
- different learning rates
- added gradient clipping
- used AdamW implementation from huggingface
k-oellers, angadkalra, marrrcin and dpieczynski
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority taskwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update