Trainer test cannot load from checkpoint when training on multiple GPUs #5144

wjaskowski · 2020-12-15T10:30:46Z

🐛 Bug

The Trainer.test() looks for epoch=X-v0.ckpt when only epoch=X.ckpt exists, thus the result is:

Traceback (most recent call last):
  File "/home/wojciech/tmp/pytorch-lightining/main.py", line 16, in <module>
    result = trainer.test()
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in test
    results = self.__test_using_best_weights(ckpt_path, test_dataloaders)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in __test_using_best_weights
    ckpt = pl_load(ckpt_path, map_location=lambda storage, loc: storage)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 31, in load
    with fs.open(path_or_url, "rb") as f:
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/spec.py", line 897, in open
    f = self._open(
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/implementations/local.py", line 115, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/implementations/local.py", line 197, in __init__
    self._open()
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/implementations/local.py", line 202, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/wojciech/tmp/pytorch-lightining/lightning_logs/version_10/checkpoints/epoch=0-v0.ckpt'

To Reproduce

Execute several times on >1 gpu machine:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import NeptuneLogger

import os
from typing import Any, Optional

import torch
from pytorch_lightning import LightningDataModule
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import transforms

import torch
from torch.nn import functional as F
from pytorch_lightning.core.lightning import LightningModule
from torch.optim import Adam


class MNISTModule(LightningModule):

    def __init__(self):
        super().__init__()

        # mnist images are (1, 28, 28) (channels, width, height)
        self.layer_1 = torch.nn.Linear(28 * 28, 128)
        self.layer_2 = torch.nn.Linear(128, 256)
        self.layer_3 = torch.nn.Linear(256, 10)

    def forward(self, x):
        batch_size, channels, width, height = x.size()

        # (b, 1, 28, 28) -> (b, 1*28*28)
        x = x.view(batch_size, -1)
        x = self.layer_1(x)
        x = F.relu(x)
        x = self.layer_2(x)
        x = F.relu(x)
        x = self.layer_3(x)

        x = F.log_softmax(x, dim=1)
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log('train_loss', loss, on_step=True, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log('val_loss', loss, on_step=False, on_epoch=True)
        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log('test_loss', loss, on_step=False, on_epoch=True)
        return loss

    def configure_optimizers(self):
        opt = Adam(self.parameters(), lr=1e-3)
        return opt



# noinspection PyAttributeOutsideInit
class MNISTDataModule(LightningDataModule):

    def __init__(self):
        super().__init__()
        self.train_dims = None
        self.vocab_size = 0

    def prepare_data(self):
        # called only on 1 GPU
        MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
        MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor())

    def setup(self, stage: Optional[str] = None):
        # called on every GPU
        transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
        self.train = MNIST(os.getcwd(), train=True, download=False, transform=transform)
        self.test = MNIST(os.getcwd(), train=False, download=False, transform=transform)

        self.train, self.val = torch.utils.data.random_split(self.train, (50000, 10000))

    def train_dataloader(self):
        return DataLoader(self.train, batch_size=64, shuffle=True, drop_last=True, num_workers=2)

    def val_dataloader(self):
        return DataLoader(self.val, batch_size=512, drop_last=False)

    def test_dataloader(self):
        return DataLoader(self.test, batch_size=512, drop_last=False)


if __name__ == '__main__':
    dm = MNISTDataModule()
    model = MNISTModule()

    params = dict(param1='a', param2=1)
    trainer = Trainer(gpus=2, max_epochs=1, accelerator='ddp')
    trainer.fit(model, datamodule=dm)

    result = trainer.test()
    print(result)

Expected behavior

No exception.

Environment

CUDA:
- GPU:
- GeForce GTX TITAN X
- GeForce GTX TITAN X
- GeForce GTX TITAN X
- GeForce GTX TITAN X
- available: True
- version: 10.2
Packages:
- numpy: 1.19.4
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.0 [Also 1.0.8]
- tqdm: 4.54.1
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: Support of different batch types #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020

### Additional context

This happens only when `gpus=2` and acceleration=`ddp`. There must be some race condition since this problem occurs every now and then only.

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-12-16T07:16:22Z

@wjaskowski I believe I found the fix for it. Thanks for including a reproducible script, it helped alot.
If you find the time, would you mind checking if the fix on my branch works for you? (see linked PR)

wjaskowski · 2020-12-16T08:43:44Z

I tried the branch but the problem seems to be still there:

Traceback (most recent call last):
  File "/home/wojciech/tmp/pytorch-lightining/bug3.py", line 113, in <module>
    result = trainer.test()
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 754, in test
    results = self.__test_using_best_weights(ckpt_path, test_dataloaders)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 784, in __test_using_best_weights
    ckpt = pl_load(ckpt_path, map_location=lambda storage, loc: storage)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 31, in load
    with fs.open(path_or_url, "rb") as f:
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/spec.py", line 897,
in open
    f = self._open(
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/implementations/local.py", line 115, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/implementations/local.py", line 197, in __init__
    self._open()
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.8/site-packages/fsspec/implementations/local.py", line 202, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/wojciech/tmp/pytorch-lightining/lightning_logs/version_45/checkpoints/epoch=0-step=259-v0.ckpt'

awaelchli · 2020-12-17T06:26:40Z

this is surprising. with your exact script I can get the error on master within 2-3 trials, but on the bugfix branch (bugfix/ddp-ckpt) I ran it probably 20+ times and it never occurs.

wjaskowski · 2020-12-17T08:49:05Z

I will carefully give it a try once again when I will regain access to a machine with >1 GPUs.

edenlightning · 2021-01-08T21:17:22Z

@wjaskowski any update?

edenlightning · 2021-01-25T22:00:19Z

Feel free to reopen if needed!

sustcsonglin · 2021-03-12T02:04:17Z

I have the same issues...

blacksnail789521 · 2023-03-23T14:27:15Z

Same issue in 2023

wjaskowski added bug Something isn't working help wanted Open to be worked on labels Dec 15, 2020

awaelchli self-assigned this Dec 15, 2020

awaelchli added the with code label Dec 15, 2020

awaelchli mentioned this issue Dec 16, 2020

Fix ModelCheckpoint race condition in file existence check #5155

Merged

edenlightning added the waiting on author Waiting on user action, correction, or update label Jan 8, 2021

edenlightning closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer test cannot load from checkpoint when training on multiple GPUs #5144

Trainer test cannot load from checkpoint when training on multiple GPUs #5144

wjaskowski commented Dec 15, 2020

awaelchli commented Dec 16, 2020

wjaskowski commented Dec 16, 2020

awaelchli commented Dec 17, 2020

wjaskowski commented Dec 17, 2020

edenlightning commented Jan 8, 2021

edenlightning commented Jan 25, 2021

sustcsonglin commented Mar 12, 2021

blacksnail789521 commented Mar 23, 2023

Trainer test cannot load from checkpoint when training on multiple GPUs #5144

Trainer test cannot load from checkpoint when training on multiple GPUs #5144

Comments

wjaskowski commented Dec 15, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

awaelchli commented Dec 16, 2020

wjaskowski commented Dec 16, 2020

awaelchli commented Dec 17, 2020

wjaskowski commented Dec 17, 2020

edenlightning commented Jan 8, 2021

edenlightning commented Jan 25, 2021

sustcsonglin commented Mar 12, 2021

blacksnail789521 commented Mar 23, 2023