Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: bad value(s) in fds_to_keep, when attemping DDP #538

Closed
sdac opened this issue Nov 21, 2019 · 34 comments
Closed

ValueError: bad value(s) in fds_to_keep, when attemping DDP #538

sdac opened this issue Nov 21, 2019 · 34 comments
Labels
bug Something isn't working

Comments

@sdac
Copy link

sdac commented Nov 21, 2019

I can't get DDP working without getting the following error:

Traceback (most recent call last):
  File "train.py", line 86, in <module>
    main(config)
  File "train.py", line 41, in main
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch
    cmd, self._fds)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds
    False, False, None)
ValueError: bad value(s) in fds_to_keep

What I have tried and that didn't work:

  • Python 3.7
  • Python 3.6
  • pytorch 1.1.0
  • pytorch 1.2.0
  • Downgrading "scikit-learn" which had helped in unrelated projects according to results on Google
  • Lightning 0.5.3.2
  • Lightning master version
  • Lightning 0.5.2.1
  • CUDA 9.0
  • CUDA 9.2
  • CUDA 10.0
  • Tried removing visdom from the project

The error occurs on two servers I tried on, one with 4 Titan X cards and one with 8 Tesla V100 running Ubuntu 18.04.3 LTS.

I suspect that something in my model is triggering it and would appreciate ideas. I can not share the source code though. The model works in dp and single gpu mode.

@jeffling
Copy link
Contributor

I suspect you're right, it probably is your model. mp.spawn is most likely the culprit. One thing you can try is changing now it works through set_start_method. https://docs.python.org/3.5/library/multiprocessing.html#contexts-and-start-methods

If that doesn't work, I have a change that allows you to bypass mp.spawn and use torch.distributed.launch for DDP that I'd like to push upstream once I have enough time. it works by having a separate process run for each GPU, which bypasses some limitations with python interpreter.

@williamFalcon
Copy link
Contributor

@jeffling should we add a hook for the mp.spawn call so ppl can do whatever they want?

@jeffling
Copy link
Contributor

jeffling commented Nov 26, 2019

@williamFalcon Absolutely, I think that would be the right thing to do.

This is the relevant part of the patched fit function version that we are running, in case it helps anyone else reading this.

        elif self.use_ddp:
            if self.is_slurm_managing_tasks:
                task = int(os.environ["SLURM_LOCALID"])
                self.ddp_train(task, model)
            else:
                # This is the custom part - we disabled mp.spawn.
                assert "LOCAL_RANK" in os.environ, (
                    "LOCAL_RANK not found in os.environ. "
                    "Make sure you're running with the torch.distributed.launch utility."
                )
                rank = int(os.environ["LOCAL_RANK"])
                if rank > 0:
                    self.show_progress_bar = False
                self.ddp_train(rank, model)

We could have a callback that allows people to override everything under use_ddp.

I also think that using a process per GPU has some real benefits, and it would be good for the framework to support it out of the box, so we can allow for callback override and also support this usecase in the default callback :)

@jeffling
Copy link
Contributor

Just in case people want to use the snippet before we support this officially:

You need to set Trainer.gpus to your world_size, and Trainer.distributed_backend to ddp.

In your module, you need the following overrides as well:

    def configure_ddp(self, model, device_ids):
        """
        Configure to use a single GPU set on local rank.

        Must return model.
        :param model:
        :param device_ids:
        :return: DDP wrapped model
        """
        device_id = f"cuda:{os.environ['LOCAL_RANK']}"

        model = LightningDistributedDataParallel(
            model,
            device_ids=[device_id],
            output_device=device_id,
            find_unused_parameters=True,
        )

        return model

    def init_ddp_connection(self, proc_rank, world_size):
        """
        Connect all procs in the world using the env:// init
        Use the first node as the root address
        """

        import torch.distributed as dist

        dist.init_process_group("nccl", init_method="env://")

        # Explicitly setting seed to make sure that models created in two processes
        # start from same random weights and biases.
        # TODO(jeffling): I'm pretty sure we need to set other seeds as well?
        print(f"Setting torch manual seed to {FIXED_SEED} for DDP.")
        torch.manual_seed(FIXED_SEED)

@armancohan
Copy link

@jeffling any tips on bypassing the mp.spawn and using the pytorch's torch.distributed.launch?

@jeffling
Copy link
Contributor

jeffling commented Dec 2, 2019

@armancohan You can use the code snippets and instructions in my two comments above, and then just use the launch utility.

@armancohan
Copy link

Thanks @jeffling, this was very helpful.

@kwanUm
Copy link

kwanUm commented Dec 24, 2019

Just to understand @jeffling - currently pytorch lightning does NOT work with 'ddp' out of the box. That's because the framework doesn't propagate the seed value set at the initial process to the spawned processes (Therefore, each of the processes in the distributed training may have a different seed..).
The only way to overcome this at the moment is to directly touch the library code and manually set the seed value.
Is the above correct? I'm asking this because I'm seeing very different results when training my GAN netowrk with ddp versus on a single GPU, and this might be the reason.

@jeffling
Copy link
Contributor

@kwanUm You can set the seed value inside your lightning module as well.

There are a few things that lightning can't or doesn't currently do, but a lot of these are dependant on your model and data:

  • make sure seed values are correct (this is something we could help with but doesn't yet)
  • distributed sampler
  • Depending on your model, sync batch norm

@astariul
Copy link
Contributor

I have same issue, can't make ddp works.

Proposed solution by @jeffling have no effect on my machine.

Any idea would be highly appreciated.

@kyoungrok0517
Copy link

kyoungrok0517 commented Mar 23, 2020

Same here. Firstly I couldn't understand what's the reason. It'll be very appreciated if someone can explain the meaning of the error.

@Borda Borda added bug Something isn't working information needed labels Mar 26, 2020
@Borda
Copy link
Member

Borda commented Mar 26, 2020

@colanim @kyoungrok0517 may you give details about you machines and environments?

@kyoungrok0517
Copy link

@Borda Here's my environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V

Nvidia driver version: 440.64
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] numpydoc==0.9.1
[pip] pytorch-lightning==0.7.1
[pip] torch==1.4.0
[pip] torchtext==0.5.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-lightning         0.7.1                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.5.0                py37_cu101    pytorch

@Borda
Copy link
Member

Borda commented Mar 26, 2020

@jeffling may you check?

@kyoungrok0517
Copy link

I'm experiencing the same problem in any multi-gpu environment, not only with two Titan Vs.

@jeffling
Copy link
Contributor

jeffling commented Mar 29, 2020

@kyoungrok0517 @colanim Just as a sanity check, do you have the same issues with running any model with lightning? Also, exactly which arguments are passing in?

@kyoungrok0517
Copy link

No. The model was running well on multi-gpu in early stage but from some
point the problem occurred. Since the model structure stayed the same I
suspect the error has something to do with the data loading step but can't
be certain. I attach some code snippets that might cause the problem:

# data loading (pandas, parquet)
class News20Dataset(Dataset):
    def __init__(self, data_path, tokenizer):
        super().__init__()
        self.data = pd.read_parquet(data_path)
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        row = self.data.iloc[index]
        text_ids = self.tokenizer.encode(row.text)
        return {"text": text_ids, "target": row.label}
# data loading (nunpy, zarr)
class TripleDataset(Dataset):
    def __init__(self, data_path, tokenizer):
        super().__init__()
        self.data = zarr.open(data_path, "r")
        self.tokenizer = tokenizer
# vector loading (torchtext.vocab)
def _get_bow_vocab(self):
    VOCAB_PATH = Path(root_dir) / "../../vocab/vocab.json"
    VECTORS = "fasttext.en.300d"
    MIN_FREQ = 10
    MAX_SIZE = 100000
    with open(VOCAB_PATH, "r", encoding="utf-8") as f:
        vocab_counts = Counter(json.load(f))
    
    return Vocab(
        vocab_counts,
        vectors=VECTORS,
        min_freq=MIN_FREQ,
        max_size=MAX_SIZE
    )

@kyoungrok0517
Copy link

Here's my dataloaders

def _get_dataloader(self, dataset, test=False):
    batch_size = self.hparams.batch_size if not test else 10000
    num_workers = int(cpu_count() / 4) or 1
    return DataLoader(
        dataset, batch_size=batch_size, num_workers=num_workers, pin_memory=True,
    )

def train_dataloader(self):
    return self._get_dataloader(self._train_dataset)

def val_dataloader(self):
    return self._get_dataloader(self._val_dataset)

def test_dataloader(self):
    return self._get_dataloader(self._test_dataset, test=True)

@kyoungrok0517
Copy link

@kyoungrok0517 @colanim Just as a sanity check, do you have the same issues with running any model with lightning? Also, exactly which arguments are passing in?

This is my Trainer

    if hparams.profile:
        profiler = AdvancedProfiler()
    else:
        profiler = None

    # train
    # trainer = Trainer.from_argparse_args(hparams)
    trainer = Trainer(
        logger=tt_logger,
        default_save_path=root_dir,
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=hparams.gpus,
        distributed_backend=hparams.distributed_backend,
        fast_dev_run=hparams.fast_dev_run,
        amp_level=hparams.amp_level,
        precision=hparams.precision,
        early_stop_callback=early_stop_callback,
        benchmark=True,
        profiler=profiler,
    )
    trainer.fit(model)

@sneiman
Copy link
Contributor

sneiman commented Mar 31, 2020

FWIW: Have you recently updated Ubuntu? I just started experiencing this in the last hour - and I am using a local fork that has not changed in a few weeks - so doesn't seem likely it's lightning. Will add more if I learn more.

Update I do not believe this is pytorch-lightning. I have recently minted models that are virtually identical that do NOT show this problem. Not clear what is causing it ... almost certainly file related as the specific error is a multiprocessing/posix complaint about file descriptors that do not have appropriate values.

@kyoungrok0517
Copy link

kyoungrok0517 commented Mar 31, 2020 via email

@sneiman
Copy link
Contributor

sneiman commented Apr 1, 2020

I have resolved this in my environment. Writing it here in case it helps. During the construction of the model, and before creating trainer, etc, I store the .data property of a Parameter:

self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
self.class_p_t      = self.class_p.data

I do NOT actually do anything with self.class_p_t - I assigned it in anticipation of logging these values to monitor progress of self learning curriculum learning class weights. Speculating there is some issue relating to autograd - but just a guess. It generated ~100 file descriptors, which is what appeared to choke fds_to_keep() in the multiprocessing module. Eliminating this assignment ended the problem - finally!

For the record:Ubuntu 19.10, python 3.7.5, pytorch 1.4, venv.

@kyoungrok0517
Copy link

kyoungrok0517 commented Apr 2, 2020 via email

@sneiman
Copy link
Contributor

sneiman commented Apr 2, 2020

Sorry - I do not know if it's a bug in pytorch or not. I shared my results in case it helps.

I am not sure under what circumstances the .data field is to be accessed - if ever. I don't even know if its just revealing a bug in the latest ubuntu release. I just know that removing that line made the problem go away. As I tried to say, I was merely speculating that this was due to some behind the scenes memory sharing approach from python/pytorch multiprocessing - of which memory files could be a useful model for large tensor sharing. Again, this is ONLY speculation.

I did post a question on pytorch github.

@sneiman
Copy link
Contributor

sneiman commented Apr 8, 2020

The people at pytorch did look at a fragment that reproduces this problem - and they had some insight into what is happening and why. I need to look at some lightning internals to make sure their diagnosis applies to what we are seeing.

@sneiman
Copy link
Contributor

sneiman commented Apr 9, 2020

I have verified what causes the problem in my model and what will fix it. The problem is my naively assigning a parameter to another variable. This new reference to the parameter does not get moved to the correct gpu when pytorch-lightning copies it with model.cuda(gpu_idx) in ddp_train(). The reference is in another process space when ddp is used, and so creates the multiprocessing fault noted at the head of this issue.

This is NOT a ptl bug. This is the result of the naive assignment of the parameter to another variable:

        # nn.Parameter() ensures pytorch knows about this - and will move it new gpu when required
        self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)

        # causes a crash if using ddp: self.class_p_t refers to the original process space, not the one it has been moved to by ddp
        self.class_p_t      = self.class_p.data

self.class_p is known to pytorch as a parameter, and so is copied to the correct gpu. But the reference to it in self.class_p_t is not known to pytorch as a parameter, and so this reference is not updated when the model is copied. To fix this simply, do a deep copy instead of the naive assignment. The self.class_p_t is still not moved to the gpu, but it is now within the process space of each ddp model:

        # this now works
        self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
        self.class_p_t      = copy.deepcopy(self.class_p.data)

Hope this helps ...

@williamFalcon
Copy link
Contributor

thanks!

@sneiman
Copy link
Contributor

sneiman commented Apr 9, 2020

@williamFalcon mrshenli's comments in pytorch do raise a question for me - he points out that something similar could happen when the model is passed as an arg to a ddp process. I think ptl is probably okay because the model.cuda(gpu_idx) must effectively make a deep copy - but just food for thought as I have not been able to confirm exactly what model.cuda(gpu_idx) does.

Also, he inadvertently partially demonstrates something I have been meaning to try in bringing a model back to spawning process from ddp - that is to use the special way in which pytorch handles tensors/models on queues. I suspect if we used a queue() to pass the model to process on gpus[0], the models parameters may be automatically resolved back to cpu - and thus the trained model is available without any special effort. I will try to get to this in the next week ...

@LukasHedegaard
Copy link

Not sure if I'm late to the game, but this might be a hint to the origin of the problem.
Using pytorch-lightning 0.8.4 on Ubuntu 18.04.4 LTS, I get the error when specifying specific gpus (e.g. --gpus 0,1,2,3), while I do not get the error if I specify the number of gpus to use (e.g. --gpus 4).

@Borda
Copy link
Member

Borda commented Aug 26, 2020

@LukasHedegaard pls update to 0.9 :]

@LukasHedegaard
Copy link

Unfortunately, I get the same results after updating to pytorch-lightning 0.9

@duaisha
Copy link

duaisha commented Dec 8, 2020

thanks!

thanks!

@williamFalcon @sneiman copy.deepcopy solution works but it uses lot of extra memory as also said in the documentation https://docs.python.org/3/library/copy.html
Is there a better solution to this problem?

@bryanmooremd
Copy link

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy():
https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

@charlesxu90
Copy link

charlesxu90 commented Jul 10, 2022

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

Changing the strategy worked for me.

Here's the code from the doc for someone troubling with this like me.

# Training with the DistributedDataParallel strategy on 4 GPUs
trainer = Trainer(strategy="ddp", accelerator="gpu", devices=4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests