ValueError: bad value(s) in fds_to_keep, when attemping DDP #538

sdac · 2019-11-21T15:44:55Z

I can't get DDP working without getting the following error:

Traceback (most recent call last):
  File "train.py", line 86, in <module>
    main(config)
  File "train.py", line 41, in main
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch
    cmd, self._fds)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds
    False, False, None)
ValueError: bad value(s) in fds_to_keep

What I have tried and that didn't work:

Python 3.7
Python 3.6
pytorch 1.1.0
pytorch 1.2.0
Downgrading "scikit-learn" which had helped in unrelated projects according to results on Google
Lightning 0.5.3.2
Lightning master version
Lightning 0.5.2.1
CUDA 9.0
CUDA 9.2
CUDA 10.0
Tried removing visdom from the project

The error occurs on two servers I tried on, one with 4 Titan X cards and one with 8 Tesla V100 running Ubuntu 18.04.3 LTS.

I suspect that something in my model is triggering it and would appreciate ideas. I can not share the source code though. The model works in dp and single gpu mode.

jeffling · 2019-11-23T02:32:44Z

I suspect you're right, it probably is your model. mp.spawn is most likely the culprit. One thing you can try is changing now it works through set_start_method. https://docs.python.org/3.5/library/multiprocessing.html#contexts-and-start-methods

If that doesn't work, I have a change that allows you to bypass mp.spawn and use torch.distributed.launch for DDP that I'd like to push upstream once I have enough time. it works by having a separate process run for each GPU, which bypasses some limitations with python interpreter.

williamFalcon · 2019-11-23T03:56:51Z

@jeffling should we add a hook for the mp.spawn call so ppl can do whatever they want?

jeffling · 2019-11-26T18:32:07Z

@williamFalcon Absolutely, I think that would be the right thing to do.

This is the relevant part of the patched fit function version that we are running, in case it helps anyone else reading this.

        elif self.use_ddp:
            if self.is_slurm_managing_tasks:
                task = int(os.environ["SLURM_LOCALID"])
                self.ddp_train(task, model)
            else:
                # This is the custom part - we disabled mp.spawn.
                assert "LOCAL_RANK" in os.environ, (
                    "LOCAL_RANK not found in os.environ. "
                    "Make sure you're running with the torch.distributed.launch utility."
                )
                rank = int(os.environ["LOCAL_RANK"])
                if rank > 0:
                    self.show_progress_bar = False
                self.ddp_train(rank, model)

We could have a callback that allows people to override everything under use_ddp.

I also think that using a process per GPU has some real benefits, and it would be good for the framework to support it out of the box, so we can allow for callback override and also support this usecase in the default callback :)

jeffling · 2019-11-26T18:40:32Z

Just in case people want to use the snippet before we support this officially:

You need to set Trainer.gpus to your world_size, and Trainer.distributed_backend to ddp.

In your module, you need the following overrides as well:

    def configure_ddp(self, model, device_ids):
        """
        Configure to use a single GPU set on local rank.

        Must return model.
        :param model:
        :param device_ids:
        :return: DDP wrapped model
        """
        device_id = f"cuda:{os.environ['LOCAL_RANK']}"

        model = LightningDistributedDataParallel(
            model,
            device_ids=[device_id],
            output_device=device_id,
            find_unused_parameters=True,
        )

        return model

    def init_ddp_connection(self, proc_rank, world_size):
        """
        Connect all procs in the world using the env:// init
        Use the first node as the root address
        """

        import torch.distributed as dist

        dist.init_process_group("nccl", init_method="env://")

        # Explicitly setting seed to make sure that models created in two processes
        # start from same random weights and biases.
        # TODO(jeffling): I'm pretty sure we need to set other seeds as well?
        print(f"Setting torch manual seed to {FIXED_SEED} for DDP.")
        torch.manual_seed(FIXED_SEED)

armancohan · 2019-12-01T18:44:44Z

@jeffling any tips on bypassing the mp.spawn and using the pytorch's torch.distributed.launch?

jeffling · 2019-12-02T07:57:06Z

@armancohan You can use the code snippets and instructions in my two comments above, and then just use the launch utility.

armancohan · 2019-12-17T05:29:09Z

Thanks @jeffling, this was very helpful.

kwanUm · 2019-12-24T14:37:01Z

Just to understand @jeffling - currently pytorch lightning does NOT work with 'ddp' out of the box. That's because the framework doesn't propagate the seed value set at the initial process to the spawned processes (Therefore, each of the processes in the distributed training may have a different seed..).
The only way to overcome this at the moment is to directly touch the library code and manually set the seed value.
Is the above correct? I'm asking this because I'm seeing very different results when training my GAN netowrk with ddp versus on a single GPU, and this might be the reason.

jeffling · 2019-12-26T21:28:20Z

@kwanUm You can set the seed value inside your lightning module as well.

There are a few things that lightning can't or doesn't currently do, but a lot of these are dependant on your model and data:

make sure seed values are correct (this is something we could help with but doesn't yet)
distributed sampler
Depending on your model, sync batch norm

astariul · 2020-02-27T02:41:24Z

I have same issue, can't make ddp works.

Proposed solution by @jeffling have no effect on my machine.

Any idea would be highly appreciated.

kyoungrok0517 · 2020-03-23T00:42:32Z

Same here. Firstly I couldn't understand what's the reason. It'll be very appreciated if someone can explain the meaning of the error.

Borda · 2020-03-26T14:16:35Z

@colanim @kyoungrok0517 may you give details about you machines and environments?

kyoungrok0517 · 2020-03-26T15:02:02Z

@Borda Here's my environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V

Nvidia driver version: 440.64
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] numpydoc==0.9.1
[pip] pytorch-lightning==0.7.1
[pip] torch==1.4.0
[pip] torchtext==0.5.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-lightning         0.7.1                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.5.0                py37_cu101    pytorch

Borda · 2020-03-26T16:34:20Z

@jeffling may you check?

kyoungrok0517 · 2020-03-29T12:37:37Z

I'm experiencing the same problem in any multi-gpu environment, not only with two Titan Vs.

jeffling · 2020-03-29T19:47:35Z

@kyoungrok0517 @colanim Just as a sanity check, do you have the same issues with running any model with lightning? Also, exactly which arguments are passing in?

kyoungrok0517 · 2020-03-30T00:53:25Z

No. The model was running well on multi-gpu in early stage but from some
point the problem occurred. Since the model structure stayed the same I
suspect the error has something to do with the data loading step but can't
be certain. I attach some code snippets that might cause the problem:

# data loading (pandas, parquet)
class News20Dataset(Dataset):
    def __init__(self, data_path, tokenizer):
        super().__init__()
        self.data = pd.read_parquet(data_path)
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        row = self.data.iloc[index]
        text_ids = self.tokenizer.encode(row.text)
        return {"text": text_ids, "target": row.label}

# data loading (nunpy, zarr)
class TripleDataset(Dataset):
    def __init__(self, data_path, tokenizer):
        super().__init__()
        self.data = zarr.open(data_path, "r")
        self.tokenizer = tokenizer

# vector loading (torchtext.vocab)
def _get_bow_vocab(self):
    VOCAB_PATH = Path(root_dir) / "../../vocab/vocab.json"
    VECTORS = "fasttext.en.300d"
    MIN_FREQ = 10
    MAX_SIZE = 100000
    with open(VOCAB_PATH, "r", encoding="utf-8") as f:
        vocab_counts = Counter(json.load(f))
    
    return Vocab(
        vocab_counts,
        vectors=VECTORS,
        min_freq=MIN_FREQ,
        max_size=MAX_SIZE
    )

kyoungrok0517 · 2020-03-30T00:54:35Z

Here's my dataloaders

def _get_dataloader(self, dataset, test=False):
    batch_size = self.hparams.batch_size if not test else 10000
    num_workers = int(cpu_count() / 4) or 1
    return DataLoader(
        dataset, batch_size=batch_size, num_workers=num_workers, pin_memory=True,
    )

def train_dataloader(self):
    return self._get_dataloader(self._train_dataset)

def val_dataloader(self):
    return self._get_dataloader(self._val_dataset)

def test_dataloader(self):
    return self._get_dataloader(self._test_dataset, test=True)

kyoungrok0517 · 2020-03-30T00:55:41Z

@kyoungrok0517 @colanim Just as a sanity check, do you have the same issues with running any model with lightning? Also, exactly which arguments are passing in?

This is my Trainer

    if hparams.profile:
        profiler = AdvancedProfiler()
    else:
        profiler = None

    # train
    # trainer = Trainer.from_argparse_args(hparams)
    trainer = Trainer(
        logger=tt_logger,
        default_save_path=root_dir,
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=hparams.gpus,
        distributed_backend=hparams.distributed_backend,
        fast_dev_run=hparams.fast_dev_run,
        amp_level=hparams.amp_level,
        precision=hparams.precision,
        early_stop_callback=early_stop_callback,
        benchmark=True,
        profiler=profiler,
    )
    trainer.fit(model)

sneiman · 2020-03-31T00:14:37Z

FWIW: Have you recently updated Ubuntu? I just started experiencing this in the last hour - and I am using a local fork that has not changed in a few weeks - so doesn't seem likely it's lightning. Will add more if I learn more.

Update I do not believe this is pytorch-lightning. I have recently minted models that are virtually identical that do NOT show this problem. Not clear what is causing it ... almost certainly file related as the specific error is a multiprocessing/posix complaint about file descriptors that do not have appropriate values.

kyoungrok0517 · 2020-03-31T01:10:52Z

Well maybe or maybe not. Then the problem could be CUDA or apex compatibility issue. To add I see the problem only when I use DDP. Using DP shows no problem.

sneiman · 2020-04-01T04:33:25Z

I have resolved this in my environment. Writing it here in case it helps. During the construction of the model, and before creating trainer, etc, I store the .data property of a Parameter:

self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
self.class_p_t      = self.class_p.data

I do NOT actually do anything with self.class_p_t - I assigned it in anticipation of logging these values to monitor progress of self learning curriculum learning class weights. Speculating there is some issue relating to autograd - but just a guess. It generated ~100 file descriptors, which is what appeared to choke fds_to_keep() in the multiprocessing module. Eliminating this assignment ended the problem - finally!

For the record:Ubuntu 19.10, python 3.7.5, pytorch 1.4, venv.

kyoungrok0517 · 2020-04-02T00:17:43Z

Hmm... are you suggesting it's a bug in pytorch? Is it expected behavior to generate many file descriptors if we extract `.data` property from tensors? For temporary storage?? I couldn't understand...

sneiman · 2020-04-02T00:53:06Z

Sorry - I do not know if it's a bug in pytorch or not. I shared my results in case it helps.

I am not sure under what circumstances the .data field is to be accessed - if ever. I don't even know if its just revealing a bug in the latest ubuntu release. I just know that removing that line made the problem go away. As I tried to say, I was merely speculating that this was due to some behind the scenes memory sharing approach from python/pytorch multiprocessing - of which memory files could be a useful model for large tensor sharing. Again, this is ONLY speculation.

I did post a question on pytorch github.

sneiman · 2020-04-08T23:28:05Z

The people at pytorch did look at a fragment that reproduces this problem - and they had some insight into what is happening and why. I need to look at some lightning internals to make sure their diagnosis applies to what we are seeing.

sneiman · 2020-04-09T16:25:57Z

I have verified what causes the problem in my model and what will fix it. The problem is my naively assigning a parameter to another variable. This new reference to the parameter does not get moved to the correct gpu when pytorch-lightning copies it with model.cuda(gpu_idx) in ddp_train(). The reference is in another process space when ddp is used, and so creates the multiprocessing fault noted at the head of this issue.

This is NOT a ptl bug. This is the result of the naive assignment of the parameter to another variable:

        # nn.Parameter() ensures pytorch knows about this - and will move it new gpu when required
        self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)

        # causes a crash if using ddp: self.class_p_t refers to the original process space, not the one it has been moved to by ddp
        self.class_p_t      = self.class_p.data

self.class_p is known to pytorch as a parameter, and so is copied to the correct gpu. But the reference to it in self.class_p_t is not known to pytorch as a parameter, and so this reference is not updated when the model is copied. To fix this simply, do a deep copy instead of the naive assignment. The self.class_p_t is still not moved to the gpu, but it is now within the process space of each ddp model:

        # this now works
        self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
        self.class_p_t      = copy.deepcopy(self.class_p.data)

Hope this helps ...

williamFalcon · 2020-04-09T16:26:53Z

thanks!

sneiman · 2020-04-09T18:06:28Z

@williamFalcon mrshenli's comments in pytorch do raise a question for me - he points out that something similar could happen when the model is passed as an arg to a ddp process. I think ptl is probably okay because the model.cuda(gpu_idx) must effectively make a deep copy - but just food for thought as I have not been able to confirm exactly what model.cuda(gpu_idx) does.

Also, he inadvertently partially demonstrates something I have been meaning to try in bringing a model back to spawning process from ddp - that is to use the special way in which pytorch handles tensors/models on queues. I suspect if we used a queue() to pass the model to process on gpus[0], the models parameters may be automatically resolved back to cpu - and thus the trained model is available without any special effort. I will try to get to this in the next week ...

Lightning-AI/pytorch-lightning#538

LukasHedegaard · 2020-08-26T08:37:33Z

Not sure if I'm late to the game, but this might be a hint to the origin of the problem.
Using pytorch-lightning 0.8.4 on Ubuntu 18.04.4 LTS, I get the error when specifying specific gpus (e.g. --gpus 0,1,2,3), while I do not get the error if I specify the number of gpus to use (e.g. --gpus 4).

Borda · 2020-08-26T09:14:44Z

@LukasHedegaard pls update to 0.9 :]

LukasHedegaard · 2020-08-26T10:04:24Z

Unfortunately, I get the same results after updating to pytorch-lightning 0.9

duaisha · 2020-12-08T05:16:31Z

thanks!

@williamFalcon @sneiman copy.deepcopy solution works but it uses lot of extra memory as also said in the documentation https://docs.python.org/3/library/copy.html
Is there a better solution to this problem?

bryanmooremd · 2022-05-20T20:33:35Z

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy():
https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

charlesxu90 · 2022-07-10T08:49:20Z

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

Changing the strategy worked for me.

Here's the code from the doc for someone troubling with this like me.

# Training with the DistributedDataParallel strategy on 4 GPUs
trainer = Trainer(strategy="ddp", accelerator="gpu", devices=4)

kyoungrok0517 mentioned this issue Dec 27, 2019

ValueError: bad value(s) in fds_to_keep when using DDP #654

Closed

Borda added bug Something isn't working information needed labels Mar 26, 2020

williamFalcon closed this as completed Apr 9, 2020

sneiman mentioned this issue Apr 25, 2020

'bad value(s) in fds_to_keep' error in DDP mode #1550

Closed

stephenbach added a commit to BatsResearch/taglets that referenced this issue Aug 6, 2020

Maybe this will help?

d56a638

Lightning-AI/pytorch-lightning#538

Pseudomanifold mentioned this issue Jul 1, 2022

DATA_DIR not respected BorgwardtLab/TOGL#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: bad value(s) in fds_to_keep, when attemping DDP #538

ValueError: bad value(s) in fds_to_keep, when attemping DDP #538

sdac commented Nov 21, 2019 •

edited by Borda

Loading

jeffling commented Nov 23, 2019

williamFalcon commented Nov 23, 2019

jeffling commented Nov 26, 2019 •

edited

Loading

jeffling commented Nov 26, 2019

armancohan commented Dec 1, 2019

jeffling commented Dec 2, 2019

armancohan commented Dec 17, 2019

kwanUm commented Dec 24, 2019

jeffling commented Dec 26, 2019

astariul commented Feb 27, 2020

kyoungrok0517 commented Mar 23, 2020 •

edited

Loading

Borda commented Mar 26, 2020

kyoungrok0517 commented Mar 26, 2020

Borda commented Mar 26, 2020

kyoungrok0517 commented Mar 29, 2020

jeffling commented Mar 29, 2020 •

edited

Loading

kyoungrok0517 commented Mar 30, 2020

kyoungrok0517 commented Mar 30, 2020

kyoungrok0517 commented Mar 30, 2020

sneiman commented Mar 31, 2020 •

edited

Loading

kyoungrok0517 commented Mar 31, 2020 via email •

edited by Borda

Loading

sneiman commented Apr 1, 2020

kyoungrok0517 commented Apr 2, 2020 via email •

edited by Borda

Loading

sneiman commented Apr 2, 2020 •

edited

Loading

sneiman commented Apr 8, 2020

sneiman commented Apr 9, 2020

williamFalcon commented Apr 9, 2020

sneiman commented Apr 9, 2020 •

edited

Loading

LukasHedegaard commented Aug 26, 2020

Borda commented Aug 26, 2020

LukasHedegaard commented Aug 26, 2020

duaisha commented Dec 8, 2020

bryanmooremd commented May 20, 2022

charlesxu90 commented Jul 10, 2022 •

edited

Loading

ValueError: bad value(s) in fds_to_keep, when attemping DDP #538

ValueError: bad value(s) in fds_to_keep, when attemping DDP #538

Comments

sdac commented Nov 21, 2019 • edited by Borda Loading

jeffling commented Nov 23, 2019

williamFalcon commented Nov 23, 2019

jeffling commented Nov 26, 2019 • edited Loading

jeffling commented Nov 26, 2019

armancohan commented Dec 1, 2019

jeffling commented Dec 2, 2019

armancohan commented Dec 17, 2019

kwanUm commented Dec 24, 2019

jeffling commented Dec 26, 2019

astariul commented Feb 27, 2020

kyoungrok0517 commented Mar 23, 2020 • edited Loading

Borda commented Mar 26, 2020

kyoungrok0517 commented Mar 26, 2020

Borda commented Mar 26, 2020

kyoungrok0517 commented Mar 29, 2020

jeffling commented Mar 29, 2020 • edited Loading

kyoungrok0517 commented Mar 30, 2020

kyoungrok0517 commented Mar 30, 2020

kyoungrok0517 commented Mar 30, 2020

sneiman commented Mar 31, 2020 • edited Loading

kyoungrok0517 commented Mar 31, 2020 via email • edited by Borda Loading

sneiman commented Apr 1, 2020

kyoungrok0517 commented Apr 2, 2020 via email • edited by Borda Loading

sneiman commented Apr 2, 2020 • edited Loading

sneiman commented Apr 8, 2020

sneiman commented Apr 9, 2020

williamFalcon commented Apr 9, 2020

sneiman commented Apr 9, 2020 • edited Loading

LukasHedegaard commented Aug 26, 2020

Borda commented Aug 26, 2020

LukasHedegaard commented Aug 26, 2020

duaisha commented Dec 8, 2020

bryanmooremd commented May 20, 2022

charlesxu90 commented Jul 10, 2022 • edited Loading

sdac commented Nov 21, 2019 •

edited by Borda

Loading

jeffling commented Nov 26, 2019 •

edited

Loading

kyoungrok0517 commented Mar 23, 2020 •

edited

Loading

jeffling commented Mar 29, 2020 •

edited

Loading

sneiman commented Mar 31, 2020 •

edited

Loading

kyoungrok0517 commented Mar 31, 2020 via email •

edited by Borda

Loading

kyoungrok0517 commented Apr 2, 2020 via email •

edited by Borda

Loading

sneiman commented Apr 2, 2020 •

edited

Loading

sneiman commented Apr 9, 2020 •

edited

Loading

charlesxu90 commented Jul 10, 2022 •

edited

Loading