Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

mpaepper · 2020-05-22T11:48:03Z

🐛 Bug

We are migrating to PyTorch Lightning from a custom implementation using Torchbearer before.

Our dataset stores a list of PyTorch tensors in memory, because the tensors are all of different dimensions. When migrating to PyTorch Lightning from a custom implementation, this seems to slow our training down in the multi GPU setup very significantly (training twice as long as before!).

To Reproduce

I built a repository which has just random data and a straightforward architecture to reproduce this both with (minimal.py) and without PyTorch Lightning (custom.py).

The repository and further details are located here: https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis

When training with PyTorch Lightning for only 10 epochs, it takes 105 seconds when using a big PyTorch tensor (without a list), but it increases to 310 seconds (3x slower) when using a list of tensors.
The data size and model is exactly the same, it's just that one time it's stored differently.
When using my custom implementation, no such effect is observed (takes 97-98 seconds no matter if with or without lists).

To run the PyTorch Lightning version use:

python minimal.py --gpus 4  # Baseline
python minimal.py --gpus 4 --use_list  # Extremely slow

One important thing to note: when using the list approach it seems that every tensor of that list is stored as a separate filesystem memory pointer, so you might need to increase your file limit: ulimit -n 99999.
It seems that this is the issue that the DataLoader gets very slow as it needs to read so many files?
Is there a way around this?

Code sample

See https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis

Expected behavior

I would expect the same dataset stored as a list of tensors to also train quickly.

Environment

CUDA:
- GPU:
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.16.4
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.7-dev
- tensorboard: 1.14.0
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.3
- version: Training metrics #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-05-24T09:57:44Z

did you try to compare the dataloaders from each version? do they have the same settings?
I didn't try running it yet but it seems it is not a problem with data transfer.

mpaepper · 2020-05-24T10:05:51Z

Yes, the dataloaders are the same. They are plain PyTorch dataloaders and use the same number of workers etc.

The issue seems to be that the distributed multiprocessing is opening a file pointer for each tensor in the list and then the access (opening all those 20000 file pointers) degrades the performance.

This is also the reason why you need to increase the ulimit -n when running the list version.

You can see that in the list version, starting each epoch takes a long time, so the data loading aspects just take much longer.

williamFalcon · 2020-05-24T11:03:28Z

thanks for bringing up.
We also have parity tests to make sure we match pytorch accuracy and speed.

https://github.com/PyTorchLightning/pytorch-lightning/tree/619f984c362a2e5fd8f04ae24ea0eb8ce9d9e57a/benchmarks

we’ll take a look at that example!

Just at a glance you have not disabled the logger in lightning and are not using a logger in the PT version. Take a look at our parity test in the meantime.

mpaepper · 2020-05-24T11:36:42Z

Thanks for taking a look.

You are right that I did not disable the logging and also PyTorch Lightning is saving a checkpoint which my other code does not.

The important thing for me however is not that my custom implementation is minimally faster, but rather that the PyTorch Lightning version gets very slow when I change nothing, but the dataset being a list of tensors rather than a big tensor storing other tensors.

The biggest notable difference in the implementation is that in my custom implementation, I start 4 separate processes (one for each GPU) manually while PyTorch Lightning is using PyTorch multiprocessing spawn here.

This is turn uses shared memory for tensors which in the case of a list of tensors creates many memory file pointers (one for each tensor) which seem to be the reason for this big slow down.

Thanks for helping me out!

I thus suspect it could be a general fall out of PyTorch multiprocessing and be unrelated to Lightning in which case I can also create an issue over there.

williamFalcon · 2020-05-24T11:48:37Z

so, you start 1 process per gpu individually and then run the script in each?
we also do that when using slurm.

what are you using ti train?

yeah, probably better to ask at pytorch about this then

mpaepper · 2020-05-24T16:13:16Z

I am using ddp, but instead of starting all the processes with spawn, I start them manually:

python custom.py --world_size 4 --rank 0
python custom.py --world_size 4 --rank 1
python custom.py --world_size 4 --rank 2
python custom.py --world_size 4 --rank 3

williamFalcon · 2020-05-25T19:56:09Z

I observe the same when using multiple workers.

But this is a common problem with PyTorch when you use ddp + dataloader workers. Try your script again?

mpaepper · 2020-05-25T20:02:07Z

Hi William,

in all scenarios, I am using ddp + multiple dataloader workers.

The only change is in the Dataset:

class FakeDataset(Dataset):
    def __init__(self, use_lists=False):
        self.length = 20480
        if use_lists:
            self.list = []
            for i in range(self.length):
                self.list.append(torch.rand((3, 224, 224)))
        else:
            self.list = torch.rand((self.length, 3, 224, 224))

With self.list = torch.rand((self.length, 3, 224, 224)) 10 epochs take around 100 seconds.
With self.list.append(torch.rand((3, 224, 224))) 10 epochs take 3 times as long (>300 seconds).

williamFalcon · 2020-05-25T20:17:21Z

i see. i need to run your script.
What happens with dataloaders num_workers = 0 for both?

But yeah, i think the problem is that spawn slows everything down.

However, you can still use lightning in the way you are doing by running each script individually.
Just set these environment variables:

MASTER_ADDR
MASTER_PORT
LOCAL_RANK
NODE_RANK

Is this on SLURM? GCP? AWS?

mpaepper · 2020-05-25T20:46:41Z

This is on custom hardware, an Ubuntu server with 4 x GeForce RTX 2080 Ti.

I tried it with num_workers = 0 with 2 GPUS.

Results:

PyTorch Lightning (minimal.py) for 10 epochs without lists: 218 seconds
PyTorch Lightning (minimal.py) for 10 epochs with lists: 222 seconds -> so in this case it's rather similar
custom.py for 10 epochs without lists: 181 seconds
custom.py for 10 epochs with lists: 181 seconds

So in this case, the difference between lists and not lists is non-existing. However, in real life, my dataloaders do a lot more work and having no workers slows things down considerably.

So by using these environment variables I can start 4 processes which are independent like in my custom.py implementation?

williamFalcon · 2020-05-25T20:48:26Z

yeah that's what i thought. The lists aren't the problem. The problem seems to be that pytorch spawn + num_workers does not play well...

Set those vars and start the processes yourself.

I ask because on SLURM clusters submitted with lighting we set it up properly where each script runs in its own process which is allocated by the SLURM scheduler.

Resolution from this PR is to add a note about .spawn + num_workers

(PS: PL adds about 600-1000 ms per epoch of overhead)

mpaepper · 2020-05-26T07:33:12Z

Can you give me some pointers regarding the environment variables? I couldn't find much in the docs.

I used this: MASTER_ADDR=127.0.0.1 MASTER_PORT=8899 LOCAL_RANK=0 NODE_RANK=0 WORLD_SIZE=2 python minimal.py

But it starts training without waiting for me to start a second process and as far as I understand training on multiple GPUs with one process will be much slower than having one process for each GPU separately, right?

I would expect that passing WORLD_SIZE=2 (while only passing a single GPU) would indicate that I will start another process which trains on the second GPU?

williamFalcon · 2020-05-26T11:19:54Z

Local rank is the process idx (1 per gpu).
make sure you set distributed_backend=ddp

in your case (1 machine, 4 gpus):
you have local rank 0,1,2,3. node rank 0.

world rank = 4

williamFalcon · 2020-05-26T17:28:45Z

Ok, i would start the script like this:

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=1 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=2 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=3 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

mpaepper · 2020-05-26T20:57:14Z

Thanks, that's what I thought with the ranks and world size just how I use it in my custom setup without Lightning, but it doesn't work like this in the Lightning context unfortunately.

When I enter the first command like this (note that I don't need to specify distributed_backend, as in my script it's already set to ddp):
MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 NODE_RANK=0 python minimal.py --gpus '0,1,2,3'

it already starts training with 4 GPUs and outputs:
WARNING:lightning:WORLD_SIZE environment variable is not equal to the computed world size. Ignored.

So I cannot even start the next command (with LOCAL_RANK=1) as the first one is not waiting for another process.

Same is if I limit the gpus:
MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 NODE_RANK=0 python minimal.py --gpus '0 -> then it just starts training on a single GPU.

Any further ideas on this?

williamFalcon · 2020-06-01T15:03:30Z

Fixed on master.
@mpaepper try running the PL script again without all the flag stuff.

python minimal.py --gpus 4 --use_list  # Extremely slow

Will re-open if still a problem

mpaepper · 2020-06-03T06:17:16Z

This is working well now. The only thing is that it's not using shared memory anymore, so the memory usage (not of the GPU, but the system itself) is 4 times higher (duplicated for each process).
However, it was like this before in my custom implementation as well, so I think this is expected.

Thank you for taking the time to fix this!

PetrochukM · 2021-01-27T21:54:01Z

Relevant PR: #2029

mpaepper added the help wanted Open to be worked on label May 22, 2020

awaelchli added the bug Something isn't working label May 24, 2020

williamFalcon closed this as completed Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

mpaepper commented May 22, 2020

awaelchli commented May 24, 2020

mpaepper commented May 24, 2020 •

edited

williamFalcon commented May 24, 2020 •

edited

mpaepper commented May 24, 2020

williamFalcon commented May 24, 2020

mpaepper commented May 24, 2020

williamFalcon commented May 25, 2020

mpaepper commented May 25, 2020 •

edited

williamFalcon commented May 25, 2020 •

edited

mpaepper commented May 25, 2020

williamFalcon commented May 25, 2020 •

edited

mpaepper commented May 26, 2020 •

edited

williamFalcon commented May 26, 2020 •

edited

williamFalcon commented May 26, 2020 •

edited

mpaepper commented May 26, 2020

williamFalcon commented Jun 1, 2020

mpaepper commented Jun 3, 2020

PetrochukM commented Jan 27, 2021

Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

Comments

mpaepper commented May 22, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

awaelchli commented May 24, 2020

mpaepper commented May 24, 2020 • edited

williamFalcon commented May 24, 2020 • edited

mpaepper commented May 24, 2020

williamFalcon commented May 24, 2020

mpaepper commented May 24, 2020

williamFalcon commented May 25, 2020

mpaepper commented May 25, 2020 • edited

williamFalcon commented May 25, 2020 • edited

mpaepper commented May 25, 2020

williamFalcon commented May 25, 2020 • edited

mpaepper commented May 26, 2020 • edited

williamFalcon commented May 26, 2020 • edited

williamFalcon commented May 26, 2020 • edited

mpaepper commented May 26, 2020

williamFalcon commented Jun 1, 2020

mpaepper commented Jun 3, 2020

PetrochukM commented Jan 27, 2021

mpaepper commented May 24, 2020 •

edited

williamFalcon commented May 24, 2020 •

edited

mpaepper commented May 25, 2020 •

edited

williamFalcon commented May 25, 2020 •

edited

williamFalcon commented May 25, 2020 •

edited

mpaepper commented May 26, 2020 •

edited

williamFalcon commented May 26, 2020 •

edited

williamFalcon commented May 26, 2020 •

edited