Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

Closed
mpaepper opened this issue May 22, 2020 · 18 comments
Closed
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@mpaepper
Copy link

馃悰 Bug

We are migrating to PyTorch Lightning from a custom implementation using Torchbearer before.

Our dataset stores a list of PyTorch tensors in memory, because the tensors are all of different dimensions. When migrating to PyTorch Lightning from a custom implementation, this seems to slow our training down in the multi GPU setup very significantly (training twice as long as before!).

To Reproduce

I built a repository which has just random data and a straightforward architecture to reproduce this both with (minimal.py) and without PyTorch Lightning (custom.py).

The repository and further details are located here: https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis

When training with PyTorch Lightning for only 10 epochs, it takes 105 seconds when using a big PyTorch tensor (without a list), but it increases to 310 seconds (3x slower) when using a list of tensors.
The data size and model is exactly the same, it's just that one time it's stored differently.
When using my custom implementation, no such effect is observed (takes 97-98 seconds no matter if with or without lists).

To run the PyTorch Lightning version use:

python minimal.py --gpus 4  # Baseline
python minimal.py --gpus 4 --use_list  # Extremely slow

One important thing to note: when using the list approach it seems that every tensor of that list is stored as a separate filesystem memory pointer, so you might need to increase your file limit: ulimit -n 99999.
It seems that this is the issue that the DataLoader gets very slow as it needs to read so many files?
Is there a way around this?

Code sample

See https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis

Expected behavior

I would expect the same dataset stored as a list of tensors to also train quickly.

Environment

  • CUDA:
    • GPU:
      • GeForce RTX 2080 Ti
      • GeForce RTX 2080 Ti
      • GeForce RTX 2080 Ti
      • GeForce RTX 2080 Ti
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.16.4
    • pyTorch_debug: False
    • pyTorch_version: 1.4.0
    • pytorch-lightning: 0.7.7-dev
    • tensorboard: 1.14.0
    • tqdm: 4.46.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.3
    • version: Training metrics聽#100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020
@mpaepper mpaepper added the help wanted Open to be worked on label May 22, 2020
@awaelchli
Copy link
Member

did you try to compare the dataloaders from each version? do they have the same settings?
I didn't try running it yet but it seems it is not a problem with data transfer.

@mpaepper
Copy link
Author

mpaepper commented May 24, 2020

Yes, the dataloaders are the same. They are plain PyTorch dataloaders and use the same number of workers etc.

The issue seems to be that the distributed multiprocessing is opening a file pointer for each tensor in the list and then the access (opening all those 20000 file pointers) degrades the performance.

This is also the reason why you need to increase the ulimit -n when running the list version.

You can see that in the list version, starting each epoch takes a long time, so the data loading aspects just take much longer.

@awaelchli awaelchli added the bug Something isn't working label May 24, 2020
@williamFalcon
Copy link
Contributor

williamFalcon commented May 24, 2020

thanks for bringing up.
We also have parity tests to make sure we match pytorch accuracy and speed.

https://github.com/PyTorchLightning/pytorch-lightning/tree/619f984c362a2e5fd8f04ae24ea0eb8ce9d9e57a/benchmarks

we鈥檒l take a look at that example!

Just at a glance you have not disabled the logger in lightning and are not using a logger in the PT version. Take a look at our parity test in the meantime.

@mpaepper
Copy link
Author

Thanks for taking a look.

You are right that I did not disable the logging and also PyTorch Lightning is saving a checkpoint which my other code does not.

The important thing for me however is not that my custom implementation is minimally faster, but rather that the PyTorch Lightning version gets very slow when I change nothing, but the dataset being a list of tensors rather than a big tensor storing other tensors.

The biggest notable difference in the implementation is that in my custom implementation, I start 4 separate processes (one for each GPU) manually while PyTorch Lightning is using PyTorch multiprocessing spawn here.

This is turn uses shared memory for tensors which in the case of a list of tensors creates many memory file pointers (one for each tensor) which seem to be the reason for this big slow down.

Thanks for helping me out!

I thus suspect it could be a general fall out of PyTorch multiprocessing and be unrelated to Lightning in which case I can also create an issue over there.

@williamFalcon
Copy link
Contributor

so, you start 1 process per gpu individually and then run the script in each?
we also do that when using slurm.

what are you using ti train?

yeah, probably better to ask at pytorch about this then

@mpaepper
Copy link
Author

I am using ddp, but instead of starting all the processes with spawn, I start them manually:

python custom.py --world_size 4 --rank 0
python custom.py --world_size 4 --rank 1
python custom.py --world_size 4 --rank 2
python custom.py --world_size 4 --rank 3

@williamFalcon
Copy link
Contributor

I observe the same when using multiple workers.

But this is a common problem with PyTorch when you use ddp + dataloader workers. Try your script again?

@mpaepper
Copy link
Author

mpaepper commented May 25, 2020

Hi William,

in all scenarios, I am using ddp + multiple dataloader workers.

The only change is in the Dataset:

class FakeDataset(Dataset):
    def __init__(self, use_lists=False):
        self.length = 20480
        if use_lists:
            self.list = []
            for i in range(self.length):
                self.list.append(torch.rand((3, 224, 224)))
        else:
            self.list = torch.rand((self.length, 3, 224, 224))

With self.list = torch.rand((self.length, 3, 224, 224)) 10 epochs take around 100 seconds.
With self.list.append(torch.rand((3, 224, 224))) 10 epochs take 3 times as long (>300 seconds).

@williamFalcon
Copy link
Contributor

williamFalcon commented May 25, 2020

i see. i need to run your script.
What happens with dataloaders num_workers = 0 for both?

But yeah, i think the problem is that spawn slows everything down.

However, you can still use lightning in the way you are doing by running each script individually.
Just set these environment variables:

MASTER_ADDR
MASTER_PORT
LOCAL_RANK
NODE_RANK

Is this on SLURM? GCP? AWS?

@mpaepper
Copy link
Author

This is on custom hardware, an Ubuntu server with 4 x GeForce RTX 2080 Ti.

I tried it with num_workers = 0 with 2 GPUS.

Results:

  • PyTorch Lightning (minimal.py) for 10 epochs without lists: 218 seconds
  • PyTorch Lightning (minimal.py) for 10 epochs with lists: 222 seconds -> so in this case it's rather similar
  • custom.py for 10 epochs without lists: 181 seconds
  • custom.py for 10 epochs with lists: 181 seconds

So in this case, the difference between lists and not lists is non-existing. However, in real life, my dataloaders do a lot more work and having no workers slows things down considerably.

So by using these environment variables I can start 4 processes which are independent like in my custom.py implementation?

@williamFalcon
Copy link
Contributor

williamFalcon commented May 25, 2020

yeah that's what i thought. The lists aren't the problem. The problem seems to be that pytorch spawn + num_workers does not play well...

Set those vars and start the processes yourself.

I ask because on SLURM clusters submitted with lighting we set it up properly where each script runs in its own process which is allocated by the SLURM scheduler.

Resolution from this PR is to add a note about .spawn + num_workers

(PS: PL adds about 600-1000 ms per epoch of overhead)

@mpaepper
Copy link
Author

mpaepper commented May 26, 2020

Can you give me some pointers regarding the environment variables? I couldn't find much in the docs.

I used this: MASTER_ADDR=127.0.0.1 MASTER_PORT=8899 LOCAL_RANK=0 NODE_RANK=0 WORLD_SIZE=2 python minimal.py

But it starts training without waiting for me to start a second process and as far as I understand training on multiple GPUs with one process will be much slower than having one process for each GPU separately, right?

I would expect that passing WORLD_SIZE=2 (while only passing a single GPU) would indicate that I will start another process which trains on the second GPU?

@williamFalcon
Copy link
Contributor

williamFalcon commented May 26, 2020

Local rank is the process idx (1 per gpu).
make sure you set distributed_backend=ddp

in your case (1 machine, 4 gpus):
you have local rank 0,1,2,3. node rank 0.

world rank = 4

@williamFalcon
Copy link
Contributor

williamFalcon commented May 26, 2020

Ok, i would start the script like this:

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=1 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=2 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=3 python my_trainer.py --gpus '0,1,2,3' --distributed_backend 'ddp'

@mpaepper
Copy link
Author

Thanks, that's what I thought with the ranks and world size just how I use it in my custom setup without Lightning, but it doesn't work like this in the Lightning context unfortunately.

When I enter the first command like this (note that I don't need to specify distributed_backend, as in my script it's already set to ddp):
MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 NODE_RANK=0 python minimal.py --gpus '0,1,2,3'

it already starts training with 4 GPUs and outputs:
WARNING:lightning:WORLD_SIZE environment variable is not equal to the computed world size. Ignored.

So I cannot even start the next command (with LOCAL_RANK=1) as the first one is not waiting for another process.

Same is if I limit the gpus:
MASTER_PORT=15000 GROUP_RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 NODE_RANK=0 python minimal.py --gpus '0 -> then it just starts training on a single GPU.

Any further ideas on this?

@williamFalcon
Copy link
Contributor

Fixed on master.
@mpaepper try running the PL script again without all the flag stuff.

python minimal.py --gpus 4 --use_list  # Extremely slow

Will re-open if still a problem

@mpaepper
Copy link
Author

mpaepper commented Jun 3, 2020

This is working well now. The only thing is that it's not using shared memory anymore, so the memory usage (not of the GPU, but the system itself) is 4 times higher (duplicated for each process).
However, it was like this before in my custom implementation as well, so I think this is expected.

Thank you for taking the time to fix this!

@PetrochukM
Copy link

Relevant PR: #2029

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants