-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage #278
Comments
Hi,
So it could take some memory. |
Regarding RAM usage, I did a longer test of 70 epochs with dali-gpu mode: Using dali-cpu the effect is much stronger. Thanks for clearing up regarding GPU memory usage. I had thought that workers was just a CPU option, so I had set it to 12. By reducing it to 8 I was able to keep the GPU batch size the same size as dali-cpu and torchvision. Would it make sense to have seperate worker options for CPU and GPU? It seems to me that the data augmentation operators are quite lightweight (for a GPU) so shouldn't require as many threads? |
Hi, |
@yaysummeriscoming - could you provide what exact cmd line command have you used to run this PyTorch example? |
/mnt/disks/ssd/ImageNet100 --print-freq 100 I'm using the first 100 classes of imagenet - I've modified resnet to suit. |
I started fixing things reported by Valgrind - you can check #308. |
Great, I'm running the script here: The only change I've made is on line 197, to change the model to have 100 classes: I just retested it, the leak is present with dali-gpu but is worst using dali-cpu mode: I'm running the first 100 imagenet classes off a google cloud VM together with a local SSD. |
Hi,
|
Today I rebuilt pytorch from the dev head, together with cuda 10 & cudnn 7.4.1. I retested using a smaller batch size, unfortunately the problem is still present. I'm using dali 0.4.1 as issue #308 doesn't seem to have been merged in yet? |
@yaysummeriscoming - no, #308 is still under review, but leaks it fixes are not that significant. So how many epochs you are able to train and what is the memory occupation after reducing the batch size? |
@yaysummeriscoming - regarding #328 we are able to test with batch 128 per GPU with 16GB memory. |
Ok thanks, I'm looking forwards to it. I'm very eager to deploy DALI into my training setup given how much faster it is! |
Tracked as DALI-452 |
So I retested this issue with DALI 0.6, cuda10 & cudnn 7.4.1, but unfortunately still no luck. I have however been able to develop a workaround by recreating all DALI objects & re-importing DALI at the end of every epoch. I inserted the following lines of code at line 282 of https://github.com/NVIDIA/DALI/blob/master/docs/examples/pytorch/resnet50/main.py (just after the DALI iterators are reset):
|
Together with the above workaround I notice that sometimes on the beginning of an epoch, the first training batch is all 0. I could get around this with the following code at line 309, where the first training batch is generated:
|
@yaysummeriscoming - it happen only when you are using this workaround or it happens without it as well? |
@JanuszL yes, the null input problem only happened with my workaround. I was also getting reduced accuracy with this workaround so I've removed it again - I imagine this might have been something to do with shuffling. I retested again today, using the DALI example script with a smaller batch size & DALI CPU mode (I found that the problem seemed to be worse in this mode). Unfortunately I'm not seeing an upper bound on memory usage: The options I used were: I'm seeing some weird behaviour regarding processing speed. Processing speed seems to drop at much lower levels of RAM utilization. On a machine with 32GB RAM, processing speed drops after about 12k iterations. 40GB RAM gets me to 70k iterations. I'm no expert on RAM usage, but I see a lot of cached memory. Could my dataset (~15GB) be cached to RAM? |
same issue by pip install package~ |
Currently, we are reworking how CPU memory is used. One of the enablers is moving from per sample to per batch processing on CPU - #936. In the future, it will allow better host memory utilization as we will be able to allocate whole batch memory at the time, not the per sample (we only enlarge buffers, not freeing them to avoid expensive reallocations). |
We're pleased to say that we've changed the allocation strategy for (non-pinned) CPU buffers. It reduces the memory consumption in RN50 training in PyTorch by almost 50%. Please check the latest master (or next successul nightly) and see if your issue is resolved. |
@mzient great thanks, hopefully I'll get some time next week to retest. Nice that theres an environment variable to control behaviour. If I follow correctly, setting the threshold to 0 will retain old behaviour? |
|
0.19 is out and should address this. Please reopen if it doesn't work. |
I tried the latest version, but the condition seemed the same for me. I run imagenet experiment on 1080 gpu, which has only 8gb memory. So for this to work, I need make sure the gpu memory usage stay the same for the entire training process, or the OOM error comes in. |
@forjiuzhou - 0.19 addresses CPU memory usage. GPU memory doesn't grow that much, and after a couple of epochs should stabilize. If you have 8GB of GPU memory, how about using a CPU pipeline? |
@JanuszL A CPU pipeline does have smaller gpu memory usage, but it still grows. After dozens of epochs, OOM error still could happen. In my experience, a validation phrase could increase hundreds MB memory, it's wierd because training phrase actually grows very slow. |
@forjiuzhou - the training pipeline memory consumption grows very slow (if at all) after a few epochs as it traversed many samples and hits the watermark pretty fast. However, in the case of the validation pipeline, the data is significantly slower so it takes more iterations to stabilize the memory consumption. But if you consider how many samples you need to process to get a stable memory consumption it would be the same in both cases I assume. |
Finally got a chance to test DALI 0.19. Memory usage is no longer rising, thanks for implementing this fix! I couldn't see any difference in speed. If there is, its <1%. |
Great to hear that! |
I’ve been able to get some great speeds out of DALI with pytorch - far beyond what torchvision can do. Problem is that I seem to be getting a memory leak that causes training to slow down & eventually crash on an OOM error. Below I’ve plotted average images/sec over the entire epoch, along with memory usage. This is training resnet 18 on a 1000 class 100k subset of imagenet, using dali-CPU mode together with Apex on the example script here: https://github.com/NVIDIA/DALI/blob/master/docs/examples/pytorch/main.py
The problem is most prominent using dali-cpu. Dali-gpu takes 90 epochs to reach the same memory usage. Note that as I'm just using 100k training examples, this is ~8 epochs on the full imagenet dataset.
Additionally the GPU version seems to use a massive amount of GPU memory. I’ve found I always need to halve the batch size compared to dali-cpu or torchvision. I see that this was touched on in issues 21 and 51, but is this much normal? For imagenet, I calculate that one 256 example float32 data batch should take 256 * 3 * 224 * 224 * 4 ~= 154Mb of memory, so it seems to me that the memory usage shouldn’t be that high?
I’m using Ubuntu 16.04, CUDA 9.2, cuDNN 7.3.0 together with a pre-1.0 version of pytorch built from the dev head. I’m running on google cloud on a machine with 12 vCPUs (6 real cores), 32GB of ram and a V100.
Edit: I'm using dali 0.4
The text was updated successfully, but these errors were encountered: