Memory (CPU and GPU) leaks during the 1st epoch #1510

alexeykarnachev · 2020-04-16T16:25:24Z

🐛 Bug

Hello.
This memory leak occurs during the first epoch. If one has a large epoch time (I had > 10 days), the OOM error will come. It's interesting, that in precision=16 mode, it leaks out on the GPU and the CPU both. If we switch amp optimization off (precision=32), the leak goes only on the CPU.
Also, I checked the number of tensors, which are tracked by the garbage collector. And it appeared to be linearly increasing during the first epoch, and then (on the 2nd epoch starts), it falls to the initial value and begins increasing again.
Let me provide the plots:

Experiment 1: amp_level='O2', precision=16

The number of tensors, tracked by garbage collector

GPU (the 2nd in my case) usage, tracked by pytorch-lightning

CPU memory usage by the process (bytes)

Experiment 2: amp_level=None, precision=None

The number of tensors, tracked by garbage collector

GPU (the 2nd in my case) usage, tracked by pytorch-lightning

CPU memory usage by the process (bytes)

As you can see, both cases have a CPU leak. The "amp"-case also has a GPU leak.
Also, it's clear, that such leaky behavior stops when the 2nd epoch starts.
On these plots, the 2nd epoch starts on the 2nd "saw claw" of the "Num-of-tensors" plot.
Also, there is another observation: the speed of tensors number increasing is 1001. And this is my forward pass method:

    def training_step(self, batch, batch_idx):
        losses = self.forward(batch)
        num_of_tensors = get_num_of_tensors()
        log = {'Num-of-tensors': num_of_tensors, 'Cpu-mem-usg': get_cpu_mem()}

        for i, loss in enumerate(losses):
            log[f'loss{i}'] = loss

        print(num_of_tensors)
        return {'loss': losses[0], 'log': log}

Here I return exactly 1001 tensor: one for loss and 1000 for log.
In my real experiments I had only 3 tensors. It took ~2-3 days to get OOM. But in the current example (see To Reproduce) it will crash much faster.

To Reproduce

Steps to reproduce the behavior:

Execute Code sample (this script has no arguments, so change needed values manually in script).
Go to the tensorboard to check plots.

Code sample

https://gist.github.com/alexeykarnachev/47de06b93a717ab0664eded42ed2826a

Expected behavior

The number of tensors, GPU and CPU memory does not increase during the training.

Environment

PyTorch version: 1.4.0
OS: Ubuntu 16.04.6 LTS
Python version: 3.7

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] pytorch-lightning==0.7.3
[pip] torch==1.4.0
[pip] torchvision==0.5.0

Additional context

Sorry for so messy flow of the information, but I don't know, how to structure it more clearly.

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-04-16T16:35:20Z

by leak you mean tensors build up during epoch 1? but after that the memory stays constant? ie: there is no more "leak" for epochs >= 2?

alexeykarnachev · 2020-04-16T16:41:11Z

Yes, the memory stays constant after 1st epoch ends (although, the number of tensors begins increasing again)

BartekRoszak · 2020-04-16T20:06:55Z

The whole output of a training step is stored.
In your code with every training step, there are new tensors created.
With z log[f'loss{i}'] = loss.item()there is no leak.

I think there is a mistake in optimizer_closure() in the training loop which, returns whole batch output dict. It should be enough to return only callback_metrics instead of the whole batch output.

alexeykarnachev · 2020-04-16T20:47:48Z

Yes, I agreed, that with .item() there is no leak because all tensors "disappear in place" (I did not check it, but I believe that it so). But, I suppose, that .item() will slow my code.
On the other hand, .item() is performed anyway by the Trainer itself (before logging), so maybe it's not a big deal to call .item() beforehand. At least as a hotfix solution

alexeykarnachev · 2020-04-16T22:47:18Z

Oh, no sorry, just checked: it will be a leak even if we perform log[f'loss{i}'] = loss.item()
Because we still have 'loss': losses[0] part (the actual loss tensor, which needs to be minimized).
So, it will be a leak with speed 1 tensor per step. It's very slow, but the OOM will occur anyway in 6-9 days

williamFalcon · 2020-04-17T00:33:33Z

can you submit a PR? i thought we took care of all the metrics.
we should also use detach instead of item no? to not slow code down

BartekRoszak · 2020-04-17T06:20:31Z

We take care of it in process_output() but then in optimizer_closure() we return original output_dict again.
We pass then a list of original outputs to the training_epoch_end().
I think w should not do that bc loss, log and progress_bar is handling by us in a proper way so we should return to training_epoch_end only other keys from output_dict and let a user manage it.

alexeykarnachev · 2020-04-17T08:41:19Z

What about fp32-mode? There is no leak on the GPU in such a case. What could be the reason?

alexeykarnachev · 2020-04-17T11:36:23Z

@AratorField , do you mean this?
https://github.com/PyTorchLightning/pytorch-lightning/blob/9b31272cf0f3079a244944096b4a81eec20fe555/pytorch_lightning/trainer/training_loop.py#L427-L428

Here is a list that stores all train step outputs during the epoch.

williamFalcon · 2020-04-17T11:47:21Z

but we detach everything.
https://github.com/PyTorchLightning/pytorch-lightning/blob/9b31272cf0f3079a244944096b4a81eec20fe555/pytorch_lightning/trainer/training_loop.py#L448

how could it leak?

alexeykarnachev · 2020-04-17T11:55:28Z

Yes, but they (tensors) are still on the GPU after detach. So, in case of long epochs or huge outputs from the training step, the GPU memory will blow after some time.

BartekRoszak · 2020-04-17T12:39:52Z

We can create something like _recursive_item() or remove keys loss, log, progress_bar from batch_output before appending to outputs.

alexeykarnachev · 2020-04-17T13:45:17Z

Is it in general a good practice to store values during the epoch? The size of such a bookkeeping list is undetermined in the general case. I mean, that one could have almost an infinite epoch and sooner or later he'll be faced with OOM (GPU or CPU, it does not matter).

williamFalcon · 2020-04-17T13:59:01Z

the thing is that .item() slows things down.
so we want to detach but not .item().

The tradeoff is that we plug the memory leak but slow things down.

BartekRoszak · 2020-04-17T14:03:27Z

There is no reason to store loss, log and progress_bar for the whole epoch.
Any other key in output_dict could be valuable and has to be stored i.e. for metrics calculating.

alexeykarnachev · 2020-04-17T14:11:32Z

Maybe it's possible to introduce a flag, which shows, should we store tensors in this list during an epoch or not.
Or, maybe you can advise me some hot-fix, that I can apply locally. Because now, I can not train even 1 epoch :)

alexeykarnachev · 2020-04-17T14:16:33Z

I even have no training_epoch_end method. Maybe, we can check if this method is not determined by the user, we can skip batch results bookkeeping?

BartekRoszak · 2020-04-17T14:28:42Z

https://github.com/PyTorchLightning/pytorch-lightning/blob/8544b334e4af9caa060a280146e7d3bb10648332/pytorch_lightning/trainer/training_loop.py#L611

change it to return closure_loss , callback_metrics

alexeykarnachev · 2020-04-17T14:34:48Z

Thank you, I'll patch it locally for now.

johngrabner · 2023-03-11T22:39:30Z

I am using pytorch-lightning 1.5.10
and deleted all logs for training (ie comment out in code), only logs are for val and made "limit_val_batches" a small value of 3000. During training, with "limit_train_batches"= 15000 (about 1 hr), I can bearly fit in 256G of CPU memory.
A batch of 30000 (about 2 hrs) freezes because my computer is out of CPU memory, zero chance to include a full epoch.

Is there a patch for 1.5.10?

alexeykarnachev added bug Something isn't working help wanted Open to be worked on labels Apr 16, 2020

alexeykarnachev mentioned this issue Apr 17, 2020

Fix optimizer_closure return result #1516

Closed

5 tasks

williamFalcon mentioned this issue Apr 19, 2020

fixed memory leak from opt return #1528

Merged

williamFalcon closed this as completed in #1528 Apr 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory (CPU and GPU) leaks during the 1st epoch #1510

Memory (CPU and GPU) leaks during the 1st epoch #1510

alexeykarnachev commented Apr 16, 2020 •

edited

williamFalcon commented Apr 16, 2020

alexeykarnachev commented Apr 16, 2020 •

edited

BartekRoszak commented Apr 16, 2020 •

edited

alexeykarnachev commented Apr 16, 2020

alexeykarnachev commented Apr 16, 2020 •

edited

williamFalcon commented Apr 17, 2020

BartekRoszak commented Apr 17, 2020 •

edited

alexeykarnachev commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

williamFalcon commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020 •

edited

BartekRoszak commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

williamFalcon commented Apr 17, 2020

BartekRoszak commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

BartekRoszak commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

johngrabner commented Mar 11, 2023

Memory (CPU and GPU) leaks during the 1st epoch #1510

Memory (CPU and GPU) leaks during the 1st epoch #1510

Comments

alexeykarnachev commented Apr 16, 2020 • edited

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

williamFalcon commented Apr 16, 2020

alexeykarnachev commented Apr 16, 2020 • edited

BartekRoszak commented Apr 16, 2020 • edited

alexeykarnachev commented Apr 16, 2020

alexeykarnachev commented Apr 16, 2020 • edited

williamFalcon commented Apr 17, 2020

BartekRoszak commented Apr 17, 2020 • edited

alexeykarnachev commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

williamFalcon commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020 • edited

BartekRoszak commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

williamFalcon commented Apr 17, 2020

BartekRoszak commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

BartekRoszak commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020

johngrabner commented Mar 11, 2023

alexeykarnachev commented Apr 16, 2020 •

edited

alexeykarnachev commented Apr 16, 2020 •

edited

BartekRoszak commented Apr 16, 2020 •

edited

alexeykarnachev commented Apr 16, 2020 •

edited

BartekRoszak commented Apr 17, 2020 •

edited

alexeykarnachev commented Apr 17, 2020 •

edited