Memory leak problem with new API fluid #8621

TomorrowIsAnOtherDay · 2018-02-28T02:30:48Z

With new API fluid, I tried to re-produce some models written in Tensorflow. But the memory usage of fluid program were rising all the time, While the program of TF was OK.

Here is my code

QiJune · 2018-02-28T02:42:23Z

You can run fluid program with following commands to track the memory usage:

FLAGS_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python train.py

It will print accurate memory usage after each operator running.

TomorrowIsAnOtherDay · 2018-02-28T03:10:03Z

Thanks for your comment.
With debug command I got memory usage of program

TomorrowIsAnOtherDay · 2018-02-28T08:25:32Z

Thanks to @QiJune 's advice.
After removing prune, memory usage is ok.
But why using prune would cause memory leak?

dzhwinter · 2018-02-28T08:39:02Z

Our design is fresh new and totally different from tensorflow. We do not have a concept of graph of operators, we only have the program, so the dependency analysis is more complicated. For this bug you encountered, prune return a different program (some part of the main program). Maybe there is some variable referenced cannot be deleted in time.

TomorrowIsAnOtherDay · 2018-03-02T07:35:01Z

Memory leak problem wasn't solved after removing prune, but it would decrease the speed of it.

TomorrowIsAnOtherDay · 2018-03-05T08:12:17Z

It can be re-produced by running the official model code ：）
https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient
@QiJune @dzhwinter

dzhwinter · 2018-03-05T08:21:52Z

Thanks for the reporting. We are looking into the issue.

QiJune · 2018-03-05T08:47:49Z

@TomorrowIsAnOtherDay Yes, I have reproduce the problem.
Prune will greatly increase memory. We should not use prune.
And after remove prune, memory will increase very slowly.

TomorrowIsAnOtherDay · 2018-03-08T09:07:04Z

@dzhwinter @QiJune
Any fresh developments？

dzhwinter · 2018-03-08T12:37:26Z

I have double the operators involved, there are no memory issues inside. Currently, I suspect that there is some variable referenced in pybind is not released, but still not convinced by the experiment yet.

dzhwinter · 2018-03-14T07:31:32Z

Here are some clues I find today. I run the reinforcement learning demo in models and tracked the buddy allocator used memory, and find that the GPU memory cost didn't increase even further after 2000 epochs. In detail, the memory cost increase in a wave-like way in 0-2000 epochs, from 2000-10000 epochs, it stays at a stable number and doesn't change anymore.

But the same time, I tracked the python process memory in holding, find it always increase by small numbers.

  714 root      20   0 38.454g 1.355g 345428 R 100.0  0.4   0:28.36 python
  714 root      20   0 38.461g 1.362g 345428 R 100.3  0.4   0:31.41 python
  714 root      20   0 38.480g 1.380g 345428 R 100.0  0.4   0:34.46 python
  714 root      20   0 38.483g 1.384g 345428 R 100.0  0.4   0:37.51 python
  714 root      20   0 38.489g 1.390g 345428 R 100.0  0.4   0:40.55 python
  714 root      20   0 38.489g 1.391g 345428 R 100.0  0.4   0:43.60 python
  714 root      20   0 38.491g 1.392g 345428 R 100.3  0.4   0:46.65 python
  714 root      20   0 38.495g 1.396g 345428 R 100.0  0.4   0:49.70 python
  714 root      20   0 38.498g 1.399g 345428 R 100.0  0.4   0:52.74 python
  714 root      20   0 38.505g 1.406g 345428 R 100.0  0.4   0:55.79 python
  714 root      20   0 38.507g 1.408g 345428 R 100.3  0.4   0:58.84 python
  714 root      20   0 38.510g 1.411g 345428 R  99.7  0.4   1:01.88 python
  714 root      20   0 38.510g 1.412g 345428 R 100.0  0.4   1:04.93 python
  714 root      20   0 38.513g 1.414g 345428 R 100.3  0.4   1:07.98 python

dzhwinter · 2018-03-14T07:33:15Z

In the very beginning, I also doubt that maybe there is some circle referenced Python object cannot be deleted in-time

TomorrowIsAnOtherDay · 2018-03-15T09:37:34Z

Thanks for your time：）

dzhwinter · 2018-03-22T09:59:59Z

To draw a more concrete conclusion, I added some new features and has done some experiments based the features as below.
If user set the FLAGS FLAGS_fraction_of_gpu_memory_to_use=0.0

Paddle/paddle/fluid/platform/gpu_info.cc

Line 21 in 8f87286

DEFINE_double(fraction_of_gpu_memory_to_use, 0.92,

, then the buddy_allocator will be disabled. Then we can track the GPU memory cost by nvidia-smi -l 1. If you run the model on a GPU node. Needs to notice that all the fluid memory are managed by Tensor. Therefore, if the GPU memory cost is not increased during the training, it will convince us all the Tensor is deleted in time, and vice visa.

Experiment https://github.com/PaddlePaddle/models/blob/develop/fluid/policy_gradient/README.md
This script adds the fluid objects/ variables reference counting inside. You can try it by replace the default brain.py.
https://gist.github.com/dzhwinter/f7bfb5ad693fdc277b85b339cf719a8d

configuration

FLAGS_fraction_of_gpu_memory_to_use=0.0 FLAGS_benchmark=true python policy_gradient/run.py --device=GPU

Then we track cpu memory with top, and gpu memory with nvidia-smi. The gpu memory stay 297M and waves between batches(python garbage collection). The cpu memory increases slowly as
previous comment shows.
It is interesting that showing variable reference is not changed either.

{}
{'fc_0.b_0@GRAD': 1, 'fc_0.tmp_0@GRAD': 1, 'cross_entropy_0.tmp_0': 1, 'cross_entropy_0.tmp_0@GRAD': 1, 'vt': 1, 'fc_1.b_0': 1, 'reduce_mean_0.tmp_0': 1, 'fc_1.b_0@GRAD': 1, 'elementwise_mul_0.tmp_0@GRAD': 1, 'acts': 1, 'fc_0.b_0': 1, 'reduce_mean_0.tmp_0@GRAD': 1, 'fc_1.w_0@GRAD': 1, 'fc_0.tmp_2': 1, 'fc_0.tmp_0': 1, 'fc_0.tmp_1': 1, 'fc_1.w_0': 1, 'fc_0.w_0': 1, 'elementwise_mul_0.tmp_0': 1, 'fc_1.tmp_0@GRAD': 1, 'fc_1.tmp_2@GRAD': 1, 'fc_0.w_0@GRAD': 1, 'fc_1.tmp_2': 1, 'fc_1.tmp_1': 1, 'fc_1.tmp_0': 1, 'learning_rate_0': 1, 'fc_1.tmp_1@GRAD': 1, 'fc_0.tmp_2@GRAD': 1, 'fc_0.tmp_1@GRAD': 1, 'obs': 1}
{'fc_0.b_0@GRAD': 1, 'fc_0.tmp_0@GRAD': 1, 'cross_entropy_0.tmp_0': 1, 'cross_entropy_0.tmp_0@GRAD': 1, 'vt': 1, 'fc_1.b_0': 1, 'reduce_mean_0.tmp_0': 1, 'fc_1.b_0@GRAD': 1, 'elementwise_mul_0.tmp_0@GRAD': 1, 'acts': 1, 'fc_0.b_0': 1, 'reduce_mean_0.tmp_0@GRAD': 1, 'fc_1.w_0@GRAD': 1, 'fc_0.tmp_2': 1, 'fc_0.tmp_0': 1, 'fc_0.tmp_1': 1, 'learning_rate_0': 1, 'fc_0.w_0': 1, 'elementwise_mul_0.tmp_0': 1, 'fc_1.tmp_0@GRAD': 1, 'fc_1.tmp_2@GRAD': 1, 'fc_0.w_0@GRAD': 1, 'fc_1.tmp_2': 1, 'fc_1.tmp_1': 1, 'fc_1.tmp_0': 1, 'fc_1.w_0': 1, 'fc_1.tmp_1@GRAD': 1, 'fc_0.tmp_2@GRAD': 1, 'fc_0.tmp_1@GRAD': 1, 'obs': 1}

dzhwinter · 2018-03-22T10:35:57Z

For your task,

Could you add the FLAGS_fraction_of_gpu_memory_to_use=0.0 to see if there is any Tensor leaked during training?
Is there immutable object created and reference in the train loop?
Thanks

TomorrowIsAnOtherDay · 2018-03-23T03:08:17Z

I suppose we should track the problem by using https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient
models.
You don't have same environments for my task and It's difficult for me to locate the bug.
It would be easier to use the official model to locate the bug.

TomorrowIsAnOtherDay · 2018-03-30T06:45:46Z

@dzhwinter Any fresh development this week?

dzhwinter · 2018-03-30T06:59:45Z

Sorry, you didn't get my point. The model I tested is same with https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient.
And found that there is no GPU memory leak in training time, the gist code only adds a variable reference counting to give us a double check.

The top tool only can track the host memory used, the gradient policy table above only maybe there is some circle reference problem in pybind/python.

In a word, The https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient. **can not ** reproduce the memory leak problem.
Could you add the FLAGS_fraction_of_gpu_memory_to_use=0.0 to see if there is any Tensor leaked during training in your task?

dzhwinter · 2018-03-30T07:02:02Z

Please do not use the prune interface, that one is buggy.

TomorrowIsAnOtherDay · 2018-04-10T03:26:38Z

without using prune API, memory leak problem was solved.

dzhwinter · 2018-06-11T04:26:06Z

This issue has been fixed in the latest branch, please check it out.

QiJune added the User 用于标记用户问题 label Feb 28, 2018

dzhwinter assigned tonyyang-svail Apr 10, 2018

panyx0718 self-assigned this Apr 11, 2018

gongweibao closed this as completed Jul 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak problem with new API fluid #8621

Memory leak problem with new API fluid #8621

TomorrowIsAnOtherDay commented Feb 28, 2018 •

edited

QiJune commented Feb 28, 2018 •

edited

TomorrowIsAnOtherDay commented Feb 28, 2018

TomorrowIsAnOtherDay commented Feb 28, 2018

dzhwinter commented Feb 28, 2018

TomorrowIsAnOtherDay commented Mar 2, 2018 •

edited

TomorrowIsAnOtherDay commented Mar 5, 2018 •

edited

dzhwinter commented Mar 5, 2018

QiJune commented Mar 5, 2018

TomorrowIsAnOtherDay commented Mar 8, 2018

dzhwinter commented Mar 8, 2018

dzhwinter commented Mar 14, 2018

dzhwinter commented Mar 14, 2018

TomorrowIsAnOtherDay commented Mar 15, 2018

dzhwinter commented Mar 22, 2018

dzhwinter commented Mar 22, 2018

TomorrowIsAnOtherDay commented Mar 23, 2018

TomorrowIsAnOtherDay commented Mar 30, 2018

dzhwinter commented Mar 30, 2018

dzhwinter commented Mar 30, 2018

TomorrowIsAnOtherDay commented Apr 10, 2018

dzhwinter commented Jun 11, 2018

Memory leak problem with new API fluid #8621

Memory leak problem with new API fluid #8621

Comments

TomorrowIsAnOtherDay commented Feb 28, 2018 • edited

QiJune commented Feb 28, 2018 • edited

TomorrowIsAnOtherDay commented Feb 28, 2018

TomorrowIsAnOtherDay commented Feb 28, 2018

dzhwinter commented Feb 28, 2018

TomorrowIsAnOtherDay commented Mar 2, 2018 • edited

TomorrowIsAnOtherDay commented Mar 5, 2018 • edited

dzhwinter commented Mar 5, 2018

QiJune commented Mar 5, 2018

TomorrowIsAnOtherDay commented Mar 8, 2018

dzhwinter commented Mar 8, 2018

dzhwinter commented Mar 14, 2018

dzhwinter commented Mar 14, 2018

TomorrowIsAnOtherDay commented Mar 15, 2018

dzhwinter commented Mar 22, 2018

dzhwinter commented Mar 22, 2018

TomorrowIsAnOtherDay commented Mar 23, 2018

TomorrowIsAnOtherDay commented Mar 30, 2018

dzhwinter commented Mar 30, 2018

dzhwinter commented Mar 30, 2018

TomorrowIsAnOtherDay commented Apr 10, 2018

dzhwinter commented Jun 11, 2018

TomorrowIsAnOtherDay commented Feb 28, 2018 •

edited

QiJune commented Feb 28, 2018 •

edited

TomorrowIsAnOtherDay commented Mar 2, 2018 •

edited

TomorrowIsAnOtherDay commented Mar 5, 2018 •

edited