Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak problem with new API fluid #8621

Closed
TomorrowIsAnOtherDay opened this issue Feb 28, 2018 · 21 comments
Closed

Memory leak problem with new API fluid #8621

TomorrowIsAnOtherDay opened this issue Feb 28, 2018 · 21 comments
Assignees
Labels
User 用于标记用户问题

Comments

@TomorrowIsAnOtherDay
Copy link
Contributor

TomorrowIsAnOtherDay commented Feb 28, 2018

With new API fluid, I tried to re-produce some models written in Tensorflow. But the memory usage of fluid program were rising all the time, While the program of TF was OK.

Here is my code

@QiJune QiJune added the User 用于标记用户问题 label Feb 28, 2018
@QiJune
Copy link
Member

QiJune commented Feb 28, 2018

You can run fluid program with following commands to track the memory usage:

FLAGS_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python train.py

It will print accurate memory usage after each operator running.

@TomorrowIsAnOtherDay
Copy link
Contributor Author

Thanks for your comment.
With debug command I got memory usage of program
image

@TomorrowIsAnOtherDay
Copy link
Contributor Author

Thanks to @QiJune 's advice.
After removing prune, memory usage is ok.
But why using prune would cause memory leak?

@dzhwinter
Copy link
Contributor

Our design is fresh new and totally different from tensorflow. We do not have a concept of graph of operators, we only have the program, so the dependency analysis is more complicated. For this bug you encountered, prune return a different program (some part of the main program). Maybe there is some variable referenced cannot be deleted in time.

@TomorrowIsAnOtherDay
Copy link
Contributor Author

TomorrowIsAnOtherDay commented Mar 2, 2018

Memory leak problem wasn't solved after removing prune, but it would decrease the speed of it.

@TomorrowIsAnOtherDay
Copy link
Contributor Author

TomorrowIsAnOtherDay commented Mar 5, 2018

It can be re-produced by running the official model code :)
https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient
@QiJune @dzhwinter

@dzhwinter
Copy link
Contributor

Thanks for the reporting. We are looking into the issue.

@QiJune
Copy link
Member

QiJune commented Mar 5, 2018

@TomorrowIsAnOtherDay Yes, I have reproduce the problem.
Prune will greatly increase memory. We should not use prune.
And after remove prune, memory will increase very slowly.

@TomorrowIsAnOtherDay
Copy link
Contributor Author

@dzhwinter @QiJune
Any fresh developments?

@dzhwinter
Copy link
Contributor

I have double the operators involved, there are no memory issues inside. Currently, I suspect that there is some variable referenced in pybind is not released, but still not convinced by the experiment yet.

@dzhwinter
Copy link
Contributor

Here are some clues I find today. I run the reinforcement learning demo in models and tracked the buddy allocator used memory, and find that the GPU memory cost didn't increase even further after 2000 epochs. In detail, the memory cost increase in a wave-like way in 0-2000 epochs, from 2000-10000 epochs, it stays at a stable number and doesn't change anymore.

But the same time, I tracked the python process memory in holding, find it always increase by small numbers.

  714 root      20   0 38.454g 1.355g 345428 R 100.0  0.4   0:28.36 python
  714 root      20   0 38.461g 1.362g 345428 R 100.3  0.4   0:31.41 python
  714 root      20   0 38.480g 1.380g 345428 R 100.0  0.4   0:34.46 python
  714 root      20   0 38.483g 1.384g 345428 R 100.0  0.4   0:37.51 python
  714 root      20   0 38.489g 1.390g 345428 R 100.0  0.4   0:40.55 python
  714 root      20   0 38.489g 1.391g 345428 R 100.0  0.4   0:43.60 python
  714 root      20   0 38.491g 1.392g 345428 R 100.3  0.4   0:46.65 python
  714 root      20   0 38.495g 1.396g 345428 R 100.0  0.4   0:49.70 python
  714 root      20   0 38.498g 1.399g 345428 R 100.0  0.4   0:52.74 python
  714 root      20   0 38.505g 1.406g 345428 R 100.0  0.4   0:55.79 python
  714 root      20   0 38.507g 1.408g 345428 R 100.3  0.4   0:58.84 python
  714 root      20   0 38.510g 1.411g 345428 R  99.7  0.4   1:01.88 python
  714 root      20   0 38.510g 1.412g 345428 R 100.0  0.4   1:04.93 python
  714 root      20   0 38.513g 1.414g 345428 R 100.3  0.4   1:07.98 python

@dzhwinter
Copy link
Contributor

In the very beginning, I also doubt that maybe there is some circle referenced Python object cannot be deleted in-time

@TomorrowIsAnOtherDay
Copy link
Contributor Author

Thanks for your time:)

@dzhwinter
Copy link
Contributor

To draw a more concrete conclusion, I added some new features and has done some experiments based the features as below.
If user set the FLAGS FLAGS_fraction_of_gpu_memory_to_use=0.0

DEFINE_double(fraction_of_gpu_memory_to_use, 0.92,
, then the buddy_allocator will be disabled. Then we can track the GPU memory cost by nvidia-smi -l 1. If you run the model on a GPU node. Needs to notice that all the fluid memory are managed by Tensor. Therefore, if the GPU memory cost is not increased during the training, it will convince us all the Tensor is deleted in time, and vice visa.

Experiment https://github.com/PaddlePaddle/models/blob/develop/fluid/policy_gradient/README.md
This script adds the fluid objects/ variables reference counting inside. You can try it by replace the default brain.py.
https://gist.github.com/dzhwinter/f7bfb5ad693fdc277b85b339cf719a8d

configuration

FLAGS_fraction_of_gpu_memory_to_use=0.0 FLAGS_benchmark=true python policy_gradient/run.py --device=GPU

Then we track cpu memory with top, and gpu memory with nvidia-smi. The gpu memory stay 297M and waves between batches(python garbage collection). The cpu memory increases slowly as
previous comment shows.
It is interesting that showing variable reference is not changed either.

{}
{'fc_0.b_0@GRAD': 1, 'fc_0.tmp_0@GRAD': 1, 'cross_entropy_0.tmp_0': 1, 'cross_entropy_0.tmp_0@GRAD': 1, 'vt': 1, 'fc_1.b_0': 1, 'reduce_mean_0.tmp_0': 1, 'fc_1.b_0@GRAD': 1, 'elementwise_mul_0.tmp_0@GRAD': 1, 'acts': 1, 'fc_0.b_0': 1, 'reduce_mean_0.tmp_0@GRAD': 1, 'fc_1.w_0@GRAD': 1, 'fc_0.tmp_2': 1, 'fc_0.tmp_0': 1, 'fc_0.tmp_1': 1, 'fc_1.w_0': 1, 'fc_0.w_0': 1, 'elementwise_mul_0.tmp_0': 1, 'fc_1.tmp_0@GRAD': 1, 'fc_1.tmp_2@GRAD': 1, 'fc_0.w_0@GRAD': 1, 'fc_1.tmp_2': 1, 'fc_1.tmp_1': 1, 'fc_1.tmp_0': 1, 'learning_rate_0': 1, 'fc_1.tmp_1@GRAD': 1, 'fc_0.tmp_2@GRAD': 1, 'fc_0.tmp_1@GRAD': 1, 'obs': 1}
{'fc_0.b_0@GRAD': 1, 'fc_0.tmp_0@GRAD': 1, 'cross_entropy_0.tmp_0': 1, 'cross_entropy_0.tmp_0@GRAD': 1, 'vt': 1, 'fc_1.b_0': 1, 'reduce_mean_0.tmp_0': 1, 'fc_1.b_0@GRAD': 1, 'elementwise_mul_0.tmp_0@GRAD': 1, 'acts': 1, 'fc_0.b_0': 1, 'reduce_mean_0.tmp_0@GRAD': 1, 'fc_1.w_0@GRAD': 1, 'fc_0.tmp_2': 1, 'fc_0.tmp_0': 1, 'fc_0.tmp_1': 1, 'learning_rate_0': 1, 'fc_0.w_0': 1, 'elementwise_mul_0.tmp_0': 1, 'fc_1.tmp_0@GRAD': 1, 'fc_1.tmp_2@GRAD': 1, 'fc_0.w_0@GRAD': 1, 'fc_1.tmp_2': 1, 'fc_1.tmp_1': 1, 'fc_1.tmp_0': 1, 'fc_1.w_0': 1, 'fc_1.tmp_1@GRAD': 1, 'fc_0.tmp_2@GRAD': 1, 'fc_0.tmp_1@GRAD': 1, 'obs': 1}

@dzhwinter
Copy link
Contributor

For your task,

  1. Could you add the FLAGS_fraction_of_gpu_memory_to_use=0.0 to see if there is any Tensor leaked during training?
  2. Is there immutable object created and reference in the train loop?
    Thanks

@TomorrowIsAnOtherDay
Copy link
Contributor Author

I suppose we should track the problem by using https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient
models.
You don't have same environments for my task and It's difficult for me to locate the bug.
It would be easier to use the official model to locate the bug.

@TomorrowIsAnOtherDay
Copy link
Contributor Author

@dzhwinter Any fresh development this week?

@dzhwinter
Copy link
Contributor

Sorry, you didn't get my point. The model I tested is same with https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient.
And found that there is no GPU memory leak in training time, the gist code only adds a variable reference counting to give us a double check.

The top tool only can track the host memory used, the gradient policy table above only maybe there is some circle reference problem in pybind/python.

In a word, The https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient. **can not ** reproduce the memory leak problem.
Could you add the FLAGS_fraction_of_gpu_memory_to_use=0.0 to see if there is any Tensor leaked during training in your task?

@dzhwinter
Copy link
Contributor

Please do not use the prune interface, that one is buggy.

@TomorrowIsAnOtherDay
Copy link
Contributor Author

without using prune API, memory leak problem was solved.

@panyx0718 panyx0718 self-assigned this Apr 11, 2018
@dzhwinter
Copy link
Contributor

This issue has been fixed in the latest branch, please check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

6 participants