-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak problem with new API fluid #8621
Comments
You can run fluid program with following commands to track the memory usage:
It will print accurate memory usage after each operator running. |
Thanks to @QiJune 's advice. |
Our design is fresh new and totally different from tensorflow. We do not have a concept of |
Memory leak problem wasn't solved after removing prune, but it would decrease the speed of it. |
It can be re-produced by running the official model code :) |
Thanks for the reporting. We are looking into the issue. |
@TomorrowIsAnOtherDay Yes, I have reproduce the problem. |
@dzhwinter @QiJune |
I have double the operators involved, there are no memory issues inside. Currently, I suspect that there is some variable referenced in pybind is not released, but still not convinced by the experiment yet. |
Here are some clues I find today. I run the reinforcement learning demo in models and tracked the buddy allocator used memory, and find that the GPU memory cost didn't increase even further after 2000 epochs. In detail, the memory cost increase in a wave-like way in 0-2000 epochs, from 2000-10000 epochs, it stays at a stable number and doesn't change anymore. But the same time, I tracked the python process memory in holding, find it always increase by small numbers. 714 root 20 0 38.454g 1.355g 345428 R 100.0 0.4 0:28.36 python
714 root 20 0 38.461g 1.362g 345428 R 100.3 0.4 0:31.41 python
714 root 20 0 38.480g 1.380g 345428 R 100.0 0.4 0:34.46 python
714 root 20 0 38.483g 1.384g 345428 R 100.0 0.4 0:37.51 python
714 root 20 0 38.489g 1.390g 345428 R 100.0 0.4 0:40.55 python
714 root 20 0 38.489g 1.391g 345428 R 100.0 0.4 0:43.60 python
714 root 20 0 38.491g 1.392g 345428 R 100.3 0.4 0:46.65 python
714 root 20 0 38.495g 1.396g 345428 R 100.0 0.4 0:49.70 python
714 root 20 0 38.498g 1.399g 345428 R 100.0 0.4 0:52.74 python
714 root 20 0 38.505g 1.406g 345428 R 100.0 0.4 0:55.79 python
714 root 20 0 38.507g 1.408g 345428 R 100.3 0.4 0:58.84 python
714 root 20 0 38.510g 1.411g 345428 R 99.7 0.4 1:01.88 python
714 root 20 0 38.510g 1.412g 345428 R 100.0 0.4 1:04.93 python
714 root 20 0 38.513g 1.414g 345428 R 100.3 0.4 1:07.98 python |
In the very beginning, I also doubt that maybe there is some circle referenced Python object cannot be deleted in-time |
Thanks for your time:) |
To draw a more concrete conclusion, I added some new features and has done some experiments based the features as below. Paddle/paddle/fluid/platform/gpu_info.cc Line 21 in 8f87286
buddy_allocator will be disabled. Then we can track the GPU memory cost by nvidia-smi -l 1 . If you run the model on a GPU node. Needs to notice that all the fluid memory are managed by Tensor. Therefore, if the GPU memory cost is not increased during the training, it will convince us all the Tensor is deleted in time, and vice visa.
Experiment https://github.com/PaddlePaddle/models/blob/develop/fluid/policy_gradient/README.md configuration FLAGS_fraction_of_gpu_memory_to_use=0.0 FLAGS_benchmark=true python policy_gradient/run.py --device=GPU Then we track cpu memory with {}
{'fc_0.b_0@GRAD': 1, 'fc_0.tmp_0@GRAD': 1, 'cross_entropy_0.tmp_0': 1, 'cross_entropy_0.tmp_0@GRAD': 1, 'vt': 1, 'fc_1.b_0': 1, 'reduce_mean_0.tmp_0': 1, 'fc_1.b_0@GRAD': 1, 'elementwise_mul_0.tmp_0@GRAD': 1, 'acts': 1, 'fc_0.b_0': 1, 'reduce_mean_0.tmp_0@GRAD': 1, 'fc_1.w_0@GRAD': 1, 'fc_0.tmp_2': 1, 'fc_0.tmp_0': 1, 'fc_0.tmp_1': 1, 'fc_1.w_0': 1, 'fc_0.w_0': 1, 'elementwise_mul_0.tmp_0': 1, 'fc_1.tmp_0@GRAD': 1, 'fc_1.tmp_2@GRAD': 1, 'fc_0.w_0@GRAD': 1, 'fc_1.tmp_2': 1, 'fc_1.tmp_1': 1, 'fc_1.tmp_0': 1, 'learning_rate_0': 1, 'fc_1.tmp_1@GRAD': 1, 'fc_0.tmp_2@GRAD': 1, 'fc_0.tmp_1@GRAD': 1, 'obs': 1}
{'fc_0.b_0@GRAD': 1, 'fc_0.tmp_0@GRAD': 1, 'cross_entropy_0.tmp_0': 1, 'cross_entropy_0.tmp_0@GRAD': 1, 'vt': 1, 'fc_1.b_0': 1, 'reduce_mean_0.tmp_0': 1, 'fc_1.b_0@GRAD': 1, 'elementwise_mul_0.tmp_0@GRAD': 1, 'acts': 1, 'fc_0.b_0': 1, 'reduce_mean_0.tmp_0@GRAD': 1, 'fc_1.w_0@GRAD': 1, 'fc_0.tmp_2': 1, 'fc_0.tmp_0': 1, 'fc_0.tmp_1': 1, 'learning_rate_0': 1, 'fc_0.w_0': 1, 'elementwise_mul_0.tmp_0': 1, 'fc_1.tmp_0@GRAD': 1, 'fc_1.tmp_2@GRAD': 1, 'fc_0.w_0@GRAD': 1, 'fc_1.tmp_2': 1, 'fc_1.tmp_1': 1, 'fc_1.tmp_0': 1, 'fc_1.w_0': 1, 'fc_1.tmp_1@GRAD': 1, 'fc_0.tmp_2@GRAD': 1, 'fc_0.tmp_1@GRAD': 1, 'obs': 1} |
For your task,
|
I suppose we should track the problem by using https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient |
@dzhwinter Any fresh development this week? |
Sorry, you didn't get my point. The model I tested is same with https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient. The top tool only can track the host memory used, the gradient policy table above only maybe there is some circle reference problem in pybind/python. In a word, The https://github.com/PaddlePaddle/models/tree/develop/fluid/policy_gradient. **can not ** reproduce the memory leak problem. |
Please do not use the prune interface, that one is buggy. |
without using prune API, memory leak problem was solved. |
This issue has been fixed in the latest branch, please check it out. |
With new API fluid, I tried to re-produce some models written in Tensorflow. But the memory usage of fluid program were rising all the time, While the program of TF was OK.
Here is my code
The text was updated successfully, but these errors were encountered: