Changes to avoid memory leak in Rollout worker #161
Conversation
Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker. This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path. Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.
…tpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop
@x77a1 I don't quite understand how the memory leak happens. I can't see
|
@safrooze I was able to narrow it down to the following statement in GlobalVariableSaver.
On changing it to something below does not make the memory grow.
|
You're only fetching the current value in that call, as opposed to settings the value. I wonder if this is a TF issue with setting variables, but I wouldn't bet on it. Can you create a self contained test of just setting some dummy TF variables for this? I don't have access to a machine until end of December to try it out myself. |
@safrooze @zach-nervana @galleibo-intel @Ajay191191 I researched a little more into this issue. It seems that we are adding a new node to the graph with the assign operation. So for each new restore, we add a new node. This makes the graph grow in size with every restore. I tried to freeze the graph before restore and the test crashed during restore. This issue in TF seems to be similar issue as ours -tensorflow/tensorflow#4151 Profiling results hint towards the same. These are the new objects added in every restore:
What is the best way to handle this issue? |
ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We reset the Tensorflow graph and recreate the Global, Online and Target networks on every restore. This ensures that the old unused nodes in TF graph is dropped.
@x77a1 nice find! Yes it looks like the continuous The link to TF issue you discovered has a great suggestion on how to deal with such situation and I think that's what we have to implement. Do you feel comfortable modifying the |
This reverts commit 740f793.
This reverts commit c694766.
…with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop" This reverts commit b8d21c7.
This reverts commit 801aed5.
ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We use TF placeholder to update the variables which avoids the memory leak.
@safrooze I agree with you. I was also not satisfied with my previous implementation. As you rightly pointed out that it was flaky. I read a little more about the Placeholder approach. Its indeed very clean. I have tried to update the code with this approach. Have tested it on my local box. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks @x77a1! A few minor suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great find and solution @x77a1!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Ajay191191 just for the future, I think it's better to squash merge the changes rather than merge all the changesets into the main repo. |
Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.
This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.
Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.