Changes to avoid memory leak in Rollout worker #161

x77a1 · 2018-12-15T20:45:38Z

Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.

This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.

Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.

Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker. This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path. Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.

…tpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop

rl_coach/tests/graph_managers/test_basic_rl_graph_manager.py

rl_coach/graph_managers/graph_manager.py

safrooze · 2018-12-21T09:43:07Z

@x77a1 I don't quite understand how the memory leak happens. I can't see GlobalVariableSaver.restore() holding onto anything. Here is the code:

    def restore(self, sess: Any, restore_path: str):
        """
        Restore from restore_path
        :param sess: active session for session-based frameworks (e.g. TF)
        :param restore_path: full path to load checkpoint from.
        """
        # We don't use saver.restore() because checkpoint is loaded to online network, but if the checkpoint
        # is from the global network, a namespace mismatch exists and variable name must be modified before loading.
        variables = dict()
        reader = tf.contrib.framework.load_checkpoint(restore_path)
        for var_name, _ in reader.get_variable_to_shape_map().items():
            # if variable was saved using global network, re-map it to online network
            # TODO: Can this be more generic so that `global/` and `online/` are not hardcoded here?
            new_name = var_name.replace('global/', 'online/')
            variables[new_name] = reader.get_tensor(var_name)
        # Assign all variables
        sess.run([v.assign(variables[v.name.split(':')[0]]) for v in self._variables])

x77a1 · 2018-12-21T16:56:34Z

@safrooze I was able to narrow it down to the following statement in GlobalVariableSaver.

v.assign(variables[v.name.split(':')[0]]) for v in self._variables

On changing it to something below does not make the memory grow.

        # Assign all variables
        sess.run([v for v in self._variables])

safrooze · 2018-12-21T18:25:09Z

You're only fetching the current value in that call, as opposed to settings the value. I wonder if this is a TF issue with setting variables, but I wouldn't bet on it. Can you create a self contained test of just setting some dummy TF variables for this? I don't have access to a machine until end of December to try it out myself.

x77a1 · 2018-12-26T02:24:11Z

@safrooze @zach-nervana @galleibo-intel @Ajay191191 I researched a little more into this issue. It seems that we are adding a new node to the graph with the assign operation. So for each new restore, we add a new node. This makes the graph grow in size with every restore. I tried to freeze the graph before restore and the test crashed during restore. This issue in TF seems to be similar issue as ours -tensorflow/tensorflow#4151

Profiling results hint towards the same. These are the new objects added in every restore:

                                                                types |   # objects |   total size
===================================================================== | =========== | ============
                                                        <class 'tuple |         696 |     65.25 KB
                                                         <class 'dict |         168 |     29.25 KB
                                                         <class 'list |         264 |     24.96 KB
                                                          <class 'int |         729 |     19.93 KB
         <class 'tensorflow.python.framework.ops.Operation._InputList |          48 |      2.62 KB
                       <class 'tensorflow.python.framework.ops.Tensor |          48 |      2.62 KB
                    <class 'tensorflow.python.framework.ops.Operation |          48 |      2.62 KB
                                                 <class 'SwigPyObject |          48 |      2.25 KB
  <class 'tensorflow.python.framework.traceable_stack.TraceableObject |          24 |      1.31 KB

What is the best way to handle this issue?

ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We reset the Tensorflow graph and recreate the Global, Online and Target networks on every restore. This ensures that the old unused nodes in TF graph is dropped.

safrooze · 2019-01-03T05:17:41Z

@x77a1 nice find! Yes it looks like the continuous tf.assign() call does indeed result in memory leak. However your solution which creates a new completely new graph for every new restore seems hacky and fragile. For one, it creates a dependency for order of operations (i.e. call restore on graph_manager before calling restore on savers) between separate modules which can create a silent regression.

The link to TF issue you discovered has a great suggestion on how to deal with such situation and I think that's what we have to implement. Do you feel comfortable modifying the GlobalVariableSaver to use placeholders to avoid creating new assign ops?

This reverts commit 740f793.

This reverts commit c694766.

…with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop" This reverts commit b8d21c7.

This reverts commit 801aed5.

ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We use TF placeholder to update the variables which avoids the memory leak.

x77a1 · 2019-01-03T07:14:25Z

@safrooze I agree with you. I was also not satisfied with my previous implementation. As you rightly pointed out that it was flaky. I read a little more about the Placeholder approach. Its indeed very clean. I have tried to update the code with this approach. Have tested it on my local box.

safrooze

Looks great. Thanks @x77a1! A few minor suggestions.

rl_coach/architectures/tensorflow_components/savers.py

safrooze

LGTM. Great find and solution @x77a1!

Ajay191191

LGTM

safrooze · 2019-01-04T09:49:58Z

@Ajay191191 just for the future, I think it's better to squash merge the changes rather than merge all the changesets into the main repo.

x77a1 and others added 3 commits December 15, 2018 12:26

Merge branch 'master' into master

1f0980c

comment out the part of test in 'test_basic_rl_graph_manager_with_car…

b8d21c7

…tpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop

zach-nervana reviewed Dec 17, 2018

View reviewed changes

rl_coach/tests/graph_managers/test_basic_rl_graph_manager.py Show resolved Hide resolved

rl_coach/graph_managers/graph_manager.py Outdated Show resolved Hide resolved

zach-nervana requested a review from Ajay191191 December 17, 2018 16:17

Merge branch 'master' into master

02f2db1

x77a1 and others added 3 commits December 25, 2018 21:04

Merge branch 'master' into master

73c4c85

Updated comments

740f793

x77a1 added 5 commits January 2, 2019 23:09

Revert "Updated comments"

2461892

This reverts commit 740f793.

Revert "Avoid Memory Leak in Rollout worker"

6dd7ae2

This reverts commit c694766.

Revert "comment out the part of test in 'test_basic_rl_graph_manager_…

779d369

…with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop" This reverts commit b8d21c7.

Revert "Changes to avoid memory leak in rollout worker"

c377363

This reverts commit 801aed5.

safrooze reviewed Jan 3, 2019

View reviewed changes

rl_coach/architectures/tensorflow_components/savers.py Show resolved Hide resolved

rl_coach/architectures/tensorflow_components/savers.py Outdated Show resolved Hide resolved

Refactored GlobalVariableSaver

b1e9ea4

safrooze approved these changes Jan 3, 2019

View reviewed changes

Ajay191191 approved these changes Jan 3, 2019

View reviewed changes

Ajay191191 merged commit 8a1ea3d into IntelLabs:master Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to avoid memory leak in Rollout worker #161

Changes to avoid memory leak in Rollout worker #161

x77a1 commented Dec 15, 2018

safrooze commented Dec 21, 2018

x77a1 commented Dec 21, 2018 •

edited

safrooze commented Dec 21, 2018

x77a1 commented Dec 26, 2018

safrooze commented Jan 3, 2019

x77a1 commented Jan 3, 2019

safrooze left a comment

safrooze left a comment

Ajay191191 left a comment

safrooze commented Jan 4, 2019

Changes to avoid memory leak in Rollout worker #161

Changes to avoid memory leak in Rollout worker #161

Conversation

x77a1 commented Dec 15, 2018

safrooze commented Dec 21, 2018

x77a1 commented Dec 21, 2018 • edited

safrooze commented Dec 21, 2018

x77a1 commented Dec 26, 2018

safrooze commented Jan 3, 2019

x77a1 commented Jan 3, 2019

safrooze left a comment

Choose a reason for hiding this comment

safrooze left a comment

Choose a reason for hiding this comment

Ajay191191 left a comment

Choose a reason for hiding this comment

safrooze commented Jan 4, 2019

x77a1 commented Dec 21, 2018 •

edited