Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Changes to avoid memory leak in Rollout worker #161

Merged
merged 13 commits into from Jan 4, 2019

Conversation

x77a1
Copy link
Contributor

@x77a1 x77a1 commented Dec 15, 2018

Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.

This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.

Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.

x77a1 and others added 3 commits December 15, 2018 12:26
Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.

This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.

Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.
…tpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop
@safrooze
Copy link
Contributor

@x77a1 I don't quite understand how the memory leak happens. I can't see GlobalVariableSaver.restore() holding onto anything. Here is the code:

    def restore(self, sess: Any, restore_path: str):
        """
        Restore from restore_path
        :param sess: active session for session-based frameworks (e.g. TF)
        :param restore_path: full path to load checkpoint from.
        """
        # We don't use saver.restore() because checkpoint is loaded to online network, but if the checkpoint
        # is from the global network, a namespace mismatch exists and variable name must be modified before loading.
        variables = dict()
        reader = tf.contrib.framework.load_checkpoint(restore_path)
        for var_name, _ in reader.get_variable_to_shape_map().items():
            # if variable was saved using global network, re-map it to online network
            # TODO: Can this be more generic so that `global/` and `online/` are not hardcoded here?
            new_name = var_name.replace('global/', 'online/')
            variables[new_name] = reader.get_tensor(var_name)
        # Assign all variables
        sess.run([v.assign(variables[v.name.split(':')[0]]) for v in self._variables])

@x77a1
Copy link
Contributor Author

x77a1 commented Dec 21, 2018

@safrooze I was able to narrow it down to the following statement in GlobalVariableSaver.

v.assign(variables[v.name.split(':')[0]]) for v in self._variables

On changing it to something below does not make the memory grow.

        # Assign all variables
        sess.run([v for v in self._variables])

@safrooze
Copy link
Contributor

You're only fetching the current value in that call, as opposed to settings the value. I wonder if this is a TF issue with setting variables, but I wouldn't bet on it. Can you create a self contained test of just setting some dummy TF variables for this? I don't have access to a machine until end of December to try it out myself.

@x77a1
Copy link
Contributor Author

x77a1 commented Dec 26, 2018

@safrooze @zach-nervana @galleibo-intel @Ajay191191 I researched a little more into this issue. It seems that we are adding a new node to the graph with the assign operation. So for each new restore, we add a new node. This makes the graph grow in size with every restore. I tried to freeze the graph before restore and the test crashed during restore. This issue in TF seems to be similar issue as ours -tensorflow/tensorflow#4151

Profiling results hint towards the same. These are the new objects added in every restore:

                                                                types |   # objects |   total size
===================================================================== | =========== | ============
                                                        <class 'tuple |         696 |     65.25 KB
                                                         <class 'dict |         168 |     29.25 KB
                                                         <class 'list |         264 |     24.96 KB
                                                          <class 'int |         729 |     19.93 KB
         <class 'tensorflow.python.framework.ops.Operation._InputList |          48 |      2.62 KB
                       <class 'tensorflow.python.framework.ops.Tensor |          48 |      2.62 KB
                    <class 'tensorflow.python.framework.ops.Operation |          48 |      2.62 KB
                                                 <class 'SwigPyObject |          48 |      2.25 KB
  <class 'tensorflow.python.framework.traceable_stack.TraceableObject |          24 |      1.31 KB

What is the best way to handle this issue?

x77a1 and others added 3 commits December 25, 2018 21:04
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We reset the Tensorflow graph and recreate the Global, Online and
Target networks on every restore. This ensures that the old unused nodes
in TF graph is dropped.
@safrooze
Copy link
Contributor

safrooze commented Jan 3, 2019

@x77a1 nice find! Yes it looks like the continuous tf.assign() call does indeed result in memory leak. However your solution which creates a new completely new graph for every new restore seems hacky and fragile. For one, it creates a dependency for order of operations (i.e. call restore on graph_manager before calling restore on savers) between separate modules which can create a silent regression.

The link to TF issue you discovered has a great suggestion on how to deal with such situation and I think that's what we have to implement. Do you feel comfortable modifying the GlobalVariableSaver to use placeholders to avoid creating new assign ops?

This reverts commit 740f793.
…with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop"

This reverts commit b8d21c7.
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We use TF placeholder to update the variables which avoids the
memory leak.
@x77a1
Copy link
Contributor Author

x77a1 commented Jan 3, 2019

@safrooze I agree with you. I was also not satisfied with my previous implementation. As you rightly pointed out that it was flaky. I read a little more about the Placeholder approach. Its indeed very clean. I have tried to update the code with this approach. Have tested it on my local box.

Copy link
Contributor

@safrooze safrooze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks @x77a1! A few minor suggestions.

Copy link
Contributor

@safrooze safrooze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great find and solution @x77a1!

Copy link
Contributor

@Ajay191191 Ajay191191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Ajay191191 Ajay191191 merged commit 8a1ea3d into IntelLabs:master Jan 4, 2019
@safrooze
Copy link
Contributor

safrooze commented Jan 4, 2019

@Ajay191191 just for the future, I think it's better to squash merge the changes rather than merge all the changesets into the main repo.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants