Changes to avoid memory leak in Rollout worker #161

Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker. This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path. Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.

…tpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop

ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We reset the Tensorflow graph and recreate the Global, Online and Target networks on every restore. This ensures that the old unused nodes in TF graph is dropped.

This reverts commit 740f793.

This reverts commit c694766.

…with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop" This reverts commit b8d21c7.

This reverts commit 801aed5.

ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We use TF placeholder to update the variables which avoids the memory leak.

Commits on Dec 17, 2018

Merge branch 'master' into master

x77a1 committed Dec 17, 2018

Copy the full SHA

02f2db1 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to avoid memory leak in Rollout worker #161

Changes to avoid memory leak in Rollout worker #161

Commits on Dec 15, 2018

Commits on Dec 16, 2018

Commits on Dec 17, 2018

Commits on Dec 26, 2018

Commits on Jan 3, 2019