Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Changes to avoid memory leak in Rollout worker #161

Merged
merged 13 commits into from Jan 4, 2019

Commits on Dec 15, 2018

  1. Changes to avoid memory leak in rollout worker

    Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.
    
    This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.
    
    Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.
    x77a1 committed Dec 15, 2018
    Copy the full SHA
    801aed5 View commit details
    Browse the repository at this point in the history

Commits on Dec 16, 2018

  1. Copy the full SHA
    1f0980c View commit details
    Browse the repository at this point in the history
  2. comment out the part of test in 'test_basic_rl_graph_manager_with_car…

    …tpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop
    x77a1 committed Dec 16, 2018
    Copy the full SHA
    b8d21c7 View commit details
    Browse the repository at this point in the history

Commits on Dec 17, 2018

  1. Copy the full SHA
    02f2db1 View commit details
    Browse the repository at this point in the history

Commits on Dec 26, 2018

  1. Avoid Memory Leak in Rollout worker

    ISSUE: When we restore checkpoints, we create new nodes in the
    Tensorflow graph. This happens when we assign new value (op node) to
    RefVariable in GlobalVariableSaver. With every restore the size of TF
    graph increases as new nodes are created and old unused nodes are not
    removed from the graph. This causes the memory leak in
    restore_checkpoint codepath.
    
    FIX: We reset the Tensorflow graph and recreate the Global, Online and
    Target networks on every restore. This ensures that the old unused nodes
    in TF graph is dropped.
    x77a1 committed Dec 26, 2018
    Copy the full SHA
    c694766 View commit details
    Browse the repository at this point in the history
  2. Copy the full SHA
    73c4c85 View commit details
    Browse the repository at this point in the history
  3. Updated comments

    x77a1 committed Dec 26, 2018
    Copy the full SHA
    740f793 View commit details
    Browse the repository at this point in the history

Commits on Jan 3, 2019

  1. Revert "Updated comments"

    This reverts commit 740f793.
    x77a1 committed Jan 3, 2019
    Copy the full SHA
    2461892 View commit details
    Browse the repository at this point in the history
  2. Revert "Avoid Memory Leak in Rollout worker"

    This reverts commit c694766.
    x77a1 committed Jan 3, 2019
    Copy the full SHA
    6dd7ae2 View commit details
    Browse the repository at this point in the history
  3. Revert "comment out the part of test in 'test_basic_rl_graph_manager_…

    …with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop"
    
    This reverts commit b8d21c7.
    x77a1 committed Jan 3, 2019
    Copy the full SHA
    779d369 View commit details
    Browse the repository at this point in the history
  4. Revert "Changes to avoid memory leak in rollout worker"

    This reverts commit 801aed5.
    x77a1 committed Jan 3, 2019
    Copy the full SHA
    c377363 View commit details
    Browse the repository at this point in the history
  5. Avoid Memory Leak in Rollout worker

    ISSUE: When we restore checkpoints, we create new nodes in the
    Tensorflow graph. This happens when we assign new value (op node) to
    RefVariable in GlobalVariableSaver. With every restore the size of TF
    graph increases as new nodes are created and old unused nodes are not
    removed from the graph. This causes the memory leak in
    restore_checkpoint codepath.
    
    FIX: We use TF placeholder to update the variables which avoids the
    memory leak.
    x77a1 committed Jan 3, 2019
    Copy the full SHA
    619ea09 View commit details
    Browse the repository at this point in the history
  6. Refactored GlobalVariableSaver

    x77a1 committed Jan 3, 2019
    Copy the full SHA
    b1e9ea4 View commit details
    Browse the repository at this point in the history