Skip to content

Bjacobwork/AnotherAgent57

Repository files navigation

Another Agent57

TensorFlow Implementation of the Deep Reinforcement Agent57

Agent57: Outperforming the Atari Human Benchmark Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell

Background on Agent57 and the Problem of Sparse Rewards

Deep reinforcement learning is a tool that leverages the power of deep learning to generalize the state-action value function in a Markov decision process in environments where the number of possible state-action mappings exceeds practical computing capabilities. However, if the rewards from the agent's environment become few and far between, the agent may never learn what to do. To solve this problem, the team at DeepMind developed a solution by introducing an intrinsic reward generated by the agent for exploring its environment. The generator increases the frequency of the rewards such that the agent can explore its environment until it discovers the extrinsic rewards required for the desired behavior.

Intrinsic Reward Generator

The intrinsic reward is computed using a life-long novelty module to scale the reward generated by the episodic novelty module. The intrinsic and extrinsic rewards are then combined following a policy chosen by a multi-armed bandit at the start of the episode. This bandit allows the agent to determine the efficacy of the intrinsic reward for the environment and choose the most appropriate trade-off at any point during training.

State-Value Function

For Agent57, the state-value function is decomposed into two identical models. One for the extrinsic reward and one for the intrinsic reward. The outputs of each are combined following the chosen policy.

Software Requirements

  • Python 3
  • Packages in the requirements.txt
  • MySQL or MariaDB Server

Script Execution Order

The program is divided into six parts. With the current configuration the learner and evaluator must be ran on the same machine. The program must be executed in this order.

  1. MySQL/MariaDB
  2. replay_server.py
  3. bandit_server.py
  4. learner.py
  5. evaluator.py
  6. environment.py

Parameters

The parameters file params.yaml is divided into five sections.

Section Description
Agent57 Defines the structure of the state-value function.
RND Defines the structure of the random distilation network.
EmbeddingNetwork Defines the structure of the embedding network.
EpisodicMemory Size of episodic memory and reward parameters.
Misc Various global parameters.

Agent57, RND, and Embedding Network

Layers can be added and removed from the model, excluding the lstm, so long as the order of layer types does not change.

Episodic Memory

Parameter Data Type Description
k int The number of nearest neighboring states used for the episodic reward.
max_size int The maximum number of states stored in memory each episode.
depth int The depth of the states stored in memory.
maximum_similarity float The largest similarity allowed to compute the episodic reward.
c float The pseudo-counts constant.
epsilon float A small constant to avoid dividing by zero.
cluster_distace float The distance between states for them to be considered the same.

Miscellaneous

Parameter Data Type Description
dtype str Defining the data type used by the model. <float16, float32, float64>
render_actor bool If true, the observations of the first and last actor on the environment server will be rendered by a separate process.
environment str The name of the gym environment being used.
obs_type str The observation type used by the agent.
obs_shape int[] The shape of the observation includes a dimension for the batch.
frameskip bool/int If the type of frameskip is an int, then the environment skips that number of frames each step.
reward_scale float Optional value to scale the extrinsic rewards.
zero_discount_on_life_loss bool If true and the environment returns 'lives' info, the future discounted rewards recorded are zero when the agent losses a life.
N int The range of policies explored by the agent.
L int The number of agents in a batch. Each agent has a different probability of taking a random action.
greed_e float The maximum probability of taking a random action.
greed_a float Controls the rate of decay for selecting the random action probability across the batch. The larger value, the faster it decays.
evaluate_modified_reward bool If true, the modified reward is used to train the bandit, else the extrinsic reward is used.
bandit_window_size int The window of past rewards the bandit considers when choosing a policy.
bandit_beta float Scales the bandit's exploration factor.
bandit_e float Probability of returning a random policy.
bandit_save_period int The number of policy updates between bandit saves.
evaluation_iterations int The number of evaluation episodes before updating the model.
trace_length int Length of traces the model is trained on.
replay_period int Length of the trace used for the burn-in period before training.
min_required_sequences int The minimum number of traces required in the database before the learner begins training.
database_size_limit int Maximum size allowed for the database represented in GB.
target_free_space int The amount of free space the replay server checks for before begining to clear space.
max_episode_length int The maximum number of steps the agent can take before the environment forcefully ends the episode.
bandit_ip str The ip address of the server hosting the bandit.
bandit_port int The port used by the server hosting the bandit.
bandit_workers int The number of workers the bandit server spawns.
weights_ip str The ip of the server distributing the model weights. Usually the learning server.
weights_port int The port used by the server distributing the model weights.
weights_workers int The number of workers spawned by the weights server.
replay_ip str The ip of the server managing the replay buffer.
replay_port_range int[] The range of ports used by the server managing the replay buffer.
replay_workers int The number of workers spawned by the replay server.
training_batch_size int The size of batches used by the learning server.
consecutive_training_batches int The number of batches the learning server is working on at any given time.
actor_weight_update int The number of steps taken by the actor between weight updates.
target_weight_update int The number of optimization steps between weight updates.
retrace_lambda int Weighs the importance of future temporal differences in the retrace target.
consecutive_batches int The number of environment batches being processed at one time.
batch_splits int The number of splits made in an environment batch. If this number is greater than one, then the batch is split into mini batches.
download_period int The number of seconds between weight downloads for the environment server.
split_stream bool If true, the batches are divided among the devices used by the agents. Else, only the first two devices are used and the model's processes are split in half.
checkpoint_period int This is the period at which the learner doesn't automatically delete the checkpoint.
break_training_loop_early bool If true and there are training splits, the learner will stop training on the trace after the first split.
training_splits int Number of splits in the trace during the update period. Used to reduce memory usage.

Database Connection

The program is configured to use the DEFAULT_CONFIG found in the database.py file.

Additional Notes

The program was developed and tested in the PyCharm environment. Therefor the imports are all relative to the main project directory.

License

BSD 3-Clause