Another Agent57

TensorFlow Implementation of the Deep Reinforcement Agent57

Agent57: Outperforming the Atari Human Benchmark Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell

Background on Agent57 and the Problem of Sparse Rewards

Deep reinforcement learning is a tool that leverages the power of deep learning to generalize the state-action value function in a Markov decision process in environments where the number of possible state-action mappings exceeds practical computing capabilities. However, if the rewards from the agent's environment become few and far between, the agent may never learn what to do. To solve this problem, the team at DeepMind developed a solution by introducing an intrinsic reward generated by the agent for exploring its environment. The generator increases the frequency of the rewards such that the agent can explore its environment until it discovers the extrinsic rewards required for the desired behavior.

The intrinsic reward is computed using a life-long novelty module to scale the reward generated by the episodic novelty module. The intrinsic and extrinsic rewards are then combined following a policy chosen by a multi-armed bandit at the start of the episode. This bandit allows the agent to determine the efficacy of the intrinsic reward for the environment and choose the most appropriate trade-off at any point during training.

For Agent57, the state-value function is decomposed into two identical models. One for the extrinsic reward and one for the intrinsic reward. The outputs of each are combined following the chosen policy.

Software Requirements

Python 3
Packages in the requirements.txt
MySQL or MariaDB Server

Script Execution Order

The program is divided into six parts. With the current configuration the learner and evaluator must be ran on the same machine. The program must be executed in this order.

Parameters

The parameters file params.yaml is divided into five sections.

Section	Description
Agent57	Defines the structure of the state-value function.
RND	Defines the structure of the random distilation network.
EmbeddingNetwork	Defines the structure of the embedding network.
EpisodicMemory	Size of episodic memory and reward parameters.
Misc	Various global parameters.

Agent57, RND, and Embedding Network

Layers can be added and removed from the model, excluding the lstm, so long as the order of layer types does not change.

Episodic Memory

Parameter	Data Type	Description
k	int	The number of nearest neighboring states used for the episodic reward.
max_size	int	The maximum number of states stored in memory each episode.
depth	int	The depth of the states stored in memory.
maximum_similarity	float	The largest similarity allowed to compute the episodic reward.
c	float	The pseudo-counts constant.
epsilon	float	A small constant to avoid dividing by zero.
cluster_distace	float	The distance between states for them to be considered the same.

Miscellaneous

Parameter	Data Type	Description
dtype	str	Defining the data type used by the model. <float16, float32, float64>
render_actor	bool	If true, the observations of the first and last actor on the environment server will be rendered by a separate process.
environment	str	The name of the gym environment being used.
obs_type	str	The observation type used by the agent.
obs_shape	int[]	The shape of the observation includes a dimension for the batch.
frameskip	bool/int	If the type of frameskip is an int, then the environment skips that number of frames each step.
reward_scale	float	Optional value to scale the extrinsic rewards.
zero_discount_on_life_loss	bool	If true and the environment returns 'lives' info, the future discounted rewards recorded are zero when the agent losses a life.
N	int	The range of policies explored by the agent.
L	int	The number of agents in a batch. Each agent has a different probability of taking a random action.
greed_e	float	The maximum probability of taking a random action.
greed_a	float	Controls the rate of decay for selecting the random action probability across the batch. The larger value, the faster it decays.
evaluate_modified_reward	bool	If true, the modified reward is used to train the bandit, else the extrinsic reward is used.
bandit_window_size	int	The window of past rewards the bandit considers when choosing a policy.
bandit_beta	float	Scales the bandit's exploration factor.
bandit_e	float	Probability of returning a random policy.
bandit_save_period	int	The number of policy updates between bandit saves.
evaluation_iterations	int	The number of evaluation episodes before updating the model.
trace_length	int	Length of traces the model is trained on.
replay_period	int	Length of the trace used for the burn-in period before training.
min_required_sequences	int	The minimum number of traces required in the database before the learner begins training.
database_size_limit	int	Maximum size allowed for the database represented in GB.
target_free_space	int	The amount of free space the replay server checks for before begining to clear space.
max_episode_length	int	The maximum number of steps the agent can take before the environment forcefully ends the episode.
bandit_ip	str	The ip address of the server hosting the bandit.
bandit_port	int	The port used by the server hosting the bandit.
bandit_workers	int	The number of workers the bandit server spawns.
weights_ip	str	The ip of the server distributing the model weights. Usually the learning server.
weights_port	int	The port used by the server distributing the model weights.
weights_workers	int	The number of workers spawned by the weights server.
replay_ip	str	The ip of the server managing the replay buffer.
replay_port_range	int[]	The range of ports used by the server managing the replay buffer.
replay_workers	int	The number of workers spawned by the replay server.
training_batch_size	int	The size of batches used by the learning server.
consecutive_training_batches	int	The number of batches the learning server is working on at any given time.
actor_weight_update	int	The number of steps taken by the actor between weight updates.
target_weight_update	int	The number of optimization steps between weight updates.
retrace_lambda	int	Weighs the importance of future temporal differences in the retrace target.
consecutive_batches	int	The number of environment batches being processed at one time.
batch_splits	int	The number of splits made in an environment batch. If this number is greater than one, then the batch is split into mini batches.
download_period	int	The number of seconds between weight downloads for the environment server.
split_stream	bool	If true, the batches are divided among the devices used by the agents. Else, only the first two devices are used and the model's processes are split in half.
checkpoint_period	int	This is the period at which the learner doesn't automatically delete the checkpoint.
break_training_loop_early	bool	If true and there are training splits, the learner will stop training on the trace after the first split.
training_splits	int	Number of splits in the trace during the update period. Used to reduce memory usage.

Database Connection

The program is configured to use the DEFAULT_CONFIG found in the database.py file.

Additional Notes

The program was developed and tested in the PyCharm environment. Therefor the imports are all relative to the main project directory.

License

BSD 3-Clause

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bandit		bandit
environment_server		environment_server
learning_server		learning_server
models		models
replay_buffer		replay_buffer
LICENSE		LICENSE
README.md		README.md
params.yml		params.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Another Agent57

TensorFlow Implementation of the Deep Reinforcement Agent57

Background on Agent57 and the Problem of Sparse Rewards

Software Requirements

Script Execution Order

Parameters

Database Connection

Additional Notes

License

About

Releases

Packages

Languages

License

Bjacobwork/AnotherAgent57

Folders and files

Latest commit

History

Repository files navigation

Another Agent57

TensorFlow Implementation of the Deep Reinforcement Agent57

Background on Agent57 and the Problem of Sparse Rewards

Software Requirements

Script Execution Order

Parameters

Database Connection

Additional Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages