Deep reinforcement learning is a tool that leverages the power of deep learning to generalize the state-action value function in a Markov decision process in environments where the number of possible state-action mappings exceeds practical computing capabilities. However, if the rewards from the agent's environment become few and far between, the agent may never learn what to do. To solve this problem, the team at DeepMind developed a solution by introducing an intrinsic reward generated by the agent for exploring its environment. The generator increases the frequency of the rewards such that the agent can explore its environment until it discovers the extrinsic rewards required for the desired behavior.
The intrinsic reward is computed using a life-long novelty module to scale the reward generated by the episodic novelty module. The intrinsic and extrinsic rewards are then combined following a policy chosen by a multi-armed bandit at the start of the episode. This bandit allows the agent to determine the efficacy of the intrinsic reward for the environment and choose the most appropriate trade-off at any point during training.
For Agent57, the state-value function is decomposed into two identical models. One for the extrinsic reward and one for the intrinsic reward. The outputs of each are combined following the chosen policy.
- Python 3
- Packages in the requirements.txt
- MySQL or MariaDB Server
The program is divided into six parts. With the current configuration the learner and evaluator must be ran on the same machine. The program must be executed in this order.
- MySQL/MariaDB
- replay_server.py
- bandit_server.py
- learner.py
- evaluator.py
- environment.py
The parameters file params.yaml is divided into five sections.
Section | Description |
---|---|
Agent57 | Defines the structure of the state-value function. |
RND | Defines the structure of the random distilation network. |
EmbeddingNetwork | Defines the structure of the embedding network. |
EpisodicMemory | Size of episodic memory and reward parameters. |
Misc | Various global parameters. |
Agent57, RND, and Embedding Network
Layers can be added and removed from the model, excluding the lstm, so long as the order of layer types does not change.
Episodic Memory
Parameter | Data Type | Description |
---|---|---|
k | int | The number of nearest neighboring states used for the episodic reward. |
max_size | int | The maximum number of states stored in memory each episode. |
depth | int | The depth of the states stored in memory. |
maximum_similarity | float | The largest similarity allowed to compute the episodic reward. |
c | float | The pseudo-counts constant. |
epsilon | float | A small constant to avoid dividing by zero. |
cluster_distace | float | The distance between states for them to be considered the same. |
Miscellaneous
Parameter | Data Type | Description |
---|---|---|
dtype | str | Defining the data type used by the model. <float16, float32, float64> |
render_actor | bool | If true, the observations of the first and last actor on the environment server will be rendered by a separate process. |
environment | str | The name of the gym environment being used. |
obs_type | str | The observation type used by the agent. |
obs_shape | int[] | The shape of the observation includes a dimension for the batch. |
frameskip | bool/int | If the type of frameskip is an int, then the environment skips that number of frames each step. |
reward_scale | float | Optional value to scale the extrinsic rewards. |
zero_discount_on_life_loss | bool | If true and the environment returns 'lives' info, the future discounted rewards recorded are zero when the agent losses a life. |
N | int | The range of policies explored by the agent. |
L | int | The number of agents in a batch. Each agent has a different probability of taking a random action. |
greed_e | float | The maximum probability of taking a random action. |
greed_a | float | Controls the rate of decay for selecting the random action probability across the batch. The larger value, the faster it decays. |
evaluate_modified_reward | bool | If true, the modified reward is used to train the bandit, else the extrinsic reward is used. |
bandit_window_size | int | The window of past rewards the bandit considers when choosing a policy. |
bandit_beta | float | Scales the bandit's exploration factor. |
bandit_e | float | Probability of returning a random policy. |
bandit_save_period | int | The number of policy updates between bandit saves. |
evaluation_iterations | int | The number of evaluation episodes before updating the model. |
trace_length | int | Length of traces the model is trained on. |
replay_period | int | Length of the trace used for the burn-in period before training. |
min_required_sequences | int | The minimum number of traces required in the database before the learner begins training. |
database_size_limit | int | Maximum size allowed for the database represented in GB. |
target_free_space | int | The amount of free space the replay server checks for before begining to clear space. |
max_episode_length | int | The maximum number of steps the agent can take before the environment forcefully ends the episode. |
bandit_ip | str | The ip address of the server hosting the bandit. |
bandit_port | int | The port used by the server hosting the bandit. |
bandit_workers | int | The number of workers the bandit server spawns. |
weights_ip | str | The ip of the server distributing the model weights. Usually the learning server. |
weights_port | int | The port used by the server distributing the model weights. |
weights_workers | int | The number of workers spawned by the weights server. |
replay_ip | str | The ip of the server managing the replay buffer. |
replay_port_range | int[] | The range of ports used by the server managing the replay buffer. |
replay_workers | int | The number of workers spawned by the replay server. |
training_batch_size | int | The size of batches used by the learning server. |
consecutive_training_batches | int | The number of batches the learning server is working on at any given time. |
actor_weight_update | int | The number of steps taken by the actor between weight updates. |
target_weight_update | int | The number of optimization steps between weight updates. |
retrace_lambda | int | Weighs the importance of future temporal differences in the retrace target. |
consecutive_batches | int | The number of environment batches being processed at one time. |
batch_splits | int | The number of splits made in an environment batch. If this number is greater than one, then the batch is split into mini batches. |
download_period | int | The number of seconds between weight downloads for the environment server. |
split_stream | bool | If true, the batches are divided among the devices used by the agents. Else, only the first two devices are used and the model's processes are split in half. |
checkpoint_period | int | This is the period at which the learner doesn't automatically delete the checkpoint. |
break_training_loop_early | bool | If true and there are training splits, the learner will stop training on the trace after the first split. |
training_splits | int | Number of splits in the trace during the update period. Used to reduce memory usage. |
The program is configured to use the DEFAULT_CONFIG found in the database.py file.
The program was developed and tested in the PyCharm environment. Therefor the imports are all relative to the main project directory.
BSD 3-Clause