# Agent for epidemic control model  
This notebook will train an agent in an epidemic control environment using DDPG with RNNs.  
  
For use, please edit PATH variable below to any folder where training outputs can be stored.  
Also, please create a folder titled 'policy' in PATH directory.  
Default environment is EE0, for different environments see "Environment" section below.

In [1]:
PATH = '/home/jovyan/Masterarbeit/Agent/Run_33'
# Decide whether to use RNN DDPG or ANN DDPG
use_rnns = True

## Imports

Firstly, all relevant dependencies will be imported.  
Comments indicate what imports are generally used for or related to.

In [2]:
import sys

import tensorflow as tf 
import numpy as np

# Environment 
from tf_agents.environments import tf_py_environment
from tf_agents.environments import py_environment
from tf_agents.policies import scripted_py_policy
from tf_agents.policies import random_tf_policy
# Neural Networks
from tf_agents.agents.ddpg import actor_rnn_network
from tf_agents.agents.ddpg import critic_rnn_network
from tf_agents.networks import sequential
from tf_agents.networks import nest_map
from tf_agents.keras_layers import inner_reshape
import functools
# Agent 
from tf_agents.agents.ddpg import ddpg_agent
# Experience Replay
from tf_agents.drivers import dynamic_episode_driver
from tf_agents.drivers import dynamic_step_driver
from tf_agents.replay_buffers import tf_uniform_replay_buffer
#Training
from tf_agents.utils import common
#Evaluation
from tf_agents.policies import policy_saver
from tf_agents.trajectories import time_step
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
import os
import matplotlib
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
#

## Environment

Next, an environment will be imported and initialized.  
For training different environments, edit lines 7-11.

In [3]:
sys.path.insert(1, '/home/jovyan/Masterarbeit/reinforce-one/Environments')
sys.path.insert(1, '/home/jovyan/Masterarbeit/reinforce-one/Environments/Variations')
from EE0 import EE0
from EE0_A import EE0_A
from EE0_NT import EE0_NT
from EE1 import EE1
from EE1_A import EE1_A

num_herds = 2
total_population = 300
average_episode_length=200
fix_episode_length = True
py_env = EE0(num_herds = num_herds, total_population = total_population, fix_episode_length = fix_episode_length, 
               average_episode_length = average_episode_length)

# Transforms py environment into tensorflow environment (i/o are now tensors)
train_env = tf_py_environment.TFPyEnvironment(py_env)
eval_env = tf_py_environment.TFPyEnvironment(py_env)

  and should_run_async(code)


## Training
In this section, define a function for agent training and evaluation.  
First, create neural networks for use for variations in training.

### RNN DDPG

Set up actor and critic recurrent neural networks for training with DDPG using RNNs.  
Edit hyperparams for different layer sizes.

In [4]:
# RNN hyperparams
actor_fc_layers = (200, 150)
actor_output_fc_layers = (50,)
actor_lstm_size = (40,)
critic_obs_fc_layers = (200,)
critic_action_fc_layers = None
critic_joint_fc_layers = (150,)
critic_output_fc_layers = (50,)
critic_lstm_size = (40,)

# RNN actor critic
actor_rnn = actor_rnn_network.ActorRnnNetwork(train_env.time_step_spec().observation, 
                                              train_env.action_spec(), 
                                              input_fc_layer_params=actor_fc_layers, 
                                              lstm_size = actor_lstm_size, 
                                              output_fc_layer_params=actor_output_fc_layers)

critic_net_input_specs = (train_env.time_step_spec().observation, 
                          train_env.action_spec())

critic_rnn = critic_rnn_network.CriticRnnNetwork(critic_net_input_specs, 
                                                 observation_fc_layer_params=critic_obs_fc_layers, 
                                                 action_fc_layer_params=critic_action_fc_layers, 
                                                 joint_fc_layer_params=critic_joint_fc_layers, 
                                                 lstm_size=critic_lstm_size, 
                                                 output_fc_layer_params=critic_output_fc_layers)

### ANN DDPG  
Create actor and critic artificial neural networks for DDPG.  
Again, edit hyperparams for different layer sizes.

In [5]:
# Set ann hyperparameters
actor_fc_layers=(400, 300)
critic_obs_fc_layers=(400,)
critic_action_fc_layers=None
critic_joint_fc_layers=(300,)


# Define creation functions 

dense = functools.partial(tf.keras.layers.Dense,
                          activation=tf.keras.activations.relu,
                          kernel_initializer=tf.compat.v1.variance_scaling_initializer(
                              scale=1./ 3.0, mode='fan_in', distribution='uniform')
                         )


def create_identity_layer():
    return tf.keras.layers.Lambda(lambda x: x)


def create_fc_network(layer_units):
    return sequential.Sequential([dense(num_units) for num_units in layer_units])


def create_actor_network(fc_layer_units, action_spec):
    flat_action_spec = tf.nest.flatten(action_spec)
    if len(flat_action_spec) > 1:
        raise ValueError('Only a single action tensor is supported by this network')
    flat_action_spec = flat_action_spec[0]

    fc_layers = [dense(num_units) for num_units in fc_layer_units]

    num_actions = flat_action_spec.shape.num_elements()
    
    action_fc_layer = tf.keras.layers.Dense(num_actions,
                                            activation=tf.keras.activations.tanh,
                                            kernel_initializer=tf.keras.initializers.RandomUniform(
                                                minval=-0.003, maxval=0.003)
                                           )

    scaling_layer = tf.keras.layers.Lambda(
        lambda x: common.scale_to_spec(x, flat_action_spec))
    return sequential.Sequential(fc_layers + [action_fc_layer, scaling_layer])


def create_critic_network(obs_fc_layer_units,
                          action_fc_layer_units,
                          joint_fc_layer_units):
    def split_inputs(inputs):
        return {'observation': inputs[0], 'action': inputs[1]}
    
    if obs_fc_layer_units:
        obs_network = create_fc_network(obs_fc_layer_units)  
    else:
        obs_network = create_identity_layer()
    if action_fc_layer_units:    
        action_network = create_fc_network(action_fc_layer_units)
    else:
        action_network = create_identity_layer()
    if joint_fc_layer_units:    
        joint_network = create_fc_network(joint_fc_layer_units) 
    else: 
        joint_network = create_identity_layer()
    value_fc_layer = tf.keras.layers.Dense(1,
                                           activation=None,
                                           kernel_initializer=tf.keras.initializers.RandomUniform(minval=-0.003, maxval=0.003)
                                          )

    return sequential.Sequential([tf.keras.layers.Lambda(split_inputs),
                                  nest_map.NestMap({'observation': obs_network,
                                                    'action': action_network}),
                                  nest_map.NestFlatten(),
                                  tf.keras.layers.Concatenate(),
                                  joint_network,
                                  value_fc_layer,
                                  inner_reshape.InnerReshape([1], [])
                                 ])


# Create neural networks

actor_ann = create_actor_network(actor_fc_layers, 
                                 train_env.action_spec())
critic_ann = create_critic_network(critic_obs_fc_layers,
                                   critic_action_fc_layers,
                                   critic_joint_fc_layers)

### Hyperparameters  
Set hyperparameters for DDPG training.

In [6]:
num_iterations = 1000000

# Agent hyperparameters
actor_learning_rate = 1e-4
critic_learning_rate = 1e-3
ou_stddev = 0.2
ou_damping = 0.15
target_update_tau = 0.05
target_update_period = 100
gamma = 0.995
# Training hyperparameters
train_steps_per_iteration = 1

# Experience replay hyperparameters
rb_capacity = 500000
batch_size = 64
train_sequence_length = 200    # Automatically set to 1 for ANN DDPG
# For ANN DDPG
collect_steps_per_iteration = 200
initial_collect_steps = 25000
# For RNN DDPG
initial_collect_episodes = 10
collect_episodes_per_iteration = 1

# Summary params
summary_interval = 1000
# Evaluation hyperparameters
eval_interval = 1000
eval_episodes = 200
threshhold_return = -30
threshhold_reset_interval = 50000
plots = False  # Only works if num_herds = 2

### DDPG  
Finally, define training function using tf-agent's ddpg agent.

In [7]:
def DDPG(num_iterations = num_iterations,
         actor_net = None,
         critic_net = None,
         directory = PATH,
         plots = plots,
         eval_interval = eval_interval,
         summary_interval = summary_interval,
         best_return = threshhold_return,
         threshhold_reset_interval = threshhold_reset_interval,
         # Agent hyperparameters
         actor_learning_rate = actor_learning_rate,
         critic_learning_rate = critic_learning_rate,
         ou_stddev = ou_stddev,
         ou_damping = ou_damping,
         target_update_tau = target_update_tau,
         target_update_period = target_update_period,
         gamma = gamma,
         # Training hyperparameters
         train_steps_per_iteration = train_steps_per_iteration,
         # Experience replay hyperparameters
         initial_collect_episodes = initial_collect_episodes,
         collect_episodes_per_iteration = collect_episodes_per_iteration,
         rb_capacity = rb_capacity,
         batch_size = batch_size,
         train_sequence_length = train_sequence_length):
    
    if actor_net is None or critic_net is None:
        raise ValueError('Please input an actor network and critic network.')
    
    # Create directories for summary output
    directory = os.path.expanduser(directory)
    train_dir = os.path.join(directory, 'train')
    eval_dir = os.path.join(directory, 'eval')
    policy_dir = os.path.join(directory, 'policy')
    
    # Global step tracks number of train steps
    global_step = tf.compat.v1.train.get_or_create_global_step()
    
    # Initialize summary writers 
    train_summary_writer = tf.compat.v2.summary.create_file_writer(
                               train_dir, flush_millis=10000)
    train_summary_writer.set_as_default()

    eval_summary_writer = tf.compat.v2.summary.create_file_writer(
                              eval_dir, flush_millis=10000)
    eval_metrics = [tf_metrics.AverageReturnMetric(buffer_size=eval_episodes),
                    tf_metrics.AverageEpisodeLengthMetric(buffer_size=eval_episodes)]
    
    with tf.compat.v2.summary.record_if(lambda: tf.math.equal(global_step % summary_interval, 0)):
    
        # DDPG Agent
        agent = ddpg_agent.DdpgAgent(train_env.time_step_spec(), 
                                     train_env.action_spec(), 
                                     actor_network = actor_net, 
                                     critic_network = critic_net, 
                                     actor_optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=actor_learning_rate), 
                                     critic_optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=critic_learning_rate), 
                                     ou_stddev = ou_stddev, 
                                     ou_damping = ou_damping, 
                                     target_update_tau = target_update_tau, 
                                     target_update_period = target_update_period,  
                                     gamma = gamma, 
                                     train_step_counter = global_step)
        agent.initialize()
        
        # Metrics to be tracked in the summary 
        train_metrics = [tf_metrics.NumberOfEpisodes(),
                         tf_metrics.EnvironmentSteps(),
                         tf_metrics.AverageReturnMetric(),
                         tf_metrics.AverageEpisodeLengthMetric()]
    
        eval_metrics = [tf_metrics.AverageReturnMetric(buffer_size=eval_episodes), 
                        tf_metrics.AverageEpisodeLengthMetric(buffer_size=eval_episodes)]
        
        # Tools for evaluation
        eval_policy = agent.policy
        saver = policy_saver.PolicySaver(eval_policy)

        # Experience replay and sample collection tools
        collect_policy = agent.collect_policy
        replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(agent.collect_data_spec,
                                                                       batch_size=train_env.batch_size,
                                                                       max_length=rb_capacity)
    
        # Assign step drivers to fill replay buffer 
        if isinstance(actor_net, actor_rnn_network.ActorRnnNetwork):
            initial_collect_driver = dynamic_episode_driver.DynamicEpisodeDriver(train_env,
                                                                                 collect_policy,
                                                                                 observers=[replay_buffer.add_batch],
                                                                                 num_episodes=initial_collect_episodes)

            collect_driver = dynamic_episode_driver.DynamicEpisodeDriver(train_env,
                                                                         collect_policy,
                                                                         observers=[replay_buffer.add_batch] + train_metrics,
                                                                         num_episodes=collect_episodes_per_iteration)
        else:
            initial_collect_driver = dynamic_step_driver.DynamicStepDriver(train_env,
                                                                           collect_policy,
                                                                           observers=[replay_buffer.add_batch],
                                                                           num_steps=initial_collect_steps)

            collect_driver = dynamic_step_driver.DynamicStepDriver(train_env,
                                                                   collect_policy,
                                                                   observers=[replay_buffer.add_batch] + train_metrics,
                                                                   num_steps=collect_steps_per_iteration)
        
        # TF functions speed up training process
        initial_collect_driver.run = common.function(initial_collect_driver.run)
        collect_driver.run = common.function(collect_driver.run)
        agent.train = common.function(agent.train)
    
        # Collect initial random samples for replay buffer
        initial_collect_driver.run()
    
        # Training starts
        time_step = None
        policy_state = collect_policy.get_initial_state(train_env.batch_size)
    
        # If it's an ANN, trajectories have to be two steps long
        if not isinstance(actor_net, actor_rnn_network.ActorRnnNetwork):
            train_sequence_length = 1
        
        dataset = replay_buffer.as_dataset(num_parallel_calls=3,
                                           sample_batch_size=batch_size,
                                           num_steps=train_sequence_length + 1).prefetch(3)
        iterator = iter(dataset)
    
    
        def train_step():
            experience, other_info = next(iterator)
            return agent.train(experience)
        train_step = common.function(train_step)

    
        for _ in range(num_iterations):
            time_step, policy_state = collect_driver.run(time_step=time_step,
                                                         policy_state=policy_state)    
            for _ in range(train_steps_per_iteration):
                train_loss = train_step()
            for train_metric in train_metrics:
                train_metric.tf_summaries(train_step=global_step, step_metrics=train_metrics[:2])
            # Evaluation
            if global_step.numpy() % eval_interval == 0:
                results = metric_utils.eager_compute(eval_metrics, 
                                                     eval_env,
                                                     eval_policy,
                                                     num_episodes=eval_episodes,
                                                     train_step=global_step,
                                                     summary_writer=eval_summary_writer,
                                                     summary_prefix='Metrics')
                metric_utils.log_metrics(eval_metrics)
                if results['AverageReturn'].numpy() >= -10:
                    eval_interval = 1000
                    plots = False
                if results['AverageReturn'].numpy() < -15:
                    eval_interval = 1000
                    plots = False
                print('Global Step = {0}, Average Return = {1}.'.format(global_step.numpy(), results['AverageReturn'].numpy())) 
                if results['AverageReturn'].numpy() > best_return:
                    best_return = results['AverageReturn'].numpy()
                    print('New best return: ', best_return)
                    #average_return, culls, tests = eval_agent(eval_env, 
                                                              #eval_policy, 
                                                              #num_episodes=eval_episodes, 
                                                              #create_plot = plots)
                    #print('Re-Tested new best return: ', average_return)
                    #print('Average Culls = {0}, Average Tests = {1}.'.format(culls, tests))  
                    dir_name = str(global_step.numpy()) + '_' + str(best_return)
                    saver.save(os.path.join(policy_dir, dir_name))
            if global_step.numpy() % threshhold_reset_interval == 0:
                best_return = threshhold_return
                    
    return train_loss

# Run Functions (rename)  
Now you can execute ddpg using either artificial or recurrent NNs!

In [None]:
if (use_rnns):
    anet = actor_rnn
    cnet = critic_rnn

loss = DDPG(num_iterations = num_iterations,
            actor_net = anet,
            critic_net = cnet,
            directory = PATH,
            plots = plots,
            eval_interval = eval_interval,
            # Agent hyperparameters
            actor_learning_rate = actor_learning_rate,
            critic_learning_rate = critic_learning_rate,
            ou_stddev = ou_stddev,
            ou_damping = ou_damping,
            target_update_tau = target_update_tau,
            target_update_period = target_update_period,
            gamma = gamma,
            # Experience replay hyperparameters
            initial_collect_episodes = initial_collect_episodes,
            collect_episodes_per_iteration = collect_episodes_per_iteration,
            rb_capacity = rb_capacity,
            batch_size = batch_size,
            train_sequence_length = train_sequence_length)

Instructions for updating:
Use `tf.data.Dataset.scan(...) instead
Instructions for updating:
Use `as_dataset(..., single_deterministic_pass=False) instead.
Global Step = 1000, Average Return = -55.74764633178711.
Global Step = 2000, Average Return = -56.17839431762695.
Global Step = 3000, Average Return = -63.310890197753906.
Global Step = 4000, Average Return = -65.41107177734375.
Global Step = 5000, Average Return = -60.11036682128906.
Global Step = 6000, Average Return = -59.4613037109375.
Global Step = 7000, Average Return = -60.17747116088867.
Global Step = 8000, Average Return = -60.7878532409668.
Global Step = 9000, Average Return = -61.314613342285156.
Global Step = 10000, Average Return = -57.118038177490234.
Global Step = 11000, Average Return = -56.49696350097656.
Global Step = 12000, Average Return = -59.658447265625.
Global Step = 13000, Average Return = -57.65753936767578.
Global Step = 14000, Average Return = -55.72941589355469.
Global Step = 15000, Average Return = -53.



Global Step = 92000, Average Return = -28.483787536621094.
New best return:  -28.483788




INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/92000_-28.483788/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/92000_-28.483788/assets


Global Step = 93000, Average Return = -24.183122634887695.
New best return:  -24.183123
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/93000_-24.183123/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/93000_-24.183123/assets


Global Step = 94000, Average Return = -23.04608154296875.
New best return:  -23.046082
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/94000_-23.046082/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/94000_-23.046082/assets


Global Step = 95000, Average Return = -27.609071731567383.
Global Step = 96000, Average Return = -23.260826110839844.
Global Step = 97000, Average Return = -23.108469009399414.




Global Step = 98000, Average Return = -22.514820098876953.
New best return:  -22.51482




INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/98000_-22.51482/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/98000_-22.51482/assets


Global Step = 99000, Average Return = -22.964221954345703.
Global Step = 100000, Average Return = -23.6401424407959.




Global Step = 101000, Average Return = -20.479145050048828.
New best return:  -20.479145
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/101000_-20.479145/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/101000_-20.479145/assets


Global Step = 102000, Average Return = -22.85436248779297.
Global Step = 103000, Average Return = -22.766407012939453.




Global Step = 104000, Average Return = -19.568605422973633.
New best return:  -19.568605
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/104000_-19.568605/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/104000_-19.568605/assets


Global Step = 105000, Average Return = -21.363300323486328.
Global Step = 106000, Average Return = -21.515092849731445.
Global Step = 107000, Average Return = -20.64554214477539.
Global Step = 108000, Average Return = -20.635011672973633.
Global Step = 109000, Average Return = -23.81412124633789.
Global Step = 110000, Average Return = -20.563316345214844.




Global Step = 111000, Average Return = -7.114769458770752.
New best return:  -7.1147695
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/111000_-7.1147695/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/111000_-7.1147695/assets


Global Step = 112000, Average Return = -12.366663932800293.
Global Step = 113000, Average Return = -43.2176513671875.
Global Step = 114000, Average Return = -55.27099609375.
Global Step = 115000, Average Return = -9.112056732177734.
Global Step = 116000, Average Return = -13.162184715270996.
Global Step = 117000, Average Return = -56.13828659057617.
Global Step = 118000, Average Return = -56.56254577636719.
Global Step = 119000, Average Return = -56.231971740722656.
Global Step = 120000, Average Return = -57.17207336425781.
Global Step = 121000, Average Return = -52.374755859375.
Global Step = 122000, Average Return = -49.806522369384766.
Global Step = 123000, Average Return = -57.3543586730957.
Global Step = 124000, Average Return = -57.09904861450195.
Global Step = 125000, Average Return = -60.47807693481445.




Global Step = 126000, Average Return = -7.086678981781006.
New best return:  -7.086679
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/126000_-7.086679/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/126000_-7.086679/assets


Global Step = 127000, Average Return = -6.64865779876709.
New best return:  -6.648658
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/127000_-6.648658/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/127000_-6.648658/assets


Global Step = 128000, Average Return = -6.453168869018555.
New best return:  -6.453169
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/128000_-6.453169/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/128000_-6.453169/assets


Global Step = 129000, Average Return = -10.120670318603516.
Global Step = 130000, Average Return = -27.70952606201172.
Global Step = 131000, Average Return = -33.61614990234375.
Global Step = 132000, Average Return = -32.4262809753418.
Global Step = 133000, Average Return = -27.629117965698242.
Global Step = 134000, Average Return = -35.659400939941406.
Global Step = 135000, Average Return = -67.71983337402344.
Global Step = 136000, Average Return = -23.841899871826172.
Global Step = 137000, Average Return = -50.06089782714844.
Global Step = 138000, Average Return = -19.14397621154785.
Global Step = 139000, Average Return = -19.34408950805664.
Global Step = 140000, Average Return = -25.50499725341797.
Global Step = 141000, Average Return = -33.86811828613281.
Global Step = 142000, Average Return = -59.677486419677734.
Global Step = 143000, Average Return = -59.0114631652832.
Global Step = 144000, Average Return = -56.393836975097656.
Global Step = 145000, Average Return = -57.484809875



Global Step = 155000, Average Return = -26.701589584350586.
New best return:  -26.70159
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/155000_-26.70159/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/155000_-26.70159/assets


Global Step = 156000, Average Return = -46.214447021484375.
Global Step = 157000, Average Return = -37.055423736572266.




Global Step = 158000, Average Return = -10.476666450500488.
New best return:  -10.476666
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/158000_-10.476666/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/158000_-10.476666/assets


Global Step = 159000, Average Return = -10.371546745300293.
New best return:  -10.371547
INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/159000_-10.371547/assets


INFO:tensorflow:Assets written to: /home/jovyan/Masterarbeit/Agent/Run_33/policy/159000_-10.371547/assets


Global Step = 160000, Average Return = -29.857912063598633.
Global Step = 161000, Average Return = -56.525535583496094.
Global Step = 162000, Average Return = -56.163455963134766.
Global Step = 163000, Average Return = -41.52546310424805.
Global Step = 164000, Average Return = -22.8272705078125.
Global Step = 165000, Average Return = -34.8350944519043.
Global Step = 166000, Average Return = -51.558929443359375.
Global Step = 167000, Average Return = -47.943138122558594.
Global Step = 168000, Average Return = -50.412513732910156.
Global Step = 169000, Average Return = -59.62147521972656.
Global Step = 170000, Average Return = -54.191444396972656.
Global Step = 171000, Average Return = -46.10911178588867.
Global Step = 172000, Average Return = -48.58021545410156.
Global Step = 173000, Average Return = -48.090030670166016.
Global Step = 174000, Average Return = -49.34231948852539.
Global Step = 175000, Average Return = -52.254913330078125.
Global Step = 176000, Average Return = -55.953475