<a href="https://colab.research.google.com/github/NC25/Gym_Fishing-v1/blob/master/Fishing_Env_v1_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Soft Actor Critic  is popular agent used for the training of large and continous domains. The Actor has a similar attribute with that of the policy gradient: it learns the best policy without needing to learn **every** state-action pair in the environment. The critic takes that of a DQN that evaluates every state-action pair that the agent will encounter by carrying out the policy. By doing this, this agent will train much faster and promotes entropy, encouraging the agent to explore more in order to find the best policy. 

If you haven't installed the following dependencies, run:

In [None]:
!sudo apt-get install -y xvfb ffmpeg
!pip install 'gym==0.10.11'
!pip install 'imageio==2.4.0'
!pip install matplotlib
!pip install PILLOW
!pip install tf-agents
!pip install 'pybullet==2.4.2'
!pip install 'pyglet==1.3.2'
!pip install pyvirtualdisplay
!pip install --upgrade setuptools





Packaging

In [4]:
 !git clone https://github.com/boettiger-lab/gym_fishing.git

fatal: destination path 'gym_fishing' already exists and is not an empty directory.


In [None]:
!python gym_fishing/setup.py sdist bdist_wheel 

In [None]:
!pip install -e ./gym_fishing/

In [7]:
import gym_fishing

In [6]:
!ls


build  dist  gym_fishing  gym_fishing.egg-info	sample_data


In [7]:
!cd gym_fishing

Setup

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf
tf.compat.v1.enable_v2_behavior()

from tf_agents.agents.ddpg import critic_network
from tf_agents.agents.sac import sac_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_pybullet
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import actor_distribution_network
from tf_agents.networks import normal_projection_network
from tf_agents.policies import greedy_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common
from tf_agents.environments import suite_gym
import gym

In [None]:

tf.compat.v1.enable_v2_behavior()

# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()
tf.version.VERSION

'2.2.0'

Hyperparameters

In [8]:
env_name = "fishing-v1"

#how many times the agent updates parameters
num_iterations = 100000 

#The number of time steps collected per iteration
initial_collect_steps = 10000
collect_steps_per_iteration = 1
#max time steps the replay buffer can hold
replay_buffer_capacity = 1000000

#number of examples used in a batch - the set of examples 
#used for training
batch_size = 256

#Learning rate is a scalar that shows how fast the model 
#should train
critic_learning_rate = 3e-4
actor_learning_rate = 3e-4
alpha_learning_rate = 3e-4 #the stepsize per iteration
target_update_tau = 0.005 # Sets when tau = 1, there will be an update 
target_update_period = 1 #when tau = 1
gamma = 0.99 #discount
reward_scale_factor = 1.0 #factor for reward
gradient_clipping = None  #gradient clipping mitigates steeps gradients

actor_fc_layer_params = (256,256) #number of layers
critic_joint_fc_layer_params = (256, 256)

log_interval = 5000

num_eval_episodes = 30
eval_interval = 10000

Environment

In [9]:
#load environment
env = suite_gym.load(env_name)
env.reset()
#PIL.Image.fromarray(env.render())

TimeStep(step_type=array(0, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([0.75]))

In [10]:
#read action and observation specs
print("Observation spec")
print(env.time_step_spec().observation)

print("Action spec")
print(env.action_spec())

Observation spec
BoundedArraySpec(shape=(1,), dtype=dtype('float64'), name='observation', minimum=0.0, maximum=2.0)
Action spec
BoundedArraySpec(shape=(1,), dtype=dtype('float64'), name='action', minimum=0.0, maximum=2.0)


Environment Wrappers

Create an environment for training and one for evaluation. We then convert the python environments to TensorFlow

In [20]:
train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

Agent

We need to create or actor and critic networks.

The critic gives value estimates for Q(s,a). It recives an input of a state observation and action and then estimates how good that action is for the given state



In [33]:
#Critic Network
observation_spec = train_env.observation_spec()
action_spec = train_env.action_spec()
critic_net = critic_network.CriticNetwork(
    (observation_spec, action_spec), 
    observation_fc_layer_params=None, 
    action_fc_layer_params=None, 
    joint_fc_layer_params=critic_joint_fc_layer_params
     #joint is the fully connected layer after the observation and action layers
     #are applied
                                          )


We will then use the critic to train the actor network so it can generate actions given an observation.



In [40]:
#Normal distribution - creates an action distribution based on observations
def normal_projection_net(action_spec, init_means_output_factor=0.1):
  return normal_projection_network.NormalProjectionNetwork(
      action_spec,
      mean_transform=None, #normalizes mean
      state_dependent_std=True, #normalizes std
      init_means_output_factor=init_means_output_factor, #output factor initializing weights
      std_transform=sac_agent.std_clip_transform,
      scale_distribution=True)

actor_net = actor_distribution_network.ActorDistributionNetwork(
    observation_spec,
    action_spec,
    fc_layer_params=actor_fc_layer_params, dtype=tf.float64,
    continuous_projection_net=normal_projection_net)


Agent

In [43]:
#Instantiate Agent
tf.keras.backend.set_floatx(
   'float64'
)
global_step = tf.compat.v1.train.get_or_create_global_step()
tf_agent = sac_agent.SacAgent(
    train_env.time_step_spec(),
    action_spec,
    actor_network=actor_net,
    critic_network=critic_net,
    actor_optimizer=tf.compat.v1.train.AdamOptimizer(
        learning_rate=actor_learning_rate),
    critic_optimizer=tf.compat.v1.train.AdamOptimizer(
        learning_rate=critic_learning_rate),
    alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
        learning_rate=alpha_learning_rate),
    target_update_tau=target_update_tau, #When to update model
    target_update_period=target_update_period,
    td_errors_loss_fn=tf.compat.v1.losses.mean_squared_error,
    gamma=gamma,
    reward_scale_factor=reward_scale_factor,
    gradient_clipping=gradient_clipping,
    train_step_counter=global_step)
tf_agent.initialize()




To change all layers to have dtype float32 by default, call `tf.keras.backend.set_floatx('float32')`. To change just this layer, pass dtype='float32' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



To change all layers to have dtype float32 by default, call `tf.keras.backend.set_floatx('float32')`. To change just this layer, pass dtype='float32' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



To change all layers to have dtype float32 by default, call `tf.keras.backend.set_floatx('float32')`. To change just this layer, pass dtype='float32' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



To change all layers to have dtype float32 by default, call `tf.keras.backend.set_floatx('float

Policies

The policy produces a set of avaliable actions or action distributions given a state observation.

The main method is :
```
policy_step = policy.step(time_step)
```
```
PolicyStep(action, state, info)
```
Agents contain two policies: the main policy and the behavioral policy





In [36]:
eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
collect_policy = tf_agent.collect_policy

Metrics and Evaluation
Here we are using the average return metric, which is the average of the utilities we recieve after running a policy per a couple of episodes.


In [37]:
def compute_avg_return(environment, policy, num_episodes=5):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]


compute_avg_return(eval_env, eval_policy, num_eval_episodes)

# Please also see the metrics module for standard implementations of different
# metrics.

InvalidArgumentError: ignored

Replay Buffer

In order to keep track of the data from the environment, we can use the TFUniformReplayBuffer. This is done by using data specs from the stored tensors.


In [29]:
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=tf_agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_capacity
)

#collect_data_spec is the name of our Trajectory

Data Collection

We can create a driver that can collect experiences utilizing our policy. This is done through generating random seeds to keep collecting different experiences from observations the agent encounters

In [None]:
initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
    train_env,
    collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=initial_collect_steps)

initial_collect_driver.run()

Data Pipeline

This how we transfer the data for the replay buffer to sample. We state the sample batch size for the number items sampled from the replay buffer. 




In [None]:
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, sample_batch_size=batch_size, num_steps=2).prefetch(3)
#num_steps=2 because the buffer can sample two adjacent rows:
#(the current observation and the next observation)

iterator = iter(dataset)

Training the agent

The training will collect data from the environment and optimize the policy. Periodically, the policy will be evaluated to see progress.

In [None]:
collect_driver = dynamic_step_driver.DynamicStepDriver(
    train_env,
    collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=collect_steps_per_iteration
)

#Reset train step
tf_agent.train_step_counter.assign(0)

#Evaluate metric of policy before training
avg_return = compute_avg_return(eval_env, eval_policy,
                                num_eval_episodes)
returns = [avg_return]

for i in range(num_iterations):

  #Collect a few steps using the collect policy method and save to buffer
  for i in range(collect_steps_per_iteration):
    collect_driver.run()

  #Samples a batch of data from the buffer and update the agent
  experience, unused_info = next(iterator)
  train_loss = tf_agent.train(experience)

  step = tf_agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: los = {1}'.format(step, train_loss.loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, eval_policy, 
                                    num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)


Visualizations

In [None]:
steps = range(0, num_iterations + 1, eval_interval)
plt.plot(steps, returns)
plt.ylabel('Average Return')
plt.xlabel("Step")
plt.ylim()

Videos



In [None]:
def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

In [None]:
num_episodes = 3
video_filename = 'sac_minitaur.mp4'
with imageio.get_writer(video_filename, fps=60) as video:
  for _ in range(num_episodes):
    time_step = eval_env.reset()
    video.append_data(eval_py_env.render())
    while not time_step.is_last():
      action_step = tf_agent.policy.action(time_step)
      time_step = eval_env.step(action_step.action)
      video.append_data(eval_py_env.render())

embed_mp4(video_filename)