<a href="https://colab.research.google.com/github/NC25/gym_fishing/blob/master/sac_fishing-v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Soft Actor Critic

The Soft Actor Critic  is popular agent used for the training of large and continous domains. 

*   Off-policy -  trains on many different actions that are not included in the policy, which encourages entropy.
*   Maxes expected reward and entropy

*   Agent gets a bonus reward at each time-step proportional to the 
instantaneous entropy
*   Optimizes policy while approximating Q functions)



---










In [None]:
!sudo apt-get install -y xvfb ffmpeg
!pip install 'gym==0.10.11'
!pip install 'imageio==2.4.0'
!pip install matplotlib
!pip install PILLOW
!pip install tf-agents
!pip install 'pybullet==2.4.2'
!pip install 'pyglet==1.3.2'
!pip install pyvirtualdisplay
!pip install --upgrade setuptools

# Hyperparameter Tuning (optional)

Dependencies to set up Baselines Zoo for Tuning

In [None]:
#!apt-get install swig cmake libopenmpi-dev zlib1g-dev ffmpeg
#!pip install stable-baselines box2d box2d-kengz pyyaml pybullet optuna pytablewriter

# Dependencies
Run the following dependencies

In [None]:
!pip install stable-baselines

In [None]:
#!pip install pyglet==1.5.0

Collecting pyglet==1.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/70/ca/20aee170afe6011e295e34b27ad7d7ccd795faba581dd3c6f7cec237f561/pyglet-1.5.0-py2.py3-none-any.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 2.8MB/s 
Installing collected packages: pyglet
  Found existing installation: pyglet 1.5.7
    Uninstalling pyglet-1.5.7:
      Successfully uninstalled pyglet-1.5.7
Successfully installed pyglet-1.5.0


In [None]:
%tensorflow_version 1.x
!apt-get install ffmpeg freeglut3-dev xvfb

TensorFlow is already loaded. Please restart the runtime to change versions.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
freeglut3-dev is already the newest version (2.8.1-3).
ffmpeg is already the newest version (7:3.4.6-0ubuntu0.18.04.1).
xvfb is already the newest version (2:1.19.6-1ubuntu4.4).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


Downgrade tensor flow to version 1.4

In [None]:
pip install tensorflow==1.4

# Clone Repository

Cloning must follow this order!

In [None]:
 !git clone https://github.com/boettiger-lab/gym_fishing.git

fatal: destination path 'gym_fishing' already exists and is not an empty directory.


In [None]:
!python gym_fishing/setup.py sdist bdist_wheel 

In [None]:
!pip install -e ./gym_fishing/

Obtaining file:///content/gym_fishing
Installing collected packages: gym-fishing
  Found existing installation: gym-fishing 0.0.2
    Can't uninstall 'gym-fishing'. No files were found to uninstall.
  Running setup.py develop for gym-fishing
Successfully installed gym-fishing


In [None]:
!ls


build  dist  gym_fishing  gym_fishing.egg-info	sample_data


In [None]:
!cd gym_fishing

In [None]:
import gym_fishing

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display

In [None]:
import gym
import numpy as np

from stable_baselines.sac.policies import MlpPolicy
from stable_baselines import SAC

# Load Environment

In [None]:
env = gym.make('fishing-v1')

model = SAC(MlpPolicy, env, verbose=1) #verbose = 1 includes progress bar and 
#loading line


# Helper Function

This helper function will be used to evaluate our un-trained agent. It will instantiate the environment and return observations for every time step.

Then it will return a reward based on the state and action. Finally it will append the reward to our utility (the collection of our rewards).

In [None]:
def evaluate(model, num_episodes=100):
    
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done:
            action, _states = model.predict(obs)
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward #mean reward for the last episode

mean_reward_before_train = evaluate(model, num_episodes=250)

Mean reward: 0.75 Num episodes: 250


We can see the results when we evaluate our untrained agent.

So let's start training!

In [None]:

model = SAC(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=50000, log_interval=10)
model.save("fishing-v1")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
| policy_loss             | -0.5982082    |
| qf1_loss                | 5.4603765e-07 |
| qf2_loss                | 2.4431975e-07 |
| time_elapsed            | 118           |
| total timesteps         | 23143         |
| value_loss              | 8.019773e-05  |
-------------------------------------------
-------------------------------------------
| current_lr              | 0.0003        |
| ent_coef                | 0.0071541476  |
| ent_coef_loss           | 1.2977657     |
| entropy                 | 0.9696589     |
| episodes                | 9060          |
| fps                     | 194           |
| mean 100 episode reward | 0.9           |
| n_updates               | 23115         |
| policy_loss             | -0.64237183   |
| qf1_loss                | 3.9034663e-07 |
| qf2_loss                | 3.3640067e-06 |
| time_elapsed            | 119           |
| total timesteps         | 23214         |
| value_los

In [None]:
from stable_baselines.common.evaluation import evaluate_policy

# Evaluation

We can evaluate the trained agent and see how it compares our untrained agent.

In [None]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=50000)

print("mean_reward: " + str(mean_reward) + "+/- " + str(std_reward))

mean_reward: 1.4349191705089428+/- 2.220446049250313e-16


In [3]:
#!python -m train.py --algo sac2 --env fishing-v1 -n 50000 -optimize --n-trials 1000 --n-jobs 2 --sampler tpe --pruner median#

/usr/bin/python3: Error while finding module specification for 'train.py' (ModuleNotFoundError: No module named 'train')


In [None]:
del model

model = SAC.load("fishing-v1")

obs = env.reset()


#render = lambda : plt.imshow(env.render(mode='rgb_array'))
 
#env.reset()
#while True: 
  #action, _states = model.predict(obs)
  #obs, rewards, dones, info = env.step(action)
  #env.render()