## 1. Installation of stable-baseline3
- Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch

In [1]:
!pip install stable-baselines3[extra]



## 2. Import Dependecies

In [2]:
import os  # Provides functions for interacting with the operating system (os)
import gym # Gym is an open source Python library for developing and comparing reinforcement learning algorithms
           # Go to github.com/openai/gym/blob/master/gym/core.py to see the methods a gym env can peform
from stable_baselines3 import PPO # Proximal Policy Optimization (PPO) Algorithm
from stable_baselines3.common.vec_env import DummyVecEnv # Creates a simple vectorized wrapper for multiple environments 
from stable_baselines3.common.evaluation import evaluate_policy # Runs policy for n_eval_episodes episodes and returns average reward. This is made to work only with one env.

## 3. Create Environment

In [3]:
environment_name = 'CartPole-v0' # Go to gym.openai.com to discover other availabel environments
env = gym.make(environment_name)

## 4. Test Environment

In [4]:
episodes = 5 
for episode in range(1, episodes+1): 
    env.reset() # Resets the environment to an initial state and returns an initial observation                        
    done = True 
    score = 0
    
    while not done:
        env.render() # Gym environment rendering                        
        action = env.action_space.sample() # The action_space used in the gym environment is used to define characteristics of the action space of the environment. 
        n_state, reward, done, info = env.step(action) # Run one timestep of the environment's dynamics, returns: observation, reward, done, information
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:0
Episode:2 Score:0
Episode:3 Score:0
Episode:4 Score:0
Episode:5 Score:0


In [10]:
env.action_space.sample()

1

## 5. Train Model

In [6]:
log_path = os.path.join('Training', 'Logs') # Similar to a logbook, Logs contain all the important records about the course of the event.

In [7]:
env = gym.make(environment_name)
env = DummyVecEnv([lambda:env])
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log = log_path)
model.learn(total_timesteps = 20000)

Using cuda device
Logging to Training\Logs\PPO_14
-----------------------------
| time/              |      |
|    fps             | 344  |
|    iterations      | 1    |
|    time_elapsed    | 5    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 377          |
|    iterations           | 2            |
|    time_elapsed         | 10           |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0060490808 |
|    clip_fraction        | 0.0662       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.688       |
|    explained_variance   | 0.00712      |
|    learning_rate        | 0.0003       |
|    loss                 | 8.11         |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.00967     |
|    value_loss           | 64.3         |
---------

<stable_baselines3.ppo.ppo.PPO at 0x25dc9e09820>

## 6. Save Model

In [8]:
PPO_Path = os.path.join('Training', 'Saved Models', 'PPO_Model')
model.save(PPO_Path)

## 7. Evaluate and Test Model

In [9]:
# Frist possibiliy to test the model
evaluate_policy(model, env, n_eval_episodes = 10, render = True)
env.close()



(200.0, 0.0)

In [11]:
# Second possibilty to test the model
episodes = 5                             
for episode in range(1, episodes+1): 
    obs = env.reset()
    score = 0
    
    while not done:
        env.render()                        
        action, _ = model.predict(obs) # Now the model is used here
        obs , reward, done, info = env.step(action) 
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:[200.]
Episode:2 Score:[200.]
Episode:3 Score:[200.]
Episode:4 Score:[200.]
Episode:5 Score:[200.]


## 8. Viewing Logs in Tensorboard
- TensorBoard provides the visualization and tooling needed for machine learning experimentation. For example:
  - Tracking and visualizing metrics such as loss and accuracy
  - Visualizing the model graph (ops and layers)
  - Viewing histograms of weights, biases, or other tensors as they change over time

In [13]:
training_log_path = os.path.join(log_path,'PPO_1')

In [1]:
!tensorboard --logdir = {training_log_path} --host localhost

TensorFlow installation not found - running with reduced feature set.
usage: tensorboard [-h] [--helpfull] [--logdir PATH] [--logdir_spec PATH_SPEC]
                   [--host ADDR] [--bind_all] [--port PORT]
                   [--reuse_port BOOL] [--load_fast {false,auto,true}]
                   [--extra_data_server_flags EXTRA_DATA_SERVER_FLAGS]
                   [--grpc_creds_type {local,ssl,ssl_dev}]
                   [--grpc_data_provider PORT] [--purge_orphaned_data BOOL]
                   [--db URI] [--db_import] [--inspect] [--version_tb]
                   [--tag TAG] [--event_file PATH] [--path_prefix PATH]
                   [--window_title TEXT] [--max_reload_threads COUNT]
                   [--reload_interval SECONDS] [--reload_task TYPE]
                   [--reload_multifile BOOL]
                   [--reload_multifile_inactive_secs SECONDS]
                   [--generic_data TYPE]
                   [--samples_per_plugin SAMPLES_PER_PLUGIN]
                   [--wh

## 9. The Callback
- A callback is a set of functions that will be called at given stages of the training procedure. You can use callbacks to access internal state of the RL model during training. It allows one to do monitoring, auto saving, model manipulation, progress bars, …

In [15]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [16]:
save_path = os.path.join('Training', 'Saved Models')

In [17]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold = 200, verbose =1) # Stops the training if the mean reward achieved by the RL model is above a threshold
eval_callback = EvalCallback(env,
                            callback_on_new_best = stop_callback, 
                            best_model_save_path = save_path, 
                            eval_freq = 10000,
                            verbose = 1 ) # Evaluate periodically the performance of an agent, using a separate test environment. It will save the best model.

In [18]:
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log = log_path)

Using cuda device


In [19]:
model.learn(total_timesteps = 20000, callback = eval_callback)

Logging to Training\Logs\PPO_15
-----------------------------
| time/              |      |
|    fps             | 525  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 408         |
|    iterations           | 2           |
|    time_elapsed         | 10          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008508554 |
|    clip_fraction        | 0.103       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.00189    |
|    learning_rate        | 0.0003      |
|    loss                 | 7.41        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0172     |
|    value_loss           | 55.8        |
-----------------------------------------
--

<stable_baselines3.ppo.ppo.PPO at 0x25dac3859a0>

## 10. Changing Policies 

In [20]:
net_arch = [dict(pi = [128, 128, 128, 128], vf = [128, 128, 128, 128])]
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log = log_path, policy_kwargs = {'net_arch': net_arch})
model.learn(total_timesteps = 20000, callback = eval_callback)

Using cuda device
Logging to Training\Logs\PPO_16
-----------------------------
| time/              |      |
|    fps             | 558  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 345         |
|    iterations           | 2           |
|    time_elapsed         | 11          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.015575488 |
|    clip_fraction        | 0.242       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.681      |
|    explained_variance   | -0.00366    |
|    learning_rate        | 0.0003      |
|    loss                 | 3.56        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0274     |
|    value_loss           | 21.9        |
--------------------------

<stable_baselines3.ppo.ppo.PPO at 0x25dac38c730>

## 11. Changing Algorithms

In [21]:
from stable_baselines3 import DQN # Deep Q Learning (DQN)
model = DQN('MlpPolicy', env, verbose = 1, tensorboard_log = log_path)

Using cuda device


In [22]:
model.learn(totaltimesteps = 10000, callback = eval_callBack)

NameError: name 'eval_callBack' is not defined