# 1. Import dependencies

In [1]:
import os  # operating system library
import gym # allow us to build environment
from stable_baselines3 import PPO # import first algorithm
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

stable-baselines allows you to vectorize environments, which means allows you to train your ML models/ RL agents, multiple at the same time ( a huge boost of the training speed)
For this part, we can just see DummyVecEnv as a wrapper around your environment, make it easier to train

evaluate_policy, make it easier to test out how models are actually performing, average reward of certain episode and standard deviation for that perticular agent that you are training

This case -> classic control -> CartPole -> train a RL model that balance the beam

Openai gym environment -> spaces: 
   

- Box: A range of values ( using in the cases where you want continuous values)
        low, high and shape
- Discrete: set of items (mostly used for action)
- tuple: used for combine spaces
- dict: dictionary of spaces
- MultiBinary
- MultiDiscrete

# 2.Load environment

In [5]:
environment_name = "CartPole-v0" # preinstalled openai gym env
env = gym.make(environment_name) # making our env

- right now we are gonna just take random action with our environment
- later we will train to make the agent make the right move in the environment

In [6]:
episodes = 5 
# test out CartPole environment 5 times
for episode in range(1, episodes + 1):
    # reset our environment
    state = env.reset() 
        # get a initial set of observation
    done = False
    score = 0
    while not done:
        env.render()
        # allow us to view the graphical representation of that environment
        action = env.action_space.sample()
        # generating a random action
        n_state,reward,done,info = env.step(action)
        # us passing a action
        score += reward
        # accumulating reward
    print("Episode:{} Score{}".format(episodes,score))
env.close()
# closing our environment

Episode:5 Score13.0
Episode:5 Score16.0
Episode:5 Score25.0
Episode:5 Score20.0
Episode:5 Score14.0


### understanding the environment

In [7]:
env.reset() 
# the observation we get for our particular pole
# we will pass theses observations to our RL agents
# to determine what is the best action 
# to maximize our reward

array([ 3.9578103e-03, -3.3120640e-02,  1.7329494e-02,  6.1306520e-05],
      dtype=float32)

In [8]:
env.step(1)
# us passing a action
# we will be getting
# 1. next set of observation
# 2. reward(1 is increment, 0/-1 is decrement)
# 3. whether or not an episode is done

(array([ 0.0032954 ,  0.16174854,  0.01733072, -0.28710398], dtype=float32),
 1.0,
 False,
 {})

1).There's two parts of an environment: an action space and an observation space
    action space: The action you can take in that environment 
    observation space: What your observations are actually looks like in that particular environment

In [9]:
env.action_space # discrete(2) which means only 1 and 0

Discrete(2)

In [10]:
env.observation_space # lower bound, upper bound, numbers of values,type

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [11]:
env.action_space.sample()
# 0: push car to the left, 1: push car to the right

0

In [12]:
env.observation_space.sample()
# car position,   car velocity,   pole angle      , pole velocity
# (-4.8,4.8)  , (-inf,inf)  ,(-4.18 rad, 4.18rad), (-inf,inf)

array([-2.0636480e+00, -2.6776991e+38, -3.9457178e-01,  7.8232093e+37],
      dtype=float32)

# 3.Train a RL model

Model free RL: Only uses the current state value to try to make the prediction 
    A2C, PPO, DQN
Model based RL: Try to make prediction about the future state of the model to try to generate the best possible action

Certain type of algorithm can only work with certain type of spaces  (in action space)

Which type of algorith that we are gonna use
What is the trianning metrics

In [13]:
log_path = os.path.join('Training','Logs')
# make your directories first

In [14]:
log_path

'Training\\Logs'

In [15]:
env = gym.make(environment_name)

env = DummyVecEnv([lambda:env])
# wrap our environment into the DummyVecEnv 
# a wrapper for a nonvectorize environment
# lambda function: an environment creation function
model = PPO('MlpPolicy',env,verbose=1,tensorboard_log = log_path)
# defining our model (defining our agent)
# model: PPO  (check PPO: PPO??)
# 1.defining the policy that we are going to use: (mlp)multilayer preceptron policy
# 2.env: our environment, the DummyVecEnv
# 3.we wanna log out result for that particular model

Using cuda device


In [16]:
model.learn(total_timesteps = 20000)

Logging to Training\Logs\PPO_4
-----------------------------
| time/              |      |
|    fps             | 571  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 421         |
|    iterations           | 2           |
|    time_elapsed         | 9           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009081107 |
|    clip_fraction        | 0.112       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.000354   |
|    learning_rate        | 0.0003      |
|    loss                 | 8.53        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0163     |
|    value_loss           | 60.7        |
-----------------------------------------
---

<stable_baselines3.ppo.ppo.PPO at 0x265d75bdf30>

## 4.Save and reload the model

In [17]:
PPO_Path = os.path.join('Training','Saved Models','PPO_Model_Cartpole')

In [18]:
model.save(PPO_Path)

In [19]:
del model 
# already delete our model
# we need to retrain, or load model

In [20]:
PPO_Path  # saved path

'Training\\Saved Models\\PPO_Model_Cartpole'

In [21]:
model = PPO.load(PPO_Path,env = env)

## 5.Evaluation
See how our trained model actually preforming

the training metrix (rollout metrix) actually depends on the algorithm that you are using
ep_len_mean: on average how long a particular episode lasted before done
ep_rew_mean: the average reward that the agent accumulated per episode

Monitoring in Tensorboard that we just pass through

In [22]:
evaluate_policy(model,env,n_eval_episodes = 10, render = True)
# (model ,envorinment, how many episodes do we want, do we want it render it in real time or not)



(200.0, 0.0)

In [23]:
env.close()

200: average reward,  0.0: the standard deviation of the reward

## 6.Test model
In 5 we were testing our model in a encapsulated environment
Now we wanna do the same thing as we did in part one

This part shows us how to define an environment, how to train a model, how to evaluate and test it

In [25]:
action,_ = model.predict(obs)

NameError: name 'obs' is not defined

In [26]:
action

0

In [27]:
episodes = 5 
for episode in range(1, episodes + 1):
    obs = env.reset() 
    done = False
    score = 0
    while not done:
        env.render()
        action,_ = model.predict(obs) # now using model here
        obs,reward,done,info = env.step(action)
        score += reward
    print("Episode:{} Score{}".format(episodes,score))
env.close()

Episode:5 Score[200.]
Episode:5 Score[200.]
Episode:5 Score[200.]
Episode:5 Score[200.]
Episode:5 Score[200.]


In [28]:
obs = env.reset() 
# we get the observation for our observation space
# we gonna take these observation and pass them to our model

In [29]:
model.predict(obs)
# we are not getting a random action here
# we are using predict on our model observation
# it means: based on our current observation we should take 0 to get the best reward

(array([0], dtype=int64), None)

In [30]:
env.step(action)
# 1. the state after we take our action
# 2. our reward (1: we haven't make our pole fell, so we get accumulating reward of 1)


(array([[ 0.02791913,  0.21281123, -0.02722932, -0.2593118 ]],
       dtype=float32),
 array([1.], dtype=float32),
 array([False]),
 [{}])

## 7. Viewing Logs in Tensorboard
For more sophisticated environment we shoud check the logs inside of tensorboard

Idealy run on command prompt

In [31]:
Training_log_path = os.path.join(log_path,'PPO_2')

In [32]:
Training_log_path

'Training\\Logs\\PPO_2'

In [33]:
# !tensorboard --logdir={Training_log_path}
# tensorboard --logdir='Training\Logs\PPO_2'

The core metrics you should be looking at:

1. Average Reward: an indication of how well your model gonna preform, in that praticular environment using that particular reward function

2. Average episode length: its the ideal length of How long your agent is lasting in that particular environment

Traininig Strategies

1. Train 
2. Hyperparameter Tuning (uptuner)
3. Try different algorithm

## 8. Adding a callback to the training Stage

At this part we are going to specify a reward threshold, meaning training its gonna stop once it hit certain condition

We can also try to define different neural network (used to use MLP)

Use different algorithm

In [34]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [35]:
save_path = os.path.join('Training','Save Models')
# define the save path of our best model

In [36]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold = 200, verbose = 1)
# setup a StopTrainingOnRewardThreshold callback
# a callback that would stop our training once we pass a certain threshold
eval_callback = EvalCallback(env,
                            callback_on_new_best=stop_callback,
                            eval_freq = 10000,
                            best_model_save_path = save_path,
                            verbose = 1)
# this is the callback thats gonna trig after a train run
# 2. everytime theres a new best model its gonna run the stop_callback, if the reward_threshold past 200 we gonna stop the training
# 3. specify how frquenty we gonna run the evaluation callback (10000 time steps)
# 4. also specify the best_model

# every 10000 times we woudl check whether we pass the 200 reward threshold
# if it does, it will stop the training n save our best model

In [37]:
model = PPO('MlpPolicy',env,verbose = 1,tensorboard_log = log_path)
# creat new model, as we did

Using cuda device


In [38]:
model.learn(total_timesteps=20000,callback = eval_callback)
# when run our trianing we will be passing in our callback (eval_callback)


Logging to Training\Logs\PPO_5
-----------------------------
| time/              |      |
|    fps             | 513  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 397          |
|    iterations           | 2            |
|    time_elapsed         | 10           |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0088384375 |
|    clip_fraction        | 0.118        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.686       |
|    explained_variance   | 0.00352      |
|    learning_rate        | 0.0003       |
|    loss                 | 6.44         |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.0183      |
|    value_loss           | 52.4         |
----------------------------



Eval num_timesteps=10000, episode_reward=188.20 +/- 23.60
Episode length: 188.20 +/- 23.60
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 188         |
|    mean_reward          | 188         |
| time/                   |             |
|    total_timesteps      | 10000       |
| train/                  |             |
|    approx_kl            | 0.007684355 |
|    clip_fraction        | 0.0619      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.607      |
|    explained_variance   | 0.27        |
|    learning_rate        | 0.0003      |
|    loss                 | 18.3        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0161     |
|    value_loss           | 64          |
-----------------------------------------
New best mean reward!
------------------------------
| time/              |       |
|    fps             | 330   |
|    iterations      | 5     |
|    ti

<stable_baselines3.ppo.ppo.PPO at 0x265a5e1e770>

## 9.Changing Policies

changing the units and the number of layers of our nerual network

In [39]:
net_arch = [dict(pi=[128,128,128,128],vf=[128,128,128,128])]
# four layers and 128 units in each of the layers

In [40]:
model = PPO('MlpPolicy',env,verbose = 1,tensorboard_log = log_path,policy_kwargs={'net_arch':net_arch})

Using cuda device




In [41]:
model.learn(total_timesteps=20000,callback = eval_callback)

Logging to Training\Logs\PPO_6
-----------------------------
| time/              |      |
|    fps             | 423  |
|    iterations      | 1    |
|    time_elapsed    | 4    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 325         |
|    iterations           | 2           |
|    time_elapsed         | 12          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014751053 |
|    clip_fraction        | 0.21        |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.681      |
|    explained_variance   | -0.00288    |
|    learning_rate        | 0.0003      |
|    loss                 | 3.25        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.026      |
|    value_loss           | 19.8        |
-----------------------------------------
---



Eval num_timesteps=10000, episode_reward=200.00 +/- 0.00
Episode length: 200.00 +/- 0.00
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 200         |
|    mean_reward          | 200         |
| time/                   |             |
|    total_timesteps      | 10000       |
| train/                  |             |
|    approx_kl            | 0.012308881 |
|    clip_fraction        | 0.153       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.586      |
|    explained_variance   | 0.575       |
|    learning_rate        | 0.0003      |
|    loss                 | 14.7        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0216     |
|    value_loss           | 41          |
-----------------------------------------
------------------------------
| time/              |       |
|    fps             | 282   |
|    iterations      | 5     |
|    time_elapsed    | 36    |


<stable_baselines3.ppo.ppo.PPO at 0x265a5dc0940>

## 10.Using an Alternate Algorithm

We are gonna use DQN algorithm

In [47]:
from stable_baselines3 import DQN

In [49]:
model = DQN('MlpPolicy',env,verbose = 1,tensorboard_log = log_path)

Using cuda device


In [50]:
model.learn(total_timesteps=20000)

Logging to Training\Logs\DQN_1
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.951    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 10278    |
|    time_elapsed     | 0        |
|    total_timesteps  | 103      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.894    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 12550    |
|    time_elapsed     | 0        |
|    total_timesteps  | 223      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.837    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 13089    |
|    time_elapsed     | 0        |
|    total_timesteps  | 344      |
----------------------------------
------------------------

<stable_baselines3.dqn.dqn.DQN at 0x265a61c1960>

In [52]:
model.save

<bound method BaseAlgorithm.save of <stable_baselines3.dqn.dqn.DQN object at 0x00000265A61C1960>>

In [54]:
DQN.load

<bound method BaseAlgorithm.load of <class 'stable_baselines3.dqn.dqn.DQN'>>