RL Notes:

Focuses on teaching through trial and error

Fundamental elements:

1. Agent - Something that operates in the environment, could be a player or a model

2. Environment

3. Action - Doing something in the environment (done by agent)

4. Rewards - Returns on an action

RL assumes our environment follows the Markov property, given the present, the probability of the future is independent of the past (this property is also called “memoryless property”)

I will be working on Model-free RL in this notebook : PPO and DQN 

Core metrics to look at:
1. Average reward

2. Average episode length

In [8]:
!pip install stable-baselines3[extra] 
#pretty cool RL library



In [9]:
#Dependencies
import os
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv #somehow increases training speed and efficiency
from stable_baselines3.common.evaluation import evaluate_policy #returns some parameters after running
import tensorflow as tf
import datetime

In [10]:
#Using openAI gym for environments
env_name = "CartPole-v0"
env = gym.make(env_name)

In [11]:
#Interacting and getting familiar with the environment
eps = 5
for i in range(1,eps+1):
    state = env.reset() #returns a list of values that determine a state of the environment
    done = False
    score = 0
    
    while not done:
        env.render() #graphically renders the environment
        action = env.action_space.sample() #picks a random action to be taken
        n_state, reward, done, info = env.step(action) #takes the action using the .step method, 
        #new state is returned in n_state, and the same goes for reward. done returns when cartpole hits the ground.
        score += reward 
    print('Episode: {} score: {}'.format(i,score))
env.close()

Episode: 1 score: 16.0
Episode: 2 score: 37.0
Episode: 3 score: 28.0
Episode: 4 score: 16.0
Episode: 5 score: 14.0


In [12]:
env.action_space
# This says that we have discrete 2 actions indexed by 0 and 1

Discrete(2)

In [13]:
env.observation_space
# This says we have a box type framework with 4 values in the range of first two values in output and the last one tells us the type of values
# The 4 values are cart position and velocity and pole position and its angular velocity.

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

In [14]:
#Making log path
log_path = os.path.join("Training","logs")
log_path

'Training\\logs'

In [15]:
#Instantiating a PPO model
env = gym.make(env_name)
env = DummyVecEnv([lambda: env]) #wrapping env in DummyVecEnv 
model = PPO('MlpPolicy',env,verbose=1,tensorboard_log = log_path) #its like defining the agent
#First thing says we are using Multi layer perceptron policy 

Using cpu device


In [16]:
for i in range(5):
    model.learn(total_timesteps = 20000) #Training the model over 5 epochs

Logging to Training\logs\PPO_6
-----------------------------
| time/              |      |
|    fps             | 1549 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1103        |
|    iterations           | 2           |
|    time_elapsed         | 3           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.010169906 |
|    clip_fraction        | 0.0912      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.0102     |
|    learning_rate        | 0.0003      |
|    loss                 | 5.73        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0127     |
|    value_loss           | 48.7        |
-----------------------------------------
---

In [17]:
PPO_path = os.path.join('Training','models','PPO_Model_CartPole')
model.save(PPO_path)

In [18]:
del model #just deleting so that we can load the model again from the saved path

In [19]:
model = PPO.load(PPO_path)

In [20]:
#Evaluating the model
evaluate_policy(model, env, n_eval_episodes=10, render = True)



(200.0, 0.0)

In [21]:
env.close()

Testing summary:

We first get observations for the environment using the env.reset(), set done = False and score = 0

And while balancing isn't done:

1. We render the env graphically

2. Pick the best action using our trained model

3. update rewards and state 

At the end of episode we output score

In [22]:
#Testing 
#
eps = 5
for i in range(1,eps+1):
    obs = env.reset() #returns a list of values that determine a state of the environment
    done = False
    score = 0
    
    while not done:
        env.render() #graphically renders the environment    
        action, _ = model.predict(obs) #picks the best action with intel from model (this returns the action and state after action
        obs, reward, done, info = env.step(action) #takes the action using the .step method, 
        #new state is returned in n_state, and the same goes for reward. done returns if the pole is finally balanced.
        score += reward #reward is 1 as long as the pole doesn't fall 
    print('Episode: {} score: {}'.format(i,score))


Episode: 1 score: [200.]
Episode: 2 score: [200.]
Episode: 3 score: [200.]
Episode: 4 score: [200.]
Episode: 5 score: [200.]


In [23]:
env.close()

In [24]:
#Viewing logs from tensorboard
training_log_path = os.path.join('Training','logs','PPO_1')
training_log_path
!tensorboard --logdir = {training_log_path}

2022-06-27 18:36:17.429993: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-06-27 18:36:17.430019: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
usage: tensorboard [-h] [--helpfull] [--logdir PATH] [--logdir_spec PATH_SPEC]
                   [--host ADDR] [--bind_all] [--port PORT]
                   [--reuse_port BOOL] [--load_fast {false,auto,true}]
                   [--extra_data_server_flags EXTRA_DATA_SERVER_FLAGS]
                   [--grpc_creds_type {local,ssl,ssl_dev}]
                   [--grpc_data_provider PORT] [--purge_orphaned_data BOOL]
                   [--db URI] [--db_import] [--inspect] [--version_tb]
                   [--tag TAG] [--event_file PATH] [--path_prefix PATH]
                   [--window_title TEXT] [--max_reload_threads COUNT]
                   [--reload_inter

In [25]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold


In [26]:
save_path = os.path.join('Training', 'Saved Models')
log_path = os.path.join('Training', 'Logs')

In [27]:
#stops training after reward reaches 190
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=190, verbose=1) 

#saves the best model 
eval_callback = EvalCallback(env, 
                             callback_on_new_best=stop_callback, 
                             eval_freq=10000, 
                             best_model_save_path=save_path, 
                             verbose=1)

In [28]:
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

Using cpu device


In [29]:
model.learn(total_timesteps=20000, callback=eval_callback) #training model with our eval_callback

Logging to Training\Logs\PPO_11
-----------------------------
| time/              |      |
|    fps             | 2026 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 1279         |
|    iterations           | 2            |
|    time_elapsed         | 3            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0092116315 |
|    clip_fraction        | 0.107        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.686       |
|    explained_variance   | -0.0245      |
|    learning_rate        | 0.0003       |
|    loss                 | 5.13         |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.0182      |
|    value_loss           | 51           |
---------------------------

<stable_baselines3.ppo.ppo.PPO at 0x24f36096390>

In [39]:
#Customized parameters for model 
net_arch=[dict(pi=[64,64,64 ,64], vf=[128, 128, 128, 128])] # 4 layers of 128 units, pi for the actor, vf for the value function
model = PPO('MlpPolicy', env, verbose = 1, policy_kwargs={'net_arch': net_arch})
#paicy_kwargs specifies out dictionary of parameters
model.learn(total_timesteps=20000, callback=eval_callback)
evaluate_policy(model, env, n_eval_episodes=10, render=True)

Using cpu device
-----------------------------
| time/              |      |
|    fps             | 1358 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 894         |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014510571 |
|    clip_fraction        | 0.199       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.681      |
|    explained_variance   | 0.000594    |
|    learning_rate        | 0.0003      |
|    loss                 | 2.56        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.022      |
|    value_loss           | 17.7        |
-----------------------------------------
-----------------

(200.0, 0.0)

In [35]:
#Using a different algorithm apart from PPO, say DQN
from stable_baselines3 import DQN
model = DQN('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)
model.learn(total_timesteps=20000, callback=eval_callback)

dqn_path = os.path.join('Training', 'Saved Models', 'DQN_model')
model.save(dqn_path)


Using cpu device
Logging to Training\Logs\DQN_2
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.956    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 7153     |
|    time_elapsed     | 0        |
|    total_timesteps  | 93       |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.887    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 7892     |
|    time_elapsed     | 0        |
|    total_timesteps  | 237      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.839    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 8071     |
|    time_elapsed     | 0        |
|    total_timesteps  | 339      |
----------------------------------
-------

In [40]:
#Getting the saved model 
model = DQN.load(dqn_path, env=env)
evaluate_policy(model, env, n_eval_episodes=10, render=True)


(9.5, 0.9219544457292888)

In [37]:
env.close()

Summary of notebook

1. We first learnt a bit about RL and its concepts. 

2. stable-baselines3 is a great RL library with lots of models 

3. Loading the cartpole environment. Every environment has an action_space and an observation_space. The observation_space basically is a set of numbers that define its state. The action_space is a set of all actions that an agent can take in the environment. 

4. Interacting with cartpole env. We took some steps using the env.step() by sampling actions from the action_space (using env.action_space.sample()). Then we rendered the environment graphically using env.render()

5. At this point, we got introduced to the PPO model and instantiated it. Then trained it for 20000 steps for 5 epochs. Then we tested the model, find more details on this in the testing summary markdown. Later we checked our log files using Tensorboard (which has informative graphs to depict how the avg loss, avg ep length, etc varied throughout the training).

6. At this point we were introduced to some callback methods, basically we stop the training when we have reached a certain rewards limit and save the current model as the best model. 

7. Then we attempted to make a customized PPO model by changing the number of units in its layers. We went with 4 layers of 128 units for the agent and value function. Didn't see much of a difference in the final episode reward. The default models work pretty well.

8. We used the DQN model later to check results, although it didn't perform as well as the PPO model, the avg episode reward was only around 9 whereas it is around 200 for PPO model.