# Installing and importing dependencies

In [1]:
!pip install stable-baselines3[extra]



In [43]:
pip install pyglet==1.5.27

Note: you may need to restart the kernel to use updated packages.


The above line is what actually helps in rendering the environment without any error

In [31]:
import os
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from pyglet.gl import *

In [3]:
environment_name = 'CartPole-v1'
env = gym.make(environment_name)

the main environment functions are:
1. env.reset() - resets the environment and obtain the initial observations
2. env.render() - visualize the environment
3. env.step() - apply an action to the environment
4. env.close() - close down the render frame

The below block of code is just to simply run the environment 10 times. It does not train the Rl model

In [8]:
episodes = 10;
for episode in range (1,episodes+1):
    state = env.reset() #resetting to the initial set of observations
    done = False
    score = 0
    
    while not done:
        env.render()#used to visualize the environment
        action = env.action_space.sample()#selects a random action from the action space of the environment
        n_state, reward, done, info = env.step(action)
        score +=reward
    print('Episode:{}Score:{}'.format(episode,score))
# env.close()
        

Episode:1Score:26.0
Episode:2Score:34.0
Episode:3Score:23.0
Episode:4Score:44.0
Episode:5Score:22.0
Episode:6Score:28.0
Episode:7Score:11.0
Episode:8Score:20.0
Episode:9Score:17.0
Episode:10Score:20.0


In [9]:
env.close()#closes the gym environment

In [10]:
env.action_space

Discrete(2)

The above output means that there are only two possible actions . Either 0 or 1

In [11]:
log_path = os.path.join('Training','Logs')

In [12]:
log_path

'Training\\Logs'

In [13]:
env = gym.make(environment_name)#creates the gym environment
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy',env,verbose =1, tensorboard_log=log_path)

Using cpu device


env = DummyVecEnv([lambda: env]) - it wraps the environment in a 'DummyVecEnv'. The 'DummyVecEnv' is a vectorized environment wrapper that allows us to treat a single environment as multiple parallel environments, enabling more efficient training by taking advantage of parallel processing .

In [14]:
model.learn(total_timesteps = 20000)

Logging to Training\Logs\PPO_3
-----------------------------
| time/              |      |
|    fps             | 504  |
|    iterations      | 1    |
|    time_elapsed    | 4    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 316         |
|    iterations           | 2           |
|    time_elapsed         | 12          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008486919 |
|    clip_fraction        | 0.0967      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | 0.00277     |
|    learning_rate        | 0.0003      |
|    loss                 | 6.72        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0148     |
|    value_loss           | 50.7        |
-----------------------------------------
---

<stable_baselines3.ppo.ppo.PPO at 0x1ac63806430>

learn() method starts the training process for the specified number of timesteps. During this training, the model interacts with the environment,collects experiences, and updates its policy to improve performance.

In [15]:
PPO_Path = os.path.join('Training','Saved Models','PPO Model Cartpole')

In [16]:
model.save(PPO_Path)

In [17]:
evaluate_policy(model,env,n_eval_episodes=10,render=True)



(500.0, 0.0)

In [18]:
env.close()

# Testing the model


In [32]:
episodes = 5
for episode in range(1,episodes+1):
    obs = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action,_ = model.predict(obs) #using the model to predict the steps
        obs, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode,score))


Episode:1 Score:[426.]
Episode:2 Score:[500.]
Episode:3 Score:[500.]
Episode:4 Score:[500.]
Episode:5 Score:[417.]


In [33]:
env.close()

# Tensorboard


In [21]:
training_log_path = os.path.join(log_path,'PPO_1')

In [22]:
training_log_path

'Training\\Logs\\PPO_1'

In [23]:
!tensorboard --logdir={training_log_path}

^C


Tensorboard is a web-based tool provided  by TensorFlow for visualizing training and evaluation metrics such as training curves,model graphs etc. 

You can access it using http://localhost:6006/

# Adding a callback

Importing the required dependencies

In [24]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold


In [25]:
save_path = os.path.join('Training','Saved Models')

In [27]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=500, verbose=1)
eval_callback = EvalCallback(env,callback_on_new_best = stop_callback,
                            eval_freq=10000,
                            best_model_save_path=save_path,verbose=1)

StopTrainingOnRewardThreshold - this callback is used to stop the training process when the specified reward threshold is reached.

Verbose - it is a parameter which determines whether to print the information about the training progress.verbose=1 will display the information

EvalCallback - this callback is used to evaluate the model during training.

In [28]:
model =PPO('MlpPolicy',env,verbose=1,tensorboard_log=log_path)

Using cpu device


PPO - Proximal Policy Optimization model 

MlpPolicy - it is a policy architecture specified to the PPO model. 
Here it uses the Multi-layer perceptron policy which is a feedforward 
neural network.

tensorboard - it is a visualization tool commonly used for monitoring 
and analyzing the training progress.

In [29]:
model.learn(total_timesteps=20000, callback = eval_callback)

Logging to Training\Logs\PPO_4
-----------------------------
| time/              |      |
|    fps             | 491  |
|    iterations      | 1    |
|    time_elapsed    | 4    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 344         |
|    iterations           | 2           |
|    time_elapsed         | 11          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008331481 |
|    clip_fraction        | 0.102       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.687      |
|    explained_variance   | 0.00595     |
|    learning_rate        | 0.0003      |
|    loss                 | 7.04        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0152     |
|    value_loss           | 53.6        |
-----------------------------------------
---

<stable_baselines3.ppo.ppo.PPO at 0x1ac784123a0>

During the training, the model will interact with the environment, collect experiences and update its policy to improve the performance.