## Simple guide for RF in python. Just use Copy this notebook to your local pc and run all to see how it works !

In [21]:
%pip install stable-baselines3[extra] -q
%pip install gym[all] -q
%pip install tensorboard -q

In [1]:
import gym, os
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

In [5]:
environment_name = "CartPole-v0"
env = gym.make(environment_name)


In [6]:

episodes = 5 # Testing our env 5 Times ! (You can think of 1 episode as one full game)
for episode in range(1,episodes+1):
    state = env.reset() # This one sets an initial env. It generates observations so to understand best type of action and reward to it
    done = False # Is episode doen
    score = 0 
    
    while not done:
        env.render() # This allows us to view env in graphics
        action = env.action_space.sample() #Here we generate a random action
        n_state, reward, done, info = env.step(action) # Env.step for action with the env 
        action += reward
    print('Episode: {} Score{}'.format(episode,score))
env.close() # Close render screen (graphics screen)

Episode: 1 Score0

Episode: 2 Score0

Episode: 3 Score0

Episode: 4 Score0

Episode: 5 Score0


# See the Env

In [17]:
env.action_space

Discrete(2)

In [18]:
env.observation_space.sample() # We get 4 values, which mean CART POSITION, CART VELOCITY, POLE ANGLE, POLE ANGULAR VELOCITY

array([ 2.3256021e+00, -1.9713159e+38,  2.9394504e-01, -8.2985874e+37],
      dtype=float32)

# Train agent

In [6]:
log_path = os.path.join('Training','Logs')

In [7]:
environment_name = "CartPole-v0"
env = gym.make(environment_name) # Created here our env
env = DummyVecEnv([lambda:env]) # This will let us work in an env being covered by DummyVec(aka wrap)
model = PPO('MlpPolicy',env,verbose=1, tensorboard_log=log_path) #Define the agent

Using cpu device


policy = this is agent's policy and to understand it do it like that: think of that policy as the rule which tells how to operate in the env.

In [27]:
model.learn(total_timesteps=20_000) # How long we want to train

Logging to Training\Logs\PPO_1

-----------------------------

| time/              |      |

|    fps             | 1172 |

|    iterations      | 1    |

|    time_elapsed    | 1    |

|    total_timesteps | 2048 |

-----------------------------

-----------------------------------------

| time/                   |             |

|    fps                  | 824         |

|    iterations           | 2           |

|    time_elapsed         | 4           |

|    total_timesteps      | 4096        |

| train/                  |             |

|    approx_kl            | 0.009386983 |

|    clip_fraction        | 0.122       |

|    clip_range           | 0.2         |

|    entropy_loss         | -0.686      |

|    explained_variance   | -0.00359    |

|    learning_rate        | 0.0003      |

|    loss                 | 10.8        |

|    n_updates            | 10          |

|    policy_gradient_loss | -0.0184     |

|    value_loss           | 60.4        |

--------------------

<stable_baselines3.ppo.ppo.PPO at 0x1dfbee88850>

In [8]:
PPO_Path = os.path.join('Training', 'Saved Models', 'PPO_model')

In [None]:
model.save(PPO_Path)

In [9]:
del model

In [10]:
model = PPO.load(PPO_Path, env=env)

Evaluation 

In [None]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)

In [16]:
env.close()

# Testing

In [18]:

episodes = 5 # Testing our env 5 Times ! (You can think of 1 episode as one full game)
for episode in range(1,episodes+1):
    obs = env.reset() # AGAIN, HERE WE GET OBSERVATIONS FOR OBSERVATION SPACE
    done = False # Is episode doen
    score = 0 
    
    while not done:
        env.render() # This allows us to view env in graphics
        action, _states = model.predict(obs) # Instead of samples we will use predictions
        obs, reward, done, info = env.step(action) # Env.step for action with the env 
        score += reward
    print('Episode: {} Score{}'.format(episode,score))
env.close() # Close render screen (graphics screen)

Episode: 1 Score200.0

Episode: 2 Score200.0

Episode: 3 Score200.0

Episode: 4 Score200.0

Episode: 5 Score200.0


In [11]:
obs = env.reset()

In [12]:
obs # THESE FOUR DESCRIBE US THESE: Cart position, Cart Velocity, Pole angle, Pole Angular Velocity

array([[-0.00671012,  0.01133617,  0.03237586,  0.00465344]],
      dtype=float32)

In [13]:
model.predict(obs)

(array([0], dtype=int64), None)

In [14]:
action , _ = model.predict(obs)

In [15]:
env.step(action) # Here reward is 1 or 0. We get 0 if our poll falls, otherwise 1

(array([[-0.0064834 ,  0.2059792 ,  0.03246893, -0.27764127]],
       dtype=float32),
 array([1.], dtype=float32),
 array([False]),
 [{}])

# Using Tensorboard

In [19]:
training_log_path = os.path.join(log_path, 'PPO_1')

In [20]:
!tensorboard --logdir={training_log_path}


^C


# Adding a callback to training

In [11]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold
import os

In [12]:
save_path = os.path.join('Training', 'Saved Models')
log_path = os.path.join('Training', 'Logs')

In [14]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=200, verbose=1)
eval_callback = EvalCallback(env, callback_on_new_best=stop_callback,
                            eval_freq=10000,
                            best_model_save_path= save_path,
                            verbose=1)

In [None]:
model = PPO('MlpPolicy', env,verbose=1,tensorboard_log=log_path)

In [None]:
model.learn(total_timesteps=20_000,callback=eval_callback)

# Changing Policy

In [16]:
new_arch = [dict(pi=[128,128,128,128], vf=[128,128,128,128])] # This is our new architecture of our NN

In [None]:
model = PPO('MlpPolicy', env,verbose=1,tensorboard_log=log_path , policy_kwargs={'net_arch':new_arch}) # To specify NEW neural network and NEW policy we use policy_kwargs

In [None]:
model.learn(total_timesteps=20_000,callback=eval_callback)

# Using ALT Algorithm

In [20]:
from stable_baselines3 import DQN


In [21]:
model = DQN('MlpPolicy', env,verbose=1,tensorboard_log=log_path) 

Using cpu device


In [None]:
model.learn(total_timesteps=20_000,callback=eval_callback)