# Driving a Car with Reinforcement Learning
---
Can a computer learn to drive a virtual racecar down a racetrack?

In [1]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
import os

In [2]:
# code from Gymnasium webpage: https://gymnasium.farama.org/
# env = gym.make("CarRacing-v2", render_mode="human")
# observation, info = env.reset(seed=42)
# for _ in range(1000):
#    action = env.action_space.sample()  # this is where you would insert your policy
#    observation, reward, terminated, truncated, info = env.step(action)

#    if terminated or truncated:
#       observation, info = env.reset()

# env.close()

## Problem setup
---
Write here about the agent, environment, rewards, etc.

Leveraging a 'top down' approach for agent observations (I think)

In [3]:
environment_name = "CarRacing-v2"
env = gym.make(environment_name, render_mode='human')

In [4]:
# track generation
env.reset()

(array([[[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0],
         ...,
         [0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]],
 
        [[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0],
         ...,
         [0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]],
 
        [[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0],
         ...,
         [0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]],
 
        ...,
 
        [[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0],
         ...,
         [0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]],
 
        [[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0],
         ...,
         [0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]],
 
        [[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0],
         ...,
         [0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]]], dtype=uint8),
 {})

In [5]:
env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [6]:
# racing track, 96 x 96 image with 3 colour overlays
env.observation_space

Box(0, 255, (96, 96, 3), uint8)

In [7]:
# produces new window with racetrack environment
env.render()

In [8]:
# close opened track environment
env.close()

You can observe the car's path when taking random actions by running the below code. An episode will terminate when either the car visits all the track tiles, or it falls off the racetrack (in which case it receives reward -100).

In [9]:
# episodes = 5
# for episode in range(1, episodes+1):
#     state = env.reset()
#     done = False
#     score = 0 
    
#     while not done:
#         env.render()
#         # random action
#         action = env.action_space.sample()
#         n_state, reward, done, info, _ = env.step(action)
#         score+=reward
#     print('Episode:{} Score:{}'.format(episode, score))
# env.close()

## Model training
---
We will train our racecar using the Proximal Policy Optimisation (PPO) algorithm.

In [10]:
log_path = os.path.join('Training', 'Logs')

In [11]:
log_path

'Training\\Logs'

The `CnnPolicy` is able to deal with image recognition, which is how our agent observes the problem.

In [12]:
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# multi-layer perceptron policy
model = PPO("CnnPolicy", env, verbose=1, tensorboard_log=log_path)

Using cpu device
Wrapping the env in a VecTransposeImage.


One can modify the number of epochs in whicht the model is trained. The results can subsequently be saved and later evaluated, viewed, etc.

In [13]:
# train the model
model.learn(total_timesteps=100000)

Logging to Training\Logs\PPO_4
-----------------------------
| time/              |      |
|    fps             | 26   |
|    iterations      | 1    |
|    time_elapsed    | 76   |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 21           |
|    iterations           | 2            |
|    time_elapsed         | 190          |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0053407513 |
|    clip_fraction        | 0.0317       |
|    clip_range           | 0.2          |
|    entropy_loss         | -4.26        |
|    explained_variance   | 9.83e-06     |
|    learning_rate        | 0.0003       |
|    loss                 | 0.371        |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.00314     |
|    std                  | 1            |
|    value_loss           | 

-----------------------------------------
| time/                   |             |
|    fps                  | 18          |
|    iterations           | 12          |
|    time_elapsed         | 1310        |
|    total_timesteps      | 24576       |
| train/                  |             |
|    approx_kl            | 0.012771003 |
|    clip_fraction        | 0.133       |
|    clip_range           | 0.2         |
|    entropy_loss         | -3.9        |
|    explained_variance   | 0.348       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.0321      |
|    n_updates            | 110         |
|    policy_gradient_loss | -0.0221     |
|    std                  | 0.885       |
|    value_loss           | 0.256       |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 18          |
|    iterations           | 13          |
|    time_elapsed         | 1426  

-----------------------------------------
| time/                   |             |
|    fps                  | 18          |
|    iterations           | 23          |
|    time_elapsed         | 2615        |
|    total_timesteps      | 47104       |
| train/                  |             |
|    approx_kl            | 0.030289447 |
|    clip_fraction        | 0.264       |
|    clip_range           | 0.2         |
|    entropy_loss         | -3.45       |
|    explained_variance   | 0.937       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.127       |
|    n_updates            | 220         |
|    policy_gradient_loss | -0.0257     |
|    std                  | 0.758       |
|    value_loss           | 0.499       |
-----------------------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 17         |
|    iterations           | 24         |
|    time_elapsed         | 2732      

----------------------------------------
| time/                   |            |
|    fps                  | 17         |
|    iterations           | 34         |
|    time_elapsed         | 3960       |
|    total_timesteps      | 69632      |
| train/                  |            |
|    approx_kl            | 0.03932415 |
|    clip_fraction        | 0.339      |
|    clip_range           | 0.2        |
|    entropy_loss         | -3.23      |
|    explained_variance   | 0.946      |
|    learning_rate        | 0.0003     |
|    loss                 | 0.586      |
|    n_updates            | 330        |
|    policy_gradient_loss | -0.00035   |
|    std                  | 0.712      |
|    value_loss           | 4.94       |
----------------------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 17         |
|    iterations           | 35         |
|    time_elapsed         | 4085       |
|    total_times

-----------------------------------------
| time/                   |             |
|    fps                  | 17          |
|    iterations           | 45          |
|    time_elapsed         | 5339        |
|    total_timesteps      | 92160       |
| train/                  |             |
|    approx_kl            | 0.042081233 |
|    clip_fraction        | 0.363       |
|    clip_range           | 0.2         |
|    entropy_loss         | -3.14       |
|    explained_variance   | 0.982       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.413       |
|    n_updates            | 440         |
|    policy_gradient_loss | 0.00579     |
|    std                  | 0.692       |
|    value_loss           | 2.29        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 17          |
|    iterations           | 46          |
|    time_elapsed         | 5462  

<stable_baselines3.ppo.ppo.PPO at 0x23f6e9ec0d0>

## Saving the model

In [16]:
PPO_Path = os.path.join('Training', 'Saved Models', 'PPO_racecar_100,000')

In [15]:
# save model to specified location
model.save(PPO_Path)

In [21]:
# delete and reload model
# del model
# model = PPO.load(PPO_Path, env=env)

## Model evaluation

In [14]:
# from stable_baselines3.common.evaluation import evaluate_policy

In [37]:
environment_name = "CarRacing-v2"
env = gym.make(environment_name, render_mode='human')

In [38]:
evaluate_policy(model, env, n_eval_episodes=5)

(517.8984082348645, 268.5036796214647)

In [39]:
env.close()

## Model testing

In [29]:
from stable_baselines3.common.vec_env.base_vec_env import VecEnv, VecEnvStepReturn, VecEnvWrapper

In [36]:
episodes = 5
for episode in range(1, episodes+1):
    vec_env = model.get_env()
    obs = vec_env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        # actions dictated by the trained model
        action, _ = model.predict(obs)
        obs, reward, done, info, = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

IndexError: index 1 is out of bounds for axis 0 with size 1

In [30]:
training_log_path = os.path.join(log_path, 'PPO_2')

In [32]:
!tensorboard --logdir={training_log_path}

^C
