# COMP47590 - Advanced Machine Learning 

## Workshop: Baby You Can Drive My Car
Train an agent to drive on highways. Uses the highway-env environment (https://github.com/eleurent/highway-env).

![Highway](highway.gif)

There are five **actions** in this environment:
- change lane left (0)
- none (1)
- change lane right (2)
- faster (3)
- slower (4)

**Reward** is awarded after each frame as a combination of velocity and colisions:

$$R(s,a) = a\frac{v - v_\min}{v_\max - v_\min} - b\,\text{collision}$$
 
where $v,\,v_\min,\,v_\max$ are the current, minimum and maximum speed of the ego-vehicle respectively, and $a,\,b$ are coefficients (https://github.com/Farama-Foundation/HighwayEnv?tab=readme-ov-file).

And the **state** represnetation gives kinematic infromation on the agents car and neighbouring cars.  

![](highway_obs.png)


### Initialisation

If using Google colab you need to install packages  - comment out lines below.

In [21]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate
#!pip install highway_env

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still wont; see display!)

In [22]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Install the highway environment.

In [23]:
# !pip install highway_env

Import required packages. 

In [24]:
import gymnasium as gym
import stable_baselines3 as sb3
import highway_env

Create the **highway-fast-v0** environment.

In [25]:
env_eval = gym.make("highway-fast-v0")

  logger.warn(


Explore the environment action sapce and observation space

In [26]:
env_eval.action_space

Discrete(5)

In [27]:
env_eval.observation_space

Box(-inf, inf, (5, 5), float32)

View the environment configuration.

In [28]:
env_eval.unwrapped.config

{'observation': {'type': 'Kinematics'},
 'action': {'type': 'DiscreteMetaAction'},
 'simulation_frequency': 5,
 'policy_frequency': 1,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'screen_width': 600,
 'screen_height': 150,
 'centering_position': [0.3, 0.5],
 'scaling': 5.5,
 'show_trajectories': False,
 'render_agent': True,
 'offscreen_rendering': False,
 'manual_control': False,
 'real_time_rendering': False,
 'lanes_count': 3,
 'vehicles_count': 20,
 'controlled_vehicles': 1,
 'initial_lane_id': None,
 'duration': 30,
 'ego_spacing': 1.5,
 'vehicles_density': 1,
 'collision_reward': -1,
 'right_lane_reward': 0.1,
 'high_speed_reward': 0.4,
 'lane_change_reward': 0,
 'reward_speed_range': [20, 30],
 'normalize_reward': True,
 'offroad_terminal': False}

Play an episode of the environment using random actions

In [29]:
obs, info = env_eval.reset()
terminate = False
truncate = False
while not (terminate or truncate):
    action = env_eval.action_space.sample()
    obs, reward, terminate, truncate, info = env_eval.step(action)
    env_eval.render()
env_eval.close()

  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(


Complete an episode of the environment using random actions recording actions and reward.

In [30]:
cumulative_reward = 0
actions = []
action_map = {0: 'left', 
              1: 'none',
              2: 'right',
              3: 'faster',
              4: 'slower'}

terminate = False
truncate = False
while not (terminate or truncate):
    action = env_eval.action_space.sample()
    actions.append(action)
    obs, reward, terminate, truncate, info = env_eval.step(action)
    cumulative_reward += reward
env_eval.close()

In [31]:

    
print("Actions: ", ', '.join([action_map[a] for a in actions]))
print("Cumulative Reward: {}".format(cumulative_reward))

Actions:  right
Cumulative Reward: 0.06666666666666665


### Create and Train an Agent

Create a DQN agent using stable-baselines3. In the highway environment episodes are typically note very long (typically < 30 timestpes). Therefore it makes sense to change some hyperparameters to reflect these shorter episodes. We suggest:

- learning_rate = 0.0005
- buffer_size = 15000
- learning_starts = 200
- gamma = 0.8
- train_freq = 1
- target_update_interval = 50
- exploration_fraction=0.7

Also the observation vector is reaonably large so a bigger value function network might work well:
- 'net_arch':[256, 128]

Creating an environment without rendering. 

In [32]:
from stable_baselines3.common.monitor import Monitor

env_train = gym.make("highway-fast-v0", render_mode="human")
env_train = Monitor(env_train)

  logger.warn(


In [33]:
tb_log = './log_tb_highway_DQN/'
agent = sb3.DQN(
    "MlpPolicy",
    env_train,
    learning_rate=0.0005,
    buffer_size=15000,
    learning_starts=200,
    gamma=0.8,
    train_freq=1,
    target_update_interval=50,
    exploration_fraction=0.7,
    policy_kwargs=dict(net_arch=[256, 128]),
    verbose = 1,
    tensorboard_log = tb_log
)


Using cpu device
Wrapping the env in a DummyVecEnv.




Train the agent for a large number of steps.

In [34]:
agent.learn(total_timesteps=1000)



Logging to ./log_tb_highway_DQN/DQN_5
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 17.8     |
|    ep_rew_mean      | 13       |
|    exploration_rate | 0.904    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 43       |
|    time_elapsed     | 1        |
|    total_timesteps  | 71       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 13       |
|    ep_rew_mean      | 9.43     |
|    exploration_rate | 0.859    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 45       |
|    time_elapsed     | 2        |
|    total_timesteps  | 104      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 12.9     |
|    ep_rew_mean      | 9.42     |
|    exploration_rate | 0.79     |
| time/          

<stable_baselines3.dqn.dqn.DQN at 0x3158b67e0>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./log_tb_highway_DQN/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

###Â Evaluation

Evaluate the agent in the environment for 10 stepes including rendering.

In [35]:
from stable_baselines3.common.evaluation import evaluate_policy

mean_reward, std_reward = evaluate_policy(agent, env_eval, n_eval_episodes=10, render=False)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))



Mean Reward: 16.45774464905262 +/- 6.1674904953879155


### Deployment

We can save an agent easily in SB3.

In [36]:
agent.save("dqn_highway_agent")



We can easily load an agent. 

In [37]:
agent = sb3.dqn.DQN.load("dqn_highway_agent")

Deploy the agent into the environment

In [38]:
obs, _ = env_eval.reset()

terminate = False
truncate = False
while not (terminate or truncate):

    action, _ = agent.predict(obs)
    obs, reward, terminate, truncate, info = env_eval.step(action)

    env_eval.render()

  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(
  gym.logger.warn(


Now continue training the agent.

In [39]:
agent.set_env(env_eval)
agent.learn(total_timesteps=1000)



Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./log_tb_highway_DQN/DQN_6




----------------------------------
| rollout/            |          |
|    ep_len_mean      | 7.25     |
|    ep_rew_mean      | 5.73     |
|    exploration_rate | 0.961    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 75       |
|    time_elapsed     | 0        |
|    total_timesteps  | 29       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 7.12     |
|    ep_rew_mean      | 5.55     |
|    exploration_rate | 0.923    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 77       |
|    time_elapsed     | 0        |
|    total_timesteps  | 57       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 9.25     |
|    ep_rew_mean      | 6.94     |
|    exploration_rate | 0.849    |
| time/               |          |
|    episodes       

<stable_baselines3.dqn.dqn.DQN at 0x3229e3ce0>

Evaluate the agent after retraining.

In [40]:
mean_reward, std_reward = evaluate_policy(agent, env_eval, n_eval_episodes=10, render=False)
print("Mean Reward after retraining: {} +/- {}".format(mean_reward, std_reward))






Mean Reward after retraining: 20.596696168929338 +/- 3.2557385203257265
