<a href="https://colab.research.google.com/github/Ethan830/RL-Autonomous-Vehicles/blob/main/LunarLander.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_12_4_atari.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# Deep RL for 2D environments: Q-Learning, DQN, and PPO
* [Eugene Agichtein](https://www.cs.emory.edu/~eugene/) for CS325: Artificial Intelligence
* Adapted from [Jeff Heaton](https://sites.wustl.edu/jeffheaton/)

This is the starting code for training agents for the Box2D environment in Gymnasium:
https://gymnasium.farama.org/environments/box2d/


Lunar Lander example is used in the starter code. You will extend these to Car Racing and Bipedal Worker yourself.



# Google CoLab Setup

The following code setsup gymnasium in Google colab. do not modify these lines, but ok need to add additional dependencies if needed

In [None]:
from google.colab import drive
!pip install stable-baselines3[extra] gymnasium
!pip install gymnasium[accept-rom-license,atari]
!pip install pyvirtualdisplay
!sudo apt-get install -y python-opengl ffmpeg
!sudo apt-get install -y xvfb
!pip install swig
!pip install gymnasium[box2d]

Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-3.0-py3-none-any.whl.metadata (943 bytes)
Downloading PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Installing collected packages: pyvirtualdisplay
Successfully installed pyvirtualdisplay-3.0
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package python-opengl
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
xvfb is already the newest version (2:21.1.4-2ubuntu1.7~22.04.14).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


# Table-based Q-Learning for Box2D
Gymnasium: https://gymnasium.farama.org/ is a more general and realistic virtual universe with many environments, such as robotic control, video games, and 3-d physics.

Out of the box, Q-Learning does not deal with continuous inputs. Additionally, Q-Learning primarily deals with discrete actions, such as pressing a joystick up or down. First step is to adapt the example code from Mountain Car notebook provided to the Lunar Lander in Box2D environment.

## Introducing Box2D/Lunar Lander

This section will demonstrate how Q-Learning can create a solution to the Lunar Lander gym environment. The goal is to land a simple spaceship with 3 engines between 2 flags (landing area).

There are two versions of the environment, one without wind (easy / predictable) and with wind enabled (turbulent/windy environment when control is difficult). Lets suspend disbelief that there is wind on the moon. Our lander should be able to land on Mars too, where winds can be very powerful.

First, it might be helpful to visualize the Lunar Lander environment. The following code shows this environment with the wind enabled.

In [None]:
import base64
from IPython import display as ipythondisplay
from pathlib import Path
from gymnasium.wrappers import RecordVideo
import gymnasium as gym
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
import numpy as np
import math


env = gym.make(
    "LunarLander-v3",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")

The LunarLander environment observations can be either discrete (simpler version) or continuous. Actions are discrete. See details here:
https://gymnasium.farama.org/environments/box2d/lunar_lander/

The goal is to learn which engines to fire to safely land the spacecraft.


Lets see how the robot behaves without training.

In [None]:
import shutil
env.metadata['render_fps'] = 30
# Reset the environment
shutil.rmtree("videos_lander_random", ignore_errors=True)
env.reset()

# Setup the wrapper to record the video
video_callable=lambda episode_id: True
env = RecordVideo(env, video_folder='./videos_lander_random', episode_trigger=video_callable)

# Run the environment until done

truncated = False
terminated = False
i=0
while not ( terminated or truncated):
  i+=1
  #action = np.array([np.random.uniform(0,1), np.random.uniform(-1,1)]) #all engines off. crash land/ fall down
  action =  np.random.randint(0, 3)
  state, reward, terminated, truncated , info = env.step(action)
  #uncomment below to see observations
  print(f"Step {i}: State={state}, Reward={reward}, term={terminated}, trunc={truncated}, info={info}")

env.close()




Step 1: State=[ 0.01157179  1.3876135   0.5852294  -0.5307642  -0.01325702 -0.1312025
  0.          0.        ], Reward=-1.104001393681557, term=False, trunc=False, info={}
Step 2: State=[ 0.01735792  1.3750721   0.5852496  -0.55747294 -0.01981243 -0.13112031
  0.          0.        ], Reward=-1.2274943671163214, term=False, trunc=False, info={}
Step 3: State=[ 0.02307339  1.3619356   0.5763651  -0.5839148  -0.0245799  -0.09535811
  0.          0.        ], Reward=-0.4210477371996706, term=False, trunc=False, info={}
Step 4: State=[ 0.0288106   1.3497384   0.5785604  -0.54218835 -0.02936376 -0.09568652
  0.          0.        ], Reward=3.185432062627922, term=False, trunc=False, info={}
Step 5: State=[ 0.03454323  1.3385125   0.5782696  -0.49902764 -0.03431544 -0.09904275
  0.          0.        ], Reward=3.221996861426658, term=False, trunc=False, info={}
Step 6: State=[ 0.0401803   1.3267012   0.56627715 -0.5250148  -0.03685128 -0.05072148
  0.          0.        ], Reward=0.04237890

In [None]:
# Display the video
import glob
# Find all video files in the specified directory
video_files = glob.glob("./videos_lander_random/*.mp4")
if not video_files:
    print("No video files found in the specified directory.")
else:
  video = io.open(video_files[0], 'r+b').read()
  encoded = base64.b64encode(video)
  ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
    </video>
  '''.format(encoded.decode('ascii'))))


No video files found in the specified directory.


#Table-based QLearning parameters
Several hyperparameters are very important for Q-Learning. These parameters will likely need adjustment as you apply Q-Learning to other problems. Because of this, it is crucial to understand the role of each parameter.

* **LEARNING_RATE** The rate at which previous Q-values are updated based on new episodes run during training.
* **DISCOUNT** The amount of significance to give estimates of future rewards when added to the reward for the current action taken. A value of 0.95 would indicate a discount of 5% on the future reward estimates.
* **EPISODES** The number of episodes to train over. Increase this for more complex problems; however, training time also increases.

In [None]:
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 3e4 #set to >=3e4 to ensure training works for this problem

OBSERVATION_DIM = 8
NUM_ACTIONS = 4
NUM_BINS = 4 #8 use 2 or 3 bits for each observation dimension

###updated 4/21/2025 for clarity ####
epsilon = 0.5
env.reset()



(array([ 1.3444901e-03,  1.4085737e+00,  1.3616368e-01, -1.0427843e-01,
        -1.5511111e-03, -3.0843105e-02,  0.0000000e+00,  0.0000000e+00],
       dtype=float32),
 {})

We lets create the discrete buckets for state and build Q-table.



In [None]:
# This function converts the floating point state values into
# discrete values. This is often called binning.  We divide
# the range that the state values might occupy and assign
# each region to a bucket.
#then we map the state to a single number between 0:numBins**obs_space

#Updated 4/21/2025 to correct for out-of-bounds observation values
def discretizeLunarState(s, obs_space, numBins=4):
  highs = np.array([1.5, 1.5, 5.0, 5.0, math.pi, 5.0, 1.0, 1.0]) #broken environment? should not be necessary
  lows = np.array([-1.5, -0.5, -5.0, -5.0, -math.pi, -5.0, 0.0, 0.0])
  s = np.clip(s, lows, highs)

  discrete_state = []

  normalized = (min(5, max(-5, int((s[0]) / 0.05))), \
            min(5, max(-1, int((s[1]) / 0.1))), \
            min(3, max(-3, int((s[2]) / 0.1))), \
            min(3, max(-3, int((s[3]) / 0.1))), \
            min(3, max(-3, int((s[4]) / 0.1))), \
            min(3, max(-3, int((s[5]) / 0.1))), \
            int(s[6]), \
            int(s[7]))

  for i in [0,1,2,3,4,5]:
    bin = ( highs[i]-lows[i] ) / numBins
    val = int ( ( normalized[i] -  lows[i] ) / bin )
    discrete_state.append( val )

  discrete_state.append(int(s[6])) #boolean leg
  discrete_state.append(int(s[7])) #boolean leg

  shift = int( math.log2(NUM_BINS))

  state_key = 0
  for i in [0,1,2,3,4,5]:
    state_key = state_key << shift
    state_key += discrete_state[i]
  state_key<<1
  state_key+=discrete_state[6]
  state_key<<1
  state_key+=discrete_state[7]

  return state_key


obs = env.reset()
state = discretizeLunarState(obs[0], env.observation_space, NUM_BINS)
print(obs)
#so now the state is a tuple of discrete values, to be used as the key in Q(s,a) table.
print(state)


#set up qtable
#(num_states, num_actions)
q_table = np.zeros((NUM_BINS**8, NUM_ACTIONS)) #number of possible discrete states x number of actions
print(q_table.shape)



(array([ 0.00343676,  1.4011996 ,  0.34809506, -0.43203324, -0.00397558,
       -0.07884869,  0.        ,  0.        ], dtype=float32), {})
5066
(65536, 4)


Now lets setup Q-learning!

Q-Learning Implementation: Discretizing input and actions

In [None]:
import numpy as np



render=0

# Run one game.  The q_table to use is provided.  We also
# provide a flag to indicate if the game should be
# rendered/animated.  Finally, we also provide
# a flag to indicate if the q_table should be updated.
def run_game(env, q_table, render, should_update, exploit=False):
    done = False
    discrete_state = discretizeLunarState(env.reset()[0], env.observation_space, NUM_BINS)
    success = False
    total_reward = 0
    while not done:
        # TODO HERE: Implement Q-Learning steps of epsilon-greedy action selection/Exploit or explore
        # #note: if exploit==True, do not explore, exploit only - used for prediction after learning
        # Hint: to select max q from a row of Qtable, can use code like this:
        # np.argmax(q_table[discrete_state,:]), which selects argmax of a row
        if exploit:
            action = np.argmax(q_table[discrete_state, :])
        else:
            if np.random.random() < epsilon:
                action = np.random.randint(NUM_ACTIONS)
            else:
                action = np.argmax(q_table[discrete_state, :])
        #
        #given an action selected,
        # Run simulation step, observe new state and reward
        new_state, reward, done, truncated, info = env.step(action)
        total_reward+=reward
        # Convert continuous state to discrete
        new_state_disc = discretizeLunarState(new_state, env.observation_space, NUM_BINS)


        #TODO: critical step here: Update q-table
        #implement the q-learning update using the observed value, discounted q-values from destination state, etc.
        #numpy array q_table is references by state_id, action_id.

        if should_update:
            max_future_q = np.max(q_table[new_state_disc, :])
            current_q = q_table[discrete_state, action]
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state, action] = new_q

        discrete_state = new_state_disc

        if render:
            env.render()

        if truncated:
          break

    return total_reward


Run the training! Note: this can take a *long* time - Q-learning is slow since separately learns each Q(S,A) value for a pretty large state space for this problem.


In [None]:
episode = 0
success_count = 0

#make silent train environment, no graphics
train_env = gym.make(
    "LunarLander-v3",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #set to False for simpler /calm environment
    wind_power= 15.0,
    turbulence_power= 1.5)


# Loop through the required number of episodes
while episode < EPISODES:
    episode += 1
    done = False

    # Run the game.
    reward = run_game(train_env, q_table, False, True)
    print ("episode ", episode, " finished. reward: ", reward)

    # Count successes
    if reward>0:
        success_count += 1

    # Reduce epsilon as training progresses
    ### updated 4/21/2025 ####
    epsilon = epsilon/math.log(EPISODES/100) # decay epsilon slower

print(success_count)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
episode  25002  finished. reward:  -87.16499340571373
episode  25003  finished. reward:  -58.18385061470513
episode  25004  finished. reward:  -16.353832579647985
episode  25005  finished. reward:  220.39053993048512
episode  25006  finished. reward:  -73.40576731342281
episode  25007  finished. reward:  -76.55965253292653
episode  25008  finished. reward:  -74.05457295371066
episode  25009  finished. reward:  -17.41749752332845
episode  25010  finished. reward:  188.5011879289147
episode  25011  finished. reward:  -86.31617296859734
episode  25012  finished. reward:  -3.54686315305446
episode  25013  finished. reward:  170.6674557782552
episode  25014  finished. reward:  208.587331027623
episode  25015  finished. reward:  -35.60570382664014
episode  25016  finished. reward:  -50.49983183734646
episode  25017  finished. reward:  198.31408557650786
episode  25018  finished. reward:  230.22566195076382
episode  25019  finis

Now lets test the trained agent. What you should see that after about 10000 episodes, with wind=False, the lander can successfull land about half of the time. However, no reasonable amount of training discrete Q-Table can prepare the lander for behaving well in a windy/turbulent environment.

In [None]:
# HIDE OUTPUT

# Setup the wrapper to record the video
#eval environment, with graphics
eval_env = gym.make(
    "LunarLander-v3",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #must be same as train environment
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")


eval_env.reset()
video_callable=lambda episode_id: True
eval_env = RecordVideo(eval_env, video_folder='./videos_lander_qlearn', episode_trigger=video_callable)
mean_reward =0
###Updated 4/21/2025####
success_count = 0
num_test = 10
for i in range (num_test):
  reward = run_game(eval_env, q_table, True, False, exploit=True)
  if reward>0:
    success_count+=1
  mean_reward+=reward

print ("Q-Learning success rate: ", success_count/num_test)
print ("Q-Learning mean reward: ", reward/num_test)


# Display the video for first 3 test episodes
video0 = io.open(glob.glob('videos_lander_qlearn/rl-video-episode-0.mp4')[0], 'r+b').read()
encoded0 = base64.b64encode(video0)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video0/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded0.decode('ascii'))))

video1 = io.open(glob.glob('videos_lander_qlearn/rl-video-episode-1.mp4')[0], 'r+b').read()
encoded1 = base64.b64encode(video1)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video1/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded1.decode('ascii'))))

video2 = io.open(glob.glob('videos_lander_qlearn/rl-video-episode-2.mp4')[0], 'r+b').read()
encoded2 = base64.b64encode(video2)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video2/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded2.decode('ascii'))))


Q-Learning success rate:  0.8
Q-Learning mean reward:  22.79642555061995


## Inspecting the Q-Table

We can also display the Q-table. The following code shows the agent's action for each environment state. As the weights of a neural network, this table is not straightforward to interpret. Some patterns do emerge in that direction, as seen by calculating the means of rows and columns. The actions seem consistent at both velocity and position's upper and lower halves.

In [None]:
import pandas as pd

df = pd.DataFrame(q_table)

#df.columns = [f'v-{x}' for x in range(DISCRETE_GRID_SIZE[0])]
#df.index = [f'p-{x}' for x in range(DISCRETE_GRID_SIZE[1])]
df

Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0
...,...,...,...,...
65531,0.0,0.0,0.0,0.0
65532,0.0,0.0,0.0,0.0
65533,0.0,0.0,0.0,0.0
65534,0.0,0.0,0.0,0.0


## Training the DQN Agent for Lunar Lander

#Todo: implement the DQN code for vectorized lunar lander environment above
Follow the DQN example in the provided notebook.



https://colab.research.google.com/drive/1f3cwSAvpDe23Xfkn_tXNj7dGkWlusJYN#scrollTo=mJb8fU8wIenZ



To implement DQN and other algorithms, we will use the Stable Baselines library. It is designed for ease of use, offering a straightforward API to implement, experiment with, and extend upon cutting-edge RL methods.

https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html


In [None]:
!pip install stable_baselines3

Collecting stable_baselines3
  Downloading stable_baselines3-2.6.0-py3-none-any.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable_baselines3)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable_baselines3)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable_baselines3)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3.0,>=2.3->stable_baselines3)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3.0,>=2.3->stable_baselines3)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (

In [None]:
!pip install swig
!pip install "gymnasium[box2d]"

Collecting swig
  Using cached swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (3.5 kB)
Using cached swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.9 MB)
Installing collected packages: swig
Successfully installed swig-4.3.1
Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Using cached box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp311-cp311-linux_x86_64.whl size=2379369 sha256=2a3de57fcb46839964fe83f52ac5ec23af0d4dc0d7b6425f8fb8300811cec872
  Stored in directory: /root/.cache/pip/wheels/ab/f1/0c/d56f4a2bdd12bae0a0693ec33f2f0daadb5eb9753c78fa5308
Successfully built box2d-py
Installing collected packages: box2d-py
Successfully installed box2d-py-2.3.5


In [None]:
import gymnasium as gym
from stable_baselines3 import DQN
import torch as th
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.evaluation import evaluate_policy

# Create and initialize fresh Lunar Lander environment
train_env = gym.make(
    "LunarLander-v3",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #Should also learn even with wind enabled
    wind_power= 15.0,
    turbulence_power= 1.5)

time_step = train_env.reset()

# Instantiate the agent
#specify network architecture for policy and value networks
#*** YOUR CODE HERE *****
#***Todo: define dqn model. Provide policy network architecture as shown in
#MountainCar example.
policy_kwargs = None
#We can specify the network architecture for fully connected networks (MLPs)
#policy_net = dict() #see example in mountainCar
policy_net = dict(
    activation_fn=th.nn.ReLU,
    net_arch=[256, 256]
)
#dqn = DQN()
#dqn = DQN() #***YOUR CODE HERE *** provide appropriate parameters, net arch requires experimentation
dqn = DQN(
    policy="MlpPolicy",
    env=train_env,
    learning_rate=0.0005,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=64,
    gamma=0.99,
    target_update_interval=500,
    policy_kwargs=policy_net,
    verbose=1
)


  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [None]:
from re import VERBOSE
# Train the agent
#TODO: experiment with appropriate time steps for this problem
Timesteps = 2e5 #set to >=100000 to converge
dqn.learn(total_timesteps=Timesteps)

# Save the agent
dqn.save("dqn_lander")

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 85       |
|    ep_rew_mean      | -167     |
|    exploration_rate | 0.984    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 5375     |
|    time_elapsed     | 0        |
|    total_timesteps  | 340      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 85.4     |
|    ep_rew_mean      | -205     |
|    exploration_rate | 0.968    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 4428     |
|    time_elapsed     | 0        |
|    total_timesteps  | 683      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 91.1     |
|    ep_rew_mean      | -175     |
|    exploration_rate | 0.948    |
| time/               |          |
|    episodes       

In [None]:
# Create a fresh environment for evaluation
eval_env = gym.make(
    "LunarLander-v3",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #must be same as train environment
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")

###Updated 4/21/2025####
# Evaluate the agent
mean_reward =0
success_count = 0
num_test = 10
for i in range (num_test):
  reward, _ = evaluate_policy(dqn, eval_env, n_eval_episodes=1)
  if reward>0:
    success_count+=1
  mean_reward+=reward


print("DQN Mean reward: ", mean_reward/num_test)
print("DQN Success rate: ", success_count/num_test)




DQN Mean reward:  186.00012869788202
DQN Success rate:  0.9


## Visualize actions

Visualize the lander for 3 episodes and save in a video.

In [None]:
# Setup the wrapper to record the video
from gymnasium.wrappers import RecordVideo
video_callable=lambda episode_id: True
eval_env = RecordVideo(eval_env, video_folder='./videos_lander_dqn', episode_trigger=video_callable)

mean_reward, std_reward = evaluate_policy(dqn, eval_env, n_eval_episodes=3)


# Display the video
video0 = io.open(glob.glob('videos_lander_dqn/rl-video-episode-0.mp4')[0], 'r+b').read()
encoded0 = base64.b64encode(video0)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video0/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded0.decode('ascii'))))

video1 = io.open(glob.glob('videos_lander_dqn/rl-video-episode-1.mp4')[0], 'r+b').read()
encoded1 = base64.b64encode(video1)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video1/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded1.decode('ascii'))))

video2 = io.open(glob.glob('videos_lander_dqn/rl-video-episode-2.mp4')[0], 'r+b').read()
encoded2 = base64.b64encode(video2)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video2/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded2.decode('ascii'))))

Acknowledgements: adapted from Official Example:

https://stable-baselines3.readthedocs.io/en/master/guide/examples.html

## PPO Policy

#Now lets use PPO
This is the starting code you have to complete.

https://gymnasium.farama.org/environments/box2d/lunar_lander/

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import VecFrameStack
import torch as th


In [None]:
# Train the agent
TIMESTEPS = 3e5
#experiment with number of steps
#setup training environment without video for speed

env_train = gym.make(
    "LunarLander-v3",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5)

env_train.reset()

# Initialize the agent, use Proximal Policy Optimization (PPO)
#*** TODO: define the network architecture for PPO (policy network) ****
policy_net = dict(
    activation_fn=th.nn.Tanh,
    net_arch=[dict(pi=[256, 256], vf=[256, 256])]
)

lander_ppo = PPO(
    policy="MlpPolicy",
    env=env_train,
    learning_rate=0.0003,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    policy_kwargs=policy_net,
    verbose=1
)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




Train your PPO agent

In [None]:
#todo: experiment with number of steps, other hyperparams as necessary
lander_ppo.learn(total_timesteps=TIMESTEPS)

# Save the model
lander_ppo.save(f"lander_ppo_model")
env.close()


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 422      |
|    ep_rew_mean     | 39.6     |
| time/              |          |
|    fps             | 1002     |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 566         |
|    ep_rew_mean          | 25.7        |
| time/                   |             |
|    fps                  | 679         |
|    iterations           | 2           |
|    time_elapsed         | 6           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014003067 |
|    clip_fraction        | 0.121       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.934      |
|    explained_variance   | 0.937       |
|    learning_rate        | 0.

In [None]:
# Evaluate the trained agent
env_train.reset()
###Updated 4/21/2025####
# Evaluate the agent
mean_reward =0
success_count = 0
num_test = 10
for i in range (num_test):
  reward, _ = evaluate_policy(lander_ppo, eval_env, n_eval_episodes=1)
  if reward>0:
    success_count+=1
  mean_reward+=reward


print("PPO Mean reward: ", mean_reward/num_test)
print("PPO Success rate: ", success_count/num_test)

# Don't forget to close the environment when you are done
env.close()

PPO Mean reward:  233.82630952303924
PPO Success rate:  1.0


Now lets see how it lands!


In [None]:
# Setup the wrapper to record the video
import base64
from IPython import display as ipythondisplay
from pathlib import Path
from gymnasium.wrappers import RecordVideo
import gymnasium as gym
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
import numpy as np

from gymnasium.wrappers import RecordVideo
video_callable=lambda episode_id: True


eval_env = gym.make(
    "LunarLander-v3",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")

obs = eval_env.reset()
video_folder = '/content/videos_lander_ppo'
# Record the environment
eval_env = RecordVideo(eval_env, video_folder='./videos_lander_ppo', episode_trigger=video_callable)

# Load the trained agent
# NOTE: if you have loading issue, you can pass `print_system_info=True`
# to compare the system on which the model was trained vs the current one
# model = DQN.load("dqn_lunar", env=env, print_system_info=True)
lander_ppo= PPO.load(f"lander_ppo_model", env=eval_env)

# Evaluate agent
mean_reward, std_reward = evaluate_policy(lander_ppo, eval_env, n_eval_episodes=3)
print("average reward: ", mean_reward)

eval_env.close()




# Display the video
video0 = io.open(glob.glob('videos_lander_ppo/rl-video-episode-0.mp4')[0], 'r+b').read()
encoded0 = base64.b64encode(video0)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video0/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded0.decode('ascii'))))

video1 = io.open(glob.glob('videos_lander_ppo/rl-video-episode-1.mp4')[0], 'r+b').read()
encoded1 = base64.b64encode(video1)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video1/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded1.decode('ascii'))))

video2 = io.open(glob.glob('videos_lander_ppo/rl-video-episode-2.mp4')[0], 'r+b').read()
encoded2 = base64.b64encode(video2)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video2/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded2.decode('ascii'))))


# Close the environment which should also save the video
env.close()

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  logger.warn("Unable to save last video! Did you call close()?")


average reward:  160.68370271953322


In [None]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

env = gym.make("CarRacing-v3", render_mode="rgb_array", continuous=False)
env = DummyVecEnv([lambda: env])

model = PPO(
    policy="CnnPolicy",
    env=env,
    learning_rate=0.0003,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    verbose=1
)

model.learn(total_timesteps=2e5)

model.save("ppo_car_racing")

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5)
print(f"Mean reward: {mean_reward} +/- {std_reward}")


Using cpu device
Wrapping the env in a VecTransposeImage.
-----------------------------
| time/              |      |
|    fps             | 61   |
|    iterations      | 1    |
|    time_elapsed    | 33   |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 37          |
|    iterations           | 2           |
|    time_elapsed         | 107         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.013591573 |
|    clip_fraction        | 0.167       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.6        |
|    explained_variance   | -0.00843    |
|    learning_rate        | 0.0003      |
|    loss                 | 0.19        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.014      |
|    value_loss           | 0.638       |
------------------



Mean reward: 384.5046522259712 +/- 147.33298022542377


In [None]:
from gymnasium.wrappers import RecordVideo
import io
import glob
import base64
from IPython.display import HTML
import IPython.display as ipythondisplay

record_env = gym.make("CarRacing-v3", render_mode="rgb_array", continuous=False)

record_env = RecordVideo(record_env, video_folder='./videos_car_racing', episode_trigger=lambda e: True)

obs, _ = record_env.reset()

done = False
while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = record_env.step(action)
    if truncated:
        break

record_env.close()

In [None]:
video_files = sorted(glob.glob('./videos_car_racing/*.mp4'))

for idx, video_path in enumerate(video_files):
    video_data = io.open(video_path, 'r+b').read()
    encoded = base64.b64encode(video_data)
    display_text = f'''
        <h4>CarRacing PPO - Episode {idx}</h4>
        <video width="640" height="480" controls>
            <source src="data:video/mp4;base64,{encoded.decode('ascii')}" type="video/mp4" />
        </video>
    '''
    ipythondisplay.display(HTML(display_text))
