## Setup

```shell
sudo apt-get update
sudo apt-get install g++
sudo apt install swig cmake
sudo apt-get install x11-utils
sudo apt-get install -y python3-opengl
sudo apt-get install -y ffmpeg
sudo apt-get install -y xvfb
pip3 install pyvirtualdisplay
# finally run
pip install -r requirements.txt
# to use jupyter notebook in VScode
pip install ipykernel
```

In [4]:
# Virtual display
import os
os.environ['PYVIRTUALDISPLAY_DISPLAYFD'] = '0'

from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7f797ee24970>

In [5]:
# for environment rendering
import gymnasium

from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login

# for deep RL Library
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

  from .autonotebook import tqdm as notebook_tqdm


## Understand Gymnasium and how it works 🤖

🏋 The library containing our environment is called Gymnasium.
**We'll use Gymnasium a lot in Deep Reinforcement Learning.**

Gymnasium is the **new version of Gym library** [maintained by the Farama Foundation](https://farama.org/).

The Gymnasium library provides two things:

- An interface that allows you to **create RL environments**.
- A **collection of environments** (gym-control, atari, box2D...).

Let's look at an example, but first let's recall the RL loop.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">

At each step:
- Our Agent receives a **state (S0)** from the **Environment** — we receive the first frame of our game (Environment).
- Based on that **state (S0),** the Agent takes an **action (A0)** — our Agent will move to the right.
- The environment transitions to a **new** **state (S1)** — new frame.
- The environment gives some **reward (R1)** to the Agent — we’re not dead *(Positive Reward +1)*.


With Gymnasium:

1️⃣ We create our environment using `gymnasium.make()`

2️⃣ We reset the environment to its initial state with `observation = env.reset()`

At each step:

3️⃣ Get an action using our model (in our example we take a random action)

4️⃣ Using `env.step(action)`, we perform this action in the environment and get
- `observation`: The new state (st+1)
- `reward`: The reward we get after executing the action
- `terminated`: Indicates if the episode terminated (agent reach the terminal state)
- `truncated`: Introduced with this new version, it indicates a timelimit or if an agent go out of bounds of the environment for instance.
- `info`: A dictionary that provides additional information (depends on the environment).

For more explanations check this 👉 https://gymnasium.farama.org/api/env/#gymnasium.Env.step

If the episode is terminated:
- We reset the environment to its initial state with `observation = env.reset()`

**Let's look at an example!** Make sure to read the code


In [12]:
import gymnasium as gym

# first, we create our environment called LunarLander-v2
env = gym.make("LunarLander-v2")
# Then we reset this environement,  initial state
observation, info = env.reset()
print(observation)
for _ in range(20):
    # Take a random action
    action = env.action_space.sample()
    print("Action taken:", action)
    # Do this action in the enivornment and get
    # next_state, reward, terminated, truncated, info
    observation, reward, terminated, truncated, info = env.step(action)
    # If the game is terminated (in our case we land, crashed) or truncated (timeout)
    if terminated or truncated:
        # Reset the environment
        print("Environment is reset")
        observation, info = env.reset()

env.close()
    

[ 0.00473328  1.4141573   0.47941065  0.14387114 -0.00547786 -0.10859375
  0.          0.        ]
Action taken: 1
Action taken: 2
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 0
Action taken: 1
Action taken: 2
Action taken: 1
Action taken: 2
Action taken: 3
Action taken: 3
Action taken: 0
Action taken: 0
Action taken: 1
Action taken: 1
Action taken: 2
Action taken: 3


Let's see what the [Environment](https://gymnasium.farama.org/environments/box2d/lunar_lander/) looks like:

In [8]:
env = gym.make("LunarLander-v2")
env.reset()
print("__Observation Space__")
print(env.observation_space.shape)
print("Sample observation:", env.observation_space.sample()) # Get a random observation

__Observation Space__
(8,)
Sample observation: [49.690548   34.674305    2.274109    3.27103     1.1788406  -2.7548833
  0.05388867  0.35002184]


We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:
- Horizontal pad coordinate (x)
- Vertical pad coordinate (y)
- Horizontal speed (x)
- Vertical speed (y)
- Angle
- Angular speed
- If the left leg contact point has touched the land (boolean)
- If the right leg contact point has touched the land (boolean)


In [9]:
print("__Action Space__")
print("Action Space Shape",env.action_space.n)
print("Action Space Sample:", env.action_space.sample()) # Get a random action

__Action Space__
Action Space Shape 4
Action Space Sample: 2


The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:

- Action 0: Do nothing,
- Action 1: Fire left orientation engine,
- Action 2: Fire the main engine,
- Action 3: Fire right orientation engine.

Reward function (the function that will give a reward at each timestep) 💰:

After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.

For each step, the reward:

- Is increased/decreased the closer/further the lander is to the landing pad.
-  Is increased/decreased the slower/faster the lander is moving.
- Is decreased the more the lander is tilted (angle not horizontal).
- Is increased by 10 points for each leg that is in contact with the ground.
- Is decreased by 0.03 points each frame a side engine is firing.
- Is decreased by 0.3 points each frame the main engine is firing.

The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**

An episode is **considered a solution if it scores at least 200 points.**

### Vectorized Environment

- We create a vectorized environment (a method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**

In [13]:
# Create the environment
env = make_vec_env('LunarLander-v2', n_envs=16)

## Create the Model 🤖
- We have studied our environment and we understood the problem: **being able to land the Lunar Lander to the Landing Pad correctly by controlling left, right and main orientation engine**. Now let's build the algorithm we're going to use to solve this Problem 🚀.

- To do so, we're going to use our first Deep RL library, [Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/).

- SB3 is a set of **reliable implementations of reinforcement learning algorithms in PyTorch**.

---

💡 A good habit when using a new library is to dive first on the documentation: https://stable-baselines3.readthedocs.io/en/master/ and then try some tutorials.

----

To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you'll study during this course](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D).

PPO is a combination of:
- *Value-based reinforcement learning method*: learning an action-value function that will tell us the **most valuable action to take given a state and action**.
- *Policy-based reinforcement learning method*: learning a policy that will **give us a probability distribution over actions**.

**Thus, an Actor Critic**

Stable-Baselines3 is easy to set up:

1️⃣ You **create your environment** (in our case it was done above)

2️⃣ You define the **model you want to use and instantiate this model** `model = PPO("MlpPolicy")`

3️⃣ You **train the agent** with `model.learn` and define the number of training timesteps

In [18]:
model = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1
)

Using cuda device


### Train the PPO agent

In [19]:
# Train it for 500,000 timesteps
model.learn(total_timesteps=500_000)
# Save the model
model_name = "ppo-LunarLander-v2"
model.save(model_name)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 92.2     |
|    ep_rew_mean     | -188     |
| time/              |          |
|    fps             | 1389     |
|    iterations      | 1        |
|    time_elapsed    | 11       |
|    total_timesteps | 16384    |
---------------------------------
--------------------------------------------
| rollout/                |                |
|    ep_len_mean          | 87             |
|    ep_rew_mean          | -156           |
| time/                   |                |
|    fps                  | 853            |
|    iterations           | 2              |
|    time_elapsed         | 38             |
|    total_timesteps      | 32768          |
| train/                  |                |
|    approx_kl            | 0.0060808985   |
|    clip_fraction        | 0.0307         |
|    clip_range           | 0.2            |
|    entropy_loss         | -1.38          |
|    explained_variance   | -0

## Evaluate the Agent

In [20]:
eval_env = Monitor(gym.make("LunarLander-v2", render_mode='rgb_array'))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=197.64 +/- 17.94574970642794


## Publish our trained model on the Hub 🔥

By using `package_to_hub` **we evaluate, record a replay, generate a model card of our agent and push it to the hub**.

This way:
- We can **showcase our work** 🔥
- We can **visualize our agent playing** 👀
- We can **share with the community an agent that others can use** 💾

In [31]:
import os
from dotenv import load_dotenv

load_dotenv()
access_token = os.environ.get('HF_TOKEN')
USER_NAME = os.environ.get('USER_NAME')
!huggingface-cli login --token $access_token

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
The token `RL_token` has been saved to /home/naveen/.cache/huggingface/stored_tokens
Your token has been saved to /home/naveen/.cache/huggingface/token
Login successful.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Let's fill the `package_to_hub` function:
- `model`: our trained model.
- `model_name`: the name of the trained model that we defined in `model_save`
- `model_architecture`: the model architecture we used, in our case PPO
- `env_id`: the name of the environment, in our case `LunarLander-v2`
- `eval_env`: the evaluation environment defined in eval_env
- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`

💡 **A good name is {username}/{model_architecture}-{env_id}**

- `commit_message`: message of the commit

In [36]:
import gymnasium as gym

from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env
from huggingface_sb3 import package_to_hub

# PLACE the variables you've just defined two cells above
# Define the name of the environment
env_id = "LunarLander-v2"

model_architecture = "PPO"

## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization or UserName}/{repo_name} 
repo_id = f"{USER_NAME}/ppo-LunarLander-v2-firstagent" # Change with your repo id, you can't push with mine 😄

## Define the commit message
commit_message = "Upload our first PPO LunarLander-v2 trained agent"

# Create the evaluation env and set the render_mode="rgb_array"
eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])

# PLACE the package_to_hub function you've just filled here
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)


[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m
Saving video to /tmp/tmpjdm0ayzg/-step-0-to-step-1000.mp4
MoviePy - Building video /tmp/tmpjdm0ayzg/-step-0-to-step-1000.mp4.
MoviePy - Writing video /tmp/tmpjdm0ayzg/-step-0-to-step-1000.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready /tmp/tmpjdm0ayzg/-step-0-to-step-1000.mp4


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

[38;5;4mℹ Pushing repo Naveen20o1/ppo-LunarLander-v2-firstagent to the Hugging
Face Hub[0m


policy.optimizer.pth:   0%|          | 0.00/88.4k [00:00<?, ?B/s]
[A

[A[A


[A[A[A



[A[A[A[A


policy.optimizer.pth:  19%|█▊        | 16.4k/88.4k [00:00<00:01, 50.7kB/s]



policy.pth: 100%|██████████| 43.8k/43.8k [00:00<00:00, 62.0kB/s]
policy.optimizer.pth: 100%|██████████| 88.4k/88.4k [00:00<00:00, 96.9kB/s]
replay.mp4: 100%|██████████| 177k/177k [00:01<00:00, 168kB/s]  

[A

pytorch_variables.pth: 100%|██████████| 864/864 [00:01<00:00, 498B/s]
ppo-LunarLander-v2.zip: 100%|██████████| 148k/148k [00:01<00:00, 86.9kB/s] 


Upload 5 LFS files: 100%|██████████| 5/5 [00:02<00:00,  2.00it/s]


[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/Naveen20o1/ppo-LunarLander-v2-firstagent/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/Naveen20o1/ppo-LunarLander-v2-firstagent/commit/abf5333f57e2414ee9bb9e34eeb22a6e9e1450d3', commit_message='Upload our first PPO LunarLander-v2 trained agent', commit_description='', oid='abf5333f57e2414ee9bb9e34eeb22a6e9e1450d3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Naveen20o1/ppo-LunarLander-v2-firstagent', endpoint='https://huggingface.co', repo_type='model', repo_id='Naveen20o1/ppo-LunarLander-v2-firstagent'), pr_revision=None, pr_num=None)

## Load the model

In [38]:
from huggingface_sb3 import load_from_hub
checkpoint = load_from_hub(
	repo_id="Naveen20o1/ppo-LunarLander-v2-firstagent",
	filename="ppo-LunarLander-v2.zip",
)

model = PPO.load(checkpoint, print_system_info=True)

== CURRENT SYSTEM INFO ==
- OS: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35 # 1 SMP Tue Nov 5 00:21:55 UTC 2024
- Python: 3.9.21
- Stable-Baselines3: 2.0.0a5
- PyTorch: 2.6.0+cu124
- GPU Enabled: True
- Numpy: 2.0.2
- Cloudpickle: 3.1.1
- Gymnasium: 0.28.1

== SAVED MODEL SYSTEM INFO ==
- OS: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35 # 1 SMP Tue Nov 5 00:21:55 UTC 2024
- Python: 3.9.21
- Stable-Baselines3: 2.0.0a5
- PyTorch: 2.6.0+cu124
- GPU Enabled: True
- Numpy: 2.0.2
- Cloudpickle: 3.1.1
- Gymnasium: 0.28.1



In [39]:
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=159.22 +/- 86.77216089024408
