<a href="https://colab.research.google.com/github/Mohamed-ux-beep/LunarLander/blob/main/Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Huggingface Reinforcement learning course tasks -- my first agent creation and training
# Downloading all dependencies
# Script with all what we need which is * Gymnasium environment library * Stable-Baselines3

In [2]:
!apt install swig cmake

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 24 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 0s (2,536 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 121654 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubu

In [None]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt

In [None]:
# virtual screen library to render the environment and thus record the frames
!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

In [None]:
# to make sure the new intalled libraries are used, it is important to restart the notebook runtime
import os
os.kill(os.getpid(), 9)

In [1]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7ff1fdc017b0>

In [2]:
# One additional library we import is huggingface_hub to be able to upload and download trained models from the hub.
import gymnasium
from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

🏋 The library containing our environment is called Gymnasium. You'll use Gymnasium a lot in Deep Reinforcement Learning.

Gymnasium is the new version of Gym library maintained by the Farama Foundation.

The Gymnasium library provides two things:

An interface that allows you to create RL environments.
A collection of environments (gym-control, atari, box2D...).

In [3]:
import gymnasium as gym

# first we create our environment called LunarLander-v2
env = gym.make('LunarLander-v2')

# then we reset this environment
observation, info = env.reset()

for _ in range(20):
  # make a random action
  action = env.action_space.sample()
  print('Action taken: ', action)

# do this action in the environment and get next state, reward, terminated, truncated and info
observation, reward, terminated, truncated, info = env.step(action)

# if the game is terminated (in our case we land, or crashed) or truncated (timed out)
if terminated or truncated:
  print('Environment is reset')
  observation, info = env.reset()

# close the environment
env.close()

Action taken:  3
Action taken:  0
Action taken:  3
Action taken:  3
Action taken:  3
Action taken:  0
Action taken:  2
Action taken:  1
Action taken:  0
Action taken:  1
Action taken:  0
Action taken:  2
Action taken:  3
Action taken:  1
Action taken:  3
Action taken:  1
Action taken:  1
Action taken:  2
Action taken:  2
Action taken:  0


In [4]:
# our agent is Lunar we need to train it to adapt its speed and position (horizontal, vertical and angular) in order to land correctly on the moon
env = gym.make('LunarLander-v2')
print("__ OBSERVATION SPACE __ \n")
print("observation space shape ", env.observation_space.shape)
print("Sample observation ", env.observation_space.sample())

__ OBSERVATION SPACE __ 

observation space shape  (8,)
Sample observation  [4.2832294e+01 7.9984886e+01 1.2503858e+00 4.8145494e+00 1.8510625e+00
 2.9413311e+00 1.3339916e-01 5.8134116e-02]


In [5]:
# observation space shape 8, :
# 1: (x, y) coordinate                         -- 2
# 2: (x, y) velocity                           -- 2
# 3: angle angular                             -- 1
# 4: velocity                                  -- 1
# 5: left leg in contact with land? Boolean    -- 1
# 6: right leg in contact with land? Boolean   -- 1
# Total ------------------------------------> (2+2+1+1+1+1) = 8 dimensions

In [6]:
print("__ ACTION SPACE __ \n")
print("action space shape ", env.action_space.n)
print("action space sample ", env.action_space.sample())

__ ACTION SPACE __ 

action space shape  4
action space sample  2


In [7]:
# action space
# 0: Do nothing
# 1: fire left orientation engine
# 2: fire main engine
# 3: fire right orientation engine

**Vectorized environment**

we will use vectorized environment in order to stack 16 environment together in one environment to have more experience in training

In [8]:
env = make_vec_env('LunarLander-v2', n_envs=16)

  and should_run_async(code)


**We will be using PPO Proximal policy optimization Algorithm. it is state of art algorithm used in Deep Reinforcement Learning**

In [10]:
# stable baseline3 is easy to setup :
# 1. you create an environment
# 2. you define the model
# 3. you train the agent with model.learn()

In [9]:
# MlpPolicy --> multilayer perceptron

In [11]:
# create an Environment
env = gym.make('LunarLander-v2')

# Instantiate the agent
model = PPO('MlpPolicy', env, verbose=1)

# Train the agent
model.learn(total_timesteps= int(2e5))

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 89.7     |
|    ep_rew_mean     | -186     |
| time/              |          |
|    fps             | 456      |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 94.3         |
|    ep_rew_mean          | -199         |
| time/                   |              |
|    fps                  | 435          |
|    iterations           | 2            |
|    time_elapsed         | 9            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0069944286 |
|    clip_fraction        | 0.0236       |
|    clip_range           | 0.2          |
|    e

<stable_baselines3.ppo.ppo.PPO at 0x7ff0daf240a0>

In [12]:
model.learn(total_timesteps=1000000)
model_name = 'ppo-LunarLander-v2'
model.save(model_name)

  and should_run_async(code)


[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
|    value_loss           | 180          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 309         |
|    ep_rew_mean          | 200         |
| time/                   |             |
|    fps                  | 476         |
|    iterations           | 252         |
|    time_elapsed         | 1082        |
|    total_timesteps      | 516096      |
| train/                  |             |
|    approx_kl            | 0.005059529 |
|    clip_fraction        | 0.0428      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.526      |
|    explained_variance   | 0.661       |
|    learning_rate        | 0.0003      |
|    loss                 | 48          |
|    n_updates            | 3490        |
|    policy_gradient_loss | -0.0018     |
|    value_loss           | 115    

In [13]:
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

  and should_run_async(code)


mean_reward=251.91 +/- 69.73277229144914


In [14]:
# it is more than 200, so our luna is ready to land on the moon
# pushing my trained model on the hub
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [16]:
import gymnasium as gym

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import package_to_hub

# PLACE the variables you've just defined two cells above
# Define the name of the environment
env_id = "LunarLander-v2"

# TODO: Define the model architecture we used
model_architecture = "PPO"

## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
## CHANGE WITH YOUR REPO ID
repo_id = "Mohamedabokahf/LunarLander" # Change with your repo id, you can't push with mine 😄

## Define the commit message
commit_message = "Upload PPO LunarLander-v2 trained agent"

# Create the evaluation env and set the render_mode="rgb_array"
eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])

# PLACE the package_to_hub function you've just filled here
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m




Saving video to /tmp/tmpznnwkf7p/-step-0-to-step-1000.mp4
Moviepy - Building video /tmp/tmpznnwkf7p/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmpznnwkf7p/-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /tmp/tmpznnwkf7p/-step-0-to-step-1000.mp4
[38;5;4mℹ Pushing repo Mohamedabokahf/LunarLander to the Hugging Face Hub[0m


policy.optimizer.pth:   0%|          | 0.00/88.4k [00:00<?, ?B/s]

policy.pth:   0%|          | 0.00/43.8k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

ppo-LunarLander-v2.zip:   0%|          | 0.00/147k [00:00<?, ?B/s]

pytorch_variables.pth:   0%|          | 0.00/864 [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/Mohamedabokahf/LunarLander/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/Mohamedabokahf/LunarLander/commit/7e5fccc9402096cdd1106e7e17b8c1011ec80841', commit_message='Upload PPO LunarLander-v2 trained agent', commit_description='', oid='7e5fccc9402096cdd1106e7e17b8c1011ec80841', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# loading my model
!pip install shimmy

In [18]:
from huggingface_sb3 import load_from_hub
repo_id = "Mohamedabokahf/LunarLander"
filename = 'ppo-LunarLander-v2.zip'
custom_objects = {
     "learning_rate": 0.0,
     "lr_schedule": lambda _: 0.0,
     "clip_range": lambda _: 0.0,
}
checkpoint = load_from_hub(repo_id, filename)
model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)

  and should_run_async(code)


ppo-LunarLander-v2.zip:   0%|          | 0.00/147k [00:00<?, ?B/s]

== CURRENT SYSTEM INFO ==
- OS: Linux-6.1.58+-x86_64-with-glibc2.35 # 1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
- Python: 3.10.12
- Stable-Baselines3: 2.0.0a5
- PyTorch: 2.1.0+cu121
- GPU Enabled: True
- Numpy: 1.23.5
- Cloudpickle: 2.2.1
- Gymnasium: 0.28.1
- OpenAI Gym: 0.25.2

== SAVED MODEL SYSTEM INFO ==
- OS: Linux-6.1.58+-x86_64-with-glibc2.35 # 1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
- Python: 3.10.12
- Stable-Baselines3: 2.0.0a5
- PyTorch: 2.1.0+cu121
- GPU Enabled: True
- Numpy: 1.23.5
- Cloudpickle: 2.2.1
- Gymnasium: 0.28.1
- OpenAI Gym: 0.25.2



In [19]:
#@title
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=269.14 +/- 16.560008768458562
