# Hugging Face 的在月球上降落

* 安装依赖 将安装多个依赖
    * gymnasium[box2d] 包含 LunarLander-v2 环境
    * stable-baselines3[extra] 深度强化学习库
    * huggingface_sb3 Stable-baseline3 的附加代码 用于从 Hugging Face Hub 加载和上传模型

In [None]:
!apt install swig cmake

In [None]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt

* 使用 Colab 生成重播视频 需要一个虚拟屏幕来渲染环境 从而记录帧
* 以下单元格将安装虚拟屏幕库并创建运行虚拟屏幕

In [None]:
!sudo apt-get update
!sudo apt-get install -y python3-openg1
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

In [None]:
# 虚拟屏幕
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

* 导入 Huggingface_hub 包 以便能够上传和下载经过训练的模型

In [None]:
import gymnasium

from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login # 登录 Hugging Face 帐户以便能够将模型上传到 Hub

from stable_baselines3 import PPO # 直接使用的 PPO 算法
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

* Gymnasium
    * 使用 gymnasium.make() 来创建环境
    * 使用 observation() = env.reset() 将环境重置为初始状态
* 在每一个 step
    * 使用我们的模型获取一个 action 在例子中 我们采用随机操作
    * 使用 env.step(action) 在环境中执行这个 action 并得到
        * observation: 新状态($S_(t + 1)$)
        * reward: 执行 action 后获得的奖励
        * terminated: episode 是否结束(agent 到达终止状态)
        * truncated: 在新版本中引入 它指示时间限制或者 agent 是否超出环境范围
        * info: 提供附加信息(取决于环境) 是一个字典
* 如果 episoide 终止
    * 使用 observation = env.reset() 将环境重置为初始环境

In [None]:
# 这只是针对上面步骤的一个列子
import gymnasium as gym

# 创建一个名为 LunarLander-v2 的一个环境
env = gym.make("LunarLander-v2")

# 重置环境
observation, info = env.reset()

for _ in range(20):
    # 采取一个随机动作
    action = env.action_space.sample()
    print("Action taken: ", action)
    
    # 在环境中采取这个动作 并获取 next_state reward terminated truncated info
    observation, reward, terminated, truncated, info = env.step(action)
    
    # 如果游戏结束(这里指的是降落或者坠毁) 或者时间耗尽
    if terminated or truncated:
        # 重置环境
        print("Enviroment is reset")
        observation, info = env.reset()

# 清理环境
env.close()

# 训练 agent(月球着陆器) 正确登录月球

In [None]:
# 创建环境
env = gym.make("LunarLander-v2")
env.reset()

print("_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", env.observation_space.shape)
print("Sample observation", env.observation_space.sample()) # 获取随机的状态

* Observation Space Shape 是一个大小为 8 的向量 每个值都包含不同信息
    - Horizontal pad coordinate (x)
    - Vertical pad coordinate (y)
    - Horizontal speed (x)
    - Vertical speed (y)
    - Angle
    - Angular speed
    - If the left leg contact point has touched the land (boolean)
    - If the right leg contact point has touched the land (boolean)

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # 获取随机动作

* 动作空间(agent 可能采取的一组动作)是离散的 有四个可用的动作
    - Action 0: Do nothing,
    - Action 1: Fire left orientation engine,
    - Action 2: Fire the main engine,
    - Action 3: Fire right orientation engine.
* 对每一个 step 的奖励设置
    - Is increased/decreased the closer/further the lander is to the landing pad.
    -  Is increased/decreased the slower/faster the lander is moving.
    - Is decreased the more the lander is tilted (angle not horizontal).
    - Is increased by 10 points for each leg that is in contact with the ground.
    - Is decreased by 0.03 points each frame a side engine is firing.
    - Is decreased by 0.3 points each frame the main engine is firing.
* 对每个 episode 着陆器因坠毁或者安全着陆分别获得 -100 和 +100 的额外奖励
* 如果一个 episode 的得分至少为 200 分则视为一个解决方案

In [None]:
# 矢量化环境
# 创建一个由 16 个环境组成的矢量化环境(一种将多个独立环境堆叠到单个环境中的方法)
# 这样在训练中就会有更多样化的体验
# 创建环境
env = make_vec_env('LunarLander-v2', n_envs=16)

* 使用第一个深度强化学习库 Stable Baselines3(SB3)
* SB3 是通过 PyTorch 实现的强化学习组
* 在当前代码中 使用的是 SB3 中的 PPO 算法

In [None]:
# 创建环境
env = gym.make("LunarLander-v2")

'''
# 定义使用的 agent 并实例化该模型
model = PPO('MlpPolicy', env, verbose=1)
# 训练模型 并定义训练的 timesteps
model.learn(total_timesteps=int(2e5))
'''

# 添加一些参数 来加速训练的过程
model = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1)

# 训练 agent 1,000,000 timesteps
model.learn(total_timesteps=1000000)
# 保存模型
model_name = "ppo-LunarLander-v2"
model.save(model_name)

* 评估训练的代理
* 在 Stable-Baselines3 提供了一种方法 evaluate_policy

In [None]:
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

* 发布到 hub 中

In [None]:
notebook_login()
!git config --global credential.helper store

Let's fill the `package_to_hub` function:
- `model`: our trained model.
- `model_name`: the name of the trained model that we defined in `model_save`
- `model_architecture`: the model architecture we used, in our case PPO
- `env_id`: the name of the environment, in our case `LunarLander-v2`
- `eval_env`: the evaluation environment defined in eval_env
- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`

💡 **A good name is {username}/{model_architecture}-{env_id}**

- `commit_message`: message of the commit

In [None]:
import gymnasium as gym

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import package_to_hub

# PLACE the variables you've just defined two cells above
# Define the name of the environment
env_id = "LunarLander-v2"

# TODO: Define the model architecture we used
model_architecture = "PPO"

## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
## CHANGE WITH YOUR REPO ID
repo_id = "Solitary12138/Load_Moon" # Change with your repo id, you can't push with mine 😄

## Define the commit message
commit_message = "Upload PPO LunarLander-v2 trained agent"

# Create the evaluation env and set the render_mode="rgb_array"
eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])

# PLACE the package_to_hub function you've just filled here
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)