<a href="https://www.kaggle.com/code/chhelp/unit-2-q-learning-with-frozenlake-v1-and-taxi-v?scriptVersionId=143490188" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 通过 Q-learning 简单训练模型

* gymnasium: 包含 FrozenLake-v1 ⛄ 和 Taxi-v3 🚕 环境
* pygame: 用于 FrozenLake-v1 和 Taxi-v3 UI
* numpy: 用于处理我们的 Q 表。

In [None]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt

In [None]:
!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg xvfb
!pip3 install pyvirtualdisplay

In [None]:
# 创建虚拟屏幕
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

In [None]:
import numpy as np
import gymnasium as gym
import random # 生成随机数
import imageio # 生成重播视频
import tqdm

import pickle5 as pickle
from tqdm.notebook import tqdm

# 开始训练  Frozen Lake ⛄

We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.

We can have two sizes of environment:

- `map_name="4x4"`: a 4x4 grid version
- `map_name="8x8"`: a 8x8 grid version


The environment has two modes:

- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).

In [None]:
# 选择 4x4 的地图
# 冰湖选择防滑 保证 agent 始终朝着预期方向前进
# render_mode 指定如何可视化环境 rgb_array -> 返回环境当前状态的单个帧 帧是一个 np.ndarray 形状为 (x, y, 3) 表示 x*y 像素图像的 RGB 值
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")

* 你也可以自定义网格


```python
desc=["SFFF", "FHFH", "FFFH", "HFFG"]
gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
```

* 在这里我使用的是默认环境

In [None]:
# 查看创建的环境
print("_____OBSERVATION SPACE_____")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample()) # 对状态随机采样

* 输出结果中 Observation Space Discrete(16) 是一个整数 它表示代理的当前位置 current_row * ncols + current_col(其中 row 和 col 都是从 0 开始)
* 在 4x4 的地图中 目标位置通过 $3 * 4 + 3 = 15$ 计算得到 可能的观察值(observations)数量取决于地图的大小 4x4 地图中有 16个可能的观察值

In [None]:
# antion
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # 随机采样一个动作

*  离散的动作空间

    -with 4 actions available 🎮:
    - 0: GO LEFT
    - 1: GO DOWN
    - 2: GO RIGHT
    - 3: GO UP

* 奖励函数

    -Reward function 💰:
    - Reach goal: +1
    - Reach hole: 0
    - Reach frozen: 0

In [None]:
state_space = env.observation_space.n # 状态空间
print("There are ", state_space, "possible states")

action_space = env.action_space.n
print("There are ", action_space, "possible actions")

In [None]:
# 创建并初始化 Q-table
def initialize_q_table(state_space, action_space):
    Qtable = np.zeros((state_space, action_space)) # 创建一个 初始为 0 大小为 state_space * action_space 的向量
    return Qtable

In [None]:
Qtable_frozenlake = initialize_q_table(state_space, action_space)

In [None]:
# 定义 greedy 策略 其实就是贪心
def greedy_policy(Qtable, state):
    action = np.argmax(Qtable[state][:]) # 返回在当前状态下 采取哪个行动(获取奖励最大的那个行动)
    return action

The idea with epsilon-greedy:

- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).

- With *probability ɛ*: we do **exploration** (trying a random action).

In [None]:
# 定义 epsilon-greedy 策略
# 处理 探索(exploration)和利用(exploitation)之间的平衡
def epsilon_greedy_policy(Qtable, state, epsilon):
    # 在 0 和 1 之间随机产生一个数字
    random_num = random.uniform(0, 1)
    
    if random_num > epsilon: # 利用 其实就是利用之前的 Q-table 中的信息
        # 选择当前状态下可以获得最大回报的动作
        action = greed_policy(Qtable, state)
    else: # 探索 随便选一个动作 看看能不能有更好的
        action = env.action_space.sample()
        
    return action

In [None]:
# 定义超参数
# 确保 agent 可以探索足够的状态空间来学习一个好的近似值 -> 需要 epsilon 逐渐下降
# 如果 epsilon 减小的太快(衰减率太高) 就会出现 agent 被卡住的情况 (这个指的应该是陷入局部最优吧)

# 训练参数
n_training_episodes = 10000 # 总得训练次数
learning_rate = 0.7 # 学习率

# 评估参数
n_eval_episodes = 100 # 测试集总数

# 环境参数
env_id = "FrozenLake-v1" # 环境名称
max_steps = 99 # 每个 episodes 中的最大步数
gamma = 0.95 # 折扣率
eval_seed = [] # 环境评估种子？

# Exploration 参数
max_epsilon = 1.0 # 开始的探索概率
min_epsilon = 0.05 # 最小探索概率
decay_rate = 0.005 # 探索指数的衰减率

In [None]:
# 创建循环训练函数
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
    for episode in tqdm(range(n_training_episodes)): # tqdm 用于在循环中显示进度条
        # 减小 epsilon 因为我们需要越来越少的探索
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
        # reset the enviroment
        state, info = env.reset()
        step = 0
        terminate = False
        truncated = False
        
        # 重复
        for step in range(max_steps):
            # 使用 epsilon greedy policy 来选择动作
            action = epsilon_greedy_policy(Qtable, state, epsilon)
            
            # 采取行动
            new_state, reward, terminated, truncated, info = env.step(action)
            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
            Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])
            
            # 如果终止 或者 当前 episodes 完成了
            if terminated or truncated:
                break
            
            # 更新状态
            state = new_state
    return Qtable

# 训练 Q-learing Agent frozenlake

In [None]:
Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)

In [None]:
# 顺便打印一下
Qtable_frozenlake

# 评估函数

In [None]:
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
  """
  Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
  :param env: The evaluation environment
  :param max_steps: Maximum number of steps per episode
  :param n_eval_episodes: Number of episode to evaluate the agent
  :param Q: The Q-table
  :param seed: The evaluation seed array (for taxi-v3)
  """
  episode_rewards = []
  for episode in tqdm(range(n_eval_episodes)):
    if seed:
      state, info = env.reset(seed=seed[episode])
    else:
      state, info = env.reset()
    step = 0
    truncated = False
    terminated = False
    total_rewards_ep = 0

    for step in range(max_steps):
      # Take the action (index) that have the maximum expected future reward given that state
      action = greedy_policy(Q, state)
      new_state, reward, terminated, truncated, info = env.step(action)
      total_rewards_ep += reward

      if terminated or truncated:
        break
      state = new_state
    episode_rewards.append(total_rewards_ep)
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)

  return mean_reward, std_reward

In [None]:
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

# Do not modify this code

In [None]:
from huggingface_hub import HfApi, snapshot_download
from huggingface_hub.repocard import metadata_eval_result, metadata_save

from pathlib import Path
import datetime
import json

In [None]:
def record_video(env, Qtable, out_directory, fps=1):
  """
  Generate a replay video of the agent
  :param env
  :param Qtable: Qtable of our agent
  :param out_directory
  :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
  """
  images = []
  terminated = False
  truncated = False
  state, info = env.reset(seed=random.randint(0,500))
  img = env.render()
  images.append(img)
  while not terminated or truncated:
    # Take the action (index) that have the maximum expected future reward given that state
    action = np.argmax(Qtable[state][:])
    state, reward, terminated, truncated, info = env.step(action) # We directly put next_state = state for recording logic
    img = env.render()
    images.append(img)
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)

In [None]:
def push_to_hub(
    repo_id, model, env, video_fps=1, local_repo_path="hub"
):
    """
    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
    This method does the complete pipeline:
    - It evaluates the model
    - It generates the model card
    - It generates a replay video of the agent
    - It pushes everything to the Hub

    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
    :param env
    :param video_fps: how many frame per seconds to record our video replay
    (with taxi-v3 and frozenlake-v1 we use 1)
    :param local_repo_path: where the local repository is
    """
    _, repo_name = repo_id.split("/")

    eval_env = env
    api = HfApi()

    # Step 1: Create the repo
    repo_url = api.create_repo(
        repo_id=repo_id,
        exist_ok=True,
    )

    # Step 2: Download files
    repo_local_path = Path(snapshot_download(repo_id=repo_id))

    # Step 3: Save the model
    if env.spec.kwargs.get("map_name"):
        model["map_name"] = env.spec.kwargs.get("map_name")
        if env.spec.kwargs.get("is_slippery", "") == False:
            model["slippery"] = False

    # Pickle the model
    with open((repo_local_path) / "q-learning.pkl", "wb") as f:
        pickle.dump(model, f)

    # Step 4: Evaluate the model and build JSON with evaluation metrics
    mean_reward, std_reward = evaluate_agent(
        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
    )

    evaluate_data = {
        "env_id": model["env_id"],
        "mean_reward": mean_reward,
        "n_eval_episodes": model["n_eval_episodes"],
        "eval_datetime": datetime.datetime.now().isoformat()
    }

    # Write a JSON file called "results.json" that will contain the
    # evaluation results
    with open(repo_local_path / "results.json", "w") as outfile:
        json.dump(evaluate_data, outfile)

    # Step 5: Create the model card
    env_name = model["env_id"]
    if env.spec.kwargs.get("map_name"):
        env_name += "-" + env.spec.kwargs.get("map_name")

    if env.spec.kwargs.get("is_slippery", "") == False:
        env_name += "-" + "no_slippery"

    metadata = {}
    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]

    # Add metrics
    eval = metadata_eval_result(
        model_pretty_name=repo_name,
        task_pretty_name="reinforcement-learning",
        task_id="reinforcement-learning",
        metrics_pretty_name="mean_reward",
        metrics_id="mean_reward",
        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
        dataset_pretty_name=env_name,
        dataset_id=env_name,
    )

    # Merges both dictionaries
    metadata = {**metadata, **eval}

    model_card = f"""
  # **Q-Learning** Agent playing1 **{env_id}**
  This is a trained model of a **Q-Learning** agent playing **{env_id}** .

  ## Usage

  ```python

  model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")

  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
  env = gym.make(model["env_id"])
  ```
  """

    evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])

    readme_path = repo_local_path / "README.md"
    readme = ""
    print(readme_path.exists())
    if readme_path.exists():
        with readme_path.open("r", encoding="utf8") as f:
            readme = f.read()
    else:
        readme = model_card

    with readme_path.open("w", encoding="utf-8") as f:
        f.write(readme)

    # Save our metrics to Readme metadata
    metadata_save(readme_path, metadata)

    # Step 6: Record a video
    video_path = repo_local_path / "replay.mp4"
    record_video(env, model["qtable"], video_path, video_fps)

    # Step 7. Push everything to the Hub
    api.upload_folder(
        repo_id=repo_id,
        folder_path=repo_local_path,
        path_in_repo=".",
    )

    print("Your model is pushed to the Hub. You can view your model here: ", repo_url)

# 上传到 hugging hub 请修改username 和 repo_name

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model = {
    "env_id": env_id,
    "max_steps": max_steps,
    "n_training_episodes": n_training_episodes,
    "n_eval_episodes": n_eval_episodes,
    "eval_seed": eval_seed,

    "learning_rate": learning_rate,
    "gamma": gamma,

    "max_epsilon": max_epsilon,
    "min_epsilon": min_epsilon,
    "decay_rate": decay_rate,

    "qtable": Qtable_frozenlake
}

In [None]:
username = "Solitary12138" # FILL THIS
repo_name = "Frozen-Lake"
push_to_hub(
    repo_id=f"{username}/{repo_name}",
    model=model,
    env=env)

# 训练 Taxi-v3

In [None]:
env = gym.make("Taxi-v3", render_mode="rgb_array")

* 出租车有 25 个可能位置 乘客有 5 个可能位置(包括乘客在出租车上的位置)和 4 个目的地位置 因此有 500 个离散状态

In [None]:
state_space = env.observation_space.n
print("There are ", state_space, " possible state")

In [None]:
action_space = env.action_space.n
print("There are ", action_space, " possible action")

The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:

- 0: move south
- 1: move north
- 2: move east
- 3: move west
- 4: pickup passenger
- 5: drop off passenger

Reward function 💰:

- -1 per step unless other reward is triggered.
- +20 delivering passenger.
- -10 executing “pickup” and “drop-off” actions illegally.

In [None]:
# 创建 Q-table 500 * 6
Qtable_taxi = initialize_q_table(state_space, action_space)
print(Qtable_taxi)
print("Q-table shape: ", Qtable_taxi.shape)

In [None]:
# 定义超参数
# Training parameters
n_training_episodes = 25000   # Total training episodes
learning_rate = 0.7           # Learning rate

# Evaluation parameters
n_eval_episodes = 100        # Total number of test episodes

# DO NOT MODIFY EVAL_SEED
eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,
 161,131,184,51,170,12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,
 112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148] # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
                                                          # Each seed has a specific starting state

# Environment parameters
env_id = "Taxi-v3"           # Name of the environment
max_steps = 99               # Max steps per episode
gamma = 0.95                 # Discounting rate

# Exploration parameters
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.05           # Minimum exploration probability
decay_rate = 0.005            # Exponential decay rate for exploration prob

# 训练 Taxi-v3

In [None]:
Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
Qtable_taxi

# 保存模型到 hugging hub

In [None]:
model = {
    "env_id": env_id,
    "max_steps": max_steps,
    "n_training_episodes": n_training_episodes,
    "n_eval_episodes": n_eval_episodes,
    "eval_seed": eval_seed,

    "learning_rate": learning_rate,
    "gamma": gamma,

    "max_epsilon": max_epsilon,
    "min_epsilon": min_epsilon,
    "decay_rate": decay_rate,

    "qtable": Qtable_taxi
}

In [None]:
username = "Solitary12138" # FILL THIS
repo_name = "Taxi-V3" # FILL THIS
push_to_hub(
    repo_id=f"{username}/{repo_name}",
    model=model,
    env=env)