# 价值函数

输入：当前状态或者状态-行为的键值对

输出：这个状态或者键值对的value

## agent策略

价值函数有了，我可以知道每个情况下采取不同行为的价值结果，agent的目标是得到全局最大的价值，所以有多种算法利用价值函数

与确定性策略不同，他们输入状态，输出行为，可能是通过一个神经网络学习的，但是基于价值函数的策略，这里的策略不用学习，本质是个函数

1. 贪心策略：每次选择当前价值最大的，并不能保证全局最大
2. ε-贪心策略：如果只是贪心，可能无法训练，因为每次不去探索新的可能性，所以引入一个ε，每次操作之前随机一个值，小于ε就采取随机操作，否则采取当前最好情况

## 价值函数类型

两种

1. 只输入环境状态：状态价值函数
2. 输入环境状态和操作：动作价值函数

## 价值函数的具体计算

贝尔曼方程：简化我们的价值估计

价值函数计算得算全局的信息，计算量很大，使用这个方程简化，比如一个迷宫游戏，每次行走价值减一。

若迷宫函数，未找到出口价值=-1，则采用状态价值函数就是输入当前状态，假设当前站在十字路口，可以往下，可以往右，计算右边状态的价值函数：往右-1，往右-1，往下-1，往下-1，到达出口，所以当前状态价值是-4

计算下边状态的价值函数：往上-1，往右，往右，往右，往下，往下，到达出口，当前价值-6，

贝尔曼方程则类似于动态规划，往右的时候-1+往右的时候的状态的价值函数结果

## 价值函数的学习方式

### 蒙特卡洛：情节结束时学习

举例：一个老鼠吃东西，有各种不同的路线，情节开始时，老鼠在状态St，情节结束。
计算老鼠在这盘情节游戏的总分，来更新在St状态的value：`新V(St) =  V(St) + alpha * [Gt - V(St)]`，就这样不断的慢慢更新V(St),alpha是学习率

初始的时候价值函数V(St)返回的都是0，但是随着越玩越多，通过学习，价值函数返回的数值也越来越真实

迷宫游戏也是同理，在初始状态，随机走，得到一个值，越玩越多了以后值也越真实

### 时间差分学习：每一步都在学习

`新V(St) =  V(St) + alpha * [Rt+1 + gama*V(St+1) - V(St)]`:

从St状态走到St+1状态，反馈的奖励是 Rt+1，gama可能等于0.99

初始的时候，V(St),V(St+1)都是0，玩多了以后靠近真实值



# Q-learning

Q-learning算法是基于时间差分方法来训练价值函数的，并且它是离策略的

使用Q-learning来训练我们的Q函数，Q函数输入：状态和动作，输出价值。Q代表Quality，动作在该场景下的质量

底层使用Q-table记录action和stage对应的价值，横坐标动作，纵坐标场景，Q-learning就是在训练这个底层的Q-table，得到最优化的表


伪代码流程：

    1. 给定情节数量，假设100，就是玩100次游戏
    2. 在每个情节中循环：
        1. 根据当前的场景和可能动作，查询Q表（当然，初始的Q表内容都是0），基于ε-贪心策略选择action
        2. 选择行为后环境给出reward Rt+1，并进入新的场景St+1
        3. 更新Q表：Q(St,At) = Q(St,At) + alpha * ( Rt+1 + gama*max(Q(St+1,a)) - Q(St,At) )   # max(Q(St+1,a):可能行为最大值
    
    需要注意的是：ε是变化的，小于这个值尝试随机动作，开始时ε=1.0，随着q表完善，ε越来越小，随机行为越来越少

### 离策略

在决定下一个action的时候，我们使用了ε-贪心策略，决定是随机还是查询q表

采取了行动后得到了St+1,Rt+1后，更新Q表时我们采用了贪心策略即：`max(Q(St+1,a):可能行为最大值`，并且我们在推理时候也使用贪心策略

离策略算法：训练和更新，推理采取不同的算法

## 初始化环境

In [55]:
# 打开虚拟显示器，因为我们要录制一个游戏的replay
# 注意在macos系统需要安装XQuartz，这是用来运行基于x11开发的程序和库
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x110be7a90>

In [56]:
import numpy as np
import gymnasium as gym
import random
import imageio
import os
import tqdm
import pickle5 as pickle
from tqdm.notebook import tqdm

In [57]:
# 创建场景并指定渲染方式是rgb，返回一组npmpy数组，形状是(x,y,3)

env = gym.make("FrozenLake-v1",desc=None,map_name="4x4",is_slippery=False,render_mode='rgb_array')

In [58]:
# 查看场景
# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample())  # Get a random observation
# 查看动作
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample())  # Take a random action

_____OBSERVATION SPACE_____ 

Observation Space Discrete(16)
Sample observation 0

 _____ACTION SPACE_____ 

Action Space Shape 4
Action Space Sample 1


In [59]:
# 我们需要初始化Q-table，就需要知道有多少种场景和操作
state_space = env.observation_space.n
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

# 初始化Q-table
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)
def initialize_q_table(state_space, action_space):
  # 其实就是一个二维数组
  Qtable = np.zeros((state_space,action_space))
  return Qtable

There are  16  possible states
There are  4  possible actions


In [60]:
Qtable_frozenlake = initialize_q_table(state_space, action_space)

In [61]:
# 定义两种贪心策略

# 普通贪心策略
def greedy_policy(Qtable, state):
  # Exploitation: take the action with the highest state, action value
  action = np.argmax(Qtable[state]) # 得到最大值的那种行为
  return action

# ε-贪心策略
def epsilon_greedy_policy(Qtable, state, epsilon):
  # Randomly generate a number between 0 and 1
  random_num = random.uniform(0,1)
  # if random_num > greater than epsilon --> exploitation
  if random_num > epsilon:
    # Take the action with the highest value given a state
    # np.argmax can be useful here
    action = greedy_policy(Qtable,state)
  # else --> exploration
  else:
    # action = random.randint(len(Qtable[0])) # Take a random action
    action = env.action_space.sample()

  return action

### 定义超参数

In [62]:
# Training parameters
n_training_episodes = 10000  # Total training episodes
learning_rate = 0.7  # Learning rate

# Evaluation parameters
n_eval_episodes = 100  # Total number of test episodes

# Environment parameters
env_id = "FrozenLake-v1"  # Name of the environment
max_steps = 99  # Max steps per episode
gamma = 0.95  # Discounting rate
eval_seed = []  # The evaluation seed of the environment

# ε-贪心策略 Exploration parameters
max_epsilon = 1.0  # Exploration probability at start
min_epsilon = 0.05  # Minimum exploration probability
decay_rate = 0.0005  # Exponential decay rate for exploration prob

### 训练

In [63]:
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
  for episode in tqdm(range(n_training_episodes)):
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    # Reset the environment
    state, info = env.reset()
    step = 0
    terminated = False
    truncated = False

    # repeat
    for step in range(max_steps):   # max_steps 每个场景能agent能act多少次
      # Choose the action At using epsilon greedy policy
      action =  epsilon_greedy_policy(Qtable,state,epsilon)

      # Take action At and observe Rt+1 and St+1
      # Take the action (a) and observe the outcome state(s') and reward (r)
      new_state, reward, terminated, truncated, info = env.step(action)

      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
      Qtable[state][action] = Qtable[state][action] + learning_rate*(
        reward + gamma * max(Qtable[new_state,:]) - Qtable[state,action]
        )

      # If terminated or truncated finish the episode
      if terminated or truncated:
        break

      # Our next state is the new state
      state = new_state
  return Qtable

In [64]:
# 完成训练，得到最终的Qtable
Qtable_frozenlake = train(n_training_episodes,min_epsilon,max_epsilon,decay_rate,env,max_steps,Qtable_frozenlake)

  0%|          | 0/10000 [00:00<?, ?it/s]

In [65]:
Qtable_frozenlake

array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],
       [0.73509189, 0.        , 0.81450625, 0.77378094],
       [0.77378094, 0.857375  , 0.77378094, 0.81450625],
       [0.81450625, 0.        , 0.77378094, 0.77378094],
       [0.77378094, 0.81450625, 0.        , 0.73509189],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.        , 0.81450625],
       [0.        , 0.        , 0.        , 0.        ],
       [0.81450625, 0.        , 0.857375  , 0.77378094],
       [0.81450625, 0.9025    , 0.9025    , 0.        ],
       [0.857375  , 0.95      , 0.        , 0.857375  ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.95      , 0.857375  ],
       [0.9025    , 0.95      , 1.        , 0.9025    ],
       [0.        , 0.        , 0.        , 0.        ]])

### 评估agent

In [66]:
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
    """
    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
    :param env: The evaluation environment
    :param n_eval_episodes: Number of episode to evaluate the agent
    :param Q: The Q-table
    :param seed: The evaluation seed array (for taxi-v3)
    """
    episode_rewards = []
    for episode in tqdm(range(n_eval_episodes)):
        if seed:
            state, info = env.reset(seed=seed[episode])
        else:
            state, info = env.reset()
        step = 0
        truncated = False
        terminated = False
        total_rewards_ep = 0

        for step in range(max_steps):
            # Take the action (index) that have the maximum expected future reward given that state
            action = greedy_policy(Q, state)
            new_state, reward, terminated, truncated, info = env.step(action)
            total_rewards_ep += reward

            if terminated or truncated:
                break
            state = new_state
        episode_rewards.append(total_rewards_ep)
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward

In [67]:
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

  0%|          | 0/100 [00:00<?, ?it/s]

Mean_reward=1.00 +/- 0.00


## 完成后查看并推送模型

In [68]:
from huggingface_hub import HfApi, snapshot_download
from huggingface_hub.repocard import metadata_eval_result, metadata_save

from pathlib import Path
import datetime
import json

In [69]:
def record_video(env, Qtable, out_directory, fps=1):
    """
    Generate a replay video of the agent
    :param env
    :param Qtable: Qtable of our agent
    :param out_directory
    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
    """
    images = []
    terminated = False
    truncated = False
    state, info = env.reset(seed=random.randint(0, 500))
    img = env.render()
    images.append(img)
    while not terminated or truncated:
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(Qtable[state][:])
        state, reward, terminated, truncated, info = env.step(
            action
        )  # We directly put next_state = state for recording logic
        img = env.render()
        images.append(img)
    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)

In [70]:
def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
    """
    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
    This method does the complete pipeline:
    - It evaluates the model
    - It generates the model card
    - It generates a replay video of the agent
    - It pushes everything to the Hub

    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
    :param env
    :param video_fps: how many frame per seconds to record our video replay
    (with taxi-v3 and frozenlake-v1 we use 1)
    :param local_repo_path: where the local repository is
    """
    _, repo_name = repo_id.split("/")

    eval_env = env
    api = HfApi()

    # Step 1: Create the repo
    repo_url = api.create_repo(
        repo_id=repo_id,
        exist_ok=True,
    )

    # Step 2: Download files
    repo_local_path = Path(snapshot_download(repo_id=repo_id))

    # Step 3: Save the model
    if env.spec.kwargs.get("map_name"):
        model["map_name"] = env.spec.kwargs.get("map_name")
        if env.spec.kwargs.get("is_slippery", "") == False:
            model["slippery"] = False

    # Pickle the model
    with open((repo_local_path) / "q-learning.pkl", "wb") as f:
        pickle.dump(model, f)

    # Step 4: Evaluate the model and build JSON with evaluation metrics
    mean_reward, std_reward = evaluate_agent(
        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
    )

    evaluate_data = {
        "env_id": model["env_id"],
        "mean_reward": mean_reward,
        "n_eval_episodes": model["n_eval_episodes"],
        "eval_datetime": datetime.datetime.now().isoformat(),
    }

    # Write a JSON file called "results.json" that will contain the
    # evaluation results
    with open(repo_local_path / "results.json", "w") as outfile:
        json.dump(evaluate_data, outfile)

    # Step 5: Create the model card
    env_name = model["env_id"]
    if env.spec.kwargs.get("map_name"):
        env_name += "-" + env.spec.kwargs.get("map_name")

    if env.spec.kwargs.get("is_slippery", "") == False:
        env_name += "-" + "no_slippery"

    metadata = {}
    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]

    # Add metrics
    eval = metadata_eval_result(
        model_pretty_name=repo_name,
        task_pretty_name="reinforcement-learning",
        task_id="reinforcement-learning",
        metrics_pretty_name="mean_reward",
        metrics_id="mean_reward",
        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
        dataset_pretty_name=env_name,
        dataset_id=env_name,
    )

    # Merges both dictionaries
    metadata = {**metadata, **eval}

    model_card = f"""
  # **Q-Learning** Agent playing1 **{env_id}**
  This is a trained model of a **Q-Learning** agent playing **{env_id}** .

  ## Usage

  model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")

  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
  env = gym.make(model["env_id"])
  """

    evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])

    readme_path = repo_local_path / "README.md"
    readme = ""
    print(readme_path.exists())
    if readme_path.exists():
        with readme_path.open("r", encoding="utf8") as f:
            readme = f.read()
    else:
        readme = model_card

    with readme_path.open("w", encoding="utf-8") as f:
        f.write(readme)

    # Save our metrics to Readme metadata
    metadata_save(readme_path, metadata)

    # Step 6: Record a video
    video_path = repo_local_path / "replay.mp4"
    record_video(env, model["qtable"], video_path, video_fps)

    # Step 7. Push everything to the Hub
    api.upload_folder(
        repo_id=repo_id,
        folder_path=repo_local_path,
        path_in_repo=".",
    )

    print("Your model is pushed to the Hub. You can view your model here: ", repo_url)

In [71]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [72]:
model = {
    "env_id": env_id,
    "max_steps": max_steps,
    "n_training_episodes": n_training_episodes,
    "n_eval_episodes": n_eval_episodes,
    "eval_seed": eval_seed,
    "learning_rate": learning_rate,
    "gamma": gamma,
    "max_epsilon": max_epsilon,
    "min_epsilon": min_epsilon,
    "decay_rate": decay_rate,
    "qtable": Qtable_frozenlake,
}

username = "Segment139"  # FILL THIS
repo_name = "q-FrozenLake-v1-4x4-noSlippery"
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

q-learning.pkl:   0%|          | 0.00/914 [00:00<?, ?B/s]