# 时序差分与Q-Learning

## **时序差分 (TD)**

时序差分方法即在环境参数未知时以采样的方式更新价值函数。在 $t$ 时刻，智能体与未知环境交互转移到下一个状态 $S_{t+1}$ 并获得奖励 $R_{t+1}$，定义差分误差

$$
\delta_{t} \doteq R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)
$$

由价值函数的定义知 $R_{t+1}+\gamma V\left(S_{t+1}\right)$ 实际上是对 $V(S_t)$ 的采样。把采样值和当前值做差得到差分误差，并采用如下方式更新 $V_(S_t)$

$$ \begin{aligned}
V\left(S_{t}\right) & \leftarrow V\left(S_{t}\right)+\alpha\delta_{t} \\
& = V\left(S_{t}\right)+\alpha\left[R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)\right]
\end{aligned}
$$

其中 $\alpha$ 为学习率，$\lambda$ 为折扣因子。

## **Q-learning**

Q-learning 即采取如下时序差分方式对 Q 值进行更新，迭代地求解有限离散状态下的 MDP 问题。

$$
Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right]
$$

其中 $\alpha$ 为学习率，$\lambda$ 为折扣因子。Q-learning 迭代地遍历每一个状态-动作对，可以证明：对于有限离散的 MDP 问题，Q-learning 会使所有的状态-动作对对应的 Q 值收敛到最优值，从而给出最优值函数和最优策略。

由上述迭代式，可以给出一个简易的 Q-learning 的类实现：

In [3]:
import torch
import numpy as np
from collections import defaultdict

class QLearningAgent:
    def __init__(self, env, learning_rate=0.1, discount_factor=0.9, epsilon=0.1):
        self.env = env
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.epsilon = epsilon
        self.state_dim = env.observation_space.shape
        self.action_dim = env.action_space.n

        self.Qtable = defaultdict(lambda: np.zeros(self.action_dim))

    def choose_action(self, state):
        if np.random.uniform(0, 1) < self.epsilon:
            action = self.env.action_space.sample()
        else:
            action = np.argmax(self.Qtable[state])
        return action

    def update(self, state, action, reward, done, next_state):
        q_value = self.Qtable[state][action]
        max_q_value = np.max(self.Qtable[next_state])
        td_target = reward + self.discount_factor * max_q_value * (1 - done)
        td_error = td_target - q_value
        self.Qtable[state][action] += self.learning_rate * td_error

    def train(self, num_episodes):
        for episode in range(num_episodes):
            state, _ = self.env.reset()
            done = False
            while not done:
                action = self.choose_action(state)
                next_state, reward, terminated, truncated, _ = self.env.step(action)
                self.update(state, action, reward, terminated, next_state)
                done = terminated or truncated
                state = next_state

    def save(self, file_path):
        checkpoint = {
            'Qtable': dict(self.Qtable),
            'state_dim': self.state_dim,
            'action_dim': self.action_dim
        }
        torch.save(checkpoint, file_path)

    def load(self, file_path):
        checkpoint = torch.load(file_path)
        self.Q = defaultdict(lambda: np.zeros(self.action_dim), checkpoint['Q'])
        self.state_dim = checkpoint['state_dim']
        self.action_dim = checkpoint['action_dim']

用 gym 里的 blackjack 游戏验证并保存 checkpoint:

In [2]:
import gymnasium as gym

env = gym.make('Blackjack-v1')
agent = QLearningAgent(env)
agent.train(num_episodes=1000)

save_path = './blackjack_ckpt.pt'
agent.save(save_path)