# 时序差分与Q-Learning

## **时序差分 (TD)**

时序差分方法即在环境参数未知时以采样的方式更新价值函数。在 $t$ 时刻，智能体与未知环境交互转移到下一个状态 $S_{t+1}$ 并获得奖励 $R_{t+1}$，定义差分误差

$$
\delta_{t} \doteq R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)
$$

由价值函数的定义知 $R_{t+1}+\gamma V\left(S_{t+1}\right)$ 实际上是对 $V(S_t)$ 的采样。把采样值和当前值做差得到差分误差，并采用如下方式更新 $V_(S_t)$

$$ \begin{aligned}
V\left(S_{t}\right) & \leftarrow V\left(S_{t}\right)+\alpha\delta_{t} \\
& = V\left(S_{t}\right)+\alpha\left[R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)\right]
\end{aligned}
$$

其中 $\alpha$ 为学习率，$\lambda$ 为折扣因子。

## **Q-learning**

Q-learning 即采取如下时序差分方式对 Q 值进行更新，迭代地求解有限离散状态下的 MDP 问题。

$$
Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right]
$$

其中 $\alpha$ 为学习率，$\lambda$ 为折扣因子。Q-learning 迭代地遍历每一个状态-动作对，可以证明：对于有限离散的 MDP 问题，Q-learning 会使所有的状态-动作对对应的 Q 值收敛到最优值，从而给出最优值函数和最优策略。

由上述迭代式，可以给出一个简易的 Q-learning 的类实现：

In [9]:
import numpy as np

class QLearning:
    def __init__(self, num_states, num_actions, learning_rate, discount_factor):
        self.num_states = num_states  # 总状态数
        self.num_actions = num_actions  # 总动作数
        self.lr = learning_rate  # 学习率
        self.discount = discount_factor  # 折扣因子

        self.q_table = np.zeros((num_states, num_actions))
        self.epsilon = 0.1

    # 依epsilon-greedy策略选择动作
    def choose_action(self, state):
        if np.random.uniform() < self.epsilon:
            action = np.random.randint(self.num_actions)
        else:
            action = np.argmax(self.q_table[state])
        return action

    # 使用Q-learning更新Q值表
    def update_q_table(self, state, action, reward, next_state):
        q_value = self.q_table[state, action]
        max_q_value = np.max(self.q_table[next_state])
        td_error = reward + self.discount * max_q_value - q_value
        self.q_table[state, action] += self.lr * td_error

    def train(self, num_episodes, env):
        for episode in range(num_episodes):
            state = 0
            done = False
            while not done:
                action = self.choose_action(state)
                # 执行动作，获得环境返回的奖励和下一个状态
                next_state, reward, done = env.step(state, action)
                self.update_q_table(state, action, reward, next_state)
                state = next_state

用一个简单的确定性环境测试代码：

In [12]:

class Environment:
    def __init__(self):
        self.num_states = 6
        self.num_actions = 2

    def step(self, state, action):
        transitions = [[0, 1, -1], [1, 3, -1], [2, 3, -1], [3, 4, 10], [4, 5, -1], [5, 5, 0]]
        next_state, reward, done = transitions[state][1], transitions[state][2], False
        if next_state == 5:
            done = True
        return next_state, reward, done


if __name__ == '__main__':
    env = Environment()
    q_learning = QLearning(env.num_states, env.num_actions, learning_rate = 0.5, discount_factor = 0.9)
    q_learning.train(num_episodes = 100, env = env)

    print("Learned Q-table:")
    print(q_learning.q_table)

Learned Q-table:
[[ 5.471       5.3780294 ]
 [ 6.94792973  7.19      ]
 [ 0.          0.        ]
 [ 9.1         8.83673254]
 [-1.         -1.        ]
 [ 0.          0.        ]]
