# TD Learning 

### TD Prediction  
TD与MC的却别在于更新value function时，MC是在每一个episode接收后进行更新，而TD是在每一个时间步之后更新，不必等到游戏结束.  
$$
\begin{align}
V(s_t)\leftarrow V(s_t)+\alpha[r_t+\gamma V(s_{t+1})-V(s_t)]
\end{align}
$$  

In [1]:
#定义本章所用环境
import gym
import random
import pandas as pd
import numpy as np

env = gym.make('FrozenLake-v1')

### TD Prediction algorithm 

In [6]:
#定义一个随机policy
def policy():
    return env.action_space.sample()

#初始化value function
V = {}
for s in range(env.observation_space.n):
    V[s] = 0.0

In [8]:
alpha = 0.85 #学习率
gamma = 0.9 #折减系数

num_episode = 1000
num_step = 100

#计算value function
for i in range(num_episode):
    s = env.reset()
    for t in range(num_step): #每一步都进行value function的更新
        a = policy()
        s_, r, done, _ = env.step(a)
        V[s] += alpha * (r + gamma * V[s_] - V[s])
        if done:
            break
        s = s_

通过上述算法可以发现，在episode每一个时间步中，V都会进行迭代。

### TD Control

- SARSA —— On-polivy TD control

从前面的学习可以知道，我们从Q function中提取policy，求解optimal policy实际上就是在求解Q function.所以只需将V function换成更新Q function 就是 SARSA算法。  
$$
\begin{align}
Q(s_t,a_t)\leftarrow Q(s_t,a_t)+\alpha[r_t+\gamma Q(s_{t+1},a_{t+1})-Q(s_t,a_t)]
\end{align}
$$  

In [15]:
#初始化Q function
Q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        Q[(s, a)] = 0.0

#定义policy
def epsilon_greedy_policy(state, epsilon):
    if random.uniform(0, 1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x : Q[(s, x)])

In [16]:
alpha = 0.8
gamma = 0.9
epsilon = 0.8

num_episode = 1000
num_step = 100

#计算Q function ，即提取optimal policy
for i in range(num_episode):
    s = env.reset()
    a = epsilon_greedy_policy(s, epsilon)
    for t in range(num_step):
        s_, r, done, _ = env.step(a)
        a_ = epsilon_greedy_policy(s_, epsilon)
        Q[(s, a)] += alpha * (r + gamma * Q[(s_, a_)] - Q[(s, a)])
        if done:
            break
        s = s_
        a = a_

- Q learning —— off-policy TD control

其与SARSA的区别在于，在算法中：与环境交互的policy和计算Q时用的policy不是同一个policy。  
在这里，与环境交互用的是epsilon-greedy policy；计算Q时用的是greedy-policy。

In [2]:
#初始化Q
Q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        Q[(s, a)] = 0.0

In [3]:
#定义policy

#epsilon-greedy policy
def epsilon_greedy_policy(state, epsilon):
    if random.uniform(0, 1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x : Q[(state, x)])
    
#greedy policy
def greedy_policy(state):
    return np.argmax([Q[(s_, a)] for a in range(env.action_space.n)])

In [6]:
#提取optimal Q function
alpha = 0.85
gamma = 0.90
epsilon = 0.8

num_episode = 1000
num_step = 100

for i in range(num_episode):
    s = env.reset()
    for t in range(num_step):
        a = epsilon_greedy_policy(s, epsilon) #与环境交互的policy
        s_ , r, done, _ = env.step(a)
        a_ = greedy_policy(s_) #用于更新Q的policy
        Q[(s, a)] += alpha * (r + gamma * Q[(s_, a_)] - Q[(s, a)])
        
        if done:
            break
        
        s = s_

In [7]:
df = pd.DataFrame(Q.items(), columns = ['state-action', 'value'])
df.head(11)

Unnamed: 0,state-action,value
0,"(0, 0)",0.210561
1,"(0, 1)",0.237352
2,"(0, 2)",0.30101
3,"(0, 3)",0.334825
4,"(1, 0)",0.035253
5,"(1, 1)",0.181317
6,"(1, 2)",0.000964
7,"(1, 3)",0.168928
8,"(2, 0)",0.320543
9,"(2, 1)",0.297513
