# 强化学习2: 迭代法
* 0.强化学习中如何用到迭代法？
* 1.迭代法
* 2.策略迭代与价值迭代
* 3.复习动态规划 (not that related, ignored and deleted in this lecture notes, but mentioned in Lecture Video 48.13-15)

<img src='pic/RL_1.png'>



* Flappy bird的简单解决方法
    * 如何衡量总价值
    * 如何选择动作，选择总策略

---------------THIS CHAPTER FOCUSED ON --------------------
* **总价值不易计算时，但环境状态有显式的分布时**
    * 如何使用迭代法计算总价值
    * 如何使用迭代法反复改进总策略
    * 策略迭代法的收敛

----------------END-------------------------------------------    

* 总价值不易计算时，环境状态没有显式的分布时，从连续的样本和经验中学习
    * 蒙特卡洛方法
    * 计算总价值
    * 更新总策略
* 总价值不易计算时，环境状态没有显式的分布时，从每一次与环境状态的交互中学习
    * Temporal Differences
    * Temporal Differences与蒙特卡罗方法的对比
    * SARSA
    * Q-learning
* 当环境状态过多，如何将有限样本中的策略推广到更大的状态空间，作为更大状态空间的近似解？
    * 结合监督学习, function approximation
    * 线性方法等
* Q-learing+Deep-Learning
    * DQN
    * DQN的优势与特点


[参考视频](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)

[参考书籍](https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation/dp/0262193981)

[参考中文知乎](https://www.zhihu.com/people/flood-sung/activities)

# 0.迭代法用在哪里？

### 0.1 在计算策略总价值时
* 当完全知道环境的动态变化，但计算不方便时
<img src='pic/v_pi.png'>

* 采用近似算法计算在策略$\pi$下的总价值
<img src='pic/v_k.png'>
* 最终计算而得的不动点，就应该是$v_\pi(s)$
* 这个步骤，叫做 **iterative policy evaluation 迭代法求策略总价值**

### 0.2 在更新策略，求最优策略时
* 迭代法求最优策略
* 求对应策略总价值 -> 更新策略 -> 求对应策略总价值 -> 更新策略
* GPI (General Policy Iteration)

# 1. 迭代法
* 1.数值计算的基本方法：迭代法解线性方程组
    * Gauss消元法的复杂度是n3，迭代法复杂度是n2
* 2.Jacob方法
* 3.Gauss-Seidel方法
* 4.收敛性

### 2. Jacob方法

\begin{equation}
9x_1 + x_2 + x_3 = b_1\\
2x_1 + 10x_2 + 3x_3 = b_2\\
3x_1+4x_2+11x_3 = b_3\\
\end{equation}

* 变形
\begin{equation}
x_1 = 1/9 [b_1 - x_2-x_3]\\
x_2 = 1/10 [b_2 -2x_1-3x_3]\\
x_3 = 1/11[b_3-3x_1-4x_2]\\
\end{equation}

* 初始化 $ x^(0) = [x_1^{(0)},x_2^{(0)},x_3^{(0)}]^T$

* 反复迭代
* 变形
\begin{equation}
x_1^{(k+1)} = 1/9 [b_1 - x_2^{(k)}-x_3^{(k)}]\\
x_2^{(k+1)} = 1/10 [b_2 -2x_1^{(k)}-3x_3^{(k)}]\\
x_3^{(k+1)} = 1/11[b_3-3x_1^{(k)}-4x_2^{(k)}]\\
\end{equation}

* 例如，当b=$[10,19,0]^T$时，精确解应为： $x=[1,2,-1]^T$
* 利用Jacob方法

In [23]:
def jacob_iter(x,b):
    x_new = [0,0,0]
    x_new[0]=1/9*(b[0]-x[1]-x[2])
    x_new[1]=1/10*(b[1]-2*x[0]-3*x[2])
    x_new[2]=1/11*(b[2]-3*x[0]-4*x[1])
    return x_new

In [24]:
x=[0,0,0]
b=[10,19,0]
for i in range(20):
    x=jacob_iter(x,b)
    print("%f,%f,%f"%(x[0],x[1],x[2]))

1.111111,1.900000,0.000000
0.900000,1.677778,-0.993939
1.035129,2.018182,-0.855556
0.981930,1.949641,-1.016192
1.007395,2.008472,-0.976760
0.996476,1.991549,-1.005097
1.001505,2.002234,-0.995966
0.999304,1.998489,-1.001223
1.000304,2.000506,-0.999260
0.999862,1.999717,-1.000267
1.000061,2.000108,-0.999859
0.999972,1.999946,-1.000056
1.000012,2.000022,-0.999973
0.999994,1.999989,-1.000011
1.000002,2.000005,-0.999995
0.999999,1.999998,-1.000002
1.000000,2.000001,-0.999999
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000


### 3. Gauss-Seidel迭代法

\begin{equation}
9x_1 + x_2 + x_3 = b_1\\
2x_1 + 10x_2 + 3x_3 = b_2\\
3x_1+4x_2+11x_3 = b_3\\
\end{equation}

* 变形
\begin{equation}
x_1 = 1/9 [b_1 - x_2-x_3]\\
x_2 = 1/10 [b_2 -2x_1-3x_3]\\
x_3 = 1/11[b_3-3x_1-4x_2]\\
\end{equation}

* 反复迭代
* 变形
\begin{equation}
x_1^{(k+1)} = 1/9 [b_1 - x_2^{(k)}-x_3^{(k)}]\\
x_2^{(k+1)} = 1/10 [b_2 -2x_1^{(k+1)}-3x_3^{(k)}]\\
x_3^{(k+1)} = 1/11[b_3-3x_1^{(k+1)}-4x_2^{(k+1)}]\\
\end{equation}

In [20]:
def gs_iter(x,b):
    x_new = [0,0,0]
    x_new[0]=1/9*(b[0]-x[1]-x[2])
    x_new[1]=1/10*(b[1]-2*x_new[0]-3*x[2])
    x_new[2]=1/11*(b[2]-3*x_new[0]-4*x_new[1])
    return x_new

In [22]:
x=[0,0,0]
b=[10,19,0]
for i in range(20):
    x=gs_iter(x,b)
    print("%f,%f,%f"%(x[0],x[1],x[2]))

1.111111,1.677778,-0.913131
1.026150,1.968709,-0.995753
1.003005,1.998125,-1.000138
1.000224,1.999997,-1.000060
1.000007,2.000017,-1.000008
0.999999,2.000003,-1.000001
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000
1.000000,2.000000,-1.000000


### 4. 收敛性
* 对方程组Ax=b
* can be converted to: $Nx^{(k+1)}=b+Px^{k}$
* Jacob方法：
    * N为对角线矩阵，P为N-A
* G-S方法
    * N为下三角矩阵，P为N-A

Tell if converge or not:
* $N(x-x^{(k+1)})=P(x-x^{(k)})$
* $e^{(k+1)}=N^{-1}Pe^{(k)}$

* 收敛(converge) condition:
$N^{-1}P$ matrix 主对角线elements all <1

# 2. 迭代法求策略估值与迭代法更新最佳策略

### 1.迭代法求策略估值 （iterative process to get State Value under current policy) - Iterative Policy Evaluation

<img src='pic/v_pi.png'>

<img src='pic/v_k.png'>

p is the transition probability (see Chapter 47), here p(s',r|s,a): means the probability of transit from s -> s', if taking action a.

k, k+1 means: current iterative, next iterative.
- Iterative Method:
    - here we cannot directly get the steady state v_pi, because the steady state v_pi (s) means = v_pi(s'), and we need to solve bellman equation by matrix inversion (see Chapter 47), the complexity of solving that equation is very high when the matrix dimension is very large
    - Therefore we use an iterative method to get the numerical solution that is approximately equal to the accurate solution of directly solving the equation.
    - In terms of vk(s'), because it is steady-state, therefore vk(s')=vk(s), we can use vk(s) above formula to get the numerical estimation of vk+1(s)
    - see below for an example

<img src='pic/policy_ev_alg.png'>

* 例子
<img src='pic/policy_ev.png'>

note: Current Policy: pi(a|s): probability of taking each of the 4 actions (see graph above) under whatever state (s) is 1/4

note: p(s',r|s,a) is 1, for whatever action taken. 
> For example: current state s is at (1,1), if take action = move right, the next state s' must be at (1,2), therefore the probability is 1.
> Which means each action can only result in one and only one state at t+1. it cannot result in multiple possible states at t+1.

note: pre-define: v(s) at position (1,1) = 0; v(s) at position (4,4) = 0

note: pre-define: $\gamma$ = 1

- k=0: initialization: all state's value v(s) is 0

<img src='pic/policy_ev_table.png'>

- k=n, where |v_n - v_n-1| < $\theta$ (e.g. 0.0001) , then we stop iteration, and get the the steady-state V(s) under current policy Pi  -> V_pi(s)

**note: The example above only update the V(s) under current Policy (which is a random policy: probability of 1/4 moving each direction), but we want to find the optimal policy as well.
In this case, we can not only iterative update V(s), but also update Policy (Pi) under EACH iteration at the same time. All the way until both Policy (Pi) and V(s) converged (i.e. difference between last iteration and second last iteration < threshold)**

**See section below**

### 2.迭代法更新最佳策略 (Iterative process get Optimal Policy and its State Value) - Iterative Policy Evaluation and Greedy Policy Improvement

<img src='pic/policy_iter.png'>

### Example 1
    - In this Example, how to update Policy (Pi)?
        - move to the directions with the maximum State Value
        - For example: (see graph below) when k=2, under position (2,2), moving to the left and upward, the state value will be -1.7; while moving to right and downward, the state value will be -2.0; Therefore, the updated policy (to estimate k=3 State Value) for position (2,2) should be only move left (50%) or upward (50%), and not moving right or downward because they have less state value
<img src='pic/policy_improve_1.png'>

### Example 2
* 老贾掌管两个租车行A,B，有车的时候只要有单，就可赚10元。老贾每晚可以在两个租车行间调配第二天的备用车辆数量。每移动一辆，花费2元。已知每天租车和还车的客人成泊松分布$\frac{e^{-\lambda}\lambda^{n}}{n!}$。对每个车行，每天来租车和还车的泊松分布参数分别为租车A,B = 3,4, 还车A,B = 3,2. 每个车行最多有20辆后备车。$\gamma$=0.9，求每晚老贾在A,B间调配车辆的策略。

<img src='pic/policy_improve_2.png'>

* 为什么持续迭代更新旧的策略可以获得更好的策略？
* 命题：如果$q_{\pi}(s,\pi'(s))>=v_{\pi}(s)$，那么$v_{\pi'}(s)>=v_{\pi}(s)$
    - Prove as below:

<img src='pic/policy_improve_3.png'>