# 强化学习1: 基本概念与简单例子

* 1.复习监督学习
* 2.强化学习系列课程基本概念与方法总览
* 3.马尔可夫决策过程
    * Markov Decision Processes

* 4.Flappy bird的简单解决方法

[参考视频David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)

[参考书籍Sutton](https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation/dp/0262193981)

[参考中文知乎](https://www.zhihu.com/people/flood-sung/activities)

## 1.复习监督学习: Supervised Learning

<img src="./pic/Supervised.png">

Compare pred vs. label using cost function, to evaluate the model

Cost function has different techniques/methods: e.g. for regression, using MSE; for categorization, using Cross-Entropy

## 2.强化学习系列课程基本概念与方法总览
    * 2.1强化学习基本概念
        * 特点：
        1. 没有预先设定的label, but there is an overall goal (which can be quantified) -> the biggest difference between RL vs. Supervised Learning.
        2. 反馈往往是滞后的:
            reward caused by **current** action. (t=now)
        3. 目标：累计奖励最大化:
            Value=Reward (t=0) + Reward (t=1) + ......
            Goal: Value is max overall

<img src='./pic/RL_1.png'>

Concept:
> Policy:

<img src='./pic/RL_policy.png'>

> Value: (overall)

<img src='./pic/RL_value.png'>

> 环境模型(model):
    - a model which can define the environment accurately
    - an environment can determine the current reward based on the action
    - but most of time, we don't know the environment thus don't know the model
    - therefore, trial-error method is used in RL to figure out the accurate environment model

* 个体agent
    * 动作
    * 总策略
    * 例子：毕业后的选择，人生策略
* 奖励
    * 例子：毕业后的选择所赚的每年工资
* 总价值
    * $v_\pi(S) = E_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}...|S_t=s]$
    * 例子：毕业
* 总策略(policy)
* 环境模型(model)
    * 预判下一步状态以及概率
        * $P_{ss^{'}}^{a}=P[S_{t+1}=s'|S_t=s,A_t=a]$
    * 预判下一步的收益
* 环境状态(state)
* 强化学习：在不断地‘了解环境’与‘最大化已知环境下的总收益’
    * learning vs planing 
    * exploring探索未知 vs exploitation利用已知(select the best current step's action from options)
    * 选餐馆
    * 选择钻井位置
    * 选工作

<img src='./pic/Agent_Env.png'>

* 2.2 策略与总价值 例子，走迷宫（选自David Silver讲义）
<img src='pic/policy.png'>

<img src='pic/value.png'>

### 2.3 强化学习系列方法总览
* Flappy bird的简单解决方法
    * 如何衡量总价值
    * 如何选择动作，选择总策略
* 总价值不易计算时，但环境状态**有显式**的分布时
    * 使用**迭代法**计算总价值(Value (overall))
    * 使用**迭代法**反复改进总策略
    * 策略迭代法的收敛
* 总价值不易计算时，环境状态**没有显式**的分布时:
    - method 1: 从连续的样本和经验中学习
        * **蒙特卡洛**方法 (trial and error for every options)
        * 计算总价值
        * 更新总策略
    - method 2: 从每一次与环境状态的交互中学习
        * Temporal Differences
            * SARSA (on policy)
            * Q-learning (off policy):
                - Q-learning still use MC to trial and error for every option on every step, but after every trial, it uses 迭代法 to update the Value (overall)
        * Temporal Differences与蒙特卡罗方法的对比

* 当**环境状态过多**，如何将有限样本中的策略推广到更大的状态空间，作为更大状态空间的近似解？
    * Method 1: Combined with Supervised Learning:
        - When there are too much environment states, it is not feasible to get the action for each state, we can sample limited amount of states, and RL on those to get actions; 
        - For the rest of states, using Supervised Learning (e.g. Decision Tree, SVM, Logi-Regression...) to predict the action given the state, based on those sampled amount of state-action pair obtained above.
        <img src='pic/SupervisedLearning+RL.png'>
    * Method 2: 线性方法等


* Q-learing+Deep-Learning
    * DQN:
        - Example: Using CNN to extract feature of environment, output feature is Environment State; Then using RL
    * DQN的优势与特点
        <img src='./pic/DQN_Advantages.png'>

## 3.马尔可夫决策过程 Markov Decision Processes


### 3.1 Markov Property：现在决定未来
- Definition: The Markov property means that evolution of the Markov process in the future (S_t+1) depends only on the present state (S_t) and does not depend on past history (S_1, S_2, ..., S_t-1). 
- AKA: **Memoryless Property**
* $ P[S_{t+1}|S_t] = P[S_{t+1}|S_1,S_2,S_3,...,S_t]$


### 3.2 Markov状态转移矩阵 - Markov State Transition Matrix
- 状态转移概率：
    - probabilities of transitioning from one state to another in a single time unit
    - $ P_{ss^{'}}=P[S_{t+1}=s^{'}|S_t=s]$ -> the probability of state s goes to state s'
* 状态转移矩阵：


$$P_{n,n} =
 \begin{pmatrix}
  p_{1,1} & p_{1,2} & \cdots & p_{1,n} \\
  p_{2,1} & p_{2,2} & \cdots & p_{2,n} \\
  \vdots  & \vdots  & \ddots & \vdots  \\
  p_{n,1} & p_{n,2} & \cdots & p_{n,n}
 \end{pmatrix}$$

<img src='pic/Markov State Transition Matrix.png'>

* 计算以下图示的状态转移矩阵：
<img src='pic/mc_matrix.png'>

> answer:


<img src='pic/CalculateMarkovStateTransitionMatrix.png'>

### 3.3 Markov Rewards Process
* Definition: Current Reward Rs:
    > Rs=E[Rt+1|St=s]
* Future Reward: $R_{t+1}, R_{t+2}, ...$
* Current Return 𝐺𝑡 is the total discounted reward from time-step t 
    > $ G_t = R_{t+1} + \gamma R_{t+2}+... = \sum_{k=0}^{\infty}\gamma^kR_{t+k+1}$
    - **$\gamma$ is called "discount-rate" that reflecting the impact of time to when you get the reward.**
    - for example: 100 CAD that you get today is different from 100 CAD that you get 10 years later.
    - usually: 100 CAD that you get today worths more than 100 CAD that you get 10 years later because of the inflation of currency.
    - therefore: $\gamma$ is [0,1], meaning, the later you get the reward, the less valuable of that reward.
* The value function v(s) gives the long-term value of state s:
    > The state value function v(s) of an MRP is the **expected return 𝐺𝑡** ***starting from state s***
    
    > $v(s) = E[G_t|S_t=s]$, where E is Expectation (期望）， V is Value


> Here we manually define some Reward at each state, the reward is our 心情; Then the Overall Value V is to max out our 心情 overall

<img src='pic/Rewards.png'>

### 3.4 状态价值state value
* 从状态S=s 开始的总价值的期望值(as defined above)
> $v(s) = E[G_t|S_t=s]$
* 计算一下从 class1开始的马尔可夫过程的总价值，当$ \gamma = 1/2$ ？
    * C1,C2,C3,PASS,Sleep
    * C1,FB,C1,C2,Pub
    * C1,FB,FB,FB,FB
* 分别计算以上路径的总价值，当$ \gamma$ = 0
* 分别计算以上路径的总价值，当$\gamma$ = 1

* Current状态价值可被分为两个部分，一是当前的奖励，二是下一步状态的总价值的贴现值
\begin{equation}
v(s) = E[G_t|S_t=s]\\
=E[R_{t+1}+\gamma R_{t+2} + \gamma^2 R_{t+3} + ... |S_t=s]\\
=E[R_{t+1}|S_t=s]+\gamma E[R_{t+2}+ \gamma R_{t+3}+ ...|S_t=s]\\
=E[R_{t+1}|S_t=s]+\gamma E[G_{t+1}|S_t=s]\\
\end{equation}
Please note: for $E[G_{t+1}|S_t=s], S_t=s$ means $S_{t+1}=s_{next}$, so $E[G_{t+1}|S_t=s]$ <=> $E[G_{t+1}|S_{t+1}=s_{next}] = v(s_{next})$

\begin{equation}
v(s)=E[R_{t+1}|S_t=s]+\gamma v(S_{next})|S_t=s
\end{equation}

Also based on previous definition: Rs=E[Rt+1|St=s], we have:

$v(s)=R_s+\gamma v(S_{next})|S_t=s$

For $v(S_{next})|S_t=s$ part, we can understand it as combination of all possible [next state's value] \* [weight] (i.e. Probability of going to certain next state):
> $\sum_{s^{'}}P_{ss{'}}v(s^{'})$
<img src='./pic/NextState_MarkovRewardProcess.png'>
* $v(s) = R_s + \gamma \sum_{s^{'}}P_{ss{'}}v(s^{'})$

- Using Matrix to express: (This is called **Bellman Equation**）
> v(s) = R + $\gamma$P v(s_next)

### 3.4 Bellman方程
* $v(s) = E[G_t|S_t=s]$  -> Bellman Equation for state-value
* v(s) = R + $\gamma$P v(s_next) -> Bellman Equation for state-value
* P状态转移矩阵 -> Transition Probability (see 3.2)
* 不动点求解:
    - If it is in Steady State (equilibrium), all state will not change anymore with time, therefore s=s_next, 
    
    <img src='pic/Bellman_Equation_SteadyState.png'>
    - so we can get steady state v:
    * v = $(I-\gamma P)^{-1}R$
    
    
* 这种直接求解只对小规模的MC有效


* 例1: 一个棋盘只能：
    * 上下左右走，各占1/4概率
    * 不能走出界，否则棋子不动，并扣1分
    * 移动到A,B处，各得10，5分，并在下一步移动至$A^{'},B^{'}$处
    * $\gamma=0.9$
    - Question: What's the steady state's V(s)? (V(s) is a 5x5 matrix)
<img src='pic/chess.png'>

<img src='pic/Bellman_Equation_Ex1.png'>

* 例2: 5个格子，最右的rewards为1，其余为0。左右均为终止状态。根据Bellman方程求v (at steady state)
- note: action can be either to left or to right or stay(i.e. at 终止状态）， probability is 1/2 for each
- assume  $\gamma=1$
<img src='pic/boxes.png'>

<img src='pic/Bellman_Equation_Ex2.png'>

### 3.6 策略 (Policy) $\pi$
* $\pi (a|s) = P[A_t=a|S_t=s]$ where "a" is action (e.g. move left, move right, stay still, ...)
* 与时间无关，与状态有关

Some resource to help understand 3.7 and 3.8:
https://towardsdatascience.com/reinforcement-learning-markov-decision-process-part-2-96837c936ec3

### 3.7 状态价值函数(state-value) v.s 动作价值函数 (state-action-value)
状态价值函数: (Bellman Expectation Equation for state-value)
* $v_{\pi}(s) = E_\pi [G_t|S_t=s]$
- we are finding the value of a particular state subjected to some policy(π), this is the difference between Bellman Equation (defined above) and the Bellman Expectation Equation(here).

动作价值函数: (Bellman Expectation Equation for State-Action Value)
* 动作价值函数多了一个动作a，代表在状态s下做了动作a之后的总价值期望
* $q_{\pi}(s,a) = E_\pi[G_t|S_t=s,A_t=a]$
* 动作价值函数照样可以分解：
    * $q_{\pi}(s,a) = E_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1},A_{t+1})|S_t=s,A_t=a]$

* 当前的$q_{\pi}(s,a)$与当前的$v_{\pi}(s)$的关系
    - Usually in graphic expression: Hollow dot means State; Solid dot means Action
    - This backup diagram describes the value of being in a particular state. From the state s there is some probability (pi(a|s)) that we take both the actions. There is a Q-value(State-action value function) for each of the action. We average the Q-values which tells us how good it is to be in a particular state. Basically, it defines Vπ(s).
<img src='pic/v_q.png'>

* 当前的$q_{\pi}(s,a)$与**下一步**的$v_{\pi}(s^{'})$的关系
    - Tips: When t -> t+1, we need to add $R_s^a$ (see equation below); When t->t (no time step change), we don't need it (see equation above)
<img src='pic/q_v.png'>

* 当前的$v_{\pi}(s)$与**下一步**的$v_{\pi}(s^{'})$的关系
<img src='pic/v_v.png'>

* 当前的$q_{\pi}(s,a)$与**下一步**的$q_{\pi}(s^{'},a^{'})$的关系
<img src='pic/q_q.png'>

So, this is how we can formulate Bellman Expectation Equation for a given MDP to find it’s State-Value Function and State-Action Value Function. 

But, it does not tell us the best way to behave in an MDP. For that let’s talk about what is meant by Optimal Value and Optimal Policy Function.

### 3.8 最优总价值(Optimal State-Value Function)与最优动作价值(Optimal State-Action Value Function (Q-Function))
* Optimal State-Value Function:
    - It is the maximum Value function over all policies.
> $v_{*}(s)=max_{\pi}v_{\pi}(s)$
* Optimal State-Action Value Function (Q-Function):
    - It is the maximum action-value function over all policies.
> $q_{*}(s,a)=max_{\pi}q_{\pi}(s,a)$
* 对于任何MDP，必定有最优总价值与最优动作价值

* 当前max-v 与 当前max-q的关系
    - note: only pick one action a, that can max out the q -> q*, therefore all the other actions has 0 probability ($\pi$)
<img src='pic/max_v.png'>

> This is called: **Bellman Optimality Equation for State-value Function**

* 当前max-q 与 下一步max-v的关系
<img src='pic/max_q.png'>

> This is called **Bellman Optimality Equation for State-Action Value Function**

* 当前max-v 与 下一步max-v的关系
<img src='pic/max_v_v.png'>

> This is called **Bellman Optimality Equation for State-Value Function from the Backup Diagram**

* 当前max-q 与 下一步max-q的关系
<img src='pic/max_q_q.png'>

> This is called **Bellman Optimality Equation for State-Action Value Function from the Backup Diagram**

* 一般来说最优状态价值或者最优动作价值都无法直接求, reason:
    * need to 掌握环境的动态变化
    * need 足够的运算能力
    * need to 满足Markov性质
* 不能直接求的，需要用到近似方法

## 4.Flappy bird的简单解决方法

- Instead of using DQN (Image->CNN->Feature->RL), we can simplify the problem by manually defining/providing the feature to RL (i.e. Feature -> RL)
- Advantage:
    - Manual Feature Input will save lots of time of CNN auto detect the feature to use. (Simplified Method: few hundreds training time; DQN Method: Millions training time)
- Disadvantage:
    - Manual Feature method cannot be applied generally to other problem (e.g. other games), while DQN method can be applied to other games because we don't manually provide the feature to RL, CNN figured out features on its own, we only need to provide pictures