## 强化学习与监督学习的区别

* 如果按照监督学习的方式，类似于大人会给小孩说，这样做是对的，那样做是错误的。然后，小孩就能知道怎样做是对的，怎样是错误的了。这个过程并不存在奖励机制，无法从环境中获得回报。
* 如果按照强化学习的方式，类似于大人什么都不告诉小孩，当小孩说脏话时就会挨打（负奖励），小孩做正确的事情时就给一颗糖吃（正奖励）。最后，小孩也就建立起是非观念了。为了得到更多的糖吃，小孩就会尽量避免做错误的事情。



## 分类:

Q-learning

SARSA

DQN

## 组件:

### Policy 策略函数:

当前状态作为输入，下一步行动决策作为输出。<br>

### Value Function:

衡量当前状态或者行为的好坏，奖励和惩罚的函数。通常是一个value。<br>


### Model:

智能体(Agent)感知周围环境变化的模式。<br>

Agent并不是必须，比如Q-Learning就是不需要智能体的。<br>

## 此次强化学习的全部课程集中关注于时间差分学习的部分，并着重于 Q-learning、Sarsa 以及 Policy Gradient 三种最基础的方法。

## !!只是基础!!


# Q-Learning:

![](https://cdn.aibydoing.com/aibydoing/images/document-uid214893labid6102timestamp1531891479974.png)

* 我们不希望狮子碰到地雷，希望他吃到火腿。
* 我们希望路径最短。

很容易可以看出来，需要5步。并且所有可能性都是有限的。<br>
如果是编程，可以挖掉地雷的位置，然后做最短路径问题。<br>
但是我们希望这种游戏是可以重复性游玩的，比如每次游玩的时候，变换地雷和火腿的相对位置，保持狮子的位置不变。目的始终是吃火腿，躲地雷，那么这样才有意思。当然，我不知道能不能做到，因为这种情况，似乎传统编程的难度相当大，而且是编程者来硬解。我们希望是由model自己来学习这个模式。<br>

In [2]:
import pandas as pd
import numpy as np
import time
from IPython import display  # 引入 display 模块目的方便程序运行展示


def init_env():
    start = (0, 0)
    terminal = (3, 2)
    hole = (2, 1)
    env = np.array([["_ "] * 4] * 4)  # 建立一个 4*4 的环境
    env[terminal] = "$ "  # 目的地
    env[hole] = "# "  # 陷阱
    env[start] = "L "  # 小狮子
    interaction = ""
    for i in env:
        interaction += "".join(i) + "\n"
    print(interaction)


init_env()

L _ _ _ 
_ _ _ _ 
_ # _ _ 
_ _ $ _ 



In [3]:
def init_q_table():
    # Q-Table 初始化
    actions = np.array(["up", "down", "left", "right"])
    q_table = pd.DataFrame(
        np.zeros((16, len(actions))), columns=actions
    )  # 初始化 Q-Table 全为 0
    return q_table


init_q_table()

Unnamed: 0,up,down,left,right
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0


In [4]:
def act_choose(state, q_table, epsilon):
    """
    参数:
    state -- 状态
    q_table -- Q-Table
    epsilon -- 概率值

    返回:
    action --下一步动作
    """
    state_act = q_table.iloc[state, :]
    actions = np.array(["up", "down", "left", "right"])

    if np.random.uniform() > epsilon or state_act.all() == 0:
        action = np.random.choice(actions)
    else:
        action = state_act.idxmax()
    return action

In [5]:
def env_feedback(state, action, hole, terminal):
    """
    参数:
    state -- 状态
    action -- 动作
    hole -- 陷阱位置
    terminal -- 终点位置

    返回:
    next_state -- 下一状态
    reward -- 奖励
    end --结束标签
    """
    # 行为反馈
    reward = 0.0
    end = 0
    a, b = state
    if action == "up":
        a -= 1
        if a < 0:
            a = 0
        next_state = (a, b)
    elif action == "down":
        a += 1
        if a >= 4:
            a = 3
        next_state = (a, b)
    elif action == "left":
        b -= 1
        if b < 0:
            b = 0
        next_state = (a, b)
    elif action == "right":
        b += 1
        if b >= 4:
            b = 3
        next_state = (a, b)

    if next_state == terminal:
        reward = 10.0
        end = 2
    elif next_state == hole:
        reward = -10.0
        end = 1
    else:
        reward = -1.0
    return next_state, reward, end

In [6]:
def update_q_table(q_table, state, action, next_state, terminal, gamma, alpha, reward):
    """
    参数:
    q_table -- Q-Table
    state -- 状态
    action -- 动作
    next_state -- 下一状态
    terminal -- 终点位置
    gamma -- 折损因子
    alpha -- 学习率
    reward -- 奖励

    返回:
    q_table -- 更新后的Q-Table
    """
    # Q-Table 的更新函数
    x, y = state
    next_x, next_y = next_state
    q_original = q_table.loc[x * 4 + y, action]
    if next_state != terminal:
        q_predict = reward + gamma * q_table.iloc[next_x * 4 + next_y].max()
    else:
        q_predict = reward
    q_table.loc[x * 4 + y, action] = (1 - alpha) * q_original + alpha * q_predict
    return q_table

In [7]:
def show_state(end, state, episode, step, q_table):
    """
    参数:
    end -- 结束标签
    state -- 状态
    episode -- 迭代次数
    step --迭代步数
    q_table-- Q-Table
    """
    # 状态可视化辅助函数
    terminal = (3, 2)
    hole = (2, 1)
    env = np.array([["_ "] * 4] * 4)
    env[terminal] = "$ "
    env[hole] = "# "
    env[state] = "L "
    interaction = ""
    for i in env:
        interaction += "".join(i) + "\n"

    if state == terminal:
        message = "EPISODE: {}, STEP: {}".format(episode, step)
        interaction += message
        display.clear_output(wait=True)  # 清除输出内容
        print(interaction)
        print("\n" + "q_table:")
        print(q_table)
        time.sleep(3)  # 在成功到终点时，等待 3 秒
    else:
        display.clear_output(wait=True)
        print(interaction)
        print(q_table)
        time.sleep(0.3)  # 在这里控制每走一步所需要时间

In [8]:
def q_learning(max_episodes, alpha, gamma, epsilon):
    """
    参数:
    max_episodes -- 最大迭代次数
    alpha -- 学习率
    gamma -- 折损因子
    epsilon -- 概率值

    返回:
    q_table -- 更新后的Q-Table
    """
    q_table = init_q_table()
    terminal = (3, 2)
    hole = (2, 1)
    episodes = 0
    while episodes <= max_episodes:
        step = 0
        state = (0, 0)
        end = 0
        show_state(end, state, episodes, step, q_table)
        while end == 0:
            x, y = state
            act = act_choose(x * 4 + y, q_table, epsilon)  # 动作选择
            next_state, reward, end = env_feedback(state, act, hole, terminal)  # 环境反馈
            q_table = update_q_table(
                q_table, state, act, next_state, terminal, gamma, alpha, reward
            )  # q-table 更新
            state = next_state
            step += 1
            show_state(end, state, episodes, step, q_table)
        if end == 2:
            episodes += 1

In [23]:
q_learning(max_episodes=60, alpha=0.8, gamma=0.9, epsilon=0.9)

_ _ _ _ 
_ _ _ _ 
_ # _ _ 
_ _ L _ 
EPISODE: 60, STEP: 5

q_table:
           up      down      left      right
0   -3.368121  3.122000  1.605217  -3.349864
1   -3.112465 -2.656223 -3.146730   0.894333
2   -2.213120  5.069215 -2.916351  -1.857024
3   -1.718528 -0.999936 -0.999680  -1.536000
4    0.907711  4.580000  2.878692   0.300440
5   -2.475887 -8.000000 -2.271360   5.100961
6   -0.999936  7.856577 -1.651200  -0.998400
7   -1.891973  2.464000 -0.960000   0.000000
8   -2.780328  6.200000  4.525414  -9.999872
9    0.000000  0.000000  0.000000   0.000000
10  -1.651200  9.984000 -9.920000   2.771200
11   0.000000  7.072000  4.960000   2.611200
12   4.531180  6.187589 -1.683200   8.000000
13 -10.000000  7.999205  5.951967  10.000000
14   0.000000  0.000000  0.000000   0.000000
15   0.000000  5.952000  9.920000   0.000000


到了最后，我的狮子都是下下下右右,这种Q-Table的实际上在找最优有点死板。<br>

它把所有位置的所有可能都列出来，存储量还是很大的，如果下五子棋也这么搞，也就是要记录所有的残局情况。<br>

然后一次次解出残局的最优解。<br>

用数学函数来说，实际上就是不断修正函数，来让函数来拟合某一个最优解。<br>

有缺点，这里，每次更换目标都需要重新训练，而且每一次都不能找出所有最优解，只是找出一个局部最优解。【实际上这点也有点为难人了。】<br>

但是，实际上，这个应该在五子棋上面表现不错。只是参数量估计相当大。<br>

## 让游戏更有意思一些。

我把狮子的出生点随机化了(陷阱和hole除外）<br>

实际上没有改动算法根基，如果要动火腿和陷阱，目前的Q-Table就不能是静态的。<br>

In [9]:
import random

def random_start_q_learning(max_episodes, alpha, gamma, epsilon):
    """
    参数:
    max_episodes -- 最大迭代次数
    alpha -- 学习率
    gamma -- 折损因子
    epsilon -- 探索概率值

    返回:
    q_table -- 更新后的Q-Table
    """
    q_table = init_q_table()
    terminal = (3, 2)
    hole = (2, 1)
    
    # 定义网格的所有合法起点，排除 hole 和 terminal
    valid_start_positions = [(i, j) for i in range(4) for j in range(4) if (i, j) != hole and (i, j) != terminal]

    episodes = 0
    while episodes <= max_episodes:
        step = 0

        # 随机选择起始点，排除 hole 和 terminal
        state = random.choice(valid_start_positions)

        end = 0
        show_state(end, state, episodes, step, q_table)
        while end == 0:
            x, y = state
            act = act_choose(x * 4 + y, q_table, epsilon)  # 动作选择
            next_state, reward, end = env_feedback(state, act, hole, terminal)  # 环境反馈
            q_table = update_q_table(
                q_table, state, act, next_state, terminal, gamma, alpha, reward
            )  # q-table 更新
            state = next_state
            step += 1
            show_state(end, state, episodes, step, q_table)
        if end == 2:
            episodes += 1

    return q_table


In [10]:
random_start_q_learning(max_episodes=60, alpha=0.8, gamma=0.9, epsilon=0.9)

_ _ _ _ 
_ _ _ _ 
_ # _ _ 
_ _ L _ 
EPISODE: 60, STEP: 5

q_table:
          up       down      left     right
0  -1.683200  -1.798400 -1.683200  3.120820
1   2.153950   4.579999 -1.804800  3.465572
2  -0.960000   6.199998 -1.683200 -1.821440
3  -1.536000  -0.998400  4.569193 -1.683200
4  -1.376000  -1.536000 -1.536000  4.580000
5   3.064217  -9.600000  2.265684  6.200000
6   3.886458   8.000000  3.504000  2.468968
7   2.742232   4.656794  5.912491  2.628968
8   3.114336  -1.711770 -2.213120 -8.000000
9   0.000000   0.000000  0.000000  0.000000
10  6.143953  10.000000 -8.000000  2.771200
11  2.628968   7.982234  7.532800  5.581517
12 -2.213120  -1.821440 -1.712640  4.760013
13 -9.920000  -0.999680 -1.717248  9.999360
14  0.000000   0.000000  0.000000  0.000000
15  3.485440   4.960000  9.999872 -0.800000


Unnamed: 0,up,down,left,right
0,-1.6832,-1.7984,-1.6832,3.12082
1,2.15395,4.579999,-1.8048,3.465572
2,-0.96,6.199998,-1.6832,-1.82144
3,-1.536,-0.9984,4.569193,-1.6832
4,-1.376,-1.536,-1.536,4.58
5,3.064217,-9.6,2.265684,6.2
6,3.886458,8.0,3.504,2.468968
7,2.742232,4.656794,5.912491,2.628968
8,3.114336,-1.71177,-2.21312,-8.0
9,0.0,0.0,0.0,0.0
