## 强化学习与监督学习的区别

* 如果按照监督学习的方式，类似于大人会给小孩说，这样做是对的，那样做是错误的。然后，小孩就能知道怎样做是对的，怎样是错误的了。这个过程并不存在奖励机制，无法从环境中获得回报。
* 如果按照强化学习的方式，类似于大人什么都不告诉小孩，当小孩说脏话时就会挨打（负奖励），小孩做正确的事情时就给一颗糖吃（正奖励）。最后，小孩也就建立起是非观念了。为了得到更多的糖吃，小孩就会尽量避免做错误的事情。



## 分类:

Q-learning

SARSA

DQN

## 组件:

### Policy 策略函数:

当前状态作为输入，下一步行动决策作为输出。<br>

### Value Function:

衡量当前状态或者行为的好坏，奖励和惩罚的函数。通常是一个value。<br>


### Model:

智能体(Agent)感知周围环境变化的模式。<br>

Agent并不是必须，比如Q-Learning就是不需要智能体的。<br>

## 此次强化学习的全部课程集中关注于时间差分学习的部分，并着重于 Q-learning、Sarsa 以及 Policy Gradient 三种最基础的方法。

## !!只是基础!!


# Q-Learning:

![](https://cdn.aibydoing.com/aibydoing/images/document-uid214893labid6102timestamp1531891479974.png)

* 我们不希望狮子碰到地雷，希望他吃到火腿。
* 我们希望路径最短。

很容易可以看出来，需要5步。并且所有可能性都是有限的。<br>
如果是编程，可以挖掉地雷的位置，然后做最短路径问题。<br>
但是我们希望这种游戏是可以重复性游玩的，比如每次游玩的时候，变换地雷和火腿的相对位置，保持狮子的位置不变。目的始终是吃火腿，躲地雷，那么这样才有意思。当然，我不知道能不能做到，因为这种情况，似乎传统编程的难度相当大，而且是编程者来硬解。我们希望是由model自己来学习这个模式。<br>

ps:后来发现，Q-Table的本质并不允许更改火腿和地雷的位置。<br>

In [1]:
import pandas as pd
import numpy as np
import time
from IPython import display  # 引入 display 模块目的方便程序运行展示


def init_env(width = 4,height = 4):
    start = (0, 0)
    terminal = (int(width/3*2), int(height/2))
    hole = (int(width/3),int(height/3))
    env = np.array([["_ "] * width] * height)  # 建立一个 4*4 的环境
    env[terminal] = "$ "  # 目的地
    env[hole] = "# "  # 陷阱
    env[start] = "L "  # 小狮子
    interaction = ""
    for i in env:
        interaction += "".join(i) + "\n"
    print(interaction)


init_env()

L _ _ _ 
_ # _ _ 
_ _ $ _ 
_ _ _ _ 



In [2]:
import numpy as np
import pandas as pd

def init_q_table(width=4, height=4):
    # 定义动作
    actions = np.array(["up", "down", "left", "right"])
    
    # 初始化 Q-Table 全为 0
    q_table = pd.DataFrame(np.zeros((width * height, len(actions))), columns=actions)
    
    # 遍历所有的格子，检查边缘情况
    for x in range(width):
        for y in range(height):
            state = x * height + y  # 映射关系
            
    
    return q_table

# 调用函数初始化 Q-Table
q_table = init_q_table()
q_table


Unnamed: 0,up,down,left,right
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0


In [3]:
def act_choose(state, q_table, epsilon):
    """
    参数:
    state -- 状态
    q_table -- Q-Table
    epsilon -- 概率值

    返回:
    action --下一步动作
    """
    state_act = q_table.iloc[state, :]
    actions = np.array(["up", "down", "left", "right"])

    if np.random.uniform() > epsilon or state_act.all() == 0:
        action = np.random.choice(actions)
    else:
        action = state_act.idxmax()
    return action

In [4]:
def env_feedback(state, action, hole, terminal,width=4,height=4):
    """
    参数:
    state -- 状态
    action -- 动作
    hole -- 陷阱位置
    terminal -- 终点位置

    返回:
    next_state -- 下一状态
    reward -- 奖励
    end --结束标签
    """
    # 越过边缘惩罚-10.0
    # 陷阱惩罚-10.0
    # 火腿奖励10.0
    
    # 行为反馈
    reward = 0.0
    end = 0
    a, b = state
    if action == "up":
        a -= 1
        if a < 0:
            a = 0
            reward = -10.0
        next_state = (a, b)
        
    elif action == "down":
        a += 1
        if a >= height:
            a = height-1
            reward = -10.0
        next_state = (a, b)
    elif action == "left":
        b -= 1
        if b < 0:
            b = 0
            reward = -10.0
        
        next_state = (a, b)
    elif action == "right":
        b += 1
        if b >= width:
            b = width-1
            reward = -10.0
        next_state = (a, b)
        

    if next_state == terminal:
        reward = 10.0
        end = 2
    elif next_state == hole:
        reward = -10.0
        end = 1
    else:
        if reward != -10.0:
          reward = -1.0
    return next_state, reward, end

In [5]:
def update_q_table(q_table, state, action, next_state, terminal, gamma, alpha, reward,width=4,height=4):
    """
    参数:
    q_table -- Q-Table
    state -- 状态
    action -- 动作
    next_state -- 下一状态
    terminal -- 终点位置
    gamma -- 折损因子
    alpha -- 学习率
    reward -- 奖励

    返回:
    q_table -- 更新后的Q-Table
    """
    # Q-Table 的更新函数
    x, y = state
    next_x, next_y = next_state
    q_original = q_table.loc[x * height + y, action]
    if next_state != terminal:
        q_predict = reward + gamma * q_table.iloc[next_x * height + next_y].max()
    else:
        q_predict = reward
    q_table.loc[x * 4 + y, action] = (1 - alpha) * q_original + alpha * q_predict
    return q_table

In [6]:
def show_state(end, state, episode, step, q_table,width=4,height=4):
    """
    参数:
    end -- 结束标签
    state -- 状态
    episode -- 迭代次数
    step --迭代步数
    q_table-- Q-Table
    """
    # 状态可视化辅助函数
    terminal = (int(width/3*2), int(height/2))
    hole = (int(width/3),int(height/3))
    env = np.array([["_ "] * width] * height)
    env[terminal] = "$ "
    env[hole] = "# "
    env[state] = "L "
    interaction = ""
    for i in env:
        interaction += "".join(i) + "\n"

    if state == terminal:
        message = "EPISODE: {}, STEP: {}".format(episode, step)
        interaction += message
        display.clear_output(wait=True)  # 清除输出内容
        print(interaction)
        print("\n" + "q_table:")
        print(q_table)
        time.sleep(3)  # 在成功到终点时，等待 3 秒
    else:
        display.clear_output(wait=True)
        print(interaction)
        print(q_table)
        time.sleep(0.3)  # 在这里控制每走一步所需要时间

In [7]:
def q_learning(max_episodes, alpha, gamma, epsilon,width=4,height=4):
    """
    参数:
    max_episodes -- 最大迭代次数
    alpha -- 学习率
    gamma -- 折损因子
    epsilon -- 概率值

    返回:
    q_table -- 更新后的Q-Table
    """
    q_table = init_q_table()
    terminal = (int(width/3*2), int(height/2))
    hole = (int(width/3),int(height/3))
    episodes = 0
    while episodes <= max_episodes:
        step = 0
        state = (0, 0)
        end = 0
        show_state(end, state, episodes, step, q_table)
        while end == 0:
            x, y = state
            act = act_choose(x * height + y, q_table, epsilon)  # 动作选择
            next_state, reward, end = env_feedback(state, act, hole, terminal)  # 环境反馈
            q_table = update_q_table(
                q_table, state, act, next_state, terminal, gamma, alpha, reward
            )  # q-table 更新
            state = next_state
            step += 1
            show_state(end, state, episodes, step, q_table)
        if end == 2:
            episodes += 1

In [8]:
q_learning(max_episodes=15, alpha=0.8, gamma=0.9, epsilon=0.9)

_ _ _ _ 
_ # _ _ 
_ _ L _ 
_ _ _ _ 
EPISODE: 15, STEP: 6

q_table:
           up      down      left     right
0  -11.112960  4.570999 -7.075289  0.966886
1   -8.000000 -9.984000 -1.894645  5.274991
2   -9.920000  7.913718 -1.853696 -0.960000
3   -4.454900  1.122685  4.923750  0.000000
4   -1.651200  6.198483 -9.999360 -9.600000
5    0.000000  0.000000  0.000000  0.000000
6    4.856744  9.984000 -8.000000 -0.460937
7    2.892618  4.800000  6.112000  0.000000
8   -0.992000 -0.998400 -4.346880  7.999866
9   -8.000000  2.468352  5.835090  9.999974
10   0.000000  0.000000  0.000000  0.000000
11   2.656000  4.007680  9.600000 -2.240000
12   4.798258  0.000000 -8.000000 -0.960000
13  -0.800000  0.000000  0.000000  5.912320
14   9.600000 -8.000000 -0.960000  3.468288
15   7.342080 -8.000000 -0.800000 -9.600000


到了后面，我的狮子路线是固定的。<br>

它把所有位置的所有可能都列出来，存储量还是很大的，如果下五子棋也这么搞，也就是要记录所有的残局情况。<br>

然后一次次解出残局的最优解。<br>

用数学函数来说，实际上就是不断修正函数，来让函数来拟合某一个最优解。<br>

有缺点，这里，每次更换目标都需要重新训练，而且每一次都不能找出所有最优解，只是找出一个局部最优解。【实际上这点也有点为难人了。】<br>

但是，实际上，这个应该在五子棋上面表现不错。只是参数量估计相当大。<br>

## 让游戏更有意思一些。

我把狮子的出生点随机化了(陷阱和hole除外）,然后把地图变成6x6<br>

实际上没有改动算法根基，如果要动火腿和陷阱，目前的Q-Table就不能是静态的。<br>

In [9]:
import random

def random_start_q_learning(max_episodes, alpha, gamma, epsilon,width=4,height=4):
    """
    参数:
    max_episodes -- 最大迭代次数
    alpha -- 学习率
    gamma -- 折损因子
    epsilon -- 探索概率值

    返回:
    q_table -- 更新后的Q-Table
    """
    q_table = init_q_table(width=width,height=height)
    terminal = (int(width/3*2), int(height/2))
    hole = (int(width/3),int(height/3))
    
    # 定义网格的所有合法起点，排除 hole 和 terminal
    valid_start_positions = [(i, j) for i in range(width) for j in range(height) if (i, j) != hole and (i, j) != terminal]

    episodes = 0
    while episodes <= max_episodes:
        step = 0

        # 随机选择起始点，排除 hole 和 terminal
        state = random.choice(valid_start_positions)

        end = 0
        show_state(end, state, episodes, step, q_table,width=height,height=height)
        while end == 0:
            x, y = state
            act = act_choose(x * width + y, q_table, epsilon)  # 动作选择
            next_state, reward, end = env_feedback(state, act, hole, terminal,width=width,height=height)  # 环境反馈
            q_table = update_q_table(
                q_table, state, act, next_state, terminal, gamma, alpha, reward,width=width,height=height
            )  # q-table 更新
            state = next_state
            step += 1
            show_state(end, state, episodes, step, q_table,width=width,height=height)
        if end == 2:
            episodes += 1

    return q_table


In [None]:
random_start_q_learning(max_episodes=100, alpha=0.8, gamma=0.9, epsilon=0.9,width=6,height=6)

_ _ _ _ _ L 
_ _ _ _ _ _ 
_ _ # _ _ _ 
_ _ _ _ _ _ 
_ _ _ $ _ _ 
_ _ _ _ _ _ 

          up      down      left     right
0  -6.588574  1.189814 -6.588574  2.050207
1  -6.938713  3.389120  0.516046  0.750078
2  -9.920000 -0.960000  2.050208 -0.960000
3  -8.192375  0.151714  0.845182 -0.519958
4   0.851645 -3.031040 -5.253015  3.041588
5  -8.561375  1.431761  1.291495 -8.698250
6   0.880065 -8.275200  2.985853  1.651458
7  -0.034564  4.876800 -0.291594 -2.086400
8   1.019574  2.656000 -0.235093  1.145409
9   0.771260 -0.069478 -2.284800 -5.815538
10  0.000000  0.000000  0.000000  0.000000
11  2.701957  0.192000 -7.424000  1.195264
12 -1.651200 -1.280000 -8.000000  0.852122
13  1.337665  4.800000 -0.800000  4.960000
14 -7.584000 -2.400000  0.800000  0.000000
15  5.836800  7.840000  0.000000  0.000000
16  1.195264 -2.400000  2.771200  4.800000
17  0.961280 -0.800000 -0.800000 -2.240000
18 -1.376000  0.000000  0.000000  0.000000
19  0.000000  0.000000  0.000000  0.000000
20 -0.800000 -8.00

缺少徘徊惩罚，如果陷入徘徊会导致训练卡死。<br>
至少效率是很低的<br>

但是目前的架构似乎是加不进来徘徊惩罚的，因为徘徊是二步，四步等等的循环。<br>

它不是单独一步的惩罚，任何单独一步可能都是正常的，但是连起来就变成了循环，就不正常。<br>

待解决。<br>