作为强化学习很早期也很经典的一种算法，Q-Learning是一种基于值(Value Based)的算法。至于为什么叫 Q-Learning，是因为其本身是一种依靠 Q 函数来寻找最优的动作-状态决策的。关于 Q-Learning算法的细节和原理笔者这里不做详细描述，感兴趣的朋友可以直接研读相关论文。
> Watkins C J C H, Dayan P. Technical Note: Q-Learning[J]. Machine Learning, 1992, 8(3-4):279-292.

**Q-Learning算法描述：**
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200601103840585.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0lfYW1fYV9idWdlcg==,size_16,color_FFFFFF,t_70)

### 导入相关模块并 预设参数值：

In [2]:
import time
import numpy as np 
import pandas as pd 
# 状态宽度
N_STATES = 6   
# 探索者的可用动作

ACTIONS = ['left', 'right']
# 贪心率
EPSILON = 0.9   
# 学习率
ALPHA = 0.1   
# 奖励递减值
GAMMA = 0.9   
# 最大回合数
MAX_EPISODES = 13  
# 移动间隔时间
FRESH_TIME = 0.3

### 初始化 Q-Table：

In [3]:
def build_table(n_states,actions):
    q_table=pd.DataFrame(np.zeros((n_states,len(actions))),columns=actions)
    return q_table
q_table=build_table(N_STATES,ACTIONS)
q_table

Unnamed: 0,left,right
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
5,0.0,0.0


### 定义在某个状态地点State，选择行为Action:

In [4]:
def choose_action(S,q_table):
    # 选出这个 state 的所有 action 值
    action_choice=q_table.iloc[S,:]
    # 非贪婪 or 或者这个 state 还没有探索过
    if np.random.rand()>EPSILON or action_choice.all()==0:
        action=np.random.choice(ACTIONS)
    else:
        # 贪心策略
        action=ACTIONS[action_choice.argmax()]
    return action
choose_action(3,q_table)

'right'

### 定义环境反馈过程，以奖励R的形式给出：

In [5]:
def get_env_feedback(S,A):
    # 主体与环境的交互过程
    if A=='right':
#         到达最右端是最好的，奖励值也是最大的
        if S==N_STATES-2:
            S_='terminal'
            R=5
        else:
#             下一步是往右走，接近目标，奖励一下
            S_=S+1 
            R=1 
    else:
#         目标是最右端，此时反而往左走，不奖励
        S_=S if S==0 else S-1 
        R=0 
    return S_,R 

### 更新环境：

In [6]:
def update_env(S,episode,step_counter):
    env_list=['-']*(N_STATES-1)+['T']
    if S=='terminal':
        print(f"\rEpisode:{episode},total_step:{step_counter}")
        time.sleep(2)
    else:
        env_list[S]='o'
        print(f"\r{''.join(env_list)}",end="")
        time.sleep(FRESH_TIME)

### 定义 Q-Learning训练过程：

**Q-Learning算法描述：**
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200601103840585.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0lfYW1fYV9idWdlcg==,size_16,color_FFFFFF,t_70)

In [7]:
def q_train():
    q_table=build_table(N_STATES,ACTIONS)
    for episode in range(1,MAX_EPISODES+1):
        '''
        S: 回合初始位置
        step_counter：此回合走的步数
        is_terminal：是否回合结束
        '''
        S,step_counter,is_terminal=0,0,False 
        update_env(S,episode,step_counter)
        while not is_terminal:
            A=choose_action(S,q_table)  # 选行为
            S_,R=get_env_feedback(S,A)  # 实施行为并得到环境的反馈
            q_predict=q_table.loc[S,A] # 估算的(状态-行为)值
            if S_=='terminal':
                q_target=R 
                # 实际的(状态-行为)值 (回合结束)
                is_terminal=True 
            else:
                 # 实际的(状态-行为)值 (回合没结束)
                q_target=R+GAMMA*q_table.iloc[S_,:].max()
             # 更新 q_table
            q_table.loc[S,A]+=ALPHA*(q_target-q_predict)
            S=S_ 
            step_counter+=1
               # 环境更新
            update_env(S,episode,step_counter)
    return q_table
            
                

In [8]:
q_table=q_train()
print("\r\nQ_table:")
print(q_table)

Episode:1,total_step:32
Episode:2,total_step:5
Episode:3,total_step:7
Episode:4,total_step:11
Episode:5,total_step:5
Episode:6,total_step:5
Episode:7,total_step:9
Episode:8,total_step:5
Episode:9,total_step:8
Episode:10,total_step:7
Episode:11,total_step:5
Episode:12,total_step:5
Episode:13,total_step:5

Q_table:
       left     right
0  0.112040  1.686950
1  0.234920  1.737197
2  0.180106  1.991320
3  0.303024  2.577075
4  0.111257  3.729067
5  0.000000  0.000000
