** 强化学习跟此前的监督学习有着本质的区别： **
> 监督学习是**训练模型特征到标签的映射关系**，而强化学习的学习过程却是一种**从无到有的过程**。简单来说，强化学习是让计算机实现从一开始什么都不懂，经过**不断尝试和试错找到规律达到目的**这样的一个过程。强化学习的主体与环境基于离散的时间步长相作用。在每一个时间 t，主体接收到一个观测Ot，通常其中包含奖励Rt。然后，它从允许的集合中选择一个动作At，然后送出到环境中去。环境则变化到一个新的状态 St+1，然后决定了和这个变化  (St,At,St+1)相关联的奖励Rt+1。**强化学习主体的目标，是得到尽可能多的奖励**。**主体选择的动作是其历史的函数，它也可以选择随机的动作。** 可以看到**状态(State)、动作(Action)和奖励(Reward)**是强化学习的三个核心概念。
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200601102849744.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0lfYW1fYV9idWdlcg==,size_16,color_FFFFFF,t_70)

**传统的强化学习算法**
> Q-Learning算法、Sarsa算法、Policy Gradients算法、蒙特卡洛树搜索等算法。

**当下结合了深度学习的强化学习算法**
> 深度Q网络(DQN)，以及结合神经网络之后的深度强化学习这一整个领域。

**深度强化学习的核心框架：**
![在这里插入图片描述](https://img-blog.csdnimg.cn/2020060110320239.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0lfYW1fYV9idWdlcg==,size_16,color_FFFFFF,t_70)

In [79]:
import time
import numpy as np 
import pandas as pd 
N_STATES=6
ACTIONS=["left","right"]
EPSILON=.9 
ALPHA=.1 
GAMMA=.9
MAX_EPISODES=13 
FRESH_TIME=.3 

In [80]:
def build_table(n_states,actions):
    q_table=pd.DataFrame(np.zeros((n_states,len(actions))),columns=actions)
    return q_table
q_table=build_table(N_STATES,ACTIONS)
q_table

Unnamed: 0,left,right
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
5,0.0,0.0


In [81]:
def choose_action(S,q_table):
    action_choice=q_table.iloc[S,:]
    if np.random.rand()>EPSILON or action_choice.all()==0:
        action=np.random.choice(ACTIONS)
    else:
        action=ACTIONS[action_choice.argmax()]
    return action
choose_action(3,q_table)

'right'

In [82]:
def get_env_feedback(S,A):
    if A=='right':
        if S==N_STATES-2:
            S_='terminal'
            R=5
        else:
            S_=S+1 
            R=1 
    else:
        S_=S if S==0 else S-1 
        R=0 
    return S_,R 

In [83]:
def update_env(S,episode,step_counter):
    env_list=['-']*(N_STATES-1)+['T']
    if S=='terminal':
        print(f"\rEpisode:{episode},total_step:{step_counter}")
        time.sleep(2)
    else:
        env_list[S]='o'
        print(f"\r{''.join(env_list)}",end="")
        time.sleep(FRESH_TIME)

In [84]:
def q_train():
    q_table=build_table(N_STATES,ACTIONS)
    for episode in range(1,MAX_EPISODES+1):
        S,step_counter,is_terminal=0,0,False 
        while not is_terminal:
            A=choose_action(S,q_table)
            S_,R=get_env_feedback(S,A)
            q_predict=q_table.loc[S,A]
            if S_=='terminal':
                q_target=R 
                is_terminal=True 
            else:
                q_target=R+GAMMA*q_table.iloc[S_,:].max()
            q_table.loc[S,A]+=ALPHA*(q_target-q_predict)
            S=S_ 
            step_counter+=1
            update_env(S,episode,step_counter)
    return q_table
            
                

In [85]:
q_table=q_train()
print("\r\nQ_table:")
print(q_table)

Episode:1,total_step:19
Episode:2,total_step:5
Episode:3,total_step:5
Episode:4,total_step:5
Episode:5,total_step:5
Episode:6,total_step:5
Episode:7,total_step:5
Episode:8,total_step:5
Episode:9,total_step:5
Episode:10,total_step:5
Episode:11,total_step:5
Episode:12,total_step:7
Episode:13,total_step:5

Q_table:
       left     right
0  0.009000  1.348803
1  0.025273  1.420445
2  0.009000  1.833672
3  0.148685  2.498905
4  0.025273  3.729067
5  0.000000  0.000000
