## 强化学习

### MDP 马尔可夫决策过程

```
智能体(Agent)采取行动(Action),从而改变自己的状态(State)获取奖励(Reward)与环境发生交互的循环过程。
MDP的策略取决于当前状态。

强化学习的过程简而言之就是观察环境以及自身的状态，然后预测获取奖励最大的动作。
```
<img src='强化学习模型1.jpg' style='zoom:80%'>

### 基本概念

```
1. S：有限状态集合
2. A：有限的动作集合
3. T(S,a,S') : 根据当前的状态和动作a，预测下一个状态。
4. R(S,a): 在当前状态下采取行动a获取的奖励
5. Policy π(s,a): 根据当前的状态产生action。 a = π(s) 或者 π(a|s) = p(a|s) 计算出在当前状态下产生某个动作的概率

```


### MDP求解

```
我们需要最优的策略使未来的回报最大化，求解过程大致可以分为两步：
1. 预测：给定策略，评估相应状态的价值函数和状态，也就是价值函数
2. 行动：根据价值函数得到当前状态对应的最优动作
```
<img src='MDP环境建模.png' style='zoom:80%'>


### 回报计算

```
U(s0,s1,s2 ......) 与折扣率 r: U 表示一组action之后所有状态累计的reward之和，但是由于直接的reward相加在无限时间序列中导致无偏向，而产生状态的无限循环，进而引入折扣系数，即往后的状态反馈乘上折扣系数，表示当前的奖励比未来的反馈更加重要。

强化学习的最终目标是寻找最大的U

```
<img src='Bellman回报计算与状态值函数.png' style='zoom:80%'>

<img src='行为值函数定义.png' style='zoom:80%'>

### Bellman期望方程
```
在动作空间和状态空间为有限集合求出，通过变换，将当前状态值计算公式转换为序列计算公式，即 使用下一个状态值表示当前状态值： 

```

<img src='Bellman状态值函数变换.png' style='zoom:80%'>


```
根据定义可以求出状态值函数和行为值函数之间的关系，即：

1. 状态值 = 行为概率转移值×行为值函数           [使用行为值表示状态值]
2. 行为值 = 当前奖励值 + 下一个状态值的折扣之和     [使用状态值表示行为值]
3. 状态值 = 当前状态转移的及时奖励 + 在一定概率动作下到达下一状态后的状态值之和 [使用状态值表示状态值] 
4. 行为值 = 当前状态及时奖励 + 在一定概率动作下到达下一状态的行为值之和    [使用行为值表示行为值]
```

<img src='状态值和行为值之间的关系.png' style='zoom:80%'>

<img src='状态值和行为值之间的关系1.png' style='zoom:80%'>

<img src='状态值表示状态值.png' style='zoom:80%'>

<img src='行为值表示行为.png' style='zoom:80%'>

```
在实际的计算过程中，当状态转移矩阵值稳定后，首先会判断当前状态，然后选择一种使状态奖励值最大的一种策略行为。即求使状态值最大的函数
```

<img src='状态值更新实例.png' style='zoom:80%'>


### 蒙特卡罗方法

```
求解Bellman的前提是使状态转移的行为概率是已知值，比如象棋棋盘中，站在上帝视角，从一种状态到下一种状态的概率是可以确定的，但是在很多情况下，概率无法确定，那么如何确定行为概率则是一个问题，为了解决这个问题，采用蒙特卡洛方法，即在每一个状态下，都会通过随机选择大量的状态尝试将其作为下一个状态，如果尝试成功，则将此状态保存下来，如果不成功则不保存，通过大量选择，则可以构建一个next状态集合，以及状态转移概率。
```

### 时序差分法

```
蒙特卡洛方法虽然可以解决问题，但是对于状态集较大的训练场景，每个状态训练时的状态随机筛选过程消耗算力则会极大。蒙特卡洛算法在估计值时使用完整序列的长期回报，而TD算法使用当前回报和下一时刻的价值估计。
```

### Q-learning算法


Q-Learning的实现流程：

1. 根据实际任务定义初始化状态机，并初步定义状态转移得分值
2. 根据状态机和得分值初始化reward矩阵R
3. 根据矩阵R可以计算每种状态的期望得分值，即得到Q-Table矩阵
4. 使用得到的Q-Table矩阵值可以更新reward矩阵R
5. 重复 3 4 最终收敛，得到最终的Q-Table矩阵，这个Q-Table矩阵相当于policy策略矩阵


#### Q-Learning举例

```
根据任务确定状态机
```
<img src='Q-learning状态机.png' style='zoom:80%'>

```
给状态转换得分赋值，即从5出去得分为100 其余得分为 0 
```

<img src='Q-Learning得分值赋值.png' style='zoom:80%'>

```
根据状态机得分值初始化R矩阵
```

<img src='根据Q-Learning状态机确定Q-Table.png' style='zoom:80%'>

```
根据R矩阵更新Q-Table矩阵
```
<img src='Q-Tabel矩阵值更新.png' style='zoom:80%'>

```
Q-Table更新实例
```
<img src='Q-Table更新实例.png' style='zoom:80%'>


### DQN算法

```
Q-learning获取Q-Table的值的目的是需要确定在某一种状态下，采取某种action时获取的reward值，进而确定在当前状态下获取reward最大的action。在状态集比较大的时候，使用Q-learning时，Q-Tabel矩阵过大，不现实。此时使用DQN能解决问题。QQN实现的目标一致，但是具体操作不同，DNQ是通过神经网络学习不同的状态和action的特征，并给出action 和 action得分值 的结果。
```
<img src='DQN.png' style='zoom:80%'>

```
从图形可以看出，DQN算法需要输入状态以及状态的action以及奖励值作为的神经网络学习的输入。对于如何从这些值如何获取，一般是从游戏环境中通过不断尝试不同的操作，并通过观察结果计算奖励值。
```
### 总结

```
Bellman 蒙特卡诺 时序差分 Q-Learning 这写方法都是状态值更新的一种手段，是在当前状态值和每一步的action及时奖励值确定的基础上，更新下一次迭代更新的状态值，通过不断的迭代更新，最终使状态价值在各个状态上分布更加合理。
这些状态更新是逻辑状态更新，和实际物理意义关联是及时奖励值，一般及时奖励值的计算有一个明确的计算公式，或者规定。
```

```
软件安装使用国内源安装：
阿里云 https://mirrors.aliyun.com/pypi/simple/
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/
豆瓣(douban) http://pypi.douban.com/simple/
清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/
中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/

pip install 包名 -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
```

## 1.Tensorflow实现birdflap

In [5]:
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
import cv2
import sys
sys.path.append("game/")
import wrapped_flappy_bird as game
import random
import numpy as np
from collections import deque

In [6]:
GAME = 'bird' # the name of the game being played for log files
ACTIONS = 2 # number of valid actions
GAMMA = 0.99 # decay rate of past observations
OBSERVE = 1000. # timesteps to observe before training
EXPLORE = 2000000. # frames over which to anneal epsilon
FINAL_EPSILON = 0.0001 # final value of epsilon
INITIAL_EPSILON = 0.0001 # starting value of epsilon
REPLAY_MEMORY = 50000 # number of previous transitions to remember
BATCH = 32 # size of minibatch
FRAME_PER_ACTION = 1

In [7]:
#def weight_variable(shape):
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev = 0.01)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.01, shape = shape)
    return tf.Variable(initial)

def conv2d(x, W, stride):
    return tf.nn.conv2d(x, W, strides = [1, stride, stride, 1], padding = "SAME")

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = "SAME")

def createNetwork():
    # network weights6
    W_conv1 =  ([8, 8, 4, 32])
    b_conv1 = bias_variable([32])

    W_conv2 = weight_variable([4, 4, 32, 64])
    b_conv2 = bias_variable([64])

    W_conv3 = weight_variable([3, 3, 64, 64])
    b_conv3 = bias_variable([64])

    W_fc1 = weight_variable([1600, 512])
    b_fc1 = bias_variable([512])

    W_fc2 = weight_variable([512, ACTIONS])
    b_fc2 = bias_variable([ACTIONS])

    # input layer
    s = tf.placeholder("float", [None, 80, 80, 4])

    # hidden layers
    h_conv1 = tf.nn.relu(conv2d(s, W_conv1, 4) + b_conv1)
    h_pool1 = max_pool_2x2(h_conv1)

    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2, 2) + b_conv2)
    #h_pool2 = max_pool_2x2(h_conv2)

    h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 1) + b_conv3)
    #h_pool3 = max_pool_2x2(h_conv3)

    #h_pool3_flat = tf.reshape(h_pool3, [-1, 256])
    h_conv3_flat = tf.reshape(h_conv3, [-1, 1600])

    h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat, W_fc1) + b_fc1)

    # readout layer
    readout = tf.matmul(h_fc1, W_fc2) + b_fc2

    return s, readout, h_fc1

def trainNetwork(s, readout, h_fc1, sess):
    # define the cost function
    a = tf.placeholder("float", [None, ACTIONS])
    y = tf.placeholder("float", [None])
    readout_action = tf.reduce_sum(tf.multiply(readout, a), reduction_indices=1)
    cost = tf.reduce_mean(tf.square(y - readout_action))
    train_step = tf.train.AdamOptimizer(1e-6).minimize(cost)

    # open up a game state to communicate with emulator
    game_state = game.GameState()

    # store the previous observations in replay memory
    D = deque()

    # printing
    a_file = open("logs_" + GAME + "/readout.txt", 'w')
    h_file = open("logs_" + GAME + "/hidden.txt", 'w')

    # get the first state by doing nothing and preprocess the image to 80x80x4
    do_nothing = np.zeros(ACTIONS)
    do_nothing[0] = 1
    x_t, r_0, terminal = game_state.frame_step(do_nothing)
    x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)
    ret, x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)
    s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)

    # saving and loading networks
    saver = tf.train.Saver()
    sess.run(tf.initialize_all_variables())
    checkpoint = tf.train.get_checkpoint_state("saved_networks")
    
    
   #if checkpoint and checkpoint.model_checkpoint_path:
   #    saver.restore(sess, checkpoint.model_checkpoint_path)
   #    print("Successfully loaded:", checkpoint.model_checkpoint_path)
   #else:
   #    print("Could not find old network weights")
   #
    # start training
    epsilon = INITIAL_EPSILON
    t = 0
    while "flappy bird" != "angry bird":
        # choose an action epsilon greedily
        readout_t = readout.eval(feed_dict={s : [s_t]})[0]
        a_t = np.zeros([ACTIONS])
        action_index = 0
        if t % FRAME_PER_ACTION == 0:
            if random.random() <= epsilon:
                print("----------Random Action----------")
                action_index = random.randrange(ACTIONS)
                a_t[random.randrange(ACTIONS)] = 1
            else:
                action_index = np.argmax(readout_t)
                a_t[action_index] = 1
        else:
            a_t[0] = 1 # do nothing

        # scale down epsilon
        if epsilon > FINAL_EPSILON and t > OBSERVE:
            epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE

        # run the selected action and observe next state and reward
        x_t1_colored, r_t, terminal = game_state.frame_step(a_t)
        x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80, 80)), cv2.COLOR_BGR2GRAY)
        ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)
        x_t1 = np.reshape(x_t1, (80, 80, 1))
        #s_t1 = np.append(x_t1, s_t[:,:,1:], axis = 2)
        s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)

        # store the transition in D
        D.append((s_t, a_t, r_t, s_t1, terminal))
        if len(D) > REPLAY_MEMORY:
            D.popleft()

        # only train if done observing
        if t > OBSERVE:
            # sample a minibatch to train on
            minibatch = random.sample(D, BATCH)

            # get the batch variables
            s_j_batch = [d[0] for d in minibatch]
            a_batch = [d[1] for d in minibatch]
            r_batch = [d[2] for d in minibatch]
            s_j1_batch = [d[3] for d in minibatch]

            y_batch = []
            readout_j1_batch = readout.eval(feed_dict = {s : s_j1_batch})
            
            if t == OBSERVE+1:
                print(readout_j1_batch.shape)
                print(readout_j1_batch)
            #
            
            for i in range(0, len(minibatch)):
                terminal = minibatch[i][4]
                # if terminal, only equals reward
                if terminal:
                    y_batch.append(r_batch[i])
                else:
                    y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

            # perform gradient step
            train_step.run(feed_dict = {
                y : y_batch,
                a : a_batch,
                s : s_j_batch}
            )

        # update the old values
        s_t = s_t1
        t += 1

        # save progress every 10000 iterations
        if t % 10000 == 0:
            saver.save(sess, 'saved_networks/' + GAME + '-dqn', global_step = t)

        # print info
        state = ""
        if t <= OBSERVE:
            state = "observe"
        elif t > OBSERVE and t <= OBSERVE + EXPLORE:
            state = "explore"
        else:
            state = "train"

        #print("TIMESTEP", t, "/ STATE", state, \
        #    "/ EPSILON", epsilon, "/ ACTION", action_index, "/ REWARD", r_t, \
        #    "/ Q_MAX %e" % np.max(readout_t))
        ## write info to files
        '''
        if t % 10000 <= 100:
            a_file.write(",".join([str(x) for x in readout_t]) + '\n')
            h_file.write(",".join([str(x) for x in h_fc1.eval(feed_dict={s:[s_t]})[0]]) + '\n')
            cv2.imwrite("logs_tetris/frame" + str(t) + ".png", x_t1)
        '''

def playGame():
    sess = tf.InteractiveSession()
    s, readout, h_fc1 = createNetwork()
    trainNetwork(s, readout, h_fc1, sess)

def main():
    playGame()

In [None]:
if __name__ == "__main__":
    main()

(32, 2)
[[ 1.63239557e-02  7.58314645e-03]
 [ 1.08270040e-02  1.00139771e-02]
 [ 1.28428917e-02  8.52609985e-03]
 [ 9.97734349e-03  2.08778959e-03]
 [ 2.25342810e-02  8.65358859e-03]
 [ 1.70554854e-02  7.23292772e-03]
 [ 1.57846808e-02  5.85000962e-05]
 [ 1.37990844e-02  9.29090288e-03]
 [ 1.42347123e-02  1.24834701e-02]
 [ 1.20413061e-02  5.57029899e-03]
 [ 2.44197380e-02  9.29792132e-03]
 [ 2.21522246e-02  8.88486486e-03]
 [ 1.38307884e-02  1.04778986e-02]
 [ 8.89533199e-03  8.34231824e-03]
 [ 1.74962208e-02  7.65069667e-03]
 [ 1.56601369e-02  1.40965283e-02]
 [ 1.59182176e-02  1.12611335e-02]
 [ 2.00182982e-02  8.53320397e-03]
 [ 2.41243020e-02  8.77001137e-03]
 [ 1.14605408e-02  7.53760617e-03]
 [ 1.50481444e-02  1.43945422e-02]
 [ 1.72681697e-02  1.14596048e-02]
 [ 1.07064564e-02  6.06974494e-03]
 [ 2.22815014e-02  7.89135415e-03]
 [-4.25567664e-03  1.21664386e-02]
 [ 1.28520271e-02  3.98911722e-03]
 [ 1.54135376e-03 -2.36980896e-03]
 [ 1.26641225e-02  7.57872872e-03]
 [ 4.8799086

## 2. 使用pytorch实现FlappyBird

In [1]:
import torch 
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets,transforms
import torch.optim as optim
import cv2
import sys
sys.path.append("game/")
import wrapped_flappy_bird as game
import random
import numpy as np
from collections import deque

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


In [3]:
class DQN_NET(nn.Module):
    def __init__(self,input_size,out_put_size):
        super(DQN_NET,self).__init__()
        
        self.layer1 = nn.Conv2d(input_size,32,kernel_size=6,stride=4,padding=1)
        self.pool1  = nn.MaxPool2d(kernel_size=2,stride=2,padding=0)
        
        self.layer2 = nn.Conv2d(32,64,kernel_size=4,stride=2,padding=1)
        self.pool2  = nn.MaxPool2d(kernel_size=2,stride=2,padding=1)
        
        self.layer3 = nn.Conv2d(64,64,kernel_size=3,stride=1,padding=1)
        self.pool3  = nn.MaxPool2d(kernel_size=2,stride=2,padding=1)
        
        self.fc_shape_size  = 256
        
        self.fc     = nn.Linear(256,out_put_size)
    
    def forward(self,x):
        cur_state = x       #记录当前状态值
        x = F.relu(self.layer1(x))
        x = self.pool1(x)
        x = F.relu(self.layer2(x))
        x = self.pool2(x)
        x = F.relu(self.layer3(x))
        x = self.pool3(x)
        x = x.view([ self.fc_shape_size])
        x = self.fc(x)
        action_reward = x  #记录在当前状态下各种action的得分值  
        return cur_state,action_reward

In [4]:
GAME = 'bird'          # the name of the game being played for log files
ACTIONS = 2            # number of valid actions
GAMMA = 0.99           # decay rate of past observations
OBSERVE = 1000.        # timesteps to observe before training
EXPLORE = 2000000.     # frames over which to anneal epsilon
FINAL_EPSILON = 0.0001 # final value of epsilon
INITIAL_EPSILON = 0.0001 # starting value of epsilon
REPLAY_MEMORY = 50000    # number of previous transitions to remember
BATCH = 32               # size of minibatch
FRAME_PER_ACTION = 1

In [5]:
# 初始化game
game_state = game.GameState()

# 创建模型
input_size  =  4
output_size =  2
DQN_Model = DQN_NET(input_size,output_size)

#定义损失函数
loss_func = torch.nn.MSELoss()
    
# 优化器选择
opt       = optim.Adam(DQN_Model.parameters(),lr=1e-6) 

In [6]:
DATA = deque()
epsilon = INITIAL_EPSILON       # 贪心算法开关阈值

# 传入学习model 和 game操作句柄 
def DATA_INIT(model,game_state):
    i = 0
    ACTION = output_size 
    action = np.zeros(ACTION)       # bird有向上飞 和 向下飞两种动作
    action[0] = 1                   #定义bird向下飞
    
    # 获取第一帧特征图 返回 当前状态值 action奖励 游戏是否终止的flag
    cur_state,action_reward,terminal = game_state.frame_step(action)
    
    # 将特征图转换为灰度图并
    image = cv2.cvtColor(cv2.resize(cur_state, (80, 80)), cv2.COLOR_BGR2GRAY)
    ret, image = cv2.threshold(image,1,255,cv2.THRESH_BINARY)
    image = torch.tensor(image)
    # 将第一张特征图复制4份 制作第一个状态图
    cur_state = np.stack((image, image, image, image), axis=2)
    cur_state = torch.tensor(cur_state).view(1,4,80,80).float()
    
    # 观察并记录输入状态cur_state 操作action 获取的reward  以及 next_state
    while i < OBSERVE:
        
        action_tmp = np.zeros(ACTIONS)
        # 贪心算法 探索算法 控制开关
        # 通过网络模型预测的动作执行是
        if i%FRAME_PER_ACTION == 0:
            if random.random() <= epsilon:
                print("==========Random Action================")
                action_index             = random.randrange(ACTIONS)
                action_tmp[action_index] = 1
            else:
                cur_state,action_reward  = model(cur_state)
                action_index             = np.argmax(action_reward.detach())
                action_tmp[action_index] = 1
        else:
            action_tmp[0] = 1   # down
        
        # 在当前状态下 执行action_tmp 动作，获取下一帧image action_tmp的reward,以及terminal判断标志
        next_image,reward,terminal = game_state.frame_step(action_tmp)
        
       
        next_image = cv2.cvtColor(cv2.resize(next_image, (80, 80)), cv2.COLOR_BGR2GRAY)
        ret, next_image = cv2.threshold(next_image,1,255,cv2.THRESH_BINARY)
        next_image = torch.tensor(next_image).view(1,1,80,80).float()
        cur_state_tmp = cur_state[:,1:4,:,:]
        # cur_state中的后3张image状态图 + 新生成的next_image 组成 next_state 状态图
        next_state = torch.cat((cur_state_tmp,next_image),dim=1)
        
        DATA.append((cur_state,action_tmp,reward,next_state,terminal))
        
        # 当DATA中存储超过阈值则将最靠前的state pop掉
        if len(DATA) > REPLAY_MEMORY:
            DATA.popleft()
        
        # 将最新状态更新为当前状态
        cur_state = next_state
        
        i += 1
        
        
def DATA_UpDate(model,DATA):
    
    # 取出最后一deque中的next_state状态值 基于最新model预测，更新deque值
    cur_state = DATA[len(DATA)-1][3]
    
    action_tmp = np.zeros(ACTIONS)
    # 贪心算法 探索算法 控制开关
    # 通过网络模型预测的动作执行是
    if i%FRAME_PER_ACTION == 0:
        if random.random() <= epsilon:
            print("==========Random Action================")
            action_index             = random.randrange(ACTIONS)
            action_tmp[action_index] = 1
        else:
            cur_state,action_reward  = model(cur_state)
            action_index             = np.argmax(action_reward.detach())
            action_tmp[action_index] = 1
    else:
        action_tmp[0] = 1   # down

    # 在当前状态下 执行action_tmp 动作，获取下一帧image action_tmp的reward,以及terminal判断标志
    next_image,reward,terminal = game_state.frame_step(action_tmp)

    next_image = cv2.cvtColor(cv2.resize(next_image, (80, 80)), cv2.COLOR_BGR2GRAY)
    ret, next_image = cv2.threshold(next_image,1,255,cv2.THRESH_BINARY)
    next_image = torch.tensor(next_image).view(1,1,80,80).float()
    cur_state_tmp = cur_state[:,1:4,:,:]
    # cur_state中的后3张image状态图 + 新生成的next_image 组成 next_state 状态图
    next_state = torch.cat((cur_state_tmp,next_image),dim=1)

    DATA.append((cur_state,action_tmp,reward,next_state,terminal))

    # 当DATA中存储超过阈值则将最靠前的state pop掉
    if len(DATA) > REPLAY_MEMORY:
        DATA.popleft()
    
    if epsilon > FINAL_EPSILON:
        epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE
    
    return next_state

In [7]:
DATA_INIT(DQN_Model,game_state)

In [None]:
## tensorflow  转换到pytorch架构需要 注意pytorch 架构的结果中包含grad标志符的情况
epochs = 10000
BATCH = 64
for epoch in range(epochs):
    minibatch = random.sample(DATA, BATCH)
    cur_state_batch  = [d[0] for d in minibatch]
    action_batch     = [d[1] for d in minibatch]
    reward_batch     = [d[2] for d in minibatch]
    next_state_batch = [d[3] for d in minibatch]
    
    state_reward_batch = []
    next_state_bach    = []
    next_reward_bach   = []
    action_predict     = []
    
    for i in range(0,len(minibatch)):
        next_state_bach_tmp,next_reward_bach_tmp = DQN_Model(next_state_batch[i])

        next_reward_bach.append(next_reward_bach_tmp)
    
    
    # 计算最大期望reward
    for i in range(0,len(minibatch)):
        
        terminal = minibatch[i][4]
        
        if terminal:
            state_reward_batch.append(reward_batch[i])
        else:
            state_reward_batch.append(reward_batch[i] + GAMMA*np.max(next_reward_bach[i].detach().numpy()))
    
    for i in range(0,len(minibatch)):
        cur_state_batch_tmp,action_predict_tmp = DQN_Model(cur_state_batch[i])
        
        action_predict.append(np.max(action_predict_tmp.detach().numpy()))
    
    action_predict = torch.tensor(np.array(action_predict)).float()
    print(state_reward_batch)
    state_reward_batch = torch.tensor(np.array(state_reward_batch)).float()
    
    loss = loss_func(action_predict,state_reward_batch)
    
    
    DQN_Model.train()
    
    opt.zero_grad()
    
    loss.backward()
    
    opt.step()
    
    DATA_UpDate(DQN_Model,DATA)
    
    if epoch % 10 == 0:
        print('epoch ',epoch,'over')