# 3. double-DQN
### 目的是減少因為max Q值計算帶來的計算偏差，或者稱為過度估計(over estimation)問題，用當前的Q網絡來選擇動作，用目標Q網絡來計算目標Q。     
### import部分與Nips-DQN相同
先引入需要的函式庫與套件。    

In [1]:
import torch
import torch.nn as nn
from collections import deque
import numpy as np
import gym
import random
from net import AtariNet
from util import preprocess

### 此處與Nature-DQN相同
設定參數。

In [2]:
BATCH_SIZE = 32
LR = 0.001
START_EPSILON = 1.0
FINAL_EPSILON = 0.1
EPSILON = START_EPSILON
EXPLORE = 1000000
GAMMA = 0.99
TOTAL_EPISODES = 10000000
MEMORY_SIZE = 1000000
MEMORY_THRESHOLD = 100000
UPDATE_TIME = 10000
TEST_FREQUENCY = 1000
env = gym.make('Pong-v0')
env = env.unwrapped
ACTIONS_SIZE = env.action_space.n

設定Agent物件。    
- `__init__`函式是初始化，利用net.py建造network，並設定optimizer為Adam、loss_function為MSE Loss。    
此處有兩個network，network與target_network。    
learning_count紀錄學習次數，若`learning_count % UPDATE_TIME == 0`，load_state_dict會將target_network恢復成一般network。    
- `action`函式會進行遊戲，首先由random值與閥值(EPSILON)決定action是隨機或根據經驗。`torch.unsqueeze()`這個函數主要是對數據維度進行擴充，`torch.max`會返回輸入tensor中所有元素的最大值。    
- `learn`函式紀錄學習過程，`memory.append`會儲存"SARS',是否結束"到memory的deque列表中，如果列表超過memory size會刪掉舊的資料，如果memory不夠多資料會return。    
當經驗池中含足夠多筆資料時，會從中隨機挑選一個sample，計算eval_q(從network所在state估算)和next_q(target_network紀錄中的next state得出)，最後算出target_q和loss，以改善類神經網路。    
    
### 重點在於DDQN不再是直接在目標Q網絡裡面找各個動作中最大Q值，而是先在當前Q網絡中先找出最大Q值對應的動作𝑎𝑚𝑎𝑥(𝑆′𝑗,𝑤)=argmax𝑎′𝑄(𝜙(𝑆′𝑗),𝑎,𝑤)，然後利用這個動作𝑎𝑚𝑎𝑥(𝑆′ 𝑗,𝑤)去計算目標Q值𝑦𝑗=𝑅𝑗+𝛾𝑄′(𝜙(𝑆′𝑗),𝑎𝑚𝑎𝑥(𝑆′𝑗,𝑤),𝑤′)。
>         actions_value = self.network.forward(next_state)
        next_action = torch.unsqueeze(torch.max(actions_value, 1)[1], 1)
        eval_q = self.network.forward(state).gather(1, action)
        next_q = self.target_network.forward(next_state).gather(1, next_action)
        target_q = reward + GAMMA * next_q * done
        loss = self.loss_func(eval_q, target_q)

`gather`功能：沿給定軸dim，將輸入索引張量index指定位置的值進行聚合。

In [3]:
class Agent(object):
    def __init__(self):
        self.network, self.target_network = AtariNet(ACTIONS_SIZE), AtariNet(ACTIONS_SIZE)
        self.memory = deque()
        self.learning_count = 0
        self.optimizer = torch.optim.Adam(self.network.parameters(), lr=LR)
        self.loss_func = nn.MSELoss()

    def action(self, state, israndom):
        if israndom and random.random() < EPSILON:
            return np.random.randint(0, ACTIONS_SIZE)
        state = torch.unsqueeze(torch.FloatTensor(state), 0)
        actions_value = self.network.forward(state)
        return torch.max(actions_value, 1)[1].data.numpy()[0]

    def learn(self, state, action, reward, next_state, done):
        if done:
            self.memory.append((state, action, reward, next_state, 0))
        else:
            self.memory.append((state, action, reward, next_state, 1))
        if len(self.memory) > MEMORY_SIZE:
            self.memory.popleft()
        if len(self.memory) < MEMORY_THRESHOLD:
            return

        if self.learning_count % UPDATE_TIME == 0:
            self.target_network.load_state_dict(self.network.state_dict())
        self.learning_count += 1

        batch = random.sample(self.memory, BATCH_SIZE)
        state = torch.FloatTensor([x[0] for x in batch])
        action = torch.LongTensor([[x[1]] for x in batch])
        reward = torch.FloatTensor([[x[2]] for x in batch])
        next_state = torch.FloatTensor([x[3] for x in batch])
        done = torch.FloatTensor([[x[4]] for x in batch])

        actions_value = self.network.forward(next_state)
        next_action = torch.unsqueeze(torch.max(actions_value, 1)[1], 1)
        eval_q = self.network.forward(state).gather(1, action)
        next_q = self.target_network.forward(next_state).gather(1, next_action)
        target_q = reward + GAMMA * next_q * done
        loss = self.loss_func(eval_q, target_q)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

### 此部分與Nips-DQN相同
- 最外層的for迴圈：    
TOTAL_EPISODES為一千萬次，每一千次會進行test。    
`env.reset()`：初始化環境。    
util的`preprocess`功能：每個frame轉換為32位浮點值(介於0和1)的47×47灰度圖像。獎勵信號限制為-1、0和1。不進行圖像裁剪。
- 第一次while迴圈：   
`agent.action(state, True)`：得到隨機或是由類神經網路預測的actions_value。    
`env.step(action)`：做出action，得到下一個state、reward、遊戲是否結束、info。    
`preprocess(next_state)`：將下一個state先做預處理。   
`agent.learn(state, action, reward, next_state, done)`：送到上面的learn function做學習。    
若遊戲結束就break這個迴圈。    
每次EPSILON會遞減，直到最小值0.1為止。
- 每一千次進行test：    
action函數的參數israndom設為False，不再隨機決定動作，計算玩完一次遊戲後reward總值是多少。

In [None]:
agent = Agent()

for i_episode in range(TOTAL_EPISODES):
    state = env.reset()
    state = preprocess(state)
    while True:
        # env.render()
        action = agent.action(state, True)
        next_state, reward, done, info = env.step(action)
        next_state = preprocess(next_state)
        agent.learn(state, action, reward, next_state, done)

        state = next_state
        if done:
            break
    if EPSILON > FINAL_EPSILON:
        EPSILON -= (START_EPSILON - FINAL_EPSILON) / EXPLORE

    # TEST
    if i_episode % TEST_FREQUENCY == 0:
        state = env.reset()
        state = preprocess(state)
        total_reward = 0
        while True:
            # env.render()
            action = agent.action(state, israndom=False)
            next_state, reward, done, info = env.step(action)
            next_state = preprocess(next_state)

            total_reward += reward

            state = next_state
            if done:
                break
        print('episode: {} , total_reward: {}'.format(i_episode, round(total_reward, 3)))

env.close()