# DDPG(Deep Deterministic Policy Gradient)
DDPG算法是model free, off-policy的，且使用了深度神經網絡。    
* DQN只能解決離散且維度不高的action spaces的問題。是value based方法，只有一個值函數網絡。
* DDPG可以解決連續動作空間問題。是actor-critic方法，即既有值函數網絡(critic)，又有策略網絡(actor)。

## Step1: import
引入相關library與pytorch，環境使用Openai gym的Pendulum-v0。

In [2]:
import numpy as np
import torch, gym, argparse
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

## Step2: 建立Replay buffer的class
深度神經網絡作為有監督學習模型，要求數據滿足獨立同分佈。Experience Replay方法通過存儲-採樣的方法打破數據之間的關聯性。

In [3]:
class ReplayBuffer(object):
    def __init__(self, max_size=1e6):
        self.storage = []
        self.max_size = max_size
        self.ptr = 0

    # 在replay buffer中增加數據
    def add(self, data):
        # 如果儲存空間超過最大size，會從頭取代buffer中的data
        if len(self.storage) == self.max_size:
            self.storage[int(self.ptr)] = data
            self.ptr = (self.ptr + 1) % self.max_size
        else:
            self.storage.append(data)

    # 取出資料
    def sample(self, batch_size):
        ind = np.random.randint(0, len(self.storage), size=batch_size)
        x, y, u, r, d = [], [], [], [], []
        for i in ind:
            X, Y, U, R, D = self.storage[i]
            x.append(np.array(X, copy=False))
            y.append(np.array(Y, copy=False))
            u.append(np.array(U, copy=False))
            r.append(np.array(R, copy=False))
            d.append(np.array(D, copy=False))
        return np.array(x), np.array(y), np.array(u), np.array(r).reshape(-1, 1), np.array(d).reshape(-1, 1)
    
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Step3: 建立Actor與Critic Class
### 主要概念: 給定一個策略來求[值函數]，再根據值函數來[更新策略]。

## 1.Actor:
使用一個神經網絡來近似[策略函數]，它的輸入是observation(state)，輸出是action。    
<font color=blue>這裡涉及到強化學習中一個非常重要的概念：策略梯度Policy Gradient。</font>    
> Policy gradient想解決的問題就是找到網路參數theda，使得每次與環境互動時都可以得到最佳策略使之獲得的Expected reward能極大化。

In [4]:
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(Actor, self).__init__()
        self.l1 = nn.Linear(state_dim, 400) # 輸入state
        self.l2 = nn.Linear(400, 300)
        self.l3 = nn.Linear(300, action_dim) # 輸出預測action
        self.max_action = max_action

    def forward(self, x):
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = self.max_action * torch.tanh(self.l3(x))
        return x

## 2.Critic: 
使用一個神經網絡來近似[值函數]，輸入是action與observation(state)，輸出是Q(s, a)。    
更新方式與DQN類似，使用梯度下降法進行更新。

In [5]:
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()
        self.l1 = nn.Linear(state_dim, 400) # 輸入state
        self.l2 = nn.Linear(400 + action_dim, 300) # 中間加上action
        self.l3 = nn.Linear(300, 1) # 輸出Q value

    def forward(self, x, u):
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(torch.cat([x, u], 1)))
        x = self.l3(x)
        return x

## DDPG架構圖
***
![title](structure.jpg)
來源: https://www.jianshu.com/p/22cdc0d9fa13

## Step4: 建立DDPG Class
DDPG運作的主體，含有actor與critic總共四個network，也能運用從main function傳進來的replay buffer。

In [6]:
class DDPG(object):
    #####----------------建立actor與critic物件----------------#####
    # 分別有兩個網路，target與online network
    # 使用adam optimizer
    # state_dim = 3, action_dim是連續的(從-2.0到2.0)
    def __init__(self, state_dim, action_dim, max_action):
        self.actor = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=1e-4)
        self.critic = Critic(state_dim, action_dim).to(device)
        self.critic_target = Critic(state_dim, action_dim).to(device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), weight_decay=1e-2)

    # 根據state，從online actor network中預測action
    def select_action(self, state):
        state = torch.FloatTensor(state.reshape(1, -1)).to(device)
        return self.actor(state).cpu().data.numpy().flatten()

    def train(self, replay_buffer, iterations, batch_size=64, discount=0.99, tau=0.001):

        # iterations = episode_timesteps，通常為200
        for _ in range(iterations):
            x, y, u, r, d = replay_buffer.sample(batch_size) # 從replay buffer採樣
            # 轉換數值型態
            state = torch.FloatTensor(x).to(device)
            action = torch.FloatTensor(u).to(device)
            next_state = torch.FloatTensor(y).to(device)
            done = torch.FloatTensor(1 - d).to(device)
            reward = torch.FloatTensor(r).to(device)

            # 根據replay buffer採樣出的數值，從target與online network分別獲得Q vlaue
            target_Q = self.critic_target(next_state, self.actor_target(next_state))
            target_Q = reward + (done * discount * target_Q).detach()
            current_Q = self.critic(state, action)

             # 用兩個Q vlaue計算loss，分別計算critic與actor的optimizer.step
            critic_loss = F.mse_loss(current_Q, target_Q)
            self.critic_optimizer.zero_grad()
            critic_loss.backward()
            self.critic_optimizer.step()

            actor_loss = -self.critic(state, self.actor(state)).mean()
            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()

            # Update model
            for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
                target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
            for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

## Pendulum環境的設定
***
![title](Pendulum_details.png)

## Step5: 變數設定與主要迴圈

In [None]:
#### ----------------設定參數----------------#####
env_name = "Pendulum-v0"
seed = 0 # Sets Gym, PyTorch and Numpy seeds
start_timesteps = 1e4 # how many step random policy run
max_timesteps = 1e6
expl_noise = 0.1 # Gaussian exploration
batch_size = 100
GAMMA = 0.99 # Discount
tau = 0.005 # DDPG update rate
policy_noise = 0.2  # Noise to target policy during critic update
noise_clip = 0.5 # Range to clip target policy noise
policy_freq = 2 # Frequency of delayed policy updates


#----------------Main Function----------------#####
env = gym.make(env_name) # 環境設為 Openai gym的 Pendulum-v0(倒立單擺)
env.seed(seed) # 亂數種子皆預設為0
torch.manual_seed(seed)
np.random.seed(seed)
state_dim = env.observation_space.shape[0] # 輸出狀態state個數
action_dim = env.action_space.shape[0]     # 輸出動作action個數
max_action = float(env.action_space.high[0]) # 查看動作的最高值
policy = DDPG(state_dim, action_dim, max_action) ##### 建立DDPG物件 #####
replay_buffer = ReplayBuffer() ##### 建立ReplayBuffer物件 #####

# 初始化變數
total_timesteps = 0
timesteps_since_eval = 0
episode_num = 0
episode_reward = 0
episode_timesteps = 0
done = True


#####----------------訓練agent的迴圈----------------#####
while total_timesteps < max_timesteps: # 執行次數到達10的6次方(一百萬次)時停止
    # ------經過200 step，遊戲結束，done被設為true，會進入這個條件式------
    if done: 
        if total_timesteps != 0:
            # 印出遊玩編號與遊戲獲得的reward
            print(("Total T: %d Episode Num: %d Episode T: %d Reward: %f") % (total_timesteps, episode_num, episode_timesteps, episode_reward))
            ##### DDPG物件進行訓練 #####
            policy.train(replay_buffer, episode_timesteps, batch_size, GAMMA, tau)

        #重新初始化變數，episode_num增加1
        obs = env.reset() # 將環境reset
        done = False
        episode_reward = 0
        episode_timesteps = 0
        episode_num += 1

    # ------random policy------
    if total_timesteps < start_timesteps: 
        action = env.action_space.sample() # 從動作空間中隨機選取動作
    # ------actor根據state決定動作------
    else:
        action = policy.select_action(np.array(obs))  ##### DDPG物件挑選動作 #####
        if expl_noise != 0: # 有機率做隨機探索
            action = (action + np.random.normal(0, expl_noise, size=env.action_space.shape[0])).clip(
                env.action_space.low, env.action_space.high)

    # ------對環境做出action------
    new_obs, reward, done, _ = env.step(action) #做出action後，得到新的state, reward, 是否結束, info
    done_bool = 0 if episode_timesteps + 1 == env._max_episode_steps else float(done) # done_bool=done，除非差一步到達一百萬次，done會直接等於0
    episode_reward += reward # 將一次遊玩中的所有reward加起來

    replay_buffer.add((obs, new_obs, action, reward, done_bool)) # 增加replay buffer中的資料
    obs = new_obs # 新的state取代舊的state
    # 將下列變數加一，代表做過一次動作
    episode_timesteps += 1
    total_timesteps += 1
    timesteps_since_eval += 1

Total T: 200 Episode Num: 1 Episode T: 200 Reward: -1771.013012
Total T: 400 Episode Num: 2 Episode T: 200 Reward: -1222.142191
Total T: 600 Episode Num: 3 Episode T: 200 Reward: -1710.377831
Total T: 800 Episode Num: 4 Episode T: 200 Reward: -1031.276423
Total T: 1000 Episode Num: 5 Episode T: 200 Reward: -1547.753475
Total T: 1200 Episode Num: 6 Episode T: 200 Reward: -1651.807438
Total T: 1400 Episode Num: 7 Episode T: 200 Reward: -852.814602
Total T: 1600 Episode Num: 8 Episode T: 200 Reward: -1871.344309
Total T: 1800 Episode Num: 9 Episode T: 200 Reward: -790.870957
Total T: 2000 Episode Num: 10 Episode T: 200 Reward: -1541.124861
Total T: 2200 Episode Num: 11 Episode T: 200 Reward: -1283.404255
Total T: 2400 Episode Num: 12 Episode T: 200 Reward: -1100.613597
Total T: 2600 Episode Num: 13 Episode T: 200 Reward: -1357.381504
Total T: 2800 Episode Num: 14 Episode T: 200 Reward: -973.311868
Total T: 3000 Episode Num: 15 Episode T: 200 Reward: -876.232186
Total T: 3200 Episode Num: 