1. Use random epsilon and replay buffer to initialize Q(s,a) on online network and Q'(s, a) on target network.
2. Use epsilon to choose a random action a, otherwise a = argmax_a Q(s, a)
3. Play one step on online network with action a and check the reward r and the next state s'.
4. Store transition data (s, a, r, s') to replay buffer.
5. Sample one random batch from replay buffer.
6. For each transition calculation y, if episode end at this step, then y = r, otherwise y = r + GAMMA * max_a' Q'(s', a').
7. Calculate loss: L = ( Q(s, a) - y )^2 , Q(s, a) is the online network and y is the target network.
8. With the smallest loss L, use the Stochastic Gradient Descent Algorithm to update Q(s, a).
9. For every N steps, update weighting from Q to Q'.
10. Repeat step 2 until convergence.

In [1]:
from lib import wrappers
from lib import dqn_model

In [2]:
import argparse
import time
import numpy as np
import collections

import torch
import torch.nn as nn
import torch.optim as optim

from tensorboardX import SummaryWriter

In [3]:
DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
MEAN_REWARD_BOUND = 19.5

#discount factor
GAMMA = 0.99
#batch size to get from replay buffer
BATCH_SIZE = 32
#largest replay buffer size
REPLAY_SIZE = 10000
#Wait the frames number below before start
REPLAY_START_SIZE = 10000
#Adam learning rate
LEARNING_RATE = 1e-4
#How long we sync the online model to the target model
SYNC_TARGET_FRAMES = 1000

In [4]:
#epsilon start at 1, and after 100000 frames, epsilon will decreased to 0.02, it means action has 2% randomness
EPSILON_DECAY_LAST_FRAME = 10**5
EPSILON_START = 1.0
EPSILON_FINAL = 0.02

In [5]:
#we define experience buffer at below, we will append the result to buffer after each step, and only record
#certain number of steps, here we use 10000 steps, also we will random sample a batch to make it independent from previous
#steps
Experience = collections.namedtuple('Experience', field_names=['state', 'action', 'reward', 'done', 'new_state'])

class ExperienceBuffer:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)
        
    def __len__(self):
        return len(self.buffer)
    
    def append(self, experience):
        self.buffer.append(experience)
    
    #we repack it in numpy array for calculation convenience
    def sample(self, batch_size):
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), \
               np.array(dones, dtype=np.uint8), np.array(next_states)

#Agent interact with environment and store result in replay buffer
class Agent:
    def __init__(self, env, exp_buffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self._reset()
        
    def _reset(self):
        self.state = env.reset()
        self.total_reward = 0.0
    
    #Agent will act one step in environment and store in buffer, it first choose the action by epsilon with either random
    #or best action with maximum q value
    def play_step(self, net, epsilon=0.0, device="cpu"):
        done_reward = None
        
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_a = np.array([self.state], copy=False)
            state_v = torch.tensor(state_a).to(device)
            q_vals_v = net(state_v)
            _, act_v = torch.max(q_vals_v, dim=1)
            action = int(act_v.item())
            
        # do step in the environment
        new_state, reward, is_done, _ = self.env.step(action)
        self.total_reward += reward
        new_state = new_state
        
        exp = Experience(self.state, action, reward, is_done, new_state)
        self.exp_buffer.append(exp)
        self.state = new_state
        if is_done:
            done_reward = self.total_reward
            self._reset()
        return done_reward
    
#we use parallel processing to calculate batch loss with GPU, it will faster than normal version at least 2 times
#for the non-ending step, loss: L = (Q(s,a)- (r + GAMMA * max_action Q(s',a')))^2
#for ending step, L = (Q(s,a) - r)^2

#we send the batch as array to the function, it is sample() from the experience buffer, the online and target network
#also included, we use detach to prevent gradient go to target network
def calc_loss(batch, net, tgt_net, device="cpu"):
    states, actions, rewards, dones, next_states = batch
    #we pack batch data as numpy array, if parameters need CUDA device, we add to GPU.
    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions, dtype=np.long).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.BoolTensor(dones).to(device)

    #we put the observation to the model and use gather to get the q-value. First parameterm is the parameter position
    #we want to operate, 1 correspond to action parameter, unsqueeze will insert a new dimension,here at final position,
    #the result is the action taken
    #gather result is differentiable, it record the last loss gradient
    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)

    #we apply next state observation to target network and calculate the largest q-value for action dimension(1).
    #max() will return the largest value and the index at the same time, which is max and argmax, we use the value here
    #only, therefore we get array[0]
    next_state_values = tgt_net(next_states_v).max(1)[0]

    #for the last step q-values, we set it as 0.0 for convergence because there are no next step to collect reward
    #action value won't have next state discounted reward. If we don't set this, it won't converge.
    next_state_values[done_mask] = 0.0

    #we detach the value from calculation map to avoid back propagation will let current state and next state affect
    #together
    next_state_values = next_state_values.detach()

    #we calculate Bellman approximation and Mean Square Loss here
    expected_state_action_values = next_state_values * GAMMA + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)
    

In [None]:
if __name__ == "__main__":
    #we create argument parser here and activate cuda environment
    parser = argparse.ArgumentParser()
    parser.add_argument("--cuda", default=False, action="store_true", help="Enable cuda")
    parser.add_argument("--env", default=DEFAULT_ENV_NAME, help="Name of the environment, default=" + DEFAULT_ENV_NAME)
    parser.add_argument("--reward", type=float, default=MEAN_REWARD_BOUND,
                        help="Mean reward boundary for stop of training,default=%.2f" %MEAN_REWARD_BOUND)
    args, unknown = parser.parse_known_args()
    device = torch.device("cuda" if args.cuda else "cpu")
    
    #we construct environment with environment wrappers, and construct online and target network
    #although they have random initial weights, but we will synchronize every 1000 frames, which is 1 episode, so it 
    #is not very important
    env = wrappers.make_env(args.env)
    net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)
    tgt_net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)
    
    #we make replay buffer here, then send it to agent. Epsilon is set to 1 at start, but will decrease much after each
    #iteration
    writer = SummaryWriter(comment="-" + args.env)
    print(net)
    
    buffer = ExperienceBuffer(REPLAY_SIZE)
    agent = Agent(env, buffer)
    epsilon = EPSILON_START
    
    #create optimizer, one episode reward buffer, frame counter, some speed counter and best mean reward recorder.
    #we will record the best reward if it get new record.
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
    total_rewards = []
    frame_idx = 0
    ts_frame = 0
    ts = time.time()
    best_mean_reward = None
    
    while True:
        #we will calculate the number of iteration finished and decrease epsilon accordingly, it will decrease in 
        #EPSILON_DECAY_LAST_FRAME, which is 100000 frames, when it reach epsilon 0.02, it will hold at that value
        frame_idx += 1
        epsilon = max(EPSILON_FINAL, EPSILON_START - frame_idx / EPSILON_DECAY_LAST_FRAME)
        
        #agent will play one step in environment, if we go to the last step, it will return not None and we will record 
        #the following values:
        #current epsilon value
        #speed, which is the number of frames processed each second
        #last 100 episodes mean reward
        #episode number
        reward = agent.play_step(net, epsilon, device=device)
        if reward is not None:
            total_rewards.append(reward)
            speed = (frame_idx - ts_frame) / (time.time() - ts)
            ts_frame = frame_idx
            ts = time.time()
            mean_reward = np.mean(total_rewards[-100:])
            print("%d: done %d games, mean reward %.3f, eps %.2f, speed %.2f f/s" 
                  %(frame_idx, len(total_rewards), mean_reward, epsilon, speed))
            writer.add_scalar("epsilon", epsilon, frame_idx)
            writer.add_scalar("speed", speed, frame_idx)
            writer.add_scalar("reward_100", mean_reward, frame_idx)
            writer.add_scalar("reward", reward, frame_idx)
        
            #if last 100 episodes mean reward reached largest value, we will report and save model
            #if mean reward larger than reward boundary, we will stop training. 
            #Our mean reward bound in pong is 19.5 for 21 games
            if best_mean_reward is None or best_mean_reward < mean_reward:
                torch.save(net.state_dict(), args.env + "-best.dat")
                if best_mean_reward is not None:
                    print("Best mean reward updated %.3f -> %.3f, model saved" %(best_mean_reward, mean_reward))
                best_mean_reward = mean_reward
            if mean_reward > args.reward:
                print("Solved in %d frames!" % frame_idx)
                break
        
        #we will wait until buffer accumulates enough data to start
        if len(buffer) < REPLAY_START_SIZE:
            continue
        
        #we will synchronize the online network to target network in SYNC_TARGET_FRAMES frames, which is 1000 frames
        if frame_idx % SYNC_TARGET_FRAMES == 0:
            tgt_net.load_state_dict(net.state_dict())
        
        #we set gradient to zero, get batch sample from replay buffer, calculate loss and minimize the loss
        #this part used the most time
        optimizer.zero_grad()
        batch = buffer.sample(BATCH_SIZE)
        loss_t = calc_loss(batch, net, tgt_net, device=device)
        loss_t.backward()
        optimizer.step()
    writer.close()

DQN(
  (conv): Sequential(
    (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (3): ReLU()
    (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
  )
  (fc): Sequential(
    (0): Linear(in_features=3136, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=6, bias=True)
  )
)
762: done 1 games, mean reward -21.000, eps 0.99, speed 313.00 f/s
1602: done 2 games, mean reward -20.500, eps 0.98, speed 303.73 f/s
Best mean reward updated -21.000 -> -20.500, model saved
2578: done 3 games, mean reward -20.000, eps 0.97, speed 299.36 f/s
Best mean reward updated -20.500 -> -20.000, model saved
3466: done 4 games, mean reward -20.250, eps 0.97, speed 263.24 f/s
4486: done 5 games, mean reward -20.000, eps 0.96, speed 285.28 f/s
5368: done 6 games, mean reward -20.167, eps 0.95, speed 295.68 f/s
6158: done 7 games, mean reward -20.286, eps 0.94, speed 27

144743: done 103 games, mean reward -17.800, eps 0.02, speed 7.87 f/s
Best mean reward updated -18.100 -> -17.800, model saved
146838: done 104 games, mean reward -17.460, eps 0.02, speed 8.04 f/s
Best mean reward updated -17.800 -> -17.460, model saved
148659: done 105 games, mean reward -17.090, eps 0.02, speed 8.10 f/s
Best mean reward updated -17.460 -> -17.090, model saved
150794: done 106 games, mean reward -16.720, eps 0.02, speed 8.24 f/s
Best mean reward updated -17.090 -> -16.720, model saved
153079: done 107 games, mean reward -16.380, eps 0.02, speed 8.21 f/s
Best mean reward updated -16.720 -> -16.380, model saved
154802: done 108 games, mean reward -15.980, eps 0.02, speed 8.19 f/s
Best mean reward updated -16.380 -> -15.980, model saved
156497: done 109 games, mean reward -15.580, eps 0.02, speed 8.23 f/s
Best mean reward updated -15.980 -> -15.580, model saved
159949: done 110 games, mean reward -15.340, eps 0.02, speed 8.23 f/s
Best mean reward updated -15.580 -> -15.3

282876: done 169 games, mean reward 5.860, eps 0.02, speed 7.97 f/s
Best mean reward updated 5.490 -> 5.860, model saved
284724: done 170 games, mean reward 6.250, eps 0.02, speed 7.96 f/s
Best mean reward updated 5.860 -> 6.250, model saved
286636: done 171 games, mean reward 6.600, eps 0.02, speed 7.95 f/s
Best mean reward updated 6.250 -> 6.600, model saved
288901: done 172 games, mean reward 6.930, eps 0.02, speed 8.35 f/s
Best mean reward updated 6.600 -> 6.930, model saved
290809: done 173 games, mean reward 7.280, eps 0.02, speed 8.05 f/s
Best mean reward updated 6.930 -> 7.280, model saved
292820: done 174 games, mean reward 7.650, eps 0.02, speed 7.95 f/s
Best mean reward updated 7.280 -> 7.650, model saved
295107: done 175 games, mean reward 7.970, eps 0.02, speed 7.97 f/s
Best mean reward updated 7.650 -> 7.970, model saved
296917: done 176 games, mean reward 8.330, eps 0.02, speed 7.98 f/s
Best mean reward updated 7.970 -> 8.330, model saved
298613: done 177 games, mean rew

426338: done 243 games, mean reward 17.270, eps 0.02, speed 8.11 f/s
Best mean reward updated 17.240 -> 17.270, model saved
428177: done 244 games, mean reward 17.290, eps 0.02, speed 8.35 f/s
Best mean reward updated 17.270 -> 17.290, model saved
430120: done 245 games, mean reward 17.310, eps 0.02, speed 8.51 f/s
Best mean reward updated 17.290 -> 17.310, model saved
432052: done 246 games, mean reward 17.320, eps 0.02, speed 8.16 f/s
Best mean reward updated 17.310 -> 17.320, model saved
433857: done 247 games, mean reward 17.310, eps 0.02, speed 8.09 f/s
435513: done 248 games, mean reward 17.320, eps 0.02, speed 8.07 f/s
437290: done 249 games, mean reward 17.350, eps 0.02, speed 8.19 f/s
Best mean reward updated 17.320 -> 17.350, model saved
438944: done 250 games, mean reward 17.370, eps 0.02, speed 8.12 f/s
Best mean reward updated 17.350 -> 17.370, model saved
441085: done 251 games, mean reward 17.350, eps 0.02, speed 8.07 f/s
442939: done 252 games, mean reward 17.340, eps 0