In this notebook, we wil try to implement the DQN algorithm on **atari**. This will be using the **gym** library that allow us to better manage the environment.To implement this algorithm we need to define some classes that help to better save the transitions and the training of the algorithm. 

In [1]:
# we import the necessary packages for our algorithm
import gym
import collections
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

From this class, we try to save the transitions in the environment with all the actions and the successive states inside the environment. This class contains three methods: put, sample and size.
The **put** method allow us to append the last transition made in the environment and supervize the memory size, so that we use just the 50.000 last transitions to make the learning process.The **sample** method is used to sample a batch_size transitions in the environment and return the environment characteristics (actions, states and rewards) as tensor type. The **size** method return the size of the memory.

In [2]:
class Replay():
    def __init__(self):
        self.Replay = collections.deque()
        self.batch_size = 32
        self.size_limit = 50000
    
    def put(self, transition):
        self.Replay.append(transition)
        if len(self.Replay) > self.size_limit:
            self.Replay.popleft() # we let just the size limit transitions
    
    def sample(self, n):
        mini_batch = random.sample(self.Replay, n)
        s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []
        
        for transition in mini_batch:
            s, a, r, s_prime, done_mask = transition
            s_lst.append(s)
            a_lst.append([a])
            r_lst.append([r])
            s_prime_lst.append(s_prime)
            done_mask_lst.append([done_mask])

        return torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
               torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
               torch.tensor(done_mask_lst)
        
    def size(self):
        return len(self.Replay)

In this part, we implement model used for the learning.Here, we implement a neural network with 3 linear layers with a relu activation.

In [3]:
class Qnetwork(nn.Module):#our class Qnetwork heritate from nn.Module in pytorch
    def __init__(self):
        super(Qnetwork, self).__init__()
        self.fc1 = nn.Linear(4, 64) # we start by building a Fully connected layer 
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 2) # fully connected layer: (right, left actions)
        
    def forward(self, x):
        x = F.relu(self.fc1(x)) #fully connected layer followed by a relu
        x = F.relu(self.fc2(x)) # fully connected layer followed by a relu
        x = self.fc3(x) # no relu since Q-value can be negative
        return x
    
    def sample_action(self, obs, epsilon): # epsilon-greedy 
        out = self.forward(obs)
        coin = random.random()
        if coin < epsilon:
            return random.randint(0,1)
        else:
            return out.argmax().item() # return the one with largest Q-val

In [4]:
# The funtion to train the model with a specified batch_size and optimizer
# we will use a RMSprop algorithm
def train(q, q_target, memory, gamma, optimizer, batch_size):
    for i in range(10):
        s,a,r,s_prime,done_mask = memory.sample(batch_size)
        
        q_out = q(s) # s.shape: 32,4 / q.shape: 32,2
        q_a = q_out.gather(1,a) # Extracting q values only for the actions taken (32,2 -> 32,1)
                                # because action includes q val for both direction: right and left
        max_q_prime = q_target(s_prime).max(1)[0].unsqueeze(1)
        target = r + gamma * max_q_prime * done_mask
        loss = F.smooth_l1_loss(target, q_a)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [5]:
def main():
    env = gym.make('CartPole-v1')
    q = Qnetwork()
    q_target = Qnetwork() # create instance
    q_target.load_state_dict(q.state_dict()) # duplicate
    memory = Replay()
    
    print_interval = 20
    score = 0.0
    render = False
    
    gamma = 0.98
    batch_size = 32
    optimizer = optim.RMSprop(q.parameters(), lr=0.0005)# we are using RMSProp as the article did 
    # we don't update Target Q's parameters
    
    for n_epi in range(10000):
        epsilon = max(0.01, 0.08 - 0.01*(n_epi/200))# epsilon decreases from 8% to 1% linearly throughout episodes
        s = env.reset()
        done = False
        if render:
            env.render()
        
        for t in range(600):
            a = q.sample_action(torch.from_numpy(s).float(), epsilon)
            s_prime, r, done, info = env.step(a)
            done_mask = 0.0 if done else 1.0 # will be multiplied to TD value later(?)
            memory.put((s,a,r/200.0, s_prime, done_mask))
            s = s_prime
            
            score += r
            if score > 1000:
                render = True
            if done:
                break
                
        if memory.size()>2000:# We only stack memories until 2000 and then start training
            train(q, q_target, memory, gamma, optimizer, batch_size)
            
            
        if n_epi%20==0 and n_epi!=0:
            q_target.load_state_dict(q.state_dict()) # update target Q every 20 episode
            print(f"Episode {n_epi}:  Buffer size: {memory.size()}, Score: {score}, EPS: {epsilon*100:.1f}")
            
            score = 0.0
    env.close()

if __name__ == '__main__':
    q_value = main()

Episode 20:  Buffer size: 666, Score: 666.0, EPS: 7.9
Episode 40:  Buffer size: 1323, Score: 657.0, EPS: 7.8
Episode 60:  Buffer size: 2000, Score: 677.0, EPS: 7.7


  return torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \


Episode 80:  Buffer size: 2249, Score: 249.0, EPS: 7.6
Episode 100:  Buffer size: 2452, Score: 203.0, EPS: 7.5
Episode 120:  Buffer size: 2646, Score: 194.0, EPS: 7.4
Episode 140:  Buffer size: 2839, Score: 193.0, EPS: 7.3
Episode 160:  Buffer size: 3076, Score: 237.0, EPS: 7.2
Episode 180:  Buffer size: 3372, Score: 296.0, EPS: 7.1
Episode 200:  Buffer size: 3892, Score: 520.0, EPS: 7.0
Episode 220:  Buffer size: 4773, Score: 881.0, EPS: 6.9
Episode 240:  Buffer size: 6042, Score: 1269.0, EPS: 6.8
Episode 260:  Buffer size: 6911, Score: 869.0, EPS: 6.7
Episode 280:  Buffer size: 9369, Score: 2458.0, EPS: 6.6
Episode 300:  Buffer size: 14481, Score: 5112.0, EPS: 6.5
Episode 320:  Buffer size: 19947, Score: 5466.0, EPS: 6.4
Episode 340:  Buffer size: 28031, Score: 8084.0, EPS: 6.3
Episode 360:  Buffer size: 35632, Score: 7601.0, EPS: 6.2
Episode 380:  Buffer size: 44753, Score: 9121.0, EPS: 6.1
Episode 400:  Buffer size: 50000, Score: 8653.0, EPS: 6.0
Episode 420:  Buffer size: 50000, S

We see that the learning is improving from one episode to other one, but it is not stable which was expected, as the problem is probabilistic. But, in general the algorithm is doing well for this game and by chosing a complex models we may have a better results