# Deep Q Learning With PyTorch

This notebook should be used as a guide to understand the implementation of DQL algorithm, tha was written in the Agent class. It started as a basic extraction of code comments. Also, we focus here on the 'learn' function, that calculates the loss from a batch of transitions.

<b>Note for developers:</b> Any changes to the agent's code / train function - if approved - should include changes to this notebook. PyTorch is a great framework, but it is a challenge to understand the more complex code fragments, without some wider explanation from the author. 

## Setup

We declare all important stuff first. Net and Memory class:

In [2]:
import torch
import numpy as np
import random
from collections import deque
from torch import autograd, optim, nn
from torch.autograd import Variable
import torch.nn.functional as F

class Net(torch.nn.Module):
    """Neural Network with variable layer sizes and 2 hidden layers."""

    def __init__(self, input_size, hidden1_size, hidden2_size, output_size):
        super().__init__()
        self.fc1 = torch.nn.Linear(input_size, hidden1_size)
        self.fc2 = torch.nn.Linear(hidden1_size, hidden2_size)
        self.fc3 = torch.nn.Linear(hidden2_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x
    
class Memory(deque):
    """Subclass of deque, storing transitions batches for Agent"""

    def __init__(self, maxlen):
        super().__init__(maxlen=maxlen)
    

Now initialize the networks and the memory (with some random stuff)

In [3]:
initial_state = np.random.rand(10)
actions = ['action1', 'action2', 'action3', 'action4']

input_size = len(initial_state)
hidden1_size = 20
hidden2_size = 10
output_size = len(actions)

q_network = Net(input_size, hidden1_size, hidden2_size, output_size)
target_network = Net(input_size, hidden1_size, hidden2_size, output_size)

memory = Memory(100)
for _ in range(100):
    memory.append((np.random.rand(10), 1, -5, np.random.rand(10), False))

gamma = 0.9
epsilon = 0.1
epsilon_decay = 0.999
epsilon_min = 0.001
batch_size = 4
l_rate = 0.01
optimizer = optim.Adagrad(q_network.parameters(), lr=l_rate)

<b>NOTE: We register q_network.parameters() to the optimizer as we want to update them and nothing else.</b>

Define a function that agent uses to generate batches from memory

In [4]:
def get_experience_batch():
    """
    Retrieves a random batch of transitions from memory and transforms it
    to separate PyTorch Variables.

    Transition is a tuple in form of:
    (state, action, reward, next_state, terminal_state)
    Returns:
        exp_batch - list of Variables in given order:
            [0] - input_state_batch
            [1] - action_batch
            [2] - reward_batch
            [3] - next_state_batch
            [4] - terminal_mask_batch
    """

    exp_batch = [0, 0, 0, 0, 0]
    transition_batch = random.sample(memory, batch_size)

    # Float Tensors
    for i in [0, 2, 3, 4]:
        exp_batch[i] = Variable(torch.Tensor([x[i] for x in transition_batch]))

    # Long Tensor for actions
    exp_batch[1] = Variable(torch.LongTensor([int(x[1]) for x in transition_batch]))

    return exp_batch    

## Training iteration

Now lets break up the *_learn()* method code into smaller chunks and observe, with commentary, what happens.

First, let's see the whole, working function:

In [5]:
def _train():
    if len(memory) > batch_size:

        exp_batch = get_experience_batch()

        input_state_batch = exp_batch[0]
        action_batch = exp_batch[1]
        reward_batch = exp_batch[2]
        next_state_batch = exp_batch[3]
        terminal_mask_batch = exp_batch[4]

        all_q_values = self.q_network(input_state_batch)

        q_values = all_q_values.gather(1, action_batch.unsqueeze(1)).squeeze()

        q_next_max = self.q_network(next_state_batch)
        q_next_max = Variable(q_next_max.data)
        q_next_max, _ = q_next_max.max(dim=1)

        q_t1_max_with_terminal = q_next_max.mul(1 - terminal_mask_batch)

        targets = reward_batch + self.gamma * q_t1_max_with_terminal

        self.optimizer.zero_grad()
        loss = nn.modules.SmoothL1Loss()(q_values, targets)
        loss.backward()
        self.optimizer.step()

Now, lets go step by step. Please note that the sizes of printed Variables are pretty important.

- We sample random transition batch and transform it into separate batches (of autograd.Variable type). This is just a basic data preparation and the logic is handled by <b>get_experience_batch</b> method.
- Make sure you understand how the data looks after this. We had transitions, but now we have separate batches. So, to retrieve the first tranisiton from the batches, you should take the first elements from each of them.

In [6]:
exp_batch = get_experience_batch()

input_state_batch = exp_batch[0]
action_batch = exp_batch[1]
reward_batch = exp_batch[2]
next_state_batch = exp_batch[3]
terminal_mask_batch = exp_batch[4]

print(input_state_batch)
print("\n", action_batch)
print("\n", reward_batch)
print("\n", next_state_batch)
print("\n", terminal_mask_batch)


Variable containing:
 0.4316  0.9630  0.1087  0.1640  0.0551  0.8047  0.5656  0.9746  0.1767  0.1709
 0.8658  0.8056  0.5478  0.4555  0.3722  0.9552  0.4410  0.7902  0.4389  0.2040
 0.6501  0.3299  0.4768  0.7880  0.5652  0.0653  0.1315  0.0798  0.2819  0.0754
 0.0508  0.3060  0.7159  0.0703  0.8606  0.7534  0.4870  0.4760  0.6707  0.8573
[torch.FloatTensor of size 4x10]


 Variable containing:
 1
 1
 1
 1
[torch.LongTensor of size 4]


 Variable containing:
-5
-5
-5
-5
[torch.FloatTensor of size 4]


 Variable containing:
 0.7205  0.2970  0.6573  0.9439  0.5758  0.5123  0.6997  0.4771  0.9502  0.2461
 0.4616  0.3396  0.1968  0.9294  0.2858  0.1251  0.2446  0.8107  0.9409  0.5322
 0.7426  0.4276  0.0018  0.5493  0.1445  0.6420  0.3377  0.5860  0.2442  0.4903
 0.1016  0.4350  0.0266  0.5533  0.4194  0.7809  0.3471  0.2322  0.0195  0.3680
[torch.FloatTensor of size 4x10]


 Variable containing:
 0
 0
 0
 0
[torch.FloatTensor of size 4]



### 1. We start to calculate all the important values needed to compute the error used to train the network.

- As Deep Q Learning states, we want to calculate the error like this: <b>Q(s,a) - (r + max{a}{Q(s_next,a)})</b>

- ( This differs a bit if we use Double Deep Q Learning, so if you work on it, make sure to change it! )

- So lets start with calculating the Q(s) values for each input state in input_state_batch:


In [7]:
all_q_values = q_network(input_state_batch)
print(all_q_values, all_q_values.shape)

Variable containing:
-0.1493 -0.1617  0.3746  0.0611
-0.1438 -0.1179  0.3555  0.0596
-0.1284 -0.0708  0.3416  0.0462
-0.1365 -0.0591  0.3261  0.0664
[torch.FloatTensor of size 4x4]
 torch.Size([4, 4])


### 2. Now, retrieve q_values only for actions that were taken - we want Q(s,a), not Q(s)! 

- The squeeze / unsqueeze functions are used because of size mismatches. See http://pytorch.org/docs/stable/torch.html#torch.squeeze

In [8]:
q_values = all_q_values. \
                gather(1, action_batch.unsqueeze(1)).squeeze()
print(q_values)

Variable containing:
-0.1617
-0.1179
-0.0708
-0.0591
[torch.FloatTensor of size 4]



- This use of gather (http://pytorch.org/docs/stable/torch.html#torch.gather) function works the same as a basic loop:

In [9]:
q_values_test = []
for i in range(len(all_q_values)): 
    q_values_test.append(all_q_values[i][action_batch[i]])
    
print(q_values_test)

[Variable containing:
-0.1617
[torch.FloatTensor of size 1]
, Variable containing:
-0.1179
[torch.FloatTensor of size 1]
, Variable containing:
1.00000e-02 *
 -7.0821
[torch.FloatTensor of size 1]
, Variable containing:
1.00000e-02 *
 -5.9140
[torch.FloatTensor of size 1]
]


- <b>but we cannot use such loop, because Variables are immutable, so they don't have 'append' function etc.</b>

### 3. Calculate q_next_max = max{a}{Q(s_next,a)}

- Max function return the values <b>and</b> indices (http://pytorch.org/docs/stable/torch.html#torch.max), but we just want the values

In [10]:
q_next_max = q_network(next_state_batch)
q_next_max = Variable(q_next_max.data)
print(q_next_max)
q_next_max, _ = q_next_max.max(dim=1)
print(q_next_max)

Variable containing:
-0.1348 -0.0608  0.3340  0.0606
-0.1423 -0.0875  0.3658  0.0379
-0.1449 -0.1205  0.3683  0.0491
-0.1437 -0.0852  0.3606  0.0420
[torch.FloatTensor of size 4x4]

Variable containing:
 0.3340
 0.3658
 0.3683
 0.3606
[torch.FloatTensor of size 4]



- So as we see - we calculate the outputs and take the biggest ones
- Note: We create new Variable after the first line. Why?
- We used the q_network parameters to calculate q_next_max, but we don't want the backward() function to propagate twice into these parameters. Creating new Variable 'cuts' this part of computational graph - prevents it.

### 4. If the next state was terminal, we don't calculate the q value the target should be just = r

In [11]:
q_t1_max_with_terminal = q_next_max.mul(1 - terminal_mask_batch)
print(q_t1_max_with_terminal)

Variable containing:
 0.3340
 0.3658
 0.3683
 0.3606
[torch.FloatTensor of size 4]



- So if the next state would be terminal, there would be zeros in corresponding places!

### 5. Calculate the target = r + max{a}{Q(s_next,a)}

- We have the rewards and the max Q values for next states, so we can calculate the target by adding them up.
- Of course, we multiply by gamma here, as stated in the Q-Learning algorithm.

In [12]:
targets = reward_batch + gamma * q_t1_max_with_terminal

print(targets)
print(q_values)

Variable containing:
-4.6994
-4.6708
-4.6685
-4.6755
[torch.FloatTensor of size 4]

Variable containing:
-0.1617
-0.1179
-0.0708
-0.0591
[torch.FloatTensor of size 4]



- So, what we see? We have the networks predictions, and the target values. Nicely trained agent (with reasonable environment transitions) makes the predicitions close to the targets. Our agent does a bad job.

### Loss function and backward()

- To train him into a better direction, we now calculate the loss function (yes, we usually don't use the raw error and calculate a loss function):

In [13]:
optimizer.zero_grad()
loss = nn.modules.SmoothL1Loss()(q_values, targets)
print(loss)
loss.backward()
optimizer.step()

Variable containing:
 4.0761
[torch.FloatTensor of size 1]



- We make use of pytorch's optimizer. We have to zero the gradients before calculating new ones.

- We calculate the loss using built-in SmoothL1Loss, called Huber Loss, which is a recommended choice for Q-Learning

- We call the <b>backward</b> function, which traces all values that were used in the process of calculating the loss. The most important thing here, is that it will calculate the gradients for all q_network weights and biases.

- We calculated the gradients, but the update happens only with <b>optimizer.step()</b> call. It takes  the gradients and updates the parameters (just the parameters that were registered to the optimizer!), finalizing the training iteration. 