Solution to deeprlbootcamp lab 3 by Dennis Briner

# DQN

## Code

In [None]:
# aka: (s, a, r, s', d)
def compute_q_learning_loss(self, l_obs, l_act, l_rew, l_next_obs, l_done):
    """
    :param l_obs: A chainer variable holding a list of observations. Should be of shape N * |S|.
    :param l_act: A chainer variable holding a list of actions. Should be of shape N.
    :param l_rew: A chainer variable holding a list of rewards. Should be of shape N.
    :param l_next_obs: A chainer variable holding a list of observations at the next time step. Should be of
    shape N * |S|.
    :param l_done: A chainer variable holding a list of binary values (indicating whether episode ended after this
    time step). Should be of shape N.
    :return: A chainer variable holding a scalar loss.
    """

    # compute Q(s,a)
    q_value = F.select_item(self._q.forward(l_obs), l_act)

    # compute y = r + y max(Q(s',a')) or y = r respectively
    # (1 - l_done) will be 0 if considering a terminal state => y = r
    y = l_rew + (1 - l_done) * self._discount * F.max(self._qt.forward(l_next_obs), axis=1)

    # compute loss using MSE (y - Q(s,a))^2
    loss = F.mean_squared_error(y, q_value)

    return loss

## Results

We see fast improvements within the first ten iterations. Afterwards it doesn't improve drastically since we have already reached an optimum solution.

<img src="dqn-grid.png" height="400">

# DDQN

## Code

In [None]:
# aka: (s, a, r, s', d)
def compute_double_q_learning_loss(self, l_obs, l_act, l_rew, l_next_obs, l_done):
    """
    :param l_obs: A chainer variable holding a list of observations. Should be of shape N * |S|.
    :param l_act: A chainer variable holding a list of actions. Should be of shape N.
    :param l_rew: A chainer variable holding a list of rewards. Should be of shape N.
    :param l_next_obs: A chainer variable holding a list of observations at the next time step. Should be of
    shape N * |S|.
    :param l_done: A chainer variable holding a list of binary values (indicating whether episode ended after this
    time step). Should be of shape N.
    :return: A chainer variable holding a scalar loss.
    """

    # compute Q(s,a)
    q_value = F.select_item(self._q.forward(l_obs), l_act)

    # compute arg max(Q(s'))
    a_prime = F.argmax(self._q.forward(l_next_obs), axis=1)

    # compute Q(s', arg max(Q(s')) )
    double_q = F.select_item(self._qt.forward(l_next_obs), a_prime)

    # compute y = r + Q(s', arg max(Q(s')) ) or y = r respectively
    # (1 - l_done) will be 0 if considering a terminal state => y = r
    y = l_rew + (1 - l_done) * self._discount * double_q

    # compute loss using MSE (y - Q(s,a))^2
    loss = F.mean_squared_error(y, q_value)

    return loss

## Results Grid World

We don't see much of a difference compared to the DQN version. At least it's not getting worse. The improvements should be mainly in terms of performance. Performance may be neglected in the simple grid world environment.

<img src="ddqn-grid.png" height="400">

## Results Pong

This is the result I've gotten after training for 200+ iterations. It seems to improve, but very slowly. When playing the game, the agent still looses.

<img src="ddqn-pong.png" height="400">

### Modifying hyperparameters

I went on and tried to modify the hyperparameters. My hypothesis was, that the exploration rate was too high, since the reward always dropped pretty hard after a few iterations. My second hypothesis was, that I should decrease the discount factor. In pong only the last few actions are relevant for loosing a point and I wanted to value this more.

This is how I've adjusted the configuration:

In [2]:
# DQN gamma parameter
discount=0.9,

# Training procedure length
initial_step=initial_step,
max_steps=max_steps,
learning_start_itr=max_steps // 100,
# Frequency of copying the actual Q to the target Q
target_q_update_freq=target_q_update_freq,
# Frequency of updating the Q-value function
train_q_freq=4,
# Double Q
double_q=double,

# Exploration parameters
initial_eps=1.0,
final_eps=0.01,
fraction_eps=0.01,

NameError: name 'initial_step' is not defined

Sadly there wasn't much improvement visible. I still went further on playing with the hyperparameters, but I couldn't find any combination that would make the agent learn very fast and good. The biggest issue here was training time on my machine (2.6 GHz 6-Core Intel Core i7). Just training a 100 iterations took over 1 hour. This kept me from experimenting more.

<img src="ddqn-pong-hyper.png" height="400">

When observing the agent play, I could see that it has figured out that it needs to move the paddle towards the ball. But it's not moving close enough. Maybe this could be due to my lowered discount factor:

<img src="pong-error.gif" height="200">