# Report of Project 1: Navigation

## Learning algorithm

The agent used in this project is a __DuelingDDQN__, which is described in the following sections.

The DuelingDDQN algorithm refers to a class of RL algorithms called a temporal difference (TD) learning. It is a _model-free_ algorithm, because there is no prior planning or implementation of a (stochastic) model of the environment. It is an _off-policy_ algorithm, because it improves a policy different from the policy used to generate the data.

While the definition of model-free should be pretty straight forward, the off-policy part may need some some further explanation and what is the definition of a policy after all? Simply put, a policy is a function which maps states to actions. It could be random to explore the environment and/or deterministic (sort of), i.e. exploiting what it already learned from the environment, aiming to achive the highest possible reward.

__The policy__ used here is called $\epsilon$-greedy, which selects actions either randomly or based on the learned policy (i.e. a Q-network). The choice between these two is based on $\epsilon$ which is decayed on each episode by a certain factor, i.e. favoring exploration in the beginning of the learning process and exploitation of the learned policy towards the end. The term off-policy refers to the fact that we use two different policies (Q-networks) for a) generating the next step's action (value) and b) updating the policy.

Every single step taken by the agent in the environment is called an _experience_, which consists of a _state_ vector (e.g. representing a ray-based perception of the environment), a taken _action_ (based on the policy described above), a reward returned by the environment, the _next state_ of the agent and an indicator if the episode has finished. Each experience is stored in a fixed-size __experience replay buffer__ (a double-ended queue), from which the algorithm samples batches _uniformly at random_ to fit the neural network(s), which generates i.i.d. samples from the buffer, which is expected by gradient-based optimizers such as Adam, RMSProp and others.

The model-fitting process (i.e. update of the NN weights) is done every n-th step of the agent, using a sampled batch of experiences. This process is defined by the __double DQN__ algorithm, which consists of two neural networks, the online network which is continuously updated, and the target network, which is used to generate the next step's action value. The extension _double_ refers to the fact that the online network is used to choose the action (with the highest value) from the next state, as opposed to the "classic" DQN algorithm. 

An update step looks as follows: for all experiences in a given batch we compute the expected action values of the next state's from the target network and estimates of the actual state values from the online network. Since the experiences also contain the rewards, we can compute the gradient update, which is defined as follows:

$$\nabla_{\theta_i}L_i(\theta_i) = \mathbb{E}_{(s,a,r,s')\sim U(D)} \Big[\big(r+\gamma Q(s',\underset{a'}{\operatorname{argmax}} Q(s',a';\theta_i);\theta^-) - Q(s,a;\theta_i))\nabla_{\theta_i}Q(s,a;\theta_i\big)\Big]$$

The values for the target Q-value estimate $r+\gamma Q(s',\underset{a'}{\operatorname{argmax}} Q(s',a';\theta_i);\theta^-)$ and the local Q-value estimate $Q(s,a;\theta_i)$ are passed on the loss function, where the `SmoothL1Loss` a.k.a Huber Loss has been chosen, which is more robust to extreme values, i.e. not producing very large gradients potentially leading to slow convergence especially in the beginning of the learning process, as the network policy is chasing a moving target.

The final step in the model-fitting process is to "soft-update" the target network's weights. In contrast to overwriting the target-network weights with the online network weights every N time steps, we use __Polyak Averaging__, which updates the weights more often by mixing the weights with tiny bits of both networks:

$$\theta_i^- = \tau \theta_i + (1-\tau)\theta_i^-$$

The neural network architecture is a __Dueling DQN__, where the main difference to the vanilla DQN is that the layer before the output is split into two streams, where the first stream represents the state-value function $V(s)$ and the second stream represents the action-advantage function $A(s,a)$. The whole purpose of that is __sample efficiency__.

TODO elaborate more on that

## Results

While the evironment could already be solved within ~600 episodes using a simple DQN, by extending the algorithm using a dueling architecture, the environment could be __solved in 477 episodes__.

![Scores](scores.png "Scores")

## Ideas for future work

From an architectural perspective, loose coupling between the `DuelingDDQN` agent, the `DuelingDenseQNetwork`s, `ReplayBuffer`, `Adam` optimizer, `SmoothL1Loss`, etc. should be preferred to enable composing and testing different implementations without changing the code. 

Moreover, the action-selection code should be moved into it's own class, decoupling the code even further.

From a modeling point of view, there are also many ways to improve the algorithms and models even further. A (incomplete) list of possible future improvements/extensions are:

* Implement Prioritized Experience Replay (PER)
* Distributional DQN
* Noisy DQN
* Rainbow DQN

Also, it would be interesting to see how Bayesian variants (BNNs) of the neural network architecutes could possibly improve decision making, based on it's posterior-predictive distributions.

Instead of manually testing different hyper-parameter settings, utilization of hyper-parameter optimization tools should be done (e.g. using bayesian optimization).

## Additional References

* [Miguel Morales, "Grokking Deep Reinforcement Learning" Manning Publications.](https://www.manning.com/books/grokking-deep-reinforcement-learning)
* [Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529.]( http://www.davidqiu.com:8888/research/nature14236.pdf)