# Reinforcement learning

> <span style="color:gray">
Created by Jonas Busk ([jbusk@dtu.dk](mailto:jbusk@dtu.dk)).
</span>

Before getting started with the exercises, it is useful to get acquainted with some core concepts of reinforcement learning. The contents of this notebook is heavily inspired by 
[Matiisen](http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/),
[Karpathy](http://karpathy.github.io/2016/05/31/rl/)
and [Scholarpedia](http://www.scholarpedia.org/article/Reinforcement_learning#.28Temporal.29_Credit_Assignment_Problem).

If you want to get started with the exercises right away, you can skip this read for now and refer back to it as needed.


## What is reinforcement learning?

Reinforcement learning (RL) is learning by interacting with an environment. This is different from other types of machine learning, where most often you learn from a fixed collection of training examples.
Instead, an RL-agent gathers experience by choosing actions and learning from the outcomes (essentially by trial and error) in an ever changing dataset.

After each action, the RL-agent receives a reward signal (positive or negative) and the learning task is then to increase the probability of selecting "good" actions and decrease the probability of "bad" actions in order to maximize the overall expected amount of reward. 



## Markov decision process

So how do we formalize a reinforcement learning problem, so that we can reason about it?

Imagine an **agent** navigating an **environment**. The environment is in a certain **state**, $s \in S$, and the agent can perform **actions**, $a \in A$, which transform the environment, resulting in a new state and a **reward**, $r$, for performing that action. In this new state, the agent can perform another action, and so on. This system is illustrated in the figure below. 

<img src="images/reinforcement-learning.svg" style="width: 300px;"/> 

The rule for how the agent chooses its actions is called a **policy**, and can be viewed as a function, $\pi(s)$, that given a sate selects an action. If we consider a conditional probability distribution over possible actions given a state, $p(a|s)$, the policy can either be *stochastic*, by sampling an action from the probability distribution:

$$\pi(s) = a \sim p(a|s)$$

or *deterministic*, by simply selecting the action with the highest probability:

$$\pi(s) = argmax_a \, p(a|s)$$

Similarly, the environment can be stochastic, meaning choosing a certain action only leads to a certain state with some probability, but might lead to a different state. If an action always leads to a certain state, the environment is deterministic.    

The set of states, set of actions and rules for transitioning between states and assigning rewards make up a [Markov decision process](https://en.wikipedia.org/wiki/Markov_decision_process).
One such transition can be denoted $<s,a,r,s'>$.
In this framework, it assumed that the probability of the next state, $s'$, depends only on the current state, and action $(s,a)$, and not of the sequence of states and actions that preceded them. This is also known as the [Markov property](https://en.wikipedia.org/wiki/Markov_property).
 

## Credit assignment problem

When an agent repeatedly interacts with the environment and after some time receives a positive reward, it can be difficult to know exactly what actions helped achieving that reward. In many cases, the reward is not a direct result of the most recent action, but instead comes from an earlier action, i.e. the reward is delayed. This is known as the *credit assignment problem*. 

To address this issue, we need to take into account not only the immediate rewards, but also the future rewards, when evaluating the goodness of an action. Given a series of transitions consisting of $T$ timesteps, it is straight forward to compute the **total reward**:

$$
R = r_1 + r_2 + r_3 + \dots + r_T \ ,
$$

and the **total future reward** at timestep $t$ then is:

$$
R_t = r_t + r_{t+1} + r_{t+2} + \dots + r_T \ .
$$

However, in a stochastic environment, we cannot be so sure of rewards happening far into the future -- even if we repeat the same series of actions, it might lead to a different result -- in fact, the uncertainty in the rewards grows the farther into the future we go. Therefore it is common to use **discounted future reward** (also known as **return**) instead, by weighting rewards by (un)certainty:

$$
R_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots + \gamma^{T-1} r_T
$$

where $\gamma$ is the **discount factor**, chosen as a value between 0 and 1. The discounted future reward at time $t$ can also be expressed in terms of the discounted future reward at time $t+1$:

$$
R_t = r_t + \gamma R_{t+1} \ .
$$

If $\gamma = 1$ there is no discount, which is suitable for deterministic environments where repeating a series of actions is certain to lead to the same rewards (no uncertainty). If we set $\gamma = 0$, only immediate rewards are considered, resulting in a short-sighted, greedy strategy. Usually something like $\gamma = 0.9$ works well. 


## Exploration-exploitation dilemma

Reinforcement learning agents need to explore their environment in order to assess its reward structure. After some exploration, the agent might have learned a rewarding set of actions, but it cannot know if there exist an even better strategy without further exploration. So when should the agent stop exploring and start exploiting what it has learned so far, in order to maximize its total reward? This is known as the *exploration-exploitation dilemma*. Reinforcements learning strategies approach this differently, but most solutions depend on heuristics, generally starting with exploration and gradually turning to more exploitation. 
The [multi-armed bandit problem](https://en.wikipedia.org/wiki/Multi-armed_bandit) is an example of this dilemma. 