# Value-based RL Classic approaches

Before we turn toward deep reinforcement learning we spend some time with the classic approaches. You will see that these methods have good convergence guarantees compared to neural network approaches. If it is possible to formalize, structure your problem in a way that it fits to the classic methods and meets its requirements then classic methods can be successful.

So far we did not cover it but it is usual to distinguish between prediction and control problems. Prediction problems assumes we have a fixed or known policy and we want to know the corresponding value-functions. In case of a control problem, we want to know the optimal policy (agent) in the environment.

The methods below are based on the temporal-difference learning approach. This is given by merging the monte carlo and dynamic programming methods. The following algorithms has the same underlying principles:
* TD-learning (prediction)
* Sarsa-learning (control)
* Q-learning (control)

## TD-learning

This is for prediction problems. The Robbins-Monro theorem is a useful, old idea to use Monte Carlo sampling to solve the Bellman-equation. 

**Robbins-Monro theorem (1951):**

Assume $x_t$ is a Markov-chain, defined in the following way: $x_{t+1} - x_t = a_t \cdot (\alpha - y_t)$. $y_t$ is drawn from the $H(y | x_t)$ distribution and the corresponding random variable is $Y(x_t)$, its expected value is $M(x_t)$. $M$ is a function, which we suppose has a unique solution at ($\Theta$) for the equation $M(x_t) = \alpha$. Define $b$ as $\lim_t b_t$, where $b_t = E\left[ \left( x_t - \Theta \right)^2 \right]$. Assume the followings are hold:

* for the $\left\{ a_t \right\}$ series: $\lim_{t \rightarrow \infty} a_t = 0$, $\sum_{t=1}^\infty {a_t} = \infty$ and $0 < \sum_{t=1}^\infty {a_t^2} < \infty$
* there exists a $C$, that $P\left[ |Y(x)| < C \right]$
* there exists a $\delta$, that $M(x) \ge \alpha - \delta$, if $x < \Theta$; $M(x) \le \alpha + \delta$, if $x > \Theta$.

Then $b = 0$.

Now, we can apply this theorem for our case in the following way:

* $V_t(s)$ is $x_t$, we examine each $s$ separately
* $Y(V_t) = V_t(s) - G_t(s)$
* $G(s)$ is the sampled utility of the state $s$ by starting several trajectories
* $\alpha = 0$
* Then $\Theta=E[G_t(s)] = V(s)$

Because of the theorem $\lim_t V_t(s) = V(s)$, for each $s$. The Markov-chain for $V_t(s)$ is generated by the following equation:

$$V_{t+1}(s) - V_t(s) = a_t \cdot \left( 0 - V_t(s) + G_t(s) \right)$$

then

$$V_{t+1}(s) = V_t(s) + a_t \cdot \left( G_t(s) - V_t(s) \right)$$

The $G_t(s)$ can be calculated for each state as:

$$G_t(s) = r_t + \gamma \cdot V_t(s')$$

A right choice of $a_t$ can be:

$$a_t = \frac{1}{t+1}$$

After putting everything together, the update rule becomes:

$$V_{t+1}^\pi(s) = V_t^\pi(s) + \alpha(t)\left(r_t + \gamma V_t^\pi(s') - V_t^\pi(s) \right)$$

The term in the paranthesis, sometimes called TD-error:

$$\delta = r_t + \gamma V_t^\pi(s') - V_t^\pi(s)$$ 

The algorithm follows the steps below:
1. Initialization of the V table with random values
2. Start iteration
3. Execute action $a$ in state $s$. Observe $r_t$ and the next state $s'$.
4. Update the V function according to the update rule.
5. Choose the next action according to policy $\pi$
6. Check the terminal condition and finish or repeat from step 2

### Exploration-exploitation dilemma

We have policy $\pi$ and we have to act in the current state. If we follow the current policy, we try to explore our current knowledge. However at the beginning of the learning, we do not have a deep knowledge about the environment, therefore we have to keep exploring it.

### $\varepsilon$-greedy

One simple way to achieve this is the so called $\varepsilon$-greedy. The main point is that with $\varepsilon$ probability we take a random action and with $1-\varepsilon$ probability we take the action recommended by the policy. This method is mostly used for **deterministic policies**. The $\varepsilon$ is changing during time.

Pseudo-code for $\varepsilon$-greedy. Set of actions is $A$:

1. $\varepsilon$ probability sample an action from A
2. $1-\varepsilon$ probability use the action $\pi(s)$

From the implementational stand-point:

If we have $n$ actions and the $i^{th}$ action is recommended by the policy $\pi$. Then the probability distribution over the actions:

$$p_1 = \frac{\varepsilon}{n},\ p_2 = \frac{\varepsilon}{n},\ ...,\ p_i = 1 - \varepsilon + \frac{\varepsilon}{n},\ ...,\ p_n = \frac{\varepsilon}{n}$$

The $\varepsilon$ is close to 1 at the very beginning, and decreases until it achieves zero. The change with the iteration is usually linear or exponential.

### Softmax-based

Let us take the Q function and use its values over the actions. By putting it into a softmax function we can normalize it to the range of 0 and 1. In the softmax we have a paramter $\beta$ we can tune. The probailites over the actions:

$$p(s, a_i) = \frac{e^{\beta Q(s, a_i)}}{\sum_{a\in A}{e^{\beta Q(s, a)}}}$$

You can think as a different policy is used for acting than the current optimal policy.

The effect of $\beta$:

* if $\beta \rightarrow \infty$: $a_i$ with the highest Q value, gets the highest probability to be applied (exploitation, greedy)
* if $\beta \rightarrow 0$: all of the actions has the same probability, the distribution is uniform (exploration)

The $\beta$ starts at zero then increases to a high value during the iteration. It is also a usual method to write $\beta$ as $1/T$. Then $T$ starts from a large number and go to zero. If we interpret $T$ as temperature, this means it cools down. Sometimes this method called as simulated annealing.

### Entropy-based approach

For the sake of completness, I mention it here. We will revisit this later on too. This method can be used in policy-based RL. This means, the policy is parametrized. Therefore we need an objective function for measuring how good is the policy learnt so far. Let us use the $L(\pi_\theta)$ notation. $\pi_\theta$ is the parametrized policy. Then it is possible to train the policy:

$$\theta_{t+1} = \theta_t + \eta \cdot \frac{\partial L}{\partial \theta}$$

$\eta$ is the learning rate. We can add a regularization like term to this in order to encourage the exploration for a longer time. This is an entropy term for the policy:

$$H(\pi) = \sum_i {\pi_i \cdot \log(\pi)}$$

Putting together:

$$L_{reg}(\pi) = L(\pi) + H(\pi)$$

Why does this work?

Exploration means the policy $\pi$ is almost a random policy, meaning it chooses each actions with the same probability. With other words, the $\pi$ function is uniform.

The goal is to maximize the objective function:

$$\pi^* = \arg\max_\pi{L_{reg}(\pi)}$$

**The entropy term has a maximum when the policy is a uniform distribution.** The regularization happens because the entropy term forces the policy towards a uniform, however the objective function forces toward the optimal policy.

### Model-free learning

From now, we will discuss the action-value function based learning algorithms. The same logic can be applied to utilize the Robbins-Monro theorem. If we look closer the update rules it is clear that during the learning period, we do not need the transition matrix, the model of the environment.

Recall, how the policy can be calculated!

With Q-function:

$$\pi(s) = \arg \max_a{ Q(s, a) }$$

When V-function is given:

$$\pi(s) = \arg\max_a \left( T(s, a, s') \cdot \left[ r(s, a) + \gamma V(s') \right]\right)$$

That motivates the usage of the $Q$ function because it can be trained without knowing the model of the environment. Let us see how they works.

## Sarsa-learning

Trajectory: state, action, reward, next state, next action

Abbreviation: s a r s a , shortly sarsa

The algorithm follows the steps below:
1. Initialization of the Q table with random values
2. Start iteration
3. Execute action $a$ in state $s$. Observe $r_t$ and the next state $s'$.
4. Update the Q function.
5. Choose the next action according to $\varepsilon$-greedy
6. Check the terminal condition and finish or repeat from step 2

$$Q_{t+1}(s, a) = Q_t(s, a) + \alpha(t) \left( r_t + \gamma Q_t(s', a') - Q_t(s, a)  \right)$$

## Q-learning

The algorithm follows the steps below:
1. Initialization of the Q table with random values
2. Start iteration
3. Execute action $a$ in state $s$. Observe $r_t$ and the next state $s'$.
4. Update the Q function.
5. Choose the next action according to $\varepsilon$-greedy
6. Check the terminal condition and finish or repeat from step 2

Update rule:

$$Q_{t+1}(s, a) = Q_t(s, a) + \alpha(t) \left( r_t + \gamma \max_{a'} Q_t(s', a') - Q_t(s, a)  \right)$$

### On-policy vs. off-policy learning

Let us look the update rules again in case of Sarsa and Q-learning.

Sarsa-learning: 

$$Q_{t+1}(s, a) = Q_t(s, a) + \alpha(t) \left( r_t + \gamma Q_t(s', a') - Q_t(s, a)  \right)$$

Q-learning:

$$Q_{t+1}(s, a) = Q_t(s, a) + \alpha(t) \left( r_t + \gamma \max_{a'} Q_t(s', a') - Q_t(s, a)  \right)$$

There is the innocent $\max_{a'}$ operator, which makes a lot of sense. Due to the $\max$ operator in the second case the next action $a'$ is used in the formula and the next action to be executed by the agent are not necessary the same. In RL, they distinguish between the two "type of policy": behavioral and target policy. 

**Behavioral policy:** the policy that is responsible for choosing the next action during sampling. ($\pi_B$)

**Target policy:** the policy that we want to learn. The policy that provides the highest possible expected return. ($\pi_T$)

**On-policy:** the target and the behavioral policies are exactly the same. $\pi_B = \pi_T$

**Off-policy:** the target and the behavioral policies are not the same. $\pi_B \neq \pi_T$

The Sarsa-learning algorithm is an on-policy algorithm while the Q-learning is an off-policy learning algorithm.

In case of the off-policy learning, we can disjoint the exploration and exploitation. The convergence of Q-learning means, it is possible to explore and sample the environment with whatever policy and the $Q$ function still converges to the optimal one. Then the target policy is given by:

$$\pi_T(s) = \arg \max_a{ Q(s, a) }$$

(Side note: assumed the $\pi_B$ ensures all of the states are visited several times.)

### Convergence of Q-learning

The Q-learning algorithm is guaranteed to converge to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. See details in the paper: [Christopher Watkins (1992)](https://link.springer.com/content/pdf/10.1007/BF00992698.pdf)

## Curses

### Curse of dimension

The state space tend to grow to a large size easily. Let's take the Atari game. The input is an $84\times 84 \times 4$ image with intensity values between 0 and 256. How many states are possible?

$$|S| = 255^{84\cdot 84 \cdot 4} \approx 10^{50000}$$

Which is huge, and it is not possible to store it in a table. This is the curse of dimension.

### Curse of modeling

$T(s, a, s')$ therefore the transtion matrix is way bigger than the state space. For instance in case of the Atari game, the size of the transition matrix:

$$|T| = |S| \times |A| \times |S| \approx 10^{100000}$$

This is also huge and intractable to store it in the memory. It is also difficult to learn or identify such a huge model by sampling the environment. It is harder than exploring the state itself. This is the curse of modeling. Model-free methods can eliminate this problem.

### Curse of credit assignment

When a longer episode is played or executed, it is not evident which actions were the most relevant to achieve the final result. This is the credit assignment problem.

## Linear-approximation

In order to cope with the curse of dimension, we can utilize linear-approximation. This means the Q-function is represented as a linear function of the parameters:

$$Q_\theta(s, a) = \sum_i{\theta_i \cdot f_i(s, a)} = \theta \cdot f$$

Then the goal is to set the parameters $\theta$ to get the optimal Q-function. However, we should keep in mind that for the optimal Q-function should satisfy the Bellman-equation. Because we have only an approximation we can not be sure that the Bellman-equation will hold exactly. Instead we apply a mean squared error on the TD-error part.

$$L_{\theta} = E\left[ \left( r + \gamma \cdot Q_\tilde{\theta}(s', a') - Q_\theta(s, a) \right)^2 \right]$$

where $\tilde{\theta}$ means we will not take the derivative according to it, just only according to $\theta$. Therefore the update rule (SGD):

$$\theta_{t+1} = \theta_t + \left[ r + \gamma Q_\theta(s', a') - Q_\theta(s, a) \right]\frac{\partial Q_\theta(s, a)}{\partial \theta}$$

Now, we substitute the linear approximation:

$$\frac{\partial Q_\theta(s, a)}{\partial \theta} = f(s, a)$$

Then:

$$\theta_{t+1} = \theta_t + \left[ r + \gamma \theta_t \cdot f(s', a') - \theta_t \cdot f(s, a) \right] f(s, a)$$

Regarding how the action ($a'$) is chosen, on-policy or off-policy, we have a slightly different update rules:

**Sarsa-learning:**

$a'$ is chosen according to the current policy $\pi$ then: $\theta_{t+1} = \theta_t + \left[ r + \gamma \theta_t \cdot f(s', a') - \theta_t \cdot f(s, a) \right] f(s, a)$

**Q-learning:**

This is an off-policy method, therefore the behavioral and the target policy is disjointed: $\theta_{t+1} = \theta_t + \left[ r + \gamma \theta_t \cdot \max_{a'} f(s', a') - \theta_t \cdot f(s, a) \right] f(s, a)$

### Feature extraction for linear methods

The question is how can we create a compressed representation of a large state space. We should choose specific states in the state space then every state that is close to it, can be described in the same way. This is similar to discretization. Can we do a better approach?

**Polynomials**

Suppose each state $s$ corresponds to $k$ numbers, $s_1, s_2, ..., s_k$, with each $s_i \in R$. For this $k$-dimensional state space, each order-$n$ polynomial-basis feature $x_i$ can be written as

$$x_i(s) = \prod_{j=1}^k{s_j^{c_{i, j}}}$$

where each $c_{i, j}$ is an integer in the set $\{ 0, 1, ..., n \}$ for an integer $n \ge 0$. 

**Radial Basis Functions**

It is useful for continuous-valued features. The RBF feature, $x_i$, depends on the distance between the state $s$ and the corresponding center state, $c_i$, and the feature's width, $\sigma_i$:

$$x_i(s) = e^{\left( -\frac{||s-c_i||^2}{2\sigma_i^2} \right)}$$

<img src="http://drive.google.com/uc?export=view&id=16eO17_rp7JpaGE6jdlOxz7GH-8dpsgY8" width=75%>

## n-step return, $\lambda$-return, eligibility traces

TD, Sarsa and Q-learning can be combined with different type of strategies to make the update rule more robust for noise. Let us go through them.

### n-step return

As the name suggests, it is possible to go ahead (with $n$ steps) and calculate a more accurate estimate of the value of the current state. 

$$G_{t:t+n} = r_{t+1} + \gamma \cdot r_{t+2} + ... + \gamma^{n-1} \cdot r_{t+n} + \gamma^n \cdot V^\pi_{t + n -1}(s_{t+n})$$

Then the update rule:

$$V^\pi_{t + n}(s_t) = V^\pi_{t + n - 1}(s_t) + \alpha \left[ G_{t:t+n} - V^\pi_{t + n - 1}(s_t) \right],\ 0 \le t < T$$

Similar formulas can be defined for Sarsa and with some more care for the Q-learning. The last one requires a bit more care and the application of importance sampling.

### $\lambda$-return

If we have n-step returns, then we can combine them. The definition of the $\lambda$-return:

$$G^\lambda_t = (1 - \lambda) \cdot \sum_{n=1}^\infty{\lambda^{n-1}G_{t:t+n}}$$

Because summing until infinity, we can truncate the formula above until T (termination):

$$G^\lambda_t = (1 - \lambda) \cdot \sum_{n=1}^{T-t-1}{\lambda^{n-1}G_{t:t+n}} + \lambda^{T-t-1}G_t$$


Then the update rule for the approximation case (at the end of the episode):

$$\theta_{t+1} = \theta_t + \alpha \cdot \left[ G^\lambda_t - \tilde{V}(s_t, \theta_t) \right]\frac{\partial \tilde{V}(s_t, \theta_t)}{\partial \theta_t}$$

### Eligibility traces

The previous two methods can be viewed as a forward view. The disadvantage of that approach is the high computational complexity. This can be improved by a backward view, with the so called eligibility traces. For the approximation based TD-algorithm, the eligibility trace is a vector with the same length as the weight vector ($\theta$). The weight vector parametrizes the value approximator function.

The eligibility trace at the very beginning is initialized as zero:

$$\textbf{z}_{-1} = 0$$

Then, the eligibility trace is updated according to the following equation ($0\le t \le T$):

$$\textbf{z}_t = \gamma\lambda\textbf{z}_{t-1} + \nabla \tilde{V}(s_t, \theta_t)$$

The one-step TD_error is given as:

$$\delta_t = r_{t+1} + \gamma \tilde{V}(s_{t+1}, \theta_t) - \tilde{V}(s_t, \theta_t)$$

Then the update rule for the weight vector:

$$\theta_{t+1} = \theta_t + \alpha \delta_t \textbf{z}_t$$

The eligibility trace is an indicator how much a component of the weight should be updated. Components where the gradient is generally higher have more tendency to be updated. It is similar like moments with adaptive learning rate.

**The forward- and backward view of linear TD($\lambda$) are equivalent.**

### Convergence properties of the algorithms

**Prediction algorithms:**

| On/Off-policy | Algorithm | Tabular | Linear | Non-linear|
|------|------|------|------|------|
| On-policy | MC | YES | YES | YES |
| On-policy | TD($\lambda$) | YES | YES | NO |
| Off-policy | MC | YES | YES | YES |
| Off-policy | TD($\lambda$) | YES | NO | NO |

**Control algorithms:**

| Algorithm | Tabular | Linear | Non-linear|
|------|------|------|------|
| MC | YES | YES | NO |
| Sarsa | YES | YES | NO |
| Q-learning | YES | NO | NO |