# Introduction into value-based RL

## Outline of the day

* Bellman-equation again
* Taxonomy of RL algorithms
* Model-based vs model-free learning

* Value-based RL with classic methods
* Value-based RL with DNNs
* Examples

* Practice session

## Recap: Bellman-equation

<img src="http://drive.google.com/uc?export=view&id=1VqFRFYo8Tv2YDG__6ntVliUNMb-RlEmU" width=75%>

**Key concepts we covered last time:**

* state
* action
* reward (and return)
* policy
* transition matrix
* dynamic programming

**Objective of reinforcement learning:**

To find the optimal behavior of the agent which maximizes its expected return. 

<img src="http://drive.google.com/uc?export=view&id=1J-jj-LQLnKIMFy9KV6cdKONtXQlMHhWI" width=65%>

Value-function, fixed stochastic policy
$$V^\pi(s) = \sum_{s', a}{\pi(s, a) \cdot T(s, a, s') \cdot \left[ r(s, a) + \gamma \cdot V^\pi(s') \right]}$$

Value-function, fixed deterministic policy
$$V^\pi(s) = \sum_{s'}{T(s, \pi(s), s') \cdot \left[ r(s, \pi(s)) + \gamma \cdot V^\pi(s') \right]}$$

Action value-function, fixed stochastic policy
$$Q^\pi(s, a) = \sum_{s', a'}{\pi(s', a') \cdot T(s, a, s') \cdot \left[ r(s, a) + \gamma \cdot Q^\pi(s', a') \right]}$$

Action value-function, fixed deterministic policy
$$Q^\pi(s, a) = \sum_{s'}{T(s, a, s') \cdot \left[ r(s, a) + \gamma \cdot Q^\pi(s', \pi(s')) \right]}$$

Optimal value-function
$$\tilde{V}(s) = \max_a \sum_{s'}{T(s, a, s') \cdot \left[ r(s, a, s') + \gamma \cdot \tilde{V}(s') \right]}$$

Optimal action value-function
$$\tilde{Q}(s, a) = \sum_{s'}{T(s, a, s') \cdot \left[ r(s, a, s') + \gamma \cdot \max_{a'} \tilde{Q}(s', a') \right]}$$

## Taxonomy of RL algorithms

<img src="http://drive.google.com/uc?export=view&id=1Gz0WBOtTxYrZ91uAidFE_Tuw0RBSfl9O" width=65%>

<img src="http://drive.google.com/uc?export=view&id=1WaYwpMZ5O0pUpT41TlfSQh2ov42xIHQb" width=65%>

**Value-based methods:**

* DQN and its variants (DQN with prioritized sweeping, Double DQN)
* Deep Sarsa

**Policy-based method:**

* REINFORCE

**Actor-Critic methods:**

* A3C, A2C
* TRPO
* PPO
* SAC

and lots of others.

## Model-based or model-free learning

Think about how the policy can be calculated when the value-function or the action-value function is known:

$$\pi(s) = \arg \max_a{ \tilde{Q}(s, a) }$$

When the value-function is given:

$$\pi(s) = \arg\max_a \left( T(s, a, s') \cdot \left[ r(s, a) + \gamma \tilde{V}(s') \right]\right)$$

The main difference between the two is the necessity of $T$ and $r$, the transition probability and the reward.
These functions describe the dynamics of the environment, the model.

Algorithms using the model of the environment, are the model-based algorithms. Otherwise, it is model-free. If the $Q$ function is calculated then the algorithm can be model-free. This is one of the reasons why Q-learning is so popular and well-known (see later).

Model-based algorithms require the knowledge of the model. Unfortunately, it is rarely known ahead. Therefore we need a so called model-identification to learn it. There are two types of model representation:

1. distribution models
2. sampling models

**Distribution models vs sampling models**

If we have a state and an already chosen action then we can have a **distribution over the possible next state**. Formally:

$$p(s') = T(s, a, s')$$

A distribution model can give the probabilities over the possible next actions. However, a **sampling model only gives the next state with probability $p(s')$** but we have no idea that what are the probabilities of the other states.

Therefore the distribution model is more general.

From the MDP's point of view, the learning of distribution model is the same as learning the $T(s, a, s')$. The simplest approach to learn the model is to use a "policy" or sampling method that ensures that all of the transitions are visited enough time, therefore the transition probabilities will be closed to the empirical probabilities. The empirical probability:

$$T(s, a, s') = \frac{N_{s, a, s'}}{N_{s, a}}$$

$N_{s, a, s'}$ means how many transition ($s, a \rightarrow s'$) happened so far. $N_{s, a}$ means how many times the $s, a$ pair appeared.

Having a large and complex environment results in a need for a large amount of sampling. This can be infeasable in real life. However, there are some RL learning methods that tries to learn the model and find the optimal policy concurrently, for instance [Dina-Q](http://papers.nips.cc/paper/388-integrated-modeling-and-control-based-on-reinforcement-learning-and-dynamic-programming.pdf).

In practice, it is much feasable to build a simulator and use it as a sampling model then apply a **model-free method** in the simulator. Of course, the transition from the simulator to real world is still requires effort. From now, we are interested in model-free methods. Most of the successful RL algorithms are model-free.