# Model Based Reinforcement Learning

----

Rely on the model of the environment, which including reward function or/and transition model.  With the model, we can learn or infer how the environment would interact with and provide feedback to the agent. 

# RL Algorithm Components

## Model

Transition: The transition function $P$ records the probability of transitioning from state $s$ to $s’$ after taking action $a$ while obtaining immediate reward $r$. 
$$
P(s' \vert s, a)  = \mathbb{P} [S_{t+1} = s' \vert S_t = s, A_t = a]
$$

Reward: Reward function $R$ predicts the immediate reward triggered by one action. **Note:** Reward is sometimes defined as a function of the current state, $R(s)$, or as a function of
the (state, action, next state) tuple, $R(s, a, s')$. Most frequently in this example, we assume reward is a function of (state, action) pair, $R(s, a)$.

$$
R(s, a) = \mathbb{E} [R_{t+1} \vert S_t = s, A_t = a]
$$


## Policy
Policy, as the agent’s behavior function π, tells us which action to take in state s. It is a mapping from state s to action a and can be either deterministic or stochastic.

  1. Deterministic policy: $$\pi(s) = a$$
  1. Stochastic policy: $$\pi (a|s) = \mathbb{P}(A=a| S=s)$$

## Value
Value function measures the goodness of a state or how rewarding a state or an action is by a prediction of feature reward. 

There are many ways to define the value function. In this example we just use $\gamma$ discount sum of reward, and the feature reward, known as **return**, is a total sum of discounted rewards going forward. We can compute the return $G_t$ starting from time $t$.
$$
G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
$$

The discount factor properties:
- $\gamma \in [0, 1]$;
- Discounting provides mathematical convenience;
- No need to worry about the infinite loops.


The expected return of a particular state $s$ start from time $t$, $S_t=s$:
$$
V^{\pi}(s) = \mathbb{E}^{\pi}[G_t \vert S_t = s]
$$

Similarly, Q-value:
$$
Q^{\pi}(s, a) = \mathbb{E}^{\pi}[G_t \vert S_t = s, A_t = a]
$$


Additionally, using the probility distribution over possible actions and the Q-values to recover the value function, under particular policy $π$:
$$
V^{\pi}(s) = \sum_{a \in \mathcal{A}} Q^{\pi}(s, a) \pi(a \vert s)
$$







# Markov Decision Processes

Markov assumption: 

- The future and the past are conditionally independent given the present.

A Markov deicison process consists of five elements:
$$
\mathcal{M} = <S, A, P, R, \gamma>
$$

- $S$ - state space;
- $A$ - action space;
- $P$ - transition function;
- $R$ - reward function;
- $\gamma$ - discounting factor.


# Bellman Equations

Bellman equations refer to a set of equations that decompose the value function into the immediate reward plus the discounted future values.

- Fallow deterministic policy:

$$
% <![CDATA[
\begin{aligned}
V^{\pi}(s) &= R(s, \pi(s)) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, \pi(s)) V^{\pi} (s') \\
Q^{\pi}(s, a) &= R(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) V^{\pi} (s')
\end{aligned} %]]>
$$


- Fallow stochastic policy: 
$$
% <![CDATA[
\begin{aligned}
V^{\pi}(s) &= \sum_{a \in \mathcal{A}} \pi(a \vert s) R(s, a) + \gamma \sum_{s' \in S}\sum_{a \in A} \pi(a \vert s) P(s'|s, a) V^{\pi} (s') \\
Q^{\pi}(s, a) &= R(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) \sum_{a' \in A} \pi(a' \vert s') Q^{\pi} (s', a')
\end{aligned} %]]>
$$

# Model-Based RL

Dynamic Programming. Using deterministic policy.

## Policy Evaluation
Policy Evaluation is to compute the value $V^π$ for a given policy $π$:
$$
V_{t+1}(s) 
= \mathbb{E}_\pi [r + \gamma V_t(s') | S_t = s]
= \sum_{s', r} P(s', r \vert s, \pi (s)) (r + \gamma V_k(s'))
$$

## Policy Improvement
Based on the value functions, Policy Improvement generates a better policy π′≥π by acting greedily.
$$
Q^\pi(s, a) 
= \mathbb{E} [R_{t+1} + \gamma V^\pi(S_{t+1}) \vert S_t=s, A_t=a]
= \sum_{s', r} P(s', r \vert s, a) (r + \gamma V^\pi(s'))
$$

## Policy Iteration

Policy Iteration = Policy evaluation + Policy Improvement

$$
\pi_0 \xrightarrow[]{\text{evaluation}} V^{\pi_0} \xrightarrow[]{\text{improve}}
\pi_1 \xrightarrow[]{\text{evaluation}} V^{\pi_1} \xrightarrow[]{\text{improve}}
\pi_2 \xrightarrow[]{\text{evaluation}} \dots \xrightarrow[]{\text{improve}}
\pi_* \xrightarrow[]{\text{evaluation}} V^*
$$

$$
% <![CDATA[
\begin{aligned}
Q^\pi(s, \pi'(s))
&= Q^\pi(s, \arg\max_{a \in \mathcal{A}} Q^\pi(s, a)) \\
&= \max_{a \in \mathcal{A}} Q^\pi(s, a) \geq Q^\pi(s, \pi(s)) = V^\pi(s)
\end{aligned} %]]>
$$