<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/7.%20Integrating%20Learning%20and%20Planning/Integrating_Learning_and_Planning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Integrating Learning and Planning

In previous lectures, we have learnt how to learn a policy irectly from experience, as well as learn a value function directly from experience. Those methods are called model-free algorithms as they rely on real world experience to learn a function or policy. 

In contrast, model-based RL algorithm deals with a different problem - it learns a **model** directly from the experience, and use **planning** to construct a value function or policy. The learning and planning activities can be integrated into a single architecture.

Advantages of Model-based RL
- Can efficiently learn model by supervisedlearning methods
- Can reason about model uncertainty

Disadvantages
- First learn a model, then construct a value function
    - It may cause two sources of approximation error

In particular, we will cover:
- Dyna-Q Algorithm
- Forward search
- Monte Carlo Tree Search

## What is a model?

A model represents the environment that an agent can use it to predict how the environment will respond to its actions.
Formally, a model $M$ is a representation of an MDP $\langle S, A, P, R \rangle$ that is parameterized by $\eta$

Assume that the state space $S$ and action space $A$ are known, a model $M=\langle P_{\eta}, R_{\eta} \rangle$ represents the state transitions $P_{\eta} \approx P$ and rewards $R_{\eta} \approx R$ 

With a given $S_t, A_t$, a model produces
\begin{equation}
\begin{split}
S_{t+1} &\sim P_{\eta}(S_{t+1} | S_t, A_t) \\
R_{t+1} &\sim R_{\eta}(R_{t+1} | S_t, A_t)
\end{split}
\end{equation}

It is typically assumed that between state transitions and rewards are conditional independence

\begin{equation}
\mathop{\mathbb{P}}[S_{t+1}, R_{t+1} | S_t, A_t] = \mathop{\mathbb{P}}[S_{t+1} | S_t, A_t]
\end{equation}

## Planning with a Model
For a given model $M_{\eta} = \langle P_{\eta}, R_{\eta} \rangle$
Solve the MDP $\langle S, A, P_{\eta}, R_{\eta} \rangle$ by using
- Value iteration
- Policy iteration
- Tree seaerch ** ...

## A more concrete example - Sample-Based Planning

Sample-based planning uses the model only to generate samples

1. Sample experience from model
\begin{equation}
\begin{split}
S_{t+1} &\sim P_{\eta}(S_{t+1} | S_t, A_t) \\
R_{t+1} &= R_{\eta}(R_{t+1} | S_t, A_t)
\end{split}
\end{equation}
Then, we apply model-free RL to samples,
- Monte-Carlo Control
- SARSA
- Q-learning
![Model Based RL](https://github.com/RLWH/reinforcement-learning-notebook/blob/master/images/model_based_rl.PNG?raw=true)
 
#### AB MDP Example
  ![AB MDP](https://github.com/RLWH/reinforcement-learning-notebook/blob/master/images/model_based_rl_example_ab.PNG?raw=true)
  
#### Drawbacks - Planning with an Inaccurate Model
If we have an imperfect model $\langle P_{\eta}, R_{\eta} \rangle \neq \langle P, R \rangle$, the performance of model-based RL is limited to optimal policy for approximate MDP, hence, we will have a suboptimal policy

What can we do?
1. When the model is wrong, use model-free RL
2. Reason explicitly about model uncertainty

#### Algorithm - One step Q-planning
---
```
Loop forever:
        1. Select a state and an action at random
        2. Send S, A to a sample model, and obtain a sample next reward R, and a sample next state S'
        3. Apply one-step tabular Q-learning to S, A, R, S'
              Q(S, A) += alpha * (R + gamma * max_a Q(S', A) - Q(S, A))
```
---

## Integrating Learning and Planning
We can use two sources of experience

1. **Real experience** - Sampled from the environment (true MDP)
\begin{equation}
\begin{split}
S' &\sim P^{a}_{s, s'}\\
R &= R^a_{s}
\end{split}
\end{equation}

2. **Simulated experience** - Sampled from model (approximate MDP)
\begin{equation}
\begin{split}
S' &\sim P_{\eta}(S' | S, A)\\
R &= R_{\eta}(R | S, A)
\end{split}
\end{equation}

## DYNA (Integrated Planning, Acting and Learning)
- Learn a model from experience
- Learn and plan value function (and /or policy) from real and simulated experience

#### DYNA cycle
![DYNA cycle](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/dyna_cycle.PNG)

#### Algorithm (DYNA-Q)
---
```
Initialise Q(S,A) and Model(S,A) for all S and A
Do forever:
        1. S = current state
        2. A = epsilon-greedy(S, Q)
        3. Execute A, observe R, S'
        4. Q(S,A) += alpha * [R + gamma * max_a Q(S',A) - Q(S,A)]   <--- [Using One step TD]
        5. Update model
              Model(S,A) <- R, S'
        6. Repeat n times (imagination):
               S <- random previousy observed state
               A <- random action previously taken in S
               R, S' <- Model(S,A)
               Q(S,A) += alpha * [R + gamma * max_a Q(S',A) - Q(S,A)]
```
---