<div style="text-align: left">
    <img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/logo.png?raw=true' width=800/>  
</div>

Author: Itay Segev

E-mail: [itaysegev@campus.technion.ac.il](mailto:itaysegev@campus.technion.ac.il)



# <img src="https://img.icons8.com/?size=50&id=ZCDMsVkk9six&format=png&color=000000" style="height:50px;display:inline"> Model-based RL



<img src='https://news.ubc.ca/wp-content/uploads/2023/08/AdobeStock_559145847.jpeg' width=900/>


<a id="section:intro"></a>

# <img src="https://img.icons8.com/?size=50&id=55412&format=png&color=000000" style="height:50px;display:inline"> Introduction
---

The algorithms introduced in the previous tutorials are all model-free, as they do not require a model to use or control behavior. In this section, we will study a different class of algorithms called model-based. In contrast to model-free RL, **model-based methods use a model to build a policy**.

We will cover several key concepts, including the importance of learning a model, addressing the challenges of distributional shift, and the benefits of using short model-based rollouts. You will also explore the Dyna algorithm and its modern variants, and understand how model-based acceleration can enhance learning efficiency. Throughout this tutorial, you'll implement the main components of **Q-Dyna**, incorporating synthetic data generation, model training, and policy updates. These steps will help you build a more efficient and robust RL agent by leveraging both real and simulated experiences.





# <img src="https://img.icons8.com/?size=50&id=43171&format=png&color=000000" style="height:30px;display:inline"> Setup


You will need to make a copy of this notebook in your Google Drive before you can edit the notebook. You can do so with **File &rarr; Save a copy in Drive**.

In [None]:
#@title mount your Google Drive
import os
connect_drive = False #@param {type: "boolean"}
if connect_drive:
  from google.colab import drive
  drive.mount('/content/gdrive', force_remount=True)

  # set up mount symlink
  DRIVE_PATH = '/content/gdrive/My\ Drive/cs236203_s24'
  DRIVE_PYTHON_PATH = DRIVE_PATH.replace('\\', '')
  if not os.path.exists(DRIVE_PYTHON_PATH):
    %mkdir $DRIVE_PATH

## the space in `My Drive` causes some issues,
## make a symlink to avoid this
SYM_PATH = '/content/cs236203_s24'
if not os.path.exists(SYM_PATH) and connect_drive:
  !ln -s $DRIVE_PATH $SYM_PATH




In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import convolve as conv




In [None]:
# @title Figure Settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle")

In [None]:
#@title Plotting Functions

def plot_state_action_values(env, value, ax=None):
  """
  Generate plot showing value of each action at each state.
  """
  if ax is None:
    fig, ax = plt.subplots()

  for a in range(env.n_actions):
    ax.plot(range(env.n_states), value[:, a], marker='o', linestyle='--')
  ax.set(xlabel='States', ylabel='Values')
  ax.legend(['R','U','L','D'], loc='lower right')


def plot_quiver_max_action(env, value, ax=None):
  """
  Generate plot showing action of maximum value or maximum probability at
    each state (not for n-armed bandit or cheese_world).
  """
  if ax is None:
    fig, ax = plt.subplots()

  X = np.tile(np.arange(env.dim_x), [env.dim_y,1]) + 0.5
  Y = np.tile(np.arange(env.dim_y)[::-1][:,np.newaxis], [1,env.dim_x]) + 0.5
  which_max = np.reshape(value.argmax(axis=1), (env.dim_y,env.dim_x))
  which_max = which_max[::-1,:]
  U = np.zeros(X.shape)
  V = np.zeros(X.shape)
  U[which_max == 0] = 1
  V[which_max == 1] = 1
  U[which_max == 2] = -1
  V[which_max == 3] = -1

  ax.quiver(X, Y, U, V)
  ax.set(
      title='Maximum value/probability actions',
      xlim=[-0.5, env.dim_x+0.5],
      ylim=[-0.5, env.dim_y+0.5],
  )
  ax.set_xticks(np.linspace(0.5, env.dim_x-0.5, num=env.dim_x))
  ax.set_xticklabels(["%d" % x for x in np.arange(env.dim_x)])
  ax.set_xticks(np.arange(env.dim_x+1), minor=True)
  ax.set_yticks(np.linspace(0.5, env.dim_y-0.5, num=env.dim_y))
  ax.set_yticklabels(["%d" % y for y in np.arange(0, env.dim_y*env.dim_x, env.dim_x)])
  ax.set_yticks(np.arange(env.dim_y+1), minor=True)
  ax.grid(which='minor',linestyle='-')


def plot_heatmap_max_val(env, value, ax=None):
  """
  Generate heatmap showing maximum value at each state
  """
  if ax is None:
    fig, ax = plt.subplots()

  if value.ndim == 1:
      value_max = np.reshape(value, (env.dim_y,env.dim_x))
  else:
      value_max = np.reshape(value.max(axis=1), (env.dim_y,env.dim_x))
  value_max = value_max[::-1,:]

  im = ax.imshow(value_max, aspect='auto', interpolation='none', cmap='afmhot')
  ax.set(title='Maximum value per state')
  ax.set_xticks(np.linspace(0, env.dim_x-1, num=env.dim_x))
  ax.set_xticklabels(["%d" % x for x in np.arange(env.dim_x)])
  ax.set_yticks(np.linspace(0, env.dim_y-1, num=env.dim_y))
  if env.name != 'windy_cliff_grid':
      ax.set_yticklabels(
          ["%d" % y for y in np.arange(
              0, env.dim_y*env.dim_x, env.dim_x)][::-1])
  return im


def plot_rewards(n_episodes, rewards, average_range=10, ax=None):
  """
  Generate plot showing total reward accumulated in each episode.
  """
  if ax is None:
    fig, ax = plt.subplots()

  smoothed_rewards = (conv(rewards, np.ones(average_range), mode='same')
                      / average_range)

  ax.plot(range(0, n_episodes, average_range),
          smoothed_rewards[0:n_episodes:average_range],
          marker='o', linestyle='--')
  ax.set(xlabel='Episodes', ylabel='Total reward')


def plot_performance(env, value, reward_sums):
  fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(16, 12))
  plot_state_action_values(env, value, ax=axes[0,0])
  plot_quiver_max_action(env, value, ax=axes[0,1])
  plot_rewards(n_episodes, reward_sums, ax=axes[1,0])
  im = plot_heatmap_max_val(env, value, ax=axes[1,1])
  fig.colorbar(im)

# <img src="https://img.icons8.com/?size=50&id=104313&format=png&color=000000" style="height:50px;display:inline"> Quentin's World Environment




In this tutorial, our RL agent will act in the Quentin's world, a 10x10 grid world.

<img alt="QuentinsWorld" width="560" height="560" src="https://github.com/NeuromatchAcademy/course-content/blob/main/tutorials/static/W3D4_Tutorial4_QuentinsWorld.png?raw=true">

In this environment, there are 100 states and 4 possible actions: right, up, left, and down. The goal of the agent is to move, via a series of steps, from the start (green) location to the goal (yellow) region, while avoiding the red walls. More specifically:
* The agent starts in the green state,
* Moving into one of the red states incurs a reward of -1,
* Moving into the world borders stays in the same place,
* Moving into the goal state (yellow square in the upper right corner) gives you a reward of 1, and
* Moving anywhere from the goal state ends the episode.

Now that we have our environment and task defined, how can we solve this using a model-based RL agent?

In [None]:
class world(object):
    def __init__(self):
        return

    def get_outcome(self):
        print("Abstract method, not implemented")
        return

    def get_all_outcomes(self):
        outcomes = {}
        for state in range(self.n_states):
            for action in range(self.n_actions):
                next_state, reward = self.get_outcome(state, action)
                outcomes[state, action] = [(1, next_state, reward)]
        return outcomes

In [None]:
class QuentinsWorld(world):
    """
    World: Quentin's world.
    100 states (10-by-10 grid world).
    The mapping from state to the grid is as follows:
    90 ...       99
    ...
    40 ...       49
    30 ...       39
    20 21 22 ... 29
    10 11 12 ... 19
    0  1  2  ...  9
    54 is the start state.
    Actions 0, 1, 2, 3 correspond to right, up, left, down.
    Moving anywhere from state 99 (goal state) will end the session.
    Landing in red states incurs a reward of -1.
    Landing in the goal state (99) gets a reward of 1.
    Going towards the border when already at the border will stay in the same
        place.
    """
    def __init__(self):
        self.name = "QuentinsWorld"
        self.n_states = 100
        self.n_actions = 4
        self.dim_x = 10
        self.dim_y = 10
        self.init_state = 54
        self.shortcut_state = 64

    def toggle_shortcut(self):
      if self.shortcut_state == 64:
        self.shortcut_state = 2
      else:
        self.shortcut_state = 64

    def get_outcome(self, state, action):
        if state == 99:  # goal state
            reward = 0
            next_state = None
            return next_state, reward
        reward = 0  # default reward value
        if action == 0:  # move right
            next_state = state + 1
            if state == 98:  # next state is goal state
                reward = 1
            elif state % 10 == 9:  # right border
                next_state = state
            elif state in [11, 21, 31, 41, 51, 61, 71,
                           12, 72,
                           73,
                           14, 74,
                           15, 25, 35, 45, 55, 65, 75]:  # next state is red
                reward = -1
        elif action == 1:  # move up
            next_state = state + 10
            if state == 89:  # next state is goal state
                reward = 1
            if state >= 90:  # top border
                next_state = state
            elif state in [2, 12, 22, 32, 42, 52, 62,
                           3, 63,
                           self.shortcut_state,
                           5, 65,
                           6, 16, 26, 36, 46, 56, 66]:  # next state is red
                reward = -1
        elif action == 2:  # move left
            next_state = state - 1
            if state % 10 == 0:  # left border
                next_state = state
            elif state in [17, 27, 37, 47, 57, 67, 77,
                           16, 76,
                           75,
                           14, 74,
                           13, 23, 33, 43, 53, 63, 73]:  # next state is red
                reward = -1
        elif action == 3:  # move down
            next_state = state - 10
            if state <= 9:  # bottom border
                next_state = state
            elif state in [22, 32, 42, 52, 62, 72, 82,
                           23, 83,
                           84,
                           25, 85,
                           26, 36, 46, 56, 66, 76, 86]:  # next state is red
                reward = -1
        else:
            print("Action must be between 0 and 3.")
            next_state = None
            reward = None
        return int(next_state) if next_state is not None else None, reward

# <img src="https://img.icons8.com/?size=50&id=55011&format=png&color=000000" style="height:50px;display:inline"> Why Learn a Model in Model-Based Reinforcement Learning?

Model-based reinforcement learning (RL) is a powerful approach that involves learning a model of the environment's dynamics and using this model to make decisions. This contrasts with model-free RL, where the agent learns to make decisions directly from interactions with the environment without an explicit model. Here, we will explore why learning a model can be advantageous and discuss a basic approach to implementing model-based RL.

**But what is a model?** A model (sometimes called a world model or internal model) is a representation of how the world will respond to the agent's actions. You can think of it as a representation of how the world works. With such a representation, the agent can simulate new experiences and learn from these simulations. This is advantageous for two reasons. First, acting in the real world can be costly and sometimes even dangerous. Learning from simulated experience can avoid some of these costs or risks. Second, simulations make fuller use of one's limited experience. To see why, imagine an agent interacting with the real world. The information acquired with each individual action can only be assimilated at the moment of the interaction. In contrast, the experiences simulated from a model can be simulated multiple times -- and whenever desired -- allowing for the information to be more fully assimilated.

### Advantages of Learning a Model

1. **Efficiency**: Model-based RL can be more sample-efficient than model-free RL. By learning a model of the environment, the agent can simulate interactions internally, reducing the need for extensive real-world interaction, which is often costly or time-consuming.

2. **Control and Planning**: With a model, the agent can plan its actions by predicting future states and rewards. This capability allows for more informed decision-making and the ability to foresee and avoid potential pitfalls.

3. **Flexibility**: Models enable the agent to adapt to changes in the environment by reusing the learned model for different tasks or objectives. This flexibility is particularly useful in dynamic or evolving environments.

### Basic Approach to Model-Based RL

The process of model-based RL involves two main steps: learning the model and using the model for control.



#### Learning the Model

- The model represents the dynamics of the environment. In the deterministic case, it can be expressed as a function $ f(s_t, a_t) = s_{t+1} $, where $ s_t $ and $ s_{t+1} $ are the states at time $ t $ and $ t+1 $, respectively, and $ a_t $ is the action taken at time $ t $.
- In a stochastic environment, the model might represent a probability distribution over the next state, $ P(s_{t+1} \mid s_t, a_t) $.
- The model is typically learned using supervised learning techniques from a dataset of transitions collected by the agent.




#### Using the Model for Control

- Once the model is learned, it can be used to plan actions. The agent uses the model to simulate future states and evaluate the potential outcomes of different actions, allowing it to select the optimal action.

### Example: A Simple Model-Based RL Algorithm

Here’s a prototype of a basic model-based RL algorithm, referred to as version 0.5:

<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/MB_model-based-0.5.png?raw=true' width=900/>

### Challenges and Considerations

1. **Distributional Shift**: One of the main challenges in model-based RL is the distributional shift, where the model is trained on data from a certain distribution of states and actions but is later used to predict outcomes in potentially different distributions. This can lead to inaccuracies in the model’s predictions.

2. **Uncertainty**: Accurately estimating uncertainty in the model’s predictions can significantly improve the performance of model-based RL algorithms. By considering the uncertainty, the agent can avoid over-reliance on potentially inaccurate predictions.



# <img src="https://img.icons8.com/?size=50&id=yg0Xl3Bazd07&format=png&color=000000" style="height:50px;display:inline"> Distributional Shift in Model-Based RL



One of the significant challenges in model-based reinforcement learning (RL) is handling distributional shift. This problem arises when the distribution of states encountered during policy execution differs from the distribution of states used to train the model. Let's delve into why this occurs and how it affects the performance of model-based RL algorithms.


## Understanding Distributional Shift

To understand distributional shift, consider a simple example of an agent trying to reach the top of a mountain. The agent starts by running a base policy ($\pi_0$), such as a random policy, to collect data. This initial policy explores the environment and generates a dataset of state transitions. The agent then uses this dataset to learn a dynamics model $ f(s_t, a_t) = s_{t+1} $.

With this learned model, the agent plans its actions to maximize its altitude, expecting that moving to the right will lead it to higher ground, based on the initial random data. However, the agent might end up falling off the mountain because it hasn't seen all possible states and transitions during its random walk. This scenario illustrates the core issue of distributional shift.




### Why Distributional Shift Occurs

1. **Training Data Distribution**: The model is trained on data collected under the initial policy $\pi_0$. The state distribution under this policy is denoted as $ P_{\pi_0}(s_t) $. This distribution represents the states the agent encountered while following $\pi_0$.

2. **Planning with the Model**: When the agent uses the learned model to plan its actions, it effectively follows a new policy, denoted as $\pi_f$, which is induced by the model. The state distribution under this new policy is $ P_{\pi_f}(s_t) $.

The key issue is that $ P_{\pi_f}(s_t) $ is generally not the same as $ P_{\pi_0}(s_t) $. This discrepancy means that the model, which is accurate for the states seen under $\pi_0$, may make poor predictions for the states encountered under $\pi_f$. This leads to erroneous predictions and suboptimal actions, exacerbating the problem as the agent continues to plan and execute actions based on these flawed predictions.

## Example of Distributional Shift

Consider the agent's goal of climbing a mountain:

1. **Data Collection**: The agent performs a random walk ($\pi_0$) and collects data, learning that moving to the right generally increases altitude.
2. **Model Learning**: The agent uses this data to train a model $ f $ that predicts higher altitudes to the right.
3. **Planning**: Using $ f $, the agent plans to keep moving right to reach higher altitudes.
4. **Execution**: The agent follows $\pi_f$ (the policy induced by $ f $), but this policy takes the agent to regions not covered by the initial random walk, leading to a fall off the mountain.

This example highlights how the model's predictions can fail outside the distribution of the training data, causing significant performance issues.

## Why It Becomes a Problem with High-Capacity Models

1. **Expressive Models**: High-capacity models like deep neural networks can fit the training data very tightly. While this can improve performance on the training distribution, it makes the models more sensitive to distributional shifts.
2. **Overfitting**: These models can overfit to the specific distribution of states seen during training, leading to poor generalization to new states encountered during planning.

In contrast, simpler models with fewer parameters (e.g., fitting a few coefficients in a known physics model) are less prone to overfitting because they have less flexibility. This is why system identification often works well in robotics: the models are simpler, and the parameter space is smaller, reducing the risk of significant distributional shift.


## Addressing Distributional Shift

To mitigate the effects of distributional shift, several strategies can be employed:

1. **Regularization**: Apply regularization techniques to prevent the model from overfitting to the training data.
2. **Data Augmentation**: Collect more diverse data to better cover the state space.
3. **Model Uncertainty**: Incorporate model uncertainty into planning, using methods such as ensemble models or Bayesian approaches to account for the model's confidence in its predictions.


## <img src="https://img.icons8.com/?size=50&id=55162&format=png&color=000000" style="height:50px;display:inline"> Improving Model-Based Reinforcement Learning



The distributional shift problem discussed earlier is a significant challenge for model-based reinforcement learning (RL). However, there are strategies to mitigate this issue and improve the performance of model-based RL algorithms. Here, we will discuss how to enhance the basic model-based RL algorithm to address distributional shift and introduce more robust methods like Model Predictive Control (MPC).



### Improving the Basic Model-Based RL Algorithm

To address distributional shift, we can enhance the basic algorithm by iteratively collecting data, training the model, and planning actions. This approach is conceptually similar to the DAgger (Dataset Aggregation) method used in imitation learning, where we iteratively collect new data to match the state distribution of a learned policy.

This iterative loop helps the model adapt to new states encountered during execution, reducing the impact of distributional shift. However, while this method conceptually mitigates distributional shift, it still has limitations in practice, such as requiring extensive data collection and retraining.

To further improve, we can adopt **Model Predictive Control (MPC)**, a more advanced approach that involves frequent re-planning to correct mistakes as soon as they occur. This method, referred to as Model-Based RL Version 1.5, enhances the robustness of the algorithm by planning actions at each time step based on the latest state information.

<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/MB_model-based-1.5.png?raw=true' width=900/>

### Key Advantages of MPC:

1. **Immediate Correction**: By re-planning at each time step, the agent can immediately correct any mistakes, making the system more robust to model errors.
2. **Reduced Dependency on Perfect Models**: Since the agent frequently re-plans, it can handle less accurate models better than the naive approach.
3. **Shorter Planning Horizons**: Frequent re-planning allows the use of shorter planning horizons, reducing computational complexity and enabling faster adaptation.

#### Example: Driving a Car

Consider a scenario where the agent is driving a car. The initial model predicts that steering slightly to the left will keep the car going straight. However, in reality, this causes the car to veer left. With MPC, as soon as the agent observes the deviation, it re-plans to correct the steering, ensuring the car stays on course.


## Limitations of Model Predictive Control (MPC) and Open-Loop Control

### Open-Loop Control

Open-loop control refers to planning and committing to a fixed sequence of actions without adjusting based on new observations. The control strategy is predetermined and does not adapt to the changing state of the environment.

#### Suboptimality of Open-Loop Control

1. **Lack of Adaptability**: Since the actions are predetermined, the strategy cannot respond to new information or changes in the environment. This can lead to suboptimal decisions when unexpected situations arise.
2. **Illustrative Example**: Consider an agent faced with a math test. If the agent must decide whether to take the test and answer it without seeing the actual test questions, it has to commit to an action without sufficient information. The optimal strategy would involve first observing the test questions and then deciding how to proceed. An open-loop strategy would fail in such scenarios.




### Specific Open-Loop Planning Methods


#### Random Shooting

- **Method**: Generate a large number of random action sequences, simulate the outcomes using the model, and select the sequence that yields the highest expected reward.
- **Limitations**: While simple, random shooting can be computationally expensive and inefficient, especially in high-dimensional action spaces. It also does not account for future state observations, making it inherently open-loop.



#### Cross-Entropy Method

- **Method**: This is an optimization technique where multiple action sequences are sampled, and the top-performing sequences are used to update the distribution from which new sequences are sampled. The process is repeated until convergence.
- **Limitations**: Although more efficient than random shooting, the cross-entropy method still plans in an open-loop manner, optimizing a fixed sequence of actions without adapting to new observations during execution.

Both of these methods optimize for the expected reward given a sequence of actions but do not adjust based on real-time state changes. This lack of adaptability is a fundamental limitation, particularly in dynamic environments where the optimal action depends on the latest state information.

## <img src="https://img.icons8.com/?size=50&id=oNOWJS4XHflp&format=png&color=000000" style="height:50px;display:inline"> Transitioning to Closed-Loop Control


To overcome the limitations of open-loop control, we need to shift to closed-loop control, where the agent continuously observes the state and updates its policy accordingly. Closed-loop control strategies, unlike open-loop strategies, adapt to new information in real-time, leading to more optimal decision-making.

### Closed-Loop Control

In closed-loop control, the agent’s actions are determined by a policy that takes the current state as input and outputs the next action. This allows the agent to adjust its actions based on the latest observations, making it more responsive and adaptable to changes in the environment.




By transitioning from open-loop to closed-loop control, model-based RL can achieve more adaptive and effective policies. This approach aligns closely with the original reinforcement learning problem, where the objective is to develop a policy that optimizes actions for any given state. Leveraging learned models provides a powerful tool for enhancing policy learning and achieving better performance in complex environments.

### Advantages of Closed-Loop Control

1. **Real-Time Adaptation**: The agent can adapt its actions based on the current state, allowing it to handle unexpected situations more effectively.
2. **Improved Performance**: By continuously updating the policy, the agent can make more informed decisions, leading to better overall performance compared to open-loop control.



### Key Concepts in Closed-Loop Control

1. **Policy-Based Methods**: Instead of planning a fixed sequence of actions, the agent develops a policy that dictates actions based on the current state.
2. **Model Integration**: The agent models the environment’s dynamics explicitly, which aids in policy learning and decision-making.
3. **Global Policies**: Using highly expressive function approximators like neural networks, the agent can learn global policies that perform well across various states.



### Example: Math Test Problem Resolved

In a closed-loop control scenario, the agent would first observe the test before deciding to answer. This adaptive strategy ensures the agent can choose the optimal action based on the current state, leading to better overall performance.

## Model-Based RL with Backpropagation

In model-based reinforcement learning (RL), we aim to learn policies that maximize reward by leveraging models of the environment's dynamics. A natural idea is to apply the tools of deep learning, such as backpropagation and gradient descent, to optimize policies. This involves setting up a computation graph that allows us to compute the total reward for a given policy and use gradient-based methods to optimize this policy.

## Setting Up the Computation Graph

To maximize the total reward, we set up a computation graph with the following components:

1. **Policies**: Functions that take states as input and produce actions.
2. **Dynamics**: Functions that take states and actions as input and produce the next state.
3. **Rewards**: Functions that take states and actions as input and produce scalar reward values.


<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/MB_backprop.png?raw=true' width=900/>

This computation graph allows us to compute the sum of rewards for a given policy. By leveraging automatic differentiation software, we can compute gradients and perform gradient ascent to optimize the policy parameters.


### Why This Approach Might Not Work

While setting up the computation graph and applying backpropagation seems straightforward, this approach often fails in practice due to several reasons:



#### Temporal Structure and Compounding Effects

- In trajectory optimization, actions taken earlier in a trajectory have compounding effects on later states and rewards. This results in large gradients for early actions and smaller gradients for later actions.
- This situation leads to an ill-conditioned optimization problem, where some parameters receive very large gradient updates while others receive very small updates.



#### Parameter Sensitivity

- Small changes in actions at the beginning of a trajectory can lead to significant changes in the trajectory's outcome. This sensitivity makes it challenging to optimize policies effectively.



#### Vanishing and Exploding Gradients

- The problems faced when optimizing policies through backpropagation are similar to those encountered in training Recurrent Neural Networks (RNNs) naively. The derivatives of later rewards with respect to earlier policy parameters involve the product of many Jacobians.
- If these Jacobians have eigenvalues significantly different from one, the gradients can either explode (if eigenvalues are larger than one) or vanish (if eigenvalues are smaller than one).

## Solution: Leveraging Model-Based Acceleration for Policy Learning

Given the challenges associated with backpropagation in model-based RL, a practical solution involves using the learned model to generate synthetic samples to accelerate model-free RL algorithms. This approach may initially seem counterintuitive, as it essentially treats the learned model as a simulator rather than directly exploiting its known derivatives. However, this method, known as model-based acceleration, can significantly enhance the efficiency and effectiveness of model-free RL training.


# <img src="https://img.icons8.com/?size=50&id=46554&format=png&color=000000" style="height:50px;display:inline"> Model-Free Learning with a Model


Building on the previous discussion, we now introduce a refined approach to model-based reinforcement learning (RL), referred to as Model-Based RL Version 2.5. This method combines elements of model-free learning with the benefits of having a learned model, addressing some of the challenges encountered with direct backpropagation through the model.

<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/MB_model-based-2.5.png?raw=true' width=900/>

Model-Based RL Version 2.5 aims to improve the policy by using a learned dynamics model to generate synthetic experiences. Instead of relying on backpropagation through the model, this approach uses policy gradient methods, thereby mitigating some of the difficulties associated with optimizing policies directly via backpropagation.

## Steps Involved

### Data Collection

- Run an initial policy to collect a dataset of state transitions.

### Model Learning

- Train a dynamics model using the collected data.

### Trajectory Sampling

- Use the learned dynamics model to generate a large number of synthetic trajectories with the current policy.

### Policy Improvement

- Apply policy gradient methods to the sampled trajectories to improve the policy. Techniques such as actor-critic methods can be used to enhance this step.

### Iteration

- Repeat the trajectory sampling and policy improvement steps several times, using the model to simulate new trajectories without generating additional real data.

### Data Augmentation

- Once the policy has improved sufficiently, run it in the real environment to collect more data. Append this new data to the dataset and retrain the dynamics model with the expanded dataset.

This iterative process helps refine the policy using both real and synthetic data, enhancing the agent’s performance over time.




## Benefits and Challenges

### Benefits

1. **Avoiding Backpropagation Issues**: By using policy gradient methods instead of direct backpropagation through the model, this approach avoids problems like vanishing and exploding gradients, making the optimization process more stable.
2. **Data Efficiency**: The use of synthetic trajectories generated by the learned model allows for more efficient use of data, reducing the need for extensive real-world interactions.



### Challenges

1. **Model Accuracy**: The effectiveness of this approach heavily depends on the accuracy of the learned dynamics model. If the model's predictions are inaccurate, the synthetic trajectories might not represent the real environment accurately, leading to suboptimal policy updates.
2. **Exploration**: Since the model generates synthetic data based on the current policy, it might not explore the state space as thoroughly as needed, potentially missing critical areas that require real-world data collection.

## <img src="https://img.icons8.com/?size=50&id=FXi8HsuaMBHf&format=png&color=000000" style="height:50px;display:inline"> The Curse of Long Model-Based Rollouts


One of the significant challenges in model-based reinforcement learning (RL) is dealing with the inaccuracies that arise from using learned models over long horizons. This issue is commonly referred to as the "curse of long model-based rollouts." To understand this problem, let's revisit some key concepts and explore how errors accumulate during extended model-based rollouts.




### Imitation Learning and Distributional Shift

In imitation learning, a policy trained via supervised learning may make small mistakes when executed, leading to deviations from the states observed during training. These deviations place the policy in unfamiliar situations, causing compounding errors. This phenomenon is known as distributional shift.

### Similar Issues in Model-Based RL

In model-based RL, the learned dynamics model is used to simulate the environment. However, like the learned policy in imitation learning, the learned model can make mistakes. These mistakes cause the simulated states to diverge from the real states, and the errors compound over time.

When the policy is optimized using the learned model, any inaccuracies in the model can lead to suboptimal policy updates. This is particularly problematic when long rollouts are used, as the cumulative errors can significantly distort the policy's performance.


### Exacerbation by Policy Updates

The issue is further exacerbated when the policy is improved based on the learned model:

1. **Changing Policy**: In Model-Based RL Version 2.5, the policy is iteratively improved using synthetic data generated by the learned model. Each update to the policy can introduce new states that the model hasn't accurately learned, leading to greater distributional shift.
2. **Distributional Shift**: The distribution of states encountered by the updated policy differs even more from those seen during the initial data collection, worsening the model's accuracy and further compounding errors in long rollouts.


## <img src="https://img.icons8.com/?size=50&id=cGIDLkSsAuf3&format=png&color=000000" style="height:50px;display:inline"> Model-Based RL with Short Rollouts



Given the challenges of long model-based rollouts due to accumulating errors, one effective strategy is to use short rollouts. This approach can mitigate the error accumulation problem while still benefiting from model-based learning.

**By limiting the length of model-based rollouts, we can reduce the accumulated error**. For example, if the task has a horizon of 1000 steps, using rollouts of only 50 steps can significantly lower the error. However, simply reducing the rollout length changes the nature of the problem since tasks with long horizons might have critical events occurring later in the trajectory that short rollouts would miss. For instance, a robot cooking a meal might need more than five minutes to complete significant actions.

### Hybrid Approach

A hybrid approach involves collecting full-length trajectories from the real environment infrequently and using the real-world states as starting points for short model-based rollouts. This ensures that the model still encounters later stages of the task without relying on long rollouts. The trade-offs include having much lower error due to the short rollouts while ensuring comprehensive state coverage by sampling states uniformly from real-world trajectories. This way, the agent can still see states from all time steps.

### State Distribution Mismatch

One issue with this approach is the state distribution mismatch. **When the policy is updated, the new policy will encounter different states than the ones seen during data collection**. If short rollouts are started from real-world states, the initial policy collected the data, and the new policy is being tested on these states, leading to a mismatch. The resulting state distribution is a mix of the distribution from the data-collecting policy and the new policy, which can be problematic. If the policy changes are small, the impact of the mismatch is minimal, and advanced policy gradient methods can still be effective. However, when significant policy changes are made between data collection rounds, the mismatch can degrade the performance of on-policy methods like policy gradient algorithms.

### Practical Implementation: Model-Based RL Version 3.0

<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/MB_model-based_3.0.png?raw=true' width=900/>

Model-Based RL Version 3.0, which uses short rollouts, aligns more closely with practical methods used in the field. The process typically involves:

#### Data Collection
- Collect real-world data and use it to train a dynamics model.

#### State Sampling
- Sample states from the real-world data, possibly uniformly at random across the trajectory.

#### Short Model-Based Rollouts
- Perform short rollouts from these sampled states using the learned model. These rollouts can be as short as one time step but are typically around 10 time steps.

#### Policy Improvement
- Use both real data and synthetic data from the model to improve the policy. Off-policy algorithms like Q-learning or actor-critic methods are often employed.

#### Iteration
- Generate more data from the model, then run the improved policy in the real environment to collect additional real data. Append this new data to the dataset and retrain the model.




#### Design Considerations

1. **Balance of Real and Synthetic Data**: The method involves delicate decisions regarding the proportion of real versus synthetic data used for training.
2. **Frequency of Policy Updates**: The frequency and extent of policy updates between data collection episodes need careful tuning to maintain performance.

# <img src="https://img.icons8.com/?size=50&id=lWvRA05d4v6N&format=png&color=000000" style="height:50px;display:inline"> Dyna: Practical Model-Based RL Algorithms


In this section, we explore practical model-based reinforcement learning (RL) algorithms that build upon the framework of Model-Based RL Version 3.0. Specifically, we focus on the Dyna algorithm, a classic approach that effectively integrates model-based and model-free techniques to improve learning efficiency.

Dyna, introduced by Richard Sutton in the 1990s, exemplifies an approach that enhances online Q-learning with model-based rollouts. The algorithm uses very short rollouts, often just one time step, to leverage the learned model and improve data efficiency. Despite its simplicity, Dyna can provide significant benefits if a good model is learned.

## Key Steps in Dyna

<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/MB_dyna.png?raw=true' width=900/>

### Action Selection

- In the current state, pick an action $a$ using an exploration policy. This step is identical to standard online Q-learning.

### Transition Observation

- Observe the resulting next state $s'$ and the reward $r$, forming a transition tuple $(s,a,s')$.

### Model and Reward Function Update

- Update the dynamics model and reward function using the observed transition. In the original Dyna, this was done online with a single step of gradient descent or by mixing old values in a tabular model with new observations.

### Q-Learning Update

- Perform a standard Q-learning update using the observed transition.

### Model-Based Rollouts

- Repeat a model-based procedure $K$ times (where $K$ is a hyperparameter):
  1. Sample a state-action pair from the buffer of previous experiences.
  2. Use the learned model to simulate the next state and reward for this pair.
  3. Perform Q-learning updates using these simulated transitions.

This procedure allows Dyna to incorporate simulated experiences, generated by the learned model, into the Q-learning process, thereby improving the policy without relying solely on real-world interactions.


### Design Choices and Variations

Dyna makes several design choices that can be adjusted for different applications. The original Dyna samples state-action pairs from a buffer of previous experiences, but an alternative is to sample actions according to the latest policy, such as the argmax policy for the Q function. While Dyna performs a single update step for the model, multiple steps could be used to improve model accuracy, especially in deterministic systems where the same state-action pair should always produce the same next state. Dyna's design is optimized for highly stochastic systems. By using the state-action pairs directly from the buffer, it avoids distributional shift issues, making it statistically safer. These design choices are not rigid and can be adapted based on the specific characteristics of the environment and the goals of the learning task.

## Benefits of Dyna

### Data Efficiency

- By incorporating model-based rollouts, Dyna significantly reduces the number of real-world interactions needed to improve the policy, enhancing data efficiency.

### Flexibility

- Dyna can be implemented with various off-policy RL methods, such as Q-learning and Q-function actor-critic methods, providing flexibility in its application.

### Robustness

- The use of short rollouts minimizes error accumulation, making the algorithm more robust to inaccuracies in the learned model.


## <img src="https://img.icons8.com/?size=50&id=_lFKH2HByz22&format=png&color=000000" style="height:50px;display:inline"> Dyna-Q

In this section, we will implement Dyna-Q, one of the simplest model-based reinforcement learning algorithms. A Dyna-Q agent combines acting, learning, and planning. The first two components -- acting and learning -- are just like what we have studied previously. Q-learning, for example, learns by acting in the world, and therefore combines acting and learning. But a Dyna-Q agent also implements planning, or simulating experiences from a model--and learns from them.

The most common way in which the Dyna-Q agent is implemented is by adding a planning routine to a Q-learning agent: after the agent acts in the real world and learns from the observed experience, the agent is allowed a series of $k$ *planning steps*. At each one of those $k$ planning steps, the model generates a simulated experience by randomly sampling from the history of all previously experienced state-action pairs. The agent then learns from this simulated experience, again using the same Q-learning rule that you implemented for learning from real experience. This simulated experience is simply a one-step transition, i.e., a state, an action, and the resulting state and reward. So, in practice, a Dyna-Q agent learns (via Q-learning) from one step of **real** experience during acting, and then from k steps of **simulated** experience during planning.

---
**TABULAR DYNA-Q**

Initialize $Q(s,a)$ and $Model(s,a)$ for all $s \in S$ and $a \in A$.

Loop forever:

> (a) $S$ &larr; current (nonterminal) state <br>
> (b) $A$ &larr; $\epsilon$-greedy$(S,Q)$ <br>
> (c) Take action $A$; observe resultant reward, $R$, and state, $S'$ <br>
> (d) $Q(S,A)$ &larr; $Q(S,A) + \alpha \left[R + \gamma \max_{a} Q(S',a) - Q(S,A)\right]$ <br>
> (e) $Model(S,A)$ &larr; $R,S'$ (assuming deterministic environment) <br>
> (f) Loop repeat $k$ times: <br>
>> $S$ &larr; random previously observed state <br>
>> $A$ &larr; random action previously taken in $S$ <br>
>> $R,S'$ &larr; $Model(S,A)$ <br>
>> $Q(S,A)$ &larr; $Q(S,A) + \alpha \left[R + \gamma \max_{a} Q(S',a) - Q(S,A)\right]$ <br>


---

There's one final detail about this algorithm: where does the simulated experiences come from or, in other words, what is the "model"? In Dyna-Q, as the agent interacts with the environment, the agent also learns the model. For simplicity, Dyna-Q implements model-learning in an almost trivial way, as simply caching the results of each transition. Thus, after each one-step transition in the environment, the agent saves the results of this transition in a big matrix, and consults that matrix during each of the planning steps. Obviously, this model-learning strategy only makes sense if the world is deterministic (so that each state-action pair always leads to the same state and reward), and this is the setting of the exercise below. However, even this simple setting can already highlight one of Dyna-Q major strengths: the fact that the planning is done at the same time as the agent interacts with the environment, which means that new information gained from the interaction may change the model and thereby interact with planning in potentially interesting ways.

Since you already implemented Q-learning in the previous tutorial, we will focus here on the extensions new to Dyna-Q: the model update step and the planning step.

In [None]:
def epsilon_greedy(q, epsilon):
  """Epsilon-greedy policy: selects the maximum value action with probabilty
  (1-epsilon) and selects randomly with epsilon probability.

  Args:
    q (ndarray): an array of action values
    epsilon (float): probability of selecting an action randomly

  Returns:
    int: the chosen action
  """
  be_greedy = np.random.random() > epsilon
  if be_greedy:
    action = np.argmax(q)
  else:
    action = np.random.choice(len(q))

  return action


def q_learning(state, action, reward, next_state, value, params):
  """Q-learning: updates the value function and returns it.

  Args:
    state (int): the current state identifier
    action (int): the action taken
    reward (float): the reward received
    next_state (int): the transitioned to state identifier
    value (ndarray): current value function of shape (n_states, n_actions)
    params (dict): a dictionary containing the default parameters

  Returns:
    ndarray: the updated value function of shape (n_states, n_actions)
  """
  # value of previous state-action pair
  prev_value = value[int(state), int(action)]

  # maximum Q-value at current state
  if next_state is None or np.isnan(next_state):
      max_value = 0
  else:
      max_value = np.max(value[int(next_state)])

  # reward prediction error
  delta = reward + params['gamma'] * max_value - prev_value

  # update value of previous state-action pair
  value[int(state), int(action)] = prev_value + params['alpha'] * delta

  return value

<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/task_sign.png?raw=true' width=800/>

### <img src="https://img.icons8.com/?size=50&id=46589&format=png&color=000000" style="height:30px;display:inline"> Task: Dyna-Q Model Update

In this exercise you will implement the model update portion of the Dyna-Q algorithm. More specifically, after each action that the agent executes in the world, we need to update our model to remember what reward and next state we last experienced for the given state-action pair.

In [None]:
def dyna_q_model_update(model, state, action, reward, next_state):
  """ Dyna-Q model update

  Args:
    model (ndarray): An array of shape (n_states, n_actions, 2) that represents
                     the model of the world i.e. what reward and next state do
                     we expect from taking an action in a state.
    state (int): the current state identifier
    action (int): the action taken
    reward (float): the reward received
    next_state (int): the transitioned to state identifier

  Returns:
    ndarray: the updated model
  """
  ###############################################################
  ## TODO for students: implement the model update step of Dyna-Q
  # Fill out function and remove
  raise NotImplementedError("Student exercise: implement the model update step of Dyna-Q")
  ###############################################################

  # Update our model with the observed reward and next state
  model[...] = ...

  return model

#### <img src="https://img.icons8.com/?size=50&id=42816&format=png&color=000000" style="height:30px;display:inline">  Solution



In [None]:
def dyna_q_model_update(model, state, action, reward, next_state):
  """ Dyna-Q model update

  Args:
    model (ndarray): An array of shape (n_states, n_actions, 2) that represents
                     the model of the world i.e. what reward and next state do
                     we expect from taking an action in a state.
    state (int): the current state identifier
    action (int): the action taken
    reward (float): the reward received
    next_state (int): the transitioned to state identifier

  Returns:
    ndarray: the updated model
  """
  # Update our model with the observed reward and next state
  model[state, action] = reward, next_state

  return model

Now that we have a way to update our model, we can use it in the planning phase of Dyna-Q to simulate past experiences.

In [None]:
def learn_environment(env, model_updater, planner, params, max_steps,
                      n_episodes, shortcut_episode=None):
  # Start with a uniform value function
  value = np.ones((env.n_states, env.n_actions))

  # Run learning
  reward_sums = np.zeros(n_episodes)
  episode_steps = np.zeros(n_episodes)

  # Dyna-Q state
  model = np.nan*np.zeros((env.n_states, env.n_actions, 2))

  # Loop over episodes
  for episode in range(n_episodes):
    if shortcut_episode is not None and episode == shortcut_episode:
      env.toggle_shortcut()
      state = 64
      action = 1
      next_state, reward = env.get_outcome(state, action)
      model[state, action] = reward, next_state
      value = q_learning(state, action, reward, next_state, value, params)


    state = env.init_state  # initialize state
    reward_sum = 0

    for t in range(max_steps):
      # choose next action
      action = epsilon_greedy(value[state], params['epsilon'])

      # observe outcome of action on environment
      next_state, reward = env.get_outcome(state, action)

      # sum rewards obtained
      reward_sum += reward

      # update value function
      value = q_learning(state, action, reward, next_state, value, params)

      # update model
      model = model_updater(model, state, action, reward, next_state)

      # execute planner
      value = planner(model, value, params)

      if next_state is None:
        break  # episode ends
      state = next_state

    reward_sums[episode] = reward_sum
    episode_steps[episode] = t+1

  return value, reward_sums, episode_steps

### <img src="https://img.icons8.com/?size=50&id=46589&format=png&color=000000" style="height:30px;display:inline"> Task: Dyna-Q Planning

In this exercise you will implement the other key part of Dyna-Q: planning. We will sample a random state-action pair from those we've experienced, use our model to simulate the experience of taking that action in that state, and update our value function using Q-learning with these simulated state, action, reward, and next state outcomes. Furthermore, we want to run this planning step $k$ times, which can be obtained from `params['k']`.

For this exercise, you may use the `q_learning` function to handle the Q-learning value function update. Recall that the method signature is `q_learning(state, action, reward, next_state, value, params)` and it returns the updated `value` table.

After completing this function, we have a way to update our model and a means to use it in planning so we will see it in action. The code sets up our agent parameters and learning environment, then passes your model update and planning methods to the agent to try and solve Quentin's World. Notice that we set the number of planning steps $k=10$.

In [None]:
def dyna_q_planning(model, value, params):
  """ Dyna-Q planning

  Args:
    model (ndarray): An array of shape (n_states, n_actions, 2) that represents
                     the model of the world i.e. what reward and next state do
                     we expect from taking an action in a state.
    value (ndarray): current value function of shape (n_states, n_actions)
    params (dict): a dictionary containing learning parameters

  Returns:
    ndarray: the updated value function of shape (n_states, n_actions)
  """
  ############################################################
  ## TODO for students: implement the planning step of Dyna-Q
  # Fill out function and remove
  raise NotImplementedError("Student exercise: implement the planning step of Dyna-Q")
  #############################################################
  # Perform k additional updates at random (planning)
  for _ in range(...):
    # Find state-action combinations for which we've experienced a reward i.e.
    # the reward value is not NaN. The outcome of this expression is an Nx2
    # matrix, where each row is a state and action value, respectively.
    candidates = np.array(np.where(~np.isnan(model[:,:,0]))).T

    # Write an expression for selecting a random row index from our candidates
    idx = ...

    # Obtain the randomly selected state and action values from the candidates
    state, action = ...

    # Obtain the expected reward and next state from the model
    reward, next_state = ...

    # Update the value function using Q-learning
    value = ...

  return value


# set for reproducibility, comment out / change seed value for different results
np.random.seed(1)

# parameters needed by our policy and learning rule
params = {
  'epsilon': 0.05,  # epsilon-greedy policy
  'alpha': 0.5,  # learning rate
  'gamma': 0.8,  # temporal discount factor
  'k': 10,  # number of Dyna-Q planning steps
}

# episodes/trials
n_episodes = 500
max_steps = 1000

# environment initialization
env = QuentinsWorld()

# solve Quentin's World using Dyna-Q
results = learn_environment(env, dyna_q_model_update, dyna_q_planning,
                            params, max_steps, n_episodes)
value, reward_sums, episode_steps = results

# Plot the results
plot_performance(env, value, reward_sums)

After an initial warm-up phase of the first 20 episodes, we should see that the number of planning steps has a noticeable impact on our agent's ability to rapidly solve the environment. We should also notice that after a certain value of $k$ our relative utility goes down, so it's important to balance a large enough value of $k$ that helps us learn quickly without wasting too much time in planning.

#### <img src="https://img.icons8.com/?size=50&id=42816&format=png&color=000000" style="height:30px;display:inline">  Solution



In [None]:

def dyna_q_planning(model, value, params):
  """ Dyna-Q planning

  Args:
    model (ndarray): An array of shape (n_states, n_actions, 2) that represents
                     the model of the world i.e. what reward and next state do
                     we expect from taking an action in a state.
    value (ndarray): current value function of shape (n_states, n_actions)
    params (dict): a dictionary containing learning parameters

  Returns:
    ndarray: the updated value function of shape (n_states, n_actions)
  """
  # Perform k additional updates at random (planning)
  for _ in range(params['k']):
    # Find state-action combinations for which we've experienced a reward i.e.
    # the reward value is not NaN. The outcome of this expression is an Nx2
    # matrix, where each row is a state and action value, respectively.
    candidates = np.array(np.where(~np.isnan(model[:,:,0]))).T

    # Write an expression for selecting a random row index from our candidates
    idx = np.random.choice(len(candidates))

    # Obtain the randomly selected state and action values from the candidates
    state, action = candidates[idx]

    # Obtain the expected reward and next state from the model
    reward, next_state = model[state, action]

    # Update the value function using Q-learning
    value = q_learning(state, action, reward, next_state, value, params)

  return value


# set for reproducibility, comment out / change seed value for different results
np.random.seed(1)

# parameters needed by our policy and learning rule
params = {
  'epsilon': 0.05,  # epsilon-greedy policy
  'alpha': 0.5,  # learning rate
  'gamma': 0.8,  # temporal discount factor
  'k': 10,  # number of Dyna-Q planning steps
}

# episodes/trials
n_episodes = 500
max_steps = 1000

# environment initialization
env = QuentinsWorld()

# solve Quentin's World using Dyna-Q
results = learn_environment(env, dyna_q_model_update, dyna_q_planning,
                            params, max_steps, n_episodes)
value, reward_sums, episode_steps = results

# Plot the results
with plt.xkcd():
  plot_performance(env, value, reward_sums)

### When the world changes...

In addition to speeding up learning about a new environment, planning can also help the agent to quickly incorporate new information about the environment into its policy. Thus, if the environment changes (e.g. the rules governing the transitions between states, or the rewards associated with each state/action), the agent doesn't need to experience that change *repeatedly* (as would be required in a Q-learning agent) in real experience. Instead, planning allows that change to be incorporated quickly into the agent's policy, without the need to experience the change more than once.

In this final section, we will again have our agents attempt to solve Quentin's World. However, after 200 episodes, a shortcut will appear in the environment.  We will test how a model-free agent using Q-learning and a Dyna-Q agent adapt to this change in the environment.

<img alt="QuentinsWorldShortcut" width="560" height="560" src="https://github.com/NeuromatchAcademy/course-content/blob/main/tutorials/static/W3D4_Tutorial4_QuentinsWorldShortcut.png?raw=true">



The following code again looks similar to what we've run previously. Just as above we will have multiple values for $k$, with $k=0$ representing our Q-learning agent and $k=10$ for our Dyna-Q agent with 10 planning steps. The main difference is we now add in an indicator as to when the shortcut appears. In particular, we will run the agents for 400 episodes, with the shortcut appearing in the middle after episode #200.

When this shortcut appears we will also let each agent experience this change once i.e. we will evaluate the act of moving upwards when in the state that is below the now-open shortcut. After this single demonstration, the agents will continue on interacting in the environment.


In [None]:
# set for reproducibility, comment out / change seed value for different results
np.random.seed(1)

# parameters needed by our policy and learning rule
params = {
  'epsilon': 0.05,  # epsilon-greedy policy
  'alpha': 0.5,  # learning rate
  'gamma': 0.8,  # temporal discount factor
}

# episodes/trials
n_episodes = 400
max_steps = 1000
shortcut_episode = 200  # when we introduce the shortcut

# number of planning steps
planning_steps = np.array([0, 10]) # Q-learning, Dyna-Q (k=10)

# environment initialization
steps_per_episode = np.zeros((len(planning_steps), n_episodes))

# Solve Quentin's World using Q-learning and Dyna-Q
for i, k in enumerate(planning_steps):
  env = QuentinsWorld()
  params['k'] = k
  results = learn_environment(env, dyna_q_model_update, dyna_q_planning,
                              params, max_steps, n_episodes,
                              shortcut_episode=shortcut_episode)
  steps_per_episode[i] = results[2]


# Plot results
fig, ax = plt.subplots()
ax.plot(steps_per_episode.T)
ax.set(xlabel='Episode', ylabel='Steps per Episode',
       xlim=[20,None], ylim=[0, 160])
ax.axvline(shortcut_episode, linestyle="--", color='gray', label="Shortcut appears")
ax.legend(('Q-learning', 'Dyna-Q', 'Shortcut appears'),
          loc='upper right');

If all went well, we should see the Dyna-Q agent having already achieved near optimal performance before the appearance of the shortcut and then immediately incorporating this new information to further improve. In this case, the Q-learning agent takes much longer to fully incorporate the new shortcut.

# <img src="https://img.icons8.com/?size=50&id=55422&format=png&color=000000" style="height:50px;display:inline"> Variants of Generalized Dyna Algorithms

Several algorithms in the literature leverage the generalized Dyna approach, each with unique design decisions regarding the use of data for Q-learning and the integration of model-based rollouts. For instance, **Model-Based Policy Optimization (MBPO)** closely follows the procedure described, while **Model-Based Value Expansion (MBVE)** uses model rollouts to improve target value estimates without directly training the Q function. These methods share a fundamental recipe: collect transitions, update the model, perform model-based rollouts, and use the transitions to update the Q function. The primary advantage of these approaches is their sample efficiency, as they generate additional synthetic data to augment the real-world dataset, leading to faster learning. However, they also introduce potential biases, especially if the model is inaccurate or the state distribution becomes skewed. To mitigate these issues, techniques like model ensembles and frequent real-world data collection are used. Despite these challenges, model-based approaches typically achieve faster initial learning, though they may eventually plateau at a lower performance level due to model inaccuracies.

# <img src="https://img.icons8.com/?size=100&id=46509&format=png&color=000000" style="height:50px;display:inline"> Conclusion
---

In this tutorial, we have learned about model-based reinforcement learning and implemented one of the simplest architectures of this type, Dyna-Q. Dyna-Q is very much like Q-learning, but instead of learning only from real experience, you also learn from **simulated** experience. This small difference, however, can have huge benefits! Planning *frees* the agent from the limitation of its own environment, and this in turn allows the agent to speed-up learning -- for instance, effectively incorporating environmental changes into one's policy.

Not surprisingly, model-based RL is an active area of research. Some of the exciting topics in the frontier of the field involve (i) learning and representing a complex world model (i.e., beyond the tabular and deterministic case above), and (ii) what to simulate -- also known as search control -- (i.e., beyond the random selection of experiences implemented above).





# <img src="https://img.icons8.com/dusk/64/000000/plus-2-math.png" style="height:50px;display:inline"> Further Reading
---


For those interested in diving deeper into model-based reinforcement learning, the following papers provide a comprehensive exploration of various approaches and innovations in the field:

### Core Papers

- **Deisenroth et al.**: [PILCO: A Model-Based and Data-Efficient Approach to Policy Search](https://link.springer.com/article/10.1007/s10994-013-9332-0). This paper presents a pioneering approach to policy search using probabilistic models to achieve data efficiency.
- **Nagabandi et al.**: [Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning](https://arxiv.org/abs/1708.02596). This work explores the combination of neural network dynamics models with model-free fine-tuning to enhance performance.
- **Chua et al.**: [Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models](https://arxiv.org/abs/1805.12114). This paper discusses the use of probabilistic dynamics models to achieve sample-efficient learning in complex environments.
- **Feinberg et al.**: [Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning](https://arxiv.org/abs/1803.00101). This study introduces Model-Based Value Expansion (MBVE) to improve the efficiency of model-free RL.
- **Buckman et al.**: [Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion](https://arxiv.org/abs/1807.01675). This paper explores the use of ensemble methods to enhance the efficiency and robustness of model-based RL.

### Additional Seminal Works

- **Gu et al.**: [Continuous Deep Q-Learning with Model-Based Acceleration (2016)](https://arxiv.org/abs/1603.00748). This paper presents a method for accelerating deep Q-learning using model-based techniques.
- **Feinberg et al.**: [Model-Based Value Expansion (2018)](https://arxiv.org/abs/1803.00101). This study further elaborates on the MBVE approach for efficient RL.
- **Janner et al.**: [When to Trust Your Model: Model-Based Policy Optimization (2019)](https://arxiv.org/abs/1906.08253). This paper discusses strategies for determining the reliability of models in model-based policy optimization.

These readings provide a solid foundation and advanced insights into the development and application of model-based reinforcement learning algorithms.







# <img src="https://img.icons8.com/?size=100&id=46756&format=png&color=000000" style="height:50px;display:inline"> Credits
---
* Examples and code snippets were taken from <a href="https://neuromatch.io/neuroscience/"> RNeuromatch Academy </a>
* Examples and explanations were taken from <a href="https://rail.eecs.berkeley.edu/deeprlcourse/">CS285 - Deep Reinforcement Learning Course at UC Berkeley</a>
* Icons from <a href="https://icons8.com/">Icons8.com