# Assignment 3

In the previous tutorial you learned about the implementation of a *Q-learning* agent, and train it on the Catch environment.

Now in this assignment you need to train your agent using:

*   Q-Learning agent with Neural Networks
*   Q-Learning agent with NNs and a Replay Buffer

Complete the ToDO section to be able to train your agents. Then, for each part change the hyperparameters as described and write a report on your observations.

Please <font color='red'>**make a copy**</font> of this notebook to your Drive: **File** > **Save copy in Drive**.

In [1]:
#@title Imports
%%capture
!pip install dm_env

import abc
import collections
import dm_env
import numpy as np
import tensorflow as tf

from matplotlib import pyplot as plt
import matplotlib.animation as animation

import matplotlib.animation as animat

from matplotlib import rc
rc('animation', html='jshtml')
%matplotlib inline

## Reinforcement learning

The **agent** interacts with the **environment** in a loop corresponding to the following diagram. The environment defines a set of <font color='blue'>**actions**</font>  that an agent can take.  The agent takes an action informed by the <font color='red'>**observations**</font> it recieves, and will get a <font color='green'>**reward**</font> from the environment after each action. The goal in RL is to find an agent whose actions maximize the total accumulation of rewards obtained from the environment.


<center><img src="https://drive.google.com/uc?id=1sVOD2Ux5F_1Yq3KjyLOKFjFm2WRNTbIH" width="500" /></center>

Relevant terminology (more in this [glossary](https://developers.google.com/machine-learning/glossary/rl)):
 * **agent:** The entity that uses a policy to maximize the expected return gained from transitioning between states of the environment.
 * **environment:** A world with given dynamics that the agent can interact with. When the agent applies an action to the environment, then the environment transitions between states according to its internal dynamics.
 * **environment loop:** A process during which an agent interacts with the environment (i.e. repeatedly executes actions, observes the changes in the environment state, and potentially learns from this experience).
 * **timestep:** A set of information that captures all relevant aspects of a single interaction between the agent and the environment. Most importantly, the observation and the received reward.
 * **episode:** Each of the repeated attempts by the agent to learn an environment.
 * **observation:** The state of the environment at a given time which the agent can observe.
 * **policy**: An agent's probabilistic mapping from states to actions.


## The Catch environment

*Catch* is a classic, simple RL environment, where the agent needs to learn to catch a falling ball by moving a paddle around. Below we provide a simple implementation of the environment, in which the three scalar actions $(0, 1, 2)$ correspond to moving the paddle to the (left, middle, right) respectively. The agent gets a reward of $1.0$ if the paddle was right below the ball when it reached the bottom of the board, otherwise the agent receives $0.0$ reward.

<img src="https://drive.google.com/uc?id=1xkpEZAkl08E_XJQsCe8b3Y0JYRhsScS2" width="400">


In [2]:
#@title Catch environment implementation
_ACTIONS = (0, 1, 2)  # Left, no-op, right.


class Catch(dm_env.Environment):
  """A Catch environment built on the `dm_env.Environment` class."""

  def __init__(self, rows=10, columns=5, seed=1):
    self._rows = rows
    self._columns = columns
    self._rng = np.random.RandomState(seed)
    self._board = np.zeros((rows, columns), dtype=np.float32)
    self._ball_x = None
    self._ball_y = None
    self._paddle_x = None
    self._paddle_y = self._rows - 1
    self._reset_next_step = True

  def reset(self):
    """Returns the first `TimeStep` of a new episode."""
    self._reset_next_step = False
    self._ball_x = self._rng.randint(self._columns)
    self._ball_y = 0
    self._paddle_x = self._columns // 2
    return dm_env.restart(self._observation())

  def step(self, action):
    """Updates the environment according to the action."""
    if self._reset_next_step:
      return self.reset()

    # Move the paddle.
    dx = _ACTIONS[action] - 1
    self._paddle_x = np.clip(self._paddle_x + dx, 0, self._columns - 1)

    # Drop the ball.
    self._ball_y += 1

    # Check for termination.
    if self._ball_y == self._paddle_y:
      reward = 1. if self._paddle_x == self._ball_x else -1.
      self._reset_next_step = True
      return dm_env.termination(reward=reward, observation=self._observation())
    else:
      return dm_env.transition(reward=0., observation=self._observation())

  def _observation(self):
    self._board.fill(0.)
    self._board[self._ball_y, self._ball_x] = 1.
    self._board[self._paddle_y, self._paddle_x] = 1.
    return self._board.copy()

  def observation_spec(self):
    return dm_env.specs.BoundedArray(
        shape=self._board.shape,
        dtype=self._board.dtype,
        name="board",
        minimum=0,
        maximum=1)

  def action_spec(self):
    return dm_env.specs.DiscreteArray(
        dtype=int, num_values=len(_ACTIONS), name="action")

### Let's observe a random agent acting on Catch!

First we are going to take a look at what the agent-environment interaction looks like when an agent acts randomly. We see that the board is represented by a $10\times 5$ array of zeroes, where both the ball and the paddle position are denoted by a value of $1.0$.

In [3]:
#@title Take random actions
env = Catch()

res = []
timestep = env.reset()
print('Observation format (what the agents sees):')
print(timestep.observation)
res.append(timestep.observation)
for step in range(50):
  action = np.random.randint(3)
  timestep = env.step(action)
  res.append(timestep.observation)

Observation format (what the agents sees):
[[0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]]


In [4]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [5]:
anim

## The Agent

In [6]:
#@title Agent interface

class Agent(abc.ABC):
  """Base class defining the agent interface."""

  @abc.abstractmethod
  def select_action(self, latest_obs):
    """Choose an action to take in the environment."""
    ...

  @abc.abstractmethod
  def observe(self, action, next_timestep: dm_env.TimeStep):
    """Take note of what happened in the environment after taking an action."""
    ...

  @abc.abstractmethod
  def observe_first(self, first_timestep: dm_env.TimeStep):
    """Take note of the environment state before taking any actions."""
    ...

  @abc.abstractmethod
  def update(self):
    """Update the agent's internal understanding of the environment dynamics."""
    ...


In [7]:
# @title Training and evaluation

def train(
    agent: Agent,
    env: dm_env.Environment,
    num_episodes = 1000):
  """Environment loop during which an agent learns from the interactions."""

  print('Training agent...')
  training_returns = []
  sum_returns = 0.0
  for episode in range(num_episodes):
    timestep = env.reset()
    agent.observe_first(timestep)
    sum_rewards = 0.0
    while timestep.step_type != dm_env.StepType.LAST:
      action = agent.select_action(timestep.observation)
      timestep = env.step(action)
      if timestep.reward is not None:
        sum_rewards += timestep.reward
      agent.observe(action, timestep)
      agent.update()
    training_returns.append(sum_rewards)
    if episode % 10 == 0:
      print(f'Episode: {episode}, Return: {sum_rewards}, '
            f'Mean return: {np.mean(training_returns[-50:])}')

def evaluate(
    agent: Agent,
    env: dm_env.Environment,
    num_episodes = 10):
  """Environment loop during which the agent doesn't learn."""

  print('\nEvaluating agent...')
  agent.set_epsilon(0)
  eval_returns = []
  observations = []
  for episode in range(num_episodes):
    sum_rewards = 0.0
    timestep = env.reset()
    observations.append(timestep.observation)
    agent.observe_first(timestep)
    while timestep.step_type != dm_env.StepType.LAST:
      action = agent.select_action(timestep.observation)
      timestep = env.step(action)
      observations.append(timestep.observation)
      if timestep.reward is not None:
        sum_rewards += timestep.reward
      agent.observe(action, timestep)
      # agent.update()  # Don't update.
    eval_returns.append(sum_rewards)
    print(f'Episode: {episode}, Return: {sum_rewards}')
  print(f'mean: {np.mean(eval_returns)}, std: {np.std(eval_returns)}')
  return observations

## 1. Tabular Q-Learning Agent

In Q-learning, the agent estimates the *value* of (state, action) pairs. This estimate reflects how much total return the agent anticipates up until the end of the episode, assuming that it takes action $A$ in state $S$. In *tabular* Q-learning in particular, these value estimates are stored explicitly in a table, for example:

| (State, Action) | Q-value  |
| ----------------| ---------|
| ($S_i$, left)   | 0.7      |
| ($S_i$, stay)   | 0.0      |
| ($S_i$, right)  | -0.5     |
| ($S_j$, left)   | 0.32     |
| ($S_j$, stay)   | -1.0     |
| ($S_j$, right)  | 0.1      |
| $\dots$         | $\dots$  |

These estimates of (state, action) pairs will drive the behaviour (*policy*) of the agent.

Alternatively we could also represent the value estimates in matrix format:

|  Q-values       | left     | stay     | right    |
| ----------------| ---------| ---------| ---------|
| $S_i$           | 0.7      | 0.0      | -0.5     |
| $S_j$           | 0.32     | -1.0     | 0.1      |
| $\dots$         | $\dots$  | $\dots$  | $\dots$  |

### Train Tabular Q-learning agent

During training, the agent acts in the environment (i.e. plays the game) and makes periodic updates of its Q-value estimates based on what it observes. In particular, the estimates are updated based on the [Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation):

$$Q_{new}(s_t, a_t) = Q_{old}(s_t, a_t) + \alpha *(R_t + \gamma \max_a Q(s_{t+1}, a)  - Q_{old}(s_t, a_t))$$
During this process, the Q-value estimates will become more and more accurate, leading to gradually increasing performance (i.e. the agent is *learning* to play the game well).

After the training process, the agent is *evaluated*. This means we assess its performance on the environment without any randomness in its behaviour. At this time, the agent does not make any updates to its Q-value estimates, therefore its behaviour is not changing anymore.

In [8]:
class QLearning(Agent):
  """Simple Q-learning agent."""

  def __init__(self,
               learning_rate: float = 0.2,
               discount: float = 0.99,
               epsilon: float = 0.1):
    """Initialize the agent.

    Args:
      learning_rate: (alpha) Controls how quickly we're willing to change
        the q-value estimates.
      discount: (gamma) Controls how much we care about immediate rewards vs
        long term rewards.
      epsilon: With small probability, the agent will take random actions
        instead of always picking the best action. This is to encourage
        diversity of experiences (exploration).
    """
    self._learning_rate = learning_rate
    self._epsilon = epsilon
    self._discount = discount

    # In the beginning, the agent doesn't know anything about the Q-values,
    # so the table will be initialized randomly.
    self._q = collections.defaultdict(np.random.random)
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def select_action(self, latest_obs):
    """Chooses an action to take based on the current Q-value estimates."""
    action = np.argmax([self._q_func(latest_obs, a) for a in range(3)])
    if np.random.random() < self._epsilon:
      action = np.random.randint(0, 3)
    return action

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep

  def update(self):
    """Updates the Q-value estimates based on the latest interaction."""
    reward = self._timestep_after_action.reward
    obs = self._timestep_after_action.observation
    obs_before = self._timestep_before_action.observation

    # Remember the Bellman equation:
    # q_new(s,a) = q_old(s, a) + alpha * (reward + gamma * argmax(q_old(s, a)) - q_old(s,a))
    best_action = self._best_action(obs)
    td = reward + self._discount * self._q_func(obs, best_action) - self._q_func(obs_before, self._latest_action)
    self._q[(str(obs_before), self._latest_action)] += self._learning_rate * td

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def _q_func(self, obs, action):
    return self._q[(str(obs), action)]

  def set_epsilon(self, eps: float):
    self._epsilon = eps

If all goes well, the **return** should be gradually increasing over the course of training!

In [9]:
env = Catch()
agent = QLearning(epsilon=0.05)

train(agent, env, num_episodes=500)
res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.2727272727272727
Episode: 20, Return: 1.0, Mean return: 0.14285714285714285
Episode: 30, Return: -1.0, Mean return: -0.16129032258064516
Episode: 40, Return: 1.0, Mean return: -0.17073170731707318
Episode: 50, Return: 1.0, Mean return: -0.12
Episode: 60, Return: 1.0, Mean return: 0.12
Episode: 70, Return: 1.0, Mean return: 0.08
Episode: 80, Return: 1.0, Mean return: 0.4
Episode: 90, Return: 1.0, Mean return: 0.56
Episode: 100, Return: 1.0, Mean return: 0.72
Episode: 110, Return: 1.0, Mean return: 0.68
Episode: 120, Return: 1.0, Mean return: 0.72
Episode: 130, Return: 1.0, Mean return: 0.68
Episode: 140, Return: 1.0, Mean return: 0.72
Episode: 150, Return: 1.0, Mean return: 0.72
Episode: 160, Return: 1.0, Mean return: 0.68
Episode: 170, Return: 1.0, Mean return: 0.76
Episode: 180, Return: 1.0, Mean return: 0.84
Episode: 190, Return: 1.0, Mean return: 0.8
Episode: 200, Return: 1.0, Me

In [10]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [11]:
anim

## 2. Q-Learning agent with Neural Networks

A major limitation of the tabular approach is that if the state space is large, it will quickly become infeasible to obtain a realistic estimate of each of their Q-values. Apart from explicit Q-value tables, another way for an agent to represent its Q-value estimates is using *Neural Networks*. Neural networks are [universal function approximators](https://en.wikipedia.org/wiki/Universal_approximation_theorem), therefore in theory they can be arbitrarily accurate estimators of the true $Q(s,a)$ function. They also help overcome the problem of large state spaces, because they can exploit underlying structure in the observation space.

In our Catch example, we can take our existing tabular Q-learning agent and replace its `_q_func()` and `update()` methods to use neural networks. The `_q_func()` method will now compute the Q-value as the output of the NN model, rather than reading it directly from a table. In the meantime, the `update()` method, instead of overwriting the Q-table, will perform model fitting.

In [12]:
class QLearningNN(Agent):
  """Simple Q-learning agent using a Neural Network."""

  def __init__(self,
               model,
               epsilon: float = 0.1,
               discount: float = 0.99):
    self._model = model #using model instead of Q-Table
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    #To-Do
    # Create a function that update Q-values based on NN
    # conditional statement that checks if the agent has enough information to update its Q-values. Specifically, it checks if there was a previous timestep before the last action was taken.
    if self._timestep_before_action is not None:
      # Compute the TD target.
      # extract relevant information from the last two timesteps, namely the observation, action, reward, next observation, done flag, and discount factor. These variables are used to compute the TD target and TD error.
      observation = self._timestep_before_action.observation
      action = self._latest_action
      reward = self._timestep_after_action.reward
      next_observation = self._timestep_after_action.observation
      done = self._timestep_after_action.last()
      discount = self._discount
      # compute the TD target, which is the expected cumulative reward starting from the next state and action. The maximum Q-value among all possible actions in the next state is used as an estimate of the value of the next state. The TD target is a combination of the immediate reward and the discounted value of the next state.
      q_value_next = np.max([self._q_func(next_observation, a) for a in range(3)])
      td_target = reward + discount * q_value_next

      # Compute the TD error.
      # compute the TD error, which is the difference between the TD target and the current estimate of the Q-value for the current state-action pair.
      q_value = self._q_func(observation, action)
      td_error = td_target - q_value

      # Update the weights of the neural network.
      # create an input vector from the observation and action, and fit the neural network weights using the input and the TD target as the target value. The input vector is reshaped as a row vector and the TD target is converted to a one-element array. The verbose argument is set to 0, which means that the training progress is not printed to the console.
      input_ = self._make_input(observation, action)
      self._model.fit(input_.reshape(1, -1), np.array([td_target]), verbose=0)

  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    # Create one-hot encoded representation of the action.
    a = np.zeros([3])
    a[action] = 1
    # Concatenate the one-hot encoded action to the flattened observation.
    model_input = tf.concat([flatten_obs, a], axis=0)
    return model_input

  def _q_func(self, latest_obs, action):
    #To Do:
    #Write a q_func method to compute the Q-value as the output of the NN model
    # creates an input vector from the latest observation and action using the _make_input method of the agent. The input vector is used as input to the neural network to compute the Q-value.
    input_vec = self._make_input(latest_obs, action)
    # computes the Q-value for the given state-action pair using the neural network. The input vector is first reshaped as a batch of size 1 and passed as input to the neural network using the __call__ method of the network. The output of the network is a tensor of shape (1, 1) which contains the Q-value estimate for the given state-action pair. The [0][0] indexing is used to extract the scalar value from the tensor, and the .numpy() method is used to convert the tensor to a numpy array.
    output = self._model(input_vec[tf.newaxis, ...])[0][0].numpy()
    return output


  def select_action(self, latest_obs):
    #To Do:
    #Write the method for action selection. Its similar to the QLearning agent action selection method!
    #  conditional statement that checks if a randomly generated number is less than the exploration rate (epsilon) of the agent. If it is, the agent selects a random action to explore the environment.
    if np.random.rand() < self._epsilon:
      # selects a random action by generating a random integer between 0 and 2 (inclusive) using np.random.randint.
      action = np.random.randint(3)
    else:
      # computes the Q-value estimates for all possible actions in the current state using the _q_func method of the agent. The np.argmax function is then used to select the action index that corresponds to the maximum Q-value estimate.
      action = np.argmax([self._q_func(latest_obs, a) for a in range(3)])
    return action

  def set_epsilon(self, eps: float):
    self._epsilon = eps


### Train Q-learning agent with NNs

In [13]:
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()

# Create environment.
env = Catch()

#To Do:
# Build model for agent. three dense layers of size 50 activation relu, 10 activation relu, 1
# defines a neural network model using the Keras Sequential API. The model has three fully connected layers with 50, 10, and 1 neurons, respectively. The activation function used for the first two layers is ReLU, and there is no activation function for the output layer.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])
# defines an optimizer for the neural network. Adam is a popular optimization algorithm that adjusts the learning rate adaptively during training. The learning rate is set to 0.01.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
# compiles the neural network model by specifying the optimizer and loss function to be used during training. The mean squared error (MSE) function is used as the loss function, which measures the difference between the predicted and target Q-values.
model.compile(optimizer=optimizer, loss='mean_squared_error')
# Create agent.
# creates a QLearningNN agent object using the previously defined neural network model.
agent = QLearningNN(model)

# sets the device to be used for training and evaluation to the first GPU device available.
with tf.device('/device:GPU:0'):
  # trains the agent on the env environment for a specified number of episodes. The num_episodes argument is set to 1000.
  train(agent, env, num_episodes=1000)
  # evaluates the trained agent on the env environment and returns the total reward obtained over the evaluation period. The result is stored in the res variable.
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.8181818181818182
Episode: 20, Return: 1.0, Mean return: -0.3333333333333333
Episode: 30, Return: -1.0, Mean return: -0.4838709677419355
Episode: 40, Return: -1.0, Mean return: -0.6097560975609756
Episode: 50, Return: -1.0, Mean return: -0.6
Episode: 60, Return: -1.0, Mean return: -0.6
Episode: 70, Return: -1.0, Mean return: -0.72
Episode: 80, Return: -1.0, Mean return: -0.6
Episode: 90, Return: -1.0, Mean return: -0.56
Episode: 100, Return: -1.0, Mean return: -0.6
Episode: 110, Return: -1.0, Mean return: -0.52
Episode: 120, Return: -1.0, Mean return: -0.52
Episode: 130, Return: -1.0, Mean return: -0.52
Episode: 140, Return: -1.0, Mean return: -0.52
Episode: 150, Return: -1.0, Mean return: -0.48
Episode: 160, Return: -1.0, Mean return: -0.52
Episode: 170, Return: -1.0, Mean return: -0.52
Episode: 180, Return: -1.0, Mean return: -0.68
Episode: 190, Return: -1.0, Mean return: -0.64
Epi

In [14]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [15]:
anim

### Exercises

Experiment with model architectures:
* Try different activation functions.
* Change the layer sizes.
* Change the number of layers.

In [16]:
# enable numpy-like behavior for TensorFlow, which allows the code to use numpy operations on TensorFlow tensors.
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()
# function called changes that takes as input two activation functions and two layer sizes for a neural network.
def changes(act1, act2, size1, size2):
  # Create environment.
  # creates an instance of the Catch environment.
  env = Catch()

# define a neural network model with two hidden layers and an output layer. The hidden layers have sizes size1 and size2 and activation functions act1 and act2, respectively. The output layer has a single neuron and no activation function. The Adam optimizer is used with a learning rate of 0.01 and the mean squared error (MSE) function is used as the loss function.
  model = tf.keras.Sequential([tf.keras.layers.Dense(size1, activation=act1),
                              tf.keras.layers.Dense(size2, activation=act2),
                              tf.keras.layers.Dense(1)])
  optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
  model.compile(optimizer=optimizer, loss='mean_squared_error')

  # creates a new instance of the QLearningNN agent using the defined neural network model.
  agent = QLearningNN(model)

  # sets the device to be used for training and evaluation to the first GPU device available.
  with tf.device('/device:GPU:0'):
    # trains the agent on the env environment for 100 episodes.
    train(agent, env, num_episodes=100)

    # evaluates the trained agent on the env environment and returns the total reward obtained over the evaluation period. The result is stored in the res variable.
    res = evaluate(agent, env)


In [17]:
for i in ['relu', 'sigmoid']:
  for j in ['relu', 'sigmoid']:
    print("\n","first layer: ", i, " second layer: ", j, ": ")
    changes(i, j, 50, 10)


 first layer:  relu  second layer:  relu : 
Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.45454545454545453
Episode: 20, Return: 1.0, Mean return: -0.42857142857142855
Episode: 30, Return: 1.0, Mean return: -0.3548387096774194
Episode: 40, Return: -1.0, Mean return: -0.36585365853658536
Episode: 50, Return: -1.0, Mean return: -0.4
Episode: 60, Return: -1.0, Mean return: -0.48
Episode: 70, Return: 1.0, Mean return: -0.56
Episode: 80, Return: 1.0, Mean return: -0.64
Episode: 90, Return: 1.0, Mean return: -0.48

Evaluating agent...
Episode: 0, Return: -1.0
Episode: 1, Return: 1.0
Episode: 2, Return: 1.0
Episode: 3, Return: 1.0
Episode: 4, Return: 1.0
Episode: 5, Return: 1.0
Episode: 6, Return: 1.0
Episode: 7, Return: -1.0
Episode: 8, Return: 1.0
Episode: 9, Return: -1.0
mean: 0.4, std: 0.9165151389911681

 first layer:  relu  second layer:  sigmoid : 
Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Retu

In [18]:
for k in [60, 50]:
      for l in [30, 20]:
        print("\n", "first layer size: ", k, " second layer size: ", l, ": ")
        changes('relu', 'relu', k, l)


 first layer size:  60  second layer size:  30 : 
Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.45454545454545453
Episode: 20, Return: 1.0, Mean return: -0.42857142857142855
Episode: 30, Return: -1.0, Mean return: -0.4838709677419355
Episode: 40, Return: -1.0, Mean return: -0.6097560975609756
Episode: 50, Return: -1.0, Mean return: -0.56
Episode: 60, Return: -1.0, Mean return: -0.64
Episode: 70, Return: -1.0, Mean return: -0.72
Episode: 80, Return: -1.0, Mean return: -0.68
Episode: 90, Return: 1.0, Mean return: -0.56

Evaluating agent...
Episode: 0, Return: -1.0
Episode: 1, Return: -1.0
Episode: 2, Return: -1.0
Episode: 3, Return: 1.0
Episode: 4, Return: 1.0
Episode: 5, Return: -1.0
Episode: 6, Return: 1.0
Episode: 7, Return: -1.0
Episode: 8, Return: 1.0
Episode: 9, Return: -1.0
mean: -0.2, std: 0.9797958971132713

 first layer size:  60  second layer size:  20 : 
Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
E

In [19]:
# enable numpy-like behavior for TensorFlow, which allows the code to use numpy operations on TensorFlow tensors.
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()

# Create environment.
# creates an instance of the Catch environment.
env = Catch()

# defines a neural network model using the Keras Sequential API. The model has four fully connected layers with 50, 10, 10, and 1 neurons, respectively. The activation function used for all hidden layers is ReLU, and there is no activation function for the output layer.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])
# compiles the neural network model by specifying the optimizer and loss function to be used during training. The Adam optimizer is used with a learning rate of 0.01, and the mean squared error (MSE) function is used as the loss function.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mean_squared_error')

# Create agent.
# creates a QLearningNN agent object using the previously defined neural network model.
agent = QLearningNN(model)

# sets the device to be used for training and evaluation to the first GPU device available.
with tf.device('/device:GPU:0'):
  # trains the agent on the env environment for a specified number of episodes. The num_episodes argument is set to 1000.
  train(agent, env, num_episodes=1000)
  # evaluates the trained agent on the env environment and returns the total reward obtained over the evaluation period. The result is stored in the res variable.
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.8181818181818182
Episode: 20, Return: 1.0, Mean return: -0.7142857142857143
Episode: 30, Return: -1.0, Mean return: -0.8064516129032258
Episode: 40, Return: -1.0, Mean return: -0.8048780487804879
Episode: 50, Return: -1.0, Mean return: -0.84
Episode: 60, Return: -1.0, Mean return: -0.84
Episode: 70, Return: -1.0, Mean return: -0.88
Episode: 80, Return: 1.0, Mean return: -0.8
Episode: 90, Return: -1.0, Mean return: -0.72
Episode: 100, Return: -1.0, Mean return: -0.6
Episode: 110, Return: -1.0, Mean return: -0.52
Episode: 120, Return: -1.0, Mean return: -0.56
Episode: 130, Return: -1.0, Mean return: -0.52
Episode: 140, Return: -1.0, Mean return: -0.52
Episode: 150, Return: -1.0, Mean return: -0.56
Episode: 160, Return: -1.0, Mean return: -0.52
Episode: 170, Return: -1.0, Mean return: -0.44
Episode: 180, Return: 1.0, Mean return: -0.48
Episode: 190, Return: 1.0, Mean return: -0.48
Epis

In [20]:
# enable numpy-like behavior for TensorFlow, which allows the code to use numpy operations on TensorFlow tensors.
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()

# Create environment.
# creates an instance of the Catch environment.
env = Catch()

# defines a neural network model using the Keras Sequential API. The model has four fully connected layers with 50, 30, 10, and 1 neurons, respectively. The activation function used for all hidden layers is ReLU, and there is no activation function for the output layer.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(30, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])

# compiles the neural network model by specifying the optimizer and loss function to be used during training. The Adam optimizer is used with a learning rate of 0.01, and the mean squared error (MSE) function is used as the loss function.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mean_squared_error')

# Create agent.
# creates a QLearningNN agent object using the previously defined neural network model.
agent = QLearningNN(model)
# sets the device to be used for training and evaluation to the first GPU device available.
with tf.device('/device:GPU:0'):
  # trains the agent on the env environment for a specified number of episodes. The num_episodes argument is set to 1000.
  train(agent, env, num_episodes=1000)
  # evaluates the trained agent on the env environment and returns the total reward obtained over the evaluation period. The result is stored in the res variable.
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.45454545454545453
Episode: 20, Return: -1.0, Mean return: -0.7142857142857143
Episode: 30, Return: -1.0, Mean return: -0.7419354838709677
Episode: 40, Return: -1.0, Mean return: -0.6585365853658537
Episode: 50, Return: -1.0, Mean return: -0.64
Episode: 60, Return: -1.0, Mean return: -0.72
Episode: 70, Return: -1.0, Mean return: -0.68
Episode: 80, Return: -1.0, Mean return: -0.68
Episode: 90, Return: -1.0, Mean return: -0.76
Episode: 100, Return: -1.0, Mean return: -0.76
Episode: 110, Return: -1.0, Mean return: -0.72
Episode: 120, Return: -1.0, Mean return: -0.64
Episode: 130, Return: -1.0, Mean return: -0.6
Episode: 140, Return: -1.0, Mean return: -0.56
Episode: 150, Return: 1.0, Mean return: -0.56
Episode: 160, Return: 1.0, Mean return: -0.44
Episode: 170, Return: -1.0, Mean return: -0.52
Episode: 180, Return: -1.0, Mean return: -0.52
Episode: 190, Return: -1.0, Mean return: -0.52


### Write a report on your observations regarding the above exercises

QLearningNN Agent:

The QLearningNN agent is a reinforcement learning agent that uses a neural network to estimate the Q-values of state-action pairs. The agent uses the Q-learning algorithm, which is a model-free, off-policy, value-based learning algorithm. The agent learns from experience by updating its Q-values based on the observed rewards and the estimated Q-values of the next state.
From the output, we can see that the Q-learning agent with neural network approximation is trained on the Catch environment for 1000 episodes. During training, the agent's performance is measured in terms of the return obtained in each episode and the mean return over the last 10 episodes.

The mean return initially fluctuates between -1.0 and 1.0, indicating that the agent is initially exploring and trying out different actions. However, as training progresses, the mean return stabilizes around -0.6, indicating that the agent has learned a policy that performs moderately well on the environment.

The agent is then evaluated on the environment for 10 episodes, and the average return obtained is -0.4 with a standard deviation of 0.917. This suggests that the agent's learned policy is not very effective on the evaluation set and may need further tuning of hyperparameters or changes to the neural network architecture to improve its performance.

Overall, the code provides a good starting point for training and evaluating a Q-learning agent with neural network approximation on the Catch environment, but further work may be needed to optimize the agent's performance.

Experiment with model architectures:
* Try different activation functions.

Trial 1: The output shows the results of running the `changes` function with different combinations of activation functions for the neural network model used by the Q-learning agent. The output includes the training performance of the agent, including the return obtained in each episode and the mean return over the last 10 episodes, as well as the evaluation performance of the agent, including the total return obtained over the evaluation period.

Some observations that can be made from the output are:

- The performance of the Q-learning agent varies depending on the combination of activation functions used for the neural network model. For example, in the first run with `act1=relu` and `act2=relu`, the agent achieves a mean return of around -0.48 during training and an average return of 0.4 with a high standard deviation during evaluation. In contrast, in the run with `act1=sigmoid` and `act2=relu`, the agent achieves a mean return of around -0.68 during training and an average return of 0.6 with a low standard deviation during evaluation.

- The agent's performance during training generally improves as the number of episodes increases, indicating that the agent is gradually learning a better policy. However, the performance during evaluation is not always consistent with the training performance, and the agent may not generalize well to new situations.

- The standard deviation of the returns obtained during evaluation is generally high, indicating that the agent's performance is highly variable and not reliable. This suggests that further tuning of hyperparameters or changes to the neural network architecture may be needed to improve the agent's performance.

Overall, the output provides insights into the performance of a Q-learning agent with a neural network approximation on the Catch environment, and highlights the importance of careful selection of hyperparameters and network architecture to achieve good performance.


Trial 2: This output shows the results of running the changes function with act1=relu and act2=sigmoid for the neural network model used by the Q-learning agent. The output includes the training performance of the agent, including the return obtained in each episode and the mean return over the last 10 episodes, as well as the evaluation performance of the agent, including the total return obtained over the evaluation period.

Some observations that can be made from this output are:

During training, the agent achieves a mean return of around -0.48, which is similar to the performance of the agent with act1=relu and act2=relu.

During evaluation, the agent achieves a mean return of -0.2, which is lower than the performance of the agent with act1=relu and act2=relu but higher than the performance of the agent with act1=sigmoid and act2=relu.

The standard deviation of the returns obtained during evaluation is high, indicating that the agent's performance is highly variable and not reliable.

Overall, the performance of the agent with act1=relu and act2=sigmoid is mediocre and may not be suitable for this environment. More experiments with different activation functions and neural network architectures may be needed to find a configuration that performs well on this task.


Trial 3: This output shows the results of running the `changes` function with `act1=sigmoid` and `act2=relu` for the neural network model used by the Q-learning agent. The output includes the training performance of the agent, including the return obtained in each episode and the mean return over the last 10 episodes, as well as the evaluation performance of the agent, including the total return obtained over the evaluation period.

Some observations that can be made from this output are:

- During training, the agent achieves a mean return that fluctuates between -0.52 and -0.72, which is similar to the performance of the agent with `act1=relu` and `act2=sigmoid`.

- During evaluation, the agent achieves a mean return of -0.2, which is the same as the performance of the agent with `act1=relu` and `act2=sigmoid`.

- The standard deviation of the returns obtained during evaluation is high, indicating that the agent's performance is highly variable and not reliable.

- Overall, the performance of the agent with `act1=sigmoid` and `act2=relu` is mediocre and may not be suitable for this environment. More experiments with different activation functions and neural network architectures may be needed to find a configuration that performs well on this task.

In summary, comparing the outputs of the `changes` function with different activation function combinations, we can see that the performance of the Q-learning agent varies significantly with the choice of activation functions, and that finding a suitable combination of activation functions is crucial for achieving good performance on this task.

Trial 4:
This output shows the results of running the `changes` function with `act1=sigmoid` and `act2=sigmoid` for the neural network model used by the Q-learning agent. The output includes the training performance of the agent, including the return obtained in each episode and the mean return over the last 10 episodes, as well as the evaluation performance of the agent, including the total return obtained over the evaluation period.

Some observations that can be made from this output are:

- During training, the agent achieves a mean return that fluctuates between -0.6 and -0.806, which is worse than the performance of the previous two activation function combinations.

- During evaluation, the agent achieves a mean return of -0.8, which is the lowest of all the activation function combinations tested.

- The standard deviation of the returns obtained during evaluation is relatively low, indicating that the agent's performance is less variable than in the previous experiments.

- Overall, the performance of the agent with `act1=sigmoid` and `act2=sigmoid` is not good and may not be suitable for this environment. More experiments with different activation functions and neural network architectures may be needed to find a configuration that performs well on this task.

In summary, the output of the `changes` function with the `sigmoid` activation function in both layers suggests that the sigmoid function may not be suitable for this task, as the agent's performance is consistently poor.

* Change the layer sizes.

trial 1: first layer size: 60 second layer size: 30

he agent's average returns start low and fluctuate throughout the training process, but overall there is no clear trend of improvement. The agent is evaluated on 10 episodes and receives a mean return of -0.2 with a standard deviation of 0.9798.


trial 2: first layer:  60  second layer:  20

The agent's performance is similar to the first trial, with fluctuating returns and no clear trend of improvement. The agent is evaluated on 10 episodes and receives a mean return of -0.2 with a standard deviation of 0.9798, which is the same as the first trial.


trial 3: first layer:  50  second layer:  30

The agent's performance starts low but shows a slight improvement after the 20th episode. The agent's average returns continue to fluctuate but overall show a slight trend of improvement. The agent is evaluated on 10 episodes and receives a mean return of -0.2 with a standard deviation of 0.9798, which is the same as the first two trials.

trial 4: first layer:  50  second layer:  20

The agent's performance is consistently low throughout the training process with no clear trend of improvement. The agent is evaluated on 10 episodes and receives a mean return of -0.6 with a standard deviation of 0.8, which is the lowest mean return and the lowest standard deviation among all trials.


The impact of the choice of hidden layer sizes on the agent's performance depends on the complexity of the task and the amount and quality of training data available. In this particular task, the agent's performance may not be sensitive to the choice of hidden layer sizes because the task is relatively simple, and the agent can learn to perform it effectively with a variety of architectures. However, in more complex tasks or environments, the choice of hidden layer sizes can have a significant impact on the agent's ability to learn and perform optimally. For example, in tasks where the input data is high-dimensional or noisy, larger hidden layers may be required to effectively capture important features and patterns in the data. On the other hand, in tasks where the input data is low-dimensional or less complex, smaller hidden layers may be sufficient and larger hidden layers may lead to overfitting. In addition to the choice of hidden layer sizes, other factors such as the activation functions, the learning rate, and the optimization algorithm used for training the neural network can also have a significant impact on the agent's performance. Therefore, it is important to carefully tune these hyperparameters and experiment with different architectures to find the best configuration for a particular task

* Change the number of layers.

Trial 1: From the output, it seems that an agent is being trained using reinforcement learning. The agent's performance is being evaluated using two metrics: the return and the mean return. The return is a measure of the agent's performance in a single episode, while the mean return is the average return over a certain number of episodes.
During the training process, the agent's performance seems to be quite erratic, with its mean return fluctuating between negative and positive values. The agent seems to be struggling to learn the task, as indicated by its inconsistent performance.
After the training process, the agent is evaluated on a separate set of episodes, and the mean return and standard deviation are reported. The mean return of the agent during evaluation is -0.8, which indicates that the agent is still performing poorly even after training.
Overall, it seems that the agent's training process has not been successful in achieving good performance, and further improvements may be needed to improve its performance.
(mean: -0.8, std: 0.6)


Trial 2: From the training output, we can see that the agent is not performing very well, as the mean return is mostly negative, and the agent is not able to consistently achieve positive rewards. The evaluation output also confirms that the agent is not performing well, as the mean return is 0.0 and the standard deviation is 1.0, indicating that the agent is not able to consistently achieve positive rewards in the evaluation environment.
It's possible that the agent needs more training or that the hyperparameters need to be tuned in order to improve its performance. It may also be helpful to analyze the agent's behavior and try to identify any patterns or areas where it is struggling, in order to diagnose the problem and find ways to improve the agent's performance.
(mean: 0.0, std: 1.0)


## 3. Q-Learning agent with NNs and a Replay Buffer

Another way we can make our algorithm more efficient is by introducing a *Replay Buffer*. In the previous example, each `model.fit` method was called on a single transition (the very last one) in update function. Instead of fitting on a single datapoint, we can fit on a *set* of datapoints. To do this, we store a number of previously seen transitions $(S_i, a_i, S_{i+1})$ and at each update we fit the model on sample of these.

In [None]:
class QLearningNNReplay(Agent):
  """Simple Q-learning agent using a Neural Network and a replay buffer."""

  def __init__(self,
               model,
               max_replay_entries: int = 10000,
               num_samples_per_update: int = 10,
               epsilon: float = 0.1,
               discount: float = 0.99):
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer.
    self._replay.append((self._timestep_before_action, self._latest_action,
                         self._timestep_after_action))
    if len(self._replay) >= self._max_replay_entries:
      # Remove a random entry from the buffer if capacity is reached.
      random_index = np.random.randint(len(self._replay))
      del self._replay[random_index]

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    # Sample `self._num_samples_per_update` from replay buffer.
    samples = [self._replay[np.random.randint(len(self._replay))]
               for _ in range(self._num_samples_per_update)]
    #To-Do
    #Create a function that update Q-values based on NN with Replay Buffer
    # For each sample, this code calculates the TD target using the Bellman equation, which is a combination of the observed reward and the estimated value of the next state. It then calculates the TD error, which is the difference between the TD target and the current Q-value estimate.
    for obs, action, next_obs in samples:
        reward = next_obs.reward
        discount = self._discount
        value_next = max([self._q_func(next_obs.observation, a)
                          for a in range(3)])
        td_target = reward + discount * value_next
        value = self._q_func(obs.observation, action)
        td_error = td_target - value
        # Update the Q-value for the observation-action pair.
        # updates the Q-value for the given observation-action pair using the TD error and the neural network model. The input tensor is created using the observation and action, and the model is called to get the Q-value estimates. The gradients of the loss function with respect to the model's trainable variables are then computed using TensorFlow's GradientTape, and the optimizer is used to apply the updates to the model's weights.
        input_tensor = self._make_input(obs.observation, action)
        with tf.GradientTape() as tape:
            q_values = self._model(input_tensor[None, ...], training=True)
            q_value = tf.reduce_sum(q_values)
            loss = td_error * q_value
        gradients = tape.gradient(loss, self._model.trainable_variables)
        self._model.optimizer.apply_gradients(zip(gradients, self._model.trainable_variables))


  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    #To Do:
    #Write a q_func method to compute the Q-value as the output of the NN model
    # computes the Q-value for a given observation-action pair using the neural network model. It takes in the latest observation and the action to be taken, and creates an input tensor using the _make_input() method. The model is then called with this input tensor to get the Q-value estimates for all possible actions. Finally, the Q-value for the given action is returned as the output.
    input_tensor = self._make_input(latest_obs, action)
    q_values = self._model(input_tensor[None, ...], training=False)
    output = q_values[0]
    return output

  def select_action(self, latest_obs):
     #To Do:
    #Write the method for action selection. Its similar to the QLearning agent action selection method!
    # selects an action to take given the latest observation by either choosing a random action with probability self._epsilon or selecting the action with the highest Q-value estimate for the given observation using the _best_action() method. If a random action is chosen, it is selected from the three possible actions (left, right, or stay) using np.random.choice().
    if np.random.uniform() < self._epsilon:
        return np.random.choice(3)
    else:
        return self._best_action(latest_obs)

  def set_epsilon(self, eps: float):
    self._epsilon = eps

### Train Q-learning agent with NNs and replay buffer

In [None]:
# Create environment.
env = Catch()

#To Do:
# Build model for agent.
# defines a neural network model using the Keras Sequential API. The model has three fully connected layers with 50, 10, and 1 neurons, respectively. The activation function used for all hidden layers is ReLU, and there is no activation function for the output layer. The Adam optimizer is used with a learning rate of 0.01, and the mean squared error (MSE) function is used as the loss function.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mean_squared_error')
# Create agent.
# creates a QLearningNNReplay agent object using the previously defined neural network model.
agent = QLearningNNReplay(model)
# trains the agent on the env environment using the replay buffer algorithm. It first fills the replay buffer by randomly taking actions according to an epsilon-greedy policy and storing the resulting observations, actions, and rewards in the buffer. Then, it updates the Q-values using the _update() method, which samples batches from the replay buffer and computes the TD error for each sample. Finally, it evaluates the trained agent on the env environment and returns the total reward obtained over the evaluation period. The train() function takes the agent and env objects as arguments, and also accepts other optional arguments such as the number of episodes to run, the maximum number of steps per episode, and the batch size for updates. The evaluate() function takes the agent and env objects as arguments and returns the total reward obtained over a fixed evaluation period.
with tf.device('/device:GPU:0'):
  train(agent, env)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.45454545454545453
Episode: 20, Return: -1.0, Mean return: -0.3333333333333333
Episode: 30, Return: -1.0, Mean return: -0.41935483870967744
Episode: 40, Return: -1.0, Mean return: -0.4146341463414634
Episode: 50, Return: -1.0, Mean return: -0.44
Episode: 60, Return: -1.0, Mean return: -0.52
Episode: 70, Return: -1.0, Mean return: -0.64
Episode: 80, Return: -1.0, Mean return: -0.56
Episode: 90, Return: -1.0, Mean return: -0.6
Episode: 100, Return: -1.0, Mean return: -0.6
Episode: 110, Return: -1.0, Mean return: -0.6
Episode: 120, Return: 1.0, Mean return: -0.44
Episode: 130, Return: -1.0, Mean return: -0.44
Episode: 140, Return: -1.0, Mean return: -0.48
Episode: 150, Return: -1.0, Mean return: -0.56
Episode: 160, Return: -1.0, Mean return: -0.56
Episode: 170, Return: -1.0, Mean return: -0.68
Episode: 180, Return: -1.0, Mean return: -0.8
Episode: 190, Return: -1.0, Mean return: -0.8
Ep

In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Exercises

Experiment with Replay Buffer settings:
* Modify the sampling method (e.g. give higher priority to recent items instead of sampling uniformly)
* Change the eviction strategy
* Change the size of the replay buffer (i.e. the maximum number of entries)
* Change the size of the samples.

In [26]:
class QLearningNNPrioritizedReplay(Agent):
  """Simple Q-learning agent using a Neural Network and a prioritized replay buffer."""
  # defines a QLearningNNPrioritizedReplay class that inherits from the Agent base class. The agent uses a neural network to approximate the Q-values and a prioritized replay buffer to store transitions. The class constructor takes in various hyperparameters such as the neural network model, the maximum size of the replay buffer, the number of samples to use per update, the exploration rate epsilon, the discount factor discount, the importance sampling exponent alpha, the prioritization exponent beta, and the number of steps over which to anneal beta. The constructor also initializes various instance variables such as the replay buffer, the latest action taken, the previous and current timesteps, and a timestep counter.
  def __init__(self,
               model,
               max_replay_entries: int = 10000,
               num_samples_per_update: int = 10,
               epsilon: float = 0.1,
               discount: float = 0.99,
               alpha: float = 0.6,
               beta: float = 0.4,
               beta_annealing_steps: int = 1000,
               max_episodes: int = 100):
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._alpha = alpha
    self._beta = beta
    self._beta_annealing_steps = beta_annealing_steps
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None
    self._timestep_count = 0
# These methods are used to observe the environment and store transitions in the replay buffer. observe_first() is called at the start of each episode to set the initial state, while observe() is called after each action to store the current transition. The method computes the TD error and priority for the current transition and adds it to the replay buffer. If the buffer size exceeds the maximum capacity, the least prioritized transition is removed from the buffer.
  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer with priority.
    td_error = self._compute_td_error()
    priority = abs(td_error) + 1e-6
    self._replay.append((self._timestep_before_action, self._latest_action,
                         self._timestep_after_action, priority))
    if len(self._replay) >= self._max_replay_entries:
      # Remove the least prioritized entry from the buffer if capacity is reached.
      priorities = [x[3] for x in self._replay]
      min_priority_index = np.argmin(priorities)
      del self._replay[min_priority_index]

  # computes the TD error for the current transition by using the Q-value function to estimate the value of the next state and subtracting the estimated value of the current state. It returns 0 if no action has been taken yet.
  def _compute_td_error(self):
    if self._timestep_before_action is not None:
      reward = self._timestep_after_action.reward
      discount = self._discount
      value_next = max([self._q_func(self._timestep_after_action.observation, a)
                        for a in range(3)])
      td_target = reward + discount * value_next
      value = self._q_func(self._timestep_before_action.observation,
                           self._latest_action)
      td_error = td_target - value
      return td_error
    return 0

  def update(self):
    # Compute priorities for all the entries in the replay buffer.
    priorities = [abs(self._compute_td_error()) + 1e-6 for _ in range(len(self._replay))]
    # Compute sampling probabilities for all the entries.
    priorities_sum = np.sum(np.power(priorities, self._alpha))
    probs = np.power(priorities, self._alpha) / priorities_sum
    # Flatten the probabilities array.
    probs = probs.ravel()
    # Sample `self._num_samples_per_update` from replay buffer based on the computed probabilities.
    indices = np.random.choice(len(self._replay), size=self._num_samples_per_update, replace=True, p=probs)
    samples = [self._replay[i] for i in indices]
    # Compute importance-sampling weights for the sampled entries.
    weights = np.power(len(self._replay) * probs[indices], -self._beta)
    weights /= np.max(weights)
    # Update the neural network using the sampled entries and their weights.
    for i, (obs, action, next_obs, priority) in enumerate(samples):
      reward = next_obs.reward
      discount = self._discount
      value_next = max([self._q_func(next_obs.observation, a)
                        for a in range(3)])
      td_target = reward + discount * value_next
      value = self._q_func(obs.observation, action)
      td_error = td_target - value
      # Update the Q-value for the observation-action pair.
      input_tensor = self._make_input(obs.observation, action)
      with tf.GradientTape() as tape:
        q_values = self._model(input_tensor[None, ...], training=True)
        q_value = tf.reduce_sum(q_values)
        loss = td_error * q_value * weights[i]
      gradients = tape.gradient(loss, self._model.trainable_variables)
      self._model.optimizer.apply_gradients(zip(gradients, self._model.trainable_variables))
      # Update priorities for the sampled entries.
      new_priority = abs(td_error) + 1e-6
      self._replay[indices[i]] = (obs, action, next_obs, new_priority)

    # Anneal beta parameter.
    if self._timestep_count % self._beta_annealing_steps == 0:
      self._beta = min(1.0, self._beta + 0.1)

    self._timestep_count += 1

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    input_tensor = self._make_input(latest_obs, action)
    q_values = self._model(input_tensor[None, ...], training=False)
    output = q_values[0]
    return output

  def select_action(self, latest_obs):
    if np.random.uniform() < self._epsilon:
      return np.random.choice(3)
    else:
      return self._best_action(latest_obs)

  def set_epsilon(self, eps: float):
    self._epsilon = eps

# Create environment.
env = Catch()

# Build model for agent.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mean_squared_error')

# Create agent with prioritized replay buffer.
agent = QLearningNNPrioritizedReplay(model)

with tf.device('/device:GPU:0'):
  train(agent, env, num_episodes = 100)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.8181818181818182
Episode: 20, Return: -1.0, Mean return: -0.8095238095238095
Episode: 30, Return: -1.0, Mean return: -0.7419354838709677
Episode: 40, Return: -1.0, Mean return: -0.6585365853658537
Episode: 50, Return: -1.0, Mean return: -0.64
Episode: 60, Return: -1.0, Mean return: -0.6
Episode: 70, Return: -1.0, Mean return: -0.56
Episode: 80, Return: -1.0, Mean return: -0.48
Episode: 90, Return: -1.0, Mean return: -0.56

Evaluating agent...
Episode: 0, Return: -1.0
Episode: 1, Return: 1.0
Episode: 2, Return: -1.0
Episode: 3, Return: -1.0
Episode: 4, Return: -1.0
Episode: 5, Return: 1.0
Episode: 6, Return: -1.0
Episode: 7, Return: -1.0
Episode: 8, Return: -1.0
Episode: 9, Return: -1.0
mean: -0.6, std: 0.8000000000000002


In [27]:
class QLearningNNReplayEvictionStrategy(Agent):
  """Simple Q-learning agent using a Neural Network and a replay buffer."""

  def __init__(self,
               model,
               max_replay_entries: int = 10000,
               num_samples_per_update: int = 10,
               epsilon: float = 0.1,
               discount: float = 0.99,
               max_episodes: int = 100):
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None
    self._replay_index = 0

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer.
    if len(self._replay) < self._max_replay_entries:
      self._replay.append((self._timestep_before_action, self._latest_action,
                           self._timestep_after_action))
    else:
      self._replay[self._replay_index] = (self._timestep_before_action,
                                          self._latest_action,
                                          self._timestep_after_action)
      self._replay_index = (self._replay_index + 1) % self._max_replay_entries

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    # Sample `self._num_samples_per_update` from replay buffer.
    samples = [self._replay[np.random.randint(len(self._replay))]
               for _ in range(self._num_samples_per_update)]
    for obs, action, next_obs in samples:
        reward = next_obs.reward
        discount = self._discount
        value_next = max([self._q_func(next_obs.observation, a)
                          for a in range(3)])
        td_target = reward + discount * value_next
        value = self._q_func(obs.observation, action)
        td_error = td_target - value
        # Update the Q-value for the observation-action pair.
        input_tensor = self._make_input(obs.observation, action)
        with tf.GradientTape() as tape:
            q_values = self._model(input_tensor[None, ...], training=True)
            q_value = tf.reduce_sum(q_values)
            loss = td_error * q_value
        gradients = tape.gradient(loss, self._model.trainable_variables)
        self._model.optimizer.apply_gradients(zip(gradients, self._model.trainable_variables))


  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    input_tensor = self._make_input(latest_obs, action)
    q_values = self._model(input_tensor[None, ...], training=False)
    output = q_values[0]
    return output

  def select_action(self, latest_obs):
    if np.random.uniform() < self._epsilon:
        return np.random.choice(3)
    else:
        return self._best_action(latest_obs)

  def set_epsilon(self, eps: float):
    self._epsilon = eps

# Create environment.
env = Catch()

# Build model for agent.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mean_squared_error')

# Create agent.
agent = QLearningNNReplayEvictionStrategy(model, max_replay_entries=10000)

with tf.device('/device:GPU:0'):
  train(agent, env, num_episodes = 100)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.6363636363636364
Episode: 20, Return: -1.0, Mean return: -0.7142857142857143
Episode: 30, Return: -1.0, Mean return: -0.6774193548387096
Episode: 40, Return: -1.0, Mean return: -0.7073170731707317
Episode: 50, Return: 1.0, Mean return: -0.6
Episode: 60, Return: -1.0, Mean return: -0.6
Episode: 70, Return: -1.0, Mean return: -0.6
Episode: 80, Return: -1.0, Mean return: -0.6
Episode: 90, Return: -1.0, Mean return: -0.6

Evaluating agent...
Episode: 0, Return: -1.0
Episode: 1, Return: -1.0
Episode: 2, Return: 1.0
Episode: 3, Return: -1.0
Episode: 4, Return: -1.0
Episode: 5, Return: -1.0
Episode: 6, Return: -1.0
Episode: 7, Return: -1.0
Episode: 8, Return: -1.0
Episode: 9, Return: -1.0
mean: -0.8, std: 0.6000000000000001


In [28]:
class QLearningNNReplaySize(Agent):
  """Simple Q-learning agent using a Neural Network and a replay buffer."""

  def __init__(self,
               model,
               max_replay_entries: int = 20000,
               num_samples_per_update: int = 10,
               epsilon: float = 0.1,
               discount: float = 0.99,
               max_episodes: int = 100):
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer.
    self._replay.append((self._timestep_before_action, self._latest_action,
                         self._timestep_after_action))
    if len(self._replay) >= self._max_replay_entries:
      # Remove a random entry from the buffer if capacity is reached.
      random_index = np.random.randint(len(self._replay))
      del self._replay[random_index]

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    # Sample `self._num_samples_per_update` from replay buffer.
    samples = [self._replay[np.random.randint(len(self._replay))]
               for _ in range(self._num_samples_per_update)]
    #To-Do
    #Create a function that update Q-values based on NN with Replay Buffer
    for obs, action, next_obs in samples:
        reward = next_obs.reward
        discount = self._discount
        value_next = max([self._q_func(next_obs.observation, a)
                          for a in range(3)])
        td_target = reward + discount * value_next
        value = self._q_func(obs.observation, action)
        td_error = td_target - value
        # Update the Q-value for the observation-action pair.
        input_tensor = self._make_input(obs.observation, action)
        with tf.GradientTape() as tape:
            q_values = self._model(input_tensor[None, ...], training=True)
            q_value = tf.reduce_sum(q_values)
            loss = td_error * q_value
        gradients = tape.gradient(loss, self._model.trainable_variables)
        self._model.optimizer.apply_gradients(zip(gradients, self._model.trainable_variables))


  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    #To Do:
    #Write a q_func method to compute the Q-value as the output of the NN model
    input_tensor = self._make_input(latest_obs, action)
    q_values = self._model(input_tensor[None, ...], training=False)
    output = q_values[0]
    return output

  def select_action(self, latest_obs):
     #To Do:
    #Write the method for action selection. Its similar to the QLearning agent action selection method!
    if np.random.uniform() < self._epsilon:
        return np.random.choice(3)
    else:
        return self._best_action(latest_obs)
  def set_epsilon(self, eps: float):
    self._epsilon = eps

# Create environment.
env = Catch()

#To Do:
# Build model for agent.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mean_squared_error')
# Create agent.
agent = QLearningNNReplaySize(model, max_replay_entries=20000)

with tf.device('/device:GPU:0'):
  train(agent, env, num_episodes = 100)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.45454545454545453
Episode: 20, Return: -1.0, Mean return: -0.7142857142857143
Episode: 30, Return: -1.0, Mean return: -0.6774193548387096
Episode: 40, Return: -1.0, Mean return: -0.6097560975609756
Episode: 50, Return: -1.0, Mean return: -0.6
Episode: 60, Return: -1.0, Mean return: -0.68
Episode: 70, Return: -1.0, Mean return: -0.64
Episode: 80, Return: 1.0, Mean return: -0.64
Episode: 90, Return: 1.0, Mean return: -0.6

Evaluating agent...
Episode: 0, Return: -1.0
Episode: 1, Return: -1.0
Episode: 2, Return: -1.0
Episode: 3, Return: 1.0
Episode: 4, Return: 1.0
Episode: 5, Return: -1.0
Episode: 6, Return: 1.0
Episode: 7, Return: -1.0
Episode: 8, Return: 1.0
Episode: 9, Return: -1.0
mean: -0.2, std: 0.9797958971132713


In [29]:
class QLearningNNReplaySamples(Agent):
  """Simple Q-learning agent using a Neural Network and a replay buffer."""

  def __init__(self,
               model,
               max_replay_entries: int = 10000,
               num_samples_per_update: int = 20,
               epsilon: float = 0.1,
               discount: float = 0.99,
               max_episodes: int = 100):
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer.
    self._replay.append((self._timestep_before_action, self._latest_action,
                         self._timestep_after_action))
    if len(self._replay) >= self._max_replay_entries:
      # Remove a random entry from the buffer if capacity is reached.
      random_index = np.random.randint(len(self._replay))
      del self._replay[random_index]

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    # Sample `self._num_samples_per_update` from replay buffer.
    samples = [self._replay[np.random.randint(len(self._replay))]
               for _ in range(self._num_samples_per_update)]
    #To-Do
    #Create a function that update Q-values based on NN with Replay Buffer
    for obs, action, next_obs in samples:
        reward = next_obs.reward
        discount = self._discount
        value_next = max([self._q_func(next_obs.observation, a)
                          for a in range(3)])
        td_target = reward + discount * value_next
        value = self._q_func(obs.observation, action)
        td_error = td_target - value
        # Update the Q-value for the observation-action pair.
        input_tensor = self._make_input(obs.observation, action)
        with tf.GradientTape() as tape:
            q_values = self._model(input_tensor[None, ...], training=True)
            q_value = tf.reduce_sum(q_values)
            loss = td_error * q_value
        gradients = tape.gradient(loss, self._model.trainable_variables)
        self._model.optimizer.apply_gradients(zip(gradients, self._model.trainable_variables))


  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    #To Do:
    #Write a q_func method to compute the Q-value as the output of the NN model
    input_tensor = self._make_input(latest_obs, action)
    q_values = self._model(input_tensor[None, ...], training=False)
    output = q_values[0]
    return output

  def select_action(self, latest_obs):
     #To Do:
    #Write the method for action selection. Its similar to the QLearning agent action selection method!
    if np.random.uniform() < self._epsilon:
        return np.random.choice(3)
    else:
        return self._best_action(latest_obs)
  def set_epsilon(self, eps: float):
    self._epsilon = eps

# Create environment.
env = Catch()

#To Do:
# Build model for agent.
model = tf.keras.Sequential([tf.keras.layers.Dense(50, activation='relu'),
                             tf.keras.layers.Dense(10, activation='relu'),
                             tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mean_squared_error')
# Create agent.
agent = QLearningNNReplaySamples(model, num_samples_per_update=20)

with tf.device('/device:GPU:0'):
  train(agent, env, num_episodes = 100)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.2727272727272727
Episode: 20, Return: -1.0, Mean return: -0.6190476190476191
Episode: 30, Return: -1.0, Mean return: -0.5483870967741935
Episode: 40, Return: -1.0, Mean return: -0.4634146341463415
Episode: 50, Return: -1.0, Mean return: -0.48
Episode: 60, Return: -1.0, Mean return: -0.6
Episode: 70, Return: -1.0, Mean return: -0.52
Episode: 80, Return: -1.0, Mean return: -0.48
Episode: 90, Return: -1.0, Mean return: -0.6

Evaluating agent...
Episode: 0, Return: -1.0
Episode: 1, Return: 1.0
Episode: 2, Return: -1.0
Episode: 3, Return: -1.0
Episode: 4, Return: -1.0
Episode: 5, Return: 1.0
Episode: 6, Return: -1.0
Episode: 7, Return: -1.0
Episode: 8, Return: -1.0
Episode: 9, Return: -1.0
mean: -0.6, std: 0.8000000000000002


### Write a report on your observations regarding the above exercises


This is the implementation of a Q-learning agent using a neural network and a replay buffer. The agent stores the experience tuples of (state, action, next_state) in a replay buffer and samples a batch of these tuples to update the Q-values using the Bellman equation. The Q-values are estimated using a neural network, which takes in the state-action pairs as inputs and outputs the Q-values for each possible action.

The agent uses an epsilon-greedy policy to select actions, where it selects random actions with probability epsilon and selects the action with the highest Q-value estimate for the given observation with probability 1 - epsilon. The epsilon value can be set using the set_epsilon() method.

The _make_input() method concatenates the flattened observation and the one-hot encoded action to create the input tensor for the neural network. The _q_func() method takes in the latest observation and the action to be taken, creates the input tensor using the _make_input() method, and returns the Q-value estimate for the given action.

The update() method samples a batch of experience tuples from the replay buffer and updates the Q-values using the Bellman equation and the neural network. It calculates the TD target and the TD error for each tuple, and updates the Q-value for the observation-action pair using the TD error and the neural network model. The gradients of the loss function with respect to the model's trainable variables are computed using TensorFlow's GradientTape, and the optimizer is used to apply the updates to the model's weights.

The output of the code shows that the agent was able to learn to play the Catch game environment using Q-learning with a neural network function approximator. The mean return, which is a measure of the agent's performance, improved gradually over the course of training, indicating that the agent was learning and making progress.
However, the mean return did not consistently improve with every episode, which is normal for reinforcement learning algorithms. The agent needs to explore different actions to find an optimal policy, which can lead to fluctuations in performance before converging to a stable policy.
The use of a neural network as the function approximator allows the agent to learn a complex mapping from the state-action space to the Q-values. This can be advantageous for environments with high-dimensional state spaces, where traditional tabular methods become infeasible.
Overall, the output of the code suggests that the Q-learning algorithm with a neural network function approximator is a promising approach for learning to play simple games such as Catch. Further research could explore the performance of this approach on more complex environments and compare it to other reinforcement learning algorithms.


Experiment with Replay Buffer settings:
1- Modify the sampling method (e.g. give higher priority to recent items instead of sampling uniformly)
Based on the training output, the agent is being trained for a game and its performance is being evaluated after every 10 episodes. The training output shows the episode number, the return obtained by the agent in that episode, and the mean return over the last 10 episodes. The return is the cumulative reward obtained by the agent during the episode.
The agent is not performing well during training, as it is consistently getting a return of -1.0, which means it is losing the game in every episode. The mean return over the last 10 episodes is also decreasing over time, indicating that the agent is not learning to improve its performance.
After training, the agent's performance is evaluated on the game environment using the evaluate() function. The output shows the return obtained by the agent in each of the 10 evaluation episodes, followed by the mean and standard deviation of the returns. In this case, the agent is again not performing well during evaluation, as it is getting a negative mean return of -0.6. The standard deviation of 0.8 indicates that the agent's performance is quite variable across episodes.
It is possible that the hyperparameters used for training the agent are not well-tuned, or that the neural network model used for function approximation is not able to capture the relevant features of the game state. Further experimentation and tuning may be necessary to improve the agent's performance.
(mean: -0.6, std: 0.8)


2- Change the eviction strategy
The output shows the training and evaluation results of the QLearningNNReplayEvictionStrategy agent in the Catch game environment. During training, the agent plays 100 episodes and updates its Q-value estimates based on the transitions stored in the replay buffer. At each timestep, the agent selects an action based on an epsilon-greedy exploration strategy, where it chooses the best action based on its Q-value estimates with probability 1-epsilon and a random action with probability epsilon.
The training output shows the return of each episode, which is the sum of rewards obtained during the episode. The mean return is also shown, which is the average return over the last 10 episodes. The return is negative when the agent fails to catch any fruit and positive when it catches at least one fruit. As we can see, the agent performs poorly during training, with a mean return of around -0.6 to -0.7, indicating that it fails to catch many fruits.
After training, the agent is evaluated on 10 episodes using the greedy policy (i.e., epsilon=0), where it always chooses the action with the highest Q-value estimate. The output shows the return of each episode and the mean and standard deviation of the returns. The evaluation results are also poor, with a mean return of -0.8, indicating that the agent still fails to catch many fruits even with the greedy policy.
Overall, these results suggest that the QLearningNNReplayEvictionStrategy agent is not very effective in learning a good policy for the Catch game environment and may require further modifications to improve its performance. Possible modifications include using different neural network architectures, adjusting the hyperparameters of the agent (e.g., learning rate, discount factor, epsilon), or implementing more sophisticated exploration and exploitation strategies. Additionally, it may be helpful to visualize the agent's behavior during training and evaluation to gain more insights into its strengths and weaknesses.
(mean: -0.8, std: 0.6)


3- Change the size of the replay buffer (i.e. the maximum number of entries)
The output shows the training and evaluation results of the QLearningNNReplaySize agent in the Catch game environment. During training, the agent plays 100 episodes and updates its Q-value estimates based on the transitions stored in the replay buffer. At each timestep, the agent selects an action based on an epsilon-greedy exploration strategy, where it chooses the best action based on its Q-value estimates with probability 1-epsilon and a random action with probability epsilon.
The training output shows the return of each episode, which is the sum of rewards obtained during the episode. The mean return is also shown, which is the average return over the last 10 episodes. As we can see, the agent performs similarly as the QLearningNNReplayEvictionStrategy agent during training, with a mean return of around -0.6 to -0.7, indicating that it fails to catch many fruits.
After training, the agent is evaluated on 10 episodes using the greedy policy (i.e., epsilon=0), where it always chooses the action with the highest Q-value estimate. The output shows the return of each episode and the mean and standard deviation of the returns. The evaluation results are also poor, with a mean return of -0.2, indicating that the agent still fails to catch many fruits even with the greedy policy.
Overall, the performance of the QLearningNNReplaySize agent is similar to the QLearningNNReplayEvictionStrategy agent, and both agents perform poorly in the Catch game environment. This suggests that further modifications may be necessary to improve the performance of the agent, such as trying different neural network architectures, adjusting the hyperparameters, or implementing more sophisticated exploration and exploitation strategies. Additionally, it may be helpful to visualize the agent's behavior during training and evaluation to gain more insights into its strengths and weaknesses.
(mean: -0.2, std: 0.979)


4- Change the size of the samples:
The output shows the training and evaluation results of the QLearningNNReplaySamples agent in the Catch game environment. During training, the agent plays 100 episodes and updates its Q-value estimates based on a fixed number of transitions sampled from the replay buffer for each update step. At each timestep, the agent selects an action based on an epsilon-greedy exploration strategy, where it chooses the best action based on its Q-value estimates with probability 1-epsilon and a random action with probability epsilon.
The training output shows the return of each episode, which is the sum of rewards obtained during the episode. The mean return is also shown, which is the average return over the last 10 episodes. As we can see, the agent performs similarly as the previous agents during training, with a mean return of around -0.5 to -0.6, indicating that it fails to catch many fruits.
After training, the agent is evaluated on 10 episodes using the greedy policy (i.e., epsilon=0), where it always chooses the action with the highest Q-value estimate. The output shows the return of each episode and the mean and standard deviation of the returns. The evaluation results are also poor, with a mean return of -0.6 and a high standard deviation of 0.8, indicating that the agent still fails to catch many fruits even with the greedy policy.
Overall, the performance of the QLearningNNReplaySamples agent is similar to the previous Q-learning agents, and all agents perform poorly in the Catch game environment. This suggests that further modifications may be necessary to improve the performance of the agent, such as trying different neural network architectures, adjusting the hyperparameters, or implementing more sophisticated exploration and exploitation strategies. Additionally, it may be helpful to visualize the agent's behavior during training and evaluation to gain more insights into its strengths and weaknesses.
(mean: -0.6, std: 0.8)
