# Assignment 3

In the previous tutorial you learned about the implementation of a *Q-learning* agent, and train it on the Catch environment.

Now in this assignment you need to train your agent using:

*   Q-Learning agent with Neural Networks
*   Q-Learning agent with NNs and a Replay Buffer

Complete the ToDO section to be able to train your agents. Then, for each part change the hyperparameters as described and write a report on your observations.



In [1]:
#@title Imports
%%capture
!pip install dm_env

import abc
import collections
import dm_env
import numpy as np
import tensorflow as tf

from matplotlib import pyplot as plt
import matplotlib.animation as animation
from matplotlib import rc
rc('animation', html='jshtml')
%matplotlib inline

## Reinforcement learning

The **agent** interacts with the **environment** in a loop corresponding to the following diagram. The environment defines a set of <font color='blue'>**actions**</font>  that an agent can take.  The agent takes an action informed by the <font color='red'>**observations**</font> it recieves, and will get a <font color='green'>**reward**</font> from the environment after each action. The goal in RL is to find an agent whose actions maximize the total accumulation of rewards obtained from the environment.


<center><img src="https://drive.google.com/uc?id=1sVOD2Ux5F_1Yq3KjyLOKFjFm2WRNTbIH" width="500" /></center>

Relevant terminology (more in this [glossary](https://developers.google.com/machine-learning/glossary/rl)):
 * **agent:** The entity that uses a policy to maximize the expected return gained from transitioning between states of the environment.
 * **environment:** A world with given dynamics that the agent can interact with. When the agent applies an action to the environment, then the environment transitions between states according to its internal dynamics.
 * **environment loop:** A process during which an agent interacts with the environment (i.e. repeatedly executes actions, observes the changes in the environment state, and potentially learns from this experience).
 * **timestep:** A set of information that captures all relevant aspects of a single interaction between the agent and the environment. Most importantly, the observation and the received reward.
 * **episode:** Each of the repeated attempts by the agent to learn an environment.
 * **observation:** The state of the environment at a given time which the agent can observe.
 * **policy**: An agent's probabilistic mapping from states to actions.


## The Catch environment

*Catch* is a classic, simple RL environment, where the agent needs to learn to catch a falling ball by moving a paddle around. Below we provide a simple implementation of the environment, in which the three scalar actions $(0, 1, 2)$ correspond to moving the paddle to the (left, middle, right) respectively. The agent gets a reward of $1.0$ if the paddle was right below the ball when it reached the bottom of the board, otherwise the agent receives $0.0$ reward.

<img src="https://drive.google.com/uc?id=1xkpEZAkl08E_XJQsCe8b3Y0JYRhsScS2" width="400">


In [2]:
#@title Catch environment implementation
_ACTIONS = (0, 1, 2)  # Left, no-op, right.


class Catch(dm_env.Environment):
  """A Catch environment built on the `dm_env.Environment` class."""

  def __init__(self, rows=10, columns=5, seed=1):
    self._rows = rows
    self._columns = columns
    self._rng = np.random.RandomState(seed)
    self._board = np.zeros((rows, columns), dtype=np.float32)
    self._ball_x = None
    self._ball_y = None
    self._paddle_x = None
    self._paddle_y = self._rows - 1
    self._reset_next_step = True

  def reset(self):
    """Returns the first `TimeStep` of a new episode."""
    self._reset_next_step = False
    self._ball_x = self._rng.randint(self._columns)
    self._ball_y = 0
    self._paddle_x = self._columns // 2
    return dm_env.restart(self._observation())

  def step(self, action):
    """Updates the environment according to the action."""
    if self._reset_next_step:
      return self.reset()

    # Move the paddle.
    dx = _ACTIONS[action] - 1
    self._paddle_x = np.clip(self._paddle_x + dx, 0, self._columns - 1)

    # Drop the ball.
    self._ball_y += 1

    # Check for termination.
    if self._ball_y == self._paddle_y:
      reward = 1. if self._paddle_x == self._ball_x else -1.
      self._reset_next_step = True
      return dm_env.termination(reward=reward, observation=self._observation())
    else:
      return dm_env.transition(reward=0., observation=self._observation())

  def _observation(self):
    self._board.fill(0.)
    self._board[self._ball_y, self._ball_x] = 1.
    self._board[self._paddle_y, self._paddle_x] = 1.
    return self._board.copy()

  def observation_spec(self):
    return dm_env.specs.BoundedArray(
        shape=self._board.shape,
        dtype=self._board.dtype,
        name="board",
        minimum=0,
        maximum=1)

  def action_spec(self):
    return dm_env.specs.DiscreteArray(
        dtype=int, num_values=len(_ACTIONS), name="action")

### Let's observe a random agent acting on Catch!

First we are going to take a look at what the agent-environment interaction looks like when an agent acts randomly. We see that the board is represented by a $10\times 5$ array of zeroes, where both the ball and the paddle position are denoted by a value of $1.0$.

In [3]:
#@title Take random actions
env = Catch()

res = []
timestep = env.reset()
print('Observation format (what the agents sees):')
print(timestep.observation)
res.append(timestep.observation)
for step in range(50):
  action = np.random.randint(3)
  timestep = env.step(action)
  res.append(timestep.observation)

Observation format (what the agents sees):
[[0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]]


In [4]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [5]:
anim

## The Agent

In [6]:
#@title Agent interface

class Agent(abc.ABC):
  """Base class defining the agent interface."""

  @abc.abstractmethod
  def select_action(self, latest_obs):
    """Choose an action to take in the environment."""
    ...

  @abc.abstractmethod
  def observe(self, action, next_timestep: dm_env.TimeStep):
    """Take note of what happened in the environment after taking an action."""
    ...

  @abc.abstractmethod
  def observe_first(self, first_timestep: dm_env.TimeStep):
    """Take note of the environment state before taking any actions."""
    ...

  @abc.abstractmethod
  def update(self):
    """Update the agent's internal understanding of the environment dynamics."""
    ...


In [7]:
# @title Training and evaluation

def train(
    agent: Agent,
    env: dm_env.Environment,
    num_episodes = 100):
  """Environment loop during which an agent learns from the interactions."""

  print('Training agent...')
  training_returns = []
  sum_returns = 0.0
  for episode in range(num_episodes):
    timestep = env.reset()
    agent.observe_first(timestep)
    sum_rewards = 0.0
    while timestep.step_type != dm_env.StepType.LAST:
      action = agent.select_action(timestep.observation)
      timestep = env.step(action)
      if timestep.reward is not None:
        sum_rewards += timestep.reward
      agent.observe(action, timestep)
      agent.update()
    training_returns.append(sum_rewards)
    if episode % 10 == 0:
      print(f'Episode: {episode}, Return: {sum_rewards}, '
            f'Mean return: {np.mean(training_returns[-50:])}')

def evaluate(
    agent: Agent,
    env: dm_env.Environment,
    num_episodes = 10):
  """Environment loop during which the agent doesn't learn."""

  print('\nEvaluating agent...')
  agent.set_epsilon(0)
  eval_returns = []
  observations = []
  for episode in range(num_episodes):
    sum_rewards = 0.0
    timestep = env.reset()
    observations.append(timestep.observation)
    agent.observe_first(timestep)
    while timestep.step_type != dm_env.StepType.LAST:
      action = agent.select_action(timestep.observation)
      timestep = env.step(action)
      observations.append(timestep.observation)
      if timestep.reward is not None:
        sum_rewards += timestep.reward
      agent.observe(action, timestep)
      # agent.update()  # Don't update.
    eval_returns.append(sum_rewards)
    print(f'Episode: {episode}, Return: {sum_rewards}')
  print(f'mean: {np.mean(eval_returns)}, std: {np.std(eval_returns)}')
  return observations

## 1. Tabular Q-Learning Agent

In Q-learning, the agent estimates the *value* of (state, action) pairs. This estimate reflects how much total return the agent anticipates up until the end of the episode, assuming that it takes action $A$ in state $S$. In *tabular* Q-learning in particular, these value estimates are stored explicitly in a table, for example:

| (State, Action) | Q-value  |
| ----------------| ---------|
| ($S_i$, left)   | 0.7      |
| ($S_i$, stay)   | 0.0      |
| ($S_i$, right)  | -0.5     |
| ($S_j$, left)   | 0.32     |
| ($S_j$, stay)   | -1.0     |
| ($S_j$, right)  | 0.1      |
| $\dots$         | $\dots$  |

These estimates of (state, action) pairs will drive the behaviour (*policy*) of the agent.

Alternatively we could also represent the value estimates in matrix format:

|  Q-values       | left     | stay     | right    |
| ----------------| ---------| ---------| ---------|
| $S_i$           | 0.7      | 0.0      | -0.5     |
| $S_j$           | 0.32     | -1.0     | 0.1      |
| $\dots$         | $\dots$  | $\dots$  | $\dots$  |

### Train Tabular Q-learning agent

During training, the agent acts in the environment (i.e. plays the game) and makes periodic updates of its Q-value estimates based on what it observes. In particular, the estimates are updated based on the [Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation):

$$Q_{new}(s_t, a_t) = Q_{old}(s_t, a_t) + \alpha *(R_t + \gamma \max_a Q(s_{t+1}, a)  - Q_{old}(s_t, a_t))$$
During this process, the Q-value estimates will become more and more accurate, leading to gradually increasing performance (i.e. the agent is *learning* to play the game well).

After the training process, the agent is *evaluated*. This means we assess its performance on the environment without any randomness in its behaviour. At this time, the agent does not make any updates to its Q-value estimates, therefore its behaviour is not changing anymore.

In [8]:
class QLearning(Agent):
  """Simple Q-learning agent."""

  def __init__(self,
               learning_rate: float = 0.2,
               discount: float = 0.99,
               epsilon: float = 0.1):
    """Initialize the agent.

    Args:
      learning_rate: (alpha) Controls how quickly we're willing to change
        the q-value estimates.
      discount: (gamma) Controls how much we care about immediate rewards vs
        long term rewards.
      epsilon: With small probability, the agent will take random actions
        instead of always picking the best action. This is to encourage
        diversity of experiences (exploration).
    """
    self._learning_rate = learning_rate
    self._epsilon = epsilon
    self._discount = discount

    # In the beginning, the agent doesn't know anything about the Q-values,
    # so the table will be initialized randomly.
    self._q = collections.defaultdict(np.random.random)
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def select_action(self, latest_obs):
    """Chooses an action to take based on the current Q-value estimates."""
    action = np.argmax([self._q_func(latest_obs, a) for a in range(3)])
    if np.random.random() < self._epsilon:
      action = np.random.randint(0, 3)
    return action

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep

  def update(self):
    """Updates the Q-value estimates based on the latest interaction."""
    reward = self._timestep_after_action.reward
    obs = self._timestep_after_action.observation
    obs_before = self._timestep_before_action.observation

    # Remember the Bellman equation:
    # q_new(s,a) = q_old(s, a) + alpha * (reward + gamma * argmax(q_old(s, a)) - q_old(s,a))
    best_action = self._best_action(obs)
    td = reward + self._discount * self._q_func(obs, best_action) - self._q_func(
        obs_before, self._latest_action)
    self._q[(str(obs_before), self._latest_action)] += self._learning_rate * td

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def _q_func(self, obs, action):
    return self._q[(str(obs), action)]

  def set_epsilon(self, eps: float):
    self._epsilon = eps

If all goes well, the **return** should be gradually increasing over the course of training!

In [9]:
env = Catch()
agent = QLearning(epsilon=0.05)

train(agent, env, num_episodes=500)
res = evaluate(agent, env)

Training agent...
Episode: 0, Return: 1.0, Mean return: 1.0
Episode: 10, Return: -1.0, Mean return: -0.6363636363636364
Episode: 20, Return: -1.0, Mean return: -0.3333333333333333
Episode: 30, Return: -1.0, Mean return: -0.22580645161290322
Episode: 40, Return: 1.0, Mean return: -0.12195121951219512
Episode: 50, Return: 1.0, Mean return: -0.04
Episode: 60, Return: 1.0, Mean return: 0.2
Episode: 70, Return: 1.0, Mean return: 0.32
Episode: 80, Return: -1.0, Mean return: 0.4
Episode: 90, Return: -1.0, Mean return: 0.36
Episode: 100, Return: 1.0, Mean return: 0.28
Episode: 110, Return: 1.0, Mean return: 0.2
Episode: 120, Return: 1.0, Mean return: 0.16
Episode: 130, Return: 1.0, Mean return: 0.16
Episode: 140, Return: 1.0, Mean return: 0.2
Episode: 150, Return: -1.0, Mean return: 0.2
Episode: 160, Return: 1.0, Mean return: 0.4
Episode: 170, Return: 1.0, Mean return: 0.44
Episode: 180, Return: -1.0, Mean return: 0.36
Episode: 190, Return: 1.0, Mean return: 0.48
Episode: 200, Return: 1.0, Mea

In [10]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [11]:
anim

## 2. Q-Learning agent with Neural Networks

A major limitation of the tabular approach is that if the state space is large, it will quickly become infeasible to obtain a realistic estimate of each of their Q-values. Apart from explicit Q-value tables, another way for an agent to represent its Q-value estimates is using *Neural Networks*. Neural networks are [universal function approximators](https://en.wikipedia.org/wiki/Universal_approximation_theorem), therefore in theory they can be arbitrarily accurate estimators of the true $Q(s,a)$ function. They also help overcome the problem of large state spaces, because they can exploit underlying structure in the observation space.

In our Catch example, we can take our existing tabular Q-learning agent and replace its `_q_func()` and `update()` methods to use neural networks. The `_q_func()` method will now compute the Q-value as the output of the NN model, rather than reading it directly from a table. In the meantime, the `update()` method, instead of overwriting the Q-table, will perform model fitting.

*   The update() method performs the Q-value update. It first gets the Q-value before the update (old_q) and the Q-value after the update (next_q). It then computes the target Q-value using the reward obtained from the current timestep and the maximum Q-value for the next observation. Finally, it updates the neural network model using gradient descent.

*   The _q_func() method computes the Q-value as the output of the neural network model for a given observation and action.

* The select_action() method chooses the action to take based on whether a random value is less than epsilon. If it is, a random action is chosen, otherwise the action with the highest Q-value is chosen.



In [12]:
class QLearningNN(Agent):
  """Simple Q-learning agent using a Neural Network."""

  def __init__(self,
               model,
               epsilon: float = 0.2,
               discount: float = 0.95):
    self._model = model #using model instead of Q-Table
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    #To-Do
    #Create a function that update Q-values based on NN

    # Get the Q-values before the update.
    obs = self._timestep_before_action.observation
    action = self._latest_action
    old_q = self._q_func(obs, action)

    # Get the Q-value after the update.
    new_obs = self._timestep_after_action.observation
    best_action = self._best_action(new_obs)
    next_q = self._q_func(new_obs, best_action)

    # Compute the target Q-value.
    reward = self._timestep_after_action.reward or 0.0
    target_q = reward + self._discount * next_q

    # Update the Q-value estimate using gradient descent.
    input_tensor = tf.convert_to_tensor(self._make_input(obs, action)[tf.newaxis, :], dtype=tf.float32)
    with tf.GradientTape() as tape:
      q_values = self._model(input_tensor)
      loss = tf.math.reduce_mean(tf.square(target_q - q_values))
    grads = tape.gradient(loss, self._model.trainable_variables)
    self._model.optimizer.apply_gradients(zip(grads, self._model.trainable_variables))

  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    # Create one-hot encoded representation of the action.
    a = np.zeros([3])
    a[action] = 1
    # Concatenate the one-hot encoded action to the flattened observation.
    model_input = tf.concat([flatten_obs, a], axis=0)
    return model_input

  def _q_func(self, latest_obs, action):
   #To Do:
   #Write a q_func method to compute the Q-value as the output of the NN model
    input_tensor = tf.convert_to_tensor(self._make_input(latest_obs, action)[tf.newaxis, :], dtype=tf.float32)
    output = self._model(input_tensor)[0][0]
    return output

  def select_action(self, latest_obs):
    #To Do:
    #Write the method for action selection. Its similar to the QLearning agent action selection method!
    if np.random.rand() < self._epsilon:
      action = np.random.randint(3)
    else:
      action = self._best_action(latest_obs)
    return action

  def set_epsilon(self, eps: float):
    self._epsilon = eps

### Train Q-learning agent with NNs

#First Trail

In [13]:
# Create environment.
env = Catch()

#To Do:
# Build model for agent.

model = tf.keras.Sequential([
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mse', optimizer=optimizer)

# Create agent.
agent = QLearningNN(model)

with tf.device('/device:GPU:0'):
  train(agent, env, num_episodes=1000)
  res = evaluate(agent, env)


Training agent...




Episode: 0, Return: 1.0, Mean return: 1.0
Episode: 10, Return: -1.0, Mean return: -0.6363636363636364
Episode: 20, Return: -1.0, Mean return: -0.5238095238095238
Episode: 30, Return: 1.0, Mean return: -0.41935483870967744
Episode: 40, Return: 1.0, Mean return: -0.36585365853658536
Episode: 50, Return: 1.0, Mean return: -0.36
Episode: 60, Return: -1.0, Mean return: -0.28
Episode: 70, Return: -1.0, Mean return: -0.32
Episode: 80, Return: -1.0, Mean return: -0.44
Episode: 90, Return: -1.0, Mean return: -0.6
Episode: 100, Return: 1.0, Mean return: -0.56
Episode: 110, Return: -1.0, Mean return: -0.56
Episode: 120, Return: -1.0, Mean return: -0.52
Episode: 130, Return: 1.0, Mean return: -0.4
Episode: 140, Return: -1.0, Mean return: -0.2
Episode: 150, Return: -1.0, Mean return: -0.24
Episode: 160, Return: -1.0, Mean return: -0.12
Episode: 170, Return: -1.0, Mean return: -0.08
Episode: 180, Return: -1.0, Mean return: -0.12
Episode: 190, Return: -1.0, Mean return: -0.08
Episode: 200, Return: -1

* Based on training, the mean return initially fluctuates between positive and negative values, indicating that the agent is still exploring the environment. However, as the training progresses, the mean return gradually increases and eventually stabilizes at around 0.72, indicating that the agent has learned an effective policy for the task.


* Based on the evaluation, the agent consistently obtains a return of 1.0 for each episode, indicating that it is able to successfully complete the task. The mean return is also 1.0, which indicates that the agent is able to achieve the maximum possible reward for the task.

* The standard deviation of the returns is 0.0, which indicates that the agent is able to perform the task with a high degree of consistency and reliability.

* Finally, the agent has learned to perform  well in the Catch environment.

In [14]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [15]:
anim

### Exercises

Experiment with model architectures:
* Try different activation functions.
* Change the layer sizes.
* Change the number of layers.

# secound Trail

In [16]:
# Build model for agent.
model2 = tf.keras.Sequential([
    tf.keras.layers.Dense(50, activation='sigmoid'),
    tf.keras.layers.Dense(25, activation='tanh'),
    tf.keras.layers.Dense(10, activation='sigmoid'),
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model2.compile(loss='mse', optimizer=optimizer)

# Create agent.
agent = QLearningNN(model2)

with tf.device('/device:GPU:0'):
  train(agent, env, num_episodes=1000)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -1.0
Episode: 20, Return: -1.0, Mean return: -1.0
Episode: 30, Return: -1.0, Mean return: -0.9354838709677419
Episode: 40, Return: -1.0, Mean return: -0.8536585365853658
Episode: 50, Return: 1.0, Mean return: -0.76
Episode: 60, Return: -1.0, Mean return: -0.76
Episode: 70, Return: 1.0, Mean return: -0.68
Episode: 80, Return: 1.0, Mean return: -0.64
Episode: 90, Return: -1.0, Mean return: -0.72
Episode: 100, Return: -1.0, Mean return: -0.8
Episode: 110, Return: -1.0, Mean return: -0.76
Episode: 120, Return: -1.0, Mean return: -0.76
Episode: 130, Return: -1.0, Mean return: -0.84
Episode: 140, Return: -1.0, Mean return: -0.8
Episode: 150, Return: -1.0, Mean return: -0.84
Episode: 160, Return: 1.0, Mean return: -0.84
Episode: 170, Return: -1.0, Mean return: -0.92
Episode: 180, Return: -1.0, Mean return: -0.8
Episode: 190, Return: -1.0, Mean return: -0.76
Episode: 200, Return: -1.0, Mean re

* The agent is not learn the optimal policy.

* I think the neural network model used to represent the Q-function may not be expressive enough to capture the complexities of the environment and task.
* Let's try another Model.


#The Third Trail

In [17]:
# Build model for agent.
model3 = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='elu'),
    tf.keras.layers.Dense(20, activation='elu'),
    tf.keras.layers.Dense(1)
])

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model3.compile(loss='mse', optimizer=optimizer)

# Create agent.
agent = QLearningNN(model3)

with tf.device('/device:GPU:0'):
  train(agent, env, num_episodes=1000)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: 1.0, Mean return: 1.0
Episode: 10, Return: -1.0, Mean return: -0.09090909090909091
Episode: 20, Return: 1.0, Mean return: 0.047619047619047616
Episode: 30, Return: -1.0, Mean return: -0.22580645161290322
Episode: 40, Return: -1.0, Mean return: -0.17073170731707318
Episode: 50, Return: 1.0, Mean return: -0.2
Episode: 60, Return: -1.0, Mean return: -0.16
Episode: 70, Return: -1.0, Mean return: -0.24
Episode: 80, Return: -1.0, Mean return: -0.24
Episode: 90, Return: -1.0, Mean return: -0.32
Episode: 100, Return: -1.0, Mean return: -0.36
Episode: 110, Return: 1.0, Mean return: -0.44
Episode: 120, Return: 1.0, Mean return: -0.36
Episode: 130, Return: 1.0, Mean return: -0.2
Episode: 140, Return: -1.0, Mean return: -0.12
Episode: 150, Return: 1.0, Mean return: -0.04
Episode: 160, Return: -1.0, Mean return: 0.0
Episode: 170, Return: 1.0, Mean return: -0.2
Episode: 180, Return: -1.0, Mean return: -0.24
Episode: 190, Return: -1.0, Mean return: -0.28
Episode:

* Based on training, the agent was able to learn and improve its performance over time. At the beginning of training, the agent's mean return was negative, but as training progressed, the agent's mean return improved and became positive.

* Based on the evaluation,  the agent's performance is quite good. The agent was able to achieve a high return of 1.0 consistently across all evaluated episodes. The standard deviation of the returns is also 0.0, indicating that the agent's performance was highly consistent during evaluation.


* Finally, the agent perform very well in the environment.



In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Write a report on your observations regarding the above exercises

* The First trail--> The agent was able to learn and improve over time, as indicated by the increase in mean return. Based on the evaluation, the agent consistently obtains a return of 1.0 for each episode, indicating that it is able to successfully complete the task. The mean return is also 1.0, which indicates that the agent is able to achieve the maximum possible reward for the task. The standard deviation of the returns is 0.0, which indicates that the agent is able to perform the task with a high degree of consistency and reliability. Finally, the agent has learned to perform well in the Catch environment.

* The Secound trail is the worst --> The agent is struggling to learn the optimal policy. The mean return of the agent is fluctuating around zero, indicating that the agent is not consistently achieving a positive reward.

* The third trail is the best --> The agent perform very well in the environment.
Based on training, the agent was able to learn and improve its performance over time. At the beginning of training, the agent's mean return was negative, but as training progressed, the agent's mean return improved and became positive.
Based on the evaluation,  the agent's performance is quite good. The agent was able to achieve a high return of 1.0 consistently across all evaluated episodes. The standard deviation of the returns is also 0.0, indicating that the agent's performance was highly consistent during evaluation.


## 3. Q-Learning agent with NNs and a Replay Buffer

Another way we can make our algorithm more efficient is by introducing a *Replay Buffer*. In the previous example, each `model.fit` method was called on a single transition (the very last one). Instead of fitting on a single datapoint, we can fit on a *set* of datapoints. To do this, we store a number of previously seen transitions $(S_i, a_i, S_{i+1})$ and at each update we fit the model on sample of these.

* The _sample_replay_buffer() method is implemented to sample transitions from the replay buffer for training.

* The update() method is implemented to update the Q-values of the neural network based on the sampled transitions.

* The _q_func() method is implemented to compute the Q-value as the output of the neural network model.

* The select_action() method is implemented to select an action based on the current observation and exploration probability.

In [None]:
class QLearningNNReplay(Agent):
  """Simple Q-learning agent using a Neural Network and a replay buffer."""

  def __init__(self,
               model,
               max_replay_entries: int = 10000,
               num_samples_per_update: int = 10,
               epsilon: float = 0.1,
               discount: float = 0.99):
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer.
    self._replay.append((self._timestep_before_action, self._latest_action,
                         self._timestep_after_action))
    if len(self._replay) >= self._max_replay_entries:
      # Remove a random entry from the buffer if capacity is reached.
      random_index = np.random.randint(len(self._replay))
      del self._replay[random_index]

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def _sample_replay_buffer(self):
    # Sample `self._num_samples_per_update` from replay buffer.
    samples = [self._replay[np.random.randint(len(self._replay))]
               for _ in range(self._num_samples_per_update)]
    return samples

  def update(self):
    samples = self._sample_replay_buffer()
    for (obs, action, next_obs) in samples:
      # Compute the old Q-value for the current observation and action.
      old_q = self._q_func(obs.observation, action)

      # Compute the new Q-value for the next observation and best action.
      best_action = self._best_action(next_obs.observation)
      next_q = self._q_func(next_obs.observation, best_action)

      # Compute the target Q-value using the reward and discount factor.
      reward = next_obs.reward or 0.0
      target_q = reward + self._discount * next_q

      # Update the Q-value estimate using gradient descent.
      input_tensor = tf.convert_to_tensor(self._make_input(obs.observation, action)[tf.newaxis, :], dtype=tf.float32)
      with tf.GradientTape() as tape:
        q_values = self._model(input_tensor)
        loss = tf.math.reduce_mean(tf.square(target_q - q_values))
      grads = tape.gradient(loss, self._model.trainable_variables)
      self._model.optimizer.apply_gradients(zip(grads, self._model.trainable_variables))

  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    #To Do:
   #Write a q_func method to compute the Q-value as the output of the NN model
    input_tensor = tf.convert_to_tensor(self._make_input(latest_obs, action)[tf.newaxis, :], dtype=tf.float32)
    output = self._model(input_tensor)[0][0]
    return output

  def select_action(self, latest_obs):
     #To Do:
    #Write the method for action selection. Its similar to the QLearning agent action selection method!
    if np.random.rand() < self._epsilon:
        # Choose a random action with probability epsilon
        action = np.random.randint(3)
    else:
        # Choose the best action based on the Q-values with probability 1-epsilon
        action = self._best_action(latest_obs)
    return action

  def set_epsilon(self, eps: float):
    self._epsilon = eps

### Train Q-learning agent with NNs and replay buffer

# First Trail

In [None]:
# Create environment.
env = Catch()

#To Do:
# Build model for agent.
model3 = tf.keras.Sequential([
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model3.compile(optimizer=optimizer, loss='mse')

# Create agent
agent = QLearningNNReplay(model3)

with tf.device('/device:GPU:0'):
  train(agent, env)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.45454545454545453
Episode: 20, Return: -1.0, Mean return: -0.5238095238095238
Episode: 30, Return: -1.0, Mean return: -0.6129032258064516
Episode: 40, Return: 1.0, Mean return: -0.5609756097560976
Episode: 50, Return: -1.0, Mean return: -0.48
Episode: 60, Return: 1.0, Mean return: -0.44
Episode: 70, Return: 1.0, Mean return: -0.4
Episode: 80, Return: -1.0, Mean return: -0.36
Episode: 90, Return: 1.0, Mean return: -0.4

Evaluating agent...
Episode: 0, Return: -1.0
Episode: 1, Return: -1.0
Episode: 2, Return: 1.0
Episode: 3, Return: 1.0
Episode: 4, Return: 1.0
Episode: 5, Return: -1.0
Episode: 6, Return: 1.0
Episode: 7, Return: -1.0
Episode: 8, Return: 1.0
Episode: 9, Return: -1.0
mean: 0.0, std: 1.0


* Based on training, the agent's performance started with a mean return of -1.0 and gradually improved over time, with occasional fluctuations.  By the end of training, the agent achieved a mean return of -0.4, indicating that it was able to catch some of the falling objects.

* Based on evaluation, the agent's performance was mixed, with some episodes resulting in a positive return of 1.0 and others resulting in a negative return of -1.0. The mean return over 10 episodes was 0.0 with a standard deviation of 1.0, indicating that the agent's performance was highly variable.

* Finally, the agent was able to learn and generalize to some extent on the Catch environment.


In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Exercises

Experiment with Replay Buffer settings:
* Modify the sampling method (e.g. give higher priority to recent items instead of sampling uniformly)
* Change the eviction strategy
* Change the size of the replay buffer (i.e. the maximum number of entries)
* Change the size of the samples.

* When prioritized sampling is enabled, the agent assigns higher priority to more recent transitions, which can help to improve learning speed and performance.

* QLearningNNReplay agent provides a more flexible and powerful learning algorithm by allowing the use of prioritized sampling, which can improve learning efficiency and performance in certain scenarios.

#Secound trail

In [None]:
class QLearningNNReplay2(Agent):
  """Simple Q-learning agent using a Neural Network and a replay buffer."""

  def __init__(self,
               model,
               max_replay_entries: int = 100000,
               num_samples_per_update: int = 10,
               epsilon: float = 0.2,
               discount: float = 0.95,
               prioritized_sampling: bool = True):  # Add a flag for prioritized sampling
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._prioritized_sampling = prioritized_sampling  # Store the flag in the instance
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer.
    self._replay.append((self._timestep_before_action, self._latest_action,
                         self._timestep_after_action))
    if len(self._replay) >= self._max_replay_entries:
      # Remove a random entry from the buffer if capacity is reached.
      random_index = np.random.randint(len(self._replay))
      del self._replay[random_index]

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def _sample_replay_buffer(self):
    # Sample `self._num_samples_per_update` from replay buffer.
    if self._prioritized_sampling:
      # Use prioritized sampling: higher priority to more recent items
      weights = np.arange(1, len(self._replay) + 1)
      weights = weights / np.sum(weights)
      indices = np.random.choice(len(self._replay), size=self._num_samples_per_update, p=weights)
    else:
      # Use uniform sampling (original)
      indices = np.random.randint(len(self._replay), size=self._num_samples_per_update)

    samples = [self._replay[i] for i in indices]
    return samples

  def update(self):
    samples = self._sample_replay_buffer()
    for (obs, action, next_obs) in samples:
      # Compute the old Q-value for the current observation and action.
      old_q = self._q_func(obs.observation, action)

      # Compute the new Q-value for the next observation and best action.
      best_action = self._best_action(next_obs.observation)
      next_q = self._q_func(next_obs.observation, best_action)

      # Compute the target Q-value using the reward and discount factor.
      reward = next_obs.reward or 0.0
      target_q = reward + self._discount * next_q

      # Update the Q-value estimate using gradient descent.
      input_tensor = tf.convert_to_tensor(self._make_input(obs.observation, action)[tf.newaxis, :], dtype=tf.float32)
      with tf.GradientTape() as tape:
        q_values = self._model(input_tensor)
        loss = tf.math.reduce_mean(tf.square(target_q - q_values))
      grads = tape.gradient(loss, self._model.trainable_variables)
      self._model.optimizer.apply_gradients(zip(grads, self._model.trainable_variables))

  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    #To Do:
   #Write a q_func method to compute the Q-value as the output of the NN model
    input_tensor = tf.convert_to_tensor(self._make_input(latest_obs, action)[tf.newaxis, :], dtype=tf.float32)
    output = self._model(input_tensor)[0][0]
    return output

  def select_action(self, latest_obs):
     #To Do:
    #Write the method for action selection. Its similar to the QLearning agent action selection method!
    if np.random.rand() < self._epsilon:
        # Choose a random action with probability epsilon
        action = np.random.randint(3)
    else:
        # Choose the best action based on the Q-values with probability 1-epsilon
        action = self._best_action(latest_obs)
    return action

  def set_epsilon(self, eps: float):
    self._epsilon = eps

In [None]:
# Create environment.
env = Catch()

#To Do:
# Build model for agent.
model4 = tf.keras.Sequential([
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model4.compile(optimizer=optimizer, loss='mse')

# Create agent
agent = QLearningNNReplay2(model4)

with tf.device('/device:GPU:0'):
  train(agent, env)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -0.45454545454545453
Episode: 20, Return: -1.0, Mean return: -0.3333333333333333
Episode: 30, Return: 1.0, Mean return: -0.3548387096774194
Episode: 40, Return: -1.0, Mean return: -0.4146341463414634
Episode: 50, Return: -1.0, Mean return: -0.4
Episode: 60, Return: 1.0, Mean return: -0.32
Episode: 70, Return: -1.0, Mean return: -0.36
Episode: 80, Return: 1.0, Mean return: -0.2
Episode: 90, Return: -1.0, Mean return: -0.24

Evaluating agent...
Episode: 0, Return: 1.0
Episode: 1, Return: 1.0
Episode: 2, Return: 1.0
Episode: 3, Return: 1.0
Episode: 4, Return: 1.0
Episode: 5, Return: 1.0
Episode: 6, Return: 1.0
Episode: 7, Return: 1.0
Episode: 8, Return: 1.0
Episode: 9, Return: 1.0
mean: 1.0, std: 0.0


* Based on training, the agent was able to learn the task successfully, The mean return increased over time.

* Based on the evaluation, the agent is able to consistently achieve the optimal return of 1.0, indicating that it has learned a good policy.

* Finally,  the agent was able to learn and generalize to some extent on the Catch environment.


In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Write a report on your observations regarding the above exercises

* The `QLearningNNReplay2`, included a flag for prioritized sampling of transitions from the replay buffer. So, the improved performance of the agent in the second trail may be due to the use of prioritized experience replay.

* The secound trail is the best --> Based on training, the agent was able to learn the task successfully, The mean return of the agent increased over time, indicating that the agent was improving its policy.
Based on the evaluation, the agent is able to consistently achieve the optimal return of 1.0, indicating that it has learned a good policy.

* The use of prioritized sampling in the updated version of the agent can improve learning speed and performance, especially in scenarios where recent transitions are more informative than older ones.