<a href="https://colab.research.google.com/github/MarShao0124/deep_learning_practice/blob/main/08_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Reinforcement Learning: An Introduction

In this tutorial, we will enter the world of Deep Reinforcement Learning (DRL). In particular, we will first familiarize ourselves with some basic concepts of Reinforcement Learning (RL), then we will implement a classical tabular Q-learning method for the classic [Frozen Lake](https://gym.openai.com/envs/FrozenLake-v0/) puzzle and finally, implement a Deep Q-learning approach for the [CartPole](https://gym.openai.com/envs/CartPole-v1/) problem.


>



![alt text](https://media2.giphy.com/media/46ib09ZL1SdWuREnj3/giphy.gif?cid=3640f6095c6e92762f3446634d90bc65) ![alt text](https://media0.giphy.com/media/d9QiBcfzg64Io/200w.webp?cid=3640f6095c6e93e92f30655873731752)![alt text](https://i.gifer.com/GpAY.gif)

The gifs above, show the results obtained by [Deepmind](https://arxiv.org/pdf/1312.5602v1.pdf) in one of their latest papers. They successfully trained an RL agent using deep Q-learning to play classical Atari arcade games. Let's see now how they did it.








# Q-Learning

This family of RL methods try to learn an approximator of the action-value functions $Q(s,a)$  based on the [Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation), such that the update using a classical [gradient descent ](https://en.wikipedia.org/wiki/Gradient_descent) formulation is given by:
$$Q\left(s,a\right)=Q\left(s,a\right)+ \alpha \left(r+\gamma \max _{a} Q\left(s_{t+1},a\right)-Q\left(s,a\right)\right).$$
Where $\alpha$ is the step size.
 Q-Learning updates the estimated reward at each time step and  uses the old estimate $ \max _{a}Q\left(s_{t+1},a\right)$ to update the new ones. In a more algorithmic way, the Q-Learning process is the following:


1.   Initialize Q-values at random $Q\left(s,a\right)$.
2. Forever or until learning is stopped do:
> 1.  Observe state $s$.
> 2.   Take action $a$ according to your policy, e.g., $\epsilon$-greedy.
> 3.   Observe reward $r$ and new state $s_{t+1}$.
> 4. Based on your actual estimates, compute $\max _{a}Q\left(s_{t+1},a\right)$.
> 5. Update your current estimate for  $Q\left(s,a\right)$:
$$Q\left(s,a\right)=Q\left(s,a\right)+ \alpha \left(r+\gamma \max _{a} Q\left(s_{t+1},a\right)-Q\left(s,a\right)\right).$$

Okay, now that we are familiar with Q-Learning lets jump to a real implementation of it.







## Tabular Q-Learning with Frozen Lake
In this section we will teach an agent how to play  the [Frozen lake](https://gym.openai.com/envs/FrozenLake-v0/) game using a classical tabular Q-learning. Brace yourselves, winter is coming!

![alt text](https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/1ee37cfc3130057f828f19b3cee6066d41c1eeb4/Q%20learning/FrozenLake/frozenlake.png)

Winter has arrived and you and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so you must navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.
The goal of this game is to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H). However, the ice is slippery (!!), so you won't always move in the direction you intend (stochastic environment), i.e., there is a probability $p$ that you move in the direction selected and a probability $(1-p)$ that given the slippery ice, you move to a random position near position. Specifically, let's say you select the action UP, you have a probability of 1/3 of actually going UP, 1/3 of going RIGHT and 1/3 of going LEFT. Similarly, if you select LEFT, you have a probability of 1/3 of actually going LEFT, 1/3 of going UP and 1/3 of going DOWN.

The lake is represented by a 4x4 grid and the location where the frisbee has landed (G) as well as the holes (H) is always the same for every new game. The game is restarted every time you have successfully recovered the frisbee or you have fallen into the cold waters. A reward of +1 is given every time you recover the frisbee and 0 other way.


**Step 0: Import the needed libraries:**

We will be using 3 libraries:

* Numpy for our Qtable.
* OpenAI Gym for our FrozenLake Environment
* Random to generate random numbers



In [None]:
!pip install gymnasium --upgrade

import base64
import collections
import glob
import io
import os
import random
import time

from IPython import display as ipythondisplay
from IPython.display import HTML
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras import layers, models, optimizers

**Environment creation:**

OpenAi is  a library composed of many environments that we can use to train our agents, in our case we choose to use the Frozen Lake.

In [None]:
env = gym.make("FrozenLake-v1", render_mode='rgb_array')

**Q-table**

 Now, we'll create our Q-table. The goal of the Q-table is to store the estimates $Q\left(s,a\right)$ and retrieve them when necessary. In this game the states are represented by each of the 16 grid positions being 0 the starting position and 16 the goal position and the actions are 4: left, right, up and down. Our Q-table will have then $16 \times 4$ positions, where the value of the first column of the first row represents the expected return of being in position 0 and taking left.

The number of rows (states) and columns (actions) the table will have can also be obtained using the values action_size and the state_size from the OpenAI Gym library: *env.action_space.n* and* env.observation_space.n*.

We initialize the table to 0.

In [None]:
action_size = env.action_space.n
state_size = env.observation_space.n
qtable = np.zeros((state_size, action_size))
print(qtable)

**Hyperparameters**

Following, we specify the hyperparameters:


In [None]:
total_episodes = 25000        # Total episodes
learning_rate = 0.8           # Learning rate (alpha in the previous formulation)
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

At first, we don't know how to interact with the environment (Q-table values set to 0), so we start exploring it by taking a random action with probability $\epsilon=1$, capturing the rewards obtained and updating the Q-values of the table accordingly. As time passes by, we start knowing more and more the environment, so we reduce (decay_rate) the probability of taking a random action and we start exploiting our knowledge, we choose the action that leads us to the highest reward, i.e., the one with the highest Q-value.

In [None]:
# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability
decay_rate = 0.005             # Exponential decay rate for exploration prob

**Q-Learning**

Now we implement the Q-Learning algorithm:
> 1.  Observe state $s$.
> 2.   Choose a random value $v$ between 0 and 1.
> 3. If $v<\epsilon$, we choose a random action, otherwise we select the action with maximum $Q(s,a)$.
> 3.   Observe reward $r$ and new state $s_{t+1}$.
> 4. Based on your previous estimates, compute $\max _{a}Q\left(s_{t+1},a\right)$.
> 5. Update your current estimates for  $Q\left(s,a\right)$:
$$Q\left(s,a\right)=Q\left(s,a\right)+ \alpha \left(r+\gamma \max _{a} Q\left(s_{t+1},a\right)-Q\left(s,a\right)\right).$$


In [None]:
# List of rewards
rewards = []

for episode in range(total_episodes):
    # Reset the environment
    state, _ = env.reset()
    step = 0
    done = False
    total_rewards = 0

    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)

        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])

        total_rewards += reward

        # Our new state is state
        state = new_state

        # If done (if we're dead) : finish episode
        if done == True:
            break

    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

**Use our Q-table to play FrozenLake!**

After 25000 episodes, our Q-table can be used as a "cheatsheet" to play FrozenLake"!
  
By running this cell, you can see our agent playing FrozenLake:

In [None]:
env = gym.make("FrozenLake-v1", render_mode='rgb_array')
state, _ = env.reset()
step = 0

plt.imshow(env.render())
plt.show()

for step in range(max_steps):

    # Take the action (index) that have the maximum expected future reward given that state
    action = np.argmax(qtable[state,:])

    new_state, reward, terminated, truncated, info = env.step(action)
    plt.imshow(env.render())
    plt.show()

    # We print the current step.
    print(f"Number of steps: {step}")
    if terminated or truncated:
      break
    state = new_state

env.close()

Let’s see how many times our agent finds the frisbee 🎉

To this end we will print the last step of the game.

In [None]:
games=5
for game in range(games):
    env = gym.make("FrozenLake-v1")
    state, _ = env.reset()
    step = 0
    for step in range(max_steps):

        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        new_state, reward, terminated, truncated, info = env.step(action)

        if terminated or truncated:
        # Here, we decide to only print the last state (to see if our agent is on the goal or fall into a hole)
        # We print the number of step it took.
            print(f"Number of steps: {step}")
            break
        state = new_state
    env.close()

In [None]:
games = 5
total_rewards = 0

for game in range(games):
    env = gym.make("FrozenLake-v1")
    state, _ = env.reset()
    step = 0
    for step in range(max_steps):
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        new_state, reward, terminated, truncated, info = env.step(action)
        if terminated or truncated:
            total_rewards += reward
            break
        state = new_state
    env.close()
success = total_rewards / games
print("Ratio of sucessfully finished episodes is {:f}".format(success))

## CartPole

That wasn't so hard! How about trying to balance a pole so it does not fall? In this section we will address the [CartPole](https://gym.openai.com/envs/CartPole-v1/) problem, let's get to it!

![texto alternativo](https://keon.github.io/images/deep-q-learning/animation.gif)

As before we will use Q-learning to train our agent, so let's start by constructing our Q-table. We first need to find out the number of columns and rows of it. By checking the environment specifications of [OpenAi](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py), we see that the actions are left and right, so we need two columns for the actions. On the other hand, the state information is given by:

        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24 deg        24 deg
        3	Pole Velocity At Tip      -Inf            Inf
      
The cart position goes from -4.8 to 4.8 with a resolution of 0.01, which means $\frac{4.8 \times 2}{0.01}=960$ possible carts positions, while the cart velocity goes from $-\infty$ to $\infty$!. How we are going to construct a table with $\infty$ rows?

Do not panic! That is when deep learning steps up and takes over the stage. As you have already seen the use of Deep Neural Networks as general function approximators have been proven to work very well in a wide range of areas, reinforcement learning is not an exception. In this case we will use the NNs as function approximation between the mapping of states to actions, so for every input state, we want the NNs to output an approximation of the $Q\left(s,a\right)$.

![alt text](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1318%2F1*Gh5PS4R_A5drl5ebd_gNrg%402x.png&f=1)

In this particular scenario, the input layer will have the same number of inputs as environment parameters, 4, and the output layer will have the same number of outputs as actions, in this case 2.

**Reward:** A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical.



**Step 0: Import the needed libraries**

We start by importing the needed libraries:
We will be using 3 libraries:
* Keras: for our DNNs.
* OpenAI Gym: for our CartPole Environment
* Random: to generate random numbers.
* Collections: Collection will be use to create a memory buffer to store the tuples $\left(S_t, A_t, R_t,S_{t+1}\right)$ of transactions.

The idea behind the use of a memory buffer is that most optimization algorithms, including gradient descent, assume that the samples used in an update step are independent and identically distributed. Clearly in the defined environment that is not the case, however, by sampling uniformly the memory buffer with a high number of samples the correlation between contiguous samples is broken and less likely to be correlated samples are used to update the network's weights, leading to a stable optimization of the action-parameter selection.


**The Agent**

Let's start by coding a general DQ-Learning agent. The state and action size are passed as parameters and we configure a replay buffer to have capacity to store 2000 experienced transitions.

In [None]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = collections.deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    # Now we address the DNNs; we are going to use two fully connected layers of 24 neurons each and as an optimizer we select Adam.
    def _build_model(self):
        # Neural Net for Deep-Q learning Model
        model = models.Sequential([
            layers.Input(shape=(self.state_size,)),
            layers.Dense(24, activation='relu'),
            layers.Dense(24, activation='relu'),
            layers.Dense(self.action_size, activation='linear')
        ])
        model.compile(loss='mse', optimizer=optimizers.Adam(learning_rate=self.learning_rate))
        return model

    # Now define the method to store the transitions into the memory buffer.
    # The parameter done is a boolean returned true when the pole has fallen.
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    # Now we implement an 𝜖-greedy policy.
    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0]) # returns action

    def exploit(self, state): # When we test the agent we dont want it to explore anymore, but to exploit what it has learnt
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

    """
    Then comes the implementation of the Q-Learning method:
    1. We obtain the samples to train the DNN from the replay buffer.
    2. We compute $target=r+\gamma \max _{a} Q\left(s_{t+1},a\right)$, by doing a forward pass using next_state value.
    3. We do a forward pass through the network to obtain the $Q\left(s,a\right)$ for all the possible actions.
    4. In order to just update the parameter of the action taken, we copy target to the value of the $Q\left(s,a\right)$ of the actual $a$ taken.
    5. We update the parameters of the network using MSE as loss function.
    """
    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        ### This code below generates batches of states, actions, rewards
        ### next states out of the sampled minibatch
        state_b = np.squeeze(np.array(list(map(lambda x: x[0], minibatch))))
        action_b = np.squeeze(np.array(list(map(lambda x: x[1], minibatch))))
        reward_b = np.squeeze(np.array(list(map(lambda x: x[2], minibatch))))
        next_state_b = np.squeeze(np.array(list(map(lambda x: x[3], minibatch))))
        done_b = np.squeeze(np.array(list(map(lambda x: x[4], minibatch))))

        target = (reward_b + self.gamma *
                        np.amax(self.model.predict(next_state_b), 1))
        target[done_b==1] = reward_b[done_b==1]
        target_f = self.model.predict(state_b)
        for k in range(target_f.shape[0]):
            target_f[k][action_b[k]] = target[k]
        self.model.train_on_batch(state_b, target_f)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    # Load, save models
    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

**Main**

Following we implement the training of the agent. (Warning: it takes a while...)

In [None]:
EPISODES = 200
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
batch_size = 32

for e in range(EPISODES):
    state, _ = env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(200):
        action = agent.act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        next_state = np.reshape(next_state, [1, state_size])
        agent.remember(state, action, reward, next_state, done)
        state = next_state
        if done:
            print("episode: {}/{}, score: {}, e: {:.2}"
                  .format(e, EPISODES, time, agent.epsilon))
            break
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)

Let's now visualize how the agent is performing:

In [None]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else:
    print("Could not find video")


def wrap_env(env):
  env = gym.wrappers.RecordVideo(env, './video')
  return env

In [None]:
env = wrap_env(gym.make('CartPole-v1', render_mode='rgb_array'))
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
state, _ = env.reset()
state = np.reshape(state, [1, state_size])
for time in range(200):
    screen = env.render()
    action = agent.exploit(state)
    state, reward, terminated, truncated, _ = env.step(action)
    if terminated or truncated:
      break
    state = np.reshape(state, [1, state_size])

env.close()
show_video()

You can have a look of the tutorials and code prepared by [OpenAI](https://spinningup.openai.com/en/latest/user/introduction.html) for further details on RL.

# Coursework

## Task 1: On-policy vs. Off-policy
Use the code given below to run the training loop, where the agent is trained for 200 episodes. The agent we give follows a Q-learning approach, which is an off-policy approach. You will now change the approach to SARSA, which is an on-policy approach. Also, for both Q-learning and SARSA test two different policies: $\epsilon$-greedy and Softmax. $\epsilon$-greedy is already defined in the tutorial and implemented in the given agent. Softmax policy refers to sampling the next action following the probability distribution given by $Softmax(Q(s, a))$. We provide you the NumPy softmax function to normalize the Q-Values into a probability function to use before sampling. Similarly to RNN, in the softmax function, there is a temperature value involved, we set a default value that works, but you can tweak it if you find another value with better performance. Report the new value if you decide to do so.

You will need to modify `act` and `replay` from the `DQNAgent` to implement the different approaches we ask for. Results may differ from run to run due to different initialization states.

**Report**
* Plot the average reward for the last 50 episodes vs. number of training episodes (train for 200 episodes) for the four agents trained: Q-learning and SARSA with both $\epsilon$-greedy policy and Softmax policy. Attach in the Appendix the modifications done to `DQNAgent` to implement the different agents. Do not include your code, a simple explanation with the key modifications is enough.

In [None]:
def softmax(x, temperature=0.025):
    """Compute softmax values for each sets of scores in x."""
    x = (x - np.expand_dims(np.max(x, 1), 1))
    x = x/temperature
    e_x = np.exp(x)
    return e_x / (np.expand_dims(e_x.sum(1), -1) + 1e-5)

class DQNAgent:
  def __init__(self, state_size, action_size):
    self.state_size = state_size
    self.action_size = action_size
    self.memory = collections.deque(maxlen=20000)
    self.gamma = 0.95    # discount rate
    self.epsilon = 1.0  # exploration rate
    self.epsilon_min = 0.01
    self.epsilon_decay = 0.995
    self.learning_rate = 0.001
    self.model = self._build_model()


  def _build_model(self):
    # Neural Net for Deep-Q learning Model
    model = models.Sequential()
    model.add(layers.Input(shape=(self.state_size,)))
    model.add(layers.Dense(24, activation='relu'))
    model.add(layers.Dense(48, activation='relu'))
    model.add(layers.Dense(self.action_size, activation='linear'))
    model.compile(loss='mse',
                  optimizer=optimizers.Adam(learning_rate=self.learning_rate))
    return model

  def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))

  def act(self, state, policy):# We implement the epsilon-greedy policy
    if policy == "epsilon-greedy":
      if np.random.rand() <= self.epsilon:
          return random.randrange(self.action_size)
      act_values = self.model.predict(state,verbose=0)
      return np.argmax(act_values[0])

    elif policy == "softmax":
      act_values = self.model.predict(state, verbose=0)
      act_values_2d = act_values.reshape(1, -1)  # shape (1, action_size)
      probabilities = softmax(act_values_2d)[0]  # now you can do axis=1
      return np.random.choice(self.action_size, p=probabilities)

    else:
      raise ValueError("Invalid policy. Choose either 'epsilon-greedy' or 'softmax'.")

  def exploit(self, state): # When we test the agent we dont want it to explore anymore, but to exploit what it has learnt
    act_values = self.model.predict(state,verbose=0)
    return np.argmax(act_values[0])

  def replay(self, batch_size, mode, policy):
    minibatch = random.sample(self.memory, batch_size)

    state_b = np.squeeze(np.array(list(map(lambda x: x[0], minibatch))))
    action_b = np.squeeze(np.array(list(map(lambda x: x[1], minibatch))))
    reward_b = np.squeeze(np.array(list(map(lambda x: x[2], minibatch))))
    next_state_b = np.squeeze(np.array(list(map(lambda x: x[3], minibatch))))
    done_b = np.squeeze(np.array(list(map(lambda x: x[4], minibatch))))

    ### Q-learning
    if mode == "Q-learning":
      target = (reward_b + self.gamma *
                        np.amax(self.model.predict(next_state_b,verbose=0), 1))

    ### SARSA
    elif mode == "SARSA":
      next_actions = np.array([self.act(np.reshape(next_state, [1, self.state_size]), policy)
                             for next_state in next_state_b])  # Select next action with policy
      next_q_values = self.model.predict(next_state_b, verbose=0)
      target = reward_b + self.gamma * next_q_values[np.arange(len(next_state_b)), next_actions]

    # other mode input
    else:
      raise ValueError("Invalid mode. Choose either 'Q-learning' or 'SARSA'.")


    target[done_b == 1] = reward_b[done_b == 1] # Update targets for terminal states
    target_f = self.model.predict(state_b, verbose=0) # Predict Q-values for current states
    for k in range(target_f.shape[0]):
      target_f[k][action_b[k]] = target[k]
    self.model.train_on_batch(state_b, target_f)
    if self.epsilon > self.epsilon_min:
      self.epsilon *= self.epsilon_decay

  def load(self, name):
    self.model.load_weights(name)
  def save(self, name):
    self.model.save_weights(name)

In [None]:
Q_epsilon_greedy_avg_reward = []
Q_softmax_avg_reward = []
SARSA_epsilon_greedy_avg_reward = []
SARSA_softmax_avg_reward =[]

In [None]:
Q_softmax_avg_reward = []

EPISODES = 200
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
policy = "softmax"
mode = "SARSA"
agent = DQNAgent(state_size, action_size)
batch_size = 32
episode_reward_list = collections.deque(maxlen=50)


for e in range(EPISODES):
    state, _ = env.reset()
    state = np.reshape(state, [1, state_size])
    total_reward = 0
    for time in range(200):
      action = agent.act(state,policy)
      next_state, reward, terminated, truncated, _ = env.step(action)
      done = terminated or truncated
      total_reward += reward
      next_state = np.reshape(next_state, [1, state_size])
      agent.remember(state, action, reward, next_state, done)
      state = next_state
      if done:
          break
      if len(agent.memory) > batch_size:
          agent.replay(batch_size,mode,policy)
    episode_reward_list.append(total_reward)
    episode_reward_avg = np.array(episode_reward_list).mean()
    SARSA_softmax_avg_reward.append(episode_reward_avg)
    print("episode: {}/{}, score: {}, e: {:.2}, last 50 ep. avg. rew.: {:.2f}"
                .format(e, EPISODES, total_reward, agent.epsilon, episode_reward_avg))

In [None]:
print(SARSA_softmax_avg_reward)


In [None]:
SARSA_epsilon_greedy_avg_reward = [np.float64(11.0), np.float64(19.0), np.float64(16.333333333333332), np.float64(16.5), np.float64(18.6), np.float64(18.833333333333332), np.float64(21.0), np.float64(20.5), np.float64(19.555555555555557), np.float64(19.3), np.float64(19.272727272727273), np.float64(18.583333333333332), np.float64(17.923076923076923), np.float64(17.357142857142858), np.float64(16.8), np.float64(16.4375), np.float64(16.058823529411764), np.float64(15.777777777777779), np.float64(15.473684210526315), np.float64(15.2), np.float64(14.952380952380953), np.float64(14.636363636363637), np.float64(14.73913043478261), np.float64(14.5), np.float64(14.56), np.float64(14.346153846153847), np.float64(14.148148148148149), np.float64(14.035714285714286), np.float64(13.89655172413793), np.float64(13.866666666666667), np.float64(14.193548387096774), np.float64(14.5), np.float64(14.787878787878787), np.float64(15.294117647058824), np.float64(15.428571428571429), np.float64(15.444444444444445), np.float64(15.621621621621621), np.float64(15.789473684210526), np.float64(16.05128205128205), np.float64(16.525), np.float64(16.975609756097562), np.float64(18.166666666666668), np.float64(18.46511627906977), np.float64(18.636363636363637), np.float64(19.4), np.float64(20.434782608695652), np.float64(20.659574468085108), np.float64(20.895833333333332), np.float64(21.591836734693878), np.float64(22.04), np.float64(23.1), np.float64(23.72), np.float64(24.52), np.float64(25.24), np.float64(25.86), np.float64(27.56), np.float64(29.94), np.float64(31.2), np.float64(33.36), np.float64(34.94), np.float64(37.7), np.float64(41.34), np.float64(42.54), np.float64(46.34), np.float64(48.34), np.float64(50.7), np.float64(52.92), np.float64(55.96), np.float64(58.54), np.float64(62.34), np.float64(64.58), np.float64(66.98), np.float64(70.64), np.float64(73.34), np.float64(76.6), np.float64(80.42), np.float64(84.24), np.float64(87.46), np.float64(91.06), np.float64(94.36), np.float64(97.74), np.float64(100.78), np.float64(104.3), np.float64(107.66), np.float64(111.26), np.float64(114.94), np.float64(117.92), np.float64(121.48), np.float64(124.88), np.float64(128.18), np.float64(131.42), np.float64(133.18), np.float64(136.56), np.float64(139.58), np.float64(142.14), np.float64(144.22), np.float64(147.6), np.float64(150.22), np.float64(152.78), np.float64(155.16), np.float64(157.26), np.float64(159.42), np.float64(162.16), np.float64(164.78), np.float64(166.8), np.float64(167.36), np.float64(167.66), np.float64(169.58), np.float64(170.16), np.float64(171.72), np.float64(172.58), np.float64(172.6), np.float64(174.08), np.float64(172.92), np.float64(173.52), np.float64(173.96), np.float64(174.42), np.float64(174.28), np.float64(174.86), np.float64(173.34), np.float64(174.02), np.float64(175.46), np.float64(174.44), np.float64(174.34), np.float64(173.7), np.float64(172.16), np.float64(170.98), np.float64(170.08), np.float64(169.34), np.float64(168.62), np.float64(167.4), np.float64(166.86), np.float64(166.18), np.float64(165.02), np.float64(164.08), np.float64(162.56), np.float64(161.74), np.float64(160.4), np.float64(159.74), np.float64(158.42), np.float64(157.42), np.float64(156.84), np.float64(156.06), np.float64(156.52), np.float64(155.22), np.float64(154.26), np.float64(153.04), np.float64(152.1), np.float64(151.82), np.float64(151.34), np.float64(151.14), np.float64(150.26), np.float64(148.92), np.float64(147.72), np.float64(146.86), np.float64(146.54), np.float64(146.02), np.float64(146.5), np.float64(145.72), np.float64(145.26), np.float64(144.62), np.float64(143.44), np.float64(144.3), np.float64(143.68), np.float64(144.22), np.float64(143.62), np.float64(143.92), np.float64(143.76), np.float64(142.78), np.float64(143.26), np.float64(143.28), np.float64(141.78), np.float64(141.34), np.float64(140.9), np.float64(140.9), np.float64(141.42), np.float64(141.76), np.float64(141.8), np.float64(140.88), np.float64(141.6), np.float64(141.96), np.float64(141.84), np.float64(141.14), np.float64(141.46), np.float64(141.06), np.float64(140.84), np.float64(140.46), np.float64(140.64), np.float64(140.08), np.float64(140.06), np.float64(140.16), np.float64(141.08), np.float64(140.78), np.float64(139.28), np.float64(139.86), np.float64(140.9), np.float64(140.64), np.float64(141.12), np.float64(139.98), np.float64(140.9)]
SARSA_softmax_avg_reward =
Q_epsilon_greedy_avg_reward = [np.float64(59.0), np.float64(45.5), np.float64(34.0), np.float64(31.75), np.float64(27.2), np.float64(26.333333333333332), np.float64(24.571428571428573), np.float64(23.375), np.float64(22.666666666666668), np.float64(21.5), np.float64(20.90909090909091), np.float64(19.916666666666668), np.float64(19.153846153846153), np.float64(18.5), np.float64(17.933333333333334), np.float64(17.375), np.float64(17.11764705882353), np.float64(17.444444444444443), np.float64(17.0), np.float64(16.65), np.float64(16.285714285714285), np.float64(16.045454545454547), np.float64(15.782608695652174), np.float64(15.5), np.float64(15.32), np.float64(15.153846153846153), np.float64(15.0), np.float64(15.035714285714286), np.float64(14.931034482758621), np.float64(14.766666666666667), np.float64(14.580645161290322), np.float64(14.5), np.float64(14.393939393939394), np.float64(14.235294117647058), np.float64(14.228571428571428), np.float64(14.083333333333334), np.float64(13.972972972972974), np.float64(14.131578947368421), np.float64(14.076923076923077), np.float64(13.925), np.float64(13.78048780487805), np.float64(13.80952380952381), np.float64(17.0), np.float64(17.727272727272727), np.float64(18.244444444444444), np.float64(19.5), np.float64(20.46808510638298), np.float64(21.604166666666668), np.float64(22.632653061224488), np.float64(23.94), np.float64(25.1), np.float64(26.26), np.float64(27.62), np.float64(28.48), np.float64(29.28), np.float64(32.36), np.float64(33.48), np.float64(34.2), np.float64(35.72), np.float64(36.64), np.float64(38.4), np.float64(39.84), np.float64(40.84), np.float64(42.76), np.float64(45.68), np.float64(47.92), np.float64(51.54), np.float64(54.32), np.float64(56.84), np.float64(60.64), np.float64(63.34), np.float64(66.14), np.float64(69.72), np.float64(73.54), np.float64(75.36), np.float64(78.92), np.float64(81.5), np.float64(84.8), np.float64(88.1), np.float64(91.44), np.float64(94.08), np.float64(97.84), np.float64(101.62), np.float64(105.44), np.float64(109.16), np.float64(112.98), np.float64(116.78), np.float64(120.38), np.float64(124.14), np.float64(127.98), np.float64(131.82), np.float64(135.52), np.float64(136.5), np.float64(139.52), np.float64(142.7), np.float64(145.18), np.float64(147.88), np.float64(150.38), np.float64(152.94), np.float64(155.1), np.float64(156.76), np.float64(158.96), np.float64(160.6), np.float64(163.24), np.float64(166.26), np.float64(166.74), np.float64(169.34), np.float64(172.3), np.float64(174.44), np.float64(177.22), np.float64(179.16), np.float64(181.54), np.float64(184.34), np.float64(186.22), np.float64(187.1), np.float64(188.68), np.float64(188.8), np.float64(189.56), np.float64(190.86), np.float64(190.86), np.float64(191.82), np.float64(192.8), np.float64(193.02), np.float64(192.94), np.float64(194.5), np.float64(194.72), np.float64(195.92), np.float64(196.3), np.float64(196.76), np.float64(196.68), np.float64(197.04), np.float64(196.32), np.float64(196.32), np.float64(196.32), np.float64(196.32), np.float64(196.32), np.float64(195.58), np.float64(195.58), np.float64(195.58), np.float64(195.58), np.float64(195.42), np.float64(195.42), np.float64(195.42), np.float64(194.68), np.float64(194.68), np.float64(194.68), np.float64(194.1), np.float64(194.1), np.float64(193.24), np.float64(192.5), np.float64(192.5), np.float64(192.5), np.float64(193.28), np.float64(193.28), np.float64(192.68), np.float64(192.0), np.float64(191.86), np.float64(191.88), np.float64(191.88), np.float64(191.92), np.float64(191.72), np.float64(191.72), np.float64(191.4), np.float64(191.4), np.float64(191.4), np.float64(191.4), np.float64(191.4), np.float64(191.4), np.float64(191.4), np.float64(191.26), np.float64(191.38), np.float64(190.78), np.float64(190.74), np.float64(190.82), np.float64(191.22), np.float64(191.22), np.float64(191.22), np.float64(191.14), np.float64(190.3), np.float64(190.84), np.float64(191.42), np.float64(191.76), np.float64(191.64), np.float64(191.64), np.float64(191.64), np.float64(191.04), np.float64(191.78), np.float64(191.78), np.float64(191.3), np.float64(191.3), np.float64(191.46), np.float64(190.18), np.float64(190.18), np.float64(190.8), np.float64(190.8), np.float64(190.8), np.float64(191.38), np.float64(190.56), np.float64(190.82), np.float64(191.32)]
Q_softmax_avg_reward = [np.float64(10.0), np.float64(9.5), np.float64(9.666666666666666), np.float64(9.5), np.float64(9.6), np.float64(9.5), np.float64(9.571428571428571), np.float64(9.625), np.float64(9.555555555555555), np.float64(9.5), np.float64(9.545454545454545), np.float64(9.583333333333334), np.float64(9.615384615384615), np.float64(9.571428571428571), np.float64(9.466666666666667), np.float64(9.5), np.float64(9.529411764705882), np.float64(9.555555555555555), np.float64(9.526315789473685), np.float64(9.55), np.float64(9.571428571428571), np.float64(9.590909090909092), np.float64(9.608695652173912), np.float64(9.583333333333334), np.float64(9.6), np.float64(9.576923076923077), np.float64(9.62962962962963), np.float64(9.642857142857142), np.float64(9.655172413793103), np.float64(9.6), np.float64(9.580645161290322), np.float64(9.5625), np.float64(9.545454545454545), np.float64(9.558823529411764), np.float64(9.571428571428571), np.float64(9.583333333333334), np.float64(9.594594594594595), np.float64(9.605263157894736), np.float64(9.58974358974359), np.float64(9.575), np.float64(9.585365853658537), np.float64(9.547619047619047), np.float64(9.55813953488372), np.float64(9.522727272727273), np.float64(9.555555555555555), np.float64(9.58695652173913), np.float64(9.595744680851064), np.float64(9.583333333333334), np.float64(9.571428571428571), np.float64(9.58), np.float64(9.56), np.float64(9.58), np.float64(9.56), np.float64(9.58), np.float64(9.56), np.float64(9.56), np.float64(9.56), np.float64(9.54), np.float64(9.54), np.float64(9.56), np.float64(9.56), np.float64(9.56), np.float64(9.58), np.float64(9.62), np.float64(9.62), np.float64(9.6), np.float64(9.56), np.float64(9.58), np.float64(9.58), np.float64(9.6), np.float64(9.58), np.float64(9.58), np.float64(9.56), np.float64(9.58), np.float64(9.56), np.float64(9.6), np.float64(9.6), np.float64(9.6), np.float64(9.58), np.float64(9.58), np.float64(9.58), np.float64(9.6), np.float64(9.9), np.float64(10.12), np.float64(10.34), np.float64(10.36), np.float64(10.62), np.float64(10.62), np.float64(10.66), np.float64(10.92), np.float64(11.82), np.float64(11.94), np.float64(12.06), np.float64(12.32), np.float64(12.36), np.float64(12.42), np.float64(12.48), np.float64(12.54), np.float64(12.54), np.float64(12.6), np.float64(12.7), np.float64(12.76), np.float64(12.82), np.float64(13.06), np.float64(13.22), np.float64(13.3), np.float64(13.46), np.float64(13.64), np.float64(13.68), np.float64(13.8), np.float64(14.02), np.float64(14.2), np.float64(14.48), np.float64(14.68), np.float64(14.92), np.float64(15.08), np.float64(15.28), np.float64(15.5), np.float64(15.96), np.float64(16.34), np.float64(16.86), np.float64(17.1), np.float64(17.36), np.float64(17.68), np.float64(18.34), np.float64(18.64), np.float64(18.86), np.float64(19.3), np.float64(19.54), np.float64(19.94), np.float64(20.42), np.float64(20.68), np.float64(20.84), np.float64(21.0), np.float64(21.42), np.float64(22.22), np.float64(22.66), np.float64(23.76), np.float64(24.6), np.float64(25.12), np.float64(25.24), np.float64(26.3), np.float64(28.04), np.float64(29.06), np.float64(30.3), np.float64(31.58), np.float64(33.04), np.float64(34.8), np.float64(36.36), np.float64(37.8), np.float64(39.28), np.float64(40.64), np.float64(42.26), np.float64(43.44), np.float64(44.8), np.float64(46.08), np.float64(47.32), np.float64(48.54), np.float64(50.16), np.float64(51.6), np.float64(53.12), np.float64(54.16), np.float64(55.28), np.float64(56.6), np.float64(58.04), np.float64(59.38), np.float64(60.58), np.float64(61.66), np.float64(62.8), np.float64(63.68), np.float64(64.68), np.float64(66.14), np.float64(67.48), np.float64(68.62), np.float64(69.54), np.float64(70.68), np.float64(71.7), np.float64(73.0), np.float64(74.4), np.float64(75.5), np.float64(76.36), np.float64(77.68), np.float64(78.74), np.float64(79.88), np.float64(80.82), np.float64(81.66), np.float64(82.44), np.float64(82.88), np.float64(83.86), np.float64(84.44), np.float64(84.84), np.float64(85.32), np.float64(85.1), np.float64(85.4), np.float64(85.7), np.float64(85.82), np.float64(85.98), np.float64(85.64), np.float64(85.84), np.float64(85.84)]

In [None]:
import matplotlib.pyplot as plt

# Create x-axis values (episode numbers)
episodes = range(len(Q_epsilon_greedy_avg_reward))  # Assuming all lists have the same length

# Plot the data
plt.plot(episodes, Q_epsilon_greedy_avg_reward, label="Q-learning (epsilon-greedy)")
plt.plot(episodes, Q_softmax_avg_reward, label="Q-learning (softmax)")
plt.plot(episodes, SARSA_epsilon_greedy_avg_reward, label="SARSA (epsilon-greedy)")
plt.plot(episodes, SARSA_softmax_avg_reward, label="SARSA (softmax)")

# Add labels and title
plt.xlabel("Episode")
plt.ylabel("Average Reward (Last 50 Episodes)")
plt.title("Average Reward vs. Episode for Different Agents")
plt.grid(True)

# Add legend
plt.legend()

# Show the plot
plt.show()