In [1]:
import gym
import matplotlib.pyplot as plt
import numpy as np
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import random
from IPython.display import clear_output
from collections import deque
from tqdm.notebook import tqdm
from collections import deque

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import keras.backend as K

tf.compat.v1.disable_eager_execution()

## Prioritized Experience Replay

### Main Ideas:

1. **Enhancing Experience Replay**:
   - **Experience Replay** is a fundamental component in deep reinforcement learning that stores past experiences $ (s, a, r, s') $ in a replay buffer.
   - **Prioritized Experience Replay (PER)** improves upon standard experience replay by prioritizing experiences that are more informative, allowing the agent to learn more efficiently.

2. **Prioritization Based on TD Error**:
   - **Temporal-Difference (TD) Error**: PER assigns priorities to experiences based on the magnitude of their TD errors.
   - Experiences with higher TD errors are considered more surprising or informative and are sampled more frequently for training.

3. **Sampling Mechanism**:
   - **Proportional Prioritization**: The probability of sampling an experience is proportional to its priority raised to a power $ \alpha $.
   - **Rank-Based Prioritization**: Alternatively, experiences can be ranked, and sampling probabilities can decrease monotonically with rank.

4. **Bias Correction with Importance Sampling**:
   - Non-uniform sampling introduces bias in the updates. PER compensates for this by applying **Importance Sampling (IS) Weights** to the updates.
   - The IS weights adjust the updates to account for the non-uniform probabilities, ensuring unbiased learning.

5. **Adaptive Learning**:
   - PER dynamically adjusts the priorities of experiences based on their TD errors, ensuring that the replay buffer remains focused on the most relevant experiences as learning progresses.

### Structure of Prioritized Experience Replay:

- **Replay Buffer with Priorities**:
  - The replay buffer not only stores experiences but also maintains a priority value for each experience.
  
- **Priority Assignment**:
  - Each experience $ i $ is assigned a priority $ p_i $ based on its TD error:
    $$
    p_i = |\delta_i| + \epsilon
    $$
    where $ \delta_i = r + \gamma \max_{a'} Q(s', a') - Q(s, a) $ is the TD error, and $ \epsilon $ is a small positive constant to ensure all experiences have a non-zero probability of being sampled.
  
- **Sampling Process**:
  - **Proportional Prioritization**:
    $$
    P(i) = \frac{p_i^\alpha}{\sum_{k} p_k^\alpha}
    $$
    where $ \alpha $ determines the degree of prioritization ($ \alpha = 0 $) corresponds to uniform sampling).
  
  - **Rank-Based Prioritization**:
    - Experiences are sorted based on their priority, and sampling probabilities decrease with rank to reduce bias and variance.
  
- **Importance Sampling Weights**:
  - To correct for the bias introduced by prioritized sampling, IS weights $ w_i $ are computed as:
    $$
    w_i = \left( \frac{1}{N} \cdot \frac{1}{P(i)} \right)^\beta
    $$
    where $ N $ is the size of the replay buffer, and $ \beta $ controls the amount of correction ($ \beta = 0 $ means no correction, and $ \beta = 1 $ fully compensates for the non-uniform probabilities).
  - These weights are typically normalized by the maximum weight in the batch to ensure stability.
  
- **Updating Priorities**:
  - After each training step, the priorities of the sampled experiences are updated based on their new TD errors to reflect their current importance.

### Why Prioritized Experience Replay?

- **Improved Sample Efficiency**:
  - By focusing on more informative experiences, PER enables the agent to learn effectively from fewer samples compared to uniform sampling.
  
- **Faster Convergence**:
  - Prioritizing high-TD-error experiences accelerates the learning process, leading to quicker policy and value function updates.
  
- **Enhanced Learning Stability**:
  - Emphasizing significant transitions reduces the variance introduced by irrelevant or redundant experiences, contributing to more stable learning.
  
- **Better Performance in Complex Environments**:
  - In environments with sparse or delayed rewards, PER ensures that crucial experiences are revisited more frequently, aiding in the discovery of optimal strategies.

### Advantages of Prioritized Experience Replay

1. **Focused Learning**:
   - Directs the learning process towards experiences that have a higher impact on the agent's policy and value estimates.
   
2. **Reduced Redundancy**:
   - Minimizes the repetition of less informative experiences, making more efficient use of the replay buffer.
   
3. **Adaptive Sampling**:
   - Dynamically adjusts sampling probabilities based on the agent's current learning progress, ensuring the replay buffer remains relevant throughout training.

### Implementation Details

- **Data Structures for Efficient Sampling**:
  - **Sum Tree**:
    - A binary tree data structure where each parent node is the sum of its child nodes' priorities.
    - Enables efficient $ \mathcal{O}(\log N) $ time complexity for updating priorities and sampling based on the cumulative distribution.
  
  - **Heap Structures**:
    - Alternative approaches use heap-based structures to maintain sorted priorities, though they may be less efficient than sum trees for certain operations.
  
- **Hyperparameters**:
  - **$\alpha$**: Controls the degree of prioritization. Common values range between 0.4 and 0.6.
  - **$\beta$**: Adjusts the amount of importance-sampling correction. It is often annealed from a lower value to 1 over the course of training.
  - **$\epsilon$**: A small constant (e.g., $1 \times 10^{-6}$) to ensure all experiences have a non-zero probability of being sampled.
  
- **Integration with Deep Q-Networks (DQN)**:
  - PER is typically integrated into the DQN framework by modifying the experience replay buffer to handle priorities and implementing the sampling and weighting mechanisms during training.
  
- **Handling Edge Cases**:
  - Ensuring that the replay buffer does not become overly skewed by a few high-priority experiences.
  - Balancing exploration and exploitation by appropriately setting $ \alpha $ and $ \beta $.

### Advantages Over Uniform Experience Replay

- **Focused Learning**: Directs the learning process towards more significant experiences, enhancing the agent's ability to learn optimal policies.
- **Reduced Redundancy**: Minimizes the repetition of less informative experiences, making more efficient use of the replay buffer.
- **Adaptive Sampling**: Dynamically adjusts the sampling probabilities based on the agent's learning progress, ensuring that the replay buffer remains relevant throughout training.

### Potential Drawbacks and Considerations

1. **Increased Computational Overhead**:
   - Managing priorities and maintaining data structures like sum trees can introduce additional computational complexity compared to uniform sampling.
   
2. **Bias Introduction**:
   - Non-uniform sampling can bias the learning process. Although importance-sampling weights mitigate this bias, careful tuning of hyperparameters is necessary.
   
3. **Hyperparameter Sensitivity**:
   - The performance of PER is sensitive to the choice of hyperparameters ($ \alpha $, $ \beta $, $ \epsilon $), requiring thorough experimentation and tuning.
   
4. **Implementation Complexity**:
   - Implementing efficient PER requires a more sophisticated replay buffer compared to standard uniform experience replay, increasing the complexity of the RL pipeline.

### Variants and Extensions

1. **Stochastic Prioritized Experience Replay**:
   - Introduces randomness in the prioritization process to balance exploration and exploitation.
   
2. **Max-Prioritized Experience Replay**:
   - Focuses exclusively on experiences with the maximum priority, though it can lead to overfitting if not carefully managed.
   
3. **Multi-Step Prioritized Experience Replay**:
   - Extends PER to multi-step returns, allowing the agent to learn from longer sequences of experiences with prioritized sampling.

### Integration with Other RL Techniques

1. **Dueling Networks**:
   - Combining PER with dueling DQN architectures can further enhance learning by focusing on both state-value and action advantages.
   
2. **Double DQN**:
   - When used alongside Double DQN, PER helps in mitigating overestimation bias while efficiently sampling important experiences.
   
3. **Rainbow DQN**:
   - PER is one of the key components in the Rainbow DQN framework, which integrates multiple improvements to create a more robust and high-performing RL agent.

### Empirical Results

- **Benchmark Performance**:
  - Studies have demonstrated that PER significantly improves the performance of DQN agents on various benchmark tasks, including Atari games and continuous control environments.
  
- **Sample Efficiency**:
  - PER agents often achieve higher rewards with fewer training steps compared to agents using uniform experience replay.
  
- **Stability and Convergence**:
  - PER contributes to more stable learning curves and faster convergence rates, particularly in environments with high-dimensional state spaces or sparse rewards.

### Conclusion

**Prioritized Experience Replay** is a powerful enhancement to the standard experience replay mechanism in reinforcement learning. By intelligently prioritizing more informative experiences, PER boosts sample efficiency, accelerates learning, and improves overall agent performance. While it introduces additional complexity and requires careful tuning, the benefits it offers make it a valuable component in modern deep reinforcement learning architectures. When combined with other advancements like dueling networks and Double DQN, PER plays a crucial role in pushing the boundaries of what RL agents can achieve in complex environments.


In [2]:
class PER():
    def __init__(self, 
                 observation_space, 
                 action_space, 
                 gamma=0.99, 
                 lr=0.001,
                 buffer_size=20000,
                 batch_size=32,
                 epsilon_decay=0.99,
                 epsilon= 0.7):
        self.observation_space = observation_space
        self.action_space = action_space
        self.gamma = gamma
        self.lr = lr
        self.buffer_size = buffer_size
        self.batch_size = batch_size
        self.epsilon_decay = epsilon_decay
        self.epsilon = epsilon
        self.buffer = deque(maxlen=self.buffer_size)
        self.model = self.build_model(name='model')
        self.target = self.build_model(name='target')
        
    def build_model(self, name):
        model = keras.Sequential(name=name)
        model.add(keras.Input(shape=self.observation_space))
        model.add(keras.layers.Dense(128, activation='relu'))
        model.add(keras.layers.Dense(128, activation='relu'))
        model.add(keras.layers.Dense(128, activation='relu'))
        model.add(keras.layers.Dense(self.action_space, activation='linear'))
        
        model.compile(
            optimizer=keras.optimizers.legacy.Adam(learning_rate=self.lr),
            loss='mse'
        )
        return model
    
    def predict(self, observation):
        return self.model.predict(np.array([observation]), verbose=False)[0]
    
    def predict_action(self, observation):
        return np.argmax(self.predict(observation))
    
    def e_greedy(self, observation):
        if len(self.buffer)%5==0 and self.epsilon > 0.01:
            self.epsilon = self.epsilon*self.epsilon_decay
        e = self.epsilon
        if random.random() >= e:
            return self.predict_action(observation)
        return random.randint(0, self.action_space-1)
    
    def remember(self, experience):
        initial_priority = max(self.priorities, default=1)
        self.buffer.append((experience, initial_priority))
    
    def get_probabilities(self, priority_scale):
        priorities = np.array([priority for _,priority in self.buffer])
        scaled_priorities = priorities ** priority_scale
        sample_probabilities = scaled_priorities / sum(scaled_priorities)
        return sample_probabilities
    
    def get_importance(self, probabilities, beta = 0.8):
        importance = ((1 / len(self.buffer))*(1 / probabilities))**beta
        importance_normalized = importance / max(importance)
        return importance_normalized
    

    def sample(self, batch_size = None, priority_scale = 0.1):
        if batch_size is None:
            batch_size = self.batch_size

        sample_probs = self.get_probabilities(priority_scale)
        sample_indices = random.choices(range(len(self.buffer)), k=batch_size, weights=sample_probs)
        
        # Extract experiences separately to avoid shape issues
        experiences = [self.buffer[i][0] for i in sample_indices]  # Assuming experience is the first element in tuple
        importance = self.get_importance(sample_probs[sample_indices])
        
        # Structure data by type for training
        states = np.array([exp[0] for exp in experiences])
        actions = np.array([exp[1] for exp in experiences])
        rewards = np.array([exp[2] for exp in experiences])
        next_states = np.array([exp[3] for exp in experiences])
        dones = np.array([exp[4] for exp in experiences])
        
        return (states, actions, rewards, next_states, dones), importance, sample_indices

    
    def set_priorities(self, indices, errors, offset=0.1):
        for idx, error in zip(indices, errors):
            _,priority = self.buffer[idx]
            self.buffer[idx] = (self.buffer[idx][0], abs(error)+offset)

    def target_update(self):
        self.target.set_weights(self.model.get_weights())
    
    def train(self):
        if len(self.buffer) < self.batch_size:
            return

        (states, actions, rewards, next_states, dones), importance_weights, indices = self.sample(self.batch_size)

        current_q_values = self.model.predict(np.array(states))

        next_q_values = self.target.predict(np.array(next_states))

        targets = np.array(current_q_values)

        max_next_q_values = np.max(next_q_values, axis=1)
        
        for i in range(self.batch_size):
            if dones[i]:
                targets[i, actions[i]] = rewards[i]
            else:
                targets[i, actions[i]] = rewards[i] + self.gamma * max_next_q_values[i]

        td_errors = targets[np.arange(self.batch_size), actions] - current_q_values[np.arange(self.batch_size), actions]

        history = self.model.fit(np.array(states), targets, sample_weight=importance_weights, verbose=0)

        self.set_priorities(indices, td_errors)

        self.target_update()

        # Extract loss from history object and return it along with other metrics if necessary
        loss = history.history['loss'][0]  
        return loss, targets, current_q_values

    @property
    def priorities(self):
        return [priority for _, priority in self.buffer]

In [None]:
import random
from collections import deque
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from dataclasses import dataclass

@dataclass
class Experience:
    state: np.ndarray
    action: int
    reward: float
    next_state: np.ndarray
    done: bool
    priority: float = 1.0  # Default priority

class PrioritizedReplayBuffer:
    def __init__(self, capacity, alpha=0.5):
        """
        Args:
            capacity (int): Maximum number of experiences to store.
            alpha (float): How much prioritization is used (0 - no prioritization, 1 - full prioritization).
        """
        self.capacity = capacity
        self.buffer = deque(maxlen=capacity)
        self.alpha = alpha
    
    def add(self, experience: Experience):
        self.buffer.append(experience)
    
    def sample(self, batch_size, beta=0.5, alpha=0.5):
        if len(self.buffer) == 0:
            return None, None, None

        # Extract priorities
        priorities = np.array([exp.priority for exp in self.buffer], dtype=np.float32)
        scaled_priorities = priorities ** alpha
        sample_probabilities = scaled_priorities / scaled_priorities.sum()

        batch_size = min(batch_size, len(self.buffer))
        indices = np.random.choice(len(self.buffer), size=batch_size, replace=False, p=sample_probabilities)
        sampled_experiences = [self.buffer[idx] for idx in indices]

        N = len(self.buffer)
        weights = (1.0 / (N * sample_probabilities[indices])) ** beta
        weights /= weights.max()  # Normalize for stability

        states = np.array([exp.state for exp in sampled_experiences])
        actions = np.array([exp.action for exp in sampled_experiences])
        rewards = np.array([exp.reward for exp in sampled_experiences])
        next_states = np.array([exp.next_state for exp in sampled_experiences])
        dones = np.array([exp.done for exp in sampled_experiences])

        return (states, actions, rewards, next_states, dones), indices, weights

    
    def update_priorities(self, indices, td_errors, offset=1e-6):
        """
        Update the priorities of sampled experiences.

        Args:
            indices (list or array): Indices of sampled experiences.
            td_errors (list or array): Corresponding TD errors.
            offset (float): Small constant to ensure no experience has zero priority.
        """
        for idx, error in zip(indices, td_errors):
            self.buffer[idx].priority = abs(error) + offset
    
    def __len__(self):
        return len(self.buffer)

class DQAgent_PER:
    def __init__(self, 
                 observation_space, 
                 action_space,
                 gamma=0.99, 
                 lr=5e-4,
                 buffer_size=20000,
                 batch_size=64,
                 epsilon_decay=0.999,
                 epsilon=0.7,
                 alpha=0.5,  
                 beta=0.5,
                 param_freq=3000,
                 target_update_freq=1000):
        """
        Args:
            observation_space (int): Dimensionality of the state space.
            action_space (int): Number of possible actions.
            gamma (float): Discount factor.
            lr (float): Learning rate.
            buffer_size (int): Replay buffer capacity.
            batch_size (int): Training batch size.
            epsilon_decay (float): Decay rate for epsilon.
            epsilon (float): Initial epsilon for epsilon-greedy.
            alpha (float): How much prioritization is used (0 - no prioritization, 1 - full prioritization).
            beta (float): Initial value of beta for importance-sampling.
            target_update_freq (int): Number of training steps between target network updates.
        """
        self.observation_space = observation_space
        self.action_space = action_space
        self.gamma = gamma
        self.lr = lr
        self.batch_size = batch_size
        self.epsilon_decay = epsilon_decay
        self.epsilon = epsilon
        self.alpha = alpha
        self.beta = beta
        self.alpha_decay = 0.99
        self.beta_growth = 1.01
        self.param_freq = param_freq
        self.target_update_freq = target_update_freq
        self.train_step = 0
        
        self.buffer = PrioritizedReplayBuffer(capacity=buffer_size, alpha=self.alpha)
        self.model = self.build_model(name='model')
        self.target_model = self.build_model(name='target')
        self.update_target_network()
    
    def build_model(self, name):
        model = keras.Sequential(name=name)
        model.add(keras.Input(shape=(self.observation_space,)))
        model.add(keras.layers.Dense(128, activation='relu'))
        model.add(keras.layers.Dense(256, activation='relu'))
        model.add(keras.layers.Dense(128, activation='relu'))
        model.add(keras.layers.Dense(self.action_space, activation='linear'))
        
        model.compile(
            optimizer=keras.optimizers.Adam(learning_rate=self.lr),
            loss='mse'
        )
        return model
    
    def update_target_network(self):
        self.target_model.set_weights(self.model.get_weights())
    
    def remember(self, state, action, reward, next_state, done):
        """
        Store experience in replay buffer with maximum priority for new experiences.

        Args:
            state (array-like): Current state.
            action (int): Action taken.
            reward (float): Reward received.
            next_state (array-like): Next state.
            done (bool): Whether the episode ended.
        """
        max_priority = max(self.buffer.buffer, key=lambda exp: exp.priority).priority if len(self.buffer) > 0 else 1.0
        experience = Experience(state, action, reward, next_state, done, priority=max_priority)
        self.buffer.add(experience)
    
    def get_param(self):
        if self.train_step % self.param_freq == 0:
            self.alpha *= self.alpha_decay
            self.beta *= self.beta_growth
        return self.alpha, self.beta
    
    def e_greedy(self, state):
        """
        Select action using epsilon-greedy policy.

        Args:
            state (array-like): Current state.

        Returns:
            int: Selected action.
        """
        if random.random() >= self.epsilon:
            return self.predict_action(state)
        return random.randint(0, self.action_space - 1)
    
    def predict_action(self, observation, epsilon=0.05):
        """
        Predict the best action based on current state.

        Args:
            state (array-like): Current state.

        Returns:
            int: Action with highest predicted Q-value.
        """
        if np.random.random() < epsilon:
            return np.random.randint(self.action_space)
        return np.argmax(self.model.predict(np.array([observation]), verbose=False)[0])

    def sample_batch(self):
        """
        Sample a batch of experiences from the buffer.

        Returns:
            tuple: (states, actions, rewards, next_states, dones), indices, IS_weights
        """
        alpha, beta = self.get_param()
        batch, indices, is_weights = self.buffer.sample(self.batch_size, beta=beta, alpha=alpha)
        return batch, indices, is_weights
    
    def train(self):
        """
        Train the model using a batch of experiences from the buffer.
        """
        batch, indices, is_weights = self.sample_batch()
        if batch is None:
            return
        
        states, actions, rewards, next_states, dones = batch
        
        # Predict Q-values for current states
        current_q = self.model.predict(states, verbose=0)
        
        # Predict Q-values for next states using target network
        target_q = self.target_model.predict(next_states, verbose=0)
        max_target_q = np.max(target_q, axis=1)
        
        # Compute target Q-values
        for i in range(len(states)):
            if dones[i]:
                current_q[i][actions[i]] = rewards[i]
            else:
                current_q[i][actions[i]] = rewards[i] + self.gamma * max_target_q[i]
        
        # Compute TD errors
        td_errors = current_q[np.arange(len(states)), actions] - self.model.predict(states, verbose=0)[np.arange(len(states)), actions]
        
        # Fit the model with importance-sampling weights
        self.model.fit(states, current_q, sample_weight=is_weights, epochs=1, verbose=0)
        
        # Update priorities in the buffer
        self.buffer.update_priorities(indices, td_errors)
        
        # Increment training step
        self.train_step += 1
        
        # Update target network periodically
        if self.train_step % self.target_update_freq == 0:
            self.update_target_network()
    
    def decay_epsilon(self):
        """
        Decay the exploration rate epsilon.
        """
        self.epsilon = max(0.01, self.epsilon * self.epsilon_decay)
    
    def save_model(self, filepath):
        self.model.save(filepath)
    
    def load_model(self, filepath):
        self.model = keras.models.load_model(filepath)
        self.update_target_network()

def plot_rewards(rewards, moving_avg, window=100):
    """
    Plot total rewards and moving average rewards.

    Args:
        rewards (list): Total rewards per episode.
        moving_avg (list): Moving average rewards.
        window (int): Window size for moving average.
    """
    plt.figure(figsize=(12, 6))
    plt.plot(range(1, len(rewards) + 1), rewards, label='Total Reward per Episode')
    plt.plot(range(1, len(moving_avg) + 1), moving_avg, label=f'Moving Average (Last {window}) Episodes')
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('Reward Progression Over Episodes')
    plt.legend()
    plt.grid(True)
    plt.show()

# Example Training Loop
if __name__ == "__main__":
    import gym

    # Initialize the environment
    env = gym.make('LunarLander-v2')
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  

    # Initialize the PER DQN agent
    agent = DQAgent_PER(
        observation_space=observation_space,
        action_space=action_space,
        gamma=0.99,
        lr=5e-4,
        buffer_size=20000,
        batch_size=64,
        epsilon_decay=0.995,
        epsilon=1.0,  # Start with full exploration
        alpha=0.5,
        beta=0.5,
        param_freq=1000,
        target_update_freq=1000
    )

    num_episodes = 1000
    rewards_per_episode = []
    moving_average_window = 100
    moving_averages = []
    recent_rewards = deque(maxlen=moving_average_window)

    for episode in range(num_episodes):
        state, _ = env.reset()
        done = False
        total_reward = 0
        step_count = 0

        while not done:
            action = agent.e_greedy(state)
            next_state, reward, done, _, _ = env.step(action)
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
            step_count += 1
            agent.train()

        # Decay epsilon after each episode
        agent.decay_epsilon()

        rewards_per_episode.append(total_reward)
        recent_rewards.append(total_reward)
        moving_avg = np.mean(recent_rewards)
        moving_averages.append(moving_avg)

        print(f"Episode {episode + 1} - Total Reward: {total_reward}, Moving Average Reward: {moving_avg:.2f}, Epsilon: {agent.epsilon:.4f}")

        # Plotting periodically
        # if (episode + 1) % 100 == 0:
        #     plot_rewards(rewards_per_episode, moving_averages, window=moving_average_window)

        if moving_avg >= 200 and episode >= moving_average_window:
            print(f"\nEnvironment solved in {episode + 1} episodes with moving average reward {moving_avg:.2f}!")
            agent.save_model("dqn_cartpole_per.h5")
            break

    # Final Plot
    plot_rewards(rewards_per_episode, moving_averages, window=moving_average_window)
    agent.save_model("dqn_cartpole_per_final.h5")
    env.close()


  if not isinstance(terminated, (bool, np.bool8)):
  updates=self.state_updates,


Episode 1 - Total Reward: -219.7254431623344, Moving Average Reward: -219.73, Epsilon: 0.9950
Episode 2 - Total Reward: -75.5704220783338, Moving Average Reward: -147.65, Epsilon: 0.9900
Episode 3 - Total Reward: -84.63763906175664, Moving Average Reward: -126.64, Epsilon: 0.9851
Episode 4 - Total Reward: -81.32546350512915, Moving Average Reward: -115.31, Epsilon: 0.9801
Episode 5 - Total Reward: -282.9260970304337, Moving Average Reward: -148.84, Epsilon: 0.9752
Episode 6 - Total Reward: -66.88187757918828, Moving Average Reward: -135.18, Epsilon: 0.9704
Episode 7 - Total Reward: -110.25073635901069, Moving Average Reward: -131.62, Epsilon: 0.9655
Episode 8 - Total Reward: -201.99259012271247, Moving Average Reward: -140.41, Epsilon: 0.9607
Episode 9 - Total Reward: -140.997095912345, Moving Average Reward: -140.48, Epsilon: 0.9559
Episode 10 - Total Reward: -57.77904856929878, Moving Average Reward: -132.21, Epsilon: 0.9511
Episode 11 - Total Reward: -243.29453039355727, Moving Aver