# DOOM Reinforcement Learning Project Using Vizdoom and Double Dueling Deep Q Learning

### Authors: Bartosz Dybowski, Piotr Sieczka


### Description

DDDQN model prepared for scenario 'Defend the center'.

### "Defend the center" scenario
The goal of this scenario is to train the agent to **survive and defend a specific area** while neutralizing threats.

- **Map Layout**: The player is placed in the center of a circular arena, surrounded by monsters attacking from all directions.
- **Goal**: The agent must shoot monsters to survive as long as possible.
- **Reward System**:
  - Positive reward for killing monsters.
  - Negative reward for taking damage or dying.
- **Challenges**: The agent must balance shooting accurately, managing health, and reacting to multiple threats simultaneously.

### Actions
The agent has **3 possible actions**, such as turning, moving, and shooting:
`[turn left, turn right, shoot]`.

### Reference
Solution based on https://gist.github.com/simoninithomas/d6adc6edb0a7f37d6323a5e3d2ab72ec

## 1. Install the libraries

In [None]:
!pip install tensorflow
!pip install vizdoom
!pip install scikit-image==0.19.0
!pip install matplotlib

## 2. Import libraries

In [1]:
import tensorflow as tf                         # Deep Learning library
import numpy as np                              # Handle matrices
from vizdoom import *                           # Doom Environment

import random                                   # Handling random number generation
import time                                     # Handling time calculation
from skimage import transform                   # Help us to preprocess the frames

from collections import deque                   # Ordered collection with ends
import matplotlib.pyplot as plt                 # Display graphs

import warnings                                 # This ignore all the warning messages that are printed during the training because of skiimage
warnings.filterwarnings('ignore')

from tensorflow.keras.layers import Lambda

## 3. Set up paths

Scenario Configuration and Resources

- **`SCENARIO_CONFIG_PATH`**: A `.cfg` file defining scenario rules, such as player settings, victory conditions, and enemy behavior.
- **`SCENARIO_WAD_PATH`**: A `.wad` file containing the map layout, textures, and game assets.
- **`MODEL_PATH`**: Path of the saved model.


In [2]:
SCENARIO_CONFIG_PATH = "C:\\Users\\barto\\OneDrive\\Pulpit\\DoomRL\\scenarios\\defend_the_center.cfg"
SCENARIO_WAD_PATH = "C:\\Users\\barto\\OneDrive\\Pulpit\\DoomRL\\scenarios\\defend_the_center.wad"
MODEL_TO_SAVE_PATH = "C:\\Users\\barto\\OneDrive\\Pulpit\\DoomRL\\models\\dddqn\\dddqn_model_dtc.ckpt"

## 4. Create Environment

The Doom environment requires:
  - A **configuration file** (`.cfg`): This handles all gameplay options, such as frame size, available actions, and difficulty level.
  - A **scenario file** (`.wad`): This generates the map and assets for the chosen scenario (in this case, **Defend the Center**, but feel free to experiment with others).


In [3]:
def create_environment():
    game = DoomGame()

    # Load the correct configuration
    game.load_config(SCENARIO_CONFIG_PATH)

    # Load the correct scenario
    game.set_doom_scenario_path(SCENARIO_WAD_PATH)

    game.set_window_visible(False) # no pop out window
    game.init()

    # Define three possible actions: turn left, turn right, and shoot
    possible_actions = [
        [1, 0, 0],  # Turn left
        [0, 1, 0],  # Turn right
        [0, 0, 1]   # Shoot
    ]

    return game, possible_actions

In [4]:
game, possible_actions = create_environment()

## 5. Preprocessing functions ⚙️
### a) Preprocess frame
Preprocessing in order to to reduce the complexity of our states to reduce the computation time needed for training.
<br><br>
Steps:
- Grayscale each of frames because color does not add important information, but this is already done by the config file.
- Crop the screen (in our case we remove the roof because it contains no information)
- Normalize pixel values
- Finally resize the preprocessed frame

In [5]:
"""
    preprocess_frame:

    1. Take a frame.
    2. Resize it:

        from
        __________________
        |                 |
        |                 |
        |                 |
        |                 |
        |_________________|

        to
        _____________
        |            |
        |            |
        |            |
        |____________|

    3. Normalize it.

    4. return preprocessed_frame

    """
def preprocess_frame(frame):
    # Crop the screen (remove part that contains no information)
    # [Up: Down, Left: right]
    cropped_frame = frame[15:-5, 20:-20]

    # Check if the cropped frame has non-zero dimensions
    if cropped_frame.size == 0:
        # If the cropped frame has zero dimensions, return a default frame with zeros
        return np.zeros((100, 120), dtype=np.float32)

    # Normalize Pixel Values
    normalized_frame = cropped_frame / 255.0

    # Resize
    preprocessed_frame = transform.resize(cropped_frame, [100, 120])

    return preprocessed_frame # 100 x 120 x 1 frame

### b) Stack frames

Stacking frames is perfromed because it helps to give have a sense of motion to our Neural Network.

- First preprocess frame
- Then append the frame to the deque that automatically removes the oldest frame
- Finally build the stacked state

This is how work stack:
- For the first frame -  feed 4 frames.
- At each timestep, add the new frame to deque and then stack them to form a new stacked frame.
- And so on ...
- If done, create a new stack with 4 new frames (because we are in a new episode).

Reference:
- <a href="https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/">article</a>

In [6]:
stack_size = 4 # We stack 4 frames

# Initialize deque with zero-images one array for each image
stacked_frames  =  deque([np.zeros((100,120), dtype=int) for i in range(stack_size)], maxlen=4)

def stack_frames(stacked_frames, state, is_new_episode):
    if state.size == 0:
        # Return the existing stacked frames without modification
        return np.stack(stacked_frames, axis=2), stacked_frames

    # Preprocess frame
    frame = preprocess_frame(state)

    if is_new_episode:
        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((100,120), dtype=int) for i in range(stack_size)], maxlen=4)

        # Because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)

        # Stack the frames
        stacked_state = np.stack(stacked_frames, axis=2)

    else:
        # Append frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)

        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames, axis=2)

    return stacked_state, stacked_frames

## 6. Set up hyperparameters
In this part different hyperparameters are set up.

In [7]:
### MODEL HYPERPARAMETERS
state_size = [100, 120, 4]                                      # Our input is a stack of 4 frames hence 100x120x4 (Width, height, channels)
action_size = game.get_available_buttons_size()                 # 3 possible actions
learning_rate =  0.00025                                        # Alpha (aka learning rate)

### TRAINING HYPERPARAMETERS
total_episodes = 20                                            # Total episodes for training
max_steps = 10000                                               # Max possible steps in an episode
batch_size = 64

# FIXED Q TARGETS HYPERPARAMETERS
max_tau = 10000                                                 # Tau is the C step where target network is updated

# EXPLORATION HYPERPARAMETERS for epsilon greedy strategy
explore_start = 1.0                                             # exploration probability at start
explore_stop = 0.01                                             # minimum exploration probability
decay_rate = 0.00005                                            # exponential decay rate for exploration prob

# Q LEARNING hyperparameters
gamma = 0.95                                                    # Discounting rate

### MEMORY HYPERPARAMETERS
pretrain_length = 10                                            # Number of experiences stored in the Memory when initialized for the first time
memory_size = 100000                                            # Number of experiences the Memory can keep (CPU: 10, GPU: 1000000)

### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT
training = True

## TURN THIS TO TRUE IF YOU WANT TO RENDER THE ENVIRONMENT
episode_render = False

## 7. Create Dueling Double Deep Q-learning Neural Network model
<img src="https://cdn-images-1.medium.com/max/1500/1*FkHqwA2eSGixdS-3dvVoMA.png" alt="Dueling Double Deep Q Learning Model" />
Dueling Double Deep Q-learning model:

- Take a stack of 4 frames as input
- It passes through 3 convnets
- Then it is flatened
- Then it is passed through 2 streams
    - One that calculates V(s)
    - The other that calculates A(s,a)
- Finally an agregating layer
- It outputs a Q value for each actions

In [8]:
#tf.compat.v1.disable_eager_execution()
tf.compat.v1.enable_eager_execution()




In [14]:
class DDDQNNet(tf.keras.Model):
    def __init__(self, state_size, action_size, learning_rate, name="DDDQNNet"):
        super(DDDQNNet, self).__init__(name=name)
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate

        # Define layers
        self.conv1 = tf.keras.layers.Conv2D(
            filters=32, kernel_size=(8, 8), strides=(4, 4), activation='elu', name="conv1"
        )
        self.conv2 = tf.keras.layers.Conv2D(
            filters=64, kernel_size=(4, 4), strides=(2, 2), activation='elu', name="conv2"
        )
        self.conv3 = tf.keras.layers.Conv2D(
            filters=128, kernel_size=(4, 4), strides=(2, 2), activation='elu', name="conv3"
        )
        self.flatten = tf.keras.layers.Flatten(name="flatten")
        self.value_fc = tf.keras.layers.Dense(512, activation='elu', name="value_fc")
        self.value = tf.keras.layers.Dense(1, activation=None, name="value")
        self.advantage_fc = tf.keras.layers.Dense(512, activation='elu', name="advantage_fc")
        self.advantage = tf.keras.layers.Dense(self.action_size, activation=None, name="advantages")

        # Optimizer
        self.optimizer = tf.keras.optimizers.RMSprop(learning_rate=self.learning_rate)

    def call(self, inputs):
        x = self.conv1(inputs)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.flatten(x)

        # Value and Advantage streams
        value = self.value(self.value_fc(x))
        advantage = self.advantage(self.advantage_fc(x))

        # Combine value and advantage streams
        q_values = value + (advantage - tf.reduce_mean(advantage, axis=1, keepdims=True))
        return q_values

    
    def compute_loss(self, target_Q, predicted_Q, actions_mb, ISWeights_):
        predicted_Q_for_actions = tf.reduce_sum(predicted_Q * actions_mb, axis=1)
        td_error = tf.math.squared_difference(target_Q, predicted_Q_for_actions)
        return tf.reduce_mean(ISWeights_ * td_error)



In [16]:
# Reset the graph
#tf.reset_default_graph()

# Instantiate the DQNetwork
DQNetwork = DDDQNNet(state_size, action_size, learning_rate, name="DQNetwork")

# Instantiate the target network
TargetNetwork = DDDQNNet(state_size, action_size, learning_rate, name="TargetNetwork")

## 8. Prioritized Experience Replay

Model cannot use a simple array to do that because sampling from it will be not efficient, so a binary tree data type should be used.

To summarize:
- **Step 1**: We construct a SumTree, which is a Binary Sum tree where leaves contains the priorities and a data array where index points to the index of leaves.
    <img src="https://cdn-images-1.medium.com/max/1200/1*Go9DNr7YY-wMGdIQ7HQduQ.png" alt="SumTree"/>
    <br><br>
    - **def __init__**: Initialize our SumTree data object with all nodes = 0 and data (data array) with all = 0.
    - **def add**: add our priority score in the sumtree leaf and experience (S, A, R, S', Done) in data.
    - **def update**: we update the leaf priority score and propagate through tree.
    - **def get_leaf**: retrieve priority score, index and experience associated with a leaf.
    - **def total_priority**: get the root node value to calculate the total priority score of our replay buffer.
<br><br>
- **Step 2**: We create a Memory object that will contain our sumtree and data.
    - **def __init__**: generates our sumtree and data by instantiating the SumTree object.
    - **def store**: we store a new experience in our tree. Each new experience will **have priority = max_priority** (and then this priority will be corrected during the training (when we'll calculating the TD error hence the priority score).
    - **def sample**:
         - First, to sample a minibatch of k size, the range [0, priority_total] is / into k ranges.
         - Then a value is uniformly sampled from each range
         - We search in the sumtree, the experience where priority score correspond to sample values are retrieved from.
         - Then, we calculate IS weights for each minibatch element
    - **def update_batch**: update the priorities on the tree

### Reference:

-  https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/5.2_Prioritized_Replay_DQN/RL_brain.py

In [9]:
class SumTree(object):
    """
    This SumTree code is modified version of Morvan Zhou:
    https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/5.2_Prioritized_Replay_DQN/RL_brain.py
    """
    data_pointer = 0

    """
    Here initialize the tree with all nodes = 0, and initialize the data with all values = 0
    """
    def __init__(self, capacity):
        self.capacity = capacity # Number of leaf nodes (final nodes) that contains experiences

        # Generate the tree with all nodes values = 0
        # To understand this calculation (2 * capacity - 1) look at the schema above
        # Remember we are in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)
        # Parent nodes = capacity - 1
        # Leaf nodes = capacity
        self.tree = np.zeros(2 * capacity - 1)

        """ tree:
            0
           / \
          0   0
         / \ / \
        0  0 0  0  [Size: capacity] it's at this line that there is the priorities score (aka pi)
        """

        # Contains the experiences (so the size of data is capacity)
        self.data = np.zeros(capacity, dtype=object)


    """
    Here we add our priority score in the sumtree leaf and add the experience in data
    """
    def add(self, priority, data):
        # Look at what index we want to put the experience
        tree_index = self.data_pointer + self.capacity - 1

        """ tree:
            0
           / \
          0   0
         / \ / \
tree_index  0 0  0  We fill the leaves from left to right
        """

        # Update data frame
        self.data[self.data_pointer] = data

        # Update the leaf
        self.update (tree_index, priority)

        # Add 1 to data_pointer
        self.data_pointer += 1

        if self.data_pointer >= self.capacity:  # If we're above the capacity, you go back to first index (we overwrite)
            self.data_pointer = 0


    """
    Update the leaf priority score and propagate the change through tree
    """
    def update(self, tree_index, priority):
        # Change = new priority score - former priority score
        change = priority - self.tree[tree_index]
        self.tree[tree_index] = priority

        # then propagate the change through tree
        while tree_index != 0:    # this method is faster than the recursive loop in the reference code

            """
            Here we want to access the line above
            THE NUMBERS IN THIS TREE ARE THE INDEXES NOT THE PRIORITY VALUES

                0
               / \
              1   2
             / \ / \
            3  4 5  [6]

            If we are in leaf at index 6, we updated the priority score
            We need then to update index 2 node
            So tree_index = (tree_index - 1) // 2
            tree_index = (6-1)//2
            tree_index = 2 (because // round the result)
            """
            tree_index = (tree_index - 1) // 2
            self.tree[tree_index] += change


    """
    Here we get the leaf_index, priority value of that leaf and experience associated with that index
    """
    def get_leaf(self, v):
        """
        Tree structure and array storage:
        Tree index:
             0         -> storing priority sum
            / \
          1     2
         / \   / \
        3   4 5   6    -> storing priority for experiences
        Array type for storing:
        [0,1,2,3,4,5,6]
        """
        parent_index = 0

        while True: # the while loop is faster than the method in the reference code
            left_child_index = 2 * parent_index + 1
            right_child_index = left_child_index + 1

            # If we reach bottom, end the search
            if left_child_index >= len(self.tree):
                leaf_index = parent_index
                break

            else: # downward search, always search for a higher priority node

                if v <= self.tree[left_child_index]:
                    parent_index = left_child_index

                else:
                    v -= self.tree[left_child_index]
                    parent_index = right_child_index

        data_index = leaf_index - self.capacity + 1

        return leaf_index, self.tree[leaf_index], self.data[data_index]

    @property
    def total_priority(self):
        return self.tree[0] # Returns the root node

Here we don't use deque anymore

In [10]:
class Memory(object):  # stored as ( s, a, r, s_ ) in SumTree
    """
    This SumTree code is modified version and the original code is from:
    https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py
    """
    PER_e = 0.01    # Hyperparameter that we use to avoid some experiences to have 0 probability of being taken
    PER_a = 0.6     # Hyperparameter that we use to make a tradeoff between taking only exp with high priority and sampling randomly
    PER_b = 0.4     # importance-sampling, from initial value increasing to 1

    PER_b_increment_per_sampling = 0.001

    absolute_error_upper = 1.  # clipped abs error

    def __init__(self, capacity):
        # Making the tree
        """
        Remember that our tree is composed of a sum tree that contains the priority scores at his leaf
        And also a data array
        We don't use deque because it means that at each timestep our experiences change index by one.
        We prefer to use a simple array and to overwrite when the memory is full.
        """
        self.tree = SumTree(capacity)

    """
    Store a new experience in our tree
    Each new experience have a score of max_prority (it will be then improved when we use this exp to train our DDQN)
    """
    def store(self, experience):
        # Find the max priority
        max_priority = np.max(self.tree.tree[-self.tree.capacity:])

        # If the max priority = 0 we can't put priority = 0 since this exp will never have a chance to be selected
        # So we use a minimum priority
        if max_priority == 0:
            max_priority = self.absolute_error_upper

        self.tree.add(max_priority, experience)   # set the max p for new p


    """
    - First, to sample a minibatch of k size, the range [0, priority_total] is / into k ranges.
    - Then a value is uniformly sampled from each range
    - We search in the sumtree, the experience where priority score correspond to sample values are retrieved from.
    - Then, we calculate IS weights for each minibatch element
    """
    def sample(self, n):
        # Create a sample array that will contains the minibatch
        memory_b = []

        b_idx, b_ISWeights = np.empty((n,), dtype=np.int32), np.empty((n, 1), dtype=np.float32)

        # Calculate the priority segment
        # Here, as explained in the paper, we divide the Range[0, ptotal] into n ranges
        priority_segment = self.tree.total_priority / n       # priority segment

        # Here we increasing the PER_b each time we sample a new minibatch
        self.PER_b = np.min([1., self.PER_b + self.PER_b_increment_per_sampling])  # max = 1

        # Calculating the max_weight
        p_min = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_priority
        max_weight = (p_min * n) ** (-self.PER_b)

        for i in range(n):
            """
            A value is uniformly sample from each range
            """
            a, b = priority_segment * i, priority_segment * (i + 1)
            value = np.random.uniform(a, b)

            """
            Experience that correspond to each value is retrieved
            """
            index, priority, data = self.tree.get_leaf(value)

            #P(j)
            sampling_probabilities = priority / self.tree.total_priority

            #  IS = (1/N * 1/P(i))**b /max wi == (N*P(i))**-b  /max wi
            b_ISWeights[i, 0] = np.power(n * sampling_probabilities, -self.PER_b)/ max_weight

            b_idx[i]= index

            experience = [data]

            memory_b.append(experience)

        return b_idx, memory_b, b_ISWeights

    """
    Update the priorities on the tree
    """
    def batch_update(self, tree_idx, abs_errors):
        abs_errors += self.PER_e  # convert to abs and avoid 0
        clipped_errors = np.minimum(abs_errors, self.absolute_error_upper)
        ps = np.power(clipped_errors, self.PER_a)

        for ti, p in zip(tree_idx, ps):
            self.tree.update(ti, p)

Here deal with the empty memory problem: pre-populate memory by taking random actions and storing the experience.

In [11]:
# Instantiate memory
memory = Memory(memory_size)

# Render the environment
game.new_episode()

for i in range(pretrain_length):
    # If it's the first step
    if i == 0:
        # First we need a state
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)

    # Random action
    action = random.choice(possible_actions)

    # Get the rewards
    reward = game.make_action(action)

    # Look if the episode is finished
    done = game.is_episode_finished()

    # If we're dead
    if done:
        # We finished the episode
        next_state = np.zeros(state.shape)

        # Add experience to memory
        #experience = np.hstack((state, [action, reward], next_state, done))

        experience = state, action, reward, next_state, done
        memory.store(experience)

        # Start a new episode
        game.new_episode()

        # First we need a state
        state = game.get_state().screen_buffer

        # Stack the frames
        state, stacked_frames = stack_frames(stacked_frames, state, True)

    else:
        # Get the next state
        next_state = game.get_state().screen_buffer
        next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)

        # Add experience to memory
        experience = state, action, reward, next_state, done
        memory.store(experience)

        # Our state is now the next_state
        state = next_state

## 9. Train Agent

Algorithm:
<br>
* Initialize the weights for DQN
* Initialize target value weights w- <- w
* Init the environment
* Initialize the decay rate (that will use to reduce epsilon) 
<br><br>
* **For** episode to max_episode **do** 
    * Make new episode
    * Set step to 0
    * Observe the first state $s_0$
    <br><br>
    * **While** step < max_steps **do**:
        * Increase decay_rate
        * With $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
        * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
        * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
        
        * Sample random mini-batch from $D$: $<s, a, r, s'>$
        * Set target $\hat{Q} = r$ if the episode ends at $+1$, otherwise set $\hat{Q} = r + \gamma Q(s',argmax_{a'}{Q(s', a', w), w^-)}$
        * Make a gradient descent step with loss $(\hat{Q} - Q(s, a))^2$
        * Every C steps, reset: $w^- \leftarrow w$
    * **endfor**
    <br><br>
* **endfor**

    

In [12]:
"""
This function will do the part
With ϵ select a random action atat, otherwise select at=argmaxaQ(st,a)
"""
def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions, DQNetwork):
    exp_exp_tradeoff = np.random.rand()

    # Epsilon-greedy strategy
    explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)

    if explore_probability > exp_exp_tradeoff:
        action = random.choice(possible_actions)  # Exploration
    else:
        # Exploitation
        state = np.expand_dims(state, axis=0)
        Qs = DQNetwork(state)
        choice = np.argmax(Qs.numpy())
        action = possible_actions[int(choice)]

    return action, explore_probability

In [15]:
def update_target_graph(DQNetwork, TargetNetwork):
    for t, e in zip(TargetNetwork.trainable_variables, DQNetwork.trainable_variables):
        t.assign(e)

In [17]:
# Checkpoint for saving the model
checkpoint = tf.train.Checkpoint(optimizer=DQNetwork.optimizer, model=DQNetwork)

if training:
    # Initialize the game
    game.init()

    # List to store rewards from each episode
    episode_rewards_list = []

    for episode in range(total_episodes):
        step, tau, decay_step = 0, 0, 0
        game.new_episode()
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)
        episode_rewards = []

        while step < max_steps:
            step += 1
            tau += 1
            decay_step += 1

            # Predict action
            action, explore_probability = predict_action(
                explore_start, explore_stop, decay_rate, decay_step, state, possible_actions, DQNetwork
            )

            # Perform the action in the environment
            reward = game.make_action(action)
            done = game.is_episode_finished()
            episode_rewards.append(reward)

            if done:
                # If the episode is finished, reset the state
                next_state = np.zeros((120, 140), dtype=np.int64)
                next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                total_reward = np.sum(episode_rewards)
                episode_rewards_list.append(total_reward)

                print(f"Episode: {episode}, Total Reward: {total_reward}, Explore P: {explore_probability}")
                break
            else:
                # Update the next state
                next_state = game.get_state().screen_buffer
                next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                state = next_state

            # Learning
            tree_idx, batch, ISWeights_mb = memory.sample(batch_size)
            states_mb = np.array([each[0][0] for each in batch], ndmin=3)
            actions_mb = np.array([each[0][1] for each in batch])
            rewards_mb = np.array([each[0][2] for each in batch])
            next_states_mb = np.array([each[0][3] for each in batch], ndmin=3)
            dones_mb = np.array([each[0][4] for each in batch])

            # Compute Q-values using Double DQN
            q_next_state = DQNetwork(next_states_mb)
            q_target_next_state = TargetNetwork(next_states_mb)

            target_Qs_batch = []
            for i in range(len(batch)):
                terminal = dones_mb[i]
                action = np.argmax(q_next_state[i].numpy())

                if terminal:
                    target_Qs_batch.append(rewards_mb[i])
                else:
                    target = rewards_mb[i] + gamma * q_target_next_state[i][action]
                    target_Qs_batch.append(target)

            targets_mb = np.array(target_Qs_batch)

            # Update the model
            with tf.GradientTape() as tape:
                predictions = DQNetwork(states_mb)
                loss = DQNetwork.compute_loss(targets_mb, predictions, actions_mb, ISWeights_mb)

            gradients = tape.gradient(loss, DQNetwork.trainable_variables)
            DQNetwork.optimizer.apply_gradients(zip(gradients, DQNetwork.trainable_variables))

        # Save model every 10 episodes
        if episode % 10 == 0:
            checkpoint.save(MODEL_TO_SAVE_PATH)
            print("Model Saved")

    # Plot episode rewards
    if episode_rewards_list:
        plt.figure(figsize=(10, 5))
        plt.plot(episode_rewards_list, label="Episode Rewards")
        plt.xlabel('Episode')
        plt.ylabel('Total Reward')
        plt.title('Episode Rewards Over Time')
        plt.legend()
        plt.grid()
        plt.show()
    else:
        print("No rewards to plot. Ensure training has been run.")


KeyboardInterrupt: 

## 9. Watch agent play

In [None]:
# Set model path
MODEL_PATH="C:\\Users\\barto\\OneDrive\\Pulpit\\DoomRL\\models\\dddqn\\dddqn_model_dtc.ckpt-10"

# Initialize the game
game = DoomGame()

# Load the correct configuration and scenario
game.load_config(SCENARIO_CONFIG_PATH)
game.set_doom_scenario_path(SCENARIO_WAD_PATH)

# Set visibility and initialize game
game.set_window_visible(True)
game.init()

# Restore the trained model
checkpoint = tf.train.Checkpoint(model=DQNetwork)
checkpoint.restore(MODEL_PATH).expect_partial()

# Play the game for a few episodes
for i in range(10):
    game.new_episode()
    state = game.get_state().screen_buffer
    state, stacked_frames = stack_frames(stacked_frames, state, True)

    while not game.is_episode_finished():
        # Epsilon-greedy strategy
        exp_exp_tradeoff = np.random.rand()
        explore_probability = 0.01  # Fixed low epsilon for testing

        if explore_probability > exp_exp_tradeoff:
            # Random action (exploration)
            action_index = np.random.choice(range(len(possible_actions)))
        else:
            # Exploit the trained model
            Qs = DQNetwork(tf.convert_to_tensor(state.reshape((1, *state.shape)), dtype=tf.float32))
            action_index = np.argmax(Qs.numpy())

        # Perform the action
        action = possible_actions[action_index]
        game.make_action(action)

        if game.is_episode_finished():
            break
        else:
            # Get the next state
            next_state = game.get_state().screen_buffer
            next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
            state = next_state

    # Get the final score for this episode
    score = game.get_total_reward()
    print(f"Episode {i + 1} Score: {score}")

# Close the game
game.close()