<a href="https://colab.research.google.com/github/Strojove-uceni/2024-final-letadylka-prochazka-belohlavek/blob/main/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TABLE OF CONTENTS**

>[Introduction](#scrollTo=Mey2sVO3w06m)

>>[Abstract](#scrollTo=Mey2sVO3w06m)

>>[How to Run the Project](#scrollTo=Mey2sVO3w06m)

>[Overview of the Used Machine Learning Techniques](#scrollTo=ZDfbtUw36j_1)

>>[Architectures](#scrollTo=ZDfbtUw36j_1)

>>>[Disclamer](#scrollTo=ZDfbtUw36j_1)

>>[Q-values prediction](#scrollTo=ZDfbtUw36j_1)

>>[Classical MLP](#scrollTo=ZDfbtUw36j_1)

>>[Attention Model](#scrollTo=HU4Z78Kv9hiW)

>>[DGN](#scrollTo=_s_7VNFm91w8)

>>[DQN](#scrollTo=G5-kiKLj98yv)

>>[CommNet](#scrollTo=VK8i3VvG-Bg-)

>>[State Aggregation](#scrollTo=AGSMkNTI-KIH)

>>>[SUM](#scrollTo=AGSMkNTI-KIH)

>>>[GCN](#scrollTo=7S-b66SX-TyK)

>[Selecting Parameters](#scrollTo=1Ei12632gg3Z)

>>[Common Parameters in the Sweep](#scrollTo=1Ei12632gg3Z)

>>[CommNet specific](#scrollTo=1Ei12632gg3Z)

>>[DQN, DGN specific](#scrollTo=1Ei12632gg3Z)

>[Advice for Parameter Selection](#scrollTo=OhkNKLaYlGRj)

>>[CommNet settings](#scrollTo=OhkNKLaYlGRj)

>>[DQN, DGN settings](#scrollTo=OhkNKLaYlGRj)

>>[Aggregation Type](#scrollTo=OhkNKLaYlGRj)

>[Our results](#scrollTo=S8nU4WQLwJI7)



Authors: Michal Bělohlávek, Tomáš Procházka

# Introduction
Welcome to the demo file, where one can run the project with zero effort and see the results and visualistaion for themselves. While this is an easy and plesant way to enjoy this Neural Net, we strongly urge anyone who visits this demo to run the project as it was intended, expand upon it and improve it.

## Abstract
This project was created and submited as the final semestral project for the Machine Learning 2 class on FNSPE CTU. This project concerns itself with reinforcement learning for multiple agents controlled by a single neural net in a graph environment. The core aim of this project is to provide a neural network solution that efficiently navigates multiple planes along a fully connected graph with the goal of estimating the shortest path, while avoiding collisions of planes. We implemented an enhanced version of a classical replay buffer that samples experiences based on the predicted future reward to assist the learning process. We also added regularization techniques, since the dimensionality of our problem is much larger compared to the on in the repository we cite.

## How to Run the Project
For those who decide to download the project and run the training on their PC, please beware of the configurations. A basic setup is present in /data as demo_config.yaml

Setting up capacity, minibatch_size or sequence_length too high may result in freezing the computer.

Most hyperparameters may be changed in the config.yaml file. If you intend to do your own sweeps on weights and biases, we have also uploaded a version of the main file wandb_main.py that supports sweep configuration.

If you however decide to only run the project in this demo file, note that any pre-trained models are too large to upload to the GitHub repo directly, so the training will be done from scratch here. The training will use the demo_confing with small number of steps and generally "low" settings, so taht the training can be completed in reasonable amount of time. Therefore, one should expect very poor results compared to the results we present at the end of this notebook.

**To see the training and results, simply run the following code boxes.**


# Overview of the Used Machine Learning Techniques
## Architectures

<span><font color="green;">
###Disclamer

**The following code is used as an illustration only, to see the functionalities of the code directly, visit the /src file on our GitHub repo. This approach was taken because it is not feasible to copy the whole code into Colab.**
</font></span>

(Or maybe it would be feasible but that would be an extreme violation of our hard work.)

This file contains the description of the main architectures used within this project. Below we provide a detailed description for all of them.

First we describe models that were used for Q-values predictions:

    - DGN
    - DQN
    - Comm_net,
    
then we go over the methods that were used to aggregate hidden graph representations:

    - SUM
    - GCN

and lastly we describe the **NetMon** class, that was originally provided by the authors.

## Reinforcement Learning
First, let's introduce the concept of reinforcement learning with multiple agents. Each agent observes only a partial view of some environment and state information. Reinforcement Learning (RL) is a machine learning approach, where agents learn to make sequential decisions by interacting with an environment. Each agent hold a local and/or global observation of its environment. Through a process of trial and error, the agent receives rewards or penalties for their actions, enabling it to discover an optimal policy for achieving specific goals. In our case, for example, we used Epsilon-Greedy policy, where at the beginning the agent makes random decisions to explore the environment and allow itself to learn the interactions. Then it gradually transfers to learned behavior.

In the context of multiple agents, Recurrent Neural Networks (RNNs) play a critical role in handling temporal dependencies. For example, in multi-agent RL, RNNs can be used to model and predict the behavior of agents based on sequences of past experiences, enabling better coordination and communication between agents. By maintaining hidden states that encapsulate the history of interactions, RNNs empowers agents to adapt to dynamic environments and collaborate effectively.

Our work prouds itself amongst other things on the replay_buffer that significantly improved the prediction of paths that lead to future reward. A replay buffer is a key component in reinforcement learning that stores past experiences, typically in the form of state, action, reward, and next state tuples. For RNNs, which rely on sequential dependencies, replay buffers are particularly important as they allow the agent to learn from diverse trajectories while maintaining temporal coherence. By sampling batches of sequences instead of independent transitions, the replay buffer ensures that the RNN captures meaningful patterns over time, improving its ability to model long-term dependencies. Additionally, we implemented a version of replay buffer that samples the batch sequences based on maximizing the td_error, hence the mean square error between predicted future and immediate rewards. This allows the agents to learn and prioritize paths that lead to targets as that is the place of the most future reward.

## Q-values Prediction
We use Q-Net for q_value predictions, a reinforcement learning technique that assigns values to each future action for one step into the future based on agent's observations (state). In this particular setting, the Q-Net predicts the reward for each edge the agent could take at any given step. We implemented a node mask and generalized the setting to fit grahps with variable edge count for each node. The algorithm in Q-Net implements dynamic programming weighted by the learning rate hyperparameter.

$$Q_{target} = (1-lr) * Q_{now} + lr * E[R_{t+1}(a_{t+1}, s_{t+1}) + \gamma * max_{a}Q_{next}(a_{t+1}, s_{t+1})| s_t],$$

where $R_{t+1}(a_{t+1}, s_{t+1})$ is sampled from batch of experiences. From this formulation, we can se that sampling the batch indices that maximize the td_error, we essentially grow the $Q_{now}$ values for future steps.

The goal of each agent is to maximize the expected future reward weighted by the gamma (discount) factor

$$max E[\sum_{t=t_0}^T \gamma^{t-t_0}R_{t}(a_t, s_t)].$$

## Classical MLP
MLP is a feed forward network that passes the input thourgh many linear layers with activation functions. In our case, we used leaky-ReLU as the activation function. It is also possible to modify this setting to for example GeLU in the config.yaml file but we should points out that this may lead to the agent learning to take forbidden edges, that are subsequently masked leading to insufficient gradient flow. This approach has not been explored in this project. Dropout is included for regularization.

In [None]:
import torch
from torch import nn

class MLP(nn.Module):
    """
    This is the underlying module for all used models within this work.
    """

    def __init__(self, in_features, mlp_units, activation_fn, activation_on_output = True):
        super(MLP, self).__init__()

        self.activation = activation_fn
        self.dropout = nn.Dropout(0.3)


        self.linear_layers = nn.ModuleList() # Storage for L layers
        previous_units = in_features

        # Transform units into a list
        if isinstance(mlp_units, int):
            mlp_units = [mlp_units]

        # Create a chain of layers
        for units in mlp_units:
            self.linear_layers.append(nn.Linear(previous_units, units))
            previous_units = units

        self.out_features = previous_units
        self.activation_on_ouput = activation_on_output

    # Forward pass
    def forward(self, x):

        # Inter layers
        for module in self.linear_layers[:-1]:
            x = module(x)
            if self.activation is not None:
                x = self.activation(x)
            x = self.dropout(x)

        # Pass through the last layer
        x = self.linear_layers[-1](x)
        if self.activation_on_ouput:
            x = self.activation(x)
            x = self.dropout(x)

        return x

## Attention Model

In [None]:
class AttModel(nn.Module):
    """
        Basic attention model with with masking and scaling.
    """

    def __init__(self, in_features, k_features, v_features, out_features, num_heads, activation_fn, vkq_activation_fn):
        super(AttModel, self).__init__()


        self.k_features = k_features
        self.v_features = v_features
        self.num_heads = num_heads      # Number of attention heads

        self.fc_v = nn.Linear(in_features, v_features * num_heads)  # Transforming input features into Values for attention
        self.fc_k = nn.Linear(in_features, k_features * num_heads)  # Transforming input features into Keys for attention
        self.fc_q = nn.Linear(in_features, k_features * num_heads)  # Transforming input values into Queries for attention

        self.fc_out = nn.Linear(v_features * num_heads, out_features)   # Transforms the outputs from all attention heads into output dimension

        self.activation = activation_fn
        self.vkq_activation = vkq_activation_fn     # Activation function that can be applied into Values, Keys, Queries


        """
        Defining the scaling factor for attention as 1/ sqrt(d_k), this is the same as the publishing paper "Attention is All You Need".
        This is done for the purpose of reducing the gradient so it does not become too large. Later you will see that without it, the dot product
        would grow too large without the scaling
        """
        self.attention_scale = 1 / (k_features **0.5)

        self.dropout = nn.Dropout(0.1)

    # Forward pass
    def forward(self, x, mask):
        batch_size, num_agents = x.shape[0], x.shape[1]

        """
        The code below does the following:
            - a linear mapping is applied on the inputs to obtain Values, Keys, Queries
            - the Values, Keys, Queries are then reshaped to separate the different attention heads of the model
            :reshape: will result in (batch_size, num_agents, num_heads, features_per_head)

        Visual representation:
            Input x
            |
            [Linear Layers] -> V, Q, K
            |
            [Optional Activation] (vkq_activation_fn)
            |
            [Reshape for Multi-Head]
            |
            [Transpose for Heads]
            |
            [Compute Attention Weights (Dot Product, Scale, Mask, Softmax)]
            |
            [Apply Attention to Values]
            |
            [Skip Connection]
            |
            [Transpose and Concatenate Heads]
            |
            [Final Linear Layer and Activation]
            |
            Output
        """

        v = self.fc_v(x).view(batch_size, num_agents, self.num_heads, self.v_features)
        q = self.fc_q(x).view(batch_size, num_agents, self.num_heads, self.k_features)
        k = self.fc_k(x).view(batch_size, num_agents, self.num_heads, self.k_features)

        if self.vkq_activation is not None:
            v = self.vkq_activation(v)
            q = self.vkq_activation(q)
            k = self.vkq_activation(k)

        # We rearrange the tensors to shape (batch_size, num_heads, num_agents, features_per_head)
        # This is done so we can perform batch multiplication over the batch size and heads
        q, k, v = q.transpose(1,2), k.transpose(1,2), v.transpose(1,2)

        # Add head axis (we are keeping the same mask for all attention heads)
        mask = mask.unsqueeze(1)    # (batch_size, 1, num_agents, num_agents) (1,1,20,20)

        """
        The attention is calculated as a dot product of all queries with all keys,
            while scaling it with the attention scale so it does not explode.
            - q is of shape             (batch_size, num_heads, num_agents, features_per_head)
            - k transposed is of shape  (batch_size, num_heads, features_per_head, num_agents)
            - the multiplication result is of shape (batch_size, num_heads, num_agents, num_agents)
        :masked_fill sets positions where mask == 0 to a large negative value - removes them from the attention computation practically
        """

        att_weights = torch.matmul(q, k.transpose(2, 3)) * self.attention_scale
        att = att_weights.masked_fill(mask==0, -1e9)
        att = F.softmax(att, dim=-1)    # Softmax is applied along the last dimension to obtain normalized attention probabilities
        att = self.dropout(att)

        # Now we combine the Values with respect to the attention we just computed
        """
            - att is of shape (batch_size, num_heads, num_agents, num_agents)
            - v is of shape (batch_size, num_heads, num_agents, v_features)
            - the multiplication result is of shape (batch_size, num_heads, num_agents, v_features)
        """
        out = torch.matmul(att, v)

        # We add a skip connection
        out  = torch.add(out, v)    # This additionally promotes gradient flow and mitigates vanishing gradient

        # Now "remove" the transpose and concatenate all heads together
        """
            - out is of shape (batch_size, num_heads, num_agents, v_features)
            - out after transpose is of shape (batch_size, num_agents, num_heads, v_features)
            - contiguous() ensures that the tensor is stored in a contiguous chunk of memory so that the reshape for view can happen
            - view is used to reshape the tensor to (batch_size, num_agents, v_features), therefore, we flatten the last two dimensions
                into a single one (num_heads * v_features)
            - final out is of shape  (batch_size, num_agents, num_heads * v_features)
        """

        out = out.transpose(1,2).contiguous().view(batch_size, num_agents, -1)
        out = self.activation(self.fc_out(out)) # Linear map into a desired feature dimension
        out = self.dropout(out)

        return out, att_weights

## DGN


In [None]:
class DGN(nn.Module):
    """

    """

    def __init__(self, in_features, mlp_units, num_actions, num_heads, num_attention_layers, activation_fn, kv_values):
        super(DGN, self).__init__()

        self.encoder = MLP(in_features, mlp_units, activation_fn)
        self.att_layers = nn.ModuleList()
        hidden_features = self.encoder.out_features

        print("In features of DGN: ", in_features)
        print("MLP units are: ", mlp_units)

        for _ in range(num_attention_layers):
            self.att_layers.append(
                AttModel(hidden_features, kv_values, kv_values, hidden_features, num_heads, activation_fn, activation_fn)
                                   )

        self.q_net = Q_Net(hidden_features * (num_attention_layers + 1), num_actions)

        self.att_weights = []

    def forward(self, x, mask):
        """
        Additional comment to the function:
            - each attention layer refines the representation h by focusing on relevant parts of the input
            - by concatenating the representations the feature set for the Q-network is enhanced, consequently making more informed decisions

        """

        h = self.encoder(x)     # Encodes the input featuers, has a shape of (batch_size, num_agents, hidden_features)
        q_input = h     # Initialize the q_input with encoded features
        self.att_weights.clear()    # Ensuring that attention weights from previous forward passes do not accumulate

        for attention_layer in self.att_layers:
            h, att_weight = attention_layer(h, mask)
            self.att_weights.append(att_weight)

            # Concatenation of outputs
            q_input = torch.cat((q_input, h), dim=-1)

        # Final q_input is of shape (batch_size, num_agents, hidden_features * (num_attention_layers +1))
        q = self.q_net(q_input)

        return q    # is of shape (batch_size, num_agents, num_actions)


## DQN

Deep Q-Learning Network. The encoder MLP transforms input features for generalization purposes that are then passed to the Q_Net to predict the reward for possible actions. Forward action to process the input. While being arguably the simplest model we have, DQN had the best and most consistent performance. This has also been noted by the authors of the graph MARL paper we reference.

In [None]:
class DQN(nn.Module):
    """
    Introduces simple Deep Feed Forward Neural Network( = MLP) as the encoder.
    """

    def __init__(self, in_features, mlp_units, num_actions, activation_fn):
        super(DQN, self).__init__()

        self.encoder = MLP(in_features, mlp_units, activation_fn)   # Encodes incoming features
        self.q_net = Q_Net(self.encoder.out_features, num_actions)  # Outputs Q-values
        self.activation = activation_fn

    def forward(self, x, mask):
        batch, agent, features = x.shape
        h = self.encoder(x)
        q = self.q_net(h)
        return q


## CommNet

In [None]:
class DQNR(nn.Module):
    """
    Recurrent DQN with an lstm cell.
    """

    def __init__(self, in_features, mlp_units, num_actions, activation_fn):
        super(DQNR, self).__init__()
        self.encoder = MLP(in_features, mlp_units, activation_fn)
        self.lstm = nn.LSTMCell(
            input_size=self.encoder.out_features, hidden_size=self.encoder.out_features
        )
        self.state = None
        self.q_net = Q_Net(self.encoder.out_features, num_actions)

    def get_state_len(self):
        return 2 * self.lstm.hidden_size

    def _state_reshape_in(self, batch_size, n_agents):
        """
        Reshapes the state of shape
            (batch_size, n_agents, self.get_state_len())
        to shape
            (2, batch_size * n_agents, hidden_size).

        :param batch_size: the batch size
        :param n_agents: the number of agents
        """
        self.state = (
            self.state.reshape(
                batch_size * n_agents,
                2,
                self.lstm.hidden_size,
            )
            .transpose(0, 1)
            .contiguous()
        )

    def _state_reshape_out(self, batch_size, n_agents):
        """
        Reshapes the state of shape
            (2, batch_size * n_agents, hidden_size)
        to shape
            (batch_size, n_agents, self.get_state_len()).

        :param batch_size: the batch size
        :param n_agents: the number of agents
        """
        self.state = self.state.transpose(0, 1).reshape(batch_size, n_agents, -1)

    def _lstm_forward(self, x, reshape_state=True):
        """
        A single lstm forward pass

        :param x: Cell input
        :param reshape_state: reshape the state to and from (batch_size, n_agents, -1)
        """
        batch_size, n_agents, feature_dim = x.shape
        # combine agent and batch dimension
        x = x.view(batch_size * n_agents, -1)

        if self.state is None:
            lstm_hidden_state, lstm_cell_state = self.lstm(x)
        else:
            if reshape_state:
                self._state_reshape_in(batch_size, n_agents)
            lstm_hidden_state, lstm_cell_state = self.lstm(
                x, (self.state[0], self.state[1])
            )

        self.state = torch.stack((lstm_hidden_state, lstm_cell_state))
        x = lstm_hidden_state

        # undo combine
        x = x.view(batch_size, n_agents, -1)
        if reshape_state:
            self._state_reshape_out(batch_size, n_agents)

        return x

    def forward(self, x, mask):
        h = self.encoder(x)
        h = self._lstm_forward(h)
        return self.q_net(h)


class CommNet(DQNR):
    """

    """

    def __init__(
        self,
        in_features,
        mlp_units,
        num_actions,
        comm_rounds,
        activation_fn,
    ):
        super().__init__(in_features, mlp_units, num_actions, activation_fn)
        assert comm_rounds >= 0
        self.comm_rounds = comm_rounds

    def forward(self, x, mask):
        batch_size, n_agents, feature_dim = x.shape
        h = self.encoder(x)

        # manually reshape state
        if self.state is not None:
            self._state_reshape_in(batch_size, n_agents)

        h = self._lstm_forward(h, reshape_state=False)

        # explicitly exclude self-communication from mask
        mask = mask * ~torch.eye(n_agents, dtype=bool, device=x.device).unsqueeze(0)

        for _ in range(self.comm_rounds):
            # combine hidden state h according to mask
            # first add up hidden states according to mask
            #    h has dimensions (batch, agents, features)
            #    and mask has dimensions (batch, agents, neighbors)
            #    => we have to transpose the mask to aggregate over all neighbors
            c = torch.bmm(h.transpose(1, 2), mask.transpose(1, 2)).transpose(1, 2)
            # then normalize according to number of neighbors per agent
            c = c / torch.clamp(mask.sum(dim=-1).unsqueeze(-1), min=1)

            # skip connection for hidden state and communication
            h = h + c
            # use new hidden state
            self.state[0] = h.view(batch_size * n_agents, -1)

            # pass through forward module
            h = self._lstm_forward(h, reshape_state=False)

        # manually reshape state in the end
        self._state_reshape_out(batch_size, n_agents)
        return self.q_net(h)




## State Aggregation

### SUM

In [None]:
class SimpleAggregation(nn.Module):
    def __init__(self, agg: str, mask_eye: bool) -> None:
        super().__init__()
        self.agg = agg
        assert self.agg == "mean" or self.agg == "sum"
        self.mask_eye = mask_eye

    def forward(self, node_features, node_adjacency):
        if self.mask_eye:
            node_adjacency = node_adjacency * ~(
                torch.eye(
                    node_adjacency.shape[1],
                    node_adjacency.shape[1],
                    device=node_adjacency.device,
                )
                .repeat(node_adjacency.shape[0], 1, 1)
                .bool()
            )
        feature_sum = torch.bmm(node_adjacency, node_features)
        if self.agg == "sum":
            return feature_sum
        if self.agg == "mean":
            num_neighbors = torch.clamp(node_adjacency.sum(dim=-1), min=1).unsqueeze(-1)
            return feature_sum / num_neighbors


### GCN
GCN is a graph convolutional operator that handles Message Passing phase within the GNN. Implementation is available at [GCN](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GCNConv.html#torch_geometric.nn.conv.GCNConv) within the pytorch_geometric library that specializes on GNNs.

GCN is based on the spectral approximations of convolutional

NetMon

In [None]:
!git clone https://github.com/Strojove-uceni/2024-final-letadylka-prochazka-belohlavek/

Cloning into '2024-final-letadylka-prochazka-belohlavek'...
remote: Enumerating objects: 528, done.[K
remote: Counting objects: 100% (302/302), done.[K
remote: Compressing objects: 100% (195/195), done.[K
remote: Total 528 (delta 156), reused 208 (delta 101), pack-reused 226 (from 1)[K
Receiving objects: 100% (528/528), 14.32 MiB | 22.67 MiB/s, done.
Resolving deltas: 100% (258/258), done.


In [None]:
# Run the main.py script with the modified config
!python /content/2024-final-letadylka-prochazka-belohlavek/src/main.py --config data/config_comm.yaml

2024-12-12 10:11:59.841513: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-12 10:11:59.878132: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-12 10:11:59.887487: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-12 10:11:59.924750: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

A module that was compiled using NumPy 1.x cannot be

Download files from our Git repository




# Selecting Parameters
We ran multiple sweeps on Weights & Biases to provide an overview of the parameter importance and correlations. Because of the time constraints for the project, we were not able to collect all the data we would like but regardless the results are reasonably clear. We selected the BEST models based on the gathered reward. Other interesting metrics would be spr (the mean ratio of the lenght of the path taken and the true shortest path) or throughput (mean ratio of the planes that reached the target during the episode).

We ran two sweeps, one for only CommNet settings (first picture), the other sweep for comparing DQN and DGN (second picture). Both sweeps ran for 75k steps, with CommNet being significantly faster yet inferior architecture and Sum being the superior AND faster aggregation method.

## Common Parameters in the Sweep
 - mini_batch_size : number of sampled experiences from the replay buffer
 - epsilon_decay : decay factor of epsilon in the EpsilonGreedy policy
 - agg_type : method used during GNN message aggregation phase
 - gamma: multiplicative factor in Q-Learning

## CommNet specific
 - comm_round: number of information passing round

## DQN, DGN specific
 - num_heads: number of attention heads




# Advice for Parameter Selection
Below we give advice how to fine tune the hyperparameters and then we show results from the sweeps on wandb. 35 sweeps were performed for the two settings.
## CommNet settings
It is apparent that selecting lower number of com_rounds as well as increasing the mini_batch_size improves the model greatly. Although the epsilon_update_frequency does not show as important, it could be due to the narrow range we selected for the sweep. We advice to set this parameter to about 70 for 75k steps and gradually increase this number with increasing number of steps. A good rule of thumb is that after the training the epsilon value should be around 0.01, where there is a hard line so that if epsilon dips below 0.01 it is reset to 0.01.
## DQN, DGN settings
Gamma seemed to be the most important hyperparameter in this setting, which is not all that surprising because of the architecture that goes almost directly into the Q_Net. We saw consistent improvement in the gathered reward with the gamma parameter being set lower (to around 0.92-0,95) for 75k steps. Epsilon update frequency shows here as a very important hyperparameter with a strong negative correlation. This means that epsilon should be updated more often, which suggests that the models are able to learn to navigate the environment quickly. The number of attention heads and attention layers shows a positive correlation with the reward.

## Aggregation Type
In both sweeps the aggregation type didn't hold much importance. In the case of CommNet the suggestion would be to use GCN on basis that GCN had positive correlation and SUM had negative correlation with the generated reward.
On the other hand, in the case of DQN and DGN, SUM seemed to perform better. From our own experience from training the models for more steps, we would reccomend using SUM as it is faster and the results are comparable.

WANDB sweep for CommNet
![](https://github.com/Strojove-uceni/2024-final-letadylka-prochazka-belohlavek/blob/main/pictures/commnet.png?raw=true)

Hyperparameters importance for CommNet
![](https://github.com/Strojove-uceni/2024-final-letadylka-prochazka-belohlavek/blob/main/pictures/commnet_parameters.png?raw=true)

WANDB sweep for DQN vs. DGN
![](https://github.com/Strojove-uceni/2024-final-letadylka-prochazka-belohlavek/blob/main/pictures/dgn_dqn.png?raw=true)

HYPERPARAMETERS IMPORTANCE FOR DGN and DQN
![](https://github.com/Strojove-uceni/2024-final-letadylka-prochazka-belohlavek/blob/main/pictures/dgn_dqn_parameters.png?raw=true)


# Our results