# Preliminaries

In this section, we briefly introduce the basic concepts of graph representation as well as the general vertex and edge updating mechanism of Graph Neural Networks (GNNs).

## Graph Representation

A graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ is made up of a set of vertices $\mathcal{V} \subseteq \{\mathbf{v}_i \in \mathbb{R}^{1 \times K} \}$, and edges $\mathcal{E} \subseteq \{ \mathbf{e}_{i,j} = \mathbf{e}(\mathbf{v}_i, \mathbf{v}_j) \mid \mathbf{v}_i, \mathbf{v}_j \in \mathcal{V},  i \neq j \}$, where $\mathbf{v}_i$ represents $K$ attributes of the $i_{th}$ object/component in the pre-defined graph or non-graph data sample and $\mathbf{e}_{i,j}$ represents the edge feature that defines the relationship between the vertices $\mathbf{v}_i$ and $\mathbf{v}_j$. Each pair of vertices can be only connected by at most one undirected edge or two directed edges. A standard way to describe such edges is through the adjacency matrix $\mathcal{A} \in \mathbb{R}^{|\mathcal{V}| \times |\mathcal{V}|}$, where all vertices in a graph are ordered so that each vertex indexes a specific row and column. As a result, the presence of each edge $\mathbf{e}_{i,j}$ can be described by a binary value $\mathcal{A}_{i,j} = 1$ if $\mathbf{v}_i$ and $\mathbf{v}_j$ are connected or $\mathcal{A}_{i,j} = 0$ otherwise. Specifically, the adjacent matrix is always symmetric if all edges are undirected, but can be non-symmetric if one or more directed edges exist. Instead of using a binary value, some studies (Dwivedi et al. 2020; Song et al. 2021a; Isufi et al. 2021) also build adjacency matrices with continuous real values to describe the strength of association between each pair of vertices.

## Vertex/Edge Updating of Message-Passing GNNs

Recently, message-passing Graph Neural Networks (GNNs) (Kipf and Welling 2016; Xu et al. 2018; Bresson and Laurent 2017; Veliˇckovi´c et al. 2017), including Graph Convolution Networks (GCNs), become dominant models for a wide variety of graph analysis tasks. Given a GNN $G$, its $l_{th}$ layer $G^l$ takes the graph $\mathcal{G}^{l-1} = (\mathcal{V}^{l-1}, \mathcal{E}^{l-1})$ produced by the ${l-1}_{th}$ layer $G^{l-1}$, as the input, and generates a new graph $\mathcal{G}^{l} = (\mathcal{V}^l, \mathcal{E}^l)$, which can be formulated as:
\begin{equation}
    \mathcal{G}^{l} = G^l(\mathcal{G}^{l-1})
\end{equation}
Specifically, the vertex feature $\mathbf{v}^l_i$ in $\mathcal{G}^{l}$ is computed based on: (i) its previous status $\mathbf{v}_i^{l-1}$ in $\mathcal{G}^{l-1}$; (ii) a set of adjacent vertices of the $\mathbf{v}_i^{l-1} \in \mathcal{G}^{l-1}$ (denoted as the $\mathbf{v}_j^{l-1} \subseteq \mathcal{N}(\mathbf{v}_i^{l-1})$, where $\mathcal{A}^{l-1}_{i,j} = 1$, and $\mathcal{A}^{l-1}$ is the adjacent matrix of the $\mathcal{G}^{l-1}$); and (iii) a set of edge features $\mathbf{e}_{j,i}^{l-1}$ that represent the relationship between every $\mathbf{v}_j^{l-1}$ and $\mathbf{v}_i^{l-1}$ in $\mathcal{N}(\mathbf{v}_i^{l-1})$. Here, the message $ \mathbf{m}_{\mathcal{N}(\mathbf{v}^{l-1}_i)}$ is produced by aggregating all adjacent vertices of $\mathbf{v}_i^{l-1}$ through related edges $\mathbf{e}_{j,i}^{l-1}$, which can be formulated as:
\begin{equation}
\begin{split}
& \mathbf{m}_{\mathcal{N}(\mathbf{v}^{l-1}_i)} = M(\mathbin\Vert ^{N}_{j=1} f(\mathbf{v}_j^{l-1},\mathbf{e}_{j,i}^{l-1})) \\
& f(\mathbf{v}_j^{l-1},\mathbf{e}_{j,i}^{l-1}) = 0 \quad \text{subject to} \quad \mathcal{A}^{l-1}_{i,j} = 0
\end{split}
\label{eq:message-passing}
\end{equation}
where $M$ is a differentiable function that aggregates messages produced from all adjacent vertices; $N$ denotes the number of vertices in the graph $\mathcal{G}^{l-1}$; $f(\mathbf{v}_j^{l-1},\mathbf{e}_{j,i}^{l-1})$ is a differentiable function defining the influence of an adjacent vertex $\mathbf{v}_j^{l-1}$ on the vertex $\mathbf{v}_i^{l-1}$ through their edge $\mathbf{e}_{j,i}^{l-1}$; and $\mathbin\Vert $ is the aggregation operator to combine messages of all adjacent vertices of the $\mathbf{v}_i^{l-1}$. As a result, the vertex feature $\mathbf{v}^l_i$ can be updated as:
\begin{equation}
\begin{split}
   \mathbf{v}^l_i = G_v^l(\mathbf{v}^{l-1}_i,\mathbf{m}_{\mathcal{N}(\mathbf{v}^{l-1}_i)})
\end{split}
\label{eq:vertex}
\end{equation}
where $G_v^l$ denotes a differentiable function of the $l_{th}$ GNN layer, which updates all vertex features for producing the graph $\mathcal{G}^{l}$. 



Meanwhile, each edge feature $\mathbf{e}_{i,j}^l$ in the graph $\mathcal{G}^{l}$ can be either kept as the same to its previous status $\mathbf{e}_{i,j}^{l-1}$ in the graph $\mathcal{G}^{l-1}$ (Kipf and Welling 2016; Hamilton et al. 2017) (denoted as the GNN type 1) or updated (Gong and Cheng 2019; Jo et al. 2021; Isufi et al. 2021) (denoted as the GNN type 2) during GNNs' propagation. Specifically, each edge feature $\mathbf{e}_{i,j}^l \in \mathcal{G}^{l}$ is computed based on: (i) its previous status $\mathbf{e}^{l-1}_{i,j} \in \mathcal{G}^{l-1}$; and (ii) the corresponding vertex features $\mathbf{v}^{l-1}_i$ and $\mathbf{v}^{l-1}_j$ in $\mathcal{G}^{l-1}$. Mathematically speaking, the $\mathbf{e}^{l}_{i,j}$ can be computed as:
\begin{equation}
\begin{split}
    \mathbf{e}_{i,j}^l = 
    \begin{cases}
    \mathbf{e}^{l-1}_{i,j} &  \text{GNN type 1} \\
    G_e^l(\mathbf{e}^{l-1}_{i,j},g(\mathbf{v}^{l-1}_i, \mathbf{v}^{l-1}_j)) & \text{GNN type 2}
    \end{cases}
\end{split}
\end{equation}
where $G_e^l$ is a differentiable function of the $l_{th}$ GNN layer, which updates edge features to produce the graph $\mathcal{G}^{l}$, and $g$ is a differentiable function that models the relationship between $\mathbf{v}^{l-1}_i$ and $\mathbf{v}^{l-1}_j$. In summary, vertex and edge features' updating are mutually influenced during the propagation of message-passing GNNs. Please refer to Hamilton (2020) and Dwivedi et al. (2020) for more details.

# Methodology

## Algorithmic Definitions

**Task-specific:** The term refers to features or topology specifically extracted for a particular task. These are produced from the original input data using networks trained with the target labels. In simpler terms, the networks define the features or topology based on the relationship between the input and the label.

**Plug-and-play:** The term refers to the ability to easily integrate the model into a machine learning framework (i.e., Backbone-GNN frameworks in this paper) without requiring significant modifications or customization.

**Edge feature:** The term refers to a vector or a single value attached to a graph edge, which describes the characteristics of the edge based on the relationship between the corresponding vertex pair. In this paper, the term 'multi-dimensional edge feature' represents a vector consisting of multiple attributes to describe the characteristics of the corresponding edge.

## Visualization of GD and TTP Modules

Figure 1 illustrates the GD and TTP modules of the graph-based GRATIS. The GCN in the backbone first produces a set of vertex features from the input graph $\mathcal{D}^{\text{in}}$, which are concatenated as a matrix $X^{\text{GCN}}$. Then, the CNN yields the global contextual representation $X$ from the $X^{\text{GCN}}$. The GD module simply treats the original input graph as the basic graph. After that, the TTP module produces an adjacency probability matrix $\mathcal{\hat{A}}^{\text{prob}}$ from $X$, which is then combined with the basic graph topology to generate the final task-specific adjacency matrix $\mathcal{\hat{A}}$. 

<div align="center">
<img src="figures/submodules_predefined_graphs.png" alt="GCN-CNN Backbone, Graph Definition (GD), and Task-specific Topology Prediction (TTP) modules for processing pre-defined graphs" width="60%" />
</div>

*Figure 1: Illustration of the GCN-CNN Backbone, Graph Definition (GD), and Task-specific Topology Prediction (TTP) modules for processing **pre-defined graphs**.*

Figure 2 illustrates the GD and TTP modules of the non-graph based GRATIS, which can take various types of non-graph data (e.g., image, audio, video, text and time-series). During training, in the GD module, the outputs of the VFE are fed to a standard MLP-based predictor, making the VFE to be intermediately supervised. This allows the trained VFE to provide a set of task-related vertex features, based on which the basic graph is defined. Then, the TTP module jointly trains the VFE with a GCN, allowing a $\hat{\text{VFE}}$ to be further optimized to produce task-specific graph vertex features and a vertex-decided graph topology, which are combined with the basic graph topology to generate the graph $\mathcal{G}^{\text{V}}(\mathcal{\hat{V}},\mathcal{E}^{V})$ that have an optimal task-specific topology.

![GD and TTP modules for **non-graph data**](figures/submodules_non_graphs.png)
*Figure 2: Illustration of the GD and TTP modules for **non-graph data**, which are jointly optimized during training.*

## Framework differentiation analysis

In this section, we show that the three modules (GD, TTP and MEFG) of the proposed framework can be jointly trained with any differentiable backbone and GNN predictor in an end-to-end manner. This allows the weights learned for all modules to be task-specific. 

Firstly, the GD module for processing graph data only contain an identity mapping, while the GD module for processing non-graph data is made up of a set of differentiable FC and GAP layers (i.e., the VFE). Therefore, the GD module for processing both graph and non-graph data are fully differentiable. Let $\mathcal{L}$ as a differentiable objective function that measures the loss of the final prediction generated by the GNN predictor, its gradient w.r.t. the graph $\mathcal{G}^{B}(\mathcal{V}, \mathcal{E})$ generated by the GD module can be computed as:

\begin{equation}
\begin{split}
\frac{\partial \mathcal{L}}{\partial \mathcal{G}^{B}} &= \frac{\partial \mathcal{L}}{\partial p_{\mathcal{\hat{G}}}} \frac{\partial p_{\mathcal{\hat{G}}}}{\partial \mathcal{\hat{G}}} \frac{\partial \mathcal{\hat{G}}}{\partial \mathcal{G}^{B}} \\
& = \frac{\partial \mathcal{L}}{\partial p_{\mathcal{\hat{G}}}} \frac{\partial p_{\mathcal{\hat{G}}}}{\partial \mathcal{\hat{G}}}(\frac{\partial \mathcal{\hat{V}}}{\partial \mathcal{V}}+\frac{\partial \mathcal{\hat{V}}}{\partial X}, \frac{\partial \mathcal{\hat{E}}}{\partial \mathcal{V}}+\frac{\partial \mathcal{\hat{E}}}{\partial X}) \\
& = \frac{\partial \mathcal{L}}{\partial p_{\mathcal{\hat{G}}}} \frac{\partial \text{GNN}(\mathcal{\hat{G}})}{\partial \mathcal{\hat{G}}} (\frac{\partial 
\text{TTP}(X, \mathcal{A}, \mathcal{V})}{\partial \mathcal{V}} + \frac{\partial 
\text{TTP}(X, \mathcal{A}, \mathcal{V})}{\partial X},   \\ 
&\frac{\partial  \text{MEFG}(X, \mathcal{V})}{\partial \mathcal{V}}+\frac{\partial  \text{MEFG}(X, \mathcal{V})}{\partial X})
\end{split}
\label{eq:back-propagation}
\end{equation}

where $p_{\mathcal{\hat{G}}}$ denotes the graph analysis prediction obtained from the differentiable GNN predictor; $\mathcal{\hat{G}}(\mathcal{\hat{V}}, \mathcal{\hat{E}})$ is the final produced task-specific graph representation ($\mathcal{\hat{E}}$ is produced from the MEFG module and $\mathcal{\hat{V}}$ is produced from the TTP module). Here, both GNN and MEFG are made up of fully differentiable layers. In other words, the terms $\frac{\partial \mathcal{L}}{\partial p_{\mathcal{\hat{G}}}}$, $ \frac{\partial \text{GNN}(\mathcal{\hat{G}})}{\partial \mathcal{\hat{G}}}$, $ \frac{\partial  \text{MEFG}(X, \mathcal{V})}{\partial \mathcal{V}}$ and $\frac{\partial  \text{MEFG}(X, \mathcal{V})}{\partial X}$ are differentiable. 

In addition, the generation of $\mathcal{\hat{A}}^{\text{prob}}$ for graph data and $\mathcal{\hat{A}}^{\text{V}}$ for non-graph data are fully decided by global contextual representation $X$ and task-specific vertex features $\mathcal{\hat{V}}$, both of which are produced via fully differentiable operations, i.e., $X$ is generated by a CNN/GCN backbone, and $\mathcal{\hat{V}}$ is produced by either identity mapping for pre-defined graphs or the VFE for non-graph data. As a result, the gradients for updating $\mathcal{A}$ to $\mathcal{\hat{A}}^{\text{prob}}$ or $\mathcal{\hat{A}}^{\text{V}}$ can be formulated as:

\begin{equation}
\begin{split}
\frac{\partial \mathcal{L}}{\partial \mathcal{A}} &= \frac{\partial \mathcal{L}}{\partial \mathcal{V}}+\frac{\partial \mathcal{L}}{\partial X} \\
&= \frac{\partial \mathcal{L}}{\partial p_{\mathcal{\hat{G}}}} 
\frac{\partial p_{\mathcal{\hat{G}}}}{\partial \mathcal{\hat{G}}} (\frac{\partial \mathcal{\hat{G}}}{\partial \mathcal{V}} + \frac{\partial \mathcal{\hat{G}}}{\partial X} )\\
&= \frac{\partial \mathcal{L}}{\partial p_{\mathcal{\hat{G}}}} 
\frac{\partial p_{\mathcal{\hat{G}}}}{\partial \mathcal{\hat{G}}} (\frac{\partial \mathcal{\hat{V}}}{\partial \mathcal{V}} + \frac{\partial \mathcal{\hat{E}}}{\partial \mathcal{V}} + \frac{\partial \mathcal{\hat{V}}}{\partial X} + \frac{\partial \mathcal{\hat{E}}}{\partial X})
\end{split}
\label{eq:back-propagation_A}
\end{equation}

As analyzed in Eqa. \ref{eq:back-propagation}, all terms in Eqa. \ref{eq:back-propagation_A} are differentiable, which means that the back-propagated gradients would enforce the adjacency matrices $\mathcal{\hat{A}}^{\text{prob}}$ or $\mathcal{\hat{A}}^{\text{V}}$ to be also learned in an end-to-end manner. Although the process that updates the adjacency matrix from $\mathcal{A}$ to $\mathcal{\hat{A}}$ (Eqa. (19) of the main document) in the TTP module is not fully differentiable, to the best of our knowledge, existing GNNs do not forward the adjacency matrix within the network, i.e., the Eqa. \ref{eq:back-propagation} does not need to compute $\frac{\partial 
\text{TTP}(X, \mathcal{A}, \mathcal{V})}{\partial A}$. Consequently, the proposed approach is a fully differentiable framework.

# Additional experimental results

## Detailed results of AU recognition task (non-graph vertex classification)

We report the detailed F1 and AUC results of the AU recognition task (i.e., non-graph-based vertex classification), which are achieved by our approach with different settings on BP4D and DISFA datasets. These are provided in Table 1, Table 2, Table 3, and Table 4. We also compare them in Figure 3 and Figure 4.

![f1_auc_BP4D](figures/f1_auc_BP4D.png)
*Figure 3: F1 scores and AUC results (\%) achieved for 12 AUs' recognition on BP4D dataset, where three competitors (SRERL, UGN-B and HMP-PS) are also graph-based methods.*

![f1_auc_BP4D](figures/f1_auc_DISFA.png)
*Figure 4: F1 scores and AUC results (\%) achieved for 8 AUs' recognition on DISFA dataset, where three competitors (SRERL, UGN-B and HMP-PS) are also graph-based methods.*

**Table 1: F1 scores (in %) achieved for 12 AUs on BP4D dataset. The best, second best, and third best results of each column are indicated with brackets and bold font, brackets alone, and underline, respectively.**

| Method |  AU1  |  AU2  |  AU4  |  AU6  |  AU7  |  AU10  |  AU12  |  AU14  |  AU15  |  AU17  |  AU23  |  AU24  |  Avg.  |
|--------|-------|-------|-------|-------|-------|--------|--------|--------|--------|--------|--------|--------|--------|
| DRML  | 36.4  | 41.8  | 43.0  | 55.0  | 67.0  | 66.3  | 65.8  | 54.1  | 33.2  | 48.0  | 31.7  | 30.0  | 48.3  |
| EAC-Net | 39.0  | 35.2  | 48.6  | 76.1  | 72.9  | 81.9  | 86.2  | 58.8  | 37.5  | 59.1  | 35.9  | 35.8  | 55.9  |
| JAA-Net  | 47.2  | 44.0  | 54.9  | 77.5  | 74.6  | 84.0  | 86.9  | 61.9  | 43.6  | 60.3  | 42.7  | 41.9  | 60.0  |
| LP-Net  | 43.4  | 38.0  | 54.2  | 77.1  | 76.7  | 83.8  | 87.2  | 63.3  | 45.3  | 60.5  | 48.1  | 54.2  | 61.0  |
| ARL  | 45.8  | 39.8  | 55.1  | 75.7  | 77.2  | 82.3  | 86.6  | 58.8  | 47.6  | 62.1  | 47.4  | 55.4  | 61.1  |
| SEV-Net | **[58.2]** | **[50.4]** | 58.3  | **[81.9]** | 73.9  | **[87.8]** | 87.5  | 61.6  | <u>52.6</u> | 62.2  | 44.6  | 47.6  | 63.9  |
| FAUDT | 51.7  | [49.3] | [61.0] | 77.8  | <u>79.5_ | 82.9  | 86.3  | [67.6] | 51.9  | 63.0  | 43.7  | **[56.3]** | <u>64.2</u>  |
| SRERL | 46.9  | 45.3  | 55.6  | 77.1  | 78.4  | 83.5  | <u>87.6_ | 63.9  | 52.2  | <u>63.9_ | 47.1  | 53.3  | 62.9  |
| UGN-B | [54.2] | <u>46.4</u> | 56.8  | 76.2  | 76.7  | 82.4  | 86.1  | 64.7  | 51.2  | 63.1  | 48.5  | 53.6  | 63.3  |
| HMP-PS | 53.1  | 46.1  | 56.0  | 76.5  | 76.9  | 82.1  | 86.4  | 64.8  | 51.5  | 63.0  | [49.9] | 54.5  | 63.4  |
| Ours (ResNet) | 47.1  | 45.4  | <u>59.5</u> | <u>79.5</u> | [79.6] | <u>84.9</u> | [87.8] | <u>67.1</u> | **[53.7]** | [64.2] | **[52.4]** | **[55.9]**  | [64.7]  |
| Ours (SwinB) | <u>53.2</u> | 44.0  | **[61.6]** | [79.6] | **[80.0]** | [85.2] | **[89.4]** | **[69.6]** | **[54.6]** | **[64.4]** | <u>49.7</u> | <u>55.7</u> | **[65.6]**  |


**Table 2: AUC results achieved for 12 AUs on the BP4D dataset.**

|  AU  |  DRML  |  SRERL  |  Ours (Res-GatedGCN)  |  Ours (Swin-GatedGCN)  |  Ours (Res-GAT)  |  Ours (Swin-GAT)  |
|:----:|:------:|:------:|:---------------------:|:----------------------:|:----------------:|:-----------------:|
|  1   |  55.7  |  67.6  |         75.0          |        **77.7**        |       74.7       |       77.2        |
|  2   |  54.5  |  70.0  |       **78.0**        |         76.5           |       74.9       |       76.2        |
|  4   |  58.8  |  73.4  |         85.4          |       **86.5**         |       83.2       |       86.3        |
|  6   |  56.6  |  78.4  |         88.9          |       **89.2**         |       88.8       |       88.9        |
|  7   |  61.0  |  76.1  |       **84.0**        |         83.3           |       82.4       |       82.6        |
|  10  |  53.6  |  80.0  |       **87.4**        |         86.5           |       86.9       |       87.1        |
|  12  |  60.8  |  85.9  |         93.2          |       **94.0**         |       92.6       |       93.3        |
|  14  |  57.0  |  64.4  |         69.9          |       **73.1**         |       72.4       |       72.6        |
|  15  |  56.2  |  75.1  |         82.7          |       **84.6**         |       83.9       |       84.5        |
|  17  |  50.0  |  71.7  |       **79.1**        |         78.7           |       78.3       |       78.6        |
|  23  |  53.9  |  71.6  |         79.5          |       **80.8**         |       79.7       |       79.7        |
|  24  |  53.9  |  74.6  |       **87.8**        |         86.3           |       84.3       |       84.7        |
| **Avg.** |  56.0  |  74.1  |         82.6          |       **83.1**         |       81.8       |       82.6        |


**Table 3: F1 scores (in %) achieved for 8 AUs on DISFA dataset. The best, second best, and third best results of each column are indicated with brackets and bold font, brackets alone, and underline, respectively.**

| Method      | AU1         | AU2         | AU4        | AU6        | AU9         | AU12        | AU25       | AU26        | **Avg.**   |
|-------------|-------------|-------------|------------|------------|-------------|-------------|------------|-------------|------------|
| DRML        | 17.3        | 17.7        | 37.4       | 29.0       | 10.7        | 37.7        | 38.5       | 20.1        | 26.7       |
| EAC-Net     | 41.5        | 26.4        | 66.4       | 50.7       | **[80.5]**  | **[89.3]**  | 88.9       | 15.6        | 48.5       |
| JAA-Net     | 43.7        | 46.2        | 56.0       | 41.4       | 44.7        | 69.6        | 88.3       | 58.4        | 56.0       |
| LP-Net      | 29.9        | 24.7        | 72.7       | 46.8       | 49.6        | 72.9        |<u>93.8</u> | 65.0        | 56.9       |
| ARL         | 43.9        | 42.1        | 63.6       | 41.8       | 40.0        | 76.2        |   [95.2]   |   [66.8]    | 58.7       |
| SEV-Net     | **[55.3]**  | **[53.1]**  | 61.5       | 53.6       | 38.2        | 71.6        | **[95.7]** | 41.5        | 58.8       |
| FAUDT       | 46.1        |   [48.6]    | <u>72.8</u>| **[56.7]** | 50.0        | 72.1        | 90.8       | 55.4        |<u>61.5</u> |
| SRERL       | 45.7        | 47.8        | 59.6       | 47.1       | 45.6        | 73.5        | 84.3       | 43.6        | 55.9       |
| UGN-B       | 43.3        | <u>48.1</u> | 63.4       | 49.5       | 48.2        | 72.9        | 90.8       | 59.0        | 60.0       |
| HMP-PS      | 38.0        | 45.9        | 65.2       | 50.9       | <u>50.8</u> | <u>76.0</u> | 93.3       | **[67.6]**  | 61.0       |
| Ours (ResNet) |  [54.6]   | 47.1        |   [72.9]   |<u>54.0</u> |  [55.7]     |   [76.7]    | 91.1       | 53.0        |   [63.1]   |
| Ours (SwinB) | <u>54.1</u>| 46.7        | **[75.2]** |   [56.0]   | 49.4        |    75.9     | 93.5       |<u>65.0</u>  | **[64.5]** |


**Table: AUC results achieved for 8 AUs on the DISFA dataset.**

|  AU  |  DRML  |  SRERL  |  Ours (Res-GatedGCN)  |  Ours (Swin-GatedGCN)  |  Ours (Res-GAT)  |  Ours (Swin-GAT)  |
|:----:|:------:|:-------:|:--------------------:|:---------------------:|:---------------:|:----------------:|
|  1   |  53.3  |  76.2   |        90.0          |         87.1          |       89.5      |    **90.7**      |
|  2   |  53.2  |  80.9   |        88.5          |         88.8          |       88.8      |    **89.7**      |
|  4   |  60.0  |  79.1   |        94.2          |         92.4          |       94.3      |    **94.4**      |
|  6   |  54.9  |  80.4   |    **92.5**          |         91.7          |       92.1      |       90.1       |
|  9   |  51.5  |  76.5   |    **91.5**          |         91.4          |       91.2      |       90.0       |
|  12  |  54.6  |  87.9   |        95.9          |    **96.7**           |       96.0      |       95.0       |
|  25  |  45.6  |  90.9   |        99.1          |         99.1          |       99.0      |    **99.2**      |
|  26  |  45.3  |  73.4   |        91.2          |    **91.7**           |       90.7      |       90.6       |
|**Avg.**|  52.3  |  80.7   |    **92.9**        |         92.4          |       92.7      |       92.5       |



## Confusion matrices

Figure 5 provides the confusion matrices for the facial expression recognition task achieved on both FER 2013 and RAF-DB datasets (i.e., non-graph data-based graph classification task). Meanwhile, Figure 6 and Figure 7 provide the confusion matrices for AU co-occurrence pattern recognition  (i.e., non-graph data-based link/edge prediction task). Particularly, we report the confusion matrices for the best folds of each of our best approaches for each dataset, namely for the best ResNet-GAT system, the best ResNet-GatedGCN system, the best Swin-GAT system and the best Swin-GatedGCN system on both BP4D and DISFA. As per the classes, _both_ implies both AUs in the pair are active and _neither_ implies neither of the AUs in the pair are active. Furthermore, a pair is only considered once and was coupled in the order of ascending indices for each activation unit. To examplify, the AU pair consisting of AU1 (inner brow raiser) and AU24 (lip pressor) is considered once and as AU1\&AU24. Therefore, the class _first_ in the confusion matrix implies that only the first element of the pair is active, in this example AU1 (inner brow raiser), whereas the class _second_ in the confusion matrix implies that only the second element of the pair is active, in this case AU24 (lip pressor). Finally, the confusion matrices were normalized in a row-wise manner for a better visualization of the percentage of correct and mistaken predictions for each class. Each row was normalized through L1 normalization.

![Conf_Matrix_fer](figures/combined_conf_fer.png)
*Figure 5: Confusion matrices of our best systems achieved for facial expression recognition tasks on FER2013 and RAF-DB datasets: (a) the best ResNet-GatedGCN model on FER 2013 dataset; (b) the best Swin-GatedGCN model on FER 2013 dataset; (c)  the best ResNet-GatedGCN model on RAF-DB dataset; and (d) the best Swin-GatedGCN model on RAF-DB dataset.*

![Conf_Matrix_BP4D](figures/bp4d_combined_conf_mat.jpg)
*Figure 6: Row-wise normalized confusion matrices of our best systems achieved for BP4D dataset: (a) the Best ResNet-GAT vertex-fed model; (b) the best ResNet-GatedGCN vertex-fed model; (c) the best Swin-GAT vertex-fed model; and (d) the best Swin-GatedGCN edge-fed model.*

![Conf_Matrix_DISFA](figures/disfa_combined_conf_mat.jpg)
*Figure 7: Row-wise normalized confusion matrices of our best systems achieved for DISFA dataset: (a) the best ResNet-GAT edge-fed model; (b) the best ResNet-GatedGCN edge-fed model; (c) the best Swin-GAT edge-fed model; and (d) the best Swin-GatedGCN vertex\&edge-fed model.*

# References

- Dwivedi, V. P., Joshi, C. K., Luu, A. T., Laurent, T., Bengio, Y., & Bresson, X. (2023). Benchmarking graph neural networks. *Journal of Machine Learning Research, 24*(43), 1-48.

- Song, T., Chen, L., Zheng, W., & Ji, Q. (2021, May). Uncertain graph neural networks for facial action unit detection. In *Proceedings of the AAAI Conference on Artificial Intelligence* (Vol. 35, No. 7, pp. 5993-6001).

- Isufi, E., Gama, F., & Ribeiro, A. (2021). Edgenets: Edge varying graph neural networks. *IEEE Transactions on Pattern Analysis and Machine Intelligence, 44*(11), 7457-7473.

- Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. *arXiv preprint* arXiv:1609.02907.

- Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2018). How powerful are graph neural networks?. *arXiv preprint* arXiv:1810.00826.

- Bresson, X., & Laurent, T. (2017). Residual gated graph convnets. *arXiv preprint* arXiv:1711.07553.

- Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. *arXiv preprint* arXiv:1710.10903.

- Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. *Advances in neural information processing systems, 30*.

- Gong, L., & Cheng, Q. (2019). Exploiting edge features for graph neural networks. In *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition* (pp. 9211-9219).

- Jo, J., Baek, J., Lee, S., Kim, D., Kang, M., & Hwang, S. J. (2021). Edge representation learning with hypergraphs. *Advances in Neural Information Processing Systems, 34*, 7534-7546.
