# Group Equivariant Self-Attention in Transformers:
## Attention, and Convolution, are not all you need...

Group self-attention is a generalization of the traditional self-attention mechanism that aims to exploit the underlying group structure in the input data. In many practical scenarios, the data possesses inherent symmetries and structures that can be modeled using group actions. The group self-attention mechanism leverages these symmetries to build more expressive and efficient attention mechanisms.

Lifting is a key concept employed in the development of group self-attention. It refers to the process of extending functions defined on a quotient space (homogeneous space $\mathcal{X}$) to functions defined on the original space (the group $G$). In the context of group self-attention, we lift functions defined on the homogeneous space, which is a space of geometric transformations, to functions defined on the group that encodes these transformations. By incorporating the group structure into the self-attention mechanism, we can better exploit the symmetries in the data.

The lifting process involves the following steps:

1. Identify a suitable homogeneous space $\mathcal{X} = G/H$, where $G$ is the group of interest and $H$ is a subgroup that stabilizes a chosen point.
2. Define a section $s: \mathcal{X} \to G$, a mapping that assigns a representative group element to each point in the homogeneous space $\mathcal{X}$.
3. Use the section to lift the function $f_{\mathcal{X}}: \mathcal{X} \to \mathbb{R}^d$ to a function $f_G: G \to \mathbb{R}^d$.

Once the functions are lifted to the group, the group self-attention mechanism can be constructed using the lifted functions and their interactions within the group structure. The key advantage of group self-attention is its ability to model and exploit symmetries and invariances in the data, leading to more robust and accurate representations. By incorporating the group structure in the self-attention mechanism, we can capture complex patterns in the data and build more expressive attention mechanisms that are well-suited for various applications in computer vision, natural language processing, and beyond.

A section is a mathematical construct used in the context of lifting functions defined on a quotient space (in our case, the homogeneous space $\mathcal{X}$) to functions defined on the original space (the group $G$). Specifically, a section is a mapping from the homogeneous space $\mathcal{X} = G/H$ to the group $G$ that assigns a representative group element to each point in $\mathcal{X}$.

1. **Definition of a section**: Formally, a section $s: \mathcal{X} \to G$ is a mapping such that for each $z \in \mathcal{X}$, the coset $s(z)H$ is equal to $z$. In other words, for every $z \in \mathcal{X}$, we have:

   $$
   s(z)H = z
   $$

   Note that sections are not unique, as there can be multiple group elements in the same coset that satisfy the condition.

2. **Properties of a section**: A section $s$ provides a way to lift elements from the homogeneous space $\mathcal{X}$ to the group $G$. As mentioned before, sections are not unique, and different choices of sections can result in different lifted functions. However, the key property is that any two sections $s_1, s_2: \mathcal{X} \to G$ are related by an element $h \in H$:

   $$
   s_1(z) = s_2(z)h
   $$

   for some $h \in H$ and for all $z \in \mathcal{X}$. This property ensures that the choice of section does not affect the overall structure of the lifted function.

3. **Lifting using a section**: Once a section $s$ is defined, we can use it to lift the function $f_{\mathcal{X}}: \mathcal{X} \to \mathbb{R}^d$ to a function $f_G: G \to \mathbb{R}^d$. To define the lifted function $f_G$, we set:

   $$
   f_G(i, h') = f_{\mathcal{X}}(s(x(i))h')
   $$

   for all $(i, h') \in G$. The lifted function $f_G$ takes as input a pair $(i, h')$ with $i \in S$ and $h' \in \mathscr{H}$, and it incorporates the group structure, making it suitable for use in group self-attention mechanisms.

In summary, a section is a mapping from the homogeneous space $\mathcal{X}$ to the group $G$ that assigns a representative group element to each point in $\mathcal{X}$. A section is used to lift functions from the homogeneous space to the group, allowing the lifted functions to be used in group self-attention mechanisms. While sections are not unique, their properties ensure that the choice of section does not affect the overall structure of the lifted function.

To lift a function $f \in L_{\mathbb{R}^d}(S)$ to a function of the form $f_G(i, h') \in L_{\mathbb{R}^d}(G)$, follow these steps:

1. **Define the coordinate function**: First, define a coordinate function $x: S \to \mathcal{X}$, which maps each element in $S$ to a point in the homogeneous space $\mathcal{X} = G/H$.

2. **Create the function $f_{\mathcal{X}}$**: Use the coordinate function $x$ to define a new function $f_{\mathcal{X}}: \mathcal{X} \to \mathbb{R}^d$ such that $f_{\mathcal{X}}(x(i)) = f(i)$ for all $i \in S$. This function is an intermediate step that links the original function $f$ to the lifted function $f_G$.

3. **Identify the stabilizer**: Determine the stabilizer subgroup $H \subseteq G$ for the homogeneous space $\mathcal{X} = G/H$. The stabilizer consists of all group elements $h \in G$ that fix a chosen reference point $x_0 \in \mathcal{X}$ under the group action.

4. **Define the section**: Define a section $s: \mathcal{X} \to G$, which is a mapping that assigns to each point $z \in \mathcal{X}$ a representative group element $s(z) \in G$ such that $s(z)H = z$ for all $z \in \mathcal{X}$. The section $s$ is not unique, but it provides a way to lift elements from $\mathcal{X}$ to $G$.

5. **Lift the function $f_{\mathcal{X}}$ to $f_G$**: Define the lifted function $f_G: G \to \mathbb{R}^d$ by setting $f_G(i, h') = f_{\mathcal{X}}(s(x(i))h')$ for all $(i, h') \in G$. Note that $f_G$ takes as input a pair $(i, h')$ with $i \in S$ and $h' \in \mathscr{H}$.

To summarize, the lifting process involves defining a coordinate function $x$, creating an intermediate function $f_{\mathcal{X}}$, identifying the stabilizer subgroup $H$, defining a section $s$, and finally defining the lifted function $f_G$. The lifted function $f_G$ incorporates the group structure and is defined over the entire group $G$, allowing it to be used in group self-attention mechanisms.

Given a function $f \in L_{\mathbb{R}^d}(S)$, where $S$ is a finite set of indices, we define three functions:

1. Key function $\varphi_{key}: L_{\mathbb{R}^d}(S) \to L_{\mathbb{R}^{d_k}}(S)$
2. Query function $\varphi_{query}: L_{\mathbb{R}^d}(S) \to L_{\mathbb{R}^{d_k}}(S)$
3. Value function $\varphi_{value}: L_{\mathbb{R}^d}(S) \to L_{\mathbb{R}^{d_v}}(S)$

These functions are used to compute the self-attention for each element in $S$. The self-attention mechanism also incorporates a relative positional encoding $\rho: S \times S \to \mathbb{R}^d$ that encodes the relative positions between elements.

The self-attention function $\alpha[f]: S \times S \to \mathbb{R}$ is defined as follows:

$$
\alpha[f](i, j) = \frac{\exp\left(\langle \varphi_{qry}(f(i)), \varphi_{key}(f(j)) + \rho(i, j) \rangle\right)}{\sum_{k \in S}\exp\left(\langle \varphi_{qry}(f(i)), \varphi_{key}(f(k)) + \rho(i, k) \rangle\right)}
$$

Now, we want to extend the domain of these functions from $S$ to $\mathcal{X}$ using the quotient space $f_{\mathcal{X}} = G/\mathscr{H}$, where $\mathcal{X}$ is a homogeneous space. We have the coordinate function $x: S \to \mathcal{X}$ that maps elements of $S$ to $\mathcal{X}$, and we can define $f_{\mathcal{X}}: \mathcal{X} \to \mathbb{R}^d$ such that $f_{\mathcal{X}}(x(i)) = f(i)$. To extend the domain of the key, query, and value functions, we have:

1. Key function $\varphi_{key}: L_{\mathbb{R}^d}(\mathcal{X}) \to L_{\mathbb{R}^{d_k}}(\mathcal{X})$, where $\varphi_{key}(f_{\mathcal{X}}(x(i))) = \varphi_{key}(f(i))$
2. Query function $\varphi_{query}: L_{\mathbb{R}^d}(\mathcal{X}) \to L_{\mathbb{R}^{d_k}}(\mathcal{X})$, where $\varphi_{query}(f_{\mathcal{X}}(x(i))) = \varphi_{query}(f(i))$
3. Value function $\varphi_{value}: L_{\mathbb{R}^d}(\mathcal{X}) \to L_{\mathbb{R}^{d_v}}(\mathcal{X})$, where $\varphi_{value}(f_{\mathcal{X}}(x(i))) = \varphi_{value}(f(i))$


In the context of self-attention, the query, key, and value functions play essential roles in transforming the input features. Each of these functions has a specific domain and codomain, as described below:

1. Query function ($\varphi_{qry}$):

- Domain: The input space of the query function is the feature space associated with the elements of the input set, denoted as $L_{\mathbb{R}^d}(S)$ for the original self-attention, and $L_{\mathbb{R}^d}(\mathcal{X})$ when extended to the domain $\mathcal{X}$. In the case of group self-attention, the domain is $L_{\mathbb{R}^d}(G)$.

- Codomain: The output space of the query function is a transformed feature space, typically of dimension $d_k$. It is denoted as $L_{\mathbb{R}^{d_k}}(S)$ for the original self-attention, $L_{\mathbb{R}^{d_k}}(\mathcal{X})$ when extended to the domain $\mathcal{X}$, and $L_{\mathbb{R}^{d_k}}(G)$ for group self-attention.

2. Key function ($\varphi_{key}$):

- Domain: Similar to the query function, the input space of the key function is the feature space associated with the elements of the input set, denoted as $L_{\mathbb{R}^d}(S)$ for the original self-attention, $L_{\mathbb{R}^d}(\mathcal{X})$ when extended to the domain $\mathcal{X}$, and $L_{\mathbb{R}^d}(G)$ for group self-attention.

- Codomain: The output space of the key function is also a transformed feature space, typically of dimension $d_k$. It is denoted as $L_{\mathbb{R}^{d_k}}(S)$ for the original self-attention, $L_{\mathbb{R}^{d_k}}(\mathcal{X})$ when extended to the domain $\mathcal{X}$, and $L_{\mathbb{R}^{d_k}}(G)$ for group self-attention.

3. Value function ($\varphi_{val}$):

- Domain: The input space of the value function is the same as that of the query and key functions: $L_{\mathbb{R}^d}(S)$ for the original self-attention, $L_{\mathbb{R}^d}(\mathcal{X})$ when extended to the domain $\mathcal{X}$, and $L_{\mathbb{R}^d}(G)$ for group self-attention.

- Codomain: The output space of the value function is a transformed feature space, typically of dimension $d_v$. It is denoted as $L_{\mathbb{R}^{d_v}}(S)$ for the original self-attention, $L_{\mathbb{R}^{d_v}}(\mathcal{X})$ when extended to the domain $\mathcal{X}$, and $L_{\mathbb{R}^{d_v}}(G)$ for group self-attention.

These functions act on the input features and transform them into a suitable representation for computing self-attention, which allows the model to capture and utilize dependencies between different input elements.

Apologies for any confusion caused by the notation. I'll explain the domains and codomains more clearly:

The notation $L_{\mathbb{R}^d}(S)$, $L_{\mathbb{R}^d}(\mathcal{X})$, and $L_{\mathbb{R}^d}(G)$ represent function spaces of functions mapping from the respective domain to the feature space $\mathbb{R}^d$. Here, $d$ is the dimension of the feature space.

1. $L_{\mathbb{R}^d}(S)$: This represents the space of functions that map from the input set $S$ to the $d$-dimensional feature space $\mathbb{R}^d$. In the context of self-attention, $S$ typically represents a sequence or a set of elements (e.g., words in a sentence, pixels in an image, or nodes in a graph), and the function maps each element in the set to a $d$-dimensional feature vector.

2. $L_{\mathbb{R}^d}(\mathcal{X})$: This represents the space of functions that map from the homogeneous space $\mathcal{X}$ to the $d$-dimensional feature space $\mathbb{R}^d$. The homogeneous space $\mathcal{X}$ is formed by the quotient space $G/\mathscr{H}$, where $G$ is a group (e.g., $\mathbb{R}^2 \rtimes \mathscr{H}$) and $\mathscr{H}$ is a subgroup (e.g., $SO(2)$ or $SO(3)$). The functions in this space map each element in the homogeneous space $\mathcal{X}$ to a $d$-dimensional feature vector.

3. $L_{\mathbb{R}^d}(G)$: This represents the space of functions that map from the group $G$ to the $d$-dimensional feature space $\mathbb{R}^d$. In the context of group self-attention, $G$ is a group acting on the input set (e.g., translation or rotation group), and the functions in this space map each element in the group $G$ to a $d$-dimensional feature vector.

For the codomains, the notations $L_{\mathbb{R}^{d_k}}(S)$, $L_{\mathbb{R}^{d_k}}(\mathcal{X})$, $L_{\mathbb{R}^{d_k}}(G)$, $L_{\mathbb{R}^{d_v}}(S)$, $L_{\mathbb{R}^{d_v}}(\mathcal{X})$, and $L_{\mathbb{R}^{d_v}}(G)$ are similar but refer to the output spaces of the respective query, key, and value functions. The output spaces are also feature spaces but with different dimensions, such as $d_k$ for the query and key functions and $d_v$ for the value function.


Next, we need to extend the relative positional encoding $\rho: S \times S \to \mathbb{R}$. To extend the relative positional encoding $\rho: S \times S \to \mathbb{R}^d$ to the domain $\mathcal{X} \times \mathcal{X}$, we define a new function $\rho^P: \mathcal{X} \to \mathbb{R}^d$ such that $\rho^P(x(j) - x(i)) = \rho(i, j)$. 

Now, we can define the extended self-attention function $\alpha[f_{\mathcal{X}}]: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ as follows:

$$
\alpha[f_{\mathcal{X}}](x(i), x(j)) = \frac{\exp\left(\langle \varphi_{qry}(f_{\mathcal{X}}(x(i))), \varphi_{key}(f_{\mathcal{X}}(x(j))) + \rho^P(x(j)-x(i)) \rangle\right)}{\sum_{x(k) \in \mathcal{X}}\exp\left(\langle \varphi_{qry}(f_{\mathcal{X}}(x(i))), \varphi_{key}(f_{\mathcal{X}}(x(k))) + \rho^P(x(k)-x(i)) \rangle\right)}
$$

By extending the domain of the key, query, and value functions and the relative positional encoding to $\mathcal{X}$, we have successfully extended the self-attention mechanism from a finite set $S$ to a homogeneous space $\mathcal{X}$. This lays the groundwork for further generalizing the self-attention mechanism to incorporate the structure of the group $G$ and develop group self-attention using liftings.

To lift self-attention to group self-attention, we first need to lift the function $f_{\mathcal{X}}: G/H \to \mathbb{R}^d$ to a function $f_G: G \to \mathbb{R}^d$. We will achieve this by defining a function $f(i, h') \in L_{\mathbb{R}^d}(G)$, where $i \in S$ and $h' \in \mathscr{H}$. 

The idea behind the lifting is to exploit the structure of the group $G$ and its action on the homogeneous space $\mathcal{X}$. To do this, we introduce the concept of a section, which is a continuous map $s: \mathcal{X} \to G$ that maps each point $x(i)$ in $\mathcal{X}$ to a representative group element $s(x(i)) \in G$. For each $x(i) \in \mathcal{X}$, we have a unique coset $s(x(i))\mathscr{H}$. 

With this section, we can define the lifted function $f_G: G \to \mathbb{R}^d$ as follows:

$$
f_G(i, h') = f(i) \quad \text{if} \quad (i, h') \in s(x(i))\mathscr{H}
$$

Now that we have lifted $f_{\mathcal{X}}$ to $f_G$, we can proceed to define group self-attention. We will employ the following notation:

$$
\mathscr{L}_h [\rho]((i, h_1), (j, h_2)) = \rho^P(h^{-1}(x(j) - x(i)), h^{-1}(\tilde{h}^{-1}\hat{h}))
$$

Here, $\mathscr{L}_h [\rho]$ is the lifted version of the relative positional encoding function $\rho$. With the lifted functions $f_G$ and $\mathscr{L}_h [\rho]$, we can define the group self-attention function $\alpha[f_G]: G \times G \to \mathbb{R}$ as:

$$
\alpha[f_G]((i, h_1), (j, h_2)) = \frac{\exp\left(\langle \varphi_{qry}(f_G(i, h_1)), \varphi_{key}(f_G(j, h_2)) + \mathscr{L}_h [\rho]((i, h_1), (j, h_2)) \rangle\right)}{\sum_{(k, h_3) \in N(i, h_1)\subset G}\exp\left(\langle \varphi_{qry}(f_G(i, h_1)), \varphi_{key}(f_G(k, h_3)) + \mathscr{L}_h [\rho]((i, h_1), (k, h_3)) \rangle\right)}
$$

This group self-attention function incorporates the structure of the group $G$ and its action on the homogeneous space $\mathcal{X}$, thereby providing a more general and powerful attention mechanism that can handle data with underlying geometric and topological structures.

To generalize the group self-attention to multihead self-attention, we will incorporate multiple attention heads, each with its own key, query, and value functions. This allows the model to capture different aspects of the input data by combining the attention weights from each head.

Let $H$ be the total number of attention heads. We will denote the key, query, and value functions for each head as $\varphi_{key}^{head}, \varphi_{qry}^{head},$ and $\varphi_{val}^{head}$, respectively. Now, we can define the multihead group self-attention function $\alpha^{head}[f_G]: G \times G \to \mathbb{R}$ for each head as:

$$
\alpha^{head}[f_G]((i, h_1), (j, h_2)) = \frac{\exp\left(\langle \varphi_{qry}^{head}(f_G(i, h_1)), \varphi_{key}^{head}(f_G(j, h_2)) + \mathscr{L}_h [\rho]((i, h_1), (j, h_2)) \rangle\right)}{\sum_{(k, h_3) \in N(i, h_1) \subset G}\exp\left(\langle \varphi_{qry}^{head}(f_G(i, h_1)), \varphi_{key}^{head}(f_G(k, h_3)) + \mathscr{L}_h [\rho]((i, h_1), (k, h_3)) \rangle\right)}
$$

For each head, we compute the corresponding attention weights and apply them to the value function, yielding a weighted sum of the value vectors. We then concatenate the results from all heads and apply an output function $\varphi_{out}$:

$$
m_G^r[f, \rho](i, h) = \varphi_{out}\left( \bigcup_{head \in [H]} \sum_{h_1 \in \mathscr{H}} \sum_{(j, h_2) \in N(i,h_1)} \alpha^{head}[f_G]((i, h_1), (j, h_2)) \varphi_{val}^{head}(f_G(j, h_2)) \right)
$$

This multihead group self-attention mechanism captures multiple relationships in the input data, allowing the model to better understand and represent the underlying structure. By incorporating the structure of the group $G$ and its action on the homogeneous space $\mathcal{X}$, multihead group self-attention provides a powerful and flexible way to model data with geometric and topological structures.

---

Here is the step-by-step proof of the $G$-equivariance of group self-attention using the preferred notation:

1. Recall the $G$-equivariance property:

$$
m_G^r[L_g[f], \rho](i, h) = L_g[m_G^r[f, \rho]](i, h).
$$

2. Consider a $g$-transformed input signal:

$$
L_g[f](i, \tilde{h}) = L_y L_{\bar{h}}[f](i, \tilde{h}) = f(\rho^{-1}(\bar{h}^{-1}(\rho(i) - y)), \bar{h}^{-1} \tilde{h}), \quad g = (y, \bar{h}), y \in \mathbb{R}^d, \bar{h} \in H.
$$

3. Compute the group self-attention operation on $L_g[f]$:

$$
m_G^r[L_g[f], \rho](i,h) = \varphi^{head}_{\text{out}}\sum_{\tilde{h}\in [H]}\sum_{(j,\hat{h})\in N(i, \tilde{h})}\sigma_{j,\hat{h}}\langle\varphi^{head}_{\text{qry}}(f(\bar{h}^{-1}(x(i)-y), \bar{h}^{-1}\tilde{h})),\varphi^{head}_{\text{key}}(f(\bar{h}^{-1}(x(j)-y), \bar{h}^{-1}\hat{h}) + \rho((i, \tilde{h}),(j, \hat{h})))\rangle\varphi^{head}_{\text{val}}(f(\bar{h}^{-1}(x(j)-y), \bar{h}^{-1}\hat{h})).
$$

4. Using the properties of the group actions, rewrite the expression:

$$
m_G^r[L_g[f], \rho](i,h) = \varphi^{head}_{\text{out}}\sum_{\tilde{h}\in [H]}\sum_{(j,\hat{h})\in N(i, \tilde{h})}\sigma_{j,\hat{h}}\langle\varphi^{head}_{\text{qry}}(f(\bar{i}, \tilde{h}')), \varphi^{head}_{\text{key}}(f(\bar{j}, \hat{h}') + L_{\bar{h}^{-1}h}[\rho]((\bar{i}, \tilde{h}'),(\bar{j}, \hat{h}')))\rangle\varphi^{head}_{\text{val}}(f(\bar{j}, \hat{h}')).
$$

5. By the definition of $\rho$ and the properties of the group actions, we can further simplify the expression:

$$
m_G^r[L_g[f], \rho](i,h) = \varphi^{head}_{\text{out}}\sum_{\tilde{h}\in [H]}\sum_{(j,\hat{h})\in N(i, \tilde{h})}\sigma_{j,\hat{h}}\langle\varphi^{head}_{\text{qry}}(f(\bar{i}, \tilde{h}')), \varphi^{head}_{\text{key}}(f(\bar{j}, \hat{h}') + L_{\bar{h}^{-1}h}[\rho]((\bar{i}, \tilde{h}'),(\bar{j}, \hat{h}')))\rangle\varphi^{head}_{\text{val}}(f(\bar{j}, \hat{h}')).
$$

6. Now, we can make use of the area of summation remaining equal for any transformation $g \in G$ for unimodular groups:

$$
\sum_{(j, \hat{h}) \in N(i, \tilde{h})} [\cdot] = \sum_{(\bar{j}, \hat{h}') \in N(\bar{i}, \tilde{h}')} [\cdot].
$$

7. Consequently, we can further reduce the expression as:

$$
m_G^r[L_g[f], \rho](i,h) = \varphi^{head}_{\text{out}}\sum_{\tilde{h}\in [H]}\sum_{(\bar{j},\hat{h}')\in N(\bar{i}, \tilde{h}')}\sigma_{\bar{j},\hat{h}'}\langle\varphi^{head}_{\text{qry}}(f(\bar{i}, \tilde{h}')), \varphi^{head}_{\text{key}}(f(\bar{j}, \hat{h}') + L_{\bar{h}^{-1}h}[\rho]((\bar{i}, \tilde{h}'),(\bar{j}, \hat{h}')))\rangle\varphi^{head}_{\text{val}}(f(\bar{j}, \hat{h}')).
$$

8. We can observe that:

$$
m_G^r[L_g[f], \rho](i,h) = m_G^r[f, \rho](\bar{i}, \bar{h}^{-1}h) = m_G^r[f, \rho](x^{-1}(\bar{h}^{-1}(x(i) - y)), \bar{h}^{-1}h) = L_g[m_G^r[f, \rho]](i,h).
$$

This demonstrates that $m_G^r[L_g[f], \rho](i,h) = L_g[m_G^r[f, \rho]](i,h)$, which shows that the group self-attention operation is indeed $G$-equivariant. The equivariance property arises from the fact that $L_g[\rho]((i, \tilde{h}),(j, \hat{h})) = \rho((i, \tilde{h}),(j, \hat{h}))$ for all $g \in G$. In other words, it comes from the fact that the positional encoding used is invariant to the action of elements $g \in G$.

---

To make the given equations more readable and maintain the integrity of the content, I have added line breaks to the long lines:

\begin{align*}
m_G^r[L_y L_{\bar{h}}[f], \rho](i,h)
&= \phi_{\text{out}} \sum_{\tilde{h} \in \mathscr{H}} \sum_{(j,\hat{h}) \in N(i, \tilde{h})} \\
&\quad \frac{\exp\left(\langle \varphi^{head}_{\text{qry}} (L_y L_{\bar{h}}[f](i, \tilde{h})), \varphi^{head}_{\text{key}} (L_y L_{\bar{h}}[f](j, \hat{h})
+ L_h [\rho]((i, \tilde{h}),(j, \hat{h}))\rangle\right)}{\sum_{(k, \hat{h}) \in N(i, \tilde{h})} \exp\left(\langle \varphi^{head}_{\text{qry}} (L_y L_{\bar{h}}[f](i, \tilde{h})), \varphi^{head}_{\text{key}} (L_y L_{\bar{h}}[f](k, \hat{h})
+ L_h [\rho]((i, \tilde{h}),(k, \hat{h}))\rangle\right)} \\
&\quad \times \varphi^{head}_{\text{val}} (L_y L_{\bar{h}}[f](j, \hat{h})) \\
&= \phi_{\text{out}} \sum_{\tilde{h} \in \mathscr{H}} \sum_{(j,\hat{h}) \in N(i, \tilde{h})} \\
&\quad \frac{\exp\left(\langle \varphi^{head}_{\text{qry}} (f(x^{-1}(\bar{h}^{-1}(x(i) - y)), \bar{h}^{-1} \tilde{h})), \varphi^{head}_{\text{key}} (f(x^{-1}(\bar{h}^{-1}(x(j) - y)), \bar{h}^{-1} \hat{h})
+ L_h [\rho]((i, \tilde{h}),(j, \hat{h}))\rangle\right)}{\sum_{(k, \hat{h}) \in N(i, \tilde{h})} \exp\left(\langle \varphi^{head}_{\text{qry}} (f(x^{-1}(\bar{h}^{-1}(x(i) - y)), \bar{h}^{-1} \tilde{h})), \varphi^{head}_{\text{key}} (f(x^{-1}(\bar{h}^{-1}(x(k) - y)), \bar{h}^{-1} \hat{h})
+ L_h [\rho]((i, \tilde{h}),(k, \hat{h}))\rangle\right)} \\
&\quad \times \varphi^{head}_{\text{val}} (f(x^{-1}(\bar{h}^{-1}(x(j) - y)), \bar{h}^{-1} \hat{h})) \\
&= \phi_{\text{out}} \sum_{\tilde{h} \in \mathscr{H}} \sum_{(x^{-1}(\bar{h}x(\bar{j})+y),\bar{h}\hat{h}') \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \\
&\quad \frac{\exp\left(\langle \varphi^{head}_{\text{qry}} (f(\bar{i}, \tilde{h}')), \varphi^{head}_{\text{key}} (f(\bar{j}, \hat{h}')
+ L_h [\rho]((x^{-1}(\bar{h}x(\bar{i}) + y), \bar{h} \tilde{h}'),
(x^{-1}(\bar{h}x(\bar{j}) + y), \bar{h} \hat{h}'))\rangle\right)}{\sum_{(k, \hat{h}) \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \exp\left(\langle \varphi^{head}_{\text{qry}} (f(\bar{i}, \tilde{h}')), \varphi^{head}_{\text{key}} (f(\bar{j}, \hat{h}')
+ L_h [\rho]((x^{-1}(\bar{h}x(\bar{i}) + y), \bar{h} \tilde{h}'),
(x^{-1}(\bar{h}x(\bar{j}) + y), \bar{h} \hat{h}'))\rangle\right)} \\
&\quad \times \varphi^{head}_{\text{val}} (f(\bar{j}, \hat{h}')) \\
\end{align*}

\begin{align*}
m_G^r[L_y L_{\bar{h}}[f], \rho](i,h)
&= \phi_{out} \sum_{h \in \mathscr{H}} \sum_{\bar{h}\tilde{h}' \in \mathscr{H}} \\
&\quad \sum_{(x^{-1}(\bar{h}x(\bar{j})+y),\bar{h}\hat{h}') \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \\
&\quad \frac{\exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ \rho_P(h^{-1}\bar{h}(x(\bar{j}) - x(\bar{i}), \tilde{h}'^{-1} \hat{h}'))\rangle\right)}{\sum_{(k, \hat{h}) \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ \rho_P(h^{-1}\bar{h}(x(\bar{j}) - x(\bar{i}), \tilde{h}'^{-1} \hat{h}'))\rangle\right)} \\
&\quad \times \varphi_{val}^h (f(\bar{j}, \hat{h}')) \\
&= \phi_{out} \sum_{h \in \mathscr{H}} \sum_{\bar{h}\tilde{h}' \in \mathscr{H}} \\
&\quad \sum_{(x^{-1}(\bar{h}x(\bar{j})+y),\bar{h}\hat{h}') \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \\
&\quad \frac{\exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ L_{\bar{h}^{-1}h}[\rho]((\bar{i}, \tilde{h}'),(\bar{j}, \hat{h}')))\rangle\right)}{\sum_{(k, \hat{h}) \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ L_{\bar{h}^{-1}h}[\rho]((\bar{i}, \tilde{h}'),(\bar{j}, \hat{h}')))\rangle\right)} \\
&\quad \times \varphi_{val}^h (f(\bar{j}, \hat{h}')) \\
&= m_G^r[f, \rho](\bar{i}, \bar{h}^{-1}h)\\
&= m_G^r[f, \rho](x^{-1}(\bar{h}^{-1}(x(i) - y)), \bar{h}^{-1}h) \\
&= L_y L_{\bar{h}}[m_G^r[f, \rho]](i,h).
\end{align*}

---

If the group self-attention formulation provided in Eq. 11 is $G$-equivariant, then
it must hold that $m_G^r[L_g[f], \rho](i,h) = L_g[m_G^r[f, \rho]](i,h)$. Consider a $g$-transformed input signal
$L_g[f](i, \tilde{h}) = L_y L_{\bar{h}}[f](i, \tilde{h}) = f(\rho^{-1}(\bar{h}^{-1}(\rho(i) - y)), \bar{h} \tilde{h})$, $g = (y, \bar{h})$, $y \in \mathbb{R}^d$, $\bar{h} \in H$. The group
self-attention operation on $L_g[f]$ is given by:

\begin{align*}
m_G^r[L_y L_{\bar{h}}[f], \rho](i,h)
&= \phi_{\text{out}} \sum_{\tilde{h} \in \mathscr{H}} \sum_{(j,\hat{h}) \in N(i, \tilde{h})}
\frac{\exp\left(\langle \varphi^{head}_{\text{qry}} (L_y L_{\bar{h}}[f](i, \tilde{h})), \varphi^{head}_{\text{key}} (L_y L_{\bar{h}}[f](j, \hat{h})
+ L_h [\rho]((i, \tilde{h}),(j, \hat{h}))\rangle\right)}{\sum_{(k, \hat{h}) \in N(i, \tilde{h})} \exp\left(\langle \varphi^{head}_{\text{qry}} (L_y L_{\bar{h}}[f](i, \tilde{h})), \varphi^{head}_{\text{key}} (L_y L_{\bar{h}}[f](k, \hat{h})
+ L_h [\rho]((i, \tilde{h}),(k, \hat{h}))\rangle\right)} \varphi^{head}_{\text{val}} (L_y L_{\bar{h}}[f](j, \hat{h})) \\
&= \phi_{\text{out}} \sum_{\tilde{h} \in \mathscr{H}} \sum_{(j,\hat{h}) \in N(i, \tilde{h})}
\frac{\exp\left(\langle \varphi^{head}_{\text{qry}} (f(x^{-1}(\bar{h}^{-1}(x(i) - y)), \bar{h}^{-1} \tilde{h})), \varphi^{head}_{\text{key}} (f(x^{-1}(\bar{h}^{-1}(x(j) - y)), \bar{h}^{-1} \hat{h})
+ L_h [\rho]((i, \tilde{h}),(j, \hat{h}))\rangle\right)}{\sum_{(k, \hat{h}) \in N(i, \tilde{h})} \exp\left(\langle \varphi^{head}_{\text{qry}} (f(x^{-1}(\bar{h}^{-1}(x(i) - y)), \bar{h}^{-1} \tilde{h})), \varphi^{head}_{\text{key}} (f(x^{-1}(\bar{h}^{-1}(x(k) - y)), \bar{h}^{-1} \hat{h})
+ L_h [\rho]((i, \tilde{h}),(k, \hat{h}))\rangle\right)} \varphi^{head}_{\text{val}} (f(x^{-1}(\bar{h}^{-1}(x(j) - y)), \bar{h}^{-1} \hat{h})) \\
&= \phi_{\text{out}} \sum_{\tilde{h} \in \mathscr{H}} \sum_{(x^{-1}(\bar{h}x(\bar{j})+y),\bar{h}\hat{h}') \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')}
\frac{\exp\left(\langle \varphi^{head}_{\text{qry}} (f(\bar{i}, \tilde{h}')), \varphi^{head}_{\text{key}} (f(\bar{j}, \hat{h}')
+ L_h [\rho]((x^{-1}(\bar{h}x(\bar{i}) + y), \bar{h} \tilde{h}'),
(x^{-1}(\bar{h}x(\bar{j}) + y), \bar{h} \hat{h}'))\rangle\right)}{\sum_{(k, \hat{h}) \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \exp\left(\langle \varphi^{head}_{\text{qry}} (f(\bar{i}, \tilde{h}')), \varphi^{head}_{\text{key}} (f(\bar{j}, \hat{h}')
+ L_h [\rho]((x^{-1}(\bar{h}x(\bar{i}) + y), \bar{h} \tilde{h}'),
(x^{-1}(\bar{h}x(\bar{j}) + y), \bar{h} \hat{h}'))\rangle\right)} \varphi^{head}_{\text{val}} (f(\bar{j}, \hat{h}')) \\
\end{align*}

Here we have used the substitutions $\bar{i} = x^{-1}(\bar{h}^{-1}(x(i) - y)) \Rightarrow i = x^{-1}(\bar{h}x(\bar{i}) + y)$, $\tilde{h}' = \bar{h}^{-1} \tilde{h}$,
and $\bar{j} = x^{-1}(\bar{h}^{-1}(x(j) - y)) \Rightarrow j = x^{-1}(\bar{h}x(\bar{j}) + y)$, $\hat{h}' = \bar{h}^{-1} \hat{h}$. By using the definition of
$\rho((i, \tilde{h}),(j, \hat{h}))$ we can further reduce the expression above as:

\begin{align*}
m_G^r[L_y L_{\bar{h}}[f], \rho](i,h)
&= \phi_{out} \sum_{h \in \mathscr{H}} \sum_{\bar{h}\tilde{h}' \in \mathscr{H}} \sum_{(x^{-1}(\bar{h}x(\bar{j})+y),\bar{h}\hat{h}') \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')}
\frac{\exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ \rho_P(h^{-1}\bar{h}(x(\bar{j}) - x(\bar{i}), \tilde{h}'^{-1} \hat{h}'))\rangle\right)}{\sum_{(k, \hat{h}) \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ \rho_P(h^{-1}\bar{h}(x(\bar{j}) - x(\bar{i}), \tilde{h}'^{-1} \hat{h}'))\rangle\right)} \varphi_{val}^h (f(\bar{j}, \hat{h}')) \\
&= \phi_{out} \sum_{h \in \mathscr{H}} \sum_{\bar{h}\tilde{h}' \in \mathscr{H}} \sum_{(x^{-1}(\bar{h}x(\bar{j})+y),\bar{h}\hat{h}') \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')}
\frac{\exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ L_{\bar{h}^{-1}h}[\rho]((\bar{i}, \tilde{h}'),(\bar{j}, \hat{h}')))\rangle\right)}{\sum_{(k, \hat{h}) \in N(x^{-1}(\bar{h}x(\bar{i})+y),\bar{h}\tilde{h}')} \exp\left(\langle \varphi_{qry}^h (f(\bar{i}, \tilde{h}')), \varphi_{key}^h (f(\bar{j}, \hat{h}')
+ L_{\bar{h}^{-1}h}[\rho]((\bar{i}, \tilde{h}'),(\bar{j}, \hat{h}')))\rangle\right)} \varphi_{val}^h (f(\bar{j}, \hat{h}')) \\
&= m_G^r[f, \rho](\bar{i}, \bar{h}^{-1}h) = m_G^r[f, \rho](x^{-1}(\bar{h}^{-1}(x(i) - y)), \bar{h}^{-1}h) \\
&= L_y L_{\bar{h}}[m_G^r[f, \rho]](i,h).
\end{align*}

We see that indeed $m_G^r[L_y L_{\bar{h}}[f], \rho](i,h) = L_y L_{\bar{h}}[m_G^r[f, \rho]](i,h)$