# Attention Permutation Invariances

## Key-Value Permutation Invariance

If $\pi_r(M)$ denotes an arbitrary permutation over the rows of a matrix $M$, then
$$\text{Attention}(Q,\pi_r(K),\pi_r(V))=\text{Attention}(Q,K,V) $$

Proof:

Notice that $\pi_r(M)$ can be written as $\pi_r(M) = R_{\pi}M$ for some permutation matrix $R_{\pi}$. A permutation matrix is a square binary matrix that has exactly one entry of 1 in each row and each column with all other entries 0

$$
\begin{align*}
\text{Attention}(Q,\pi_r(K),\pi_r(V)) & = \text{Attention}(Q,R_{\pi}K,R_{\pi}V),\\
& = \text{softmax}\left(\frac{Q(R_{\pi}K)^T}{\sqrt{d_k}}\right)R_{\pi}V,\\
& = \text{softmax}\left(\frac{QK^TR_{\pi}^{T}}{\sqrt{d_k}}\right)R_{\pi}V,\\
& = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}R_{\pi}^{T}\right)R_{\pi}V,\\
& = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)R_{\pi}^{T}R_{\pi}V,\\
& = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V,\\
& = \text{Attention}(Q,K,V),
\end{align*}
$$
where in the fifth equality we used that postmultiply by a permutation matrix is a column permutation and softmax is permutation equivariant. Notice that for any permutation $\pi$ we have 
$$
\begin{align*}
\text{softmax}\left(\pi(x)\right)& = \text{softmax}\left(x_{\pi(1)},x_{\pi(2)},\dots,x_{\pi(n)}\right),\\
&=\frac{\left(e^{x_{\pi(1)}},e^{x_{\pi(2)}},\dots,e^{x_{\pi(n)}}\right)}{\sum_{i=1}^n e^{x_{\pi(i)}}},\\
&=\frac{\left(e^{x_{\pi(1)}},e^{x_{\pi(2)}},\dots,e^{x_{\pi(n)}}\right)}{\sum_{i=1}^n e^{x_{i}}},\\
&=\frac{\pi\left(e^{x_{1}},e^{x_{2}},\dots,e^{x_{n}}\right)}{\sum_{i=1}^n e^{x_{i}}},\\
&=\pi\left(\frac{\left(e^{x_{1}},e^{x_{2}},\dots,e^{x_{n}}\right)}{\sum_{i=1}^n e^{x_{i}}}\right),\\
&=\pi\left(\text{softmax}(x)\right),\\
\end{align*}
$$
and $\text{softmax}$ is applied to each row. So for $A = [a_1|a_2|\dots|a_n]^T = \frac{QK^T}{\sqrt{d_k}}$, $f=\text{softmax}$, and a column permutation $\pi_c(A) = [\pi(a_1)|\pi(a_2)|\dots|\pi(a_n)]^T = AC_{\pi}$, we have
$$
\begin{align*}
f(AC_{\pi}) & = f([\pi(a_1)|\pi(a_2)|\dots|\pi(a_n)]^T),\\
 & = ([f(\pi(a_1))|f(\pi(a_2))|\dots|f(\pi(a_n))]^T),\\
& = ([\pi(f(a_1))|\pi(f(a_2))|\dots|\pi(f(a_n))]^T),\\
& = \pi_c\left([f(a_1)|f(a_2)|\dots|f(a_n)]^T\right),\\
& = \pi_c\left(f(A)\right),\\
& =f(A)C_{\pi}.
\end{align*}
$$

## Permutation Equivariance of Attention

If $\pi_r$ and $\pi_r$ denote arbitrary permutations over the rows of a matrix, then
$$\text{Attention}(\pi_r(Q),\sigma_r(K),\sigma_r(V))=\pi_r\left(\text{Attention}(Q,K,V)\right). $$

Proof:
Consider  $\pi_r(M) = R_{\pi}M$ for some permutation matrix $R_{\pi}$. From the previous result we have
$$
\begin{align*}
\text{Attention}(\pi_r(Q),\sigma_r(K),\sigma_r(V)) & = \text{Attention}(\pi_r(Q),K,V),\\
&= \text{Attention}(R_{\pi}Q,K,V),\\
& = \text{softmax}\left(\frac{(R_{\pi}Q)K^T}{\sqrt{d_k}}\right)V,\\
& = \text{softmax}\left(R_{\pi}\frac{QK^T}{\sqrt{d_k}}\right)V,\\
& = R_{\pi}\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V,\\
& = \pi_r\left(\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\right),\\
& = \pi_r\left(\text{Attention}(Q,K,V) \right),
\end{align*}
$$
where in the fifth equality we have used that the softmax funtion is applied individually for each row.

## Permutation Equivariance of Attention with weights

If Attention with weights has no bias (linear projections instead of affine projections), the Attention is permutation equivariant. If $\pi_r$ and $\sigma_r$ denote arbitrary permutations over the rows of a matrix, then
$$\text{Attention}_{\mathcal{W}}(\pi_r(Q),\sigma_r(K),\sigma_r(V))=\pi_r\left(\text{Attention}_{\mathcal{W}}(Q,K,V)\right) $$
where 
$$
\begin{align*}
\mathcal{W} &= \{W_q,b_q,W_k,b_k,W_v,b_v\},\\
b_q &= 0_{\mathbb{R}^{d_k}},\\
b_k &= 0_{\mathbb{R}^{d_k}},\\
b_v &= 0_{\mathbb{R}^{d_v}},\\
\end{align*}
$$

Proof: Consider  $\pi_r(M) = R_{\pi}M$ and  $\sigma_r(M) = R_{\sigma}M$for some permutation matrix $R_{\pi}$ and $R_{\sigma}$ respectively. From the previous result we have
$$
\begin{align*}
\text{Attention}_{\mathcal{W}}(\pi_r(Q),\sigma_r(K),\sigma_r(V))
&=\text{Attention}(\pi_r(Q)W_q,\sigma_r(K)W_k,\sigma_r(V)W_v),\\
&=\text{Attention}((R_{\pi}Q)W_q,(R_{\sigma}K)W_k,(R_{\sigma}V)W_v),\\
&=\text{Attention}(R_{\pi}(QW_q),R_{\sigma}(KW_k),R_{\sigma}(VW_v)),\\
&=R_{\pi}\text{Attention}(QW_q,KW_k,VW_v),\\
&=R_{\pi}\text{Attention}_{\mathcal{W}}(Q,K,V),\\
&=\pi_r\left(\text{Attention}_{\mathcal{W}}(Q,K,V)\right).
\end{align*}
$$

## Permutation Equivariance of Self-Attention (with weights)

If Self-Attention with weights has no bias (linear projections instead of affine projections), the SelfAttention is permutation equivariant. If $\pi_r(M)$ denotes an arbitrary permutation over the rows of a matrix $M$, then
$$\text{SelfAttention}_{\mathcal{W}}(\pi_r(X))=\pi_r\left(\text{SelfAttention}_{\mathcal{W}}(X)\right) $$
where 
$$
\begin{align*}
\mathcal{W} &= \{W_q,b_q,W_k,b_k,W_v,b_v\},\\
b_q &= 0_{\mathbb{R}^{d_k}},\\
b_k &= 0_{\mathbb{R}^{d_k}},\\
b_v &= 0_{\mathbb{R}^{d_v}},\\
\end{align*}
$$

Proof: From the previous result we have
$$
\begin{align*}
\text{SelfAttention}_{\mathcal{W}}(\pi_r(X))
&=\text{Attention}_{\mathcal{W}}(\pi_r(X),\pi_r(X),\pi_r(X)),\\
&=\pi_r\left(\text{Attention}_{\mathcal{W}}(X,X,X)\right),\\
&=\pi_r\left(\text{SelfAttention}_{\mathcal{W}}(X)\right).
\end{align*}
$$

## Permutation Equivariance of Multi-Head Attention

If Multi-Head Attention has no bias (linear projections instead of affine projections), the Multi-Head Attention is permutation equivariant. If $\pi_r$ and $\sigma_r$ denote arbitrary permutations over the rows of a matrix, then
$$\text{MultiHeadAttention}_{_{\{\mathcal{W}_{i}\}_{i=1}^h\cup \{W_o,b_o\}}}(\pi_r(Q),\sigma_r(K),\sigma_r(V))=\pi_r\left(\text{MultiHeadAttention}_{_{\{\mathcal{W}_{i}\}_{i=1}^h\cup \{W_o,b_o\}}}(Q,K,V)\right) $$
where 
$$
\begin{align*}
\mathcal{W_i} &= \{W_{q,i},b_{q,i},W_{k,i},b_{k,i},W_{v,i},b_{v,i}\},\quad i = 1,2,\dots,h,\\
b_{q,i} &= 0_{\mathbb{R}^{d_k}},\quad i = 1,2,\dots,h,\\
b_{k,i} &= 0_{\mathbb{R}^{d_k}},\quad  i = 1,2,\dots,h,\\
b_{v,i} &= 0_{\mathbb{R}^{d_v}},\quad i = 1,2,\dots,h,\\
b_{0} &= 0_{\mathbb{R}^{d_o}}.
\end{align*}
$$

Proof:  Consider  $\pi_r(M) = R_{\pi}M$ and $\sigma_r(M) = R_{\sigma}M$ for permutation matrices $R_{\pi}$ and $R_{\sigma}$ respectively. From the Attention equivariance we have
$$
\begin{align*}
&\text{MultiHeadAttention}_{_{\{\mathcal{W}_{i}\}_{i=1}^h\cup \{W_o,b_o\}}}(\pi_r(Q),\sigma_r(K),\sigma_r(V))\\
&=\begin{pmatrix}
\text{Attention}_{\mathcal{W}_{1}}(\pi_r(Q),\sigma_r(K),\sigma_r(V))|
\text{Attention}_{\mathcal{W}_{2}}(\pi_r(Q),\sigma_r(K),\sigma_r(V))|
\dots|
\text{Attention}_{\mathcal{W}_{h}}(\pi_r(Q),\sigma_r(K),\sigma_r(V))
\end{pmatrix}W_o,\\
&=\begin{pmatrix}
\pi_r\left(\text{Attention}_{\mathcal{W}_{1}}(Q,K,V)\right)|
\pi_r\left(\text{Attention}_{\mathcal{W}_{2}}(Q,K,V)\right)|
\dots|
\pi_r\left(\text{Attention}_{\mathcal{W}_{h}}(Q,K,V)\right)
\end{pmatrix}W_o,\\
&=\begin{pmatrix}
R_{\pi}\text{Attention}_{\mathcal{W}_{1}}(Q,K,V)|
R_{\pi}\text{Attention}_{\mathcal{W}_{2}}(Q,K,V)|
\dots|
R_{\pi}\text{Attention}_{\mathcal{W}_{h}}(Q,K,V)
\end{pmatrix}W_o,\\
&=R_{\pi}\begin{pmatrix}
\text{Attention}_{\mathcal{W}_{1}}(Q,K,V)|
\text{Attention}_{\mathcal{W}_{2}}(Q,K,V)|
\dots|
\text{Attention}_{\mathcal{W}_{h}}(Q,K,V)
\end{pmatrix}W_o,\\
&=R_{\pi}\text{MultiHeadAttention}_{_{\{\mathcal{W}_{i}\}_{i=1}^h\cup \{W_o,b_o\}}}(Q,K,V),\\
&=\pi_r\left(\text{MultiHeadAttention}_{_{\{\mathcal{W}_{i}\}_{i=1}^h\cup \{W_o,b_o\}}}(Q,K,V)\right).
\end{align*}
$$