
# Invariances


In many applications of pattern recognition, it is known that predictions should be unchanged, or *invariant*, under one or more transformations of the input variables.

For example, in the classification of objects in two-dimensional images, such as handwritten digits, a particular object should be assigned the same classification irrespective of its position within the image (*translation invariance*) or of its size (scale invariance).


<font color='red'>If sufficiently large numbers of training patterns are available, then an adaptive model such as a neural network can learn the invariance, at least approximately. This involves including within the training set a sufficiently large number of examples of the effects of the various transformations.</font>

<font color='blue'>This approach may be impractical, however, if the number of training examples is limited. </font>We therefore seek alternative approaches for encouraging an adaptive model to exhibit the required invariances. These can broadly be devided into four categories:

> $\fbox{1}$. The training set is augmented using replicas of the training patterns, transformed according to the desired invariances. For instance, in our digit recognition example, we could make multiple copies of each example in which the digit is shifed to a different position in each image.
> <div style="background-color:#f0f0f0">$\fbox{2}$. A regularization term is added to the error function that penalizes changes in the model output when the input is transformed. This leads to the technique of tangent propagation, discussed in Section 5.5.4.</div>
> $\fbox{3}$. Invariance is built into the pre-processing by extracting features that are invariant under the required transformations (transfer learning). Any subsequent regression or classification system that uses such features as inputs will necessarily also respect these invariances.
> <div style="background-color:#f0f0f0">$\fbox{4}$. The final option is to build the invariance properties into the structure of a neural network (or into the definition of a kernel function in the case of techniques such as the relevance vector machine). One way to achieve this is through the use of local receptive fields and shared weights, as discussed in the context of convolutional networks in Section 5.5.6.</div>

$\fbox{Approach 1}$ is often relatively easy to implement and can be used to encourage complex invariances. But it can also be computationally costly.

$\fbox{Approach 2}$ leaves the data set unchanged but modifies the error function through the addition of a regularizer. In Section 5.5.5, we shall show that this approach is closely related to $\fbox{Approach 1}$.

$\fbox{Approach 3}$ has an advantadge that is can correctly extrapolate well beyond the range of transformations included in the traning set. However, it can be difficult to find hand-crafted features with the required invariances that do not also discard information that can be useful for discrimination.


----------------
# Tangent propagation

As what we mentioned in $\fbox{Approach 2}$, <font color='red'>we can use regularizaion to encourage models to be invariant to transformations of the input through the technique of *tangent propagation*.</font>

Provided the transformation is continous (such as translation or rotation, but not mirror reflection for instance), then the transformed pattern will sweep out a manifold $\mathcal{M}$ within the $D$-dimensional space. Suppose the transformation is governed by a single parameter $\xi$. Then the subspace $\mathcal{M}$ swept out by $\mathbf{x}_n$ will be one-dimensional, and will be parameterized by $\xi$. Let the vector that results from acting on $\mathbf{x}_n$ by this transformation be denoted by $\mathbf{s}(\mathbf{x}_n, \xi)$, which is defined so that $\mathbf{s}(\mathbf{x}, 0) = \mathbf{x}$. Then the tangent to the curve $\mathcal{M}$ is given by the directional deriative $\mathbf{\tau}=\partial \mathbf{s}/\partial \xi$, and the tangent vector at the point $\mathbf{x}_n$ is given by

$$\mathbf{\tau}_n = \left.\frac{\partial \mathbf{s}(\mathbf{x}_n, \xi)}{\partial \xi}\right|_{\xi=0} \tag{5.125}$$

<font color='red'> Under a transformation of the input vector, the network output vector will, in general, change.</font>. The derivatie of output $k$ with respect to $\xi$ is given by

$$\left.\frac{\partial y_k}{\partial \xi}\right|_{\xi=0}=\sum_{i=1}^D\frac{\partial y_k}{\partial x_i}\left.\frac{\partial x_i}{\partial \xi}\right|_{\xi=0} = \sum_{i=1}^D J_{ki}\tau_i \tag{5.126}$$

where $J_{ki}$ is the $(k,i)$ element of the Jacobian matrix. <font color='red'>We want the output to be invariant under the transformation of the inputs. Thus the result (5.126) can be use as a penalty by the addition to the original error function $E$.</font>

$$\tilde{E} = E + \lambda\Omega \tag{5.127}$$

where $\lambda$ is a regularization coefficient and 

$$\Omega = \frac{1}{2}\sum_n\sum_k\left(\left.\frac{\partial y_{nk}}{\partial \xi}\right|_{\xi=0}\right)^2 
= \frac{1}{2}\sum_n\sum_k\left(\sum_{i=1}^D J_{nki}\tau_{ni}\right)^2 \tag{5.128}$$ 

<font color='red'>The regularization function will be zero when the network mapping function is invariant under the transformation in the neighbourhood of each pattern vector, and the value of the parameter $\lambda$ determines the balance between fitting the training data and learning the invariance property.</font>

<font color='blue'>In a practical implementation, the tangent vector $\mathbf{\tau}_n$ can be approximated using finite differences, by substracting the original vector $\mathbf{x}_n$ from the corresponding vector after transformation using a small value of $\xi$, and then dividing by $\xi$.</font>


-------------
# Training with transformed data

Here we show that the $\fbox{Approach 1}$ (using transformed versions of the original input patterns) is closely related to the technique of the $\fbox{Approach 2}$ (tangent propagation).

We shall consider a sum-of-squares error function. The error function for untransformed inputs can be written in the form

$$E = \frac{1}{2}\iint \{y(\mathbf{x})-t\}^2 p(t|\mathbf{x}) p(\mathbf{x})d\mathbf{x}dt \tag{5.129}$$

The error function for transformed data

$$\begin{align*}
\tilde{E} &= \frac{1}{2}\iiint \{y(\mathbf{s}(\mathbf{x}, \xi)) - t\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi \tag{5.130}\\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{s}(\mathbf{x}, 0))
+ (\xi-0)\left.\frac{\partial y}{\partial \xi}\right|_{\xi=0}
+ \frac{(\xi-0)^2}{2!}\left.\frac{\partial^2 y}{\partial \xi^2}\right|_{\xi=0} + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi\qquad \text{Taylor expansion}\\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{x})
+ \xi\frac{\partial \mathbf{s}(\mathbf{x},0)}{\partial \xi}\frac{\partial y}{\partial \mathbf{s}(\mathbf{x}, 0)}
+ \frac{\xi^2}{2}\left.\frac{\partial}{\partial \xi} \frac{\partial y}{\partial \xi}\right|_{\xi=0} + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi \\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{x})
+ \xi\frac{\partial \mathbf{s}(\mathbf{x},0)}{\partial \xi}\frac{\partial y}{\partial \mathbf{x}}
+ \frac{\xi^2}{2}\left.\frac{\partial}{\partial \xi} \frac{\partial y}{\partial \xi}\right|_{\xi=0} + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi\\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{x})
+ \xi\mathbf{\tau}^T\nabla y
+ \frac{\xi^2}{2}\left.\frac{\partial}{\partial \xi} \frac{\partial y}{\partial \xi}\right|_{\xi=0} + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi\\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{x})
+ \xi\mathbf{\tau}^T\nabla y
+ \frac{\xi^2}{2}\frac{\partial}{\partial \xi} \mathbf{\tau}^T\nabla y + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi\\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{x})
+ \xi\mathbf{\tau}^T\nabla y
+ \frac{\xi^2}{2}\left(\mathbf{\tau}^T\frac{\partial}{\partial \xi} \frac{\partial y}{\partial \mathbf{s}(\mathbf{x},0)}
+ \frac{\partial \mathbf{\tau}^T}{\partial \xi}\nabla y \right) + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi\\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{x})
+ \xi\mathbf{\tau}^T\nabla y
+ \frac{\xi^2}{2}\left(\mathbf{\tau}^T \frac{\partial \mathbf{s}(\mathbf{x},0)}{\partial \xi}\frac{\partial^2 y}{\partial \mathbf{s}(\mathbf{x},0)^2}
+ \frac{\partial \mathbf{\tau}^T}{\partial \xi}\nabla y\right) + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi\\
&= \frac{1}{2}\iiint \left\{
y(\mathbf{x})
+ \xi\mathbf{\tau}^T\nabla y
+ \frac{\xi^2}{2}\left(\mathbf{\tau}^T \nabla\nabla y\mathbf{\tau}
+ (\mathbf{\tau}')^T\nabla y \right) + O(\xi^3)
- t\right\}^2 p(t|\mathbf{x}) p(\mathbf{x}) p(\xi) d\mathbf{x}dt d\xi\qquad \text{let } \mathbf{\tau}'=\left.\frac{\partial^2 \mathbf{s}(\mathbf{x},\xi)}{\partial \xi^2}\right|_{\xi=0}\\
&= \frac{1}{2}\iint\{y(\mathbf{x})-t\}^2 p(t|\mathbf{x}) p(\mathbf{x}) d\mathbf{x} dt\\
&\quad+ \mathbb{E}[\xi]\iint \{y(\mathbf{x})-t\}\mathbf{\tau}^T\nabla y(\mathbf{x}) p(t|\mathbf{x})p(\mathbf{x})d\mathbf{x}dt\\
&\quad+ \mathbb{E}[\xi^2]\iint \left[\{y(\mathbf{x})-t\}\frac{1}{2}\left\{(\mathbf{\tau}')^T\nabla y(\mathbf{x})+\mathbf{\tau}^T\nabla\nabla y(\mathbf{x})\mathbf{\tau}\right\} + (\mathbf{\tau}^T\nabla y(\mathbf{x}))^2\right] p(t|\mathbf{x})p(\mathbf{x})d\mathbf{x}dt
+O(\xi^3)\\
&= \frac{1}{2}\iint\{y(\mathbf{x})-t\}^2 p(t|\mathbf{x}) p(\mathbf{x}) d\mathbf{x} dt\\
&\quad+ 0 \qquad \color{red}{\text{assume }p(\xi)\text{ has zero mean}}\\
&\quad+ \mathbb{E}[\xi^2]\iint \left[\{y(\mathbf{x})-t\}\frac{1}{2}\left\{(\mathbf{\tau}')^T\nabla y(\mathbf{x})+\mathbf{\tau}^T\nabla\nabla y(\mathbf{x})\mathbf{\tau}\right\} + (\mathbf{\tau}^T\nabla y(\mathbf{x}))^2\right] p(t|\mathbf{x})p(\mathbf{x})d\mathbf{x}dt\\
&\quad+ 0 \qquad \color{red}{\text{assume } \xi \text{ near the point } 0}\\
&= E+\lambda\Omega \tag{5.131}
\end{align*}$$

where

$$\begin{align*}
\lambda &= \mathbb{E}[\xi^2]\\
\Omega &= \int\left[\{y(\mathbf{x})-\mathbb{E}[t|\mathbf{x}]\}\frac{1}{2}\left\{(\mathbf{\tau}')^T\nabla y(\mathbf{x})+\mathbf{\tau}^T\nabla\nabla y(\mathbf{x})\mathbf{\tau}\right\}+(\mathbf{\tau}^T\nabla y(\mathbf{x}))^2\right]p(\mathbf{x})d\mathbf{x} \tag{5.132}
\end{align*}$$

in which we have performed the integration over $t$.

We can further simplify this regularization by <font color='red'>assuming that the model fits the observations</font>, or saying that the function $y(\mathbf{x})$ that minimizes the sum-of-squares error is given by the conditional average $\mathbb{E}[t|\mathbf{x}]$ of the target values $t$, so that

$$y(\mathbf{x}) = \mathbb{E}[t|\mathbf{x}] + O(\xi) \tag{5.133}$$

The first term in the regularizer is therefore vanished, and we are left with

$$\Omega = \frac{1}{2}\int \big(\mathbf{\tau}^T\nabla y(\mathbf{x})\big)^2 p(\mathbf{x}) d\mathbf{x} \tag{5.134}$$

**which is equivalent to the tangent propagation regularizer (5.128).**

<font color='orange'>If we consider the special case in which the transformation of the inputs simply consists of the addition of random noise</font>, so that $\mathbf{x} \rightarrow \mathbf{x} + \mathbf{\xi}$, where $\mathbf{\xi}$ is the noise, then the regularizer takes the form

$$\Omega = \frac{1}{2}\|\nabla y(\mathbf{x})\|^2 p(\mathbf{x}) d\mathbf{x} \tag{5.135}$$

which is known as *Tikhonov* regularization*.