In [1]:
import numpy as np

# 9.1 Inductive Bias

### 9.1.1 Inverse Problems

Most machine learning problems are *inverse problems*:\
Given a conditional distribution, we may easily generate samples of observations from the distribution. In ML, we almost always care to solve the *inverse* of this problem - i.e. given a sample of observations, determine the distribution that generated them. The fundamental difficulty with this is that infinitely many distributions could have generated any sample of observations. 

The preference for one distribution over others is called *inductive bias*. This umbrella term is lent to many flavors of useful bias such as domain knowledge or medium knowledge (like knowing that the medium of images possesses translation invariance), among many others.

### 9.1.2 No Free Lunch Theorem

This theorem states that every learning algorithm is as good as any other when averaged over all possible problems...

In finding good inductive biases, we want biases that are appropriate to the broad class of problems that our models will be applied to in practice. For a given application, better results are obtained by incorporating stronger inductive biases for the specific applications of interest.

### 9.1.3 Symmetry and Invariance

In many applications, good predictions are *invarian* under one or more transformations of the input variables. For instance, computer vision models should be able to identify objects regardless of where they appear in an image (*translation invariance*). 

Transformations that leave crucial properties unchanged represent *symmetries* in an ML context. For instance, a cat should be identifiable as a cat regardless of whether it is changed by a transformation of scale or position in an image. Thus, the dominant features that identify it as a cat are *symmetric* under these transformations. The set of all transformations that correspond to a particular symmetry comprise a *group*...

A *group* consists of a set of elements $A, B, C, ...$ together with a *binary operation* for compsing pairs of elements, denoted $A \circ B$.\
Groups have the following axioms:
1. **Closed** - Groups are closed under their binary operation: $$A\circ B \in \mathcal{G}, \ \forall A, B \in \mathcal{G}$$
2. **Associative**: $$(A\circ B) \circ C = A\circ (B \circ C), \ \forall A, B, C \in \mathcal{G}$$
3. **Identity**: $$\exists I \in \mathcal{G} : A \circ I = I \circ A = A, \ \forall A \in \mathcal{G}$$
4. **Inverse**: $$\exists A^{-1} \in \mathcal{G} : A\circ A^{-1} = A^{-1} \circ A = I, \ \forall A \in \mathcal{G}$$

### 9.1.4 Equivariance

This is a generalization of invariance in which the output of the network is itself transformed when the inputs are transformed.\
E.g. a network that classifies pixels in an image as either being in the foreground or the background should translate the segmentations in the same way that the input is translated. 

Letting $\bf I$ be an input, $S$ be the network operation, and $T$ be the transformation operation, then equivariance holds if:
$$S(T(\mathbf I)) = T(S(\mathbf I))$$

More generally, equivariance may still apply if a different operation $\tilde{T}$ is applied to the output such that $S(T(\mathbf I)) = \tilde{T}(S(\mathbf I))$\
(Does $\tilde{T}$ need to be isomorphic to $T$ or something??)

Invariance is a special case of equivariance in which $\tilde{T}$ is the identity operator, i.e. : $$S(T(\mathbf{I})) = S(\mathbf{I})$$

# 9.2 Weight Decay

For a weight vector $w$ and error function $E(w)$, the $l2$ regularized error is given by: $$\tilde{E}(w) = E(w) + \frac \lambda 2 w^\intercal w$$
Here the coefficient $1/2$ is included for convenience in differentiation.\
This type of regularization is called *weight decay* because it encourages weight values to decay towards zero. This principal should apply to $l1$ regularization as well, though perhaps more pronouncedly... The derivative is then:
$$\nabla \tilde E (w) = \nabla E(w) + \lambda w$$


## 9.2.1 Consistent Regularizers