# Efficient Backprop

#### Learning and Generalization

Function $M(Z^p, W)$
- $Z^p$ is the p-th input parameter
- $W$ is the collection of adjustable parameters in the system

Cost function $E^p = C(D^p, M(Z^p, W))$ measures discrepancy
- $D^p$is the correct output for $Z^p$

Goal is to find $W$ to minimize $E^p_{train}$. A common cost function is the MSE:
$E^p = \frac{1}{2}(D^p - M(Z^p, W))^2$

Bias is how much a network differs from the output and variance is how much the network differs between new datasets.

#### Standard Backpropagation

Traditional multi-layer neural networks are a special case where modules are alternated layers of matrix multiplications and component-wise sigmoid functions.
- $Y_n = W_n\ X_{n-1}$
- $X_n = F(Y_n)$

where $W_n$ is a matrix whose number of columns is the dimension of $X_{n - 1}$, and number of rows is the dimension of $X_n$. $F$ is a vector function that applies a sigmoid function to each component of its input. $Y_n$ is the vector of weighted sums, or total inputs, to layer n.

#### Stochasitc versus Batch Learning

Gradient descent is an optimal minimization procedure where $W$ is iteratively adjusted:
$W(t) = W(t - 1) - \eta \frac{\partial E}{\partial W}$

where $\eta$ is a scaler constant. The issue is that this equation requires a complete pass through the entire dataset in order to compute the average or true gradient (batch learning). Stochastic (online) learning where a single example ${Z^t, D^t}$ is chosen randomly from the training set at each iteration $t$. An estimate of the tru gradient is then computed based on the error $E^t$ of that example, and then the weights are updated:
$W(t + 1) = W(t) - \eta \frac{\partial E^t}{\partial W}$

The noise from this is advantageous. Stochastic is:
- faster, particularly on redundant datasets
- results in better solutions. The noice can help a gradient jump into an adjacent basin that could be potentially a global minima. Batch learning will find the minimum of the basin it starts in, which could end up being a local minima.
- can be used to track changes

#### Shuffling Examples

Networks learn the fastest from the most unexpected sample. Choose a sample at each iteration that is the most unfamiliar to the system. This only applies to stochastic learning. It is best to do this by choosing successive examples that are from different classes since training examples belonging to the same class will contain similar information. Another method is to examine the error between network output and the target value. Large error means the input has not been learned and therefore contains new information. It would be best to present this input more frequently into the network while the error is large. The process of modifying the probability of appearance of each pattern is called an emphasizing scheme. 

The above technique applies to data that contains outliers can be detrimental as they produce large errors, but should not be presented repeatedly. 

#### Normalize Inputs

Convergence is faster if the average of each input variable over the training set is close to zero and the inputs are scaled so they all have the same covariance $C_i$ where:
$C_i = \frac{1}{P}\ \sum_{p = 1}^P\ (z^p_i)^2$

where $P$ is the number of training examples, $C_i$ is the covariance of the $i$th input variable, and $z^p_i$ is the $i$th component of the $p$th training example. This balances how the weights connected to input nodes learn and thus speeds the process up. 

Another opportunity is decorrelating the inputs. It is harder to solve for two inputs simultaneously than independently. PCA can remove linear correlations. So steps of transformation are:
- shift inputs to mean zero
- decorrelate inputs
- equalize covariances

#### Sigmoid Functions

Sigmoid functions:
- Logistic $f(x) = \frac{1}{1 + e^{-x}}$
- Hyperbolic Tangent $tanh(x)$

Symmetric sigmoids like hyperbolic tangent tend to converge faster than standard logistic functions. This is because $tanh(x)$ has a mean around 0. A recommended sigmoid is $f(x) = 1.7159 tanh(\frac{2}{3}x)$. It is also beneficial to sometimes add a small linear term $f(x) = tanh(x) + ax$ to avoid flat spots. 

One issue with the symmetric sigmoids is the error surface can be flat near the origin. For this reason it is wise to initialize weights at small numbers. 

#### Target Values

Target values set to the Sigmoid's asymptote have several drawbacks:
- Weights will be drvien to larger and larger values where the sigmoid derivatives are close to zero. Large weights increase the gradient, but when multiplied by an exponentially small sigmoid derivative, a weight results in being close to zero, causing the weights to become stuck. 
- When input patterns fall near a decision boundary the output class is uncertain. For this reason a network should output a value that is inbetween two possibloe output values; not near either asymptote. Large weights tend to force outputs to the tails of the sigmoid, which causes a wrong class prediction without indication of uncertainty.

The solution to avoid the above scenarios is to set target values to be within the range of the sigmoid rather than at the asymptotic values. The best way to do this is to set target values to the point of the maximum second derivative on the sigmoid, which is around $\pm 1$.

#### Initializing Weights

Weights should be activated randomly, but in a way that the sigmoid is activated in its linear region. Large weights makes learning slow by over-saturation, small weights result in small gradients and also slow learning. The approach is to assure the distribution of the outputs of each node have a standard deviation of around 1 (aka normalizing the training set).

To maintain this standard deviation of 1 for each output at each layer, just use the sigmoid function with the requirement that the input to the sigmoid also has a standard deviation of 1:
$\sigma_{yi} = (\sum_j\ w^2_{ij})^{\frac{1}{2}}$

So to insure a standard deviation of 1, wieghts should be randomly drawn from a distribution with mean 0 and standard deviation given by:
$\sigma_w = m^{-\frac{1}{2}}$

where $m$ is the number of inputs to the unit. This is a uniform distribuion.

#### Choosing Learning Rates

It is advisable for stochastic gradient descent to pick a learning rate $\eta_i$ for each weight as this can improve convergence. To ensure that all weights converge around the same speed, it is best to use larger rates in the lower layers and smaller weights in the higher layers. 

Other tricks to improve convergence:
- Momentum: $\Delta w(t + 1) = \eta \frac{\partial E_{t + 1}}{\partial w} + \mu \Delta w(t)$ can increase the convergence when the cost surface is non-spherical because it damps the size of the steps along directions of high curvature thus yielding a larger effective learning rate along directions of low curvature. Generally this is used more in batch mode.
- Adaptive Learning Rates: increase or decrease learning rate based on the error. 
    - Smallest eigenvalue of the Hessian is smaller than the second smallest eigenvalue and therefore after a large number of iterations. the parameter vector $w(t)$ will approach the minimum from the direction of the minimum eigenvector of the Hessian. 
    - Put simply, if the error is large proceed with big steps, and if the error is small it anneals the learning rate.

#### Radial Basis Functions vs Sigmoid Units

Radial Basis Functions (RBF) uses Euclidean Distance between the input and the weights and the Sigmoid is replaced by an exponential. Output is:  
$g(x) = \sum_{i = 1}^N\ w_i exp( \frac{1}{2\sigma_i^2} || x - v_i ||^2 )$

where $v_i$ is the mean standard deviation of the $i$-th Gaussian. 

#### Convergence of Gradient Descent

Update equation for gradient descent is:  
$W(t + 1) = W(t) - \eta\ \frac{dE(W)}{dW}$

- if $\eta < \eta_{opt}$ then the weight will move toward the minimum
- if $\eta = \eta_{opt}$ then the weight will meet the local minimum in one step
- if $\eta > \eta_{opt}$ then the weight will oscillate around the minimum eventually obtaining the minimum.
- if $\eta > 2\eta_{opt}$ then divergence will occur and the weight will never reach the minimum.

#### Input Transformations Revisited

###### Subtract the Means from the Input Variables
Nonzero mean in input creates a very large eigenvalue, i.e. the cost surface will be steep in some directions and shallow in others so that convergence will be very slow. For a single linear neuron, the eigenvectors of the Hessian (mean subtracted) point along the principal axes of the cloud of training vectors. 

###### Normalize the Variances of the Input Variables
Inputs that have a large variation in spread along different directions of the input space will have a large condition number and slow learning. If inputs are correlated, this will make the error surface spherical, but possibly reduce eccentricity. Correlated input variables cause eigenvectors of the Hessian Matrix to be rotated away from the corrdinate axes, thus weight updates are not decoupled. Decoupling makes the one learning rate per weight optimal. 

###### Decorrelate the Input Variables 
If we assign each weight its own learning rate then the descent direction will be in the direction of the other that points in the direction of the minimum. 

###### Use a Seperate Learning Rate for Each Weight

#### Diagonal Levenberg Marquardt Method

- Uses the Square Jacobi approximation
- Designed for batch learning
- Have complexity of $O(N^3)$
- Most importantly, work only for mean squared error loss functions.

This method has a regularization parameter $\mu$ that prevents it from exploding, if some eigenvalues are small:
$$\Delta w = (\sum_p \frac{\partial f(w, x_p)^T}{\partial w}\ \frac{\partial f(w, x_p)}{\partial w} + \mu I)^{-1} \nabla E(w)$$

where $I$ denotes the unity matrix. 

#### Conjugate Gradient

###### Properties
- $O(N)$ method
- Does not use Hessian explicitly
- Attempts to find descent directions that try to minimally spoil the result achieved in the previous iterations
- It uses a line search
- Most importantly: works only for batch learning

It is for the last reason that we use this method if the training set is not too large or if the task is regression.

- $p_k$ is the descent direction
- $k$ is the iterations

The evolution of the descent direction is given by:  
$p_k = -\nabla E(w_k) + \beta_kp_{k - 1}$

where the choice of $\beta_k$ can be done according to Fletcher and Reeves or Polak and Ribiere.

Polak and Ribiere generally works better in practice:  
$$\beta_{k} = \frac{\nabla E(w_{k})^T(\nabla E(w_{k})}{\nabla E(w_{k - 1})^T \nabla E(w_{k - 1})}$$
$$p_{k + 1} = max(\beta_{k + 1}, 0)\ p_k - \nabla f(x_{k + 1})$$

We place the max() for the update of $p_k$ to avoid the update to $\beta$ becoming negative. Two directions $p_k$ and $p_{k - 1}$ are defined as conjugate if:
$$p^T_k\ Hp_{k - 1} = 0$$

i.e conjugate directions are orthogonal directions in space of the identity Hessian matrix. Polak and Ribiere's choice seems more robust for non-quadratic function. This gradient can also be viewed as a good choice for the momentum term. 

On large redundant classification problems, stochastic back-propogation is faster, which is why the other oprion for training is Stochastic or Levenberg Marquadt Methods.

#### Discussion and Conclusion

- Shuffle examples
- Center inputs by subtracting the mean
- Normalize input to standard deviation of 1
- Decorrelate input variables
- Pick network with Sigmoid function $1.7159 tanh(\frac{2}{3}(x))$
- Set target values; typically -1 and +1
- Initialize the weights to random values; uniform random distribution with mean 0 and standard deviation $\sigma_w = m^{-\frac{1}{2}}$

- If training set is large (more than a few hundred samples) and redundant, and if the task is classification, use stochastic gradient with careful tuning, or use stochastic diagonal Levenberg Marquardt method
- If training set is not too large, or if the task is regression, use Conjugate Gradient