# LeCunn Convolutional Neural Networks

#### Nomenclature
- GT: Graph Transformer
- GTN Graph Transformer Network: allows training all modules to optimize a global performance criterion.
- HMM: Hidden Markov Model
- HOSL Heuristic Over-Segmentation: seperating individual characters out from neighborswithin a word or sentence.
- KNN: K-Nearest Neighbors
- NN: Neural Network
- OCR: Optical Character Recognition
- PCA: Principal Component Analysis
- RBF: Radial Basis Function
- RS-SVM: Reduced-set Support Vector Method
- SDNN: Space Displacement Neural Network
- SVM: Support Vector Machine
- TDNN: Time Delay Neural Network
- V-SVM: Virtual Support Vector Machine

#### Introduction

Split process into two methods:
- Feature Extractor: transforms input patterns so they can be represented by low-dimensional vectors or short strings of symbols that can easily be matched or compared and are relatively invariant with respect to transformations and distortions of the input patterns that do not change their nature.
- Classifier: general purpose and trainable.

Every problem needs to identify the set of features to use, hence why that is a step of its own. 

Convolutional Neural Networks incorporate knowledge about invariances of 2D shapes by using local connection patterns, and by imposing constraints on the weights. 

Gradient-based learning: learning machine computes a function:
$$Y^p = F(Z^p, W)$$

where $Z^p$ is the $p$-th input pattern, and $W$ represents the collection of adjustable parameters in the system. $Y^p$ is the recognized class label of the pattern or the scores/probability associated with a range of classes. The loss function:
$$E^p = D(D^p, F(W, Z^p))$$

measures the discrepancy between the actual value and the estimated value from the input pattern. The gap between the training and testing sets decrease with the number of training samples:
$$E_{test} - E_{train} = k(\frac{h}{P})^{\alpha}$$

where:
- $P$ is the number of training samples
- $h$ is a measure of effective capacity
- $\alpha$ is a number between 0.5 and 1.0
- $k$ is a constant

#### Gradient-Based Learning

$E(W)$ is continuous and differentiable everywhere. Simplest gradient descent algorithm is:
$$W_k = W_{k - 1} - \epsilon \frac{\partial E(W)}{\partial W}$$

where $\epsilon$ is a scaler constant. A popular minimization procedure is the stochastic gradient algorithm consists of updating the parameter vector using a noisy, or approximated version of the average gradient. Common instance is:
$$W_k = W_{k - 1} - \epsilon \frac{\partial E^{pk}(W)}{\partial W}$$

which converges on large training sets with redundant samples. 

#### Backpropogation Explained

###### Initialization
$W$ can be initialized randomly from a uniform distribution with mean = 0 and standard deviation of 1.0.

###### Forward Propagatation
Start with the input we have, pass through the network layer and calculate the actual output of the model.

###### Loss Function
At this point we will have the output of the randomly initialized network given the randomly initialized weights and the actual values the network is trying to target. We use the square of the difference so the values are positive and that large errors are penalized more than small errors. 

###### Differentiation
Optimization technique that modifies the internal weights of the network to minimize the total loss function previously defined. This differentiation calculates the derivative of the loss function (i.e. speed at which the function is changing its value at a given point).

A question to ask is: how much of the total error will change if we change the internal weight of the neural network with a certain small value $\partial W$.

What is important is the rate in the increase of total errors relative to the change in weight. This is computed as:
$$\frac{\nabla y}{\nabla x}$$

Once we receive an output, we adjust the weight in the following way:
- Check derivative
- If positive, meaning error increases with the weight increase, then decrease the weight.
- If negative, meaning error decreases if we increase the weights, then increase the weight.
- If 0, we do nothing as we have reached a stable point.

##### Back-Propogation
The derivative is decomposable, thus can be back-propogated. We start with a starting point of errors, then derive it. Example:  

Input -> Layer 1 (3 * x) -> Layer 2 (2 * x) -> Output

Consider $\Delta x = 0.001$. After Layer 1 $\Delta x = 0.003$, and after Layer 2 $\Delta x = 0.006$. Back propogation will calculate the final result $0.006$ to $0.001$ in the reverse of the above process.

Forward-propagation applies a function to the input, and back-propagation is performed by knowing the derivativeof the function. So we need to keep a stack of function calls during the forward pass with their parameters in order to back-propogate the errors using the derivatives of these functions. 

<br />
<img src=".\images\back_propogation.PNG" />
<br />

- Derivative of the loss function in respect to the output ->
- Derivative of the output in respect to the input variables 

So the final step of the back-propogation:
$$\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x_i}$$

<br />
<img src=".\images\back_propogation_2.PNG" />
<br />

where $\partial L$ is the derivative of the loss function. It is best to consider this as the chain rule: derivative of the outside by the inside multiplied by the derivative of the inside. 

###### Weight Update
Delta Rule: 
$$W_t = W_{t - 1} - (\nabla_w * \lambda)$$

- If derivative rate is positive, an increase in weight will increase the error, thus the new weight should be smaller.
- If the derivative rate is negative, an increase in weight will decrease the error, thus we need to increase the weights.
- If the derivative is 0, it means we are stable in a minimum and no updates are needed.

Stochastic Gradient Descent: updates the weights after each single input is observed. Better suited for most occasions over the Delta Rule. Faster convergence.

###### Iterate Until Convergence

#### Example

Assume $X = 3x3$ and $W = 2x2$ without any padding and a stride of $1$ generating output $H = 2x2$. While performing the forward pass, cache the variables $X$ and $W$ (this is helpful for performaing backpropagation).

The first iteration will be a 2x2 window in the top-left of a 3x3 matrix. This will map to the (1, 1) of the next layer represented as a 2x2 matrix. The value in (1, 1) will be the linear combination of the input $X$ in the current window and the $W$ for each value of the 2x2 window.

<br />
<img src=".\images\back_propogation_3.PNG" />
<br />

###### Notation
- $\partial h_{ij} = \frac{\partial L}{\partial h_{ij}}$
- $\partial w_{ij} = \frac{\partial L}{\partial w_{ij}}$

It is important to note that any change in the weight of the filter window will affect all of the output pixels.

<br />
<img src=".\images\back_propogation_4.PNG" />
<br />

#### Globally Trainable Systems
If the partial derivative of $E^p$ with respect to $X_n$ is known, then the partial derivatives of $E^p$ with respect to $W_n$ and $X_{n - 1}$ can be computed using the backward recurrence:
$$\frac{\partial E^p}{\partial W_n} = \frac{\partial F}{\partial W}(W_n, X_{n - 1})\ \frac{\partial E^p}{\partial X_n}$$

This equation computes some terms of the gradient of $E^p(W)$.

$$\frac{\partial E^p}{\partial X_{n - 1}} = \frac{\partial F}{\partial X}(W_n, X_{n - 1})\ \frac{\partial E^p}{\partial X_n}$$

This equation generates a backward recurrence - backpropagation procedure for the neural network. 

where $\frac{\partial F}{\partial W}(W_n, X_{n - 1})$ is the Jacobian of $F$ with respect to $W$ evaluated at the point $(W_n, X_{n - 1})$, and $\frac{\partial F}{\partial X}(W_n, X_{n - 1})$ is the Jacobian of $F$ with respect to $X$. 

The Jacobian of a vector function is a matrix containing the partial derivatives of all outputs with respect to all the inputs. 

#### Convolutional Neural Networks

Three architectural ideas

###### Local Receptive Fields
Neurons can extract elementary visual features like edges, end-points, corners, etc. These are combined by subsequent layers to determine higher-order features. A complete convolutional layer is composed of several feature maps (each feature map focusing on a different feature i.e. using a different weight vector to allow different features to be derived from each location). Once a feature is detected, its exact location is irrelevant; only its position relative to other features is important.

Sub-sampling of layers decreases the resolution by taking regional averages. By taking these regional averages variations of the overall image can be indentified by the network so that if a new observation comes to be predicted that has differences, the network can still output the correct label. 

LeNet

<br />
<img src=".\images\back_propogation_5.PNG" />
<br />

$n$: is the layer index
$C_n$: Convolutional layer $n$
$S_n$: Sub-sampling layer $n$

- Comprises of 7 layers (excluding input layer of 32x32)
- Input is normalized so mean is near 0
- $C_1$ has 6 feature maps of 28x28 (prevents input from falling of the boundary) where each unit in each map is connected to a 5x5 neighborhood in the input
- $S_2$ has 6 feature maps of 14x14. Each unit in each feature map is connected to a 2x2 neighborhood in the corresponding feature map of $C_1$. The four inputs to a unit in $S_2$ are multiplied by a trainable coefficient and added to a trainable bias. The result is passed through a sigmoid function. The 2x2 receptive fields are non-overlapping therefore feature maps in this layer have half the number of rows and columns as the previous layer $C_1$.
- Layer $C_3$ has 16 feature maps. Each unit in each feature map is connected to a 5x5 neighborhood at identical locations in a subset of $S_2$. We can add dropout so every unit in the previous layer does not map to a unit in the current layer. The hopes is to force the feature map to extract different features because they get different sets of inputs. 
- Layer $S_4$ has 16 feature maps of size 5x5. Each unit is the feature map is connected to a 2x2 neighborhood in the corresponding feature map prior. 
- Layer $C_5$ has 120 feature maps. Each unit is connected to a 5x5 neighborhood on all of $S_4$s feature maps. The size of this layer's feature map is 1x1, therefore amounts to a fully connection, but not labeled as a fully connected layer.
- Layer $F_6$ comes next. 

Layers up to the fully connected layer compute a dot product between their input vector and their weight vector, to which bias is added. This weighted sum $a_i$ for unit $i$ is then passed through a sigmoid squashing the function to produce the state of unit $i$ denoted by $x_i$.

The squashing function is a scaled tanh function:
$$f(a) = A tanh(Sa)$$

where $A$ is the amplitude and $S$ is the slope at the origin. The constant $A$ is chosen to be $1.7159$

The output layer is composed of Euclidean Radial Basis Function units, one for each class, with the inputs from $F_6$. The outputs of each RBF $y_i$ is computed as:
$$y_i = \sum_j (x_j - w_{ij})^2$$

The above is the Euclidean distance between the input vector and its parameter vector. RBF can be interpreted as the unnormalized negative log-likelihood of a Gaussian distribution in the space of configurations of layer $F_6$. 

###### Shared Weights


###### Spatial or Temporal Sub-Sampling

#### Loss Function

Maximum Likelihood Estimation, which is equivelant to the Minimum Mean Squared Error:
$$E(W) = \frac{1}{P} \sum_{p = 1}^ P\ y_{D_p}(Z^p, W)$$

where $y_{D_p}$ is the output of the $D_p$-th RBF unit (one corresponding to the correct class $Z^p$). This cost function is appropriate for most cases. A better criterion to use that avoids a collapsing effect - input is ignored and RBF outputs are equal to zero:
$$E(W) = \frac{1}{P} \sum_{p = 1}^ P\ y_{D_p}(Z^p, W) + log(e_{-j} + \sum_i e^{-y_i(Z^p, W)})$$

This prevents collapsing effect when the RBF parameters are learned because it keeps the RBD centers apart from each other. 

#### Invariance and Noise Resistance

LeNet-5 works for scaled variations of up to a factor of 2, vertical shift variations of plus or minus about half the height of the character, and rotations up to plus or minus 30 degrees. 