# Neural Network

_Kevin Siswandi_  
**Fundamentals of Machine Learning**  
June 2020

We will work here in classification problem, with data $\{x_i, y_i\}$, $y_i \in \{-1, +1\}$.
## Perceptron

Let's work in homogenous coordinates. That is, we absorb the intercept into a dimension of the parameter vector. As a concrete example, if our linear decision boundary is given by

$$ \tilde{w}^T \tilde{x} = -b $$

We can simply write $ w^T x = 0$, with

$$ w = \begin{bmatrix} \tilde{w_1}\\ \tilde{w_2} \\ b \end{bmatrix};
x = \begin{bmatrix} x_1\\ x_2 \\ 1 \end{bmatrix}$$

In this case, what we are doing here is basically projecting the data into the normal of the decision boundary. 

The misclassification error can be computed, using an indicator function as

$$ \epsilon(\omega) = \sum_i^n I(y_i x_i^T \omega < 0) $$

This is equivalent to using a left-continuous Heaviside step function as the loss. We have three choices of functions for the misclassification error (number of misclassified samples):
1. Heaviside step function
2. Sigmoid (from logistic regression)
3. Hinge function

If we use the Hinge function:
* gradient descent is easier
* the solution is more unique (because hinge function penalizes a little bit even if we are on the correct side) -- need to be correct by at least some margin, the solution is hence less ambiguous.

**Note that if the decision boundary is nonlinear, there will be local minima** no matter what loss function we use.

In the case of Heaviside step function, the loss surface in the parameter space $\epsilon(w)$ is a sum of displaced Heaviside step functions. As such, it is:
* piecewise constant
* having no informative gradient
* having ambiguous local optima
* non-convex (and so is sigmoid)

On the other hand, the **hinge function is convex**. Sum of convex functions is also convex, so we can always find the global optimum. In a convex function, there are no local minima, and an observer is able to 'see' where the global optimum is when placed somewhere on the function. Therefore, if we choose a convex error count such as the hinge function, the overall error/loss surface is convex and global optimum can always be found efficiently (although the global optima may still be ambiguous/not unique).

Hinge function has been popular in perceptron training. In 1960, it was central hinge function (without margin), and later finite margin was introduced. The training:

$$ \arg \min_\omega \epsilon(\omega) = \sum_{i=1}^n | -(y_i x_i^T w - margin)|_+ $$

or alternatively this can be written

$$ \arg \min_\omega \epsilon(\omega) = \sum_{i=1}^n \max( -y_i x_i^T w + margin, 0) $$

The problem is, however, we still can't have non-linear decision boundary. One powerful way to get non-linear decision boundary is by stacking multiple perceptrons.

**Summary for perceptron**

During training, we minimise the loss surface to get the best set of parameters. In practice, one can also use quadratic loss function (and this will recover linear regression) as the surrogate loss function, but it is not suitable for classification problem, because you get penalized even though the solution lies on the correct side.

Let $f(x)$ be the output of the perceptron. Define the loss

$$ \sum_{i=1}^{n} l(y_i, f(x_i, w)) = \epsilon(w) $$

The perceptron is trained to find the optimal parameters (usually by gradient descent)

$$ \arg \min_w \epsilon(w) $$

## Multi Layer Perceptron

Also known as multi-layer perceptron (MLP)/artificial neural network, this is stacking perceptrons to get a non-linear classifier. However, the catch is that this works well when training data are abundant. Otherwise, just use logistic regression when data are limited. The general strategy is:
1. Learn a perfect decision boundary if given enough data
2. Regularize to make sure it is not overfitting

Generally speaking, neural network always wins everytime given enough training data and enough GPU. It is, however, susceptible to overfitting when training data is limited.

Assume we have a dataset that is not linearly separable. Let's examine the specific case when we have two perceptrons in the first layer (with sigmoid activation function) and one perceptron in the second layer:
* neither perceptrons in the first layer can perfectly separate the data.
* however, in the new feature space spanned by the outputs of the first layer, the data is now linearly separable.
* consequently, the perceptron in the second layer can now perfectly fit the data.

The key is to use nonlinear activation function. Otherwise, a linear activation function will just rotate or stretch the original data. Making the problem linearly separable via nonlinear mapping is called the 'projection trick'.

Summary of multilayer perceptron:
* combines several perceptrons to create nonlinear decision boundary
* "internal perceptrons" map data to new feature space -- important: should be nonlinear mapping!