# Artificial Neural Network

Artificial neural networks (ANNs) are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" tasks by considering examples, generally without task-specific programming. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the results to identify cats in other images. They do this without any a priori knowledge about cats, e.g., that they have fur, tails, whiskers and cat-like faces. Instead, they evolve their own set of relevant characteristics from the learning material that they process.

An ANN is based on a collection of connected units or nodes called artificial neurons. Each connection (a simplified version of a synapse) between artificial neurons can transmit a signal from one to another. The artificial neuron that receives the signal can process it and then signal artificial neurons connected to it.


In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is calculated by a non-linear function of the sum of its inputs. Artificial neurons and connections typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that only if the aggregate signal crosses that threshold is the signal sent. Typically, artificial neurons are organized in layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first (input), to the last (output) layer, possibly after traversing the layers multiple times.<br/><br/>

CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html <br/>
10 tags, 50,000 training data, 10,000 testing data, size are 32*32<br/> 
Tags: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

### Linear loss function

Linear classification function:   $$f(x,W) = Wx+b$$  <br/>
Linear Loss function:
$$L = \frac{1}{N}\sum_{i=1}^{N}\sum_{j\neq y_{i}}^{ } max(0,f(x_{i};W)_{j} -f(x_{i};W)_{y_{i}} + 1 )$$
Average value of all sample training data

### Penalty function
Penalty methods are a certain class of algorithms for solving constrained optimization problems.

A penalty method replaces a constrained optimization problem by a series of unconstrained problems whose solutions ideally converge to the solution of the original constrained problem. The unconstrained problems are formed by adding a term, called a penalty function, to the objective function that consists of a penalty parameter multiplied by a measure of violation of the constraints. The measure of violation is nonzero when the constraints are violated and is zero in the region where constraints are not violated.

$$L = \frac{1}{N}\sum_{i=1}^{N}\sum_{j\neq y_{i}}^{ } max(0,f(x_{i};W)_{j} -f(x_{i};W)_{y_{i}} + 1 ) + \lambda R(W)$$
where, <br/>
R(W) is the regular pennalty (L2)
$$R(W) = \sum_{k}^{ }\sum_{l}^{ }w_{k,l}^{2}$$

### Softmax
* SVM: Scoring result,
* Softmax: Probability

In mathematics, the softmax function, or normalized exponential function,is a generalization of the logistic function that "squashes" a K-dimensional vector `z`  of arbitrary real values to a K-dimensional vector $\sigma (z)$ of real values in the range (0, 1) that add up to 1. The function is given by
$$\sigma :\mathbb {R} ^{K}\to (0,1)^{K}$$
$$\sigma (\mathbf {z} )_{j}={\frac {e^{z_{j}}}{\sum _{k=1}^{K}e^{z_{k}}}}$$
where, j = 1, …, K
<br/><br/>

In probability theory, the output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes. In fact, it is the gradient-log-normalizer of the categorical probability distribution.

<br/>
The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression).multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.

<br/>
Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x and a weighting vector w is:

$${\displaystyle P(y=j\mid \mathbf {x} )={\frac {e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{j}}}{\sum _{k=1}^{K}e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{k}}}}}$$

<br/>
This can be seen as the composition of K linear functions $ \mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{1},\ldots$ , $\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{K}$ and the softmax function (where 
$\mathbf {x} ^{\mathsf {T}}\mathbf {w}$  denotes the inner product of $\mathbf {x}$  and $\mathbf {w}$ ). The operation is equivalent to applying a linear operator defined by $\mathbf {w}$ to vectors $\mathbf {x}$, thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space ${\displaystyle \mathbb {R} ^{K}}$.