<img src="images/thro.png" align="right"> 
# A2I2 - Artificial Neural Networks (ANN)

## Homework - Part 1: History of ANN, Artifical Neurons, Activation Functions

Please work through this notebook and solve the exercises **before** the lecture.

***Expected time to complete the notebook: 45min***

*-- Markus Breunig, Kai Höfig*

# Introduction
Artificial neural networks (ANN) are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being programmed with task-specific rules. ANNs are primaryly used to classification and regression problems. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the results to identify cats in other images (supervised learning / classification). They do this without any prior knowledge of cats, for example, that they have fur, tails, whiskers and cat-like faces. Instead, they automatically generate identifying characteristics from the examples that they process.

An ANN is based on a collection of connected nodes called artificial neuron. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it.

In ANN implementations, the "signal" at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called edges. Neurons and edges have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.

The original goal of the ANN approach was to solve problems in the same way that a human brain would. But over time, attention moved to performing specific tasks, leading to deviations from biology. ANNs have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games, medical diagnosis, and even in activities that have traditionally been considered as reserved to humans, like painting.
[source: https://en.wikipedia.org/wiki/Artificial_neural_network]

# History of ANN
Warren McCulloch and Walter Pitts (1943) opened the subject by creating a computational model for neural networks. Rosenblatt (1958) created the perceptron. The first functional networks with many layers were published by Ivakhnenko and Lapa in 1965. The basics of continuous backpropagation (the primary means of training an ANN) were derived in the context of control theory by Kelley in 1960 and by Bryson in 1961.

In 1973, Dreyfus used backpropagation to adapt parameters of controllers in proportion to error gradients. Werbos's (1975) backpropagation algorithm enabled practical training of multi-layer networks. In 1982, he applied Linnainmaa's AD method to neural networks in the way that became widely used. Thereafter research stagnated following Minsky and Papert (1969), who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks. In 1992, max-pooling was introduced to help with least-shift invariance and tolerance to deformation to aid 3D object recognition. Schmidhuber adopted a multi-level hierarchy of networks (1992) pre-trained one level at a time by unsupervised learning and fine-tuned by backpropagation.

The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s. In 2012, Ng and Dean created a network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images. Unsupervised pre-training and increased computing power from GPUs and distributed computing allowed the use of larger networks, particularly in image and visual recognition problems, which became known as "deep learning".

Ciresan and colleagues (2010) showed that despite the vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks. Between 2009 and 2012, ANNs began winning prizes in ANN contests, approaching human level performance on various tasks, initially in pattern recognition and machine learning. For example, the bi-directional and multi-dimensional long short-term memory (LSTM) won three competitions in connected handwriting recognition in 2009 without any prior knowledge about the three languages to be learned.

Ciresan and colleagues built the first pattern recognizers to achieve human-competitive/superhuman performance on benchmarks such as traffic sign recognition (IJCNN 2012).

[source: https://en.wikipedia.org/wiki/Artificial_neural_network]

# The Artificial Neuron

An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs and sums them to produce an output (or activation). Usually each input is separately weighted, and the sum is passed through a non-linear function known as an activation function (often called transfer function). The activation functions usually have a sigmoid shape, but they may also take the form of other non-linear functions, piecewise linear functions, or step functions. They are also often monotonically increasing, continuous, differentiable and bounded. 
[source: https://en.wikipedia.org/wiki/Artificial_neuron]

## Basic structure
For a given artificial neuron, let there be $m + 1$ inputs with signals $x_0$ through $x_m$ and weights $w_0$ through $w_m$. Usually, the $x_0$ input is assigned the value $-1$ (or $+1$), which makes it a bias input with $-w_0$ (or $w_0$). This leaves only m actual inputs to the neuron: from $x_1$ to $x_m$.

The output of the neuron is:
\begin{equation*}
y=g \left( \sum _{i=0}^{m}w_{i}x_{i} \right)
\end{equation*}

Where $g$ is the activation function (commonly a threshold function).

<img src="images/artificial_neuron.png" width="400">

The output is analogous to the axon of a biological neuron, and its value propagates to the input of the next layer, through a synapse. It may also exit the system, possibly as part of an output vector.

It has no learning process as such. Its transfer function weights are calculated and threshold value are predetermined.

## Biological models
<img src="images/neuron.png" align="right" width="300">

Artificial neurons are designed to mimic aspects of their biological counterparts.
- Dendrites: In a biological neuron, the dendrites act as the input vector. These dendrites allow the cell to receive signals from a large (>1000) number of neighboring neurons. As in the above mathematical treatment, each dendrite is able to perform "multiplication" by that dendrite's "weight value." The multiplication is accomplished by increasing or decreasing the ratio of synaptic neurotransmitters to signal chemicals introduced into the dendrite in response to the synaptic neurotransmitter. A negative multiplication effect can be achieved by transmitting signal inhibitors (i.e. oppositely charged ions) along the dendrite in response to the reception of synaptic neurotransmitters.
- Soma: In a biological neuron, the soma acts as the summation function, seen in the above mathematical description. As positive and negative signals (exciting and inhibiting, respectively) arrive in the soma from the dendrites, the positive and negative ions are effectively added in summation, by simple virtue of being mixed together in the solution inside the cell's body.
- Axon: The axon gets its signal from the summation behavior which occurs inside the soma. The opening to the axon essentially samples the electrical potential of the solution inside the soma. Once the soma reaches a certain potential, the axon will transmit an all-in signal pulse down its length. In this regard, the axon behaves as the ability for us to connect our artificial neuron to other artificial neurons.

Unlike most artificial neurons, however, biological neurons fire in discrete pulses. Each time the electrical potential inside the soma reaches a certain threshold, a pulse is transmitted down the axon. This pulsing can be translated into continuous values. The rate (activations per second, etc.) at which an axon fires converts directly into the rate at which neighboring cells get signal ions introduced into them. The faster a biological neuron fires, the faster nearby neurons accumulate electrical potential (or lose electrical potential, depending on the "weighting" of the dendrite that connects to the neuron that fired). It is this conversion that allows computer scientists and mathematicians to simulate biological neural networks using artificial neurons which can output distinct values (often from −1 to 1).

## Artificial Neuron for the AND-Function

With a single artificial neuron, we can implement some simple functions already. Let us use this model of an artificial neuron to implement the Boolean AND function. Let the activation function $g$ be a threshold function that returns $0$ if the weighted sum of the input values is less than zero and $1$ if it is greater than zero.

For the Boolean AND function, two input variables $x_1$ and $x_2$ are required, each of which can have the values $0$ or $1$. The result is $1$ if both inputs are $1$, otherwise $0$. Now the weights must be set so that the desired result is achieved: $w_0=1.5$, $w_1=1$, $w_2=1$.

The weight $w_0$ acts here as a threshold value (bias). This is due to the activation function used at zero; Since $w_0$ is subtracted from the weighted sum by the fixed input $x_0 = -1$, a weight not equal to zero shifts the threshold value to any other position, here to $1.5$. Any other threshold value between 1 and 2 can also be selected for the AND function.

## Other Functions
<img src="images/and_or_xor.png" align="right" width = "500">
Other functions can be similarly implements as long as they are linearly separable. The weights describe nothing more than a hyperplane (in 2D, as in the example, a straight line) with which the separation between the desired output values zero or one is achieved. So the Boolean OR can easily be implement, but not the Boolean XOR as shown in the figure.

Learning or training now means that the weights are automatically determined from an annotated sample.


## <span style="color:red">*Exercise ANN:1:1*</span>

<span style="color:blue">a) What does an artificial neuron look like to implement the Boolean OR function?</span>

<span style="color:blue">b) What does an artificial neuron look like to implement the Boolean NOT function?</span>

In [None]:
# Solution for ANN:1:1

# Activation Functions

The activation function of a neuron is chosen to have a number of properties which either enhance or simplify the network containing the neuron. Crucially, a non-linear function has be used, if we want to be able to implement non-linearly-seperable functions be combining multiple neurons. Let us look at the most widely used actication functions:
<img src="images/activation_functions.png" size="200">

### Step function (Sprungfunktion)
The output y of this transfer function is binary, depending on whether the input meets a specified threshold $\theta$. The "signal" is sent, i.e. the output is set to one, if the activation meets the threshold.

\begin{equation*}
y=\begin{cases}1&{\text{if }}u\geq \theta \\0&{\text{if }}u<\theta \end{cases}
\end{equation*}

This function is used in perceptrons and often shows up in many other models. It performs a division of the space of inputs by a hyperplane. It is specially useful in the last layer of a network intended to perform binary classification of the inputs. It can be approximated from other sigmoidal functions by assigning large values to the weights. The step function has one big drawback, though: it is non-continous and therefor non-differentiable at the threshold value. We will see later on that the first derivative of the activation function is very importent when training a network of neurons. Therefore, continous and differentiable approximations of the step function have been used.

### Sigmoid function and tangens hyperbolicus
For both the sigmoid function $g_s(x)$ and the tangens hyperbolicus $g_t(x)$, the first derivate can be calculated easily:

\begin{equation*}
g_s(x) = \frac{1}{1+e^{-x}}
\end{equation*}

\begin{equation*}
g'_s(x) = g_s(x)(1-g_s(x))
\end{equation*}

and

\begin{equation*}
g_t(x) = \tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}
\end{equation*}

\begin{equation*}
g'_t(x) = 1-g_t^2(x)
\end{equation*}

### ReLU, leaky ReLU and Softplus
These are the currently most widely used activations functions. They have mostly replaced Sigmoid and tanh, as they solve the problem of *vanishing gradient*, which is especially severe in deep learning models. The vanishing gradient problem refers to the property, that the first derivate (gradient) of Sigmoid and tanh is large around the threshold (0), but becomes closer and closer to 0 the further away (as the functions become flatter and flatter) - the gradient vanishes. ReLU $g_r$, leaky ReLU $g_l$ and Softplus $g_p$ solve this: 

\begin{equation*}
g_r(x) = \max(0,x) = \begin{cases}0&{ x \leq 0} \\x&{x>0} \end{cases}
\end{equation*}

\begin{equation*}
g'_r(x) = \begin{cases}0&{ x \leq 0} \\1&{x>0} \end{cases}
\end{equation*}

and

\begin{equation*}
g_l(x) = \begin{cases}0,01x&{ x \leq 0} \\x&{x>0} \end{cases}
\end{equation*}

\begin{equation*}
g'_l(x) = \begin{cases}0,01&{ x \leq 0} \\1&{x>0} \end{cases}
\end{equation*}

and
\begin{equation*}
g_p(x)=\ln(1+e^x)
\end{equation*}

\begin{equation*}
g'_p(x)=\frac{1}{1+e^{-x}}
\end{equation*}

ReLu is continuous and almost differentiable (simple chose 0 or 1 at the threshold). The gradient still vanishes for values less than zero, leaky ReLU introduces a small gradient there as well. Softplus is a smooth approximation of ReLU and differentiable. 

### Identity function
Another widely used activation function is the identity $g_i(x)=x$ - one could say, we simple output the sum without using an activation function. This is widely used in the final layer of network used for regression problems.

### Softmax
Softmax is often used in the final layer of network used for classification problems. It can not be shown graphically, as the output does depend on all neurons in the layer, not on a single neuron $x$ as for the other functions. Let us collect the outputs $x_i$ of all these neurons in a vector $\vec{x}$. 
\begin{equation*}
g_{m j}(\vec{x})=\frac{e^{x_j}}{\sum_{i}{e^{x_i}}}
\end{equation*}

\begin{equation*}
\frac{\partial g_{mj}(\vec{x})}{\partial x_i}=g_{m j}(\vec{x})(\delta_{ji}-g_{m i}(\vec{x}))
\end{equation*}

where $\delta_{ji}=1$ if $i=j$ and zero otherwise.

Softmax is a smooth (and differentiable) approximation of the maximum function, i.e. it increases large values and decreases small values. It's main advantage is that the sum of all values is alway equal to $1$, so the values can be interpreted as probabilities and thus compared to each other. 

The threshold used in all examples above was set to zero. In practise, it should be variable and learned by the network together with the weights, this is achieved by using the bias (weight $w_0$) as shown above - it moves the threshold to the left or the right. This trick is convenient as we avoid introducing a special bias paramater - the bias behaves just like all other weights during training. 


## <span style="color:red">*Exercise ANN:1:2*</span>

<span style="color:blue">Calculate the output of the following neuron for each of the following activation functions:</span>
<img src="images/exercise_neuron.png" width="200">

<span style="color:blue">a) step function</span>

<span style="color:blue">b) Sigmoid</span>

<span style="color:blue">c) ReLU</span>

In [None]:
# Solution for ANN:1:2

In [None]:
# --- EOF ---