<a id="perceptron"></a>
# Perceptron model 

## "The Psychologists"

- Original motivation of earliest neural model, perceptrons, from "electronic" modeling of perception
- Influence of Psychology still visible in AI: visual processing, acustic processing, natural language processing

### The Hero: 

The psychologist Frank Rosenbaltt and the Mark I perceptron:

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRozuUtQVt1EyFVVfovXp5tC9iP3f5mM7tMy3jAlVaarA7gf_zE" width=600 heigth=600>


### The hardware:
<img src="https://s3.amazonaws.com/s3.timetoast.com/public/uploads/photos/7146113/Mark-I.jpeg?1477813660" width=600 heigth=600>

- Original perceptron models and their update mechanisms not aimed at digital coputers but specialized analog hardware! 

(Hardware innovation - though not analog, but specialized - is kicking in again, see [here](https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx).)


### The learning algorithm:
<img src="http://www.rutherfordjournal.org/images/TAHC_perceptron.jpg" width=600 heigth=600>

Inspiration of "learning rules" from ["Hebbian learning"](https://en.wikipedia.org/wiki/Hebbian_theory):
- Neural learning relying on local information only
- Correlation patters of neuron firing strengthen the synaptic connections
- Colloquially: "What fires together, wires togetgher"
- Rather vague learning rule not applicable in practice
- Rather limited solutions until advent of backpropagation

(Backprop also extreme, since it relies too heavily on a distant supervision signal - so there is maybe a way to have a "semi-hebbian" learning proceudre - see "synthetic gradients" in a later lecture.) 


## Biological motivation (recapitulation)

### Representation
<img src="http://drive.google.com/uc?export=view&id=1tedMjIowYM8Y68C8fRrZJ2JRsFRQx1P5">

### Thresholding

- Neurons do not "fire" continuously
- Their activation has to be modelled with some kind of a step function

<img src="http://www.saedsayad.com/images/ANN_Unit_step.png">

Simplest "nonlinearity" - later we will encounter a plethora of others

<img src="http://slideplayer.com/slide/5214241/16/images/5/Perceptron:+Linear+threshold+unit.jpg">

**Attention**

- $x_{1..n}$ are the input values, "input activations"
- $x_{0}$ is also present! -- This is the  "bias unit", or "bias term"
<img src="https://raw.githubusercontent.com/qingkaikong/blog/master/39_ANN_part2_step_by_step/figures/figure1_perceptron_structure.jpg">
- Shouldn't be confused with concept of "bias" met when discussing overfitting, although  semantics is similar: a general "prejudice" which determines the behaviour of the system

### "Biological inspiration"

<img src="https://scontent-vie1-1.xx.fbcdn.net/v/t1.0-9/40956246_2063218163722756_3056192324313808896_n.jpg?_nc_cat=103&_nc_ht=scontent-vie1-1.xx&oh=26b310abeac119852a92902b0b7ebd2f&oe=5CE0B0B9" width=55%>

[source](https://www.facebook.com/photo.php?fbid=2063218160389423&set=gm.2075284019202050&type=3&theater)

## Capable of modeling logical operations
- Logic considered pinnacle of cognitive activities
- "It can learn, it models logic, what else would be needed?"
- Perceptron's problems with modeling certain logical functions had huge effect in history of AI. 

##  Artificial neuron -- mathematical model

(the mathematical discussion follows mainly that of [Hal Daumé III](http://ciml.info/dl/v0_99/ciml-v0_99-ch04.pdf))

### Activation function

with 
- $\mathbf x = \langle x_1,...,x_D \rangle $ incoming activations, 
- $\mathbf w = \langle w_1,...,w_D \rangle$ weights, and
- $b$ bias

the outgoing activation is

$a(x_1,...,x_D) = \sum_{d=1}^D w_d x_d +b$ where

If $a(\mathbf{x})$ > 0 then the input is classified as a positive if $a(\mathbf{x}) \leq 0$ then as a negative instance.




<a id="learning"></a>
# 3 Learning

## Inspiration: Hebbian learning

[Hebbian learning rule](https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Hebbian_Learning)
- One of the oldest learning rules
- If there is a high correlation between the outputs of two neurons at the two ends of a synapse ("they fire together") then the strength of the synapse should be increased ("what fires together wires together"). 

$ w_{ij}[n+1]=w_{ij}[n]+\eta x_{i}[n]x_{j}[n]$

Where:

- $n$ is the current time step.
- $x_{i},x_{j}$ are the activations of the two neurons
- $w$ is the strength of the synapse (later: "weight")
- $\eta$ is the learning rate

## The perceptron algorithm

1. for all $d\in 1..D$: $w_d \leftarrow 0$ (initialize weights)
2. $b \leftarrow 0$ (initialize bias) 
3. $\mathit{EpochCount}$ times: for all $(\mathbf{x}, y)$ training examples
   - Calculate the $a= \mathbf w \mathbf x + b$ activation
   - If $ya \leq 0$:
       - $\mathbf{w} \leftarrow \mathbf{w} + y\mathbf{x}$
       - $b \leftarrow b + y$
       
(In somewhat more complex formulations there is a learning rate parameter as well (frequently referred to as $\eta$). With this the learning step is along the lines of:

- $\mathbf{w} \leftarrow \mathbf{w} + \eta y\mathbf{x}$
- $b \leftarrow b$ + $\eta y$

accordingly, the simple (nonetheless fully functional) version above uses $\eta=1$ learning rate.)

### Why is the update rule useful? 

We will see later that the algorithm is guaranteed to find a separator if the data set is separable. Disregarding this for the moment, it's simple to see that the algorithm modifies the output for incorrectly classified examples in the right direction. Calculating the difference between the outputs before and after the update:

$$[(\mathbf w + y\mathbf x)\mathbf x + b + y] - [\mathbf w x + b] = y\mathbf x^2 + y = y (\mathbf x^2 + 1)$$

that is, we increase the output at least by 1 if the incorrectly classified example was positive and decrease it by at least one if it was negative.

   
##  Complexity contrast with other learning algorithms

### Perceptron algorithm

Update was possible simply by rotating potentiometers (in the case of binary vectors by a single step forward or backward).
#### vs, for instance, Newton's method

Other methods for reducing the error rate can be way more complex. For instance, Newton's method, which approximates the minimum of the error function by using Newton's method to find its critical point:

<a href="https://www.researchgate.net/profile/Daniel_Marcsa2/publication/266091369/figure/fig5/AS:476476194725892@1490612185738/The-geometrical-construction-of-Newton-Raphson-method.png"><img src="https://drive.google.com/uc?export=view&id=1CRdZS-0tQuE3SEo7Yn5oeeX6IJluShYx" width=40%></a>

In the (typical) multidimensional case,

- this requires the computation of the Hessian matrix, which consists of all second order partial derivatives 
- moreover, the Hessian needs to be inverted

### EpochCount and order of processing

- If the order of training examples is wrong then the perceptron learns only from a few examples
- The order is so significant that a random shuffle of the original training data typically results in a 20% faster convergence.

## Geometrical interpretation

The ** decision boundary** of a perceptron with $\mathbf w$ weights and  $b$ bias is the set of possible inputs for which the activation is 0, that is, the set

$$\left\{\mathbf x : \sum_{d=1}^D w_d x_d + b = 0\right\} = \{\mathbf x: \mathbf w \mathbf x + b = 0 \}$$ 

If $b = 0$ then the decision boundary is $\{\mathbf x: \mathbf w \mathbf x = 0 \}$, which is the set of vectors that are perpendicular to $\mathbf  w$, therefore the boundary is a  *hyperplane* which is perpendicular to  $\mathbf w$ and crosses the $\mathbf 0$ vector.

<img src="https://ds055uzetaobb.cloudfront.net/image_optimizer/947723b3ba09371025dac3dab038f6b79a9ea2d3.png"  height="400" width="400">

In addition, if $\mathbf w$ is a unit vector (we can assume that, since the decision boundary is determined solely by its direction), then the $\mathbf w \mathbf x$ activation will simply be the _signed projection_ of $\mathbf x$  to $\mathbf w$. On one side of the hyper plane the projection will be positive while on the other side negative, so the plane _separates_ the inputs which are predicted to be positive and negative.

<img src="http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/images/perceptron/perceptron_img1.png"  height="400" width="400">

The role of the $b$ bias is to *move* to separator hyperplane in parallel with  $\mathbf w$ by exactly  $-b$ units.

## De hypercone of the good solutions

<img src="http://drive.google.com/uc?export=view&id=1KIZ9QUaLL2SisNrzwegoa9WA5B2e31IL">

(Source: Hinton - Neural networks for machine learning)

## Margin

If the positive and negative examples of a $\mathbf D$ data set are separated by the  hyperplane determined by a $\mathbf w$ unit weight vector and $b$ bias then the minimum of the activations on $\mathbf D$ is the separator's _margin_:

$$\mathrm{Margin}(\mathbf D, \mathbf w, b) = \min_{(\mathbf x, y) \in \mathbf D} y(\mathbf w \mathbf x + b)$$

This is simply the minimum of the distances from the separator hyperplane. The margin of a data set can also be defined, this is the largest possible margin:

$$\mathrm{Margin}(\mathbf D) = \sup_{\mathbf w, b}(\mathbf D, \mathbf w, b)$$


##  Convergence theorem (Rosenblatt)
Let's assume that for a $\mathbf D$ data set in which $\forall \mathbf x_i: \|\mathbf x_i\|\leq 1$ there exists a $\mathbf{w^*}$ optimal separator with the maximal $\gamma$ margin, and the algorithm is performed with the a $\mathbf w_0,...,\mathbf w_i,\dots$ update steps (for simplicity we assume that the bias is $0$ and the $\mathbf w^*$ is chosen to be a unit vector -- none of these assumptions is essential for the result). In that case the algorithm finds a separator in a finite $k$ number of update steps, and, moreover, $k$ is guaranteed  to satisfy

$$ k \leq \frac{1}{\gamma^2}$$

The key idea is to prove that the angle between  $\mathbf {w}^*$ and $\mathbf {w}_i$ decreases to the degree needed for the linear separation in a finite number of update steps.


## Advantages and disadvantages of perceptron algorithm
__Advantages__:
- online --  processes one example at a time and possibly improves the model on the basis of this example. $\Rightarrow$ Capable of continuously processing new incoming examples.
- fast and simple
- convergence theorem

__Disadvantages__
- in contrast to SVM, there is no guarantee that the resulting separator is optimal.
- error-driven: it can change a 99.99% precision model because of a single error
- convergence is guaranteed only when there does exist a separator! And that-- as we will see -- is not always the case...

<a id="advancedperceptrons"></a>
# 4 Advanced perceptrons

##  The problem

Later examples have too large influence on the learned weights -- a last update can change a weigth vector that worked well for all other examples (!).

## Solutions

### Voting

The weights and bias are stored at every update, together with number of correct predictions since the last update. The learning process is unchanged, but in the prediction stage the system generates a prediction with all stored weights + bias values and the result is computed ast the weighted sum of all of these predictions,  where the weights are the stored "survival times". Problem: Requires a huge amount of memory.

### Averaged perceptron

Similar, but more practical alternative: prediction is performed with the weighted average of the weights and biases that were generated during the learning process -- weights are, again, the "survival times".  In contrast to voting, here it is enough to maintain a rolling weighted average during the learning phase, so the additional memory consumption is negligible.


# 5 Limitations

### Minsky & Papert (1969): "Perceptrons"

<img src ="http://slideplayer.com/slide/775779/3/images/41/Minsky+&+Papert+(1969).jpg" width=600 heigth=600>

#### A very general perceptron definition
multiple layers and nonlinearity are possible

<img src="http://drive.google.com/uc?export=view&id=1ipkPyUFpS8bfTrXNIy6CqMOOAzm14Z9j" width=600 heigth=600>

#### Criticism

> Perceptrons have been widely publicized as "pattern recognition" or "learning" machines and as such have been discussed in a large number of books, journal articles, and voluminous "reports." Most of this writing (some exceptions are mentioned in our bibliography) is without scientific value and we will not usually refer by name to the works we criticize. The sciences of computation and cybernetics began, and it seems quite rightly so, with a certain flourish of romanticism. They were laden with attractive and exciting new ideas which have already
borne rich fruit. Heavy demands of rigor and caution could have held this development to a much slower pace; only the future could tell which directions were to be the best.

##### "The Seductive Powers of Perceptrons"

> Thus "programming" takes on a pleasingly homogeneous form. Moreover since "programs" are representable in a
[multi]-dimensional space, they inherit a metric which makes it easy to imagine a kind of automatic programming which people have been tempted to call learning', by attaching feedback devices to the parameter controls they propose to "program" the machine by providing it with a sequence of input patterns and an "error signal" which will cause the coefficients to change in the right direction when the machine makes an inappropriate decision.

#### Goal: precise theory + what can they be used for? 
- Perceptrons are "massively parallel" machines -- these architectures were not so well understood as the classic sequential ones.
- Although they write about multilayer perceptrons too, their most important _theorems_ concern single layer linear perceptrons.
- Focus: perceptrons as visual pattern recognizers (this was their main application area at the time)
- Negative results: some predicates, e.g., parity, connectedness etc. cannot be represented by certain types of perceptrons.  Their most well known result of this type is the XOR operation:
 
## XOR Problem
- It was important to model logical operators
- "Feeding" 0,1 input the output should be the corresponding operation result
- A perceptron can model a logical operation only if it is linearly separable

### The truth table of XOR

<img src="https://www.dyclassroom.com/image/topic/logic-gate/xor-xnor/xor-table.png" width=300 heigth=300>

### Problem
$x1$ and $x2$ represents $A$ and $B$
<img src="http://drive.google.com/uc?export=view&id=1m59mOWDu7yShgMpSgw2yC8YDVzi1-fzs" style="width: 80%;">
 
### The XOR proof
Let us assume, toward a contradiction, that a $w_1,w_2, b$ perceptron computes XOR. Then

1. $w_1 + b > 0$
2. $w_2 + b > 0$
3. $w_1 + w_2 + b \leq 0$
4. $b \leq 0 $

But adding (1) and (2) we have: $w_1 + w_2 + 2b > 0$ 

And adding (3) and (4) we get: $w_1 + w_2 + 2b \leq 0$, which is a contradiction.



<a id="nnbasics"></a>
# Neural network basics 



## "The resistance" - People who worked during winter

During the AI winters the mainstream of scientific community - in image recognition, speech, language and in other AI fields - was strongly against the usage of neural networks, since they regarded it as a "dead paradigm". But the history of AI as a scientific field teaches us not to disregard some old ideas, since they can easily arise in new forms again (like genetic algorithms entering mainstream again in [this](https://www.technologyreview.com/s/611568/evolutionary-algorithm-outperforms-deep-learning-machines-at-video-games/) case). 

The "resistance" during the neural network winter is well represented by the work of [**Geoffrey Hinton**](https://en.wikipedia.org/wiki/Geoffrey_Hinton), who was working on new learning algorithms for neural models in the mid 80s, especially backpropagation (see next lecture), thus securing his name as ["the godfather of deep learning"](https://www.youtube.com/watch?v=uAu3jQWaN6E) - and a media celebrity and "face" for the movement.)

<img src="https://images.thestar.com/C_Dnyhg8tb3tVXiGtq93vee9oJM=/1200x799/smart/filters:cb(1524397170509)/https://www.thestar.com/content/dam/thestar/news/world/2015/04/17/how-a-toronto-professors-research-revolutionized-artificial-intelligence/geoffrey-hinton-3.jpg" width=600 heigth=600>

Others also followed. (Except Schmidhuber, who claims: he started the whole thing on his own. :-)

<img src="https://i.imgur.com/lq1LDVO.png" height="400" width="400" align="left">


### Jürgen Schmidhuber,
### Joshua Bengio, 
### Geoffrey Hinton,
### Yann LeCun,
### Andrew Ng

## "Multilinearity"

### We were here
<img src="https://www.mathworks.com/help/nnet/ug/percept_percla.gif">

### Our problem was

<img src="http://drive.google.com/uc?export=view&id=1m59mOWDu7yShgMpSgw2yC8YDVzi1-fzs" style="width: 80%;">

## Solution: We need more layers! 

Even Minsky and Papert suggested this solution to the XOR problem

<img src="http://slideplayer.com/slide/778829/3/images/5/Minsky+&+Papert+(1969)+offered+solution+to+XOR+problem+by.jpg" width=600 heigth=600>

### How does it look in practice?

<img src="http://scikit-learn.org/stable/_images/multilayerperceptron_network.png" width=400 heigth=400>



## Smarter learning method

### Problems with the perceptron learning algorithm

#### 1. How should we adapt the algorithm for more than one layers?

It is unclear how we should compute the error and update the weights on the basis of the error -- we need an analytical method for computing the individual weights' contribution "backward" from the errror.

#### 2. We have already seen the dangerous order sensitivity + it considers only one example at a time 
In practice the perceptron update rule is quite scary:

<img src="https://jeremykun.files.wordpress.com/2011/08/perceptron-iterations.gif">

#### 3. What if the data is not linearly separable?
We have no guarantee that the resulting weights will be optimal, i.e., that the algorithm minimizes the number of errors. 

## Learning as a minimization task

Basic intuition: 
- Find model parameters that minimize how "bad" the model is, measured by some objective function
- "Badness" not exclusively error on the data set, a model can be worse than another because it's too complex, its weights are too large etc. 
- The function to be minimized is typically called "loss" or "objective" function


### Error rate as loss function (0/1 loss)

- The error rate or error number on the training data might seem to be a good loss function 
- but, there is no efficient optimization algorithm for this type of loss, even for the linear case:

> - The problem of finding the linear separator with the minimal error rate/number of errors is NP-hard, i.e., there is a polynomial time algorithm only if P=NP.  
> - Even more surprisingly, the problem of approximating the minimum is also NP-hard $\Rightarrow$ it is advisable to experiment with different loss functions.

- A main problem of the error rate functions is that in certain cases small changes in the parameters can lead to large jumps in the function value, while in others even large parameter values do not change the output (discontinuous, non-differentiable) 
- Solution: we work with smoothed, continuous loss functions that correlate with the error rate (e.g., being its upper limit) but which are easier to minimize.

<img src="http://fa.bianp.net/blog/images/2014/loss_01.png">

<img src="http://fa.bianp.net/blog/images/2014/loss_log.png">

### Surrogate loss functions

<img src="http://i.imgur.com/r37lX2P.jpg">


## What is the solution to optimizing coninous loss functions? -- Gradient descent"
<img src="https://sebastianraschka.com/images/blog/2015/singlelayer_neural_networks_files/perceptron_animation.gif">

The Gradient Descent algorithm takes a small step in right direction taking into consideration all examples of the training data, so it moves toward the minimum in the error space.

## Gradient Descent algorithm
If the function to be minimized is the differentiable $F:\mathbb R^n\rightarrow \mathbb R$,  $\eta _n$ is a series of learning rates and  $K$ is the number of steps then the algorithm is

1. $\mathbf p_0 :=  \mathbf p_{\mathrm init}$ (initial parameters  [for neural nets these are the weights and biases]):
2. for k $\in [1..K]$:
    - $\mathbf g_k :=\Delta F(\mathbf p_{k-1})$ (compute the value of the derivative for the actual parameters)            
    - $\mathbf p_k := \mathbf p_{k-1} - \eta_k \mathbf g_k$ (take a step in the direction of the gradient)
3. The result is $\mathbf p_K$.

<img src="https://sebastianraschka.com/images/blog/2015/singlelayer_neural_networks_files/perceptron_gradient_descent_1.png">

### Convergence

Under certain conditions (among others, $F$ has to be convex)  the series generated by the algorithm is guaranteed to converge to the minimum: the $\eta_n$ series can be chosen to be a constant series such that the convergence rate is $\mathcal O(1/k)$, that is,  if $F$ reaches its minimum at $\mathbf p^*$ then there is an $\alpha$ constant for which for any $k$, $F(\mathbf{p}_k)-F(\mathbf{p}^*)\leq \frac{\alpha}{k}$.

### Too small and too large learning rates

<img src="https://cdn-images-1.medium.com/max/1600/1*EP8stDFdu_OxZFGimCZRtQ.jpeg" width="600">



## Moving from the parameter space to the "error space"

The change of the decision boundary corresponds to decreasing the error.
<img src="https://iamtrask.github.io/img/sgd_optimal.png">
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/23104835/Qc281.jpg">

**Notice that if the separation is nonlinear then curvature of the decision boundary/surface increases. Somewhat simplifying the matter we can assume that the weight vector's values are increasing and the decision surface can be described by a higher order polynomial.**


## "Nonlinearity"

- An early idea
- Firstly the role was played by the

** Sigmoid (logistic) function**

$${\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.}$$

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/600px-Logistic-curve.svg.png" width=400 heighth=400>

### Its advantages

- Differentiable (unlike the step function used in the perceptron)
- It is the smoothed version of the unit step function
  <img src="http://drive.google.com/uc?export=view&id=1fTTksMeUU9UTZLUDq4NF5avktkQ_p5IU">
- Its range is the  $(0,1)$ interval, so its values can be interpreted as probabilities
- Can produce complex decision surfaces when used in multiple layers

See also:

<img src="https://sebastianraschka.com/images/faq/diff-perceptron-adaline-neuralnet/8.png">

(Adaline is a perceptron variant trained with a "smarter learning rule", see [here](https://sebastianraschka.com/faq/docs/diff-perceptron-adaline-neuralnet.html) ) 

<img src="https://i.stack.imgur.com/xcdwn.png" width=60%>
<img src="https://i.stack.imgur.com/blIBz.png" width=60%>
<img src="https://ars.els-cdn.com/content/image/1-s2.0-S089360809700097X-gr3.gif" width=55%>

[Source](https://stats.stackexchange.com/questions/291492/how-can-logistic-regression-produce-curves-that-arent-traditional-functions)


Another important advantage is that it is a **universal approximator**.

