In [None]:
%matplotlib inline  
import numpy as np
import matplotlib.pyplot as plt
import sys
import time
from IPython import display

# 8 Optimization - Neural Networks - The Perceptron

## Introduction

Neural networks are excellent examples of (often non-linear) function optimization. 

Even though real neurons look like this:

![Real Neuron](https://static.turbosquid.com/Preview/2014/12/03__09_02_06/all.jpg5254a144-6d9d-4a4f-b9b0-6fcbba4f64bcOriginal.jpg)

and have highly complicated biochemical processes that control their firing given input in the many dendrites, people have long thought about how to approximate their function mathematically.



### The artificial neuron

We will briefly review some "proper" Nobel-prize-winning equations in the later part of the course, when we talk about partial derivative equations - however, as you may imagine, a real neuron's complexity is rather daunting, so for simulations, simpler models would be good as a start. 

Here's an extremely simplified model - a so-called artificial neuron which simply 
1. takes several inputs $x_i$
2. sums them up as $\sum x_i$
3. pushes them through an activation function $f(\sum x_i)$ and 
4. delivers one output $y=f(\sum x_i)$ to downstream neurons.

![Simplified neuron and model](https://cdn-images-1.medium.com/max/1200/1*SJPacPhP4KDEB1AdhOFy_Q.png)

This model is based on the ground-breaking modeling work outlined in the paper by McCullough and Pitts from 1943. In this work, they used a step-function as $f$. 

McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133.

### Activation functions 

The purpose of the activation function is to introduce more complicated processing of the inputs - we will talk about this more later. The most common choices are plotted below:

In [None]:
def activation(x,type_ = 'tanh'): 
    if type_=='tanh':
        out = np.tanh(x)
    else:
        if 'logistic' == type_:
            out = 1.0 / (1 + np.exp(- x))
        else:
            if 'relu' == type_:
                out = np.maximum(np.zeros_like(x),x)
            else:
                if 'perceptron' == type_:
                    out = ...
                else:
                    if 'linear' == type_:
                        out = ...
                    else:
                        raise Exception(np.array(['do not know type ',type_]))
    
    return out


In [None]:
# x-values
x=np.arange(-2,2,0.02)

# activation functions
plt.figure(figsize=(10,8))
plt.plot(x,activation(x,'linear'),label='linear');
plt.plot(x,activation(x,'tanh'),label='tanh');
plt.plot(x,activation(x,'logistic'),label='logistic');
plt.plot(x,activation(x,'perceptron'),label='perceptron');
plt.plot(x,activation(x,'relu'),label='relu');
plt.grid()
plt.legend()
plt.show()

### Learning networks - the perceptron

Let's try to do something simple now. Let's assume that our data we would like to learn consists of pairs $\vec{x}_i,y_i$, and that the $y_i$ can only take on **two values** ($y_i=-1,1$). 

Furthermore, we assume that the data has a **linear structure**. We can quantify this like so: 

There exists some $\vec{w}^* \in \mathbb{R}^d$, such that $\|\vec{w}^*\|=1$  and for
some $\alpha > 0$, for all $i$ $\in {1, 2, \dots , n}$
it holds that:

$$
y_i(\vec{w}^*\vec{x}_i)>\alpha
$$

Which, when you remember that $y_i=-1,1$, simply says that the sign of $(\vec{w}^*\vec{x}_i)$ is the same as the corresponding $y_i$.

The constant $\alpha$ is introduced as a lower bound on the value of $y_i(\vec{w}^*\vec{x}_i)$. 

Note, that if you find such a vector $\vec{w}^*$, then all points that satisify

$$
(\vec{w}^*\vec{x})=0
$$

define a line (through the origin - if we augment our x-s with a leading "1" and add another leading $w_0$ to $\vec{w}$, then we can also model the intercept or bias of that line)!


Finally, we will assume that all input values $\vec{x}_i$ are bounded from above (so the maximum distance from the origin for these points is some number $D$, such that for all $i$, $\|\vec{x}_i\|<D$.

How do we update the weights now? Let's write the algorithm created by Frank Rosenblatt in pseudo-code:

<div class="alert alert-warning">
set weights $\vec{w}= ...$

while any $y_i(\vec{w}*\vec{x}_i)\leq 0$:

&nbsp;&nbsp;&nbsp;choose one index $k$, for which $y_k(\vec{w}*\vec{x}_k)...$

&nbsp;&nbsp;&nbsp;set weights to $\vec{w} = \vec{w} + ...\vec{x}_k$

</div>

Let's write down the algorithm in Python - this is much longer than the pseudocode above, but only because of the plotting of the updated decision hyperplane that happens during the execution:

In [None]:
def myPerceptron(x,y,maxIter,doPlot):
    
    # init weights
    w = np.zeros((x.shape[1],1))+0.01
    print(w)

    # check wrong outputs
    outputs = ...
    print(outputs)
    
    ite=1

    # plot data and initial guess
    xs=np.arange(np.min(x[:,1]),np.max(x[:,1]))

    # as long as there any misclassified points
    # and we are within iteration limits, do:
    while(np.sum(outputs)>0 & ite<=maxIter):
        # get all misclassified points
        ind=np.where(outputs>0)[0]

        # update the weight with one misclassified point
        update = y[ind[0]]*x[ind[0],:]
        # necessary to switch row vector back to column vector
        w = w + update[:,None]
        print(w)
        ite=ite+1
        # and determine the new, wrong classification outputs
        outputs = 
        
        if (doPlot):
            plt.figure(figsize=(10,6))
            display.clear_output(wait=True)
            indPos = y==1
            indNeg = y==-1
            plt.figure(figsize=(10,8))
            plt.scatter(x[indPos.ravel(),1],x[indPos.ravel(),2])
            plt.scatter(x[indNeg.ravel(),1],x[indNeg.ravel(),2])
            # plotting the line from the weights
            ys=...
            plt.plot(xs,ys,'b-')
            plt.xlim((-10,10))
            plt.ylim((-10,10))
            plt.title('{}: {} wrong\n'.format(ite,np.sum(outputs)))
            plt.grid()
            plt.show()
            time.sleep(1)
       
    return(w)

Let's test this algorithm with two reasonably well-separated point clouds in two dimensions:

In [None]:
rng = np.random.default_rng(seed=42)
x = np.vstack((rng.standard_normal((20,2)),rng.standard_normal((20,2))+4))
y = np.vstack((-1*np.ones((20,1)),np.ones((20,1))))
x = np.hstack((np.ones((x.shape[0],1)),x))

w = myPerceptron(x,y,50,True)

And we can see how the algorithm adjusts the line so that it tries to better capture the distribution of the data.

The perceptron algorithm like this has two important properties:

<div class="alert: alert-warning">
<p> 1. It immediately stops ...

<p>2. It is ...
</div>

In [None]:
x = np.vstack((rng.standard_normal((20,2)),rng.standard_normal((20,2))+1))
y = np.vstack((-1*np.ones((20,1)),np.ones((20,1))))
x = np.hstack((np.ones((x.shape[0],1)),x))

w = myPerceptron(x,y,50,True)

Hence, we can see that the algorithm (by definition) will not stop as it is not possible to find a line that splits the data.

In this case, you will have to settle for the last decision-plane that the algorithm finds!!

### Proof of convergence for perceptrons

How can we prove that this algorithm does its job? 

We can see that if it finishes, the line will separate the classes. So, what we need to prove is that the algorithm converges in a limited amount of steps $k$.

In detail: Assume that there exists some parameter vector $\vec{w}^*$ such that $||\vec{w}^*||=1$, and some $\alpha > 0$ such that for all k = 1...n, $y_k(\vec{x}\cdot\vec{w}^*) \geq \alpha$. 

The latter comes as we formulate the correct classifications for the Perceptron!

#### Lower bound

Let's try to find a lower bound for $k$ first. Let's take a misclassified point $i$ and update it:

$$
\vec{w}_{k+1}=\vec{w}_k+y_i\vec{x}_i
$$

we can multiply this by $\vec{w}^*$ to get

$$
\vec{w}_{k+1}\vec{w}^*=\vec{w}_k\vec{w}^*+y_i\vec{x}_i\vec{w}^*
$$

but we required that the second term on the right - if we find a solution - is $>\alpha$, so:

$$
\vec{w}_{k+1}\vec{w}^*>\vec{w}_k\vec{w}^*+\alpha
$$

Now, let's start the process with $\vec{w}_0=\vec{0}$, which is our initialization of the weights. This means the update becomes:

$$
\vec{w}_{1}\vec{w}^*>\alpha
$$

So, if we just did this $k$ times we therefore get:

$$
\vec{w}_{k+1}\vec{w}^*>k\alpha
$$

And since $\|\vec{w}_{k+1}\|\|\vec{w}^*\|>\vec{w}_{k+1}\vec{w}^*$, we get:

$$
\|\vec{w}_{k+1}\|>k\alpha
$$


#### Upper bound

For the upper bound, let's write the norm of the update step:

$$
\|\vec{w}_{k+1}\|^2=\|\vec{w}_k+y_i\vec{x}_i\|^2=\|\vec{w}_k\|^2+2y_i\vec{w}_k\vec{x}_i+\|\vec{x}_i\|^2<\|\vec{w}_k\|^2+\|\vec{x}_i\|^2
$$

where we've used the fact that the $|y_i|=1$. But we required bounded points: $\|\vec{x}_i\|<D$, so:

$$
\|\vec{w}_{k+1}\|^2<\|\vec{w}_k\|^2+D^2
$$

Again, we start with $\vec{w}_0=\vec{0}$, so we get:

$$
\|\vec{w}_{1}\|^2<D^2
$$

and doing that $k$ times, we get:

$$
\|\vec{w}_{k+1}\|^2<kD^2
$$


#### Putting it together
Now, we've got two results - an upper and a lower bound:

$$
k^2{\alpha}^2<\|\vec{w}_{k+1}\|^2<kD^2
$$

so we "ignore" the middle term and get:

$$
k<\frac{D^2}{{\alpha}^2}
$$

What that means is that the algorithm is **guaranteed** to converge in a maximum number of steps, provided the data is linearly separable (related to the constant $\alpha$) and that it is bounded (related to the constant $D$).

### Logical functions with a perceptron

The original paper by McCulloch and Pitts talked about using the neuronal model as a substitute for **logical** operations. So, let's try to following their reasoning.

Let's say I want to build a neuron that receives two inputs $x_1,x_2$ - these inputs are logical values, true or false. I want the neuron to run a simple logical operation, so that, for example, its output $y$ will be equal to a target function, such as $t = x_1 \text{ AND } x_2$. 

We can see that $t$ is a function with exactly two possible values as well. Given that the original perceptron was formulated as a two-class learner, this means we can try to apply this here as well!

In [None]:
inputA=np.array([[0],[0],[1],[1]])
inputB=np.array([[0],[1],[0],[1]])

targetLogical=inputA&inputB
targetLogical[targetLogical==0]=-1

print(targetLogical)

In [None]:
x = np.hstack((np.ones((4,1)),inputA,inputB))

myPerceptron(x,targetLogical,20,True)

So AND works.

In [None]:
targetLogical=inputA|inputB
targetLogical[targetLogical==0]=-1
print(targetLogical)

In [None]:
x = np.hstack((np.ones((4,1)),inputA,inputB))

myPerceptron(x,targetLogical,20,True)

So OR works

In [None]:
x = np.hstack((np.ones((4,1)),inputA,inputB))

myPerceptron(x,targetLogical,20,True)

But, XOR does not work. We can see why since it cannot be linearly separated. 

The fact that such a simple logical function could not be processed by the perceptron was known for a long time. 

However, given that you can get all possible values of two logical inputs with a suitable **chain** of AND and NOT (or with OR and NOT) operations, you can see that implementing this function would be possible if you simply string enough perceptrons together!

1969 saw the publication of a book "Perceptrons" by Marvin Minsky and Seymour Papert. Often it is said that the XOR problem was popularized by this book, which attributed to a dramatic decline in the popularity of neural networks - the first so-called AI winter. This is, however, only partly true since the result for XOR only holds for ONE SINGLE perceptron (again, a network of perceptrons would be fully capable of doing an XOR operation) - what Minsky and Papert showed, instead, is some limitations about those networks of perceptrons that related mostly to their EFFICIENCY, and not to the fact that they cannot in principle compute something. 

Regardless, the first AI winter did happen and funding for neural network based research did decline dramatically.

### Multi-layer perceptrons

It was already mentioned above that it should be possible to put multiple perceptron units together to create an actual neural **network**.

We will talk about this in the next lecture, when we derive a more general way to train neurons and neural networks.

