# Perceptron

A Perceptron is composed of a single layer of **TLUs** (threshold logic unit)

## TLU

A TLU is a type of artificial neuron defined by
$$
z = x^T w
$$
In other words, the output is the sum of the inputs multiplied by their weights (determined via training).
A step function is then applied to the output.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

In [9]:
iris = load_iris()
X = iris.data[:, (2,3)] # petal length and width
y = (iris.target == 0).astype(int)
y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [11]:
per_clf = Perceptron()
per_clf.fit(X, y)
y_pred = per_clf.predict([[2, 0.5]])
y_pred

array([0])

## Multilayer Perceptron

![Linear model diagram](https://media.datacamp.com/legacy/v1725638284/image_bd3b978959.png)

The input layer uses __passthrough__ TLUs,
There can be multiple middle (or __hidden__) layers. Each layer also has a single bias neuron and all neurons in a layer are connected to all neurons in the next layer.
The final __output__ layer provides the final result.

### Backpropagation

Training algorithm.
1. Makes a prediction for each training instance in a small batch.
2. Calculates error.
3. Goes back through each layer in reverse order to measure contribution of each connection to the error.
4. Then applies **gradient descent** to tweak the connection weights.

### Activation functions

Backpropagation does not work with the step function because of how gradient descent works (TODO: Understand this)

Logistic function was the first alternative used:
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
The derivative is not 0 at any point --> gradient descent can always work.
Ranges from 0 to 1.

Hyperbolic tangent:
Ranges from -1 to 1 and is also continuous and differentiable.
Centered around 0 can help speed up convergence (TODO: WHY???)
$$
\tanh(z) = 2\sigma(2z) - 1
$$

Rectified Linear Unit function:
Not differentiable at 0 and gradient 0 at < 0, but very fast.
$$
R(z) = max(0, z)
$$
