# Intro to ANNs

Artificial Neural Networks has been around for a while with varying interest. NNs regularly outperform other ML techniques on very large and complex problems. Experts expect NNs to stay popular this time due to the vast amount of available data and compute power. Although this all sounds good, lack of explainability (black box model) could become more of an issue with AI ethics.

## 1. Perceptron

Invented in 1957, the Perceptron is one of the simplest ANN architecture. It consists of just one **threshold logic unit (TLU)**, which computes the weighted sum of the inputs and applies a step function (as activation function) to that sum to generate outputs. 

Thus, the Perceptron with a single TLU can be used for simple linear binary classification similar to Logistic Regression. Once the weighted sum exceeds the threshold, the instance is considered to be positive, else negative. Extending the Perceptron with multiple TLUs allows for multiclass predictions.

A big downside of Perceptrons is that they are incapable of learning complex patterns since the decision boundary is linear (same as in Logistic Regression). A more complex ANN is necesseray for linearily non-separable datasets. Furthermore, compared to Logistic regression the Perceptron does not output class probabiity but class belonging based on a hard threshold, which is why Logistic Regression should be preferred.

**Let's apply the Perceptron with a single TLU to the iris dataset**

In [4]:
# load data
from sklearn import datasets
import numpy as np
from sklearn.linear_model import Perceptron

iris = datasets.load_iris()

X = iris["data"][:,2:3] # Petal length and width
y = (iris["target"]==0) # Setosa?
y = y.astype(np.int)

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict(X)
y_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [2]:
y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### 2. Multilayer Perceptron
A single TLU Perceptron is very limited. Some if its limitations can be eliminated by stacking multiple Perceptrons together as multiple layers, resulting in a network architecture called **Multilayer Perceptron (MLP)**.

An MLP is composed of one input layer followed by multiple layers of TLUs (hidden layers) and a final layer of TLUs (output layer). Every layer except the output layer has a bias unit. Each layer is fully connected. 

One key difference is that instead of a step function, MLPs use a **sigmoid function** as activation function. Otherwise, the flat surface of a stepfunction would make it impossible to calculate gradients, and the Gradient Descent algorithm for optimization would not worked (other functions would work too, like ReLu).

In [19]:
# Perceptron (single TLU)
from sklearn.linear_model import Perceptron
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=100, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)

perceptron = Perceptron(random_state=1, max_iter=300).fit(X_train, y_train)

print("actual classes: " + str(y_test))
print("predicted classes: " + str(perceptron.predict(X_test)))
print("accuracy score: " + str(perceptron.score(X_test, y_test)))

actual classes: [1 0 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1]
predicted classes: [1 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1]
accuracy score: 0.92


In [21]:
# Multilayer Perceptron Classifier
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=100, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)

mlp_clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)

print("actual classes: " + str(y_test))
print("predicted classes: " + str(mlp_clf.predict(X_test)))
#print(mlp_clf.predict_proba(X_test).ravel())
print("accuracy score: " + str(mlp_clf.score(X_test, y_test)))
print("number of layers: " + str(mlp_clf.n_layers_))

actual classes: [1 0 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1]
predicted classes: [1 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 1]
accuracy score: 0.88
number of layers: 3


The **Backpropagation** algorithm is essentially calculating gradient descent forward and backwards, thus finding the network error with regard to every single model parameter. Knowing the error per parameter setting let's us tweak the weights in order to reduce the error. This is repeated until the network converges to a solution.

Note: Automatically computing gradients is called **automatic differentiation (autodiff)**. Backpropagation is a type of autodiff, called reverse-mode autodiff.

### Learning process:

We start with random weights (initialization).

Using mini-batches, the algorithm goes through the full training set multiple times. Each passthrough is called an **epoch**. 

The result of one layer becomes the input of the next layer until we reach the final layer (forward pass). 

Then the output is compared to the desired output and a measure of error is returned. 

Next, it calculates how much each output connection contributed to the error, and how much of the error came from the connections in the layer before, and so on (reverse pass). 

Finally, Gradient Descent is performed to tweak all the connction weights in the network using the error gradients just computed.

All this together is called Backpropagation.