## Artificial Neural Networks (ANNs)

Inspired by the neural networks found in the brains of humans and animals.
These are made up of neurons, which transmit electrical signals.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Components_of_neuron.jpg/1280px-Components_of_neuron.jpg)

While the method was inspired by nature, there is a lot we still don't know about how the brain works.
ANNs are useful mathematical tools but it's not their purpose to precisely imitate nature.

A single artificial neuron (or perceptron) has a similar model as we used for logistic regression.

![](https://www.researchgate.net/profile/Antonio-Parmezan/publication/330742498/figure/fig2/AS:721823524200448@1549107547175/Structure-of-Perceptron_W640.jpg)

By using multiple neurons, they can be trained simultaneously, for different combinations of the input features.
Instead of trying to combine the features ourselves manually, the ANN will learn the relations between the features by itself.
By changing the weights, a neuron will be more (or less) sensitive to certain features.

The neurons are grouped into layers of the ANN.
An input layer, 1 or more hidden (or middle) layers, and an output layer.
In the traditional feed-forward (or sequential) networks, the output of one layer is the input of the next layer.
(There are other possible structures, we will look at them later.)

![](https://miro.medium.com/max/750/1*Uhr-4VDJD0-gnteUNFzZTw.jpeg)

By using multiple layers, the ANN can build more and more complex features by combining simple ones.
This is shown above on a special type of ANN, a Convolutional Neural Network (CNN).

### Forward pass (inference)

In most ANNs, every node of a layer is connected to every node of the adjacent layers.
These are called dense layers.
So, the input vecctor of each neuron in a layer is the same, only their weights can differ.

By calculating the weighted sums and the activation functions for each neuron, layer by layer, we can calculate the output of an (already trained) ANN.
This is called the forward pass or inference process.

In regression models, we've seen how we can calculate the predicted output by dot product of vectors.
Here, the input is a 1D vector as well but we have multiple weight vectors - one vector for each neuron of a layer.
By combining the weight vectors into a matrix, the dot products for the neurons of one layer can be calculated with a single matrix multiplication.
GPUs are very good at it, as 3D graphical transformations are also done by matrix multiplications.

But how do we train the network?

How do we calculate the error?
How do we know which layer causes the error?
If we have multiple layers, how do we update their weights?


### Backward pass (backpropagation)

One possible taching algorithm is backpropagation.

Essentially, the forward pass calculation is a series of matrix multiplications and function compositions:
$$g(x) = f^L(W^L f^{L-1}(W^{L-1} \dots f^2(W^2 f^1(W^1 x))\dots))$$

The error/cost/loss is the difference between the actual $y$ values in the training set, and the predicted $p=g(x)$ values.
This is not new, we can use MSE, MLE, or other loss functions, based on the output type.

To find how a particular weight $w$ should be changed, we need the partial derivative $\frac{\partial g}{\partial w}$.

In short, we can calculate the derivatives for each layer by going backwards, and using the derivative of a layer to calculate the derivative of the preceding layer.
This is also a series of matrix multiplications and elementwise multiplications.

For an in-depth explanation of this calculation, read [Chapter 2 of Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap2.html) by Michael Nielsen.

[Playtime!](http://playground.tensorflow.org/)

## Using TensorFlow and Keras

[TensorFlow](https://www.tensorflow.org/) is an open-source ML library created by Google.

*Tensor is the generalization of vectors and matrices. A scalar (a number) is a rank-0 tensor, a 1D vector is a rank-1 tensor, a 2D matrix is a rank-2 tensor, etc.*

[Keras](https://keras.io/) is a high-level Python API for working with ANNs. It became part of the TensorFlow Python package.

In [None]:
# Using TensorFlow and Keras
import tensorflow as tf

# load dataset containing labeled grayscale images of handwritten digits
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# scale pixel values to 0-1 range
train_images = train_images / 255.0
test_images = test_images / 255.0
print(train_images.shape)  # 60000 28x28 pixel images

In [None]:
import plotly.express as px
import numpy as np

count = 10
print("Index:", np.array(range(count)))
print("Label:", train_labels[:count])
px.imshow(train_images[:count], animation_frame=0, color_continuous_scale="Greys")


In [None]:
dir(tf.keras.layers)

In [None]:
# build model
model = tf.keras.Sequential()
# the Flatten layer converts the pixel matrix into a 1D vector
model.add(tf.keras.layers.Flatten(input_shape=train_images[0].shape))
# hidden layer with 128 nodes
model.add(tf.keras.layers.Dense(128, "sigmoid"))
# output layer for the 10 classes
model.add(tf.keras.layers.Dense(10, "sigmoid"))

# shorter version
model = tf.keras.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=train_images[0].shape),
        tf.keras.layers.Dense(128, "sigmoid"),
        tf.keras.layers.Dense(10, "sigmoid"),
    ]
)

# compile model
model.compile(
    # loss function for classification with numerical labels
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    # display accuracy (correct/all) during training
    metrics=["accuracy"],
)

# train the model
model.fit(train_images, train_labels, epochs=5)  # epochs is the iteration count


In [None]:
# make predictions with the trained model
print(model.predict(test_images[:3]))
print("Label:", test_labels[:10])
px.bar(model.predict(test_images[:10]))


In [None]:
# The output of the model is a vector with 10 numbers. The predicted class is the one with the highest value.
count = 30
print("Predicted:", np.argmax(model.predict(test_images[:count]), axis=1))
print("Actual:   ", test_labels[:count])

In [None]:
# look at the wrong predictions
count = 10
predictions = np.argmax(model.predict(test_images), axis=1)
failures = []
fail_images = []
for i, pred in enumerate(predictions):
    if pred != test_labels[i]:
        failures.append([i, pred, test_labels[i]])
        fail_images.append(test_images[i])
print("Accuracy on test set:", 1 - (len(failures) / len(test_labels)))
for i, pred, act in failures[:count]:
    print(f"{i}: predicted {pred} instead of {act}")
fail_images = np.array(fail_images[:count])
px.imshow(fail_images, animation_frame=0, color_continuous_scale="Greys")


Nowadays, instead of the sigmoid function, a different activation function is preferred.

The ReLU (Rectified Linear Unit) performed better in most cases, and easier to calculate:
$$f(x) = max(0,x)$$

It is not differantiable at 0 but the derivative can be chosen to be either 0 or 1 (as it is $0$ for $x<0$, and $1$ for $x>0$).

There are some modified variants to it too:

![Source: https://en.wikipedia.org/wiki/ReLU](https://upload.wikimedia.org/wikipedia/commons/4/42/ReLU_and_GELU.svg)

![Source: https://www.researchgate.net/publication/341310767_Machine_Learning_for_Materials_Developments_in_Metals_Additive_Manufacturing](https://i.imgur.com/fVxhXQC.jpg)

In [None]:
model = tf.keras.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=train_images[0].shape),
        tf.keras.layers.Dense(128, "relu"),
        tf.keras.layers.Dense(
            10, "sigmoid"
        ),  # we want 0-1 values for the output (for now)
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=["accuracy"]
)
model.fit(train_images, train_labels, epochs=5)


The model outputs 10 numbers between 0 and 1. But these are not real probabilities, as they don't add up to 1. (If this was a labeling problem, this would be an appropriate model.)

We could divide each value by the sum, then the sum would be 1. This is a linear normalization of the output.
$$y_i \rightarrow \frac{y_i}{\sum_{j=1}^n y_j}$$

Instead of this, a non-linear normalization approach is used in most cases, called softmax:
$$y_i \rightarrow \frac{e^{y_i}}{\sum_{j=1}^n e^{y_j}}$$

This also results in the sum being 1 but its main advantage is when used on a wider range of values, not 0-1 values.

So, we can use the softmax function as activation function on the output layer instead of the sigmoid function.

In [None]:
model = tf.keras.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=train_images[0].shape),
        tf.keras.layers.Dense(128, "relu"),
        tf.keras.layers.Dense(10, "softmax"),
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=["accuracy"]
)
model.fit(train_images, train_labels, epochs=5)


In [None]:
# make predictions with the trained model
px.bar(model.predict(test_images[:10]))

While the above model works well, the same can be done with a different method, which is more efficient and produces fewer numerical errors.

Instead of taking the softmax of the values in the output layer, tell the loss function to use logits instead of probabilities.

*Logits are the $z$ values in the sigmoid function $\frac{1}{1+e^{-z}}$*

In [None]:
model = tf.keras.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=train_images[0].shape),
        tf.keras.layers.Dense(128, "relu"),
        tf.keras.layers.Dense(
            10, "linear"
        ),  # "linear" can be omitted, it's the default
    ]
)
model.compile(
    # tell the loss function to use logits
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
model.fit(train_images, train_labels, epochs=5)


In [None]:
count = 30
print("Predicted:", np.argmax(model.predict(test_images[:count]), axis=1))
print("Actual:   ", test_labels[:count])
# make predictions with the trained model
pred = model.predict(test_images[:10])
print(pred)
probs = tf.nn.softmax(pred)
px.bar(probs)

In [None]:
# when compiling a model, we can choose from several implemented solvers
# (or provide our own by defining a custom subclass)
# they can also be configured through parameters (e.g. learning_rate)
tf.keras.optimizers.Optimizer.__subclasses__()

In [None]:
for opt in ["Adam", "SGD", "Nadam", "Adamax", "Adagrad"]:
    print(5 * "=", opt, 5 * "=")
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=train_images[0].shape),
            tf.keras.layers.Dense(128, "relu"),
            tf.keras.layers.Dense(10, "linear"),
        ]
    )
    model.compile(
        optimizer=opt,  # specify optimizer
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )
    model.fit(train_images, train_labels, epochs=2)


There are also many [loss functions](https://www.tensorflow.org/api_docs/python/tf/keras/losses) and [layer types](https://www.tensorflow.org/api_docs/python/tf/keras/layers) to choose from. We will look at some of them later.

In [None]:
model.save("my_model")  # creates a directory for the model files


In [None]:
model = tf.keras.models.load_model("my_model")
count = 20
print("Predicted:", np.argmax(model.predict(test_images[:count]), axis=1))
print("Actual:   ", test_labels[:count])


## Using PyTorch

[PyTorch](https://pytorch.org) is an open-source ML framework made by Facebook (now Meta), based on the [Torch](http://torch.ch/) library. Torch used an API based on the Lua scripting language, while PyTorch offers a Python and a C++ API.

TensorFlow is more popular, especially for image recognition applications, while PyTorch is often used for NLP - Natural Language Procesing tasks.

In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# Download training data.
training_data = datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)
# Download test data.
test_data = datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)
training_data

In [None]:
# When the training set is large, training on the whole dataset can be slow
# Mini-batch learning uses a small subset of the dataset in each step
# (For TF with Keras, batch_size can be passed to model.fit(), defaults to 32)
# Another approach is SGD (Stochastic Gradient Descent), which only uses
# a single, randomly chosen input example for each training step.
batch_size = 64

# Create DataLoaders
# Shuffle training dataset before each epoch, so batches will be different
train_dataloader = DataLoader(training_data, batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, len(test_data))

# DataLoaders provide an iterator over the batches
for X, y in train_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

In [None]:
# Create a model
model = nn.Sequential(
    nn.Flatten(), nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 10)
)
print(model)


In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(5):
    print("epoch", epoch)
    for batch, (X, y) in enumerate(train_dataloader):
        # Forward pass: Compute predicted y by passing X to the model
        y_logits = model(X)
        # Compute and print loss
        loss = loss_fn(y_logits, y)
        if batch % 100 == 0:
            curr, size = batch * batch_size, len(training_data)
            print(f"  loss: {loss.item():>7f} [{curr:>5d}/{size:>5d}]")
        # Set gradients to zero
        optimizer.zero_grad()
        # Perform a backward pass (backpropagation)
        loss.backward()
        # Update weights based on gradients
        optimizer.step()
    # At the end of each epoch, evaluate the model and print accuracy
    for X, y in test_dataloader:
        y_logits = model(X)
        y_pred = torch.argmax(y_logits, dim=1)
        acc = torch.mean((y_pred == y).float()).float()
        print(f"accuracy: {acc:%}")


In [None]:
torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

In [None]:
# Recreate the same NN structure
# A nicer method is to define a custom subclass for nn.Module
# and build the network in the ctor
model = nn.Sequential(
    nn.Flatten(), nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 10)
)
model.load_state_dict(torch.load("model.pth"))
# Evaluate
for i in range(1000):
    x, y = test_data[i][0], test_data[i][1]
    pred = model(x)
    predicted, actual = pred[0].argmax(0), y
    if predicted != actual:
        print(f'Predicted: "{predicted}", Actual: "{actual}"')
