# Artificial Neural Networks

There are many different types of models used in machine learning. However, one of the classes of ML model that stands out is the artificial neural network (ANN). Considering that it is used in all types of machine learning, we will present the basics about them.

ANNs are computational systems based on a collection of connected units (or nodes) called artificial neurons, which more or less mimic the neurons in a biological brain. Each connection, like the synapses in a biological brain
ogic, it can transmit signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal other artificial neurons connected to it.

*Deep learning* involves the study of complex algorithms related to ANN. Complexity is attributed to elaborate patterns in how information flows throughout the model. Deep learning has the ability to represent the world as a nested hierarchy of concepts, each defined in relation to a simpler concept. Deep learning techniques are used extensively in reinforcement learning and natural language processing applications.

## ANNs: Architecture, Training and Hyperparameters

ANNs contain multiple neurons arranged in layers. An ANN goes through a training phase by comparing the modeled output with the desired output, in which it learns to recognize patterns in the data.

### Architecture

The architecture of an ANN encompasses Neurons, layers and weights.

#### Neurons

The basis of ANNs are neurons (also known as artificial neurons, nodes or perceptrons). Neurons have one or more inputs and one output. It is possible to create a network of neurons to compute complex logical propositions. Activation functions in these neurons create complicated, non-linear functional mappings between inputs and output.

As shown in the figure below, a neuron takes an input (x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>), applies the learning parameters to generate a weighted sum (*z*), and then passes that sum to an activation function (*f*) that computes the output *f(z)*.

<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/Artificial_neuron_structure.svg/1200px-Artificial_neuron_structure.svg.png" width="600">
    <figcaption>Neuron Artificial</figcaption>
</figure>

#### Layers

The *f(z)* output of a single neuron (as shown in the figure above) will not be able to model complex tasks. So, in order to deal with more complex structures, we have multiple layers of these neurons. As we accumulate neurons horizontally and vertically, the class of functions that we can obtain becomes increasingly complex. The figure below shows the architecture of an ANN with an input layer, an output layer and a hidden layer.

<figure>
    <img src="https://i0.wp.com/i.postimg.cc/pLgLsJDt/Architecture.jpg?w=1230&ssl=1" width="600">
    <figcaption>Neural network Architecture</figcaption>
</figure>

##### Input layer
- the input layer takes the input from the dataset and is the exposed part of the network. Typically, a neural network is designed by having an input layer where each neuron corresponds to a different value present in the input data set. The neurons in the input layer just pass the input value to the next layer.

##### Hidden layers
- the layers after the input layers are called hidden, as they are not directly exposed to the input. The simplest network structure is to have a single neuron in the hidden layer that produces value.

A multilayer ANN is capable of solving complex tasks related to machine learning due to the hidden layer(s). Because of ever-increasing computing power and efficient libraries, neural networks with many layers can be built. ANNs with many hidden layers (more than three0) are known as *deep neural networks*. Deep neural networks have several hidden layers, which allow the network to learn features from data in a hierarchical structure. In this hierarchy, the simplest attributes, learned in the first layers, are combined in subsequent layers to form more complex attributes. ANNs with many layers pass input data, features, through more complex mathematical operations than ANNs with fewer layers, and are therefore more computationally intensive to be trained.

##### Output layer
- the final layer is called output layer; it is responsible for producing a value or a vector of values ​​that corresponds to the format required to solve the problem.

#### Neuron weights
- the weight of a neuron represents the strength of the connection between units and measures the influence that the input will have on the output. If the weight from neuron 1 to neuron 2 has a greater magnitude, it means that neuron 1 has a greater influence on neuron 2. Weights close to zero mean that changing this input will not change the output. Negative weights mean that increasing this input will decrease the output.

### Training

Training a neural network basically means calibrating all the weights in the ANN. This optimization is performed with an iterative approach that involves forward propagation and back propagation steps.

#### Forward propagation

Forward propagation is a process of feeding input values to the neural network and obtaining an output, which we call *predicted value*. When we feed the input values to the first layer of the neural network, it happens without any operations. The second layer takes the values from the first and applies multiplication, addition and activation operations before passing the value to the next. The same process is repeated for any subsequent layers until an output value in the last layer is received.

#### Backpropagation

After forward propagation, we obtain a predicted value from an ANN. Imagine that the desired output of a network is *Y* and the predicted value of the network from forward propagation is $Y'$. The difference between the predicted output and the desired output ($Y$ - $Y'$) is converted into the loss (or cost) function *J(w)*, where *w* represents the weights in the ANN. The objective is to optimize the loss function (that is, to make the loss as small as possible) in the training set.

The optimization method used is *gradient descent*. Your goal is to find the gradient *J(w)* with respect to *w* at the current point and take a small step in the direction of the negative gradient until the minimum value is reached, as shown in the figure below.

<figure>
    <img src="https://miro.medium.com/v2/resize:fit:1142/1*AZzu43KoxDamVpWMVW0zfw.png" width="600">
    <figcaption>Gradient Descent</figcaption>
</figure>

In an ANN, the function *J(w)* is basically a composition of multiple layers, as explained above. Thus, if layer 1 is represented as function *p()*, layer 2 as *q()*, and layer 3 as *r()*, then the general function will be *J(w) = r(q(p()))*. *w* consists of all the weights of the three layers. We want to find the gradient of *J(w)* with respect to each component of *w*.

Skipping the mathematical details, this in essence suggests that the gradient of a *w* component in the first layer would depend on the gradients in the second and third layers. Likewise, the gradients in the second layer will depend on the gradients in the third layer. Therefore, we start computing the derivations in the reverse direction, starting with the last layer, and use backpropagation to compute the gradients of the previous layer.

In general, during the backpropagation process, the model error (difference between the predicted output and the desired output) is backpropagated through the network, one layer at a time, and the weights are updated according to how much they contributed to the error.

### Hyperparameters

*Hyperparameters* are variables established before the training process and cannot be learned during it. ANNs have a large number of hyperparameters, which makes them very flexible. However, this flexibility makes it difficult to refine the model. Understanding hyperparameters and the intuition behind them helps us get an idea of ​​what values are reasonable for each hyperparameter so we can narrow the search space. Let's start with the number of layers and hidden nodes.

### Number of layers and hidden nodes

More layers or hidden nodes per layer means more parameters in the ANN, allowing the model to fit more complex functions. To have a trained network that generalizes well, we need to choose the ideal number of hidden layers, as well as nodes, in the hidden layer. Too few layers and nodes will lead to high errors for the system, as the predictive factors may be too complex for a small number of nodes to capture. Too many layers and nodes will overfit the training data and not generalize well.

There is no definitive recipe that tells us how to decide the number of layers and knots.

The number of hidden layers basically depends on the complexity of the task. Very complex tasks, such as large image classifications or speech recognition, typically require neural networks with dozens of layers and a huge amount of training data. For most problems, we can start with just one or two hidden layers, and then gradually increase that number until we start to overfit the training set.

The number of hidden nodes must be related to the number of input and output nodes, the amount of training data available and the complexity of the function being modeled. As a general rule, the number of hidden nodes in each layer should be somewhere between the size of the input layer and the size of the output layer, ideally the average. This number should not exceed twice the number of input nodes to avoid overfitting.

#### Learning rate

When we train the ANNs, we use many iterations of forward propagation and back propagation to optimize the weights. In each iteration, we calculate derivations of the loss function with respect to each weight and subtract it from that weight. The learning rate determines how quickly or slowly we want to update the weight values ​​(parameter). This rate must be high enough to converge in a reasonable amount of time. However, it must be low enough to find the minimum value of the loss function.

#### Activation functions

Activation functions refer to the functions used throughout the weighted sum of inputs in the ANNs to obtain the desired output. They allow the network to combine inputs in more complex ways and provide more robust capabilities in the relationship they can glean and the output they can produce. They decide which neurons will be activated, that is, what information is passed on to subsequent layers.

Without activation functions, ANNs lose the bulk of their representation learning power. There are several origination functions. The most used are:

*Linear function (identity)*
- is represented by the equation of a straight line *(f(x) = mx + c)*, where activation is proportional to input. If we have many layers, and they are all linear in nature, the final activation function of the last layer will be the same as that of the first. The range of a linear function is *-inf* to *+inf*.

*Sigmoid function*
- references a function projected as an S-shaped graph. It is represented by the mathematical equation *f(x) = 1 / (1 + $e^{-x}$)* and lies between 0 and 1. A large positive input results in a large positive output; a large negative input results in a large negative output. It is also referred to as the logistic activation function.

*Tanh function*
- It is similar to the sigmoid function, and its mathematical function is *Tanh (x) = 2Sigmoid (2x) - 1*, where *Sigmoid* represents the **sigmoid** function mentioned previously. The output of this function is between -1 and 1, with an equal mass on both sides of the zero axis.

*ReLU function*
- ReLU (Rectified linear unit), and is represented by *F(x) = max(x, 0)*. Thus, if the input is a positive number, the function will return the number itself; and if the input is a negative number, the function returns zero. It is the most commonly used function due to its simplicity.

There is no rule for choosing an activation function. The decision depends entirely on the properties of the problem and the relationships being modeled. we can try different activation functions and select the one that helps us achieve faster convergence and a more efficient training process. The choice of activation function in the output layer is strongly restricted by the type of problem being modeled.

### Cost functions

Cost functions (also known as loss functions) are a performance metric for ANNs, measuring the degree of quality with which the ANN fits empirical data. The two best-known cost functions are:

*Mean Square Error (MSE)*
- It is the cost function used mainly for regression problems, where the output is a continuous value. MSE is measured as the average squared difference between predictions and actual observation.

*Cross entropy (or logarithmic loss)*
- this cost function is mainly used for classification problems, where the output is a probability value between 0 and 1. Cross-entropy increases as the predicted probability diverges from the actual label. A perfect model would have a cross-entropy of zero.

#### Optimizers

Optimizers update weight parameters to minimize the loss function. Cost functions act as a guide for the area, telling the optimizer whether it is going in the right direction to reach the global minimum. Below are some of the common optimizers:

*Momentum*
- the *momentum optimizer* analyzes previous gradients in addition to the current step. It takes bigger steps if the previous and current updates move the weights in the same direction (gaining momentum). It will take smaller steps if the gradient direction is opposite. A clever way to visualize this is to think of a ball rolling down a valley; it will gain momentum as it approaches the lower part of the valley.

*AdaGrad (Adaptive Gradient Algorithm)*
- *AdaGrad* adapts the learning rate to the parameters, performing smaller updates for parameters associated with features that occur frequently and larger updates for parameters associated with infrequent features.

*RMSProp*
- *RMSProp* is Root Mean Square Propagation. In RMSProp, the learning rate is automatically adjusted, and it chooses a different rate for each parameter.

*Adam (Adaptive Moment Estimation)*
- *Adam* combines the best properties of the AdaGrad and RMSProp algorithms to provide optimization, making it one of the most popular gradient descent optimization algorithms.

#### Epoch

One round of updating the network for the complete set of training data is called an *epoch*. A network can be trained over tens, hundreds, or many thousands of epochs depending on data size and computational constraints.

#### Batch size

Batch size is the number of training examples in one forward/backward pass. A batch size of 32 means that 32 samples from the training dataset will be used to estimate the error gradient before the model weights are updated. The larger the batch, the more memory space is required.

## Creating an Artificial Neural Network model

Previously, we talked about the steps for end-to-end development of a model in Python. Now, we will go a little deeper into the steps involved in creating an ANN-based model.

### Installing Keras and Machine Learning Packages

There are several Keras libraries that allow you to create ANN and deep learning models quickly and easily, without going into the details of the underlying algorithms. Keras is one of the easiest to use packages that allows efficient numerical computation related to ANNs. With it, complex deep learning models can be defined and implemented with just a few lines of code. We will mainly use Keras packages to implement deep learning models in the various studies.

Keras (https://keras.io) is just a wrapper of more complex numerical computing engines like TensorFlow (https://www.tensorflow.org) and Theano. To install Keras, you must first install TensorFlow or Theano.

#### Importing packages

Before we start creating an ANN model, we need to import two modules from the Keras package: **Sequential** and **Dense**:

```Python
import numpy as np
from Keras.layers import Dense
from Keras.models import Sequential
```

#### Loading data

This example makes use of **NumPy**'s *random* module to quickly generate some data and labels to be used by the ANN that we will create in the next step. Specifically, an array of size *(1000, 10)* will be constructed first. Next, we will create a label array consisting of zeros and ones with size *(1000, 1)*:

```Python
data = np.random.random((1000, 10))
y = np.random.randint(2, size = (1000, 1))
model = Sequential()
```

#### Model construction: defining the neural network architecture

A quick way to get started is to use Keras Sequential, which is a linear stack of layers. We will create a Sequential model and add layers one at a time until the network topology is finalized. The first thing we need to do is make sure the input layer has the right number of inputs. we can specify this when we create the first layer. Next, we will select a dense or fully connected layer to indicate that we are dealing with an input layer when using the *input_dim* argument.

We will add a layer to the model with the *add()* function and the number of nodes in each layer is specified. Finally, another dense layer is added as an output layer.

The model architecture will be as follows:

- the model expects data lines with 10 variables (argument *input_dim_ = 10)*;
- the first hidden layer has 32 nodes and uses the relu activation function;
- the second hidden layer has 32 nodes and uses the relu activation function;
- the output layer has 1 node and uses the sigmoid activation function;

The code is represented below:

```Python
model = Sequential()
model.add(Dense(32m input_dim = 10, activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
```

#### Compiling the model

Once the model is built, it can be compiled with the help of the *compile()* function. Doing so leverages the powerful numerical libraries in the Theano or TensorFlow packages. When compiling, it is important to specify additional properties required in network training. Training a network means finding the best set of weights to make predictions for the problem in question. Thus, we must specify the loss function used to evaluate a set of weights, the optimizer used to search for different weights for the network, and any optional metrics we would like to collect and report during training.

In the following example, we use the *cross-entropy* loss function, which is defined in Keras as *binary_crossentropy*. We will also use the adam optimizer, which is the default option. Finally, since this is a classification problem, we will collect and report classification accuracy as a metric. The code follows below:

```Python
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
```

#### Adjusting the model

With our model defined and compiled, it's time to run it with the data. We can train - or fit - our model with the data we have loaded by calling the *fit()* function.

The training process will run for a fixed number of iterations (epochs) over the dataset using the *nb_epoch* argument. We can also establish the number of instances that are evaluated before a weight is updated in the network. This is done using the *batch_size* argument. For this problem, we will run a small number of epochs (10) and use a batch size of 32. Again, these numbers can be chosen experimentally through trial and error:

```Python
model.fit(data, y, nb_epoch = 10, batch_size = 32)
```

#### Evaluating the model

We have already trained our neural network with the entire data set and can evaluate its performance with the same data set. This will give us insight into the level of quality at which we will model the dataset (e.g. training accuracy), but it will not give us insight into the level of quality at which the algorithm will perform with the new data. To do this, we must separate the data into trill and test sets. The model is evaluated with the training set using the *evaluation()* function. This will generate a prediction for each input and output pair and collect scores, including average loss and any configured metrics such as accuracy:

```Python
scores = model.evaluate(X_test, y_test)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))
```

### Running a faster ANN model: GPU and cloud services

To train ANNs (especially deep neural networks with many layers), a large amount of computational power is required. Central processing units, or CPUs, are responsible for processing and executing instructions on a local machine. Since they are limited in the number of colors and do the work sequentially, they cannot perform fast computations for the large number of matrices needed to train deep learning models. This way, training deep learning models can be extremely slow on CPUs.

The following alternatives are useful for running ANNs that typically require a significant amount of time to run on a CPU:

- run the notebooks locally on a GPU;
- run notebooks on Kaggle Kernels or Google Colaboratory;
- use Amazon Web Services.

#### GPU

A Graphics Processing Unit, or GPU, is made up of hundreds of cores that can handle thousands of threads simultaneously. The execution of ANNs and deep learning models can be accelerated by using GPUs.

They are especially adept at processing complex matrix operations. GPU cores are highly specialized and massively speed up processes like deep learning training by taking processing away from CPUs and directing it to cores in the GPU subsystem.

All machine learning-related Python packages, including TensorFlow, Theano, and Keras, can be configured to use GPUs.