# Artificial Neural Networks

There are many different types of models used in machine learning. However, one of the classes of ML model that stands out is the artificial neural network (ANN). Considering that it is used in all types of machine learning, we will present the basics about them.

ANNs are computational systems based on a collection of connected units (or nodes) called artificial neurons, which more or less mimic the neurons in a biological brain. Each connection, like the synapses in a biological brain
ogic, it can transmit signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal other artificial neurons connected to it.

*Deep learning* involves the study of complex algorithms related to ANN. Complexity is attributed to elaborate patterns in how information flows throughout the model. Deep learning has the ability to represent the world as a nested hierarchy of concepts, each defined in relation to a simpler concept. Deep learning techniques are used extensively in reinforcement learning and natural language processing applications.

## ANNs: Architecture, Training and Hyperparameters

ANNs contain multiple neurons arranged in layers. An ANN goes through a training phase by comparing the modeled output with the desired output, in which it learns to recognize patterns in the data.

### Architecture

The architecture of an ANN encompasses Neurons, layers and weights.

#### Neurons

The basis of ANNs are neurons (also known as artificial neurons, nodes or perceptrons). Neurons have one or more inputs and one output. It is possible to create a network of neurons to compute complex logical propositions. Activation functions in these neurons create complicated, non-linear functional mappings between inputs and output.

As shown in the figure below, a neuron takes an input (x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>), applies the learning parameters to generate a weighted sum (*z*), and then passes that sum to an activation function (*f*) that computes the output *f(z)*.

<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/Artificial_neuron_structure.svg/1200px-Artificial_neuron_structure.svg.png" width="600">
    <figcaption>Neuron Artificial</figcaption>
</figure>

#### Layers

The *f(z)* output of a single neuron (as shown in the figure above) will not be able to model complex tasks. So, in order to deal with more complex structures, we have multiple layers of these neurons. As we accumulate neurons horizontally and vertically, the class of functions that we can obtain becomes increasingly complex. The figure below shows the architecture of an ANN with an input layer, an output layer and a hidden layer.

<figure>
    <img src="https://i0.wp.com/i.postimg.cc/pLgLsJDt/Architecture.jpg?w=1230&ssl=1" width="600">
    <figcaption>Neural network Architecture</figcaption>
</figure>

##### Input layer
- the input layer takes the input from the dataset and is the exposed part of the network. Typically, a neural network is designed by having an input layer where each neuron corresponds to a different value present in the input data set. The neurons in the input layer just pass the input value to the next layer.

##### Hidden layers
- the layers after the input layers are called hidden, as they are not directly exposed to the input. The simplest network structure is to have a single neuron in the hidden layer that produces value.

A multilayer ANN is capable of solving complex tasks related to machine learning due to the hidden layer(s). Because of ever-increasing computing power and efficient libraries, neural networks with many layers can be built. ANNs with many hidden layers (more than three0) are known as *deep neural networks*. Deep neural networks have several hidden layers, which allow the network to learn features from data in a hierarchical structure. In this hierarchy, the simplest attributes, learned in the first layers, are combined in subsequent layers to form more complex attributes. ANNs with many layers pass input data, features, through more complex mathematical operations than ANNs with fewer layers, and are therefore more computationally intensive to be trained.

##### Output layer
- the final layer is called output layer; it is responsible for producing a value or a vector of values ​​that corresponds to the format required to solve the problem.

#### Neuron weights
- the weight of a neuron represents the strength of the connection between units and measures the influence that the input will have on the output. If the weight from neuron 1 to neuron 2 has a greater magnitude, it means that neuron 1 has a greater influence on neuron 2. Weights close to zero mean that changing this input will not change the output. Negative weights mean that increasing this input will decrease the output.

### Training

Training a neural network basically means calibrating all the weights in the ANN. This optimization is performed with an iterative approach that involves forward propagation and back propagation steps.

#### Forward propagation

Forward propagation is a process of feeding input values to the neural network and obtaining an output, which we call *predicted value*. When we feed the input values to the first layer of the neural network, it happens without any operations. The second layer takes the values from the first and applies multiplication, addition and activation operations before passing the value to the next. The same process is repeated for any subsequent layers until an output value in the last layer is received.

#### Backpropagation

After forward propagation, we obtain a predicted value from an ANN. Imagine that the desired output of a network is *Y* and the predicted value of the network from forward propagation is $Y'$. The difference between the predicted output and the desired output ($Y$ - $Y'$) is converted into the loss (or cost) function *J(w)*, where *w* represents the weights in the ANN. The objective is to optimize the loss function (that is, to make the loss as small as possible) in the training set.

The optimization method used is *gradient descent*. Your goal is to find the gradient *J(w)* with respect to *w* at the current point and take a small step in the direction of the negative gradient until the minimum value is reached, as shown in the figure below.

<figure>
    <img src="https://miro.medium.com/v2/resize:fit:1142/1*AZzu43KoxDamVpWMVW0zfw.png" width="600">
    <figcaption>Gradient Descent</figcaption>
</figure>

In an ANN, the function *J(w)* is basically a composition of multiple layers, as explained above. Thus, if layer 1 is represented as function *p()*, layer 2 as *q()*, and layer 3 as *r()*, then the general function will be *J(w) = r(q(p()))*. *w* consists of all the weights of the three layers. We want to find the gradient of *J(w)* with respect to each component of *w*.

Skipping the mathematical details, this in essence suggests that the gradient of a *w* component in the first layer would depend on the gradients in the second and third layers. Likewise, the gradients in the second layer will depend on the gradients in the third layer. Therefore, we start computing the derivations in the reverse direction, starting with the last layer, and use backpropagation to compute the gradients of the previous layer.

In general, during the backpropagation process, the model error (difference between the predicted output and the desired output) is backpropagated through the network, one layer at a time, and the weights are updated according to how much they contributed to the error.

### Hyperparameters

*Hyperparameters* are variables established before the training process and cannot be learned during it. ANNs have a large number of hyperparameters, which makes them very flexible. However, this flexibility makes it difficult to refine the model. Understanding hyperparameters and the intuition behind them helps us get an idea of ​​what values are reasonable for each hyperparameter so we can narrow the search space. Let's start with the number of layers and hidden nodes.

### Number of layers and hidden nodes

More layers or hidden nodes per layer means more parameters in the ANN, allowing the model to fit more complex functions. To have a trained network that generalizes well, we need to choose the ideal number of hidden layers, as well as nodes, in the hidden layer. Too few layers and nodes will lead to high errors for the system, as the predictive factors may be too complex for a small number of nodes to capture. Too many layers and nodes will overfit the training data and not generalize well.

There is no definitive recipe that tells us how to decide the number of layers and knots.

The number of hidden layers basically depends on the complexity of the task. Very complex tasks, such as large image classifications or speech recognition, typically require neural networks with dozens of layers and a huge amount of training data. For most problems, we can start with just one or two hidden layers, and then gradually increase that number until we start to overfit the training set.

The number of hidden nodes must be related to the number of input and output nodes, the amount of training data available and the complexity of the function being modeled. As a general rule, the number of hidden nodes in each layer should be somewhere between the size of the input layer and the size of the output layer, ideally the average. This number should not exceed twice the number of input nodes to avoid overfitting.

#### Learning rate

When we train the ANNs, we use many iterations of forward propagation and back propagation to optimize the weights. In each iteration, we calculate derivations of the loss function with respect to each weight and subtract it from that weight. The learning rate determines how quickly or slowly we want to update the weight values ​​(parameter). This rate must be high enough to converge in a reasonable amount of time. However, it must be low enough to find the minimum value of the loss function.

#### Activation functions

Activation functions