[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Mayo-Radiology-Informatics-Lab/MIDeL/blob/main/chapters/8A.ipynb)

*Author: Joe Sobek, MS*

# Chapter 8: Components of Deep Learning Models

## Introduction

In this chapter we give a basic description of the simplest and most fundamental elements of the most common form of deep learning model: artificial neural networks (also known as neural networks or ANNs for short). We’ll also discuss a few commonly used neural network ‘architectures’, which is another way to describe the overall structure of a neural network or the key design choices used by a group of similar neural networks. New neural network architectures are constantly being developed, but fundamentally these all make use of the concepts discussed in this chapter (or variations of them). Our intent is to provide you with enough knowledge so that you will feel familiar with the fundamental deep learning concepts and feel equipped to pursue deeper knowledge when the need arises.

## Nodes

The most basic element of an ANN is called a node or, in some literature, a neuron. Despite the complexity of biological neurons, neural network nodes are fairly simple and generally only have two parts.

### Weights and Biases

The two main components of nodes are the weight component and the bias component, mathematically described like this:

$y = \vec{w} \cdot \vec{x} + b$

Here $\vec{x}$ represents the input to the node and $y$ represents the output of the node. The node’s components are represented by $\vec{w}$ for the weight and $b$ for the bias applied to the node’s output. These weight and bias components contain the node’s parameters, which will be adjusted during the training process.

Note that an individual node’s input $\vec{x}$ and the node’s weights $\vec{w}$ are tensors (the 1-d form is more often called a vector just as 2-d is often called a matrix), while the output $y$ and bias $b$ are scalars (a single value). The tensor product taken between the node’s weights and the input creates a scalar of the same size, and all the values in that scalar can be added together and added to the bias to determine the node’s output. Like a biological neuron, each node in a neural network receives many inputs to generate (usually) a single output, which is then sent to one or many other nodes in subsequent layers.

### Nonlinearities

Nonlinearities are exactly what they sound like: a nonlinear equation which is applied to the output of a node. Nonlinearities aren’t actually a part of an individual node, but because they are the key that makes ANNs so analytically powerful it’s important to introduce them as soon as possible.

Why are nonlinearities so important? As the equation at the beginning of this section shows, nodes use linear computations. Linear equations can only be combined to form other linear equations, so it’s impossible for a collection of linear equations to represent nonlinear phenomena. But real world data requires nonlinearities to learn the (typically nonlinear) pattern, and this means purely linear ANNs aren’t very useful.

To address this issue, the nodes have an activation function that takes the node's output $y$ as the input and applies a nonlinear function such as tanh before passing that output to the next nodes. By doing this, ANNs gain the ability to represent nonlinear phenomena and machine learning practitioners are able to create ANNs with the potential to model complex, nonlinear problems. Later in this chapter we’ll give some examples of commonly used activation functions.

## Linear Layers

Linear layers are the simplest layers found in ANNs. Each layer contains an arbitrary number of nodes like those described above. The simplest ANN consists of an input layer, a hidden dense layer, and an output layer. Because each node is connected to every node in the preceding and following layers, these are often called “dense” or “fully connected” layers. In some cases, these are also referred to as ‘densely connected’ layers or Dense layers. As the Torch framework uses the term ‘Linear’ for this type of layer, we will use ‘Linear’ as well. Something to be aware of is the class of neural network architectures called ‘DenseNets.’ Despite the confusing nomenclature, these are not made using Linear layers and will be covered in a more advanced section of this chapter.

<br><img src="https://i.ibb.co/tD1FWtm/MIDeL8-1.png" alt="MIDeL8-1" border="0"><u><br /><b>Figure 1.</b>  Schematic of linear layers in an artificial neural network ([source](https://stackabuse.com/deep-learning-in-keras-building-a-deep-learning-model/))</u><br><br>

Now that we’ve introduced a simple deep learning network, we can define some terms to help compare architectures. In order to be considered a deep learning network, it must contain an input layer, two or more hidden layers, and an output layer. ANNs become “deeper” as more hidden layers are added, and “wider” as the number of nodes in each hidden layer increases. Linear layers generally only allow you to choose the width of the layer and whether or not the nodes can use a bias parameter, however some deep learning frameworks include a few more advanced options as well.

While linear layers form a key component of many ANN architectures and can be applied to nearly every deep learning task, there are a few important details that need to be kept in mind while working with them. Linear layers typically only accept 1-dimensional data (such as a list of values or a vector). Linear layers require their input to have a fixed size (e.g. if your layer is designed to accept a list of 30 values, then every list given as input must have 30 values). For problems where the length of each example can vary, additional work must be done. Finally, linear layers act on all of their input simultaneously and, though not a hard and fast rule, this may discard spatial information. A network created using linear layers may discover spatial relationships, but linear layers aren’t created with any mechanism to force models to consider spatial information. In fact, converting data where spatial information is important, such as images, into a form compatible (a 1D vector) with linear layers generally makes it much harder for the network to ‘see’ spatial relationships in the data.

### Dense (“fully connected”) vs. Sparse connections

Though rarely used, there is another form of linear layer that is worth being aware of. In contrast to the “dense” linear layers described above, these “sparse” linear layers connect each of their nodes to only some of the layers’ inputs, and allow each node to send their output to only some of the subsequent layer’s nodes. They offer advantages in certain contexts, but are less straightforward to work with and go beyond the scope of this chapter.

### Dropout

Before moving past linear layers there is one final topic to discuss: dropout. Dropout can be used with many kinds of layers but is most straightforward with linear layers. Dropout layers randomly “turn off” some of the nodes in the previous layer during the training process by passing through a value of ‘0’ while the other nodes’ output is sent through unaltered. Which nodes are turned off randomly changes each epoch. During inference or validation, the dropout layer will generally **not** turn off any nodes.

<br><img src="https://i.ibb.co/tQx4SG2/MIDeL8-2.png" alt="MIDeL8-2" border="0"><u><br /><b>Figure 2.</b>  Effect of dropout ([source](https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf))</u><br><br>

Why is turning off nodes useful? To answer that question, we need to describe a problem you may encounter while training ANNs. Pursuing better model performance often leads us to creating wider and deeper ANNs. However, for large ANNs there is a risk that, as the training process updates the weights of nodes, it begins to specialize and overvalue the output of relatively few nodes while ignoring most other nodes. This results in a model which considers only a small amount of the available information (or features) in your dataset. In the worst cases, very large ANNs will begin to consider only features that are specific to the training examples. When this happens, it becomes much less likely that the model will be able to make accurate predictions when given new data. This is one mechanism by which overfitting occurs (See the discussion in chapter 10).

However, turning off over-specialized nodes forces other nodes to be used when classifying examples that node is specialized for, and temporarily reducing the overall power of the network by turning off many nodes forces over-specialized nodes to help classify examples they may not be specialized for. This makes it more likely that the model will make predictions using features that are present in most of your training dataset. This process of preventing models from specializing and encouraging them to use generalizable features is called model regularization. Dropout is only one of many such techniques.

## Convolutional layers

Our focus in this book is using deep learning with medical imaging. Though linear layers are useful in a wide variety of problem domains, when it comes to imaging they suffer major drawbacks. Because linear layers perform only a single calculation that uses the entire input at once, the number of parameters and memory requirements grow rapidly as image resolution increases. 

This also means linear layers are only compatible with datasets of images that have the same resolution (i.e. every image must be 28x28 or 256x256). In addition, spatial relationships (things like shapes and textures) are important for imaging problems and linear layers often struggle to find these. To overcome these issues, a different type of layer called a “convolutional layer” is often used for images.

Unlike the nodes in linear layers, the weight and bias parameters of a convolutional layer are structured into something called a kernel. Rather than looking at the entire input at once, the kernel accepts only a small piece of the input example, which it processes before moving on to the next piece of the input example. For each piece of the input example, the kernel generates one value which is placed in the output according to the position of the piece of input being fed to the kernel. This process of incrementally feeding different pieces of an input example to the kernel to generate the output is known as “convolution,” and is where convolutional layers get their name. Because the kernel only processes a small piece of the image at any given step, the convolutional layer only needs enough parameters to calculate the output from a portion of the image, which is far fewer than a linear layer would require.

<br><img src="https://i.ibb.co/ZMTKcv4/MIDeL8-3.gif" alt="MIDeL8-3" border="0"><u><br /><b>Figure 3.</b>  Sample convolution operation ([source](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1))</u><br><br>

The use of convolution has important consequences. Most significantly, the kernel’s structured (e.g. 2D) nature allows it to discover spatial relationships that exist in data, and the iterative combination of a structured kernel with a structured input allows the layer to create a structured output, which means the spatial relationships discovered by a convolutional layer can be passed to the next layer. Ultimately this allows networks made from convolutional layers, called convolutional neural networks or CNNs, to detect complex structures in images where a simpler ANN would struggle, such as when the structure of interest is not always found in the same orientation or position within the images.

In comparison to linear layers, convolutional layers include many hyperparameters that allow you to configure both the kernel and the convolution process. We’ll discuss those shortly.

### Kernel Dimensionality

The first thing to choose when configuring your kernel is its dimensionality, which is determined by the layer you choose to use. Typical choices are 1, 2, or 3 dimensions, and these options are included in most deep learning frameworks. Some example data for each case are: an ECG with 1 dimension, a radiograph with 2 dimensions, and a CT scan with 3 dimensions.

In most cases we choose a kernel that has the same number of dimensions as the number of real-space dimensions in our data, but sometimes, like with ECG data, we use something else. We’ll refer to these as spatial dimensions throughout this section, despite the occasional exceptions. There are also cases where it's useful to organize our data in a way that allows us to use convolutional kernels with more or less spatial dimensions than are present in the raw data, such as breaking a CT scan apart into single slices to feed into a 2-D CNN instead of using multiple slices at once.

While spatial dimensions typically determine the dimensionality of our kernel, imaging and medical data frequently includes a non-spatial dimension referred to as the “channel dimension.” While spatial dimensions typically represent a length of some kind, such as a distance or a timespan, channel dimensions can represent nearly anything. An example, as discussed in Chapter 4, is the typical color photograph, a 2-dimensional image with a height and a width that has 3 different color channels for each pixel: red, green, and blue. Similarly, ECGs often provide multiple channels of data, where each channel represents a different probe or detector. In contrast, a CT scan or a black and white image generally only has a single channel of data. MRIs often acquire multiple types of contrast for the same patient, and if these are ‘registered’ (spatially aligned) one can construct a multi-channel image from the multiple individual images.

This channel dimension generally isn't counted when choosing the kernel’s dimensionality, though exceptions exist. Instead, an extra dimension is added to the kernel to account for the incoming channels. For example, a 3-D convolutional layer will have 5 total dimensions: 1 for the channels and 3 for the spatial dimensions. The fifth, final dimension is added to account for the number of different examples in each mini-batch fed to the model, though the output from each example is kept separate from the output of the other examples in the batch.

### Kernel Size

Kernel size, sometimes called shape, is the next hyperparameter to choose. This is a choice of the spatial length for each side of the kernel and represents how much of the input data the kernel will be given during each step of the convolution process. For example, a 3x3 kernel will analyze 9 pixels at each step, and a 3x3x3 kernel will look at 27 voxels at each step of the convolution process.

Your choice for a kernel’s size has important consequences, including the number of parameters the convolutional layer has and the “receptive field” of that layer as well as your network. To determine the total number of parameters needed for a convolutional layer, you multiply the lengths of each side of the kernel with the number of incoming channels, or input features, and the number of filters, or output features, desired for the layer. Finally, you add the number of bias parameters. For example, a convolutional layer using a 3x3 kernel with 3 input channels, 32 output channels, and with bias enabled has 896 parameters. This is 9 (one for each pixel in the 3x3 kernel) x 3 (one for each input channel) x 32 (one for each output channel) weight parameters (864 total) + 32 (one for each output channel) bias parameters.

“Receptive field” is a measure of how much of the original input image your convolutional layer sees at each step of the convolution process. At the input layer, this corresponds directly to the dimensions of the kernel. However, this can be difficult to determine in subsequent layers, as it will depend on the size or receptive field of every previous convolutional layer.

For example, a convolutional layer with a 3x3 kernel that receives its input from another convolutional layer with a 3x3 kernel applied to the input image will ‘see’ a 5x5 chunk of the input image. The figure below shows how this works. Green represents a single 3x3 convolution operation between Layers 1 and 2, while the overall 5x5 colored area in Layer 1 represents a single 3x3 convolution operation between Layers 2 and 3. Because each pixel in Layer 2 represents a 3x3 area in Layer 1, the yellow pixel in Layer 3 contains information from the 5x5 region shown in Layer 1.

<br><img src="https://i.ibb.co/mqK7LYf/MIDeL8-4.png" alt="MIDeL8-4" border="0"><u><br /><b>Figure 4.</b>  Kernel stacking ([source](https://doi.org/10.3390/rs9050480))</u><br><br>

One thing to note is that not every pixel shown in Layer 1 is valued equally in Layer 3’s yellow pixel, for example, Layer 1’s center pixel is part of all nine contributing convolution operations while the corner pixels are only part of one convolution operation each. The pattern of each convolution extending the area ‘seen’ by one extra pixel length on every edge repeats as you add 3x3 convolutional layers. If you use larger convolutional kernels, the receptive field will be extended by a larger amount.

A full description is beyond the scope of this chapter, but this ability for combinations of convolutional layers to analyze large pieces of your input image, despite each individual layer having a relatively small kernel, is important for understanding why CNNs are so powerful for imaging problems.

### Filters vs. Channels

We mentioned two terms above that may benefit from extra clarification: “filters” and “channels.” Channels are normally discussed as described above, representing different kinds of non-spatial information in the data, for example the intensities of different colors, measurements from different ECG leads, or different detectors in an MRI scanner.  Deeper in the network, the data passed between layers uses channels to represent the presence of different features the network is looking for.

Filters, on the other hand, are related to the convolutional kernel. Each filter looks at every data channel in the area (or volume) that the kernel is looking at and generates a single output value for every step of the convolution. Each filter determines one channel of the layer’s output data and, when constructing a network, the number of filters coming out of a convolutional layer must match the number of channels the next convolutional layer expects to read. Though there may be more, less, or the same number of filters in the kernel as channels in the data, every filter generally uses information from every input channel to determine its output.

There are exceptions. Usually this is determined by a layer hyperparameter referred to as “groups.” This hyperparameter splits the input channels and the filters into separate groups and pairs each group of input channels with a specific group of filters. In other words, if you set groups to 2, the layer will show half of the incoming channels to half of the layer’s filters and will show the other half of the incoming channels to the other half of the filters, as if the single convolutional layer was actually two smaller, parallel convolutional layers. The default case, where every filter sees every incoming channel, is equivalent to setting groups to 1. This is similar to the sparse layers discussed earlier in regards to linear layers.

One important thing to note is that while filters and kernels are always related to one another, different frameworks and instructional materials use different definitions for each. The description given here uses the convention set by the Tensorflow framework, where kernels are assemblages of filters, but other sources describe filters as assemblages of kernels instead.  Others, such as the Pytorch framework, treat kernels and filters as identical objects.  In this case, the number of filters (as described here) is determined by the number of output channels or output features.

### Dilation

Typically, as discussed above, a convolutional kernel looks at a continuous chunk of its input data. However, groups of kernels like this are limited to a total receptive field that is based on the total number of convolutional layers the network has and the size of each layer’s kernel. In other words, if you want your network to look for large scale features in your images, you need to use many layers or large convolutional kernels and your CNN will need many parameters.

<br><img src="https://i.ibb.co/m5kh59Z/MIDeL8-6.gif" alt="MIDeL8-6" border="0"><u><br /><b>Figure 5.</b>  Convolution with dilation=2 ([source](https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d))</u><br><br>

Dilation provides a way to get around this. What dilation does is spread the convolutional kernel further across its input. For example, a 3x3 kernel with a dilation of 2 will insert 1 pixel between each pixel of its input, such that it covers a 5x5 total area. The kernel still only looks at 9 pixels within that 25 pixel region, but is now collecting information over a wider area of its input. The default case, where the kernel looks at adjacent pixels, corresponds with a dilation of 1.

### Stride

Stride is another hyperparameter that allows you to increase your receptive field without increasing the number of parameters your model requires. Instead of expanding the kernel, like dilation, stride changes how much the kernel “steps” after completing its computation on one part of the image. In the default case, with a stride of 1, the “steps” between the pieces of input seen by the kernel are 1 pixel apart, and there will roughly be a piece of input centered on each pixel in the input (with one caveat we’ll discuss next). If we increase the stride to 2 the convolution process will take bigger steps, and each input will be separated by 1 pixel. To put it into rough numbers, if a layer needs 28 steps to process the first row of data using stride 1, then that layer will need only 14 steps to process the first row of data using stride 2. This reduction is steps occurs in all dimensions, so a stride of 2 applied to a 2D kernel would reduce the computation by a factor of 4. Conversely, if the layer uses a stride less than 1, which is sometimes useful, then the layer will need more steps across the first row of the input image, for example 56 steps with stride ½.

<br><img src="https://i.ibb.co/n8tWZQK/MIDeL8-7.gif" alt="MIDeL8-7" border="0"><u><br /><b>Figure 6.</b>  Convolution with stride=2 ([source](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1))</u><br><br>

One thing to keep in mind is that changing your layer’s stride has dramatic effects on the size of its output. While a layer with stride 1 will produce an output with a similar size to its input, a layer with stride 2 will downsample the input by a factor of 2 for each dimension. In contrast, a layer with stride ½ will upsample the input by a factor of 2 for each dimension.

### Padding

There’s an important factor about convolutions we’ve neglected to mention thus far. We’ve treated convolutions as if they do not change the resolution of their input--as if the output from a convolutional layer receiving a 28x28 image will also be a 28x28 image. This is not generally true.

<img src="https://i.ibb.co/9sDY0GS/MIDeL8-8.gif" alt="MIDeL8-8" border="0"><u><br /><b>Figure 7.</b>  Convolution with padding=1 ([source](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1))</u><br>

The structured output from a convolutional layer has 1 pixel corresponding to every pixel that was fed to the center of the convolutional kernel during the convolution process. However, the kernel can never extend past the edge of its input. In other words, a 3x3 kernel convolved over a 28x28 image results in a 26x26 output, because the kernel cannot center itself over any of the pixels along the outside edge. If our kernel is 5x5, then the output will be 24x24.

Padding is a hyperparameter that allows us to avoid this change in size. Padding adds extra pixels around the outside border of the image before beginning the convolution process, so that we can control how large the output from the layer is. For instance, with padding 1, our 3x3 kernel, run over a 28x28 image, now produces a 28x28 output, and padding 2 enables the 5x5 kernel to produce a 28x28 output.

Padding requires more choices than the other hyperparameters we’ve discussed here. You not only choose how much padding you want to add, but also what the values of the pixels should be that you are adding. Typical options are padding with zeros (often called zero-padding, but be careful not to confuse this with no padding), constant padding, which is like zero-padding except you choose the value, nearest padding, which is padding with the values from the input that are spatially nearest to the added pixels, and reflection padding, e.g. treating the edge of the input like a mirror and padding with the value present in the reflected input image. Since the edges of images usually do not have important objects, the choice of padding usually doesn’t have a huge impact, but there can be cases where the edges are important, and then attention to the right form of padding is critical.

### Pooling

Now that we’ve covered the hyperparameters used to define your convolutional layers, we’ll discuss another technique that’s commonly used with images. While stride can be used to downsample images, pooling is a more common technique for downsampling. Pooling layers still use a kernel, but that kernel performs a fixed operation and does not have any trainable parameters.

There are two main types of pooling layers: average pooling and max pooling. Average pooling outputs the average of all of the pixels within the kernel, while max pooling returns the maximum value within the kernel. You can configure the kernel size, stride, and dilation of a pooling layer, but typically they are performed such that every pixel in your image is part of the pooling operation exactly once, so that a pooling layer with a 2x2 kernel has a stride of 2, and outputs an image half the size of the input (i.e. a 28x28 image becomes a 14x14 image). There are also “global” versions of these pooling layers, which calculate their values over the entire spatial extent of the image, such that a 224x224 RGB image becomes a length 3 vector after a 2-D global pooling layer.

### One Final Note About Spatial Dimensions

There is one more thing we need to mention before we move past the discussion of convolutional hyperparameters. Up until now we have treated everything, kernel size, stride, padding, dilation, as if the operations were done symmetrically in space. However, you are not forced to choose only square (or cubic) dimensions. Each of these hyperparameters can be applied asymmetrically. You might only want to pad one or two edges of your image. Maybe you want to use stride (or pooling) to downsample your CT scan’s X and Y dimensions, but don’t want to change the Z dimension. Three x one (3x1) or 1x3 kernels are commonly used together to replicate the receptive field of a 3x3 kernel using fewer parameters.

These discussions are easiest if we assume we’re applying the techniques discussed symmetrically, but in practice there are many times where you may want or need to apply them asymmetrically.

### Dropout in Convolutional Layers

Lastly, we will revisit dropout from the discussion about linear layers. This works in largely the same way with one caveat: there are two ways to perform the dropout and one needs to carefully choose the appropriate layer when adding dropout to their CNN. The typical form of dropout, as discussed above, will randomly drop out weights from the kernel. This means that filters may randomly have a few parameters zeroed out but still be applied to the input data. The other form of dropout, sometimes called spatial dropout, randomly drops entire filters, not applying them at all, instead of individual weights within the filters. Typically, the latter makes more sense for CNNs.

## Activation Layers

As mentioned above, the key to the analytical power of deep learning is the nonlinear functions used to “activate” the output from each node. Nearly any mathematical function can be an activation function, but we’ll limit ourselves to the most commonly used activation functions and describe how they are typically used in deep learning. Though some of those alternative activation functions offer advantages over those listed here, the following activation functions are enough to build ANNs and CNNs capable of performing nearly any task.

### ReLU

The Rectified Linear Unit, or “ReLU” as it’s most commonly called, is represented by the following equation:

$ReLU(x) = 0 \mbox{ if } x ≤ 0$

$ReLU(x) = x \mbox{ if } x > 0$

Put in plain language, for any node whose output is less than 0, a ReLU activation sets that output to 0, and for any node whose output is greater than 0, a ReLU activation sends that output to the next layer.

<img src="https://i.ibb.co/brCHms9/MIDeL8-9.png" alt="MIDeL8-9" border="0"><u><br /><b>Figure 8.</b>  ReLU activation function</u><br>

Despite its relative simplicity, ReLU provides enough nonlinearity for deep ANNs and CNNs to model complex behavior and is a commonly chosen activation function for hidden layers.

### Sigmoid

The sigmoid function is described by a more complicated equation than ReLU, seen below:

$sigmoid(x) = \frac{1}{1 + e^{-x}}$

After plotting this equation, it’s fairly straightforward to see what makes sigmoid a useful activation function.

<img src="https://i.ibb.co/5YD61xK/MIDe-L8-10.png" alt="MIDeL8-10" border="0"><u><br /><b>Figure 9.</b>  Sigmoid activation function</u><br>

The first thing to take note of is that, no matter what value x takes, $sigmoid(x)$ always falls between 0 and 1. This makes it ideal for calculating probabilities and, though it is sometimes used between hidden layers, sigmoid is typically used as the final activation for networks performing single-class classification. It is also useful for multi-class classification problems when a single data example can belong to or contain multiple (or no) classes. For this usage, a threshold is set (typically, but not always, 0.5), and the example is considered to belong to the class of interest whenever the output of the model is greater than that threshold value.

### Tanh

Hyperbolic tangent, also known as tanh, activations are represented by the following equation:

$tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$

Though its equation may seem complicated, its behavior is notably similar to sigmoid with one important difference.

<img src="https://i.ibb.co/ZWJh7xJ/MIDe-L8-11.png" alt="MIDeL8-11" border="0"><u><br /><b>Figure 10.</b>  Tanh activation function</u><br>

Namely, $tanh(x)$ always falls between -1 and 1, whereas $sigmoid(x)$ always falls between 0 and 1. Though less commonly used than ReLU nowadays, tanh activations may also be used between hidden layers.

### Softmax

The final activation layer we’ll mention is the softmax function, mathematically denoted by:

$y_{i}=\frac{e^{x_{i}}}{\sum_{j=1}^{J}e^{x_{j}}}\mbox{ for }i = 1,...,J$

Unlike the previous activation functions, softmax doesn’t generate straightforward plots. This is because the output from the softmax function depends on the output of all of the nodes in a layer, rather than generating the output for each node independently of the others. Like sigmoid, the output from a softmax activation is always between 0 and 1, however softmax applies an additional constraint: when you add all of the outputs together, they must sum to 1. This allows you to interpret the model’s output as the probability of an example belonging to each given class.

Softmax is generally used as the final activation layer for networks performing multiple-class classification for those problems where each example must belong to exactly one class. For this purpose, the model’s prediction is made by choosing the class whose output from the softmax function is greatest.

## Normalization Layers

In our discussion about dropout we mentioned that dropout acts as a regularizer for deep learning models and that there are other regularization techniques. Now it’s time to discuss another commonly used regularization technique: normalization. First, let’s talk a bit more about regularization in general.

As discussed with dropout, deep learning models are vulnerable to overfitting on their training data during the training process. One way to avoid this is to increase the size of your dataset and ensure that it is truly a representative dataset, but this can be difficult or impossible, and if your model is too powerful (i.e. has too many nodes) it can “memorize” the training examples. Like we said earlier, this happens when models learn or activate too strongly on limited sets of features, and in extreme cases can reach a point where the features are only found in one or two training examples. In contrast, predictions made based on the detection of many features, even with a weaker signal, tend to be more generalizable, and models that weakly activate on many detected features are more likely to give reliable predictions on new data. Because of this we ultimately want to prevent our deep learning models from activating too strongly on any one feature. Fortunately, we can build models that limit the strength of each activation.

The first way to address this is to normalize the input data. When all of the input data is forced to fall within the same range of values, the influence of systematic errors such as different intensity values between image acquisitions is reduced. This helps reduce potential issues caused by the input, but to prevent problems from developing deeper in the neural network we also need ways to regularize activations within the network.

Normalization is focused on addressing this challenge. These layers are added between hidden layers within the deep learning model and normalize their input in one of several different ways. Most of these layers have configurable hyperparameters that allow you to adjust their calculated normalization factors. Some examples of these include whether the layer can use information from earlier batches, the momentum it uses for that, and whether the layer has its own learnable parameters to further adjust the calculated normalization factors.

### Batch Norm

Batch normalization (“batchnorm”) calculates statistics for each channel or feature separately, but uses all of the data within the batch to calculate those values in order to normalize the data. For example, in a batch consisting of 8 RGB images, a batchnorm layer would calculate separate normalization factors for each color. All 8 red channels would have their total mean and standard deviation (stdev) calculated before being normalized, all 8 green channels would have their mean and stdev calculated together and normalized, and the same for the blue channels.

<img src="https://i.ibb.co/1zjfPwF/MIDe-L8-12.png" alt="MIDeL8-12" border="0"><u><br /><b>Figure 11.</b>  BatchNorm layer ([source](https://arxiv.org/pdf/1803.08494.pdf))</u><br>

### Instance Norm

Where batch normalization uses the entire batch to normalize each feature/channel, instance normalization only uses the values found within each example. In our batch of 8 RGB images, each image’s red channel would be normalized based on its own values, without the values from the other images being factored in. Green and blue would also each be normalized based on only their own values separately for each individual image.

<img src="https://i.ibb.co/2ydrBrG/MIDe-L8-13.png" alt="MIDeL8-13" border="0"><u><br /><b>Figure 12.</b>  InstanceNorm layer ([source](https://arxiv.org/pdf/1803.08494.pdf))</u><br>

### Layer Norm

Where batch normalization calculates normalization factors for each feature separately and combines the information from every example in the batch, layer normalization calculates normalization factors for each example in the batch separately and combines all of the features together. In our batch of 8 RGB images, each of the 8 images would separately calculate the mean and stdev for their combined red, green, and blue channels.

<img src="https://i.ibb.co/vkw67nC/MIDe-L8-14.png" alt="MIDeL8-14" border="0"><u><br /><b>Figure 13.</b>  LayerNorm layer ([source](https://arxiv.org/pdf/1803.08494.pdf))</u><br>

### Group Norm

Group normalization is very similar to layer normalization and offers a middle ground between layer normalization and instance normalization. Instead of combining all of the channels/features into one large group, it combines the channels into smaller groups before calculating normalization statistics.

<img src="https://i.ibb.co/BgYbQ2P/MIDe-L8-15.png" alt="MIDeL8-15" border="0"><u><br /><b>Figure 14.</b>  GroupNorm layer ([source](https://arxiv.org/pdf/1803.08494.pdf))</u><br>

Group normalization functions generally require that every group has the same number of features, so we can’t use the previous batch of 8 RGB images. If we instead use CMYK images so we have a fourth channel, an example of group normalization with 2 groups would follow the layer normalization example, except the cyan and magenta colors would have their normalization factors calculated together and the yellow and key/black channels would have their normalization factors calculated together. Typically you’d use this layer deeper in the network, where channels tend to have more abstract meanings.

## Example Networks

Now that we’ve discussed the basic elements used in modern ANNs and CNNs, it’s time to discuss some famous examples. In the interest of space we’ll limit ourselves to a couple of architectures which are notable not only for their past performance but for their continued relevance.

### VGGs

We’ll start with a family of CNN architectures referred to as VGG (Visual Geometry Group) models, which provide a useful combination of analytical power, contemporary usefulness, and architectural simplicity. VGG architectures introduced two primary innovations over earlier high-performance CNNs. First was shrinking the size of the convolutional kernels. Second was the introduction of the “block.”

In contrast to its predecessors, many of which used 7x7 or 11x11 convolutional kernels with carefully chosen strides for the early layers, VGGs use 3x3 convolutional kernels throughout the network. While shrinking the kernels decreased the receptive field of each individual layer in the CNN, each layer required a dramatically reduced number of weights. Because of this, VGGs could be made with more layers than earlier models, with twice the depth of some earlier top performers. This increase in total depth not only kept the model’s effective receptive field high, the increased nonlinearity from more than doubling the number of layers increased the model’s analytical power and led to increased overall performance.

Blocks represent a more abstract innovation. The idea is to organize layers into higher-level structures that can easily be repeated when building networks. For example, VGG uses two kinds of blocks made out of convolutional layers with ReLU activations followed by a max pooling layer. The first kind of block has two convolutional layers before the pooling layer, and the second has three convolutional layers before the pooling layer. While hyperparameters typically vary between blocks, you know that two blocks of the same type will have the same general behavior.

While most deep learning frameworks include easy ways to create VGG networks and you most likely won’t need to reproduce them yourself, below is a figure to help you visualize how a VGG organizes its layers.

<img src="https://i.ibb.co/ssRpzqV/MIDe-L8-16.png" alt="MIDeL8-16" border="0"><u><br /><b>Figure 15.</b>  Architecture of VGG16 network ([source](https://neurohive.io/en/popular-networks/vgg16/))</u><br>

### ResNets

While VGG models made significant advancements over earlier CNNs, their simple architecture comes with some major disadvantages. First, despite using smaller convolutional kernels, they were very large networks, with the 16 layers of VGG16 having nearly 140 million parameters, which can make them computationally demanding to train. Second, stacking many convolutional layers makes networks vulnerable to what is known as the vanishing gradient problem (and the related exploding gradient problem).

A full discussion of where these problems come from is too advanced for now, but it’s good to keep in mind that, while adding depth makes networks more powerful, it also makes them more difficult to train. ResNets (Residual Networks) were developed as an attempt to address these issues and further push the envelope on network depth. In order to avoid the issues encountered by deep CNNs, ResNets make use of something called “residual connections.”

#### Residual Connections and Residual Blocks

Residual connections solve these issues by providing an alternate path for information to travel deeper into the network. See below:

<img src="https://i.ibb.co/XxLX4b6/MIDe-L8-17.png" alt="MIDeL8-17" border="0"><u><br /><b>Figure 16.</b>  Schematic of a residual block ([source](https://d2l.ai/chapter_convolutional-modern/resnet.html))</u><br>

In this diagram of a residual block, the solid red line on the right of the figure represents the residual connection. Because the block’s input, x, is added to the result of the block’s convolution operations, useful features from the network’s input can easily reach very deep layers. An additional advantage is that, during the parameter update process, information can travel back up the network along these residual connections which helps resolve the gradient issues mentioned above. Thanks to these residual connections, residual blocks can be used to construct much deeper networks than VGG blocks. Where VGGs face issues after 19 convolutional layers, ResNets are known to work well with as many as 152 layers.

One final innovation worth mentioning is the separation of spatial convolutions and channel based convolutions, which means even these far deeper networks require fewer parameters. For example, ResNet152 only uses around 60 million parameters, less than half what VGG16 uses. How can that be? ResNets use 1x1 convolutional kernels in the layers that will change the number of channels and 3x3 convolutional kernels for the other convolutional layers (except for the very first layer, which uses a 7x7 convolutional kernel).

## Conclusion

We have described a long list of components of deep learning models. While not all of them are required in every case, it is important to be aware of them in order to understand how deep learning models work and how they might fail. You also should have a sense of the ways that you might alter the components for specific problems you are working on.

---

##***Feedback***

*Now that you have completed this chapter, we would be very grateful if you spend a few minutes of your time to answer a short survey about this chapter. We highly value your feedback and will do our best to leverage this to improve our educational content and/or strategies.*

[Click here to begin the survey!](https://docs.google.com/forms/d/e/1FAIpQLSddhdaAmeHmrKKRNXCLIQH6_mnIC3KR7XlDIVWGt3FSQhPDhQ/viewform)