# Neural Networks and Deep Learning

In this section, you will be introduced to the final topic on neural networks and deep learning. You will be learning about TensorFlow, Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). You will use key deep learning concepts to determine creditworthiness of individuals and predict housing prices in a neighborhood. Later on, you will also implement an image classification program using the skills you learned. By the end of this chapter, you will have a firm grasp on the concepts of neural networks and deep learning.

# Introduction

Previously, we learned about what clustering problems are and saw several algorithms, such as k-means, that can automatically group data points on their own. We will learn about neural networks and deep learning networks.

The difference between neural networks and deep learning networks is the complexity and depth of the networks. Traditionally, neural networks have only one hidden layer, while deep learning networks have more than that.

Although we will use neural networks and deep learning for supervised learning, note that neural networks can also model unsupervised learning techniques. This kind of model was actually quite popular in the 1980s, but because the computation power required was limited at the time, it's only recently that this model has been widely adopted. With the democratization of Graphics Processing Units (GPUs) and cloud computing, we now have access to a tremendous amount of computation power. This is the main reason why neural networks and especially deep learning are hot topics again.

Deep learning can model more complex patterns than traditional neural networks, and so deep learning is more widely used nowadays in computer vision (in applications such as face detection and image recognition) and natural language processing (in applications such as chatbots and text generation).

## Artificial Neurons

Artificial Neural Networks (ANNs), as the name implies, try to replicate how a human brain works, and more specifically how neurons work.

A neuron is a cell in the brain that communicates with other cells via electrical signals. Neurons can respond to stimuli such as sound, light, and touch. They can also trigger actions such as muscle contractions. On average, a human brain contains 10 to 20 billion neurons. That's a pretty huge network, right? This is the reason why humans can achieve so many amazing things. This is also why researchers have tried to emulate how the brain operates and in doing so created ANNs.

ANNs are composed of multiple artificial neurons that connect to each other and form a network. An artificial neuron is simply a processing unit that performs mathematical operations on some inputs ($x_1, x_2, \dots, x_n$) and returns the final results (`y`) to the next unit, as shown here:

![Figure 6.1](img/fig6_01.jpg)

We will see how an artificial neuron works more in detail in the coming sections.

## Neurons in TensorFlow

TensorFlow is currently the most popular neural network and deep learning framework. It was created and is maintained by Google. TensorFlow is used for voice recognition and voice search, and it is also the brain behind translate.google.com. Later in this chapter, we will use TensorFlow to recognize written characters.

The TensorFlow API is available in many languages, including Python, JavaScript, Java, and C. TensorFlow works with **tensors**. You can think of a tensor as a container composed of a matrix (usually with high dimensions) and additional information related to the operations it will perform (such as weights and biases, which you will be looking at later in this chapter). A tensor with no dimensions (with no rank) is a scalar. A tensor of rank 1 is a vector, rank 2 tensors are matrices, and a rank 3 tensor is a three-dimensional matrix. The rank indicates the dimensions of a tensor. In this chapter, we will be looking at tensors of ranks 2 and 3.

  > Mathematicians use the terms matrix and dimension, whereas deep learning programmers use tensor and rank instead.

TensorFlow also comes with mathematical functions to transform tensors, such as the following:

  * Arithmetic operations: `add` and `multiply`
  * Exponential operations: `exp` and `log`
  * Relational operations: `greater`, `less`, and `equal`
  * Array operations: `concat`, `slice`, and `split`
  * Matrix operations: `matrix_inverse`, `matrix_determinant`, and `matmul`
  * Non-linear operations: `sigmoid`, `relu`, and `softmax`
  
**Go to Exercise 6.1**

In the Exercise 6.1, you successfully implemented an artificial neuron using TensorFlow. This is the base of any neural network model using tensorflow.

---

## Neural Network Architecture

Neural networks aren't the newest branch of **Artificial Intelligence (AI)**. Neural networks are inspired by how the human brain works. They were invented in the 1940s by Warren McCulloch and Walter Pitts. The neural network was a mathematical model that was used to describe how the human brain can solve problems.

We will use ANN to refer to both the mathematical model, and the biological neural network when talking about the human brain.

The way a neural network learns is more complex compared to other classification or regression models. The neural network model has a lot of internal variables, and the relationship between the input and output variables may involve multiple internal layers. Neural networks have higher accuracy than other supervised learning algorithms.

  > **Note**  
  > Mastering neural networks with TensorFlow is a complex process. The purpose of this section is to provide you with an introductory resource to get started.

The main example we are going to use is the recognition of digits from an image. We are considering this format since each image is small, and we have around 70,000 images available. The processing power required to process these images is similar to that of a regular computer.

ANNs work similarly to how the human brain works. A dendroid in a human brain is connected to a nucleus, and the nucleus is connected to an axon. In an ANN, the input is the dendroid, where the calculations occur is the nucleus, and the output is the axon.

An artificial neuron is designed to replicate how a nucleus works. It will transform an input signal by calculating a matrix multiplication followed by an activation function. If this function determines that a neuron has to fire, a signal appears in the output. This signal can be the input of other neurons in the network:

![Figure 6.2](img/fig6_02.jpg)

Let's understand the preceding figure further by taking the example of $n=4$. In this case, the following applies:

  * $X$ is the input matrix, which is composed of $x_1, x_2, x_3$, and $x_4$.
  * $W$, the weight matrix, will be composed of $w_1, w_2, w_3$, and $w_4$.
  * $b$ is the bias.
  * $f$ is the activation function.

We will first calculate Z (the left-hand side of the neuron) with matrix multiplication and bias:

$$
Z = W * X + b = x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + x_4 * w_4 + b
$$

Then the output, `y`, will be calculated by applying a function, `f`:

$$
y = f(Z) = f(x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + x_4 * w_4 + b)
$$

Great – this is how an artificial neuron works under the hood. It is two matrix operations, a product followed by a sum, and a function transformation.

We now move on to the next section – weights.

## Weights

**W** (also called the weight matrix) refers to weights, which are parameters that are automatically learned by neural networks in order to predict accurately the output, `y`.

A single neuron is the combination of the weighted sum and the activation function and can be referred to as a hidden layer. A neural network with one hidden layer is called a **regular neural network**:

![Figure 6.3](img/fig6_03.jpg)

When connecting inputs and outputs, we may have multiple hidden layers. A neural network with multiple layers is called a **deep neural network**.

The term deep learning comes from the presence of multiple layers. When creating an **Artificial Neural Network (ANN)**, we can specify the number of hidden layers.

---

## Biases
Previously, we saw that the equation for a neuron is as follows:

$$
y = f(x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + x_4 * w_4)
$$

The problem with this equation is that there is no constant factor that depends on the inputs $x_1, x_2, x_3, and x_4$. The preceding equation can model any linear function that will go through the point 0: if all $w$ values are equal to 0 then $y$ will also equal to 0. But what about other functions that don't go through the point 0? For example, imagine that we are predicting the probability of churn for an employee by their month of tenure. Even if they haven't worked for the full month yet, the probability of churn is not zero.

To accommodate this situation, we need to introduce a new parameter called **bias**. It is a constant that is also referred to as the **intercept**. Using the churn example, the bias `b` can equal to 0.5 and therefore the churn probability for a new employer during the first month will be 50%.

Therefore, we add bias to the equation:

$$
y = f(x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + x_4 * w_4 + b) \\
y = f(X \cdot W + b)
$$

The first equation is the verbose form, describing the role of each coordinate, weight coefficient, and bias. The second equation is the vector form, where $x = (x_1, x_2, x_3, x_4)$ and $w = (w_1, w_2, w_3, w_4)$. The dot operator between the vectors symbolizes the dot or scalar product of the two vectors. The two equations are equivalent. We will use the second form in practice because it is easier to define a vector of variables using TensorFlow than to define each variable one by one.

Similarly, for $w_1, w_2, w_3, and w_4$, the bias, `b`, is a variable, meaning that its value can change during the learning process.

With this constant factor built into each neuron, a neural network model becomes more flexible in terms of fitting a specific training dataset better.

  > It may happen that the product $p = x_1*w_1 + x_2*w_2 + x_3*w_3 + x_4*w_4$ is negative due to the presence of a few negative weights. We may still want to give the model the flexibility to execute (or fire) a neuron with values above a given negative number. Therefore, adding a constant bias, $b = 5$, for instance, can ensure that the neuron fires for values between -5 and 0 as well.
  
TensorFlow provides the `Dense()` class to model the hidden layer of a neural network (also called the fully connected layer):

```Python
from tensorflow.keras import layers
layer1 = layers.Dense(units=128, input_shape=[200])
```

In this example, we have created a fully connected layer of 128 neurons that takes as input a tensor of shape 200.

  > You can find more information on this TensorFlow class at [https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense).
  
The `Dense()` class is expected to have a flattened input (only one row). For instance, if your input is of shape `28` by `28`, you will have to flatten it beforehand with the Flatten() class in order to get a single row with 784 neurons (`28 * 28`):

```Python
from tensorflow.keras import layers
input_layer = layers.Flatten(input_shape=(28, 28))
layer1 = layers.Dense(units=128)
```

  > You can find more information on this TensorFlow class at [https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten).
  
In the following sections, we will learn about how we can extend this layer of neurons with additional parameters.

---

## Use Cases for ANNs

ANNs have their place among supervised learning techniques. They can model both classification and regression problems. A classifier neural network seeks a relationship between features and labels. The features are the input variables, while each class the classifier can choose as a return value is a separate output. In the case of regression, the input variables are the features, while there is one single output: the predicted value. While traditional classification and regression techniques have their use cases in AI, ANNs are generally better at finding complex relationships between inputs and outputs.

In the next section, we will be looking at activation functions and their different types.

## Activation Functions

As seen previously, a single neuron needs to perform a transformation by applying an activation function. Different activation functions can be used in neural networks. Without these functions, a neural network would simply be a linear model that could easily be described using matrix multiplication.

The activation function of a neural network provides non-linearity and therefore can model more complex patterns. Two very common activation functions are sigmoid and tanh (the hyperbolic tangent function).

### Sigmoid

The formula of `sigmoid` is as follows:

$$
f(x) = \sigma (x) = \frac{1}{1 + e^{-x}}
$$

The output values of a sigmoid function range from 0 to 1. This activation function is usually used at the last layer of a neural network for a binary classification problem.

### Tanh

The formula of the hyperbolic tangent is as follows:

$$
f(x) = tanh(x) = \frac{(e^x - e^{-x})}{(e^x + e^{-x})}
$$

The `tanh` activation function is very similar to the `sigmoid` function and was quite popular until recently. It is usually used in the hidden layers of a neural network. Its values range between -1 and 1.

### ReLU

Another important activation function is `relu`. **ReLU** stands for **Rectified Linear Unit**. It is currently the most widely used activation function for hidden layers. Its formula is as follows:

$$
f(x) = \left\{\begin{matrix}
 0 & \text{for x} & \leq 0 \\ 
 x & \text{for x} & > 0 
\end{matrix}\right.
$$

There are now different variants of `relu` functions, such as `leaky ReLU` and `PReLU`.

### Softmax

The function shrinks the values of a list to be between 0 and 1 so that the sum of the elements of the list becomes 1. The definition of the `softmax` function is as follows:

$$
f_i \left(\overrightarrow{x} \right) = \frac{e^{x_i}}{\sum^J_{j=1} e^{x_j}} \text{for i} = 1, \dots, J
$$

The `softmax` function is usually used as the last layer of a neural network for multi-class classification problems as it can generate probabilities for each of the different output classes.

Remember, in TensorFlow, we can extend a `Dense()` layer with an activation function; we just need to set the `activation` parameter. In the following example, we will add the `relu` activation function:

```Python
from tensorflow.keras import layers
layer1 = layers.Dense(units=128, input_shape=[200], activation='relu')
```

**Go to Exercise 6.2**

---

## Forward Propagation and the Loss Function

So far, we have seen how a neuron can take an input and perform some mathematical operations on it and get an output. We learned that a neural network is a combination of multiple layers of neurons.

The process of transforming the inputs of a neural network into a result is called **forward propagation** (or the forward pass). What we are asking the neural network to do is to make a prediction (the final output of the neural network) by applying multiple neurons to the input data:

![Figure 6.11](img/fig6_11.jpg)

The neural network relies on the weights matrices, biases, and activation function of each neuron to calculate the predicted output value, $(\hat{y})$. For now, let's assume the values of the weight matrices and biases are set in advance. The activation functions are defined when you design the architecture of the neural networks.

As for any supervised machine learning algorithm, the goal is to make accurate predictions. This implies that we need to assess how accurate the predictions are compared to the true values. For traditional machine learning algorithms, we used scoring metrics such as mean squared error, accuracy, or the F1 score. This can also be applied to neural networks, but the only difference is that such scores are used in two different ways:

  * They are used by data scientists to assess the performance of a model on training and testing sets and then tune hyperparameters if needed. This also applies to neural networks, so nothing new here.
  * They are used by neural networks to automatically learn from mistakes and update weight matrices and biases. This will be explained in more detail in the next section, which is about backpropagation. So, the neural network will use a metric (also called a **loss function**) to compare its predicted values, $(\hat{y})$ to the true label, $y$, and then learn how to make better predictions automatically.
  
The loss function is critical to a neural network learning to make good predictions. This is a hyperparameter that needs to be defined by data scientists while designing the architecture of a neural network. The choice of which loss function to use is totally arbitrary and depending on the dataset or the problem you want to solve, you will pick one or another. Luckily for us, though, there are some basic rules of thumb that work in most cases:

  * If you are working on a regression problem, you can use mean squared error.
  * If it is a binary classification, the loss function should be binary cross-entropy.
  * If it is a multi-class classification, then categorical cross-entropy should be your go-to choice.
  
As a final note, the choice of loss function will also define which activation function you will have to use on the last layer of the neural network. Each loss function expects a certain type of data in order to properly assess prediction performance.

Here is the list of activation functions according to the loss function and type of project/problem:

| Problem Type   | Last-Layer Actvivation Function  | Loss Function  |
|---|---|---|
| Regression   | None (or identity function)  | Mean squared error   |
| Binary classification   | sigmoid   | Binary cross-entropy   |
| Multi-class classification   | softmax   | Categorical cross-entropy   |

With TensorFlow, in order to build your custom architecture, you can instantiate the `Sequential()` class and add your layers of fully connected neurons as shown in the following code snippet:

```Python
import tensorflow as tf
from tensorflow.keras import layers
model = tf.keras.Sequential()
input_layer = layers.Flatten(input_shape=(28,28))
layer1 = layers.Dense(128, activation='relu')
model.add(input_layer)
model.add(layer1)
```

---

## Backpropagation

Previously, we learned how a neural network makes predictions by using weight matrices and biases (we can combine them into a single matrix) from its neurons. Using the loss function, a network determines how good or bad the predictions are. It would be great if it could use this information and update the parameters accordingly. This is exactly what backpropagation is about: optimizing a neural network's parameters.

Training a neural network involves executing forward propagation and backpropagation multiple times in order to make predictions and update the parameters from the errors. During the first pass (or propagation), we start by initializing all the weights of the neural network. Then, we apply forward propagation, followed by backpropagation, which updates the weights.

We apply this process several times and the neural network will optimize its parameters iteratively. You can decide to stop this learning process by setting the maximum number of times the neural networks will go through the entire dataset (also called epochs) or define an early stop threshold if the neural network's score is not improving anymore after few epochs.

---

## Optimizers and the Learning Rate

In the previous section, we saw that a neural network follows an iterative process to find the best solution for any input dataset. Its learning process is an optimization process. You can use different optimization algorithms (also called **optimizers**) for a neural network. The most popular ones are `Adam`, `SGD`, and `RMSprop`.

One important parameter for the neural networks optimizer is the learning rate. This value defines how quickly the neural network will update its weights. Defining a too-low learning rate will slow down the learning process and the neural network will take a long time before finding the right parameters. On the other hand, having too-high a learning rate can make the neural network not learn a solution as it is making bigger weight changes than required. A good practice is to start with a not-too-small learning rate (such as 0.01 or 0.001), then stop the neural network training once its score starts to plateau or get worse, and lower the learning rate (by an order of magnitude, for instance) and keep training the network.

With TensorFlow, you can instantiate an optimizer from `tf.keras.optimizers`. For instance, the following code snippet shows us how to create an `Adam` optimizer with 0.001 as the learning rate and then compile our neural network by specifying the loss function (`'sparse_categorical_crossentropy'`) and metrics to be displayed (`'accuracy'`):

```Python
import tensorflow as tf
optimizer = tf.keras.optimizers.Adam(0.001)
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
```

Once the model is compiled, we can then train the neural network with the `.fit()` method like this:

```Python
model.fit(features_train, label_train, epochs=5)
```

Here we trained the neural network on the training set for 5 epochs. Once trained, we can use the model on the testing set and assess its performance with the `.evaluate()` method:

```Python
model.evaluate(features_test, label_test)
```

**Go to Exercise 6.03**

---

## Regularization

As with any machine learning algorithm, neural networks can face the problem of overfitting when they learn patterns that are only relevant to the training set. In such a case, the model will not be able to generalize the unseen data.

Luckily, there are multiple techniques that can help reduce the risk of overfitting:

  * L1 regularization, which adds a penalty parameter (absolute value of the weights) to the loss function
  * L2 regularization, which adds a penalty parameter (squared value of the weights) to the loss function
  * Early stopping, which stops the training if the error for the validation set increases while the error decreases for the training set
  * Dropout, which will randomly remove some neurons during training
  
All these techniques can be added at each layer of a neural network we create.

**Go to Exercise 6.4**

---

## Deep Learning

Now that we are comfortable in building and training a neural network with one hidden layer, we can look at more complex architecture with deep learning.

Deep learning is just an extension of traditional neural networks but with deeper and more complex architecture. Deep learning can model very complex patterns, be applied in tasks such as detecting objects in images and translating text into a different language.

### Shallow versus Deep Networks

Now that we are comfortable in building and training a neural network with one hidden layer, we can look at more complex architecture with deep learning.

As mentioned earlier, we can add more hidden layers to a neural network. This will increase the number of parameters to be learned but can potentially help to model more complex patterns. This is what deep learning is about: increasing the depth of a neural network to tackle more complex problems.

For instance, we can add a second layer to the neural network we presented earlier in the section on forward propagation and loss functions:

![Figure 6.19](img/fig6_19.jpg)

In theory, we can add an infinite number of hidden layers. But there is a drawback with deeper networks. Increasing the depth will also increase the number of parameters to be optimized. So, the neural network will have to train for longer. So, as good practice, it is better to start with a simpler architecture and then steadily increase its depth.

## Computer Vision and Image Classification

Deep learning has achieved amazing results in computer vision and natural language processing. Computer vision is a field that involves analyzing digital images. A digital image is a matrix composed of **pixels**. Each pixel has a value between $0$ and $255$ and this value represents the intensity of the pixel. An image can be black and white and have only one channel. But it can also have colors, and in that case, it will have three channels for the colors red, green, and blue. This digital version of an image that can be fed to a deep learning model.

There are multiple applications of computer vision, such as image classification (recognizing the main object in an image), object detection (localizing different objects in an image), and image segmentation (finding the edges of objects in an image). In this book, we will only look at image classification.

In the next section, we will look at a specific type of architecture: CNNs.

## Convolutional Neural Networks (CNNs)

CNNs are ANNs that are optimized for image-related pattern recognition. CNNs are based on convolutional layers instead of fully connected layers.

A convolutional layer is used to detect patterns in an image with a filter. A filter is just a matrix that is applied to a portion of an input image through a convolutional operation and the output will be another image (also called a feature map) with the highlighted patterns found by the filter. For instance, a simple filter can be one that recognizes vertical lines on a flower, such as for the following image:

![Figure 6.20](img/fig6_20.jpg)

These filters are not set in advance but learned by CNNs automatically. After the training is over, a CNN can recognize different shapes in an image. These shapes can be anywhere on the image, and the convolutional operator recognizes similar image information regardless of its exact position and orientation.

## Convolutional Operations

A convolution is a specific type of matrix operation. For an input image, a filter of size $n*n$ will go through a specific area of an image and apply an element-wise product and a sum and return the calculated value:

![Figure 6.21](img/fig6_21.jpg)

In the preceding example, we applied a filter to the top-left part of the image. Then we applied an element-wise product that just multiplied an element from the input image to the corresponding value on the filter. In the example, we calculated the following:

  * 1st row, 1st column: $5 * 2 = 10$
  * 1st row, 2nd column: $10 * 0 = 0$
  * 1st row, 3rd column: $15 * (-1) = -15$
  * 2nd row, 1st column: $10 * 2 = 20$
  * 2nd row, 2nd column: $20 * 0 = 0$
  * 2nd row, 3rd column: $30 * (-1) = -30$
  * 3rd row, 1st column: $100 * 2 = 200$
  * 3rd row, 2nd column: $150 * 0 = 0$
  * 3rd row, 3rd column: $200 * (-1) = -200$

Finally, we perform the sum of these values: $10 + 0 -15 + 20 + 0 - 30 + 200 + 0 - 200 = -15$.

Then we will perform the same operation by sliding the filter to the right by one column from the input image. We keep sliding the filter until we have covered the entire image:

![Figure 6.22](img/fig6_22.jpg)

Rather than sliding column by column, we can also slide by two, three, or more columns. The parameter defining the length of this sliding operation is called the **stride**.

You may have noticed that the result of the convolutional operation is an image (or feature map) with smaller dimensions than the input image. If you want to keep the exact same dimensions, you can add additional rows and columns with the value 0 around the border of the input image. This operation is called **padding**.

This is what is behind a convolutional operation. A convolutional layer is just the application of this operation with multiple filters.

We can declare a convolutional layer in TensorFlow with the following code snippet:

```Python
from tensorflow.keras import layers
layers.Conv2D(32, kernel_size=(3, 3), strides=(1,1), padding="valid", activation="relu")
```

In the preceding example, we have instantiated a convolutional layer with $32$ filters (also called **kernels**) of size $(3, 3)$ with stride of $1$ (sliding window by 1 column or row at a time) and no padding (`padding="valid"`).

  > You can read more about this Conv2D class on TensorFlow's website, at [https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D).
  
In TensorFlow, convolutional layers expect the input to be tensors with the following format: (**rows, height, width, channel**). Depending on the dataset, you may have to reshape the images to conform to this requirement. TensorFlow provides a function for this, shown in the following code snippet:

```Python
features_train.reshape(60000, 28, 28, 1)
```

## Pooling Layer

Another frequent layer in a CNN's architecture is the pooling layer. We have seen previously that the convolutional layer reduces the size of the image if no padding is added. Is this behavior expected? Why don't we keep the exact same size as for the input image? In general, with CNNs, we tend to reduce the size of the feature maps as we progress through different layers. The main reason for this is that we want to have more and more specific pattern detectors closer to the end of the network.

Closer to the beginning of the network, a CNN will tend to have more generic filters, such as vertical or horizontal line detectors, but as it goes deeper, we would, for example, have filters that can detect a dog's tail or a cat's whiskers if we were training a CNN to recognize cats versus dogs, or the texture of objects if we were classifying images of fruits. Also, having smaller feature maps reduces the risk of false patterns being detected.

By increasing the stride, we can further reduce the size of the output feature map. But there is another way to do this: adding a pooling layer after a convolutional layer. A pooling layer is a matrix of a given size and will apply an aggregation function to each area of the feature map. The most frequent aggregation method is finding the maximum value of a group of pixels:

![Figure 6.23](img/fig6_23.jpg)

In the preceding example, we use a max pooling of size $(2, 2)$ and `stride=2`. We look at the top-left corner of the feature map and find the maximum value among the pixels 6, 8, 1, and 2 and get the result, 8. Then we slide the max pooling by a stride of 2 and perform the same operation on the pixels 6, 1, 7, and 4. We repeat the same operation on the bottom groups and get a new feature map of size $(2,2)$.

In TensorFlow, we can use the `MaxPool2D()` class to declare a max-pooling layer:

```Python
from tensorflow.keras import layers
layers.MaxPool2D(pool_size=(2, 2), strides=2)
```

  > You can read more about this Conv2D class on TensorFlow's website at [https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D).

## CNN Architecture

As you saw earlier, you can define your own custom CNN architecture by specifying the type and number of hidden layers, the activation functions to be used, and so on. But this may be a bit daunting for beginners. How do we know how many filters need to be added at each layer or what the right stride will be? We will have to try multiple combinations and see which ones work.

Luckily, a lot of researchers in deep learning have already done such exploratory work and have published the architecture they designed. Currently, the most famous ones are these:

  * AlexNet
  * VGG
  * ResNet
  * Inception
  
  > You can read more about the different CNN architectures implemented on TensorFlow at [https://www.tensorflow.org/api_docs/python/tf/keras/applications](https://www.tensorflow.org/api_docs/python/tf/keras/applications).

# Recurrent Neural Networks (RNNs)


In the last section, we learned how we can use CNNs for computer vision tasks such as classifying images. With deep learning, computers are now capable of achieving and sometimes surpassing human performance. Another field that is attracting a lot of interest from researchers is natural language processing. This is a field where RNNs excel.

In the last few years, we have seen a lot of different applications of RNN technology, such as speech recognition, chatbots, and text translation applications. But RNNs are also quite performant in predicting time series patterns, something that's used for forecasting stock markets.

# RNN Layers

The common point with all the applications mentioned earlier is that the inputs are sequential. There is a time component with the input. For instance, a sentence is a sequence of words, and the order of words matters; stock market data consists of a sequence of dates with corresponding stock prices.

To accommodate such input, we need neural networks to be able to handle sequences of inputs and be able to maintain an understanding of the relationships between them. One way to do this is to create memory where the network can take into account previous inputs. This is exactly how a basic RNN works:

![Figure 6.24](img/fig6_24.jpg)

In the preceding figure, we can see a neural network that takes an input called $X_t$ and performs some transformations and gives the output results, $\hat{y_t}$. Nothing new so far.

But you may have noticed that there is an additional output called $H_{t-1}$ that is an output but also an input to the neural network. This is how RNN simulates memory – by considering its previous results and taking them in as an additional input. Therefore, the result $\hat{y_t}$ will depend on the input $x_t$ but also $H_{t-1}$. Now, we can represent a sequence of four inputs that get fed into the same neural network:

![Figure 6.25](img/fig6_25.jpg)

We can see the neural network is taking an input ($x$) and generating an output ($y$) at each time step ($t, t+1, \dots, t+3$) but also another output ($h$), which is feeding the next iteration.

  > The preceding figure may be a bit misleading – there is actually only one RNN here (all the RNN boxes in the middle form one neural network), but it is easier to see how the sequencing works in this format.
  
An RNN cell looks like this on the inside:

![Figure 6.26](img/fig6_26.jpg)

It is very similar to a simple neuron, but it takes more inputs and uses `tanh` as the activation function.

  > You can use any activation function in an RNN cell. The default value in TensorFlow is `tanh`.

This is the basic logic of RNNs. In TensorFlow, we can instantiate an RNN layer with `layers.SimpleRNN`:

```Python
from tensorflow.keras import layers
layers.SimpleRNN(4, activation='tanh')
```

In the code snippet, we created an RNN layer with $4$ outputs and the `tanh` activation function (which is the most widely used activation function for RNNs).

## The GRU Layer

One drawback with the previous type of layer is that the final output takes into consideration all the previous outputs. If you have a sequence of 1,000 input units, the final output, y, is influenced by every single previous result. If this sequence was composed of 1,000 words and we were trying to predict the next word, it would really be overkill to have to memorize all of the 1,000 words before making a prediction. Probably, you only need to look at the previous 100 words from the final output.

This is exactly what **Gated Recurrent Unit (GRU)** cells are for. Let's look at what is inside them:

![Figure 6.27](img/fig6_27.jpg)

Compared to a simple RNN cell, a GRU cell has a few more elements:

  * A second activation function, which is `sigmoid`
  * A multiplier operation performed before generating the outputs $(y_t)$ and $H_t$
  
The usual path with `tanh` is still responsible for making a prediction, but this time we will call it the "candidate." The sigmoid path acts as an "update" gate. This will tell the GRU cell whether it needs to discard the use of this candidate or not. Remember that the output ranges between 0 and 1. If close to 0, the update gate (that is, the sigmoid path) will say we should not consider this candidate.

On the other hand, if it is closer to 1, we should definitely use the result of this candidate.

Remember that the output $H_t$ is related to $H_{t-1}$, which is related to $H_{t-2}$, and so on. So, this update gate will also define how much "memory" we should keep. It tends to prioritize previous outputs closer to the current one.

This is the basic logic of GRU (note that the GRU cell has one more component, the reset gate, but for the purpose of simplicity, we will not look at it). In TensorFlow, we can instantiate such a layer with `layers.GRU`:

```Python
from tensorflow.keras import layers
layers.GRU(4, activation='tanh', recurrent_activation='sigmoid')
```

In the code snippet, we have created a GRU layer with 4 output units and the `tanh` activation function for the candidate prediction and sigmoid for the update gate.

## The LSTM Layer

There is another very popular type of cell for RNN architecture called the LSTM cell. LSTM stands for **Long Short-Term Memory**. LSTM came before GRU, but the latter is much simpler, and this is the reason why we presented it first. Here is what is under the hood of LSTM:

![Figure 6.28](img/fig6_28.jpg)

At first, this looks very complicated. It is composed of several elements:

  * `Cell state`: This is the concatenation of all the previous outputs. It is the "memory" of the LSTM cell.
  * `Forget gate`: This is responsible for defining whether we should keep or forget a given memory.
  * `Input gate`: This is responsible for defining whether the new memory candidate needs to be updated or not. This new memory candidate is then added to the previous memory.
  * `Output gate`: This is responsible for making the prediction based on the previous output $(H_{t-1}$), the current input $(x_t)$, and the memory.
  
An LSTM cell can consider previous results but also past memory, and this is the reason why it is so powerful.

In TensorFlow, we can instantiate such a layer with `layers.SimpleRNN`:

```Python
from tensorflow.keras import layers
layers.LSTM(4, activation='tanh', recurrent_activation='sigmoid')
```

In the code snippet, we have created an LSTM layer with 4 output units and the `tanh` activation function for the candidate prediction and sigmoid for the update gate.

  > You can read more about SimpleRNN implementation in TensorFlow here: [https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN).

## Hardware for Deep Learning

As you may have noticed, training deep learning models takes longer than traditional machine learning algorithms. This is due to the number of calculations required for the forward pass and backpropagation. In this book, we trained very simple models with just a few layers. But there are architectures with hundreds of layers, and some with even more than that. That kind of network can take days or even weeks to train.

To speed up the training process, it is recommended to use a specific piece of hardware called a GPU. GPUs specialize in performing mathematical operations and therefore are perfect for deep learning. Compared to a **Central Processing Unit (CPU)**, a GPU can be up to 10X faster at training a deep learning model. You can personally buy a GPU and set up your own deep learning computer. You just need to get one that is CUDA-compliant (currently only NVIDIA GPUs are).

Another possibility is to use cloud providers such as AWS or Google Cloud Platform and train your models in the cloud. You will pay only for what you use and can switch them off as soon as you are done. The benefit is that you can scale the configuration up or down depending on the needs of your projects – but be mindful of the cost. You will be charged for the time your instance is up even if you are not training a model. So, don't forget to switch things off if you're not using them.

Finally, Google recently released some new hardware dedicated to deep learning: **Tensor Processing Unit (TPUs)**. They are much faster than GPUs, but they are quite costly. Currently, only Google Cloud Platform provides such hardware in their cloud instances.

## Challenges and Future Trends

As with any new technology, deep learning comes with challenges. One of them is the big barrier to entry. To become a deep learning practitioner, you used to have to know all the mathematical theory behind deep learning very well and be a confirmed programmer. On top of this, you had to learn the specifics of the deep learning framework you chose to use (be it TensorFlow, PyTorch, Caffe, or anything else). For a while, deep learning couldn't reach a broad audience and was mainly limited to researchers. This situation has changed, though it is not perfect. For instance, TensorFlow now comes with a higher-level API called Keras (this is the one you saw in this chapter) that is much easier to use than the core API. Hopefully, this trend will keep going and make deep learning frameworks more accessible to anyone interested in this field.

The second challenge was that deep learning models require a lot of computation power, as mentioned in the previous section. This was again a major blocker for anyone who wanted to have a go at it. Even though the cost of GPUs has gone down, deep learning still requires some upfront investment. Luckily for us, there is now a free option to train deep learning models with GPUs: Google Colab. It is an initiative from Google to promote research by providing temporary cloud computing for free. The only thing you need is a Google account. Once signed up, you can create Notebooks (similar to Jupyter Notebooks) and choose a kernel to be run on a CPU, GPU (limited to 10 hours per day), or even a TPU (limited to ½ hour per day). So, before investing in purchasing or renting out GPU, you can first practice with Google Colab.

  > You can find more information about Google Colab at [https://colab.research.google.com/](https://colab.research.google.com/).

More advanced deep learning models can be very deep and require weeks of training. So, it is hard for basic practitioners to use such architecture. But thankfully, a lot of researchers have embraced the open source movement and have shared not only the architectures they have designed but also the weights of the networks. This means you can now access state-of-the-art pre-trained models and fine-tune them to fit your own projects. This is called transfer learning (which is out of the scope of this book). It is very popular in computer vision, where you can find pre-trained models on ImageNet or MS-Coco, for instance, which are large datasets of pictures. Transfer learning is also happening in natural language processing, but it is not as developed as it is for computer vision.

  > You can find more information about these datasets at [http://www.image-net.org/ and http://cocodataset.org/](http://www.image-net.org/ and http://cocodataset.org/).
  
Another very important topic related to deep learning is the increasing need to be able to interpret model results. Soon, these kinds of algorithms may be regulated, and deep learning practitioners will have to be able to explain why a model is making a given decision. Currently, deep learning models are more like black boxes due to the complexity of the networks. There are already some initiatives from researchers to find ways to interpret and understand deep neural networks, such as Zeiler and Fergus, "Visualizing and Understanding Convolutional Networks", ECCV 2014. However, more work needs to be done in this field with the democratization of such technologies in our day-to-day lives. For instance, we will need to make sure that these algorithms are not biased and are not making unfair decisions affecting specific groups of people.