# Ch.7 Advanced Deep Learning Best Practices
## 7.1 Going beyond the Sequential model: the Keras functional API

Up to this point, all neural networks introduced in these studies have been implemented using the **`Sequential`** model, which makes the assumption that the network has exactly one input and one output, and consists of a linear stack of layers. However, some networks require several indpendent inputs, others require multiple outputs, and some have internal branching between layers that makes them look like graphs of layers rather than a linear stack of layers.

Some tasks require *multimodal* inputs: the merge data coming from different input sources, processing each type of data using different kinds of neural layers. Picture a deep learning model that is trying to predict the most likely market price of a second-hand piece of clothing, using only user-provided metadata information on the item (brand, age, etc.), a text description, and a photo.
 - If we only had the metadata, we could one-hot encode it and use a densely connected network to predict the price.
 - If we only had the text description, we could use an RNN of a 1D CNN.
 - If we only had the photo, we could use a 2D CNN.
 
But how can we use all three at the same time? A good way is to *jointly* learn a more accurate model of the data by using a model that can see all available input modalities simultaneously: a model with three branches.

![multi input model](images/7_1_1_multiinput.jpg)

Similarly, some tasks need to predict multiple target attributes of input data. Given the text of a book, we might want to automatically classify it by genre but also predict the date it was written. We could train two separate models, but genre and date written are not statistically independent, and we could build a better model that learns jointly to predict genre and date at the same time.

![multi output model](images/7_1_1_multioutput.jpg)

There even exist crazier, more complex networks structured as acyclic graphs. Here is a picture of an Inception model developed by Google.

![inception](images/7_1_1_inception.jpg)

## 7.1.1 Introduction to the functional API

With the functional API, we directly manipulate tensors, and use layers as functions that take tensors and return tensors:

In [11]:
from keras import Input, layers

input_tensor = Input(shape=(32,)) # a tensor

dense = layers.Dense(32, activation='relu') # layer is a function

output_tensor = dense(input_tensor) # layer may be called on a tensor

Let's start with a minimal example that shows a simple **Sequential** model side-by-side with its equivalent in the functional API:

In [12]:
from keras.models import Sequential, Model
from keras import layers
from keras import Input

seq_model = Sequential() # already know about this model
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,))) #1st hidden layer
seq_model.add(layers.Dense(32, activation='relu')) # 2nd hidden layer
seq_model.add(layers.Dense(10, activation='softmax')) # output layer

# Below is the functional API equivalent as the model above
input_tensor = Input(shape=(64,)) # input layer
x = layers.Dense(32, activation='relu')(input_tensor) # 1st hidden layer
x = layers.Dense(32, activation='relu')(x) # 2nd hidden layer
output_tensor = layers.Dense(10, activation='softmax')(x) #output layer

# Model class turns an input tensor and output tensor into a model
model = Model(input_tensor, output_tensor)

# let's look at it!
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_19 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_20 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_21 (Dense)             (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


Behind the scenes, Keras retrieves every layer involved in going from `input_tensor` to `output_tensor`, bringing them together into a graph-like data structure - a **Model**. The reason it works is because the `output_tensor` was obtained by repeatedly transforming the `input_tensor`. If we tried to build a model from inputs and outputs that weren't related, we'd get an error.

In [3]:
unrelated_input = Input(shape=(32,))
bad_model = model = Model(unrelated_input, output_tensor)

RuntimeError: Graph disconnected: cannot obtain value for tensor Tensor("input_2:0", shape=(?, 64), dtype=float32) at layer "input_2". The following previous layers were accessed without issue: []

This error tells us that Keras couldn't reach `input_1` from the provided output tensor.

When it comes to compiling, training, or evaluating such an instance of `Model`, the API is the same as `Sequential`.

In [13]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy') # compiles the model

# generate dummy data to train on
import numpy as np
x_train = np.random.random((1000, 64))
y_train = np.random.random((1000, 10))

# train the model for 10 epochs
model.fit(x_train, y_train, epochs=10, batch_size=128)

# evaluate the model
score = model.evaluate(x_train, y_train)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
  32/1000 [..............................] - ETA: 3s

## 7.1.2 Multi-input models

The functional API can be used to build models that have multiple inputs. Typically, such models merge their different input branches using a layer that can combine several tensors: by adding them, concatenating them, and so on. Let's look at a very simple example of a multi-input model: a question-answering model.

A typical question-answering model has two inputs: a natural-language question and a text snippet (such as a news article) providing information to be used for answering the question. The model must then produce an answer: in the simplest possible setup, this is a one-word answer obtained via a softmax over some predefined vocabulary.

![question](images/7_1_2_question.jpg)

Here is an example of how we can build this model using the functional API. We set up two independent branches, encoding the text input and the question input as representation vectors, then we concatenate these vectors, and add a softmax classifier on top of the concatenated representations.

**FUNCTIONAL API IMPLEMENTATION OF A TWO-INPUT QUESTION-ANSWERING MODEL**

In [14]:
from keras.models import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# text input is a variable-length sequence of integers
text_input = Input(shape=(None,), dtype='int32', name='text')

# embed the inputs into a sequence of vectors of size 64
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)

# encodes the vectors in a single vector via an LSTM
encoded_text = layers.LSTM(32)(embedded_text)

# same process with different layer instances for the question
question_input = Input(shape=(None,), dtype='int32', name='question')

embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# concatenate the encoded question and encoded text
concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)

# add a softmax classifier on top
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# model instantiation. Specify the two inputs and output
model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

Now, how do we train this two-input model? We can feed the model a list of Numpy arrays as inputs, or we can feed it a dictionary that maps input names to Numpy arrays (only an option if we give names to inputs).

**FEEDING DATA TO A MULTI-INPUT MODEL**

In [17]:
import numpy as np

num_samples = 1000
max_length = 100

# generate dummy Numpy data
text = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))
question = np.random.randint(1, question_vocabulary_size, size=(num_samples, max_length))

# answers are one-hot encoded, not integers
answers = np.random.randint(0, 1, size=(num_samples, answer_vocabulary_size))

In [None]:
# fit using a list of inputs
model.fit([text, question], answers, epochs=10, batch_size=128)

# fit using a dictionary of inputs (only if inputs are named)
#model.fit({'text': text, 'question': question}, answers, epochs=10, batch_size=128)

## 7.1.3 Multi-output models

In the same way as above, we can use the functional API to build models with multiple *outputs* (or multiple *heads*). A simple example is a network that attempts to simultaneously predict different properties of the data, such as a network that takes a series of social media posts from one person as inputs and tries to predict attributes of that person, such as age, gender, and income level.

![multi output](images/7_1_3_multioutput.jpg)

**FUNCTIONAL API IMPLEMENTATION OF A THREE-OUTPUT MODEL**

In [21]:
from keras import layers
from keras import Input
from keras.models import Model
vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None,), dtype='int32', name='posts')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)
x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

age_prediction = layers.Dense(1, name='age')(x) 
income_prediction = layers.Dense(num_income_groups,
                                 activation='softmax',
                                 name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)

model = Model(posts_input,
              [age_prediction, income_prediction, gender_prediction])

Training such a model requires the ability to specify different loss functions for different heads of the network: age prediction is a scalar regression task, but gender is a binary classification task, requiring a different training procedure. But because gradient descent requires us to minimize a scalar, we must combine these losses into a single value in order to train the model. The simplest way to combine different losses is to sum them all.

**COMPILATION OPTIONS OF A MULTI-OUTPUT MODEL: MULTIPLE LOSSES**

In [22]:
model.compile(optimizer='rmsprop', loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'])

The very imbalanced loss contributions will cause the model representations to be optimized preferentially for the task with the largest individual loss, at the expense of other tasks. To remedy this, we can assign different levels of importance to the loss values in their contribution to the final loss. For instance, the MSE loss used for the age-regression typically takes a value around 3-5, whereas the cross-entropy loss used for gender-classification can be as low as 0.1. In this situation, we can assign a weight of 10 to the crossentropy loss and a weight of 0.25 to the MSE loss.

**COMPILATION OPTIONS OF A MULTI-OUTPUT MODEL: LOSS WEIGHTING**

In [23]:
model.compile(optimizer='rmsprop', 
              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'],
              loss_weights=[0.25, 1., 10.])

In [None]:
# feeding data to a multi-output model
model.fit(posts, [age_targets, income_targets, gender_targets],
          epochs=10, batch_size=64)

model.fit(posts, {'age': age_targets,
                  'income': income_targets, 
                  'gender': gender_targets}, 
          epochs=10, batch_size=64)

## 7.1.4 Directed acyclic graphs of layers

With the functional API, not only can we build models with multiple inputs and multiple outputs, but we can also implement networks with a complex internal topology. Neural networks in Keras are allowed to be arbitrary *directed acyclic graphs* of layers. The qualifier **acyclic** is important: these graphs can’t have cycles. It’s impossible for a tensor x to become the input of one of the layers that generated x. The only processing loops that are allowed (that is, recurrent connections) are those internal to recurrent layers.

Several common neural-network components are implemented as graphs. Two notable ones are Inception modules and residual connections. To better understand how the functional API can be used to build graphs of layers, let’s take a look at how we can implement both of them in Keras.

### Inception modules
*Inception* is a popular type of network architecture for convolutional neural networks. It consists of a stack of modules that themselves look like small independent networks, split into several parallel branches. The most basic form of an Inception module has three to four branches starting with a 1 × 1 convolution, followed by a 3 × 3 convolution, and ending with the concatenation of the resulting features. This setup helps the network separately learn spatial features and channel-wise features, which is more efficient than learning them jointly. More-complex versions of an *Inception* module are also possible, typically involving pooling operations, different spatial convolution sizes (for example, 5 × 5 instead of 3 × 3 on some branches), and branches without a spatial convolution (only a 1 × 1 convolution). An example of such a module, taken from Inception V3, is shown below.

![inception](images/7_1_4_inception.jpg)

**THE PURPOSE OF 1x1 CONVOLUTIONS**

We already know that convolutions extract spatial patches around every tile in an input tensor and apply the same transformation to each patch. An edge case is when the patches extracted consist of a single tile. The convolution operation then becomes equivalent to running each tile vector through a Dense layer: it will compute features that mix together information from the channels of the input tensor, but it won’t mix information across space (because it’s looking at one tile at a time). Such 1 × 1 convolutions (also called pointwise convolutions) are featured in *Inception* modules, where they contribute to factoring out channel-wise feature learning and space-wise feature learning—a reasonable thing to do if we assume that each channel is highly autocorrelated across space, but different channels may not be highly correlated with each other.

Here's how to implement the module featured in the image above using the functional API. Note this example assumes the existence of a 4D input tensor x:

In [None]:
from keras import layers

# every branch has the same stride value. This keeps all branch outputs the same size.
branch_a = layers.Conv2D(128, 1, activation='relu', strides=2)(x) 
branch_b = layers.Conv2D(128, 1, activation='relu')(x)
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_b)

# striding occurs in the average pooling layer
branch_c = layers.AveragePooling2D(3, strides=2)(x)
branch_c = layers.Conv2D(128, 3, activation='relu')(branch_c)

branch_d = layers.Conv2D(128, 1, activation='relu')(x)
branch_d = layers.Conv2D(128, 3, activation='relu')(branch_d)
branch_d = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_d)

# concatenates the branch outputs to obtain the module output
output = layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis=-1)

Note that the full Inception V3 architecture is available in Keras as `keras.applications.inception_v3.InceptionV3`, including weights pretrained on the ImageNet dataset. Another closely related model available as part of the Keras applications module is **Xception**. Xception, which stands for extreme inception, is a convnet architecture loosely inspired by Inception. It takes the idea of separating the learning of channel-wise and space-wise features to its logical extreme, and replaces Inception modules with depthwise separable convolutions consisting of a depthwise convolution (a spatial convolution where every input channel is handled separately) followed by a pointwise convolution (a 1 × 1 convolution)—effectively, an extreme form of an Inception module, where spatial features and channel-wise features are fully separated. Xception has roughly the same number of parameters as Inception V3, but it shows better runtime performance and higher accuracy on ImageNet as well as other large-scale datasets, due to a more efficient use of model parameters.

### Residual Connections
*Residual connections* are a common graph-like network component found in many post-2015 network architectures, including Xception. They tackle two common problems that plague any large-scale deep-learning model: vanishing gradients and representational bottlenecks. In general, adding residual connections to any model that has more than 10 layers is likely to be beneficial.

A residual connection consists of making the output of an earlier layer available as input to a later layer, effectively creating a shortcut in a sequential network. Rather than being concatenated to the later activation, the earlier output is summed with the later activation, which assumes that both activations are the same size. If they’re different sizes, we can use a linear transformation to reshape the earlier activation into the target shape (for example, a Dense layer without an activation or, for convolutional feature maps, a 1 × 1 convolution without an activation).

Here’s how to implement a residual connection in Keras when the feature-map sizes are the same, using identity residual connections. This example assumes the existence of a 4D input tensor x:

In [None]:
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)    
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)

y = layers.add([y, x])                                             

And the following implements a residual connection when the feature-map sizes differ, using a linear residual connection (again, assuming the existence of a 4D input tensor x):

In [None]:
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

residual = layers.Conv2D(128, 1, strides=2, padding='same')(x)       

y = layers.add([y, residual])                                        

**REPRESENTATIONAL BOTTLENECKS IN DEEP LEARNING**

In a Sequential model, each successive representation layer is built on top of the previous one, which means it only has access to information contained in the activation of the previous layer. If one layer is too small (for example, it has features that are too low-dimensional), then the model will be constrained by how much information can be crammed into the activations of this layer.

You can grasp this concept with a signal-processing analogy: if you have an audio-processing pipeline that consists of a series of operations, each of which takes as input the output of the previous operation, then if one operation crops your signal to a low-frequency range (for example, 0–15 kHz), the operations downstream will never be able to recover the dropped frequencies. Any loss of information is permanent. Residual connections, by reinjecting earlier information downstream, partially solve this issue for deep-learning models.

**VANISHING GRADIENTS IN DEEP LEARNING**

Backpropagation, the master algorithm used to train deep neural networks, works by propagating a feedback signal from the output loss down to earlier layers. If this feedback signal has to be propagated through a deep stack of layers, the signal may become tenuous or even be lost entirely, rendering the network untrainable. This issue is known as vanishing gradients.

This problem occurs both with deep networks and with recurrent networks over very long sequences—in both cases, a feedback signal must be propagated through a long series of operations. We’re already familiar with the solution that the LSTM layer uses to address this problem in recurrent networks: it introduces a carry track that propagates information parallel to the main processing track. Residual connections work in a similar way in feedforward deep networks, but they’re even simpler: they introduce a purely linear information carry track parallel to the main layer stack, thus helping to propagate gradients through arbitrarily deep stacks of layers.

## 7.1.5 Layer weight sharing
One more important feature of the functional API is the ability to reuse a layer instance several times. When we call a layer instance twice, instead of instantiating a new layer for each call, you reuse the same weights with every call. This allows you to build models that have shared branches—several branches that all share the same knowledge and perform the same operations. That is, they share the same representations and learn these representations simultaneously for different sets of inputs.

For example, consider a model that attempts to assess the semantic similarity between two sentences. The model has two inputs (the two sentences to compare) and outputs a score between 0 and 1, where 0 means unrelated sentences and 1 means sentences that are either identical or reformulations of each other. Such a model could be useful in many applications, including deduplicating natural-language queries in a dialog system.

In this setup, the two input sentences are interchangeable, because semantic similarity is a symmetrical relationship: the similarity of A to B is identical to the similarity of B to A. For this reason, it wouldn’t make sense to learn two independent models for processing each input sentence. Rather, you want to process both with a single LSTM layer. The representations of this LSTM layer (its weights) are learned based on both inputs simultaneously. This is what we call a Siamese LSTM model or a shared LSTM.

Here’s how to implement such a model using layer sharing (layer reuse) in the Keras functional API:

In [None]:
from keras import layers
from keras import Input
from keras.models import Model

lstm = layers.LSTM(32)                                                
left_input = Input(shape=(None, 128))                                 
left_output = lstm(left_input)                                        

right_input = Input(shape=(None, 128))                                
right_output = lstm(right_input)                                      

merged = layers.concatenate([left_output, right_output], axis=-1)     
predictions = layers.Dense(1, activation='sigmoid')(merged)           

model = Model([left_input, right_input], predictions)                 
model.fit([left_data, right_data], targets)

Naturally, a layer instance may be used more than once - it can be called arbitrarily many times, reusing the same set of weights every time.

## 7.1.6 Models as layers
Importantly, in the functional API, models can be used as you’d use layers—effectively, you can think of a model as a “bigger layer.” This is true of both the Sequential and Model classes. This means we can call a model on an input tensor and retrieve an output tensor:

`y = model(x)`

If the model has multiple input tensors and multiple output tensors, it should be called with a list of tensors:

`y1, y2 = model([x1, x2])`

When we call a model instance, we’re reusing the weights of the model—exactly like what happens when we call a layer instance. Calling an instance, whether it’s a layer instance or a model instance, will always reuse the existing learned representations of the instance—which is intuitive.

One simple practical example of what we can build by reusing a model instance is a vision model that uses a dual camera as its input: two parallel cameras, a few centimeters (one inch) apart. Such a model can perceive depth, which can be useful in many applications. We shouldn’t need two independent models to extract visual features from the left camera and the right camera before merging the two feeds. Such low-level processing can be shared across the two inputs: that is, done via layers that use the same weights and thus share the same representations. Here’s how to implement a Siamese vision model (shared convolutional base) in Keras:

In [25]:
from keras import layers
from keras import applications
from keras import Input

xception_base = applications.Xception(weights=None, include_top=False)      

left_input = Input(shape=(250, 250, 3))                       
right_input = Input(shape=(250, 250, 3))                      

left_features = xception_base(left_input)                     
right_input = xception_base(right_input)                      

merged_features = layers.concatenate([left_features, right_input], axis=-1)

## 7.1.7 Wrapping up
This concludes our introduction to the Keras functional API—an essential tool for building advanced deep neural network architectures. Now we know the following:

 - To step out of the Sequential API whenever we need anything more than a linear stack of layers
 - How to build Keras models with several inputs, several outputs, and complex internal network topology, using the Keras functional API
 - How to reuse the weights of a layer or model across different processing branches, by calling the same layer or model instance several times