# The Keras functional API

In the functional API, you directly manipulate tensors, and you use layers as _functions_ that take tensors and return tensors (hence, the name _functional API_ ).

## Introdiction to functional API

Let’s start with a minimal example that shows side by side a simple Sequential model and its equivalent in the functional API.

In [1]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras import layers
from tensorflow.keras import Input

seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))

seq_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_2 (Dense)              (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


In [2]:
# The functional equivalent
input_tensor = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)

model = Model(input_tensor, output_tensor) 

model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 64)]              0         
_________________________________________________________________
dense_3 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_4 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_5 (Dense)              (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


This error tells, in essence, that _Keras_ couldn’t reach `input_1` from the provided output tensor.
When it comes to compiling, training, or evaluating such an instance of `Model`, the API is the same as that of `Sequential`:

In [5]:
import numpy as np

# Compiles the model
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# Generates dummy Numpy data to train on
x_train = np.random.random((1000, 64))
y_train = np.random.random((1000, 10))

# Trains the model for 10 epochs
model.fit(x_train, y_train, epochs=10, batch_size=128)

# Evaluates the model
score = model.evaluate(x_train, y_train)
print(score)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
117.60975646972656


## Multi-input models

The functional API can be used to build models that have multiple inputs. Typically, such models at some point merge their different input branches using a layer that can combine several tensors: by adding them, concatenating them, and so on. 

### A question-answering model

A typical question-answering model has two inputs: a natural-language question and a text snippet (such as a news article) providing information to be used for answering the question. The model must then produce an answer: in the simplest possible setup, this is a one-word answer obtained via a softmax over some predefined vocabulary:

<img src="./resources/question-answering-model.png" alt="qa-model" style="width: 300px" />

Here is the implementation with the _Keras_ functional API:

In [1]:
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# The text input is a variable-length sequence of integers. Note that you can optionally name the inputs.
text_input = Input(shape=(None,), dtype='int32', name='text')
# Embeds the inputs into a sequence of vectors of size 64
embedded_text = layers.Embedding(text_vocabulary_size, 64)(text_input)
# Encodes the vectors in a single vector via an LSTM
encoded_text = layers.LSTM(32)(embedded_text)

# Same process (with different layer instances) for the question
question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(question_vocabulary_size, 32)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# Concatenates the encoded question and encoded text
concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)
# Adds a softmax classifier on top
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# At model instantiation, you specify the two inputs and the output.
model = Model([text_input, question_input], answer)

model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['acc']
)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               [(None, None)]       0                                            
__________________________________________________________________________________________________
question (InputLayer)           [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 64)     640000      text[0][0]                       
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 32)     320000      question[0][0]                   
______________________________________________________________________________________________

Now, **how do you train this two-input model?** There are two possible APIs: you can **feed the model a list of Numpy arrays as inputs**, or you can **feed it a dictionary that maps input names to Numpy arrays**. Naturally, the latter option is available only if you give names to your inputs.

In [7]:
import numpy as np

num_samples = 1000
max_length = 100

# Generates dummy Numpy data
text = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))

question = np.random.randint(1, question_vocabulary_size, size=(num_samples, max_length))
# Answers are one-hot encoded, not integers
answers = np.random.randint(0, 1, size=(num_samples, answer_vocabulary_size))

# Fitting using a list of inputs
model.fit([text, question], answers, epochs=10, batch_size=128)

# Fitting using a dictionary of inputs (only if inputs are named)
model.fit({'text': text, 'question': question}, answers, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1fcd77dfc10>

## Multi-input models

In the same way, you can use the functional API to build models with multiple outputs (or multiple _heads_ ).

### A three-outputs model

A simple example is a network that attempts to simultaneously predict different properties of the data, such as a network that takes as input a series of social media posts from a single anonymous person and tries to predict attributes of that person, such as age, gender, and income level

<img src="./resources/three-outputs-model.png" alt="qa-model" style="width: 300px" />

In [8]:
vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None,), dtype='int32', name='posts')

embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)

x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

# Note that the output layers are given names
age_prediction = layers.Dense(1, name='age')(x)
income_prediction = layers.Dense(num_income_groups, activation='softmax', name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)

model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
posts (InputLayer)              [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 50000)  12800000    posts[0][0]                      
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, None, 128)    32000128    embedding_2[0][0]                
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D)    (None, None, 128)    0           conv1d[0][0]                     
____________________________________________________________________________________________

#### Compile with multiple losses

Importantly, training such a model requires the ability to **specify different loss functions for different heads of the network**: for instance, age prediction is a scalar regression task, but gender prediction is a binary classification task, requiring a different training procedure. But because gradient descent requires you to minimize a _scalar_, you must **combine these losses into a single value** in order to train the model. The simplest way to combine different losses is to sum them all. In *Keras*, you can use either a list or a dictionary of losses in `compile` to specify different objects for different outputs; **the resulting loss values are summed into a global loss, which is minimized during training**.

In [9]:
model.compile(
    optimizer='rmsprop', 
    loss=['mse', 'categorical_crossentropy', 'binary_crossentropy']
)

# Equivalent (possible only if you give names to the output layers)
model.compile(
    optimizer='rmsprop', 
    loss={
        'age': 'mse',
        'income': 'categorical_crossentropy',
        'gender': 'binary_crossentropy'
    }
)

#### Loss weighting

Note that very **imbalanced loss contributions will cause the model representations to be optimized preferentially for the task with the largest individual loss**, at the expense of the other tasks. To remedy this, you can assign **different levels of importance** to the loss values in their contribution to the final loss. This is useful in particular if the losses' values use different scales.

In [9]:
model.compile(
    optimizer='rmsprop',
    loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'], 
    loss_weights=[0.25, 1., 10.]
)

# Equivalent (possible only if you give names to the output layers)
model.compile(
    optimizer='rmsprop', 
    loss={
        'age': 'mse',
        'income': 'categorical_crossentropy',
        'gender': 'binary_crossentropy'
    },
    loss_weights={
        'age': 0.25,
        'income': 1.,
        'gender': 10.
    }
)

#### Feeding data to a multi-output model

Much as in the case of multi-input models, you can pass Numpy data to the model for training either via a list of arrays or via a dictionary of arrays.

```python
# age_targets, income_targets, and gender_targets are assumed to be Numpy arrays.
model.fit(
    posts, 
    [age_targets, income_targets, gender_targets], 
    epochs=10, 
    batch_size=64
)

# Equivalent (possible only if you give names to the output layers)
model.fit(
    posts, 
    {'age': age_targets, 'income': income_targets, 'gender': gender_targets},
    epochs=10, 
    batch_size=64
)
```

## Directed acyclic graphs of layers

With the functional API, not only can you build models with multiple inputs and multiple outputs, but you can also implement networks with a complex internal topology. Neural networks in _Keras_ are allowed to be arbitrary _directed acyclic graphs_ of layers. The qualifier _acyclic_ is important: **these graphs can’t have cycles**. It’s impossible for a tensor x to become the input of one of the layers that generated x. The only processing loops that are allowed (that is, recurrent connections) are those internal to recurrent layers.

### Inception modules

_Inception_ is a popular type of network architecture for convolutional neural networks; it was developed by _Christian Szegedy_ and his colleagues at Google in 2013–2014, inspired by the earlier network-in-network architecture. It consists of a stack of modules that themselves look like small independent networks, split into several parallel branches. Here's an example, taken from Inception V3:

<img src="./resources/inception-module.png" alt="qa-model" style="width: 400px" />

In [None]:
# Every branch has the same stride value (2), which is necessary to  
# keep all branch outputs the same size so you can concatenate them.
branch_a = layers.Conv2D(128, 1, activation='relu', strides=2)(x)

# In this branch, the striding occurs in the spatial convolution layer.
branch_b = layers.Conv2D(128, 1, activation='relu')(x)
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_b)

# In this branch, the striding occurs in the average pooling layer.
branch_c = layers.AveragePooling2D(3, strides=2)(x)
branch_c = layers.Conv2D(128, 3, activation='relu')(branch_c)

branch_d = layers.Conv2D(128, 1, activation='relu')(x)
branch_d = layers.Conv2D(128, 3, activation='relu')(branch_d)
branch_d = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_d)

# Concatenates the branch outputs to obtain the module output
output = layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis=-1)

### Residual connections

_Residual connections_ are a common graph-like network component found in many post- 2015 network architectures, including _Xception_.
A residual connection consists of **making the output of an earlier layer available as input to a later layer**, effectively creating a shortcut in a sequential network. Rather than being concatenated to the later activation, **the earlier output is summed with the later activation**, which **assumes that both activations are the same size**. If they’re different sizes, you can use a linear transformation to reshape the earlier activation into the target shape (for example, a `Dense` layer without an activation or, for convolutional feature maps, a 1 × 1 convolution without an activation).

In [None]:
# Here’s how to implement a residual connection in _Keras_ when the feature-map sizes are the same, using identity residual connections. This example assumes the exis- tence of a 4D input tensor `x`:

from tensorflow.keras import layers Applies a transformation to x

# Applies a transformation to x
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)

# Adds the original x back to the output features
y = layers.add([y, x])

In [None]:
# The following implements a residual connection when the feature-map sizes differ, using a linear residual connection (again, assuming the existence of a 4D input tensor `x`)

from keras import layers 

y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

# Uses a 1 × 1 convolution to linearly downsample 
# the original x tensor to the same shape as y
residual = layers.Conv2D(128, 1, strides=2, padding='same')(x)

# Adds the residual tensor back to the output features
y = layers.add([y, residual])

## Layer weight sharing

One more important feature of the functional API is the **ability to reuse a layer instance several times**. When you call a layer instance twice, instead of instantiating a new layer for each call, you reuse the same weights with every call. This allows you to build models that have branches that share the same knowledge and perform the same operations. That is, they share the same representations and learn these representations simultaneously for different sets of inputs.

For example, consider a model that attempts to assess the semantic similarity between two sentences. The model has two inputs (the two sentences to compare) and outputs a score between 0 and 1, where 0 means unrelated sentences and 1 means sentences that are either identical or reformulations of each other.

In this setup, the two input sentences are interchangeable, because semantic similarity is a symmetrical relationship: the similarity of A to B is identical to the similarity of B to A. For this reason, it wouldn’t make sense to learn two independent models for processing each input sentence. Rather, you want to process both with a single `LSTM` layer. The representations of this `LSTM` layer (its weights) are learned based on both inputs simultaneously. This is what we call a _Siamese LSTM model_ or a _shared LSTM_.

In [None]:
# Instantiates a single LSTM layer, once
lstm = layers.LSTM(32)

# Building the left branch of the model: 
# inputs are variable-length sequences of vectors of size 128.
left_input = Input(shape=(None, 128))
left_output = lstm(left_input)

# Building the right branch of the model: 
# when you call an existing layer instance, you reuse its weights.
right_input = Input(shape=(None, 128))
right_output = lstm(right_input)

# Builds the classifier on top
merged = layers.concatenate([left_output, right_output], axis=-1) 
predictions = layers.Dense(1, activation='sigmoid')(merged)

# Instantiating and training the model: when you train such a model, 
# the weights of the LSTM layer are updated based on both inputs.
model = Model([left_input, right_input], predictions)
model.fit([left_data, right_data], targets)

## Models as layers

Importantly, in the functional API, models can be used as you’d use layers—effectively, you can think of a model as a “bigger layer.” This is true of both the `Sequential` and `Model` classes.
When you call a model instance, **you’re reusing the weights of the model, exactly like what happens when you call a layer instance**. Calling an instance, whether it’s a layer instance or a model instance, will always **reuse the existing learned representations of the instance**, which is intuitive.

One simple practical example of what you can build by reusing a model instance is a vision model that uses a dual camera as its input: two parallel cameras, a few centimeters (one inch) apart. 
Such a model can perceive depth, which can be useful in many applications. 

**You shouldn’t need two independent models** to extract visual features from the left camera and the right camera before merging the two feeds. Such low-level processing can be shared across the two inputs: that is, done via layers that use the same weights and thus share the same representations. Here's how you'd implement a Siamese vision model (shared convolutional base) in _Keras_:

In [10]:
from tensorflow.keras import layers, Model
from tensorflow.keras import applications
from tensorflow.keras import Input

# The base image-processing model is the Xception network (convolutional base only).
xception_base = applications.Xception(weights=None, include_top=False)

# The inputs are 250 × 250 RGB images.
left_input = Input(shape=(250, 250, 3))
right_input = Input(shape=(250, 250, 3))

# Calls the same vision model twice
left_features = xception_base(left_input)
right_input = xception_base(right_input)

# The merged features contain information from the right visual feed and the left visual feed.
merged_features = layers.concatenate([left_features, right_input], axis=-1)