until now, all neural networks introduced have been implemented using the 'Sequential' model; the sequential model makes the assumption that the network has exactly one input and exactly one output, and that it consists of a linear stack of layers 

![A Sequential Model: A Linear Stack of Layer](./seq_model.png)

![A Multi-input Model](./mul_input.png)

![A Multi-output (or Multihead) Model](./mul_output.png)

![Graph-Like Model](./graph_like_model.png)

### Introduction to the Functional API

in the functional API, you directly manipulate tensors, and use layers as fucntions that take tensors and renturn tensors

In [1]:
from keras import Input, layers

input_tensor = Input(shape=(32,)) #  a tensor

dense = layers.Dense(32, activation='relu') # a layer function

output_tensor = dense(input_tensor) # a layer may be called on a tensor, and it returns a tensor

a minimal example that shows side by side a simple Sequential model and its equivalent in the functional API:

In [7]:
from keras.models import Sequential, Model
from keras import layers
from keras import Input

# sequential model
seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))

# its functional equivalent
input_tensor = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)
model = Model(input_tensor, output_tensor) # the model class turns an input tensor and outptu tensor into a model

model.summary()
seq_model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 64)]              0         
                                                                 
 dense_16 (Dense)            (None, 32)                2080      
                                                                 
 dense_17 (Dense)            (None, 32)                1056      
                                                                 
 dense_18 (Dense)            (None, 10)                330       
                                                                 
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_13 (Dense)            (None,

the API is the same as that of Sequential when it comes to compiling, training, or evaluating such an instance of model:

In [None]:
import numpy as np
x_train = np.random.random((1000, 64))
y_train = np.random.random((1000, 10))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x_train, y_train, epochs=10, batch_size=128)
score = model.evaluate(x_train, y_train)

seq_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
seq_model.fit(x_train, y_train, epochs=10, batch_size=128)
seq_score = seq_model.evaluate(x_train, y_train)

print(score, seq_score)

### Multi-Input Models

![A Question-Answering Model](./qa_model.png)

In [1]:
# functional API implementation of a two-input question-answering model
from keras.models import Model
from keras import layers 
from keras import Input

text_vocabulary_size = 10000
questioni_vocabulary_size = 10000
answer_vocabulary_size = 500

text_input = Input(shape=(None,), dtype='int32', name='text') # a variable-length sequence of integers

# embedding the inputs into a sequence of vector of size 64
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)

# encodes the vectors in a single vector via an LSTM
encoded_text = layers.LSTM(32)(embedded_text)

# some process (with different layer instances) for the question
question_input = Input(shape=(None,),
                       dtype='int32', 
                       name='question')

embedded_question = layers.Embedding(32, questioni_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# adds a softmax classifier on top
concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)

# add a softmax classifier on top
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# at model instantiation, specify the two inputs and the output
model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

In [4]:
# feeding data to a multi-input model
import numpy as np

num_samples = 1000
max_length =100

# generate dummy numpy data
text = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))

question = np.random.randint(1, questioni_vocabulary_size, size=(num_samples, max_length))
answers = np.random.randint(0, 1, size=(num_samples, answer_vocabulary_size)) # answers are one-hot encoded, not integers

model.fit([text, question], answers, epochs=10, batch_size=128) # fitting using a list of inputs
model.fit({'text':text, 'question':question}, answers, epochs=10, batch_size=128) # fitting using a dictionary of inputs (only if inputs are named)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x28fb11282b0>

![A Social Media Model with Three Heads](./sm_model_3.png)

In [5]:
# functional API implementation of a three-output model
from keras import layers
from keras import Input
from keras.models import Model

vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None,), dtype='int32', name='post')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)
x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

age_prediction = layers.Dense(1, name='age')(x) # note that the outptu layers are given names
income_prediction = layers.Dense(num_income_groups, activation='softmax', name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)
model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])

In [6]:
# compilation options of a multi-output: multiple losses
model.compile(optimizer='rmsprop', 
              loss=['mae', 'categorical_crossentropy', 'binary_crossentropy'])

model.compile(optimizer='rmsprop', 
              loss={'age': 'mse', # equivalent (possible only if you give names to the output layers)
                    'income': 'categorical_crossentropy',
                    'gender': 'bianry_crossentropy'})

In [7]:
# compilation options of a multi-output model: loss weighting
model.compile(optimizer='rmsprop', 
              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'],
              loss_weights=[0.25, 1., 10.])

model.compile(optimizer='rmsprop', 
              loss={'age': 'mse', # equivalent (possible only if you give names to the output layers)
                    'income': 'categorical_crossentropy',
                    'gender': 'bianry_crossentropy'},
              loss_weights={'age': 0.25,
                            'income':1.,
                            'gender':10.})

In [None]:
# feeding data to a multi-output model
model.fit(posts, [age_targets, income_targets, gender_targets], # assume targets to be numpy array
          epochs=10, batch_size=64)

model.fit(posts, {'age': age_targets, # equivalent (possible only if you give names to the output layers)
                  'income': income_targets,
                  'gender': gender_targets},
          epochs=10, batch_size=64)

### Directed Acyclic Graphs of Layers

- with the functional API, you can also implement networks with a complex internal topology; neural networks in keras are allowed to be arbitraty directed acyclic graphs of layers
- the qualifier acyclic is important: these graphs can't have cycles, it's impossible for a tensor x to become the input of one of the layers that generated x
- the only processing loops that are allowed (recurrent connections) are those internal to recurrent layers

two notable common neural-network components: Inception Modules and Residual Connections

##### Inception Modules

a popular type of network architecture for convolutional neural networks; it consists of a stack of modules that hemselves look like small independent networks, split into several parallel branches

![An Inception Module](./inception_module.png)

##### The Purpose of 1 x 1 Convolutions

convolutions extract spatial patches around evey tile in an input tensor and apply the same transformation to each patch; an edge case is when the patches extracted consist of a single tile, the convolution operation when becomes equivalent to running each tile vector through a Dense layer: it will compute features that mix together information from the channels of the input tensor, but it won't mix information across space (because it's looking at one tile at a time)

such 1 x 1 convolutions (also called poinrtwise concolutions) are featured in inception modules, where they contribute to factoring out channel-wise feature learning and space-wise feature learning -- a reasonable thing to do if you assume that each channel is highly autocorrelated across space, but different channels may not be highly correlated with each other 

example assumes the existence of a 4D input tensor x:

In [None]:
from keras import layers

# every branch has the same stride value (2), which is necessary to keep all branch outputs the same size so you can concatenate them
branch_a = layers.Conv2D(128, 1,
                         activation='relu', strides=2)(x)

# the striding occurs in the spatial convolution layer in branch_b 
branch_b = layers.Conv2D(128, 1, activation='relu')(x)
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_b)

# the striding occurs in the average pooling layer
branch_c = layers.AveragePooling2D(3, strides=2)(x)
branch_c = layers.Conv2D(128, 3, activation='relu')(branch_c)

branch_d = layers.Conv2D(128, 1, activation='relu')(x)
branch_d = layers.Conv2D(128, 3, activation='relu')(branch_d)
branch_d = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_d)

# concatenates teh branch outputs to obtain teh module output
output = layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis=-1)

- full Inception V3 architecture is availabel in keras as 'keras.applications.inception_v3.InveptionV3', including weights pretrained on the ImageNet dataset
- Xception (extreme inception), is a convnet architecture loosely inspired by inception, it takes the idea of separating the learning of channel-wise features to its logical extreme, and replaces inception modules with depth-wise separable convolutions consisting of a depthwise convolution (a spatial concolution where every input channel is handled separately) followed by a pointwise convolution (a 1 x 1 convolution) -- effectively, an extreme form of an Inception module, where spatial features and channel-wise features are fully separated

##### Residual Connections

residual connections are a common graph-like network component tackle two common problems taht plague any large-scale deep-learning model: vanishing gradients and representational bottelnecks; in general, adding residual connections to any model that has more than 10 layers is likely to be beneficial

a residual connection consist of making the outptu of an earlier layer available as input to a late layer, effectively creating a shortcut in a sequential network; rather than being concatenated to the later activation, the earlier output is summed with the layer activation, which assumes that both activations are the same size

example implement a residual connection in keras when the feature-map sizes are the same, using identity residual connections, assuming the existence of a 4D input tensor of x:

In [None]:
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x) # applies a transformation to x
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)

y = layers.add([y, x]) # adds the original x back to the output features

example implement a residual connection when the feature-map sizes differ, using a linear residual connection, assuming the existence of a 4D input tensor x:

In [None]:
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

residual = layers.Conv2D(128, 1, strides=2, padding='same')(x) # use a 1 x 1 convolution to linearly downsample the original x tensor to the same shape as y

y = layers.add([y, residual]) # adds the residual tensor back to tht output features

##### Representation Bottlenecks in Deep Learning

in a sequential model, each successive representation layer is built on top of the previous one , which means it only has access to information contained in the activation of the previous layer; if one layer is too small, then the model will be constrained by how much informatin can be crammed into the activations of the layer

concept with a signal-processing analogy: if you have an audio processing pipeline that consists of a series of operations, then if one operation crops your signal to a low-frequency range, the operations downstrea will never be able to revocer the dropped frequencies; any loss of information is permanent

residual connections, by reinjecting earlier information downsteram, partially solve this for deep-lerannig models

##### Vanishing Gradient in Deep Learning

backpropagation, the master algorithm used to train deep neural networks, works by propagating a feedback signal from the output loss down to earlier layers; if this feedback signal has to be propagated through a depp stack of layers, the signal may become tenuous or even be lost entriely, rendering the network untrainable; this issue is known as vanishing gradients

this problem occurs both with deep networks and with recurrent networks over very long sequences -- in both cases, a feeback signal must be propageted through a long sereis of operations; residual connection work in a similar way as LSTM in feedward deep networks, but simpler: introduce a purely linear information carry track parallel to the main layer stack, thus helping to propagete gradients through arbitrarily deep stacks of layers

### Layer Weight Sharing

when call a layer instance twice, instead of instantiating a new layer for each call, reuse the same weights with every call; this allows to build models that have shared branches -- several braches that all share the same knoeledge and perform the same operations; that is, they share the same representations and learn these representations simultaneously for different sets of inputs

example implement a LSTM model using layer sharing (layer reuse) in the keras functional API:

In [None]:
from keras import layers
from keras import Input 
from keras.models import Model

lstm = layers.LSTM(32) # instantiates a single LSTM layer once

left_input = Input(shape=(None, 128)) # left branch of the moedl
left_output = lstm(left_input)

right_input = Input(shape=(None, 128)) # right branch of the model: when call an existing layer instance, reuse its weights
right_output = lstm(right_input)

merged = layers.concatenate([left_output, right_output], axis=-1) # builds the classifier on top
predictions = layers.Dense(1, activation='sigmoid')(merged)

model = Model([left_input, right_input], predictions) # instantiating and training the model; the weights of the LSTM layer are updated based on both input
model.fit([left_data, right_data], targets)

naturally, a layer instance may be used more than once -- it can be called arbitrarily many times, reusing the same set of weights every time

### Models as Layers

In [None]:
y = model(x)
y1, y2 = model([x1, x2])

example implement a Siamese vision model (shared convolutional base) in keras:

In [None]:
from keras import layers
from keras import applications
from keras import Input

xception_base = applications.Xception(weights=None, include_top=False) # the base image-processing model is the Xception network (convolutional base only)

# the inputs are 250 x 250 RGB images
left_input = Input(shape=(250, 250, 3))
right_input = Input(shape=(250, 250, 3))

# call the same vision model twice
left_feature = xception_base(left_input)
right_input = xception_base(right_input)

merged_features = layers.concatenate([left_feature, right_input], axis=-1)  # the merged features cantain information from the right visual feed and the left visual feed 

### Wrapping Up

introduction to the keras functional API -- an essential tool for building advanced deep neural architecture:
- to step out the sequential API whenever need anything more than a linear stack of layers
- build keras models with several inputs, several outputs and complex internal network topology, using keras functional API
- reuse the weight of layers or model accross different processing branches, by calling the same layer or model instance several times