# 3.6 Anatomy of a Neural Network: Understanding Core Keras APIs

## 3.6.1 Layers: The Building Blocks of Deep Learning

The fundamental data structure in NNs is the __layer__. A layer is a data processing module that takes as input one or more tensors and that outputs one or more tensors. 

Some layers are __stateless__, but more __frequently layers have a state__: the layer’s __weights__, one or several tensors learned with SGD, which together contain the __network’s knowledge__.

Different types of layers are appropriate for different tensor formats and different types of data processing. 

For instance, __simple vector data__, stored in rank-2 tensors of shape `(samples, features)`, is often processed by densely connected layers, also called fully connected or __dense layers__ (the Dense class in Keras). 

__Sequence data__, stored in rank-3 tensors of shape `(samples, timesteps, features)`, is typically processed by __recurrent layers__, such as an __LSTM__ layer, or __1D convolution layers__ (Conv1D). 

__Image data__, stored in rank-4 tensors, is usually processed by __2D convolution layers__ (Conv2D).

You can think of layers as the __LEGO bricks of DL__, a metaphor that is made explicit by Keras. Building DL models in Keras is done by __clipping together compatible layers to form useful data-transformation pipelines__.

### The Base Layer Class in Keras

Everything in Keras is either a `Layer` or something that closely interacts with a Layer.

A `Layer` is an object that encapsulates some state (__weights__) and some computation (a __forward pass__). The weights are typically defined in a `build()` (although they could also be created in the constructor, `__init__()`), and the computation is defined in the `call()` method.

In the previous chapter, we implemented a `NaiveDense` class that contained two weights `W` and `b` and applied the computation `output = activation(dot(input, W) + b)`. This is what the same layer would look like in Keras.

In [2]:
from tensorflow import keras
 
# All Keras layers inherit from the base Layer class
class SimpleDense(keras.layers.Layer):
    def __init__(self, units, activation=None):
        super().__init__()
        self.units = units
        self.activation = activation

    # Weight creation takes place in the build() method
    def build(self, input_shape):
        input_dim = input_shape[-1]
        # add_weight() is a shortcut method for creating weights
        # It is also possible to create standalone variables and assign them as
        # layer attributes, like self.W = tf.Variable(tf.random.uniform(w_shape))
        self.W = self.add_weight(shape=(input_dim, self.units),
                                    initializer="random_normal")
        self.b = self.add_weight(shape=(self.units,),
                             initializer="zeros")
        
    # We define the forward pass computation in the call() method.
    def call(self, inputs):
        y = tf.matmul(inputs, self.W) + self.b
        if self.activation is not None:
          y = self.activation(y)
        return y

Once instantiated, a layer like this can be used just like a function, taking as input a TensorFlow tensor.

In [3]:
import tensorflow as tf

# instantiate Layer
my_dense = SimpleDense(units=32, activation=tf.nn.relu)
# create some test inputs
input_tensor = tf.ones(shape=(2, 784))
# call the layer on the inputs, just like a function
output_tensor = my_dense(input_tensor)
print(output_tensor.shape)

(2, 32)


You’re probably wondering, why did we have to implement `call()` and `build()`, since we ended up using our layer by plainly calling it, that is to say, by using its `__call__()` method? 

It’s because we want to be able to create the state just in time. Let’s see how that works.

### Automatic Shape Inference: Building Layers on the Fly

Just like with LEGO bricks, __you can only “clip” together layers that are compatible__. The notion of layer compatibility here refers specifically to the fact that __every layer will only accept input tensors of a certain shape and will return output tensors of a certain shape__.

In [6]:
from tensorflow.keras import layers

layer = layers.Dense(32, activation="relu")

This layer will return a tensor where the first dimension has been transformed to be 32. It can only be connected to a downstream layer that expects 32-dimensional vectors as its input.

When using Keras, you don’t have to worry about size compatibility most of the time, because __the layers you add to your models are dynamically built to match the shape of the incoming layer__.

In [7]:
from tensorflow.keras import models

model = models.Sequential([
                           layers.Dense(32, activation="relu"),
                           layers.Dense(32)
])

The layers didn’t receive any information about the shape of their inputs—instead, they __automatically inferred their input shape as being the shape of the first inputs they see__.

In the toy version of the Dense layer we implemented in chapter 2 (which we named NaiveDense), we had to pass the layer’s input size explicitly to the constructor in order to be able to create its weights. 

That’s not ideal, because it would lead to models that look like this, where each new layer needs to be made aware of the shape of the layer before it.

In [None]:
# model = NaiveSequential([
#     NaiveDense(input_size=784, output_size=32, activation="relu"),
#     NaiveDense(input_size=32, output_size=64, activation="relu"),
#     NaiveDense(input_size=64, output_size=32, activation="relu"),
#     NaiveDense(input_size=32, output_size=10, activation="softmax")
# ])

It would be even worse if the rules used by a layer to produce its output shape are complex. For instance, what if our layer returned outputs of shape `(batch, input_ size * 2 if input_size % 2 == 0 else input_size * 3)`?

If we were to reimplement our `NaiveDense` layer as a Keras layer capable of automatic shape inference, it would look like the previous `SimpleDense` layer, with its `build()` and `call()` methods.

In `SimpleDense`, we no longer create weights in the constructor like in the `NaiveDense example`; instead, we create them in a dedicated state-creation method, `build()`, which receives as an argument the first input shape seen by the layer. 

The `build()` method is called automatically the first time the layer is called (via its `__call__()` method). In fact, that’s why we defined the computation in a separate `call()` method rather than in the `__call__()` method directly. The `__call__()` method of the base layer schematically looks like this:

In [9]:
def __call__(self, inputs):
  if not self.built:
    self.build(inputs.shape)
    self.built = True
  return self.call(inputs)

With automatic shape inference, our previous example becomes simple and neat.

In [10]:
model = keras.Sequential([
                          SimpleDense(32, activation="relu"),
                          SimpleDense(64, activation="relu"),
                          SimpleDense(32, activation="relu"),
                          SimpleDense(10, activation="softmax")
])

Note that __automatic shape inference__ is not the only thing that the `Layer` class’s `__call__()` method handles. It takes care of many more things, in particular __routing between eager and graph execution__ (a concept you’ll learn about in chapter 7), and __input masking__ (which we’ll cover in chapter 11). 

For now, just remember: __when implementing your own layers, put the forward pass in the `call()` method__.

## 3.6.2 From Layers to Models

A DL model is a __graph of layers__. In Keras, that’s the `Model` class. 

Until now, you’ve only seen `Sequential` models (a subclass of `Model`), which are simple stacks of layers, mapping a single input to a single output. But as you move forward, you’ll be exposed to a much broader variety of __network topologies__ such as Two-branch networks, Multihead networks and Residual connections.

The topology of a model defines a __hypothesis space__. You may remember that in chapter 1 we described machine learning as searching for useful representations of some input data, within a __predefined space of possibilities__, using guidance from a feedback signal. 

By choosing a __network topology__, you constrain your __space of possibilities__ (__hypothesis space__) to a specific series of tensor operations, mapping input data to output data. What you’ll then be searching for is a good set of values for the weight tensors involved in these tensor operations.

To learn from data, you have to make assumptions about it. These assumptions define what can be learned. As such, the structure of your hypothesis space—the architecture of your model—is extremely important. It __encodes the assumptions you make about your problem__, the prior knowledge that the model starts with.

Picking the right network architecture is __more an art than a science__.

## 3.6.3 The “compile” step: Configuring the learning process

Once the model architecture is defined, you still have to choose three more things:

1. __Loss function__
2. __Optimizer__ Determines how the network will be updated based on the loss function. It implements a specific variant of SGD.
3. __Metrics__

Once you’ve picked your loss, optimizer, and metrics, you can use the built-in `compile()` and `fit()` methods to start training your model.

In [None]:
# define a linear classifier
model = keras.Sequential([keras.layers.Dense(1)])

# model configuration
model.compile(optimizer='rmsprop',
             loss='mean_squared_error',
             metrics=['accuracy'])

These strings are actually __shortcuts that get converted to Python objects__.

In [None]:
model.compile(optimizer=keras.optimizers.RMSprop(),
              loss=keras.losses.MeanSquaredError(),
              metrics=[keras.metrics.BinaryAccuracy()])

This is useful if you want to pass your own __custom losses or metrics__, or if you want to __further configure the objects you’re using__—for instance, by passing a `learning_rate` argument to the optimizer.

In [None]:
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=1e-4),
              loss=my_custom_loss,
              metrics=[my_custom_metric_1, my_custom_metric_2])

## 3.6.4 Picking a Loss Function

Choosing the right loss function for the right problem is extremely important: __your network will take any shortcut it can to minimize the loss__, so if the objective doesn’t fully correlate with success for the task at hand, your network will end up doing things you may not have wanted.

## 3.6.5 Understanding the `fit()` Method

After `compile()` comes `fit()`. The `fit()` method implements the training loop itself. These are its key arguments:

1. The __data__ (inputs and targets) to train on. It will typically be passed either in the form of __NumPy arrays__ or a __TensorFlow Dataset object__. 
1. The number of __epochs__ to train for: how many times the training loop should iterate over the data passed.
1. The __batch size__ to use within each epoch of mini-batch gradient descent: the number of training examples considered to compute the gradients for one weight update step.

In [None]:
history = model.fit(
    inputs,
    targets,
    epochs=5,
    batch_size=128)

The call to `fit()` returns a `History` object. This object contains a `history` field, which is a dict mapping keys such as "loss" or specific metric names to the list of their per-epoch values.

In [None]:
>>> history.history
{"binary_accuracy": [0.855, 0.9565, 0.9555, 0.95, 0.951],
 "loss": [0.6573270302042366,
          0.07434618508815766,
          0.07687718723714351,
          0.07412414988875389,
          0.07617757616937161]}

## 3.6.6 Monitoring Loss and Metrics on Validation Data

To keep an eye on how the model does on new data, it’s standard practice to reserve a __subset of the training data__ as __validation data__: you won’t be training the model on this data, but you will __use it to compute a loss value and metrics value__. 

You do this by using the `validation_data` argument in `fit()`. Like the training data, the validation data could be passed as __NumPy arrays__ or as a __TensorFlow Dataset object__.

In [None]:
model = keras.Sequential([keras.layers.Dense(1)])
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=0.1),
              loss=keras.losses.MeanSquaredError(),
              metrics=[keras.metrics.BinaryAccuracy()])
  
indices_permutation = np.random.permutation(len(inputs))
shuffled_inputs = inputs[indices_permutation]
shuffled_targets = targets[indices_permutation]
 
num_validation_samples = int(0.3 * len(inputs))
val_inputs = shuffled_inputs[:num_validation_samples]
val_targets = shuffled_targets[:num_validation_samples]
training_inputs = shuffled_inputs[num_validation_samples:]
training_targets = shuffled_targets[num_validation_samples:]

model.fit(
    training_inputs,
    training_targets,
    epochs=5,
    batch_size=16,
    validation_data=(val_inputs, val_targets)
)

The value of the loss on the validation data is called the __validation loss__, to distinguish it from the __training loss__. 

Note that if you want to compute the validation loss and metrics after the training is complete, you can call the `evaluate()` method.

In [None]:
loss_and_metrics = model.evaluate(val_inputs, val_targets, batch_size=128)

`evaluate()` will iterate in batches (of size `batch_size`) over the data passed and return a list of scalars, where the first entry is the validation loss and the following entries are the validation metrics.

## 3.6.7 Inference: Using a model after training

Once you’ve trained your model, you’re going to want to use it to make predictions on new data. This is called __inference__. 

To do this, a naive approach would simply be to `__call__()` the model.

In [None]:
predictions = model(new_inputs)

However, this will process all inputs in `new_inputs` at once, which may not be feasible if you’re looking at a lot of data (in particular, it may require more memory than your GPU has).

A better way to do inference is to use the `predict()` method. It will iterate over the data in small batches and return a NumPy array of predictions. And unlike `__call__()`, it can also process TensorFlow Dataset objects.

In [None]:
predictions = model.predict(new_inputs, batch_size=128)