# Chapter 12: Custom Models and Training with Tensorflow

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

## 12.1 A Quick Tour of Tensorflow

- Similar to NumPy but with GPU support.
- Supports distributed computing.
- Includes a just-in-time (JIT) compiler that allows it to optimize computations for speed and memory usage.
- Computation graphs can be exported to a portable format.
- Implements autodiff and provides some excellent optimizers.

## 12.2 Using TensorFlow like NumPy

**TensorFlow** - API revolves around **tensors**, which flow from operation to operation.

**Tensor** - Very similar to NumPy `ndarray`: it is usually a multidimensional array, but can also hold a scalar.

### 12.2.1 Tensors and Operations

Create a tensor with `tf.constant()`.

In [None]:
tf.constant([[1., 2., 3.], [4., 5., 6.]]) # matrix

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [None]:
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])
t.shape

TensorShape([2, 3])

In [None]:
t.dtype

tf.float32

In [None]:
# Indexing similar to NumPy
t[:, 1:]

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 3.],
       [5., 6.]], dtype=float32)>

In [None]:
t[..., 1, tf.newaxis] # ... = Access all unspecified elements

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

In [None]:
t + 10

# Python calls t.__add__(10)
# Which calls tf.add(t, 10)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

In [None]:
tf.square(t)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)>

In [None]:
t @ tf.transpose(t)

# TensorFlow creates a new tensor object for transpose
# Cannot do NumPy's t.T

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

> #### Keras' Low-Level API

> Keras API has its own low-level API, located in `keras.backend`. In `tf.keras`, these functions generally just call the corresponding TensorFlow operations. But if you want to write code that will be portable to other Keras implementations, you should use these Keras functions.

In [None]:
from tensorflow import keras

In [None]:
K = keras.backend
K.square(K.transpose(t)) + 10

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[11., 26.],
       [14., 35.],
       [19., 46.]], dtype=float32)>

### 12.2.2 Tensors and NumPy

You can create a tensor from a NumPy array, and vice versa. You can even apply TensorFlow operations to NumPy arrays and NumPy operations to tensors.

In [None]:
a = np.array([2., 4., 5.])
tf.constant(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>

In [None]:
t.numpy() # or np.array(t)

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [None]:
tf.square(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([ 4., 16., 25.])>

In [None]:
np.square(t)

array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)

### 12.2.3 Type Conversions

Type conversions can significantly hurt performance. To avoid this, TensorFlow does not perform any type conversions automatically; it just raises an exception if you try to execute an operation on tensors with incompatible types.

In [None]:
tf.constant(2.) + tf.constant(40) # Cannot add float and integer tensors

InvalidArgumentError: ignored

In [None]:
tf.constant(2.) + tf.constant(40., dtype=tf.float64) # Cannot add 32-bit float and 64-bit float tensors

InvalidArgumentError: ignored

In [None]:
t2 = tf.constant(40., dtype=tf.float64)
tf.constant(2.0) + tf.cast(t2, tf.float32) # Use tf.cast() to convert types

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

### 12.2.4 Variables

`tf.Tensor` values are immutable: you cannot modify them.

Not helpful as weights in neural networks since they need to be tweaked by backpropagation.

Use `tf.Variable`.

In [None]:
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

A `tf.Variable` acts much like a `tf.Tensor` but it can also be modified in place using the `assign()` method.

In [None]:
v.assign(2 * v) # Mutates v

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [None]:
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [None]:
v[:, 2].assign([0., 1.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

In [None]:
# Assign/update specific indices with specific values
v.scatter_nd_update(indices=[[0, 0], [1, 2]], updates=[100., 200.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>

### 12.2.5 Other Data Structures

**Sparse tensors** (`tf.SparseTensor`): Efficiently represent tensors containing mostly 0s.

**Tensor arrays** (`tf.TensorArray`): Lists of tensors. All tensors contained must have the same shape and data type.

**Ragged tensors** (`tf.RaggedTensor`): Represent static lists of lists of tensors, where every tensor has the same shape and data type.

**String tensors**: Regular tensors of type `tf.string`.
- These represent byte strings, not Unicode strings.
- `tf.string` is atomic, meaning that its length does not appear in the tensor's shape.
- Once you convert it to a Unicode tensor, then the length appears in the shape.

**Sets**: Represented as regular tensors (or sparse tensors).
- `tf.constant([[1, 2], [3, 4]])` represents 2 sets [1, 2] and [3, 4].

**Queues**: Store tensors across multiple steps, in `tf.queue` package.
- First In, First Out (FIFO) queues, "`FIFOQueue`"
- Queues that can prioritize some items, "`PriorityQueue`"
- Shuffle the items, "`RandomShuffleQueue`"
- Batch items of different shapes by padding, "`PaddingFIFOQueue`"

## 12.3 Customizing Models and Training Algorithms

### 12.3.1 Custom Loss Functions

Let's imagine implementing the Huber loss.

> Note: Always try to use vectorized implementation for better performance. To benefit from TensorFlow's graph feature, you should only use TensorFlow operations.

In [5]:
# FROM TEXTBOOK NOTEBOOK

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data


In [None]:
def huber_fn(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)

In [None]:
# FROM TEXTBOOK NOTEBOOK

input_shape = X_train.shape[1:]

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="selu", kernel_initializer="lecun_normal",
                       input_shape=input_shape),
    keras.layers.Dense(1),
])

In [None]:
model.compile(loss=huber_fn, optimizer="nadam")
# From textbook notebook
model.fit(X_train_scaled, y_train, epochs=2,
          validation_data=(X_valid_scaled, y_valid))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f2a3a0e01d0>

### 12.3.2 Saving and Loading Models That Contain Custom Components

When you load a model containing custom objects, you need to map the names to the objects.

In [None]:
# From textbook notebook
model.save("my_model_with_a_custom_loss.h5")

model = keras.models.load_model("my_model_with_a_custom_loss.h5",
                                custom_objects={"huber_fn": huber_fn})

In [None]:
# Function that creates a configured loss function
def create_huber(threshold=1.0):
    def huber_fn(y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    return huber_fn

model.compile(loss=create_huber(2.0), optimizer="nadam")

When you save the model, the `threshold` will not be saved. This means that you will have to specify the `threshold` value when loading the model.

In [None]:
# From textbook notebook
model.save("my_model_with_a_custom_loss_threshold_2.h5")

model = keras.models.load_model("my_model_with_a_custom_loss_threshold_2.h5",
                                custom_objects={"huber_fn": create_huber(2.0)})

By creating a subclass of `keras.losses.Loss` and implementing its `get_config()` method, you can solve this problem of having to specify the `threshold` value.

In [None]:
class HuberLoss(keras.losses.Loss):
    def __init__(self, threshold=1.0, **kwargs):
        self.threshold = threshold
        super().__init__(**kwargs)
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < self.threshold
        squared_loss = tf.square(error) / 2
        linear_loss = self.threshold * tf.abs(error) - self.threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "threshold": self.threshold}

Code explanation:

1. Constructor (`__init__`) accepts `**kwargs` and passes them to the parent constructor (`super().__init__`), which handles standard hyperparameters.
    - Note: `**kwargs` stands for unpacking (`**`) the keyword arguments dictionary (`kwargs`).

2. The `call()` method takes the labels and predictions, computes all the instance losses, and returns them.
    - Exact same as `huber_fn` from above.

3. The `get_config()` method returns a dictionary mapping each hyperparameter name to its value.
    - First calls the parent class's `get_config()` method (`super().get_config()`).
    - Then adds the new hyperparameters to this dictionary.
    - Note: `**base_config` unpacks the dictionary.

In [None]:
model.compile(loss=HuberLoss(2.), optimizer="nadam")

In [None]:
# From textbook notebook
model.save("my_model_with_a_custom_loss_class.h5")

model = keras.models.load_model("my_model_with_a_custom_loss_class.h5",
                                custom_objects={"HuberLoss": HuberLoss})

### 12.3.3 Custom Activation Functions, Initializers, Regularizers, and Constraints

In [None]:
# Custom Keras functions

def my_softplus(z): # return value is just tf.nn.softplus(z)
    return tf.math.log(tf.exp(z) + 1.0)

def my_glorot_initializer(shape, dtype=tf.float32):
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)

def my_l1_regularizer(weights):
    return tf.reduce_sum(tf.abs(0.01 * weights))

def my_positive_weights(weights): # return value is just tf.nn.relu(weights)
    return tf.where(weights < 0., tf.zeros_like(weights), weights)

In [None]:
# Custom functions can then be used normally
layer = keras.layers.Dense(30, activation=my_softplus,
                           kernel_initializer=my_glorot_initializer,
                           kernel_regularizer=my_l1_regularizer,
                           kernel_constraint=my_positive_weights)

If a function has hyperparameters that need to be saved along with the model, then you will want to subclass the appropriate class.

In [None]:
# l1 regularization that saves its factor hyperparameter
# No calling parent constructor, super.__init__()
# Not defined in parent class
class MyL1Regularizer(keras.regularizers.Regularizer):
    def __init__(self, factor):
        self.factor = factor
    def __call__(self, weights):
        return tf.reduce_sum(tf.abs(self.factor * weights))
    def get_config(self):
        return {"factor": self.factor}

> Note: You must implement the `call()` method for losses, layers, activation functions, and models, or `__call__()` for regularizers, initializers, and constraints.

### 12.3.4 Custom Metrics

Losses and metrics are conceptually not the same thing.

Losses (eg. cross entropy) are:
- Used by Gradient Descent to *train* a model.
- They must be differentiable (at least where they are evaluated).
- Their gradients should not be 0 everywhere.
- Okay if not easily interpretable by humans.

Metrics (eg. accuracy) are:
- Used to *evaluate* a model.
- Can be non-differentiable.
- Can have 0 gradients everywhere.
- Must be more easily interpretable.

Defining a custom metric function is exactly the same as defining a custom loss function. 

We can use the Huber loss function as a metric (though MAE or MSE is preferred).

In [None]:
model.compile(loss="mse", optimizer="nadam", metrics=[create_huber(2.0)])

> Recall: In Chapter 3, precision is the number of true positives divided by the number of positive predictions (true positives + false positives).

For each batch during training, Keras will compute this metric and keep track of its mean since the beginning of the epoch. This can be incorrect depending on the batch vs. overall.

`keras.metrics.Precision` class can keep track of the number of true positives and false positives and can compute their ratio.

In [None]:
precision = keras.metrics.Precision() # Create Precision object
# Pass labels and predictions of 1st batch
# 5 positive predictions, 4 correct
precision([0, 1, 1, 1, 0, 1, 0, 1], [1, 1, 0, 1, 0, 1, 0, 1])

<tf.Tensor: shape=(), dtype=float32, numpy=0.8>

In [None]:
# Pass labels and predictions of 2nd batch
# 3 positive predictions, 0 correct
precision([0, 1, 0, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 0, 0, 0])

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

**Streaming metric (stateful metric)**: A metric that is gradually updated, batch after batch.

In [None]:
# Get the current value of the metric.
precision.result()

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

In [None]:
# Look at its variables (number of true/false positives)
precision.variables

[<tf.Variable 'true_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>,
 <tf.Variable 'false_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>]

In [None]:
# Reset these variables
precision.reset_states() # both variables get reset to 0.0

If you need to create such a streaming metric, create a subclass of `keras.metrics.Metric` class.

In [None]:
# Keeps track of total Huber loss
# Keeps track of number of instances seen so far
# When asked for result, returns the ratio, which is the mean Huber loss

class HuberMetric(keras.metrics.Metric):
    def __init__(self, threshold=1.0, **kwargs):
        super().__init__(**kwargs) # handles base args (eg. dtype)
        self.threshold = threshold
        self.huber_fn = create_huber(threshold)
        self.total = self.add_weight("total", initializer="zeros")
        self.count = self.add_weight("count", initializer="zeros")
    def update_state(self, y_true, y_pred, sample_weight=None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))
    def result(self):
        return self.total / self.count
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "treshold": self.threshold}

Code explanation:

1. The constructor uses the `add_weight()` method to create the variables needed to keep track of the metric's state over multiple batches (sum of all Huber losses, `total`, and number of instances seen so far, `count`).
    - Alternatively, can create variables manually since Keras tracks any `tf.Variable` that is set as an attribute.

2. The `updated_state()` method is called when you use an instance of this class as a function. It updates the variables, given the labels and predictions for 1 batch.

3. The `result()` method computes and returns the final result, in this case the mean Huber metric over all instances.
    - When you use the metric as a function, the `update_state()` method gets called first.
    - Then the `result()` method is called, and its output is returned.

4. The `get_config()` method ensures the `threshold` gets saved along with the model.

> **Not in code**: The `reset_states()` method resets all variables to 0.0 and can be overridden if needed.

In general, Keras calls the simple function metric (not custom) for each batch and keeps track of the mean during each epoch. But some metric, like precision, cannot be averaged over batches and so must implement a custom streaming metric.

### 12.3.5 Custom Layers

Custom layers are useful if you want to build an exotic layer with no default TensorFlow implementation or treat blocks of layers as a single layer.

If you want to create a custom layer without any weights, the simplest option is to write a function and wrap it in a `keras.layers.Lambda` layer.

This custom layer can then be used like any other layer, using the Sequential, Function, or Subclassing API. It can also be used as an activation function.

In [None]:
exponential_layer = keras.layers.Lambda(lambda x: tf.exp(x))

To build a custom stateful layer (ie. a layer with weights), you need to create a subclass of the `keras.layers.Layer` class.

In [None]:
# Simplified version of the Dense layer
class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)
    
    def build(self, batch_input_shape):
        self.kernel = self.add_weight(
            name="kernel", shape=[batch_input_shape[-1], self.units],
            initializer="glorot_normal")
        self.bias = self.add_weight(
            name="bias", shape=[self.units], initializer="zeros")
        super().build(batch_input_shape) # must be at the end
    
    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)
    
    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "units": self.units,
                "activation": keras.activations.serialize(self.activation)}

Code explanation:

1. The constructor takes all the hyperparameters as arguments (eg. `units` and `activation`), and `**kwargs` argument.
    - It calls the parent constructor and passes unpacked `kwargs` (`super().__init__(**kwargs)`), which takes care of standard arguments.
    - Saves hyperparameters as attributes.
    - Converts `activation` argument to appropriate activation function (`keras.activations.get()`).

2. The `build()` method creates the layer's variables by calling the `add_weight()` method for each weight.
    - Pass the shape of this layer's inputs to `build()`, which is necessary to create some of the weights.
    - We need to know the number of neurons in the previous layer in order to create the connection weights matrix.
    - `"kernel"` corresponds to the size of the last dimension of the inputs.
    - Only at the end, call parent's `build` method (`super().build()`) to tell Keras that the layer is built (ie. sets `self.built=True`).

3. The `call()` method performs the desired operations.
    - Compute the matrix multiplication of inputs `X` and layer's kernel.
    - Add the bias vector.
    - Apply activation function to result.
    - Gives output of the layer.

4. The `compute_output_shape()` method returns the shape of this layer's outputs.
    - Same shape as inputs, except last dimension is replaced with the number of neurons in the layer.
    - In `tf.keras`, shapes are instances of `tf.TensorShape` class can can be converted to Python lists using `as_list()`.

5. The `get_config()` method saves the hyperparameter values.
    - The activation function's full configuration is saved by calling `keras.activations.serialize()`.

> Note: You can generally omit `compute_output_shape()` method, as tf.keras automatically infers the output shape, except when the layer is dynamic.

To create a layer with multiple inputs (eg. `Concatenate`):
1. The argument to `call()` method should be a tuple containing all the inputs.
2. The argument to `compute_output_shape()` method should be a tuple containing each input's batch shape.

To create a layer with multiple outputs:
1. The `call()` method should return the list of outputs.
2. The `compute_output_shape()` method should return the list of batch output shapes (1 per output).

In [None]:
# Takes 2 inputs, returns 3 outputs
class MyMultiLayer(keras.layers.Layer):
    def call(self, X):
        X1, X2 = X
        return [X1 + X2, X1 * X2, X1 / X2]
    
    def compute_output_shape(self, batch_input_shape):
        b1, b2 = batch_input_shape
        return [b1, b1, b1] # should probably handle broadcasting rules

This layer can now be used like any other layer - only using Functional and Subclassing API, as Sequential only accepts 1 input and output.

If your layer needs to have a different behavior during training and during testing (eg. uses `Dropout` or `BatchNormalization` layers), then you must add a `training` argument to the `call()` method and use this argument to decide what to do.

In [None]:
# keras.layers.GaussianNoise does the same thing

# Adds Gaussian noise during training (for regularization)
# Does nothing during testing
class MyGaussianNoise(keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev
    
    def call(self, X, training=None):
        if training:
            noise = tf.random.normal(tf.shape(X), stddev=self.stddev)
            return X + noise
        else:
            return X
        
    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

### 12.3.6 Custom Models

To create custom models: subclass the `keras.Model` class, create layers and variables in the constructor, and implement the `call()` method to do whatever you want the model to do.

Suppose we want to build a custom model similar to *Figure 12-3*:
1. Input layer goes through 1st dense layer
2. Then through a **residual block**, which is composed of:
    - 2 dense layers
    - An addition operation
    - Concatenating the inputs to the output using the (+) operation
3. Through the residual block 3 more times
4. Into another residual block
5. Finally a dense output layer

To create this model, first create a `ResidualBlock` layer.

In [None]:
# Create ResidualBlock layer
class ResidualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(n_neurons, activation="elu",
                                          kernel_initializer="he_normal")
                       for _ in range(n_layers)]

    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z

Keras automatically detects that the `hidden` attribute contains trackable objects (layers in this case), so their variables are automatically added to this layer's list of variables.

In [None]:
# Create ResidualRegressor model
class ResidualRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(30, activation="elu",
                                          kernel_initializer="he_normal")
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)
    
    def call(self, inputs):
        Z = self.hidden1(inputs)
        for _ in range(1 + 3):
            Z = self.block1(Z)
        Z = self.block2(Z)
        return self.out(Z)

We create the layers in the constructor and then use them in the `call()` method.

To save the model and load it, you must implement the `get_config()` method in both the `ResidualBlock` class and the `ResidualRegressor` class. Alternatively, save and load the weights using `save_weights()` and `load_weights()` methods.

`Layer` class (superclass)  
$\downarrow$  
`Model` class (subclass of `Layer`)

> Best practices: Subclass `Layer` class for internal components of your model (ie. layers or reusable blocks of layers). Subclass `Model` class for the model itself (ie. the object you will train).

### 12.3.7 Losses and Metrics Based on Model Internals

So far custom losses and metrics were based on the labels and predictions.

But sometimes you want them based on other parts of the model, such as weights or activations of its hidden layers - useful for regularization or to monitor some internal aspect of the model.

To define a custom loss based on model internals, compute it based on any part of the model you want, then pass the result to the `add_loss()` method.

Suppose we want to build a custom regression MLP model composed of 5 hidden layers, and 1 output layer. It will have an auxiliary output on top of the upper hidden layer, with an associated loss called the **reconstruction loss**: the mean squared difference between the reconstruction and the inputs.

In [None]:
class ReconstructingRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(30, activation="selu",
                                          kernel_initializer="lecun_normal")
                       for _ in range(5)]
        self.out = keras.layers.Dense(output_dim)
    
    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        self.reconstruct = keras.layers.Dense(n_inputs)
        super().build(batch_input_shape)
    
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        reconstruction = self.reconstruct(Z)
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        self.add_loss(0.05 * recon_loss)
        return self.out(Z)

Code explanation:

1. The constructor creates the DNN with 5 dense hidden layers and 1 dense output layer.

2. The `build()` method creates an extra dense layer which will be used to reconstruct the inputs of the model.
    - Must be created here because its number of units must be equal to the number of inputs and is unknown before `build()` is called.

3. The `call()` method:
    - Processes the inputs through all 5 hidden layers.
    - Passes the results through reconstruction layer, producing the reconstruction.
    - Computes the reconstruction loss.
    - Adds loss to model's list of losses using `add_loss()` method.
        - Scale down the reconstruction loss by multiplying by 0.05 (a hyperparameter you can tune).
        - Ensures that the reconstruction loss does not dominate the main loss.
    - Finally passes the output of the hidden layers to the output layer and returns its output.

Similarly, you can add a custom metric based on model internals as long as the result is the output of a metric object. For example, create a `keras.metrics.Mean` object in the constructor, call it in `call()` method, passing the `recon_loss` and add it to the model by calling `add_metric()` method. This will display both the mean loss and the mean reconstruction error over each epoch.

### 12.3.8 Computing Gradients Using Autodiff

In [None]:
def f(w1, w2):
    return 3*w1**2 + 2*w1*w2

$\frac{\partial f}{\partial w1} = 6*w1 + 2*w2$ 

$\frac{\partial f}{\partial w2} = 2*w1$.

So at $(w1, w2) = (5,3)$, the gradient vector is $(\frac{\partial f}{\partial w1}, \frac{\partial f}{\partial w2}) = (36, 10)$.

For a neural network, the function would be much more complex and finding the partials by hand is almost an impossible task.

One solution could be to compute an approximation of each partial derivative by measuring how much the function's output changes when the corresponding parameter is tweaked.

In [None]:
w1, w2 = 5, 3
eps = 1e-6
(f(w1 + eps, w2) - f(w1, w2)) / eps

36.000003007075065

In [None]:
(f(w1, w2 + eps) - f(w1, w2)) / eps

10.000000003174137

Works well but needing to call `f()` at least once per parameter is not suitable for large neural networks. Instead use autodiff.

In [None]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape: # Records every operation that involves a variable
    z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

In [None]:
# tape erased after calling its gradient() method
# Exception if gradient() called twice
with tf.GradientTape() as tape:
    z = f(w1, w2)

dz_dw1 = tape.gradient(z, w1) # => tensor 36.0
dz_dw2 = tape.gradient(z, w2) # RuntimeError!

RuntimeError: ignored

In [None]:
# To call gradient() more than once
with tf.GradientTape(persistent=True) as tape:
    z = f(w1, w2)

dz_dw1 = tape.gradient(z, w1) # => tensor 36.0
dz_dw2 = tape.gradient(z, w2) # => tensore 10.0, works fine now!
del tape # Delete when finished to free resources

In [None]:
# tape only tracks variable operations
# Result is None, otherwise
c1, c2 = tf.constant(5.), tf.constant(3.)
with tf.GradientTape() as tape:
    z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2]) # returns [None, None]
gradients

[None, None]

In [None]:
# Force tape to watch any tensor and track their operations
with tf.GradientTape() as tape:
    tape.watch(c1)
    tape.watch(c2)
    z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2]) # returns [tensor 36., tensor 10.]
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

To compute the gradients of a vector containing multiple losses, TensorFlow will compute the gradients of the vector's sum.

To get the individual gradients (eg. the gradients of each loss with regard to the model parameters), call the tape's `jacobian()` method: it will perform reverse-mode autodiff once for each loss in the vector.

In some cases use `tf.stop_gradient()` function to stop gradients from backpropagating through some part of the neural network.

In [None]:
def f(w1, w2):
    return 3*w1**2 + tf.stop_gradient(2*w1*w2)

with tf.GradientTape() as tape:
    z = f(w1, w2) # same result as without stop_gradient()

gradients = tape.gradient(z, [w1, w2]) # => returns [tensor 30., None]
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=30.0>, None]

In [None]:
# Numerical issues when computing gradients for large inputs
x = tf.Variable([100.])
with tf.GradientTape() as tape:
    z = my_softplus(x)

tape.gradient(z, [x]) # result is NaN

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>]

Computing gradients using `my_softplus()` function leads to numerical difficulties because due to floating-point precision errors, autodiff ends up computing infinity divided by infinity (returning NaN).

The derivative of softplus function is $ 1 / (1 + 1 / \text{exp}(x)) $. 

So decorate with `@tf.custom_gradient` to tell TensorFlow to use this stable function when computing the gradients of `my_softplus()` function and making it return both its normal output and the function that computes the derivatives.

In [None]:
@tf.custom_gradient
def my_better_softplus(z):
    exp = tf.exp(z)
    def my_softplus_gradients(grad):
        return grad / (1 + 1 / exp)
    return tf.math.log(exp + 1), my_softplus_gradients

### 12.3.9 Custom Training Loops

Since `fit()` method only uses one optimizer (the one used when the model is compiled), `fit()` may not be flexible enough in some cases (eg. 1 optimizer for wide path, 1 optimizer for deep path) and would require writing a custom loop.

In [3]:
# Build a simple model
l2_reg = keras.regularizers.l2(0.05)
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="elu", kernel_initializer="he_normal",
                       kernel_regularizer=l2_reg)
])

In [4]:
def random_batch(X, y, batch_size=32):
    """Randomly sample a batch of instances from the training set."""
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

def print_status_bar(iteration, total, loss, metrics=None):
    """Displays the training status, including the number of steps,
    the total number of steps, the mean loss since the start of the epoch,
    and other metrics."""
    metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result())
                         for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics,
          end=end)

> Code Notes:

> - `{:.4f}` format a float with 4 digits after the decimal point.

> - $\backslash$ r (carriage return) along with `end=""` ensures stats bar gets printed on the same line.

In [6]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size # X_train is on housing set
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.MeanAbsoluteError()]

In [7]:
# Build custom loop
for epoch in range(1, n_epochs + 1):
    print("Epoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training=True)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
        print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
    print_status_bar(len(y_train), len(y_train), mean_loss, metrics)
    for metric in [mean_loss] + metrics:
        metric.reset_states()

Epoch 1/5
11610/11610 - mean: 2.3334 - mean_absolute_error: 1.0091
Epoch 2/5
11610/11610 - mean: 1.0631 - mean_absolute_error: 0.7386
Epoch 3/5
11610/11610 - mean: 1.0606 - mean_absolute_error: 0.7398
Epoch 4/5
11610/11610 - mean: 1.0299 - mean_absolute_error: 0.7259
Epoch 5/5
11610/11610 - mean: 1.0689 - mean_absolute_error: 0.7429


Code explanation:

1. (Lines 2 & 4): 2 nested loops for the epochs and the batches within an epoch.

2. (Line 5): Sample a random batch from the training set.

3. (Lines 6-9): Inside `tf.GradientTape()` block,
    - (Line 7): Make a prediction for one batch
    - (Line 8): Calculate the main loss. Since `loss_fn` contains `mean_squared_error()` which returns one loss per instance, compute the mean over the batch using `tf.reduce_mean()`.
    - (Line 9): Total loss is main loss + other losses (eg. regularization loss). Regularization is already reduced to a single scalar, so use `tf.add_n()`, which sums multiple tensors of the same shape and data type.

4. (Line 10): Ask tape to compute the gradient of the loss with regard to each **trainable** variable (not all variables!).

5. (Line 11): Apply to optimizer to perform a Gradient Descent step.

6. (Line 12): Update mean loss.

7. (Lines 13-14): Update metrics over the current epoch.

8. (Line 15): Display status bar.

9. (Line 16): At end of each epoch, display the status bar again to make it look complete and print a line feed.

10. (Lines 17-18): Reset the states of the mean loss and the metrics.

If you add weight constraints to your model (eg. by setting `kernel_constraint` or `bias_constraint` when creating a layer), you should update the training loop to apply these constraints just after `apply_gradients()`.

In [8]:
for variable in model.variables:
    if variable.constraint is not None:
        variable.assign(variable.constraint(variable))

> Note: This training loop does not handle layers that behave differently during training and testing (eg. `BatchNormalization` or `Dropout`). You need to call the model with `training=True` and make sure it propagates this to every layer that needs it.

## 12.4 TensorFlow Functions and Graphs

### 12.4.1 AutoGraph and Tracing

### 12.4.2 TF Function Rules