# Custom Models & Training with TensorFlow

Up until now, we've only used TensorFlow's high-level API, tf.keras, but it already got us pretty far: we built various neural network architectures, including regression & classification nets, wide & deep nets, & self-normalising nets, using all sorts of techniques, such as batch normalisation, dropout, & learning rate schedules. In fact, 95% of the use cases you will encounter will not require anything other than tf.keras. But now it's time to dive deeper into TensorFlow & take a look at its lower-level Python API. This will be useful when you need extra control to write custom loss functions, custom metrics, layers, models, initializers, regularizers, weight constraints, & more. You may need to fully control the training loop itself, for example to apply special transformation constraints to the gradients (beyond just clipping them) or to use multiple optimisers for different parts of the network. We will cover all these cases & look at how we can boost our custom models & training algorithms using TensorFlow's automatic graph generation feature. First, we'll take a quick tour of TensorFlow.

---

# A Quick Tour of TensorFlow

As you know, TensorFlow is a powerful library for numerical computation, particularly well suited & fine-tuned for large-scale machine learning (but you could use it for anything else that requires heavy computations). It was developed by the Google Brain team & it powers many of Google's large-scale services, such as Google Cloud Speech, Google Photos, & Google Search. It was open sourced in November 2015, & it is now the most popular deep learning library. Countless projects use TensorFlow for all sorts of machine learning tasks, such as image classification, natural language processing, recommender systems, & time series forecasting.

So what does TensorFlow offer? Here's a summary:

* Its core is very similar to numpy, but with GPU support.
* It supports distributed computing (across multiple devices & servers).
* It includes a kind of just-in-time (JIT) compiler that allows it to optimise computations for speed & memory usage. It works by extracting the *computation graph* from a Python function, then optimizing it (e.g., by pruning unused nodes), & running it efficiently (e.g., by automatically running independent operations in parallel).
* Computation graphs can be extorted to a portable format, so you can train a TensorFlow model in one environment (e.g, using Python on Linux) & run it in another (e.g., using Java on an Android device).
* It implements autodiff & provides some excellent optimizers, such as RMSProp & Nadam, so you can easily minimise all sorts of loss functions.

TensorFlow offers many more features built on top of these core featuers: the most important is of course `tf.keras`, but it also has data loading & preprocessing ops (`tf.data`, `tf.io`, etc.), image processing ops (`tf.image`), signal preprocessing ops (`tf.signal`), & more.

<img src = "Images/TensorFlow Python API.png" width = "600" style = "margin:auto"/>

At the lowest level, each TensorFlow operation (*op* for short) is implemented using highly efficiency C++ code. Many operations have multiple implementations called *kernels*: each kernel is dedicated to a specific device type, such as CPUs, GPUs, or even TPUs (*tensor processing units*). As you may know, GPUs can dramatically speed up computations by splitting them into many smaller chunks & running them in parallel across many GPU threads. TPUs are even faster: they are custom ASIC chips built specifically for deep learning operations.

TensorFlow's architecture is shown below. Most of the time, your code will use the high-level APIs (especially tf.keras & tf.data); but when you need more flexibility, you will use the lower-level Python API, handling tensors directly. Note that APIs for other languages are also available. In any case, TensorFlow's execution engine will take care of running the operations efficiently, even across multiple devices & machines if you tell it to.

<img src = "Images/TensorFlow's Architecture.png" width = "550" style = "margin:auto"/>

TensorFlow runs not only on Windows, Linux, & macOS, but also on mobile devices (using *TensorFlow Lites*, including both iOS & Android. If you do not want to use the Python API, there are C++, Java, Go, & Swift APIs. There is even a Javascript implmentation called *TensorFlow.js* that makes it possible to run your models directly in your browser.

There's more to TensorFlow than the library. TensorFlow is at the center of an extensive ecosystem of libraries. First, there's TensorBoard for visualisation. Next, there's TensorFlow Extended (TFX), which is a set of libraries built by Google to productionise TensorFlow projects: it includes tools for data validation, preprocessing, model analysis, & serving. Google's *TensorFlow Hub* provides a way to easily download & reuse pretrained neural networks. You can also get many neural network architectures, some of them pre-trained, in TensorFlow's model garden. Check out the TensorFlow Resources for more TensorFlow-based projects.

---

# Using TensorFlow like NumPy

TensorFlow's API revolves around *tensors*, which flow from operation to operation -- hence the name *TensorFlow*. A tensor is very similar to a NumPy `ndarray`: it is usually a multidimensional array, but it can also hold a scaler (a simple value, such as 42). These tensors will be important when we create custom cost functions, custom metrics, custom layers, & more, so let's see how to create & manipulate them.

## Tensors & Operations

You can create a tensor with `tf.constant()`. For example, here is a tensor representing a matrix with two rows & three columns of floats:

In [2]:
import tensorflow as tf

tf.constant([[1., 2., 3.], [4., 5., 6.]])

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [3]:
tf.constant(42)

<tf.Tensor: shape=(), dtype=int32, numpy=42>

Just like an `ndarray` a `tf.Tensor` has a shape & a data type (`dtype`):

In [4]:
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])
t.shape

TensorShape([2, 3])

In [5]:
t.dtype

tf.float32

Indexing works much like in NumPy.

In [6]:
t[:, 1:]

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 3.],
       [5., 6.]], dtype=float32)>

In [7]:
t[:, 1, tf.newaxis]

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

Most importantly, all sorts of tensor operations are available:

In [8]:
t + 10

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

In [9]:
tf.square(t)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)>

In [10]:
t @ tf.transpose(t)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

Note that writing `t + 10` is equivalent to calling `tf.add(t, 10)` (indeed, Python calls the magic method `t._add__(10)`, which just calls `tf.add(t, 10)`. Other operators like - & * are also supported. The @ operator was added in Python 3.5, for matrix multiplication: it is equivalent to calling the `tf.matmul()` function.

You will find all the basic math operations you need (`tf.add()`, `tf.multiply()`, `tf.square()`, `tf.exp()`, `tf.sqrt()`, etc.) & more operations that you can find in NumPy (e.g., `tf.reshape()`, `tf.squeeze()`, `tf.tile()`). Some functions have a different name than in NumPy; for instance, `tf.reduce_mean()`, `tf.reduce_sum()`, `tf.reduce_max()`, & `tf.math.log()` are equivalent of `np.mean()`, `np.sum()`, `np.max()`, & `np.log()`. When the name differs, there is often a good reason for it. For example, in TensorFlow, you must write `tf.transpose(t)`; you cannot just write `t.T` like in NumPy. The reason is that the `tf.transpose()` function does not do exactly the same thing as NumPy's `T` attribute: in TensorFlow, a new tensor is created with its own copy of the transposed data, while in NumPy, `t.T` is just a transposed view on the same data. Similarly, the `tf.reduce_sum()` operation is named this way because its GPU kernel (i.e., GPU implementation) uses a reduce algorithm that does not guarantee the order in which the elements are added: because 32-bit floats have limited precision, the result may change slightly every time you call this operation. The same is true of `tf.reduce_mean()` (but of course `tf.reduce_max()` is deterministic).

## Tensors & NumPy

Tensors play nice with NumPy: you can create a tensor from a NumPy array, & vice versa. You can even apply TensorFlow operations to NumPy arrays & NumPy operations to tensors:

In [11]:
import numpy as np

a = np.array([2., 4., 5.])
tf.constant(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>

In [12]:
t.numpy()

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [13]:
tf.square(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([ 4., 16., 25.])>

In [14]:
np.square(t)

array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)

## Type Conversions

Type conversion can significantly hurt performance, & they can easily go unnoticed when they are done automatically. To avoid this, TensorFlow does not perform any type conversions automatically: it just raises an exception if you try to execute an operation on tensors with incompatible types. For example, you cannot add a flow tensor & an integer tensor, & you canot even add a 32-bit float & a 64-bit float.

In [15]:
tf.constant(2.) + tf.constant(40)

InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:AddV2]

In [16]:
tf.constant(2.) + tf.constant(40., dtype = tf.float64)

InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a double tensor [Op:AddV2]

This may be a bit annoying at first, but remember that it's for a good cause. & of course, you can use `tf.cast()` when you really need to convert types:

In [17]:
t2 = tf.constant(40., dtype = tf.float64)
tf.constant(2.0) + tf.cast(t2, tf.float32)

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

## Variables

The `tf.Tensor` values we've seen so far are immutable: you cannot modify them. This means that we cannot use regular tensors to implement weights in a neural network, since they need to be tweaked by backpropagation. Plus, other parameters may also need to change over time (e.g., a momentum optimizer keeps track of past gradients). What we need is a `tf.Variable`.

In [18]:
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

A `tf.Variable` acts much like a `tf.Tensor`: you can perform the same operations with it, it plays nicely with NumPy as well, & it is just as picky with types. But it can also be modified in place using the `assign()` method (or `assign_add()` or `assign_sub()`, which increment or decrement the variable by the given value). You can also modify individual cells (or slices), by using the cell's (or slice's) `assign()` method (direct item assignment will not work) or by using the `scatter_update()` or `scatter_nd_update()` methods:

In [19]:
v.assign(2 * v)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [20]:
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [21]:
v[:, 2].assign([0., 1.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

In [22]:
v.scatter_nd_update(indices = [[0, 0], [1, 2]], updates = [100., 200.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>

## Other Data Structures

TensorFlow supports several other data structures, including the following:

* *Sparse Tensors* (`tf.SparseTensor`)
   - Efficiently represent tensors containing mostly zeros. The `tf.sparse` package contains operations for sparse tensors.
* *Tensor Arrays* (`tf.TensorArray`)
   - Are lists of tensors. They have a fixed size by default but can optionally be made dynamic. All tensors they contain must have the same shape & data type.
* *Ragged Tensors* (`tf.RaggedTensor`)
   - Represent static lists of lists of tensors, where every tensor has the same shape & data type. The `tf.ragged` package contains operations for ragged tensors.
* *String Tensors*
   - Are regular tensors of type `tf.string`. These represent byte strings, not Unicode strings, so if you create a string tensor using a Unicode string (e.g., a regular python 3 string like `"cafe"`), then it will get encoded to UTF-8 automatically (e.g., `b"caf\xc3\xa9"`). Alternatively, you can represent Unicode strings using tensors of type `tf.int32`, where each item represents a Unicode code point (e.g., `[99, 97, 102, 233]`). The `tf.strings` package (with an `s`) contains ops for byte strings & Unicode strings (& to convert one into other). It's important to note that a `tf.string` is atomic, meaning that its length does not appear in the tensor's shape. once you convert it to a Unicode tensor (i.e., a tensor of type `tf.int32` holding Unicode code points), the length appears in the shape.
* *Sets*
   - Are represented as regular tensors (or sparse tensors). For example, `tf.constant([[1, 2], [3, 4]])` represents the two sets {1, 2} & {3, 4}. More generally, each set is represented by a vector in the tensor's last axis. You can manipulate sets using operations for the `tf.sets` pacakge.
* *Queues*
   - Store tensors across multiple steps. TensorFlow offers various kinds of queues: simple First IN, First Out (FIFO) queues (FIFOQueue), queues that can prioritise some items (`PriorityQueue`), shuffle their items (`RandomShuffleQueue`), & batch items of different shapes by padding (`PaddingFIFOQueue`). These classes are all in the `tf.queue` package.
   
With tensors, operations, variables, & various data structures at your disposal, you are now ready to customise your models & training algorithms

---

# Customising Models & Training Algorithms

Let's start by creating a custom loss function, which is a simple & common use case.

## Custom Loss Functions

Suppose you want to train a regression model, but your training set is a bit noisy. Of course, you start by trying to clean up your dataset by removing or fixing the outliers, but that turns out to be insufficient; the dataset is still noisy. Which loss function should you use? The mean squared error might penalise large errors too much & cause your model to be imprecise. The mean absolute error would not penalise outliers as much, but training might take a while to converge, & the trained model might not be very precise. This is probably a good time to use the Huber loss instead of the good old MSE. The Huber loss is not currently part of the official Keras API, but it is available in tf.keras (just use an instance of the `keras.losses.Huber` class). But let's pretend it's not there: implementing it is easy. Just create a function that takes the labels & predictions as arguments, & use TensorFlow operations to compute every instance's loss:

In [23]:
def huber_fn(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)

It is also preferable to return a tensor containing one loss per instance, rather than returning the mean loss. This way, Keras can apply class weights or sample weights when requested.

Now, you can use this loss when you compile & train a Keras model:

In [24]:
from tensorflow import keras
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target.reshape(-1, 1))
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

keras.backend.clear_session()

checkpoint_callback = keras.callbacks.ModelCheckpoint("my_model_with_a_custom_loss.h5", save_best_only = True)

input_shape = X_train.shape[1:]

model = keras.models.Sequential([
    keras.layers.Dense(30, activation = "selu", kernel_initializer = "lecun_normal",
                       input_shape = input_shape),
    keras.layers.Dense(1),
])

model.compile(loss = huber_fn, 
              optimizer = "nadam",
              metrics = ["mae"])
model.fit(X_train, y_train, epochs = 15,
          validation_data = (X_val, y_val),
          callbacks = [checkpoint_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fb415005910>

That's it. For each batch during training, Keras will call the `huber_fn()` function to compute the loss & use it to perform a Gradient Descent step. Moreover, it will keep track of the total loss since the beginning of the epoch, & it will display the mean loss.

But what happens to this custom loss when you save the model?

## Saving & Loading Models That Contain Custom Components

Saving a model containing a custom loss function works fine, as Keras saves the name of the function. When you load it, you'll need to provide a dictionary that maps the function name to the actual function. More generally, when you load a model containing custom objects, you need to map the names to the objects:

In [25]:
model = keras.models.load_model("my_model_with_a_custom_loss.h5",
                                custom_objects = {"huber_fn": huber_fn})

With the current implementation, any error between -1 & 1 is considered "small". But what if you want a different threshold? One solution is to create a function that creates a configured loss function:

In [26]:
def create_huber(threshold = 1.0):
    def huber_fn(y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    return huber_fn

checkpoint_callback = keras.callbacks.ModelCheckpoint("my_model_with_a_custom_loss_threshold2.h5", save_best_only = True)

model.compile(loss = create_huber(2.0), optimizer = "nadam")
model.fit(X_train, y_train, epochs = 15,
          validation_data = (X_val, y_val),
          callbacks = [checkpoint_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fb4150dcd60>

Unfortunately, when you save the model, the `threshold` will not be saved. This means that you will have to specify the `threshold` value when loading the model (note that the name to use is `"huber_fn"`, which is the name of the function you gave Keras, not the name of the function that created it):

In [27]:
model = keras.models.load_model("my_model_with_a_custom_loss_threshold2.h5",
                                custom_objects = {"huber_fn": create_huber(2.0)})

You can solve this by creating a subclass of the `keras.losses.Loss` class, & then implementing its `get_config()` method:

In [28]:
class HuberLoss(keras.losses.Loss):
    def __init__(self, threshold = 1.0, **kwargs):
        self.threshold = threshold
        super().__init__(**kwargs) 
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < self.threshold
        squared_loss = tf.square(error) / 2
        linear_loss = self.threshold * tf.abs(error) - self.threshold**2 / 2 
        return tf.where(is_small_error, squared_loss, linear_loss)
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "threshold": self.threshold}

Let's walk through this code:

* The constructor accepts `**kwargs` & passes them to the parent constructor, which handles standard hyperparameters: the `name` of the loss & the `reduction` algorithm to use to aggregate the individual instance losses. By default, it is `"sum_over_batch_size"`, which means that the loss will be the sum of the instance losses, weighted by the sample weights, if any, & divided by the batch size (not by the sum of weights, so this is not the weighted mean). Other possible values are `"sum"` & `"none"`.
* The `call()` method takes the labels & predictions, computes all the instance losses, & returns them.
* The `get_config()` method returns a dictionary mapping each hyperparameter name to its value. It first calls the parent class's `get_config()` method, then addes the new hyperparameters to this dictionary.

You can then use any instance of this class when you compile the model:

In [29]:
checkpoint_callback = keras.callbacks.ModelCheckpoint("my_model_with_a_custom_loss_class.h5",
                                                      save_best_only = True)

model.compile(loss = HuberLoss(2.0), optimizer = "nadam")
model.fit(X_train, y_train, epochs = 15,
          validation_data = (X_val, y_val),
          callbacks = [checkpoint_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fb3f67a29d0>

In [30]:
model = keras.models.load_model("my_model_with_a_custom_loss_class.h5",
                                custom_objects = {"HuberLoss": HuberLoss})

When you save a model, Keras calls the loss instance's `get_config()` method & saves the config as JSON in the HDF5 file. When you load the model, it calls the `from_config()` class method on the `HuberLoss` class: this method is implemented by the base class (`Loss`) & creates an instance of the class, passing `**config` to the constructor.

That's it for losses. Pretty simple. Almost as simple as customer activation functions, initialisers, regularisers, & constraints.

## Custom Activation Functions, Initialisers, Regularisers, & Constraints

Most keras functionalities, such as losses, regularisers, constraints, initialisers, metrics, activation functions, layers, & even full models, can be customised in very much the same way. Most of the time, you will just need to write a simple function with the appropriate inputs & outputs. Here are examples of a custom activation function (equivalent to `keras.activations.softplus()` or `tf.nn.softplus()`), a custom Glorot initialiser (equivalent to `keras.initializers.glorot_normal()`), a custom $l_1$ regulariser (equivalent to `keras.regularizers.l1(0.01)`), & a custom constraint that ensures weights are all positive (equivalent to `keras.constraints.nonneg()` or `tf.nn.relu`):

In [31]:
def my_softplus(z):
    return tf.math.log(tf.exp(z) + 1.0)

def my_glorot_initialiser(shape, dtype = tf.float32):
    stddev = tf.sqrt(2.0 / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev = stddev, dtype = dtype)

def my_l1_regulariser(weights):
    return tf.reduce_sum(tf.abs(0.01 * weights))

def my_positive_weights(weights):
    return tf.where(weights < 0.0, tf.zeros_like(weights), weights)

As you can see, the argument depend on the type of custom function. These custom functions can then be used normally; for example:

In [32]:
layer = keras.layers.Dense(30, activation = my_softplus, 
                           kernel_initializer = my_glorot_initialiser, 
                           kernel_regularizer = my_l1_regulariser,
                           kernel_constraint = my_positive_weights)

The activation function will be applied to the output of this `Dense` layer, & its result will be passed on the next layer. The layer's weights will be initialised using the value returned by the initialiser. At each training step, the weights will be passed to the regularisation function to compute the regularisation loss, which will be added to the main loss to get the final loss used for training. Finally, the constraint function will be called after each training step, & the layer's weights will be replaced by the constrained weights.

If a function has hyperparameters that need to be saved along with the model, then you will want to subclass the appropriate class, such as `keras.regularizers.Regularizer`, `keras.constraints.Constraints`, `keras.initializers.Initializer`, or `keras.layers.Layer` (for any layer, including activation functions). Much like we did for the custom loss, here is a simple class for $l_1$ regularisation that saves its `factor` hyperparameter (this time, we do not need to call the parent constructor or the `get_config()` method, as they are not defined by the parent class):

In [33]:
class myL1Regulariser(keras.regularizers.Regularizer):
    def __init__(self, factor):
        self.factor = factor
    def __call__(self, weights):
        return tf.reduce_sum(tf.abs(self.factor * weights))
    def get_config(self):
        return {"factor": self.factor}

Note that you must implement the `call` method for losses, layers (including activation functions), & models, or the `__call__` method for regularisers, initialisers, & constraints. For metrics, things are a bit different.

## Custom Metrics

Losses & metrics are conceptually not the same thing: losses (e.g., cross entropy) are used by gradient descent to *train* a model, so they must be differentiable (at least where they are evaluated), & there gradients should not be 0 everywhere. Plus it's okay if htey are not easily interpretable by humans. In contrast, metrics (e.g., accuracy) are used to *evaluate* a model: they must be more easily interpretable, & they can be non-differentiable or have 0 gradients everywhere.

That said, in most cases, defining a custom metric function is exactly the same as defining a custom loss function. In fact, we could even use the Huber loss function we created earlier as a metric; it would work just fine (& persistence would also work the same way, in this case only saving the name of the function, `"huber_fn"`):

In [34]:
model.compile(loss = "mse", optimizer = "nadam", metrics = [create_huber(2.0)])


For each batch during training, Keras will compute this metric & keeep track of its mean since the beginning of the epoch. Most of the time, this is exactly what you want. But not always! Consider a binary classifier's precision, for example. As we saw before, precision is the number of true positives divided the number of positive predictions (including both true positives & false positives). Suppose the model made 5 positive predictions in the first batch, 4 of which were correct: that's 80% precision. Then suppose the model made 3 positive predictions in the second batch, but they were all incorrect: that's 0% precision for the second batch. If you just compute the mean of these two precisions, you get 40%. But wait a second -- that's not the model's precision over these two batches! Indeed, there were a total of 4 true positives (4 + 0) out of eight positive predictions (5 + 3), so the overal prediction is 50%, not 40%. What we need is an object that can keep track of the number of true positives & the number of false positives & that can compute their ratio when requested. This is precisely what the `keras.metrics.Precision` class does:

In [35]:
precision = keras.metrics.Precision()
precision([0, 1, 1, 1, 0, 1, 0, 1], [1, 1, 0, 1, 0, 1, 0, 1])

<tf.Tensor: shape=(), dtype=float32, numpy=0.8>

In [36]:
precision([0, 1, 0, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 0, 0, 0])

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

In this example, we created a `Precision` object, then we used it like a function, passing it the labels & predictions for the first batch, then the second batch (note that we could also have passed sample weights). We used the same number of true & false positives as in the example we just discussed. After the first batch, it returns a precision of 80%; then after the second batch, it returns 50% (which is the overall precision so far, not the second batch's precision). This is called a *streaming metric* (or *stateful metric*), as it is gradually updated, batch after batch.

At any point, we can call the `result()` method to get the current value of the metric. We can also look at its variables (tracking the number of true & false positives) by using the `variables` attribute, & we can reset these variables using the `reset_states()` method:

In [37]:
precision.result()

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

In [38]:
precision.variables

[<tf.Variable 'true_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>,
 <tf.Variable 'false_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>]

In [39]:
precision.reset_states()

If you need to create such a streaming metric, create a subclass of the `keras.metrics.Metric` class. Here is a simple example that keeps track of the total Huber loss & the number of instances seen so far. When asked for the result, it returns the ratio, which is simply the mean Huber loss:

In [40]:
class HuberMetric(keras.metrics.Metric):
    def __init(self, threshold = 1.0, **kwargs):
        super().__init__(**kwargs)
        self.threshold = threshold
        self.huber_fn = create_huber(threshold)
        self.total = self.add_weight("total", initializers = "zeros")
        self.count = self.add_weight("count", initializers = "zeros")
    def update_state(self, y_true, y_pred, sample_weight = None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))
    def result(self):
        return self.total / self.count
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "threshold":self.threshold}

Let's walk through this code:

* The constructor uses the `add_weight()` method to create the variables needed to keep track of the metrics's state over multiple batches -- in this case, the sum of all Huber losses (`total`) & the number of instances seen so far (`count`). You could just create variables manually if you preferred. Keras tracks any `tf.Variable` that is set as an attribute (& more generally, any "trackable" object, such as layers or models).
* The `update_state()` method is called when you use an instance of this class as a function (as we did with the `Precision` object). It updates the variables, given the labels & predictions for one batch (& sample weights, but in this case we ignore them).
* The `result()` method computes & returns the final result, in this case the mean Huber metric over all instance. When you use the metric as a function, the `updated_state()` method gets called first, then the `result()` method is called, & its output is returned.
* We also implement the `get_config()` method to ensure the `threshold` gets saved along with the model.
* The default implementation of the `reset_states()` method resets all variables to 0.0 (but you can override it if needed).

When you define a metric using a simple function, keras automatically calls it for each batch, & it keeps track of the mean during each epoch, just like we did manually. So the only benefit of our `HuberMetric` class is that the `threshold` will be saved. But of course, some metrics, like precision, cannot simply be averaged over batches: in those cases, there's no other option than to implement a streaming metric.  

Now that we have built a streaming metrics, building a custom layer will seem like a walk in the park

## Custom Layers

You may occasionally want to build an architecture that contains an exotic layer for which TensorFlow does not provide a default implementation. In this case, you will need to create a custom layer. Or you may simply want to build a very repetitive architecture, containing identical blocks of layers repeated many times, & it would be convenient to treat each block of layers as a single layer. For example, if the model is a sequence of layers A, B, C, A, B, C, A, B, C, then you might want to define a custom layer D containing layers A, B, C, so your model would simple be D, D, D. Let's see how to build customer layers.

First, som layers have no weights, such as `keras.layers.Flatten` or `keras.layers.ReLU`. If you want to create a custom layer without any weights, the simplest option is to write a function & wrap it in a `keras.layers.Lambda` layer. For example, the following layer will apply the exponential function to its inputs:

In [41]:
exponential_layer = keras.layers.Lambda(lambda x: tf.exp(x))

This custom layer can then be used like any other layer, using the sequential API, the functional API, or the subclassing API. You can also use it as an activation function (or you could use `activation = tf.exp`, `activation = keras.activations.exponential`, or simply `activation = "exponential"`). The exponential layer is sometimes used in the output layer of a regression model when the values to predict have very different scales (e.g., 0.001, 10., 1,000.).

As you've probably guessed by now, to build a custom stateful layer (i.e., a layer with weights), you need to create a subclass of the `keras.layers.Layer` class. For example, the following class implements a simplified version of the `Dense` layer:

In [42]:
class myDense(keras.layers.Layer):
    def __init__(self, units, activation = None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)
    def build(self, batch_input_shape):
        self.kernel = self.add_weight(name = "kernel", shape = [batch_input_shape[-1], self.units],
                                      initializer = "glorot_normal")
        self.bias = self.add_wieght(name = "bias", shape = [self.units], initializer = "zeros")
        super().build(batch_input_shape)
    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)
    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "units": self.units,
                "activation": keras.activations.serialize(self.activation)}

Let's walk through this code:

* The constructor takes all the hyperparameters as argument (in this example, `unitss` & `activation`), & importantly it also takes a `**kwargs` argument. It calls the parent constructor, passing it the `kwargs`: this takes care of standard arguments such as `input_shape`, `trainable`, & `name`. Then it saves the hyperparameters as attributes, converting the `activation` arguments to the appropriate activation function using the `keras.activations.get()` function (it accepts functions, standard strings like `"relu"` or `"selu"`, or simply `None`).
* The `build()` method's role is to create the layer's variables by calling the `add_weight()` method for each weight. The `build()` method is called the first time the layer is used. At that point, Keras will know the shape of this layer's inputs, & it will pass it to the `build()` method, which is often necessary to create some of the weights. For example, we need to know the number of neurons in the previous layer in order to create the connection weights matrix (i.e., the `"kernel"`): this corresponds to the size of the last dimension of the inputs. At the end of the `build()` method (& only at the end), you must call the parent's `build()` method: this tells Keras that the layer is built (it just sets `self.built = True`).
* The `call()` method performs the desired operations. In this case, we compute the matrix multiplication of the inputs `X` & the layer's kernel, we add the bias vector, & we apply the activation function to the result, & this gives us the output of the layer.
* The `compute_output_shape()` method simply returns the shape of this layer's outputs. In this case, it is the same shpae as the inputs, except the last dimension is replaced with the number of neurons in the layer. Note that in tf.keras, shapes are instances of the `tf.TensorShape` class, which you can convert to Python lists using `as_list()`.
* The `get_config()` method is just like in the previous custom classes. Note that we save the activation function's full configuration by calling `keras.activations.serialize().`

You can now use a `myDense` layer just like any other layer.

To create a layer with multiple inputs (e.g., `Concatenate`), the argument to the `call()` method should be a tuple containing all the input, & similarly the argument to the `compute_output_shape()` method should be a tuple containing each input's batch shape. To create a layer with multiple outputs, the `call()` method should return the list of outputs, & `compute_output_shape()` should return the list of batch output shapes (one per output). For example, the following toy layer takes two inputs & returns three outputs:

In [43]:
class myMultiLayer(keras.layers.Layer):
    def call(self, X):
        X1, X2 = X
        return [X1 + X2, X1 * X2, X1 / X2]
    def compute_output_shape(self, batch_input_shape):
        b1, b2 = batch_input_shape
        return [b1, b1, b1]

This layer may now be used like any other layer, but of course only using the functional & subclassing APIs, not the sequential API (which only accepts layers with one input & one output).

If your layer needs to have a different behavior during training & during testing (e.g., if it uses `Dropout` or `BatchNormalization` layers), then you must add a `training` argument to the `call()` method & use this argument to decide what to do. For example, let's create a lyer that adds Gaussian noise during training (for regularisation) but does nothing during testing (Keras has a layer that does the same thing, `keras.layers.GaussianNoise`):

In [44]:
class myGaussianNoise(keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev
    def call(self, X, training = None):
        if training: 
            noise = tf.random.normal(tf.shape(X), stddev = self.stddev)
            return X + noise
        else:
            return X
    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

With that, you can now build any custom layer you need. Now let's create custom models.

## Custom Models

We already looked at creating custom model classes before, when we discussed the subclassing API. It's straightforward: subclass the `keras.Model` class, create layers & variables in the constructor, & implement the `call(0` method to do whatever you want the model to do. Suppose you want to build the model represented below.

<img src = "Images/Custom Model Example.png" width = "500" style = "margin:auto"/>

The inputs go through a first dense layer, then through a *residual block* composed of two dense layers & an additional operation, then through this same residual block three more times, then through a second residual black, & the final result goes through a dense output layer. Note that this model does not make much sense; it's just an example to illustrate the fact that you can easily build any kind of model you want, even on that contains loops & skip connections. To implement this model, it is best to first create a `ResidualBlock` layer, since we are going to create a couple of identical blocks (& we might want to reuse it in another model):

In [45]:
class residualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(n_neurons, activation = "elu",
                                          kernel_initializer = "he_normal")
                       for _ in range(n_layers)]
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z

This layer is a bit special since it contains other layers. This is handled transparently by keras: it automatically detects that the `hidden` attribute contains trackable objects (layers in this case), so their variables are automatically added to this layer'slist of variables. The rest of this class is self-explanatory. Next, let's use the subclassing API to define the model itself:

In [46]:
class residualRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(30, activation = "elu",
                                          kernel_initializer = "he_normal")
        self.block1 = residualBlock(2, 30)
        self.block2 = residualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)
    def call(self, inputs):
        Z = self.hidden1(inputs)
        for _ in range (1 + 3):
            Z = self.block1(Z)
        Z = self.block2(Z)
        return self.out(Z)

We create the layers in the constructor & use them in the `call()` method. This model can then be used like any other model (compile it, fit it, evaluate it, & use it to make predictions). If you also want to be able to save the model using the `save()` method & load it using the `keras.models_load_model()` function, you must implement the `get_config()` method in both the `residualBlock` class & the `residualRegressor` class. Alternatively, you can save & load the weights using the `save_weights()` & `load_weights()` methods.

The `Model` class is a subclass of the `Layer` class, so models can be defined & used exactly like layers. But a model has some extra functionalities, including of course its `compile()`, `fit()`, `evaluate()`,  & `predict()` methods (& a few variants), plus the `get_layers()` method (which can return any of the model's layers by name or by index) & the `save()` method (& support for `keras.models.load_model()` & `keras.models.clone_model()`).

With that, you can naturally & concisely build almost any model that you find in a paper, using the sequentail API, the functional API, the subclassing API, or even a mix of these. "Almost" any model? Yes, there are still a few things that we need to look at: first, how to define losses or metrics based on model internals, & second, how to build a custom training loop.

## Losses & Metrics Based on Model Internals

The custom losses & metrics we defined earlier were all based on labels & the predictions (& optionally sample weights). There will be times when you want to define losses based on other parts of your model, such as the weights or activations of its hidden layers. This may be useful for regularisation purposes or to monitor some internal aspect of your model.

To define a fustom loss based on model internals, compute it based on any part of the model you want, then pass the result to the `add_loss()` method. For example, let's build a custom regression MLP model composed of a stack of five hidden layers plus an output layer. This custom model will also have an auxiliary output on top of the upper hidden layer. The loss associated to this auxiliary output will be called the *reconstruction loss*: it is the mean squared difference between the reconstruction & the inputs. By adding this reconstruction loss to the main loss, we will encourage the model to preserve as much information as possible through the hiddne layers -- even information that is not directly useful for the regression task itself. In practice, this loss sometimes improves genralisation (it is a regularisation loss). Here is the code for this custom model with a custom reconstruction loss:

In [47]:
class ReconstructingRegressor(keras.Model): 
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(30, activation="selu",
                                      kernel_initializer="lecun_normal")
                       for _ in range(5)]
        self.out = keras.layers.Dense(output_dim)
def build(self, batch_input_shape):
    n_inputs = batch_input_shape[-1] 
    self.reconstruct = keras.layers.Dense(n_inputs) 
    super().build(batch_input_shape)
def call(self, inputs): 
    Z = inputs
    for layer in self.hidden: 
        Z = layer(Z)
    reconstruction = self.reconstruct(Z)
    recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs)) 
    self.add_loss(0.05 * recon_loss)
    return self.out(Z)

Let's go through the code:

* The constructor creates the DNN with 5 dense hidden layers & one dense output layer.
* The `build()` method creates an extra dense layer which will be used to reconstruct the inputs of the model. It must be created here because its number of units must be equal to the number of inputs, & this number is unknown before the `build()` method is called.
* The `call()` method processes the inputs through all five hidden layers, then passes the result through the reconstruction layer, which produces the reconstruction.
* Then the `call()` method computes the reconstruction loss (the mean squared difference between the reconstruction & the inputs), & adds it to the model's list of losses using the `add_loss()` method. Notice that we scale down the reconstruction loss by multiplying it by 0.05 (this is a hyperparameter you can tune). This ensures that the reconstruction loss does not dominate the main loss.
* Finally, the `call()` method passes the output of the hidden layers to the output layer & returns its output.

Similarly, you can add a custom metric based on model internals by computing it in anyway you want, as long as the result in the output of a metric object. For example, you can create a `keras.metrics.Mean` object in the constructor, then call it in the `call()` method, passing it the `recon_loss`, & finally add it to the model by calling the model's `add_metric()` method. This way, when you train the model, keras will display both the mean loss over each epoch (the loss is the sum of the main loss plus 0.05 times the reconstruction loss) & the mean reconstruction error over each epoch. Both will go down during training.

In over 99% of cases, everything we have discussed will be sufficient to implement whatever model you want to build,e ven with complex architectures, losses, & metrics. However, in some rare cases, you may need to customise the training loop itself. Before we get there, we need to look at how to compute gradients automatically in TensorFlow.

## Computing Gradients Using Autodiff

To understand how to use autodiff to compute gradients automatically, let's consider a toy function:

In [48]:
def f(w1, w2):
    return 3 * w1 ** 2 + 2 * w1 * w2

If you know calculus, you can analytically find that the partial derivative of this function with regard to `w1` is `6 * w1 + 2 * w2`. You can also find that its partial derivative with regard to `w2` is `2 * w1`. For example, at the point `(w1, w2) = (5, 3)`, these partial derivatives equal to 36 & 10, respectively, so the gradient vector at this point is (36, 10). but if this were a neural network, the function would be much more complex, typically with tens of thousands of parameters, & finding the partial derivative analytically by hand would be an almost impossible task. One solution could be to compute an approximation of each partial derivative by measuring how much the function's output changes when you tweak the corresponding parameter:

In [49]:
w1, w2 = 5, 3
eps = 1e-6
(f(w1 + eps, w2) - f(w1, w2)) / eps

36.000003007075065

In [50]:
(f(w1, w2 + eps) - f(w1, w2)) / eps

10.000000003174137

Looks about right. This works rather well & is easy to implement, but it is just an approximation, & importantly, you need to call `f()` at least once per parameter (not twice, since we could compute `f(w1, w2)` just once). Needing to call `f()` at least once makes this approach intractable for large neural networks. So instead, we should use autodiff. TensorFlow makes this pretty simple:

In [53]:
w1, w2 = tf.Variable(5.0), tf.Variable(3.0)
with tf.GradientTape() as tape:
    z = f(w1, w2)
gradients = tape.gradient(z, [w1, w2])

We first define two variables `w1` & `w2`, then we create a `tf.GradientTape` context that will automatically record every operation that involves a variable, & finally we ask this tape to compute the gradients of the result `z` with regard to both variables `[w1, w2]`. Let's take a look at the gradeints that TensorFlow computed:

In [54]:
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

Perfect! Not only is the result accurate (the precision is only limited by the floating-point errors), but the `gradient()` method only goes through the recorded computations once (in reverse order), no matter how many variables there are, so it is incredibly efficient. It's like magic.

The tape is automatically erased immediately after you call its `gradient()` method, so you will get an exception if you try to call `gradient()` twice:

In [55]:
with tf.GradientTape() as tape:
    z = f(w1, w2)
gradients = tape.gradient(z, w1)
gradients = tape.gradient(z, w2)

RuntimeError: A non-persistent GradientTape can only be used to compute one set of gradients (or jacobians)

If you need to call `gradient()` more than once, you must make the tape persistent & delete it each time you are done with it to free resources:

In [56]:
with tf.GradientTape(persistent = True) as tape:
    z = f(w1, w2)
    
dz_dw1 = tape.gradient(z, w1)
dz_dw1

<tf.Tensor: shape=(), dtype=float32, numpy=36.0>

In [57]:
dz_dw2 = tape.gradient(z, w2)
dz_dw2

<tf.Tensor: shape=(), dtype=float32, numpy=10.0>

In [59]:
del tape

By default, the tape will only track operations involving variables, so if you try to compute the gradient of `z` with regard to anything other than a variable, the result will be `None`:

In [60]:
c1, c2 = tf.constant(5.0), tf.constant(3.0)
with tf.GradientTape() as tape:
    z = f(c1, c2)
    
gradients = tape.gradient(z, [c1, c2])

However, you can force the tape to watch any tensors you like, to record every operation that involves them. You can then compute gradients with regard to these tensors, as if they were variables:

In [61]:
with tf.GradientTape() as tape:
    tape.watch(c1)
    tape.watch(c2)
    z = f(c1, c2)
    
gradients = tape.gradient(z, [c1, c2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

This can be useful in some cases, like if you want to implement a regularisation loss that penalises activations that vary a lot when the inputs vary little: the loss will be based on the gradient of the activations with regard to the inputs. Since the inputs are not variables, you need to tell the tape to watch them.

Most of the time, a gradient tape is used to compute the gradients of a single value (usually the loss) with regard to a set of values (usually the model parameters). This is where reverse-mode autodiff shines, as it just needs to do one forward pass & one reverse pass to get all the gradients at once. If you try to compute the gradients of a vector, for example, a vector containing multiple losses, then TensorFlow will compute the gradients of the vector's sum. So if you ever need to get the individual gradients (e.g., the gradients of each loss with regard to the model parameters), you must call the tape's `jacobian()` method: it will perform reverse-mode autodiff once for each loss in the vector (all in parallel by default). It is even possible to compute second-order partial derivatives (the Hessians, i.e., the partial derivatives of partial derivatives), but this is rarely needed in practice.

In some cases, you may want to stop the gradients from backpropagating through some part of your neural network. To do this, you must use the `tf.stop_gradient()` function. The unction returns its inputs during the forward pass (like `tf.identity()`), but it does not let gradients through during backpropagation (it acts like a constant):

In [63]:
def f(w1, w2):
    return 3 * w1 ** 2 + tf.stop_gradient (2 * w1 * w2)

with tf.GradientTape() as tape:
    z = f(w1, w2)
    
gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=30.0>, None]

Finally, you may occasionally run into some numerical issues when computing gradients. For example, if you compute the gradients of the `my_softplus()` function for large inputs, the result will be NaN:

In [64]:
x = tf.Variable([100.0])
with tf.GradientTape() as tape:
    z = my_softplus(x)
    
tape.gradient(z, [x])

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>]

This is because computing the gradients of this function using autodiff leads to some numerical difficulties: due to floating-point precision errors, autodiff ends up computing infinity divided by infinity (which returns NaN). Fortunately, we can analytically find that the derivative of the softplus function is just $\frac{1}{1 + \frac{1}{e^x}}$, which is numerically stable. Next, we can tell TensorFlow to use this stable function when computing the gradients of the `my_softplus()` function by decorating it with `@tf.custom_gradient` & making it return both its normal output & the function that computes the derivatives (not that it will receive as input the gradients that were backpropagated so far, down to the softplus function; & according to the chain rule, we should multiply them with this function's gradients):

In [65]:
@tf.custom_gradient
def my_better_softplus(z):
    exp = tf.exp(z)
    def my_softplus_gradients(grad):
        return grad / (1 + 1 / exp)
    return tf.math.log(exp + 1), my_softplus_gradients

Now when we compute the gradients of the `my_better_softplus()`, function, we get the proper result, even for large input values (however, the main output still explodes because of the exponential; one workaround is to use `tf.where()` to return the inputs when they are large).

Congratulations! You can now compute the gradients of any function (provided it is differentiable at the point where you compute it), even blocking backpropagation when needed, & write your own gradient functions! This is probably more flexibility than you will eer need, even if you build your own custom training loops, as we'll see now.

## Custom Training Loops

In some rare cases, the `fit()` method may not be flexible enough for what you need to do. For example, recall the wide & deep model in previous lessons that uses two different optimisers: one for the wide path & another for the deep path. Since the `fit()` method only uses one optimiser (the one that we specify when compuling the model), implementing this paper requires writing your own custom loop.

You may also like to write custom training loops simply to feel more confident that they do precisely what you intend them to do (perhaps you are unsure about some details of the `fit()` method). It can sometimes feel safer to make everything explicity. However, remembers that writing a custom training loop will make your code longer, more error-prone, & harder to maintain.

First, let's build a simple model. No need to compile it, since we will handle the training loop manually:

In [68]:
keras.backend.clear_session()

l2_reg = keras.regularizers.l2(0.05)
model = keras.models.Sequential([
    keras.layers.Dense(30, activation = "elu", kernel_initializer = "he_normal",
                       kernel_regularizer = l2_reg),
    keras.layers.Dense(1, kernel_regularizer = l2_reg)
])

Next, let's create a tiny function that will randomly sample a batch of instances from the training set.

In [70]:
def random_batch(X, y, batch_size = 32):
    idx = np.random.randint(len(X), size = batch_size)
    return X[idx], y[idx]

Let's also define a function that will display the training status, including the number of steps, the total number of steps, the mean loss since the start of the epoch (i.e., we will use the `Mean` metric to compute it), & other metrics:

In [72]:
def print_status_bar(iteration, total, loss, metrics = None):
    metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result())
                          for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics, end = end)

This code is self-explanatory, unless you are unfamiliar with Python string formatting: `{:.4f}` will format a float with four digits after the decimal point, & using `\r` (carriage return) along with `end = ""` ensures that the status bar always gets printed on the same line. In the notebook, the `print_status_bar()` function includes a progress bar, but you could use the handy `tqdm` library instead.

With that, let's get down to business! First, we need to define more hyperparameters & choose the optimiser, the loss function, & the metrics (just the MAE in this example):

In [74]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimiser = keras.optimizers.Nadam(lr = 0.01)
loss_fn = keras.losses.mean_squared_error
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.MeanAbsoluteError()]

& now we are ready to build the custom loop.

In [77]:
for epoch in range(1, n_epochs + 1):
    print("\nEpoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training = True)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimiser.apply_gradients(zip(gradients, model.trainable_variables))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
        print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
        for metric in [mean_loss] + metrics:
            metric.reset_states()


Epoch 1/5
11584/11610 - mean: 0.5317 - mean_absolute_error: 0.4983
Epoch 2/5
11584/11610 - mean: 0.6584 - mean_absolute_error: 0.4543
Epoch 3/5
11584/11610 - mean: 0.5356 - mean_absolute_error: 0.5061
Epoch 4/5
11584/11610 - mean: 0.4665 - mean_absolute_error: 0.4364
Epoch 5/5
11584/11610 - mean: 0.6518 - mean_absolute_error: 0.5195

There's a lot going on in this code, so let's walk through it:

* We create two nested loops: one for the epochs, the other for the batches within an epoch.
* Then we sample a random batch from the training set.
* Inside the `tf.GradientTape()` block, we make a prediction for one batch (using the model as a function), & we compute the loss: it is equal to the main loss plus the other losses (in this model, there is one regularisation loss per layer). Since the `mean_squared_error()` function returns one loss per instance, we compute the mean over the batch using `tf.reduce_mean()` (if you wanted to apply different weights to each instance, this is where you would do it). The regularisation losses are already reduced to a single scalar each, so if we just need to sum them (using `tf.add_n()`, which sums multiple tensors of the same shape & data type).
* Next, we ask the tape to compute the gradient of the loss with regard to each trainable variable (*not* all variables!), & we apply them to the optimiser to perform a gradient descent step.
* Then we update the mean loss & the metrics (over the current epoch), & we display the status bar.
* At the end of each epoch, we display the status bar again to make it look complete & to print a line feed, & we reset the states of the mean loss & the metrics.

If you set the optimiser's `clipnorm` or `clipvalue` hyperparameter, it will take care of this for you. If you want to apply any other transformation to the gradients, simply do so before calling the `apply_gradients()` method.

If you add weight constraints to your model (e.g., by setting `kernel_constraint` or `bias_constraint` when creating a layer), you should update the training loop to apply these constraints just after `apply_gradients()`:

In [78]:
for variable in model.variables:
    if variable.constraint is not None:
        variable.assign(variable.constraint(variable))

Most importantly, this training loop does not handle layers that behave differently during training & testing (e.g., `BatchNormalisation` or `Dropout`). To handle these, you need to call the model with `training = True` & make sure it propagates this to every layer that needs it.

As you can see, there are quite a lot of things you need to get write, & it's easy to make a mistake. But on the bright side, you can get full control, so it's your call.

Now that you know how to customise any poart of your models & training algorithms, let's see how you can use TensorFlow's automatic graph generation feature: it can speed up your custom code considerably, & it will also make it portable to any platform supported by TensorFlow.

---

# TensorFlow Functions & Graphs

In TensorFlow 2, graphs are simple to use. To show just how simple,let's start with a trivial function that computes the cube of its input:

In [79]:
def cube(x):
    return x ** 3

We can obviously call this function with a python value, such as an int or a float, or we can call it with a tensor:

In [80]:
cube(2)

8

In [81]:
cube(tf.constant(2.0))

<tf.Tensor: shape=(), dtype=float32, numpy=8.0>

Now, let's use `tf.function()` to convert this python function to a *TensorFlow Function*:

In [82]:
tf_cube = tf.function(cube)
tf_cube

<tensorflow.python.eager.def_function.Function at 0x7fb3f81d75e0>

This TF Function can then be used exactly like the original python function, & it will return the same result (but as tensors):

In [83]:
tf_cube(2)

<tf.Tensor: shape=(), dtype=int32, numpy=8>

In [84]:
tf_cube(tf.constant(2.0))

<tf.Tensor: shape=(), dtype=float32, numpy=8.0>

Under the hood, `tf.function()` analysed the computations performed by the `cube()` function & generated an equivalent computation graph! As you can see, it was rather painless (we'll see how this works shortly. Alternatively, we could have used `tf.function` as a decorator; this is actually more common:

In [85]:
@tf.function
def tf_cube(x):
    return x ** 3

The original python function is still available via the tf function's `python_function` attribute, in case you ever need it:

In [86]:
tf_cube.python_function(2)

8

TensorFlow optimises the computation graph, pruning unused nodes, simplifying expressions (e.g., 1 + 2 would get replaced with 3), & more. Once the optimised graph is ready, the tf function efficiently executes the operations in the graph, in the appropriate order (& in parallel when it can). As a result, a tf function will usually run much faster than the original python function, especially if it performs complex computations. Most of the time, you will not really need to know more than that: when you want to boost a python function, just transform it into a tf function. That's all.

Moreover, when you write a custom loss function, a custom metrics, a custom layer, or any other custom function & you use it in a keras model (as we did throughout this lesson), keras automatically converts your function into a tf function -- no need to use `tf.function()`. So most of the time, all this magic is 100% transparent.

By default, a tf function generates a new graph for every unique set of input shapes & data types & caches it for subsequent calls. For example, if you call `tf_cube(tf.constant(10))`, a graph will be generated for int32 tensors of shape []. Then, if you call `tf_cube(tf.constant(20))`, the same graph will be reused. But if you then call `tf_cube(tf.constant([10, 20]))`, a new graph will be geenrated for int32 tensors of shape [2]. This is how tf functions handle polymorphism (i.e., varying argument types & shapes). However, this is only true for tensor arguments: if you pass numerical python values to a tf function, a new graph will be generated for every distinct value: for example, calling `tf_cube(10)` & `tf_cube(20)` will generate two graphs

## AutoGraph & Tracing

So how does TensorFlow generate graphs? It starts by analysing the python function's source code to capture all the control flow statements, such as `for` loops, `while` loops, & `if` statements, as well as `break`, `continue`, & `return` statements. This first step is called *autograph*. The reason TensorFlow has to analyse the source code is that python does not provide any other way to capture control flow statements: it offers magic methods like `__add__()` & `__mul__()` to capture operators like `+` & `*`, but there are no `__while__()` or `__if__()` magic methods. After analysing the function's code, autograph outputs an upgraded version of that function in which all the control flow statements are replaced by the appropriate TensorFlow operations, such as `tf.while_loop()` for loops & `tf.cond()` for `if` statements. For example, in the below figure, autograph analyses the source code of the `sum_squares()` python function, & it generates the `tf_sum_squares()` function. In this function, the `for` loop is replaced by the definition of the `loop_body()` function (containing the body of teh originnal `for` loop), followed by a call to the `for_stmt()` function. This call will build the appropriate `tf.while_loop()` operation in the computation graph.

<img src = "Images/Autograph.png" width = "600" style = "margin:auto"/>

Next, TensorFlow calls this "upgraded" function, but instead of passing the argument, it passes a *symbolic tensor* -- a tensor without any actual value, only a name, a data type, & a shape. For example, if you call `sum_square(tf.constant(10))`, then the `tf__sum_squares()` function will be called with a symbolic tensor of type int32 & shape []. The function will run in *graph mode*, meaning that each TensorFlow operation will add a node in the graph to represent itself & its output tensor(s) (as opposed to the regular mode, called *eager execution* or *eager mode*). In graph mode, tf operations do not perform any computations. In the above figure, you can see the `tf__sum_squares()` function being called with a symbolic tensor as its argument (in this case, an int32 tensor of shape []) & the final graph being generated during tracing. The nodes represent operations, & the arrows represent tensors (both the generated function & the graph are simplified).

## TF Function Rules

Most of the time, converting a python function that performs TensorFlow operations into a tf function is trivial: decorate it with `@tf.function` or let keras take care of it for you. However, there are a few rules to respect:

* If you call any external library, including NumPy or even the standard library, this call will run only during tracing; it will not be part of the graph. Indeed a TensorFlow graph can only include TensorFlow constructs (tensors, operations, variables, datasets, & so on). So, make sure you use `tf.reduce_sum()` instead of `np.sum()`, `tf.sort()` instead of the built-in `sorted()` function, & so on (unless you really want the code to run only during tracing). This has a few additional implications:
   - If you define a tf function `f(x)` that just returns `np.random.rand()`, a random number will be generated when the function is traced, so `f(tf.constant(2.0))` & `f(tf.constant(3.0))` will return the same random number, but `f(tf.constant([2.0, 3.0])` will return a different one. If you replace `np.random.rand()` with `tf.random.uniform([])`, then a new random number will be generated upon every call, since the operation will be part of the graph.
   - If your non-TensorFlow code has side effects (such as logging something or updating a python counter), then you should not expect those side effects to occur every time you call the tf function, as they will only occur when the function is traced.
   - You can wrap arbitrary python code in a `tf.py_function()` operation, but doing so will hinder performance, as TensorFlow will not be able to do any graph optimisation on this code. It will also reduce portability, as the graph will only run on platforms where python is available (& where the right libraries are installed).
* You can call other python functions or tf functions, but they should follow the same rules, as tensorflow will capture their operations in the computation graph. Note that these other functions do not need to be decorated with `@tf.function`
* If the function creates a TensorFlow variable (or any other stateful tensorflow object, such as a dataset or a queue), it must do so upon the very first call, & only then, or else you will get an exception. It is usually preferable to create variables outside of the tf function (e.g., in the `build()` method of a custom layer). If you want to assign a new value to the variable, make sure you call its `assign()` method, instead of using the `=` operator.
* The source code of your python function should be available to tensorflow. If the source code is unavailable (for example, if you define your function in the python shell, which does not give access to the source code, or if you deploy only the compiled *.*pyc* python files to production), when the graph generation process will fail or have limited functionality.
* TensorFlow will only capture `for` loops that iterate over a tensor or a dataset. So make sure you use `for i in tf.range(x)` rather than `for i in range(x)` or else the loop will not be captured in the graph. Instead, it will run during tracing. (This may be what you want if the `for` loop is meant to build the graph, for example to create each layer in a neural network).
* As always, for performance reasons, you should prefer a vectorised implementation whenever you can, rather than using loops.

It's time to sum up. In this lesson, we started with a brief overview of TensorFlow, then we looked at TensorFlow's low-level API, including tensors, operations, variables, & special data structures. We then used these tools to customise almost every component of tf.keras. Finally we looked at how tf functions can boost performance, how graphs are generated using autograph & tracing, & what rules to follow when writing tf functions.