In [15]:
import tensorflow as tf
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
y_train = y_train.to_numpy()

# A Quick Tour of Tensorflow

Here's a summary of what TensorFlow has to offer:

1. Its core is very similar to NumPy, but with GPU support
2. It supports distributed computing (across multiple devices and servers)
3. It includes a kind of just-in-time (JIT) compiler that allows it to optimize computations for speed and memory usage. It works by extracting the *computation graph* from a Python function, then optimizing it (e.g. by pruning unused nodes), and finally running it efficiently (e.g. by automatically running independent operations in parallel.)
4. Computation graphs can be exported to a portable format so you can train a TensorFlow model in one environment (e.g. using Python on Linux) and run it in another (e.g. using Java on an Android device)
5. It implements autodiff (see Chapter 10 and Appendix D) and provides some excellent optimizers, such as RMSProp and Nadam (see Chapter 11), so you can easily minimize all sorts of loss functions.

TensorFlow offers many more features built on top of these core features: the most important is of course tf.keras, but it also has data loading and preprocessing ops, image processing ops, signal processing ops, and more. 

As you may know, GPUs can dramatically speed up computations by splitting them into many smaller chunks and running them in parallel across many GPU threads. TPUs are even faster: they are custom ASIC chips built specifically for Deep Learning operations.

There's even more to the TensorFlow library:
1. TensorBoard - for visualization
2. TensorFlow Extended (TFX) - a set of libraries built by Google to productionize TensorFlow projects. It includes tools for data validation, preprocessing, model analysis, and serving.
3. TensorFlow Hub - provides a way to easily download and reuse pretrained neural networks. You can also get many neural network architectures, some of them pretrained, in TensorFlows *model garden*
4. TensorFlow Resources - contains TensorFlow-based projects. You will find hundreds of TensorFlow projects on GitHub, so it is often easy to find existing coded for whatever you are trying to do.

More and more ML papers are released along with their implementations, and sometimes even with pretrained models. Check out https://paperswithcode.com/ to easily find them

## Using Tensorflow like NumPy

TensorFlow's API revolves around tensors. A tensor is usually a multidimensional array, but it can also hold a scaler. Let's see how to create and manipulate them

### Tensors and Operations

In [21]:
tf.constant([[1., 2., 3.], [4., 5., 6]]), tf.constant(42)

(<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
 array([[1., 2., 3.],
        [4., 5., 6.]], dtype=float32)>,
 <tf.Tensor: shape=(), dtype=int32, numpy=42>)

In [22]:
# Just like an ndarray, a tf.Tensor has a shape and a data type
t = tf.constant([[1., 2., 3.], [4., 5., 6]])
t.shape, t.dtype

(TensorShape([2, 3]), tf.float32)

In [23]:
# Indexing works much like in Numpy
t[:, 1:]

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 3.],
       [5., 6.]], dtype=float32)>

In [24]:
t[..., 1, tf.newaxis]

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

In [25]:
# More importantly, all sorts of tensor operations are available
t + 10, tf.square(t), t @ tf.transpose(t)

(<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
 array([[11., 12., 13.],
        [14., 15., 16.]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3), dtype=float32, numpy=
 array([[ 1.,  4.,  9.],
        [16., 25., 36.]], dtype=float32)>,
 <tf.Tensor: shape=(2, 2), dtype=float32, numpy=
 array([[14., 32.],
        [32., 77.]], dtype=float32)>)

You will find all the basic math operations you need (tf.add(), tf.multiply(), tf.square(), tf.exp(), tf.sqrt(), etc.) and most operations that you can find in Numpy (e.g. tf.reshape(), tf.squeeze(), tf.tile()). Some functions have a different name than Numpy; for instance, tf.reduce_mean(), tf.reduce_sum(), tf.reduce_max(), and tf.math.log() are the equivalent of np.mean(), np.sum(), np.max() and np.log().

When the name differs, there is often a good reason for it. For example, in TensorFlow you must write tf.transpose(t); you cannot use write t.T like in NumPy. The reason is that the tf.transpose() function does not do exactly the same thing as Numpy's T attribute: in TensorFlow, a new tensor is created with its own copy of the transposed data, whlie in Numpy, t.T is just a transposed view of the same data. 

Similarly, the tf.reduce_sum() operation is named this way because its GPU kernel (i.e. GPU implementation) uses a reduce algorithm that does not guarantee the order in which the elements are added: because 32-bit floats have limited precision, the result may change ever so slightly every time you call this operation.

### Tensors and NumPy

In [28]:
# Tensors play nice with Numpy
a = np.array([2., 4., 5.])
tf.constant(a), t.numpy()

(<tf.Tensor: shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>,
 array([[1., 2., 3.],
        [4., 5., 6.]], dtype=float32))

In [29]:
tf.square(a), np.square(t)

(<tf.Tensor: shape=(3,), dtype=float64, numpy=array([ 4., 16., 25.])>,
 array([[ 1.,  4.,  9.],
        [16., 25., 36.]], dtype=float32))

Notice that NumPy uses 64-bit precision by default, while TensorFlow uses 32-bit. This is because 32-bit precision is generally more than enough for neural networks, plus it runs faster and uses less RAM. So when you create a tensor from a NumPy array, make sure to set dtype=tf.float32

### Type Conversions

Type conversions can significantly hurt performance, and they can easily go unnoticed when they are done automatically. To avoid this, TensorFlow does not perform any type conversions automatically: it just raises an exception if you try to execute an operation on tensors with incompatible types. This may be a bit annoying at first, but remember that it's for a good cause! And of course you can use tf.cast() when you really need to convert types.

In [33]:
# Example of Exception
# tf.constant(2.) + tf.constant(40)

' InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:AddV2]'

' InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:AddV2]'

In [34]:
# Example casting variables as the correct type
t2 = tf.constant(40., dtype=tf.float64)
tf.constant(2.0) + tf.cast(t2, tf.float32)

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

## Variables

The tf.Tensor values we've seen so far are immutable: you cannot modify them. For mutable tf.Tensor values we need tf.Variable. A tf.Variable acts much like a tf.Tensor: you can perform the same operations with it, it plays nicely with NumPy as well, and it is just as picky with types. But it can also be modified in place using the assign() method (or assign_add() or assign_sub(), which increment or decrement the variable by the given value).

In practice you will rarely have to create variables manually, since Keras provides an add_weight() method that will take care of it for you, as we will see. Moreover, model parameters will generally be updated directly by the optimizers, so you will rarely need to update variables manually.

In [37]:
# Variable examples
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [38]:
v.assign(2 * v)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [39]:
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [40]:
v[:, 2].assign([0., 1.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

In [41]:
v.scatter_nd_update(
    indices=[[0, 0], [1, 2]],
    updates=[100., 200.]
)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>

## Other Data Structures

TensorFlow supports several other data structures, including the following:

1. Sparse tensors (tf.SparseTensor)
> Efficiently represent tensors containing mostly zeros. The tf.sparse package contains operations for sparse tensors.

2. Tensor arrays (tf.TensorArray)
> Are lists of tensors. They have a fixed size by default but can optionally be made dynamic. All tensors they contain must have the same shape and data type.

3. Ragged tensors (tf.RaggedTensor)
> Represent static lists of lists of tensors, where every tensor has the same shape and data type. The tf.ragged package contains operations for ragged tensors.

4. String tensors
> Are regular tensors of type tf.string. These represent byte strings, not Unicode strings, so if you create a string tensor using a Unicode string (e.g., a regular Pythong 3 string like "coffee"), then it will get encoded to UTF-8 automatically. Alternatively, you can represent Unicode strings using tensors of type tf.int32, where each item represents a Unicode code point (e.g., [99, 97, 102, 233]). The tf.strings package (with an s) contains ops for byte strings and Unicode strings (and to convert one into the other). It's important to note that a tf.string is atomic, meaning that its length does not appear in the tensor's shape. Once you convert it to a Unicode tensor (i.e. a tensor of type tf.int32 holding Unicode code points), the length appears in the shape.

5. Sets
> Are represented as regular tensors (or sparse tensors). For example, tf.constant([[1, 2], [3, 4]]) represents the two sets {1, 2} and {3, 4}. More generally, each set is represented by a vector in the tensor's last axis. You can manipulate sets using operations from the tf.sets package.

6. Queues
> Store tensors across multiple steps. TensorFlow offers various kinds of queues, these classes are all in the tf.queue package:
>    1. Simple First In, First Out (FIFO) queues (FIFOQueue)
    2. Queues that can prioritize some items (PriorityQueue)
    3. Queues that shuffle their items (RandomShuffleQueue)
    4. Queus that batch items of different shapes by padding (PaddingFIFOQueue)

# Customizing Models and Training Algorithms

## Custom Loss Functions

Suppose you want to train a regression model, but your training set is a bit noisy. Of course, you start by trying to clean up your dataset by removing or fixing the outliers, but that turns out to be insufficient; the dataset is still noisy. Which loss function should you use? The mean squared error might penalize large errors too much and cause your model to be imprecise. The mean absolute error would not penalize outliers as much, but training might take a while to converge, and the trained model might not be very precise. This is probably a good time to use the Huber loss (introduced in Chapter 10) instead of the good old MSE. The Huber loss is not currently part of the official Keras API, but it is available in tf.keras (just use an instance of the keras.losses.Huber class). But let's pretend it's not there: implementing it is easy as pie! Just create a function that takes the labels and predictions as arguments, and use TensorFlow operations to compute every instance's loss:

In [47]:
def huber_fn(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)

' For best performance you should always vectorize implementations, as in this example. Moreover, if you want to benefit from TensorFlows graph features, you should use only TensorFlow operations'

# To use with a model
# model.compile(loss=huber_fn, optimizer='nadam')
# model.fit(X_train, y_train, [...])

' For best performance you should always vectorize implementations, as in this example. Moreover, if you want to benefit from TensorFlows graph features, you should use only TensorFlow operations'

It is also preferable to return a tensor containing one loss per instance, rather than returning the mean loss. This way, Keras can apply class weights or sample weights when requested (Chapter 10). Now you can use this loss when you compile the Keras model, then train your model, and that's it! But what happens to this custom loss when you save the model?

## Saving and Loading Models That Contain Custom Components

Saving a model containing a custom loss function works fine, as Keras saves the name of the function. When you load a model containing custom objects, you need to map the names to the objects, as shown below.

In [51]:
# model = tf.keras.models.load_model('model_with_custom_loss.h5', custom_objects={'huber_fn': huber_fn})

With the current implementation, any error between -1 and 1 is considered "small". But what if you want a different threshold? You can solve this by creating a subclass of the keras.losses.Loss class, and then implementing its get_config() method, as shown below

In [53]:
class HuberLoss(tf.keras.losses.Loss):
    def __init__(self, threshold=1.0, **kwargs):
        self.threshold = threshold
        super().__init__(**kwargs)
        
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < self.threshold
        squared_loss = tf.square(error) / 2
        linear_loss = self.threshold * tf.abs(error) - self.threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}

The Keras API currently only specifies how to use subclassing to define layers, models, callbacks, and regularizers. If you build other components (such as losses, metrics, initializers, or constraints) using subclassing, they may not be portable to other Keras implementations. It's likely that the Keras API will be updated to specify subclassing for all these components as well.

That said, let's walk through the code above:

1. The constructor accepts **kwargs** and passes them to the parent constructor, which handles standard hyperparameters: the name of the loss and the reduction algorithm to use to aggregate the individual instance losses. By default, it is 'sum_over_batch_size', which means that the loss will be the sum of the instance losses, weighted by the sample weights, if any, and divided by the batch size (not by the sum of weights, so this is *not* the weighted mean). It would not be a good idea to use a weighted mean: if you did, then two instances with the same weight but in different batches would have a different impact on training, depending on the total weight of each batch. Other possible values are 'sum' and 'None'.

2. The call() method takes the labels and predictions, computes all the instance losses, and returns them.

3. The get_config() method returns a dictionary mapping each hyperparameter name to its value. It first calls the parent classes get_config() method, then adds the new hyperparameters to this dictionary.

You can then use any instance of this class when you compile the model. When you save the model, the threshold will be saved along with it; and when you load the model, you just need to map the class name to the class itself, as shown below.

In [55]:
# model.compile(loss=HuberLoss(threshold=2.), optimizer='nadam')
# model = tf.keras.models.load_model('my_model_with_custom_loss_class.h5', custom_objects={'HuberLoss': HuberLoss})

## Custom Activation Functions, Initializers, Regularizers, and Contstraints

Most Keras functionalities, such as losses, regularizers, constraints, initializers, metrics, activation functions, layers, and even full models, can be customized in very much the same way. Most of the time, you will just need to write a simple function with the appropriate inputs and outputs. Some examples below:

In [58]:
def my_softplus(z):
    ' Return value is just tf.nn.softplus(z)'
    return tf.math.log(tf.exp(z) + 1.0)

In [59]:
def my_glorot_initializer(shape, dtype=tf.float32):
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)

In [60]:
def my_l1_regularizer(weights):
    return tf.reduce_sum(tf.abs(0.01 * weights))

In [61]:
def my_positive_weights(weights):
    ' Return value is just tf.nn.relu(weights)'
    return tf.where(weights < 0., tf.zeroes_like(weights), weights)

In [62]:
layer = tf.keras.layers.Dense(
    units=30,
    activation=my_softplus,
    kernel_initializer=my_glorot_initializer,
    kernel_regularizer=my_l1_regularizer,
    kernel_constraint=my_positive_weights
)

The activiation function will be applied to the output of this Dense layer, and its result will be passed on to the next layer. The layer's weights will be initialized using the value returned by the initializer. At each training step the weights will be passed to the regularization function to compute the regularization loss, which will be added to the main loss to get the final loss used for training. Finally, the constraint function will be called after each training step, and the layer's weights will be replaced by the constained weights.

If a function has hyperparameters that need to be saved along with the model, then you will want to subclass the appropriate class. Note that you must implement the call() method for losses, layers (including activation functions), and models, or the __ call__() method for regularizers, initializers, and constraints. For metrics, things are a bit different.

In [64]:
class MyL1Regularizer(tf.keras.regularizers.Regularizer):
    def __init__(self, factor):
        self.factor = factor
        
    def __call__(self, weights):
        return tf.reduce_sum(tf.abs(self.factor * weights))
    
    def get_config(self):
        return {'factor': self.factor}

## Custom Metrics

Losses and metrics are conceptually not the same thing: losses are used by Gradient Descent to _train_ a model, so they must be differentiable and their gradients should not be 0 everywhere. In contrast, metrics are used to _evaluate_ a model: they must be more easily interpretable, and they can be non-differentiable or have 0 gradients everywhere. 

__That said, in most cases, defining a custom metric function is exactly the same as defining a custom loss function. In fact, we could even use the Huber loss function we created earlier as a metric; it would work just fine.__

For each batch during training, Keras will compute this metric and keep track of its mean since the beginning of the epoch. Most of the time, this is exactly what you want. But not always! Consider a binary classifier's precision, for example. In this case, what we need is an object that can keep track of the number of true positives and the number of false positives and that can compute their ratio when requested. This is precisely what the tf.keras.metrics.Precision class does. This is called a _streaming metric_ (or _stateful metric_), as iti s gradually updated, batch after batch.

If you need to create such a streaming metric, create a subclass of the tf.keras.metrics.Metric class. Here is simple example that keeps track of the total Huber loss and the number of instances seen so far.

In [67]:
def create_huber(threshold=1.0):
    def huber_fn(y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)

In [68]:
# Example using a Loss function as a metric
# model.compile(loss='mse', optimizer='nadam', metrics=[create_huber(2.0)])

In [69]:
class HuberMetric(tf.keras.metrics.Metric):
    def __init__(self, threshold=1.0, **kwargs):
        super().__init__(**kwargs) # handles base args (e.g. dtype)
        self.huber_fn = create_huber(threshold)
        self.total = self.add_weight('total', initializer='zeros')
        self.count = self.add_weight('count', initializer='zeros')
        
    def update_state(self, y_true, y_pred, sample_weights=None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))
        
    def result(self):
        return self.total / self.count
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}

Let's walk through the class above:

1. The constructor uses the add_weight() method to create the variables needed to keep track of the metric's state over multiple batches - in this case, the sum of all Huber losses (total) and the number of instances seen so far (count). You could just create variables manually if you preferred. Keras tracks any tf.Variable that is set as an attribute (and more generally, any 'trackable' object, such as layers or models).

2. The update_state() method is called when you use an instance of this class as a function. It updates the variables, given the labels and predictions for one batch (and sample weights, but in this case we ignore them).

3. The result() method computes and returns the final result, in this case the mean Huber metric over all instances. When you use the metric as a function, the update_state() method gets called first, then the result() method is called, and its output is returned.

4. We also implement the get_config() method to ensure the threshold gets saved along with the model.

5. The default implementation of the reset_states() method resets all variables to 0.0 (but you can override this if needed).

Keras will take care of variable persistence seamlessly; no action is required. Now that we have built a streaming metric, building a custom layer will seem like a walk in the park!

## Custom Layers

You may occasionally want to build an architecture that contains an exotic layer for which TensorFlow does not provide a default implementation. In this case, you will need to create a custom layer. Or you may simply want to build a very repretivie architecture to treat each block of layers as a single layer. __For example, if the model is a sequence of layers A, B, C, A, B, C, A, B, C, then you might want to define a custom layer D containing layers A, B, C such that the new sequence is D, D, D.__

First off, some layers have no weights. If you want to create a custom layer without any weights, the simplest option is to write a function and wrap it in a keras.layers.Lambda layer, as shown below. As you've probably guessed by now, to build a custom stateful layer (i.e. a layer with weights), you need to create a subclass of the keras.layers.Layer class. An example of a simplified version of the Dense layer is shown below.

In [73]:
exponential_layer = tf.keras.layers.Lambda(lambda x: tf.exp(x))

In [74]:
class MyDense(tf.keras.layers.Layer):
    def __init__(self, units, activiation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = tf.keras.activations.get(activation)
        
    def build(self, batch_input_shape):
        self.kernel = self.add_weight(
            name='kernel',
            shape=[batch_input_shape[-1], self.units],
            initializer='glorot_normal' 
        )
        self.bias = self.add_weight(
            name='bias',
            shape=[self.units],
            initializer='zeros'
        )
        super().build(batch_input_shape) # must be at the end of this method
        
    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)
    
    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'units': self.units, 'activation': tf.keras.activations.serialize(self.activation)}

Let's walk through the code above:

1. The constructor takes all the hyperparameters as arguments (in this example, units and activation), and importantly it also takes the ** kwargs argument. It calls the parent constructor, passing it the kwargs: this takes care of standard arguments such as input_shape, trainable, and name. Then it saves the hyperparameters as attributes, converting the activation argument to the appropriate activation function using the tf.keras.activations.get() function (it accepts functions, standard strings like 'relu' or 'selu', or simply None).

2. The build() method's role is to create the layer's variables by calling the add_weight() method for each weight. The build() method is called the first time a layer is used. At that point, Keras will know the shape of this layer's inputs, and it will pass it to the build() method, which is often necessary to create some the weights. For example, we need to know the number of neurons in the previous layer in order to create a connection weights matrix (i.e. the kernel): this corresponds to the size of the last dimension of the inputs. At the end of this build() method (and only at the end), you must call the parent's build() method: this tells Keras that hte layer is built (it just sets self.built = True)

3. The call() method performs the desired operations. In this case, we compute the matrix multiplication of the inputs X and the layer's kernel, we add the bias vector, and we apply the activation function to the result, and this gives us the output of hte layer.

4. The compute_output_shape() method simply returns the shape of this layer's outputs. It this case, it is the same shape as the inputs, except the last dimension is replaced with the number of neurons in the layer. Note that in tf.keras, shapes are instances o the tf.TensorShape class, which you can convert to Python lists using as_list().

5. The get_config() method is just like in the previous custom classes. Note that we save the activation function's full configuration by calling tf.keras.activations.serialize()

You can now use a MyDense layer like any other layer! You can generally omit the compute_output_shape() method as tf.keras automatically infers the output shape, except when the layer is dynamic (as we will see shortly). In other Keras implementations, this method is either required or its default implementation assumes the output shape is the same as the input shape.

To create a layer with multiple inputs (e.g. Concatenate), the argument to the call() method should be a tuple containing all the inputs, and similarly the argument to the compute_output_shape() method should be a tuple containing each input's batch shape. An example is shown below:

In [76]:
class MyMultiLayer(tf.keras.layers.Layer):
    def call(self, X):
        if X == 0:
            X = 1e-9
            
        X1, X2 = X
        return [X1 + X2, X1 * X2, X1 / X2]
    
    def compute_output_shape(self, batch_input_shape):
        b1, b2 = [batch_input_shape]
        return [b1, b1, b1] # should probably handle broadcasting rules

If your layer needs to have a different behavior during training and during testing (e.g. if it uses a Dropout or BatchNormalization layer(s)), then you must add a training argument to the call() method and use this argument to decide what to do. For example, let's create a layer that adds Gaussian noise during training but does nothing during testing:

In [78]:
class MyGaussianNoise(tf.keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev
        
    def call(self, X, training=None):
        if training:
            noise = tf.random.normal(tf.shape(X), stddev=self.stddev)
            return X + noise
        else:
            return X
        
    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

In [79]:
' Residual block layer example'
class ResidualBlock(tf.keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [
            tf.keras.layers.Dense(
                units=n_neurons,
                activation='elu',
                kernel_initializer='he_normal'
            ) 
            for _ in range(n_layers)
        ]
        
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z

## Custom Models

We already looked at creating custom model classes in Chapter 10. It's straightforward: subclass the tf.keras.Model class, create layers and variables in the constructor, and implement the call() method to do whatever you want the model to do. 

If models provide more functionality than layers, why not just define every layer as a model? Well, technically you could, but it is usually cleaner to distinguish the internal components of your model (i.e. layers or reusable blocks of layers) from the model itself (i.e. the object you will train). The former should subclass the Layer class, while the latter should subclass the Model class.

In [82]:
class ResidualRegressor(tf.keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = tf.keras.layers.Dense(
            units=30,
            activiation='elu',
            kernel_initializer='he_normal'
        )
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = tf.keras.layers.Dense(output_dim)
        
    def call(self, inputs):
        Z = self.hidden1(inputs)
        for _ in range(1 + 3):
            Z = self.block1(Z)
        Z = self.block2(Z)
        return self.out(Z)

# Losses and Metrics Based on Model Internals

There will be times when you want to define losses based on other parts of your model, such as the weights or activations of its hidden layers. __This may be useful for regularization purposes or to monitor some internal aspect of your model.__ To define a custom loss based on model internals, compute it based on any part of the model you want, then pass the result to the add_loss() method. 

The example below will have an auxiliary output on top of the upper hidden layer. The loss associated to this auxiliary output will be called the _reconstruction loss_: it is the mean squared difference between the reconstruction and the inputs. __By adding this reconstruction loss to the main loss, we will encourage the model to preserve as much information as possible through the hidden layers-- even information that is not directly useful for the regression task itself.__ In practice, this loss sometimes improves generalization.

In [85]:
class ReconstructionRegressor(tf.keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [
            tf.keras.layers.Dense(
                units=30,
                activation='selu',
                kernel_initializer='lecun_normal'
            )
            for _ in range(5)
        ]
        self.out = tf.keras.layers.Dense(output_dim)
        
    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        self.reconstruct = tf.keras.layers.Dense(n_inputs)
        super().build(batch_input_shape)
        
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        reconstruction = self.reconstruct(Z)
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        self.add_loss(0.05 * recon_loss)
        return self.out(Z)

Let go through the code above:
    
1. The constructor creates the DNN with five dense hidden layers and one dense output layer.
2. The build() method creates an extra dense layer which will be used to reconstruct the inputs of the model. It must be created here because its number of units must be equal to the number of inputs, and this number is unknown before the build() method is called.
3. The call() method processes the inputs through all five hidden layers, then passes the result through the reconstruction layer, which prodces the reconstruction. Then the call() method computes the reconstruction loss (the mean squared difference between the reconstruction and inputs), and adds it to the model's list of losses using the add_loss() method. Notice that we scale down the reconstruction loss by multiplying it by 0.05 (this is a hyperparameter you can tune). This ensures that the reconstruction loss does not dominate the main loss. Finally, the call() method passes the output of the hidden layers to the output layer and returns its output

__In over 99% of cases, everything we have discussed so far will be sufficient to implement whatever model you want to build, even with complex architectures, losses, and metrics.__ However, in some rare cases you may need to customing the training loop itself. Before we get there, we need to look at how to compute gradients automatically in TensorFlow

## Computing Gradients Using Autodiff

To understand hwo to use autodiff to compute gradients automatically, lets consider a simple toy function:

In [89]:
def f(w1, w2):
    return 3 * w1 ** 2 + 2 * w1 * w2

w1, w2 = 5, 3
eps = 1e-6
(f(w1 + eps, w2) - f(w1, w2)) / eps, (f(w1, w2 + eps) - f(w1, w2)) / eps

(36.000003007075065, 10.000000003174137)

In [90]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
    z = f(w1, w2)
    
gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

In [91]:
' The tape is automatically erased immediately after you call its gradient() method, so you will get an exception if you try to call gradient() twice'

# with tf.GradientTape() as tape:
#     z = f(w1, w2)
    
# dz_dw1 = tape.gradient(z, w1)
# dz_dw2 = tape.gradient(z, w2)

# RuntimeError: A non-persistent GradientTape can only be used to compute one set of gradients (or jacobians)

' The tape is automatically erased immediately after you call its gradient() method, so you will get an exception if you try to call gradient() twice'

To save memory, only put the strict minimum inside the tf.GradientTape() block. Alternatively, pause recording by creating a tape.stop_recording() block inside the tf.GradientTape() block.

If you need to call gradient() more than once, you must make the tape persistent and delete it each time you are done with it to free resources. By default, the tape will only track operations involving variables. However, you can force the tape to watch any tensors you like, to record every operation that involves. You can then compute gradients with regard to these tensors, as if they were variables. __This can be useful in some cases, like if you want to implement a regularization loss that penalizes activations that vary a lot when the inputs vary little: the loss will be based on the gradient of the activations with regard to the inputs.__ Since the inputs are not variables, you would need to tell the tape to watch them.

In some cases you may want to stop gradients from backpropagating through some part of you neural network. To do this, you must the tf.stop_gradient() function.

In [93]:
# Example keeping GradientTape persistent
with tf.GradientTape(persistent=True) as tape:
    z = f(w1, w2)
    
dz_dw1 = tape.gradient(z, w1)
dz_dw2 = tape.gradient(z, w2)
print(dz_dw1, dz_dw2)
del tape

tf.Tensor(36.0, shape=(), dtype=float32) tf.Tensor(10.0, shape=(), dtype=float32)


In [94]:
# Example using GradientTape to calculate the gradients of constants
c1, c2 = tf.constant(5.), tf.constant(3.)
with tf.GradientTape() as tape:
    z = f(c1, c2)
    
gradients = tape.gradient(z, [c1, c2])
gradients

[None, None]

In [95]:
# Example using GradientTape to calculate the gradients of any operations that use the constants from the previous examlpe
with tf.GradientTape() as tape:
    tape.watch(c1)
    tape.watch(c2)
    z = f(c1, c2)
    
gradients = tape.gradient(z, [c1, c2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

In [96]:
# Example using stop_gradient()
def f(w1, w2):
    return 3 * w1 ** 2 + tf.stop_gradient(2 * w1 * w2)

with tf.GradientTape() as tape:
    z = f(w1, w2)
    
gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=30.0>, None]

In [97]:
# Example writing a custom gradient function
@tf.custom_gradient
def my_better_softplus(z):
    exp = tf.exp(z)
    def my_softplus_gradients(grad):
        return grad / (1 + 1 / exp)
    return tf.math.log(exp + 1), my_softplus_gradients

Congratualtions, you can now compute the gradients of any function (provided it is differentiable at the point where you compute it), even blocking backpropagation when needed. This is probably more flexibility than you will ever need, even if you build your own custom training loops, as we will see next.

## Custom Training Loops

In some rate cases, the fit() method may not be flexible enough for what you need to do. __For example, the Wide & Deep paper we discussed in Chapter 10 uses two different optimizers__. Implementing this paper requires writing your own custom loop.

However, remember that writing a custom training loop will make your code longer, more error-prone, and harder to maintain. Unless you really need the extra flexibility, you should prefer using the fit() method rather than implementing your own training loop, especially if you work in a team. An example of a custom training loop is shown below:

In [101]:
# Build a simple model with l2 regularization
l2_reg = tf.keras.regularizers.l2(0.05)
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=30, activation='elu', kernel_initializer='he_normal', kernel_regularizer=l2_reg),
    tf.keras.layers.Dense(units=1, kernel_regularizer=l2_reg)
])

# Create a function that will randomly sample a batch of instances from the training set
def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

# Define a function that will display the training status
def print_status_bar(iteration, total, loss, metrics=None):
    metrics = ' - '.join(['{}: {:4f}'.format(m.name, m.result()) for m in [loss] + (metrics or [])])
    end = '' if iteration < total else '\n'
    print('\r{}/{} - '.format(iteration, total) + metrics, end=end)
    
# Declare hyperparameters
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = tf.keras.losses.mean_squared_error
mean_loss = tf.keras.metrics.Mean()
metrics = [tf.keras.metrics.MeanAbsoluteError()]

# Build the custom training loop
for epoch in range(1, n_epochs + 1):
    print('Epoch {}/{}'.format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training=True)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
            
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
            
        print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
        for metric in [mean_loss] + metrics:
            metric.reset_states()

Epoch 1/5
96/100 - mean: 5.163573 - mean_absolute_error: 1.139317Epoch 2/5
96/100 - mean: 4.336100 - mean_absolute_error: 0.805593Epoch 3/5
96/100 - mean: 3.884027 - mean_absolute_error: 0.559777Epoch 4/5
96/100 - mean: 3.860552 - mean_absolute_error: 0.582307Epoch 5/5
96/100 - mean: 3.692938 - mean_absolute_error: 0.579613

There's a lot going on in this code, so let's walk through it:

1. We create two nested loops: one for the epochs, the other for hte batches within an epoch
2. Then we sample the random batch from the training set.
3. Inside the tf.GradientTape() block, we make a prediction for one batch (using the model as a function), and we compute the loss: it is equal to the main loss plust the other losses (in this model, there is one regularization loss per layer). Since the mean_squared_error() function returns one loss per instance, we compute the mean over the batch using tf.reduce_mean() (if you wanted to apply different weights to each instance, this is where you would do it). The regulariztaion losses are already reduced to a single scaler each, so we just need to sum them (using tf.add_n(), which sums multile tensors of the same shape and data type).
4. Next, we ask the tape to compute the gradient of the loss with regard to each trainable variable (_not_ all variables!), and we apply them to the optimizer to perform a Gradient Descent step.
5. Then we update the mean loss and metrics (over the current epoch), and we display the status bar.
6. At the end of each epoch, we display the status bar again to make it look complete and to print a line feed, and we reset the states of the mean loss and the metrics.

As yo ucan see, there are quite a lot of things you need to get right, and it's easy to make a mistake. But on the bright side, you get full control, so it's your call.

# TensorFlow Functions and Graphs

TensorFlow graphs are easy to use. Let's start with a trivial function that computes the cube of its input and go from there:

In [105]:
def cube(x):
    return x ** 3

# Next use tf.function() to convert this Python function into a TensorFlow function
tf_cube = tf.function(cube)

# Alternatively
@tf.function
def tf_cube(x):
    return x ** 3

Under the hood, tf.function() analyzed the computations performed by the cube() function and generated an equivalent computation graph. As a result, a TF Function will usually run much fatser than the original Python function, especially if it performs complex computations. By default, a TF Function generates a new graph for every unique set of input shapes and data types and caches it for subsequent calls.

If you call a TF Function many times with different numerical Python values, then many graphs will be generated, slowing down your program and using a lot of RAM (you must delete the TF Function to release it). Python values should be reserved for arguments that will ahve few unique values, such as hyperparameters like the number of neurons per layer. This allows TensorFlow to better optimize each variant of your model.

## AutoGraph and Tracing

So how does TensorFlow generate graphs? It starts by analyzing the Python function's source code to capture all the control flow statements, such as for loops, while loops, and if statements, as well as break, continue and return statements. This first step is called _AutoGraph_.

After analyzing the function's code, AutoGraph outputs an upgraded version of that function in which all the control flow statements are replaced by the appropriate TensorFlow operations, such as tf.while_loop() for loops and tf.cond() for if statements.

Next, TensorFlow calls this 'upgraded' function, but instead of passing the argument, it passes a _symbolic tensor_ - a tensor without any actual value, only a name, a data type and a shape. The function will run in _graph mode_, meaning that each TensorFlow operation will add a node in the graph to represent itself and its output tensor(s) (as opposed to the regular mode, called _eager execution_, or _eager mode_).

To view the generated function's source code, you can call tf.autograph.to_code(sum_squares.python_function) (replacing sum_squares with the name of your python function). The code is not meant to be pretty, but it can sometimes help for debugging.

## TF Function Rules

Most of the time, converting a Python function that performs TensorFlow operations into a TF Function is trivial: decorate it with @tf.function or let Keras take care of it for you. However, there are a few rules to respect:

1. If you call any external library, including NumPy or even the standard library, this call will run only during tracing; it will not be part of the graph. Indeed, a TensorFlow graph can only include TensorFlow constructs (tensors, operations, variables, datasets, and so on). So make sure you use tf.reduce_sum() instead of np.sum(), tf.sort() instead of the built-in sorted() function, and so on (unless you really want the code to run only during tracing). This has a few additional implications:
    1. If you define a TF Function f(x) that just returns np.random.rand(), a random number will only be generated when the function is traced, so f(tf.constant(2.)) and f(tf.constant(3.)) will return the same random number, but f(tf.constant([2., 3.])) will return a different one. If you replace np.random.rand() with tf.random.uniform([]), then a new random number will be generated upon every call, since the operation will be part of the graph.
    2. If your non-TensorFlow code has side effects (such as logging something or updating a Python counter), then you should not expect those side effects to occur every time you call the TF Function, as they will only occur when the function is traced.
    3. You can wrap arbitrary Python code in a tf.py_function() operation, but doing so will hinder performance, as TensorFlow will not be able to do any graph optimization on this code. It will also reduce portability, as the graph will only run on platforms where Python is available (and where the right libraries are installed).
2. You can call other Python functions of TF Functions, but they should follow the same rules, as TensorFlow will capture their operations in the computation graph. Note that these other functions do not need to be decorated with @tf.function
3. If the function creates a TensorFlow variable (or any other stateful TensorFlow object, such as a dataset or a queue), it must do so upon the very first call, and only then, or else you will get an exception. It is usually preferable to create variables outside of the TF Function (e.g. in the build() method of a custom layer). If you want to assign a new value to the variable, make sure you call its assign() method, instead of using the = operator.
4. The source code of your Python function should be available to TensorFlow. If the source code is unavailable (for example, if you define your function in the Python shell, which does not give access to the source code, or if you deploy only compiled .pyc Python files to production), then the graph generation process will fail or have limited functionality.
5. TensorFlow will only capture for loops that iterate over a tensor or a dataset. So make sure you use for i in tf.range(x) rather than for i in range(x), or else the loop will not be captured in the graph. Instead, it will run during tracing. (This may be what you want if the for loop is meant to build the graph, for example to create each layer in a neural network).
6. As always, for performance reasons, you should prefer a vectorized implementation whenever you can, rather than using loops.

If you would like to open the black box a bit further, for example to explore the generated graphs, you will find technical details in Appendix G.

# Exercises

1. **How would you describe TensorFlow in a short sentence? What are its main features? Can you name other popular Deep Learning libraries?**

My Answer:

>TensorFlow is an ML library with end-to-end features for bringing an ML project from research to production. Some of its main features are code optimizers, parallel distribution, function graphs, and Keras, among many others. As of 12/2024 the most popular deep learning library is PyTorch.

Book Answer:

>TensorFlow is an open-source library for numerical computation, particularly well suited and fine-tuned for large-scale ML. Its core is similar to NumPy, but it also features GPU support, support for distributed computing, computation graph analysis and optimization capabilities (with a portable graph format that allows you to train a TensorFlow model in one environment and run it in another), an optimization API based on reverse-mode autodiff, and several powerful APIs such as tf.keras, tf.image, tf.signal, and more. Other popular Deep Learning libraries include PyTorch, MXNet, Microsoft Cognitive Toolkit, Theano, Caffe2, and Chainer.

2. **Is TensorFlow a drop-in replacement for NumPy? What are the main differences between the two?**

My Answer:

>No, while TensorFlow can do many of the things NumPy can do they are not exactly equivalent. NumPy uses 64-bit floating points while TensorFlow uses 32-bit. Many NumPY functions have a different naming convention in TensorFlow, for example np.sum() is equivalent to tf.reduce_sum(). Lastly, TensorFlow is more sensitive to typing that NumPy.

Book Answer:

>Although TensorFlow offers most of the functionalities provided by NumPy, it is not a drop-in replacement, for a few reasons.
>1. The names of the functions are not always the same (for example, tf.reduce_sum() vs np.sum()).
2. Some functions do not behave in exactly the same way (for example, tf.transpose() creats a transposed copy of a tensor, while NumPy's T attribute creates a treansposed view, without actually copying any data).
3. NumPy arrays are mutable, while TensorFlow tensors are not (but you can use a tf.Variable if you need a muatable object)

3. **Do you get the same result with tf.range(10) and tf.constant(np.arange(10))?**

My Answer:

>Yes, you get the same result.

>(<tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>,

> <tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>)

Book Answer:

>Both tf.range(10) and tf.constant(np.arange(10)) return a one-dimensional tensor containing the integers 0 to 9. However, the former uses 32-bit integers while the latter uses 64-bit integers. 

4. **Can you name six other data structures available in TensorFlow, beyond regular tensors?**

My Answer:

>1. Sparse Tensors
2. Tensor Arrays
3. Ragged Tensors
4. String Tensors
5. Sets
6. Queues

Book Answer:

>1. Sparse Tensors
2. Tensor Arrays
3. Ragged Tensors
4. String Tensors
5. Sets
6. Queues

5. **A custom loss function can be defined by writing a function or by subclassing the keras.losses.Loss class. When would you use each option?**

My Answer:

>If a function has hyperparameters that need to be saved along with the model, then you will want to subclass the appropriate class. Writing a function is simpler but subclassing is more cohesive with the entire TensorFlow architecture.

Book Answer:

>When you want to define a custom loss function, in general you can just implement it as a regular Python function. However, if your custom loss function must support some hyperparameters (or any other state), then you should subclass the keras.losses.Loss class and implement the __init__() and __call()__ methods. If you want the loss function's hyperparameters to be saved along with the model, then you must also implement the get_config() method.

6. **Similarly, a custom metric can be defined in a function or a subclass of keras.metrics.Metric. When would you use each option?**

My Answer:

>Defining a custom metric function is exactly the same as defining a custom loss function. Specifically, if you need a streaming metric you must subclass. My personal choice is to subclass in all cases because it gives me access to the full capabilities of the TensorFlow classes on top of my customized processes.

Book Answer:

>Much like custom loss functions, most metrics can be defined as regular Python functions. But if you want your custom metric to support some hyperparameters (or any other state), then you should subclass the keras.metrics.Metric class. Moreover, if computing the metfic over a whole epoch is not equivalent to computing the mean metric over all batches in that epoch (e.g. as for precision and recall metrics), then you should subclass the keras.metrics.Metric class and implement the __init__(), __update_state()__, and __result()__ methods to keep track of a running metric during each epoch. You should also implement the __reset_states()__ method unless all it needs to do is reset all variables to 0.0. If you want the state to be saved along with the model, then you should implement the __get_config__() method as well.

7. **When should you create a custom layer versus a custom model?**

My Answer:

>In general for clean programming and debugging purposes you should always use a custom layer for layer operations and custom models for model operations. This separation of duty is best practice. That said, custom models offer more functionality and can technically be used in place of custom layers.

Book Answer:

>You should distinguish the interal components of your model (i.e. layers or reusable blocks of layers) from the model itself (i.e. the object you will train). The former should subclass the keras.layers.Layer class and the latter should subclass the keras.models.Model class.

8. **What are some use cases that require writing your own custom training loop?**

My Answer:

>If you need to control aspects of the training process that are mid-layer or mid-epoch you can use a custom loop to do so. The example the book gave was the Wide & Deep paper which uses different optimizers on different layers.

Book Answer:

>Writing your own custom training loop is fairly advanced, so you should only do it if you really need to. Keras provides several tools to customize training without having to write a custom training loop: callbacks, custom regularizers, custom constraints, custom losses, and so on. You should use these instead of writing a custom training loop whenever possible: writing a custom training loop is more error-prone, and it will be harder to reuse the custom code you write. However, in some cases writing a custom training loop is necessary -- for example, if you want to use different optimizers for different parts of your neural network, like in the Wide & Deep paper. A custom training loop can also be useful when debugging, or when trying to understand exactly how training works.

9. **Can custom Keras components contain arbitrary Python code, or must they be convertible to TF Functions?**

My Answer:

>They can contain arbitrary Python code but if the code is not convertible to TF Functions then it will add memory burdens to your code. It is always preferable to encapsulte python code that is written friendly to TF Functions and TensorFlow graphs when using TensorFlow to harness the full potential TF offers.

Book Answer:

>Custom keras components should be convertible to TF Functions, which means they should stick to TF operations as much as possible and respect all the rules listed in 'TF Function Rules' on page 409. If you absolutely need to include arbitrary Python code in a custom component, you can either wrap it in a tf.py_function() operation (but this will reduce performance and limit your model's portability) or set dynamic=True when creating the custom layer or model (or set run_eagerly=True when calling the model's compile() method).

10. **What are the main rules to respect if you want a function to be convertible to a TF Function?**

My Answer:

>1. If you call any external library, including NumPy or even the standard library, this call will run only during tracing; it will not be part of the graph. Indeed, a TensorFlow graph can only include TensorFlow constructs (tensors, operations, variables, datasets, and so on). So make sure you use tf.reduce_sum() instead of np.sum(), tf.sort() instead of the built-in sorted() function, and so on (unless you really want the code to run only during tracing). This has a few additional implications:
    1. If you define a TF Function f(x) that just returns np.random.rand(), a random number will only be generated when the function is traced, so f(tf.constant(2.)) and f(tf.constant(3.)) will return the same random number, but f(tf.constant([2., 3.])) will return a different one. If you replace np.random.rand() with tf.random.uniform([]), then a new random number will be generated upon every call, since the operation will be part of the graph.
    2. If your non-TensorFlow code has side effects (such as logging something or updating a Python counter), then you should not expect those side effects to occur every time you call the TF Function, as they will only occur when the function is traced.
    3. You can wrap arbitrary Python code in a tf.py_function() operation, but doing so will hinder performance, as TensorFlow will not be able to do any graph optimization on this code. It will also reduce portability, as the graph will only run on platforms where Python is available (and where the right libraries are installed).
2. You can call other Python functions of TF Functions, but they should follow the same rules, as TensorFlow will capture their operations in the computation graph. Note that these other functions do not need to be decorated with @tf.function
3. If the function creates a TensorFlow variable (or any other stateful TensorFlow object, such as a dataset or a queue), it must do so upon the very first call, and only then, or else you will get an exception. It is usually preferable to create variables outside of the TF Function (e.g. in the build() method of a custom layer). If you want to assign a new value to the variable, make sure you call its assign() method, instead of using the = operator.
4. The source code of your Python function should be available to TensorFlow. If the source code is unavailable (for example, if you define your function in the Python shell, which does not give access to the source code, or if you deploy only compiled .pyc Python files to production), then the graph generation process will fail or have limited functionality.
5. TensorFlow will only capture for loops that iterate over a tensor or a dataset. So make sure you use for i in tf.range(x) rather than for i in range(x), or else the loop will not be captured in the graph. Instead, it will run during tracing. (This may be what you want if the for loop is meant to build the graph, for example to create each layer in a neural network).
6. As always, for performance reasons, you should prefer a vectorized implementation whenever you can, rather than using loops.

Book Answer:

>1. If you call any external library, including NumPy or even the standard library, this call will run only during tracing; it will not be part of the graph. Indeed, a TensorFlow graph can only include TensorFlow constructs (tensors, operations, variables, datasets, and so on). So make sure you use tf.reduce_sum() instead of np.sum(), tf.sort() instead of the built-in sorted() function, and so on (unless you really want the code to run only during tracing). This has a few additional implications:
    1. If you define a TF Function f(x) that just returns np.random.rand(), a random number will only be generated when the function is traced, so f(tf.constant(2.)) and f(tf.constant(3.)) will return the same random number, but f(tf.constant([2., 3.])) will return a different one. If you replace np.random.rand() with tf.random.uniform([]), then a new random number will be generated upon every call, since the operation will be part of the graph.
    2. If your non-TensorFlow code has side effects (such as logging something or updating a Python counter), then you should not expect those side effects to occur every time you call the TF Function, as they will only occur when the function is traced.
    3. You can wrap arbitrary Python code in a tf.py_function() operation, but doing so will hinder performance, as TensorFlow will not be able to do any graph optimization on this code. It will also reduce portability, as the graph will only run on platforms where Python is available (and where the right libraries are installed).
2. You can call other Python functions of TF Functions, but they should follow the same rules, as TensorFlow will capture their operations in the computation graph. Note that these other functions do not need to be decorated with @tf.function
3. If the function creates a TensorFlow variable (or any other stateful TensorFlow object, such as a dataset or a queue), it must do so upon the very first call, and only then, or else you will get an exception. It is usually preferable to create variables outside of the TF Function (e.g. in the build() method of a custom layer). If you want to assign a new value to the variable, make sure you call its assign() method, instead of using the = operator.
4. The source code of your Python function should be available to TensorFlow. If the source code is unavailable (for example, if you define your function in the Python shell, which does not give access to the source code, or if you deploy only compiled .pyc Python files to production), then the graph generation process will fail or have limited functionality.
5. TensorFlow will only capture for loops that iterate over a tensor or a dataset. So make sure you use for i in tf.range(x) rather than for i in range(x), or else the loop will not be captured in the graph. Instead, it will run during tracing. (This may be what you want if the for loop is meant to build the graph, for example to create each layer in a neural network).
6. As always, for performance reasons, you should prefer a vectorized implementation whenever you can, rather than using loops.

11. **When would you need to create a dynamic Keras model? How do you do that? Why not make all your models dynamic?**

My Answer:

>From chapter 10, when your model involves loops, varying shapes, conditional branching, or other dynamic behavior the sublcassing API is the tool to use. This involves subclassing the tf.keras.Model class and peforming the dynamic operations in the call() method. The extra flexibility comes at a cost: your model's architecture is hidden within the call() method, so Keras cannot easily inspect it; it cannot save or clone it; and when you call the summary() method, you only get a list of layers without any information on how they are connected to each other. Moreover, Keras cannot check types and shapes ahead of time, and it is easier to make mistakes.

Book Answer:

12. **Implement a custom layer that performs *Layer Normalization* (we will use this type of layer in Chapter 15):**

    1. The build() method should define two trainable weights $\alpha$ and $\beta$, both of shape input_shape[-1:] and data type tf.float32. $\alpha$ should be initialized with 1s and $\beta$ with 0s

    2. The call() method should compute the mean $\mu$ and standard deviation $\sigma$ of each instance's features. For this, you can use tf.nn.moments(inputs, axes=-1, keepdims=True), which returns the mean $\mu$ and the variance $\sigma^{2}$ of all instances (compute the square root of the variance to get the standard deviation). Then the function should computer and return $\large\alpha\bigotimes\frac{(X - \mu)}{\sigma + \epsilon} + \beta$, where $\bigotimes$ represents itemwise multiplication and $\epsilon$ is a smoothing term (small constant to avoid division by zero, e.g. 0.001)

    3. Ensure that your custom layer produces the same (or very nearly the same) outoput as the keras.layers.LayerNormalization layer.

13. **Train a model using a custom training loop to tackle the Fashion MNIST dataset (see Chapter 10).**

    1. Display the epoch, iteration, mean training loss, and mean accuracy over each epoch (updated at each iteration), as well as the validation loss and accuracy at the end of each epoch.
    
    2. Try using a different optimizer with a different learning rate for the upper layers and the lower layers.