### Best practices for the real world

#### Getting the most out of your models
Blindly trying out different architecture configurations works well enough if you just need something that works okay. In this section, we’ll go beyond “works okay” to “works great and wins machine learning competitions” via a set of must-know techniques for building state-of-the-art deep learning models.

##### Hyperparameter optimization
When building a deep learning model, you have to make many seemingly arbitrary decisions: How many layers should you stack? How many units or filters should go in each layer? Should you use relu as activation, or a different function? Should you use BatchNormalization after a given layer? How much dropout should you use? And so on. These architecture-level parameters are called **hyperparameters** to distinguish them from the **parameters** of a model, which are trained via backpropagation. <br>
In practice, experienced machine learning engineers and researchers build intuition over time as to what works and what doesn’t when it comes to these choices— they develop hyperparameter-tuning skills. But there are no formal rules. If you want to get to the very limit of what can be achieved on a given task, you can’t be content with such arbitrary choices. Your initial decisions are almost always suboptimal, even if you have very good intuition. You can refine your choices by tweaking them by hand and retraining the model repeatedly—that’s what machine learning engineers and researchers spend most of their time doing. But it shouldn’t be your job as a human to fiddle with hyperparameters all day—that is better left to a machine. <br>
Thus you need to explore the space of possible decisions automatically, systematically, in a principled way. You need to search the architecture space and find the best performing architectures empirically. That’s what the field of automatic hyperparameter optimization is about: it’s an entire field of research, and an important one. <br>
The process of optimizing hyperparameters typically looks like this:
1. Choose a set of hyperparameters (automatically).
2. Build the corresponding model.
3. Fit it to your training data, and measure performance on the validation data.
4. Choose the next set of hyperparameters to try (automatically).
5. Repeat.
6. Eventually, measure performance on your test data.

The key to this process is the algorithm that analyzes the relationship between validation performance and various hyperparameter values to choose the next set of hyperparameters to evaluate. Many different techniques are possible: Bayesian optimization, genetic algorithms, simple random search, and so on. <br>
Training the weights of a model is relatively easy: you compute a loss function on a mini-batch of data and then use backpropagation to move the weights in the right direction. Updating hyperparameters, on the other hand, presents unique challenges. <br>
Consider these points:
- The hyperparameter space is typically made up of discrete decisions and thus isn’t continuous or differentiable. Hence, you typically can’t do gradient descent in hyperparameter space. Instead, you must rely on gradient-free optimization techniques, which naturally are far less efficient than gradient descent.
- Computing the feedback signal of this optimization process (does this set of hyperparameters lead to a high-performing model on this task?) can be extremely expensive: it requires creating and training a new model from scratch on your dataset.
- The feedback signal may be noisy: if a training run performs 0.2% better, is that because of a better model configuration, or because you got lucky with the initial weight values?

Thankfully, there’s a tool that makes hyperparameter tuning simpler: **KerasTuner**. <br>
Let’s check it out.

##### USING KERASTUNER
Let’s start by installing KerasTuner:

```python
!pip install keras-tuner -q
```

KerasTuner lets you replace hard-coded hyperparameter values, such as units=32, with a range of possible choices, such as Int(name="units", min_value=16, max_value=64, step=16). This set of choices in a given model is called the search space of the hyperparameter tuning process. <br>
To specify a search space, define a model-building function (see the next listing). It takes an hp argument, from which you can sample hyperparameter ranges, and it returns a compiled Keras model.

##### A KerasTuner model-building function

In [1]:
from tensorflow import keras
from tensorflow.keras import layers

def build_model(hp):
    # Sample hyperparameter values from the hp object. After sampling, these values (such as the "units" variable here) are just regular Python constants.
    units = hp.Int(name="units", min_value=16, max_value=64, step=16)
    model = keras.Sequential([
        layers.Dense(units, activation="relu"),
        layers.Dense(10, activation="softmax")
    ])
    # Different kinds of hyperparameters are available: Int, Float, Boolean, Choice.
    optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )
    return model # The function returns a compiled model.

If you want to adopt a more modular and configurable approach to model-building, you can also subclass the **HyperModel** class and define a build method, as follows.

##### A KerasTuner HyperModel

In [2]:
import keras_tuner as kt

class SimpleMLP(kt.HyperModel):
    # Thanks to the object oriented approach, we can configure model constants as constructor arguments (instead of hardcoding them in the model-building function).
    def __init__(self, num_classes):
        self.num_classes = num_classes
    
    # The build() method is identical to our prior build_model() standalone function.
    def build(self, hp):
        units = hp.Int(name="units", min_value=16, max_value=64, step=16)
        model = keras.Sequential([
            layers.Dense(units, activation="relu"),
            layers.Dense(self.num_classes, activation="softmax")
        ])
        optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
        model.compile(
            optimizer=optimizer,
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"])
        return model

hypermodel = SimpleMLP(num_classes=10)

The next step is to define a “tuner.” Schematically, you can think of a tuner as a for loop that will repeatedly
- Pick a set of hyperparameter values
- Call the model-building function with these values to create a model
- Train the model and record its metrics

KerasTuner has several built-in tuners available—**RandomSearch, BayesianOptimization**, and **Hyperband**. <br>
Let’s try **BayesianOptimization**, a tuner that attempts to make smart predictions for which new hyperparameter values are likely to perform best given the outcomes of previous choices:

In [3]:
tuner = kt.BayesianOptimization(
    hypermodel=build_model, # Specify the model-building function (or hypermodel instance).
    objective="val_accuracy", # Specify the metric that the tuner will seek to optimize. Always specify validation metrics, since the goal of the search process is to find models that generalize!
    max_trials=100, # Maximum number of different model configurations (“trials”) to try before ending the search.
    executions_per_trial=2, # To reduce metrics variance, you can train the same model multiple times and average the results. executions_per_trial is how many training rounds(executions) to run for each model configuration (trial).
    directory="mnist_kt_test", # Where to store search logs
    overwrite=True, # Whether to overwrite data in directory to start a new search. Set this to True if you’ve modified the model-building function, or to False to resume a previously started search with the same model-building function.
)

You can display an overview of the search space via **search_space_summary()**:

In [4]:
tuner.search_space_summary()

##### Objective maximization and minimization
For built-in metrics (like accuracy, in our case), the direction of the metric (accuracy should be maximized, but a loss should be minimized) is inferred by KerasTuner. However, for a custom metric, you should specify it yourself, like this:

```python
objective = kt.Objective(
name="val_accuracy", # The metric’s name, as found in epoch logs
direction="max") # The metric’s desired direction: "min" or "max"
tuner = kt.BayesianOptimization(
build_model,
objective=objective,
...
)
```

Finally, let’s launch the search. Don’t forget to pass validation data, and make sure not to use your test set as validation data—otherwise you’d quickly start overfitting to your test data, and you wouldn’t be able to trust your test metrics anymore:

In [5]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape((-1, 28 * 28)).astype("float32") / 255
x_test = x_test.reshape((-1, 28 * 28)).astype("float32") / 255
# Reserve these for later.
x_train_full = x_train[:]
y_train_full = y_train[:]
# Set these aside as a validation set.
num_val_samples = 10000
x_train, x_val = x_train[:-num_val_samples], x_train[-num_val_samples:]
y_train, y_val = y_train[:-num_val_samples], y_train[-num_val_samples:]
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=5),
]
# This takes the same arguments as fit() (it simply passes them down to fit() for each new model).
# Use a large number of epochs (you don’t know in advance how many epochs each model will need), and use an EarlyStopping callback to stop training when you start overfitting.
tuner.search(
    x_train, y_train,
    batch_size=128,
    epochs=100,
    validation_data=(x_val, y_val),
    callbacks=callbacks,
    verbose=2,
)

The preceding example will run in just a few minutes, since we’re only looking at a few possible choices and we’re training on MNIST. However, with a typical search space and dataset, you’ll often find yourself letting the hyperparameter search run overnight or even over several days. If your search process crashes, you can always restart it—just specify **overwrite=False** in the tuner so that it can resume from the trial logs stored on disk. <br>
Once the search is complete, you can query the best hyperparameter configurations, which you can use to create high-performing models that you can then retrain.

##### Querying the best hyperparameter configurations

In [6]:
top_n = 4
# Returns a list of HyperParameter objects, which you can pass to the model-building function
best_hps = tuner.get_best_hyperparameters(top_n)

Usually, when retraining these models, you may want to include the validation data as part of the training data, since you won’t be making any further hyperparameter changes, and thus you will no longer be evaluating performance on the validation data. In our example, we’d train these final models on the totality of the original MNIST training data, without reserving a validation set. <br>
Before we can train on the full training data, though, there’s one last parameter we need to settle: the optimal number of epochs to train for. Typically, you’ll want to train the new models for longer than you did during the search: using an aggressive patience value in the EarlyStopping callback saves time during the search, but it may lead to under-fit models. Just use the validation set to find the best epoch:

In [7]:
def get_best_epoch(hp):
    model = build_model(hp)
    callbacks=[
        keras.callbacks.EarlyStopping(
            monitor="val_loss", mode="min", patience=10) # Note the very high patience value.
    ]
    history = model.fit(
        x_train, y_train,
        validation_data=(x_val, y_val),
        epochs=100,
        batch_size=128,
        callbacks=callbacks)
    val_loss_per_epoch = history.history["val_loss"]
    best_epoch = val_loss_per_epoch.index(min(val_loss_per_epoch)) + 1
    print(f"Best epoch: {best_epoch}")
    return best_epoch

Finally, train on the full dataset for just a bit longer than this epoch count, since you’re training on more data; 20% more in this case:

In [8]:
def get_best_trained_model(hp):
    best_epoch = get_best_epoch(hp)
    model.fit(
        x_train_full, y_train_full,
        batch_size=128, epochs=int(best_epoch * 1.2))
    return model

best_models = []
for hp in best_hps:
    model = get_best_trained_model(hp)
    model.evaluate(x_test, y_test)
    best_models.append(model)

Note that if you’re not worried about slightly underperforming, there’s a shortcut you can take: just use the tuner to reload the top-performing models with the best weights saved during the hyperparameter search, without retraining new models from scratch:

In [None]:
best_models = tuner.get_best_models(top_n)

**NOTE** One important issue to think about when doing automatic hyperparameter optimization at scale is validation-set overfitting. Because you’re updating hyperparameters based on a signal that is computed using your validation data, you’re effectively training them on the validation data, and thus they will quickly overfit to the validation data. Always keep this in mind.

##### THE ART OF CRAFTING THE RIGHT SEARCH SPACE
Overall, hyperparameter optimization is a powerful technique that is an absolute requirement for getting to state-of-the-art models on any task or to win machine learning competitions. Think about it: once upon a time, people handcrafted the features that went into shallow machine learning models. That was very much suboptimal. Now, deep learning automates the task of hierarchical feature engineering—features are learned using a feedback signal, not hand-tuned, and that’s the way it should be. In the same way, you shouldn’t handcraft your model architectures; you should optimize them in a principled way. <br>
However, doing hyperparameter tuning is not a replacement for being familiar with model architecture best practices. Search spaces grow combinatorially with the number of choices, so it would be far too expensive to turn everything into a hyperparameter and let the tuner sort it out. You need to be smart about designing the right search space. Hyperparameter tuning is automation, not magic: you use it to automate experiments that you would otherwise have run by hand, but you still need to handpick experiment configurations that have the potential to yield good metrics. <br>
The good news is that by leveraging hyperparameter tuning, the configuration decisions you have to make graduate from micro-decisions (what number of units do I pick for this layer?) to higher-level architecture decisions (should I use residual connections throughout this model?). And while micro-decisions are specific to a certain model and a certain dataset, higher-level decisions generalize better across different tasks and datasets. For instance, pretty much every image classification problem can be solved via the same sort of search-space template. <br>
Following this logic, KerasTuner attempts to provide **premade search spaces** that are relevant to broad categories of problems, such as image classification. Just add data, run the search, and get a pretty good model. You can try the hypermodels **kt.applications.HyperXception** and **kt.applications.HyperResNet**, which are effectively tunable versions of Keras Applications models.

##### THE FUTURE OF HYPERPARAMETER TUNING: AUTOMATED MACHINE LEARNING

Currently, most of your job as a deep learning engineer consists of munging data with Python scripts and then tuning the architecture and hyperparameters of a deep network at length to get a working model, or even to get a state-of-the-art model, if you are that ambitious. Needless to say, that isn’t an optimal setup. But automation can help, and it won’t stop merely at hyperparameter tuning. <br>
Searching over a set of possible learning rates or possible layer sizes is just the first step. We can also be far more ambitious and attempt to generate the **model architecture** itself from scratch, with as few constraints as possible, such as via reinforcement learning or genetic algorithms. In the future, entire end-to-end machine learning pipelines will be automatically generated, rather than be handcrafted by engineer-artisans. This is called automated machine learning, or **AutoML**. You can already leverage libraries like **AutoKeras** (https://github.com/keras-team/autokeras) to solve basic machine learning problems with very little involvement on your part. <br>


##### Model ensembling
Another powerful technique for obtaining the best possible results on a task is **model ensembling**. Ensembling consists of pooling together the predictions of a set of different models to produce better predictions. If you look at machine learning competitions, in particular on Kaggle, you’ll see that the winners use very large ensembles of models that inevitably beat any single model, no matter how good. <br>
Ensembling relies on the assumption that different well-performing models trained independently are likely to be good for different reasons: each model looks at slightly different aspects of the data to make its predictions, getting part of the “truth” but not all of it. You may be familiar with the ancient parable of the blind men and the elephant: a group of blind men come across an elephant for the first time and try to understand what the elephant is by touching it. Each man touches a different part of the elephant’s body—just one part, such as the trunk or a leg. Then the men describe to each other what an elephant is: “It’s like a snake,” “Like a pillar or a tree,” and so on. The blind men are essentially machine learning models trying to understand the manifold of the training data, each from their own perspective, using their own assumptions (provided by the unique architecture of the model and the unique random weight initialization). Each of them gets part of the truth of the data, but not the whole truth. By pooling their perspectives together, you can get a far more accurate description of the data. The elephant is a combination of parts: not any single blind man gets it quite right, but, interviewed together, they can tell a fairly accurate story. <br>
Let’s use classification as an example. The easiest way to pool the predictions of a set of classifiers (**to ensemble the classifiers**) is to average their predictions at inference time:

```python
# Use four different models to compute initial predictions.
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)
# This new prediction array should be more accurate than any of the initial ones.
final_preds = 0.25 * (preds_a + preds_b + preds_c + preds_d)
```

However, this will work only if the classifiers are more or less equally good. If one of them is significantly worse than the others, the final predictions may not be as good as the best classifier of the group. <br>
A smarter way to ensemble classifiers is to do a weighted average, where the weights are learned on the validation data—typically, the better classifiers are given a higher weight, and the worse classifiers are given a lower weight. To search for a good set of ensembling weights, you can use random search or a simple optimization algorithm, such as the Nelder-Mead algorithm:

```python
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)
# # These weights (0.5, 0.25, 0.1, 0.15) are assumed to be learned empirically.
final_preds = 0.5 * preds_a + 0.25 * preds_b + 0.1 * preds_c + 0.15 * preds_d 
```

There are many possible variants: you can do an average of an exponential of the predictions, for instance. In general, a simple weighted average with weights optimized on the validation data provides a very strong baseline. <br>
The key to making ensembling work is the **diversity** of the set of classifiers. Diversity is strength. If all the blind men only touched the elephant’s trunk, they would agree that elephants are like snakes, and they would forever stay ignorant of the truth of the elephant. Diversity is what makes ensembling work. In machine learning terms, if all of your models are biased in the same way, your ensemble will retain this same bias. **If your models are biased in different ways, the biases will cancel each other out, and the ensemble will be more robust and more accurate.** <br>
For this reason, **you should ensemble models that are as good as possible while being as different as possible.** This typically means using very different architectures or even different brands of machine learning approaches. One thing that is largely not worth doing is ensembling the same network trained several times independently, from different random initializations. If the only difference between your models is their random initialization and the order in which they were exposed to the training data, then your ensemble will be low-diversity and will provide only a tiny improvement over any single model.

##### Scaling-up model training
Recall the “loop of progress” concept we introduced in chapter 7: the quality of your ideas is a function of how many refinement cycles they’ve been through (see figure 13.1). And the speed at which you can iterate on an idea is a function of how fast you can set up an experiment, how fast you can run that experiment, and finally, how well you can analyze the resulting data.

![](./images/13.1.png)

As you develop your expertise with the Keras API, how fast you can code up your deep learning experiments will cease to be the bottleneck of this progress cycle. The next bottleneck will become the speed at which you can train your models. Fast training infrastructure means that you can get your results back in 10–15 minutes, and hence, that you can go through dozens of iterations every day. Faster training directly improves the quality of your deep learning solutions. <br>
In this section, you’ll learn about three ways you can train your models faster:
- Mixed-precision training, which you can use even with a single GPU
- Training on multiple GPUs
- Training on TPUs

##### Speeding up training on GPU with mixed precision
What if I told you there’s a simple technique you can use to speed up the training of almost any model by up to 3X, basically for free? It seems too good to but true, and yet, such a trick does exist. That’s mixed-precision training. To understand how it works, we first need to take a look at the notion of “precision” in computer science.

##### UNDERSTANDING FLOATING-POINT PRECISION
Precision is to numbers what resolution is to images. Because computers can only process ones and zeros, any number seen by a computer has to be encoded as a binary string. For instance, you may be familiar with uint8 integers, which are integers encoded on eight bits: 00000000 represents 0 in uint8, and 11111111 represents 255. To represent integers beyond 255, you’d need to add more bits—eight isn’t enough. Most integers are stored on 32 bits, with which you can represent signed integers ranging from –2147483648 to 2147483647. <br>
Floating-point numbers are the same. In mathematics, real numbers form a continuous axis: there’s an infinite number of points in between any two numbers. You can always zoom in on the axis of reals. In computer science, this isn’t true: there’s a finite number of intermediate points between 3 and 4, for instance. How many? Well, it depends on the precision you’re working with—the number of bits you’re using to store a number. You can only zoom up to a certain resolution. <br>
There are three of levels of precision you’d typically use:
- Half precision, or **float16**, where numbers are stored on 16 bits
- Single precision, or **float32**, where numbers are stored on 32 bits
- Double precision, or **float64**, where numbers are stored on 64 bits

The way to think about the resolution of floating-point numbers is in terms of the smallest distance between two arbitrary numbers that you’ll be able to safely process.
- In single precision, that’s around 1e-7. 
- In double precision, that’s around 1e-16. 
- And in half precision, it’s only 1e-3.

Every model you’ve seen in this book so far used single-precision numbers: it stored its state as float32 weight variables and ran its computations on float32 inputs. That’s enough precision to run the forward and backwards pass of a model without losing any information—particularly when it comes to small gradient updates (recall that the typical learning rate is 1e-3, and it’s pretty common to see weight updates on the order of 1e-6). <br>
You could also use float64, though that would be wasteful—operations like matrix multiplication or addition are much more expensive in double precision, so you’d be doing twice as much work for no clear benefits. But you could not do the same with float16 weights and computation; the gradient descent process wouldn’t run smoothly, since you couldn’t represent small gradient updates of around 1e-5 or 1e-6. <br>
You can, however, use a hybrid approach: that’s what mixed precision is about. The idea is to leverage 16-bit computations in places where precision isn’t an issue, and to work with 32-bit values in other places to maintain numerical stability. Modern GPUs and TPUs feature specialized hardware that can run 16-bit operations much faster and use less memory than equivalent 32-bits operations. By using these lower-precision operations whenever possible, you can speed up training on those devices by a significant factor. Meanwhile, by maintaining the precision-sensitive parts of the model in single precision, you can get these benefits without meaningfully impacting model quality. <br>
And those benefits are considerable: on modern NVIDIA GPUs, mixed precision can speed up training by up to 3X. It’s also beneficial when training on a TPU (a subject we’ll get to in a bit), where it can speed up training by up to 60%.

##### MIXED-PRECISION TRAINING IN PRACTICE
When training on a GPU, you can turn on mixed precision like this:

In [None]:
from tensorflow import keras
keras.mixed_precision.set_global_policy("mixed_float16")

Typically, most of the forward pass of the model will be done in float16 (with the exception of numerically unstable operations like softmax), while the weights of the model will be stored and updated in float32. <br>
Keras layers have a **variable_dtype** and a **compute_dtype** attribute. By default, both of these are set to float32. When you turn on mixed precision, the compute_dtype of most layers switches to float16, and those layers will cast their inputs to float16 and will perform their computations in float16 (using half-precision copies of the weights). However, since their variable_dtype is still float32, their weights will be able to receive accurate float32 updates from the optimizer, as opposed to half-precision updates. <br>
Note that some operations may be numerically unstable in float16 (in particular, softmax and crossentropy). If you need to opt out of mixed precision for a specific layer, just pass the argument dtype="float32" to the constructor of this layer.

##### Multi-GPU training
While GPUs are getting more powerful every year, deep learning models are getting increasingly larger, requiring ever more computational resources. Training on a single GPU puts a hard bound on how fast you can move. The solution? You could simply add more GPUs and start doing multi-GPU distributed training. <br>
There are two ways to distribute computation across multiple devices: **data parallelism** and **model parallelism**. <br>
- With **data parallelism**, a single model is replicated on multiple devices or multiple machines. 
  - Each of the model replicas processes different batches of data, and then they merge their results.
- With **model parallelism**, different parts of a single model run on different devices, processing a single batch of data together at the same time. 
  - This works best with models that have a naturally parallel architecture, such as models that feature multiple branches.

In practice, model parallelism is only used for models that are too large to fit on any single device: it isn’t used as a way to speed up training of regular models, but as a way to train larger models. <br>

#### Summary
- You can leverage hyperparameter tuning and KerasTuner to automate the tedium out of finding the best model configuration. But be mindful of validation set overfitting!
- An ensemble of diverse models can often significantly improve the quality of your predictions.
- You can speed up model training on GPU by turning on mixed precision— you’ll generally get a nice speed boost at virtually no cost.