#### Modern convnet architecture patterns
A model’s “architecture” is the sum of the choices that went into creating it: which layers to use, how to configure them, and in what arrangement to connect them. These choices define the **hypothesis space** of your model: the space of possible functions that gradient descent can search over, parameterized by the model’s weights. Like feature engineering, a good hypothesis space encodes **prior knowledge** that you have about the problem at hand and its solution. For instance, using convolution layers means that you know in advance that the relevant patterns present in your input images are translation invariant. In order to effectively learn from data, you need to make assumptions about what you’re looking for.

Model architecture is often the difference between success and failure. If you make inappropriate architecture choices, your model may be stuck with suboptimal metrics, and no amount of training data will save it. Inversely, a good model architecture will accelerate learning and will enable your model to make efficient use of the training data available, reducing the need for large datasets. **A good model architecture is one that reduces the size of the search space or otherwise makes it easier to converge to a good point of the search space.** Just like feature engineering and data curation, **model architecture is all about making the problem simpler for gradient descent to solve**. And remember that gradient descent is a pretty stupid search process, so it needs all the help it can get.

Model architecture is more an art than a science. Experienced machine learning engineers are able to intuitively cobble together high-performing models on their first try, while beginners often struggle to create a model that trains at all. The keyword here is **intuitively**: no one can give you a clear explanation of what works and what doesn’t. Experts rely on pattern-matching, an ability that they acquire through extensive practical experience. You’ll develop your own intuition throughout this book. However, it’s not all about intuition either—there isn’t much in the way of actual science, but as in any engineering discipline, there are best practices.

In the following sections, we’ll review a few essential convnet architecture best practices: in particular, **residual connections**, **batch normalization**, and **separable convolutions**. Once you master how to use them, you will be able to build highly effective image models. We will apply them to our cat vs. dog classification problem.

Let’s start from the bird’s-eye view: the **modularity-hierarchy-reuse (MHR)** formula for system architecture.

##### Modularity, hierarchy, and reuse
If you want to make a complex system simpler, there’s a universal recipe you can apply: just structure your amorphous soup of complexity into **modules**, organize the modules into a **hierarchy**, and start **reusing** the same modules in multiple places as appropriate (“reuse” is another word for **abstraction** in this context). That’s the MHR formula (modularity-hierarchy-reuse), and it underlies system architecture across pretty much every domain where the term “architecture” is used. It’s at the heart of the organization of any system of meaningful complexity, whether it’s a cathedral, your own body, the US Navy, or the Keras codebase (see figure 9.7).

![](./chapter_images/9.7.png)

If you’re a software engineer, you’re already keenly familiar with these principles: an effective codebase is one that is modular, hierarchical, and where you don’t reimplement the same thing twice, but instead rely on reusable classes and functions. If you factor your code by following these principles, you could say you’re doing “software architecture.”

Deep learning itself is simply the application of this recipe to continuous optimization via gradient descent: you take a classic optimization technique (gradient descent over a continuous function space), and you structure the search space into modules (layers), organized into a deep hierarchy (often just a stack, the simplest kind of hierarchy), where you reuse whatever you can (for instance, convolutions are all about reusing the same information in different spatial locations). <br>
Likewise, deep learning model architecture is primarily about making clever use of modularity, hierarchy, and reuse. You’ll notice that all popular convnet architectures are not only structured into layers, they’re structured into repeated groups of layers (called “blocks” or “modules”). For instance, the popular VGG16 architecture we used in the previous chapter is structured into repeated “conv, conv, max pooling” blocks (see figure 9.8). <br>
Further, most convnets often feature pyramid-like structures (feature hierarchies). Recall, for example, the progression in the number of convolution filters we used in the first convnet we built in the previous chapter: 32, 64, 128. **The number of filters grows with layer depth, while the size of the feature maps shrinks accordingly.** You’ll notice the same pattern in the blocks of the VGG16 model (see figure 9.8).

![](./chapter_images/9.8.png)

Deeper hierarchies are intrinsically good because they encourage feature reuse, and therefore abstraction. In general, a deep stack of narrow layers performs better than a shallow stack of large layers. However, there’s a limit to how deep you can stack layers, due to the problem of **vanishing gradients**. This leads us to our first essential model architecture pattern: **residual connections**.

##### Residual connections
You probably know about the game of Telephone, also called Chinese whispers in the UK and téléphone arabe in France, where an initial message is whispered in the ear of a player, who then whispers it in the ear of the next player, and so on. The final message ends up bearing little resemblance to its original version. It’s a fun metaphor for the **cumulative errors** that occur in sequential transmission over a noisy channel. As it happens, backpropagation in a sequential deep learning model is pretty similar to the game of Telephone. You’ve got a chain of functions, like this one: <br>
y = f4(f3(f2(f1(x)))) <br>
The name of the game is to adjust the parameters of each function in the chain based on the error recorded on the output of f4 (the loss of the model). To adjust f1, you’ll need to percolate error information through f2, f3, and f4. However, each successive function in the chain introduces some amount of noise. If your function chain is too deep, this noise starts overwhelming gradient information, and backpropagation stops working. Your model won’t train at all. This is the **vanishing gradients** problem.

The fix is simple: just force each function in the chain to be nondestructive—to retain a noiseless version of the information contained in the previous input. The easiest way to implement this is to use a **residual connection**. It’s dead easy: just **add the input of a layer or block of layers back to its output** (see figure 9.9). The **residual connection** acts as an **information shortcut** around destructive or noisy blocks (such as blocks that contain relu activations or dropout layers), enabling error gradient information from early layers to propagate noiselessly through a deep network. This technique was introduced in 2015 with the ResNet family of models (developed by He et al. at Microsoft).

![](./chapter_images/9.9.png)

In practice, you’d implement a residual connection as follows.

##### A residual connection in pseudocode

```python
x = ... # Some input tensor
residual = x # Save a pointer to the original input. This is called the residual.
x = block(x) # This computation block can potentially be destructive or noisy, and that’s fine.
x = add([x, residual]) # Add the original input to the layer’s output: the final output will thus always preserve full information about the original input.
```

Note that adding the input back to the output of a block implies that **the output should have the same shape as the input**. However, this is not the case if your block includes convolutional layers with an increased number of filters, or a max pooling layer. In such cases, use a **1 × 1 Conv2D layer with no activation to linearly project the residual to the desired output shape**. 

You’d typically use **padding="same"** in the convolution layers in your target block so as to avoid spatial downsampling due to padding, and you’d use **strides** in the residual projection to match any downsampling caused by a max pooling layer.

##### Residual block where the number of filters changes

In [1]:
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(32, 32, 3))
x = layers.Conv2D(32, 3, activation="relu")(inputs)
residual = x # Set aside the residual
# This is the layer around which we create a residual connection: it increases the number of output filers from 32 to 64.
# Note that we use padding="same" to avoid downsampling due to padding.
x = layers.Conv2D(64, 3, activation="relu", padding="same")(x)
# The residual only had 32 filters, so we use a 1 × 1 Conv2D to project it to the correct shape.
residual = layers.Conv2D(64, 1)(residual)
# Now the block output and the residual have the same shape and can be added.
x = layers.add([x, residual])

##### Case where target block includes a max pooling layer

In [2]:
inputs = keras.Input(shape=(32, 32, 3))
x = layers.Conv2D(32, 3, activation="relu")(inputs)
residual = x # Set aside the residual
# This is the block of two layers around which we create a residual connection: it includes a 2 × 2 max pooling layer. 
# Note that we use padding="same" in both the convolution layer and the max pooling layer to avoid downsampling due to padding.
x = layers.Conv2D(64, 3, activation="relu", padding="same")(x)
x = layers.MaxPooling2D(2, padding="same")(x)
# We use strides=2 in the residual projection to match the downsampling created by the max pooling layer.
residual = layers.Conv2D(64, 1, strides=2)(residual)
# Now the block output and the residual have the same shape and can be added.
x = layers.add([x, residual])


To make these ideas more concrete, here’s an example of a simple convnet structured into a series of blocks, each made of two convolution layers and one optional max pooling layer, with a residual connection around each block:

In [3]:
inputs = keras.Input(shape=(32, 32, 3))
x = layers.Rescaling(1./255)(inputs)

# Utility function to apply a convolutional block with a residual connection, with an option to add max pooling
def residual_block(x, filters, pooling=False):
    residual = x
    x = layers.Conv2D(filters, 3, activation="relu", padding="same")(x)
    x = layers.Conv2D(filters, 3, activation="relu", padding="same")(x)

    if pooling:
        x = layers.MaxPooling2D(2, padding="same")(x)
        # If we use max pooling, we add a strided convolution to project the residual to the expected shape.
        residual = layers.Conv2D(filters, 1, strides=2)(residual)
    elif filters != residual.shape[-1]:
        # If we don't use max pooling, we only project the residual if the number of filters has changed.
        residual = layers.Conv2D(filters, 1)(residual)
    x = layers.add([x, residual])
    return x

# First block
x = residual_block(x, filters=32, pooling=True)
# Second block; note the increasing filter count in each block.
x = residual_block(x, filters=64, pooling=True)
# The last block doesn't need a max pooling layer, since we will apply global average pooling right after it.
x = residual_block(x, filters=128, pooling=False)

x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, 32, 32, 3)]  0           []                               
                                                                                                  
 rescaling (Rescaling)          (None, 32, 32, 3)    0           ['input_3[0][0]']                
                                                                                                  
 conv2d_6 (Conv2D)              (None, 32, 32, 32)   896         ['rescaling[0][0]']              
                                                                                                  
 conv2d_7 (Conv2D)              (None, 32, 32, 32)   9248        ['conv2d_6[0][0]']               
                                                                                              

With residual connections, you can build networks of arbitrary depth, without having to worry about vanishing gradients.

Now let’s move on to the next essential convnet architecture pattern: **batch normalization**.

#### Batch normalization
**Normalization** is a broad category of methods that seek to make different samples seen by a machine learning model more similar to each other, which helps the model learn and generalize well to new data. The most common form of data normalization is one you’ve already seen several times in this book: **centering the data on zero by subtracting the mean from the data, and giving the data a unit standard deviation by dividing the data by its standard deviation**. In effect, this makes the assumption that the data follows a **normal (or Gaussian) distribution** and makes sure this distribution is centered and scaled to unit variance:

```python
normalized_data = (data - np.mean(data, axis=...)) / np.std(data, axis=...)
```

Previous examples in this book **normalized data before feeding it into models**. But data normalization may be of interest **after every transformation operated by the network**: even if the data entering a Dense or Conv2D network has a 0 mean and unit variance, there’s no reason to expect a priori that this will be the case for the data coming out. Could normalizing intermediate activations help? <br>
**Batch normalization** does just that. It’s a type of layer (**BatchNormalization** in Keras) introduced in 2015 by Ioffe and Szegedy; it can adaptively normalize data even as the mean and variance change over time during training. During training, it uses the mean and variance of the current batch of data to normalize samples, and during inference (when a big enough batch of representative data may not be available), it uses an exponential moving average of the batch-wise mean and variance of the data seen during training. <br>
Although the original paper stated that **batch normalization** operates by “reducing internal covariate shift,” no one really knows for sure why **batch normalization** helps. There are various hypotheses, but no certitudes. You’ll find that this is true of many things in deep learning—deep learning is not an exact science, but a set of everchanging, empirically derived engineering best practices, woven together by unreliable narratives. You will sometimes feel like the book you have in hand tells you how to do something but doesn’t quite satisfactorily say why it works: that’s because we know the how but we don’t know the why. Whenever a reliable explanation is available, I make sure to mention it. Batch normalization isn’t one of those cases. <br>
In practice, the main effect of batch normalization appears to be that it helps with gradient propagation—much like residual connections—and thus allows for deeper networks. Some very deep networks can only be trained if they include multiple **BatchNormalization** layers. For instance, batch normalization is used liberally in many of the advanced convnet architectures that come packaged with Keras, such as ResNet50, EfficientNet, and Xception. **The BatchNormalization layer can be used after any layer**—Dense, Conv2D, etc.:

> **NOTE** Both *Dense* and *Conv2D* involve a bias vector, a learned variable whose purpose is to make the layer affine rather than purely linear. For instance, Conv2D returns, schematically, y = conv(x, kernel) + bias, and Dense returns y = dot(x, kernel) + bias. **Because the normalization step will take care of centering the layer’s output on zero, the bias vector is no longer needed when using BatchNormalization**, and the layer can be created without it via the option **use_bias=False**. This makes the layer slightly leaner.

Importantly, I would generally recommend placing the **previous layer’s activation after the batch normalization layer** (although this is still a subject of debate).

##### How not to use batch normalization

```python
x = layers.Conv2D(32, 3, activation="relu")(x)
x = layers.BatchNormalization()(x)
```

##### How to use batch normalization: the activation comes last

```python
# Because the output of the Conv2D layer gets normalized, the layer doesn’t need its own bias vector.
x = layers.Conv2D(32, 3, use_bias=False)(x) # Note the lack of activation here.
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x) # We place the activation after the BatchNormalization layer.
```
The intuitive reason for this approach is that batch normalization will center your inputs on zero, while your **relu** activation uses zero as a pivot for keeping or dropping activated channels: doing normalization before the activation maximizes the utilization of the relu. That said, this ordering best practice is not exactly critical, so if you do convolution, then activation, and then batch normalization, your model will still train, and you won’t necessarily see worse results.

> ##### On batch normalization and fine-tuning 
> Batch normalization has many quirks. One of the main ones relates to fine-tuning: when fine-tuning a model that includes *BatchNormalization* layers, I recommend leaving these layers frozen (set their *trainable* attribute to *False*). Otherwise they will keep updating their internal mean and variance, which can interfere with the very small updates applied to the surrounding *Conv2D* layers.

Now let’s take a look at the last architecture pattern in our series: **depthwise separable convolutions**.

#### Depthwise separable convolutions
What if I told you that there’s a layer you can use as a drop-in replacement for Conv2D that will make your model smaller (fewer trainable weight parameters) and leaner (fewer floating-point operations) and cause it to perform a few percentage points better on its task? That is precisely what the depthwise separable convolution layer does (SeparableConv2D in Keras). This layer performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution (a 1 × 1 convolution), as shown in figure 9.10.

![](./chapter_images/9.10.png)

This is equivalent to separating the learning of spatial features and the learning of channel-wise features. In much the same way that convolution relies on the assumption that the patterns in images are not tied to specific locations, depthwise separable convolution relies on the assumption that **spatial locations** in intermediate activations are **highly correlated**, but different channels are **highly independent**. Because this assumption is generally true for the image representations learned by deep neural networks, it serves as a useful prior that helps the model make more efficient use of its training data. A model with stronger priors about the structure of the information it will have to process is a better model—as long as the priors are accurate. <br>
**Depthwise separable convolution requires significantly fewer parameters and involves fewer computations compared to regular convolution, while having comparable representational power.** It results in smaller models that converge faster and are less prone to overfitting. These advantages become especially important when you’re training **small models from scratch on limited data**. <br>
When it comes to larger-scale models, depthwise separable convolutions are the basis of the **Xception** architecture, a high-performing convnet that comes packaged with Keras. You can read more about the theoretical grounding for depthwise separable convolutions and Xception in the paper “Xception: Deep Learning with Depthwise Separable Convolutions.”

#### Putting it together: A mini Xception-like model
As a reminder, here are the convnet architecture principles you’ve learned so far:
- Your model should be organized into repeated blocks of layers, usually made of multiple convolution layers and a max pooling layer.
- The number of filters in your layers should increase as the size of the spatial feature maps decreases.
- Deep and narrow is better than broad and shallow.
- Introducing residual connections around blocks of layers helps you train deeper networks.
- It can be beneficial to introduce batch normalization layers after your convolution layers.
- It can be beneficial to replace Conv2D layers with SeparableConv2D layers, which are more parameter-efficient.

Let’s bring these ideas together into a single model. Its architecture will resemble a smaller version of Xception, and we’ll apply it to the dogs vs. cats task from the last chapter. For data loading and model training, we’ll simply reuse the setup we used in Chapter 8, but we’ll replace the model definition with the following convnet:

In [5]:
import os, shutil, pathlib
from tensorflow.keras.utils import image_dataset_from_directory

original_dir = pathlib.Path("train")
new_base_dir = pathlib.Path("cats_vs_dogs_small")

def make_subset(subset_name, start_index, end_index):
    for category in ("cat", "dog"):
        dir = new_base_dir / subset_name / category
        os.makedirs(dir)
        fnames = [f"{category}.{i}.jpg" for i in range(start_index, end_index)]
        for fname in fnames:
            shutil.copyfile(src=original_dir / fname,
                            dst=dir / fname)

make_subset("train", start_index=0, end_index=1000)
make_subset("validation", start_index=1000, end_index=1500)
make_subset("test", start_index=1500, end_index=2500)

train_dataset = image_dataset_from_directory(
    new_base_dir / "train",
    image_size=(180, 180),
    batch_size=32)
validation_dataset = image_dataset_from_directory(
    new_base_dir / "validation",
    image_size=(180, 180),
    batch_size=32)
test_dataset = image_dataset_from_directory(
    new_base_dir / "test",
    image_size=(180, 180),
    batch_size=32)

Found 2000 files belonging to 2 classes.
Found 1000 files belonging to 2 classes.
Found 2000 files belonging to 2 classes.


In [6]:
data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.2),
    ]
)

In [7]:
inputs = keras.Input(shape=(180, 180, 3))
x = data_augmentation(inputs)

x = layers.Rescaling(1./255)(x) #  Don't forget input rescaling!
# Note that the assumption that underlies separable convolution, “feature channels are largely independent,” does not hold for RGB images! 
# Red, green, and blue color channels are actually highly correlated in natural images. 
# As such, the first layer in our model is a regular Conv2D layer. We’ll start using SeparableConv2D afterwards.
x = layers.Conv2D(filters=32, kernel_size=5, use_bias=False)(x)

# We apply a series of convolutional blocks with increasing feature depth. 
# Each block consists of two batch-normalized depthwise separable convolution layers and a max pooling layer, with a residual connection around the entire block.
for size in [32, 64, 128, 256, 512]:
    residual = x

    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)

    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)

    x = layers.MaxPooling2D(3, strides=2, padding="same")(x)

    residual = layers.Conv2D(
        size, 1, strides=2, padding="same", use_bias=False)(residual)
    x = layers.add([x, residual])

# In the original model, we used a Flatten layer before the Dense layer. Here, we go with a GlobalAveragePooling2D layer.
x = layers.GlobalAveragePooling2D()(x)
# Like in the original model, we add a dropout layer for regularization.
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

In [8]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_4 (InputLayer)           [(None, 180, 180, 3  0           []                               
                                )]                                                                
                                                                                                  
 sequential (Sequential)        (None, 180, 180, 3)  0           ['input_4[0][0]']                
                                                                                                  
 rescaling_1 (Rescaling)        (None, 180, 180, 3)  0           ['sequential[0][0]']             
                                                                                                  
 conv2d_16 (Conv2D)             (None, 176, 176, 32  2400        ['rescaling_1[0][0]']      

This convnet has a trainable parameter count of 721,857, slightly lower than the 991,041 trainable parameters of the original model, but still in the same ballpark. Figure 9.11 shows its training and validation curves(for 100 epochs).

In [9]:
model.compile(loss="binary_crossentropy",
              optimizer="rmsprop",
              metrics=["accuracy"])
history = model.fit(
    train_dataset,
    epochs=10,
    validation_data=validation_dataset)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


![](./chapter_images/9.11.png)

You’ll find that our new model achieves a test accuracy of 90.8%, compared to 83.5% for the naive model in the last chapter. As you can see, following architecture best practices does have an immediate, sizable impact on model performance! <br>
At this point, if you want to further improve performance, you should start systematically tuning the hyperparameters of your architecture—a topic we’ll cover in detail in chapter 13. We haven’t gone through this step here, so the configuration of the preceding model is purely based on the best practices we discussed, plus, when it comes to gauging model size, a small amount of intuition. <br> Note that these architecture best practices are relevant to computer vision in general, not just image classification. For example, Xception is used as the standard convolutional base in DeepLabV3, a popular state-of-the-art image segmentation solution.

This concludes our introduction to essential convnet architecture best practices. With these principles in hand, you’ll be able to develop higher-performing models across a wide range of computer vision tasks. You’re now well on your way to becoming a proficient computer vision practitioner. To further deepen your expertise, there’s one last important topic we need to cover: interpreting how a model arrives at its predictions. --> part03_interpreting-what-convnets-learn