# Advanced architecture patterns

We covered one important design pattern in detail in the previous section: residual
connections. There are two more design patterns you should know about: normaliza-
tion and depthwise separable convolution.

## BATCH NORMALIZATION

Batch normalization is a type of layer (BatchNormalization in Keras) introduced
in 2015 by Ioffe and Szegedy; it can adaptively normalize data even as the mean and
variance change over time during training. It works by internally maintaining an expo-
nential moving average of the batch-wise mean and variance of the data seen during
training. The main effect of batch normalization is that it helps with gradient propa-
gation—much like residual connections—and thus allows for deeper networks.


## DEPTHWISE SEPARABLE CONVOLUTION

What if I told you that there’s a layer you can use as a drop-in replacement for Conv2D
that will make your model lighter (fewer trainable weight parameters) and faster
(fewer floating-point operations) and cause it to perform a few percentage points bet-
ter on its task? That is precisely what the depthwise separable convolution layer does
(SeparableConv2D).

## Hyperparameter optimization

Often, it turns out that random
search (choosing hyperparameters to evaluate at random, repeatedly) is the best solu-
tion, despite being the most naive one. But one tool I have found reliably better than
random search is Hyperopt (https://github.com/hyperopt/hyperopt), a Python
library for hyperparameter optimization that internally uses trees of Parzen estimators
to predict sets of hyperparameters that are likely to work well. Another library called
Hyperas (https://github.com/maxpumperla/hyperas) integrates Hyperopt for use
with Keras models. Do check it out.

## Model ensembling

Ensembling consists of pooling together the predictions of a set of differ-
ent models, to produce better predictions. If you look at machine-learning competi-
tions, in particular on Kaggle, you’ll see that the winners use very large ensembles of
models that inevitably beat any single model, no matter how good.


The easiest way to pool the predictions of a set
of classifiers (to ensemble the classifiers) is to average their predictions at inference time.


A smarter way to ensemble classifiers is to do a weighted average, where the
weights are learned on the validation data—typically, the better classifiers are given a
higher weight, and the worse classifiers are given a lower weight.


Diversity is what makes ensembling work


you should ensemble models that are as good as possible while being
as different as possible.


In recent times, one style of basic ensemble that has been very successful in prac-
tice is the wide and deep category of models, blending deep learning with shallow learn-
ing.