Let’s revisit what overfitting means. Typically, to get higher accuracy, we build larger and larger models. One consequence is that the model can rote-memorize some or all of the examples. The model learns the examples instead of learning to generalize from the examples to accurately predict examples it never saw during training. In an extreme case, a model could achieve 100% training accuracy yet have random accuracy on the testing (for 10 classes, that would be 10% accuracy)
## Validation
Let’s say training the model takes several hours. Do you really want to wait until the end of training and then test on the test data to learn whether the model overfitted? Of course not. Instead, we set aside a small portion of the training data, which we call **validation data.**

We don’t train the model with the validation data. Instead, after each epoch, we use the validation data to estimate the likely result on the test data.

<img src="img_4.png">


If a dataset is very small, and using even less data for training has a negative impact, we can use cross-validation. Instead of setting aside at the outset a portion of the training data that the model will never be trained on, a random split is done for each epoch. At the beginning of each epoch, the examples for validation are randomly selected and not used for training for that epoch, and instead used for the validation test.

But since the selection is random, some or all of the examples will appear in the training data for other epochs. Today’s datasets are large, so you seldom see the need for this technique. Figure 4.6 illustrates cross-validation splitting

<img src="img_5.png">

Next, we will train a simple CNN to classify images from the CIFAR-10 dataset. Our dataset is a subset of this dataset of tiny images, of size 32 × 32 × 3. It consists of 60,000 training and 10,000 test images covering 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
In our simple CNN, we have one convolutional layer of 32 filters with kernel size 3 × 3, followed by a strided max pooling layer. The output is then flattened and passed to the final outputting dense layer. Figure 4.7 illustrates this process.

<img src="img_6.png">

In [7]:
from keras.layers import Conv2D, Dense, MaxPooling2D, Flatten
from keras import Sequential
from keras.datasets import cifar10
import numpy as np

In [8]:
model = Sequential([
    Conv2D(32, kernel_size=3, activation="relu", input_shape=(32, 32, 3)),
    MaxPooling2D(pool_size=(2,2)),
    Flatten(),
    Dense(10, activation="softmax")])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=['acc'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 30, 30, 32)        896       
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 15, 15, 32)       0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 7200)              0         
                                                                 
 dense_1 (Dense)             (None, 10)                72010     
                                                                 
Total params: 72,906
Trainable params: 72,906
Non-trainable params: 0
_________________________________________________________________


In [18]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = (x_train / 255.0).astype(np.float32)
x_test = (x_test / 255.0).astype(np.float32)

In [10]:
model.fit(x_train, y_train, epochs=15, validation_split=0.1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x260707d9f70>

Here, we’ve added the keyword parameter validation_split=0.1 to the fit() method to set aside 10% of the training data for validation testing after each epoch.
 The following is the output after running 15 epochs. You can see that after the fourth epoch, the training and evaluation accuracy are essentially the same. But after the fifth epoch, we start to see them spread apart (65% versus 61%). By the 15th epoch, the spread is very large (74% versus 63%). Our model clearly started overfitting around the fifth epoch:

<img src="img_7.png">

Let’s now work on getting the model to not overfit to the examples and instead generalize from them. As discussed in earlier chapters, we want to add some regularization—some noise—during training so the model cannot rote-memorize the training examples. In this code example, we modify our model by adding 50% dropout before the final dense layer. Because dropout will slow our learning (because of forgetting), we increase the number of epochs to 20:


In [11]:
from keras.layers import Dropout
model = Sequential([
    Conv2D(32, kernel_size=3, activation="relu", input_shape=(32, 32, 3)),
    MaxPooling2D(pool_size=(2,2)),
    Flatten(),
    Dropout(0.5),
    Dense(10, activation="softmax")])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=['acc'])
model.fit(x_train, y_train, epochs=20, validation_split=0.1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x26078a571c0>

We can see from the following output that while achieving comparable training accuracy requires more epochs, the training and test accuracy are comparable. Thus, the model is learning to generalize instead of rote-memorizing the training examples:

<img src="img_8.png">

## Loss monitoring
Up to now, we’ve been focusing on accuracy. The other metric you see outputted is the average loss across batches for both training and valuation data. Ideally, we would like to see a consistent increase in accuracy per epoch. But we might also see sequences of epochs for which the accuracy plateaus or even fluctuates +/– a small amount.

What is important is that we see a steady decrease in the loss. The plateau or fluctuations in this case occur because we are near or hovering over lines of linear separation or haven’t fully pushed over a line, but are getting closer as indicated by the decrease in loss.

 Let’s look at this another way. Assume you’re building a classifier for dogs versus cats. You have two output nodes on the classifier layer: one for cats and one for dogs. Assume that on a specific batch, when the model incorrectly classifies a dog as a cat, the output values (confidence level) are 0.6 for cat and 0.4 for dog. In a subsequent batch, when the model again misclassifies a dog as a cat, the output values are 0.55 (cat) and 0.45 (dog). The values are now closer to the ground truths, and thus the loss is diminishing, but they still have not passed the 0.5 threshold, so the accuracy has not changed yet. Then assume in another subsequent batch, the output values for the dog image are 0.49 (cat) and 0.51 (dog); the loss has further diminished, and because we crossed the 0.5 threshold, the accuracy has gone up

## Going deeper with layers

As mentioned in earlier chapters, simply going deeper with layers can lead to instability in the model, without addressing the issues with techniques such as identity links and batch normalization. For example, many of the values we are matrix-multiplying are small numbers less than 1. Multiply two numbers less than 1, and you get an even smaller number. At some point, numbers get so small that the hardware can’t represent the value anymore, which is referred to as a vanishing gradient. In other cases, the parameters may be too close to distinguish from each other—or the opposite, spread too far apart, which is referred to as an exploding gradient.

The following code example demonstrates this by using a 40-layer DNN absent of methods to protect from numerical instability as we go deeper in layers, such as batch normalization after each dense layer:



In [22]:
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(28, 28)))
for i in range(40):
    model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_129 (Dense)           (None, 28, 64)            1856      
                                                                 
 dense_130 (Dense)           (None, 28, 64)            4160      
                                                                 
 dense_131 (Dense)           (None, 28, 64)            4160      
                                                                 
 dense_132 (Dense)           (None, 28, 64)            4160      
                                                                 
 dense_133 (Dense)           (None, 28, 64)            4160      
                                                                 
 dense_134 (Dense)           (None, 28, 64)            4160      
                                                                 
 dense_135 (Dense)           (None, 28, 64)           

In [24]:
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = (x_train / 255.0).astype(np.float32)
x_test = (x_test / 255.0).astype(np.float32)
model.fit(x_train, y_train, epochs=10, validation_split=0.1)

Epoch 1/10


InvalidArgumentError: Graph execution error:

Detected at node 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits' defined at (most recent call last):
    File "C:\Users\HOME\anaconda3\lib\runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\HOME\anaconda3\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
      app.launch_new_instance()
    File "C:\Users\HOME\anaconda3\lib\site-packages\traitlets\config\application.py", line 846, in launch_instance
      app.start()
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 677, in start
      self.io_loop.start()
    File "C:\Users\HOME\anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\HOME\anaconda3\lib\asyncio\base_events.py", line 596, in run_forever
      self._run_once()
    File "C:\Users\HOME\anaconda3\lib\asyncio\base_events.py", line 1890, in _run_once
      handle._run()
    File "C:\Users\HOME\anaconda3\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 457, in dispatch_queue
      await self.process_one()
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 446, in process_one
      await dispatch(*args)
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 353, in dispatch_shell
      await result
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 648, in execute_request
      reply_content = await reply_content
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 353, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "C:\Users\HOME\anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
      return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    File "C:\Users\HOME\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2901, in run_cell
      result = self._run_cell(
    File "C:\Users\HOME\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2947, in _run_cell
      return runner(coro)
    File "C:\Users\HOME\anaconda3\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\HOME\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3172, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\HOME\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3364, in run_ast_nodes
      if (await self.run_code(code, result,  async_=asy)):
    File "C:\Users\HOME\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3444, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\HOME\AppData\Local\Temp/ipykernel_14384/4284409223.py", line 5, in <module>
      model.fit(x_train, y_train, epochs=10, validation_split=0.1)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 1021, in train_function
      return step_function(self, iterator)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 860, in train_step
      loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 918, in compute_loss
      return self.compiled_loss(
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 201, in __call__
      loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\losses.py", line 141, in __call__
      losses = call_fn(y_true, y_pred)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\losses.py", line 245, in call
      return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\losses.py", line 1862, in sparse_categorical_crossentropy
      return backend.sparse_categorical_crossentropy(
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\backend.py", line 5202, in sparse_categorical_crossentropy
      res = tf.nn.sparse_softmax_cross_entropy_with_logits(
Node: 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits'
logits and labels must have the same first dimension, got logits shape [896,10] and labels shape [32]
	 [[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_train_function_239100]

<img src="img_9.png">