# Training Deep Neural Networks

Last chapter we talked about neural network in genral, it was a shallow one where there are just few layers and maybe multiple nurons in these layers, beside of that we discuss the different keras API like sequential and functional model, talked about the tensorboard for visualization the difference runs, the callbacks to control the behavior of the models and fine tune the hyperparameters.

In this chapter we will go deeper to train the deep neural network, which can help us doing complex tasks like classification hundred of objects not like what we have binary or 10-class problem, what about voice recognition and other complex problem. But with these problem to be solved there are different parts we should take care about like **Vanshing and Exploding Problems**, **Overfitting**, **Large NN with small data and vise Verse**, **Labeling the data and training time** all of these problem we will see how to handle to train deep NN.

## Vanshing and Exploding Problems

"The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way, Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is called the vanishing gradients problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges ", This is the exploding gradients proplem.

One of the solution for these **Vanishing and Exploding Problems** is to combine some of the activation functions with some method of weights in lizards as in the table below.

<img src="images/1.png">

By default, Keras uses Glorot initialization with a uniform distribution. You can change this to He initialization by setting kernel_initializer="he_uniform" or kernel_initializer="he_normal"

In [33]:
import os
import time
import tensorflow as tf
from tensorflow import keras
from keras.utils.vis_utils import plot_model
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
import pandas as pd
from functools import partial
%matplotlib inline
%load_ext tensorboard
import warnings
warnings.filterwarnings('ignore')


The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [2]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

## He initialization instead of defaul Glorot initialization

In [3]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu",  kernel_initializer="he_normal"),
    keras.layers.Dense(100, activation="relu",  kernel_initializer="he_normal"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


###  He initialization with a uniform distribution, but based on fan_avg rather than fan_in as in table

In [4]:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu",  kernel_initializer=he_avg_init),
    keras.layers.Dense(100, activation="relu",  kernel_initializer=he_avg_init),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Nonsaturating Activation Functions

Another slution of solving **Vanishing and Exploding Problems** is to use Non-saturated activation function, as usual we use **Sigmoid function** which saturated on when z is very large or very small, but there are another activation function have been appeared like Relu but it was lead to problem of **dying nurons** because of output 0 when z is less than 0, and to come over this problem the LeakyRelu and other variants of Relu have been introduced like Randomized Relu which introduce another hyper paramter to tune which is α, but this help us avoid the problem of dying nurons.

Also, another activation function is outperform all of the Relu and its variance is  exponential linear unit (ELU), but it takes more time to train and test, but it converage in less number of epochs.

<table>
<tr>
<th><img src="images/2.png"></th>
<th><img src="images/3.png"></th>

</tr>
</table>
<th><img src="images/4.png"></th>

## LeakyReLU instead of Relu

In [5]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=.2),
    keras.layers.Dense(100, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=.2),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-3),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Parametric Relu instead of Relu
Now alpha we have used with .2 is learned by the model as model paramter.

In [6]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer='he_normal'),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer='he_normal'),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss="sparse_categorical_crossentropy", optimizer=keras.optimizers.SGD(learning_rate=1e-3),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Batch Normalization

The previous solution for **Vanishing and Exploding Problems** can significantly reduce the vanishing and exploding gradients problems at the beginning of training, but it doesn’t guarantee that they won’t come back during training. So a new technique called Batch Normalization (BN) have been introduced to address the vanishing and exploding gradients problems.
The technique consists of adding an operation in the model just before or after the activation function of each hidden layer.

## BatchNormalization but after activation function

In [7]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10,  activation="softmax")
])

model.summary()


Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_4 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_12 (Dense)             (None, 300)               235500    
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 300)               0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_13 (Dense)             (None, 100)               30100     
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU)    (None, 100)              

In [8]:
# Let’s look at the parameters of the first BN layer. Two are trainable (by backprop), and two are not:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

In [9]:
model.compile(loss="sparse_categorical_crossentropy", optimizer=keras.optimizers.SGD(learning_rate=1e-3),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## BatchNormalization but before activation function

**sometimes it give a better result than using it after activation function**

- Also look at the result over 86 instead of 84 in other models without Normalization

In [10]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300,  kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-3),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Lets do it with ELU

In [11]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300,  kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.ELU(),
    keras.layers.Dense(100, kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.ELU(),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-3),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## BatchNormalization after activation function with ELU

In [12]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300,  kernel_initializer='he_normal'),
    keras.layers.ELU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, kernel_initializer='he_normal'),
    keras.layers.ELU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-3),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Change kernel_initializer 

In [13]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300,  kernel_initializer='lecun_normal'),
    keras.layers.ELU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, kernel_initializer='lecun_normal'),
    keras.layers.ELU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-3),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Reusing Pretrained Layers

We have talked about the **Vanishing and Exploding Problems**, and different techniques to solve these problem from different weights initialization with different activation function, and lastely talked about the Batch Normalization.

Now another problem is that maybe we do not have large training set, or we do not have the resource to build deep neural network from scratch, but there are some people who trained different architectures of Network on similar tasks and we can use their network with our new task which similar to the task they done. **This called Transfer Learning**.

<img src="images/5.png" width="500" height="500">


**The more similar the tasks are, the more layers you want to reuse (starting with the lower layers).**

Try freezing all the reused layers first so gradient descent won’t modify them, then train the model and see the result. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves and so on.

The more training data you have, the more layers you can unfreeze. **It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-tuned weights**.


## Transfer Learning With Keras

We will try to train model on just 8 classes from the dataset, then use this trained model with the other two classes.

In [14]:
def split_dataset(X, y):
    # ~ not operation that Inverts all the bits
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6] # All other classes
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    
    return ((X[~y_5_or_6], y_A), (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

In [15]:
print(X_train_A.shape)
print(X_train_B.shape)

# from 0 to 7 because other classes are shifted down
print(y_train_A[:30])

# Contain only 0 or 1 1 if shirt and 0 if sandals
print(y_train_B[:30])

(43986, 28, 28)
(200, 28, 28)
[4 0 5 7 7 7 4 4 3 4 0 1 6 3 4 3 2 6 5 3 4 5 1 3 4 2 0 6 7 1]
[1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1.
 1. 0. 1. 1. 1. 1.]


In [16]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(8, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(1e-3),
             metrics=['accuracy'])
history = model.fit(X_train_A, y_train_A, epochs=10, validation_data=(X_valid_A, y_valid_A))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [17]:
model.save('trained_models/transfer_learning_8_class.h5')

# Clone model
First load the model, then clone it to be not affected by other model we transfer the model to.

In [18]:
model_A = keras.models.load_model('trained_models/transfer_learning_8_class.h5')
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

In [19]:
# All layers except the output layer as we moved from 8 classes to 2 classes
# Also the input shape for the two model is the same so there is no problem
model_B_on_A = keras.models.Sequential(model_A_clone.layers[:-1])
# As we classify image to either shirt or sandals
model_B_on_A.add(keras.layers.Dense(1, activation='sigmoid'))

## Notes!

Now we have used all the layers from model A we have trained with 8 classes, and as we mentioned in transfer learning, the new task we are dealing with is classifying image into either shirt or sandals, which is close to task we have trained.

- The input is the same
- It's image classification task
- Just they different in the output from 8 classes to 2 classes so we do not use the output layer from model_A

But as all layers of model_A are trained and we use these layers with model_B_on_A, but the output layer we have add to model_B_on_A is initialized with random weights, so its good to first **freeze the used layer** then let the model train for small number of epoch to get some good weights for  the output layer, and **unfreeze the used layer maybe just the top ones** and train again but with **small learning rate** to not change reused layer weights away, then again you can **unfreeze more layers and so on**, but each time you **freeze or unfreeze you need to compile again**.


In [20]:
# Freeze all layer except output layer
for layer in model_B_on_A.layers[:-1]: 
    layer.trainable = False

# Compile the model
model_B_on_A.compile(loss='binary_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-2),
                    metrics=['accuracy'])

# X_train_B is the binary data 
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=5, validation_data=(X_valid_B, y_valid_B))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [21]:
# unFreeze all layer except output layer
for layer in model_B_on_A.layers[:-1]: 
    layer.trainable = True
    
# Compile the model again but with less learning rate
model_B_on_A.compile(loss='binary_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-4),
                    metrics=['accuracy'])


# X_train_B is the binary data 
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=10, validation_data=(X_valid_B, y_valid_B))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
# freeze some layer except output layer 
for layer in model_B_on_A.layers[:-2]: 
    layer.trainable = False
    
# Compile the model again
model_B_on_A.compile(loss='binary_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                    metrics=['accuracy'])

# fit again for another 10 epochs with some freeze for layers
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=10, validation_data=(X_valid_B, y_valid_B))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [23]:
# unfreeze all layers again
for layer in model_B_on_A.layers[:-2]: 
    layer.trainable = True
    
# Compile the model again
model_B_on_A.compile(loss='binary_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-4),
                    metrics=['accuracy'])

# fit again for another 10 epochs with some freeze for layers
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=10, validation_data=(X_valid_B, y_valid_B))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [24]:
model_B_on_A.evaluate(X_test_B, y_test_B)



[0.2011478692293167, 0.9330000281333923]

In [25]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(lr=1e-3, momentum=.9),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [26]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(lr=1e-3, momentum=.9,
                                     nesterov=True), metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [27]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.RMSprop(learning_rate=1e-3, rho=.9),
             metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [28]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=1e-3,
                                                                 beta_1=.9, beta_2=.999), metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [29]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(learning_rate=1e-2, decay=1e-4),
             metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * .1 **(epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [31]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return .01
    elif epoch < 15:
        return .005
    else:
        return .001

    
    
lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [32]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=.5, patience=5)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [36]:
exponential_decay_fn = exponential_decay(lr0=.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

RegularizedDense = partial(keras.layers.Dense,
                          activation='elu', kernel_initializer='he_normal',
                          kernel_regularizer=keras.regularizers.l2(.01))


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    keras.layers.BatchNormalization(),
    RegularizedDense(100),
    keras.layers.BatchNormalization(),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [37]:
exponential_decay_fn = exponential_decay(lr0=.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

RegularizedDense = partial(keras.layers.Dense,
                          activation='elu', kernel_initializer='he_normal',
                          kernel_regularizer=keras.regularizers.l1_l2(.01))


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    keras.layers.BatchNormalization(),
    RegularizedDense(100),
    keras.layers.BatchNormalization(),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [39]:
exponential_decay_fn = exponential_decay(lr0=.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

RegularizedDense = partial(keras.layers.Dense,
                          activation='elu', kernel_initializer='he_normal',
                          kernel_regularizer=keras.regularizers.l2(.05))


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    RegularizedDense(300),
    keras.layers.BatchNormalization(),
    RegularizedDense(100),
    keras.layers.BatchNormalization(),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [40]:
exponential_decay_fn = exponential_decay(lr0=.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

RegularizedDense = partial(keras.layers.Dense,
                          activation='elu', kernel_initializer='he_normal')


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=.2),
    keras.layers.BatchNormalization(),
    RegularizedDense(300),
    keras.layers.Dropout(rate=.2),
    keras.layers.BatchNormalization(),
    RegularizedDense(100),
    keras.layers.Dropout(rate=.2),
    keras.layers.BatchNormalization(),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [41]:
exponential_decay_fn = exponential_decay(lr0=.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

RegularizedDense = partial(keras.layers.Dense,
                          activation='elu', kernel_initializer='he_normal')


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(rate=.2),
    RegularizedDense(300),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(rate=.2),
    RegularizedDense(100),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(rate=.2),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [42]:
model.evaluate(X_test, y_test)



[0.3157493472099304, 0.883400022983551]

In [43]:
exponential_decay_fn = exponential_decay(lr0=.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

RegularizedDense = partial(keras.layers.Dense,
                          activation='elu', kernel_initializer='he_normal')


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    RegularizedDense(300),
    keras.layers.BatchNormalization(),
    RegularizedDense(100),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(rate=.2),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(beta_1=.9, beta_2=.999), 
              metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), 
                   callbacks=[lr_scheduler])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [44]:
model.evaluate(X_test, y_test)



[0.3216792941093445, 0.8920000195503235]