<a href="https://colab.research.google.com/github/Deep-Learning-Challenge/challenge-notebooks/blob/master/1.Multilayer%20Perceptrons/3.Advanced%20Lessons/5.Lift%20Performance%20With%20Learning%20Rate%20Schedules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>

# Lift Performance With Learning Rate Schedules

Training a neural network or a large deep learning model is a difficult optimization task. The classical algorithm to train neural networks is called stochastic gradient descent. It has been well established that you can achieve increased performance and faster training on some problems using a learning rate during training. In this lesson, you will discover how you can use different learning rate schedules for your neural network models in Python using the Keras deep learning library. After completing this lesson, you will know:

* The benefit of learning rate schedules on lifting model performance during training.
* How to configure and evaluate a time-based learning rate schedule.
* How to configure and evaluate a drop-based learning rate schedule.

Let's get started.

## Runtime Setup

In [None]:
import sys

dataset_name = "ionosphere.csv"
if 'google.colab' in sys.modules:
    DATASET = f"https://github.com/Deep-Learning-Challenge/challenge-notebooks/raw/master/datasets/{dataset_name}
else:
    DATASET = f"../../datasets/{dataset_name}"
    
DATASET

## Learning Rate Schedule For Training Models

Adapting the learning rate for your stochastic gradient descent optimization procedure can increase performance and reduce training time. Sometimes this is called learning rate annealing or adaptive learning rates. Here we will call this approach a learning rate schedule, where the default schedule is to use a constant learning rate to update network weights for each training epoch.

The most straightforward and perhaps most used adaptation of learning rates during training are techniques that reduce the learning rate over time. These benefit from making large changes at the beginning of the training procedure when larger learning rate values are used and decreasing the learning rate such that a lower rate and therefore smaller training updates are made to weights later in the training procedure. This has the effect of quickly learning good weights early and fine-tuning them later. Two popular and easy to use learning rate schedules are as follows:

* Decrease the learning rate gradually based on the epoch.
* Decrease the learning rate using punctuated large drops at specific epochs.

Next, we will look at how you can use each of these learning rate schedules in turn with Keras.

## Ionosphere Classification Dataset

The Ionosphere binary classification problem is used as a demonstration in this lesson. The dataset describes radar returns where the target was free electrons in the ionosphere. It is a binary classification problem where positive cases (g for good) show that some structures in the ionosphere and negative cases (b for bad) do not. It is a useful dataset for practicing with neural networks because all of the inputs are small numerical values of the same scale. There are 34 attributes and 351 observations.

State-of-the-art results on this dataset achieve an accuracy of approximately 94% to 98% accuracy using 10-fold cross-validation. You can learn more about the ionosphere dataset on the [UCI Machine Learning Repository
website](https://archive.ics.uci.edu/ml/datasets/Ionosphere).

## Time-Based Learning Rate Schedule

Keras has a time-based learning rate schedule built-in. The stochastic gradient descent optimization algorithm implementation in the SGD class has an argument called decay. This argument is used in the time-based learning rate decay schedule equation as follows:

$$ Learning Rate = LearningRate\times\frac{1}{1 + decay\times epoch}  $$

When the decay argument is zero (the default), this does not affect the learning rate (e.g., 0.1).

```
LearningRate = 0.1 * 1/(1 + 0.0 * 1)
LearningRate = 0.1
```

When the decay argument is specified, it will decrease the previous epoch's learning rate by the given fixed amount. For example, if we use the initial learning rate value of 0.1 and the decay of 0.001, the first five epochs will adapt the learning rate as follows:

```
Epoch Learning Rate
    1 0.1
    2 0.0999000999
    3 0.0997006985
    4 0.09940249103
    5 0.09900646517
```

Extending this out to 100 epochs will produce the following graph of learning rate (y-axis) versus epoch (x-axis):

![Learning Rate Decay](../../images/learning_rate_decay.png)

You can create a nice default schedule by setting the decay value as follows:

```
Decay = LearningRate / Epochs
Decay = 0.1 / 100
Decay = 0.001
```

The example below demonstrates using the time-based learning rate adaptation schedule in Keras. A small neural network model is constructed with a single hidden layer with 34 neurons and using the rectifier activation function. The output layer has a single neuron and uses the sigmoid activation function to output probability-like values. The learning rate for stochastic gradient descent has been set to a higher value of 0.1. The model is trained for 50 epochs, and the decay argument has been set to 0.002, calculated as $\frac{0.1}{50}$.
Additionally, it can be a good idea to use momentum when using an adaptive learning rate. In this case, we use a momentum value of 0.8. The complete example is listed below.

In [None]:
# Time Based Learning Rate Decay
from pandas import read_csv
import numpy

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from sklearn.preprocessing import LabelEncoder

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataframe = read_csv(DATASET, header=None)
dataset = dataframe.values

# split into input (X) and output (Y) variables
X = dataset[:,0:34].astype(float)
Y = dataset[:,34]

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
Y = encoder.transform(Y)

# create model
model = Sequential()
model.add(Dense(34, input_dim=34, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

# Compile model
epochs = 50
learning_rate = 0.1
decay_rate = learning_rate / epochs
momentum = 0.8
sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

The model is trained on 67% of the dataset and evaluated using a 33% validation dataset. Running the example shows a classification accuracy of 99.14%. This is higher than the baseline of 95.69% without the learning rate decay or momentum.

In [None]:
# Fit the model
model.fit(X, Y, validation_split=0.33, epochs=epochs, batch_size=28, verbose=2)

## Drop-Based Learning Rate Schedule

Another popular learning rate schedule used with deep learning models is to drop the learning rate at specific times during training systematically. Often this method is implemented by dropping the learning rate by half every fixed number of epochs. For example, we may have an initial learning rate of 0.1 and drop it by a factor of 0.5 every ten epochs. The fixed ten epochs of training would use a value of 0.1, in the next ten epochs, a learning rate of 0.05 would be used, and so on. If we plot out the learning rates for this example out to 100 epochs, you get the graph below-showing learning rate (y-axis) versus epoch (x-axis).

![Learning Rate Scheduke](../../images/learning_rate_schedule.png)

We can implement this in Keras using the `LearningRateScheduler` callback when fitting the model. The `LearningRateScheduler` callback allows us to define a function to call that takes the epoch number as an argument and returns the learning rate to use in stochastic gradient descent. When used, the learning rate specified by stochastic gradient descent is ignored. In the code below, we use the same example as a single hidden layer network on the Ionosphere dataset. A new `step_decay()` function is defined that implements the equation:

$$ LearningRate=InitialLearningRate \times {DropRate}^{floor(\frac{1+Epoch}{EpochDrop})} $$

`InitialLearningRate` is the learning rate at the beginning of the run, `EpochDrop` is how often the learning rate is dropped in epochs, and `DropRate` is how much to drop the learning rate each time it is dropped.

In [None]:
# Drop-Based Learning Rate Decay
from pandas import read_csv
import numpy
import math

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import LearningRateScheduler

from sklearn.preprocessing import LabelEncoder

# learning rate schedule
def step_decay(epoch):
    initial_lrate = 0.1
    drop = 0.5
    epochs_drop = 10.0
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
    return lrate

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataframe = read_csv(DATASET, header=None)
dataset = dataframe.values

# split into input (X) and output (Y) variables
X = dataset[:,0:34].astype(float)
Y = dataset[:,34]

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
Y = encoder.transform(Y)

# create model
model = Sequential()
model.add(Dense(34, input_dim=34, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

# Compile model
sgd = SGD(lr=0.0, momentum=0.9, decay=0.0, nesterov=False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

# learning schedule callback
lrate = LearningRateScheduler(step_decay)
callbacks_list = [lrate]

Running the example results in a classification accuracy of 99.14% on the validation dataset, again an improvement over the baseline for the model on this dataset.

In [None]:
# Fit the model
model.fit(X, Y, validation_split=0.33, epochs=50, batch_size=28, callbacks=callbacks_list,
verbose=2)

## Tips for Using Learning Rate Schedules

This section lists some tips and tricks to consider when using learning rate schedules with neural networks.

* **Increase the initial learning rate**: Because the learning rate will decrease, start with a larger value to decrease from. A larger learning rate will result in much larger changes to the weights, at least in the beginning, allowing you to benefit from fine-tuning later.
* **Use a large momentum**: Using a larger momentum value will help the optimization algorithm continue to make updates in the right direction when your learning rate shrinks to small values.
* **Experiment with schedules**: It will not be clear which learning rate schedule to use, so try a few with different configuration options and see what works best on your problem. Also, try schedules that change exponentially and even schedules that respond to your model's accuracy on the training or test datasets.

## Summary

In this lesson, you discovered learning rate schedules for training neural network models. You learned:

* The benefits of using learning rate schedules during training to lift model performance.
* How to configure and use a time-based learning rate schedule in Keras.
* How to develop your drop-based learning rate schedule in Keras.

### Next

This concludes the lessons for Part IV. Now you know how to use more advanced features of Keras and more advanced techniques to get improved performance from your neural network models. Next, in Part V, you will discover a new type of model called the convolutional neural network that is achieving state-of-the-art results in computer vision and natural language processing problems.