# Introduction #

In the last lesson, we looked at how to define a neural network in Keras. This lesson is about the second major component of deep learning models: the **optimizer**. The optimizer tells the network how to change its weights to reduce the loss on the training set. The optimizer, in other words, tells the network how to learn.

There are a variety of optimization methods used in machine learning. By far the most common method with deep neural networks is **stochastic gradient descent**, which we introduced in Lesson 1. The neural network model is very flexible and there is great diversity in the kinds of networks you can create and in the kind of data you might apply them to. Correspondingly, there a lot of options available for optimizing the network -- there isn't one method that is always the best choice... or that even works at all!

In this lesson, we'll examine some of the choices available to you for optimizing a network and what consequences they can have on your model's performance.

# Stochastic #

First let's break down the name: "Stochastic Gradient Descent." Though this can sound intimidating if you've never encountered it before, actually it's only three ideas that in themselves are relatively simple. Let's look at them together in turn.

The word **Stochastic** simply means "determined by chance." When say "stochastic" in this case because we train the network on randomly chosen samples of the data, instead of the entire dataset at once. In other words, take a sample of the training set (without replacement), fit the network to the sample, and repeat until you've run out of data. That is one **epoch**. Training will usually take place over many runs through the dataset (over many epochs).

Each one of these random samples we call a **minibatch**, or sometimes just a **batch**. The size of the batch is one of the parameters you define before training. It's common to use batch sizes in the 10's to the 100's and often even larger.

# Gradient #

The gradient is a vector which tells you how to decrease the loss.

The dataset is fixed, so the only way we can decrease the loss is by modifying the weights in the network. The following graph shows the MSE loss for this model: $y=x$.

<!-- loss graph -->

Say the network currently has weights $..$. The gradient at that point is vector in the direction of greatest *increase* -- it points, in other words, up the steepest part of the slope. (If you were looking at a topographical map, this would be where the contour lines are closest together.)

# Descent #

How do we use the gradient to update the weights? If you take the weights and *subtract* the gradient, then you will move *down* the steepest part of the slope, towards a minimum.

You modify how *fast* you descend with the **learning rate**.

So, in summary, *stochastic gradient descent* means:
1. take a minibatch sample from the training data
2. use the minibatch to compute the loss gradient
3. update the weights with the gradient to decrease the loss

Do that until the loss won't decrease anymore and you know that you've reached a minimum -- the model has *converged*. That's the whole idea behind SGD.

# Batch Size and Learning Rate #

Now, what does this mean in practice?

Sampling introduces variation. Small batches means your gradient will have more error -- in other words, it's likely to be pointing somewhat in the wrong direction.

With a smaller learning rate, you are computing this gradient more often, so the errors average out.

The moral is: if you don't know exactly where you're going, you'd better not go too fast!

# Choosing Parameters #

**local minimums**

It's as if, instead of going all the way down the hill, the optimizer simply stepped into a pothole and decided it was finished. Larger steps mean you can "bounce out" of these potholes, and with a larger batch size, you won't be going too much in the wrong direction.

As you approach your "final destination," however, you'll likely want to start slowing down. As the loss surface becomes narrower, you need a smaller learning rate in order to keep descending.

As you approach your destination, what do you do? You start slowing down! If you kept running as fast as you could, you'd run right past -- and, turning back around again, if you never slowed down, you'd keep running back and forth forever.

<!-- running back and forth forever -->

# Example - Learning Rate and Batch Size in Keras #

- Train a model
- Look at learning curves
- Train it again
- Compare

# Conclusion #

There are trade-offs to consider when choosing the parameters of your optimizer. As with most things in machine learning, there won't be a single choice that will always give the best performance.

Your understanding of these parameters can guide your choices

make your choices more effective.

your decisions will be informed.

understanding, more informed and more effective.

# Your Turn #