### Topics:
Learn how to optimize your deep learning models in Keras. Start by learning how to validate your models, then understand the concept of model capacity, and finally, experiment with wider and deeper networks.

### Understanding model optimization & Stochastic Gradient Descent

In practice, optimization is a hard problem. The optimal value for any one weight depends on the values of the other weights, and we are optimizing many weights at once. Even if the slope tells us which weights to increase, and which to decrease, our updates may not improve our model meaningfully. A small learning rate might cause us to make such small updates to the model's weights that our model doesn't improve materially. A very large learning rate might take us too far in the direction that seemed good. A smart optimizer like Adam helps, but optimization problems can still occur. The easiest way to see the effect of different learning rates is to use the simplest optimizer, Stochastic Gradient Descent, sometimes abbreviated to SGD. This optimizer uses a fixed learning rate. Learning rates around point-01 are common. But you can specify the learning rate you need with lr argument. We create models in a for loop, and each time around we compile the model using SGD with a different learning rate. We pass in the optimizer with the same argument where we previously passed the string for "Adam". In an exercise, you will compare the results of training models trained with low, medium and high learning rates.

### The dying neuron problem & Vanishing gradients

Even if your learning rate is well tuned, you can run into the so-called "dying-neuron" problem. This problem occurs when a neuron takes a value less than 0 for all rows of your data. With the ReLU activation function, any node with a negative input value produces an output of 0, and it also has a slope of 0. Because the slope is 0, the slope of any weights flowing into that node are also 0. So those weights don't get updated. In other words, once the node starts always getting negative inputs, it may continue only getting negative inputs. It's contributing nothing to the model at this point, and hence the claim that the node or neuron is "dead."

At first, this might suggest using an activation function whose slope is never exactly zero for example, an s-shaped function called tanh. However, values that were outside the middle of the S were relatively flat, or had small slopes. A small but non-zero slope might work in a network with only a few hidden layers. But in a deep network, one with many layers, the repeated multiplication of small slopes causes the slopes to get close to 0, which meant updates in backprop were close to 0. This is called the vanishing gradient problem. This in turn might suggest using an activation function that isn't even close to flat anywhere. There is research in this area, including variations on ReLU. Those aren't widely used though. For now, it's a phenomenon worth keeping in mind if you are ever pondering why your model isn't training better. If it happens, changing the activation function may be the solution.

### Model validation
As you know, your model's performance on the training data is not a good indication of how it will perform on new data. For this reason, we use validation data to test model performance. Validation data is data that is explicitly held out from training, and used only to test model performance. 

You may already be familiar with k-fold cross validation. In practice, few people run k-fold cross validation on deep learning models because deep learning is typically used on large datasets. So the computational expense of running k-fold validation would be large, and we usually trust a score from a single validation run because those validation runs are reasonably large. Keras makes it easy to use some of your data as validation data, we specify the split using the keyword argument validation_split when calling the fit method. 

### Early Stopping
we should keep training while validation score is improving, and then stop training when the validation score isn't improving. We do this with something called "early stopping.". it takes an argument called patience, which is how many epochs the model can go without improving before we stop training. 2 or 3 are reasonable values for patience. Sometimes you'll get a single epoch with no improvement, but the model will start improving again after that epoch. But if you see 3 epochs with no improvement, it's unlikely to turn around and start improving again. We pass it to the fit function under an argument called callbacks. Note that callbacks takes a list. You may consider adding other callbacks as you become very advanced. But early stopping is all you want for now. 

By default, keras trains for 10 epochs. Now that we have smart logic for determining when to stop, we can set a high maximum number of epochs. This happens with the epochs argument. Keras will go until this number of epochs, unless the validation loss stops improving, in which case it will stop earlier. This is smarter training logic than relying on a fixed number of epochs without looking at the validation scores.

### Experimentation
To experiment with different architectures. More layers, fewer layers. Layers with more nodes, layers with fewer nodes. And so on. Creating a great model requires some experimentation. Before we finish, I'll give a little bit of insight into how to choose where you experiment. But, now that you can get validation scores, you are poised to run those experiments and figure out what works best for your data.

### Changing optimization parameters

In [None]:
# Import the SGD optimizer
from tensorflow.keras.optimizers import SGD

# Create list of learning rates: lr_to_test
lr_to_test = [.000001, 0.01, 1]

# Loop over learning rates
for lr in lr_to_test:
    print('\n\nTesting model with learning rate: %f\n'%lr )
    
    # Build new model to test, unaffected by previous models
    model = get_new_model()
    
    # Create SGD optimizer with specified learning rate: my_optimizer
    my_optimizer = SGD(lr=lr)
    
    # Compile the model
    model.compile(optimizer=my_optimizer, loss="categorical_crossentropy")
    
    # Fit the model
    model.fit(predictors, target)


### Evaluating model accuracy on validation dataset

In [None]:
# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
hist = model.fit(predictors, target, validation_split=0.3)

#### Early stopping: Optimizing the optimization


In [None]:
# Import EarlyStopping
from tensorflow.keras.callbacks import EarlyStopping

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model
model.fit(predictors, target,epochs=30, validation_split=0.3, callbacks=[early_stopping_monitor])

#### Experimenting with wider networks

In [None]:
# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Create the new model: model_2
model_2 = Sequential()

# Add the first and second layers
model_2.add(Dense(100, activation='relu', input_shape=input_shape))
model_2.add(Dense(100, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit model_1
model_1_training = model_1.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Fit model_2
model_2_training = model_2.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()

### Adding layers to a network

In [None]:
# The input shape to use in the first hidden layer
input_shape = (n_cols,)

# Create the new model: model_2
model_2 = Sequential()

# Add the first, second, and third hidden layers
model_2.add(Dense(10, activation='relu', input_shape=input_shape))
model_2.add(Dense(10, activation='relu'))
model_2.add(Dense(10, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit model 1
model_1_training = model_1.fit(predictors, target, epochs=15, validation_split=0.4, verbose=False)

# Fit model 2
model_2_training = model_2.fit(predictors, target, epochs=15, validation_split=0.4, verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()

### Thinking about model capacity
At this point, you know how to run experiments, and compare different models performance. However, it takes some practice to get an intuition for what experiments or architectures to try. There is still a little more art to finding good deep learning architectures than there is for tuning other machine learning algorithms. But something called "model capacity" should be one of the key considerations you think about when deciding what models to try."Model capacity" or "network capacity" is closely related to the terms overfitting and underfitting.

Overfitting is the ability of a model to fit oddities in your training data that are there purely due to happenstance, and that won't apply in a new dataset. When you are overfitting, your model will make accurate predictions on training data, but it will make inaccurate predictions on validation data and new datasets. Underfitting is the opposite. That is when your model fails to find important predictive patterns in the training data. So it is accurate in neither the training data nor validation data. Because we want to do well on new datasets that weren't used for training the model, our validation score is the ultimate measure of a model's predictive quality. 

Let's get back to model capacity. Model capacity is a model's ability to capture predictive patterns in your data. So, the more capacity a model, leads to overfitting. So, with that in mind, here is a good workflow for you. Start with a simple network, and get the validation score. Then keep adding capacity as long as the score keeps improving. Once it stops improving, you can decrease capacity slightly, but you are probably near the ideal.

You can experiment. But you should generally be thinking about whether you are trying to increase or decrease capacity, ideally honing in on the right capacity by looking at validation scores.

### Recognizing handwritten digits
The MNIST dataset, which contains images of handwritten digits. This is a very popular dataset for getting started working with images. There is an image of each handwritten digit, and each image is composed of a 28 pixel by 28 pixel grid. The image is represented by showing how dark each pixel is. So, 0 would be as light as possible, and 255 is as dark as possible. I've flattened the 28 x 28 grid for you into a 784 x 1 array for each image. Each image shows a digit like 0, 1, 2, 3 4, all the way up to 9. Your model will predict which digit it is that was written. So you will create a deep learning model taking in those 784 features for each image as inputs, and predicting digits from among 10 possible values for the output.

In [None]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape=(784, )))

# Add the second hidden layer
model.add(Dense(50, activation='relu'))


# Add the output layer
model.add(Dense(10, activation='softmax'))


# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
model.fit(X, y, validation_split=0.3, epochs=10)