# Fine-tuning keras models


In [11]:
import numpy as np
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical
from keras.models import load_model
from keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from sklearn.metrics import mean_squared_error
import pandas as pd

## Understanding model optimization
It is hard because we're optimizing 1000s of parameters with complex relationships. Updates may not improve our model meaningfully if the learning rate is too big or too small per example. 
The easiest way to see this is using the stochastic gradient descent algorithm since it uses fixed learning rate:

In [5]:
df = pd.read_csv('titanic_all_numeric.csv.txt')
predictors = df.drop(['survived'], axis=1).as_matrix()
target = to_categorical(df.survived)
n_cols = predictors.shape[1]

  


In [10]:
def get_new_model(n_cols):
    model = Sequential()
    model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
    model.add(Dense(100, activation = 'relu'))
    model.add(Dense(2, activation = 'softmax'))
    return(model)

lr_to_test = [0.000001, 0.01, 1]

for lr in lr_to_test:
    print('\n\nTesting model with learning rate: %f\n'%lr )
    
    # Build new model to test, unaffected by previous models
    model = get_new_model(n_cols)
    
    # Create SGD optimizer with specified learning rate: my_optimizer
    my_optimizer = SGD(lr=lr)
    
    # Compile the model
    model.compile(optimizer = my_optimizer, loss='categorical_crossentropy')
    
    # Fit the model
    model.fit(predictors, target, epochs=5)
    



Testing model with learning rate: 0.000001

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Testing model with learning rate: 0.010000

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Testing model with learning rate: 1.000000

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Even when the learning rate is well tune we can run into the dying neuron problem:
- A neuron takes a value less than 0 for all rows of the data.
- relu produces output of 0 for any negative input. The slope is also 0. 
- this means the slopes of any weights flowing into that node are also zero.
- Those weights don't get updated!
Once a node starts always getting negative inputs it may continue only getting negative inputs.

-> it contributes nothing to the model. --> DEAD NEURON

A solution migh be using activation functions that don't vanish but this leads to the vanishing gradient problem. Per example tanh(). Occurs when many layers have very small slopes (due to being of the flat part of tanh).
This worked fine in small networks but in deep ones updates to bp were close to 0.

No perfect answer atm.

## Model Validation
Model performance on training data is not a good measure on how it will perform on new data. 

We held out validation/test data to test performance.

In many machine learning models we use k-cross validation but in deep networks were we are working with such large datasets it's not practicle. We'll generally use validation split. 

Single validation score is based on large amount of data and so it is reliable. 

To split data in Keras:

In [None]:
model.fit(predictor, target, validation_split=0.3)

The reasonable way to train the networks is to continue if the validation score is improving and stop when it's not improving anymore. For this we can use early stopping. 

In [12]:
#we create an early stopping monitor before fitting the model
#patience = how many epoch we will keep updating even when there is no improvement. 
early_stopping_monitor = EarlyStopping(patience = 2)
model.fit(predictor, target, validation_split=0.3, 
          epochs=20, callbacks = [early_stopping_monitor])

Now that we have a way to measure performance of our network we can experiment with different architectures, +/- layers, +/- nodes.

Creating a good model requires experimentation!

### Evaluating model accuracy on validation dataset
Now it's your turn to monitor model accuracy with a validation data set. A model definition has been provided as model. Your job is to add the code to compile it and then fit it. You'll check the validation score in each epoch.

In [None]:
# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss = 'categorical_crossentropy',metrics = ['accuracy'])

# Fit the model
hist = model.fit(predictors, target, validation_split = 0.3)


### Experimenting with wider networks
Now you know everything you need to begin experimenting with different models!

A model called model_1 has been pre-loaded. You can see a summary of this model printed in the IPython Shell. This is a relatively small network, with only 10 units in each hidden layer.

In this exercise you'll create a new model called model_2 which is similar to model_1, except it has 100 units in each hidden layer.

After you create model_2, both models will be fitted, and a graph showing both models loss score at each epoch will be shown. We added the argument verbose=False in the fitting commands to print out fewer updates, since you will look at these graphically instead of as text.

Because you are fitting two models, it will take a moment to see the outputs after you hit run, so be patient.

In [None]:
# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Create the new model: model_2
model_2 = Sequential()

# Add the first and second layers
model_2.add(Dense(100, activation ='relu', input_shape=input_shape))
model_2.add(Dense(100, activation ='relu'))

# Add the output layer
model_2.add(Dense(2, activation ='softmax'))

# Compile model_2
model_2.compile(optimizer = 'adam', loss = "categorical_crossentropy", metrics =['accuracy'])

# Fit model_1
model_1_training = model_1.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Fit model_2
model_2_training = model_2.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()


### Adding layers to a network
You've seen how to experiment with wider networks. In this exercise, you'll try a deeper network (more hidden layers).

Once again, you have a baseline model called model_1 as a starting point. It has 1 hidden layer, with 50 units. You can see a summary of that model's structure printed out. You will create a similar network with 3 hidden layers (still keeping 50 units in each layer).

This will again take a moment to fit both models, so you'll need to wait a few seconds to see the results after you run your code.

In [None]:
# The input shape to use in the first hidden layer
input_shape = (n_cols,)

model_2 = Sequential()

# Add the first and second layers
model_2.add(Dense(50, activation ='relu', input_shape=input_shape))
model_2.add(Dense(50, activation ='relu'))
model_2.add(Dense(50, activation ='relu'))
# Add the output layer
model_2.add(Dense(2, activation ='softmax'))

# Compile model_2
model_2.compile(optimizer = 'adam', loss = "categorical_crossentropy", metrics =['accuracy'])
# Fit model 1
model_1_training = model_1.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Fit model 2
model_2_training = model_2.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()


## Thinking about model capacity
Model capacity is closely related to over and underfitting. Overfitting is the hability to fit patterns in the training data that are not relevant and will cause errors when applying the model to the test data. 

Underfitting is the opposite. The model fails to find patterns in the training data and then also fails in the test data. 

Model/Network capacity/complexity is related to how many nodes/layes we add to our network.

![](cap.PNG)
That's why validation scores are so important. We start with a small network and get validation score. Then we increase capacity until the validation score is not improving anymore. 

## Stepping up to images
Recognizing handwritten digits with MNIST dataset.

Each image is a 28x28 pixel grid.

Deep learning with ipynb in the cloud:
https://www.datacamp.com/community/tutorials/deep-learning-jupyter-aws

### Building your own digit recognition model
You've reached the final exercise of the course - you now know everything you need to build an accurate model to recognize handwritten digits!

We've already done the basic manipulation of the MNIST dataset shown in the video, so you have X and y loaded and ready to model with. Sequential and Dense from keras are also pre-imported.

To add an extra challenge, we've loaded only 2500 images, rather than 60000 which you will see in some published results. Deep learning models perform better with more data, however, they also take longer to train, especially when they start becoming more complex.

If you have a computer with a CUDA compatible GPU, you can take advantage of it to improve computation time. If you don't have a GPU, no problem! You can set up a deep learning environment in the cloud that can run your models on a GPU. Here is a blog post by Dan that explains how to do this - check it out after completing this exercise! It is a great next step as you continue your deep learning journey.