# Improving Your Model Performance

In the previous chapters, you've trained a lot of models! You will now learn how to interpret learning curves to understand your models as they train. You will also visualize the effects of activation functions, batch-sizes, and batch-normalization. Finally, you will learn how to perform automatic hyperparameter optimization to your Keras models using sklearn.

# (1) Learning curves

<img src="image/Screenshot 2021-01-29 142344.png">
<img src="image/Screenshot 2021-01-29 142401.png">
<img src="image/Screenshot 2021-01-29 142416.png">
<img src="image/Screenshot 2021-01-29 142445.png">
<img src="image/Screenshot 2021-01-29 142506.png">
<img src="image/Screenshot 2021-01-29 142536.png">

## Learning curves

```
# Store initial model weights
init_weights = model.get_weights()
# Lists for storing accuracies
train_accs = []
test_accs = []
```

```
for train_size in train_sizes:
    # Split a fraction according to train_size
    X_train_frac, _, y_train_frac, _ = train_test_split(X_train, Y_train, Train_size=train_size)
    # Set model initial weigths
    model.set_weights(init_weights)
    # Fit model on the training set fractopm
    model.fit(X_train_frac, y_train_frac, epoch=100, verbose=0, callbacks=[EarlyStopping(mornitor='loss', patience=1)])
    # Get the accuracy for this training set fraction
    train_acc = model.evaluate(X_train_frac, y_train_frac, verbose=0)[1]
    train_accs.append(train_acc)
    # Get the accuracy on the whole test set
    test_acc = model.evaluate(X_test, y_test, verbose=0)[1]
    test_accs.append(test_acc)
    print("Done with size: ", train_size)
```

# Exercise I: Learning the digits

You're going to build a model on the &**digits dataset**, a sample dataset that comes pre-loaded with scikit learn. The **digits dataset** consist of **8x8 pixel handwritten digits from 0 to 9**:

<img src="image/digits_dataset_sample.png">

You want to distinguish between each of the 10 possible digits given an image, so we are dealing with multi-class classification.
The dataset has already been partitioned into `X_train`, `y_train`, `X_test`, and `y_test`, using 30% of the data as testing data. The labels are already one-hot encoded vectors, so you don't need to use Keras `to_categorical()` function.

Let's build this new `model`!

### Instructions

- Add a `Dense` layer of 16 neurons with `relu` activation and an `input_shape` that takes the total number of pixels of the 8x8 digit image.
- Add a `Dense` layer with 10 outputs and `softmax` activation.
- Compile your model with `adam`, `categorical_crossentropy`, and `accuracy` metrics.
- Make sure your model works by predicting on `X_train`.

In [None]:
# Instantiate a Sequential model
model = Sequential()

# Input and hidden layer with input_shape, 16 neurons, and relu 
model.add(Dense(16, input_shape = (64,), activation = 'relu'))

# Output layer with 10 neurons (one per digit) and softmax
model.add(Dense(10, activation='softmax'))

# Compile your model
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

# Test if your model is well assembled by predicting before training
print(model.predict(X_train))

# Exercise II: Is the model overfitting?

Let's train the `model` you just built and plot its learning curve to check out if it's overfitting! You can make use of the loaded function `plot_loss()` to plot training loss against validation loss, you can get both from the history callback.

If you want to inspect the `plot_loss()` function code, paste this in the console: `show_code(plot_loss)`

### Instructions

- Train your model for 60 `epochs`, using `X_test` and `y_test` as validation data.
- Use `plot_loss()` passing `loss` and `val_loss` as extracted from the history attribute of the `h_callback` object.

In [None]:
# Train your model for 60 epochs, using X_test and y_test as validation data
h_callback = model.fit(X_train, y_train, epochs = 60, validation_data = (X_test, y_test), verbose=0)

# Extract from the h_callback object loss and val_loss to plot the learning curve
plot_loss(h_callback.history['loss'], h_callback.history['val_loss'])

In [None]:
## Question

Just by looking at the picture, do you think the learning curve shows this model is overfitting after having trained for 60 epochs?

### Possible Answers

- Yes, it started to overfit since the test loss is higher than the training loss.

- No, the test loss is not getting higher as the epochs go by. (T)

# Exercise III: Do we need more data?

It's time to check whether the **digits dataset** `model` you built benefits from more training examples!

In order to keep code to a minimum, various things are already initialized and ready to use:

    - The `model` you just built.
`X_train`, `y_train`, `X_test`, and `y_test`.
    - The `initial_weights` of your model, saved after using `model.get_weights()`.
    - A pre-defined list of training sizes: `training_sizes`.
    - A pre-defined early stopping callback monitoring loss: `early_stop`.
    - Two empty lists to store the evaluation results: `train_accs` and `test_accs`.
Train your model on the different training sizes and evaluate the results on `X_test`. End by plotting the results with `plot_results()`.

The full code for this exercise can be found on the slides!

### Instructions

- Get a fraction of the training data determined by the `size` we are currently evaluating in the loop.
- Set the model weights to the `initial_weights` with `set_weights()` and train your model on the fraction of training data using `early_stop` as a callback.
- Evaluate and store the accuracy for the training fraction and the test set.
- Call `plot_results()` passing in the training and test accuracies for each training size.

In [None]:
for size in training_sizes:
  	# Get a fraction of training data (we only care about the training data)
    X_train_frac, y_train_frac = X_train[:size], y_train[:size]

    # Reset the model to the initial weights and train it on the new training data fraction
    model.set_weights(initial_weights)
    model.fit(X_train_frac, y_train_frac, epochs = 50, callbacks = [early_stop])

    # Evaluate and store both: the training data fraction and the complete test set results
    train_accs.append(model.evaluate(X_train, y_train)[1])
    test_accs.append(model.evaluate(X_test, y_test)[1])
    
# Plot train vs test accuracies
plot_results(train_accs, test_accs)

# (2) Activation functions

<img src="image/Screenshot 2021-01-29 153515.png">

## Sigmoid & Tanh function
<img src="image/Screenshot 2021-01-29 154557.png">

## RelU & Leaky ReLU
<img src="image/Screenshot 2021-01-29 154805.png">

## Effects of activation functions

<img src="image/Screenshot 2021-01-29 154916.png">

## Effects of Sigmoid & Tanh
<img src="image/Screenshot 2021-01-29 154938.png">

## Effects of ReLU & Leaky ReLU
<img src="image/Screenshot 2021-01-29 155020.png">

## Which activation function to use?
- No magic formula
- Different properties
- Depends on our problem
- Goal to archieve in a given layer
- ReLU are a goof first choice
- Sigmoid not recommended for deep models

## Comparing activation functions

```
# Set a random seed
np.random.seed(1)
# Return a new model with given activation
def get_model(act_function):
    model = Sequential()
    model.add(Dense(4, input_shape=(2,), activation=act_function))
    model.add(Dense(1, activation='sigmoid'))
    return model
```

## Comparing activation functions

```
# Activation functions to try out
activations = ['relu', 'sigmoid', 'tanh']

# Dictionary to store results
activation_results = {}
for funct in activations:
    model = model.get_model(act_function=funct)
    history = model.fit(X_train, y_train, validation=(X_test, y_test), epoch=100, verbose=0)
    activation_result[funct] = history
```

```
import pandas as pd

# Extract val_loss history of each activation function
val_loss_per_funct = {k:v.history['val_loss'] fot k,v in activation_results.itemห()}

# Turn the dictionary into a pandas dataframe
val_loss_curves = pd.DataFrame(val_loss_per_funct)

# Plot the curves
val_loss_curves.plot(title='Loss per Activation function')
```

# Exercise IV: Different activation functions

The `sigmoid()`, `tanh()`, `ReLU()`, and `leaky_ReLU()` functions have been defined and ready for you to use. Each function receives an input number X and returns its corresponding Y value.

Which of the statements below is **false**?

### Possible Answers

- The `sigmoid()` takes a value of 0.5 when X = 0 whilst `tanh()` takes a value of 0.

- The `leaky_ReLU()` takes a value of -0.01 when X = -1 whilst `ReLU()` takes a value of 0.

- The `sigmoid()` and `tanh()` both take values close to -1 for big negative numbers. (T)

# Exercise V: Comparing activation functions

Comparing activation functions involves a bit of coding, but nothing you can't do!

You will try out different activation functions on the multi-label model you built for your farm irrigation machine in chapter 2. The function `get_model('relu')` returns a copy of this model and applies the `'relu'` activation function to its hidden layer.

You will loop through several activation functions, generate a new model for each and train it. By storing the history callback in a dictionary you will be able to visualize which activation function performed best in the next exercise!

`X_train`, `y_train`, `X_test`, `y_test` are ready for you to use when training your models.

### Instructions

- Fill up the activation functions array with `relu`, `leaky_relu`, `sigmoid`, and `tanh`.
- Get a new model for each iteration with `get_model()` passing the current activation function as a parameter.
- Fit your model providing the train and `validation_data`, use 20 `epochs` and set verbose to 0.


In [None]:
# Activation functions to try
activations = ['relu', 'leaky_relu', 'sigmoid', 'tanh']

# Loop over the activation functions
activation_results = {}

for act in activations:
  # Get a new model with the current activation
  model = get_model(act)
  # Fit the model and store the history results
  h_callback = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, verbose=0)
  activation_results[act] = h_callback

# Exercise VI: Comparing activation functions II

What you coded in the previous exercise has been executed to obtain the `activation_results` variable, this time 100 epochs were used instead of 20. This way you will have more epochs to further compare how the training evolves per activation function.

For every `h_callback` of each activation function in `activation_results`:

    - The `h_callback.history['val_loss']` has been extracted.
    - The `h_callback.history['val_acc']` has been extracted.

Both are saved into two dictionaries: `val_loss_per_function` and `val_acc_per_function`.

Pandas is also loaded as pd for you to use. Let's plot some quick validation loss and accuracy charts!

### Instructions

- Use `pd.DataFrame()` to create a new DataFrame from the `val_loss_per_function` dictionary.
- Call `plot()` on the DataFrame.
- Create another pandas DataFrame from `val_acc_per_function`.
- Once again, plot the DataFrame.

In [None]:
# Create a dataframe from val_loss_per_function
val_loss= pd.DataFrame(val_loss_per_function)

# Call plot on the dataframe
val_loss.plot()
plt.show()

# Create a dataframe from val_acc_per_function
val_acc = pd.DataFrame(val_acc_per_function)

# Call plot on the dataframe
val_acc.plot()
plt.show()

## Valuation Loss per function 
<img src="image/2021-29-01 163353.svg">

## Valuation Accuracy per function
<img src="image/2021-29-01 163523.svg">

# (3) Batch size and batch normalization

<img src="image/Screenshot 2021-01-29 164230.png">

<img src="image/Screenshot 2021-01-29 164321.png">

## Mini-batches

**Advantages**
- Networks train faster (more weight updates in same amount of time)
- Less Ram memory required, can trian on huge datesets
- Noise can help networks reach a lower error, escaping local minima
** Disadvantages**
- More iterations need to be run
- Need to be adjusted, we need to find a good batch size

<img src="image/Screenshot 2021-01-29 164644.png">

## Batch size in Keras

$Standardization = \frac{data - mean}{standard deviation} $

<img src="image/Screenshot 2021-01-29 164846.png">
<img src="image/Screenshot 2021-01-29 164903.png">

## Batch normalization advantages
- Improves gradient flow
- Allow higher learning rates
- Reduces dependence on weight initializations
- Acts as an unintended form of regularization
- Limits internal covariate shift

## Batch normalization in Keras

```
# Import BatchNormalization from keras layers
from keras.layers import BatchNormalization
# Instantiate a Sequential model
model = Sequential()
# Add an input layer
model.add(Dense(3, input_shape=(2,), activation='relu'))
# Add batch normalization for the outputs of the layer above
model.add(BatchNormalization())
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
```

# Exercise VII: Changing batch sizes

You've seen models are usually trained in batches of a fixed size. The smaller a batch size, the more weight updates per epoch, but at a cost of a more unstable gradient descent. Specially if the batch size is too small and it's not representative of the entire training set.

Let's see how different batch sizes affect the accuracy of a simple binary classification model that separates red from blue dots.

You'll use a batch size of one, updating the weights once per sample in your training set for each epoch. Then you will use the entire dataset, updating the weights only once per epoch.

### Instructions 1/2

- Use `get_model()` to get a new, already compiled, model, then train your model for 5 `epochs` with a `batch_size` of 1.

In [None]:
# Get a fresh new model with get_model
model = get_model()

# Train your model for 5 epochs with a batch size of 1
model.fit(X_train, y_train, epochs=5, batch_size=1)
print("\n The accuracy when using a batch of size 1 is: ", model.evaluate(X_test, y_test)[1])

### Instructions 2/2

- Now train a new model with `batch_size` equal to the size of the training set.

In [None]:
model = get_model()

# Fit your model for 5 epochs with a batch of size the training set
model.fit(X_train, y_train, epochs=5, batch_size=700)
print("\n The accuracy when using the whole training set as batch-size was: "m, model.evaluate(X_test, y_test)[1])

# Exercise VIII: Batch normalizing a familiar model

Remember the **digits dataset** you trained in the first exercise of this chapter?

<img src="image/digits_dataset_sample.png">

A multi-class classification problem that you solved using `softmax` and 10 neurons in your output layer.
You will now build a new deeper model consisting of 3 hidden layers of 50 neurons each, using batch normalization in between layers. The `kernel_initializer` parameter is used to initialize weights in a similar way.

### Instructions

- Import `BatchNormalization` from keras layers.
- Build your deep network model, use **50 neurons for each hidden layer** adding batch normalization in between layers.
- Compile your model with stochastic gradient descent, `sgd`, as an optimizer.

In [None]:
# Import batch normalization from keras layers
from keras.layers import BatchNormalization

# Build your deep network
batchnorm_model = Sequential()
batchnorm_model.add(Dense(50, input_shape=(64,), activation='relu', kernel_initializer='normal'))
batchnorm_model.add(BatchNormalization())
batchnorm_model.add(Dense(50, activation='relu', kernel_initializer='normal'))
batchnorm_model.add(BatchNormalization())
batchnorm_model.add(Dense(50, activation='relu', kernel_initializer='normal'))
batchnorm_model.add(BatchNormalization())
batchnorm_model.add(Dense(10, activation='softmax', kernel_initializer='normal'))

# Compile your model with sgd
batchnorm_model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])

# Exercise IX: Batch normalization effects

Batch normalization tends to increase the learning speed of our models and make their learning curves more stable. Let's see how two identical models with and without batch normalization compare.

The model you just built `batchnorm_model` is loaded for you to use. An exact copy of it without batch normalization: `standard_model`, is available as well. You can check their `summary()` in the console. `X_train`, `y_train`, `X_test`, and `y_test` are also loaded so that you can train both models.

You will compare the accuracy learning curves for both models plotting them with `compare_histories_acc()`.

You can check the function pasting `show_code(compare_histories_acc)` in the console.

### Instructions

- Train the `standard_model` for 10 epochs passing in train and `validation data`, storing its history in `h1_callback`.
- Train your `batchnorm_model` for 10 epochs passing in train and `validation data`, storing its history in `h2_callback`.
- Call `compare_histories_acc` passing in `h1_callback` and `h2_callback`.

In [None]:
# Train your standard model, storing its history callback
h1_callback = standard_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, verbose=0)

# Train the batch normalized model you recently built, store its history callback
h2_callback = batchnorm_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, verbose=0)

# Call compare_histories_acc passing in both model histories
compare_histories_acc(h1_callback, h2_callback)

## Batch Normalization Effects

<img src="image/2021-29-01 175735.svg">

# (4) Hyperparameter tuning

## Neural network hyperparameters
- Number of layers
- Number of neurons per layer
- Layer order
- Layer activations
- Batch sizes
- Learning rates
- Optimizers
- ...

## Sklearn recap
```
# Import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
# Instantiate your classifier
tree = DecisionTreeClassifier()
# Define a series of parameters to look over
params = {'max_depth':[3,None], "max_features":range(1,4), 'min_samples_leaf':range(1,4)}
# Perform random search with cross validation
tree_cv = RandomizedSearchCV(tree, params, cv=5)

# Print the best parameters
print(tree_cv.best_parmas_)
```

## Turn a Keras model into a Sklearn estimator

```
# Function that creates our Keras model
def create_model(optimizer='adam', activation='relu'):
    model = Sequential()
    model.add(Dense(16, input_shape=(2,), activation=activation))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=optimizer, loss='binary_crossentropy')

# Imprt sklearn wrapper from keras
from keras.wrappers.scikit_learn import KerasClassifier

# Create a model as a sklearn estimator
model = KerasClassifier(build_fn=create_model, epochs=6, batch_size=16)
```

## Cross-validation
```
# Import cross_val_score
from sklearn.model_selection import cross_val_score

# Check how your keras model performs with 5 fold crossvalidation
kfold = cross_val_score(model, X, y, cv=5)

# Print the mean accuracy per fold
kfold.mean()
```

```
# Print the standard deviation per fold
kflod.std()
```

## Tips for neural networks hyperparameter tuning
- Random search is preferred over grid search
- Don't use many epochs
- Use a smaller sample of your dataset
- Play with batch size, activations, optimizers and learning rates

## Random search on Keras model

```
# Define a series of parameters
params = dict(optimizer=['sgd', 'adam'], epochs=3, batch_size=[5, 10, 20], activation=['relu', 'tanh'])

# Create a random search cv object and fit it to the data
random_search = RandomizedSearchCV(model, param_dist=params, cv=3)
random_search_results = rnadom_search.fit(X, y)
# Print results
print("Best: %f using %s".format(random_search_results.best_score_, random_search_results.best_parmas_))
```

## Tuning other hyperparameters
```
def create_mode(nl=1, nn=256):
    model.Sequential()
    model.add(Dense(16, input_shape(2,), activation='relu'))
    # Add as many hidden layers as specified in nl
    for i in range(nl):
        model.add(Dense(nn, activation='relu'))
    # End defining and compiling your model...
```

```
# Define parameters, named just like in create_model()
params = dict(nl=[1, 2, 9], nn=[128,256, 1000])

# Repeat the random search...

# Print results...
```

# Exercise X: Preparing a model for tuning

Let's tune the hyperparameters of a **binary classification** model that does well classifying the **breast cancer dataset**.

You've seen that the first step to turn a model into a sklearn estimator is to build a function that creates it. The definition of this function is important since hyperparameter tuning is carried out by varying the arguments your function receives.

Build a simple `create_model()` function that receives both a learning rate and an activation function as arguments. The `Adam` optimizer has been imported as an object from `keras.optimizers` so that you can also change its learning rate parameter.

### Instructions

- Set the learning rate of the `Adam` optimizer object to the one passed in the arguments.
- Set the hidden layers activations to the one passed in the arguments.
- Pass the optimizer and the binary cross-entropy loss to the `.compile()` method.

In [None]:
# Creates a model given an activation and learning rate
def create_model(learning_rate, activation):
  
  	# Create an Adam optimizer with the given learning rate
  	opt = Adam(lr = learning_rate)
  	
  	# Create your binary classification model  
  	model = Sequential()
  	model.add(Dense(128, input_shape = (30,), activation = activation))
  	model.add(Dense(256, activation = activation))
  	model.add(Dense(1, activation = 'sigmoid'))
  	
  	# Compile your model with your optimizer, loss, and metrics
  	model.compile(optimizer = opt, loss = crossentropy, metrics = ['accuracy'])
  	return model

# Exercise XI: Tuning the model parameters

It's time to try out different parameters on your model and see how well it performs!

The `create_model()` function you built in the previous exercise is ready for you to use.

Since fitting the `RandomizedSearchCV` object would take too long, the results you'd get are printed in the `show_results()` function. You could try `random_search.fit(X,y)` in the console yourself to check it does work after you have built everything else, but you will probably timeout the exercise (so copy your code first if you try this or you can lose your progress!).

You don't need to use the optional `epochs` and `batch_size` parameters when building your `KerasClassifier` object since you are passing them as `params` to the random search and this works already.

### Instructions

- Import `KerasClassifier` from keras `scikit_learn` wrappers.
- Use your `create_model` function when instantiating your `KerasClassifier`.
- Set `'relu'` and `'tanh'` as `activation`, 32, 128, and 256 as `batch_size`, 50, 100, and 200 `epochs`, and `learning_rate` of 0.1, 0.01, and 0.001.
- Pass your converted `model` and the chosen `params` as you build your `RandomizedSearchCV` object.


In [None]:
# Import KerasClassifier from keras scikit learn wrappers
from keras.wrappers.scikit_learn import KerasClassifier

# Create a KerasClassifier
model = KerasClassifier(build_fn = create_model)

# Define the parameters to try out
params = {'activation': ['relu', 'tanh'], 'batch_size': [32, 128, 256], 'epochs': [50, 100, 200], 'learning_rate': [0.1, 0.01, 0.001]}

# Create a randomize search cv object passing in the parameters to try
random_search = RandomizedSearchCV(model, param_distributions = params, cv = KFold(3))

# Running random_search.fit(X,y) would start the search,but it takes too long! 
show_results()

# Exercise XII: Training with cross-validation

Time to train your model with the best parameters found: **0.001** for the **learning rate, 50 epochs, a 128 batch_size** and **relu activations**.

The `create_model()` function from the previous exercise is ready for you to use. `X` and `y` are loaded as features and labels.

Use the best values found for your model when creating your `KerasClassifier` object so that they are used when performing cross_validation.

End this chapter by training an awesome tuned model on the **breast cancer dataset**!

### Instructions

- Import `KerasClassifier` from keras `scikit_learn` wrappers.
- Create a `KerasClassifier` object providing the best parameters found.
- Pass your `model`, features and labels to `cross_val_score` to perform cross-validation with 3 folds.

In [None]:
# Import KerasClassifier from keras wrappers
from keras.wrappers.scikit_learn import KerasClassifier

# Create a KerasClassifier
model = KerasClassifier(build_fn = create_model(learning_rate = 0.001, activation = 'relu'), epochs = 50, 
             batch_size = 128, verbose = 0)

# Calculate the accuracy score for each fold
kfolds = cross_val_score(model, X, y, cv = 3)

# Print the mean accuracy
print('The mean accuracy was:', kfolds.mean())

# Print the accuracy standard deviation
print('With a standard deviation of:', kfolds.std())