# Exercise - learning rate and batch size

1. Consider last exercise (i.e. the MNIST data). Suppose you are restricted to **training for only 2 epochs** but still want a good model. You recognize that finding the right learning rate is going to be very important. For this reason, you split your training data into a train and a validation set and use the validation set to find the optimal learning rate. Train a model with you optimized learning rate and evaluate it on your test data.
1. Recognizing that the batch size is also important for training speed, you decide to extend your above analysis to also find the optimal batch size. Once again, train a model with you optimized learning rate *and* batch size and evaluate it on your test data.
1. You have heard that momentum is important. You know that many optimizers already incorporate momentum by default, but you are now forced by your evil teacher to use SGD and otherwise repeat (1) and (2). You decide to extend your above analysis to also find the optimal momentum for SGD (see https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD for how to set it). Once again, train a model with you optimized learning rate, batch size, *and* momentum and evaluate it on your test data.

**See slides for more details!**

# Exercise 1

Consider last exercise (i.e. the MNIST data). Suppose you are restricted to **training for only 2 epochs** but still want a good model. You recognize that finding the right learning rate is going to be very important. For this reason, you split your training data into a train and a validation set and use the validation set to find the optimal learning rate. Train a model with you optimized learning rate and evaluate it on your test data.

In [None]:
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# A simply way to scale data. Remember pixels are in [0, 255] so this ensures [0, 1]
x_train = x_train / 255
x_test = x_test / 255

# Split train data to also get val data
x_train, x_val, y_train, y_val = 

print(x_train.shape, y_train.shape, x_val.shape, y_val.shape, x_test.shape, y_test.shape)

Here is (parts of) a model to get you started.

It is very helpful to wrap it inside a function since you want to call it multiple times in a loop.

Take note of the "Flatten" layer. This is important to reshape your data from (28, 28) to (784,).

Alternatively, you could reshape your data (the x's). This can be done using:

$\texttt{x = x.reshape(n, 784)}$ 

where $n$ is the number of samples (60k for training, 10k for test).

Then you don't need the Flatten layer, but remember to still specify an input shape of your first layer (i.e. 784 if you have done this reshaping).

**Note**: Do feel free to experiment with the number of layers, nodes per layer, and optimizer.

In [None]:
def build_model(learning_rate):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(??, activation=??),
        tf.keras.layers.Dense(??, activation=??),
        tf.keras.layers.Dense(??, activation=??),
    ])
    optimizer = tf.keras.optimizers.??(lr=learning_rate)
    model.compile(
        loss=??,
        optimizer=optimizer,
        metrics=['accuracy'],
    )
    
    return model

Let us look at single run.

In [None]:
model = build_model(??) # insert desired learning rate
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2)
model.evaluate(x_test, y_test)

Now for the optimization.

In [None]:
learning_rates = [] # must be positive floats. Default depends on optimizer

results = []

for learning_rate in learning_rates:
    model = build_model(learning_rate)
    model.fit(??) # you may want to use verbose=0 here to not get spammed. Also remember to use epochs=2
    loss, acc = model.evaluate(??)
    results.append((acc, learning_rate))
    
results = pd.DataFrame(results, columns=['Accuracy', 'Learning rate'])
results

In [None]:
results[results['Accuracy'] == results['Accuracy'].max()]

In [None]:
# Train and evaluate final model.
# Remember to use both train and val data for training for best performance! 
# Similar to what we have done in all the other exercises/assignments

# Exercise 2

Recognizing that the batch size is also important for training speed, you decide to extend your above analysis to also find the optimal batch size. Once again, train a model with you optimized learning rate *and* batch size and evaluate it on your test data.

In [None]:
learning_rates = [] # must be positive floats. Default depends on optimizer
batch_sizes = [] # # must be positive ints. Default is 32

results = []

for learning_rate in learning_rates:
    for batch_size in batch_sizes:
        model = build_model(learning_rate)
        model.fit(??) # remember to pass in batch_size here! Also remember to use epochs=2
        loss, acc = model.evaluate(??)
        results.append((acc, learning_rate, batch_size))
    
results = pd.DataFrame(results, columns=['Accuracy', 'Learning rate', 'Batch size'])
results

In [None]:
results[results['Accuracy'] == results['Accuracy'].max()]

In [None]:
# Train and evaluate final model.
# Remember to use both train and val data for training for best performance! 
# Similar to what we have done in all the other exercises/assignments

# Exericse 3

You have heard that momentum is important. You know that many optimizers already incorporate momentum by default, but you are now forced by your evil teacher to use SGD and otherwise repeat (1) and (2). You decide to extend your above analysis to also find the optimal momentum for SGD (see https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD for how to set it). Once again, train a model with you optimized learning rate, batch size, *and* momentum and evaluate it on your test data.

In [None]:
def build_model_with_momentum(learning_rate, momentum):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(??, activation=??),
        tf.keras.layers.Dense(??, activation=??),
        tf.keras.layers.Dense(??, activation=??),
    ])
    optimizer = tf.keras.optimizers.SGD(lr=learning_rate, momentum=momentum)
    model.compile(
        loss=??,
        optimizer=optimizer,
        metrics=['accuracy'],
    )
    
    return model

In [None]:
learning_rates = [] # must be positive floats. Default (for SGD) is 0.01
batch_sizes = [] # # must be positive ints. Default is 32
momentums = [] # must be in [0, 1). Default (for SGD) is 0.0

results = []

for learning_rate in learning_rates:
    for batch_size in batch_sizes:
        for momentum in momentums:
            model = build_model_with_momentum(learning_rate, momentum=momentum)
            model.fit(??) # Remember to use epochs=2
            loss, acc = model.evaluate(??)
            results.append((acc, learning_rate, batch_size, momentum))
    
results = pd.DataFrame(results, columns=['Accuracy', 'Learning rate', 'Batch size', 'Momentum'])
results

In [None]:
results[results['Accuracy'] == results['Accuracy'].max()]

In [None]:
# Train and evaluate final model.
# Remember to use both train and val data for training for best performance! 
# Similar to what we have done in all the other exercises/assignments