# Dense Networks with Keras and TensorFlow

TensorFlow is an open source software library for numerical computation that can be used for many things, but is mostly know for its use in machine learning and especially in deep learning.
Since its release in 2015 it has quickly become one of the most popular and most actively developed libraries for deep learning.
TensorFlow represents computations as graphs, which enables simple parallelization (as opposed to sequentially), and automatic differentiation.

Pure TensorFlow is very verbose, and it is therefore a good idea to use a high-level API.
Doing so simplifies and speeds-up development, reduces the risk of bugs, and generally reduces headache.
The officially supported high-level API is **[Keras](https://keras.io/)**, and will be the focus of this lab.


### External resources
If you want a deeper dive the following are good places to start:

* [Deep Learning course exercises from DTU](https://github.com/DeepLearningDTU/02456-deep-learning) - more hands on TensorFlow exercises.
* [Official TensorFlow getting started material](https://www.tensorflow.org/get_started/) - collection of good tutorials from beginer to quite advanced.
* [Official Keras getting started marerial](https://keras.io/getting-started/functional-api-guide/)
* [API documentation](https://www.tensorflow.org/api_docs/python/) - Most of the documentation for TF is written into the code, so the best way to figure out how somethings works is often to look it up in the API, and then look at the implementation. The [API guides](https://www.tensorflow.org/api_guides/python/array_ops) can also be very useful sometimes.


In [None]:
# Loading dependancies and supporting functions by running the code block below.
from __future__ import absolute_import, division, print_function 

import time
import sys, os

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import numpy as np
from IPython.display import clear_output
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras.callbacks import TensorBoard

from sklearn.preprocessing import OneHotEncoder

# The data

For this exercise we will use the [fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset.
Like the classical MNIST data set it consists of images `28x28` grayscale images of of 10 different classes, and the objective is to correctly classify as many of them as possible.
Fashion-MNIST is however more challenging (though still a toy-dataset), making it more interesting to work with.

In [None]:
# Download and load data
mnist = keras.datasets.fashion_mnist

(x_train, y_train_),(x_test, y_test_) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize data to [0, 1] interval

## Print dataset statistics and visualize
print("""Information on dataset
----------------------""")
print("Training data shape:\t", x_train.shape, y_train_.shape)
print("Test data shape\t\t", x_test.shape, y_test_.shape)

Fashion-MNIST consists of images of 10 different types of clothing.
The labels are:

| Label:       | 0           | 1       | 2        | 3     |    4 | 5     | 6      | 7      | 8   |   9 |
| -| -| -| -| -| -| -| -| -| -| -|
| **Description:** | T-shirt/top | Trouser | Pullover | Dress | Coat | Sandal | Shirt | Sneaker | Bag | Ankle boot |


In [None]:
## Plot a few MNIST examples
img_to_show = 5
idx = 0
canvas = np.zeros((28*img_to_show, img_to_show*28))
print('\nLabels')
for i in range(img_to_show):
    for j in range(img_to_show):
        canvas[i*28:(i+1)*28, j*28:(j+1)*28] = x_train[idx]#mnist_data.train.images[idx].reshape((28, 28))
        print(y_train_[idx], end=', ')
        idx += 1
    print()

print('\nInput data')
plt.figure(figsize=(6,6))
plt.axis('off')
plt.imshow(canvas, cmap='gray')
plt.title('fashion-MNIST data')
plt.show()

## One-hot encoding
Class labels for neural netowrks (almost) always use **one-hot encoding**.
Rather than using the integers directly they are encoded as binary vectors with one bit with the value `1`, and the rest `0`.
The position of the `1` indicates the label.
The code below converts the integer labels to one-hot labels.

In [None]:
y_train = OneHotEncoder(categories='auto', sparse=False).fit_transform(y_train_[:,None])
y_test = OneHotEncoder(categories='auto', sparse=False).fit_transform(y_test_[:,None])

for i in range(10):
    print(y_train_[i], '-->', y_train[i,:])


# Elements of Learning
Many parts of deep learning are still more art than science.
When defining and training a deep network there are many options, and no proven method of finding the best (or a good) configuration.
Hyperparameters can be found by experience (guessing) or some search procedure. 
Random search is easy to implement and performs decently: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf.
More advanced search procedures include [SPEARMINT](https://github.com/JasperSnoek/spearmint) and many others.

Some important factors, and good starting points are given below.

## Ballpark estimates of hyperparameters
__Number of hidden units and network architecture__
* Probably as big network as possible (memory and time constraints) and then apply regularization. 
   You'll have to experiment :). 
   One rarely goes below 256 units for feedforward networks unless your are training on CPU...
   Theres is some research into stochastic depth networks: https://arxiv.org/pdf/1603.09382v2.pdf, but in general this is trial and error.
   
   
__Loss function__
The [loss function](https://keras.io/losses/) is typically determiend by the problem, but there is some flexibility here as well.
Broadly speaking we use:
 * Cross entropy for classification (discrete targets)
 * Mean squared error for regression (continuous targets)
 
__Parameter initialization__
    [Parameter initialization](](https://keras.io/initializers/)) is extremely important. There are a lot of different initializers. Often used initializer are
    1. He
    2. Glorot
    3. Uniform or Normal with small scale. (0.1 - 0.01)
    4. Orthogonal (works well for RNNs, somtimes)

Bias is nearly always initialized to zero.
   
__Nonlinearity__: The most commonly used [nonliearities](https://keras.io/activations/) are:    
    1. ReLU
    2. Leaky ReLU
    3. Elu
    3. Sigmoids are used if your output is binary. It is not used in the hidden layers. Squases the output between -1 and 1
    4. Softmax used as output if you have a classification problem. Normalizes the the output to 1. )

__Regularization:__
    1. Dropout: Dropout rate 0.1-0.5
    2. [L2 and L1 regularization](https://keras.io/regularizers/): 1e-4 - 1e-8.
    3. [Batchnorm](https://keras.io/layers/normalization/): Batchnorm also acts as a regularizer.
    Often very useful (faster and better convergence)
    4. Early stopping: Very frequently used.
    
__Optimizers:__
    1. SGD + Momentum: learning rate 1.0 - 0.1 
    2. ADAM: learning rate 3*1e-4 - 1e-5
    3. RMSPROP: somewhere between SGD and ADAM

__mini-batch size__
* Usually people use 16-256. Bigger is not allways better. With smaller mini-batch size you get more updates and your model might converge faster. Also small batchsizes uses less memory -> you can train a model with more parameters.


In [None]:
# Create list to hold experiments - but don't overwrite, even if we re-run this cell.
try:
    experiments
except NameError:
    experiments = []

In [None]:
## Helper functions
def visualize_experiments(experiments):
    fig = plt.figure(figsize=(8,6))
    for experiment in experiments:
        exp_name, _, info = experiment
 
        ax = plt.subplot("211")
        ax.set_title('Validation Accuracy')
        plt.plot(info.history['val_acc'], label=exp_name)
        plt.legend()

        ax = plt.subplot("212")
        ax.set_title('Validation Loss')
        plt.plot(info.history['val_loss'], label=exp_name)
        plt.legend()

    plt.tight_layout()
    plt.show()


def visualize_info(experiment):
    name, _, info = experiment
    print('Params:')
    for key in info.params:
        print('{:20}'.format(key), info.params[key])
    
    fig = plt.figure(figsize=(8,6))
    ax = plt.subplot("211")
    ax.set_title('Accuracy: '+ name)
    plt.plot(info.history['val_acc'], label='val_acc')
    plt.plot(info.history['acc'], label='train_acc')
    plt.legend()

    ax = plt.subplot("212")
    ax.set_title('Loss: ' + name)
    plt.plot(info.history['val_loss'], label='val_loss')
    plt.plot(info.history['loss'], label='loss')
    plt.legend()
    
    plt.tight_layout()
    plt.show()

    
def keep_best(experiments, n):
    """ Return the n best experiments."""
    if len(experiments) < n:
        return experiments
    
    exp_sorted = sorted(experiments, key=lambda x: np.max(x[2].history['val_acc']), reverse=True)
    return exp_sorted[:n]


In [None]:
# Training function
def train(model, loss, optimizer, num_epochs, exp_name=None, use_tensorboard=False):    
    exp_name = exp_name or 'log_{:.0f}'.format(time.time()*100)
    
    model.compile(optimizer=optimizer,
                  loss=loss,
                  metrics=['accuracy'])

    if use_tensorboard:
        tensorboard = TensorBoard(log_dir='logdir/'+exp_name)
        fit_info = model.fit(x_train, y_train, epochs=num_epochs, batch_size=256, validation_split=0.1, callbacks=[tensorboard])
    else:
        fit_info = model.fit(x_train, y_train, epochs=num_epochs, batch_size=256, validation_split=0.1)
    return exp_name, model, fit_info

# Your Task
is to create the best network as possible, and attain the highest validation performance you can.
Once you are happy you can test your model on the test set.


There are multiple **loss functions** available, but for classification we almost always use [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy).

In [None]:
loss_function = tf.keras.backend.categorical_crossentropy

There are also several optimizers (gradient descent algorithms) to choose from.
Stochastic Gradient Descent (SGD) is the classical one, and it is still used frequently.
There are however [cases where it doesn't perform so well](https://imgur.com/a/Hqolp) and another more advanced algorithm is better suited.

In [None]:
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0)

In [None]:
## Define the model we want to train

model = keras.models.Sequential([
    # Dense layers only work with feature vectors, so we need to flatten first
    keras.layers.Flatten(input_shape=(28, 28)),

    keras.layers.Dense(256),
    keras.layers.Activation('relu'),
    
    # The output layer must have the same number of units as there are classes
    keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.summary()

print('\nBegin Training')
num_epochs = 10
exp_name = 'default_network'
experiment = train(model, loss_function, optimizer, num_epochs, exp_name=exp_name)
experiments.append(experiment)
del model

In [None]:
visualize_info(experiment)
visualize_experiments(experiments)

for experiment in experiments:
    name, _, info = experiment
    print(name, np.max(info.history['val_acc']))

In [1]:
# If applicable - clean up your experiments list a bit
experiments = keep_best(experiments, 3)

NameError: name 'keep_best' is not defined

# Testing
Ideally you would only ever use test data once - when you are completely 100% done and satisfied with your model.
This is rarely the practice in real life, but you should keep some discipline, and not use the test too often

In [None]:
are_you_happy_about_your_model = False

if are_you_happy_about_your_model:
    best_experimnet = keep_best(experiments, 1)[0]
    print('Best model test loss and accuracy:')
    print(best_experimnet[1].evaluate(x_test, y_test, batch_size=128))
    print()

    visualize_info(best_experimnet)
else:
    print('Come back when you are happy.')