# Exercise 4 - Fully Connected Networks and the MNIST dataset
This exercise is based on https://github.com/leriomaggio/deep-learning-keras-tensorflow



# The MNIST database

The MNIST (Modified National Institute of Standards and Technology) database ([link](http://yann.lecun.com/exdb/mnist)) has a database of handwritten digits. The dataset  consists of 28x28 grayscale images of the 10 digits.

![](mnist.png)

Since this dataset is **provided** with Keras, we just ask the `keras.dataset` model for training and test data.

`from keras.datasets import mnist`<br>
`(X_train, y_train), (X_test, y_test) = mnist.load_data()`

The training set has $60,000$ samples. 
The test set has $10,000$ samples.
The digits are size-normalized and centered in a fixed-size image. 
The data page has description on how the data was collected. It also has reports the benchmark of various algorithms on the test dataset. 

## Task 1: Data preparation 
* Download the data
* Inspect the data and plot a few of the images using `matplotlib.pyplot.imshow` 
* Reshape the input data to be in vectorial form (original data are images)
* Convert the input data to do dtype `float32` using `astype` in order to scale it afterwards
* Normalize the design matrix to values between 0 and 1.
* How many classes do you have? How much data of each class?
* Convert the class vector to binary class matrices (**one-hot-vector**). Use the `to_categorical` function from `keras.utilis` to convert integer labels to **one-hot-vectors**.
* Split the training set into training and validation data (30%)

## Task 2: Build and train a neural network
* Design a dense neural network structure. 
* Choose `softmax` as activation for the output node (normalized multi-class probability)
* Use `categorical_crossentropy` as loss function (multi-class version of crossentropy)
* Use `adam` as optimizer and a batch size of 512 (speed things up)
* Train the NN over 50 epochs and plot the evolution of the training and validation loss as well as of one meaningful metric. What do you observe?
* Evaluate the performance on the test set using `sklearn.metrics`
* Plot the probability of being a *Zero* for true zeros and for all other numbers

### Plot the confusion matrix

A good way to show the performance of a multi-class output is the confusion matrix: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

In [None]:
#Note, this code is taken straight from the SKLEARN website, an nice way of viewing confusion matrix.
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=True,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
from sklearn.metrics import confusion_matrix
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_cls) 
# plot the confusion matrix
plt.figure(figsize=(8,8))
plot_confusion_matrix(confusion_mtx, classes = range(10))
plt.figure(figsize=(8,8))
plot_confusion_matrix(confusion_mtx, classes = range(10), normalize=False)

### Plot wrong associations

 Errors are difference between predicted labels and true labels

In [None]:
errors = (Y_cls - Y_true != 0)

Y_cls_errors = Y_cls[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_test_errors = X_test[errors]

Define plotting function

In [None]:
def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)), cmap=cm.Greys, interpolation='nearest')
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1


Rank errors by difference in probability

In [None]:
# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_dela_errors[-6:]


In [None]:
# Show the top 6 errors
display_errors(most_important_errors, X_test_errors, Y_cls_errors, Y_true_errors)

## Using Dropout Layers

As we have learned last time, the trainings and validation loss of the fit history is not comparable when using dropout. We can define our own callback function which calculates the loss and metric after each epoch for any dataset

In [None]:
from keras.callbacks import Callback

class HistoryEpoch(Callback):
    def __init__(self, data):
        self.data = data        
        
    def on_train_begin(self, logs={}):
        self.loss = []
        self.acc = []

    def on_epoch_end(self, epoch, logs={}):
        x, y = self.data
        l, a = self.model.evaluate(x, y, verbose=0)
        self.loss.append(l)
        self.acc.append(a)

## Task 3: Using regularizer

* Modify your previous example network by adding a Dropout layer after each hidden layer
* Add l2 regularization to the hidden layers
* Use the new defined `HistoryEpoch` for training, validation and test data set in order to save a comparable loss function and metric. This is done by e.g.: `train_hist=HistoryEpoch((X_train, Y_train))`. In the `fit` function you can call the callback then by specifying `callbacks=[train_hist]`.
* Plot the loss and metric evolution and compare the calculated loss with the default loss from the history
* Evaluate the performance of the NN as for the unregularized NN and compare the performance

## Early Stopping as a regularizer

* If you continue training, at some point the validation loss will start to increase: that is when the model starts to **overfit**. We can use EarlyStopping as a regularizer:

In [None]:
from keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=5, verbose=1)
#Also possible choice:
#early_stop = EarlyStopping(monitor='val_acc', patience=5, verbose=1)

dropout=0.5

model_ES = Sequential()
model_ES.add(Dense(512, activation='relu', kernel_regularizer=l2(l2_lambda), input_dim=784))
model_ES.add(Dropout(dropout))
model_ES.add(Dense(256, activation='relu', kernel_regularizer=l2(l2_lambda)))
model_ES.add(Dropout(dropout))
model_ES.add(Dense(10, activation='softmax'))

model_ES.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model_dropout.summary()   


In [None]:
history_ES = model_ES.fit(X_train, Y_train, validation_data = (X_test, Y_test), epochs=100, batch_size=256, verbose=1, 
             callbacks=[early_stop]) 

In [None]:
plot_history(history_ES)

---

# Bonus: Inspecting Layers

In [None]:
# We already used `summary`
model_dropout.summary()

### `model.layers` is iterable

In [None]:
print('Model Input Tensors: ', model.input)
print('Layers - Network Configuration:')
for layer in model.layers:
    print(layer.name, layer.trainable)
    print('Layer Configuration:')
    print(layer.get_config(), )
print('Model Output Tensors: ', model.output)

## Extract hidden layer representation of the given data

One **simple** way to do it is to use the weights of your model to build a new model that's truncated at the layer you want to read. 

Then you can run the `._predict(X_batch)` method to get the activations for a batch of inputs.

In [None]:
model_truncated = Sequential()
model_truncated.add(Dense(512, activation='relu', input_shape=(784,)))
model_truncated.add(Dropout(dropout))
model_truncated.add(Dense(256, activation='relu'))

for i, layer in enumerate(model_truncated.layers):
    layer.set_weights(model_dropout.layers[i].get_weights())

model_truncated.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [None]:
# Check
np.all(model_truncated.layers[0].get_weights()[0] == model.layers[0].get_weights()[0])

In [None]:
hidden_features = model_truncated.predict(X_train)

In [None]:
hidden_features.shape

In [None]:
X_train.shape

#### Hint: Alternative Method to get activations 

(Using `keras.backend` `function` on Tensors)

```python
def get_activations(model, layer, X_batch):
    activations_f = K.function([model.layers[0].input, K.learning_phase()], [layer.output,])
    activations = activations_f((X_batch, False))
    return activations
```

### Generate the Embedding of Hidden Features

Dimensionality reduction to dim=20 by using principal component analysis (PCA)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=20)
pca_result = pca.fit_transform(hidden_features)
print('Variance PCA: {}'.format(np.sum(pca.explained_variance_ratio_)))

Dimensionality reduction to dim=2 by using t-distributed stochastic neighbor embedding (TSNE)

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(pca_result[:1000]) ## Reduced for computational issues

In [None]:
colors_map = np.argmax(Y_train, axis=1)

In [None]:
X_tsne.shape

In [None]:
nb_classes=10

In [None]:
np.where(colors_map==6)

In [None]:
colors = np.array([x for x in 'b-g-r-c-m-y-k-purple-coral-lime'.split('-')])
colors_map = np.argmax(Y_train, axis=1)
colors_map = colors_map[:1000]
plt.figure(figsize=(10,10))
for cl in range(nb_classes):
    indices = np.where(colors_map==cl)
    plt.scatter(X_tsne[indices,0], X_tsne[indices, 1], c=colors[cl], label=cl)
plt.legend()
plt.show()