# TP Programming with Keras - MNIST problem, confidence level

In this training session, we associate a confidence level to our predictions by using the MC-Dropout (MC = Monte-Carlo) method. This method consists in keeping the Dropout operation active during the test, and we use the property of randomness of Dropout to obtain a variability on the output of the network: a high variability implies a low confidence level (and vice versa).

In this practice session, some cells must be filled according to the instructions. They are identified by the word **Exercise**. You will perform the **Verifications** yourselves in most cases, by watching if the algorithm correctly works and converges.

Below we import the required libraries.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow.keras as keras
from tqdm import tqdm

## Data definition

The following cell loads the MNIST data

In [2]:
#DO NOT CHANGE

(X_train, Y_train), (X_test, Y_test) = keras.datasets.mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


**Exercise**: Apply data normalization (division by 255) and change the output data into categorial vectors (one hot encoding with keras.utils.categorical)

In [None]:
#TO DO

**Exercise**: Adapt the dimension of X_train and X_test in order to use 2D convolution layers.

In [None]:
#TO DO

## Keras model

### Model creation with convolutional layers

**Exercise**: Create a Keras model with name "my_model".

**Specific instructions**:
- Use the following format: it is not a sequential format anymore because we need to introduce a specific option to keep the Dropout active during at test time.
- Use the following format:
  - x = your_layer_1(arguments)(x)
  - x = your_layer_2(arguments)(x)
  - ....
  - outputs = your_final_layer(arguments)(x)
- For the Dropout layers, add with the argument x a keyword "training = True" to keep the Dropout active at prediction time. Do not use Batch_Normalization layer before or after a Dropout layer.

In [None]:
inputs = keras.layers.Input((28,28,1))

x = keras.layers.Conv2D(#TO DO)(inputs)
x = keras.layers.BatchNormalization()(x)
    
#TO DO

outputs =#TO DO

my_model = keras.models.Model(inputs,outputs)


**Exercise**: Display your architecture by calling my_model.summary()

In [None]:
#TO DO

### Model compilation

**Exercise**: Compile your model and choose an optimizer. Use adapted loss function and metrics.

In [None]:
#TO DO

### Early stopping

**Exercise**: Define an early stopping procedure.

In [None]:
#TO DO

## Training

**Exercise**: Run the training as usual.

In [None]:
learning = #TO DO

**Verification**: The loss function should decrease and the accuracy should increase. Same thing for the validation loss.

**Exercise**: Plot the evolution of the loss function, and the evolution of the accuracy, for the training set and the validation set. 

In [None]:
#TO DO

## Predicting with your model

**Exercise**: Pick randomly an example and display its prediction. Run the prediction several times: you will see that the prediction is not always the same for this example. Do not use my_model.predict (it disables Dropout in the most recent versions of TensorFlow). Instead, apply the model directly to your example: my_model(exemple).

In [None]:
#TO DO

We will characterize the variability of the predictions. To do so, we use the information theory to build adapted metrics in order to estimate these uncertainties (self-evaluated by the network).

For one specific prediction, we characterize the uncertainty of this prediction, based on the probabilities associated to each class. The idea is the following: if the prediction gives a high probability for one class and low probabilities for the other classes, the prediction is "certain". On the contrary, if the probability is low for every class, the prediction is not "sure".

This notion can be quantified by **Shannon entropy**, defined by:

\begin{equation}
\mathcal{H}(\hat{Y}) = -\sum_{i = 1}^{K} \hat{y_i}\log(\hat{y_i}) 
\end{equation}

In this equation, the index $i$ corresponds to the classes, $\hat{y}$ is the prediction.

**Exercise**: Complete the following function to code the Shannon entropy. Consider that $y$ is a multi-dimensional table and we want to compute the entropy along a particular axis (argument ax) which represents the classes.

In [None]:
def shannon_entr(y,ax):

  entr = #TO DO

  return entr

**Verification**: Run the following cell.

In [None]:
#DO NOT CHANGE

np.random.seed(seed = 1)

y_hat = np.random.rand(3,10)

print(shannon_entr(y_hat,1))

The result should be [2.84552209 2.71503273 1.79409548]

Now, we run several predictions for the same example. We will get a variability thanks to the Monte-Carlo Dropout. The total uncertainty will be represented by the Shannon entropy computed on the mean prediction. The uncertainty due to the inner noise of the data (aleatoric uncertainty) is given by the mean of the Sannon entropies. Finally, the uncertainty due to the variability of the different models is given by the difference between the two previous quantities (epistemic uncertainty).

Mathematically speaking, it corresponds to:

  - $\mathcal{H}(\mathbb{E}_{w}(\hat{Y}))$ is the total uncertainty (the index $w$ means that the expectancy is computed through the variability of the weights due to the MC-Dropout)
  - $\mathbb{E}_{w}(\mathcal{H}(\hat{Y}))$ is the aleatoric uncertainty
  - $\mathcal{I}(\hat{Y};w) = \mathcal{H}(\mathbb{E}_{w}(\hat{Y})) - \mathbb{E}_{w}(\mathcal{H}(\hat{Y}))$ is the epistemic uncertainty. This quantity is called "mutual information" and represents the relation between the prediction and the variability of the weights du to the MC-Dropout.

**Exercise**: Take an example and duplicate it along the axis 0 by using np.repeat.

In [None]:
i = 0

X_test_i = X_test[i:(i+1)]

X_test_dup = #TO DO

**Exercise**: Run your model to make a prediction on this duplicated example.

In [None]:
Y_pred_dup = #TO DO

**Exercise**: Use this prediction to compute the aleatoric uncertainty. As we previously saw, the aleatoric uncertainty is the mean of Shannon entropies computed on the several predictions.

In [None]:
incert_aleat = #TO DO

print(incert_aleat)

**Exercise**: Use this prediction to compute the total uncertainty. As we previously saw, the total uncertainty is the Shannon entropy computed on the mean of the predictions.

In [None]:
incert_tot = #TO DO

print(incert_tot)

**Exercise**: Finally, you can compute the epistemic uncertainty, as the difference between the total uncertainty and the aleatoric uncertainty.

In [None]:
incert_epist = #TO DO

print(incert_epist)

### Testing on the whole test database

The following code duplicates the whole database a hundred of times. Firstly, it duplicates the test vector along a supplementary axis, then it produces a reshape in ordre to obtain a table with dimensions (number of examples * n_mc,image dimension), with n_mc, the number of Monte-Carlo drawing.

In [None]:
#DO NOT CHANGE

n_mc = 100

X_test_tot_dup = np.expand_dims(X_test,axis = 1)

X_test_tot_dup = np.repeat(X_test_tot_dup,n_mc,axis = 1)

X_test_tot_dup = np.reshape(X_test_tot_dup,(n_mc*X_test.shape[0],X_test.shape[1],X_test.shape[2],X_test.shape[3]))

The function below will allow you to make predictions on batches of data using my_model directly (and not my_model.predict).
This approach helps prevent memory saturation on the machine.

In [None]:
#DO NOT CHANGE

def predict_on_batch_with_dropout(model, data, batch_size):
    predictions = []
    for i in tqdm(range(0, len(data), batch_size)):
        batch = data[i:i+batch_size]
        batch_predictions = model(batch, training=True)
        predictions.append(batch_predictions)
    return np.concatenate(predictions, axis=0)

**Exercise**: Run your model on this duplicated database to get a prediction using the function predict_on_batch_with_dropout.

In [None]:
Y_pred_tot = #TO DO

The size of this table of predictions is now (n_example * n_mc, 10). To compute the Shannon entropy, you must gather the predictions corresponding to the same example into the same dimension: the idea is to get a final table with shape (n_example, n_mc, 10).

**Exercise**: Use the function np.reshape to get this shape.

In [None]:
Y_pred_tot = #TO DO

**Exercise**: Compute the aleatoric uncertainty for all of the predictions. The results must be one vector with size n_examples (= 10 000).

**Hint**: The main difficulty to handle is the axis on which you compute the entropy, and the axis on which you compute the mean.

In [None]:
incert_aleat_tot = #TO DO

print(incert_aleat_tot.shape)

**Exercise**: Compute the total uncertainty for every prediction. Store also the vector that gives the mean of the predictions in the vector Y_mean.

In [None]:
Y_mean = #TO DO

incert_totale_tot = #TO DO

print(incert_totale_tot.shape)

**Exercise**: Finally, compute the epistemic part.

In [None]:
incert_epist_tot = #TO DO

**Exercise**: Sort the example according to the value of their aleatoric uncertainty and visualize examples with the highest aleatoric uncertainty. The function np.argsort can be helpful.

In [None]:
index_sort = #TO DO

index = #TO DO: stockez l'index que vous voulez visualiser dans cette variable

label_pred = np.argmax(Y_mean[index])

figure = plt.figure(figsize = (16,9))

ax1 = plt.subplot(121)
ax1.imshow(X_test[index,:,:],cmap = "hot")
plt.title("Mean prediction: " + str(label_pred) + "\nTrue value: " + str(Y_test[index]))

ax2 = plt.subplot(122)
ax2.bar(np.arange(10),height = Y_mean[index],tick_label = np.arange(10))
plt.xlabel("Class")
plt.ylabel("Network output")
plt.title("Aleatoric uncertainty: " + str(incert_aleat_tot[index]) + 
          "\nEpistemic uncertainty: " + str(incert_epist_tot[index])+
          "\nTotal uncertainty: " + str(incert_totale_tot[index]))


**Exercise**: Do the same exercise for the epistemic uncertainty.

In [None]:
index_sort = #TO DO

index = #TO DO: stockez l'index que vous voulez visualiser dans cette variable
 
label_pred = np.argmax(Y_mean[index])

figure = plt.figure(figsize = (16,9))

ax1 = plt.subplot(121)
ax1.imshow(X_test[index,:,:],cmap = "hot")
plt.title("Mean prediction: " + str(label_pred) + "\nTrue value: " + str(Y_test[index]))

ax2 = plt.subplot(122)
ax2.bar(np.arange(10),height = Y_mean[index],tick_label = np.arange(10))
plt.xlabel("Class")
plt.ylabel("Network output")
plt.title("Aleatoric uncertainty: " + str(incert_aleat_tot[index]) + 
          "\nEpistemic uncertainty: " + str(incert_epist_tot[index])+
          "\nTotal uncertainty: " + str(incert_totale_tot[index]))

**Exercise**: Do the same exercise for the total uncertainty.

In [None]:
index_sort = #TO DO

index =  #TO DO: stockez l'index que vous voulez visualiser dans cette variable

label_pred = np.argmax(Y_mean[index])

figure = plt.figure(figsize = (16,9))

ax1 = plt.subplot(121)
ax1.imshow(X_test[index,:,:],cmap = "hot")
plt.title("Mean prediction: " + str(label_pred) + "\nTrue value: " + str(Y_test[index]))

ax2 = plt.subplot(122)
ax2.bar(np.arange(10),height = Y_mean[index],tick_label = np.arange(10))
plt.xlabel("Class")
plt.ylabel("Network output")
plt.title("Aleatoric uncertainty: " + str(incert_aleat_tot[index]) + 
          "\nEpistemic uncertainty: " + str(incert_epist_tot[index])+
          "\nTotal uncertainty: " + str(incert_totale_tot[index]))

You can keep on studying these data, by building the histogram of uncertainties, analysing the correlation between the uncertainties and the wrong classifications of the network...