# Optimisers and Loss Functions

Tutorial by Mark Graham

# Optimisers

### Exercise 1.1 Gradient descent

In this section we'll dig into optimisers in more detail. So far, we have considered simple gradient descent. In gradient descent, the update rule is :

$$ \mathbf{W}_{t+1} = \mathbf{W}_{t}  - \eta \left.\dfrac{\partial J}{\partial \mathbf{W}}\right|_{w_{t}} $$

where $J$ is our cost function, $\mathbf{W}$ our parameters and $\eta$ our learning rate.

Let's implement gradient descent. 
Consider the following cost function for a simple two-parameter system:

$$ J = 40w_1^2 + 10 w_2^2 + 40 w_2$$

*Remember that the **cost** function refers to the mean loss over the entire training set and the **loss** function gives the cost of an individual observation*

**1.1.1 Implement the loss function for the above expression in, the cell below**

In [None]:
import numpy as np
#
# Loss function
#
def loss(w1, w2):
    #
    # ### STUDENT'S ANSWER HERE ####
    #
    return None
#

Because our cost function is simple, we can calculate its minimum analytically.

**1.1.2 Estimate the analytic mimina of a convex function**

Calculate the partial derivatives analytically $\dfrac{\partial J}{\partial w_1}$ and $\dfrac{\partial J}{\partial w_2}$ so as to complete function below 

In [None]:
#
# Gradient of loss function
#
def calculate_gradient(w1, w2):
    #
    # ### STUDENT'S ANSWER HERE ####
    # 
    grad_w1 = None
    grad_w2 = None
    #
    return (grad_w1, grad_w2)
#

#### 1.1.3. Estimate the coordinates of the minima

Using your knowledge of differention, and the partial derivatives estimated above, calculate the minima of the cost function $ J_{min}$, together with the parameters at which the loss is minimised: $ w_{1min}, w_{2min} $. 

Pass these values into  `plot_surface` function below


In [None]:
#
# Generate data for a convex function, calculate minima and plot
#
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
#
# Generate evenly spaced values
#
w1 = np.linspace(-5, 5, 30)
w2 = np.linspace(-5, 5, 30)
#
W1, W2 = np.meshgrid(w1, w2)
#
# Compute value of loss function for each parameter combination
#
losses = loss(W1, W2)
#
# ### STUDENT'S ANSWER HERE ####
#
w_1min = None
w_2min = None
L_min  = None
#
# Do plot
#
def plot_surface(W1, W2, loss_coords, minima):
    #
    fig = plt.figure(figsize=(10, 10))
    ax  = plt.axes(projection='3d')
    #
    ax.contour3D(W1, W2, losses, 50, cmap='binary')
    ax.set_xlabel('w1')
    ax.set_ylabel('w2')
    ax.set_zlabel('loss');
    #
    # plot the minimum
    #
    ax.scatter3D(minima[0], minima[1], minima[2], loss_coords, c='green', s=100)
    #
    return fig, ax
#
fig, ax = plot_surface(W1, W2, losses, [w_1min, w_2min, L_min])
#

Finally, we need a way of updating our parameters so we can use of the gradient. 

**1.1.4 Implement the gradient descent update rule, in the class outlined in the next cell**

In [None]:
#
# Class for carrying out gradient descent
#
class gradient_descent():
    #
    def __init__(self, learning_rate):
        #
        # We can store parameters of the descent algorithm here.
        #
        self.lr = learning_rate
    #
    def __call__(self, w1, w2, w1_grad, w2_grad):
        #
        # Our actual computation happens here.
        #
        # ### STUDENT'S ANSWER HERE ####
        #
        # Update the weights
        #
        w1_updated = None
        w2_updated = None
        #
        return w1_updated, w2_updated
    #
#

We're now ready to perform gradient descent

**Run the following cell, which calls your functions** `loss_gradient` **and** `gradient_descent` **:**

In [None]:
#
# Plot the path of gradient descent
#
def plot_coords(w1_coords, w2_coords, loss_coords, axis):
    #
    ax.plot3D(w1_coords, w2_coords, loss_coords, c='red')
    ax.scatter3D(w1_coords, w2_coords, loss_coords, c='red', s=100)
    ax.set_xlim(-5, 5)
    ax.set_ylim(-5, 5);
    ax.set_zlim(-100, 1500)
#
# Start point
#
w1 = 4 * -1
w2 = 4
#
num_epochs = 100
lr         = 0.01
#
# Instantiate the gradient_descent optimiser
#
grad_descent = gradient_descent(learning_rate=lr)
#
# Create empty lists to hold the coordinates
#
w1_coords, w2_coords, loss_coords = [], [], []
#
# Do epochs
#
for epoch in range(num_epochs):
    #
    w1_coords.append(w1)
    w2_coords.append(w2)
    #
    Loss = loss(w1,w2)
    #
    loss_coords.append(Loss)
    #
    w1_grad, w2_grad =  calculate_gradient(w1, w2)
    #
    w1, w2 = grad_descent(w1, w2, w1_grad, w2_grad)
    #
    if (abs(w1_grad) > 0.01) or (abs(w2_grad) > 0.01):
        print('epoch {} , loss {:.3f}, w1 gradient: {:.3f} w2 gradient: {:.3f}'.format(epoch,Loss,w1_grad, w2_grad))
    else:
        break
    #
#
# Do plot
#
fig, ax = plot_surface(W1, W2, losses, [w_1min, w_2min, L_min])
plot_coords(w1_coords, w2_coords, loss_coords, ax)
#

It should work pretty well. Lets investigate some things:
- How sensitive is our solution to the learning rate? Try out different values in the above cell, both larger and smaller. What do you notice?
- Estimate how long it takes for out solution to converge, by altering the number of epochs at a fixed lr=0.003

### Exercise 1.2: Gradient descent with momentum ###

Momentum adds 'memory' to our gradient descent, averaging the gradient at the current step with gradients from previous steps. The update equations gradient descent with momentum are:

$$ \mathbf{z}_{t+1} = \beta   \mathbf{z}_{t} +  \eta \left.\dfrac{\partial J}{\partial \mathbf{W}}\right|_{\mathbf{W}_{t}} \\
\mathbf{W}_{t+1} = \mathbf{W}_{t}  -  \mathbf{z}_{t+1} $$

where $\beta$ is the momentum parameter and $\eta$ the learning rate. 

**1.2.1 Implement the momentum update  in the cell below:**

In [None]:
#
# Class for carrying out gradient descent with momentum
#
class gradient_descent_momentum():
    #
    def __init__(self, momentum, learning_rate):
        #
        # As well as the parameters momentum and lr, we need to store z_1 and z_2 between update steps
        #
        self.momentum = momentum
        self.lr       = learning_rate
        #
        self.z_1 = 0
        self.z_2 = 0
    #    
    def __call__(self, w1, w2, w1_grad, w2_grad):
        #
        ### STUDENT CODE HERE####
        #
        # Using the above formula implement the momentum update
        #
        self.z_1 = None
        self.z_2 = None
        #
        # Update the weights 
        #
        w1_updated = None
        w2_updated = None
        #
        return w1_updated, w2_updated 
    #
#

The cell below calls the momentum update. **Run it**.

In [None]:
#
# Run the gradient descent with momentum
#
w1 = 4 * -1
w2 = 4
#
num_epochs = 100
lr         = 0.003
momentum   = 0.6
#
gradient_descent_mom = gradient_descent_momentum(momentum=momentum, learning_rate=lr)
#
w1_coords, w2_coords, loss_coords = [], [], []
#
# Do epochs
#
for epoch in range(num_epochs):
    #
    w1_coords.append(w1)
    w2_coords.append(w2)
    #
    Loss = loss(w1,w2)
    #
    loss_coords.append(Loss)
    #
    w1_grad, w2_grad = calculate_gradient(w1, w2)
    #
    w1, w2 = gradient_descent_mom(w1,w2, w1_grad, w2_grad)
    #
    if (abs(w1_grad) > 0.01) or (abs(w2_grad) > 0.01):
        print('epoch {} , loss {:.3f}, w1 gradient: {:.3f} w2 gradient: {:.3f}'.format(epoch, Loss, w1_grad, w2_grad))
    else:
        break
    #
#
# Do plot
#
fig, ax = plot_surface(W1, W2, losses, [w_1min, w_2min, L_min])
plot_coords(w1_coords, w2_coords, loss_coords, ax)
#

Now play with the momentum parameters:
1. What do you notice about the convergence speed of momentum compared to gradient descent? 
2. Vary the learning rate. What do you notice about higher learning rates, compared to gradient descent?

### Exercise 1.3: RMSProp

RMSProp also keeps a 'memory', but here it uses this memory to moderate the learning rate for each parameter independently, so that smaller steps are taken in directions with larger gradients. The update equations are:

$$ \mathbf{v}_{t+1} = \beta   \mathbf{v}_{t} +  (1-\beta) \left( \left.\dfrac{\partial J}{\partial \mathbf{W}}\right|_{w_{t}}\right)^2 \\
\mathbf{W}_{t+1} = \mathbf{W}_{t}  - \dfrac{\eta}{\sqrt{\mathbf{v}_{t+1} + \epsilon}} \circ  \left.\dfrac{\partial J}{\partial \mathbf{W}}\right|_{w_{t}} $$

The update looks complicated, but compare with the gradient descent update. They're the same, except the learning rate $\eta$ is divided by a scalar that is calculated at each update step. 


**1.3.1. Implement the RMSProp update step in the cell below**

As we have only 2 parameters, implement the updates for each one separately by:

- estimating `self.v1` and `self.v2` correspending to $\mathbf{v}_{t+1}$ for each parameter. This implements an exponential average of the square of the gradient with respect to each parameter
- estimate `lr_1` and `lr_2` the learning rate correction ($\dfrac{\eta}{\sqrt{\mathbf{v}_{t+1} + \epsilon}}$) for each parameter 
- update weights for each parameter (`w1_updated`, `w2_updated`)

In [None]:
#
# Class for RMSprop optimizer
#
class RMSProp():
    #
    def __init__(self, momentum, learning_rate):
        #
        self.momentum = momentum
        self.lr       = learning_rate
        self.v_1      = 0
        self.v_2      = 0
        self.epsilon  = 1e-5
    #   
    def __call__(self, w1, w2, w1_grad, w2_grad):
        #
        # ## STUDENT CODE HERE ####
        #
        # Estimate self.v1 and self.v2 as moving averages of the square of the gradient
        #
        self.v_1 = None
        self.v_2 = None
        #
        # Implement the learning rate update for each parameter
        #
        lr_1 = None
        lr_2 = None
        #
        # Implement the parameter update
        #
        w1_updated = None
        w2_updated = None
        #
        return w1_updated, w2_updated    

Let's use RMSProp - **run the cell below**, which calls your implementation.

In [None]:
# 
# Run the RMSprop gradient descent
#
w1 = 4 * -1
w2 = 4

num_epochs = 30
lr         = 0.5
momentum   = 0.9
#
rmsprop = RMSProp(momentum=momentum, learning_rate=lr)
#
w1_coords, w2_coords, loss_coords = [], [], []
#
# Do epochs
#
for epoch in range(num_epochs):
    #
    w1_coords.append(w1)
    w2_coords.append(w2)
    #
    Loss = loss(w1, w2)
    #
    loss_coords.append(Loss)
    #
    w1_grad, w2_grad =  calculate_gradient(w1 ,w2)
    #
    w1, w2 = rmsprop(w1, w2, w1_grad, w2_grad)
    #
    if (abs(w1_grad) > 0.01) or (abs(w2_grad) > 0.01):
        print('epoch {} , loss {:.3f}, w1 gradient: {:.3f} w2 gradient: {:.3f}'.format(epoch, Loss, w1_grad, w2_grad))
    else:
        break
    #
#
# Do plot
#
fig, ax = plot_surface(W1, W2, losses, [w_1min, w_2min, L_min])
plot_coords(w1_coords, w2_coords, loss_coords, ax)
#

Play with the learning rate and momentum. What do you notice about the learning rate needed compared to previous update rules? What about the speed of convergence?

### Stochastic gradient descent

In order to investigate optimisers here, we've been analysing a simple, quadratic loss function with parameters that are known to us. This made it straightforward to calculate both the loss and its gradient for any set of parameters, $w_1,w_2$. However, in a practical ML application we won't have this nice functional form for the loss. For a simple two parameter regression problem, the cost function would take the form:

$$ J = \frac{1}{N} \sum_{i=1}^{N } (y_i - x_{1i}w_1 - x_{2i}w_2)^2$$

We can see the cost will still be quadratic in our two parameters $w_1,w_2$ but will depend on our $N$ training data points $\{\mathbf{x_i}, y_i\}$. At each iteration we will need to run our model over the full dataset to calculate the cost and gradient

If $N$ is very large, or we have a large model with lots of parameters (e.g. a neural network) it can be time-consuming and memory-intensive to run through the full dataset to get the gradients for our next parameters update. In practice, we calculate the cost and gradients for a randomly chosen subset of the data at each iteration, approximating the cost and the gradients at that point. This has the effect of giving us noisy gradient updates - we don't necessarily take the optimal step at each iteration, but sometimes this can help us avoid or jump out of local minima.

### Exercise 4: Stochastic gradient descent for real data

Let's repeat our MLP training loop from session 1, this time using SGD

*This network consists of a hidden ReLU layer, and an output layer containing 1 sigmoid unit*

First let's create custom Dataset and Dataloader classes for our brain image data. Here, any preprocessing to be run on the whole data set should be run *only once*, and thus should go in the `__init__` function. 

**1.4.1 Complete the `__len__` and `__getitem__` methods** 
Check that the class returns the number of items and feature bvector lengths that you expect

**Don't forget to upload the data to colab and edit the path to match where you load it to**

In [None]:
#
# Dataset class for brain scan data
#
import io
import requests
import pandas as pd
import numpy as np
import torch
from torch.utils.data import DataLoader, Dataset
#
class PretermDataset(Dataset):
    # 
    def __init__(self): 
        #
        # Get the data
        #
        url      = "https://raw.githubusercontent.com/IS-pillar-3/datasets/main/prem_vs_termwrois.csv" 
        download = requests.get(url).content
        df       = pd.read_csv(io.StringIO(download.decode("utf-8")))
        #
        data = df.values[:, :-2].T
        y    = df.values[: , -1]
        #
        # Bias terms
        # 
        X = np.concatenate((np.ones((1, data.shape[1])), data))
        #
        # Normalise to ~ N(0, 1)
        #
        epsilon       = 1e-5
        X_centred     = np.ones_like(X)
        X_centred[1:] = (X[1:] - X[1:].mean(axis=1, keepdims=True)) / (X[1:].std(axis=1, keepdims=True) + epsilon)
        #               
        self.X = X_centred
        self.y = y
    #
    def __len__(self):
        #
        # ### STUDENTS COMPLETE ###
        #
        return None
    #
    def __getitem__(self, idx):
        #
        # ### STUDENTS COMPLETE ### 
        #
        # Return a sample from your data set as a tuple 
        #
        sample = None
        #
        return sample
#
#
dataset = PretermDataset()
print('Number of entries in dataset: {}'.format(len(dataset)))
#
x, y = dataset[5]
print('Shape of one item in x: {}'.format(x.shape))

**1.4.2 Create a DataLoader class to iterate through your DataSet class**

*How must we specify the* `DataLoader` *object in order to ensure random (i.e. Stochastic) selection of training observations?*

Initially set `batch_size` to equal the total number of examples

In [None]:
#
# Create a DataLoader
#
# ### STUDENTS TO COMPLETE ###
#
# Propose a batch size and instantiate at default PyTorch dataloader for this dataset
#
batch_size = None  
dataloader = None  
#
# Instantiate optimiser
#
learning_rate = 0.01
optimiser     = gradient_descent_momentum(0.9, learning_rate)
#

**1.4.3 Implement the training loop**

Create the training loop

*This network consists of a hidden ReLU layer, and an output layer containing 1 sigmoid unit*

In [None]:
#
# Train the model
#
import matplotlib.pyplot as plt
#
# ReLU activation
#
def relu(x):
    return x * (x >= 0)
#
# Sigmoidal axtivation
#
def f(z):
    return 1 / (1 + np.exp(-z))
#
# Calculate cost
#
def cost(y, y_pred):
    #
    epsilon = 1e-5
    #
    # Add epsilon to avoid log(0)
    #
    L = - y * np.log(y_pred + epsilon) - (1 - y) * np.log(1 - y_pred + epsilon) 
    #
    J = torch.mean(L)
    #
    return J
#
# Perfomance metrics
#
def accuracy(y, y_pred, threshold = 0.5):
    #
    y_pred_thresholded  = y_pred > threshold
    correct_predictions = torch.sum(y == y_pred_thresholded)  
    total_predictions   = len(y)
    #
    accuracy = 100 * correct_predictions / total_predictions
    #
    return accuracy
#
# Initialise weights
#
W1 = torch.randn(5, 302)
W2 = torch.randn(1, 5)
#
cost_record_mlp     = []
accuracy_record_mlp = []
#
num_epochs    = 40
learning_rate = 1
#
# Training loop
#
for i in range(num_epochs):
    # STUDENTS CODE - implement training loop
    # as before implement the forwards and backwards pass with update
    # but this time use a dataloader to iterate
    #
    for batch_number, (data, labels) in enumerate(dataloader, 0):
        #
        # ### STUDENTS CODE ###
        #
        # FORWARD PASS
        #
        # Multiply input with hidden layer weights and apply ReLU
        #
        #Z1 = None
        #F1 = None
        Z1 = torch.matmul(W1, data.T)
        F1 = relu(Z1)
        #
        # Multiply hidden layer weights with output from hidden layer and apply sigmoid
        #
        Z2 = torch.matmul(W2, F1)
        F2 = f(Z2)  
        #Z2 = None
        #F2 = None
        #
        # Get the cost and accuracy and store in designated lists
        #
        #cost_val = None
        #cost_record_mlp.append(None)
        #accuracy_record_mlp.append(None)
        #
        cost_val = cost(labels, F2)
        #
        cost_record_mlp.append(cost_val)
        accuracy_record_mlp.append(accuracy(labels, F2))
        #
        # BACKWARD PASS
        #
        # Hint: use the chain rule to get del_J / del_W1 and del_J / del_W2 
        #
        # Output layer deltas
        #
        dL_dW2 = torch.matmul(F2 - labels, F1.T) 
        dJ_dW2 = (1 / W2.shape[0]) * dL_dW2
        #dL_dW2 = None
        #dL_dW2 = None
        #
        # Hidden layer deltas
        #
        dL_df1  = torch.matmul((F2 - labels).T, W2)  
        df1_dZ1 = 1.0 * (Z1 > 0)
        dL_dZ1  = torch.multiply(dL_df1.T, df1_dZ1)
        dL_dW1  = torch.matmul(dL_dZ1, data)
        dJ_dW1  = (1/ W1.shape[0]) * dL_dW1 
        #dL_df1  = None
        #df1_dZ1 = None
        #dL_dZ1  = None
        #dL_dW1  = None
        #dJ_dW1  = None
        #
        # Update the weights
        #
        W1, W2=optimiser(W1, W2, dJ_dW1, dJ_dW2)
      #
    #
#
# Do plots
#
print('Training with a batch size of {}'.format(batch_size))
#
fig, ax = plt.subplots(1, 2, figsize=(18, 5))
#
ax[0].plot(cost_record_mlp)
ax[1].plot(accuracy_record_mlp)
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Cost')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy');
#

**1.4.4 Investigate the influence of batch size**

For a batch size of 101 we have (non-stochastic) gradient descent, since the entire training set has 101 entries. Try
reducing the batch size to implement **stochastic** gradient descent - what do you notice about the training curves?

Try implementing the different optimiser functions

# Loss functions

### Exercise 5: Generalised dice overlap ###

The GDL can be used for multiclass segmentation - the full paper is [here](https://arxiv.org/pdf/1707.03237.pdf)

Let's implement it.

First let's load in the data: a 2D slice from a T1 image, segmented into 7 classes. We have both a ground truth label and a label predicted by a partially trained neural network:

In [None]:
#
# Get and load data
#
!wget -nv https://github.com/IS-pillar-3/datasets/raw/main/images.npz
#
file  = np.load("images.npz")
#
image        = file["arr_0"]
ground_truth = file["arr_1"]
pred         = file["arr_2"]
#
# Label numerical equivalents
#
labels = {0: "Background",
          1: "CSF",
          2: "Basal Ganglia",
          3: "Cortex",
          4: "Brainstem",
          5: "Cerebellum",
          6: "White matter"}

Run the following cell to plot the data:

In [None]:
#
# Plot data
#
import seaborn as sns
from matplotlib import colors
from matplotlib.lines import Line2D
#
rgb_values   = sns.color_palette("Set3", 7)
cmap         = colors.ListedColormap(rgb_values, N=7)
custom_lines = [Line2D([0], [0], color=rgb_values[i], lw=4) for i in range(7)]
#
fig = plt.figure(figsize=(20,10))
#
plt.subplot(1, 3, 1)
plt.imshow(image,cmap='gray'); plt.axis('off'); plt.title('Image')
#
plt.subplot(1, 3, 2)
plt.imshow(ground_truth,cmap=cmap, vmin=0, vmax=6); plt.axis('off');plt.title('Ground Truth');
#
plt.subplot(1, 3, 3)
#
plt.imshow(pred,cmap=cmap, vmin=0, vmax=6)
plt.axis('off')
plt.title('Prediction');
plt.legend(custom_lines, labels.values(),loc='best', fontsize='medium');
#

The GDL can be expressed as 

$$
\mathrm{GDL}=1-2 \frac{\sum_{l=1}^{K} w_{l} \sum_{n} y_{l n} \hat{y}_{l n}}{\sum_{l=1}^{K} w_{l} \sum_{n}( y_{l n}+\hat{y}_{l n})}
$$

where:
- $y$ is the true segmentation map
- $\hat{y}$ the predicted class label
- $w_l$ is a weight for each class (out of $K$ in total)
- $l$ index the class
- $n$ indexes each pixel in the image 

The class weight is estimated from:
$$1 /\left(\sum_{n=1}^{N} y_{l n}\right)^{2}$$ 

Which gives higher weight to classes with fewer examples.

The first stage is to one-hot encode the segmentation maps, transforming them from a $WxH$ array to a $CxWxH$ array where each channel contains a binary segmentation mask for each class

**1.5.1 Implement a function to one hot encode the segmentation maps**

**hint** you will need to loop over all classes and, for each class $i$,  create a binary segmentation (of dimensions equal to the original image) with values 1 (where voxel belongs to class $i$) and 0 (where it does not)

In [None]:
#
# One hot encode the segmentation maps
#
def one_hot_encode(mask, num_classes):
    #
    # Initialise an empty mask for one hot encoding, with shape (num_classes, image_width,image_depth
    #
    mask_encoded = np.zeros((num_classes,mask.shape[0], mask.shape[1]))
    #
    # ### STUDENT'S CODE HERE ###
    #
    # Code a loop to fill mask_encoded - for each class we expect 1's only in the location of that region
    # 
    return mask_encoded
#

Check the encoding makes sense. Run the following cell to one-hot encode and plot each class seperately:

In [None]:
#
# One-hot encode each class
#
ground_truth_encoded = one_hot_encode(ground_truth, num_classes=7)
prediction_encoded   = one_hot_encode(pred, num_classes=7)
#
plt.figure(figsize=(20, 5))
for i in range(7):
    plt.subplot(2, 7, i + 1)
    plt.imshow(ground_truth_encoded[i, :, :]); plt.axis('off')
    plt.title(labels[i])
    plt.subplot(2, 7, i + 8)
    plt.imshow(prediction_encoded[i, :, :])
    plt.axis('off')

### 1.5.2.  Implement the GDL: 

The function can be implemented without looping over all voxels, provided you make use of numpy vectorisation, complete the below function to estimate:

a) the numerator $\sum_{l=1}^{2} w_{l} \sum_{n} y_{l n} \hat{y}_{l n}$

b) the denominator $\sum_{l=1}^{2} w_{l} \sum_{n}( y_{l n}+\hat{y}_{l n})$

c) the complete GDL 

We suggest a correction for division by zero by setting `weight=epsilon` in these circumstances.

In [None]:
#
# Function to apply Generalised Dice Overlap
#
def gdl(truth, prediction):
    #
    num_classes = truth.shape[0]
    numerator   = 0
    denominator = 0
    #
    # ### STUDENT CODE HERE ###
    #
    return 1 -  2 * np.divide(numerator, denominator, where = denominator != 0) 
#

Let's get the loss value for our example:

In [None]:
#
# Get loss value for example
#
gdl(ground_truth_encoded, prediction_encoded)
#

And sanity check: do we get a loss of 0 when our ground truth and prediction exactly match?

In [None]:
#
# Check for loss of 0 when exact match
#
gdl(ground_truth_encoded, ground_truth_encoded)
#