## Write a custom loss function

This notebook is a quick example that shows how to:

* Write your own implementation of softmax.
* Write your own implementation of cross entropy loss.
* Compare your method against a built-in one, in this case: [sparse_softmax_cross_entropy_with_logits](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits).

The above part is only a few lines of code. We'll also train a simple MNIST model using our implementation (just to have an end-to-end example). 

Note: this notebook is **not** an explainer of how softmax or cross entropy works, just the mechanics of writing your own version. 

### Install the nightly build

In [1]:
!pip install tf-nightly-2.0-preview



In [2]:
import tensorflow as tf
print("You have version", tf.__version__)
assert tf.__version__ >= "2.0" # TensorFlow ≥ 2.0 required

You have version 2.0.0-dev20190203


In [0]:
import numpy as np

from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.nn import relu

In [0]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

x_train = x_train / 255
x_test = x_test / 255

# Types are needed later when calculating loss
# using the ```sparse_softmax_cross_entropy_with_logits``` we chose to 
# compare against.
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

In [0]:
BATCH_SIZE = 128
BUFFER_SIZE = len(x_train)

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(BATCH_SIZE)

In [0]:
class MyModel(Model):
  def __init__(self):
    super(MyModel, self).__init__()
    self.flatten = Flatten()
    self.d1 = Dense(128)
    self.d2 = Dense(10)

  def call(self, x):
    x = self.flatten(x)
    x = self.d1(x)
    x = relu(x)
    x = self.d2(x)
    return x 
  
model = MyModel()
optimizer = tf.keras.optimizers.Adam()

### 1) Comparison point

I thought it'd be helpful to compare our implementation against a built-in method [sparse_softmax_cross_entropy_with_logits](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits), so we can unpack what it's doing. Note: for a more modern example of a built-in loss function, check out this [example](https://github.com/random-forests/applied-dl/blob/master/examples/2.2-hello-subclassing.ipynb).

Let's unpack the name for starters.

* ```sparse``` indicates that our labels are integer encoded (as opposed to one-hot). If your labels were in one-hot format, you would use ```softmax_cross_entropy_with_logits``` instead.

* The next part of the name is ```softmax_cross_entropy_with_logits``` -- why are these grouped together? Softmax activation is commonly followed by cross entropy loss, these are group together for convenience. 

In [0]:
def built_in_loss(logits, labels):
  return tf.reduce_mean(
      tf.nn.sparse_softmax_cross_entropy_with_logits(
          logits=logits, labels=labels))

### 2) Our implementation of softmax followed by cross entropy loss

Next, we'll write our own version of the above code. It's just the four line below. To debug or understand what this block is doing, you can break it up into smaller pieces, and print out the shapes and/or data as you go. Note:  you can also convert tensors to NumPy with ```.numpy()```.

In [0]:
def our_loss(logits, labels, n_classes=10):  
  # softmax part
  sm = tf.math.exp(logits) / tf.reduce_sum(tf.math.exp(logits), axis=1, keepdims=True)
  sm = tf.clip_by_value(sm, 1e-7, 1 - 1e-7)
  # loss part
  labels = tf.one_hot(labels, n_classes, dtype=tf.float32)
  return tf.reduce_mean(-tf.reduce_sum(labels * tf.math.log(sm), axis=1))  

### Choose which loss function to use in our training loop

Ours, or the built-in one.

In [0]:
def train_on_batch(model, images, labels):
  with tf.GradientTape() as tape:
    # Forward pass
    logits = model(images)
    loss_one = built_in_loss(logits, labels)
    loss_two = our_loss(logits, labels)    
    
  # Backward pass
  # I'll use our implementation to update the gradients.
  grads = tape.gradient(loss_two, model.variables)
  optimizer.apply_gradients(zip(grads, model.variables))
  return loss_one, loss_two

Since we're writing lower-level code, here also is a NumPy-style way of calculating accuracy. For a more modern example using object-oriented metrics (which you should use in practice), see this [example](https://github.com/random-forests/applied-dl/blob/master/examples/2.2-hello-subclassing.ipynb).

In [0]:
# Low-level code ahead. 
# See the above note for a better way of  calculating accuracy in practice.
def calc_accuracy(logits, labels):
  predictions = tf.argmax(logits, axis=1)
  batch_size = int(logits.shape[0])
  acc = tf.reduce_sum(
      tf.cast(tf.equal(predictions, labels), dtype=tf.float32)) / batch_size
  return acc * 100

In [11]:
# Loop over the dataset, grab batchs, and train our model
# As we go, verify the loss returned by our implementation is
# the same as the built-in methods.

EPOCHS = 5

for epoch in range(EPOCHS):
  print("Epoch", epoch + 1, "\n")
  for (batch, (images, labels)) in enumerate(train_dataset):
    loss_one, loss_two = train_on_batch(model, images, labels)
    
    # You can use something like this as a quick sanity check
    tf.debugging.assert_near(loss_one, loss_two, atol=0.001, rtol=0.001)
    
    step = optimizer.iterations.numpy() 
    if step % 100 == 0:
      print("Step", step)
      print("Built-in loss: %.4f, Our loss: %.4f" % (loss_one.numpy(), loss_two.numpy()))
      print("")
      
  print('Train accuracy %.2f' % calc_accuracy(model(x_train), y_train))
  print('Test accuracy %.2f\n' % calc_accuracy(model(x_test), y_test))

Epoch 1 

Step 100
Built-in loss: 0.4477, Our loss: 0.4477

Step 200
Built-in loss: 0.3544, Our loss: 0.3544

Step 300
Built-in loss: 0.3427, Our loss: 0.3427

Step 400
Built-in loss: 0.1488, Our loss: 0.1488

Train accuracy 94.62
Test accuracy 94.36

Epoch 2 

Step 500
Built-in loss: 0.2276, Our loss: 0.2276

Step 600
Built-in loss: 0.2211, Our loss: 0.2211

Step 700
Built-in loss: 0.1373, Our loss: 0.1373

Step 800
Built-in loss: 0.1791, Our loss: 0.1791

Step 900
Built-in loss: 0.0953, Our loss: 0.0953

Train accuracy 96.36
Test accuracy 95.80

Epoch 3 

Step 1000
Built-in loss: 0.1585, Our loss: 0.1585

Step 1100
Built-in loss: 0.1464, Our loss: 0.1464

Step 1200
Built-in loss: 0.1424, Our loss: 0.1424

Step 1300
Built-in loss: 0.0840, Our loss: 0.0840

Step 1400
Built-in loss: 0.1131, Our loss: 0.1131

Train accuracy 97.21
Test accuracy 96.40

Epoch 4 

Step 1500
Built-in loss: 0.0510, Our loss: 0.0510

Step 1600
Built-in loss: 0.0980, Our loss: 0.0980

Step 1700
Built-in loss: 0.