## How to write a custom loss function in TensorFlow 2.0

Hey everyone, I've gotten a couple of questions about writing custom loss functions, so here's a complete example, brought to you from a nice coffee shop in NYC.  This notebook shows you how to:

* Write your own implementation of softmax.
* Write your own implementation of cross entropy loss.
* Compare your method against a built-in one: [sparse_softmax_cross_entropy_with_logits](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits).

The above part is only a few lines of code. 

Note: this notebook is **not** an explainer of how softmax or cross entropy works, it just shows the mechanics of writing your own version. In this notebook, we'll also train a simple model on MNIST using our implementations (just so we have a complete example that runs end-to-end). 

### Install the nightly build

In [1]:
!pip install tf-nightly-2.0-preview



In [2]:
import tensorflow as tf
print("You have version", tf.__version__)
assert tf.__version__ >= "2.0" # TensorFlow ≥ 2.0 required

You have version 2.0.0-dev20190203


In [0]:
import numpy as np

from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.nn import relu

In [0]:
epochs = 5
batch_size = 128
n_classes = 10

In [0]:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Types are needed later when calculating loss
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

In [0]:
shuffle_buffer = len(x_train)
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(shuffle_buffer)
train_dataset = train_dataset.batch(batch_size)

In [0]:
class MyModel(Model):
  def __init__(self):
    super(MyModel, self).__init__()
    self.flatten = Flatten()
    self.d1 = Dense(128)
    self.d2 = Dense(n_classes)

  def call(self, x):
    x = self.flatten(x)
    x = self.d1(x)
    x = relu(x)
    x = self.d2(x)
    return x 

### 1) Comparison point

I figured it'd be helpful to compare our implementation against the helper method [sparse_softmax_cross_entropy_with_logits](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits) so we can understand what it's doing. Let's unpack the name for starters.

* ```sparse``` indicates that our labels are integer encoded (as opposed to one-hot). If your labels were in one-hot format, you would use ```softmax_cross_entropy_with_logits``` instead.

* The next part of the name is ```softmax_cross_entropy_with_logits``` -- why are these grouped together? Softmax activation is commonly followed by cross entropy loss, these are group together for convenience. 

In [0]:
def built_in_loss(logits, labels):
  return tf.reduce_mean(
      tf.nn.sparse_softmax_cross_entropy_with_logits(
          logits=logits, labels=labels))

### 2) Our implementation of softmax followed by cross entropy loss

Next, we'll write our own version of the above code. It's just the four line below. To debug or understand what this block is doing, you can break it up into smaller pieces, and print out the shapes and/or data as you go. Note:  you can also convert tensors to NumPy with ```.numpy()```.

In [0]:
def our_loss(logits, labels):
  
  # softmax part
  sm = tf.math.exp(logits) / tf.reduce_sum(tf.math.exp(logits), axis=1, keepdims=True)
  sm = tf.clip_by_value(sm, 1e-7, 1 - 1e-7)
  
  # loss part
  labels = tf.one_hot(labels, n_classes, dtype=tf.double)
  return tf.reduce_mean(-tf.reduce_sum(labels * tf.math.log(sm), axis=1))  

### Train the model

You can choose which loss function to use below (ours, or the built-in one).

In [0]:
def train_on_batch(model, images, labels):
  with tf.GradientTape() as tape:
    
    # Forward pass
    logits = model(images)
    loss_one = built_in_loss(logits, labels)
    loss_two = our_loss(logits, labels)    
    
  # Backward pass
  # I'll use our implementation to update the gradients.
  grads = tape.gradient(loss_two, model.variables)
  optimizer.apply_gradients(zip(grads, model.variables))
  return loss_one, loss_two

In [0]:
# A helper function to calculate accuracy.
# You can play around with the shapes to see what's going on.
def calc_accuracy(logits, labels):
  predictions = tf.argmax(logits, axis=1)
  batch_size = int(logits.shape[0])
  acc = tf.reduce_sum(
      tf.cast(tf.equal(predictions, labels), dtype=tf.float32)) / batch_size
  return acc * 100

In [12]:
# Loop over the dataset, grab batchs, and train our model
# As we go, verify the loss returned by our implementation is
# the same as the built-in methods.
model = MyModel()
optimizer = tf.keras.optimizers.Adam()

for epoch in range(epochs):
  print("Epoch", epoch + 1, "\n")
  for (batch, (images, labels)) in enumerate(train_dataset):
    loss_one, loss_two = train_on_batch(model, images, labels)
    
    # You can use something like this as a quick sanity check
    tf.debugging.assert_near(loss_one, loss_two, atol=0.001, rtol=0.001)
    
    step = optimizer.iterations.numpy() 
    if step % 100 == 0:
      print("Step", step)
      print("Built-in loss: %.4f, Our loss: %.4f" % (loss_one.numpy(), loss_two.numpy()))
      print("")
      
  print('Train accuracy %.2f' % calc_accuracy(model(x_train), y_train))
  print('Test accuracy %.2f\n' % calc_accuracy(model(x_test), y_test))

Epoch 1 

Step 100
Built-in loss: 0.3350, Our loss: 0.3350

Step 200
Built-in loss: 0.2593, Our loss: 0.2593

Step 300
Built-in loss: 0.3217, Our loss: 0.3217

Step 400
Built-in loss: 0.2138, Our loss: 0.2138

Train accuracy 94.09
Test accuracy 93.99

Epoch 2 

Step 500
Built-in loss: 0.1589, Our loss: 0.1589

Step 600
Built-in loss: 0.1708, Our loss: 0.1708

Step 700
Built-in loss: 0.2407, Our loss: 0.2407

Step 800
Built-in loss: 0.2550, Our loss: 0.2550

Step 900
Built-in loss: 0.1880, Our loss: 0.1880

Train accuracy 96.26
Test accuracy 95.75

Epoch 3 

Step 1000
Built-in loss: 0.2820, Our loss: 0.2820

Step 1100
Built-in loss: 0.0662, Our loss: 0.0662

Step 1200
Built-in loss: 0.1961, Our loss: 0.1961

Step 1300
Built-in loss: 0.1316, Our loss: 0.1316

Step 1400
Built-in loss: 0.1194, Our loss: 0.1194

Train accuracy 97.22
Test accuracy 96.52

Epoch 4 

Step 1500
Built-in loss: 0.1720, Our loss: 0.1720

Step 1600
Built-in loss: 0.1083, Our loss: 0.1083

Step 1700
Built-in loss: 0.