### Distributed Training

This tutorial demonstrates how to use tf.distribute.Strategy—a TensorFlow API that provides an abstraction for distributing your training across multiple processing units (GPUs, multiple machines, or TPUs)—with custom training loops. In this example, you will train a simple convolutional neural network on the Fashion MNIST dataset containing 70,000 images of size 28 x 28.

In [1]:
# Import TensorFlow
import tensorflow as tf

# Helper libraries
import numpy as np
import os

print(tf.__version__)

2.18.0


In [2]:
fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Add a dimension to the array -> new shape == (28, 28, 1)
# This is done because the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]

# Scale the images to the [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)

Create a strategy to distribute the variables and the graph
How does tf.distribute.MirroredStrategy strategy work?

All the variables and the model graph are replicated across the replicas.
Input is evenly distributed across the replicas.
Each replica calculates the loss and gradients for the input it received.
The gradients are synced across all the replicas by summing them.
After the sync, the same update is made to the copies of the variables on each replica.

In [3]:
# If the list of devices is not specified in
# `tf.distribute.MirroredStrategy` constructor, they will be auto-detected.
strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)


In [4]:
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

Number of devices: 1


In [5]:
BUFFER_SIZE = len(train_images)

BATCH_SIZE_PER_REPLICA = 16
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

EPOCHS = 10

create tensor object train dataset

In [6]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)

# ✅ Distribute the dataset across multiple GPUs
train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)

In [7]:
train_dist_dataset

<tensorflow.python.distribute.input_lib.DistributedDataset at 0x273d4be6d10>

In [8]:
def create_model():
  regularizer = tf.keras.regularizers.L2(1e-5)
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3,
                            activation='relu',
                            kernel_regularizer=regularizer),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3,
                            activation='relu',
                            kernel_regularizer=regularizer),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64,
                            activation='relu',
                            kernel_regularizer=regularizer),
      tf.keras.layers.Dense(10, kernel_regularizer=regularizer)
    ])

  return model

In [9]:
create_model().summary()

In [10]:
def create_model2():
  regularizer = tf.keras.regularizers.L2(1e-5)
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3,
                            activation='relu',
                            kernel_regularizer=regularizer,
                            padding="same", # by default padding="valid" reduce spatial dimension by 2 =(28-3)/1 + 1
                            input_shape=(28,28,1)), # to calculate number of params input shape is required
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3,
                            activation='relu',
                            kernel_regularizer=regularizer),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64,
                            activation='relu',
                            kernel_regularizer=regularizer),
      tf.keras.layers.Dense(10, kernel_regularizer=regularizer)
    ])

  return model

In [11]:
create_model2().summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [12]:
# Create a checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt.weights.h5")

Define the loss function
Recall that the loss function consists of one or two parts:

The prediction loss measures how far off the model's predictions are from the training labels for a batch of training examples. It is computed for each labeled example and then reduced across the batch by computing the average value.
Optionally, regularization loss terms can be added to the prediction loss, to steer the model away from overfitting the training data. A common choice is L2 regularization, which adds a small fixed multiple of the sum of squares of all model weights, independent of the number of examples. The model above uses L2 regularization to demonstrate its handling in the training loop below.
For training on a single machine with a single GPU/CPU, this works as follows:

The prediction loss is computed for each example in the batch, summed across the batch, and then divided by the batch size.
The regularization loss is added to the prediction loss.
The gradient of the total loss is computed w.r.t. each model weight, and the optimizer updates each model weight from the corresponding gradient.

With tf.distribute.Strategy, the input batch is split between replicas. For example, let's say you have 4 GPUs, each with one replica of the model. One batch of 256 input examples is distributed evenly across the 4 replicas, so each replica gets a batch of size 64: We have 256 = 4*64, or generally GLOBAL_BATCH_SIZE = num_replicas_in_sync * BATCH_SIZE_PER_REPLICA.

Each replica computes the loss from the training examples it gets and computes the gradients of the loss w.r.t. each model weight. The optimizer takes care that these gradients are summed up across replicas before using them to update the copies of the model weights on each replica.

So, how should the loss be calculated when using a tf.distribute.Strategy?

Each replica computes the prediction loss for all examples distributed to it, sums up the results and divides them by num_replicas_in_sync * BATCH_SIZE_PER_REPLICA, or equivently, GLOBAL_BATCH_SIZE.
Each replica compues the regularization loss(es) and divides them by num_replicas_in_sync.
Compared to non-distributed training, all per-replica loss terms are scaled down by a factor of 1/num_replicas_in_sync. On the other hand, all loss terms -- or rather, their gradients -- are summed across that number of replicas before the optimizer applies them. In effect, the optimizer on each replica uses the same gradients as if a non-distributed computation with GLOBAL_BATCH_SIZE had happened. This is consistent with the distributed and undistributed behavior of Keras Model.fit. See the Distributed training with Keras tutorial on how a larger gloabl batch size enables to scale up the learning rate.

Distributed Training:

In distributed training, the model is split across multiple devices (e.g., GPUs, TPUs), and each device computes the loss for a subset of the data. To combine these losses, you need to compute the average loss across all devices.

In [13]:
with strategy.scope():
  # Set reduction to `NONE` so you can do the reduction yourself.
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True,
      reduction=tf.keras.losses.Reduction.NONE)
  def compute_loss(labels, predictions, model_losses):
    per_example_loss = loss_object(labels, predictions) # # This line computes the loss for each sample in the batch 
    # using the SparseCategoricalCrossentropy loss function.
    print('per_example_loss', per_example_loss) # # Debugging output
    # Compute scaled loss for distributed training
    loss = tf.nn.compute_average_loss(per_example_loss)
    # This line computes the average loss across all samples in the batch.
    print('average_loss', loss)
    # # Add model regularization losses (if present)
    if model_losses:
      loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
      print('model_losses', loss)
    return loss # average loss across all samples in the batch.
  
  

nput batches shorter than GLOBAL_BATCH_SIZE create unpleasant corner cases in several places. In practice, it often works best to avoid them by allowing batches to span epoch boundaries using Dataset.repeat().batch() and defining approximate epochs by step counts, not dataset ends. Alternatively, Dataset.batch(drop_remainder=True) maintains the notion of epoch but drops the last few examples.

In [14]:
with strategy.scope():
  test_loss = tf.keras.metrics.Mean(name='test_loss') # This metric tracks the average loss on the test dataset.
  print('test_loss', test_loss) 

  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='train_accuracy') # This metric tracks the accuracy of the model on the training dataset.
  print('train_accuracy', train_accuracy)
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='test_accuracy') # This metric tracks the accuracy of the model on the test dataset.
  print('test_accuracy', test_accuracy)

test_loss <Mean name=test_loss>
train_accuracy <SparseCategoricalAccuracy name=train_accuracy>
test_accuracy <SparseCategoricalAccuracy name=test_accuracy>


Training loop

In [15]:
# A model, an optimizer, and a checkpoint must be created under `strategy.scope`.
with strategy.scope():
  model = create_model2()

  optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

  checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

Apply Gradients Separately on Each Replica (Your Current Method)
✅ Steps:

Each replica computes its own loss independently.
Each replica computes gradients separately for its local batch.
Each replica applies gradients locally using its own optimizer update.
After all updates, losses are summed across replicas for tracking.
📌 Where gradients are applied? → Before summing the loss.

🔴 Issues with This Method
Inconsistent model updates:
Each replica updates weights using gradients computed only from its mini-batch.
This can lead to desynchronization across replicas.
Potential model divergence:
Since each replica sees only part of the dataset, model weights might drift apart before being synchronized.
Less stable training:
Weight updates might not fully account for all data at once, leading to noisier updates.
✅ When to use this?

If using asynchronous training (e.g., tf.distribute.experimental.MultiWorkerMirroredStrategy).
If computation is distributed across different hardware (e.g., TPU cores updating independently).




2️⃣ Sum the Loss Across Replicas First, Then Compute and Apply Gradients (Better Approach)
✅ Steps:

Each replica computes its local loss.
Losses are summed (or averaged) across replicas using strategy.reduce().
Gradients are computed from the total loss (across all replicas).
A single, synchronized weight update is applied to the model.
📌 Where gradients are applied? → After summing the loss across all replicas.

✅ Advantages of This Method
Consistent weight updates:

Since the loss is computed globally, the gradients are based on the entire dataset (not just a small batch per replica).
This prevents desynchronization of model weights.
More stable training:

Reducing loss first ensures that all replicas contribute equally to gradient computation.
This leads to better convergence and lower variance in updates.
Better utilization of distributed resources:

Synchronizing updates prevents multiple conflicting weight changes.
✅ When to use this?

If using synchronous distributed training (e.g., tf.distribute.MirroredStrategy).
If working with large datasets where stability is important.
If aiming for efficient gradient averaging across multiple GPUs or TPUs.
🚀 Which Method is More Efficient?
✔️ Summing the loss first and then computing gradients is more efficient in most cases because it ensures:

Better model synchronization
Lower variance in weight updates
More stable training
Faster convergence

In [16]:
# def train_step(inputs):
#   images, labels = inputs # Unpacks the input data into images and labels.

#   with tf.GradientTape() as tape:
#     # Forward pass
#     predictions = model(images, training=True) # trainable parameters for   training
#     loss = compute_loss(labels, predictions, model.losses) # average loss of all sample for a replica batch 
#     print('loss:', loss)
    
#   # Backward pass:
#   gradients = tape.gradient(loss, model.trainable_variables) # Computes the gradients of the loss for replica batch with respect to the model's trainable variables weights.
#   print('gradients:', gradients)
#   optimizer.apply_gradients(zip(gradients, model.trainable_variables)) # Updates the model's trainable variables using the gradients and optimizer.

#   train_accuracy.update_state(labels, predictions) # Updates the training accuracy metric.
#   return loss # Returns per replica average loss

def test_step(inputs):
  images, labels = inputs # Unpacks the input data into images and labels.

  predictions = model(images, training=False) # non trainagble parameters for testing
  print('predictions:', predictions)
  t_loss = loss_object(labels, predictions)
  print('t_loss:', t_loss)

  test_loss.update_state(t_loss) # Updates the test loss metric.
  test_accuracy.update_state(labels, predictions) # Updates the test accuracy metric.

In [17]:
EPOCHS = 5

in single processing gpu or cpu :

 loss is calculated for each sample of a batch then sum up the losses calculate 
average of the losses dividing by batch  then optimizer calculate the gradients and update the weights


In distributed training by strategy.reduce() :

 model, all the variables are copied to each replica ,global batch are divided to replica batch ;loss is calculated for each sample of a batch of each replicas(gpu units) then all the losses are accumulated sum up calculate average loss per replica batch by dividing by global batch then calculate gradient of loss
update weights and again new weight copied to each replica

In [18]:
# This line of code reduces the per_replica_losses tensor across all replicas in the distributed environment using the SUM reduction operation.
# ReduceOp:
# tf.distribute.ReduceOp is an enumeration that defines the reduction operations that can be used to aggregate values across replicas.
# SUM Reduction:
# The SUM reduction operation adds up all the values across replicas. In this case, it's used to sum up the losses from each replica.
# Per-Replica Losses:
# per_replica_losses is a tensor that contains the losses from each replica. The shape of this tensor is typically (num_replicas, batch_size).
# Reduction:
# When you call strategy.reduce, TensorFlow performs the following steps:
# Gather values: TensorFlow gathers the values from each replica.
# Apply reduction: TensorFlow applies the specified reduction operation (in this case, SUM) to the gathered values.
# Return result: TensorFlow returns the result of the reduction operation.
# Example:
# Suppose you have 2 replicas, each computing a loss of [1.0, 2.0, 3.0]. The per_replica_losses tensor would look like this:
# [[1.0, 2.0, 3.0],
#  [1.0, 2.0, 3.0]]
# After applying the SUM reduction operation, the result would be:
# [2.0, 4.0, 6.0]

Better Approaches : 

apply gradients after sum up of all replica losses

In [19]:
def train_step(inputs):
  images, labels = inputs # Unpacks the input data into images and labels.
  print(images)
  print(labels)

  with tf.GradientTape() as tape:
  # Forward pass
    predictions = model(images, training=True) # trainable parameters for   training
    print('predictions:', predictions)
    loss = compute_loss(labels, predictions, model.losses) # average loss of all sample for a replica batch 
    print('loss:', loss)
  
  gradients = tape.gradient(loss, model.trainable_variables)  # Compute gradients
      

  
  train_accuracy.update_state(labels, predictions) # Updates the training accuracy metric.
  # return labels, predictions, loss # Returns per replica average loss
  return loss, gradients # Returns gradients for each replica


In [26]:
@tf.function
def distributed_train_step(dataset_inputs):
    #  Run `train_step()` across all replicas
    per_replica_losses, per_replica_gradients = strategy.run(train_step, args=(dataset_inputs,))
    
    print("per replica loss",per_replica_losses)

    #  Compute global loss BEFORE applying gradients # Reduce the loss across replicas
    total_loss = strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
    print("total_loss: ", total_loss)
    
    #  Aggregate gradients across replicas
    # reduced_gradients = [
    #     strategy.reduce(tf.distribute.ReduceOp.SUM, g, axis=None) 
    #     for g in zip(*per_replica_gradients)
    # ]
    # Use `tf.nest.map_structure` to aggregate gradients
    reduced_gradients = tf.nest.map_structure(
        lambda *g: strategy.reduce(tf.distribute.ReduceOp.SUM, g, axis=None),
        per_replica_gradients
    )
    
    #  Apply the aggregated gradients
    with strategy.scope():
        optimizer.apply_gradients(zip(reduced_gradients, model.trainable_variables))

    #  Compute gradients globally (on the averaged loss)
    # with tf.GradientTape() as tape:
    #     global_loss = compute_loss(labels, predictions, model.losses) # Use summed loss
    # gradients = tape.gradient(global_loss, model.trainable_variables)

    # with strategy.scope():
        #  Apply gradients globally (after averaging across replicas)
        # optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        # strategy.run(optimizer.apply_gradients, args=(zip(gradients, model.trainable_variables),))

    return total_loss


In [27]:

@tf.function
def distributed_test_step(dataset_inputs):
  return strategy.run(test_step, args=(dataset_inputs,))

In [22]:
train_dist_dataset

<tensorflow.python.distribute.input_lib.DistributedDataset at 0x273d4be6d10>

In [23]:
next(iter(train_dist_dataset))

(<tf.Tensor: shape=(16, 28, 28, 1), dtype=float32, numpy=
 array([[[[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]],
 
         [[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]],
 
         [[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]],
 
         ...,
 
         [[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]],
 
         [[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]],
 
         [[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]]],
 
 
        [[[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]],
 
         [[0.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [0.]],
 
         [[

In [28]:
print(tf.distribute.get_strategy())  # ✅ Should print the strategy object, NOT `None`

<tensorflow.python.distribute.distribute_lib._DefaultDistributionStrategy object at 0x00000273D733D250>


In [29]:
# `run` replicates the provided computation and runs it
# with the distributed input.
# @tf.function
# def distributed_train_step(dataset_inputs):
#   per_replica_losses = strategy.run(train_step, args=(dataset_inputs,)) #  runs the train_step function on each replica in the distributed environment
#   # sum up per replica average losses across all replicas
#   return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, #This line reduces the list of losses across all replicas using the SUM reduction operation.
#                         axis=None) # The axis=None argument specifies that the reduction should be performed across all axes.
                        

for epoch in range(EPOCHS):  
  # TRAIN LOOP
  total_loss = 0.0
  num_batches = 0
  for x in train_dist_dataset: # training loop iterates over the distributed training dataset (train_dist_dataset)
    total_loss += distributed_train_step(x) # perform a distributed training step on the current batch of data (x).
    num_batches += 1                        # for each distributed sample accumulates the loss returned by the distributed_train_step function.

  train_loss = total_loss / num_batches # Computes the average loss over all batches.(replica batches)

  # TEST LOOP
  for x in test_dist_dataset:
    distributed_test_step(x)

  if epoch % 2 == 0:
    checkpoint.save(checkpoint_prefix)

  template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
              "Test Accuracy: {}")
  print(template.format(epoch + 1, train_loss,
                         train_accuracy.result() * 100, test_loss.result(),
                         test_accuracy.result() * 100))

# # Reset states for next epoch
  test_loss.reset_state()
  train_accuracy.reset_state()
  test_accuracy.reset_state()

Tensor("dataset_inputs:0", shape=(16, 28, 28, 1), dtype=float32)
Tensor("dataset_inputs_1:0", shape=(16,), dtype=uint8)
predictions: Tensor("sequential_2_1/dense_5_1/Add:0", shape=(16, 10), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
per_example_loss Tensor("sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits:0", shape=(16,), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
average_loss Tensor("div_no_nan:0", shape=(), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
model_losses Tensor("add:0", shape=(), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
loss: Tensor("add:0", shape=(), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
per replica loss Tensor("add:0", shape=(), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
total_loss:  Tensor("Identity:0", shape=(), dtype=float32, device=/job:localhost/replica:0/task:

AttributeError: in user code:

    File "C:\Users\lenovo\AppData\Local\Temp\ipykernel_176\3253139696.py", line 25, in distributed_train_step  *
        optimizer.apply_gradients(zip(reduced_gradients, model.trainable_variables))
    File "d:\a27_YEARS_OLD\deep_learning\venv\Lib\site-packages\keras\src\optimizers\base_optimizer.py", line 344, in apply_gradients  **
        self.apply(grads, trainable_variables)
    File "d:\a27_YEARS_OLD\deep_learning\venv\Lib\site-packages\keras\src\optimizers\base_optimizer.py", line 409, in apply
        self._backend_apply_gradients(grads, trainable_variables)
    File "d:\a27_YEARS_OLD\deep_learning\venv\Lib\site-packages\keras\src\optimizers\base_optimizer.py", line 472, in _backend_apply_gradients
        self._backend_update_step(
    File "d:\a27_YEARS_OLD\deep_learning\venv\Lib\site-packages\keras\src\backend\tensorflow\optimizer.py", line 122, in _backend_update_step
        tf.__internal__.distribute.interim.maybe_merge_call(

    AttributeError: 'NoneType' object has no attribute 'merge_call'


In [84]:
eval_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='eval_accuracy')

new_model = create_model2()
new_optimizer = tf.keras.optimizers.Adam()

test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)

In [85]:
@tf.function
def eval_step(images, labels):
  predictions = new_model(images, training=False)
  eval_accuracy(labels, predictions)

In [67]:
checkpoint = tf.train.Checkpoint(optimizer=new_optimizer, model=new_model)
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

for images, labels in test_dataset:
  eval_step(images, labels)

print('Accuracy after restoring the saved model without strategy: {}'.format(
    eval_accuracy.result() * 100))

Accuracy after restoring the saved model without strategy: 91.97000122070312


In [34]:
for _ in range(EPOCHS):
  total_loss = 0.0
  num_batches = 0
  train_iter = iter(train_dist_dataset)

  for _ in range(10):
    total_loss += distributed_train_step(next(train_iter))
    num_batches += 1
  average_train_loss = total_loss / num_batches

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print(template.format(epoch + 1, average_train_loss, train_accuracy.result() * 100))
  train_accuracy.reset_state()

Epoch 5, Loss: 0.15078850090503693, Accuracy: 94.6875
Epoch 5, Loss: 0.10492142289876938, Accuracy: 97.5
Epoch 5, Loss: 0.15510988235473633, Accuracy: 95.625
Epoch 5, Loss: 0.1762147843837738, Accuracy: 94.375
Epoch 5, Loss: 0.10520371049642563, Accuracy: 95.625


In [36]:
@tf.function
def distributed_train_epoch(dataset):
  total_loss = 0.0
  num_batches = 0
  for x in dataset:
    per_replica_losses = strategy.run(train_step, args=(x,))
    total_loss += strategy.reduce(
      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
    num_batches += 1
  return total_loss / tf.cast(num_batches, dtype=tf.float32)

for epoch in range(EPOCHS):
  train_loss = distributed_train_epoch(train_dist_dataset)

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print(template.format(epoch + 1, train_loss, train_accuracy.result() * 100))

  train_accuracy.reset_state()

per_example_loss Tensor("while/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits:0", shape=(None,), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
loss Tensor("while/div_no_nan:0", shape=(), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
model_losses Tensor("while/add:0", shape=(), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0)
Epoch 1, Loss: 0.14403024315834045, Accuracy: 95.26249694824219
Epoch 2, Loss: 0.132389634847641, Accuracy: 96.10166931152344
Epoch 3, Loss: 0.1244950219988823, Accuracy: 96.38333129882812
Epoch 4, Loss: 0.11615363508462906, Accuracy: 96.82666778564453
Epoch 5, Loss: 0.11036597192287445, Accuracy: 97.08333587646484
