# <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">1. | Introduction  ⌛</div>
DTensor provides a way for you to distribute the training of your model across devices to improve efficiency, reliability and scalability. For more details on DTensor concepts, see [The DTensor Programming Guide.](https://www.kaggle.com/code/atrisaxena/tensorflow-dtensor-distributed-training-part1)

In this notebook, you will train a Sentiment Analysis model with DTensor. Three distributed training schemes are demonstrated with this example:

* **Data Parallel training:** where the training samples are sharded (partitioned) to devices.
* **Model Parallel training:** where the model variables are sharded to devices.
* **Spatial Parallel training:** where the features of input data are sharded to devices. (Also known as Spatial Partitioning)


This Notebook will walk through the following steps:

- First start with some data cleaning to obtain a `tf.data.Dataset` of tokenized sentences and their polarity.

- Next build an MLP model with custom Dense and BatchNorm layers. Use a `tf.Module` to track the inference variables. The model constructor takes additional `Layout` arguments to control the sharding of variables.

- For training, you will first use data parallel training together with `tf.experimental.dtensor`'s checkpoint feature. Then continue with Model Parallel Training and Spatial Parallel Training.

- The final section briefly describes the interaction between `tf.saved_model` and `tf.experimental.dtensor` as of TensorFlow 2.9.

In [1]:
!apt -y install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2
!pip install tensorflow==2.9.0 tensorflow-datasets




[1;31mE: [0mUnable to locate package libcudnn8[0m
Collecting tensorflow==2.9.0
  Downloading tensorflow-2.9.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.7/511.7 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting libclang>=13.0.0
  Downloading libclang-15.0.6.1-py2.py3-none-manylinux2010_x86_64.whl (21.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.5/21.5 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.29.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow-estimator<2.10.0,>=2.9.0rc0
  Downloading tensorflow_estimator-2.9.0-py2.py3-none-any.whl (438 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

> <div style="font-weight:bold;font-size:20px;line-height:1.5;padding:12px;background-color: #f9ff21;font-family: Verdana, Cursive; color: #ff1f5a;border: 5px solid #17b978;">
    "Sharding here simple means dividing large part(Tensor) into smaller parts (Tensor) on various distributed devices." 
    <br>
    <div> The word shard means "a small part of a whole."</div>
    </div>

In [2]:
from IPython.core.display import display, HTML, Javascript

# ----- Notebook Theme -----
color_map = ['#16a085', '#e8f6f3', '#d0ece7', '#a2d9ce', '#73c6b6', '#45b39d', 
                        '#16a085', '#138d75', '#117a65', '#0e6655', '#0b5345']

prompt = color_map[-1]
main_color = color_map[0]
strong_main_color = color_map[1]
custom_colors = [strong_main_color, main_color]

css_file = ''' 

div #notebook {
background-color: white;
line-height: 20px;
}

#notebook-container {
%s
margin-top: 2em;
padding-top: 2em;
border-top: 4px solid %s; /* light orange */
-webkit-box-shadow: 0px 0px 8px 2px rgba(224, 212, 226, 0.5); /* pink */
    box-shadow: 0px 0px 8px 2px rgba(224, 212, 226, 0.5); /* pink */
}

div .input {
margin-bottom: 1em;
}

.rendered_html h1, .rendered_html h2, .rendered_html h3, .rendered_html h4, .rendered_html h5, .rendered_html h6 {
color: %s; /* light orange */
font-weight: 600;
}

div.input_area {
border: none;
    background-color: %s; /* rgba(229, 143, 101, 0.1); light orange [exactly #E58F65] */
    border-top: 2px solid %s; /* light orange */
}

div.input_prompt {
color: %s; /* light blue */
}

div.output_prompt {
color: %s; /* strong orange */
}

div.cell.selected:before, div.cell.selected.jupyter-soft-selected:before {
background: %s; /* light orange */
}

div.cell.selected, div.cell.selected.jupyter-soft-selected {
    border-color: %s; /* light orange */
}

.edit_mode div.cell.selected:before {
background: %s; /* light orange */
}

.edit_mode div.cell.selected {
border-color: %s; /* light orange */

}
'''
def to_rgb(h): 
    return tuple(int(h[i:i+2], 16) for i in [0, 2, 4])

main_color_rgba = 'rgba(%s, %s, %s, 0.1)' % (to_rgb(main_color[1:]))
open('notebook.css', 'w').write(css_file % ('width: 95%;', main_color, main_color, main_color_rgba, main_color,  main_color, prompt, main_color, main_color, main_color, main_color))

def nb(): 
    return HTML("<style>" + open("notebook.css", "r").read() + "</style>")
nb()

# <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">2. | Importing Libraries  ⌛</div>

In [3]:
import tempfile
import numpy as np
import tensorflow_datasets as tfds
import warnings
warnings.filterwarnings("ignore")

import tensorflow as tf

from tensorflow.experimental import dtensor
print('TensorFlow version:', tf.__version__)

2023-01-19 15:50:17.534859: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2023-01-19 15:50:17.534908: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


TensorFlow version: 2.9.0


# <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">3. | Configure Virtual CPUs</div>

Then configure TensorFlow to use 8 virtual CPUs.

Even though this example uses CPUs, DTensor works the same way on CPU, GPU or TPU devices.

In [4]:
def configure_virtual_cpus(ncpu):
  phy_devices = tf.config.list_physical_devices('CPU')
  tf.config.set_logical_device_configuration(phy_devices[0], [
        tf.config.LogicalDeviceConfiguration(),
    ] * ncpu)

configure_virtual_cpus(8)
DEVICES = [f'CPU:{i}' for i in range(8)]

tf.config.list_logical_devices('CPU')

2023-01-19 15:50:23.304111: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2023-01-19 15:50:23.304161: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-01-19 15:50:23.304217: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (809e4601ffcf): /proc/driver/nvidia/version does not exist
2023-01-19 15:50:23.304682: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the

[LogicalDevice(name='/device:CPU:0', device_type='CPU'),
 LogicalDevice(name='/device:CPU:1', device_type='CPU'),
 LogicalDevice(name='/device:CPU:2', device_type='CPU'),
 LogicalDevice(name='/device:CPU:3', device_type='CPU'),
 LogicalDevice(name='/device:CPU:4', device_type='CPU'),
 LogicalDevice(name='/device:CPU:5', device_type='CPU'),
 LogicalDevice(name='/device:CPU:6', device_type='CPU'),
 LogicalDevice(name='/device:CPU:7', device_type='CPU')]

## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">4. | Download the dataset</div>
Download the IMDB reviews data set to train the sentiment analysis model.

In [5]:
train_data = tfds.load('imdb_reviews', split='train', shuffle_files=True, batch_size=64)
train_data

2023-01-19 15:50:23.871630: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling imdb_reviews-train.tfrecord...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling imdb_reviews-test.tfrecord...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling imdb_reviews-unsupervised.tfrecord...:   0%|          | 0/50000 [00:00<?, ? examples/s]

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


<_OptionsDataset element_spec={'label': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>

## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">5. | Prepare the data</div>
First tokenize the text. Here use an extension of one-hot encoding, the `'tf_idf'` mode of `tf.keras.layers.TextVectorization`.

- For the sake of speed, limit the number of tokens to 1200.
- To keep the `tf.Module` simple, run `TextVectorization` as a preprocessing step before the training.

The final result of the data cleaning section is a `Dataset` with the tokenized text as `x` and label as `y`.

**Note**: Running `TextVectorization` as a preprocessing step is **neither a usual practice nor a recommended one** as doing so assumes the training data fits into the client memory, which is not always the case.


In [6]:
text_vectorization = tf.keras.layers.TextVectorization(output_mode='tf_idf', max_tokens=1200, output_sequence_length=None)
text_vectorization.adapt(data=train_data.map(lambda x: x['text']))

In [7]:
def vectorize(features):
  return text_vectorization(features['text']), features['label']

train_data_vec = train_data.map(vectorize)
train_data_vec

<MapDataset element_spec=(TensorSpec(shape=(None, 1200), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">6. | Build a neural network with DTensor</div>

Now build a Multi-Layer Perceptron (MLP) network with `DTensor`. The network will use fully connected Dense and BatchNorm layers.

`DTensor` expands TensorFlow through single-program multi-data (SPMD) expansion of regular TensorFlow Ops according to the `dtensor.Layout` attributes of their input `Tensor` and variables.

Variables of `DTensor` aware layers are `dtensor.DVariable`, and the constructors of `DTensor` aware layer objects take additional `Layout` inputs in addition to the usual layer parameters.

Note: As of TensorFlow 2.9, Keras layers such as `tf.keras.layer.Dense`, and `tf.keras.layer.BatchNormalization` accepts `dtensor.Layout` arguments.  Refer to the [DTensor Keras Integration Tutorial](/tutorials/distribute/dtensor_keras_tutorial) for more information using Keras with DTensor.

### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">6.1. | Dense Layer</div>
The following custom Dense layer defines 2 layer variables: $W_{ij}$ is the variable for weights, and $b_i$ is the variable for the biases.

$$
y_j = \sigma(\sum_i x_i W_{ij} + b_j)
$$


### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">6.2. | Layout deduction</div>
This result comes from the following observations:

- The preferred DTensor sharding for operands to a matrix dot product $t_j = \sum_i x_i W_{ij}$ is to shard $\mathbf{W}$ and $\mathbf{x}$ the same way along the $i$-axis.

- The preferred DTensor sharding for operands to a matrix sum $t_j + b_j$, is to shard $\mathbf{t}$ and $\mathbf{b}$ the same way along the $j$-axis.


In [8]:
class Dense(tf.Module):

  def __init__(self, input_size, output_size,
               init_seed, weight_layout, activation=None):
    super().__init__()

    random_normal_initializer = tf.function(tf.random.stateless_normal)

    self.weight = dtensor.DVariable(
        dtensor.call_with_layout(
            random_normal_initializer, weight_layout,
            shape=[input_size, output_size],
            seed=init_seed
            ))
    if activation is None:
      activation = lambda x:x
    self.activation = activation
    
    # bias is sharded the same way as the last axis of weight.
    bias_layout = weight_layout.delete([0])

    self.bias = dtensor.DVariable(
        dtensor.call_with_layout(tf.zeros, bias_layout, [output_size]))

  def __call__(self, x):
    y = tf.matmul(x, self.weight) + self.bias
    y = self.activation(y)

    return y

### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">6.3. | BatchNorm</div>
A batch normalization layer helps avoid collapsing modes while training. In this case, adding batch normalization layers helps model training avoid producing a model that only produces zeros.

The constructor of the custom `BatchNorm` layer below does not take a `Layout` argument. This is because `BatchNorm` has no layer variables. This still works with DTensor because 'x', the only input to the layer, is already a DTensor that represents the global batch.

Note: With DTensor, the input Tensor 'x' always represents the global batch. Therefore `tf.nn.batch_normalization` is applied to the global batch. This differs from training with `tf.distribute.MirroredStrategy`, where Tensor 'x' only represents the per-replica shard of the batch (the local batch).

In [9]:
class BatchNorm(tf.Module):

  def __init__(self):
    super().__init__()

  def __call__(self, x, training=True):
    if not training:
      # This branch is not used in the Tutorial.
      pass
    mean, variance = tf.nn.moments(x, axes=[0])
    return tf.nn.batch_normalization(x, mean, variance, 0.0, 1.0, 1e-5)

A full featured batch normalization layer (such as `tf.keras.layers.BatchNormalization`) will need Layout arguments for its variables.

In [10]:
def make_keras_bn(bn_layout):
  return tf.keras.layers.BatchNormalization(gamma_layout=bn_layout,
                                            beta_layout=bn_layout,
                                            moving_mean_layout=bn_layout,
                                            moving_variance_layout=bn_layout,
                                            fused=False)

### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">6.3. | Putting Layers Together</div>
Next, build a Multi-layer perceptron (MLP) network with the building blocks above.  The diagram below shows the axis relationships between the input `x` and the weight matrices for the two `Dense` layers without any DTensor sharding or replication applied.

<img src="https://www.tensorflow.org/images/dtensor/no_dtensor.png" alt="The input and weight matrices for a non distributed model." class="no-filter">


The output of the first `Dense` layer is passed into the input of the second `Dense` layer (after the `BatchNorm`). Therefore, the preferred DTensor sharding for the output of first `Dense` layer ($\mathbf{W_1}$) and the input of second `Dense` layer ($\mathbf{W_2}$) is to shard $\mathbf{W_1}$ and $\mathbf{W_2}$ the same way along the common axis $\hat{j}$,

$$
\mathsf{Layout}[{W_{1,ij}}; i, j] = \left[\hat{i}, \hat{j}\right] \\
\mathsf{Layout}[{W_{2,jk}}; j, k] = \left[\hat{j}, \hat{k} \right]
$$

Even though the layout deduction shows that the 2 layouts are not independent, for the sake of simplicity of the model interface, `MLP` will take 2 `Layout` arguments, one per Dense layer.

In [11]:
from typing import Tuple

class MLP(tf.Module):

  def __init__(self, dense_layouts: Tuple[dtensor.Layout, dtensor.Layout]):
    super().__init__()

    self.dense1 = Dense(
        1200, 48, (1, 2), dense_layouts[0], activation=tf.nn.relu)
    self.bn = BatchNorm()
    self.dense2 = Dense(48, 2, (3, 4), dense_layouts[1])

  def __call__(self, x):
    y = x
    y = self.dense1(y)
    y = self.bn(y)
    y = self.dense2(y)
    return y


The trade-off between correctness in layout deduction constraints and simplicity of API is a common design point of APIs that uses DTensor.
It is also possible to capture the dependency between `Layout`'s with a different API. For example, the `MLPStricter` class creates the `Layout` objects in the constructor.

In [12]:
class MLPStricter(tf.Module):

  def __init__(self, mesh, input_mesh_dim, inner_mesh_dim1, output_mesh_dim):
    super().__init__()

    self.dense1 = Dense(
        1200, 48, (1, 2), dtensor.Layout([input_mesh_dim, inner_mesh_dim1], mesh),
        activation=tf.nn.relu)
    self.bn = BatchNorm()
    self.dense2 = Dense(48, 2, (3, 4), dtensor.Layout([inner_mesh_dim1, output_mesh_dim], mesh))


  def __call__(self, x):
    y = x
    y = self.dense1(y)
    y = self.bn(y)
    y = self.dense2(y)
    return y

To make sure the model runs, probe your model with fully replicated layouts and a fully replicated batch of `'x'` input.

In [13]:
WORLD = dtensor.create_mesh([("world", 8)], devices=DEVICES)

model = MLP([dtensor.Layout.replicated(WORLD, rank=2),
             dtensor.Layout.replicated(WORLD, rank=2)])

sample_x, sample_y = train_data_vec.take(1).get_single_element()
sample_x = dtensor.copy_to_mesh(sample_x, dtensor.Layout.replicated(WORLD, rank=2))
print(model(sample_x))

2023-01-19 15:51:30.910846: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_1144. DTensor is (re-)computing its SPMD transformation.
2023-01-19 15:51:31.017643: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_1166. DTensor is (re-)computing its SPMD transformation.


tf.Tensor([[-7.14074945 6.86515808]
 [-5.61041498 5.04737663]
 [-4.339118 2.92188787]
 ...
 [6.87280273 -3.5677619]
 [8.27548695 -5.70918512]
 [-1.98807597 1.71495867]], layout="sharding_specs:unsharded,unsharded, mesh:|world=8|0,1,2,3,4,5,6,7|0,1,2,3,4,5,6,7|/job:localhost/replica:0/task:0/device:CPU:0,/job:localhost/replica:0/task:0/device:CPU:1,/job:localhost/replica:0/task:0/device:CPU:2,/job:localhost/replica:0/task:0/device:CPU:3,/job:localhost/replica:0/task:0/device:CPU:4,/job:localhost/replica:0/task:0/device:CPU:5,/job:localhost/replica:0/task:0/device:CPU:6,/job:localhost/replica:0/task:0/device:CPU:7", shape=(64, 2), dtype=float32)


## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">7. | Moving data to the device</div>
Usually, `tf.data` iterators (and other data fetching methods) yield tensor objects backed by the local host device memory. This data must be transferred to the accelerator device memory that backs DTensor's component tensors.

`dtensor.copy_to_mesh` is unsuitable for this situation because it replicates input tensors to all devices due to DTensor's global perspective. So in this tutorial, you will use a helper function `repack_local_tensor`, to facilitate the transfer of data. This helper function uses `dtensor.pack` to send (and only send) the shard of the global batch that is intended for a replica to the device backing the replica.

This simplified function assumes single-client. Determining the correct way to split the local tensor and the mapping between the pieces of the split and the local devices can be laboring in a multi-client application.

Additional DTensor API to simplify `tf.data` integration is planned, supporting both single-client and multi-client applications. Please stay tuned.

In [14]:
def repack_local_tensor(x, layout):
  """Repacks a local Tensor-like to a DTensor with layout.

  This function assumes a single-client application.
  """
  x = tf.convert_to_tensor(x)
  sharded_dims = []

  # For every sharded dimension, use tf.split to split the along the dimension.
  # The result is a nested list of split-tensors in queue[0].
  queue = [x]
  for axis, dim in enumerate(layout.sharding_specs):
    if dim == dtensor.UNSHARDED:
      continue
    num_splits = layout.shape[axis]
    queue = tf.nest.map_structure(lambda x: tf.split(x, num_splits, axis=axis), queue)
    sharded_dims.append(dim)

  # Now we can build the list of component tensors by looking up the location in
  # the nested list of split-tensors created in queue[0].
  components = []
  for locations in layout.mesh.local_device_locations():
    t = queue[0]
    for dim in sharded_dims:
      split_index = locations[dim]  # Only valid on single-client mesh.
      t = t[split_index]
    components.append(t)

  return dtensor.pack(components, layout)

## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">8. | Data parallel training</div>
In this section, you will train your MLP model with data parallel training. The following sections will demonstrate model parallel training and spatial parallel training.

Data parallel training is a commonly used scheme for distributed machine learning:

 - Model variables are replicated on N devices each.
 - A global batch is split into N per-replica batches.
 - Each per-replica batch is trained on the replica device.
 - The gradient is reduced before weight up data is collectively performed on all replicas.

Data parallel training provides nearly linear speedup regarding the number of devices.

### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">8.1 | Creating a data parallel mesh</div>
A typical data parallelism training loop uses a DTensor `Mesh` that consists of a single `batch` dimension, where each device becomes a replica that receives a shard from the global batch.

<img src="https://www.tensorflow.org/images/dtensor/dtensor_data_para.png" alt="Data parallel mesh" class="no-filter">


The replicated model runs on the replica, therefore the model variables are fully replicated (unsharded).

In [15]:
mesh = dtensor.create_mesh([("batch", 8)], devices=DEVICES)

model = MLP([dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh),
             dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh),])


2023-01-19 15:51:31.703382: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_1220. DTensor is (re-)computing its SPMD transformation.
2023-01-19 15:51:31.790484: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_1242. DTensor is (re-)computing its SPMD transformation.


### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">8.2 | Packing training data to DTensors</div>
The training data batch should be packed into DTensors sharded along the `'batch'`(first) axis, such that DTensor will evenly distribute the training data to the `'batch'` mesh dimension.

**Note**: In DTensor, the `batch size` always refers to the global batch size. The batch size should be chosen such that it can be divided evenly by the size of the `batch` mesh dimension.

In [16]:
def repack_batch(x, y, mesh):
  x = repack_local_tensor(x, layout=dtensor.Layout(['batch', dtensor.UNSHARDED], mesh))
  y = repack_local_tensor(y, layout=dtensor.Layout(['batch'], mesh))
  return x, y

sample_x, sample_y = train_data_vec.take(1).get_single_element()
sample_x, sample_y = repack_batch(sample_x, sample_y, mesh)

print('x', sample_x[:, 0])
print('y', sample_y)

x tf.Tensor({"CPU:0": [57.1319923 85.6979904 66.6539917 ... 438.011932 111.089981 260.267975], "CPU:1": [117.437981 66.6539917 107.915985 ... 146.003983 260.267975 47.609993], "CPU:2": [215.83197 317.399963 136.481979 ... 355.487946 206.309967 101.567986], "CPU:3": [57.1319923 107.915985 79.3499908 ... 63.4799919 203.135971 371.357941], "CPU:4": [206.309967 73.0019913 34.9139938 ... 44.4359932 69.8279877 95.219986], "CPU:5": [82.5239868 219.005966 434.837952 ... 98.3939896 95.219986 345.965942], "CPU:6": [282.485962 38.0879936 133.307983 ... 174.569977 79.3499908 79.3499908], "CPU:7": [215.83197 590.363892 107.915985 ... 238.049973 244.397964 82.5239868]}, layout="sharding_specs:batch, mesh:|batch=8|0,1,2,3,4,5,6,7|0,1,2,3,4,5,6,7|/job:localhost/replica:0/task:0/device:CPU:0,/job:localhost/replica:0/task:0/device:CPU:1,/job:localhost/replica:0/task:0/device:CPU:2,/job:localhost/replica:0/task:0/device:CPU:3,/job:localhost/replica:0/task:0/device:CPU:4,/job:localhost/replica:0/task:0/de

### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">8.3 | Training step</div>
This example uses a Stochastic Gradient Descent optimizer with the Custom Training Loop (CTL). Consult the [Custom Training Loop guide](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch) and [Walk through](https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough) for more information on those topics.

The `train_step` is encapsulated as a `tf.function` to indicate this body is to be traced as a TensorFlow Graph. The body of `train_step` consists of a forward inference pass, a backward gradient pass, and the variable update.

Note that the body of `train_step` does not contain any special DTensor annotations. Instead, `train_step` only contains high-level TensorFlow operations that process the input `x` and `y` from the global view of the input batch and the model. All of the DTensor annotations (`Mesh`, `Layout`) are factored out of the train step.

In [17]:
# Refer to the CTL (custom training loop guide)
@tf.function
def train_step(model, x, y, learning_rate=tf.constant(1e-4)):
  with tf.GradientTape() as tape:
    logits = model(x)
    # tf.reduce_sum sums the batch sharded per-example loss to a replicated
    # global loss (scalar).
    loss = tf.reduce_sum(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=logits, labels=y))
  parameters = model.trainable_variables
  gradients = tape.gradient(loss, parameters)
  for parameter, parameter_gradient in zip(parameters, gradients):
    parameter.assign_sub(learning_rate * parameter_gradient)

  # Define some metrics
  accuracy = 1.0 - tf.reduce_sum(tf.cast(tf.argmax(logits, axis=-1, output_type=tf.int64) != y, tf.float32)) / x.shape[0]
  loss_per_sample = loss / len(x)
  return {'loss': loss_per_sample, 'accuracy': accuracy}

### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">8.4 | Checkpointing</div>
You can checkpoint a DTensor model using `dtensor.DTensorCheckpoint`. The format of a DTensor checkpoint is fully compatible with a Standard TensorFlow Checkpoint. There is ongoing work to consolidate `dtensor.DTensorCheckpoint` into `tf.train.Checkpoint`.

When a DTensor checkpoint is restored, `Layout`s of variables can be different from when the checkpoint is saved. This tutorial makes use of this feature to continue the training in the Model Parallel training and Spatial Parallel training sections.


In [18]:
CHECKPOINT_DIR = tempfile.mkdtemp()

def start_checkpoint_manager(mesh, model):
  ckpt = dtensor.DTensorCheckpoint(mesh, root=model)
  manager = tf.train.CheckpointManager(ckpt, CHECKPOINT_DIR, max_to_keep=3)

  if manager.latest_checkpoint:
    print("Restoring a checkpoint")
    ckpt.restore(manager.latest_checkpoint).assert_consumed()
  else:
    print("new training")
  return manager


### <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">8.5 | Training loop</div>
For the data parallel training scheme, train for epochs and report the progress. 3 epochs is insufficient for training the model -- an accuracy of 50% is as good as randomly guessing.

Enable checkpointing so that you can pick up the training later. In the following section, you will load the checkpoint and train with a different parallel scheme.

In [19]:
num_epochs = 2
manager = start_checkpoint_manager(mesh, model)

for epoch in range(num_epochs):
  step = 0
  pbar = tf.keras.utils.Progbar(target=int(train_data_vec.cardinality()), stateful_metrics=[])
  metrics = {'epoch': epoch}
  for x,y in train_data_vec:

    x, y = repack_batch(x, y, mesh)

    metrics.update(train_step(model, x, y, 1e-2))

    pbar.update(step, values=metrics.items(), finalize=False)
    step += 1
  manager.save()
  pbar.update(step, values=metrics.items(), finalize=True)

new training


2023-01-19 15:51:32.908228: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_train_step_1445. DTensor is (re-)computing its SPMD transformation.




2023-01-19 15:51:38.412637: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_train_step_13656. DTensor is (re-)computing its SPMD transformation.




## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">9. | Model parallel Training</div>
If you switch to a 2 dimensional `Mesh`, and shard the model variables along the second mesh dimension, then the training becomes Model Parallel.

In Model Parallel training, each model replica spans multiple devices (2 in this case):

- There are 4 model replicas, and the training data batch is distributed to the 4 replicas.
- The 2 devices within a single model replica receive replicated training data.


<img src="https://www.tensorflow.org/images/dtensor/dtensor_model_para.png" alt="Model parallel mesh" class="no-filter">


In [20]:
mesh = dtensor.create_mesh([("batch", 4), ("model", 2)], devices=DEVICES)
model = MLP([dtensor.Layout([dtensor.UNSHARDED, "model"], mesh), 
             dtensor.Layout(["model", dtensor.UNSHARDED], mesh)])

2023-01-19 15:51:44.388756: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_26119. DTensor is (re-)computing its SPMD transformation.
2023-01-19 15:51:44.421658: E tensorflow/core/framework/node_def_util.cc:630] NodeDef mentions attribute dtensor.device_seed_for_mesh_dims which is not in the op definition: Op<name=Squeeze; signature=input:T -> output:T; attr=T:type; attr=squeeze_dims:list(int),default=[],min=0> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node tf.StatefulPartitionedCall/tf.Squeeze}}
2023-01-19 15:51:44.471328: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_26141. DTensor is (re-)computing its SPMD transformation.


As the training data is still sharded along the batch dimension, you can reuse the same `repack_batch` function as the Data Parallel training case. DTensor will automatically replicate the per-replica batch to all devices inside the replica along the `"model"` mesh dimension.

In [21]:
def repack_batch(x, y, mesh):
  x = repack_local_tensor(x, layout=dtensor.Layout(['batch', dtensor.UNSHARDED], mesh))
  y = repack_local_tensor(y, layout=dtensor.Layout(['batch'], mesh))
  return x, y

Next run the training loop. The training loop reuses the same checkpoint manager as the Data Parallel training example, and the code looks identical.

You can continue training the data parallel trained model under model parallel training.

In [22]:
num_epochs = 2
manager = start_checkpoint_manager(mesh, model)

for epoch in range(num_epochs):
  step = 0
  pbar = tf.keras.utils.Progbar(target=int(train_data_vec.cardinality()))
  metrics = {'epoch': epoch}
  for x,y in train_data_vec:
    x, y = repack_batch(x, y, mesh)
    metrics.update(train_step(model, x, y, 1e-2))
    pbar.update(step, values=metrics.items(), finalize=False)
    step += 1
  manager.save()
  pbar.update(step, values=metrics.items(), finalize=True)

Restoring a checkpoint


2023-01-19 15:51:45.243143: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_train_step_26398. DTensor is (re-)computing its SPMD transformation.




2023-01-19 15:51:50.175531: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_train_step_35449. DTensor is (re-)computing its SPMD transformation.




## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">10. | Spatial Parallel Training</div>
When training data of very high dimensionality (e.g. a very large image or a video), it may be desirable to shard along the feature dimension. This is called [Spatial Partitioning](https://cloud.google.com/blog/products/ai-machine-learning/train-ml-models-on-large-images-and-3d-volumes-with-spatial-partitioning-on-cloud-tpus), which was first introduced into TensorFlow for training models with large 3-d input samples.

<img src="https://www.tensorflow.org/images/dtensor/dtensor_spatial_para.png" alt="Spatial parallel mesh" class="no-filter">

DTensor also supports this case. The only change you need to do is to create a Mesh that includes a `feature` dimension, and apply the corresponding `Layout`.


In [23]:
mesh = dtensor.create_mesh([("batch", 2), ("feature", 2), ("model", 2)], devices=DEVICES)
model = MLP([dtensor.Layout(["feature", "model"], mesh), 
             dtensor.Layout(["model", dtensor.UNSHARDED], mesh)])


2023-01-19 15:51:56.102320: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_44717. DTensor is (re-)computing its SPMD transformation.
2023-01-19 15:51:56.121723: E tensorflow/core/framework/node_def_util.cc:630] NodeDef mentions attribute dtensor.device_seed_for_mesh_dims which is not in the op definition: Op<name=Squeeze; signature=input:T -> output:T; attr=T:type; attr=squeeze_dims:list(int),default=[],min=0> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node tf.StatefulPartitionedCall/tf.Squeeze}}
2023-01-19 15:51:56.181381: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_44739. DTensor is (re-)computing its SPMD transformation.


Shard the input data along the `feature` dimension when packing the input tensors to DTensors. You do this with a slightly different repack function, `repack_batch_for_spt`, where `spt` stands for Spatial Parallel Training.

In [24]:
def repack_batch_for_spt(x, y, mesh):
    # Shard data on feature dimension, too
    x = repack_local_tensor(x, layout=dtensor.Layout(["batch", 'feature'], mesh))
    y = repack_local_tensor(y, layout=dtensor.Layout(["batch"], mesh))
    return x, y

The Spatial parallel training can also continue from a checkpoint created with other parallell training schemes.

In [25]:
num_epochs = 2

manager = start_checkpoint_manager(mesh, model)
for epoch in range(num_epochs):
  step = 0
  metrics = {'epoch': epoch}
  pbar = tf.keras.utils.Progbar(target=int(train_data_vec.cardinality()))

  for x, y in train_data_vec:
    x, y = repack_batch_for_spt(x, y, mesh)
    metrics.update(train_step(model, x, y, 1e-2))

    pbar.update(step, values=metrics.items(), finalize=False)
    step += 1
  manager.save()
  pbar.update(step, values=metrics.items(), finalize=True)

Restoring a checkpoint


2023-01-19 15:51:57.103881: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_train_step_44998. DTensor is (re-)computing its SPMD transformation.
2023-01-19 15:51:57.202279: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:51:57.207747: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:51:57.252063: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:51:57.256259: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:51:57.266725: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:51:57.302559: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions


  7/391 [..............................] - ETA: 27s - epoch: 0.0000e+00 - loss: 0.7221 - accuracy: 0.6934

2023-01-19 15:51:57.308129: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:51:57.321242: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions




2023-01-19 15:52:01.880797: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_train_step_54821. DTensor is (re-)computing its SPMD transformation.
2023-01-19 15:52:02.005012: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:52:02.009517: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:52:02.018236: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:52:02.026782: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:52:02.048491: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:52:02.056331: E tensorflow/core/grappler/costs/op_level_cost_estimator.cc:1104] Incompatible Matrix dimensions
2023-01-19 15:52:02.057889: E tensorflow/core/grappler/co



## <div style="font-family: Trebuchet MS; background-color: #D68910; color: #FFFFFF; padding: 12px; line-height: 1.5;border-radius:15px 60px 15px; text-align:center;border-color: red;border: 5px solid #17b978;overflow:hidden;">11. | SavedModel and DTensor</div>
The integration of DTensor and SavedModel is still under development. This section only describes the current status quo for TensorFlow 2.9.0.

As of TensorFlow 2.9.0, `tf.saved_model` only accepts DTensor models with fully replicated variables.

As a workaround, you can convert a DTensor model to a fully replicated one by reloading a checkpoint. However, after a model is saved, all DTensor annotations are lost and the saved signatures can only be used with regular Tensors, not DTensors.

In [26]:
mesh = dtensor.create_mesh([("world", 1)], devices=DEVICES[:1])
mlp = MLP([dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh), 
           dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh)])

manager = start_checkpoint_manager(mesh, mlp)

model_for_saving = tf.keras.Sequential([
  text_vectorization,
  mlp
])

@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def run(inputs):
  return {'result': model_for_saving(inputs)}

tf.saved_model.save(
    model_for_saving, "/tmp/saved_model",
    signatures=run)

2023-01-19 15:52:07.898942: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_64863. DTensor is (re-)computing its SPMD transformation.
2023-01-19 15:52:07.937044: I tensorflow/dtensor/cc/dtensor_device.cc:1390] DTensor cache key lookup missed for __inference_stateless_random_normal_64885. DTensor is (re-)computing its SPMD transformation.


Restoring a checkpoint


As of TensorFlow 2.9.0, you can only call a loaded signature with a regular Tensor, or a fully replicated DTensor (which will be converted to a regular Tensor).

In [27]:
sample_batch = train_data.take(1).get_single_element()
sample_batch

{'label': <tf.Tensor: shape=(64,), dtype=int64, numpy=
 array([0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1,
        1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1])>,
 'text': <tf.Tensor: shape=(64,), dtype=string, numpy=
 array([b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.',
        b"This was an absolutely terrible

In [28]:
loaded = tf.saved_model.load("/tmp/saved_model")

run_sig = loaded.signatures["serving_default"]
result = run_sig(sample_batch['text'])['result']

In [29]:
np.mean(tf.argmax(result, axis=-1) == sample_batch['label'])

0.703125

## What's next?

This tutorial demonstrated building and training an MLP sentiment analysis model with DTensor.

Through `Mesh` and `Layout` primitives, DTensor can transform a TensorFlow `tf.function` to a distributed program suitable for a variety of training schemes.

In a real-world machine learning application, evaluation and cross-validation should be applied to avoid producing an over-fitted model. The techniques introduced in this tutorial can also be applied to introduce parallelism to evaluation.

Composing a model with `tf.Module` from scratch is a lot of work, and reusing existing building blocks such as layers and helper functions can drastically speed up model development.
As of TensorFlow 2.9, all Keras Layers under `tf.keras.layers` accepts DTensor layouts as their arguments, and can be used to build DTensor models. You can even directly reuse a Keras model with DTensor without modifying the model implementation. Refer to the [DTensor Keras Integration Tutorial](https://www.tensorflow.org/tutorials/distribute/dtensor_keras_tutorial) for information on using DTensor Keras. 

<div style="font-weight:normal; font-size:18px;line-height:1.0;padding:12px;background-color: #f9ff21;font-family: Trebuchet MS; color: #ff1f5a;border: 5px solid #17b978;"><b>Note:</b><br><br> In order to avoid confusions between <code>Mesh</code> and <code>Layout</code>, the term <b>"dimension"</b> is always associated with <code>Mesh</code>, and the term <b>"axis"</b> with <code>Tensor</code> and <code>Layout</code> in this guide.</div> 

<div style="font-weight:bold;font-size:20px;line-height:1.5;padding:12px;background-color: #f9ff21;font-family: Verdana, Cursive; color: #ff1f5a;border: 5px solid #17b978;">References: <li style="font-size:15px;font-weight:normal;margin-bottom:5px" >
    <a href="https://www.tensorflow.org/guide/dtensor_overview">https://www.tensorflow.org/guide/dtensor_overview</a>
</li>
    <li style="font-size:15px;font-weight:normal;margin-bottom:5px">
    <a href="https://github.com/tensorflow/docs/blob/master/site/en/guide/dtensor_overview.ipynb">https://github.com/tensorflow/docs/blob/master/site/en/guide/dtensor_overview.ipynb</a>
</li>
     <li style="font-size:15px;font-weight:normal;margin-bottom:5px" >
    <a href="https://www.infoq.com/news/2022/05/tensorflow-dtensor/">  https://www.infoq.com/news/2022/05/tensorflow-dtensor</a>
</li>
</div> 

<div style="font-weight:bold;font-size:20px;line-height:1.5;padding:12px;background-color: #f9ff21;font-family: Verdana, Cursive; color: #ff1f5a;border: 5px solid #17b978;text-align:center">
   🔥👍 If you like my work don't forget to appreciate. 👍🔥
    <br>
    📝 Give you valuable comments 📝
</div>