# HugeCTR Embedding  Plugin for TensorFlow

This notebook introduces a TensorFlow (TF) plugin for the HugeCTR embedding layer, embedding_plugin, where users may benefit from both the computational efficiency of the HugeCTR embedding layer and the ease of use of TensorFlow (TF).

## Build embedding_plugin ##
Before you can use the embedding_plugin, you must first build HugeCTR. You can do so by running the following commands:
```shell
$ git clone https://github.com/NVIDIA/HugeCTR.git
$ cd HugeCTR
$ git submodule update --init --recursive
$ mkdir -p build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release -DSM=80 .. # target is NVIDIA A100
$ make -j$(nproc)
```
A dynamic library is generated in the `lib/` directory that you'll have to load using TensorFlow. You can directly import `hugectr_tf_ops.py`, where we prepare the codes to load that dynamic library and wrap some operations for convenient usage, in your python script to be used with the embedding_plugin.

## Verify Accuracy ##
To verify whether the embedding_plugin can obtain the correct result, you can generate synthetic data for testing purposes as shown below.

In [1]:
# run this cell to clear all variables.
%reset -f

In [2]:
# import tensorflow and some modules
import tensorflow as tf
# do not let TF allocate all GPU memory
devices = tf.config.list_physical_devices("GPU")
for dev in devices:
    tf.config.experimental.set_memory_growth(dev, True)
    
import numpy as np

In [3]:
# import hugectr_tf_ops.py to use embedding_plugin ops
import sys
sys.path.append("../tools/embedding_plugin/python/")
import hugectr_tf_ops

In [4]:
# generate a random embedding table and show
vocabulary_size = 8
slot_num = 3
embedding_vector_size = 4

table = np.float32([i for i in range(1, vocabulary_size * embedding_vector_size + 1)]).reshape(vocabulary_size, embedding_vector_size)
print("init embedding table value:\n", table)

init embedding table value:
 [[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]
 [13. 14. 15. 16.]
 [17. 18. 19. 20.]
 [21. 22. 23. 24.]
 [25. 26. 27. 28.]
 [29. 30. 31. 32.]]


In HugeCTR, the corresponding dense shape of the input keys is `[batch_size, slot_num, max_nnz]`, and `0` is a valid key. Therefore, `-1` is used to denote invalid keys, which only occupy that position in the corresponding dense keys matrix.

In [5]:
# generate random keys to lookup from embedding table.
keys = np.array([[[0, -1],   # nnz = 1
                  [1, -1],   # nnz = 1
                  [2,  6]],  # nnz = 2
                 
                 [[0, -1],   # nnz = 1
                  [1, -1],   # nnz = 1
                  [-1, -1]], # nnz = 0
                 
                 [[0, -1],   # nnz = 1
                  [1, -1],   # nnz = 1
                  [6, -1]],  # nnz = 1
                 
                 [[0, -1],   # nnz = 1
                  [1, -1],   # nnz = 1
                  [2, -1]]], # nnz = 1
                dtype=np.int64) 
print("the dense shape of inputs keys:", keys.shape)

the dense shape of inputs keys: (4, 3, 2)


In [6]:
# define a simple forward propagation and backward propagation with embedding_plugin
# NOTE: cause hugectr_tf_ops.init can only be called once, if you want to run this cell multi-times, please restart the kernel.

with tf.GradientTape() as tape:
    # hugectr_tf_ops embedding_plugin initialize
    hugectr_tf_ops.init(visiable_gpus=[0], seed=123, key_type='int64', value_type='float', batch_size=4, batch_size_eval=4)
    
    # create a embedding_layer with embedding_plugin
    embedding_name = hugectr_tf_ops.create_embedding(init_value=table, opt_hparams=[0.1, 0.9, 0.99, 1e-3], 
                                              name_='embedding_verification', 
                                              max_vocabulary_size_per_gpu=vocabulary_size,
                                              slot_num=slot_num, embedding_vec_size=embedding_vector_size,
                                              embedding_type='distributed', max_nnz=2)
    
    # convert dense input keys to SparseTensor
    indices = tf.where(keys != -1)
    values = tf.gather_nd(keys, indices)
    
    # create a Variable used in backward propagation
    bp_trigger = tf.Variable(initial_value=1.0, trainable=True, dtype=tf.float32)
    
    # get forward result
    forward_result = hugectr_tf_ops.fprop(embedding_name=embedding_name,
                                   sparse_indices=indices, values=values, dense_shape=keys.shape,
                                   output_type=tf.float32, is_training=True, bp_trigger=bp_trigger)
    print("forward_result:\n", forward_result)
    
    # compute gradients & update params
    grads = tape.gradient(forward_result, bp_trigger)
    
    # do second forward propagation to check whether embedding table is updated.
    forward_2 = hugectr_tf_ops.fprop(embedding_name=embedding_name,
                              sparse_indices=indices, values=values, dense_shape=keys.shape,
                              output_type=tf.float32, is_training=True, bp_trigger=bp_trigger)
    print("\n")
    print("second forward_result:\n", forward_2)
    

forward_result:
 tf.Tensor(
[[[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [34. 36. 38. 40.]]

 [[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [ 0.  0.  0.  0.]]

 [[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [25. 26. 27. 28.]]

 [[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [ 9. 10. 11. 12.]]], shape=(4, 3, 4), dtype=float32)


second forward_result:
 tf.Tensor(
[[[ 0.90024936  1.9002494   2.9002495   3.9002495 ]
  [ 4.9002495   5.9002495   6.9002495   7.9002495 ]
  [33.800995   35.800995   37.800995   39.800995  ]]

 [[ 0.90024936  1.9002494   2.9002495   3.9002495 ]
  [ 4.9002495   5.9002495   6.9002495   7.9002495 ]
  [ 0.          0.          0.          0.        ]]

 [[ 0.90024936  1.9002494   2.9002495   3.9002495 ]
  [ 4.9002495   5.9002495   6.9002495   7.9002495 ]
  [24.900497   25.900497   26.900497   27.900497  ]]

 [[ 0.90024936  1.9002494   2.9002495   3.9002495 ]
  [ 4.9002495   5.9002495   6.9002495   7.9002495 ]
  [ 8.900497    9.900497   10.900497   11.900497  ]]], shape=(4, 3, 4), dtyp

In [7]:
# similarly, use original tensorflow op to compare whether results are consistent.

# define a tf embedding layer
class EmbeddingLayer(tf.keras.layers.Layer):
    def __init__(self, vocabulary_size, embedding_vec_size,
                init_value):
        super(EmbeddingLayer, self).__init__()
        self.vocabulary_size = vocabulary_size
        self.embedding_vec_size = embedding_vec_size
        self.init_value = init_value
        
    def build(self, _):
        self.Var = self.add_weight(shape=(self.vocabulary_size, self.embedding_vec_size),
                                         initializer=tf.constant_initializer(value=self.init_value))
        
    def call(self, inputs):
        return tf.nn.embedding_lookup_sparse(self.Var, inputs, sp_weights=None, combiner="sum")
    
with tf.GradientTape() as tape:
    # reshape keys into [batch_size * slot_num, max_nnz]
    reshape_keys = np.reshape(keys, newshape=(-1, keys.shape[-1]))
    indices = tf.where(reshape_keys != -1)
    values = tf.gather_nd(reshape_keys, indices)

    # define a layer
    tf_layer = EmbeddingLayer(vocabulary_size, embedding_vector_size, table)
    
    # wrap input keys components into a SparseTensor
    sparse_tensor = tf.sparse.SparseTensor(indices, values, reshape_keys.shape)
    
    tf_forward = tf_layer(sparse_tensor)
    print("tf forward_result:\n", tf.reshape(tf_forward, [keys.shape[0], keys.shape[1], tf_forward.shape[-1]]))
    
    # define an optimizer
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.1, beta_1=0.9, beta_2=0.99, epsilon=1e-3)
    
    # compute gradients & update params
    grads = tape.gradient(tf_forward, tf_layer.trainable_weights)
    optimizer.apply_gradients(zip(grads, tf_layer.trainable_weights))
    
    # do second forward propagation to check whether params are updated.
    tf_forward_2 = tf_layer(sparse_tensor)
    print("\n")
    print("tf second forward_result:\n", tf.reshape(tf_forward_2, [keys.shape[0], keys.shape[1], tf_forward_2.shape[-1]]))

tf forward_result:
 tf.Tensor(
[[[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [34. 36. 38. 40.]]

 [[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [ 0.  0.  0.  0.]]

 [[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [25. 26. 27. 28.]]

 [[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]
  [ 9. 10. 11. 12.]]], shape=(4, 3, 4), dtype=float32)


tf second forward_result:
 tf.Tensor(
[[[ 0.90024906  1.9002491   2.900249    3.900249  ]
  [ 4.900249    5.900249    6.900249    7.900249  ]
  [33.800995   35.800995   37.800995   39.800995  ]]

 [[ 0.90024906  1.9002491   2.900249    3.900249  ]
  [ 4.900249    5.900249    6.900249    7.900249  ]
  [ 0.          0.          0.          0.        ]]

 [[ 0.90024906  1.9002491   2.900249    3.900249  ]
  [ 4.900249    5.900249    6.900249    7.900249  ]
  [24.900497   25.900497   26.900497   27.900497  ]]

 [[ 0.90024906  1.9002491   2.900249    3.900249  ]
  [ 4.900249    5.900249    6.900249    7.900249  ]
  [ 8.900497    9.900497   10.900497   11.900497  ]]], shape=(4, 3, 4)

In [8]:
# assert whether embedding_plugin's results are consistent with tensorflow original ops
first_forward_consistent = np.allclose(forward_result.numpy(), 
                                tf.reshape(tf_forward, [keys.shape[0], keys.shape[1], tf_forward.shape[-1]]).numpy())
print("Consistent in first forward propagation?", first_forward_consistent)

second_forwad_consistent = np.allclose(forward_2.numpy(), 
                                tf.reshape(tf_forward_2, [keys.shape[0], keys.shape[1], tf_forward_2.shape[-1]]))
print("Consistent in second forward propagation?", second_forwad_consistent)

Consistent in first forward propagation? True
Consistent in second forward propagation? True


The results from embedding_plugins and original TF ops are consistent in both first and second forward propagation, which means the embedding_plugin can get the same forward result and perform the same backward propagation as TF ops. Therefore, the embedding_plugin can obtain the correct results.

## DeepFM demo ##
In this notebook, TF 2.x is used to build the DeepFM model.

### Define Models with the Embedding_Plugin ###

**To proceed, Kernel must be restarted.**

In [1]:
# first, import tensorflow and import plugin ops from hugectr_tf_ops.py
import tensorflow as tf
# do not let TF allocate all GPU memory
devices = tf.config.list_physical_devices("GPU")
for dev in devices:
    tf.config.experimental.set_memory_growth(dev, True)
import sys
sys.path.append("../tools/embedding_plugin/python/")
import hugectr_tf_ops

In [2]:
# wrap plugin ops into a TF layer for easy use
class PluginEmbedding(tf.keras.layers.Layer):
    def __init__(self,
                 vocabulary_size,
                 slot_num,
                 embedding_vec_size,
                 gpu_count,
                 initializer=False,
                 name='plugin_embedding',
                 embedding_type='localized',
                 optimizer='Adam',
                 opt_hparam=[0.1, 0.9, 0.99, 1e-3],
                 update_type='Local',
                 atomic_update=True,
                 max_feature_num=int(1e3),
                 max_nnz=1,
                 combiner='sum',
                 ):
        super(PluginEmbedding, self).__init__()

        self.vocabulary_size_each_gpu = (vocabulary_size // gpu_count) + 1 
        self.slot_num = slot_num
        self.embedding_vec_size = embedding_vec_size
        self.embedding_type = embedding_type
        self.optimizer_type = optimizer
        self.opt_hparam = opt_hparam
        self.update_type = update_type
        self.atomic_update = atomic_update
        self.max_feature_num = max_feature_num
        self.max_nnz = max_nnz
        self.combiner = combiner
        self.gpu_count = gpu_count

        self.name_ = hugectr_tf_ops.create_embedding(initializer, name_=name, embedding_type=self.embedding_type, 
                                             optimizer_type=self.optimizer_type, 
                                             max_vocabulary_size_per_gpu=self.vocabulary_size_each_gpu,
                                             opt_hparams=self.opt_hparam, update_type=self.update_type,
                                             atomic_update=self.atomic_update, slot_num=self.slot_num,
                                             max_nnz=self.max_nnz, max_feature_num=self.max_feature_num,
                                             embedding_vec_size=self.embedding_vec_size, 
                                             combiner=self.combiner)

    def build(self, _):
        self.bp_trigger = self.add_weight(name="bp_trigger",
                                          shape=(1,), dtype=tf.float32, trainable=True)

    @tf.function
    def call(self, row_offsets, value_tensors, nnz_array, output_shape, training=False):
        return hugectr_tf_ops.fprop_v3(embedding_name=self.name_, row_offsets=row_offsets, value_tensors=value_tensors, 
                                nnz_array=nnz_array, bp_trigger=self.bp_trigger, is_training=training,
                                output_shape=output_shape)

In [3]:
# define other TF layers
class Multiply(tf.keras.layers.Layer):
    def __init__(self, out_units):
        super(Multiply, self).__init__()
        self.out_units = out_units

    def build(self, input_shape):
        self.w = self.add_weight(name='weight_vector', shape=(input_shape[1], self.out_units),
                                 initializer='glorot_uniform', trainable=True)
    
    def call(self, inputs):
        return inputs * self.w

In [4]:
# build DeepFM with plugin layer
class DeepFM_PluginEmbedding(tf.keras.models.Model):
    def __init__(self, 
                 vocabulary_size, 
                 embedding_vec_size,
                 which_embedding,
                 dropout_rate, # list of float
                 deep_layers, # list of int
                 initializer,
                 gpus,
                 batch_size,
                 batch_size_eval,
                 embedding_type = 'localized',
                 slot_num=1,
                 seed=123):
        super(DeepFM_PluginEmbedding, self).__init__()
        tf.keras.backend.clear_session()
        tf.compat.v1.set_random_seed(seed)

        self.vocabulary_size = vocabulary_size
        self.embedding_vec_size = embedding_vec_size
        self.which_embedding = which_embedding
        self.dropout_rate = dropout_rate
        self.deep_layers = deep_layers
        self.gpus = gpus
        self.batch_size = batch_size
        self.batch_size_eval = batch_size_eval 
        self.slot_num = slot_num
        self.embedding_type = embedding_type

        if isinstance(initializer, str):
            initializer = False
            
        # when building model with embedding_plugin ops, init() should be called prior to any other ops.
        hugectr_tf_ops.init(visiable_gpus=gpus, seed=seed, key_type='int64', value_type='float', 
                        batch_size=batch_size, batch_size_eval=batch_size_eval)
        
        # create a embedding_plugin layer
        self.plugin_embedding_layer = PluginEmbedding(vocabulary_size=vocabulary_size, slot_num=slot_num, 
                                            embedding_vec_size=embedding_vec_size + 1, 
                                            embedding_type=embedding_type,
                                            gpu_count=len(gpus), initializer=initializer)
        
        # other layers with TF original ops
        self.deep_dense = []
        for i, deep_units in enumerate(self.deep_layers):
            self.deep_dense.append(tf.keras.layers.Dense(units=deep_units, activation=None, use_bias=True,
                                                         kernel_initializer='glorot_normal', 
                                                         bias_initializer='glorot_normal'))
            self.deep_dense.append(tf.keras.layers.Dropout(dropout_rate[i]))
        self.deep_dense.append(tf.keras.layers.Dense(units=1, activation=None, use_bias=True,
                                                     kernel_initializer='glorot_normal',
                                                     bias_initializer=tf.constant_initializer(0.01)))
        self.add_layer = tf.keras.layers.Add()
        self.y_act = tf.keras.layers.Activation(activation='sigmoid')

        self.dense_multi = Multiply(1)
        self.dense_embedding = Multiply(self.embedding_vec_size)

        self.concat_1 = tf.keras.layers.Concatenate()
        self.concat_2 = tf.keras.layers.Concatenate()

    @tf.function
    def call(self, dense_feature, sparse_feature, training=True):
        """
        forward propagation.
        #arguments:
            dense_feature: [batch_size, dense_dim]
            sparse_feature: for OriginalEmbedding, it is a SparseTensor, and the dense shape is [batch_size * slot_num, max_nnz];
                            for PluginEmbedding, it is a list of [row_offsets, value_tensors, nnz_array]. 
        """
        with tf.name_scope("embedding_and_slice"):
            dense_0 = tf.cast(tf.expand_dims(dense_feature, 2), dtype=tf.float32) # [batchsize, dense_dim, 1]
            dense_mul = self.dense_multi(dense_0) # [batchsize, dense_dim, 1]
            dense_emb = self.dense_embedding(dense_0) # [batchsize, dense_dim, embedding_vec_size]
            dense_mul = tf.reshape(dense_mul, [dense_mul.shape[0], -1]) # [batchsize, dense_dim * 1]
            dense_emb = tf.reshape(dense_emb, [dense_emb.shape[0], -1]) # [batchsize, dense_dim * embedding_vec_size]

            sparse = self.plugin_embedding_layer(sparse_feature[0], sparse_feature[1], sparse_feature[2],
                                                output_shape=[self.batch_size, self.slot_num, self.embedding_vec_size + 1],
                                                training=training) # [batch_size, self.slot_num, self.embedding_vec_size + 1]

            sparse_1 = tf.slice(sparse, [0, 0, self.embedding_vec_size], [-1, self.slot_num, 1]) #[batchsize, slot_num, 1]
            sparse_1 = tf.squeeze(sparse_1, 2) # [batchsize, slot_num]

            sparse_emb = tf.slice(sparse, [0, 0, 0], [-1, self.slot_num, self.embedding_vec_size]) #[batchsize, slot_num, embedding_vec_size]
            sparse_emb = tf.reshape(sparse_emb, [-1, self.slot_num * self.embedding_vec_size]) #[batchsize, slot_num * embedding_vec_size]
        
        with tf.name_scope("FM"):
            with tf.name_scope("first_order"):
                first = self.concat_1([dense_mul, sparse_1]) # [batchsize, dense_dim + slot_num]
                first_out = tf.reduce_sum(first, axis=-1, keepdims=True) # [batchsize, 1]
                
            with tf.name_scope("second_order"):
                hidden = self.concat_2([dense_emb, sparse_emb]) # [batchsize, (dense_dim + slot_num) * embedding_vec_size]
                second = tf.reshape(hidden, [-1, dense_feature.shape[1] + self.slot_num, self.embedding_vec_size])
                square_sum = tf.math.square(tf.math.reduce_sum(second, axis=1, keepdims=True)) # [batchsize, 1, embedding_vec_size]
                sum_square = tf.math.reduce_sum(tf.math.square(second), axis=1, keepdims=True) # [batchsize, 1, embedding_vec_size]
                
                second_out = 0.5 * (sum_square - square_sum) # [batchsize, 1, embedding_vec_size]
                second_out = tf.math.reduce_sum(second_out, axis=-1, keepdims=False) # [batchsize, 1]
                
        with tf.name_scope("Deep"):
            for i, layer in enumerate(self.deep_dense):
                if i % 2 == 0: # dense
                    hidden = layer(hidden)
                else: # dropout
                    hidden = layer(hidden, training)

        y = self.add_layer([hidden, first_out, second_out])
        y = self.y_act(y) # [batchsize, 1]

        return y

The above cells wrap the embedding_plugin ops into a TF layer, and uses that layer to define a TF DeepFM model. Similarly, define an embedding layer with TF original ops, and define a DeepFM model with that layer. Because embedding_plugin supports model parallelism, the parameters of the original TF embedding layer are equally distributed to each GPU for a fair performance comparison.

### Define Models with the Original TF Ops ###

In [5]:
# define a TF embedding layer with TF original ops
class OriginalEmbedding(tf.keras.layers.Layer):
    def __init__(self, 
                 vocabulary_size,
                 embedding_vec_size,
                 initializer='uniform',
                 combiner="sum",
                 gpus=[0]):
        super(OriginalEmbedding, self).__init__()

        self.vocabulary_size = vocabulary_size
        self.embedding_vec_size = embedding_vec_size 
        if isinstance(initializer, str):
            self.initializer = tf.keras.initializers.get(initializer)
        else:
            self.initializer = initializer
        if combiner not in ["sum", "mean"]:
            raise RuntimeError("combiner must be one of \{'sum', 'mean'\}.")
        self.combiner = combiner
        if (not isinstance(gpus, list)) and (not isinstance(gpus, tuple)):
            raise RuntimeError("gpus must be a list or tuple.")
        self.gpus = gpus

    def build(self, _):
        if isinstance(self.initializer, tf.keras.initializers.Initializer):
            if len(self.gpus) > 1:
                self.embeddings_params = list()
                mod_size = self.vocabulary_size % len(self.gpus)
                vocabulary_size_each_gpu = [(self.vocabulary_size // len(self.gpus)) + (1 if dev_id < mod_size else 0)
                                            for dev_id in range(len(self.gpus))]

                for i, gpu in enumerate(self.gpus):
                    with tf.device("/gpu:%d" %gpu):
                        params_i = self.add_weight(name="embedding_" + str(gpu), 
                                                   shape=(vocabulary_size_each_gpu[i], self.embedding_vec_size),
                                                   initializer=self.initializer)
                    self.embeddings_params.append(params_i)

            else:
                self.embeddings_params = self.add_weight(name='embeddings', 
                                                        shape=(self.vocabulary_size, self.embedding_vec_size),
                                                        initializer=self.initializer)
        else:
            self.embeddings_params = self.initializer

    @tf.function
    def call(self, keys, output_shape):
        result = tf.nn.embedding_lookup_sparse(self.embeddings_params, keys, 
                                             sp_weights=None, combiner=self.combiner)
        return tf.reshape(result, output_shape)

In [6]:
# define DeepFM model with original TF embedding layer
class DeepFM_OriginalEmbedding(tf.keras.models.Model):
    def __init__(self, 
                 vocabulary_size, 
                 embedding_vec_size,
                 which_embedding,
                 dropout_rate, # list of float
                 deep_layers, # list of int
                 initializer,
                 gpus,
                 batch_size,
                 batch_size_eval,
                 embedding_type = 'localized',
                 slot_num=1,
                 seed=123):
        super(DeepFM_OriginalEmbedding, self).__init__()
        tf.keras.backend.clear_session()
        tf.compat.v1.set_random_seed(seed)

        self.vocabulary_size = vocabulary_size
        self.embedding_vec_size = embedding_vec_size
        self.which_embedding = which_embedding
        self.dropout_rate = dropout_rate
        self.deep_layers = deep_layers
        self.gpus = gpus
        self.batch_size = batch_size
        self.batch_size_eval = batch_size_eval 
        self.slot_num = slot_num
        self.embedding_type = embedding_type

        self.original_embedding_layer = OriginalEmbedding(vocabulary_size=vocabulary_size, 
                                            embedding_vec_size=embedding_vec_size + 1, 
                                            initializer=initializer, gpus=gpus)
        self.deep_dense = []
        for i, deep_units in enumerate(self.deep_layers):
            self.deep_dense.append(tf.keras.layers.Dense(units=deep_units, activation=None, use_bias=True,
                                                         kernel_initializer='glorot_normal', 
                                                         bias_initializer='glorot_normal'))
            self.deep_dense.append(tf.keras.layers.Dropout(dropout_rate[i]))
        self.deep_dense.append(tf.keras.layers.Dense(units=1, activation=None, use_bias=True,
                                                     kernel_initializer='glorot_normal',
                                                     bias_initializer=tf.constant_initializer(0.01)))
        self.add_layer = tf.keras.layers.Add()
        self.y_act = tf.keras.layers.Activation(activation='sigmoid')

        self.dense_multi = Multiply(1)
        self.dense_embedding = Multiply(self.embedding_vec_size)

        self.concat_1 = tf.keras.layers.Concatenate()
        self.concat_2 = tf.keras.layers.Concatenate()

    @tf.function
    def call(self, dense_feature, sparse_feature, training=True):
        """
        forward propagation.
        #arguments:
            dense_feature: [batch_size, dense_dim]
            sparse_feature: for OriginalEmbedding, it is a SparseTensor, and the dense shape is [batch_size * slot_num, max_nnz];
                            for PluginEmbedding, it is a list of [row_offsets, value_tensors, nnz_array]. 
        """
        with tf.name_scope("embedding_and_slice"):
            dense_0 = tf.cast(tf.expand_dims(dense_feature, 2), dtype=tf.float32) # [batchsize, dense_dim, 1]
            dense_mul = self.dense_multi(dense_0) # [batchsize, dense_dim, 1]
            dense_emb = self.dense_embedding(dense_0) # [batchsize, dense_dim, embedding_vec_size]
            dense_mul = tf.reshape(dense_mul, [dense_mul.shape[0], -1]) # [batchsize, dense_dim * 1]
            dense_emb = tf.reshape(dense_emb, [dense_emb.shape[0], -1]) # [batchsize, dense_dim * embedding_vec_size]

            sparse = self.original_embedding_layer(sparse_feature, output_shape=[-1, self.slot_num, self.embedding_vec_size + 1])

            sparse_1 = tf.slice(sparse, [0, 0, self.embedding_vec_size], [-1, self.slot_num, 1]) #[batchsize, slot_num, 1]
            sparse_1 = tf.squeeze(sparse_1, 2) # [batchsize, slot_num]

            sparse_emb = tf.slice(sparse, [0, 0, 0], [-1, self.slot_num, self.embedding_vec_size]) #[batchsize, slot_num, embedding_vec_size]
            sparse_emb = tf.reshape(sparse_emb, [-1, self.slot_num * self.embedding_vec_size]) #[batchsize, slot_num * embedding_vec_size]
        
        with tf.name_scope("FM"):
            with tf.name_scope("first_order"):
                first = self.concat_1([dense_mul, sparse_1]) # [batchsize, dense_dim + slot_num]
                first_out = tf.reduce_sum(first, axis=-1, keepdims=True) # [batchsize, 1]
                
            with tf.name_scope("second_order"):
                hidden = self.concat_2([dense_emb, sparse_emb]) # [batchsize, (dense_dim + slot_num) * embedding_vec_size]
                second = tf.reshape(hidden, [-1, dense_feature.shape[1] + self.slot_num, self.embedding_vec_size])
                square_sum = tf.math.square(tf.math.reduce_sum(second, axis=1, keepdims=True)) # [batchsize, 1, embedding_vec_size]
                sum_square = tf.math.reduce_sum(tf.math.square(second), axis=1, keepdims=True) # [batchsize, 1, embedding_vec_size]
                
                second_out = 0.5 * (sum_square - square_sum) # [batchsize, 1, embedding_vec_size]
                second_out = tf.math.reduce_sum(second_out, axis=-1, keepdims=False) # [batchsize, 1]
                
        with tf.name_scope("Deep"):
            for i, layer in enumerate(self.deep_dense):
                if i % 2 == 0: # dense
                    hidden = layer(hidden)
                else: # dropout
                    hidden = layer(hidden, training)

        y = self.add_layer([hidden, first_out, second_out])
        y = self.y_act(y) # [batchsize, 1]

        return y

Dataset is needed to use these models for training. [Kaggle Criteo datasets](http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/) provided by CriteoLabs is used as the training dataset. The original training set contains 45,840,617 examples. Each example contains a label (0 by default or 1 if the ad was clicked) and 39 features in which 13 of them are integer and the other 26 are categorial. Since TFRecord is suitable for the training process and the Criteo dataset is missing numerous values across the feature columns, preprocessing is needed. The original test set won't be used because it doesn't contain labels.

### Dataset processing ###
1. Download dataset from [http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/](http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/).
2. Extract the dataset by running the following command. 
    ```shell
    $ tar zxvf dac.tar.gz
    ```
3. Preprocess the datast and set missing values.
Preprocessing functions are defined in [preprocess.py](../tools/embedding_plugin/performance_profile/preprocess.py). Open that file and check the codes.

In [None]:
# specify source csv name and output csv name, run this command will do the preprocessing.
# Warning: this command will take serveral hours to do preprocessing.
%run ../tools/embedding_plugin/performance_profile/preprocess.py \
    --src_csv_path=train.txt --dst_csv_path=train.out.txt \
    --normalize_dense=0 --feature_cross=0

4. Split the dataset by running the following commands:
```shell
$ head -n 36672493 train.out.txt > train
$ tail -n 9168124 train.out.txt > valtest
$ head -n 4584062 valtest > val
$ tail -n 4584062 valtest > test
```

5. Convert the dataset into a TFRecord file. Converting functions are defined in [txt2tfrecord.py](../tools/embedding_plugin/performance_profile/txt2tfrecord.py). Open that file and check the codes.
After the data preprocessing is completed, *.tfrecord file(s) will be generated, which can be used for training. The training loop can now be configured to use the dataset and models to perform the training.

In [None]:
# specify source name and output tfrecord name, run this command will do the converting.
# Warning: this command will take half an hour to do converting.
%run ../tools/embedding_plugin/performance_profile/txt2tfrecord.py \
    --src_txt_name=train --dst_tfrecord_name=train.tfrecord \
    --normalized=0 --use_multi_process=1 --shard_num=1 
    # if multi tfrecord files are wanted, set shard_num to the number of files.

### Define training loop and do training ###
In [read_data.py](../tools/embedding_plugin/performance_profile/read_data.py), some preprocessing and TF data reading pipeline creation functions are defined.

In [7]:
# set env path, so that some modules can be imported
sys.path.append("../tools/embedding_plugin/performance_profile/")

import txt2tfrecord as utils
from read_data import create_dataset
import time
import logging
logging.basicConfig(format='%(asctime)s %(message)s')
logging.root.setLevel('INFO')

In [8]:
# choose wich model for training
which_model = "Plugin" # change it to "Original", if you want to try the model define with original tf ops.

In [9]:
# set some hyper parameters for training process
if ("Plugin" == which_model):
    batch_size = 16384
    n_epochs = 1
    distribute_keys = 1 
    gpus = [0] # use GPU0
    embedding_type = 'distributed'
    vocabulary_size = 1737710
    embedding_vec_size = 10
    slot_num = 26
    batch_size_eval = 1 * len(gpus)
    
elif ("Original" == which_model):
    batch_size = 16384
    n_epochs = 1
    distribute_keys = 0
    gpus = [0] # use GPU0
    vocabulary_size = 1737710
    embedding_vec_size = 10
    slot_num = 26
    batch_size_eval = 1 * len(gpus)
    embedding_type = 'distributed'

In [10]:
# define feature_description to read tfrecord examples.
cols = [utils.idx2key(idx, False) for idx in range(0, utils.NUM_TOTAL_COLUMNS)]
feature_desc = dict()
for col in cols:
    if col == 'label' or col.startswith("I"):
        feature_desc[col] = tf.io.FixedLenFeature([], tf.int64) # scaler
    else: 
        feature_desc[col] = tf.io.FixedLenFeature([1], tf.int64) # [slot_num, nnz]

In [11]:
# please set data_path to your tfrecord
data_path = "../tools/embedding_plugin/performance_profile/"

In [12]:
# create tfrecord reading pipeling
dataset_names = [data_path + "./train.tfrecord"]
dataset = create_dataset(dataset_names=dataset_names,
                         feature_desc=feature_desc,
                         batch_size=batch_size,
                         n_epochs=n_epochs,
                         distribute_keys=tf.constant(distribute_keys != 0, dtype=tf.bool),
                         gpu_count=len(gpus),
                         embedding_type=tf.constant(embedding_type, dtype=tf.string))

Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


2020-11-23 09:45:31,393 From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parallel_for/pfor.py:2380: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


In [13]:
# define loss function and optimizer used in other TF layers.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)

In [14]:
# create model instance
if "Original" == which_model:
    model = DeepFM_OriginalEmbedding(vocabulary_size=vocabulary_size, embedding_vec_size=embedding_vec_size, 
                       which_embedding=which_model, embedding_type=embedding_type,
                       dropout_rate=[0.5] * 10, deep_layers=[1024] * 10,
                       initializer='uniform', gpus=gpus, batch_size=batch_size, batch_size_eval=batch_size_eval,
                       slot_num=slot_num)
elif "Plugin" == which_model:
    model = DeepFM_PluginEmbedding(vocabulary_size=vocabulary_size, embedding_vec_size=embedding_vec_size, 
                       which_embedding=which_model, embedding_type=embedding_type,
                       dropout_rate=[0.5] * 10, deep_layers=[1024] * 10,
                       initializer='uniform', gpus=gpus, batch_size=batch_size, batch_size_eval=batch_size_eval,
                       slot_num=slot_num)

In [15]:
# define training step
@tf.function
def _train_step(dense_batch, sparse_batch, y_batch, model, loss_fn, optimizer):
    with tf.GradientTape() as tape:
        y_batch = tf.cast(y_batch, dtype=tf.float32)
        logits = model(dense_batch, sparse_batch, training=True)
        loss = loss_fn(y_batch, logits)
        loss /= dense_batch.shape[0]
    grads = tape.gradient(loss, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    return loss

In [16]:
# training loop
logging.info("begin to train")
begin_time = time.time()
display_begin = begin_time
for step, datas in enumerate(dataset):
    label, dense, others = datas[0], datas[1], datas[2:]
    if tf.constant(distribute_keys != 0, dtype=tf.bool):
        sparse = others[0:3]
    else:
        sparse = others[-1]
    
    train_loss = _train_step(dense, sparse, label, model, loss_fn, optimizer)
    loss_value = train_loss.numpy()
    
    if (step % 100 == 0 and step != 0):
        display_end = time.time()
        logging.info("step: %d, loss: %.7f, elapsed time: %.5f seconds." %(step, loss_value, (display_end - display_begin)))
        display_begin = display_end
        
end_time = time.time()
logging.info("Train End. Elapsed Time: %.3f seconds." %(end_time - begin_time))

2020-11-23 09:47:33,030 begin to train
2020-11-23 09:47:45,262 step: 100, loss: 0.0002282, elapsed time: 12.23093 seconds.
2020-11-23 09:47:54,933 step: 200, loss: 0.0002397, elapsed time: 9.67155 seconds.
2020-11-23 09:48:04,670 step: 300, loss: 0.0002279, elapsed time: 9.73684 seconds.
2020-11-23 09:48:14,446 step: 400, loss: 0.0002361, elapsed time: 9.77609 seconds.
2020-11-23 09:48:24,239 step: 500, loss: 0.0002272, elapsed time: 9.79255 seconds.
2020-11-23 09:48:34,095 step: 600, loss: 0.0002441, elapsed time: 9.85616 seconds.
2020-11-23 09:48:43,973 step: 700, loss: 0.0002372, elapsed time: 9.87768 seconds.
2020-11-23 09:48:53,886 step: 800, loss: 0.0002601, elapsed time: 9.91286 seconds.
2020-11-23 09:49:03,821 step: 900, loss: 0.0002376, elapsed time: 9.93518 seconds.
2020-11-23 09:49:13,772 step: 1000, loss: 0.0002412, elapsed time: 9.95164 seconds.
2020-11-23 09:49:23,746 step: 1100, loss: 0.0002340, elapsed time: 9.97380 seconds.
2020-11-23 09:49:33,740 step: 1200, loss: 0.0

## API signature ##
All embedding_plugin APIs are defined in [hugectr_tf_ops.py](../tools/embedding_plugin/python/hugectr_tf_ops.py).

In [17]:
%%html
<style>
table {float:left}
</style>

  ```python
  init(visiable_gpus, seed=0, key_type='int64', value_type='float', batch_size=1, batch_size_eval=1)
  ```
  
This function is used to create resource manager, which manages resources used by embedding_plugin.
**IMPORTANT:** This function can only be called once. It must be called before any other embedding_plugin API is called Currently, only key_type='int64', value_type='float' has been tested.


| Args ||
| :-----| :---- |
| visiable_gpus | list of integers, used to specify which gpus will be used by embedding_plugin. |
| seed | integer, the initializer random seed for embedding_plugin. |
| key_type| string, can be one of {'uint32', 'int64'}. Used to specify the input keys data type. |
| value_type| string, can be one of {'float', 'half'}. Used to specify the data type of embedding_plugin forward result. |
| batch_size| integer, batch_size used in training process. |
| batch_size_eval| integer, batch_size used in evaluation process. |


  ```python
  embedding_name = create_embedding(init_value, name_='hugectr_embedding', embedding_type='localized',
                                     optimizer_type='Adam', max_vocabulary_size_per_gpu=1, slot_size_array=[],
                                     opt_hparams=[0.001], update_type='Local', atomic_update=true, scaler=1.0,
                                     slot_num=1, max_nnz=1, max_feature_num=1000000, embedding_vec_size=1,
                                     combiner='sum')
  ```
  
| Args ||
| :-----| :---- |
|init_value| can be a `bool` or a 2-D matrix with `dtype=tf.float32`. When it is `bool`, parameters will be randomly initialized. When it is a 2-D matrix with `dtype=tf.float32`, that matrix will be used to initialize parameters, and the matrix's row-index will be deemed to be key of the embedding table.|
|name_|string, the name of this embedding layer. If `name_` is unique, then it will be used as the embedding layer name, otherwise, numerical suffix will be automatically added to `name_` to form an unique name for this embedding layer. |
|embedding_type| string, can be one of {'localized', 'distributed'}. |
| optimizer_type| string, can be one of {'Adam', 'MomentumSGD', 'Nesterov', 'SGD'}. | 
|max_vocabulary_size_per_gpu| integer, used to allocate GPU memory spaces for embedding layer.|
|slot_size_array| list of integers, used to allocate GPU memory spaces precisely for embedding layer.|
|opt_hparams| list of floats, used to specify hyper parameters for optimizer.<br>For `Adam`, `opt_hparams` must be a list of `[learning_rate, beta1, beta2, epsilon]`.<br>For `MomentumSGD`, `opt_hparams` must be a list of `[learning_rate, momentum_factor]`.<br>For `Nesterov`, `opt_hparams` must be a list of `[learning_rate, momentum_factor]`.<br>For `SGD`, `opt_hparams` must be a list of `[learning_rate]`.|
|update_type| string, can be one of {'Local', 'Global', 'LazyGlobal'}. |
|atomic_update| bool, only used in `SGD` optimizer. |
|scaler| float, can be one of {1.0, 128.0, 256.0, 512.0, 1024.0}, used in `mixed_precission` training. |
|slot_num| integer, how many slots (feature-fields) are unified in a single embedding layer. |
|max_nnz| integer, the number of valid keys in a single slot.|
|max_feature_num| integer, the number of valid keys in a single input sample.|
|embedding_vec_size| integer, the embedding vector size of this embedding layer.|
|combier|string, can be one of {'mean', 'sum'}. specify how to combine different embedding vector in the same slot.|

|Returns||
|:----| :---- |
|embedding_name| tf.Tensor, dtype=tf.string. An unique name for this embedding layer.|



  ```python
  forward_result = fprop(sparse_indices, values, dense_shape, embedding_name, 
                         bp_trigger, output_type, is_training=True)
  ```
  
This function can be used to do forward propagation for `distributed` and `localized` embedding layers. It will use all input keys that are stored in the SparseTensor format as its input, and will convert those keys to the CSR format within this function. Therefore, its performance is not very satisfying.
  
|Args||
|:----| :---- |
|sparse_indices| A 2-D int64 tensor of shape [N, 3], which specifies the indices of the elements in the sparse tensor that contain valid values. And `N` represents how many valid values in the corresponding dense tensor, 3 represent valid values' [batch_idx, slot_idx, nnz_idx].|
|values| A 1-D tensor of type specified in `init().key_type` ans shape [N], which supplies the valid values for each element in `sparse_indices`.|
|dense_shape| A 1-D int64 tensor of shape [3], which specifies the dense_shape: `[batch_size, slot_num, max_nnz]` of the sparse tensor.|
| embedding_name| tf.Tensor with `dtype=tf.string`, use which embedding layer to do forward propagation.|
| bp_trigger| tf.Variable(dtype=tf.float32), used to automatically trigger back propagation of the embedding layer.|
| output_type| should be the same with `init().value_type`.|
| is_training| bool, specify whether use `training` resources or `evaluation` resources.|

|Returns||
|:----| :---- |
|forward_result| tf.Tensor with `dtype=output_type`. Forward propagation results.|

  ```python
  forward_result = fprop_v3(embedding_name, row_offsets, value_tensors, nnz_array, 
                            bp_trigger, output_shape, is_training=True)
  ```
  
This function can be used to do forward propagation for `distributed` and `localized` embedding layers. Its inputs has been previously converted to the CSR format. Therefore, no conversion will be conducted within this function. For example, if the `embedding_plugin` uses four GPUs to perform the computation, then four CSR sparse matrices will be needed for its inputs. Addtionally, four row_offsets are stacked together to form a single tensor and value_tensors.

|Args||
|:----| :---- |
|embedding_name| tf.Tensor with `dtype=tf.string`, use which embedding layer to do forward propgation.|
|row_offsets| 2-D matrix with shape `[gpu_count, batch_size * slot_num + 1]`, `dtype=tf.int64`. Each row in this tensor denotes a CSR `row_offsets` for one GPU.|
|value_tensors| 2-D matrix with shape `[gpu_count, keys_nums_in_a_batch]`, `dtype=tf.int64`. Each row in this tensor denotes a CSR `values` for one GPU.|
|nnz_array| 1-D tensor with `dtype=tf.int64`, its length is equal to `gpu_count`, and each value denotes how many valid input keys in one CSR sparse matrix.| 
| bp_trigger| tf.Variable(dtype=tf.float32), used to automatically trigger back propagation of the embedding layer.|
| output_shape| 1-D tensor, and its value should be `[batch_size, slot_num, embedding_vec_size]`.|
| is_training| bool, specify whether use `training` resources or `evaluation` resources.|

|Returns||
|:----| :---- |
|forward_result| tf.Tensor with `dtype=init().value_type`. Forward propagation result.|

  ```python
  save(embedding_name, save_name)
  ```

This function is used to save the `embedding_plugin` parameters in the file.

|Args||
|:----| :---- |
|embedding_name| tf.Tensor with `dtype=tf.string`, save which embedding layer's parameter to file.|
|save_name| string, the name of saved parameters.|

  ```python
  restore(embedding_name, file_name)
  ```
  
This function is used to restore the `embedding_plugin` parameters from file.

|Args||
|:----| :---- |
|embedding_name| tf.Tensor with `dtype=tf.string`, restore parameters for which embedding layer.|
|file_name| string, restore paramters from this file. |