In [None]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# TensorFlow Embedding Plugin Benchmark

In this notebook, we will benchmark the performance of the Merlin Sparse Operation Kit (SOK) TensorFlow embedding plugin. We will compare it with an equivalent TensorFlow implementation.

## Requirement

This notebook is designed to run with the Merlin Tensorflow docker image nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.12, which can be obtained from the NVIDIA GPU cloud [Merlin page](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-tensorflow-training).

```
docker run --runtime=nvidia --net=host --rm -it -v $(pwd):/hugectr -w /hugectr -p 8888:8888 nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.12
```

Then from within the container, start the Jupyter notebook server with:

```
jupyter notebook --ip 0.0.0.0 --allow-root
```

## Pre-requisite

Make sure TensorFlow 2.x is installed.

In [None]:
import tensorflow
print(tensorflow.__version__)

import cupy
cupy.__version__

## Dataset

Next, we generate some synthetic dataset for this test.

In [None]:
CMD = """python3 gen_data.py \
    --global_batch_size=65536 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --iter_num=30 
    """
!cd ../documents/tutorials/DenseDemo && $CMD

We will next split the same dataset into 8 parts, which is more optimal for multi-GPU training.

In [None]:
CMD = """python3 split_data.py \
    --filename="./data.file" \
    --split_num=8 \
    --save_prefix="./data_"
    """
!cd ../documents/tutorials/DenseDemo && $CMD

## Benchmarking TensorFlow model

We will first benchmark a TensorFlow model on 1 GPU.

In [None]:
CMD="""python3 run_tf.py \
    --data_filename="./data.file" \
    --global_batch_size=65536 \
    --vocabulary_size=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --stop_at_iter=30
    """
!cd ../documents/tutorials/DenseDemo && $CMD

## Benchmarking SOK TensorFlow embedding plugin model

We will next benchmark an equivalent model, but with the SOK TensorFlow embedding plugin, also on 1 GPU.

In [None]:
CMD="""mpiexec -n 1 --allow-run-as-root \
    python3 run_sok_MultiWorker_mpi.py \
    --data_filename="./data.file" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --data_splited=0 \
    --optimizer="adam"
    """
!cd ../documents/tutorials/DenseDemo && $CMD


## Benchmarking SOK multi-GPU

We will next benchmark the same model, but with the SOK TensorFlow embedding plugin on multiple GPUs.

For a DGX Station A100 with 4 GPUs:

In [None]:
CMD="""mpiexec -n 4 --allow-run-as-root \
    python3 run_sok_MultiWorker_mpi.py \
    --data_filename="./data_" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --data_splited=1 \
    --optimizer="adam"
    """
!cd ../documents/tutorials/DenseDemo && $CMD

For the NVIDIA DGX A100 with 8 GPUs:

In [None]:
CMD="""mpiexec -n 8 --allow-run-as-root \
    python3 run_sok_MultiWorker_mpi.py \
    --data_filename="./data_" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --data_splited=1 \
    --optimizer="adam" \
    --dgx_a100
    """
!cd ../documents/tutorials/DenseDemo && $CMD

## Performance numbers

In this section, we list some SOK performance numbers on the DGX A100 and DGX V100.


| Model\Average iteration time                | 1 GPU (ms)  | 4 GPUs (ms) |
|----------------------|--------|--------|
| TensorFlow 2.5       | 1831.1 | N/A      |
| SOK embedding plugin | 233.1  | 77.6 |

<center><b>Table 1. Iteration time (ms) on an NVIDIA DGX-Station A100 80GB.</b></center>



| Model\Average iteration time                | 1 GPU (ms)  | 8 GPUs (ms) |
|----------------------|--------|--------|
| TensorFlow 2.5       | 1606.6 | N/A      |
| SOK embedding plugin | 241.8  | 113.1 |

<center><b>Table 2. Iteration time (ms) on an NVIDIA DGX V100 32GB.</b></center>