In [None]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_hugectr_embedding-collection/nvidia_logo.png" style="width: 90px; float: right;">

# HugeCTR Embedding Collection

## About this Notebook

This notebook shows how to use an embedding collection in a DLRM model with the Criteo dataset for training and evaluation.

It shows two key feature usage in embedding collection: 
1. How to configure table place strategy. 
2. How to use dynamic hash table.

## Concepts and API Reference

The following key classes are used in this notebook:

- `hugectr.EmbeddingTableConfig`
- `hugectr.EmbeddingCollectionConfig`

For the concepts and API reference information about the classes and file, see the [Overview of Using the HugeCTR Embedding Collection](https://nvidia-merlin.github.io/HugeCTR/master/api/hugectr_layer_book.html#overview-of-using-the-hugectr-embedding-collection) in the HugeCTR Layer Classes and Methods information.

## Setup

To setup the environment, refer to [HugeCTR Example Notebooks](../notebooks) and follow the instructions there before running the following.

## Use an Embedding Collection with a DLRM Model


### Data Preparation

To download and prepare the dataset we will be doing the following steps. At the end of this cell, we provide the shell commands you can run on the terminal to get the data ready for this notebook.

**Note**: If you already have the data downloaded, then skip to the preprocessing step (2). If preprocessing is also done, skip to creating the softlink between the processed data to the `notebooks/` directory (3).


1. Download the Criteo dataset

To preprocess the downloaded Kaggle Criteo dataset, we'll make the following operations: 

   * Reduce the amounts of data to speed up the preprocessing
   * Fill missing values
   * Remove the feature values whose occurrences are very rare, etc.


2. Preprocessing by Pandas:
   
   Meanings of the command line arguments:

   * The 1st argument represents the dataset postfix. It is `1` here since `day_1` is used.
   * The 2nd argument `wdl_data` is where the preprocessed data is stored.
   * The 3rd argument `pandas` is the processing script going to use, here we choose `pandas`.
   * The 4th argument `1` embodies that the normalization is applied to dense features.
   * The 5th argument `1` means that the feature crossing is applied.
   * The 6th argument `100` means the number of data files in each file list.

   For more details about the data preprocessing, please refer to the "Preprocess the Criteo Dataset" section of the README in the [samples/criteo](https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/samples/criteo) directory of the repository on GitHub.
   
3. Create a soft link of the dataset folder to the path of this notebook

#### Run the following commands on the terminal to prepare the data for this notebook


```shell
export project_root=/home/hugectr # set this to the directory where hugectr is downloaded
cd ${project_root}/tools
# Step 1
wget https://storage.googleapis.com/criteo-cail-datasets/day_0.gz
#Step 2
bash preprocess.sh 0 deepfm_data_nvt nvt 1 0 0
#Step 3
ln -s ${project_root}/tools/deepfm_data_nvt ${project_root}/notebooks/deepfm_data_nvt
```


### Prepare the Training Script

This notebook was developed with on single DGX-1 to run the DLRM model in this notebook. The GPU info in DGX-1 is as follows. It consists of 8 V100-SXM2 GPUs.


In [6]:
! nvidia-smi

Thu Jun 23 00:14:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    45W / 300W |      0MiB / 16160MiB |      0%      Default |
|       

The training script, `dlrm_train.py`, uses the the embedding collection API.
The script accepts argument that specifies the table placement strategy and use_dynamic_hash_table so we can run the script several times and evaluate different table placement strategy & use_dynamic_hash_table:


In [2]:
%%writefile dlrm_train.py
"""
 Copyright (c) 2023, NVIDIA CORPORATION.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
"""

import hugectr
import argparse
from mpi4py import MPI

parser = argparse.ArgumentParser(description="HugeCTR Embedding Collection DLRM model training script.")
parser.add_argument(
    "--shard_plan",
    help="shard strategy",
    type=str,
    choices=["round_robin", "uniform", "hybrid"],
)
parser.add_argument(
    "--use_dynamic_hash_table",
    action="store_true",
)
args = parser.parse_args()


def generate_shard_plan(slot_size_array, num_gpus):
    if args.shard_plan == "round_robin":
        shard_strategy = [("mp", [str(i) for i in range(len(slot_size_array))])]
        shard_matrix = [[] for _ in range(num_gpus)]
        for i, table_id in enumerate(range(len(slot_size_array))):
            target_gpu = i % num_gpus
            shard_matrix[target_gpu].append(str(table_id))
    elif args.shard_plan == "uniform":
        shard_strategy = [("mp", [str(i) for i in range(len(slot_size_array))])]
        shard_matrix = [[] for _ in range(num_gpus)]
        for table_id in range(len(slot_size_array)):
            for gpu_id in range(num_gpus):
                shard_matrix[gpu_id].append(str(table_id))
    elif args.shard_plan == "hybrid":
        mp_table = [i for i in range(len(slot_size_array)) if slot_size_array[i] > 6000]
        dp_table = [i for i in range(len(slot_size_array)) if slot_size_array[i] <= 6000]
        shard_matrix = [[] for _ in range(num_gpus)]
        shard_strategy = [("mp", [str(i) for i in mp_table]), ("dp", [str(i) for i in dp_table])]

        for table_id in dp_table:
            for gpu_id in range(num_gpus):
                shard_matrix[gpu_id].append(str(table_id))

        for i, table_id in enumerate(mp_table):
            target_gpu = i % num_gpus
            shard_matrix[target_gpu].append(str(table_id))
    else:
        raise Exception(args.shard_plan + " is not supported")
    return shard_matrix, shard_strategy


solver = hugectr.CreateSolver(
    max_eval_batches=70,
    batchsize_eval=65536,
    batchsize=65536,
    lr=0.5,
    warmup_steps=300,
    vvgpu=[[0, 1, 2, 3, 4, 5, 6, 7]],
    repeat_dataset=True,
    i64_input_key=True,
    metrics_spec={hugectr.MetricsType.AverageLoss: 0.0},
    use_embedding_collection=True,
)
slot_size_array = [
    203931,
    18598,
    14092,
    7012,
    18977,
    4,
    6385,
    1245,
    49,
    186213,
    71328,
    67288,
    11,
    2168,
    7338,
    61,
    4,
    932,
    15,
    204515,
    141526,
    199433,
    60919,
    9137,
    71,
    34,
]
reader = hugectr.DataReaderParams(
    data_reader_type=hugectr.DataReaderType_t.Parquet,
    source=["./criteo_data/train/_file_list.txt"],
    eval_source="./criteo_data/val/_file_list.txt",
    check_type=hugectr.Check_t.Non,
)
optimizer = hugectr.CreateOptimizer(
    optimizer_type=hugectr.Optimizer_t.SGD, update_type=hugectr.Update_t.Local, atomic_update=True
)
model = hugectr.Model(solver, reader, optimizer)

num_embedding = 26

model.add(
    hugectr.Input(
        label_dim=1,
        label_name="label",
        dense_dim=13,
        dense_name="dense",
        data_reader_sparse_param_array=[
            hugectr.DataReaderSparseParam("data{}".format(i), 1, False, 1)
            for i in range(num_embedding)
        ],
    )
)

# create embedding table
embedding_table_list = []
for i in range(num_embedding):
    embedding_table_list.append(
        hugectr.EmbeddingTableConfig(
            name=str(i), max_vocabulary_size=-1 if args.use_dynamic_hash_table else slot_size_array[i], ev_size=128
        )
    )
# create ebc config
ebc_config = hugectr.EmbeddingCollectionConfig(use_exclusive_keys=True)
emb_vec_list = []
for i in range(num_embedding):
    ebc_config.embedding_lookup(
        table_config=embedding_table_list[i],
        bottom_name="data{}".format(i),
        top_name="emb_vec{}".format(i),
        combiner="sum",
    )
shard_matrix, shard_strategy = generate_shard_plan(slot_size_array, 8)
ebc_config.shard(shard_matrix=shard_matrix, shard_strategy=shard_strategy)

model.add(ebc_config)
# need concat
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat,
        bottom_names=["emb_vec{}".format(i) for i in range(num_embedding)],
        top_names=["sparse_embedding1"],
    )
)

model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dense"],
        top_names=["fc1"],
        num_output=512,
    )
)

model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)

model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["relu1"],
        top_names=["fc2"],
        num_output=256,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc2"], top_names=["relu2"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["relu2"],
        top_names=["fc3"],
        num_output=128,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc3"], top_names=["relu3"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Interaction,  # interaction only support 3-D input
        bottom_names=["relu3", "sparse_embedding1"],
        top_names=["interaction1"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["interaction1"],
        top_names=["fc4"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc4"], top_names=["relu4"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["relu4"],
        top_names=["fc5"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc5"], top_names=["relu5"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["relu5"],
        top_names=["fc6"],
        num_output=512,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc6"], top_names=["relu6"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["relu6"],
        top_names=["fc7"],
        num_output=256,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc7"], top_names=["relu7"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["relu7"],
        top_names=["fc8"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["fc8", "label"],
        top_names=["loss"],
    )
)
model.compile()
model.summary()
model.fit(max_iter=1000, display=100, eval_interval=100, snapshot=10000000, snapshot_prefix="dlrm")

Overwriting dlrm_train.py


### Embedding Table Placement Strategy: Round Robin

In this Embedding Table Placement Strategy, we place each table on single GPU in a round robin way.

In [15]:
!python3 dlrm_train.py --shard_plan round_robin

HugeCTR Version: 23.2
[HCTR][10:25:19.539][INFO][RK0][main]: Global seed is 3508545476
[HCTR][10:25:19.637][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][10:25:30.608][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 30.4714 
[HCTR][10:25:30.608][DEBUG][RK0][main]: [device 1] allocating 0.0000 GB, available 30.4441 
[HCTR][10:25:30.609][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 30.5378 
[HCTR][10:25:30.609][DEBUG][RK0][main]: [device 3] allocating 0.0000 GB, available 30.5339 
[HCTR][10:25:30.609][DEBUG][RK0][main]: [device 4] allocating 0.0000 GB, available 30.4636 
[HCTR][10:25:30.609][DEBUG][RK0][main]: [device 5] allocating 0.0000 GB, available 30.4480 
[HCTR][10:25:30.609][DEBUG][RK0][main]: [device 6] allocating 0.0000 GB, available 30.4949 
[HCTR][10:25:30.609][DEBUG][RK0][main]: [device 7] all

[HCTR][10:26:11.457][INFO][RK0][main]: Model structure on each GPU
Label                                   Dense                         Sparse                        
label                                   dense                          data0,data1,data2,data3,data4,data5,data6,data7,data8,data9,data10,data11,data12,data13,data14,data15,data16,data17,data18,data19,data20,data21,data22,data23,data24,data25
(8192,1)                                (8192,13)                               
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type                              Input Name                    Output Name                   Output Shape                  
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
EmbeddingCollection0                    data0                         emb_vec0                      (8192,1,128)                  
        

[HCTR][10:26:15.652][INFO][RK0][main]: Evaluation, AverageLoss: 0.14373
[HCTR][10:26:15.652][INFO][RK0][main]: Eval Time for 70 iters: 1.24478s
[HCTR][10:26:15.697][INFO][RK0][main]: Iter: 100 Time(100 iters): 4.23782s Loss: 0.142604 lr:0.168333
[HCTR][10:26:19.865][INFO][RK0][main]: Evaluation, AverageLoss: 0.142137
[HCTR][10:26:19.865][INFO][RK0][main]: Eval Time for 70 iters: 1.25698s
[HCTR][10:26:19.899][INFO][RK0][main]: Iter: 200 Time(100 iters): 4.19912s Loss: 0.142685 lr:0.335
[HCTR][10:26:24.035][INFO][RK0][main]: Evaluation, AverageLoss: 0.1404
[HCTR][10:26:24.035][INFO][RK0][main]: Eval Time for 70 iters: 1.24589s
[HCTR][10:26:24.080][INFO][RK0][main]: Iter: 300 Time(100 iters): 4.18021s Loss: 0.143021 lr:0.5
[HCTR][10:26:28.211][INFO][RK0][main]: Evaluation, AverageLoss: 0.139695
[HCTR][10:26:28.211][INFO][RK0][main]: Eval Time for 70 iters: 1.25073s
[HCTR][10:26:28.245][INFO][RK0][main]: Iter: 400 Time(100 iters): 4.16407s Loss: 0.141111 lr:0.5
[HCTR][10:26:32.375][INFO][R

### Embedding Table Placement Strategy: Uniform

In this Embedding Table Placement Strategy, we place each table on all 8 GPUs.


In [4]:
!python3 dlrm_train.py --shard_plan uniform

HugeCTR Version: 23.3
[HCTR][06:33:37.284][INFO][RK0][main]: Global seed is 3445591887
[HCTR][06:33:37.408][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][06:33:56.384][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 30.4714 
[HCTR][06:33:56.385][DEBUG][RK0][main]: [device 1] allocating 0.0000 GB, available 30.4441 
[HCTR][06:33:56.385][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 30.5378 
[HCTR][06:33:56.385][DEBUG][RK0][main]: [device 3] allocating 0.0000 GB, available 30.5339 
[HCTR][06:33:56.385][DEBUG][RK0][main]: [device 4] allocating 0.0000 GB, available 30.4636 
[HCTR][06:33:56.386][DEBUG][RK0][main]: [device 5] allocating 0.0000 GB, available 30.4480 
[HCTR][06:33:56.386][DEBUG][RK0][main]: [device 6] allocating 0.0000 GB, available 30.4949 
[HCTR][06:33:56.386][DEBUG][RK0][main]: [device 7] all

[HCTR][06:34:46.251][INFO][RK0][main]: Evaluation, AverageLoss: 0.143524
[HCTR][06:34:46.251][INFO][RK0][main]: Eval Time for 70 iters: 2.34586s
[HCTR][06:34:46.345][INFO][RK0][main]: Iter: 100 Time(100 iters): 8.48449s Loss: 0.142247 lr:0.168333
[HCTR][06:34:54.657][INFO][RK0][main]: Evaluation, AverageLoss: 0.141641
[HCTR][06:34:54.657][INFO][RK0][main]: Eval Time for 70 iters: 2.33134s
[HCTR][06:34:54.751][INFO][RK0][main]: Iter: 200 Time(100 iters): 8.40384s Loss: 0.142243 lr:0.335
[HCTR][06:35:03.069][INFO][RK0][main]: Evaluation, AverageLoss: 0.139913
[HCTR][06:35:03.069][INFO][RK0][main]: Eval Time for 70 iters: 2.33118s
[HCTR][06:35:03.161][INFO][RK0][main]: Iter: 300 Time(100 iters): 8.40793s Loss: 0.142713 lr:0.5
[HCTR][06:35:11.479][INFO][RK0][main]: Evaluation, AverageLoss: 0.138901
[HCTR][06:35:11.479][INFO][RK0][main]: Eval Time for 70 iters: 2.34956s
[HCTR][06:35:11.568][INFO][RK0][main]: Iter: 400 Time(100 iters): 8.40618s Loss: 0.140238 lr:0.5
[HCTR][06:35:19.883][INFO

### Embedding Table Placement Strategy: Hybrid

In this Embedding Table Placement Strategy, we place small table (size < 6000) in a data parallel way and large table(size >= 6000) in a round robin way

In [21]:
!python3 dlrm_train.py --shard_plan hybrid

HugeCTR Version: 23.2
[HCTR][10:35:14.415][INFO][RK0][main]: Global seed is 198655838
[HCTR][10:35:14.517][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 30.4714 
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 1] allocating 0.0000 GB, available 30.4441 
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 30.5378 
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 3] allocating 0.0000 GB, available 30.5339 
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 4] allocating 0.0000 GB, available 30.4636 
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 5] allocating 0.0000 GB, available 30.4480 
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 6] allocating 0.0000 GB, available 30.4949 
[HCTR][10:35:25.731][DEBUG][RK0][main]: [device 7] allo

[HCTR][10:36:12.594][INFO][RK0][main]: Model structure on each GPU
Label                                   Dense                         Sparse                        
label                                   dense                          data0,data1,data2,data3,data4,data5,data6,data7,data8,data9,data10,data11,data12,data13,data14,data15,data16,data17,data18,data19,data20,data21,data22,data23,data24,data25
(8192,1)                                (8192,13)                               
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type                              Input Name                    Output Name                   Output Shape                  
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
EmbeddingCollection0                    data0                         emb_vec0                      (8192,1,128)                  
        

[HCTR][10:36:16.599][INFO][RK0][main]: Evaluation, AverageLoss: 0.144991
[HCTR][10:36:16.599][INFO][RK0][main]: Eval Time for 70 iters: 1.22035s
[HCTR][10:36:16.633][INFO][RK0][main]: Iter: 100 Time(100 iters): 4.03885s Loss: 0.144124 lr:0.168333
[HCTR][10:36:20.570][INFO][RK0][main]: Evaluation, AverageLoss: 0.144851
[HCTR][10:36:20.570][INFO][RK0][main]: Eval Time for 70 iters: 1.1863s
[HCTR][10:36:20.615][INFO][RK0][main]: Iter: 200 Time(100 iters): 3.98102s Loss: 0.145444 lr:0.335
[HCTR][10:36:24.540][INFO][RK0][main]: Evaluation, AverageLoss: 0.141821
[HCTR][10:36:24.540][INFO][RK0][main]: Eval Time for 70 iters: 1.18638s
[HCTR][10:36:24.580][INFO][RK0][main]: Iter: 300 Time(100 iters): 3.96441s Loss: 0.144249 lr:0.5
[HCTR][10:36:28.514][INFO][RK0][main]: Evaluation, AverageLoss: 0.139519
[HCTR][10:36:28.514][INFO][RK0][main]: Eval Time for 70 iters: 1.18203s
[HCTR][10:36:28.556][INFO][RK0][main]: Iter: 400 Time(100 iters): 3.97548s Loss: 0.140895 lr:0.5
[HCTR][10:36:32.490][INFO]

### Use Dynamic Hash Table with Round Robin Table Placement Strategy

Embedding collection supports user configure dynamic hash table so the table will support hash input key and the table size will grow when the table is full.

In [18]:
!python3 dlrm_train.py --shard_plan round_robin --use_dynamic_hash_table

HugeCTR Version: 23.2
[HCTR][10:29:29.407][INFO][RK0][main]: Global seed is 1217153067
[HCTR][10:29:29.506][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][10:29:40.485][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 30.4714 
[HCTR][10:29:40.486][DEBUG][RK0][main]: [device 1] allocating 0.0000 GB, available 30.4441 
[HCTR][10:29:40.486][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 30.5378 
[HCTR][10:29:40.486][DEBUG][RK0][main]: [device 3] allocating 0.0000 GB, available 30.5339 
[HCTR][10:29:40.486][DEBUG][RK0][main]: [device 4] allocating 0.0000 GB, available 30.4636 
[HCTR][10:29:40.486][DEBUG][RK0][main]: [device 5] allocating 0.0000 GB, available 30.4480 
[HCTR][10:29:40.486][DEBUG][RK0][main]: [device 6] allocating 0.0000 GB, available 30.4949 
[HCTR][10:29:40.486][DEBUG][RK0][main]: [device 7] all

[HCTR][10:30:20.859][INFO][RK0][main]: Model structure on each GPU
Label                                   Dense                         Sparse                        
label                                   dense                          data0,data1,data2,data3,data4,data5,data6,data7,data8,data9,data10,data11,data12,data13,data14,data15,data16,data17,data18,data19,data20,data21,data22,data23,data24,data25
(8192,1)                                (8192,13)                               
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type                              Input Name                    Output Name                   Output Shape                  
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
EmbeddingCollection0                    data0                         emb_vec0                      (8192,1,128)                  
        

static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
static_map allocated, size=553648128
[HCTR][10:30:26.070][INFO][RK0][main]: Evaluation, AverageLoss: 0.142151
[HCTR][10:30:26.070][INFO][RK0][main]: Eval Time for 70 iters: 1.53912s
[HCT