<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Multi-GPU Offline Inference

## Overview

In HugeCTR version 3.4.1, we provide Python APIs to do multi-GPU offline inference, which leverage [HugeCTR Hierarchical Parameter Server](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_core_features.html#hierarchical-parameter-server) and enable concurrent execution on multiple devices. The Norm or Parquet dataset is currently supported by multi-GPU offline inference.

This notebook explains how to do multi-GPU offline inference with HugeCTR Python APIs. For more details, please refer to [HugeCTR Python Interface](https://nvidia-merlin.github.io/HugeCTR/master/api/python_interface.html#inference-api).

## Table of Contents
-  [Installation](#1)
   * [Get HugeCTR from NGC](#11)
   * [Build HugeCTR from Source Code](#12)
-  [Demo](#2)
   * [Data Generation](#21)
   * [Train from Scratch](#22)
   * [Multi-GPU Offline Inference](#23)

<a id="1"></a>
## 1. Installation

<a id="11"></a>
### 1.1 Get HugeCTR from NGC
The HugeCTR Python module is preinstalled in the [Merlin Training Container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-training): `nvcr.io/nvidia/merlin/merlin-training:22.04`.

You can check the existence of required libraries by running the following Python code after launching this container.
```bash
$ python3 -c "import hugectr"
```

**Note**: This Python module contains both training APIs and offline inference APIs. For online inference with Triton, please refer to [HugeCTR Backend](https://github.com/triton-inference-server/hugectr_backend).

<a id="12"></a>
### 1.2 Build HugeCTR from Source Code

If you want to build HugeCTR from the source code instead of using the NGC container, please refer to the [How to Start Your Development](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development).

<a id="2"></a>
## 2. Demo

<a id="21"></a>
### 2.1 Data Generation
HugeCTR provides a tool to generate synthetic datasets. The [Data Generator](https://nvidia-merlin.github.io/HugeCTR/master/api/python_interface.html#data-generator-api) is capable of generating datasets of different formats and distributions. We will generate multi-hot Parquet datasets with power law distribution for this notebook:

In [1]:
import hugectr
from hugectr.tools import DataGeneratorParams, DataGenerator

data_generator_params = DataGeneratorParams(
  format = hugectr.DataReaderType_t.Parquet,
  label_dim = 2,
  dense_dim = 2,
  num_slot = 3,
  i64_input_key = True,
  nnz_array = [2, 1, 3],
  source = "./multi_hot_parquet/file_list.txt",
  eval_source = "./multi_hot_parquet/file_list_test.txt",
  slot_size_array = [10000, 10000, 10000],
  check_type = hugectr.Check_t.Non,
  dist_type = hugectr.Distribution_t.PowerLaw,
  power_law_type = hugectr.PowerLaw_t.Short,
  num_files = 16,
  eval_num_files = 4)
data_generator = DataGenerator(data_generator_params)
data_generator.generate()

[HCTR][15:01:03][INFO][RK0][main]: Generate Parquet dataset
[HCTR][15:01:03][INFO][RK0][main]: train data folder: ./multi_hot_parquet, eval data folder: ./multi_hot_parquet, slot_size_array: 10000, 10000, 10000, nnz array: 2, 1, 3, #files for train: 16, #files for eval: 4, #samples per file: 40960, Use power law distribution: 1, alpha of power law: 1.3
[HCTR][15:01:03][INFO][RK0][main]: ./multi_hot_parquet exist
[HCTR][15:01:03][INFO][RK0][main]: ./multi_hot_parquet/train/gen_0.parquet
[HCTR][15:01:05][INFO][RK0][main]: ./multi_hot_parquet/train/gen_1.parquet
[HCTR][15:01:05][INFO][RK0][main]: ./multi_hot_parquet/train/gen_2.parquet
[HCTR][15:01:05][INFO][RK0][main]: ./multi_hot_parquet/train/gen_3.parquet
[HCTR][15:01:05][INFO][RK0][main]: ./multi_hot_parquet/train/gen_4.parquet
[HCTR][15:01:05][INFO][RK0][main]: ./multi_hot_parquet/train/gen_5.parquet
[HCTR][15:01:05][INFO][RK0][main]: ./multi_hot_parquet/train/gen_6.parquet
[HCTR][15:01:06][INFO][RK0][main]: ./multi_hot_parquet/trai

<a id="22"></a>
### 2.2 Train from Scratch
We can train fom scratch by doing the following steps with Python APIs:

1. Create the solver, reader and optimizer, then initialize the model.
2. Construct the model graph by adding input, sparse embedding and dense layers in order.
3. Compile the model and have an overview of the model graph.
4. Dump the model graph to the JSON file.
5. Fit the model, save the model weights and optimizer states implicitly.
6. Dump one batch of evaluation results to files.

In [2]:
%%writefile multi_hot_train.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(model_name = "multi_hot",
                              max_eval_batches = 1,
                              batchsize_eval = 16384,
                              batchsize = 16384,
                              lr = 0.001,
                              vvgpu = [[0]],
                              i64_input_key = True,
                              repeat_dataset = True,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = ["./multi_hot_parquet/file_list.txt"],
                                  eval_source = "./multi_hot_parquet/file_list_test.txt",
                                  check_type = hugectr.Check_t.Non,
                                  slot_size_array = [10000, 10000, 10000])
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 2, label_name = "label",
                        dense_dim = 2, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", [2, 1], False, 2),
                        hugectr.DataReaderSparseParam("data2", 3, False, 1),]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 4,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "data1",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 2,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "data2",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=32))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=16))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "reshape2", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu1"],
                            top_names = ["fc2"],
                            num_output=2))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCrossEntropyLoss,
                            bottom_names = ["fc2", "label"],
                            top_names = ["loss"],
                            target_weight_vec = [0.5, 0.5]))
model.compile()
model.summary()
model.graph_to_json("multi_hot.json")
model.fit(max_iter = 1100, display = 200, eval_interval = 1000, snapshot = 1000, snapshot_prefix = "multi_hot")
model.export_predictions("multi_hot_pred_" + str(1000), "multi_hot_label_" + str(1000))

Overwriting multi_hot_train.py


In [3]:
!python3 multi_hot_train.py

HugeCTR Version: 3.4
[HCTR][15:04:04][INFO][RK0][main]: Initialize model: multi_hot
[HCTR][15:04:04][INFO][RK0][main]: Global seed is 2258929170
[HCTR][15:04:04][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][15:04:05][INFO][RK0][main]: Start all2all warmup
[HCTR][15:04:05][INFO][RK0][main]: End all2all warmup
[HCTR][15:04:05][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][15:04:05][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][15:04:05][INFO][RK0][main]: num of DataReader workers: 1
[HCTR][15:04:05][INFO][RK0][main]: Vocabulary size: 30000
[HCTR][15:04:05][INFO][RK0][main]: max_vocabulary_size_per_gpu_=65536
[HCTR][15:04:05][INFO][RK0][main]: max_vocabulary_size_per_gpu_=32768
[HCTR][15:04:05][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][15:04:14][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][15:04:14][INFO][RK0][main]: gpu0 init embedding done
[HCTR][15:04:14][INFO][RK0][main]: gpu0 start to init embedding
[HCT

<a id="23"></a>
### 2.3 Multi-GPU Offline Inference

We will demonstrate multi-GPU offline inference by doing the following steps with Python APIs:
1. Configure the inference hyperparameters.
2. Initialize the inference model, which is a collection of inference sessions deployed on multiple devices.
3. Make inference from the evaluation dataset.
4. Check the correctness by comparing with dumped evaluation results.

**Note**: The `max_batchsize` configured within `InferenceParams` is the global batch size, and it should be divisible by the number of deployed devices. The numpy array returned by `InferenceModel.predict` is of the shape `(max_batchsize * num_batches, label_dim)`.

In [4]:
import hugectr
from hugectr.inference import InferenceModel, InferenceParams
import numpy as np
from mpi4py import MPI

model_config = "multi_hot.json"
inference_params = InferenceParams(
    model_name = "multi_hot",
    max_batchsize = 1024,
    hit_rate_threshold = 1.0,
    dense_model_file = "multi_hot_dense_1000.model",
    sparse_model_files = ["multi_hot0_sparse_1000.model", "multi_hot1_sparse_1000.model"],
    deployed_devices = [0, 1, 2, 3],
    use_gpu_embedding_cache = True,
    cache_size_percentage = 0.5,
    i64_input_key = True
)
inference_model = InferenceModel(model_config, inference_params)
pred = inference_model.predict(
    16,
    "./multi_hot_parquet/file_list_test.txt",
    hugectr.DataReaderType_t.Parquet,
    hugectr.Check_t.Non,
    [10000, 10000, 10000]
)
grount_truth = np.loadtxt("multi_hot_pred_1000")
print("pred: ", pred)
print("grount_truth: ", grount_truth)
diff = pred.flatten()-grount_truth
mse = np.mean(diff*diff)
print("mse: ", mse)

[HCTR][15:04:58][INFO][RK0][main]: Global seed is 3101700364
[HCTR][15:04:58][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
[HCTR][15:05:01][INFO][RK0][main]: Start all2all warmup
[HCTR][15:05:02][INFO][RK0][main]: End all2all warmup
[HCTR][15:05:02][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][15:05:02][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][15:05:02][INFO][RK0][main]: Creating ParallelHashMap CPU database backend...
[HCTR][15:05:02][INFO][RK0][main]: Created parallel (16 partitions) blank database backend in local memory!
[HCTR][15:05:02][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][15:05:02][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][15:05:02][INFO][RK0][main]: Table: hctr_et.multi_hot.sparse_embedding1; cached 16597 / 16597 embeddings in volatile database (ParallelHashMap); load: 16597 / 18446744