In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

## Overview

In this notebook, we want to provide an overview what HugeCTR framework is, its features and benefits. We will use HugeCTR to train a basic neural network architecture and deploy the saved model to Triton Inference Server. 

<b>Learning Objectives</b>:
* Adopt NVTabular workflow to provide input files to HugeCTR
* Define HugeCTR neural network architecture
* Train a deep learning model with HugeCTR
* Deploy HugeCTR to Triton Inference Server

### Why using HugeCTR?

HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>

HugeCTR offers multiple advantages to train deep learning recommender systems:
1. **Speed**: HugeCTR is a highly efficient framework written C++. We experienced up to 10x speed up. HugeCTR on a NVIDIA DGX A100 system proved to be the fastest commercially available solution for training the architecture Deep Learning Recommender Model (DLRM) developed by Facebook.
2. **Scale**: HugeCTR supports model parallel scaling. It distributes the large embedding tables over multiple GPUs or multiple nodes. 
3. **Easy-to-use**: Easy-to-use Python API similar to Keras. Examples for popular deep learning recommender systems architectures (Wide&Deep, DLRM, DCN, DeepFM) are available.

### Other Features of HugeCTR

HugeCTR is designed to scale deep learning models for recommender systems. It provides a list of other important features:
* Proficiency in oversubscribing models to train embedding tables with single nodes that don’t fit within the GPU or CPU memory (only required embeddings are prefetched from a parameter server per batch)
* Asynchronous and multithreaded data pipelines
* A highly optimized data loader.
* Supported data formats such as parquet and binary
* Integration with Triton Inference Server for deployment to production


### Getting Started

As HugeCTR optimizes the training in CUDA++, we need to define the training pipeline and model architecture and execute it via the commandline. We will use the Python API, which is similar to Keras models.

If you are not familiar with HugeCTR's Python API and parameters, you can read more in its GitHub repository:

- HugeCTR User Guide
- HugeCTR Python API
- HugeCTR Configuration File
- HugeCTR example architectures

In this example, we will train a neural network with HugeCTR. We will use NVTabular for preprocessing.

#### Preprocessing and Feature Engineering with NVTabular

We use NVTabular to `Categorify` our categorical input columns.

In [1]:
# External dependencies
import os
import time
import gc

import nvtabular as nvt
import cudf 
import numpy as np

from os import path

We define our base directory, containing the data.

In [2]:
# path to store raw and preprocessed data

INPUT_DATA_DIR = os.environ.get('INPUT_DATA_DIR', os.path.expanduser("~/nvt-examples/movielens/data/"))
MODEL_BASE_DIR = os.environ.get('MODEL_BASE_DIR', '/model/')

## Scaling Accelerated training with HugeCTR

HugeCTR is a deep learning framework dedicated to recommendation systems. It is written in CUDA C++. As HugeCTR optimizes the training in CUDA++, we need to define the training pipeline and model architecture and execute it via the commandline. We will use the Python API, which is similar to Keras models.

HugeCTR has three main components:
* Solver: Specifies various details such as active GPU list, batchsize, and model_file
* Optimizer: Specifies the type of optimizer and its hyperparameters
* Model: Specifies training/evaluation data (and their paths), embeddings, and dense layers. Note that embeddings must precede the dense layers

**Solver**

Let's take a look on the parameter for the `Solver`. We should be familiar from other frameworks for the hyperparameter.

```
solver = hugectr.solver_parser_helper(
- vvgpu: GPU indices used in the training process, which has two levels. For example: [[0,1],[1,2]] indicates that two nodes are used in the first node. GPUs 0 and 1 are used while GPUs 1 and 2 are used for the second node. It is also possible to specify non-continuous GPU indices such as [0, 2, 4, 7]  
- max_iter: Total number of training iterations
- batchsize: Minibatch size used in training
- display: Intervals to print loss on the screen
- eval_interval: Evaluation interval in the unit of training iteration
- max_eval_batches: Maximum number of batches used in evaluation. It is recommended that the number is equal to or bigger than the actual number of bathces in the evaluation dataset.
If max_iter is used, the evaluation happens for max_eval_batches by repeating the evaluation dataset infinitely.
On the other hand, with num_epochs, HugeCTR stops the evaluation if all the evaluation data is consumed    
- batchsize_eval: Maximum number of batches used in evaluation. It is recommended that the number is equal to or
  bigger than the actual number of bathces in the evaluation dataset
- mixed_precision: Enables mixed precision training with the scaler specified here. Only 128,256, 512, and 1024 scalers are supported
)
```

**Optimizer**

The optimizer is the algorithm to update the model parameters. HugeCTR supports the common algorithms.


```
optimizer = CreateOptimizer(
- optimizer_type: Optimizer algorithm - Adam, MomentumSGD, Nesterov, and SGD 
- learning_rate: Learning Rate for optimizer
)
```

**Model**

We initialize the model with the solver and optimizer:

```
model = hugectr.Model(solver, optimizer)
```

We can add multiple layers to the model with `model.add` function. We will focus on:
- `Input` defines the input data
- `SparseEmbedding` defines the embedding layer
- `DenseLayer` defines dense layers, such as fully connected, ReLU, BatchNorm, etc.

**HugeCTR organizes the layers by names. For each layer, we define the input and output names.**

Input layer:

This layer is required to define the input data.

```
hugectr.Input(
    data_reader_type: Data format to read
    source: The training dataset file list.
    eval_source: The evaluation dataset file list.
    check_type: The data error detection machanism (Sum: Checksum, None: no detection).
    label_dim: Number of label columns
    label_name: Name of label columns in network architecture
    dense_dim: Number of continous columns
    dense_name: Name of contiunous columns in network architecture
    slot_size_array: The list of categorical feature cardinalities
    data_reader_sparse_param_array: Configuration how to read sparse data
    sparse_names: Name of sparse/categorical columns in network architecture
)
```

SparseEmbedding:

This layer defines embedding table

```
hugectr.SparseEmbedding(
    embedding_type: Different embedding options to distribute embedding tables 
    max_vocabulary_size_per_gpu: Maximum vocabulary size or cardinality across all the input features
    embedding_vec_size: Embedding vector size
    combiner: Intra-slot reduction op (0=sum, 1=average)
    sparse_embedding_name: Layer name
    bottom_name: Input layer names
)
```

DenseLayer:

This layer is copied to each GPU and is normally used for the MLP tower.

```
hugectr.DenseLayer(
    layer_type: Layer type, such as FullyConnected, Reshape, Concat, Loss, BatchNorm, etc.
    bottom_names: Input layer names
    top_names: Layer name
    ...: Depending on the layer type additional parameter can be defined
)
```

## Let's define our model

We walked through the documentation, but it is useful to understand the API. Finally, we can define our model. We will write the model to `./model.py` and execute it afterwards.

We need the cardinalities of each categorical feature to assign as `slot_size_array` in the model below.

In [3]:
workflow = nvt.Workflow.load(os.path.join(INPUT_DATA_DIR, "workflow"))

In [4]:
from nvtabular.ops import get_embedding_sizes

embeddings = get_embedding_sizes(workflow)
print(embeddings)

{'genres': (21, 16), 'movieId': (56586, 512), 'userId': (162542, 512)}


In addition, we will use the total cardinality value as `max_vocabulary_size_per_gpu` parameter. Since are training HugeCTR model only with `movieId` and `userId` columns, we will only need their cardinalities. 

In [5]:
total_cardinality = embeddings['movieId'][0] + embeddings['userId'][0] 
total_cardinality

219128

We will write the code to a ./model.py file and execute it. It will create snapshot, which we will use for inference in the next notebook.

In [6]:
%%writefile './model.py'

import hugectr
from mpi4py import MPI

solver = hugectr.solver_parser_helper(vvgpu = [[0]],
                                      max_iter = 2000,
                                      batchsize = 2048,
                                      display = 100,
                                      eval_interval = 200,
                                      batchsize_eval = 2048,
                                      max_eval_batches = 160,
                                      i64_input_key = True,
                                      use_mixed_precision = False,
                                      repeat_dataset = True,
                                      snapshot = 1900
                                      )
optimizer = hugectr.optimizer.CreateOptimizer(
    optimizer_type = hugectr.Optimizer_t.Adam,
    use_mixed_precision = False
)
model = hugectr.Model(solver, optimizer)

model.add(
    hugectr.Input(
        data_reader_type = hugectr.DataReaderType_t.Parquet,
        source = "/model/data/train/_file_list.txt",
        eval_source = "/model/data/valid/_file_list.txt",
        check_type = hugectr.Check_t.Non,
        label_dim = 1, 
        label_name = "label",
        dense_dim = 0, 
        dense_name = "dense",
        slot_size_array = [56586, 162542],
        data_reader_sparse_param_array = [
            hugectr.DataReaderSparseParam(hugectr.DataReaderSparse_t.Distributed, 3, 1, 2)
        ],
        sparse_names = ["data1"]
    )
)
model.add(
    hugectr.SparseEmbedding(
        embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
        max_vocabulary_size_per_gpu = 219128,
        embedding_vec_size = 16,
        combiner = 0,
        sparse_embedding_name = "sparse_embedding1",
        bottom_name = "data1"
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type = hugectr.Layer_t.Reshape,
        bottom_names = ["sparse_embedding1"],
        top_names = ["reshape1"],
        leading_dim=32
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type = hugectr.Layer_t.InnerProduct,
        bottom_names = ["reshape1"], 
        top_names = ["fc1"],
        num_output=128
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type = hugectr.Layer_t.ReLU,
        bottom_names = ["fc1"], 
        top_names = ["relu1"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type = hugectr.Layer_t.InnerProduct,
        bottom_names = ["relu1"], 
        top_names = ["fc2"],
        num_output=128
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type = hugectr.Layer_t.ReLU,
        bottom_names = ["fc2"], 
        top_names = ["relu2"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type = hugectr.Layer_t.InnerProduct,
        bottom_names = ["relu2"], 
        top_names = ["fc3"],
        num_output=1
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names = ["fc3", "label"],
        top_names = ["loss"])
)
model.compile()
model.summary()
model.fit()

Writing ./model.py


In [7]:
!python model.py

[09d22h30m43s][HUGECTR][INFO]: Global seed is 1740424067
[09d22h30m45s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: Tesla V100-DGXS-16GB
[09d22h30m45s][HUGECTR][INFO]: num of DataReader workers: 1
[09d22h30m45s][HUGECTR][INFO]: num_internal_buffers 1
[09d22h30m45s][HUGECTR][INFO]: num_internal_buffers 1
[09d22h30m45s][HUGECTR][INFO]: Vocabulary size: 219128
[09d22h30m45s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=219128
[09d22h30m46s][HUGECTR][INFO]: gpu0 start to init embedding
[09d22h30m46s][HUGECTR][INFO]: gpu0 init embedding done
Label Name                    Dense Name                    Sparse Name                   
label                         dense                         data1                         
--------------------------------------------------------------------------------
Layer Type                    Input Name                    Output Name                   
----------------------------------------------------------------------------

We trained the model and created snapshots.