<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# HugeCTR Python Interface

## Overview

In HugeCTR version 3.1, we provide an enhanced Python inferface, which supports continuous training and inference with high-level APIs. There are three main improvements. Firstly, the model graph can be constructed and dumped to the JSON file with Python code and it saves users from writing JSON configuration files. Secondly, we support the feature of model oversubscription with high-level APIs and extend it further for online training cases, a notebook can be found [here](./continuous_training.ipynb). Thirdly, the freezing method is provided for both sparse embedding and dense network, which enables transfer learning and fine-tune for CTR tasks.

This notebook explains how to access and use the enhanced HugeCTR Python interface. Please NOTE that the low-level training APIs are still maintained for users who want to have precise control of each training iteration, while migrating to the high-level training APIs is strongly recommended. For more details of the usage of Python API, please refer to [HugeCTR Python Interface](../docs/python_interface.md).

## Table of Contents
-  [Installation](#1)
   * [Get HugeCTR from NGC](#11)
   * [Build HugeCTR from Source Code](#12)
-  [DCN Demo](#2)
   * [Download and Preprocess Data](#21)
   * [Train from Scratch](#22)
   * [Continue Training](#23)
   * [Inference](#24)
-  [Wide&Deep Demo](#3)
   * [Download and Preprocess Data](#31)
   * [Train from Scratch](#32)
   * [Fine-tune](#33)
   * [Low-level Training](#34)

<a id="1"></a>
## 1. Installation

<a id="11"></a>
### 1.1 Get HugeCTR from NGC
The HugeCTR Python module is preinstalled in the [Merlin Training Container](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-training): `nvcr.io/nvidia/merlin/merlin-training:0.7`.

You can check the existence of required libraries by running the following Python code after launching this container.
```bash
$ python3 -c "import hugectr"
```

<a id="12"></a>
### 1.2 Build HugeCTR from Source Code

If you want to build HugeCTR from the source code instead of using the NGC container, please refer to the [Setup development environment](../docs/hugectr_contributor_guide.md#setup-development-environment).

<a id="2"></a>
## 2. DCN Demo

<a id="21"></a>
### 2.1 Download and Preprocess Data
1. Download the Kaggle Criteo dataset using the following command:
   ```shell
   $ cd ${project-root}/tools
   $ wget http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_1.gz
   ```
   
   In preprocessing, we will further reduce the amounts of data to speedup the preprocessing, fill missing values, remove the feature values whose occurrences are very rare, etc. Here we choose pandas preprocessing method to make the dataset ready for HugeCTR training.

2. Preprocessing by Pandas using the following command:
   ```shell
   $ bash preprocess.sh 1 dcn_data pandas 1 0
   ```
   
   The first argument represents the dataset postfix. It is 1 here since day_1 is used. The second argument dcn_data is where the preprocessed data is stored. The fourth arguement (one after pandas) 1 embodies that the normalization is applied to dense features. The last argument 0 means that the feature crossing is not applied.

3. Create a soft link to the dataset folder using the following command:
   ```shell
   $ ln ${project-root}/tools/dcn_data ${project_root}/notebooks/dcn_data
   ```

<a id="22"></a>
### 2.2 Train from Scratch

We can train fom scratch, dump the model graph to a JSON file, and save the model weights and optimizer states by doing the following with Python APIs:

1. Create the solver, reader and optimizer, then initialize the model.
2. Construct the model graph by adding input, sparse embedding and dense layers in order.
3. Compile the model and have an overview of the model graph.
4. Dump the model graph to the JSON file.
5. Fit the model, save the model weights and optimizer states implicitly.

Please note that the training mode is determined by `repeat_dataset` within `hugectr.CreateSolver`. If it is True, the non-epoch mode training will be adopted and the maximum iterations should be specified by `max_iter` within `hugectr.Model.fit`. If it is False, then the epoch-mode training will be adopted and the number of epochs should be specified by `num_epochs` within `hugectr.Model.fit`.

The optimizer that is used to initialize the model applies to the weights of dense layers, while the optimizer for each sparse embedding layer can be specified independently within `hugectr.SparseEmbedding`.

In [2]:
%%writefile dcn_train.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 1500,
                              batchsize_eval = 4096,
                              batchsize = 4096,
                              lr = 0.001,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = True,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                                  source = ["./dcn_data/file_list.txt"],
                                  eval_source = "./dcn_data/file_list_test.txt",
                                  check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", 2, False, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 88,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "data1",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=416))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Slice,
                            bottom_names = ["concat1"],
                            top_names = ["slice11", "slice12"],
                            ranges=[(0,429),(0,429)]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCross,
                            bottom_names = ["slice11"],
                            top_names = ["multicross1"],
                            num_layers=6))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["slice12"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu1"],
                            top_names = ["dropout1"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout1"],
                            top_names = ["fc2"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc2"],
                            top_names = ["relu2"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu2"],
                            top_names = ["dropout2"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["dropout2", "multicross1"],
                            top_names = ["concat2"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat2"],
                            top_names = ["fc3"],
                            num_output=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["fc3", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()
model.graph_to_json(graph_config_file = "dcn.json")
model.fit(max_iter = 1200, display = 500, eval_interval = 100, snapshot = 1000, snapshot_prefix = "dcn")

Overwriting dcn_train.py


In [3]:
!python3 dcn_train.py

[06d04h48m45s][HUGECTR][INFO]: Global seed is 3913288676
[06d04h48m46s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[06d04h48m48s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[06d04h48m48s][HUGECTR][INFO]: Start all2all warmup
[06d04h48m48s][HUGECTR][INFO]: End all2all warmup
[06d04h48m48s][HUGECTR][INFO]: Using All-reduce algorithm NCCL
Device 0: Tesla V100-SXM2-16GB
[06d04h48m48s][HUGECTR][INFO]: num of DataReader workers: 12
[06d04h48m48s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=1441792
[06d04h49m00s][HUGECTR][INFO]: gpu0 start to init embedding
[06d04h49m00s][HUGECTR][INFO]: gpu0 init embedding done
[06d04h49m00s][HUGECTR][INFO]: Starting AUC NCCL warm-up
[06d04h49m00s][HUGECTR][INFO]: Warm-up done
Label                                   Dense                         Sparse                        
label                                   dense                          data1                         
(None, 1)                               (None, 

<a id="23"></a>
### 2.3 Continue Training

We can continue our training based on the saved model graph, model weights and optimizer states by doing the following with Python APIs:

1. Create the solver, reader and optimizer, then initialize the model.
2. Construct the model graph from the saved JSON file.
3. Compile the model and have an overview of the model graph.
4. Load the model weights and optimizer states.
5. Fit the model, save the model weights and optimizer states implicitly.

In [4]:
!ls *.model

dcn0_opt_sparse_1000.model  dcn_dense_1000.model  dcn_opt_dense_1000.model

dcn0_sparse_1000.model:
emb_vector  key


In [5]:
%%writefile dcn_continue.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 1500,
                              batchsize_eval = 4096,
                              batchsize = 4096,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = True,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                                  source = ["./dcn_data/file_list.txt"],
                                  eval_source = "./dcn_data/file_list_test.txt",
                                  check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.construct_from_json(graph_config_file = "dcn.json", include_dense_network = True)
model.compile()
model.load_dense_weights("dcn_dense_1000.model")
model.load_sparse_weights(["dcn0_sparse_1000.model"])
model.load_dense_optimizer_states("dcn_opt_dense_1000.model")
model.load_sparse_optimizer_states(["dcn0_opt_sparse_1000.model"])
model.summary()
model.fit(max_iter = 500, display = 50, eval_interval = 100, snapshot = 10000, snapshot_prefix = "dcn")

Writing dcn_continue.py


In [6]:
!python3 dcn_continue.py

[06d04h50m09s][HUGECTR][INFO]: Global seed is 2224725357
[06d04h50m09s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[06d04h50m11s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[06d04h50m11s][HUGECTR][INFO]: Start all2all warmup
[06d04h50m11s][HUGECTR][INFO]: End all2all warmup
[06d04h50m11s][HUGECTR][INFO]: Using All-reduce algorithm NCCL
Device 0: Tesla V100-SXM2-16GB
[06d04h50m11s][HUGECTR][INFO]: num of DataReader workers: 12
[06d04h50m11s][HUGECTR][INFO]: max_num_frequent_categories is not specified using default: 1
[06d04h50m11s][HUGECTR][INFO]: max_num_infrequent_samples is not specified using default: -1
[06d04h50m11s][HUGECTR][INFO]: p_dup_max is not specified using default: 0.010000
[06d04h50m11s][HUGECTR][INFO]: max_all_reduce_bandwidth is not specified using default: 130000000000.000000
[06d04h50m11s][HUGECTR][INFO]: max_all_to_all_bandwidth is not specified using default: 190000000000.000000
[06d04h50m11s][HUGECTR][INFO]: efficiency_bandwidth

<a id="24"></a>
### 2.4 Inference

The HugeCTR inference is enabled by `hugectr.inference.InferenceSession.predict` method of InferenceSession, which requires dense features, embedding columns and row pointers of slots as the input and gives the prediction result as the output. We need to convert the criteo data to inference format first.

In [7]:
!python3 ../tools/criteo_predict/criteo2predict.py --src_csv_path=dcn_data/val/test.txt --src_config=../tools/criteo_predict/dcn_data.json --dst_path=./dcn_csr.txt --batch_size=1024

We can then make inference based on the saved model graph and model weights by doing the following with Python APIs:

1. Configure the inference related parameters.
2. Create the inference session.
3. Make inference with the `InferenceSession.predict` method. 

In [8]:
%%writefile dcn_inference.py
from hugectr.inference import InferenceParams, CreateInferenceSession
from mpi4py import MPI

def calculate_accuracy(labels, output):
    num_samples = len(labels)
    flags = [1 if ((labels[i] == 0 and output[i] <= 0.5) or (labels[i] == 1 and output[i] > 0.5)) else 0 for i in range(num_samples)]
    correct_samples = sum(flags)
    return float(correct_samples)/(float(num_samples)+1e-16)

data_file = open("dcn_csr.txt")
config_file = "dcn.json"
labels = [int(item) for item in data_file.readline().split(' ')]
dense_features = [float(item) for item in data_file.readline().split(' ') if item!="\n"]
embedding_columns = [int(item) for item in data_file.readline().split(' ')]
row_ptrs = [int(item) for item in data_file.readline().split(' ')]

# create parameter server, embedding cache and inference session
inference_params = InferenceParams(model_name = "dcn",
                                max_batchsize = 1024,
                                hit_rate_threshold = 0.6,
                                dense_model_file = "./dcn_dense_1000.model",
                                sparse_model_files = ["./dcn0_sparse_1000.model"],
                                device_id = 0,
                                use_gpu_embedding_cache = True,
                                cache_size_percentage = 0.9,
                                i64_input_key = False,
                                use_mixed_precision = False)
inference_session = CreateInferenceSession(config_file, inference_params)
output = inference_session.predict(dense_features, embedding_columns, row_ptrs)
accuracy = calculate_accuracy(labels, output)
print("[HUGECTR][INFO] number samples: {}, accuracy: {}".format(len(labels), accuracy))

Writing dcn_inference.py


In [9]:
!python3 dcn_inference.py

[06d04h52m24s][HUGECTR][INFO]: default_emb_vec_value is not specified using default: 0.000000
[06d04h52m26s][HUGECTR][INFO]: Global seed is 3956797427
[06d04h52m28s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[06d04h52m28s][HUGECTR][INFO]: Start all2all warmup
[06d04h52m28s][HUGECTR][INFO]: End all2all warmup
[06d04h52m28s][HUGECTR][INFO]: Use mixed precision: 0
[06d04h52m28s][HUGECTR][INFO]: start create embedding for inference
[06d04h52m28s][HUGECTR][INFO]: sparse_input name data1
[06d04h52m28s][HUGECTR][INFO]: create embedding for inference success
[06d04h52m28s][HUGECTR][INFO]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
[HUGECTR][INFO] number samples: 1024, accuracy: 0.96875


<a id="3"></a>
## 3. Wide&Deep Demo

<a id="31"></a>
### 3.1 Download and Preprocess Data
1. Download the Kaggle Criteo dataset using the following command:
   ```shell
   $ cd ${project_root}/tools
   $ wget http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_1.gz
   ```
   
   In preprocessing, we will further reduce the amounts of data to speedup the preprocessing, fill missing values, remove the feature values whose occurrences are very rare, etc. Here we choose pandas preprocessing method to make the dataset ready for HugeCTR training.

2. Preprocessing by Pandas using the following command:
   ```shell
   $ bash preprocess.sh 1 wdl_data pandas 1 1 100
   ```
   
   The first argument represents the dataset postfix. It is 1 here since day_1 is used. The second argument wdl_data is where the preprocessed data is stored. The fourth arguement (one after pandas) 1 embodies that the normalization is applied to dense features. The fifth argument 1 means that the feature crossing is applied. The last argument 100 means the number of data files in each file list.
   
3. Create a soft link to the dataset folder using the following command:
   ```shell
   $ ln ${project_root}/tools/wdl_data ${project_root}/notebooks/wdl_data
   ```

<a id="32"></a>
### 3.2 Train from Scratch

We can train fom scratch, dump the model graph to a JSON file, and save the model weights and optimizer states by following the same steps in [Section 2.2](#22).

In [10]:
%%writefile wdl_train.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 5000,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              lr = 0.001,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                          source = ["wdl_data/file_list.0.txt"],
                          eval_source = "wdl_data/file_list.1.txt",
                          check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("wide_data", 30, True, 1),
                        hugectr.DataReaderSparseParam("deep_data", 2, False, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 23,
                            embedding_vec_size = 1,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "wide_data",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 358,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "deep_data",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=416))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu1"],
                            top_names = ["dropout1"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout1"],
                            top_names = ["fc2"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc2"],
                            top_names = ["relu2"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu2"],
                            top_names = ["dropout2"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout2"],
                            top_names = ["fc3"],
                            num_output=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Add,
                            bottom_names = ["fc3", "reshape2"],
                            top_names = ["add1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["add1", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()
model.graph_to_json(graph_config_file = "wdl.json")
model.fit(num_epochs = 1, display = 500, eval_interval = 500, snapshot = 4000, snapshot_prefix = "wdl")

Overwriting wdl_train.py


In [11]:
!python3 wdl_train.py

[06d05h09m58s][HUGECTR][INFO]: Global seed is 222087711
[06d05h09m58s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[06d05h10m00s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[06d05h10m00s][HUGECTR][INFO]: Start all2all warmup
[06d05h10m00s][HUGECTR][INFO]: End all2all warmup
[06d05h10m00s][HUGECTR][INFO]: Using All-reduce algorithm NCCL
Device 0: Tesla V100-SXM2-16GB
[06d05h10m00s][HUGECTR][INFO]: num of DataReader workers: 12
[06d05h10m00s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=6029312
[06d05h10m00s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=5865472
[06d05h10m04s][HUGECTR][INFO]: gpu0 start to init embedding
[06d05h10m04s][HUGECTR][INFO]: gpu0 init embedding done
[06d05h10m04s][HUGECTR][INFO]: gpu0 start to init embedding
[06d05h10m04s][HUGECTR][INFO]: gpu0 init embedding done
[06d05h10m04s][HUGECTR][INFO]: Starting AUC NCCL warm-up
[06d05h10m04s][HUGECTR][INFO]: Warm-up done
Label                                   Dense                      

<a id="33"></a>
### 3.3 Fine-tune

We can only load the sparse embedding layers their corresponding weights, and then construct a new dense network. The dense weights will be trained first and the sparse weights will be fine-tuned later. We can achieve this by doing the following with Python APIs:

1. Create the solver, reader and optimizer, then initialize the model.
2. Load the sparse embedding layers from the saved JSON file.
3. Add the dense layers on top of the loaded model graph.
4. Compile the model and have an overview of the model graph.
5. Load the sparse weights and freeze the sparse embedding layers.
6. Train the dense weights.
7. Unfreeze the sparse embedding layers and freeze the dense layers, reset the learning rate scheduler with a small rate.
8. Fine-tune the sparse weights.

In [12]:
%%writefile wdl_fine_tune.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 5000,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                          source = ["wdl_data/file_list.2.txt"],
                          eval_source = "wdl_data/file_list.3.txt",
                          check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.construct_from_json(graph_config_file = "wdl.json", include_dense_network = False)
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=416))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "reshape2", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu1"],
                            top_names = ["dropout1"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout1"],
                            top_names = ["fc2"],
                            num_output=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["fc2", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()
model.load_sparse_weights(["wdl_0_sparse_4000.model", "wdl_1_sparse_4000.model"])
model.freeze_embedding()
model.fit(num_epochs = 1, display = 500, eval_interval = 1000, snapshot = 100000, snapshot_prefix = "wdl")
model.unfreeze_embedding()
model.freeze_dense()
model.reset_learning_rate_scheduler(base_lr = 0.0001)
model.fit(num_epochs = 2, display = 500, eval_interval = 1000, snapshot = 100000, snapshot_prefix = "wdl")

Writing wdl_fine_tune.py


In [13]:
!python3 wdl_fine_tune.py

[06d05h11m21s][HUGECTR][INFO]: Global seed is 2468078834
[06d05h11m21s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[06d05h11m23s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[06d05h11m23s][HUGECTR][INFO]: Start all2all warmup
[06d05h11m23s][HUGECTR][INFO]: End all2all warmup
[06d05h11m23s][HUGECTR][INFO]: Using All-reduce algorithm NCCL
Device 0: Tesla V100-SXM2-16GB
[06d05h11m23s][HUGECTR][INFO]: num of DataReader workers: 12
[06d05h11m23s][HUGECTR][INFO]: max_num_frequent_categories is not specified using default: 1
[06d05h11m23s][HUGECTR][INFO]: max_num_infrequent_samples is not specified using default: -1
[06d05h11m23s][HUGECTR][INFO]: p_dup_max is not specified using default: 0.010000
[06d05h11m23s][HUGECTR][INFO]: max_all_reduce_bandwidth is not specified using default: 130000000000.000000
[06d05h11m23s][HUGECTR][INFO]: max_all_to_all_bandwidth is not specified using default: 190000000000.000000
[06d05h11m23s][HUGECTR][INFO]: efficiency_bandwidth

[60d50h12m70s][HUGECTR][INFO]: Iter: 500 Time(500 iters): 2.583560s Loss: 0.117236 lr:0.000100
[60d50h12m90s][HUGECTR][INFO]: Iter: 1000 Time(500 iters): 2.566226s Loss: 0.132951 lr:0.000100
[60d50h12m11s][HUGECTR][INFO]: Evaluation, AUC: 0.749469
[60d50h12m11s][HUGECTR][INFO]: Eval Time for 5000 iters: 1.435451s
[60d50h12m13s][HUGECTR][INFO]: Iter: 1500 Time(500 iters): 4.005556s Loss: 0.165593 lr:0.000100
[60d50h12m16s][HUGECTR][INFO]: Iter: 2000 Time(500 iters): 2.567100s Loss: 0.135726 lr:0.000100
[60d50h12m17s][HUGECTR][INFO]: Evaluation, AUC: 0.756422
[60d50h12m17s][HUGECTR][INFO]: Eval Time for 5000 iters: 1.379582s
[60d50h12m20s][HUGECTR][INFO]: Iter: 2500 Time(500 iters): 3.951681s Loss: 0.111865 lr:0.000100
[60d50h12m23s][HUGECTR][INFO]: Iter: 3000 Time(500 iters): 2.567938s Loss: 0.132111 lr:0.000100
[60d50h12m24s][HUGECTR][INFO]: Evaluation, AUC: 0.760195
[60d50h12m24s][HUGECTR][INFO]: Eval Time for 5000 iters: 1.385839s
[60d50h12m26s][HUGECTR][INFO]: Iter: 3500 Time(500 it

<a id="34"></a>
### 3.4 Low-level Training

The low-level training APIs are maintained in the enhanced HugeCTR Python interface. If you want to have precise control of each training iteration and each evaluation step, you may find it helpful to use these APIs. Since the data reader behavior is different in epoch mode and non-epoch mode, we should pay attention to how to tweak the data reader when using low-level training. We will denmonstrate how to write the low-level training scripts for non-epoch mode, epoch mode and model oversubscription mode.

In [14]:
%%writefile wdl_non_epoch.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 5000,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = True,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                          source = ["wdl_data/file_list.0.txt"],
                          eval_source = "wdl_data/file_list.1.txt",
                          check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.construct_from_json(graph_config_file = "wdl.json", include_dense_network = True)
model.compile()
model.start_data_reading()
lr_sch = model.get_learning_rate_scheduler()
max_iter = 2000
for i in range(max_iter):
    lr = lr_sch.get_next()
    model.set_learning_rate(lr)
    model.train()
    if (i%100 == 0):
        loss = model.get_current_loss()
        print("[HUGECTR][INFO] iter: {}; loss: {}".format(i, loss))
    if (i%1000 == 0 and i != 0):
        for _ in range(solver.max_eval_batches):
            model.eval()
        metrics = model.get_eval_metrics()
        print("[HUGECTR][INFO] iter: {}, {}".format(i, metrics))
model.save_params_to_files("./", max_iter)

Writing wdl_non_epoch.py


In [15]:
!python3 wdl_non_epoch.py

[06d05h13m49s][HUGECTR][INFO]: Global seed is 743506007
[06d05h13m50s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[06d05h13m52s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[06d05h13m52s][HUGECTR][INFO]: Start all2all warmup
[06d05h13m52s][HUGECTR][INFO]: End all2all warmup
[06d05h13m52s][HUGECTR][INFO]: Using All-reduce algorithm NCCL
Device 0: Tesla V100-SXM2-16GB
[06d05h13m52s][HUGECTR][INFO]: num of DataReader workers: 12
[06d05h13m52s][HUGECTR][INFO]: max_num_frequent_categories is not specified using default: 1
[06d05h13m52s][HUGECTR][INFO]: max_num_infrequent_samples is not specified using default: -1
[06d05h13m52s][HUGECTR][INFO]: p_dup_max is not specified using default: 0.010000
[06d05h13m52s][HUGECTR][INFO]: max_all_reduce_bandwidth is not specified using default: 130000000000.000000
[06d05h13m52s][HUGECTR][INFO]: max_all_to_all_bandwidth is not specified using default: 190000000000.000000
[06d05h13m52s][HUGECTR][INFO]: efficiency_bandwidth_

In [16]:
%%writefile wdl_epoch.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 5000,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                          source = ["wdl_data/file_list.0.txt"],
                          eval_source = "wdl_data/file_list.1.txt",
                          check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.construct_from_json(graph_config_file = "wdl.json", include_dense_network = True)
model.compile()
lr_sch = model.get_learning_rate_scheduler()
data_reader_train = model.get_data_reader_train()
data_reader_eval = model.get_data_reader_eval()
data_reader_eval.set_source()
data_reader_eval_flag = True
iteration = 0
for epoch in range(2):
  print("[HUGECTR][INFO] epoch: ", epoch)
  data_reader_train.set_source()
  data_reader_train_flag = True
  while True:
    lr = lr_sch.get_next()
    model.set_learning_rate(lr)
    data_reader_train_flag = model.train()
    if not data_reader_train_flag:
      break
    if iteration % 1000 == 0:
      batches = 0
      while data_reader_eval_flag:
        if batches >= solver.max_eval_batches:
          break
        data_reader_eval_flag = model.eval()
        batches += 1
      if not data_reader_eval_flag:
        data_reader_eval.set_source()
        data_reader_eval_flag = True
      metrics = model.get_eval_metrics()
      print("[HUGECTR][INFO] iter: {}, metrics: {}".format(iteration, metrics))
    iteration += 1
model.save_params_to_files("./", iteration)

Writing wdl_epoch.py


In [17]:
!python3 wdl_epoch.py

[06d05h14m35s][HUGECTR][INFO]: Global seed is 2351083163
[06d05h14m35s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[06d05h14m37s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[06d05h14m37s][HUGECTR][INFO]: Start all2all warmup
[06d05h14m37s][HUGECTR][INFO]: End all2all warmup
[06d05h14m37s][HUGECTR][INFO]: Using All-reduce algorithm NCCL
Device 0: Tesla V100-SXM2-16GB
[06d05h14m37s][HUGECTR][INFO]: num of DataReader workers: 12
[06d05h14m37s][HUGECTR][INFO]: max_num_frequent_categories is not specified using default: 1
[06d05h14m37s][HUGECTR][INFO]: max_num_infrequent_samples is not specified using default: -1
[06d05h14m37s][HUGECTR][INFO]: p_dup_max is not specified using default: 0.010000
[06d05h14m37s][HUGECTR][INFO]: max_all_reduce_bandwidth is not specified using default: 130000000000.000000
[06d05h14m37s][HUGECTR][INFO]: max_all_to_all_bandwidth is not specified using default: 190000000000.000000
[06d05h14m37s][HUGECTR][INFO]: efficiency_bandwidth