<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# HugeCTR Continuous Training

## Overview
The notebook introduces how to use the model oversubscription (a.k.a. Embedding Training Cache) feature in HugeCTR for the continuous training. The Model Oversubscription feature is designed to handle recommendation models with huge embedding table by means of incremental training method, which allows you to train such a model even the model size is much larger than the available GPU memory size.

Model oversubscription currently supports the following features:
- Support single-node multi-GPUs and multi-node multi-GPUs cases
- Support both the SSD-based parameter server (PS) and the host memory (HMEM)-based PS
  * For the SSD-based PS, the embedding table size can scale up to the SSD capacity
  * For the HMEM-based PS, the embedding table size can scale up to the aggregate HMEM size of each node
- Provide the `get_incremental_model()` interface to get the updated embedding features during training
- Support training from scratch or incremental training with existing sparse and dense models

To learn more about the model oversubscription, please check the [Embedding Training Cache](../docs/hugectr_user_guide.md#embedding-training-cache).

To learn how to use the APIs of model oversubscription, please check the [HugeCTR Python Interface](../docs/python_interface.md).

## Table of Contents
-  [Installation](#1)
   * [Get HugeCTR from NGC](#11)
   * [Build HugeCTR from Source Code](#12)
-  [Continuous Training](#2)
   * [Data Preparation](#21)
   * [Continuous Training with High-level API](#22)
   * [Continuous Training with Low-level API](#23)

<a id="1"></a>
## Installation

### 1.1 Get HugeCTR from NGC
The continuous training module is preinstalled in the [Merlin Training Container](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-training): `nvcr.io/nvidia/merlin/merlin-training:0.7`.

You can check the existence of required libraries by running the following Python code after launching this container.
```bash
$ python3 -c "import hugectr"
```

### 1.2 Build HugeCTR from Source Code

If you want to build HugeCTR from the source code instead of using the NGC container, please refer to the [Setup development environment](../docs/hugectr_contributor_guide.md#setup-development-environment).

<a id="2"></a>
## Continuous Training

<a id="21"></a>
### 2.1 Data Preparation
1. Download the Kaggle Criteo dataset using the following command:
   ```shell
   $ cd ${project_root}/tools
   $ wget http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_1.gz
   ```
   
   To preprocess the downloaded Kaggle Criteo dataset, we'll make the following operations: 
   * Reduce the amounts of data to speed up the preprocessing
   * Fill missing values
   * Remove the feature values whose occurrences are very rare, etc.

2. Preprocessing by Pandas using the following command:
   ```shell
   $ bash preprocess.sh 1 wdl_data pandas 1 1 100
   ```
   
   Meanings of the command line arguments:
   * The 1st argument represents the dataset postfix. It is `1` here since `day_1` is used.
   * The 2nd argument `wdl_data` is where the preprocessed data is stored.
   * The 3rd argument `pandas` is the processing script going to use, here we choose `pandas`.
   * The 4th argument `1` embodies that the normalization is applied to dense features.
   * The 5th argument `1` means that the feature crossing is applied.
   * The 6th argument `100` means the number of data files in each file list.

   For more details about the data preprocessing, please refer to [Preprocess the Criteo Dataset](../samples/criteo#preprocess-the-dataset).
   
3. Create a soft link of the dataset folder to the path of this notebook using the following command:
   ```shell
   $ ln ${project_root}/tools/wdl_data ${project_root}/notebooks/wdl_data
   ```

<a id="22"></a>
### 2.2 Continuous Training with High-level API
This section gives the code sample of continuous training using Keras-like high-level API. The high-level API encapsulates most of the complexity for users, making it easy to use and able to handle most of scenarios in a production environment.

Meanwhile, HugeCTR also provides the low-level APIs besides its high-level counterpart to allow you customize the training logic. A code sample using the low-level APIs is provided in the next section.

The code sample in this section trains a model from scratch using model oversubscriber, gets the incremental model, and saves the trained dense weights and sparse embedding weights. The following steps are required to achieve those logics:

1. Create the `solver`, `reader`, `optimizer` and `mos`, then initialize the model.
2. Construct the model graph by adding input, sparse embedding, and dense layers in order.
3. Compile the model and overview the model graph.
4. Dump the model graph to the JSON file.
5. Train the sparse and dense model.
6. Set the new training datasets and their corresponding keysets.
7. Train the sparse and dense model incrementally.
8. Get the incrementally trained embedding table.
9. Save the model weights and optimizer states explicitly.

Note: `repeat_dataset` should be `False` when using the model oversubscriber, while the argument `num_epochs` in `Model::fit` specifies the number of training epochs in this mode.

In [1]:
%%writefile wdl_train.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 5000,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              lr = 0.001,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                          source = ["wdl_data/file_list."+str(i)+".txt" for i in range(2)],
                          keyset = ["wdl_data/file_list."+str(i)+".keyset" for i in range(2)],
                          eval_source = "wdl_data/file_list.2.txt",
                          check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
mos = hugectr.CreateMOS(train_from_scratch = True, use_host_memory_ps = True, dest_sparse_models = ["./wdl_0_sparse_model", "./wdl_1_sparse_model"])
model = hugectr.Model(solver, reader, optimizer, mos)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("wide_data", 30, True, 1),
                        hugectr.DataReaderSparseParam("deep_data", 2, False, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 23,
                            embedding_vec_size = 1,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "wide_data",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 358,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "deep_data",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=416))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu1"],
                            top_names = ["dropout1"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout1"],
                            top_names = ["fc2"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc2"],
                            top_names = ["relu2"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu2"],
                            top_names = ["dropout2"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout2"],
                            top_names = ["fc3"],
                            num_output=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Add,
                            bottom_names = ["fc3", "reshape2"],
                            top_names = ["add1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["add1", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()
model.graph_to_json(graph_config_file = "wdl.json")
model.fit(num_epochs = 1, display = 500, eval_interval = 1000)
# Get the updated embedding features in model.fit()
# updated_model = model.get_incremental_model()
model.set_source(source = ["wdl_data/file_list.3.txt", "wdl_data/file_list.4.txt"], keyset = ["wdl_data/file_list.3.keyset", "wdl_data/file_list.4.keyset"], eval_source = "wdl_data/file_list.5.txt")
model.fit(num_epochs = 1, display = 500, eval_interval = 1000)
# Get the updated embedding features in model.fit()
updated_model = model.get_incremental_model()
model.save_params_to_files("wdl_mos")

Overwriting wdl_train.py


In [2]:
!python3 wdl_train.py

[22d01h07m30s][HUGECTR][INFO]: Global seed is 3998117219
[22d01h07m31s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[22d01h07m32s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[22d01h07m32s][HUGECTR][INFO]: Start all2all warmup
[22d01h07m32s][HUGECTR][INFO]: End all2all warmup
[22d01h07m32s][HUGECTR][INFO]: Using All-reduce algorithm OneShot
Device 0: Tesla V100-SXM2-32GB
[22d01h07m32s][HUGECTR][INFO]: num of DataReader workers: 12
[22d01h07m32s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=6029312
[22d01h07m32s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=5865472
[22d01h07m35s][HUGECTR][INFO]: gpu0 start to init embedding
[22d01h07m35s][HUGECTR][INFO]: gpu0 init embedding done
[22d01h07m35s][HUGECTR][INFO]: gpu0 start to init embedding
[22d01h07m35s][HUGECTR][INFO]: gpu0 init embedding done
[22d01h07m35s][HUGECTR][INFO]: Host MEM-based Parameter Server is enabled
[22d01h07m35s][HUGECTR][INFO]: construct sparse models for model oversubscriber: ./wdl_0_s

### 2.3 Continuous Training with Low-level API

This section gives the code sample of continuous training using low-level API, which follows the same logics as the code sample in above section.

Although the low-level APIs provide fine-grind control to the training logic, we encourage you to use the high-level API if it can satisfy your requirement since the naked data reader and model oversubscriber logics are not straightforward and error prone.

For more about the low-level API, please refer to [Low-level Training API](../docs/python_interface.md#low-level-training-api) and samples of [Low-level Training](./hugectr_criteo.ipynb).

In [3]:
%%writefile wdl_mos.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 5000,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              vvgpu = [[0]],
                              i64_input_key = False,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                          source = ["wdl_data/file_list."+str(i)+".txt" for i in range(2)],
                          keyset = ["wdl_data/file_list."+str(i)+".keyset" for i in range(2)],
                          eval_source = "wdl_data/file_list.2.txt",
                          check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
mos = hugectr.CreateMOS(train_from_scratch = True, use_host_memory_ps = True, dest_sparse_models = ["./wdl_low_0_sparse_model", "./wdl_low_1_sparse_model"])
model = hugectr.Model(solver, reader, optimizer, mos)
model.construct_from_json(graph_config_file = "wdl.json", include_dense_network = True)
model.compile()
lr_sch = model.get_learning_rate_scheduler()
data_reader_train = model.get_data_reader_train()
data_reader_eval = model.get_data_reader_eval()
model_oversubscriber = model.get_model_oversubscriber()
dataset = [("wdl_data/file_list."+str(i)+".txt", "wdl_data/file_list."+str(i)+".keyset") for i in range(2)]
data_reader_eval.set_source("wdl_data/file_list.2.txt")
data_reader_eval_flag = True
iteration = 0
for file_list, keyset_file in dataset:
  data_reader_train.set_source(file_list)
  data_reader_train_flag = True
  model_oversubscriber.update(keyset_file)
  while True:
    lr = lr_sch.get_next()
    model.set_learning_rate(lr)
    data_reader_train_flag = model.train()
    if not data_reader_train_flag:
      break
    if iteration % 1000 == 0:
      batches = 0
      while data_reader_eval_flag:
        if batches >= solver.max_eval_batches:
          break
        data_reader_eval_flag = model.eval()
        batches += 1
      if not data_reader_eval_flag:
        data_reader_eval.set_source()
        data_reader_eval_flag = True
      metrics = model.get_eval_metrics()
      print("[HUGECTR][INFO] iter: {}, metrics: {}".format(iteration, metrics))
    iteration += 1
  print("[HUGECTR][INFO] trained with data in {}".format(file_list))

dataset = [("wdl_data/file_list."+str(i)+".txt", "wdl_data/file_list."+str(i)+".keyset") for i in range(3, 5)]
for file_list, keyset_file in dataset:
  data_reader_train.set_source(file_list)
  data_reader_train_flag = True
  model_oversubscriber.update(keyset_file)
  while True:
    lr = lr_sch.get_next()
    model.set_learning_rate(lr)
    data_reader_train_flag = model.train()
    if not data_reader_train_flag:
      break
    if iteration % 1000 == 0:
      batches = 0
      while data_reader_eval_flag:
        if batches >= solver.max_eval_batches:
          break
        data_reader_eval_flag = model.eval()
        batches += 1
      if not data_reader_eval_flag:
        data_reader_eval.set_source()
        data_reader_eval_flag = True
      metrics = model.get_eval_metrics()
      print("[HUGECTR][INFO] iter: {}, metrics: {}".format(iteration, metrics))
    iteration += 1
  print("[HUGECTR][INFO] trained with data in {}".format(file_list))
incremental_model = model.get_incremental_model()
model.save_params_to_files("wdl_mos")

Overwriting wdl_mos.py


In [4]:
!python3 wdl_mos.py

[22d01h09m55s][HUGECTR][INFO]: Global seed is 457036880
[22d01h09m55s][HUGECTR][INFO]: Device to NUMA mapping:
  GPU 0 ->  node 0

[22d01h09m56s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
[22d01h09m56s][HUGECTR][INFO]: Start all2all warmup
[22d01h09m56s][HUGECTR][INFO]: End all2all warmup
[22d01h09m56s][HUGECTR][INFO]: Using All-reduce algorithm OneShot
Device 0: Tesla V100-SXM2-32GB
[22d01h09m56s][HUGECTR][INFO]: num of DataReader workers: 12
[22d01h09m56s][HUGECTR][INFO]: max_num_frequent_categories is not specified using default: 1
[22d01h09m56s][HUGECTR][INFO]: max_num_infrequent_samples is not specified using default: -1
[22d01h09m56s][HUGECTR][INFO]: p_dup_max is not specified using default: 0.010000
[22d01h09m56s][HUGECTR][INFO]: max_all_reduce_bandwidth is not specified using default: 130000000000.000000
[22d01h09m56s][HUGECTR][INFO]: max_all_to_all_bandwidth is not specified using default: 190000000000.000000
[22d01h09m56s][HUGECTR][INFO]: efficiency_bandwid