In [None]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_hugectr_training-with-hdfs/nvidia_logo.png" style="width: 90px; float: right;">

# HugeCTR Training and Inference with Remote File System Example

## Overview

HugeCTR supports reading Parquet data, loading and saving models from/to remote file systems like HDFS, AWS S3, and GCS. Users can read their data stored in these remote file systems and train with it. And after training, users can choose to dump the trained parameters and optimizer states into these file systems. And during inference, users can read data and load sparse models from remote filesystem. In this example notebook, we are going to demonstrate the end to end procedure of training with HDFS and training plus inference with Amazon AWS S3.

## Setup HugeCTR

To setup the environment, refer to [HugeCTR Example Notebooks](../notebooks) and follow the instructions there before running the following.

## Training with HDFS Example

Hadoop is not pre-installe din the Merlin Training Container. To help you build and install HDFS, we provide a script [here](https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/sbin). Please build and install Hadoop using these two scripts. Make sure you have hadoop installed in your Container by running the following:

In [19]:
!hadoop version

Hadoop 3.3.2
Source code repository https://github.com/apache/hadoop.git -r 0bcb014209e219273cb6fd4152df7df713cbac61
Compiled by root on 2022-07-25T09:53Z
Compiled with protoc 3.7.1
From source with checksum 4b40fff8bb27201ba07b6fa5651217fb
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.3.2.jar


### Data Preparation

Users can use the [DataSourceParams](https://nvidia-merlin.github.io/HugeCTR/master/api/python_interface.html#data-source-api) to setup file system configurations. Currently, we support `Local`, `HDFS`, `S3`, and `GCS`.

**Firstly, we want to make sure that we have train and validation datasets ready:**

In [6]:
!hdfs dfs -ls hdfs://10.19.172.76:9000/dlrm_parquet/train

Found 8 items
-rw-r--r--   1 root supergroup  112247365 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_0.parquet
-rw-r--r--   1 root supergroup  112243637 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_1.parquet
-rw-r--r--   1 root supergroup  112251207 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_2.parquet
-rw-r--r--   1 root supergroup  112241764 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_3.parquet
-rw-r--r--   1 root supergroup  112247838 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_4.parquet
-rw-r--r--   1 root supergroup  112244076 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_5.parquet
-rw-r--r--   1 root supergroup  112253553 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_6.parquet
-rw-r--r--   1 root supergroup  112249557 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_7.parquet


In [7]:
!hdfs dfs -ls hdfs://10.19.172.76:9000/dlrm_parquet/val

Found 2 items
-rw-r--r--   1 root supergroup  112239093 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_0.parquet
-rw-r--r--   1 root supergroup  112249156 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_1.parquet


**Secondly, create `file_list.txt and file_list_test.txt`:**

In [9]:
!mkdir /dlrm_parquet
!mkdir /dlrm_parquet/train
!mkdir /dlrm_parquet/val

In [12]:
%%writefile /dlrm_parquet/file_list.txt
8
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_0.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_1.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_2.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_3.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_4.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_5.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_6.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_7.parquet

Overwriting /dlrm_parquet/file_list.txt


In [13]:
%%writefile /dlrm_parquet/file_list_test.txt
2
hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_0.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_1.parquet

Overwriting /dlrm_parquet/file_list_test.txt


**Lastly, create `_metadata.json` for both train and validation dataset to specify the feature information of your dataset:**

In [15]:
%%writefile /dlrm_parquet/train/_metadata.json
{ "file_stats": [{"file_name": "./dlrm_parquet/train/gen_0.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_1.parquet", "num_rows":1000000}, 
                 {"file_name": "./dlrm_parquet/train/gen_2.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_3.parquet", "num_rows":1000000}, 
                 {"file_name": "./dlrm_parquet/train/gen_4.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_5.parquet", "num_rows":1000000}, 
                 {"file_name": "./dlrm_parquet/train/gen_6.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_7.parquet", "num_rows":1000000} ], 
  "labels": [{"col_name": "label0", "index":0} ], 
  "conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, 
            {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6}, 
            {"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, 
            {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12}, 
            {"col_name": "C13", "index":13} ], 
  "cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, 
           {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, {"col_name": "C19", "index":19}, 
           {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, 
           {"col_name": "C23", "index":23}, {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, 
           {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28}, 
           {"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, 
           {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, {"col_name": "C34", "index":34}, 
           {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, 
           {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }

Writing /dlrm_parquet/train/_metadata.json


In [16]:
%%writefile /dlrm_parquet/val/_metadata.json
{ "file_stats": [{"file_name": "./dlrm_parquet/val/gen_0.parquet", "num_rows":1000000}, 
                 {"file_name": "./dlrm_parquet/val/gen_1.parquet", "num_rows":1000000} ], 
  "labels": [{"col_name": "label0", "index":0} ], 
  "conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, 
            {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6}, 
            {"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, 
            {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12}, 
            {"col_name": "C13", "index":13} ], 
  "cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, 
           {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, {"col_name": "C19", "index":19}, 
           {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, 
           {"col_name": "C23", "index":23}, {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, 
           {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28}, 
           {"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, 
           {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, {"col_name": "C34", "index":34}, 
           {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, 
           {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }

Writing /dlrm_parquet/val/_metadata.json


### Training a DLRM model

**Important APIs used in the following script:**
1. We use the [DataSourceParams](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#datasourceparams-class) to define the remote file system to read data from
2. In [DataReaderParams](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#datareaderparams), we specify the `DataSourceParams`.
3. In [fit()](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#fit-method) method, we specify HDFS path in the `snapshot_prefix` parameters to dump trained models to HDFS.

In [19]:
%%writefile train_with_hdfs.py
import hugectr
from mpi4py import MPI
from hugectr.data import DataSourceParams

# Create a file system configuration 
data_source_params = DataSourceParams(
    source = hugectr.DataSourceType_t.HDFS, #use HDFS
    server = '10.19.172.76', #your HDFS namenode IP
    port = 9000, #your HDFS namenode port
)

# DLRM train
solver = hugectr.CreateSolver(max_eval_batches = 1280,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              lr = 0.01,
                              vvgpu = [[1]],
                              i64_input_key = True,
                              use_mixed_precision = False,
                              repeat_dataset = True,
                              use_cuda_graph = False)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = ["/dlrm_parquet/file_list.txt"],
                                  eval_source = "/dlrm_parquet/file_list_test.txt",
                                  slot_size_array = [405274, 72550, 55008, 222734, 316071, 156265, 220243, 200179, 234566, 335625, 278726, 263070, 312542, 203773, 145859, 117421, 78140, 3648, 156308, 94562, 357703, 386976, 238046, 230917, 292, 156382],
                                  data_source_params = data_source_params, #file system config for data reading
                                  check_type = hugectr.Check_t.Non)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.SGD,
                                    update_type = hugectr.Update_t.Local,
                                    atomic_update = True)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", 1, True, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
                            workspace_size_per_gpu_in_mb = 10720,
                            embedding_vec_size = 128,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "data1",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dense"],
                            top_names = ["fc1"],
                            num_output=512))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))                           
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu1"],
                            top_names = ["fc2"],
                            num_output=256))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc2"],
                            top_names = ["relu2"]))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu2"],
                            top_names = ["fc3"],
                            num_output=128))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc3"],
                            top_names = ["relu3"]))                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Interaction,
                            bottom_names = ["relu3","sparse_embedding1"],
                            top_names = ["interaction1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["interaction1"],
                            top_names = ["fc4"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc4"],
                            top_names = ["relu4"]))                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu4"],
                            top_names = ["fc5"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc5"],
                            top_names = ["relu5"]))                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu5"],
                            top_names = ["fc6"],
                            num_output=512))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc6"],
                            top_names = ["relu6"]))                               
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu6"],
                            top_names = ["fc7"],
                            num_output=256))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc7"],
                            top_names = ["relu7"]))                                                                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu7"],
                            top_names = ["fc8"],
                            num_output=1))                                                                                           
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["fc8", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()

model.fit(max_iter = 2020, display = 200, eval_interval = 1000, snapshot = 2000, snapshot_prefix = "hdfs://10.19.172.76:9000/model/dlrm/") 

Overwriting train_with_hdfs.py


In [20]:
!python train_with_hdfs.py

HugeCTR Version: 3.8
[HCTR][07:51:52.502][INFO][RK0][main]: Global seed is 3218787045
[HCTR][07:51:52.505][INFO][RK0][main]: Device to NUMA mapping:
  GPU 1 ->  node 0
[HCTR][07:51:55.607][INFO][RK0][main]: Start all2all warmup
[HCTR][07:51:55.609][INFO][RK0][main]: End all2all warmup
[HCTR][07:51:56.529][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][07:51:56.530][INFO][RK0][main]: Device 1: NVIDIA A10
[HCTR][07:51:56.531][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][07:51:56.531][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][07:51:57.695][INFO][RK0][main]: Using Hadoop Cluster 10.19.172.76:9000
[HCTR][07:51:57.740][INFO][RK0][main]: Using Hadoop Cluster 10.19.172.76:9000
[HCTR][07:51:57.740][INFO][RK0][main]: Vocabulary size: 5242880
[HCTR][07:51:57.741][INFO][RK0][main]: max_vocabulary_size_per_gpu_=21954560
[HCTR][07:51:57.755][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][07:52:04.336][INFO][RK0][main]: gpu0 sta

**Check that our model files are saved in HDFS:**

In [22]:
!hdfs dfs -ls hdfs://10.19.172.76:9000/model/dlrm

Found 3 items
drwxr-xr-x   - root supergroup          0 2022-07-27 07:52 hdfs://10.19.172.76:9000/model/dlrm/0_sparse_2000.model
-rw-r--r--   3 root supergroup    9479684 2022-07-27 07:52 hdfs://10.19.172.76:9000/model/dlrm/_dense_2000.model
-rw-r--r--   3 root supergroup          0 2022-07-27 07:52 hdfs://10.19.172.76:9000/model/dlrm/_opt_dense_2000.model


## Training a DCN model with AWS S3

**Before you start:**
Please note that AWS S3 SDKs are NOT preinstalled in the NGC docker. To use S3 related functionalites, please do the following steps to customize the building of HugeCTR:
1. git clone https://github.com/NVIDIA/HugeCTR.git
2. cd HugeCTR
3. git submodule update --init --recursive
4. mkdir -p build && cd build
5. cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 -DENABLE_S3=ON .. #ENABLE_S3 option will install AWS S3 SDKs for you.
6. make -j && make install

### Data preparation

**Create `file_list.txt and file_list_test.txt`:**

In [1]:
!mkdir -p /hugectr-io-test/data/dcn_parquet/train
!mkdir -p /hugectr-io-test/data/dcn_parquet/val

In [2]:
%%writefile /hugectr-io-test/data/dcn_parquet/file_list.txt
16
s3://hugectr-io-test/data/dcn_parquet/train/gen_0.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_1.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_2.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_3.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_4.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_5.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_6.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_7.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_8.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_9.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_10.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_11.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_12.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_13.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_14.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_15.parquet

Writing /hugectr-io-test/data/dcn_parquet/file_list.txt


In [3]:
%%writefile /hugectr-io-test/data/dcn_parquet/file_list_test.txt
4
s3://hugectr-io-test/data/dcn_parquet/val/gen_0.parquet
s3://hugectr-io-test/data/dcn_parquet/val/gen_1.parquet
s3://hugectr-io-test/data/dcn_parquet/val/gen_2.parquet
s3://hugectr-io-test/data/dcn_parquet/val/gen_3.parquet

Writing /hugectr-io-test/data/dcn_parquet/file_list_test.txt


In [4]:
%%writefile /hugectr-io-test/data/dcn_parquet/train/_metadata.json
{ "file_stats": [{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_0.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_1.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_2.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_3.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_4.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_5.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_6.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_7.parquet", "num_rows":40960},
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_8.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_9.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_10.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_11.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_12.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_13.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_14.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_15.parquet", "num_rows":40960}], 
  "labels": [{"col_name": "label0", "index":0} ], 
  "conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6}, 
            {"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12}, 
            {"col_name": "C13", "index":13} ], 
  "cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, 
            {"col_name": "C19", "index":19}, {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, {"col_name": "C23", "index":23}, 
            {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28}, 
            {"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, 
            {"col_name": "C34", "index":34}, {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }

Writing /hugectr-io-test/data/dcn_parquet/train/_metadata.json


In [5]:
%%writefile /hugectr-io-test/data/dcn_parquet/val/_metadata.json
{ "file_stats": [{"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_0.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_1.parquet", "num_rows":40960},
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_2.parquet", "num_rows":40960}, 
                 {"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_3.parquet", "num_rows":40960}], 
  "labels": [{"col_name": "label0", "index":0} ], 
  "conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6}, 
            {"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12}, 
            {"col_name": "C13", "index":13} ], 
  "cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, 
            {"col_name": "C19", "index":19}, {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, {"col_name": "C23", "index":23}, 
            {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28}, 
            {"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, 
            {"col_name": "C34", "index":34}, {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }

Writing /hugectr-io-test/data/dcn_parquet/val/_metadata.json


### Training

**Important APIs used in the following script:**
1. We use the [DataSourceParams](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#datasourceparams-class) to define the remote file system to read data from, in this case, S3.
2. In [DataReaderParams](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#datareaderparams), we specify the `DataSourceParams`.
3. In [fit()](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#fit-method) method, we specify S3 path in the `snapshot_prefix` parameters to dump trained models to S3.

In [10]:
%%writefile train_with_s3.py
import hugectr
from mpi4py import MPI
from hugectr.data import DataSourceParams

# Create a file system configuration for data reading
data_source_params = DataSourceParams(
    source = hugectr.FileSystemType_t.S3, #use AWS S3
    server = 'us-east-1', #your AWS region
    port = 9000, #with be ignored
)

solver = hugectr.CreateSolver(
    max_eval_batches=1280,
    batchsize_eval=1024,
    batchsize=1024,
    lr=0.001,
    vvgpu=[[0]],
    i64_input_key=True,
    repeat_dataset=True,
)
reader = hugectr.DataReaderParams(
    data_reader_type=hugectr.DataReaderType_t.Parquet,
    source=["/hugectr-io-test/data/dcn_parquet/file_list.txt"],
    eval_source="/hugectr-io-test/data/dcn_parquet/file_list_test.txt",
    slot_size_array=[39884,39043,17289,7420,20263,3,7120,1543,39884,39043,17289,7420,20263,3,7120,1543,63,63,39884,39043,17289,7420,20263,3,7120,1543],
    data_source_params=data_source_params, # Using the S3 configurations
    check_type=hugectr.Check_t.Non,
)
optimizer = hugectr.CreateOptimizer(optimizer_type=hugectr.Optimizer_t.SGD)
model = hugectr.Model(solver, reader, optimizer)
model.add(
    hugectr.Input(
        label_dim=1,
        label_name="label",
        dense_dim=13,
        dense_name="dense",
        data_reader_sparse_param_array=[
            hugectr.DataReaderSparseParam("data1", 1, True, 26)
        ],
    )
)
model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=150,
        embedding_vec_size=16,
        combiner="sum",
        sparse_embedding_name="sparse_embedding1",
        bottom_name="data1",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding1"],
        top_names=["reshape1"],
        leading_dim=416,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat, bottom_names=["reshape1", "dense"], top_names=["concat1"]
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Slice,
        bottom_names=["concat1"],
        top_names=["slice11", "slice12"],
        ranges=[(0, 429), (0, 429)],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.MultiCross,
        bottom_names=["slice11"],
        top_names=["multicross1"],
        num_layers=6,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["slice12"],
        top_names=["fc1"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu1"],
        top_names=["dropout1"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat,
        bottom_names=["dropout1", "multicross1"],
        top_names=["concat2"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["concat2"],
        top_names=["fc2"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["fc2", "label"],
        top_names=["loss"],
    )
)
model.compile()
model.summary()

model.fit(max_iter = 1100, display = 100, eval_interval = 500, snapshot = 1000, snapshot_prefix = "https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/dcn_model/")
model.graph_to_json(graph_config_file = "dcn.json")

Overwriting train_with_s3.py


In [11]:
!python train_with_s3.py

HugeCTR Version: 4.1
[HCTR][06:54:55.819][INFO][RK0][main]: Global seed is 569406237
[HCTR][06:54:55.822][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][06:54:57.710][INFO][RK0][main]: Start all2all warmup
[HCTR][06:54:57.710][INFO][RK0][main]: End all2all warmup
[HCTR][06:54:57.711][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][06:54:57.712][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][06:54:57.713][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][06:54:57.713][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][06:54:57.714][INFO][RK0][main]: Using S3 file system backend.
[HCTR][06:54:59.762][INFO][RK0][main]: Using S3 file system backend.
[HCTR][06:55:01.777][INFO][RK0][main]: Vocabulary size: 397821
[HCTR][06:55:01.777][INFO][RK0][main]: max_vocabulary_size_per_gpu_=2457600
[HCTR][06:55:01.780][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][06:55:03.407][INFO][RK0][main]: gpu0 start to init 

### Inference

**Important API used in the following script:**
1. In [InferenceParams()](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#inferenceparams-class), we specify S3 path in the `sparse_model_files` parameter to load trained models from S3.
2. In [predict()](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#predict-method), we specify the DataSourceParams to read data from S3.

**Please note that we are Not supporting reading model graphs from S3 yet. Only models can be read from remote.**

In [6]:
%%writefile inference_with_s3.py
import hugectr
from hugectr.inference import InferenceModel, InferenceParams
from hugectr.data import DataSourceParams
import numpy as np
from mpi4py import MPI


# Create a file system configuration for data reading
data_source_params = DataSourceParams(
    source = hugectr.FileSystemType_t.S3, # use AWS S3
    server = 'us-east-1', # your AWS region
    port = 9000, # with be ignored
)

model_config = "dcn.json" # should be in local
inference_params = InferenceParams(
    model_name = "dcn",
    max_batchsize = 1024,
    hit_rate_threshold = 1.0,
    dense_model_file = "https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/dcn_model/_dense_1000.model", # S3 URL
    sparse_model_files = ["https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/dcn_model/0_sparse_1000.model"], # S3 URL
    deployed_devices = [0],
    use_gpu_embedding_cache = True,
    cache_size_percentage = 1.0,
    i64_input_key = True
)
inference_model = InferenceModel(model_config, inference_params)
pred = inference_model.predict(
    10,
    "/hugectr-io-test/data/dcn_parquet/file_list_test.txt",
    hugectr.DataReaderType_t.Parquet,
    hugectr.Check_t.Non,
    [39884,39043,17289,7420,20263,3,7120,1543,39884,39043,17289,7420,20263,3,7120,1543,63,63,39884,39043,17289,7420,20263,3,7120,1543],
    data_source_params
)
print(pred.shape)
print(pred)

Overwriting inference_with_s3.py


In [7]:
!python inference_with_s3.py

[HCTR][02:48:08.494][INFO][RK0][main]: Global seed is 2188274617
[HCTR][02:48:08.496][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][02:48:10.297][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 30.7791 
[HCTR][02:48:10.297][INFO][RK0][main]: Start all2all warmup
[HCTR][02:48:10.297][INFO][RK0][main]: End all2all warmup
[HCTR][02:48:10.298][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][02:48:10.298][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][02:48:10.298][DEBUG][RK0][main]: Created blank database backend in local memory!
[HCTR][02:48:10.298][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][02:48:10.298][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][02:48:10.298][DEBUG][RK0][main]: Created raw model loader in local memory!
[HCTR][02:48:10.298][INFO][RK0][main]: Using S3 file system backend.
[HCTR][02:48:21.335][INFO][RK0][main]: Table: hps_et.dcn.sparse_embedding1

## Training a DCN model with Google Cloud Storage

**Before you start:**
Please note that GCS SDK are NOT preinstalled in the NGC docker. To use GCS related functionalites, please do the following steps to customize the building of HugeCTR:
1. git clone https://github.com/NVIDIA/HugeCTR.git
2. cd HugeCTR
3. git submodule update --init --recursive
4. mkdir -p build && cd build
5. cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 -DENABLE_GCS=ON .. #ENABLE_GCS option will install GCS SDKs for you.
6. make -j && make install

### Data preparation

**Create `file_list.txt and file_list_test.txt`:**

In [19]:
!mkdir -p /hugectr-io-test/data/dcn_parquet/train
!mkdir -p /hugectr-io-test/data/dcn_parquet/val

In [20]:
%%writefile /hugectr-io-test/data/dcn_parquet/file_list.txt
16
gs://hugectr-io-test/data/dcn_parquet/train/gen_0.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_1.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_2.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_3.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_4.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_5.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_6.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_7.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_8.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_9.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_10.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_11.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_12.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_13.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_14.parquet
gs://hugectr-io-test/data/dcn_parquet/train/gen_15.parquet

Overwriting /hugectr-io-test/data/dcn_parquet/file_list.txt


In [21]:
%%writefile /hugectr-io-test/data/dcn_parquet/file_list_test.txt
4
gs://hugectr-io-test/data/dcn_parquet/val/gen_0.parquet
gs://hugectr-io-test/data/dcn_parquet/val/gen_1.parquet
gs://hugectr-io-test/data/dcn_parquet/val/gen_2.parquet
gs://hugectr-io-test/data/dcn_parquet/val/gen_3.parquet

Overwriting /hugectr-io-test/data/dcn_parquet/file_list_test.txt


In [22]:
%%writefile /hugectr-io-test/data/dcn_parquet/train/_metadata.json
{ "file_stats": [{"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_0.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_1.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_2.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_3.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_4.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_5.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_6.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_7.parquet", "num_rows":40960},
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_8.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_9.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_10.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_11.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_12.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_13.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_14.parquet", "num_rows":40960}, {"file_name": "gs://hugectr-io-test/data/dcn_parquet/train/gen_15.parquet", "num_rows":40960}], 
  "labels": [{"col_name": "label0", "index":0} ], 
  "conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6}, 
            {"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12}, 
            {"col_name": "C13", "index":13} ], 
  "cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, 
            {"col_name": "C19", "index":19}, {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, {"col_name": "C23", "index":23}, 
            {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28}, 
            {"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, 
            {"col_name": "C34", "index":34}, {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }

Overwriting /hugectr-io-test/data/dcn_parquet/train/_metadata.json


In [23]:
%%writefile /hugectr-io-test/data/dcn_parquet/val/_metadata.json
{ "file_stats": [{"file_name": "gs://hugectr-io-test/data/dcn_parquet/val/gen_0.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/val/gen_1.parquet", "num_rows":40960},
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/val/gen_2.parquet", "num_rows":40960}, 
                 {"file_name": "gs://hugectr-io-test/data/dcn_parquet/val/gen_3.parquet", "num_rows":40960}], 
  "labels": [{"col_name": "label0", "index":0} ], 
  "conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6}, 
            {"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12}, 
            {"col_name": "C13", "index":13} ], 
  "cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, 
            {"col_name": "C19", "index":19}, {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, {"col_name": "C23", "index":23}, 
            {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28}, 
            {"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, 
            {"col_name": "C34", "index":34}, {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }

Overwriting /hugectr-io-test/data/dcn_parquet/val/_metadata.json


### Training

**Important APIs used in the following script:**
1. We use the [DataSourceParams](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#datasourceparams-class) to define the remote file system to read data from, in this case, GCS.
2. In [DataReaderParams](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#datareaderparams), we specify the `DataSourceParams`.
3. In [fit()](https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#fit-method) method, we specify GCS path in the `snapshot_prefix` parameters to dump trained models to GCS.

In [27]:
#You need to set the GCP credentials envrionmental variable to access the GCS.

%env GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/gcs_key.json

env: GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/gcs_key.json


In [13]:
%%writefile train_with_gcs.py
import hugectr
from mpi4py import MPI
from hugectr.data import DataSourceParams

# Create a file system configuration for data reading
data_source_params = DataSourceParams(
    source = hugectr.FileSystemType_t.GCS, #use Google Cloud Storage
    server = 'storage.googleapis.com', #your endpoint override, usually storage.googleapis.com or storage.google.cloud.com
    port = 9000, #with be ignored
)

solver = hugectr.CreateSolver(
    max_eval_batches=1280,
    batchsize_eval=1024,
    batchsize=1024,
    lr=0.001,
    vvgpu=[[0]],
    i64_input_key=True,
    repeat_dataset=True,
)
reader = hugectr.DataReaderParams(
    data_reader_type=hugectr.DataReaderType_t.Parquet,
    source=["/hugectr-io-test/data/dcn_parquet/file_list.txt"],
    eval_source="/hugectr-io-test/data/dcn_parquet/file_list_test.txt",
    slot_size_array=[39884,39043,17289,7420,20263,3,7120,1543,39884,39043,17289,7420,20263,3,7120,1543,63,63,39884,39043,17289,7420,20263,3,7120,1543],
    data_source_params=data_source_params, # Using the GCS configurations
    check_type=hugectr.Check_t.Non,
)
optimizer = hugectr.CreateOptimizer(optimizer_type=hugectr.Optimizer_t.SGD)
model = hugectr.Model(solver, reader, optimizer)
model.add(
    hugectr.Input(
        label_dim=1,
        label_name="label",
        dense_dim=13,
        dense_name="dense",
        data_reader_sparse_param_array=[
            hugectr.DataReaderSparseParam("data1", 1, True, 26)
        ],
    )
)
model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=150,
        embedding_vec_size=16,
        combiner="sum",
        sparse_embedding_name="sparse_embedding1",
        bottom_name="data1",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding1"],
        top_names=["reshape1"],
        leading_dim=416,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat, bottom_names=["reshape1", "dense"], top_names=["concat1"]
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Slice,
        bottom_names=["concat1"],
        top_names=["slice11", "slice12"],
        ranges=[(0, 429), (0, 429)],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.MultiCross,
        bottom_names=["slice11"],
        top_names=["multicross1"],
        num_layers=6,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["slice12"],
        top_names=["fc1"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu1"],
        top_names=["dropout1"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat,
        bottom_names=["dropout1", "multicross1"],
        top_names=["concat2"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["concat2"],
        top_names=["fc2"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["fc2", "label"],
        top_names=["loss"],
    )
)
model.compile()
model.summary()

model.fit(max_iter = 1100, display = 100, eval_interval = 500, snapshot = 1000, snapshot_prefix = "https://storage.googleapis.com/hugectr-io-test/pipeline_test/")
model.graph_to_json(graph_config_file = "dcn.json")

Overwriting train_with_gcs.py


In [14]:
!python train_with_gcs.py

HugeCTR Version: 4.1
[HCTR][03:15:35.248][INFO][RK0][main]: Global seed is 1008636636
[HCTR][03:15:35.251][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][03:15:37.306][INFO][RK0][main]: Start all2all warmup
[HCTR][03:15:37.306][INFO][RK0][main]: End all2all warmup
[HCTR][03:15:37.307][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][03:15:37.308][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][03:15:37.308][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][03:15:37.308][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][03:15:37.309][INFO][RK0][main]: Using GCS file system backend.
[HCTR][03:15:37.323][INFO][RK0][main]: Using GCS file system backend.
[HCTR][03:15:37.328][INFO][RK0][main]: Vocabulary size: 397821
[HCTR][03:15:37.329][INFO][RK0][main]: max_vocabulary_size_per_gpu_=2457600
[HCTR][03:15:37.331][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][03:15:39.005][INFO][RK0][main]: gpu0 start to in

## Inference

### Data preparation**
Please note that we are Not supporting reading model graphs and dense models from GCS yet. Only Sparse models can be read from remote.**

In [24]:
%%writefile inference_with_gcs.py
import hugectr
from hugectr.inference import InferenceModel, InferenceParams
from hugectr.data import DataSourceParams
import numpy as np
from mpi4py import MPI


# Create a file system configuration for data reading
data_source_params = DataSourceParams(
    source = hugectr.FileSystemType_t.GCS, # use GCS
    server = 'storage.googleapis.com', # your GCS endpoint override
    port = 9000, # with be ignored
)

model_config = "dcn.json" # should be in local
inference_params = InferenceParams(
    model_name = "dcn",
    max_batchsize = 1024,
    hit_rate_threshold = 1.0,
    dense_model_file = "./_dense_10000.model", # should be in local
    sparse_model_files = ["https://storage.googleapis.com/hugectr-io-test/pipeline_test/0_sparse_1000.model"], # GCS URL
    deployed_devices = [0],
    use_gpu_embedding_cache = True,
    cache_size_percentage = 1.0,
    i64_input_key = True
)
inference_model = InferenceModel(model_config, inference_params)
pred = inference_model.predict(
    10,
    "/hugectr-io-test/data/dcn_parquet/file_list_test.txt",
    hugectr.DataReaderType_t.Parquet,
    hugectr.Check_t.Non,
    [39884,39043,17289,7420,20263,3,7120,1543,39884,39043,17289,7420,20263,3,7120,1543,63,63,39884,39043,17289,7420,20263,3,7120,1543],
    data_source_params
)
print(pred.shape)
print(pred)

Overwriting inference_with_gcs.py


In [25]:
!python inference_with_gcs.py

[HCTR][09:30:37.214][INFO][RK0][main]: Global seed is 1015829727
[HCTR][09:30:37.217][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][09:30:39.061][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 30.7830 
[HCTR][09:30:39.061][INFO][RK0][main]: Start all2all warmup
[HCTR][09:30:39.061][INFO][RK0][main]: End all2all warmup
[HCTR][09:30:39.062][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][09:30:39.062][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][09:30:39.062][DEBUG][RK0][main]: Created blank database backend in local memory!
[HCTR][09:30:39.062][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][09:30:39.062][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][09:30:39.062][DEBUG][RK0][main]: Created raw model loader in local memory!
[HCTR][09:30:39.063][INFO][RK0][main]: Using GCS file system backend.
[HCTR][09:30:40.357][INFO][RK0][main]: Table: hps_et.dcn.sparse_embedding