# Training Large Recommender Models with NVIDIA Merlin HugeCTR and Vertex AI

This notebook demonstrates how to use Vertex AI Training to operationalize training and hyperparameter tuning of large scale deep learning models developed with [NVIDIA Merlin HugeCTR](https://developer.nvidia.com/nvidia-merlin/hugectr) framework. The notebook compiles prescriptive guidance for the following tasks:

- Building a custom Vertex training container derived from NVIDIA NGC Merlin Training image
- Configuring, submitting and monitoring a Vertex custom training job
- Configuring, submitting and monitoring a Vertex hyperparameter tuning job
- Retrieving and analyzing results of a hyperparameter tuning job

The deep learning model used in this sample is [DeepFM](https://arxiv.org/abs/1703.04247) - a Factorization-Machine based Neural Network for CTR Prediction. The HugeCTR implementation of this model used in this notebook has been configured for the [Criteo dataset](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/). 

## 1. NVIDIA Merlin HugeCTR Overview

[Merlin HugeCTR](https://github.com/NVIDIA-Merlin/HugeCTR) is NVIDIA's GPU-accelerated, highly scalable recommender framework. We highly encourage reviewing the [HugeCTR User Guide](https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/docs/hugectr_user_guide.md) before proceeding with this notebook.

Merlin HugeCTR facilitates highly scalable implementations of leading deep learning recommender models including Google's [Wide and Deep](https://arxiv.org/abs/1606.07792), Facebook's [DLRM](https://ai.facebook.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/), and the [DeepFM](https://arxiv.org/abs/1703.04247) model used in this notebook.

A unique feature of HugeCTR is support for model-parallel embedding tables. Applying model-parallelism for embedding tables enables massive scalability. Industrial grade deep learning recommendation systems most often employ very large embedding tables. User and item embedding tables - cornerstones of any recommender - can easily exceed tens of millions of rows. Without model-parallelism it would be impossible to fit embedding tables this large in the memory of a single device - especially when using large embeddng vectors. In HugeCTR, embedding tables can span device memory across multiple GPUs on a single node or even across GPUs in a large distributed cluster. 

HugeCTR supports multiple model-parallelism configurations for embedding tables. Refer to [the HugeCTR API reference](https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/docs/hugectr_layer_book.md#embedding-types-detail) for detailed descriptions. In the DeepFM implementation used in this notebook, we utilize the `LocalizedSlotSparseEmbeddingHash` embedding type. In this embedding type, an embedding table is segmented into multiple slots or feature fields. Each slot stores embeddings for a single categorical feature. A given slot is allocated to a single GPU and it does not span multiple GPUs. However; in a multi GPU environment different slots can be allocated to different GPUs. 

The following diagram demonstrates an example configuration on a single node with multiple GPU - a hardware topology used by Vertex jobs in this notebook.


<img src="./images/deepfm.png" alt="Model parallel embeddings" style="width: 50%;"/>


The Criteo dataset has 26 categorical features so there are 26 slots in the embedding table. Cardinalities of categorical variables vary from tens of milions to low teens so the dimensions of slots vary accordingly. Each slot in the embedding table utilizes an embedding vector of the same size. Note that the distribution of slots across GPUs is handled by HugeCTR; you don't have to explicitly pin a slot to a GPU.

Dense layers of the DeepFM models are replicated on all GPUs using a canonical data-parallel pattern. 

A choice of an optimizer is critical when training large deep learning recommender systems. Different optimizers may result in significantly different convergence rates impacting both time (cost) to train and a final model performance. Since large recommender systems are often retrained on frequent basis minimizing time to train is one of the key design objectives for a training workflow. In this notebook we use the Adam optimizer that has been proved to work well with many deep learning recommeder system architectures.

You can find the code that implements the DeepFM model in the [src/training/hugectr/trainer/model.py](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/src/training/hugectr/model.py) file. 


## 2. Model Training Overview

The training workflow has been optimized for Vertex AI Training. 
- Google Cloud Storage (GCS) and Vertex Training GCS Fuse are used for accessing training and validation data
- A single node, multiple GPUs worker pool is used for Vertex Training jobs.
- Training code has been instrumented to support hyperparameter tuning using Vertex Training Hyperparameter Tuning Job. 

You can find the code that implements the training workflow in the [src/training/hugectr/trainer/task.py](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/src/training/hugectr/task.py) file.

### Training data access

Large deep learning  recommender systems are trained on massive datasets often hundreds of terabytes in size. Maintaining high-throughput when streaming training data to GPU workers is of critical importance. HugeCTR features a highly efficient multi-threaded data reader that parallelizes data reading and model computations. The reader accesses training data through a file system interface. The reader cannot directly access object storage systems like Google Cloud Storage, which is a canonical storage system for large scale training and validation datasets in Vertex AI Training. To expose Google Cloud Storage through a file system interface, the notebook uses an integrated feature of Vertex AI  - Google Cloud Storage FUSE. Vertex AI GCS FUSE provides a high performance file system interface layer to GCS that is self-tuning and requires minimal configuration. The following diagram depicts the training data access configuration:

<img src="images/gcsfuse.png" alt="GCS Fuse" style="width:50%"/>


### Vertex AI Training worker pool configuration

HugeCTR supports both single node, multiple GPU configurations and multiple node, multiple GPU distributed cluster topologies. In this sample, we use a single node, multiple GPU configuration. Due to the computational complexity of modern deep learning recommender models we recommend using Vertex Training A2 series machines for large models implemented with HugeCTR. The A2 machines can be configured with up to 16 A100 GPUs, 96 vCPUs, and 1,360GBs RAM. Each A100 GPU has 40GB of device memory. These are powerful configurations that can handle complex models with large embeddings.

In this sample we use the `a2-highgpu-4g` machine type. 

Both custom training and hyperparameter tuning Vertex AI jobs demonstrated in this notebook are configured to use a [custom training container](https://cloud.google.com/vertex-ai/docs/training/containers-overview). The container is a derivative of [NVIDIA NGC Merlin training container](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-training). The definition of the container image is found in the [Dockefile.hugectr](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/src/Dockerfile.hugectr) file.

### HugeCTR hyperparameter tuning with Vertex AI

The training module has been instrumented to support [hyperparameter tuning with Vertex AI](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview). The custom container includes the [cloudml-hypertune package](https://github.com/GoogleCloudPlatform/cloudml-hypertune), which is used to report the results of model evaluations to Vertex AI hypertuning service. The following diagram depicts the training flow implemeted by the training module.


<img src="images/hugectrtrainer.png" alt="Training regime" style="width:30%"/>


Note that as of HugeCTR v3.2 release, the `hugectr.inference.InferenceSession.evaluate` method used in the trainer module only supports the *AUC* evaluation metric.


## 3. Executing Model Training on Vertex AI

This notebook assumes that the Criteo dataset has been preprocessed using the preprocessing workflow detailed in the [01-dataset-preprocessing.ipynb](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/01-dataset-preprocessing.ipynb) notebook and the resulting Parquet training and validation splits, and the processed data schema have been stored in Google Cloud Storage.

As you walk through the notebook you will execute the following steps:
- Configure notebook environment settings, including GCP project, compute region, and the GCS locations of training and validation data splits.
- Build a custom Vertex training container based on NVIDIA NGC Merlin Training container
- Configure and submit a Vertex custom training job
- Configure and submit a Vertex hyperparameter training job
- Retrieve the results of the hyperparameter tuning job

## Setup

In this section of the notebook you configure your environment settings, including a GCP project, a GCP compute region, a Vertex AI service account and a Vertex AI staging bucket. You also set the locations of training and validation splits, and their schema as created in the [01-dataset-preprocessing.ipynb](01-dataset-preprocessing.ipynb) notebook.

Make sure to update the below cells with the values reflecting your environment.

In [1]:
import json
import os
import time

from google.cloud import aiplatform as vertex_ai
from google.cloud.aiplatform import hyperparameter_tuning as hpt

In [3]:
# Project definitions
PROJECT = 'jk-mlops-dev' # Change to your project.
REGION = 'us-central1'  # Change to your region.

# Bucket definitions
BUCKET = 'jk-staging-us-central1' # Change to your bucket.
VERSION = 'v01'
MODEL_NAME = 'deepfm'
MODEL_DISPLAY_NAME = f'hugectr-{MODEL_NAME}-{VERSION}'
WORKSPACE = f'gs://{BUCKET}/{MODEL_DISPLAY_NAME}'

# Service Account address
VERTEX_SA = f'vertex-sa@{PROJECT}.iam.gserviceaccount.com' # Change to your service account.

# Docker definitions for training
IMAGE_NAME = 'hugectr-training'
IMAGE_URI = f'gcr.io/{PROJECT}/{IMAGE_NAME}'
DOCKERNAME = 'hugectr'

# Dataset information for training / validation
# The path must point to the file _file_list.txt
TRAIN_DATA = '/gcs/jk-staging-us-central1/criteo-merlin-recommender-v03/nvt-csv-pipeline/895222332033/nvt-csv-pipeline-20220227054638/transform-dataset-op_-8797194732859031552/transformed_dataset/train/_file_list.txt'
VALID_DATA = '/gcs/jk-staging-us-central1/criteo-merlin-recommender-v03/nvt-csv-pipeline/895222332033/nvt-csv-pipeline-20220227054638/transform-dataset-op-2_5037863322423132160/transformed_dataset/valid/_file_list.txt'

# Schema used by the training pipepine
# The path must point to the file schema.pbtxt
SCHEMA_PATH = '/gcs/jk-staging-us-central1/criteo-merlin-recommender-v03/nvt-csv-pipeline/895222332033/nvt-csv-pipeline-20220227054638/transform-dataset-op_-8797194732859031552/transformed_dataset/train/schema.pbtxt'

### Initialize Vertex AI SDK

In [4]:
vertex_ai.init(
    project=PROJECT,
    location=REGION,
    staging_bucket=os.path.join(WORKSPACE, 'stg')
)

## Submit a Vertex custom training job

In this section of the notebook you define, submit and monitor a Vertex custom training job. As noted in the introduction, the job uses a custom training container that is a derivative of [NVIDIA NGC Merlin training container image](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-training). The custom container image packages the training module which includes a DeepFM model definition - [src/training/hugectr/model.py](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/src/training/hugectr/model.py) and a training and evaluation workflow" - [src/training/hugectr/task.py](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/src/training/hugectr/task.py). The custom container image also installs the `cloudml-hypertune` package for integration with Vertex AI hypertuning.

### Build a custom training container

In [9]:
FILE_LOCATION = './src'
! gcloud builds submit --config src/cloudbuild.yaml --substitutions _DOCKERNAME=$DOCKERNAME,_IMAGE_URI=$IMAGE_URI,_FILE_LOCATION=$FILE_LOCATION --timeout=2h --machine-type=e2-highcpu-8

Creating temporary tarball archive of 51 file(s) totalling 5.1 MiB before compression.
Some files were not included in the source upload.

Check the gcloud log [/home/jupyter/.config/gcloud/logs/2022.02.27/19.17.32.574938.log] to see which files and the contents of the
default gcloudignore file used (see `$ gcloud topic gcloudignore` to learn
more).

Uploading tarball of [.] to [gs://jk-mlops-dev_cloudbuild/source/1645989452.656979-a1e238dc491a42f6964b99a211e8bfe3.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/jk-mlops-dev/locations/global/builds/3b1768c3-a638-4eed-97c6-518af757fd4d].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/3b1768c3-a638-4eed-97c6-518af757fd4d?project=895222332033].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "3b1768c3-a638-4eed-97c6-518af757fd4d"

FETCHSOURCE
Fetching storage object: gs://jk-mlops-dev_cloudbuild/source/1645989452.656979-a1e238dc491a42f6964b99a211e8bfe3.

### Configure a custom training job

The training module accepts a set of parameters that allow you to fine tune the DeepFM model implementation and configure the training workflow. Most of the  parameters exposed by the training module map directly to the settings used in [HugeCTR Python Inteface](https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/docs/python_interface.md#createsolver-method). 

 - `NUM_EPOCHS`: The training workflow can run in either an epoch mode or a non-epoch mode. When the constant `NUM_EPOCHS` is set to a value greater than zero the model will be trained on the `NUM_EPOCHS` number of full epochs, where an epoch is defined as a single pass through all examples in the training data. 
 
 - `MAX_ITERATIONS`: If `NUM_EPOCHS` is set to zero, you must set `MAX_ITERATIONS` to a value greater than zero. `MAX_ITERATIONS` defines the number of batches to train the model on. When `NUM_EPOCHS` is greater than zero `MAX_ITERATIONS` is ignored.

- `EVAL_INTERVAL` and `EVAL_BATCHES`: The model will be evaluated every `EVAL_INTERVAL` training batches using the `EVAL_BATCHES` validation batches during the main training loop. In the current implementation the evaluation metric is `AUC`.

- `EVAL_BATCHES_FINAL`: After the main training loop completes, a final evaluation will be run using the `EVAL_BATCHES_FINAL`. The `AUC` value returned is reported to Vertex AI hypertuner.

- `DISPLAY_INTERVAL`: Training progress will be reported every `DISPLAY_INTERVAL` batches.

- `SNAPSHOT_INTERVAL`: When set to a value greater than zero, a snapshot will be saved every `SNAPSHOT_INTERVAL` batches.

- `PER_GPU_BATCH_SIZE`: Per GPU batch size. This value should be set through experimentation and depends on model architecture, training features, and GPU type. It is highly dependent on device memory available in a particular GPU. In our scenario - DeepFM, Criteo datasets, and A100 GPU - a batch size of 2048 works well.

- `LR`: The base learning rate for the HugeCTR solver.

- `DROPOUT_RATE`: The base dropout rate used in DeepFM dense layers.

- `NUM_WORKERS`: The number of HugeCTR data reader workers that concurrently load data. This value should be estimated through experimentation. This is a per GPU value. The default, which works well on A100 GPUs is 12. For the optimal performance `NUM_WORKERS` should be aligned with the number of files (shards) in the training data. 

- `SCHEMA`: The path to the `schema.pbtxt` file generated during the transformation phase. It is required to exract the cardinalities categorical features.


#### Set HugeCTR model and trainer configuration

In [10]:
NUM_EPOCHS = 0
MAX_ITERATIONS = 25000
EVAL_INTERVAL = 1000
EVAL_BATCHES = 500
EVAL_BATCHES_FINAL = 2500
DISPLAY_INTERVAL = 200
SNAPSHOT_INTERVAL = 0
PER_GPU_BATCH_SIZE = 2048
LR = 0.001
DROPOUT_RATE = 0.5
NUM_WORKERS = 12

#### Set training node configuration

As described in the overview, we use a single node, multiple GPU worker pool configuration. For a complex deep learning model like DeepFM, we recommend using A2 machines that are equipped with A100 GPUs. 

In this sample we use a `a2-highgpu-4g` machine. For production systems, you may consider even more powerful configurations - `a2-highgpu-8g` with 8 A100 GPUs  or `a2-megagpu-16g` with 16 A100 GPUs.

In [11]:
MACHINE_TYPE = 'a2-highgpu-4g'
ACCELERATOR_TYPE = 'NVIDIA_TESLA_A100'
ACCELERATOR_NUM = 4

#### Configure worker pool specifications

In this cell we configure a worker pool specification for a Vertex Custom Training job. Refer to [Vertex AI Training documentation](https://cloud.google.com/vertex-ai/docs/training/create-custom-job) for more details. 

In [12]:
gpus = json.dumps([list(range(ACCELERATOR_NUM))]).replace(' ','')
                 
worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            "command": ["python", "-m", "task"],
            "args": [
                f'--per_gpu_batch_size={PER_GPU_BATCH_SIZE}',
                f'--model_name={MODEL_NAME}',
                f'--train_data={TRAIN_DATA}', 
                f'--valid_data={VALID_DATA}',
                f'--schema={SCHEMA_PATH}',
                f'--max_iter={MAX_ITERATIONS}',
                f'--max_eval_batches={EVAL_BATCHES}',
                f'--eval_batches={EVAL_BATCHES_FINAL}',
                f'--dropout_rate={DROPOUT_RATE}',
                f'--lr={LR}',
                f'--num_workers={NUM_WORKERS}',
                f'--num_epochs={NUM_EPOCHS}',
                f'--eval_interval={EVAL_INTERVAL}',
                f'--snapshot={SNAPSHOT_INTERVAL}',
                f'--display_interval={DISPLAY_INTERVAL}',
                f'--gpus={gpus}',
            ],
        },
    }
]

### Submit and monitor the job

When submitting a training job using the `aiplatfom.CustomJob` API you can configure the `job.run` function to block till the job completes or return control to the notebook immediately after the job is submitted. You control it with the `sync` argument. 

In [13]:
job_name = 'hugectr_{}'.format(time.strftime("%Y%m%d_%H%M%S"))
base_output_dir =  os.path.join(WORKSPACE, job_name)

job = vertex_ai.CustomJob(
    display_name=job_name,
    worker_pool_specs=worker_pool_specs,
    base_output_dir=base_output_dir
)
job.run(
    sync=True,
    service_account=VERTEX_SA,
    restart_job_on_worker_restart=False
)

INFO:google.cloud.aiplatform.jobs:Creating CustomJob
INFO:google.cloud.aiplatform.jobs:CustomJob created. Resource name: projects/895222332033/locations/us-central1/customJobs/133162952751579136
INFO:google.cloud.aiplatform.jobs:To use this CustomJob in another session:
INFO:google.cloud.aiplatform.jobs:custom_job = aiplatform.CustomJob.get('projects/895222332033/locations/us-central1/customJobs/133162952751579136')
INFO:google.cloud.aiplatform.jobs:View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/133162952751579136?project=895222332033
INFO:google.cloud.aiplatform.jobs:CustomJob projects/895222332033/locations/us-central1/customJobs/133162952751579136 current state:
JobState.JOB_STATE_QUEUED
INFO:google.cloud.aiplatform.jobs:CustomJob projects/895222332033/locations/us-central1/customJobs/133162952751579136 current state:
JobState.JOB_STATE_PENDING
INFO:google.cloud.aiplatform.jobs:CustomJob projects/895222332033/locations/us-central1/custom

## Submit and monitor a Vertex hyperparameter tuning job

In this section of the notebook, you configure, submit and monitor a [Vertex AI raining hyperparameter tuning](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview) job. We will demonstrate how to use Vertex Training hyperparameter tuning to find optimal values for the base learning rate and the dropout ratio. This example can be easily extended to other parameters - e.g. the batch size or even the optimizer type.

As noted in the overview, the training module has been instrumented to integrate with Vertex AI Training hypertuning. After the final evaluation is completed, the AUC value calculated on the `EVAL_BATCHES_FINAL` number of batches from the validation dataset is reported to Vertex AI Training using the `report_hyperparameter_tuning_metric`. When the training module is executed in the context of a Vertex Custom Job this code path has no effect. When used with a Vertex AI Training hyperparameter job, the job is configured to use the AUC as a metric to optimize.




### Configure a hyperparameter job

To prepare a Vertex Training hyperparameter tuning job you need to configure a worker pool specificatin and a hyperparameter study configuration. Configuring a worker pool is virtually the same as with a Custom Job. The only difference is that you don't need to explicitly pass values of hyperparameters being tuned to a training container. They will be provided by the hyperparameter tuning service.


To configure a hyperparameter study you need to define a metric to optimize, an optimization goal, and a set of configurations for hyperparameters to tune.

In our case the metric is AUC, the optimization goal is to maximize AUC, and the hyperparameters are `lr`, and `dropout_rate`. Notice that you have to match the name of the metric with the name used to report the metric in the training module. You also have to match the names of the hyperparameters with the respective names of command line parameters in your training container.

For each hyperparameter you specify a strategy to apply for sampling values from the hyperparameter's domain. For the `lr` hyperparameter we configure the tuning service to sample values from a continuous range between 0.001 to 0.01 using a logarithmic scale. For the `dropout_rate` we provide a list of discrete values to choose from.

For more information about configuring a hyperparameter study refer to [Vertex AI Hyperparameter job configuration](https://cloud.google.com/vertex-ai/docs/training/using-hyperparameter-tuning).

In [None]:
metric_spec = {'AUC': 'maximize'}

parameter_spec = {
    'lr': hpt.DoubleParameterSpec(min=0.001, max=0.01, scale='log'),
    'dropout_rate': hpt.DiscreteParameterSpec(values=[0.4, 0.5, 0.6], scale=None),
}

### Submit and monitor the job

We can now submit a hyperparameter tuning job. When submitting the job you specify a maximum number of trials to attempt and how many trials to run in parallel. 

In [None]:
job_name = 'HUGECTR_HTUNING_{}'.format(time.strftime("%Y%m%d_%H%M%S"))
base_output_dir = os.path.join(WORKSPACE, "model_training", job_name)


custom_job = vertex_ai.CustomJob(
    display_name=job_name,
    worker_pool_specs=worker_pool_specs,
    base_output_dir=base_output_dir
)

hp_job = vertex_ai.HyperparameterTuningJob(
    display_name=job_name,
    custom_job=custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=4,
    parallel_trial_count=2,
    search_algorithm=None)

hp_job.run(
    sync=True,
    service_account=VERTEX_SA,
    restart_job_on_worker_restart=False
)

### Retrieve trial results

After a hyperparameter tuning job completes you can retrieve the trial results from the job object. The results are returned as a list of trial records. To retrieve the trial with the best value of a metric - AUC - you need to scan through all trial records.

In [None]:
hp_job.trials

#### Find the best trial

In [None]:
best_trial = sorted(hp_job.trials, 
                    key=lambda trial: trial.final_measurement.metrics[0].value, 
                    reverse=True)[0]

print("Best trial ID:", best_trial.id)
print("   AUC:", best_trial.final_measurement.metrics[0].value)
print("   LR:", best_trial.parameters[1].value)
print("   Dropout rate:", best_trial.parameters[0].value)

## Next Steps

After completing this notebook you can proceed to the [03-inference-triton-huygectr.ipynb](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/03-model-inference-hugectr.ipynb) notebook that demonstrates how to deploy the DeepFM model trained in this notebook using NVIDIA Triton Inference Server.