# NVTabular / HugeCTR Criteo Example 
Here we'll show how to use NVTabular first as a preprocessing library to prepare the [Criteo Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge) dataset, and then train a model using HugeCTR.

### Data Prep
Before we get started, make sure you've run the [`optimize_criteo` notebook](./optimize_criteo.ipynb), which will convert the tsv data published by Criteo into the parquet format that our accelerated readers prefer. It's fair to mention at this point that that notebook will take ~4 hours to run. While we're hoping to release accelerated csv readers in the near future, we also believe that inefficiencies in existing data representations like csv are in no small part a consequence of inefficiencies in the existing hardware/software stack. Accelerating these pipelines on new hardware like GPUs may require us to make new choices about the representations we use to store that data, and parquet represents a strong alternative.

#### Quick Aside: Clearing Cache
The following line is not strictly necessary, but is included for those who want to validate NVIDIA's benchmarks. We start by clearing the existing cache to start as "fresh" as possible. If you're having trouble running it, try executing the container with the `--privileged` flag.

In [1]:
!sync; echo 3 > /proc/sys/vm/drop_caches

/bin/sh: 1: cannot create /proc/sys/vm/drop_caches: Read-only file system


In [2]:
import os
from time import time
import re
import glob
import warnings

# tools for data preproc/loading
import torch
import rmm
import nvtabular as nvt
from nvtabular.ops import Normalize,  Categorify,  LogOp, FillMissing, Clip, get_embedding_sizes

Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.

For more information about alternatives visit: ('https://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.

For more information about alternatives visit: ('https://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')


### Initializing the Memory Pool
For applications like the one that follows where RAPIDS will be the only workhorse user of GPU memory and resource, a good best practices is to use the RAPIDS Memory Manager library `rmm` to allocate a dedicated pool of GPU memory that allows for fast, asynchronous memory management. Here, we'll dedicate 80% of free GPU memory to this pool to make sure we get the most utilization possible.

In [3]:
rmm.reinitialize(pool_allocator=True, initial_pool_size=0.8 * nvt.io.device_mem_size(kind='free'))



### Dataset and Dataset Schema
Once our data is ready, we'll define some high level parameters to describe where our data is and what it "looks like" at a high level.

In [4]:
# define some information about where to get our data
INPUT_DATA_DIR = os.environ.get('INPUT_DATA_DIR', '/raid/criteo/tests/crit_int_pq')
OUTPUT_DATA_DIR = os.environ.get('OUTPUT_DATA_DIR', '/raid/criteo/tests/test_dask') # where we'll save our procesed data to
BATCH_SIZE = int(os.environ.get('BATCH_SIZE', 800000))
NUM_PARTS = int(os.environ.get('NUM_PARTS', 2))
NUM_TRAIN_DAYS = 23 # number of days worth of data to use for training, the rest will be used for validation

# define our dataset schema
CONTINUOUS_COLUMNS = ['I' + str(x) for x in range(1,14)]
CATEGORICAL_COLUMNS =  ['C' + str(x) for x in range(1,27)]
LABEL_COLUMNS = ['label']
COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS + LABEL_COLUMNS

In [5]:
# ! ls $INPUT_DATA_DIR

Error: Jupyter cannot be started. Error attempting to locate jupyter: Data Science libraries jupyter and notebook are not installed in interpreter Python 3.6.9 64-bit.

In [6]:
fname = 'day_{}.parquet'
num_days = len([i for i in os.listdir(INPUT_DATA_DIR) if re.match(fname.format('[0-9]{1,2}'), i) is not None])
train_paths = [os.path.join(INPUT_DATA_DIR, fname.format(day)) for day in range(NUM_TRAIN_DAYS)]
valid_paths = [os.path.join(INPUT_DATA_DIR, fname.format(day)) for day in range(NUM_TRAIN_DAYS, num_days)]
#print(train_paths)
#print(valid_paths)

['/dataset/crit_int_pq/day_0.parquet']
['/dataset/crit_int_pq/day_23.parquet']


### Preprocessing
At this point, our data still isn't in a form that's ideal for consumption by neural networks. The most pressing issues are missing values and the fact that our categorical variables are still represented by random, discrete identifiers, and need to be transformed into contiguous indices that can be leveraged by a learned embedding. Less pressing, but still important for learning dynamics, are the distributions of our continuous variables, which are distributed across multiple orders of magnitude and are uncentered (i.e. E[x] != 0).

We can fix these issues in a conscise and GPU-accelerated manner with an NVTabular `Workflow`. We'll instantiate one with our current dataset schema, then symbolically add operations _on_ that schema. By setting all these `Ops` to use `replace=True`, the schema itself will remain unmodified, while the variables represented by each field in the schema will be transformed.

#### Frequency Thresholding
One interesting thing worth pointing out is that we're using _frequency thresholding_ in our `Categorify` op. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency (which we've set here to be 15 occurrences throughout the dataset) to the _same_ index, keeping the model from overfitting to sparse signals.

In [7]:
proc = nvt.Workflow(
    cat_names=CATEGORICAL_COLUMNS,
    cont_names=CONTINUOUS_COLUMNS,
    label_name=LABEL_COLUMNS)

# log -> normalize continuous features. Note that doing this in the opposite
# order wouldn't make sense! Note also that we're zero filling continuous
# values before the log: this is a good time to remember that LogOp
# performs log(1+x), not log(x)
proc.add_cont_feature([FillMissing(), Clip(min_value=0), LogOp()])
proc.add_cont_preprocess(Normalize())

# categorification with frequency thresholding
proc.add_cat_preprocess(Categorify(freq_threshold=15, out_path=OUTPUT_DATA_DIR))

Now instantiate dataset iterators to loop through our dataset (which we couldn't fit into GPU memory)

In [8]:
import numpy as np

dict_dtypes={}

for col in CONTINUOUS_COLUMNS:
    dict_dtypes[col] = np.float32
    
for col in CATEGORICAL_COLUMNS:
    dict_dtypes[col] = np.int64
    
for col in LABEL_COLUMNS:
    dict_dtypes[col] = np.float32

#print(dict_dtypes)

{'I1': <class 'numpy.float32'>, 'I2': <class 'numpy.float32'>, 'I3': <class 'numpy.float32'>, 'I4': <class 'numpy.float32'>, 'I5': <class 'numpy.float32'>, 'I6': <class 'numpy.float32'>, 'I7': <class 'numpy.float32'>, 'I8': <class 'numpy.float32'>, 'I9': <class 'numpy.float32'>, 'I10': <class 'numpy.float32'>, 'I11': <class 'numpy.float32'>, 'I12': <class 'numpy.float32'>, 'I13': <class 'numpy.float32'>, 'C1': <class 'numpy.int64'>, 'C2': <class 'numpy.int64'>, 'C3': <class 'numpy.int64'>, 'C4': <class 'numpy.int64'>, 'C5': <class 'numpy.int64'>, 'C6': <class 'numpy.int64'>, 'C7': <class 'numpy.int64'>, 'C8': <class 'numpy.int64'>, 'C9': <class 'numpy.int64'>, 'C10': <class 'numpy.int64'>, 'C11': <class 'numpy.int64'>, 'C12': <class 'numpy.int64'>, 'C13': <class 'numpy.int64'>, 'C14': <class 'numpy.int64'>, 'C15': <class 'numpy.int64'>, 'C16': <class 'numpy.int64'>, 'C17': <class 'numpy.int64'>, 'C18': <class 'numpy.int64'>, 'C19': <class 'numpy.int64'>, 'C20': <class 'numpy.int64'>, '

In [9]:
train_dataset = nvt.Dataset(train_paths, engine='parquet', part_mem_fraction=0.15, dtypes=dict_dtypes)
valid_dataset = nvt.Dataset(valid_paths, engine='parquet', part_mem_fraction=0.15, dtypes=dict_dtypes)

Now run them through our workflows to collect statistics on the train set, then transform and save to parquet files.

In [10]:
output_train_dir = os.path.join(OUTPUT_DATA_DIR, 'train/')
output_valid_dir = os.path.join(OUTPUT_DATA_DIR, 'valid/')
! mkdir -p $output_train_dir
! mkdir -p $output_valid_dir

For reference, let's time it to see how long it takes...

In [11]:
%%time
proc.apply(train_dataset, apply_offline=True, record_stats=True, shuffle=False, output_format="parquet", output_path=output_train_dir, out_files_per_proc=15)

CPU times: user 1min 13s, sys: 29.9 s, total: 1min 43s
Wall time: 2min 31s


In [13]:
%%time
proc.apply(valid_dataset, apply_offline=True, record_stats=False, shuffle=False, output_format="parquet", output_path=output_valid_dir, out_files_per_proc=15)

CPU times: user 24.1 s, sys: 23.5 s, total: 47.6 s
Wall time: 1min 32s


In [None]:
embeddings = get_embedding_sizes(proc)
print(embeddings.values())

And just like that, we have training and validation sets ready to feed to a model!

## HugeCTR
### Training
We'll run huge_ctr using the configuration file.

First, we'll reinitialize our memory pool from earlier to free up some memory so that we can share it with PyTorch.

In [15]:
rmm.reinitialize(pool_allocator=False)

In [18]:
! /usr/local/hugectr/bin/huge_ctr --train dcn_parquet.json

[0.001, init_start, ]
HugeCTR Version: 2.2.1
Config file: dcn_parquet.json
[25d20h38m04s][HUGECTR][INFO]: Default evaluation metric is AUC without threshold value
[25d20h38m04s][HUGECTR][INFO]: algorithm_search is not specified using default: 1
[25d20h38m04s][HUGECTR][INFO]: Algorithm search: ON
[25d20h38m06s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: Tesla V100-DGXS-16GB
[25d20h38m06s][HUGECTR][INFO]: Initial seed is 2794287524
[25d20h38m06s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[25d20h38m07s][HUGECTR][INFO]: Vocabulary size: 2116453
[25d20h38m08s][HUGECTR][INFO]: num_internal_buffers 1
[25d20h38m08s][HUGECTR][INFO]: num_internal_buffers 1
[25d20h38m08s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=2700000
[25d20h38m08s][HUGECTR][INFO]: All2All Warmup Start
[25d20h38m08s][HUGECTR][INFO]: All2All Warmup End
[25d20h38m09s][HUGECTR][INFO]: gpu0 start to init embedding
[25d20h38m09s][HUGECTR][INFO]: gpu0 init embedding done
[25d20h

In [1]:
! /usr/local/hugectr/bin/huge_ctr --train dlrm_fp32_64k.json

[0.001, init_start, ]
HugeCTR Version: 2.2.1
Config file: dlrm_fp32_64k-ALL.json
[25d23h08m44s][HUGECTR][INFO]: algorithm_search is not specified using default: 1
[25d23h08m44s][HUGECTR][INFO]: Algorithm search: ON
Device 0: Tesla V100-DGXS-16GB
Device 1: Tesla V100-DGXS-16GB
Device 2: Tesla V100-DGXS-16GB
Device 3: Tesla V100-DGXS-16GB
[25d23h08m56s][HUGECTR][INFO]: Initial seed is 2705795868
[25d23h08m56s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[25d23h08m56s][HUGECTR][INFO]: Vocabulary size: 2116453
[25d23h08m57s][HUGECTR][INFO]: num_internal_buffers 1
[25d23h08m57s][HUGECTR][INFO]: num_internal_buffers 1
[25d23h08m57s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=15500000
[25d23h08m57s][HUGECTR][INFO]: All2All Warmup Start
[25d23h08m57s][HUGECTR][INFO]: All2All Warmup End
[25d23h09m27s][HUGECTR][INFO]: gpu0 start to init embedding
[25d23h09m27s][HUGECTR][INFO]: gpu1 start to init embedding
[25d23h09m27s][HUGECTR][INFO]: gpu2 start to init embedding
[25d23h