In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Scaling Criteo: ETL with NVTabular

## Overview

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems. It provides a high level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS cuDF library.<br><br>

**In this notebook, we will show how to scale NVTabular to multi-GPUs and multiple nodes.** Prerequisite is to be familiar with NVTabular and its API. You can read more NVTabular and its API in our [Getting Started with Movielens notebooks](https://github.com/NVIDIA/NVTabular/tree/main/examples/getting-started-movielens).<br><br>

The full [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/) contains ~1.3 TB of uncompressed click logs containing over four billion samples spanning 24 days. In our benchmarks, we are able to preprocess and engineer features in **13.8min with 1x NVIDIA A100 GPU and 1.9min with 8x NVIDIA A100 GPUs**. This is a **speed-up of 100x-10000x** in comparison to different CPU versions, You can read more in our [blog](https://developer.nvidia.com/blog/announcing-the-nvtabular-open-beta-with-multi-gpu-support-and-new-data-loaders/).

Our pipeline will be representative with most common preprocessing transformation for deep learning recommender models.

* Categorical input features are `Categorified` to be continuous integers (0, ..., |C|) for the embedding layers
* Missing values of continuous input features are filled with 0. Afterwards the continuous features are clipped and normalized.

### Learning objectives
In this notebook, we learn how to to scale ETLs with NVTabular

- Learn to use larger than GPU/host memory datasets
- Use multi-GPU or multi node for ETL
- Apply common deep learning ETL workflow

### Multi-GPU and multi-node scaling

NVTabular is built on top off [RAPIDS.AI cuDF](https://github.com/rapidsai/cudf/), [dask_cudf](https://docs.rapids.ai/api/cudf/stable/dask-cudf.html) and [dask](https://dask.org/).<br><br>
**Dask** is a task-based library for parallel scheduling and execution. Although it is certainly possible to use the task-scheduling machinery directly to implement customized parallel workflows (we do it in NVTabular), most users only interact with Dask through a Dask Collection API. The most popular "collection" API's include:

* Dask DataFrame: Dask-based version of the Pandas DataFrame/Series API. Note that dask_cudf is just a wrapper around this collection module (dask.dataframe).
* Dask Array: Dask-based version of the NumPy array API
* Dask Bag: Similar to a Dask-based version of PyToolz or a Pythonic version of PySpark RDD

For example, Dask DataFrame provides a convenient API for decomposing large Pandas (or cuDF) DataFrame/Series objects into a collection of DataFrame partitions.

<img src="./imgs/dask-dataframe.svg" width="20%">

We use **dask_cudf** to process large datasets as a collection of cuDF dataframes instead of Pandas. CuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
<br><br>
**Dask enables easily to schedule tasks for multiple workers: multi-GPU or multi-node. We just need to initialize a Dask cluster (`LocalCUDACluster`) and NVTabular will use the cluster to execture the workflow.**

## ETL with NVTabular
Here we'll show how to use NVTabular first as a preprocessing library to prepare the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/) dataset. The following notebooks can use the output to train a deep learning model.

### Data Prep
The previous notebook [01-Download-Convert](./01-Download-Convert.ipynb) converted the tsv data published by Criteo into the parquet format that our accelerated readers prefer. Accelerating these pipelines on new hardware like GPUs may require us to make new choices about the representations we use to store that data, and parquet represents a strong alternative.

We load the required libraries.

In [2]:
# Standard Libraries
import os
import re
import shutil
import warnings

# External Dependencies
import numpy as np
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# NVTabular
import nvtabular as nvt
from nvtabular.ops import Categorify, Clip, FillMissing, Normalize, AddMetadata
from nvtabular.utils import _pynvml_mem_size, device_mem_size

Once our data is ready, we'll define some high level parameters to describe where our data is and what it "looks like" at a high level.

In [3]:
# define some information about where to get our data
BASE_DIR = os.environ.get("BASE_DIR", "/raid/data/criteo")
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", BASE_DIR + "/converted/criteo")
dask_workdir = os.path.join(BASE_DIR, "test_dask/workdir")
OUTPUT_DATA_DIR = os.environ.get("OUTPUT_DATA_DIR", BASE_DIR + "/test_dask/output")
stats_path = os.path.join(BASE_DIR, "test_dask/stats")

# Make sure we have a clean worker space for Dask
if os.path.isdir(dask_workdir):
    shutil.rmtree(dask_workdir)
os.makedirs(dask_workdir)

# Make sure we have a clean stats space for Dask
if os.path.isdir(stats_path):
    shutil.rmtree(stats_path)
os.mkdir(stats_path)

# Make sure we have a clean output path
if os.path.isdir(OUTPUT_DATA_DIR):
    shutil.rmtree(OUTPUT_DATA_DIR)
os.mkdir(OUTPUT_DATA_DIR)

We use the last day as validation dataset and the remaining days as training dataset.

In [4]:
fname = "day_{}.parquet"
num_days = len(
    [i for i in os.listdir(INPUT_DATA_DIR) if re.match(fname.format("[0-9]{1,2}"), i) is not None]
)
train_paths = [os.path.join(INPUT_DATA_DIR, fname.format(day)) for day in range(num_days - 1)]
valid_paths = [
    os.path.join(INPUT_DATA_DIR, fname.format(day)) for day in range(num_days - 1, num_days)
]

In [5]:
print(train_paths)
print(valid_paths)

['/raid/benny/04_criteo/converted/criteo/day_0.parquet']
['/raid/benny/04_criteo/converted/criteo/day_1.parquet']


### Deploy a Distributed-Dask Cluster

Now we configure and deploy a Dask Cluster. Please, [read this document](https://github.com/NVIDIA/NVTabular/blob/d419a4da29cf372f1547edc536729b0733560a44/bench/examples/MultiGPUBench.md) to know how to set the parameters.

In [6]:
# Dask dashboard
dashboard_port = "8787"

# Deploy a Single-Machine Multi-GPU Cluster
protocol = "tcp"  # "tcp" or "ucx"
NUM_GPUS = [0, 1, 2, 3, 4, 5, 6, 7]
visible_devices = ",".join([str(n) for n in NUM_GPUS])  # Delect devices to place workers
device_limit_frac = 0.7  # Spill GPU-Worker memory to host at this limit.
device_pool_frac = 0.8
part_mem_frac = 0.15

# Use total device size to calculate args.device_limit_frac
device_size = device_mem_size(kind="total")
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
part_size = int(part_mem_frac * device_size)

# Check if any device memory is already occupied
for dev in visible_devices.split(","):
    fmem = _pynvml_mem_size(kind="free", index=int(dev))
    used = (device_size - fmem) / 1e9
    if used > 1.0:
        warnings.warn(f"BEWARE - {used} GB is already occupied on device {int(dev)}!")

cluster = None  # (Optional) Specify existing scheduler port
if cluster is None:
    cluster = LocalCUDACluster(
        protocol=protocol,
        n_workers=len(visible_devices.split(",")),
        CUDA_VISIBLE_DEVICES=visible_devices,
        device_memory_limit=device_limit,
        local_directory=dask_workdir,
        dashboard_address=":" + dashboard_port,
        rmm_pool_size=(device_pool_size // 256) * 256,
    )

# Create the distributed client
client = Client(cluster)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 34057 instead
distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: LocalCUDACluster
Dashboard: http://127.0.0.1:34057/status,

0,1
Status: running,Using processes: True
Dashboard: http://127.0.0.1:34057/status,Workers: 1
Total threads:  1,Total memory:  0.98 TiB

0,1
Comm: tcp://127.0.0.1:35397,Workers: 1
Dashboard: http://127.0.0.1:34057/status,Total threads:  1
Started:  Just now,Total memory:  0.98 TiB

0,1
Comm: tcp://127.0.0.1:43039,Total threads: 1
Dashboard: http://127.0.0.1:32795/status,Memory: 0.98 TiB
Nanny: tcp://127.0.0.1:36051,
Local directory: /raid/benny/04_criteo/test_dask/workdir/dask-worker-space/worker-dvdj11v9,Local directory: /raid/benny/04_criteo/test_dask/workdir/dask-worker-space/worker-dvdj11v9
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB


That's it. We initialized our Dask cluster and NVTabular will execute the workflow on multiple GPUs. Similar, we could define a cluster with multiple nodes.

### Defining our Preprocessing Pipeline
At this point, our data still isn't in a form that's ideal for consumption by neural networks. The most pressing issues are missing values and the fact that our categorical variables are still represented by random, discrete identifiers, and need to be transformed into contiguous indices that can be leveraged by a learned embedding. Less pressing, but still important for learning dynamics, are the distributions of our continuous variables, which are distributed across multiple orders of magnitude and are uncentered (i.e. E[x] != 0).

We can fix these issues in a conscise and GPU-accelerated manner with an NVTabular `Workflow`. We explained the NVTabular API in [Getting Started with Movielens notebooks](https://github.com/NVIDIA/NVTabular/tree/main/examples/getting-started-movielens) and hope you are familiar with the syntax.

#### Frequency Thresholding
One interesting thing worth pointing out is that we're using _frequency thresholding_ in our `Categorify` op. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency (which we've set here to be 15 occurrences throughout the dataset) to the _same_ index, keeping the model from overfitting to sparse signals.

#### Adding Metadata
Recently, we added support to tag columns with the `AddMetadata` operator. Using tags enables us to automate workflows and following training processes by standardizing the dataset schema.
<br><br>
In many real world use cases, the feature engineering pipelines are defined for a group of features. For example, we apply normalization to all continuous features. Similarly, NVTabular defines workflows on a list of feature names. We can add the `AddMetadata` operator to define our dataset schema.<br><br>

```python
AddMetadata(tags=None, properties=None)
```

NVTabular will write a schema file, when we store the dataset to disk with `workflow.transform(train_dataset).to_parquet`. In later steps, we can load the schema file and select columns by tags.<br><br>
We will add the tags `target`, `cat` and `cont`.

In [7]:
# define our dataset schema
CONTINUOUS_COLUMNS = ["I" + str(x) for x in range(1, 14)]
CATEGORICAL_COLUMNS = ["C" + str(x) for x in range(1, 27)]
LABEL_COLUMNS = ["label"]
COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS + LABEL_COLUMNS

num_buckets = 10000000
categorify_op = Categorify(out_path=stats_path, max_size=num_buckets)
cat_features = CATEGORICAL_COLUMNS >> categorify_op >> AddMetadata(tags=["cat"])
cont_features = (
    CONTINUOUS_COLUMNS
    >> FillMissing()
    >> Clip(min_value=0)
    >> Normalize()
    >> AddMetadata(tags=["cont"])
)
label_feature = LABEL_COLUMNS >> AddMetadata(tags=["label"])

features = cat_features + cont_features + label_feature

workflow = nvt.Workflow(features, client=client)

Now instantiate dataset iterators to loop through our dataset (which we couldn't fit into GPU memory).

In [8]:
train_dataset = nvt.Dataset(train_paths, engine="parquet", part_size=part_size)
valid_dataset = nvt.Dataset(valid_paths, engine="parquet", part_size=part_size)

Now run them through our workflows to collect statistics on the train set, then transform and save to parquet files.

In [9]:
output_train_dir = os.path.join(OUTPUT_DATA_DIR, "train/")
output_valid_dir = os.path.join(OUTPUT_DATA_DIR, "valid/")
! mkdir -p $output_train_dir
! mkdir -p $output_valid_dir

For reference, let's time it to see how long it takes...

In [10]:
%%time
workflow.fit(train_dataset)

CPU times: user 1.62 s, sys: 322 ms, total: 1.94 s
Wall time: 21.4 s


<nvtabular.workflow.workflow.Workflow at 0x7fac16fef970>

We need to enforce that the output datasets have the required HugeCTR data types. We define a dictionary with column names and data types. We give it as an argument when we transform and store our dataset to disk.

In [11]:
dict_dtypes = {}

for col in CATEGORICAL_COLUMNS:
    dict_dtypes[col] = np.int64

for col in CONTINUOUS_COLUMNS:
    dict_dtypes[col] = np.float32

for col in LABEL_COLUMNS:
    dict_dtypes[col] = np.float32

In [12]:
%%time

# Add "write_hugectr_keyset=True" to "to_parquet" if using this ETL Notebook for training with HugeCTR
workflow.transform(train_dataset).to_parquet(
    output_files=len(NUM_GPUS),
    output_path=output_train_dir,
    shuffle=nvt.io.Shuffle.PER_PARTITION,
    dtypes=dict_dtypes,
    cats=CATEGORICAL_COLUMNS,
    conts=CONTINUOUS_COLUMNS,
    labels=LABEL_COLUMNS,
)

CPU times: user 543 ms, sys: 252 ms, total: 796 ms
Wall time: 24.2 s


In [13]:
%%time

# Add "write_hugectr_keyset=True" to "to_parquet" if using this ETL Notebook for training with HugeCTR
workflow.transform(valid_dataset).to_parquet(
    output_path=output_valid_dir,
    dtypes=dict_dtypes,
    cats=CATEGORICAL_COLUMNS,
    conts=CONTINUOUS_COLUMNS,
    labels=LABEL_COLUMNS,
)

CPU times: user 494 ms, sys: 213 ms, total: 707 ms
Wall time: 21.3 s


In the next notebooks, we will train a deep learning model. Our training pipeline requires information about the data schema to define the neural network architecture. We will save the NVTabular workflow to disk so that we can restore it in the next notebooks.

In [14]:
workflow.save(os.path.join(OUTPUT_DATA_DIR, "workflow"))

### Reviewing Schema File

We can take a look on the schema file. In the output foldes, NVTabular created a `schema.pbtxt` file.

In [15]:
os.system("head -n 20 " + output_train_dir + "/schema.pbtxt")

feature {
  name: "C1"
  type: INT
  int_domain {
    name: "C1"
    min: 0
    max: 9999999
    is_categorical: true
  }
  annotation {
    tag: "cat"
    tag: "categorical"
    extra_metadata {
      type_url: "type.googleapis.com/google.protobuf.Struct"
      value: "\n\021\n\013num_buckets\022\002\010\000\n\033\n\016freq_threshold\022\t\021\000\000\000\000\000\000\000\000\n\025\n\010max_size\022\t\021\000\000\000\000\320\022cA\n\030\n\013start_index\022\t\021\000\000\000\000\000\000\000\000\nP\n\010cat_path\022D\032B/raid/benny/04_criteo/test_dask/stats/categories/unique.C1.parquet\nG\n\017embedding_sizes\0224*2\n\030\n\013cardinality\022\t\021\000\000\000\340\317\022cA\n\026\n\tdimension\022\t\021\000\000\000\000\000\000\200@"
    }
  }
}
feature {
  name: "C2"


0

We can load the schema from the file.

In [16]:
schema = nvt.Schema().load(output_train_dir + "/schema.pbtxt")

We can select a column by name and get its tags and properties.

In [17]:
schema.select_by_name("C1")

[{'name': 'C1', 'tags': ['cat', <Tags.CATEGORICAL: 'categorical'>], 'properties': {'num_buckets': None, 'freq_threshold': 0.0, 'max_size': 10000000.0, 'start_index': 0.0, 'cat_path': '/raid/benny/04_criteo/test_dask/stats/categories/unique.C1.parquet', 'embedding_sizes': {'cardinality': 9999999.0, 'dimension': 512.0}, 'domain': {'min': 0, 'max': 9999999}}, 'dtype': <class 'int'>, '_is_list': False}]

We can select columns by tags. We will select the columns by our defined tags `label`, `cat` and `cont`.

In [18]:
[x.name for x in schema.select_by_tag("label")]

['label']

In [19]:
[x.name for x in schema.select_by_tag("cat")]

['C1',
 'C2',
 'C3',
 'C4',
 'C5',
 'C6',
 'C7',
 'C8',
 'C9',
 'C10',
 'C11',
 'C12',
 'C13',
 'C14',
 'C15',
 'C16',
 'C17',
 'C18',
 'C19',
 'C20',
 'C21',
 'C22',
 'C23',
 'C24',
 'C25',
 'C26']

In [20]:
[x.name for x in schema.select_by_tag("cont")]

['I1',
 'I2',
 'I3',
 'I4',
 'I5',
 'I6',
 'I7',
 'I8',
 'I9',
 'I10',
 'I11',
 'I12',
 'I13']

In [21]:
schema.select_by_name("label")

[{'name': 'label', 'tags': ['label'], 'properties': {}, 'dtype': <class 'int'>, '_is_list': False}]

We added manually tags to our data schema. NVTabular can infer column types, as well, and adds automatically tags to the schema file. For example, the output data type of Normalize is continuous and the output data type of Categorify is categorical.<br><br>
We could use the pre-defined NVTabular tags, as well. However, we added them manually to demonstrate how to add custom tags.

In [22]:
[x.name for x in schema.select_by_tag(nvt.graph.tags.Tags.CATEGORICAL)]

['C1',
 'C2',
 'C3',
 'C4',
 'C5',
 'C6',
 'C7',
 'C8',
 'C9',
 'C10',
 'C11',
 'C12',
 'C13',
 'C14',
 'C15',
 'C16',
 'C17',
 'C18',
 'C19',
 'C20',
 'C21',
 'C22',
 'C23',
 'C24',
 'C25',
 'C26']

In [23]:
[x.name for x in schema.select_by_tag(nvt.graph.tags.Tags.CONTINUOUS)]

['I1',
 'I2',
 'I3',
 'I4',
 'I5',
 'I6',
 'I7',
 'I8',
 'I9',
 'I10',
 'I11',
 'I12',
 'I13']