# Data Preprocessing at Scale with NVTabular and Vertex AI

This notebook demonstrates how to preprocess data using NVIDIA NVTabular and Vertex AI. The notebook covers the following:  
1. NVTabular Overview.  
2. Preprocessing Criteo Dataset.  
3. Preprocessing Pipeline on Vertex AI  
3.1. CSV preprocessing pipeline execution.  
3.2. BigQuery preprocessing pipeline execution.  


## 1. NVTabular Overview

[Merlin NVTabular](https://developer.nvidia.com/nvidia-merlin/nvtabular) is a feature engineering and preprocessing library designed to effectively manipulate 
large datasets and significantly reduce data preparation time, as follows:

* Processes large datasets not bound by CPU or GPU memory.
* Accelerates data preprocessing computation on GPUs using the RAPIDS cuDF library.
* Supports multi-node scaling and multi-GPU with DASK-CUDA distributed parallelism.
* Supports tabular data formats, including comma-separated values (CSV) files, Apache Parquet, Apache Orc, and Apache Avro.
* Provides data loaders that are optimized for TensorFlow, PyTorch, and Merlin HugeCTR.
* Includes multi-hot categoricals and vector continuous passing support to ease feature engineering.


To preprocess the data, we need to define a transformation `workflow`.  
Each transformation step in the transformation pipeline executes multiple calculations, called `ops`. 
NVTabular provides a [set of ops](https://nvidia.github.io/NVTabular/main/api/ops/index.html), which include:

 - Filtering outliers or missing values, or creating new features indicating that a value is missing;
 - Imputing and filling in missing data;
 - Discretization or bucketing of continuous features;
 - Creating features by splitting or combining existing features, for example, breaking down a date column into day-of-week, month-of-year, day-of-month features;
 - Normalizing numerical features to have zero mean and unit variance or applying transformations, for example with log transform;
 - Encoding discrete features using one-hot vectors or converting them to continuous integer indices.  

NVTabular processes a dataset, given a pre-defined workflow, in two steps:

1. The `fit` step, where NVTabular compute the statistics required for transforming the data. Such a step requires at most `N` passes through the data, where `N` is the number of chained operations in the workflow.
2. The `apply` step, where NVTabular uses the fitted workflow to process the data. 

NVTabular is designed to minimize the number of passes through the data. This is achieved with a lazy execution strategy. Data operations are not executed until an explicit apply phase. This allows NVTabular to optimize the workflow that requires iteration over the entire dataset.



## 2. Preprocessing Criteo dataset

The Criteo dataset contains over four billion samples spanning 24 CSV files. Each record contains 40 columns: 13 columns are numerical, 26 columns are categorical, and 1 binary target column. See [00-dataset-management.ipynb](00-dataset-management.ipynb) for more details.


### NVTabular preprocessing Workflow for Criteo dataset

In this example, the preprocessing `nvt.Workflow` consists for the following operations:
 - [Categorify](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/categorify.html): applied to categorical columns (column names that start with C). 
 - [FillMissing](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/fillmissing.html): applied to continuous columns (column names that start with I).
 - [Clip](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/clip.html):  applied to continuous columns after FillMissing.
 - [Normalize](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/normalize.html): applied to continuous columns after Clip.
 

![](images/dag_preprocessing.png)
 
 The `nvt.Workflow` is created in the `create_criteo_nvt_workflow` method, which can be found in [src/preprocessing/etl.py](src/preprocessing/etl.py) module. 
 This `nvt.Workflow` will be used as a guide to calculate the necessary statistics, and execute the data transformation.  
 

### Implementing the preprocessing pipelines using KFP

[src/pipelines/preprocessing_pipelines.py](https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai/blob/main/src/pipelines/preprocessing_pipelines.py) defines the KFP pipelines to preprocess the Criteo data. 
The `preprocessing_csv` processes the CSV data files in Cloud Storage, while `preprocessing_bq` processes the data stored in BigQuery.

A pipeline component is a self-contained set of code that performs one step in your ML workflow. The pipeline uses the following components defined in [src/pipelines/components.py](src/pipelines/components.py):

1. `convert_csv_to_parquet_op`: this component converts raw CSV files to Parquet files, and store them to Cloud Storage. 
2. `analyze_dataset_op`: this component creates a Criteo preprocessing `nvt.Workflow`, fit it to the training data split, and store it to Cloud Storage.
3. `transform_dataset_op`: this component loads the fitted `nvt.Workflow` from Cloud Storage, uses it to transform and input datas split, and store the transformed data as Parquet files to Cloud Storage.

Each component is annotated with Inputs and Outputs to keep track of lineage metadata.
The `base_image` used to execute the components is defined in [Dockerfile.nvtabular](Dockerfile.nvtabular). 

Each step in the pipeline is configured with the required CPU, memory and GPU configurations, as follows:

```
component_being_executed.set_cpu_limit("8") # Number of CPUs
component_being_executed.set_memory_limit("32G") # Memory quantity
component_being_executed.set_gpu_limit("1") # Number of GPUs
component_being_executed.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-t4') # GPU type
```

See [Specify machine type for a pipeline step](https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline#specify-machine-type) for more information.


You can configure the pipeline by setting the variables in the [config.py](config.py) module.


## Setup

In [1]:
import os
import json
from datetime import datetime
from google.cloud import aiplatform as vertex_ai
from kfp.v2 import compiler

In [2]:
PROJECT_ID = 'renatoleite-mldemos' # Change to your project Id.
REGION = 'us-central1' # Change to your region.
DATASET_GCS_LOCATION = 'gs://workshop-datasets/criteo'
BUCKET =  'merlin-on-gcp' # Change to your bucket.

VERSION = 'v01'
MODEL_DISPLAY_NAME = f'criteo-merlin-recommender-{VERSION}'
WORKSPACE = f'gs://{BUCKET}/{MODEL_DISPLAY_NAME}'

IMAGE_NAME = 'nvt_preprocessing'
IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
DOCKERNAME = 'nvtabular'

PREPROCESS_CSV_PIPELINE_NAME = 'nvt-csv-pipeline'
PREPROCESS_CSV_PIPELINE_ROOT = os.path.join(WORKSPACE, PREPROCESS_CSV_PIPELINE_NAME)

PREPROCESS_BQ_PIPELINE_NAME =  'nvt-bq-pipeline'
PREPROCESS_BQ_PIPELINE_ROOT = os.path.join(WORKSPACE, PREPROCESS_BQ_PIPELINE_NAME)

### Set pipeline configurations

In [3]:
os.environ['PROJECT_ID'] = PROJECT_ID
os.environ['REGION'] = REGION
os.environ['BUCKET'] = BUCKET
os.environ['WORKSPACE'] = WORKSPACE

os.environ['NVT_IMAGE_URI'] = IMAGE_URI
os.environ['PREPROCESS_CSV_PIPELINE_NAME'] = PREPROCESS_CSV_PIPELINE_NAME
os.environ['PREPROCESS_CSV_PIPELINE_ROOT'] = PREPROCESS_CSV_PIPELINE_ROOT
os.environ['DOCKERNAME'] = DOCKERNAME

os.environ['MEMORY_LIMIT'] = '120G'
os.environ['CPU_LIMIT'] = '32'
os.environ['GPU_LIMIT'] = '4'
os.environ['GPU_TYPE'] = 'nvidia-tesla-t4'

### Initialize Vertex SDK client

In [4]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=os.path.join(WORKSPACE, 'stg') 
)

### Build Container Docker Image

The following command will build the Docker container image to the NVTabular preprocessing steps of the pipeline and push it to the [Google Container Registry](https://cloud.google.com/container-registr). 

Note that building the Docker container image takes around 8 minutes.

In [3]:
FILE_LOCATION = './src'
! gcloud builds submit --config src/cloudbuild.yaml --substitutions _DOCKERNAME=$DOCKERNAME,_IMAGE_URI=$IMAGE_URI,_FILE_LOCATION=$FILE_LOCATION

Creating temporary tarball archive of 57 file(s) totalling 5.0 MiB before compression.
Uploading tarball of [.] to [gs://renatoleite-mldemos_cloudbuild/source/1637243053.593088-f43cafc781a34451af3f21646da82a65.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/renatoleite-mldemos/locations/global/builds/169bc0a1-7f78-48ec-9ae9-243b02c7bbee].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/169bc0a1-7f78-48ec-9ae9-243b02c7bbee?project=464015718044].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "169bc0a1-7f78-48ec-9ae9-243b02c7bbee"

FETCHSOURCE
Fetching storage object: gs://renatoleite-mldemos_cloudbuild/source/1637243053.593088-f43cafc781a34451af3f21646da82a65.tgz#1637243054327124
Copying gs://renatoleite-mldemos_cloudbuild/source/1637243053.593088-f43cafc781a34451af3f21646da82a65.tgz#1637243054327124...

Operation completed over 1 objects/4.1 MiB.
BUILD
Already have image (with digest): gcr.io/cloud-

## 3-1. CSV Preprocessing Pipeline Execution

The CSV Criteo data preprocessing pipeline performs the following steps.  

 1. Read CSV files from Cloud Storage.
 2. Convert the CSV files to parquet format and write it Cloud Storage.
 3. Fit a pre-defined NVTabular workflow to the training data split to calculate transformation statistics.
 4. Transform the training and validation data splits using the fitted workflow.
 5. Output transformed parquet files to Cloud Storage.


<img src="./images/preprocessing_pipeline_csv.png" alt="Pipeline"/>

### Converting CSV files to Parquet with NVTabular

The Criteo dataset is provides in CSV format, but the recommended data format to run the NVTabular preprocessing task and get the best possible performance is [Parquet](http://parquet.apache.org/documentation/latest/); a compressed, column-oriented file structure format. While NVTabular also supports reading from CSV files, reading  
Parquet files can 2X faster than reading CSV files.  

To convert the Criteo CSV data to Parquet, the following steps are performed:

1. Create a `nvt.Dataset` object the CSV data using the `create_csv_dataset` method in [src/preprocessing/etl.py](src/preprocessing/etl.py).
2. Convert the CSV data to Parquet, and write it to Cloud Storahe using the `convert_csv_to_parquet` method in [src/preprocessing/etl.py](src/preprocessing/etl.py).

The pipeline uses the `convert_csv_to_parquet_op` component, which is implemented in [src/pipelines/components.py](src/pipelines/components.py).

### Create pipeline parameters

In NVTabular, NVIDIA provides an option to shuffle the dataset before storing to disk.  
The uniformly shuffled dataset enables the data loader to read in contiguous chunks of data that are already randomized across the entire dataset.
NVTabular provides the option to control the number of chunks that are combined into a batch, allowing the end user flexibility when trading off between performance and true randomization.  
This mechanism is critical when dealing with datasets that exceed CPU memory and per-epoch shuffling is desired during training.  
Full shuffling of such a dataset can exceed training time for the epoch by several orders of magnitude.

In [5]:
TRAIN_PATHS = ['gs://workshop-datasets/criteo/day_1'] # Training CSV file to be preprocessed.
VALID_PATHS = ['gs://workshop-datasets/criteo/day_0'] # Validation CSV file to be preprocessed.
sep = '\t' # Separator for the CSV file.
num_output_files_train = 4
num_output_files_valid = 4

In [6]:
csv_parameter_values = {
    'train_paths': json.dumps(TRAIN_PATHS),
    'valid_paths': json.dumps(VALID_PATHS),
    'sep': sep,
    'shuffle': json.dumps(None), # select PER_PARTITION, PER_WORKER, FULL, or None.
    'num_output_files_train': num_output_files_train,
    'num_output_files_valid': num_output_files_valid
}

### Compile KFP pipeline

In [7]:
from src.pipelines.preprocessing_pipelines import preprocessing_csv

csv_compiled_pipeline_path = f'{PREPROCESS_CSV_PIPELINE_NAME}.json'
compiler.Compiler().compile(
       pipeline_func=preprocessing_csv,
       package_path=csv_compiled_pipeline_path
)

### Submit job to Vertex AI Pipelines

In [None]:
job_name = f'{datetime.now().strftime("%Y%m%d%H%M%S")}_{PREPROCESS_CSV_PIPELINE_NAME}'

pipeline_job = vertex_ai.PipelineJob(
    display_name=job_name,
    template_path=csv_compiled_pipeline_path,
    enable_caching=False,
    parameter_values=csv_parameter_values,
)

pipeline_job.submit()

## 3-2. BigQuery Preprocessing Pipeline Execution

The BigQuery Criteo data preprocessing pipeline performs the following steps.  

 1. Read the data from BigQuery tables.
 2. Export the BigQuery data to Cloud Storage as Parquet files.
 3. Fit a pre-defined NVTabular workflow to the training data split to calculate transformation statistics.
 4. Transform the training and validation data splits using the fitted workflow.
 5. Output transformed parquet files to Cloud Storage.


<img src="./images/preprocessing_pipeline_bq.png" alt="Pipeline"/>

### Exporting BigQuery data to Cloud Storage as Parquet files.

In order to use the NVTabular to preprocess the BigQuery data, the data must be exported to Cloud Storage as Parquet files.
The `extract_table_from_bq` method in the [src/preprocessing/etl.py](src/preprocessing/etl.py) module exports the data from a BigQuery table to Cloud Storage as Paquet files. It uses the [extract_table](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.extract_table) API in the [BigQuery Python SDK](https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python). The pipeline step is defined in the `export_parquet_from_bq_op` component in [src/pipelines/components.py](src/pipelines/components.py).

### Set pipeline configurations

In [None]:
BQ_DATASET_NAME = 'criteo_pipeline' # Set to your BigQuery dataset including the Criteo dataset.
BQ_LOCATION = 'us' # Set to your BigQuery dataset location.
BQ_TRAIN_TABLE_NAME = 'train'
BQ_VALID_TABLE_NAME = 'valid'

os.environ['PREPROCESS_BQ_PIPELINE_NAME'] = PREPROCESS_BQ_PIPELINE_NAME
os.environ['PREPROCESS_BQ_PIPELINE_ROOT'] = PREPROCESS_BQ_PIPELINE_ROOT

os.environ['BQ_DATASET_NAME'] = BQ_DATASET_NAME
os.environ['BQ_LOCATION'] = BQ_LOCATION
os.environ['BQ_TRAIN_TABLE_NAME'] = BQ_TRAIN_TABLE_NAME
os.environ['BQ_VALID_TABLE_NAME'] = BQ_VALID_TABLE_NAME

### Create pipeline parameters

In [None]:
bq_parameter_values = {
    'shuffle': json.dumps(None) # select PER_PARTITION, PER_WORKER, FULL, or None.
}

### Compile KFP pipeline

In [None]:
from src.pipelines.preprocessing_pipelines import preprocessing_bq

bq_compiled_pipeline_path = f'{PREPROCESS_BQ_PIPELINE_NAME}.json'
compiler.Compiler().compile(
       pipeline_func=preprocessing_bq,
       package_path=bq_compiled_pipeline_path
)

### Submit job to Vertex AI Pipelines

In [None]:
job_name = f'{datetime.now().strftime("%Y%m%d%H%M%S")}_{PREPROCESS_BQ_PIPELINE_NAME}'

pipeline_job = vertex_ai.PipelineJob(
    display_name=job_name,
    template_path=bq_compiled_pipeline_path,
    enable_caching=False,
    parameter_values=bq_parameter_values,
)

pipeline_job.submit()

## Next Steps
After completing this notebook you can proceed to the [02-model-training-hugectr.ipynb](02-model-training-hugectr.ipynb) notebook that demonstrates how to train DeepFM model using NVIDIA HugeCTR and Vertex AI.