In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# TensorFlow: Convert TFRecords to Parquet files

## TFRecords

[TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord) are a popular file format to store data for deep learning training with TensorFlow. It is a "simple format for storing a sequence of binary records". In many cases the dataset is too large for the host memory and the dataset is converted into (multiple) tfrecords file to disk. TensorFlow's ecosystem enables to stream the tfrecords from disk to train the model without requiring to load the full dataset.<br><br>
That sounds great, but there are some disadvantages when working with tabular dataset. TFRecords stores the dataset as key, values. In other domains, such as computer vision, this representation is efficient as the key is `image` and the values are a the pixels. For an RGB image with 200x200 resolution, there are 120000 (200x200x3) values. In a tabular dataset, a feature is often a single number and therefore, there is a significant overhead for using a key in each example. **In some of our experiments, we experienced that tfrecords can be ~4-5x larger than `parquet` files for the same dataset.**
<br><br>
[Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is another file format to store data. It is a free and open-source data storage format in the Hadoop ecosystem. Many popular systems, such as Spark or Pandas, support to read and write parquet files. 
<br><br>
We developed [NVTabular Data Loaders](https://nvidia-merlin.github.io/NVTabular/main/training/index.html) as a customized data loader, fully operating on the GPU. It reads the data from disk into the GPU memory and prepares the next batch on the GPU. Therefore, we do not have any CPU-GPU communication. Our data loader leverages parquet files to reduce the disk pressure. **In our experiments, we experienced that the native data loader is the bottleneck in training tabular deep learning models and by changing the native data loader to NVTabular Data Loader, we saw a 8-9x speed-up.**

### Convert TFRecords to Parquet files
That is a lot of background information. In many cases, we saw that users have their dataset stored as tfrecords files. In this notebook, we provide a tfrecords to parquet examples. Users can transform their dataset to parquet and be able to experiment with NVTabular data loader.

We leverage the library pandas-tfrecords. We install pandas-tfrecords without dependencies, as it would install a specific TensorFlow version.

In [2]:
!pip install --no-deps pandas-tfrecords==0.1.5
!pip install s3fs

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m


## Create a Synthetic Dataset

First, we will create a synthetic dataset. Afterwards, we will convert the synthetic data to a tfrecord file. The synthetic dataset contains `continuous features`, `categorical features`, `continuous features in a list with variable length`, `categorical features in a list with variable length` and the `label`.<br><br>
The features of a list have variable length, which are often used in session-based recommender systems. For example, the last page views in a session and sessions have different lengths.

In [4]:
import numpy as np
import pandas as pd

import cudf

In [5]:
def create_synthetic_df(
    N_CONT_FEATURES, N_CAT_FEATURES, N_CONT_LIST_FEATURES, N_CAT_LIST_FEATURES, N_ROWS
):
    dict_features = {}
    for icont in range(N_CONT_FEATURES):
        dict_features["cont" + str(icont)] = np.random.uniform(-1, 1, size=N_ROWS)
    for icat in range(N_CAT_FEATURES):
        dict_features["cat" + str(icat)] = np.random.choice(list(range(10)), size=N_ROWS)
    for icontlist in range(N_CONT_LIST_FEATURES):
        feature_list = []
        for irow in range(N_ROWS):
            n_elements = np.random.choice(list(range(20)))
            feature_list.append(np.random.uniform(-1, 1, size=n_elements).tolist())
        dict_features["cont_list" + str(icontlist)] = feature_list
    for icatlist in range(N_CAT_LIST_FEATURES):
        feature_list = []
        for irow in range(N_ROWS):
            n_elements = np.random.choice(list(range(20)))
            feature_list.append(np.random.choice(list(range(10)), size=n_elements).tolist())
        dict_features["cat_list" + str(icatlist)] = feature_list
    dict_features["label"] = np.random.choice(list(range(2)), size=N_ROWS)
    df = pd.DataFrame(dict_features)
    return df

We can configure the size of the dataset and numbers of features of the different type. As this is just a example, we use only 20,000 rows.

In [6]:
N_ROWS = 20000
N_CONT_FEATURES = 5
N_CAT_FEATURES = 7
N_CONT_LIST_FEATURES = 2
N_CAT_LIST_FEATURES = 3

In [7]:
df = create_synthetic_df(
    N_CONT_FEATURES, N_CAT_FEATURES, N_CONT_LIST_FEATURES, N_CAT_LIST_FEATURES, N_ROWS
)

We can take a look on the dataset.

In [8]:
df.head()

Unnamed: 0,cont0,cont1,cont2,cont3,cont4,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cont_list0,cont_list1,cat_list0,cat_list1,cat_list2,label
0,-0.346288,-0.092784,0.878876,0.990467,-0.505079,2,2,8,9,0,2,4,"[-0.5329311666886798, -0.7973632802691455, -0....",[-0.7527243533757371],"[7, 5, 1, 9, 5, 6, 5, 7, 1, 6, 0, 7, 8, 1]","[2, 0, 0, 0, 6, 4, 2, 3]","[8, 3, 5, 7, 0, 5, 2, 1, 2, 7, 7]",1
1,-0.336003,-0.665982,0.902071,0.531961,-0.005143,1,2,6,9,3,0,7,"[0.9805303513847896, -0.1364336119532299, 0.39...",[],"[4, 5, 0, 7, 6, 7]","[9, 0, 6, 9, 2, 2]","[1, 2, 0, 6, 2, 4, 9, 4, 3, 3, 7, 4, 1, 5, 7, 9]",0
2,-0.089536,-0.922915,-0.63689,-0.494594,-0.123065,7,9,0,0,2,4,4,"[0.9677775916375682, 0.4868478686143529, 0.010...","[0.9863213170102452, 0.801522837843786, 0.8203...","[4, 5, 3, 5, 2, 5, 3, 4, 1, 8, 0, 4, 5, 3, 0, ...","[6, 2]","[8, 7, 4, 6, 5, 4, 7, 9, 0, 7, 6]",0
3,-0.2604,0.693127,-0.875754,0.456287,0.762904,3,5,3,3,1,7,3,"[-0.2644213019104138, -0.09665251017206655, -0...","[-0.8362007638643811, 0.1541830950440195, 0.79...","[8, 0, 1, 0, 9, 5, 9, 7, 9, 6, 7]","[0, 8, 9, 5, 9, 7, 8]","[7, 0, 7, 2, 0, 0, 8, 3, 5]",0
4,0.980959,-0.982329,0.628736,-0.311694,-0.88094,6,6,0,8,4,2,2,"[-0.34002032148205985, -0.28546136806218714, -...","[0.057850173597639776, 0.8166183641925591, -0....","[4, 8, 9, 9, 7, 9, 2]","[4, 3, 5, 9, 0, 3, 8, 5, 4, 0, 3, 1, 4, 8, 0, ...","[7, 4, 4, 2, 5, 0, 3, 9, 5, 8, 3, 9, 3, 1, 7, ...",0


In [9]:
CONTINUOUS_COLUMNS = ["cont" + str(i) for i in range(N_CONT_FEATURES)]
CATEGORICAL_COLUMNS = ["cat" + str(i) for i in range(N_CAT_FEATURES)]
CONTINUOUS_LIST_COLUMNS = ["cont_list" + str(i) for i in range(N_CONT_LIST_FEATURES)]
CATEGORICAL_LIST_COLUMNS = ["cat_list" + str(i) for i in range(N_CAT_LIST_FEATURES)]
LABEL_COLUMNS = ["label"]

## Convert the Synthetic Dataset into TFRecords

After we created the synthetic dataset, we store it to tfrecords.

In [10]:
import tensorflow as tf

In [11]:
import os
import multiprocessing as mp
from itertools import repeat


def transform_tfrecords(
    df,
    PATH,
    CONTINUOUS_COLUMNS,
    CATEGORICAL_COLUMNS,
    CONTINUOUS_LIST_COLUMNS,
    CATEGORICAL_LIST_COLUMNS,
    LABEL_COLUMNS,
):
    write_dir = os.path.dirname(PATH)
    if not os.path.exists(write_dir):
        os.makedirs(write_dir)
    file_idx, example_idx = 0, 0
    writer = get_writer(write_dir, file_idx)
    column_names = [
        CONTINUOUS_COLUMNS,
        CATEGORICAL_COLUMNS + LABEL_COLUMNS,
        CONTINUOUS_LIST_COLUMNS,
        CATEGORICAL_LIST_COLUMNS,
    ]
    with mp.Pool(8, pool_initializer, column_names) as pool:
        data = []
        for col_names in column_names:
            if len(col_names) == 0:
                data.append(repeat(None))
            else:
                data.append(df[col_names].values)
        data = zip(*data)
        record_map = pool.imap(build_and_serialize_example, data, chunksize=200)
        for record in record_map:
            writer.write(record)
            example_idx += 1
    writer.close()


def pool_initializer(num_cols, cat_cols, num_list_cols, cat_list_cols):
    global numeric_columns
    global categorical_columns
    global numeric_list_columns
    global categorical_list_columns
    numeric_columns = num_cols
    categorical_columns = cat_cols
    numeric_list_columns = num_list_cols
    categorical_list_columns = cat_list_cols


def build_and_serialize_example(data):
    numeric_values, categorical_values, numeric_list_values, categorical_list_values = data
    feature = {}
    if numeric_values is not None:
        feature.update(
            {
                col: tf.train.Feature(float_list=tf.train.FloatList(value=[val]))
                for col, val in zip(numeric_columns, numeric_values)
            }
        )
    if categorical_values is not None:
        feature.update(
            {
                col: tf.train.Feature(int64_list=tf.train.Int64List(value=[val]))
                for col, val in zip(categorical_columns, categorical_values)
            }
        )
    if numeric_list_values is not None:
        feature.update(
            {
                col: tf.train.Feature(float_list=tf.train.FloatList(value=val))
                for col, val in zip(numeric_list_columns, numeric_list_values)
            }
        )
    if categorical_list_values is not None:
        feature.update(
            {
                col: tf.train.Feature(int64_list=tf.train.Int64List(value=val))
                for col, val in zip(categorical_list_columns, categorical_list_values)
            }
        )
    return tf.train.Example(features=tf.train.Features(feature=feature)).SerializeToString()


def get_writer(write_dir, file_idx):
    filename = str(file_idx).zfill(5) + ".tfrecords"
    return tf.io.TFRecordWriter(os.path.join(write_dir, filename))

We define the output path.

In [12]:
PATH = "/raid/tfrecord-test/"

In [13]:
!rm -rf $PATH
!mkdir $PATH

In [14]:
transform_tfrecords(
    df,
    PATH,
    CONTINUOUS_COLUMNS,
    CATEGORICAL_COLUMNS,
    CONTINUOUS_LIST_COLUMNS,
    CATEGORICAL_LIST_COLUMNS,
    LABEL_COLUMNS,
)

We can check the file.

In [15]:
!ls $PATH

00000.tfrecords


## Convert TFRecords to parquet files

Now, we have a dataset in the tfrecords format. Let's use the `convert_tfrecords_to_parquet` function to convert a tfrecord file into parquet.

In [16]:
import glob

from nvtabular.framework_utils.tensorflow.tfrecords_to_parquet import convert_tfrecords_to_parquet

Let's select all TFRecords in the folder.

In [17]:
filenames = glob.glob(PATH + "/*.tfrecords")

Let's call the `convert_tfrecords_to_parquet`.<br><br>
Some details about the parameters:
* `compression_type` is the compression type of the tfrecords. Options: `""` (no compression), `"ZLIB"`, or `"GZIP"`
* `chunks` defines how many data points per `parquet` file should be saved. It splits a tfrecords into multiple parquet files.
* `convert_lists` defines, if feature lists should be converted into muliple feature columns. Even single dataframe series are 1 dimensional arrays when converted back from tfrecords to parquet.   

In [18]:
filenames

['/raid/tfrecord-test/00000.tfrecords']

In [19]:
convert_tfrecords_to_parquet(
    filenames=filenames, output_dir=PATH, compression_type="", chunks=1000, convert_lists=True
)

2021-09-22 21:56:53.202269: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-22 21:56:54.586055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30681 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:0b:00.0, compute capability: 7.0
2021-09-22 21:56:55.158643: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
20000it [00:12, 1665.20it/s]


## Let's take a look

We can see that `convert_tfrecords_to_parquet` created multiple files per `tfrecord` depending on the chunk size.

In [20]:
filenames = glob.glob(PATH + "/*.parquet")
filenames

['/raid/tfrecord-test/00000.parquet']

If we load the first file, we can see, that it has the same structure as our original synthetic dataset.

In [23]:
df = cudf.read_parquet(filenames[0])
df.head()

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat_list0,cat_list1,cat_list2,cont0,cont1,cont2,cont3,cont4,cont_list0,cont_list1,label
0,2,2,8,9,0,2,4,"[7, 5, 1, 9, 5, 6, 5, 7, 1, 6, 0, 7, 8, 1]","[2, 0, 0, 0, 6, 4, 2, 3]","[8, 3, 5, 7, 0, 5, 2, 1, 2, 7, 7]",-0.346288,-0.092784,0.878876,0.990467,-0.505079,"[-0.53293115, -0.7973633, -0.047344275, -0.132...",[-0.75272435],1
1,1,2,6,9,3,0,7,"[4, 5, 0, 7, 6, 7]","[9, 0, 6, 9, 2, 2]","[1, 2, 0, 6, 2, 4, 9, 4, 3, 3, 7, 4, 1, 5, 7, 9]",-0.336003,-0.665982,0.902071,0.531961,-0.005143,"[0.9805303, -0.13643362, 0.39948544, 0.7434469...",[],0
2,7,9,0,0,2,4,4,"[4, 5, 3, 5, 2, 5, 3, 4, 1, 8, 0, 4, 5, 3, 0, ...","[6, 2]","[8, 7, 4, 6, 5, 4, 7, 9, 0, 7, 6]",-0.089536,-0.922915,-0.63689,-0.494594,-0.123065,"[0.9677776, 0.48684788, 0.010608715]","[0.98632133, 0.80152285, 0.820345, 0.015393688...",0
3,3,5,3,3,1,7,3,"[8, 0, 1, 0, 9, 5, 9, 7, 9, 6, 7]","[0, 8, 9, 5, 9, 7, 8]","[7, 0, 7, 2, 0, 0, 8, 3, 5]",-0.2604,0.693127,-0.875754,0.456287,0.762904,"[-0.2644213, -0.09665251, -0.92680424, 0.30409...","[-0.8362008, 0.15418309, 0.799706, 0.4666645, ...",0
4,6,6,0,8,4,2,2,"[4, 8, 9, 9, 7, 9, 2]","[4, 3, 5, 9, 0, 3, 8, 5, 4, 0, 3, 1, 4, 8, 0, ...","[7, 4, 4, 2, 5, 0, 3, 9, 5, 8, 3, 9, 3, 1, 7, ...",0.980959,-0.982329,0.628736,-0.311694,-0.88094,"[-0.34002033, -0.28546137, -0.2595898, -0.5337...","[0.057850175, 0.8166184, -0.3719872, -0.703909...",0


In [24]:
df.shape

(20000, 18)