# Training Tabular Deep Learning Models with Keras on GPU
Deep learning has revolutionized the fields of computer vision (CV) and natural language processing (NLP) in the last few years, providing a fast and general framework for solving a host of difficult problems with unprecedented accuracy. Part and parcel of this revolution has been the development of APIs like [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) for NVIDIA GPUs, allowing practitioners to quickly iterate on new and interesting ideas and receive feedback on their efficacy in shorter and shorter intervals.

One class of problem which has remained largely immune to this revolution, however, is the class involving tabular data. Part of this difficulty is that, unlike CV or NLP, where different datasets are underlied by similar phenomena and therefore can be solved with similar mechanisms, "tabular datasets" span a vast array of phenomena, semantic meanings, and problem statements, from product and video recommendation to particle discovery and loan default prediction. This diversity makes universally useful components difficult to find or even define, and is only exacerbated by the notorious lack of standard, industrial-scale benchmark datasets in the tabular space. Accordingly, deep learning models are frequently bested by their machine learning analogues on these important tasks.

Yet this diversity is also what makes tools like Keras all the more valuable. Architecture components can be quickly swapped in and out for different tasks like the implementation details they are, and new components can be built and tested with ease. Importantly, domain experts can interact with models at a high level and build their expertise into their design, without having to spend their time becoming Python programming wizrds. However, the other key ingredient of fast feedback, enabled by GPU acceleration and correlating with cost-efficiency for production pipelines, is lacking in most out-of-the-box APIs. In this example, we will walk through some recent advancements made by NVIDIA's [NVTabular](https://github.com/nvidia/nvtabular) library that can alleviate existing bottlenecks and bring to bear the full power of GPU acceleration.

In [1]:
# TODO: include parquet -> tfrecord conversion

In [2]:
! mkdir -p logs/native logs/accelerated
%load_ext tensorboard
%tensorboard --logdir /home/docker/tensorflow/logs --host 0.0.0.0

Reusing TensorBoard on port 6006 (pid 3105), started 0:17:49 ago. (Use '!kill 3105' to kill it.)

In [3]:
# TODO: use these and fix other paths
DATA_DIR = "/data"
LOG_DIR = "logs/"
ACCELERATED = False

# TODO: reimplement the preproc from criteo-example here?
# Alternatively, make criteo its own folder, and split preproc
# and training into separate notebooks, then execute the
# preproc notebook from here?
NUMERIC_FEATURE_NAMES = [f"I{i}" for i in range(1, 14)]
CATEGORICAL_FEATURE_NAMES = [f"C{i}" for i in range(1, 27)]
CATEGORY_COUNTS = [
    7599500, 33521, 17022, 7339, 20046, 3, 7068, 1377, 63, 5345303,
    561810, 242827, 11, 2209, 10616, 100, 4, 968, 14, 7838519,
    2580502, 6878028, 298771, 11951, 97, 35
]
LABEL_NAME = "label"

# optimization params
BATCH_SIZE = 65536
STEPS = 1000
LEARNING_RATE = 0.001

# architecture params
EMBEDDING_DIM = 8
TOP_MLP_HIDDEN_DIMS = [1024, 512, 256]
BOTTOM_MLP_HIDDEN_DIMS = [1024, 1024, 512, 256]

In [4]:
import tensorflow as tf
import layers
import os
from functools import reduce
from itertools import filterfalse
os.environ["TF_MEMORY_ALLOCATION"] = "0.5"

from nvtabular.tf_dataloader import KerasSequenceDataset # for data loading
from nvtabular import Workflow # for online preproc
from nvtabular.ops import HashBucket # for doing hash

Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')


In [5]:
def get_columns(accelerated=False):
    numeric_columns = [tf.feature_column.numeric_column(name, (1,)) for name in NUMERIC_FEATURE_NAMES]
    categorical_columns = []
    for feature_name, count in zip(CATEGORICAL_FEATURE_NAME, CATEGORY_COUNTS):
        num_buckets = int(0.75*count)

        # when using nvtabular workflow, hashing gets done before tf, so
        # we can just use an identity column on the number of hash buckets
        if not accelerated:
            column = tf.feature_column.categorical_column_with_hash_bucket(name, num_buckets, dtype=tf.int64)
        else:
            column = tf.feature_column.categorical_column_with_identity(name, num_buckets)
        categorical_columns.append(column)
    return numeric_columns, categorical_columns

def get_dtype(column):
    return getattr(column, 'dtype', tf.int64)

In [6]:
def make_dataset(columns, accelerated=False):
    # make a tfrecord features dataset
    if not accelerated:
        # feature spec tells us how to parse tfrecords
        # using FixedLenFeatures keeps from using sparse machinery,
        # but obviously wouldn't extend to multi-hot categoricals
        feature_spec = {column.name: tf.io.FixedLenFeature((1,), get_dtype(column)) for column in columns}
        feature_spec[LABEL_NAME] = tf.io.FixedLenFeature((1,), tf.int64)

        return tf.data.experimental.make_batched_features_dataset(
            "/data/tfrecords/train/*.tfrecords",
            BATCH_SIZE,
            feature_spec,
            label_key=LABEL_NAME,
            num_epochs=1,
            shuffle=True,
            shuffle_buffer_size=4*BATCH_SIZE,
            )

    # make an nvtabular KerasSequenceDataset and add
    # a hash bucketing workflow for online preproc
    else:
        dataset = KerasSequenceDataset(
            "/data/stable/train/*.parquet",
            columns,
            batch_size=BATCH_SIZE,
            label_name=LABEL_NAME,
            shuffle=True,
            buffer_size=0.2,
            dataset_size=STEPS*BATCH_SIZE
          )
        workflow = Workflow(
            cat_names=CATEGORICAL_FEATURE_NAMES,
            cont_names=NUMERIC_FEATURE_NAMES,
            label_name=[LABEL_NAME]
        )
        num_buckets = {
            col.name: col.num_buckets for col in columns
                if col.name in CATEGORICAL_FEATURE_NAMES
        }
        workflow.add_cat_preprocess(HashBucket(num_buckets))
        workflow.finalize()
        dataset.map(workflow)
        return dataset

In [8]:
def embed_inputs(inputs, numeric_columns, embedding_columns, accelerated=False):
    get_name = lambda x: x.name.split(":")[0]
    categorical_inputs = {get_name(x): x for x in inputs if get_name(x) in CATEGORICAL_FEATURE_NAMES}
    numeric_inputs = {get_name(x): x for x in inputs if get_name(x) in NUMERIC_FEATURE_NAMES}

    # use vanilla Keras DenseFeatures layer
    if not accelerated:
        fm_x = tf.keras.layers.DenseFeatures(embedding_columns)(categorical_inputs)
        fm_x = tf.keras.layers.Reshape((len(embedding_columns), EMBEDDING_DIM))(fm_x)
        dense_x = tf.keras.layers.DenseFeatures(numeric_columns)(numeric_inputs)
    # don't need to do feature transformation, so we can use custom,
    # better optimized embedding layer
    else:
        fm_x = layers.ScalarDenseFeatures(embedding_columns, aggregation='stack')(categorical_inputs)
        dense_x = layers.ScalarDenseFeatures(numeric_columns, aggregation='concat')(numeric_inputs)
    return fm_x, dense_x

In [9]:
def get_architecture():
    # same either way: as it should be
    dense_input = tf.keras.Input(name="dense_x", shape=(len(numeric_columns),), dtype=tf.float32)
    fm_input = tf.keras.Input(name="fm_x", shape=(len(categorical_columns), EMBEDDING_DIM), dtype=tf.float32)

    dense_x = dense_input
    for dim in TOP_MLP_HIDDEN_DIMS:
        dense_x = tf.keras.layers.Dense(dim, activation="relu")(dense_x)
    dense_x = tf.keras.layers.Dense(EMBEDDING_DIM, activation="linear")(dense_x)
    dense_x = tf.keras.layers.Reshape((1, EMBEDDING_DIM))(dense_x)

    x = tf.keras.layers.Concatenate(axis=1)([fm_input, dense_x])
    x = layers.DotProductInteraction()(x)
    for dim in BOTTOM_MLP_HIDDEN_DIMS:
        x = tf.keras.layers.Dense(dim, activation="relu")(x)
    x = tf.keras.layers.Dense(1, activation="sigmoid")(x)

    return tf.keras.Model(inputs=[dense_input, fm_input], outputs=x)

In [10]:
def print_param_counts(model):
    mult = lambda x, y: x*y
    trainable_weights = model.trainable_weights
    check_name = lambda x: (
        x.name.split(':')[0].split('/')[-1] == "embedding_weights")

    trainable_embedding_weights = filter(check_name, trainable_weights)
    trainable_network_weights = filterfalse(check_name, trainable_weights)

    num_embedding_params = sum(
        [reduce(mult, x.shape) for x in trainable_embedding_weights]
    )
    num_network_params = sum(
        [reduce(mult, x.shape) for x in trainable_network_weights]
    )
    print("Embedding parameter count: {}".format(num_embedding_params))
    print("Non-embedding parameter count: {}".format(num_network_params))

In [11]:
numeric_columns, categorical_columns = get_columns(accelerated=ACCELERATED)
embedding_columns = [tf.feature_column.embedding_column(column, EMBEDDING_DIM) for column in categorical_columns]
columns = numeric_columns + categorical_columns

train_dataset = make_dataset(columns, accelerated=ACCELERATED)

make_input = lambda column: tf.keras.Input(name=column.name, shape=(1,), dtype=getattr(column, "dtype", tf.int64))
inputs = list(map(make_input, columns))

fm_x, dense_x = embed_inputs(inputs, numeric_columns, embedding_columns, accelerated=ACCELERATED)
x = get_architecture()([dense_x, fm_x])
model = tf.keras.Model(inputs=inputs, outputs=x)

optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)
model.compile(
    optimizer,
    "binary_crossentropy",
    metrics=[tf.keras.metrics.AUC(curve="ROC", name="auroc")]
)
print_param_counts(model)

callbacks = [
    tf.keras.callbacks.TensorBoard(
        os.path.join(LOG_DIR, "accelerated" if ACCELERATED else "native"),
        update_freq=10,
        profile_batch=100)
]
model.fit(
    train_dataset,
    epochs=1,
    steps_per_epoch=STEPS,
    callbacks=callbacks
)

Embedding parameter count: 188746160
Non-embedding parameter count: 2738953
Train for 1000 steps


<tensorflow.python.keras.callbacks.History at 0x7f0b2c6d02d0>