# Training Tabular Deep Learning Models with Keras on GPU
Deep learning has revolutionized the fields of computer vision (CV) and natural language processing (NLP) in the last few years, providing a fast and general framework for solving a host of difficult problems with unprecedented accuracy. Part and parcel of this revolution has been the development of APIs like [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) for NVIDIA GPUs, allowing practitioners to quickly iterate on new and interesting ideas and receive feedback on their efficacy in shorter and shorter intervals.

One class of problem which has remained largely immune to this revolution, however, is the class involving tabular data. Part of this difficulty is that, unlike CV or NLP, where different datasets are underlied by similar phenomena and therefore can be solved with similar mechanisms, "tabular datasets" span a vast array of phenomena, semantic meanings, and problem statements, from product and video recommendation to particle discovery and loan default prediction. This diversity makes universally useful components difficult to find or even define, and is only exacerbated by the notorious lack of standard, industrial-scale benchmark datasets in the tabular space. Accordingly, deep learning models are frequently bested by their machine learning analogues on these important tasks.

Yet this diversity is also what makes tools like Keras all the more valuable. Architecture components can be quickly swapped in and out for different tasks like the implementation details they are, and new components can be built and tested with ease. Importantly, domain experts can interact with models at a high level and build their expertise into their design, without having to spend their time becoming Python programming wizrds. However, the other key ingredient of fast feedback, enabled by GPU acceleration and correlating with cost-efficiency for production pipelines, is lacking in most out-of-the-box APIs. In this example, we will walk through some recent advancements made by NVIDIA's [NVTabular](https://github.com/nvidia/nvtabular) library that can alleviate existing bottlenecks and bring to bear the full power of GPU acceleration.

In [1]:
! mkdir -p logs/native logs/accelerated
%load_ext tensorboard
%tensorboard --logdir /home/docker/tensorflow/logs --host 0.0.0.0

In [2]:
# TODO: use these and fix other paths
DATA_DIR = '/data'
LOG_DIR = 'logs/'
ACCELERATED = True

# TODO: reimplement the preproc from criteo-example here?
# Alternatively, make criteo its own folder, and split preproc
# and training into separate notebooks, then execute the
# preproc notebook from here?
NUMERIC_FEATURE_NAMES = [f'I{i}' for i in range(1, 14)]
CATEGORICAL_FEATURE_NAMES = [f'C{i}' for i in range(1, 27)]
CATEGORY_COUNTS = [
    7599500, 33521, 17022, 7339, 20046, 3, 7068, 1377, 63, 5345303,
    561810, 242827, 11, 2209, 10616, 100, 4, 968, 14, 7838519,
    2580502, 6878028, 298771, 11951, 97, 35
]
LABEL_NAME = 'label'

# optimization params
BATCH_SIZE = 65536
STEPS = 1000
LEARNING_RATE = 0.001

# architecture params
EMBEDDING_DIM = 8
TOP_MLP_HIDDEN_DIMS = [1024, 512, 256]
BOTTOM_MLP_HIDDEN_DIMS = [1024, 1024, 512, 256]

In [3]:
import tensorflow as tf
import layers
import os

In [4]:
def get_columns(accelerated=False):
    numeric_columns = [tf.feature_column.numeric_column(name, (1,)) for name in NUMERIC_FEATURE_NAMES]
    if not accelerated:
        categorical_columns = [
            tf.feature_column.categorical_column_with_hash_bucket(
                name, int(0.75*count), dtype=tf.int64) for
            name, count in zip(CATEGORICAL_FEATURE_NAMES, CATEGORY_COUNTS)
        ]
    else:
        categorical_columns = [
            tf.feature_column.categorical_column_with_identity(
                name, int(0.75*count)) for
            name, count in zip(CATEGORICAL_FEATURE_NAMES, CATEGORY_COUNTS)
        ]
    return numeric_columns, categorical_columns

In [5]:
def make_dataset(columns, accelerated=False):
    if not accelerated:
        # feature spec tells us how to parse tfrecords
        # using FixedLenFeatures keeps from using sparse machinery,
        # but obviously wouldn't extend to multi-hot categoricals
        # TODO: this is being generous, since the typical tensorflow user
        # will use tf.feature_column.make_parse_example_spec which will
        # use varlenfeature by default for categorical columns, and the
        # sparse machinery will definitely hurt GPU perf. Should we just
        # use this, even though our API is leveraging the fact that all
        # categoricals are single hot?
        feature_spec = {
            column.name: tf.io.FixedLenFeature((1,), getattr(column, 'dtype', tf.int64)) for column in columns
        }
        feature_spec[LABEL_NAME] = tf.io.FixedLenFeature((1,), tf.int64)
        return tf.data.experimental.make_batched_features_dataset(
            '/data/tfrecords/train/*.tfrecords',
            BATCH_SIZE,
            feature_spec,
            label_key=LABEL_NAME,
            num_epochs=1,
            shuffle=True,
            shuffle_buffer_size=4*BATCH_SIZE,
            )
    else:
        from nvtabular import Workflow
        from nvtabular.tf_dataloader import KerasSequenceDataset
        from nvtabular.ops import HashBucket
        dataset = KerasSequenceDataset(
            '/data/stable/train/*.parquet',
            columns,
            batch_size=BATCH_SIZE,
            label_name=LABEL_NAME,
            shuffle=True,
            buffer_size=0.2,
            dataset_size=STEPS*BATCH_SIZE
          )
        workflow = Workflow(
            cat_names=CATEGORICAL_FEATURE_NAMES,
            cont_names=CONTINUOUS_FEATURE_NAMES,
            label_name=LABEL_NAME
        )
        num_buckets = {
            col.name: col.num_buckets for col in columns
                if col.name in CATEGORICAL_FEATURE_NAMES
        }
        workflow.add_cont_preprocess(HashBucket(num_buckets))
        workflow.finalize()
        dataset.map(workflow)
        return dataset

In [6]:
def embed_inputs(numeric_inputs, categorical_inputs, accelerated=False):
    numeric_columns, categorical_columns = get_columns(accelerated=accelerated)

    fm_x = tf.keras.layers.DenseFeatures(embedding_columns)(categorical_inputs)
    fm_x = tf.keras.layers.Reshape((len(categorical_columns), EMBEDDING_DIM))(fm_x)
    dense_x = tf.keras.layers.DenseFeatures(numeric_columns)(numeric_inputs)
    return fm_x, dense_x

In [7]:
def get_architecture():
    dense_input = tf.keras.Input(name='dense_x', shape=(len(numeric_columns),), dtype=tf.float32)
    fm_input = tf.keras.Input(name='fm_x', shape=(len(categorical_columns), EMBEDDING_DIM), dtype=tf.float32)

    dense_x = dense_input
    for dim in TOP_MLP_HIDDEN_DIMS:
        dense_x = tf.keras.layers.Dense(dim, activation='relu')(dense_x)
    dense_x = tf.keras.layers.Dense(EMBEDDING_DIM, activation='linear')(dense_x)
    dense_x = tf.keras.layers.Reshape((1, EMBEDDING_DIM))(dense_x)

    x = tf.keras.layers.Concatenate(axis=1)([fm_input, dense_x])
    x = layers.DotProductInteraction()(x)
    for dim in BOTTOM_MLP_HIDDEN_DIMS:
        x = tf.keras.layers.Dense(dim, activation='relu')(x)
    x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

    return tf.keras.Model(inputs=[dense_input, fm_input], outputs=x)

In [8]:
numeric_columns, categorical_columns = get_columns(accelerated=ACCELERATED)
columns = numeric_columns + categorical_columns

# wrap categorical columns with an embedding to feed to DenseFeatures
embedding_columns = [tf.feature_column.embedding_column(column, EMBEDDING_DIM) for column in categorical_columns]

train_dataset = make_dataset(columns, accelerated=ACCELERATED)


# now we build the model
categorical_inputs = {
    column.name: tf.keras.Input(name=column.name, shape=(1,), dtype=getattr(column, 'dtype', tf.int64)) for
        column in categorical_columns
}
numeric_inputs = {
    column.name: tf.keras.Input(name=column.name, shape=(1,), dtype=getattr(column, 'dtype', tf.int64)) for
        column in numeric_columns
}
fm_x, dense_x = embed_inputs(numeric_inputs, categorical_inputs, accelerated=ACCELERATED)
x = get_architecture()([dense_x, fm_x])

inputs = list(categorical_inputs.values()) + list(numeric_inputs.values())
model = tf.keras.Model(inputs=inputs, outputs=x)

optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)
model.compile(optimizer, 'binary_crossentropy', metrics=[tf.keras.metrics.AUC(curve='ROC', name='auroc')])
print(model.summary())

run_name = 'accelerated' if ACCELERATED else 'native'
callbacks = [
    tf.keras.callbacks.TensorBoard(
        os.path.join(LOG_DIR, run_name),
        update_freq=10,
        profile_batch=100)
]
model.fit(
    train_dataset,
    epochs=1,
    steps_per_epoch=STEPS,
    callbacks=callbacks
)

Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')


ModuleNotFoundError: No module named 'dask'

In [3]:
! mkdir -p logs/native logs/accelerated
%load_ext tensorboard
%tensorboard --logdir /home/docker/tensorflow/logs --host 0.0.0.0

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 821), started 0:23:50 ago. (Use '!kill 821' to kill it.)