## HybridBackend Quickstart

In this tutorial, we use [HybridBackend](https://hybridbackend.readthedocs.io/en/latest/)
to speed up training of a sample ranking model based on stacked 
[DCNv2](https://arxiv.org/abs/2008.13535) on 
[Taobao ad click datasets](https://tianchi.aliyun.com/dataset/dataDetail?dataId=56).

### Why HybridBackend


- Training industrial recommendation models can benefit greatly from GPUs

  - Embedding layer becomes wider, consuming up to thousands of feature fields,
  which requires larger memory bandwidth;
  - Feature interaction layer is going deeper by leveraging multiple DNN
  submodules over different subsets of features, which requires higher computing
  capability;
  - GPUs provide much higher computing capability, larger memory bandwidth, and
  faster data movement;

- Industrial recommendation models do not take full advantage of the GPU
  resources by canonical training frameworks

  - Industrial recommendation models contain up to a thousand of input
  feature fields, introducing fragmentary and memory-intensive operations;
  - The multiple constituent feature interaction submodules introduce
  substantial small-sized compute kernels;

- Training framework of industrial recommendation models must be less-invasive
   and compatible with existing workflow

  - Training is only a part of production recommendation system, it needs great
  effort to modify inference pipeline;
  - AI scientists write models in a variety of ways, especially in a big team.

HybridBackend enables speeding up of training industrial recommendation models
on GPUs with minimum effort. In this tutorial, you will learn how to use
HybridBackend to make training of industrial recommendation models much faster.

See [HybridBackend GitHub repo](https://github.com/alibaba/HybridBackend) and
[the paper](https://ieeexplore.ieee.org/document/9835450) for more information.

### Requirements

- Hardware
  - Modern GPU and interconnect (e.g. A10 / PCIe Gen4)
  - Fast data storage (e.g. ESSD)
- Software
  - Ubuntu 20.04 or above
  - Python 3.8 or above
  - CUDA 11.4
  - TensorFlow 1.15
- [Taobao Ad Click Dataset](https://tianchi.aliyun.com/dataset/dataDetail?dataId=56)
  - TFRecord Format
  - Parquet Format

In [None]:
!pip3 install hybridbackend-tf115-cu114

### Sample ranking model

In this tutorial, a sample ranking model based on stacked 
[DCNv2](https://arxiv.org/abs/2008.13535) is used. 
You can see code in `ranking` for more details.

In [None]:
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
from tensorflow.python.util import module_wrapper as deprecation
deprecation._PER_MODULE_WARNING_LIMIT = 0
tf.get_logger().propagate = False

from ranking.data import DataSpec
from ranking.model import stacked_dcn_v2


# Global configuration
train_max_steps = 100
train_batch_size = 16000
data_spec = DataSpec.read('ranking/taobao/data/spec.json')


def train(iterator, embedding_weight_device, dnn_device):
  batch = iterator.get_next()
  batch.pop('ts')
  labels = tf.reshape(tf.to_float(batch.pop('label')), shape=[-1, 1])
  features = []
  for f in batch:
    feature = batch[f]
    if data_spec.embedding_dims[f] is None:
      features.append(data_spec.transform_numeric(f, feature))
    else:
      with tf.device(embedding_weight_device):
        embedding_weights = tf.get_variable(
          f'{f}_weight',
          shape=(data_spec.embedding_sizes[f], data_spec.embedding_dims[f]),
          initializer=tf.random_uniform_initializer(-1e-3, 1e-3))
      features.append(
        data_spec.transform_categorical(f, feature, embedding_weights))

  with tf.device(dnn_device):
    logits = stacked_dcn_v2(
      features,
      [1024, 1024, 512, 256, 1])
    loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(labels, logits))
    step = tf.train.get_or_create_global_step()
    opt = tf.train.AdagradOptimizer(learning_rate=0.001)
    train_op = opt.minimize(loss, global_step=step)

  hooks = []
  hooks.append(tf.train.StepCounterHook(10))
  hooks.append(tf.train.StopAtStepHook(train_max_steps))
  config = tf.ConfigProto(allow_soft_placement=True)
  config.gpu_options.allow_growth = True
  config.gpu_options.force_gpu_compatible = True
  with tf.train.MonitoredTrainingSession(
      '', hooks=hooks, config=config) as sess:
    while not sess.should_stop():
      sess.run(train_op)

### Training without HybridBackend

Without HybridBackend, training the sample ranking model underutilizes GPUs.

In [None]:
# Download training data in TFRecord format
!wget http://easyrec.oss-cn-beijing.aliyuncs.com/data/taobao/day_0.tfrecord

In [None]:
with tf.Graph().as_default():
  ds = tf.data.TFRecordDataset('./day_0.tfrecord', compression_type='GZIP')
  ds = ds.batch(train_batch_size, drop_remainder=True)
  ds = ds.map(
    lambda batch: tf.io.parse_example(batch, data_spec.to_example_spec()))
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)

  with tf.device('/gpu:0'):
    train(iterator, '/cpu:0', '/gpu:0')

### Training with HybridBackend

By just one-line importing, HybridBackend uses packing and interleaving to
speed up embedding layers dramatically and automatically.

In [None]:
# Note: Once HybridBackend is on, you need to restart notebook to turn it off.
import hybridbackend.tensorflow as hb

# Exact same code except HybridBackend is on.
with tf.Graph().as_default():
  ds = tf.data.TFRecordDataset('./day_0.tfrecord', compression_type='GZIP')
  ds = ds.batch(train_batch_size, drop_remainder=True)
  ds = ds.map(
    lambda batch: tf.io.parse_example(batch, data_spec.to_example_spec()))
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)

  with tf.device('/gpu:0'):
    train(iterator, '/cpu:0', '/gpu:0')

### Training with HybridBackend (Optimized data pipeline)

Even greater training performance gains can be archived if we use optimized
data pipeline provided by HybridBackend.

In [None]:
# Download training data in Parquet format
!wget http://easyrec.oss-cn-beijing.aliyuncs.com/data/taobao/day_0.parquet

In [None]:
# Note: Once HybridBackend is on, you need to restart notebook to turn it off.
import hybridbackend.tensorflow as hb

with tf.Graph().as_default():
  ds = hb.data.Dataset.from_parquet('./day_0.parquet')
  ds = ds.batch(train_batch_size, drop_remainder=True)
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)

  with tf.device('/gpu:0'):
    train(iterator, '/cpu:0', '/gpu:0')