In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Accelerate Training on Large Embedding Tables by LazyAdam 

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container. 

Merlin Models provide various model APIs for training, as shown in Notebook [Iterating over Deep Learning Models using Merlin Models](https://nvidia-merlin.github.io/models/main/examples/03-Exploring-different-models.html). We can create a model, such as [Two Tower](https://nvidia-merlin.github.io/models/main/models_overview.html?highlight=two%20tower#two-tower), [DLRM](https://nvidia-merlin.github.io/models/main/examples/03-Exploring-different-models.html#dlrm-model) and so on, by simply one line: `model=mm.DLRMModel(schema)`. Some models contain large embedding tables, and training could be slow on such large sparse embeddings. However, this process could be accelerated by using a special optimizer, LazyAdam.

In this example, we utilize LazyAdam for large embedding tables and nomal Adam for other trainable weights to accelerate the whole training process.


### Learning objectives
- Training a model with multiple optimizers
- Utilizing LazyAdam for training on large embedding tables
- Utilizing `find_blocks_by_name` to get a layer inside a model

In [2]:
import os

import tensorflow as tf

#import merlin
from merlin.datasets.synthetic import generate_data
import merlin.models.tf as ml
from merlin.schema import Schema, Tags

2022-08-14 06:05:07.036450: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-14 06:05:09.554587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16255 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB-LS, pci bus id: 0000:8a:00.0, compute capability: 7.0


## Generate Synthetic Dataset
To generate the synthetic dataset for our example, we can use `generate_data()`

In [3]:
NUM_ROWS = 1000000
train, valid = generate_data("e-commerce-large", int(NUM_ROWS), set_sizes=(0.7, 0.3))

We can print out the feature column names.

In [4]:
schema = train.schema
schema.column_names

['user_categories',
 'user_shops',
 'user_brands',
 'user_intentions',
 'user_profile',
 'user_group',
 'user_gender',
 'user_age',
 'user_consumption_1',
 'user_consumption_2',
 'user_is_occupied',
 'user_geography',
 'user_id',
 'item_category',
 'item_shop',
 'item_intention',
 'item_brand',
 'item_id',
 'user_item_categories',
 'user_item_shops',
 'user_item_brands',
 'user_item_intentions',
 'position',
 'click',
 'conversion']

## Building a Two-Tower Model with Merlin Models

In [5]:
item_embeddings = ml.Embeddings(schema.select_by_tag(Tags.ITEM), infer_embedding_sizes=True)
query_embeddings = ml.Embeddings(schema.select_by_tag(Tags.USER), infer_embedding_sizes=True)
model = ml.TwoTowerModel(schema, 
                         item_tower=ml.InputBlockV2(schema.select_by_tag(Tags.ITEM), embeddings=item_embeddings).connect(ml.MLPBlock([64])), 
                         query_tower=ml.InputBlockV2(schema.select_by_tag(Tags.USER), embeddings=query_embeddings).connect(ml.MLPBlock([64])),
                         samplers=[ml.InBatchSampler()],
)

The model initializer would infer the embedding table size from the schema, where the first dimension (`input_dim`) of each embedding table is the same as the cardinalities (categories) of each feature, and the second dimension is specified by the user. By setting `infer_embedding_sizes=True`, the initializer would infer the size based on the cardinalities: 
$$output\_dim=\left \lfloor cardinality^{0.25}\times multiplier \right \rfloor$$
The multiplier is 8 by default.To achieve the best performance with GPU operators, we adjust the embedding dimensions to multiples of 8.

## Apply Multiple Optimizers to the Model

We usually set one optimizer to train a model, but for large embedding tables, at each batch, the weights to be updated could be really sparse, in other words, each time we only update the model based on a small batch of training data, so for a large embedding table (first dimension >>  batch size), at most batch_size rows would be updated. Thus in order to acceleate training on large embedding tables, we want to utilize the Lazy Adam for those large tables. 

Compared with Adam, Lazy Adam is optimized for sparse updates. It only update sparse variables indices for current batch. However it may result in slight difference in experiment results compared with Adam.

#### Split Embedding Tables based on the First Dimension (`input_dim`)
Since we want to apply LazyAdam to these large tables, we have to split all tables into two sets. The result of `split_embeddings_on_size` (i.e. `item_large_tables` and `item_small_tables`) are two lists of embedding tables.


In [6]:
item_large_tables, item_small_tables = ml.split_embeddings_on_size(item_embeddings, threshold=1000)
query_large_tables, query_small_tables = ml.split_embeddings_on_size(query_embeddings, threshold=1000)

We can print out the size of each embedding table:

In [7]:
print("Large embedding tables of query tower:")
for t in query_large_tables:
    print(t.name, "first dimension: ", t.input_dim, "second dimension", t.dim)
print("Small embedding tables of query tower:")
for t in  query_small_tables:
    print(t.name, "first dimension: ", t.input_dim, "second dimension", t.dim)

Large embedding tables of query tower:
user_categories first dimension:  6087 second dimension 24
user_shops first dimension:  116742 second dimension 40
user_brands first dimension:  58016 second dimension 32
user_intentions first dimension:  33787 second dimension 32
user_id first dimension:  294737 second dimension 48
Small embedding tables of query tower:
user_profile first dimension:  99 second dimension 8
user_group first dimension:  15 second dimension 8
user_gender first dimension:  4 second dimension 8
user_age first dimension:  9 second dimension 8
user_consumption_1 first dimension:  5 second dimension 8
user_consumption_2 first dimension:  5 second dimension 8
user_is_occupied first dimension:  4 second dimension 8
user_geography first dimension:  6 second dimension 8


### Set MultiOptimizer

The `MultiOptimizer` module enables multiple optimizers [(e.g. Adam, SGD, RMSProp, Adagrad)](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) to be used in different layers in parallel. Here we want to apply Lazy Adam for large embedding tables, while small embedding tables and all other layers with Adam. 

In [8]:
optimizer = ml.MultiOptimizer(
                default_optimizer="adam",
                optimizers_and_blocks=[ml.OptimizerBlocks(ml.LazyAdam(), item_large_tables + query_large_tables),
                                       ml.OptimizerBlocks("adam", item_small_tables + query_small_tables)]
                )

Note all other trainable parameters not specified an optimizer would use the `default_optimizer`.

## Train the Model and Evaluate Training Time

In [9]:
model.compile(optimizer=optimizer)
model.fit(train, batch_size=1024, epochs=1)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


The sampler InBatchSampler returned no samples for this batch.


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


<keras.callbacks.History at 0x7f807f26a040>

### Compare Training Time with Adam

Now we create another same model and compile it with Adam optimizer. For this notebook, we use a single Tesla V100-SXM2-32GB-LS. The training result shows that for each step, it costs about 71ms. And as shown in above experiment, the training time with Lazy Adam is about 18 ms, it achieves about 4X speed up.

In [10]:
item_embeddings = ml.Embeddings(schema.select_by_tag(Tags.ITEM), infer_embedding_sizes=True)
query_embeddings = ml.Embeddings(schema.select_by_tag(Tags.USER), infer_embedding_sizes=True)
model = ml.TwoTowerModel(schema, 
                         item_tower=ml.InputBlockV2(schema.select_by_tag(Tags.ITEM), embeddings=item_embeddings).connect(ml.MLPBlock([64])), 
                         query_tower=ml.InputBlockV2(schema.select_by_tag(Tags.USER), embeddings=query_embeddings).connect(ml.MLPBlock([64])),
                         samplers=[ml.InBatchSampler()],
)
model.compile(optimizer="adam")
model.fit(train, batch_size=1024, epochs=1)

The sampler InBatchSampler returned no samples for this batch.




<keras.callbacks.History at 0x7f807d5dab80>

## Retrieve Layers Inside a Model

If the embeddings are created by default and inside the model initialization, like the example model shown below. In order to get the embedding layer (first layer) inside the model, we can first print the whole architecture of the model by `model.summary(expand_nested=True, show_trainable=True, line_length=80)`.

Now we create a Wide and Deep Model as an example, since the embedding layer is created inside the initialization of the class `ml.WideAndDeepModel`, we have to print out the model and find the name of it.

In [11]:
example_schema = train.schema.select_by_name(names=["user_categories", "item_category", "item_id", "user_id", "click"])
wide_schema = example_schema.select_by_name(names=["user_categories", "item_category"])
example_model = ml.WideAndDeepModel(
        example_schema,
        wide_schema=wide_schema,
        deep_schema=example_schema,
        deep_block=ml.MLPBlock([16]),
        prediction_tasks=ml.BinaryClassificationTask("click"),
    )

# Build the model before summary
example_model.compile(optimizer="adam")
example_model.predict(train, batch_size=1024, steps=1)

example_model.summary(expand_nested=True, show_trainable=True, line_length=80)   
                                                    

Model: "model"
___________________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     Trainable  
 parallel_block_6 (ParallelBlock)   multiple                        285393428   Y          
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
| sequential_block_25 (SequentialBlo  multiple                     3           Y          |
| ck)                                                                                     |
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
|| parallel_block_5 (ParallelBlock)  multiple                     0           Y          ||
|||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|||
||| tabular_block_5 (TabularBlock)  multiple                     0           Y          |||
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Above summary result shows the entire structure of the model, including name, style, number of parameters, and trainable argument of each layer, you can see some layers are nexted. For example, a `ParallelBlock` contains several sub-layers, they do not have order and data pass them in parallel. And a `SequentialBlcok` contain a sequence of blocks, data pass them one by one. A `ParallelBlock` block can also contain `ParallelBlcok` or `SequentialBlock`.

```
||||| embeddings (ParallelBlock)  multiple             285390448   Y           |||||
||||||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||||||
|||||| user_categories (EmbeddingTable)  multiple      146088      Y          ||||||
||||||                                                                        ||||||
|||||| item_category (EmbeddingTable)  multiple        205968      Y          ||||||
||||||                                                                        ||||||
|||||| item_id (EmbeddingTable)  multiple              270891016   Y          ||||||
||||||                                                                        ||||||
|||||| user_id (EmbeddingTable)  multiple               14147376    Y         ||||||
|||||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|||||
```

We can find out that the `embeddings` contains all embedding tables, and it is a `ParallelBlock`. Now in order to get the this layer, we can call `model.get_blocks_by_name("embeddings")` and set different optimizer for it.

In [12]:
embedding_layer = example_model.get_blocks_by_name("embeddings")
optimizer = ml.MultiOptimizer(
                default_optimizer="adam",
                optimizers_and_blocks=[ml.OptimizerBlocks(ml.LazyAdam(), embedding_layer)]
                )