In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Incremental Training with Different Learning Rates and Layer-Freezing

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container. 

Merlin Models provide various model APIs for training, as shown in Notebook [Iterating over Deep Learning Models using Merlin Models](https://nvidia-merlin.github.io/models/main/examples/03-Exploring-different-models.html). We can create a model, such as [Two Tower](https://nvidia-merlin.github.io/models/main/models_overview.html?highlight=two%20tower#two-tower), [DLRM](https://nvidia-merlin.github.io/models/main/examples/03-Exploring-different-models.html#dlrm-model) and so on, by simply one line: `model=mm.DLRMModel(schema)`. 

In this example, we fine-tune a model by setting different learning rates to different layers and freezing embedding tables.


### Learning objectives
- Training a model with multiple learning rates
- Fine-tune a model by freezing embedding tables

In [2]:
import os

import tensorflow as tf

import merlin
from merlin.datasets.synthetic import generate_data
import merlin.models.tf as ml
from merlin.schema import Schema, Tags
from merlin.io.dataset import Dataset

2022-08-15 01:26:26.521434: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-15 01:26:29.462851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16255 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB-LS, pci bus id: 0000:86:00.0, compute capability: 7.0



## Build a Two-Tower Model with Merlin Models

### Generate Synthetic Dataset
We use the data of the first day as training data, and the data of the second day as test data.

In [3]:
NUM_ROWS = 100000
data_1, data_2, data_3 = generate_data("e-commerce-large", int(NUM_ROWS), set_sizes=(0.33, 0.33, 0.34))
schema = data_1.schema

### Build the Two-Tower model by Merlon Models module.

In [4]:
item_embeddings = ml.Embeddings(schema.select_by_tag(Tags.ITEM), infer_embedding_sizes=True)
query_embeddings = ml.Embeddings(schema.select_by_tag(Tags.USER), infer_embedding_sizes=True)
model = ml.TwoTowerModel(schema, 
                         query_tower=ml.InputBlockV2(schema.select_by_tag(Tags.USER), embeddings=query_embeddings).connect(ml.MLPBlock([512, 256])), 
                         item_tower=ml.InputBlockV2(schema.select_by_tag(Tags.ITEM), embeddings=item_embeddings).connect(ml.MLPBlock([512, 256])),
)

## Iteration 1: Training on the First Day's Data 

At first, we train the model on the first day's data and evaluate it on the second day's data.

In [5]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), run_eagerly=True)
model.fit(data_1, batch_size=1024, epochs=1)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


The sampler InBatchSampler returned no samples for this batch.


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method

The sampler InBatchSampler returned no samples for this batch.




<keras.callbacks.History at 0x7f751034c310>

In [6]:
model.evaluate(data_2, batch_size=1024)

The sampler InBatchSampler returned no samples for this batch.




The sampler InBatchSampler returned no samples for this batch.




[6.903665065765381,
 0.008818386122584343,
 0.008818386122584343,
 0.011168284341692924,
 0.0019424239872023463,
 0.01942424289882183,
 0.0]

## Iteration 2: Training on the Second Day's Data 

Now we continue to train the model on the second day's data but using different strategies. Now we can use different learning rate for different layers of the model, i.e. a smaller learning rate for embedding tables while a bigger learning rate for two towers. Here we choose 0.001 as the learning rate for embedding tables.

In [7]:
optimizer = ml.MultiOptimizer(
                default_optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
                optimizers_and_blocks=[ml.OptimizerBlocks(tf.keras.optimizers.Adam(learning_rate=0.001),
                                                          [item_embeddings, query_embeddings])])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), run_eagerly=True)
model.fit(data_2, batch_size=1024, epochs=1)



<keras.callbacks.History at 0x7f74fb6fed30>

Test on the third day's data.

In [8]:
model.evaluate(data_3, batch_size=1024)



The sampler InBatchSampler returned no samples for this batch.




[6.903428554534912,
 0.614406406879425,
 0.614406406879425,
 0.6152668595314026,
 0.06182496249675751,
 0.6182500123977661,
 0.0]

## Iteration 3: Training with Layer-Freezing 

Suppose we have trained the model on all previous data and achieved a good performance. Now there is incoming new data, but we do not want to change the pretrained embedding tables and only want to train the top MLP layers. We can use `model.freeze_blocks()`.

Important note about layer-freezing: Calling `compile()` on a model is meant to "freeze" the behavior of that model, which means that `trainable` variables would be preserved for the model, so if you want to freeze any layer of the model, please make sure to compile it again.


In [9]:
model.freeze_blocks([item_embeddings, query_embeddings])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), run_eagerly=True)
model.summary(expand_nested=True, show_trainable=True, line_length=80)   

Model: "retrieval_model"
___________________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     Trainable  
 two_tower_block (TwoTowerBlock)    multiple                        341211576   Y          
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
| sequential_block_9 (SequentialBloc  multiple                     22156736    Y          |
| k)                                                                                      |
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
|| filter_2 (Filter)              multiple                        0           Y          ||
||                                                                                       ||
|| tower_block (TowerBlock)       multiple                        22156736    Y          ||
|||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

When we call `freeze_blocks`, what do we actually do? Each layer maintains a variable called `trainable`. When a layer is created, this variable is set. The default value is `True`, which means all the weights in this layer can be updated. If you change `trainable` into `False`, the weights would not be changed anymore, unless its `trainable` variable becomes `True` again. So when `freeze_blocks` is called, the `trainable` of the layer is set to False.

When we call `freeze_blocks` on some layers, all these layers and their children layers become non-trainable. For example, if a `ParallelBlock` is frozen, the children blocks inside this `ParallelBlock` are also frozen. As shown in below summary result, we freeze the `user_embeddings`, and it is a `ParallelBlock`, all the children layers are frozen as well.

```
|||||| embeddings (ParallelBlock)  multiple                   21902016    N          ||||||
|||||||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|||||||
||||||| user_categories (EmbeddingTable)  multiple           146088      N          |||||||
|||||||                                                                             |||||||
||||||| user_shops (EmbeddingTable)  multiple                4669680     N          |||||||
|||||||                                                                             |||||||
||||||| user_brands (EmbeddingTable)  multiple               1856512     N          |||||||
|||||||                                                                             |||||||
||||||| user_intentions (EmbeddingTable)  multiple           1081184     N          |||||||
```

## Freeze and Unfreeze layers

### Freeze Layers by Passing Names or Layers
In the above example, By calling `model.freeze_blocks([item_embeddings, query_embeddings])`, we pass the layers themselves into `freeze_blocks`. But if you want to freeze the layers initialized inside the model or just one embedding table of all embedding tables, another way to freeze layers is to pass names of them. 

Here we create another simple model as an example.

In [10]:
input_block = ml.InputBlockV2(schema.select_by_name(["user_categories", "item_category", "click"]))
body = input_block.connect(ml.MLPBlock([64, 32]))
model = ml.Model(body, ml.BinaryClassificationTask("click"))

# Build the model
model.compile(optimizer="adam", run_eagerly=True)
model.fit(data_1, batch_size=1024, epochs=1)
model.summary(expand_nested=True, show_trainable=True, line_length=80)  

Model: "model"
___________________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     Trainable  
 sequential_block_13 (SequentialBlo  multiple                       357272      Y          
 ck)                                                                                       
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
| sequential_block_11 (SequentialBlo  multiple                     352056      Y          |
| ck)                                                                                     |
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
|| as_ragged_features_137 (AsRaggedFe  multiple                   0           Y          ||
|| atures)                                                                               ||
||                                                               

The above result shows the summary table, and all `trainable` variables for every layer are `Y`, which means the entire model is not frozen. Then we can select names of the layers we want to freeze from below summary table, and freeze them. 

Note that in Jupyter Notebook, the name `sequential_block_12` may change if you run a cell for several times, and it would raise the error, we suggest you to check the summary output table to check the layer's name.

In [11]:
model.freeze_blocks(["user_categories","sequential_block_12"])
model.summary(expand_nested=True, show_trainable=True, line_length=80)   

Model: "model"
___________________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     Trainable  
 sequential_block_13 (SequentialBlo  multiple                       357272      Y          
 ck)                                                                                       
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
| sequential_block_11 (SequentialBlo  multiple                     352056      Y          |
| ck)                                                                                     |
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
|| as_ragged_features_137 (AsRaggedFe  multiple                   0           Y          ||
|| atures)                                                                               ||
||                                                               

From the result below, we can see that the layers of `user_categories`,`sequential_block_12`, and their children layers are frozen, i.e. their `trainable` variables become `N`.

### Unfreeze Layers by Passing Names or Layers

The freezing and unfreezing APIs provide flexibility, which allow users to unfreeze some or all frozen layers by names or by layers themselves, just like `freeze_blocks`. For example, we can unfreeze the item embedding layer by the layer itself, and then unfreeze the `user_categories` by the name.

In [12]:
model.unfreeze_blocks("user_categories")
model.summary(expand_nested=True, show_trainable=True, line_length=80) 

Model: "model"
___________________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     Trainable  
 sequential_block_13 (SequentialBlo  multiple                       357272      Y          
 ck)                                                                                       
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
| sequential_block_11 (SequentialBlo  multiple                     352056      Y          |
| ck)                                                                                     |
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
|| as_ragged_features_137 (AsRaggedFe  multiple                   0           Y          ||
|| atures)                                                                               ||
||                                                               

And `unfreeze_all_frozen_blocks` is provided to unfreeze all layers at once.

In [14]:
model.unfreeze_all_frozen_blocks()
model.summary(expand_nested=True, show_trainable=True, line_length=80) 

Model: "model"
___________________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     Trainable  
 sequential_block_13 (SequentialBlo  multiple                       357272      Y          
 ck)                                                                                       
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
| sequential_block_11 (SequentialBlo  multiple                     352056      Y          |
| ck)                                                                                     |
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
|| as_ragged_features_137 (AsRaggedFe  multiple                   0           Y          ||
|| atures)                                                                               ||
||                                                               