In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="https://developer.download.nvidia.com//notebooks/dlsw-notebooks/merlin_models_incremental-training-with-layer-freezing/nvidia_logo.png" style="width: 90px; float: right;">

# Incremental Training with Different Learning Rates and Layer Freezing

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container. 

In this example, we fine-tune a model by setting different learning rates to different layers and freezing embedding tables. Incremental training of a model is a common practice allows models to continuously learn and extend the existing model's knowledge by adjusting model parameters that has been learned previously using new examples. Another scenario to do incremental training is to resume a training job that was stopped. Here, we first showcase how to incrementally train the same model architecture with different hyperparameter settings (adjusting the learning rates) and using different datasets. Then, in a new scenario, we showcase how one can freeze certain layers of the model such as pretrained embedding layers and perform training.

**Learning objectives**
- Training a model with multiple learning rates
- Fine-tune a model by freezing embedding tables

In [2]:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"

import tensorflow as tf
from merlin.datasets.synthetic import generate_data
import merlin.models.tf as mm
from merlin.schema import Schema, Tags
from merlin.io.dataset import Dataset

2023-01-11 15:19:26.844088: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.




2023-01-11 15:19:30.718269: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-11 15:19:32.610890: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2023-01-11 15:19:32.610991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0
  from .autonotebook import tqdm as notebook_tqdm


## Building a Two-Tower Model with Merlin Models

We choose Two-Tower model architecture for this example. To learn more about a Two-Tower model you can visit this [notebook](https://github.com/NVIDIA-Merlin/models/blob/main/examples/05-Retrieval-Model.ipynb).

### Generate Synthetic Dataset

Let's create three datasets. To generate the synthetic dataset for our example, we can use `generate_data()` function. We can assume that each dataset here was collected at a different day. Therefore, below we show how we can do incremental training with Merlin Models. We generate three datasets synthetically and we name them as `day_1, day_2, day_3` for this example.

In [3]:
NUM_ROWS = int(os.environ.get("NUM_ROWS", '10000'))
day_1, day_2, day_3 = generate_data("e-commerce", int(NUM_ROWS), set_sizes=(0.33, 0.33, 0.34))
schema = day_1.schema.without(['click', 'conversion'])
day_1.schema = schema
day_2.schema = schema
day_3.schema = schema



In [4]:
day_1.schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.domain.min,properties.domain.max,properties.domain.name
0,user_categories,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,300,user_categories
1,user_shops,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,500,user_shops
2,user_brands,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,250,user_brands
3,user_intentions,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,50,user_intentions
4,user_profile,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,20,user_profile
5,user_group,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,14,user_group
6,user_gender,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,3,user_gender
7,user_age,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,8,user_age
8,user_consumption_1,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,4,user_consumption_1
9,user_consumption_2,"(Tags.USER, Tags.CATEGORICAL)",int64,False,False,0,4,user_consumption_2


### Iteration 1: Using Different Learning Rates
At first, we train the model on the first day's data and evaluate it on the second day's data.

Define the embeddings for features in the item and query towers using `Embedddings` class. By setting `infer_embedding_sizes` to True, we can automatically define the embedding dimension from the feature cardinality in the schema.

In [5]:
item_embeddings = mm.Embeddings(schema.select_by_tag(Tags.ITEM), infer_embedding_sizes=True)
query_embeddings = mm.Embeddings(schema.select_by_tag(Tags.USER), infer_embedding_sizes=True)

Build the Two-Tower model

In [6]:
model = mm.TwoTowerModel(schema, 
                         query_tower=mm.InputBlockV2(schema.select_by_tag(Tags.USER), categorical=query_embeddings).connect(mm.MLPBlock([128, 64])), 
                         item_tower=mm.InputBlockV2(schema.select_by_tag(Tags.ITEM), categorical=item_embeddings).connect(mm.MLPBlock([128, 64])),
)

#### Training the model with first day's data

In [7]:
BATCH_SIZE = int(os.environ.get(
    "BATCH_SIZE", 
    '1024'
))

In [8]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
model.fit(day_1, batch_size=BATCH_SIZE, epochs=1)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


The sampler InBatchSampler returned no samples for this batch.




<keras.callbacks.History at 0x7f5d8304a430>

**Evaluate the data using day_2 dataset**

Training model on a certain period of time, and then evaluating on a dataset which is closer to the test set time period make sense and is a common practice. However, please note that in this example the data is randomly separated, thus we dont really expect a temporal sequence in day_1, day_2, and day_3. Therefore the evaluation metrics might not "make sense", since this is a hypotetical example to showcase the functionality.

In [9]:
eval_metrics = model.evaluate(day_2, batch_size=BATCH_SIZE, return_dict=True)
eval_metrics

The sampler InBatchSampler returned no samples for this batch.




{'loss': 6.808794975280762,
 'recall_at_10': 0.03030303120613098,
 'mrr_at_10': 0.028212962672114372,
 'ndcg_at_10': 0.028665972873568535,
 'map_at_10': 0.028212962672114372,
 'precision_at_10': 0.0030303029343485832,
 'regularization_loss': 0.0,
 'loss_batch': 5.410911560058594}

#### Training the model with second day's data 

Now we continue to train the model on the second day's data but using different strategies. We can use different learning rate for different layers of the model, i.e. a smaller learning rate for embedding tables while a bigger learning rate for two towers. If we want small updates to the weights of embedding tables, we can set the small learning rate value. Here we choose `0.001` as the learning rate for embedding tables.

In [10]:
optimizer = mm.MultiOptimizer(
            default_optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=0.01),
            optimizers_and_blocks=[mm.OptimizerBlocks(tf.keras.optimizers.legacy.Adam(learning_rate=0.001),
                                                      [item_embeddings, query_embeddings])]
)                                      

In [11]:
model.compile(optimizer=optimizer)
model.fit(day_2, batch_size=BATCH_SIZE, epochs=1)



<keras.callbacks.History at 0x7f5d82080190>

**Evaluate on the third day's data**

In [12]:
eval_metrics = model.evaluate(day_3, batch_size=BATCH_SIZE,  return_dict=True)



In [13]:
eval_metrics

{'loss': 6.803046226501465,
 'recall_at_10': 0.03735294193029404,
 'mrr_at_10': 0.03274918347597122,
 'ndcg_at_10': 0.03387266397476196,
 'map_at_10': 0.03274918347597122,
 'precision_at_10': 0.0037352940998971462,
 'regularization_loss': 0.0,
 'loss_batch': 5.776802062988281}

### Iteration 2: Training with Freezing Layers

Let's consider a new situation. Suppose we have trained the model on all previous data and achieved a good performance. Now there is incoming new data, but we do not want to change the pretrained embedding tables and only want to train the top MLP layers. We can use `model.freeze_blocks()`. When we call `freeze_blocks`, what do we actually do? Each layer maintains a variable called `trainable`. When a layer is created, this variable is set. The default value is `True`, which means all the weights in this layer can be updated. If you change `trainable` into `False`, the weights would not be changed anymore, unless its `trainable` variable becomes `True` again. So when `freeze_blocks` is called, the `trainable` of the layer is set to `False`.

In [14]:
model.freeze_blocks([item_embeddings, query_embeddings])

# recompile your model after making any changes
# to the `trainable` attribute of any inner layer, so that your changes
# are taken into account
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01)) 

In [15]:
model.fit(day_1, batch_size=BATCH_SIZE, epochs=1)
model.summary(expand_nested=True, show_trainable=True, line_length=80)  

Model: "retrieval_model"
___________________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     Trainable  
 two_tower_block (TwoTowerBlock)    multiple                        164728      Y          
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
| tower_block (TowerBlock)         multiple                        122120      Y          |
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
|| sequential_block_4 (SequentialBloc  multiple                   122120      Y          ||
|| k)                                                                                    ||
|||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|||
||| parallel_block (ParallelBlock)  multiple                     97352       Y          |||
||||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

When we call `freeze_blocks` on some layers, all these layers and their children layers become non-trainable. For example, if a `ParallelBlock` is frozen, the children blocks inside this `ParallelBlock` are also frozen. As shown in below summary result, we freeze the `user_embeddings`, and it is a `ParallelBlock`, all the children layers are frozen as well.

```
|||||| embeddings (ParallelBlock)  multiple                   21902016    N          ||||||
|||||||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|||||||
||||||| user_categories (EmbeddingTable)  multiple           146088      N          |||||||
|||||||                                                                             |||||||
||||||| user_shops (EmbeddingTable)  multiple                4669680     N          |||||||
|||||||                                                                             |||||||
||||||| user_brands (EmbeddingTable)  multiple               1856512     N          |||||||
|||||||                                                                             |||||||
||||||| user_intentions (EmbeddingTable)  multiple           1081184     N          |||||||
```

## Summary

In this example notebook we learned how to use different learning rates for different layers in our model architecture, and how to freeze embedding layers, so that we do not update their parameters during training.