In [13]:
# Copyright 2024 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_hugectr_hps-sok-to-dlrm-demo/nvidia_logo.png" style="width: 90px; float: right;">

# SOK DUMP/LOAD Demo

## Overview

This notebook demonstrates how to use SOK to dump/load parallel-trained embedding weights to/from the filesystem, and to verify the correctness of sok's load and dump operations. You can learn how to use sok.dump, sok.load, sok.export, sok.assign in this demo. 

For more details about SOK, please refer to [SOK Documentation](https://nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/index.html). 

## Installation

### Get SOK from NGC

SOK Python modules are preinstalled in the 23.12 and later [Merlin Tensorflow Container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow): `nvcr.io/nvidia/merlin/merlin-tensorflow:nightly`.

You can check the existence of the required libraries by running the following Python code after launching this container.

```bash
$ python3 -c "import sparse_operation_kit as sok"
```

## Configurations

For demonstrating the SOK dump/load functionality, this demo shows the process of first dumping then loading after performing forward and backward with multiple SOK Variables paired with different TensorFlow Optimizers. This demo also verifies the correctness of the dump and load process.

The operation method of this Notebook is based on [Introduction to Horovod](https://enccs.github.io/upscalingAIcontainer/hvd_intro/?highlight=jupyter#training-with-model-fit). The process is to first define the function, and then use horovod.run to execute it.

In the first of all, we need to configure some SOK Variable properties and define a function for SOK forward and backward. This function takes SOK variables, lookup ids, and an optimizer to perform the tasks of forward, backward, and optimizer update for SOK variables.

In [14]:
import horovod
import tensorflow as tf
import horovod.tensorflow as hvd
import numpy as np
import sparse_operation_kit as sok
rows = [8192 * 5, 8192]
cols = [128, 4]
hotness = [10, 3]
combiners = ["mean", "sum"]
batch_size = 8192
iters = 100
initial_vals = [13, 17]

# train step
def train_step(params, indices,sok_optimizer):
    with tf.GradientTape() as tape:
        embeddings = sok.lookup_sparse(params, indices, combiners=combiners)
        loss = 0
        for i in range(len(embeddings)):
            loss = loss + tf.reduce_sum(embeddings[i])
    grads = tape.gradient(loss, params)
    sok_optimizer.apply_gradients(zip(grads, params))
    loss = hvd.allreduce(loss, op=hvd.Sum)
    return loss

Define a function to evaluate SOK dump and load with a given optimizer. In this function, SOK first calls train_step for a forward and backward pass, then dumps the values in SOK Variables to the file system. Next, it sets the SOK Variables to 0, and then loads the values from the file system. Finally, it compares whether the values before and after dump and load are consistent

In [15]:
def sok_dump_load_evaluate(optimizer,optimizer_id):
    sok_optimizer = sok.OptimizerWrapper(optimizer)
    # sok variables
    sok_vars = [
        sok.DynamicVariable(dimension=cols[i],
                            var_type="hybrid",
                            initializer=str(initial_vals[i]),
                            init_capacity=1024 * 1024,
                            max_capacity=1024 * 1024,)
        for i in range(len(cols))
    ]

    #prepare lookup ids
    local_indices = []
    for row in rows:
        local_size = row // hvd.size()
        if hvd.rank() < row % hvd.size():
            local_size += 1
        indices = np.arange(local_size) * hvd.size() + hvd.rank()
        indices = tf.convert_to_tensor(indices, dtype=tf.int64)
        local_indices.append(indices)

    total_indices = []
    for i in range(len(rows)):
        offsets = np.random.randint(1, hotness[i] + 1, iters * batch_size)
        offsets = tf.convert_to_tensor(offsets, dtype=tf.int64)
        offsets = hvd.broadcast(offsets, root_rank=0)
        values = np.random.randint(0, rows[i], tf.reduce_sum(offsets))
        values = tf.convert_to_tensor(values, dtype=tf.int64)
        values = hvd.broadcast(values, root_rank=0)
        total_indices.append(tf.RaggedTensor.from_row_lengths(values, offsets))
    left = batch_size // hvd.size() * hvd.rank()
    right = batch_size // hvd.size() * (hvd.rank() + 1)
    indices = []
    for j in range(len(total_indices)):
        indices.append(total_indices[j][batch_size + left : batch_size + right])
        
    # Do forward and backward
    _ = train_step(sok_vars, indices,sok_optimizer)

    # Export all the embedding table and opt slot states to tf.variable 
    # so we can know the values of sok variables before sok.dump
    # and after sok.load , we can compare values before sok.dump and load , and after sok.dump and load
    vars_unique_ids = []
    for sok_var in sok_vars:
        vars_unique_ids.append(sok_var._unique_id)
    #check optimizer have train state
    have_state = True
    for vars_unique_id in vars_unique_ids:
        tmp_slot = optimizer._slots.get(vars_unique_id)
        if tmp_slot == None:
            have_state = False
            break
    slot_names = optimizer.get_slot_names()
    slot_states_list_raw = []
    slot_states_index_list_raw = []
    slot_vars_list = []
    if have_state:
        for slot_name in slot_names:
            slot_vars_np_list_raw = []
            slot_vars_index_np_list_raw = []
            tmp_slot_var_list = []
            for sok_var in sok_vars:
                slot_var = optimizer.get_slot(sok_var, slot_name)
                ex_indices, ex_values = sok.export(slot_var)
                slot_vars_np_list_raw.append(ex_values.numpy())
                slot_vars_index_np_list_raw.append(ex_indices.numpy())
                tmp_slot_var_list.append(slot_var)
            slot_states_list_raw.append(slot_vars_np_list_raw)
            slot_states_index_list_raw.append(slot_vars_index_np_list_raw)
            slot_vars_list.append(tmp_slot_var_list)

    sok_var_nps_raw = []
    sok_var_index_nps_raw = []
    sok_var_nps_new = []
    sok_var_index_nps_new = []

    for sok_var in sok_vars:
        ex_indices, ex_values = sok.export(sok_var)
        sok_var_nps_raw.append(ex_values.numpy())
        sok_var_index_nps_raw.append(ex_indices.numpy())
    #Export all the embedding table and opt slot states to tf.variable Done!
    
    # Dump sok variable and opt slot states to file system    
    sok.dump("./weight", sok_vars, sok_optimizer)


    # Assign all the sok_var to zero , like a memset , so we can make sure load is valid
    for sok_var in sok_vars:
        ex_indices, ex_values = sok.export(sok_var)
        zeros_values = tf.zeros(ex_values.shape)
        sok.assign(sok_var, ex_indices, zeros_values)

    
    for tmp_slot_list in slot_vars_list:
        for tmp_slot_var in tmp_slot_list:
            ex_indices, ex_values = sok.export(tmp_slot_var)
            zeros_values = tf.zeros(ex_values.shape)
            sok.assign(tmp_slot_var, ex_indices, zeros_values)
    #Assign zeros is done
    
    # Load weight from sok dump
    sok.load("./weight", sok_vars, sok_optimizer)

    # check var value before dump and var value after load
    for sok_var in sok_vars:
        ex_indices, ex_values = sok.export(sok_var)
        sok_var_nps_new.append(ex_values.numpy())
        sok_var_index_nps_new.append(ex_indices.numpy())
    slot_states_list_new = []
    slot_states_index_list_new = []
    if have_state:
        for slot_name in slot_names:
            slot_vars_np_list_new = []
            slot_vars_index_np_list_new = []
            for sok_var in sok_vars:
                slot_var = optimizer.get_slot(sok_var, slot_name)
                ex_indices, ex_values = sok.export(slot_var)
                slot_vars_np_list_new.append(ex_values.numpy())
                slot_vars_index_np_list_new.append(ex_indices.numpy())
            slot_states_list_new.append(slot_vars_np_list_new)
            slot_states_index_list_new.append(slot_vars_index_np_list_new)

    for i in range(len(sok_vars)):
        var_sorted = np.argsort(sok_var_index_nps_raw[i])
        var_pos = np.searchsorted(
            sok_var_index_nps_raw[i][var_sorted], sok_var_index_nps_new[i]
        )
        remap_indices = var_sorted[var_pos]
        tmp_sok_var_nps_raw = sok_var_nps_raw[i][remap_indices, :]

        assert ((sok_var_nps_new[i] - tmp_sok_var_nps_raw) < 1e-5).all()

    if have_state:
        for i, tmp_slot_states_list in enumerate(slot_states_list_new):
            for j, tmp_array in enumerate(tmp_slot_states_list):
                index_raw = slot_states_index_list_raw[i][j]
                index_new = slot_states_index_list_new[i][j]
                var_sorted = np.argsort(index_raw)
                var_pos = np.searchsorted(index_raw[var_sorted], index_new)
                remap_indices = var_sorted[var_pos]
                tmp_var_raw = slot_states_list_raw[i][j][remap_indices, :]
                assert ((slot_states_list_new[i][j] - tmp_var_raw) < 1e-5).all()
    print(
        "[SOK INFO] dump load distribute dynamic test {} optimizer successfully".format(optimizer_id)
    )

Define a func to evaluate SOK dump and load using different TensorFlow optimizers

In [17]:
import horovod

def training_func():
    import os
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '4'
    import tensorflow as tf
    import horovod.tensorflow as hvd
    import numpy as np
    import sparse_operation_kit as sok

    hvd.init()
    gpus = tf.config.experimental.list_physical_devices("GPU")
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    if gpus:
        tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], "GPU")
    sok.init()

    #Here is SOK support optimizer
    optimizers = [
        tf.optimizers.SGD(learning_rate=1.0),
        tf.optimizers.SGD(learning_rate=1.0, momentum=0.9),
        tf.optimizers.Adamax(learning_rate=1.0, beta_1=0.9, beta_2=0.999),
        tf.optimizers.Adadelta(learning_rate=1.0),
        tf.optimizers.Adagrad(learning_rate=1.0),
        tf.optimizers.Ftrl(learning_rate=1.0),
    ]

    #remove some tf stderr output
    class suppress_stderr:
        def __init__(self):
            self.null_fd = os.open(os.devnull, os.O_RDWR)
            self.save_fd = os.dup(2)

        def __enter__(self):
            os.dup2(self.null_fd, 2)

        def __exit__(self, *_):
            os.dup2(self.save_fd, 2)
            os.close(self.null_fd)
            os.close(self.save_fd)
            
    with suppress_stderr():
        for optimizer_id, optimizer in enumerate(optimizers):
            sok_dump_load_evaluate(optimizer,optimizer_id)

## Run With Horovod

use horovod.run do 2 process task.

In [18]:
horovod.run(training_func, np=2, verbose=False, disable_cache=True, use_mpi=True)

[1,0]<stdout>:[SOK INFO] Import /usr/local/lib/python3.10/dist-packages/merlin_sok-2.0.0-py3.10-linux-x86_64.egg/sparse_operation_kit/lib/libsparse_operation_kit.so
[1,1]<stdout>:[SOK INFO] Import /usr/local/lib/python3.10/dist-packages/merlin_sok-2.0.0-py3.10-linux-x86_64.egg/sparse_operation_kit/lib/libsparse_operation_kit.so
[1,0]<stdout>:[SOK INFO] Initialize finished, communication tool: horovod
[1,1]<stdout>:[SOK INFO] Initialize finished, communication tool: horovod
[1,0]<stdout>:[SOK INFO] SOK dump weight in path: ./weight  success!
[1,1]<stdout>:[SOK INFO] SOK dump weight in path: ./weight  success!
[1,1]<stdout>:[SOK INFO] SOK load weight from path: ./weight  success!
[1,0]<stdout>:[SOK INFO] SOK load weight from path: ./weight  success!
[1,0]<stdout>:[SOK INFO] dump load distribute dynamic test 0 optimizer successfully
[1,1]<stdout>:[SOK INFO] dump load distribute dynamic test 0 optimizer successfully
[1,0]<stdout>:[SOK INFO] SOK dump weight in path: ./weight  success!
[1,1]

[None, None]

In [19]:
!ls -l ./weight

total 292224
-rw-r--r-- 1 nobody nogroup 14065960 May 30 02:12 sok_dynamic_Variable_0_0-Adam-m
-rw-r--r-- 1 nobody nogroup 14065960 May 30 02:12 sok_dynamic_Variable_0_0-Adam-v
-rw-r--r-- 1 nobody nogroup   218880 May 30 02:31 sok_dynamic_Variable_0_0-key
-rw-r--r-- 1 nobody nogroup 13989672 May 30 02:31 sok_dynamic_Variable_0_0-weight
-rw-r--r-- 1 nobody nogroup 13922600 May 30 02:31 sok_dynamic_Variable_10_0-Ftrl-accumulator
-rw-r--r-- 1 nobody nogroup 13922600 May 30 02:31 sok_dynamic_Variable_10_0-Ftrl-linear
-rw-r--r-- 1 nobody nogroup   217832 May 30 02:31 sok_dynamic_Variable_10_0-key
-rw-r--r-- 1 nobody nogroup 13922600 May 30 02:31 sok_dynamic_Variable_10_0-weight
-rw-r--r-- 1 nobody nogroup   113752 May 30 02:31 sok_dynamic_Variable_11_0-Ftrl-accumulator
-rw-r--r-- 1 nobody nogroup   113752 May 30 02:31 sok_dynamic_Variable_11_0-Ftrl-linear
-rw-r--r-- 1 nobody nogroup    57024 May 30 02:31 sok_dynamic_Variable_11_0-key
-rw-r--r-- 1 nobody nogroup   113752 May 30 02:31 sok_dyn