# Introduction to the HugeCTR Embedding Collection
## About the HugeCTR embedding collection
Embedding collection enables users to group multiple embedding lookup operations together, in which vector size / id space / optimizer / table placement strategy of embedding tables can be different so that both flexibility and performance will be enhanced. 
This notebook includes:
1. Introduce the API of the embedding collection.
2. Introduce Embedding Table Placement Strategy(ETPS) and how to configure ETPS in embedding collection.
3. Use embedding collection in the DLRM model for Criteo dataset as an example to show how to use embedding collection for your model training and evaluation. We provide two different ETPS as reference.

## API
There are 2 API related with embedding collection:
### `hugectr.EmbeddingTableConfig`
A placeholder for users to configure the attribute of the embedding table.
Parameter:
* `id_space`: Integer, the id space this table belongs to. Typically, it is the existing number of EmbeddingTableConfig when you create a new embedding table. 
* `max_vocabulary_size`: Integer, the vocabulary size of this table. If positive, then the number means how many embedding vectors this table contains. And it will cause overflow if you exceed during training or evaluation.  If you do not know the exact size of the embedding table, you can specify -1 which means the dynamic embedding table will be used and its size can be extended dynamically during training or evaluation.
* `ev_size`: Integer, the vector size of embedding this embedding consists of.
* `min_key`: Integer, the minimum value of input key.
* `max_key`: Integer, the maximum value of input key.
* `opt_params`: Optional, hugectr.Optimizer, the optimizer you want to use for this embedding table. If not specified, will use the optimizer specified in `hugectr.Model`.

Example:
```python
# create embedding table
num_embedding = 26
table_size_array = [203931, 18598, 14092, 7012, 18977, 4, 6385, 1245, 49, 186213, 71328, 67288, 11, 2168, 7338, 61, 4, 932, 15, 204515, 141526, 199433, 60919, 9137, 71, 34]
embedding_table_list = []
for i in range(num_embedding):
     embedding_table_list.append(hugectr.EmbeddingTableConfig(id_space=i, max_vocabulary_size=table_size_array[i], ev_size=128, min_key=0, max_key=table_size_array[i]))
```

### `hugectr.EmbeddingPlanner`
`EmbeddingPlanner` provides `embedding_lookup` for users to specify lookup operations on `hugectr.EmbeddingTableConfig`. 

#### `embedding_lookup`
Parameter:
  * `emb_table` : hugectr.EmbeddingTableConfig, the embedding table you want to lookup upon.
  * `input`: str, the input tensor name. Should be compatible with the `data_reader_sparse_param_array` in `hugectr.Input` in `hugectr.Model`
  * `output`: str, the output tensor name. The shape of output tensor will be (batch_size, 1, embedding vector size).
  * `combiner`: str, specify the combiner operation. Currently support `mean`, `sum` and `concat`.

#### `create_embedding_collection`
After finishing all `embedding_lookup`, users can use `EmbeddingPlanner.create_embedding_collection` to create `hugectr.EmbeddingCollection`, which can be added in `hugectr.Model` for training and evaluation.
Parameter:
  * `plan_file`: str, a json file which describes the table placement strategy. Will be covered in more detail in section `Plan and Embedding Table Placement Strategy`.

Example:
```python
embedding_planner = hugectr.EmbeddingPlanner()
emb_vec_list = []
for i in range(num_embedding):
     embedding_planner.embedding_lookup(embedding_table_list[i], "data{}".format(i), "emb_vec{}".format(i), "sum")
embedding_collection = embedding_planner.create_embedding_collection("./plan_7000.json")
```

## Plan and Embedding Table Placement Strategy(ETPS)
### What is ETPS and why is it important?
In the recommendation system, the embedding table is usually so large that a single GPU is not able to hold all embedding tables, where sharding is needed to distribute embedding tables across multiple GPUs. We call such sharding strategy as **Embedding Table Placement Strategy**. It will hugely affect the performance of embedding, since different sharding strategies influence the communication between GPUs, and the optimal placement strategy is highly related with your dataset and lookup operation. So it's very important for users to configure a suitable table placement strategy for their own use case instead of providing a fixed one.   
### How to configure ETPS in the embedding collection?
We introduce a configurable ETPS interface so that users can adjust their table placement strategy according to their own use case. We use a json file to describe the ETPS in all GPUs, which we call a **plan file**. For example, consider you have 4 table and 5 lookup operations, which may like:
```python
num_embedding = 5
table_size_array = [...]
embedding_table_list = []
for i in range(num_embedding):
     embedding_table_list.append(hugectr.EmbeddingTableConfig(id_space=i, max_vocabulary_size=table_size_array[i], ev_size=128, min_key=0, max_key=table_size_array[i]))

embedding_planner = hugectr.EmbeddingPlanner()
embedding_planner.embedding_lookup(embedding_table_list[0], "data0", "emb_vec0", "sum") # lookup 0
embedding_planner.embedding_lookup(embedding_table_list[1], "data1", "emb_vec1", "sum") # lookup 1
embedding_planner.embedding_lookup(embedding_table_list[2], "data2", "emb_vec2", "sum") # lookup 2
embedding_planner.embedding_lookup(embedding_table_list[1], "data3", "emb_vec3", "sum") # lookup 3
embedding_planner.embedding_lookup(embedding_table_list[3], "data4", "emb_vec4", "sum") # lookup 4
```
Now you want to configure the ETPS through a plan file. In the plan file, you can group several lookup operations together and do sharding. You can specify which lookup operation / which GPU / which portion of the embedding table in a plan file. The basic principle is one embedding table can only be sharded in a single way. For example, lookup 0 and lookup 3 take place upon the same embedding table. So lookup 0 and lookup 3 should be grouped together and sharded in the same way. 
If you have 2 GPUs and you want to use data parallel in all 4 embedding tables, you can write plan file like:
```json
[
  [
      {
          "local_embedding_list": [
              0, 1, 2, 3, 4
          ],
          "global_embedding_list": [
              [
                  0, 1, 2, 3, 4
              ],
              [
                  0, 1, 2, 3, 4
              ]
          ],
          "num_sharding": 1,
          "sharding_id": 0,
          "table_placement_strategy": "dp"
      }
  ],
  [
      {
          "local_embedding_list": [
              0, 1, 2, 3, 4
          ],
          "global_embedding_list": [
              [
                  0, 1, 2, 3, 4
              ],
              [
                  0, 1, 2, 3, 4
              ]
          ],
          "num_sharding": 1,
          "sharding_id": 0,
          "table_placement_strategy": "dp"
      }
  ]
]
```
The plan file consists of a list which describes the table placement strategy in each gpu orderly. In each gpu, we use a list to describe multiple groups of sharded lookup operations. Each group of sharded lookup operation is a dictionary which includes:
* `local_embedding_list`: a list of integers, which lookup operations current gpu contains.
* `global_embedding_list`: a list of lists of integers, the current group lookup operations in all gpus.
* `num_sharding`: an integer, how many shards you want to shard the current group lookup operations.
* `sharding_id`: an integer, the index of the current group lookup operations.
* `table_placement_strategy`: str, can be `mp` or `dp`. `mp` means model parallel and `dp` means data parallel.

You are allowed to apply more complex ways for ETPS. Let's say we want to shard lookup 0, 1, 2, 3 across GPUs while lookup 4 to be data parallel. We can use:
```json
[
  [
      {
          "local_embedding_list": [
              0,
              2
          ],
          "global_embedding_list": [
              [
                  0,
                  2
              ],
              [
                  1,
                  3
              ]
          ],
          "table_placement_strategy": "mp"
      },
      {
          "local_embedding_list": [
              4
          ],
          "global_embedding_list": [
              [
                  4
              ],
              [
                  4
              ]
          ],
          "table_placement_strategy": "dp"
      }
  ],
  [
      {
          "local_embedding_list": [
              1,
              3
          ],
          "global_embedding_list": [
              [
                  0,
                  2
              ],
              [
                  1,
                  3
              ]
          ],
          "table_placement_strategy": "mp"
      },
      {
          "local_embedding_list": [
              4
          ],
          "global_embedding_list": [
              [
                  4
              ],
              [
                  4
              ]
          ],
          "table_placement_strategy": "dp"
      }
  ]
]
```

## DLRM Model
### Parepare Data
You can follow the instruction under [samples/deepfm/README.md#Preprocess the Dataset Through NVTabular](../samples/deepfm/README.md#Preprocess_the_Dataset_Through_NVTabular) to prepare data.
### Prepare Train Script
We will use single DGX-1 to run DLRM in this notebook. The GPU info in DGX-1 is as follows. It consists of 8 V100-SXM2 GPUs.

In [6]:
! nvidia-smi

Thu Jun 23 00:14:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    45W / 300W |      0MiB / 16160MiB |      0%      Default |
|       

We build our train script through embedding collection API. And we will use command argument to pass plan file and test different ETPS.

In [15]:
%%writefile dlrm_train.py
import os
import sys
sys.path.append('/workdir/build/lib')
import hugectr

plan_file = sys.argv[1]
table_size_array = [203931, 18598, 14092, 7012, 18977, 4, 6385, 1245, 49, 186213, 71328, 67288, 11, 2168, 7338, 61, 4, 932, 15, 204515, 141526, 199433, 60919, 9137, 71, 34]

solver = hugectr.CreateSolver(max_eval_batches = 70,
                              batchsize_eval = 65536,
                              batchsize = 65536,
                              lr = 0.5,
                              warmup_steps = 300,
                              vvgpu = [[0,1,2,3,4,5,6,7]],
                              repeat_dataset = True,
                              i64_input_key = True,
                              metrics_spec = {hugectr.MetricsType.AverageLoss:0.0},
                              use_embedding_collection = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = ["./deepfm_data_nvt/train/_file_list.txt"],
                                  eval_source = "./deepfm_data_nvt/val/_file_list.txt",
                                  check_type=hugectr.Check_t.Non,
                                  slot_size_array = table_size_array)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.SGD,
                                    update_type = hugectr.Update_t.Local,
                                    atomic_update = True)
model = hugectr.Model(solver, reader, optimizer)

num_embedding = 26

model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data{}".format(i), 1, False, 1) for i in range(num_embedding)]))

# create embedding table
embedding_table_list = []
for i in range(num_embedding):
    embedding_table_list.append(hugectr.EmbeddingTableConfig(id_space=i, max_vocabulary_size=table_size_array[i], ev_size=128, min_key=0, max_key=table_size_array[i]))
# create embedding planner and embedding collection
embedding_planner = hugectr.EmbeddingPlanner()
emb_vec_list = []
for i in range(num_embedding):
    embedding_planner.embedding_lookup(emb_table=embedding_table_list[i], 
                                       bottom_name=f"data{i}", 
                                       top_name=f"emb_vec{i}", 
                                       combiner="sum")
embedding_collection = embedding_planner.create_embedding_collection(plan_file)

model.add(embedding_collection)
# need concat
model.add(hugectr.DenseLayer(layer_type=hugectr.Layer_t.Concat,
                              bottom_names = ["emb_vec{}".format(i) for i in range(num_embedding)],
                              top_names = ["sparse_embedding1"],
                              axis = 1))

model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dense"],
                            top_names = ["fc1"],
                            num_output=512))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))                           
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu1"],
                            top_names = ["fc2"],
                            num_output=256))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc2"],
                            top_names = ["relu2"]))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu2"],
                            top_names = ["fc3"],
                            num_output=128))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc3"],
                            top_names = ["relu3"]))                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Interaction, # interaction only support 3-D input
                            bottom_names = ["relu3","sparse_embedding1"],
                            top_names = ["interaction1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["interaction1"],
                            top_names = ["fc4"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc4"],
                            top_names = ["relu4"]))                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu4"],
                            top_names = ["fc5"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc5"],
                            top_names = ["relu5"]))                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu5"],
                            top_names = ["fc6"],
                            num_output=512))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc6"],
                            top_names = ["relu6"]))                               
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu6"],
                            top_names = ["fc7"],
                            num_output=256))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc7"],
                            top_names = ["relu7"]))                                                                              
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu7"],
                            top_names = ["fc8"],
                            num_output=1))                                                                                           
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["fc8", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()
model.fit(max_iter = 1000, display = 100, eval_interval = 100, snapshot = 10000000, snapshot_prefix = "dlrm")


Overwriting dlrm_train.py


### ETPS: Data parallel + Localized
We want to put small size table as data parallel while for other tables, each table will be on single GPU and different GPU will hold different table(The same way we we use in `hugectr.LocalizedHashEmbedding`).

In [16]:
def print_plan(plan):
    for id,single_gpu_plan in enumerate(plan):
        print('single_gpu_plan index = {}'.format(id))
        for plan_attr in single_gpu_plan:
            for key in plan_attr:
                if (key != 'global_embedding_list'):
                    print('\t{}:{}'.format(key,plan_attr[key]))
                else:
                    prefix_len = len(key)
                    left_space_fill = ' '*prefix_len
                    print('\t{}:{}'.format(key,plan_attr[key][0]))
                    for index in range(1,len(plan_attr[key])):
                        print('\t{}:{}'.format(left_space_fill,plan_attr[key][index]))
def generate_plan(table_size_array, num_gpus, plan_file):
    
    mp_table = [i for i in range(len(table_size_array)) if table_size_array[i] > 6000]
    dp_table = [i for i in range(len(table_size_array)) if table_size_array[i] <= 6000]

    # place table across all gpus
    plan = []
    for gpu_id in range(num_gpus):
        single_gpu_plan = []
        mp_plan = {
          'local_embedding_list': [table_id for i, table_id in enumerate(mp_table) if i % num_gpus == gpu_id],
          'table_placement_strategy': 'mp'
        }
        dp_plan = {
          'local_embedding_list': dp_table,
          'table_placement_strategy': 'dp'
        }
        single_gpu_plan.append(mp_plan)
        single_gpu_plan.append(dp_plan)
        plan.append(single_gpu_plan)

    # generate global view of table placement
    mp_global_embedding_list = []
    dp_global_embedding_list = []
    for single_gpu_plan in plan:
        mp_global_embedding_list.append(single_gpu_plan[0]['local_embedding_list'])
        dp_global_embedding_list.append(single_gpu_plan[1]['local_embedding_list'])
    for single_gpu_plan in plan:
        single_gpu_plan[0]['global_embedding_list'] = mp_global_embedding_list
        single_gpu_plan[1]['global_embedding_list'] = dp_global_embedding_list
    print_plan(plan)
    # dump plan file
    import json
    with open(plan_file, 'w') as f:
        json.dump(plan, f, indent=4)

In [17]:
table_size_array = [203931, 18598, 14092, 7012, 18977, 4, 6385, 1245, 49, 186213, 71328, 67288, 11, 2168, 7338, 61, 4, 932, 15, 204515, 141526, 199433, 60919, 9137, 71, 34]
generate_plan(table_size_array, 8, "./dp_and_localized_plan.json")

single_gpu_plan index = 0
	local_embedding_list:[0, 11]
	table_placement_strategy:mp
	global_embedding_list:[0, 11]
	                     :[1, 14]
	                     :[2, 19]
	                     :[3, 20]
	                     :[4, 21]
	                     :[6, 22]
	                     :[9, 23]
	                     :[10]
	local_embedding_list:[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	table_placement_strategy:dp
	global_embedding_list:[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	                     :[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	                     :[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	                     :[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	                     :[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	                     :[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	                     :[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
	                     :[5, 7, 8, 12, 13, 15, 16, 17, 18, 24, 25]
single_gpu_plan index = 1
	local_embedding_list:[1, 14]
	

In [20]:
!python3 dlrm_train.py ./dp_and_localized_plan.json

HugeCTR Version: 3.7
[HCTR][01:41:54.133][INFO][RK0][main]: Global seed is 1382527389
[HCTR][01:41:54.593][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][01:42:05.912][INFO][RK0][main]: Start all2all warmup
[HCTR][01:42:06.092][INFO][RK0][main]: End all2all warmup
[HCTR][01:42:06.104][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][01:42:06.105][INFO][RK0][main]: Device 0: Tesla V100-SXM2-16GB
[HCTR][01:42:06.106][INFO][RK0][main]: Device 1: Tesla V100-SXM2-16GB
[HCTR][01:42:06.106][INFO][RK0][main]: Device 2: Tesla V100-SXM2-16GB
[HCTR][01:42:06.107][INFO][RK0][main]: Device 3: Tesla V100-SXM2-16GB
[HCTR][01:42:06.108][INFO][RK0][main]: Device 4: Tesla V100-SXM2-16GB
[HCTR][01:42:06.109][INFO][RK0][main]: Device 5: Tesla V100-SXM2-16GB
[HCTR][01:42:06.109][INFO][RK0][main]: Device 6: Tesla V100-SXM2-16GB
[HCTR][01:42:06.110

[HCTR][01:42:52.116][INFO][RK0][main]: Iter: 100 Time(100 iters): 8.11962s Loss: 0.14083 lr:0.168333
[HCTR][01:42:54.073][DEBUG][RK0][tid #140421363656448]: file_name_ deepfm_data_nvt/val/0.35ab81b16b4a409ba42a1baf89dcba52.parquet file_total_rows_ 571942
[HCTR][01:42:54.111][DEBUG][RK0][tid #140421355263744]: file_name_ deepfm_data_nvt/val/1.01854d707a564342aef3af44b814de1c.parquet file_total_rows_ 573919
[HCTR][01:42:54.149][DEBUG][RK0][tid #140421237831424]: file_name_ deepfm_data_nvt/val/2.7d7593c16af64625973ed246f68af624.parquet file_total_rows_ 572137
[HCTR][01:42:54.186][DEBUG][RK0][tid #140421229438720]: file_name_ deepfm_data_nvt/val/3.eec657484d40418cbf2648541592d09e.parquet file_total_rows_ 572545
[HCTR][01:42:54.229][DEBUG][RK0][tid #140421221046016]: file_name_ deepfm_data_nvt/val/4.e60c2f9421d84490bbc4de5f15ec5a0f.parquet file_total_rows_ 573664
[HCTR][01:42:54.268][DEBUG][RK0][tid #140421103613696]: file_name_ deepfm_data_nvt/val/5.883be83fecd74c1fbac00321911f2787.parquet

[HCTR][01:43:41.042][DEBUG][RK0][tid #140425583122176]: file_name_ deepfm_data_nvt/train/6.92133f3ee3664684854969202958122f.parquet file_total_rows_ 4581782
[HCTR][01:43:41.759][DEBUG][RK0][tid #140421372049152]: file_name_ deepfm_data_nvt/train/7.9345ade3421b40a5803f518c48ae436f.parquet file_total_rows_ 4589169
[HCTR][01:43:45.226][INFO][RK0][main]: Iter: 600 Time(100 iters): 10.5546s Loss: 0.138196 lr:0.5
[HCTR][01:43:47.271][DEBUG][RK0][tid #140421363656448]: file_name_ deepfm_data_nvt/val/0.35ab81b16b4a409ba42a1baf89dcba52.parquet file_total_rows_ 571942
[HCTR][01:43:47.309][DEBUG][RK0][tid #140421355263744]: file_name_ deepfm_data_nvt/val/1.01854d707a564342aef3af44b814de1c.parquet file_total_rows_ 573919
[HCTR][01:43:47.347][DEBUG][RK0][tid #140421237831424]: file_name_ deepfm_data_nvt/val/2.7d7593c16af64625973ed246f68af624.parquet file_total_rows_ 572137
[HCTR][01:43:47.390][DEBUG][RK0][tid #140421229438720]: file_name_ deepfm_data_nvt/val/3.eec657484d40418cbf2648541592d09e.parqu

### ETPS: Distributed 
We want to distributed all tables across all gpus(the same way we use in  `hugectr.DistributedHashEmbedding`).

In [11]:
def generate_distributed_plan(table_size_array, num_gpus, plan_file):
    # place table across all gpus
    plan = []
    for gpu_id in range(num_gpus):
        distributed_plan = {
          'local_embedding_list': [table_id for table_id in range(len(table_size_array))],
          'table_placement_strategy': 'mp',
          'sharding_id': gpu_id,
          'num_sharding': num_gpus
        }
        plan.append([distributed_plan])

    # generate global view of table placement
    distributed_global_embedding_list = []
    for single_gpu_plan in plan:
        distributed_global_embedding_list.append(single_gpu_plan[0]['local_embedding_list'])
    for single_gpu_plan in plan:
        single_gpu_plan[0]['global_embedding_list'] = distributed_global_embedding_list
    print_plan(plan)
    # dump plan file
    import json
    with open(plan_file, 'w') as f:
        json.dump(plan, f, indent=4)

In [12]:
table_size_array = [203931, 18598, 14092, 7012, 18977, 4, 6385, 1245, 49, 186213, 71328, 67288, 11, 2168, 7338, 61, 4, 932, 15, 204515, 141526, 199433, 60919, 9137, 71, 34]
generate_distributed_plan(table_size_array, 8, "./distributed_plan.json")

single_gpu_plan index = 0
	local_embedding_list:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
	table_placement_strategy:mp
	sharding_id:0
	num_sharding:8
	global_embedding_list:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
	                     :[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
	                     :[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
	                     :[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
	                     :[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
	                     :[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
	                     :[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,

In [21]:
!python3 dlrm_train.py ./distributed_plan.json

HugeCTR Version: 3.7
[HCTR][01:52:30.368][INFO][RK0][main]: Global seed is 3880200297
[HCTR][01:52:30.827][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][01:52:42.751][INFO][RK0][main]: Start all2all warmup
[HCTR][01:52:42.935][INFO][RK0][main]: End all2all warmup
[HCTR][01:52:42.947][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][01:52:42.948][INFO][RK0][main]: Device 0: Tesla V100-SXM2-16GB
[HCTR][01:52:42.949][INFO][RK0][main]: Device 1: Tesla V100-SXM2-16GB
[HCTR][01:52:42.950][INFO][RK0][main]: Device 2: Tesla V100-SXM2-16GB
[HCTR][01:52:42.951][INFO][RK0][main]: Device 3: Tesla V100-SXM2-16GB
[HCTR][01:52:42.951][INFO][RK0][main]: Device 4: Tesla V100-SXM2-16GB
[HCTR][01:52:42.952][INFO][RK0][main]: Device 5: Tesla V100-SXM2-16GB
[HCTR][01:52:42.953][INFO][RK0][main]: Device 6: Tesla V100-SXM2-16GB
[HCTR][01:52:42.954

[HCTR][01:53:36.384][INFO][RK0][main]: Iter: 100 Time(100 iters): 15.3325s Loss: 0.143876 lr:0.168333
[HCTR][01:53:39.665][DEBUG][RK0][tid #139891807610624]: file_name_ deepfm_data_nvt/val/0.35ab81b16b4a409ba42a1baf89dcba52.parquet file_total_rows_ 571942
[HCTR][01:53:39.723][DEBUG][RK0][tid #139891799217920]: file_name_ deepfm_data_nvt/val/1.01854d707a564342aef3af44b814de1c.parquet file_total_rows_ 573919
[HCTR][01:53:39.781][DEBUG][RK0][tid #139891681785600]: file_name_ deepfm_data_nvt/val/2.7d7593c16af64625973ed246f68af624.parquet file_total_rows_ 572137
[HCTR][01:53:39.839][DEBUG][RK0][tid #139891673392896]: file_name_ deepfm_data_nvt/val/3.eec657484d40418cbf2648541592d09e.parquet file_total_rows_ 572545
[HCTR][01:53:39.899][DEBUG][RK0][tid #139891665000192]: file_name_ deepfm_data_nvt/val/4.e60c2f9421d84490bbc4de5f15ec5a0f.parquet file_total_rows_ 573664
[HCTR][01:53:39.957][DEBUG][RK0][tid #139891144914688]: file_name_ deepfm_data_nvt/val/5.883be83fecd74c1fbac00321911f2787.parque

[HCTR][01:55:06.057][DEBUG][RK0][tid #139891933435648]: file_name_ deepfm_data_nvt/train/6.92133f3ee3664684854969202958122f.parquet file_total_rows_ 4581782
[HCTR][01:55:07.441][DEBUG][RK0][tid #139891816003328]: file_name_ deepfm_data_nvt/train/7.9345ade3421b40a5803f518c48ae436f.parquet file_total_rows_ 4589169
[HCTR][01:55:14.044][INFO][RK0][main]: Iter: 600 Time(100 iters): 19.5667s Loss: 0.135635 lr:0.5
[HCTR][01:55:17.437][DEBUG][RK0][tid #139891807610624]: file_name_ deepfm_data_nvt/val/0.35ab81b16b4a409ba42a1baf89dcba52.parquet file_total_rows_ 571942
[HCTR][01:55:17.495][DEBUG][RK0][tid #139891799217920]: file_name_ deepfm_data_nvt/val/1.01854d707a564342aef3af44b814de1c.parquet file_total_rows_ 573919
[HCTR][01:55:17.553][DEBUG][RK0][tid #139891681785600]: file_name_ deepfm_data_nvt/val/2.7d7593c16af64625973ed246f68af624.parquet file_total_rows_ 572137
[HCTR][01:55:17.612][DEBUG][RK0][tid #139891673392896]: file_name_ deepfm_data_nvt/val/3.eec657484d40418cbf2648541592d09e.parqu

### Compare performance between different ETPS
We can see the iteration time for dataparallel + localized is 103.45s while for distributed is 190.85s, which means different ETPS can greatly affect the performance of embedding. So it's better to put embedding table as data parallel or localized when the table can be fitted into single GPU.