diff --git a/docs/QAList.md b/docs/QAList.md index edaccb564e..4c9d141667 100644 --- a/docs/QAList.md +++ b/docs/QAList.md @@ -85,16 +85,18 @@ nnz=0 is supported in HugeCTR input. That means no features will be looked up. Firstly, you should construct your own configure file. You can refer to our [User Guide](hugectr_user_guide.md) and samples. Secondly, using our `data_generator` to generate a random dataset. Seeing [introductions](../README.md#benchmark). Thirdly, run with `./huge_ctr --train ./your_config.json` -### 24. How to set workspace_size_per_gpu_in_mb and slot_size_array in .json file? ### +### 24. How to set workspace_size_per_gpu_in_mb and slot_size_array? ### As embeddings are model parallel in HugeCTR, -it's a reference number for HugeCTR to allocate GPU memory accordingly and not necessarily the exact number of features in your dataset. +`workspace_size_per_gpu_in_mb` is a reference number for HugeCTR to allocate GPU memory accordingly and not necessarily the exact number of features in your dataset. In practice, we usually set it larger than the real size because of the non-uniform distribution of the keys. -In DistributedSlotEmbedding, HugeCTR will allocate the same size of memory on each GPU which is `workspace_size_per_gpu_in_mb`. -Users have to set this parameter big enough to make sure no overflow in any GPUs. -In LocalizedSlotEmbedding, user can also provide `workspace_size_per_gpu_in_mb`, -if the slot sizes are significantly different, we recommend that users give a number large enough to prevent overflow. -Another approach for LocalizedSlotEmbedding is that users can provide the exact size for each slot, -which is `slot_size_array` and HugeCTR will calculate the `workspace_size_per_gpu_in_mb` according to the given slot sizes. + +`slot_size_array` has 2 usages. It can be used as a replacement for `workspace_size_per_gpu_in_mb` to avoid waiting memory caused by imbalance vocabulary size. And it can also be used as a reference to add offset for keys in different slot. + +The relation between embedding type, `workspace_size_per_gpu_in_mb` and `slot_size_array` is: +* For `DistributedSlotEmbedding`, `workspace_size_per_gpu_in_mb` is needed and `slot_size_array` is not needed. Each GPU will allocate the same amount of memory for embedding table usage. +* For `LocalizedSlotSparseEmbeddingHash`, only one of `workspace_size_per_gpu_in_mb` and `slot_size_array` is needed. If users can provide the exact size for each slot, we recommand users to specify `slot_size_array`. It can help avoid wasting memory caused by imbalance vocabulary size. Or you can specify `workspace_size_per_gpu_in_mb` so each GPU will allocate the same amount of memory for embedding table usage. If you specify both `slot_size_array` and `workspace_size_per_gpu_in_mb`, HugeCTR will use `slot_size_array` for `LocalizedSlotSparseEmbeddingHash`. +* For `LocalizedSlotSparseEmbeddingOneHot`, `slot_size_array` is needed. It is used for allocating memory and adding offset for each slot. +* For `HybridSparseEmbedding`, both `workspace_size_per_gpu_in_mb` and `slot_size_array` is needed. `workspace_size_per_gpu_in_md` is used for allocating memory while `slot_size_array` is used for adding offset ### 25. Is nvlink required in HugeCTR? ### GPU with nvlink is not required, but recommended because the performance of CTR training highly relies on the performance of inter-GPUs communication. GPU servers with PCIE connections are also supported. ### 26. Is DGX the only GPU server that is required in HugeCTR? ### diff --git a/docs/python_interface.md b/docs/python_interface.md index 08c8aa8f2f..3eb8b55748 100644 --- a/docs/python_interface.md +++ b/docs/python_interface.md @@ -249,7 +249,7 @@ hugectr.DataReaderParams() * `num_workers`: Integer, the number of data reader workers that concurrently load data. You can empirically decide the best one based on your dataset, training environment. The default value is 12. -* `slot_size_array`: List[int], the cardinality array of input features. It should be consistent with that of the sparse input. We requires this argument for Parquet format data. The default value is an empty list. +* `slot_size_array`: List[int], the cardinality array of input features. It should be consistent with that of the sparse input. We requires this argument for Parquet format data and RawAsync format when you want to add offset to input key. The default value is an empty list. * `async_param`: AsyncParam, the parameters for async raw data reader. This argument is restricted to MLPerf use. @@ -332,8 +332,6 @@ The Raw dataset format is different from the Norm dataset format in that the tra **NOTE**: Only one-hot data is accepted with this format. -When using the Raw dataset format, a user must preprocess their own dataset to generate the continuous keys for each slot, and specify the list of the slot sizes with the `slot_size_array` option. Therefore, when referencing the configuration snippet above, we assume that slot 0 has the continuous keyset `{0, 1, 2 ... 39884405}` while slot 1 has its keyset on a different space `{0, 1, 2 ... 39043}`. - The Raw dataset format can be used with embedding type LocalizedSlotSparseEmbeddingOneHot only. Example: @@ -375,7 +373,7 @@ reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Pa slot_size_array = [278899, 355877, 203750, 18573, 14082, 7020, 18966, 4, 6382, 1246, 49, 185920, 71354, 67346, 11, 2166, 7340, 60, 4, 934, 15, 204208, 141572, 199066, 60940, 9115, 72, 34]) ``` -Similar to the Raw dataset format, you must preprocess your own dataset to generate the continuous keys for each slot, and specify the list of the slot sizes with the `slot_size_array` option. Therefore, in the configuration snippet noted above, we assume that slot 0 has the continuous keyset `{0, 1, 2 ... 220817329}` and slot 1 has its keyset on a different space `{0, 1, 2 ... 126535807}`. +We provide an option to add offset for each slot by specifying `slot_size_array`. `slot_size_array` is an array whose length is equal to the number of slots. To avoid duplicate keys after adding offset, we need to ensure that the key range of the i-th slot is between 0 and slot_size_array[i]. We will do the offset in this way: for i-th slot key, we add it with offset slot_size_array[0] + slot_size_array[1] + ... + slot_size_array[i - 1]. In the configuration snippet noted above, for the 0th slot, offset 0 will be added. For the 1st slot, offset 278899 will be added. And for the third slot, offset 634776 will be added. ### **OptParamsPy** ### #### **CreateOptimizer method** @@ -451,7 +449,7 @@ hugectr.SparseEmbedding() **Arguments** * `embedding_type`: The embedding type to be used. The supported types include `hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash`, `hugectr.Embedding_t.LocalizedSlotSparseEmbeddingHash`, `hugectr.Embedding_t.LocalizedSlotSparseEmbeddingOneHot` and `hugectr.Embedding_t.HybridSparseEmbedding`. The type `Embedding_t.HybridSparseEmbedding` is valid only if `is_dlrm` is set `True` within `CreateSolver` and `data_reader_type` is specified as `DataReaderType_t.RawAsync` within `DataReaderParams`. There is NO default value and it should be specified by users. -* `workspace_size_per_gpu_in_mb`: Integer, the workspace memory size in megabyte per GPU. This workspace memory must be big enough to hold all the embedding vocabulary used during the training and evaluation. There is NO default value and it should be specified by users. To understand how to set this value, please refer [QAList.md](./QAList.md#How-to-set-workspace_size_per_gpu_in_mb-and-slot_size_array-in-.json-file). +* `workspace_size_per_gpu_in_mb`: Integer, the workspace memory size in megabyte per GPU. This workspace memory must be big enough to hold all the embedding vocabulary used during the training and evaluation. There is NO default value and it should be specified by users. To understand how to set this value, please refer [How to set workspace_size_per_gpu_in_mb and slot_size_array](./QAList.md#24). * `embedding_vec_size`: Integer, the embedding vector size. There is NO default value and it should be specified by users. @@ -461,7 +459,7 @@ hugectr.SparseEmbedding() * `bottom_name`: String, the number of the bottom tensor to be consumed by this sparse embedding layer. Please note that it should be a predefined sparse input name. There is NO default value and it should be specified by users. -* `slot_size_array`: List[int], the cardinality array of input features. It should be consistent with that of the sparse input. This parameter is used in `LocalizedSlotSparseEmbeddingHash` and `LocalizedSlotSparseEmbeddingOneHot`, which can help avoid wasting memory caused by imbalance vocabulary size. Please refer [How to set workspace_size_per_gpu_in_mb and slot_size_array in .json file](./QAList.md#24). There is NO default value and it should be specified by users. +* `slot_size_array`: List[int], the cardinality array of input features. It should be consistent with that of the sparse input. This parameter can be used in `LocalizedSlotSparseEmbeddingHash`, `LocalizedSlotSparseEmbeddingOneHot` and `HybridSparseEmbedding`. The meaning of `slot_size_array` is varied based on different embedding type. There is NO default value and it should be specified by users. Please refer [How to set workspace_size_per_gpu_in_mb and slot_size_array](./QAList.md#24). * `optimizer`: OptParamsPy, the optimizer dedicated to this sparse embedding layer. If the user does not specify the optimizer for the sparse embedding, it will adopt the same optimizer as dense layers. diff --git a/tutorial/multinode-training/README.md b/tutorial/multinode-training/README.md index 096fa56871..bbc58f7448 100644 --- a/tutorial/multinode-training/README.md +++ b/tutorial/multinode-training/README.md @@ -23,19 +23,18 @@ If you need to create a new cluster, follow the instructions outlined below. 3. Build HugeCTR with [multi-nodes training supported](../README.md)) and copy it to the same shared directory in each node. -4. Configure the JSON file. +4. Configure the python script. - The [dcn8l8gpu2nodes.json](../../samples/dcn2nodes/dcn8l8gpu2nodes.json) is using two 8-GPU nodes. You can change the `"gpu"` setting based on the environment that you're using, and then copy the JSON file into a directory where the HugeCTR executable file is located as shown here: + The [dcn_2node_8gpu.py](../../samples/dcn/dcn_2node_8gpu.py) is using two 8-GPU nodes. You can change the `"gpu"` setting based on the environment that you're using, and then add hugectr lib into PYTHONPATH: ```bash - cp ../../samples/dcn2nodes/dcn8l8gpu2nodes.json ../../build/bin/ + export PYTHONPATH=../../build/lib/ ``` 5. Configure `run_multinode.sh`. The [run_multinode.sh](./run_multinode.sh) uses `mpirun` to start the built docker container in each node. To use `run_multinode.sh`, you must set the following variables: * **WORK_DIR**: Parent path where you put the `hugectr` executable and your JSON config file. - * **BIN**: `hugectr` executable path relative to `WORK_DIR` - * **CONFIG_NAME**: JSON config file path relative to `WORK_DIR` + * **TEST_CMD**: how to run python script * **DATASET**: Real dataset path. * **VOL_DATASET**: Dataset path shown inside your docker container as a mapping from `DATASET`. HugeCTR only sees this path. * **IMAGENAME**: Name of your Docker image. diff --git a/tutorial/multinode-training/run_multinode.sh b/tutorial/multinode-training/run_multinode.sh index bcb00e4ce6..6594990e15 100644 --- a/tutorial/multinode-training/run_multinode.sh +++ b/tutorial/multinode-training/run_multinode.sh @@ -1,10 +1,9 @@ # working dir, dataset dir, bin path, config path WORK_DIR="../../build/bin" -BIN="huge_ctr" -CONFIG_NAME="dcn8l8gpu2nodes.json" +TEST_CMD="python3 ../../samples/dcn/dcn_2node_8gpu.py" DATASET="/dataset" VOL_DATASET="/dataset" -IMAGENAME="hugectr:devel" +IMAGENAME="hugectr:devel_train" export HOSTS="node1,node2" @@ -108,7 +107,7 @@ sleep 10 echo "FINISH CONTAINER CREATION" # You can adjust the mpirun args and adjust the running command, args -docker exec $CONTNAME mpirun --allow-run-as-root --bind-to none -np $((${#hosts[@]})) -x NCCL_DEBUG=INFO ${BIN} --train ${CONFIG_NAME} 2>&1 |tee $JOBID.log +docker exec $CONTNAME mpirun --allow-run-as-root --bind-to none -np $((${#hosts[@]})) -x NCCL_DEBUG=INFO ${TEST_CMD} 2>&1 |tee $JOBID.log echo "FINISH TRAINING"