Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Encountered ETC error of din model when training with multiple keyset. #429

Closed
dusir opened this issue Nov 15, 2023 · 3 comments
Closed
Assignees

Comments

@dusir
Copy link

dusir commented Nov 15, 2023

Describe the bug

We were trying to enable ETC based on din sample in the hugectr repo and train with our in-house data.

However, we found out that if the dataset was preprocessed into multiple sources, for example

source = ['/root/keyset_dir/din_1k_seq3_v1/0/0.txt', '/root/keyset_dir/din_1k_seq3_v1/1/0.txt']
keyset = ['/root/keyet/0.keyset', '/root/keyet/0.keyset']

such an error would occur

[HCTR][06:21:09.120][INFO][RK0][main]: synchronize  done.
[HCTR][06:21:10.185][ERROR][RK0][main]: Runtime error: invalid argument
	cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)
[HCTR][06:21:10.186][ERROR][RK0][main]: Runtime error: invalid argument
	cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)

It worked well with single source.

We used to face other error with multiple sources linked issue.

To Reproduce

run the script

  • hardware
    H800

  • container
    hugectr 23.06 & 23.09

@dusir dusir changed the title [BUG] [BUG]din 模型在hugectr上接口异常问题 Nov 15, 2023
@JacoCheung JacoCheung self-assigned this Nov 15, 2023
@JacoCheung
Copy link
Collaborator

The error msgs:

[HCTR][06:21:09.120][INFO][RK0][main]: synchronize  done.
[HCTR][06:21:10.185][ERROR][RK0][main]: Runtime error: invalid argument
	cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)
[HCTR][06:21:10.186][ERROR][RK0][main]: Runtime error: invalid argument
	cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)

@JacoCheung JacoCheung changed the title [BUG]din 模型在hugectr上接口异常问题 [BUG] Encountered ETC error of din model when training with multiple keyset. Nov 23, 2023
@dusir
Copy link
Author

dusir commented Nov 24, 2023

the test cmd is as follows:

din_seq3.py --model_name din_1k_seq3_v3_modify_v2 --keyset_dir '/root/keyset_dir' --batch_size 36000 --batchsize_eval 36000 --gpus '0,1,2,3,4,5,6,7' --train_dir '/data' --start_date '20231012' --end_date '20231013' --datePath '20231107' --workspace_size_per_gpu_in_mb 1200 --num_workers 30

the file din_seq3.py is same as the samples/din/din_parquet.py,we just add some args for test.

@JacoCheung
Copy link
Collaborator

Close as ETC is already deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants