Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RandomForest: cuML throws exception when setting n_streams>1 on the node with 2 GPUs #111

Open
wbo4958 opened this issue Feb 23, 2023 · 0 comments

Comments

@wbo4958
Copy link
Collaborator

wbo4958 commented Feb 23, 2023

Issue Description

When I run the RandomForestClassifier benchmark on the node with 2 GPUs, spark-rapids-ml threw the below exception

terminate called after throwing an instance of 'raft::cuda_error'
  what():  CUDA error encountered at: file=/home/xxx/work.d/ml/cuml/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 14 stack frames
#0 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x2e) [0x7f586987abe0]
#1 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft9exceptionC2ENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5a) [0x7f586987ab8c]
#2 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x37) [0x7f586987c0b1]
#3 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIfiiEEE15assignWorkspaceEPcS5_+0x5fc) [0x7f586a6bd504]
#4 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIfiiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKfPKiiiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIfiEE+0x7ad) [0x7f586a69b2ff]
#5 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT12DecisionTree3fitIfiEESt10shared_ptrINS0_16TreeMetaDataNodeIT_T0_EEERKN4raft8handle_tEP11CUstream_stPKS5_iiPKS6_PN3rmm14device_uvectorIiEEiNS0_18DecisionTreeParamsEmRKNS0_9QuantilesIS5_iEEi+0xa1) [0x7f586a68193a]
#6 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(+0x1a441dd) [0x7f586a7101dd]
#7 in /home/xxxanaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x97818) [0x7f5868a19818]
#8 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(__kmp_invoke_microtask+0x93) [0x7f5868a363b3]
#9 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x42194) [0x7f58689c4194]
#10 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x4189a) [0x7f58689c389a]
#11 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x96072) [0x7f5868a18072]
#12 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f598e381609]
#13 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f598e142133]

How to repro

The issue can be reproduced on the spark local or standonealone mode where 1 worker has 2 GPUs.

Generate datasets

python gen_data.py  classification --output_dir=/tmp/abc

Run the training job

python benchmark_runner.py  random_forest_classifier --train_path=/tmp/abc --n_streams=4 --gpu_worker=2

Please note that, there is no such issue if there are 2 workers, each has 1 GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant