Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed UT: multi_client_test_nccl_local_2gpus #1980

Open
i-chaochen opened this issue Jan 31, 2023 · 0 comments
Open

Failed UT: multi_client_test_nccl_local_2gpus #1980

i-chaochen opened this issue Jan 31, 2023 · 0 comments

Comments

@i-chaochen
Copy link

i-chaochen commented Jan 31, 2023

Root cause: tensorflow@f734ee8
Init fix: d29b6d6 or tensorflow#59501

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //tensorflow/dtensor/python/tests:multi_client_test_nccl_local_2gpus
-----------------------------------------------------------------------------
2023-01-31 11:16:27.156744: E tensorflow/tsl/lib/monitoring/collection_registry.cc:81] Cannot register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay
2023-01-31 11:16:27.170465: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Check per client log in Test artifacts.
2023-01-31 11:16:28.129654: E tensorflow/tsl/lib/monitoring/collection_registry.cc:81] Cannot register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay
2023-01-31 11:16:28.143067: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

It could be AMDGPUs do not support multiple NCCL managers?
tensorflow#58090

@i-chaochen i-chaochen changed the title multi_client_test_nccl_local_2gpus [TF] multi_client_test_nccl_local_2gpus Jan 31, 2023
@i-chaochen i-chaochen changed the title [TF] multi_client_test_nccl_local_2gpus [TF-failed UT] multi_client_test_nccl_local_2gpus Jan 31, 2023
@i-chaochen i-chaochen changed the title [TF-failed UT] multi_client_test_nccl_local_2gpus [Failed UT] multi_client_test_nccl_local_2gpus Jan 31, 2023
@i-chaochen i-chaochen changed the title [Failed UT] multi_client_test_nccl_local_2gpus Failed UT: multi_client_test_nccl_local_2gpus Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant