Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 30TB query95 fails on the join with illegal memory access with 200 partitions #7036

Closed
abellina opened this issue Nov 9, 2022 · 1 comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Nov 9, 2022

As a follow on to #6983, we ran the q95 query at 30TB with the fix in this PR (rapidsai/cudf#12079) and we ended up failing during a couple of the joins later, an inner join and a left semi.

In both of those cases we are hitting instances of the overflowing strided loop issue in cuco's static_multimap::pair_count and static_map::insert (see compute-sanitizer output below). It looks like cuDF could work around this by using int64_t as the type in their counting_transform_iterator (like I did in this proof-of-concept), but it is not clear if that is the right solution. This issue is for our tracking, but the fix will be in cuDF or cuCollections.

The only current workaround is to increase our shuffle partitions (for example 400 partitions worked without issues).

Inner join:

========= Invalid __global__ read of size 4 bytes
=========     at 0x500 in void cuco::detail::pair_count<(unsigned int)128, (unsigned int)2, (bool)0, thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cuda::__4::atomic<unsigned long, (cuda::std::__4::__detail::thread_scope)1>, cuco::static_multimap<unsigned int, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>, cuco::double_hashing<(unsigned int)2, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32<unsigned int>>>::device_view, cudf::detail::pair_equality<cudf::row_equality_comparator<cudf::nullate::DYNAMIC>>>(T4, T4, T5 *, T6, T7)
=========     by thread (64,0,0) in block (14773391,0,0)
=========     Address 0xbcd89fe80 is out of bounds
=========     and is 1603745152 bytes before the nearest allocation at 0xc2d213400 of size 256 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x22da7a]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3deb04b]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame: [0x3e28798]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:unsigned long cuco::static_multimap<unsigned int, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char> >, cuco::double_hashing<2u, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32<unsigned int> > >::pair_count<thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cudf::detail::pair_equality<cudf::row_equality_comparator<cudf::nullate::DYNAMIC> > >(thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cudf::detail::pair_equality<cudf::row_equality_comparator<cudf::nullate::DYNAMIC> >, CUstream_st*) const [0x1e69a52]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:unsigned long cudf::detail::_GLOBAL__N__dbc92c90_12_hash_join_cu_cd66f71b::compute_join_output_size<cudf::detail::join_kind>(cudf::table_device_view, cudf::detail::_GLOBAL__N__dbc92c90_12_hash_join_cu_cd66f71b::compute_join_output_size<cudf::detail::join_kind>, cuco::static_multimap<unsigned int, int, cuda::std::__4::__detail::thread_scope, rmm::mr::stream_allocator_adaptor<default_allocator<char>>, cudf::table_device_view::double_hashing<unsigned int=2, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32>> const &, bool, cudf::null_equality, cuda::std::__4::__detail::thread_scope::cuda_stream_view) [0x1e69fcc]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector>>, std::default_delete<rmm::device_uvector>> cudf::detail::_GLOBAL__N__dbc92c90_12_hash_join_cu_cd66f71b::probe_join_hash_table<cudf::detail::join_kind>(cudf::table_device_view, std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector>>, std::default_delete<rmm::device_uvector>>, cuco::static_multimap<unsigned int, int, cuda::std::__4::__detail::thread_scope, std::unique_ptr::mr::stream_allocator_adaptor<default_allocator<char>>, cudf::table_device_view::double_hashing<unsigned int=2, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32>> const &, bool, cudf::null_equality, std::optional<unsigned long>, std::unique_ptr::cuda_stream_view, cuda::std::__4::__detail::thread_scope::device_memory_resource*) [0x1e6f64f]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > >, std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > > > cudf::detail::hash_join<cudf::detail::MurmurHash3_32<unsigned int> >::probe_join_indices<(cudf::detail::join_kind)0>(cudf::table_view const&, std::optional<unsigned long>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) const [0x1e6f7f2]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > >, std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > > > cudf::detail::hash_join<cudf::detail::MurmurHash3_32<unsigned int> >::compute_hash_join<(cudf::detail::join_kind)0>(cudf::table_view const&, std::optional<unsigned long>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) const [0x1e6face]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:cudf::hash_join::inner_join(cudf::table_view const&, std::optional<unsigned long>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) const [0x1e679e3]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:cudf::detail::inner_join(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1e70633]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:cudf::inner_join(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::mr::device_memory_resource*) [0x1e70c5c]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:Java_ai_rapids_cudf_Table_innerJoinGatherMaps [0x14ec5e3]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame: [0x254ac96a7]
=========                in
=========

Leftsemi:

========= Invalid __global__ read of size 4 bytes
=========     at 0x440 in /spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/column/column_device_view.cuh:431:T1 cudf::column_device_view::element<int, (void *)0>(int) const
=========     by thread (0,0,0) in block (29517103,0,0)
=========     Address 0xa4ba95700 is out of bounds
=========     and is 3222998784 bytes before the nearest allocation at 0xb0bc46600 of size 256 bytes
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/table/row_operators.cuh:538:unsigned int cudf::element_hasher_with_seed<cudf::detail::default_hash, cudf::nullate::DYNAMIC>::operator ()<int, (void *)0>(cudf::column_device_view, int) const [0x3f0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/utilities/type_dispatcher.hpp:455:decltype(auto) cudf::type_dispatcher<cudf::dispatch_storage_type, cudf::element_hasher_with_seed<cudf::detail::default_hash, cudf::nullate::DYNAMIC>, const cudf::column_device_view &, int &>(cudf::data_type, T2, T3 &&...) [0x2e0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/table/row_operators.cuh:605:cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>::operator ()(int) const [0x1c0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/src/search/contains_table.cu:71:auto cudf::detail::<unnamed>::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>::operator ()<cudf::experimental::row::lhs_index_type, (void *)0>(T1) const [0x1c0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/cuco-src/include/cuco/static_map.cuh:510:cuco::pair<cuda::__4::atomic<cudf::experimental::row::lhs_index_type, (cuda::std::__4::__detail::thread_scope)1>, cuda::__4::atomic<int, (cuda::std::__4::__detail::thread_scope)1>> * cuco::static_map<cudf::experimental::row::lhs_index_type, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::device_view_base::initial_slot<cooperative_groups::__v1::thread_block_tile<(unsigned int)4, cooperative_groups::__v1::thread_block>, cudf::experimental::row::lhs_index_type, cudf::detail::<unnamed>::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>>(const T1 &, const T2 &, T3) [0x1c0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/cuco-src/include/cuco/detail/static_map.inl:520:bool cuco::static_map<cudf::experimental::row::lhs_index_type, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::device_mutable_view::insert<cooperative_groups::__v1::thread_block_tile<(unsigned int)4, cooperative_groups::__v1::thread_block>, cudf::detail::<unnamed>::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>, cudf::detail::<unnamed>::strong_index_comparator_adapter<cudf::row_equality_comparator<cudf::nullate::DYNAMIC>>>(const T1 &, const cuco::pair<cudf::experimental::row::lhs_index_type, int> &, T2, T3) [0xc0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/cuco-src/include/cuco/detail/static_map_kernels.cuh:154:void cuco::detail::insert<(unsigned long)128, (unsigned int)4, thrust::transform_iterator<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::<unnamed>::contains_without_lists_or_nans(const _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::table_view &, const _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::table_view &, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource *)::[lambda(T1) (instance 1)], thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cuda::__4::atomic<unsigned long, (cuda::std::__4::__detail::thread_scope)1>, cuco::static_map<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::experimental::row::lhs_index_type, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::device_mutable_view, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::<unnamed>::strong_index_hasher_adapter<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::row_hasher<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::default_hash, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::nullate::DYNAMIC>>, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::<unnamed>::strong_index_comparator_adapter<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::row_equality_comparator<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::nullate::DYNAMIC>>>(T3, T3, T4 *, T5, T6, T7) [0xc0]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x22da7a]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3decaab]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame: [0x3e2a1f8]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cuco::static_map<cudf::experimental::row::lhs_index_type(void, cudf::experimental::row::lhs_index_type, cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::strong_index_comparator_adapter<cudf::row_equality_comparator<cudf::nullate>>, int, cuda::std::__4::__detail::thread_scope, CUstream_st*), int, cuda::std::__4::__detail::thread_scope, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::insert<thrust::transform_iterator<__nv_dl_wrapper_t<__nv_dl_tag<rmm::device_uvector<bool> (*) (cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*), __operator_&__(cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::contains_without_lists_or_nans(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)), unsigned int=1>>, thrust::counting_iterator<int, thrust::use_default, thrust::counting_iterator, thrust::counting_iterator>, thrust::counting_iterator, thrust::counting_iterator>, cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>>, cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::strong_index_comparator_adapter<cudf::row_equality_comparator<cudf::nullate>>> [0x2a607ce]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::contains_without_lists_or_nans(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a5ee26]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::detail::contains(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, cudf::nan_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a5f222]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::detail::left_semi_anti_join(cudf::detail::join_kind, cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1e82d72]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::left_semi_join(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::mr::device_memory_resource*) [0x1e83a1c]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:Java_ai_rapids_cudf_Table_leftSemiJoinGatherMap [0x14ebf33]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame: [0x272ea7627]
=========                in
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Nov 9, 2022
@abellina
Copy link
Collaborator Author

Fixed by NVIDIA/cuCollections#243.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

2 participants