[BUG] 30TB query95 fails on the join with illegal memory access with 200 partitions #7036
Labels
bug
Something isn't working
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
As a follow on to #6983, we ran the q95 query at 30TB with the fix in this PR (rapidsai/cudf#12079) and we ended up failing during a couple of the joins later, an inner join and a left semi.
In both of those cases we are hitting instances of the overflowing strided loop issue in cuco's
static_multimap::pair_count
andstatic_map::insert
(see compute-sanitizer output below). It looks like cuDF could work around this by usingint64_t
as the type in theircounting_transform_iterator
(like I did in this proof-of-concept), but it is not clear if that is the right solution. This issue is for our tracking, but the fix will be in cuDF or cuCollections.The only current workaround is to increase our shuffle partitions (for example 400 partitions worked without issues).
Inner join:
Leftsemi:
The text was updated successfully, but these errors were encountered: