[BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217 #10719

NvTimLiu · 2024-04-17T14:16:50Z

Describe the bug
test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby on DB-11.3

[2024-04-17T12:33:01.024Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]1][DATAGEN_SEED=1713347944, TZ=UTC, INJECT_OOM, IGNORE_ORDER] - AssertionError: GPU and CPU are not both null at [44, 'percentile(val, 0.1,...

Detailed failures as below

=================================== FAILURES ===================================
_
linux -- Python 3.8.10 /usr/bin/python

data_gen = [('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]

    @ignore_order
    @pytest.mark.parametrize('data_gen', exact_percentile_groupby_data_gen, ids=idfn)
    def test_exact_percentile_groupby(data_gen):
>       assert_gpu_and_cpu_are_equal_collect(
            lambda spark: exact_percentile_groupby(gen_df(spark, data_gen))
        )

../../src/main/python/hash_aggregate_test.py:998: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect
    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
../../src/main/python/asserts.py:517: in _assert_gpu_and_cpu_are_equal
    assert_equal(from_cpu, from_gpu)
../../src/main/python/asserts.py:107: in assert_equal
    _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
../../src/main/python/asserts.py:43: in _assert_equal
    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
../../src/main/python/asserts.py:36: in _assert_equal
    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cpu = None, gpu = 0.0
float_check = <function get_float_check.<locals>.<lambda> at 0x7fa1f8581820>
path = [44, 'percentile(val, 0.1, abs(freq))']

    def _assert_equal(cpu, gpu, float_check, path):
        t = type(cpu)
        if (t is Row):
            assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
                assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
                for field in cpu.__fields__:
                    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
            else:
                for index in range(len(cpu)):
                    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is list):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is tuple):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is pytypes.GeneratorType):
            index = 0
            # generator has no zip :( so we have to do this the hard way
            done = False
            while not done:
                sub_cpu = None
                sub_gpu = None
                try:
                    sub_cpu = next(cpu)
                except StopIteration:
                    done = True
    
                try:
                    sub_gpu = next(gpu)
                except StopIteration:
                    done = True
    
                if done:
                    assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
                else:
                    _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
    
                index = index + 1
        elif (t is dict):
            # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
            # so sort the items to do our best with ignoring the order of dicts
            cpu_items = list(cpu.items()).sort(key=_RowCmp)
            gpu_items = list(gpu.items()).sort(key=_RowCmp)
            _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
        elif (t is int):
            assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
        elif (t is float):
            if (math.isnan(cpu)):
                assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
            else:
                assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
        elif isinstance(cpu, str):
            assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)
        elif isinstance(cpu, datetime):
            assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)
        elif isinstance(cpu, date):
            assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)
        elif isinstance(cpu, bool):
            assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)
        elif isinstance(cpu, Decimal):
            assert cpu == gpu, "GPU and CPU decimal values are different at {}".format(path)
        elif isinstance(cpu, bytearray):
            assert cpu == gpu, "GPU and CPU bytearray values are different at {}".format(path)
        elif isinstance(cpu, timedelta):
            # Used by interval type DayTimeInterval for Pyspark 3.3.0+
            assert cpu == gpu, "GPU and CPU timedelta values are different at {}".format(path)
        elif (cpu == None):
>           assert cpu == gpu, "GPU and CPU are not both null at {}".format(path)
E           AssertionError: GPU and CPU are not both null at [44, 'percentile(val, 0.1, abs(freq))']

../../src/main/python/asserts.py:100: AssertionError
----------------------------- Captured stdout call -----------------------------
### CPU RUN ###
### GPU RUN ###
### COLLECT: GPU TOOK 1.2428700923919678 CPU TOOK 0.9993729591369629 ###
--- CPU OUTPUT
+++ GPU OUTPUT
@@ -42,7 +42,7 @@
 Row(key=-699991384, percentile(val, 0.1)=inf, percentile(val, 0)=-3.233434253460157e+218, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[-3.233434253460157e+218, inf, inf, nan, nan], percentile(val, 0.1, abs(freq))=-3.233434253460157e+218, percentile(val, 0, abs(freq))=-3.233434253460157e+218, percentile(val, 1, abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[-3.233434253460157e+218], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[-3.233434253460157e+218, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[-3.233434253460157e+218, -3.233434253460157e+218, inf, nan, nan])
 Row(key=-663106112, percentile(val, 0.1)=inf, percentile(val, 0)=-8.040719402880842e-261, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, nan, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[-8.040719402880842e-261, inf, nan, nan, nan], percentile(val, 0.1, abs(freq))=-8.040719402880842e-261, percentile(val, 0, abs(freq))=-8.040719402880842e-261, percentile(val, 1, abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[-8.040719402880842e-261], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[-8.040719402880842e-261, nan, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[-8.040719402880842e-261, -8.040719402880842e-261, nan, nan, nan])
 Row(key=-642917234, percentile(val, 0.1)=inf, percentile(val, 0)=-1.0, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[-1.0, -0.998, inf, nan, nan], percentile(val, 0.1, abs(freq))=inf, percentile(val, 0, abs(freq))=inf, percentile(val, 1, abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[inf], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[inf, nan, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[inf, inf, nan, nan, nan])
-Row(key=-421192727, percentile(val, 0.1)=inf, percentile(val, 0)=-2.158391834949709e-101, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, nan, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[-2.158391834949709e-101, 1.884807116393614e+262, nan, nan, nan], percentile(val, 0.1, abs(freq))=None, percentile(val, 0, abs(freq))=None, percentile(val, 1, abs(freq))=None, percentile(val, array(0.1), abs(freq))=None, percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=None, percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=None)
+Row(key=-421192727, percentile(val, 0.1)=inf, percentile(val, 0)=-2.158391834949709e-101, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, nan, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[-2.158391834949709e-101, 1.884807116393614e+262, nan, nan, nan], percentile(val, 0.1, abs(freq))=0.0, percentile(val, 0, abs(freq))=0.0, percentile(val, 1, abs(freq))=0.0, percentile(val, array(0.1), abs(freq))=[0.0], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[0.0, 0.0, 0.0], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[4.9e-322, 5e-324, 5e-324, 2.5e-323, 5e-324])
 Row(key=-367157519, percentile(val, 0.1)=inf, percentile(val, 0)=inf, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[inf, inf, inf, nan, nan], percentile(val, 0.1, abs(freq))=inf, percentile(val, 0, abs(freq))=inf, percentile(val, 1, abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[inf], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[inf, inf, inf, nan, nan])
 Row(key=-360509906, percentile(val, 0.1)=inf, percentile(val, 0)=-2.018187239055581e-271, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[-2.018187239055581e-271, 1.1766604053981384e+156, inf, nan, nan], percentile(val, 0.1, abs(freq))=inf, percentile(val, 0, abs(freq))=-2.018187239055581e-271, percentile(val, 1, abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[inf], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[-2.018187239055581e-271, -2.018187239055581e-271, inf, nan, nan])
 Row(key=-355871595, percentile(val, 0.1)=inf, percentile(val, 0)=-3.451721078952032e-86, percentile(val, 1)=nan, percentile(val, array(0.1))=[inf], percentile(val, array())=None, percentile(val, array(0.1, 0.5, 0.9))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))=[-3.451721078952032e-86, inf, inf, nan, nan], percentile(val, 0.1, abs(freq))=inf, percentile(val, 0, abs(freq))=inf, percentile(val, 1, abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[inf], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[inf, inf, nan], percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[inf, inf, inf, nan, nan])

The text was updated successfully, but these errors were encountered:

NvTimLiu · 2024-04-18T15:20:20Z

Failure not tot observed in today's nightly, without code update,

1023 FAILD, 1024 PASS, same Revision: 66f2cc5

keep monitoring!

mattahrens · 2024-04-23T20:43:07Z

Can you document what the datagen seed was for original failure and try to repro it? We want to keep this open for original failure with what the datagen seed is.

sameerz · 2024-04-23T20:46:13Z

Can you document what the datagen seed was for original failure and try to repro it? We want to keep this open for original failure with what the datagen seed is.

Updated title: DATAGEN seed = 1713362217

jlowe · 2024-04-23T20:49:46Z

Diff is coming from CPU producing nulls when the GPU does not. Splitting out the differing columns on their own lines,
CPU:

percentile(val, 0.1, abs(freq))=None,
percentile(val, 0, abs(freq))=None,
percentile(val, 1, abs(freq))=None,
percentile(val, array(0.1), abs(freq))=None,
percentile(val, array(), abs(freq))=None,
percentile(val, array(0.1, 0.5, 0.9), abs(freq))=None,
percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=None)

GPU:

percentile(val, 0.1, abs(freq))=0.0,
percentile(val, 0, abs(freq))=0.0,
percentile(val, 1, abs(freq))=0.0,
percentile(val, array(0.1), abs(freq))=[0.0],
percentile(val, array(), abs(freq))=None,
percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[0.0, 0.0, 0.0],
percentile(val, array(0, 0.0001, 0.5, 0.9999, 1), abs(freq))=[4.9e-322, 5e-324, 5e-324, 2.5e-323, 5e-324])

sameerz · 2024-04-24T17:10:26Z

Does this need a fixed seed, or do we need to fix the underlying problem?

NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 17, 2024

NvTimLiu self-assigned this Apr 18, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 23, 2024

sameerz changed the title ~~[BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby~~ [BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217 Apr 23, 2024

sameerz unassigned NvTimLiu Apr 23, 2024

thirtiseven mentioned this issue Apr 24, 2024

Use fixed seed for some random failed tests #10739

Merged

mattahrens assigned mythrocks Apr 26, 2024

mattahrens mentioned this issue Apr 30, 2024

[BUG] test_exact_percentile_groupby_partial_fallback_to_cpu failed with DATAGEN_SEED=1713928179 #10738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217 #10719

[BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217 #10719

NvTimLiu commented Apr 17, 2024 •

edited by jlowe

NvTimLiu commented Apr 18, 2024

mattahrens commented Apr 23, 2024

sameerz commented Apr 23, 2024

jlowe commented Apr 23, 2024

sameerz commented Apr 24, 2024

[BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217 #10719

[BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217 #10719

Comments

NvTimLiu commented Apr 17, 2024 • edited by jlowe

NvTimLiu commented Apr 18, 2024

mattahrens commented Apr 23, 2024

sameerz commented Apr 23, 2024

jlowe commented Apr 23, 2024

sameerz commented Apr 24, 2024

NvTimLiu commented Apr 17, 2024 •

edited by jlowe