[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI #9779

gerashegalov · 2023-11-17T22:35:07Z

Describe the bug

_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,0)] _

--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=None)
+Row(sum(a)=Decimal('-20600424020936538707021213418489297935'))

...

_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,-10)] _

--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=None)
+Row(sum(a)=Decimal('-2.0600424020936538707021213418489297935E+47'))

Steps/Code to reproduce bug
TBD

Expected behavior
should pass

Environment details (please complete the following information)

INFO ShimLoader: Loading shim for Spark version: 3.3.0.3.3.7180.0-274
INFO ShimLoader: Complete Spark build info: 3.3.0.3.3.7180.0-274, git@github.infra.cloudera.com:CDH/spark3.git, HEAD, 73b4169950a4b83435306674db9e31e28529e8b5, 2022-08-30T12:57:40Z

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

pxLi · 2023-11-20T01:06:02Z

this also failed multiple cases in different CDH nightly runs all mismatch CPU and GPU results.

FAILED ../../src/main/python/string_test.py::test_initcap[DATAGEN_SEED=1700413755, INJECT_OOM, INCOMPAT]


[2023-11-19T17:11:01.115Z] ### CPU RUN ###
[2023-11-19T17:11:01.115Z] ### GPU RUN ###
[2023-11-19T17:11:01.115Z] ### COLLECT: GPU TOOK 0.11538124084472656 CPU TOOK 0.6901013851165771 ###
[2023-11-19T17:11:01.115Z] --- CPU OUTPUT
[2023-11-19T17:11:01.115Z] +++ GPU OUTPUT
[2023-11-19T17:11:01.115Z] @@ -1114,7 +1114,7 @@
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='Ⱥb3bb55a%@ab1é%  31ba%bⱥⱥÿ35-ÿⱥ3\t\né@5a5_a-%3éⱥÿa@\r @_@35abb3b5ⱥ3aÿ\t\tbÿb5bÿbÿ7_b7ⱥé%  ')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='Aa1b@3775b-3aé7\n\r7aⱥbbⱥ%b3_5abaⱥ\tÿé@313%b3bÿ%%5%\r B1ⱥ%-éÿ@ba_bé%ÿ\t\t_77ÿ7%ÿ_%bb1ÿ_é\n\n')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='3_ÿa1715ÿ1a_é7ÿ\r\t_aba1-5bb7é3ⱥÿ-\tÿ_b-%%é51%a@531\tbÿb5éⱥb5ⱥ@5b@_@\t\né3ⱥⱥbⱥ777bb_ⱥ37 ')
[2023-11-19T17:11:01.115Z] -Row(initcap(a)='ßo\x8e°}£\x11\x8aî\x94')
[2023-11-19T17:11:01.115Z] +Row(initcap(a)='SSo\x8e°}£\x11\x8aî\x94')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='3bb-1%_@aé%b7ab\r\n35ⱥ-@baaÿ1-é3ÿa \né3ba3@3b3ⱥ@55%1\r\n7ÿa3ÿ%bÿ--5ⱥⱥbÿ\n\na_bba7b%5bbb131\r ')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='É5@51@ⱥb7%ba7@3\r\naabⱥⱥÿa_-_--3b Aÿ@_3-a-3éÿ1a@_\r\nⱥ5-%7@bb7ÿ@b5b@\r\r15_-b%3a-éb%ba%\t\t')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='1ÿé%bbé_%ⱥb7_ⱥ5\r Ÿ-%@%_ⱥⱥ%ÿaⱥ3aa\t%77bÿ5@51-bⱥ@5- \n7bé%ÿⱥ%ⱥ33%-3é\r\ré71355ⱥ5ab3@35a\t\t')

FAILED ../../src/main/python/schema_evolution_test.py::test_column_add_after_partition[orc][DATAGEN_SEED=1700413303, INJECT_OOM, IGNORE_ORDER({'local': True})]

[2023-11-19T17:02:42.493Z] ### CPU RUN ###
[2023-11-19T17:02:42.493Z] ### GPU RUN ###
[2023-11-19T17:02:42.493Z] ### COLLECT: GPU TOOK 1.816152572631836 CPU TOOK 2.8293328285217285 ###
[2023-11-19T17:02:42.493Z] --- CPU OUTPUT
[2023-11-19T17:02:42.493Z] +++ GPU OUTPUT
[2023-11-19T17:02:42.493Z] @@ -2323,7 +2323,7 @@
[2023-11-19T17:02:42.493Z]  Row(c=747949758072796799, new_0=False, new_1=68, new_2=-32415, new_3=1382142121, new_4=-2229952796518441831, new_5=None, new_6=6.443761270526859e+250, new_7='`0ÌÙÛ\x1bärÑE\x1e¼ª0D\x18ò@¤õLvÊ|\x99ã\x99]y»', new_8=datetime.date(6979, 5, 9), new_9=datetime.datetime(1842, 8, 20, 11, 32, 25, 128767), new_10=[datetime.date(2495, 9, 28), datetime.date(7989, 10, 3), datetime.date(7668, 3, 8), datetime.date(2483, 1, 7), datetime.date(7474, 12, 20), datetime.date(4109, 8, 6), datetime.date(6820, 2, 7), datetime.date(8399, 6, 9), datetime.date(5965, 4, 10), datetime.date(339, 10, 8), datetime.date(4812, 8, 16), datetime.date(9039, 3, 14)], new_11=Row(child0=None), new_12=Row(c0=[-4807041446153269385, -492861108868460956, 2445107072012345415, 8495904361905650441, 0, 3069048035912442783], c1=True), a=0, b='x')
[2023-11-19T17:02:42.493Z]  Row(c=756767049961098093, new_0=None, new_1=None, new_2=None, new_3=None, new_4=None, new_5=None, new_6=None, new_7=None, new_8=None, new_9=None, new_10=None, new_11=None, new_12=None, a=1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=758360272413391259, new_0=False, new_1=93, new_2=-24932, new_3=-1362240737, new_4=-326270103423974582, new_5=8.31479808960357e-27, new_6=-7.480874109391324e-227, new_7='wÏ\x02þ\xadèó\x11Ç\x8e\x05\x82\x86Ð$Î)ÑNä¢È\x10Uý\x028f<\x0f', new_8=datetime.date(4850, 1, 25), new_9=datetime.datetime(2178, 3, 5, 9, 50, 14, 38792), new_10=[datetime.date(2000, 3, 1), datetime.date(8824, 9, 15), datetime.date(214, 3, 10), datetime.date(9963, 2, 1)], new_11=Row(child0=Decimal('67480762349703055.83')), new_12=Row(c0=[0, 2092818463731806519], c1=True), a=0, b='z')
[2023-11-19T17:02:42.493Z] -Row(c=761887531456626263, new_0=False, new_1=-59, new_2=-27428, new_3=649879972, new_4=300241518059178542, new_5=0.0, new_6=-4.4153881189739035e+222, new_7=None, new_8=datetime.date(6464, 3, 19), new_9=datetime.datetime(2163, 1, 20, 6, 5, 33, 728202), new_10=[datetime.date(1582, 10, 15)], new_11=Row(child0=Decimal('-274330538087067736.29')), new_12=Row(c0=[-9169051718731907692, 3207217407084977732, 5828166240839985036, -769860422017136925, -6317725922746163153, -4531858387965210345, -7549594355732088622, 2870330501981589583, 4918878943345752411, -1, -2538576527345834980, -1777000261340738155, -454263620209757316, 8207708571539683202, 0, -7855648709717721967, -1580255216891208658, -5557174777925197249, 8490753925456683644, -6041309477268441620], c1=True), a=0, b='y')
[2023-11-19T17:02:42.493Z] +Row(c=761887531456626263, new_0=False, new_1=-59, new_2=-27428, new_3=649879972, new_4=300241518059178542, new_5=0.0, new_6=-4.4153881189739035e+222, new_7=None, new_8=datetime.date(6464, 3, 19), new_9=datetime.datetime(2163, 1, 20, 6, 5, 33, 728202), new_10=[datetime.date(1582, 10, 13)], new_11=Row(child0=Decimal('-274330538087067736.29')), new_12=Row(c0=[-9169051718731907692, 3207217407084977732, 5828166240839985036, -769860422017136925, -6317725922746163153, -4531858387965210345, -7549594355732088622, 2870330501981589583, 4918878943345752411, -1, -2538576527345834980, -1777000261340738155, -454263620209757316, 8207708571539683202, 0, -7855648709717721967, -1580255216891208658, -5557174777925197249, 8490753925456683644, -6041309477268441620], c1=True), a=0, b='y')
[2023-11-19T17:02:42.493Z]  Row(c=768380761926021119, new_0=True, new_1=-87, new_2=4034, new_3=None, new_4=8519045545037378653, new_5=5.065325137478845e+29, new_6=-4.666103834234089e-215, new_7='¶Î\x8f\x8cU¶\x9caa\nÿ[Ä\x90µ)\x85îm²d\x19Ù¨\x13Ý+\x15\xad\x16', new_8=datetime.date(2000, 3, 1), new_9=datetime.datetime(2141, 2, 23, 16, 7, 4, 731178), new_10=[datetime.date(3286, 5, 19), datetime.date(8069, 9, 17), datetime.date(5396, 12, 17), datetime.date(631, 11, 13), datetime.date(3529, 2, 18), datetime.date(2590, 4, 17), datetime.date(8565, 7, 10), datetime.date(1629, 6, 13)], new_11=Row(child0=Decimal('-880390660518370190.39')), new_12=Row(c0=None, c1=True), a=1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=789081968798337592, new_0=None, new_1=None, new_2=None, new_3=None, new_4=None, new_5=None, new_6=None, new_7=None, new_8=None, new_9=None, new_10=None, new_11=None, new_12=None, a=-1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=806216176268352943, new_0=True, new_1=-45, new_2=-26835, new_3=-1020998181, new_4=-4168755332298322914, new_5=-8.087319503242584e+24, new_6=5.857303241779751e+83, new_7=':Å°ø°Á÷\x00DM²PÚRT/\x13AH\x92eý~£µ+\x0b5c4', new_8=datetime.date(1650, 11, 11), new_9=datetime.datetime(2179, 5, 24, 1, 16, 57, 493474), new_10=[datetime.date(72, 12, 13), datetime.date(2033, 2, 23), datetime.date(8471, 1, 7), datetime.date(6000, 3, 1), datetime.date(7344, 6, 4), datetime.date(1875, 5, 20), datetime.date(9395, 8, 23), datetime.date(858, 8, 22), datetime.date(8000, 3, 1), datetime.date(1505, 1, 19), datetime.date(2624, 1, 30), datetime.date(5096, 11, 8), datetime.date(3053, 7, 14), datetime.date(4000, 2, 29), datetime.date(7627, 6, 8), datetime.date(4691, 5, 26)], new_11=Row(child0=None), new_12=Row(c0=[8883040341267712284, -5854667154153339848, -7992805157714132332, 3849673508210869062, -3458498021831871544, 5689542648647886628, -5973772120704534010, -6341409587235122230, 8986229899183325292, -3347809925252056780, 4127516452669723253, -914703985335797152, 8228864223747791863, 5034508634006205111, -4701340369623626054, 6539709654621966036, 3217011538087074601], c1=True), a=-1, b='y')

revans2 · 2023-11-20T17:37:34Z

@pxLi please file separate issue for each failure. The two you listed here are not related to the test failure for SUM which this is about.

pxLi · 2023-11-21T01:41:16Z

@pxLi please file separate issue for each failure. The two you listed here are not related to the test failure for SUM which this is about.

Got it, thanks for the clarification. Filed separate ones,
#9806
#9807

jlowe · 2023-11-22T22:40:19Z

Datagen seed was 1700246532:

[2023-11-17T19:05:41.753Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,0)][DATAGEN_SEED=1700246532, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)]
[2023-11-17T19:05:41.753Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,-10)][DATAGEN_SEED=1700246532, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)]

jlowe · 2023-11-22T22:59:49Z

While trying to reproduce, I ran into a slightly different issue where instead of the CPU overflowing and the GPU did not, the opposite occurred. Here's what I ran to reproduce:

TEST_PARALLEL=0 PYSP_TEST_spark_master="local[1]" TZ=UTC DATAGEN_SEED=1700246532 REPORT_CHARS=fExXs SPARK_HOME=/home/jlowe/spark-3.3.3-bin-hadoop3/ integration_tests/run_pyspark_from_build.sh -k test_hash_reduction_sum_full_decimal
[...]
=================================== FAILURES ===================================
_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.batchSizeBytes': '250'}-Decimal(38,0)] _

data_gen = Decimal(38,0)
conf = {'spark.rapids.sql.batchSizeBytes': '250', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.variableFloatAgg.enabled': 'true'}
[...]
--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=Decimal('-20600424020936538707021213418489297935'))
+Row(sum(a)=None)

_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.batchSizeBytes': '250'}-Decimal(38,-10)] _

data_gen = Decimal(38,-10)
conf = {'spark.rapids.sql.batchSizeBytes': '250', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.variableFloatAgg.enabled': 'true'}
[...]
--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=Decimal('-2.0600424020936538707021213418489297935E+47'))
+Row(sum(a)=None)

jlowe · 2023-11-22T23:01:43Z

Running with local[2] instead of local[1] leads to CPU overflow and GPU does not.

jlowe · 2023-11-27T21:10:28Z

The test is failing because of a difference in how the CPU does overflow checks vs. how the GPU checks for overflow, and the CPU itself is not consistent in how it might check for overflow (e.g.: overflow checking with codegen enabled is different than when it's disabled). In the single-task case, the GPU is overflowing only in the tests where we use a batch size of 250. That is essentially emulating the case where we're running with more tasks, since each batch is like doing a separate partial aggregation, and the overflow check is done per batch (like the CPU does it per partition). In the second case where we use two tasks, the CPU overflows because it's checking for overflow after the partial, but the values fit in 128-bit for the GPU intermediate so we let it pass and that allows the final to aggregate to a value that can be stored in Decimal(38).

mattahrens · 2024-01-09T21:06:39Z

Remaining scope is to fix seed in proper way so failure will not be encountered going forward.

gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2023

sameerz assigned revans2 and jlowe Nov 21, 2023

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023

jlowe mentioned this issue Nov 27, 2023

Add compatibility documentation with respect to decimal overflow detection [skip ci] #9864

Merged

res-life mentioned this issue Dec 11, 2023

Using fix seed to unblock 23.12 release; Move the blocked issues to 24.02 #10009

Merged

jlowe added the ? - Needs Triage Need team to review and classify label Jan 5, 2024

mattahrens unassigned revans2 Jan 9, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 9, 2024

jlowe mentioned this issue Jan 10, 2024

Mark hash reduction decimal overflow test as a permanent seed override #10178

Merged

sameerz closed this as completed in #10178 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI #9779

[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI #9779

gerashegalov commented Nov 17, 2023

pxLi commented Nov 20, 2023

revans2 commented Nov 20, 2023

pxLi commented Nov 21, 2023

jlowe commented Nov 22, 2023

jlowe commented Nov 22, 2023

jlowe commented Nov 22, 2023

jlowe commented Nov 27, 2023

mattahrens commented Jan 9, 2024

[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI #9779

[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI #9779

Comments

gerashegalov commented Nov 17, 2023

pxLi commented Nov 20, 2023

revans2 commented Nov 20, 2023

pxLi commented Nov 21, 2023

jlowe commented Nov 22, 2023

jlowe commented Nov 22, 2023

jlowe commented Nov 22, 2023

jlowe commented Nov 27, 2023

mattahrens commented Jan 9, 2024