Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI #9779

Closed
gerashegalov opened this issue Nov 17, 2023 · 8 comments · Fixed by #10178
Closed

[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI #9779

gerashegalov opened this issue Nov 17, 2023 · 8 comments · Fixed by #10178
Assignees
Labels
bug Something isn't working

Comments

@gerashegalov
Copy link
Collaborator

Describe the bug

_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,0)] _

--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=None)
+Row(sum(a)=Decimal('-20600424020936538707021213418489297935'))

...

_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,-10)] _

--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=None)
+Row(sum(a)=Decimal('-2.0600424020936538707021213418489297935E+47'))

Steps/Code to reproduce bug
TBD

Expected behavior
should pass

Environment details (please complete the following information)

INFO ShimLoader: Loading shim for Spark version: 3.3.0.3.3.7180.0-274
INFO ShimLoader: Complete Spark build info: 3.3.0.3.3.7180.0-274, git@github.infra.cloudera.com:CDH/spark3.git, HEAD, 73b4169950a4b83435306674db9e31e28529e8b5, 2022-08-30T12:57:40Z

Additional context
Add any other context about the problem here.

@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2023
@pxLi
Copy link
Collaborator

pxLi commented Nov 20, 2023

this also failed multiple cases in different CDH nightly runs all mismatch CPU and GPU results.

FAILED ../../src/main/python/string_test.py::test_initcap[DATAGEN_SEED=1700413755, INJECT_OOM, INCOMPAT]


[2023-11-19T17:11:01.115Z] ### CPU RUN ###
[2023-11-19T17:11:01.115Z] ### GPU RUN ###
[2023-11-19T17:11:01.115Z] ### COLLECT: GPU TOOK 0.11538124084472656 CPU TOOK 0.6901013851165771 ###
[2023-11-19T17:11:01.115Z] --- CPU OUTPUT
[2023-11-19T17:11:01.115Z] +++ GPU OUTPUT
[2023-11-19T17:11:01.115Z] @@ -1114,7 +1114,7 @@
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='Ⱥb3bb55a%@ab1é%  31ba%bⱥⱥÿ35-ÿⱥ3\t\né@5a5_a-%3éⱥÿa@\r @_@35abb3b5ⱥ3aÿ\t\tbÿb5bÿbÿ7_b7ⱥé%  ')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='Aa1b@3775b-3aé7\n\r7aⱥbbⱥ%b3_5abaⱥ\tÿé@313%b3bÿ%%5%\r B1ⱥ%-éÿ@ba_bé%ÿ\t\t_77ÿ7%ÿ_%bb1ÿ_é\n\n')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='3_ÿa1715ÿ1a_é7ÿ\r\t_aba1-5bb7é3ⱥÿ-\tÿ_b-%%é51%a@531\tbÿb5éⱥb5ⱥ@5b@_@\t\né3ⱥⱥbⱥ777bb_ⱥ37 ')
[2023-11-19T17:11:01.115Z] -Row(initcap(a)='ßo\x8e°}£\x11\x8aî\x94')
[2023-11-19T17:11:01.115Z] +Row(initcap(a)='SSo\x8e°}£\x11\x8aî\x94')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='3bb-1%_@aé%b7ab\r\n35ⱥ-@baaÿ1-é3ÿa \né3ba3@3b3ⱥ@55%1\r\n7ÿa3ÿ%bÿ--5ⱥⱥbÿ\n\na_bba7b%5bbb131\r ')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='É5@51@ⱥb7%ba7@3\r\naabⱥⱥÿa_-_--3b Aÿ@_3-a-3éÿ1a@_\r\nⱥ5-%7@bb7ÿ@b5b@\r\r15_-b%3a-éb%ba%\t\t')
[2023-11-19T17:11:01.115Z]  Row(initcap(a)='1ÿé%bbé_%ⱥb7_ⱥ5\r Ÿ-%@%_ⱥⱥ%ÿaⱥ3aa\t%77bÿ5@51-bⱥ@5- \n7bé%ÿⱥ%ⱥ33%-3é\r\ré71355ⱥ5ab3@35a\t\t')

FAILED ../../src/main/python/schema_evolution_test.py::test_column_add_after_partition[orc][DATAGEN_SEED=1700413303, INJECT_OOM, IGNORE_ORDER({'local': True})]

[2023-11-19T17:02:42.493Z] ### CPU RUN ###
[2023-11-19T17:02:42.493Z] ### GPU RUN ###
[2023-11-19T17:02:42.493Z] ### COLLECT: GPU TOOK 1.816152572631836 CPU TOOK 2.8293328285217285 ###
[2023-11-19T17:02:42.493Z] --- CPU OUTPUT
[2023-11-19T17:02:42.493Z] +++ GPU OUTPUT
[2023-11-19T17:02:42.493Z] @@ -2323,7 +2323,7 @@
[2023-11-19T17:02:42.493Z]  Row(c=747949758072796799, new_0=False, new_1=68, new_2=-32415, new_3=1382142121, new_4=-2229952796518441831, new_5=None, new_6=6.443761270526859e+250, new_7='`0ÌÙÛ\x1bärÑE\x1e¼ª0D\x18ò@¤õLvÊ|\x99ã\x99]y»', new_8=datetime.date(6979, 5, 9), new_9=datetime.datetime(1842, 8, 20, 11, 32, 25, 128767), new_10=[datetime.date(2495, 9, 28), datetime.date(7989, 10, 3), datetime.date(7668, 3, 8), datetime.date(2483, 1, 7), datetime.date(7474, 12, 20), datetime.date(4109, 8, 6), datetime.date(6820, 2, 7), datetime.date(8399, 6, 9), datetime.date(5965, 4, 10), datetime.date(339, 10, 8), datetime.date(4812, 8, 16), datetime.date(9039, 3, 14)], new_11=Row(child0=None), new_12=Row(c0=[-4807041446153269385, -492861108868460956, 2445107072012345415, 8495904361905650441, 0, 3069048035912442783], c1=True), a=0, b='x')
[2023-11-19T17:02:42.493Z]  Row(c=756767049961098093, new_0=None, new_1=None, new_2=None, new_3=None, new_4=None, new_5=None, new_6=None, new_7=None, new_8=None, new_9=None, new_10=None, new_11=None, new_12=None, a=1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=758360272413391259, new_0=False, new_1=93, new_2=-24932, new_3=-1362240737, new_4=-326270103423974582, new_5=8.31479808960357e-27, new_6=-7.480874109391324e-227, new_7='wÏ\x02þ\xadèó\x11Ç\x8e\x05\x82\x86Ð$Î)ÑNä¢È\x10Uý\x028f<\x0f', new_8=datetime.date(4850, 1, 25), new_9=datetime.datetime(2178, 3, 5, 9, 50, 14, 38792), new_10=[datetime.date(2000, 3, 1), datetime.date(8824, 9, 15), datetime.date(214, 3, 10), datetime.date(9963, 2, 1)], new_11=Row(child0=Decimal('67480762349703055.83')), new_12=Row(c0=[0, 2092818463731806519], c1=True), a=0, b='z')
[2023-11-19T17:02:42.493Z] -Row(c=761887531456626263, new_0=False, new_1=-59, new_2=-27428, new_3=649879972, new_4=300241518059178542, new_5=0.0, new_6=-4.4153881189739035e+222, new_7=None, new_8=datetime.date(6464, 3, 19), new_9=datetime.datetime(2163, 1, 20, 6, 5, 33, 728202), new_10=[datetime.date(1582, 10, 15)], new_11=Row(child0=Decimal('-274330538087067736.29')), new_12=Row(c0=[-9169051718731907692, 3207217407084977732, 5828166240839985036, -769860422017136925, -6317725922746163153, -4531858387965210345, -7549594355732088622, 2870330501981589583, 4918878943345752411, -1, -2538576527345834980, -1777000261340738155, -454263620209757316, 8207708571539683202, 0, -7855648709717721967, -1580255216891208658, -5557174777925197249, 8490753925456683644, -6041309477268441620], c1=True), a=0, b='y')
[2023-11-19T17:02:42.493Z] +Row(c=761887531456626263, new_0=False, new_1=-59, new_2=-27428, new_3=649879972, new_4=300241518059178542, new_5=0.0, new_6=-4.4153881189739035e+222, new_7=None, new_8=datetime.date(6464, 3, 19), new_9=datetime.datetime(2163, 1, 20, 6, 5, 33, 728202), new_10=[datetime.date(1582, 10, 13)], new_11=Row(child0=Decimal('-274330538087067736.29')), new_12=Row(c0=[-9169051718731907692, 3207217407084977732, 5828166240839985036, -769860422017136925, -6317725922746163153, -4531858387965210345, -7549594355732088622, 2870330501981589583, 4918878943345752411, -1, -2538576527345834980, -1777000261340738155, -454263620209757316, 8207708571539683202, 0, -7855648709717721967, -1580255216891208658, -5557174777925197249, 8490753925456683644, -6041309477268441620], c1=True), a=0, b='y')
[2023-11-19T17:02:42.493Z]  Row(c=768380761926021119, new_0=True, new_1=-87, new_2=4034, new_3=None, new_4=8519045545037378653, new_5=5.065325137478845e+29, new_6=-4.666103834234089e-215, new_7='¶Î\x8f\x8cU¶\x9caa\nÿ[Ä\x90µ)\x85îm²d\x19Ù¨\x13Ý+\x15\xad\x16', new_8=datetime.date(2000, 3, 1), new_9=datetime.datetime(2141, 2, 23, 16, 7, 4, 731178), new_10=[datetime.date(3286, 5, 19), datetime.date(8069, 9, 17), datetime.date(5396, 12, 17), datetime.date(631, 11, 13), datetime.date(3529, 2, 18), datetime.date(2590, 4, 17), datetime.date(8565, 7, 10), datetime.date(1629, 6, 13)], new_11=Row(child0=Decimal('-880390660518370190.39')), new_12=Row(c0=None, c1=True), a=1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=789081968798337592, new_0=None, new_1=None, new_2=None, new_3=None, new_4=None, new_5=None, new_6=None, new_7=None, new_8=None, new_9=None, new_10=None, new_11=None, new_12=None, a=-1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=806216176268352943, new_0=True, new_1=-45, new_2=-26835, new_3=-1020998181, new_4=-4168755332298322914, new_5=-8.087319503242584e+24, new_6=5.857303241779751e+83, new_7=':Å°ø°Á÷\x00DM²PÚRT/\x13AH\x92eý~£µ+\x0b5c4', new_8=datetime.date(1650, 11, 11), new_9=datetime.datetime(2179, 5, 24, 1, 16, 57, 493474), new_10=[datetime.date(72, 12, 13), datetime.date(2033, 2, 23), datetime.date(8471, 1, 7), datetime.date(6000, 3, 1), datetime.date(7344, 6, 4), datetime.date(1875, 5, 20), datetime.date(9395, 8, 23), datetime.date(858, 8, 22), datetime.date(8000, 3, 1), datetime.date(1505, 1, 19), datetime.date(2624, 1, 30), datetime.date(5096, 11, 8), datetime.date(3053, 7, 14), datetime.date(4000, 2, 29), datetime.date(7627, 6, 8), datetime.date(4691, 5, 26)], new_11=Row(child0=None), new_12=Row(c0=[8883040341267712284, -5854667154153339848, -7992805157714132332, 3849673508210869062, -3458498021831871544, 5689542648647886628, -5973772120704534010, -6341409587235122230, 8986229899183325292, -3347809925252056780, 4127516452669723253, -914703985335797152, 8228864223747791863, 5034508634006205111, -4701340369623626054, 6539709654621966036, 3217011538087074601], c1=True), a=-1, b='y')

@revans2
Copy link
Collaborator

revans2 commented Nov 20, 2023

@pxLi please file separate issue for each failure. The two you listed here are not related to the test failure for SUM which this is about.

@pxLi
Copy link
Collaborator

pxLi commented Nov 21, 2023

@pxLi please file separate issue for each failure. The two you listed here are not related to the test failure for SUM which this is about.

Got it, thanks for the clarification. Filed separate ones,
#9806
#9807

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023
@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

Datagen seed was 1700246532:

[2023-11-17T19:05:41.753Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,0)][DATAGEN_SEED=1700246532, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)]
[2023-11-17T19:05:41.753Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-Decimal(38,-10)][DATAGEN_SEED=1700246532, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)]

@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

While trying to reproduce, I ran into a slightly different issue where instead of the CPU overflowing and the GPU did not, the opposite occurred. Here's what I ran to reproduce:

TEST_PARALLEL=0 PYSP_TEST_spark_master="local[1]" TZ=UTC DATAGEN_SEED=1700246532 REPORT_CHARS=fExXs SPARK_HOME=/home/jlowe/spark-3.3.3-bin-hadoop3/ integration_tests/run_pyspark_from_build.sh -k test_hash_reduction_sum_full_decimal
[...]
=================================== FAILURES ===================================
_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.batchSizeBytes': '250'}-Decimal(38,0)] _

data_gen = Decimal(38,0)
conf = {'spark.rapids.sql.batchSizeBytes': '250', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.variableFloatAgg.enabled': 'true'}
[...]
--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=Decimal('-20600424020936538707021213418489297935'))
+Row(sum(a)=None)

_ test_hash_reduction_sum_full_decimal[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.batchSizeBytes': '250'}-Decimal(38,-10)] _

data_gen = Decimal(38,-10)
conf = {'spark.rapids.sql.batchSizeBytes': '250', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.variableFloatAgg.enabled': 'true'}
[...]
--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=Decimal('-2.0600424020936538707021213418489297935E+47'))
+Row(sum(a)=None)

@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

Running with local[2] instead of local[1] leads to CPU overflow and GPU does not.

@jlowe
Copy link
Member

jlowe commented Nov 27, 2023

The test is failing because of a difference in how the CPU does overflow checks vs. how the GPU checks for overflow, and the CPU itself is not consistent in how it might check for overflow (e.g.: overflow checking with codegen enabled is different than when it's disabled). In the single-task case, the GPU is overflowing only in the tests where we use a batch size of 250. That is essentially emulating the case where we're running with more tasks, since each batch is like doing a separate partial aggregation, and the overflow check is done per batch (like the CPU does it per partition). In the second case where we use two tasks, the CPU overflows because it's checking for overflow after the partial, but the values fit in 128-bit for the GPU intermediate so we let it pass and that allows the final to aggregate to a value that can be stored in Decimal(38).

@mattahrens
Copy link
Collaborator

Remaining scope is to fix seed in proper way so failure will not be encountered going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants