[BUG] -0.0 vs 0.0 is a hot mess #294

revans2 · 2020-06-26T19:27:47Z

This is related to #84 and is a super set of it.

Spark is a bit of a hot mess with support for floating point -0.0

Most SQL implementations normalize -0.0 to 0.0. Spark does this for the SQL parser, but not for the dataframe API. Also spark violates ieee spec where -0.0 != 0.0 This is because java Double.compare and Float.compare treat -0.0 as < 0.0

This is true everywhere except for a few cases. equi-joins and hash aggregate keys. Hive does not do these. It always assumes that they are different.

For cudf it follows ieee where they always end up being the same. This causes issues in both sort, comparison operators, and joins that are not equijoins.

I will file something against spark, but I don't have high hopes that anything will be fixed.

The text was updated successfully, but these errors were encountered:

revans2 · 2020-06-26T20:02:46Z

I filed https://issues.apache.org/jira/browse/SPARK-32110 to document what I have found in spark.

mythrocks · 2020-07-02T20:55:04Z

Some findings when compared against Apache Hive 3.x:

Literals: Both Hive CLI and SparkSQL treat the literals 0.0 and -0.0 as equivalent. i.e. 0.0 = -0.0 is TRUE. SELECT 0.0 as a, -0.0 as b selects 0.0 and 0.0.
From data sources/files: The Spark REPL (and Scala, I’m guessing) treat the same literals as distinct. We can use this to write -0.0 into a file. E.g. Seq((-0.0, 0.0)).toDF.write.orc() writes distinct values.
Equi-join: Hive 3 does not normalize float/double. Joining 0.0 and -0.0 from ORC-file sources does not match rows. Spark normalizes, and thus matches.
Inequality joins: Both Hive 3 and SparkSQL 3 matches on -0.0 < 0.0. This is because neither normalizes on non-equijoins.

So in this regard, the only material difference between Hive and SparkSQL is that on equijoins, Hive does not normalize, and treats -0.0 as distinct from 0.0. It is consistent(ly wrong?) within itself. Spark normalizes, but only for equijoin.

revans2 · 2020-11-23T17:15:14Z

I filed rapidsai/cudf#6834 in cudf so we can work around things with bit-wise operations if possible. I believe that we should be able to make comparisons and sort match exactly with Spark. On joins we are going to have a much harder time, but we still might be able to do it. We need to be very careful with this though. -0.0 and the various NaN values are rather rare in real life. I am not sure if it is worth the added performance cost for sort to do this, and the join I am especially concerned about what it would take to make it work.

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 26, 2020

revans2 mentioned this issue Jun 26, 2020

[DOC] Document -0.0 behavior #295

Closed

sameerz added P2 Not required for release and removed ? - Needs Triage Need team to review and classify labels Aug 25, 2020

revans2 mentioned this issue Nov 23, 2020

[BUG] Distinct count of floating point values differs with regular spark #837

Closed

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

cutecycle mentioned this issue May 10, 2022

Feature/decimal support dotnet/spark#982

Open

4 tasks

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 1179e46 (NVIDIA#294)

43e85b2

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] -0.0 vs 0.0 is a hot mess #294

[BUG] -0.0 vs 0.0 is a hot mess #294

revans2 commented Jun 26, 2020

revans2 commented Jun 26, 2020

mythrocks commented Jul 2, 2020

revans2 commented Nov 23, 2020

[BUG] -0.0 vs 0.0 is a hot mess #294

[BUG] -0.0 vs 0.0 is a hot mess #294

Comments

revans2 commented Jun 26, 2020

revans2 commented Jun 26, 2020

mythrocks commented Jul 2, 2020

revans2 commented Nov 23, 2020