Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] -0.0 vs 0.0 is a hot mess #294

Open
revans2 opened this issue Jun 26, 2020 · 3 comments
Open

[BUG] -0.0 vs 0.0 is a hot mess #294

revans2 opened this issue Jun 26, 2020 · 3 comments
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P2 Not required for release

Comments

@revans2
Copy link
Collaborator

revans2 commented Jun 26, 2020

This is related to #84 and is a super set of it.

Spark is a bit of a hot mess with support for floating point -0.0

Most SQL implementations normalize -0.0 to 0.0. Spark does this for the SQL parser, but not for the dataframe API. Also spark violates ieee spec where -0.0 != 0.0 This is because java Double.compare and Float.compare treat -0.0 as < 0.0

This is true everywhere except for a few cases. equi-joins and hash aggregate keys. Hive does not do these. It always assumes that they are different.

For cudf it follows ieee where they always end up being the same. This causes issues in both sort, comparison operators, and joins that are not equijoins.

I will file something against spark, but I don't have high hopes that anything will be fixed.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 26, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Jun 26, 2020

I filed https://issues.apache.org/jira/browse/SPARK-32110 to document what I have found in spark.

@mythrocks
Copy link
Collaborator

Some findings when compared against Apache Hive 3.x:

  1. Literals: Both Hive CLI and SparkSQL treat the literals 0.0 and -0.0 as equivalent. i.e. 0.0 = -0.0 is TRUE. SELECT 0.0 as a, -0.0 as b selects 0.0 and 0.0.
  2. From data sources/files: The Spark REPL (and Scala, I’m guessing) treat the same literals as distinct. We can use this to write -0.0 into a file. E.g. Seq((-0.0, 0.0)).toDF.write.orc() writes distinct values.
  3. Equi-join: Hive 3 does not normalize float/double. Joining 0.0 and -0.0 from ORC-file sources does not match rows. Spark normalizes, and thus matches.
  4. Inequality joins: Both Hive 3 and SparkSQL 3 matches on -0.0 < 0.0. This is because neither normalizes on non-equijoins.

So in this regard, the only material difference between Hive and SparkSQL is that on equijoins, Hive does not normalize, and treats -0.0 as distinct from 0.0. It is consistent(ly wrong?) within itself. Spark normalizes, but only for equijoin.

@sameerz sameerz added P2 Not required for release and removed ? - Needs Triage Need team to review and classify labels Aug 25, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Nov 23, 2020

I filed rapidsai/cudf#6834 in cudf so we can work around things with bit-wise operations if possible. I believe that we should be able to make comparisons and sort match exactly with Spark. On joins we are going to have a much harder time, but we still might be able to do it. We need to be very careful with this though. -0.0 and the various NaN values are rather rare in real life. I am not sure if it is worth the added performance cost for sort to do this, and the join I am especially concerned about what it would take to make it work.

@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P2 Not required for release
Projects
None yet
Development

No branches or pull requests

3 participants