Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support round and bround SQL functions #1244

Merged
merged 16 commits into from
Jan 8, 2021
Merged

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Dec 2, 2020

This fixes #37

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1 nartal1 added the feature request New feature or request label Dec 2, 2020
@nartal1 nartal1 self-assigned this Dec 2, 2020
@nartal1 nartal1 added the SQL part of the SQL/Dataframe plugin label Dec 2, 2020
@nartal1
Copy link
Collaborator Author

nartal1 commented Dec 3, 2020

There is a bug in this which I am still figuring out. All test cases do not pass.
If the scale is 0, the value is incremented by 1(which shouldn't happen).
If I have something like below:

df1 = spark.createDataFrame(data=[(Decimal('3977631')), (Decimal('8291028')), (Decimal('8291025'))], schema=DecimalType(7,0))
ret = df2.selectExpr('round(value, 0)')

CPU result: (Decimal('3977631')), (Decimal('8291028')), (Decimal('8291025'))
GPU result: (Decimal('3977632')), (Decimal('8291029')), (Decimal('8291026'))

if (lhs.isInstanceOf[AutoCloseable]) {
lhs.asInstanceOf[AutoCloseable].close()
}
if (rhs.isInstanceOf[AutoCloseable]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rhs could be null at this point so needs a null check here. Perhaps it would be better to introduce something similar to the withResource method that can work with Any's that are possibly also AutoCloseable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added on

/** Executes the provided code block and then closes the value if it is AutoCloseable */
def withResourceIfAllowed[T, V](r: T)(block: T => V): V = {
try {
block(r)
} finally {
r match {
case c: AutoCloseable => c.close()
case _ => //NOOP
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we already have a withResource that can work here.

withResourceIfAllowed(left.columnarEval(batch)) { lhs =>
  withResourceIfAllowed(right.columnarEval(batch)) { rhs =>
    (lhs, rhs) match {
      case (l: GpuColumnVector, r) =>
        withResource(GpuScalar.from(r, right.dataType)) { scalar =>
          GpuColumnVector.from(doColumnar(l, scalar), dataType)
        }
      case _ => null
    }
  }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But further to the point why is this not a GpuBinaryExpression? BRound and Round are both BinaryExpressions which would clean up this code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove and @revans2 for taking a look. Sorry missed this conversation. The reason I was not making it as GpuBinaryExpression is because there isn't an BinaryOp for round or BRound in JNI or in cudf binaryops. And I was not sure what would be the overidden enum be for these operators. But this also isn't working for all cases. So I will try how to go about it as a GpuBinaryExpression

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GpuBinaryExpression does not assume an enum but CudfBinaryExpression does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! changed it to use GpuBinaryExpression. Will be in next patch.

@@ -183,6 +183,22 @@ def test_shift_right_unsigned(data_gen):
'shiftrightunsigned(a, cast(null as INT))',
'shiftrightunsigned(a, b)'))

@pytest.mark.parametrize('data_gen', [decimal_gen_scale_precision], ids=idfn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round and bround support all numeric types, not just decimal (float, int, byte, double, etc). Do we have tests for these too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add tests for other numeric types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for other numeric types.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1
Copy link
Collaborator Author

nartal1 commented Dec 8, 2020

Cleaned up the code to use GpuBinaryExpression.
Need to add numerical tests and also verify decimal type for negative scale

@revans2 revans2 mentioned this pull request Dec 8, 2020
27 tasks
@nartal1
Copy link
Collaborator Author

nartal1 commented Dec 10, 2020

There is a bug in this which I am still figuring out. All test cases do not pass.
If the scale is 0, the value is incremented by 1(which shouldn't happen).
If I have something like below:

df1 = spark.createDataFrame(data=[(Decimal('3977631')), (Decimal('8291028')), (Decimal('8291025'))], schema=DecimalType(7,0))
ret = df2.selectExpr('round(value, 0)')

CPU result: (Decimal('3977631')), (Decimal('8291028')), (Decimal('8291025'))
GPU result: (Decimal('3977632')), (Decimal('8291029')), (Decimal('8291026'))

I couldn't figure out the issue here as the code seems straight forward. So, switched to see if i can reproduce it in Java code. This is reproducible in Java side as well. So have to debug it in cudf Java

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Dec 11, 2020
… no-op(#6975)

@nartal1 found a small bug while working on: NVIDIA/spark-rapids#1244

Problem is that for `fixed_point`, when the column `scale = -decimal_places`, it should be a no-op. Fix is to make it a no-op.

Authors:
  - Conor Hoekstra <codereport@outlook.com>

Approvers:
  - David
  - Karthikeyan

URL: #6975
val scaleVal=val1.getInt
val scale = dataType match {
case DecimalType.Fixed(p, s) => s
case ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType => val1.getInt
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revans2 Could you please suggest how we handle overflow when for each types.
For example(considering short type), pyspark results in 0 for the below round operation:

>>> df2= spark.createDataFrame(data=[32562], schema=ShortType())
>>> ret = df2.selectExpr('round(value, -5)')
>>> ret.show()
+----------------+
|round(value, -5)|
+----------------+
|               0|
+----------------+

But we see different GPU result(-31072) as overflow results in undefined behavior in libcudf. 
Should we throw an exception whenever we sense an overflow for each type at this point ? 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is completely in how we implement round/bround vs how spark does it, and I am not 100% sure how to make them sync up without a lot of work on the cudf side for these corner cases.

cudf tries to do the round on the native type, which can result in an overflow. Spark will convert the native value to a decimal value (128-bits if needed), set the scale to do the rounding, and then convert the value back (with some special cases for NaN and Infinite in floating point).

https://github.com/apache/spark/blob/0626901bcbeebceb6937001e1f32934c71876210/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1220-L1249

There can be no overflow in those cases because all of the processing is happening on 128-bits. For integer smaller than a long we could cast it to a long first, do the round/bround, and then cast it back. But we would still end up with issues in long because of overflow.

Similar with float/double.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for overflow to check if it is working correctly or are we going to mark the operators and incompatible until we can figure out a way to make it work properly?

@nartal1 nartal1 changed the title [WIP] Support round and bround SQL functions Support round and bround SQL functions Dec 14, 2020
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1
Copy link
Collaborator Author

nartal1 commented Dec 16, 2020

@revans2 I am waiting on another fix in cudf to get merged before starting the CI here. It would be great if you could please take another look and suggest if looks okay for this iteration. If further changes are required, I can make the changes and verify it locally.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Dec 17, 2020
Found a small bug while working on NVIDIA/spark-rapids#1244. 
For negative integers, it was not rounding to nearest even number.

Authors:
  - Niranjan Artal <nartal@nvidia.com>
  - Conor Hoekstra <codereport@outlook.com>

Approvers:
  - Conor Hoekstra
  - Mark Harris

URL: #7014
@nartal1
Copy link
Collaborator Author

nartal1 commented Dec 18, 2020

build

@nartal1
Copy link
Collaborator Author

nartal1 commented Dec 18, 2020

build

@nartal1 nartal1 added this to the Jan 4 - Jan 15 milestone Dec 18, 2020
@nartal1
Copy link
Collaborator Author

nartal1 commented Jan 6, 2021

build

val scaleVal=val1.getInt
val scale = dataType match {
case DecimalType.Fixed(p, s) => s
case ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType => val1.getInt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for overflow to check if it is working correctly or are we going to mark the operators and incompatible until we can figure out a way to make it work properly?

@revans2
Copy link
Collaborator

revans2 commented Jan 6, 2021

build

@revans2
Copy link
Collaborator

revans2 commented Jan 7, 2021

@andygrove you reviewed this before are you okay with merging it?

Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@revans2 revans2 merged commit 41b6a66 into NVIDIA:branch-0.4 Jan 8, 2021
@pxLi pxLi added this to In progress in Release 0.4 via automation Feb 8, 2021
@sameerz sameerz moved this from In progress to Done in Release 0.4 Mar 4, 2021
nartal1 added a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
nartal1 added a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
* Run pre-commit to format files. We were behind a bit.

* Update pre-commit config to 16.0.1 to match cudf. Re-ran formatting.

* Reformat of code via pre-commit

Signed-off-by: db <dbaranec@nvidia.com>

---------

Signed-off-by: db <dbaranec@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request SQL part of the SQL/Dataframe plugin
Projects
No open projects
Release 0.4
  
Done
Development

Successfully merging this pull request may close these issues.

[FEA] round and bround SQL functions
3 participants