Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51441][SQL] Add DSv2 APIs for constraints #50253

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Mar 12, 2025

What changes were proposed in this pull request?

This PR adds DSv2 APIs for constraints as per SPIP doc.

Why are the changes needed?

These changes are the first step for constraints support in Spark.

Does this PR introduce any user-facing change?

This PR adds new public interfaces that will be supported in the future.

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 12, 2025

import org.apache.spark.sql.connector.expressions.NamedReference;

import java.util.StringJoiner;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you place this before org.apache.spark. import statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @return a CHECK constraint with the provided configuration
*/
static Check check(String name, String sql, ConstraintState state) {
return new Check(name, sql, null /* no predicate */, state);
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use Java Optional.empty() instead of null? IIRC, this null was not a part of SPIP, wan't it? Please correct me if there is a reason to have null. I might miss the detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either SQL or predicate must be present, so I don't think Optional is appropriate in this case. We have mixed use in the connector API. Column and ProcedureParameter use nulls while some other classes use Optional.

* @return a CHECK constraint with the provided configuration
*/
static Check check(String name, Predicate predicate, ConstraintState state) {
return new Check(name, null /* no SQL */, predicate, state);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. Optional?

import org.apache.spark.sql.connector.expressions.NamedReference;

import java.util.Arrays;
import java.util.Objects;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. import ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @aokolnychyi .

Comment on lines 31 to 32
* (not {@code NULL}). The search condition must be deterministic and cannot contain subqueries and
* certain functions like aggregates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to document what certain would imply here, an exhaustive list of conditional expressions we want to support would be benefitial
for ex:
can we have an UDF my_udf in a CHECK condition, which essentially is a subquery ?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to us in the implementation to decide what to support. I feel this topic should be discussed a bit more. For now, I added UDFs to the list of unsupported examples. Let's refine as we go.

* The SQL string and predicate must be equivalent.
*
* @param name the constraint name
* @param sql the SQL representation of the search condition (Spark SQL dialect)
Copy link

@singhpk234 singhpk234 Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql the SQL representation of the search condition (Spark SQL dialect)

[doubt] How are we planning to handle cases in lets say an inbuilt function is introduced at certain spark version and is referenced in check but the enforcer is at a lower version of spark sql is capturing all the SQL under Spark SQL dialect sufficient ?

For ex:

ALTER TABLE tbl ADD CONSTRAINT my_condition CHECK (func_spark_4(x) < 20);

Now the writer / enforcer is in spark_3 and doesn't know func_spark_4

Copy link
Contributor Author

@aokolnychyi aokolnychyi Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean the older version of Spark would not be able to parse/bind the expression and the write will fail. I think that's an acceptable behavior and is better than causing silent constraint violation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants