Skip to content

[SPARK-51441][SQL] Add DSv2 APIs for constraints #50253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Mar 12, 2025

What changes were proposed in this pull request?

This PR adds DSv2 APIs for constraints as per SPIP doc.

Why are the changes needed?

These changes are the first step for constraints support in Spark.

Does this PR introduce any user-facing change?

This PR adds new public interfaces that will be supported in the future.

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 12, 2025
* @return a CHECK constraint with the provided configuration
*/
static Check check(String name, String sql, ConstraintState state) {
return new Check(name, sql, null /* no predicate */, state);
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use Java Optional.empty() instead of null? IIRC, this null was not a part of SPIP, wan't it? Please correct me if there is a reason to have null. I might miss the detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either SQL or predicate must be present, so I don't think Optional is appropriate in this case. We have mixed use in the connector API. Column and ProcedureParameter use nulls while some other classes use Optional.

* @return a CHECK constraint with the provided configuration
*/
static Check check(String name, Predicate predicate, ConstraintState state) {
return new Check(name, null /* no SQL */, predicate, state);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. Optional?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @aokolnychyi .

Comment on lines 31 to 32
* (not {@code NULL}). The search condition must be deterministic and cannot contain subqueries and
* certain functions like aggregates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to document what certain would imply here, an exhaustive list of conditional expressions we want to support would be benefitial
for ex:
can we have an UDF my_udf in a CHECK condition, which essentially is a subquery ?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to us in the implementation to decide what to support. I feel this topic should be discussed a bit more. For now, I added UDFs to the list of unsupported examples. Let's refine as we go.

* The SQL string and predicate must be equivalent.
*
* @param name the constraint name
* @param sql the SQL representation of the search condition (Spark SQL dialect)
Copy link

@singhpk234 singhpk234 Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql the SQL representation of the search condition (Spark SQL dialect)

[doubt] How are we planning to handle cases in lets say an inbuilt function is introduced at certain spark version and is referenced in check but the enforcer is at a lower version of spark sql is capturing all the SQL under Spark SQL dialect sufficient ?

For ex:

ALTER TABLE tbl ADD CONSTRAINT my_condition CHECK (func_spark_4(x) < 20);

Now the writer / enforcer is in spark_3 and doesn't know func_spark_4

Copy link
Contributor Author

@aokolnychyi aokolnychyi Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean the older version of Spark would not be able to parse/bind the expression and the write will fail. I think that's an acceptable behavior and is better than causing silent constraint violation.

*
* @since 4.1.0
*/
public class Check extends BaseConstraint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we override the enforced()/rely()/validationStatus() method or variable? It will help developers to understand them.
Otherwise, the default values will only exist in docs and spark internal code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is a good idea. Those methods existing in the parent interface. The method will show up in Javadoc and IDE will suggest them as well. I am worried about all code duplication because of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a compromise, we can add more Javadoc at the top of this class. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more java doc seems ok.
If we change the default value in these classes, it can make the internal implementation simpler too. For example, we don't need to have the default ConstraintCharacteristic for each internal parsed constraints https://github.com/gengliangwang/spark/pull/13/files#diff-03507453aabc732a7e3efadc81cd840e436a392b1c62d5aa50647266bc3a9199R38

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more Javadoc.

If we change the default value in these classes, it can make the internal implementation simpler too. For example, we don't need to have the default ConstraintCharacteristic for each internal parsed constraints

Let's re-evaluate this in the implementation PR. I think we can either remove the common builder for constraints or handle this in the parser.

@gengliangwang
Copy link
Member

I will merge this one next Monday if there are no more comments

*
* @since 4.1.0
*/
@Evolving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, why is this evolving and not others? Not sure the convention in Spark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about annotating classes that implement this interface.
I added to be safe, however. Should be everywhere now.

* @param name the constraint name
* @return a CHECK constraint builder
*/
static Check.Builder check(String name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about createCheckBuilder ?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer shorter names whenever the usage/context is obvious enough.

Constraint.check("con1").predicateSql("id > 0").enforced(true).build();

This reads well to me and matches what we did for ProcedureParameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with @aokolnychyi , check is neat

* @param columns columns that comprise the unique key
* @return a UNIQUE constraint builder
*/
static Unique.Builder unique(String name, NamedReference[] columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createUniqueBuilder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in CHECK.

* @param columns columns that comprise the primary key
* @return a PRIMARY KEY constraint builder
*/
static PrimaryKey.Builder primaryKey(String name, NamedReference[] columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createPrimaryKeyBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in CHECK.

* @param refColumns the referenced columns in the referenced table
* @return a FOREIGN KEY constraint builder
*/
static ForeignKey.Builder foreignKey(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createForeignKeyBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in CHECK.

return toDDL();
}

protected String toDDL(NamedReference[] columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about joinColumns ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sure it would be more descriptive. I feel toDDL makes sense as it formats columns to be used in DDL.

Predicate predicate,
boolean enforced,
ValidationStatus validationStatus,
boolean rely) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put the parameters of base class first? And then the subclass's parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels more natural to me to follow the order of importance/definition in SQL.

CONSTRAINT name CHECK (predicate) [NOT] ENFORCED [NO]RELY

boolean enforced,
ValidationStatus validationStatus,
boolean rely) {
super(name, enforced, validationStatus, rely);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameter order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same opinion as in CHECK.

boolean enforced,
ValidationStatus validationStatus,
boolean rely) {
super(name, enforced, validationStatus, rely);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same opinion as in CHECK.

boolean enforced,
ValidationStatus validationStatus,
boolean rely) {
super(name, enforced, validationStatus, rely);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same opinion as in CHECK.

@gengliangwang
Copy link
Member

Hi @beliefer, @dongjoon-hyun, @singhpk234 — just checking if you have any further comments on this PR. Thanks!

@gengliangwang
Copy link
Member

Thanks, merging to master

@aokolnychyi
Copy link
Contributor Author

senthh pushed a commit to senthh/spark-1 that referenced this pull request Apr 2, 2025
### What changes were proposed in this pull request?

This PR adds DSv2 APIs for constraints as per SPIP [doc](https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo/).

### Why are the changes needed?

These changes are the first step for constraints support in Spark.

### Does this PR introduce _any_ user-facing change?

This PR adds new public interfaces that will be supported in the future.

### How was this patch tested?

This PR comes with tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50253 from aokolnychyi/spark-51441.

Authored-by: Anton Okolnychyi <aokolnychyi@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants