Skip to content

[SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser #51335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

mihailom-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes a change in how our parser treats datatypes. We introduce types with/without parameters and group accordingly.

Why are the changes needed?

Changes are needed for many reasons:

  1. Context of primitiveDataType is constantly getting bigger. This is not a good practice, as we have many null fields which only take up memory.
  2. We have inconsistencies in where we use each type. We get TIMESTAMP_NTZ in a separate rule, but we also mention it in primitive types.
  3. Primitive types should stay related to primitive types, adding ARRAY, STRUCT, MAP in the rule just because it is convenient is not good practice.
  4. Current structure does not give option of extending types with different features. For example, we introduced STRING collations, but what if we were to introduce CHAR/VARCHAR with collations. Current structure gives us 0 possibility of making a type CHAR(5) COLLATE UTF8_BINARY (We can only do CHAR COLLATE UTF8_BINARY (5)).

Does this PR introduce any user-facing change?

No. This is internal refactoring.

How was this patch tested?

All existing tests should pass, this is just code refactoring.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 1, 2025
@mihailom-db
Copy link
Contributor Author

@MaxGekk @cloud-fan Could you take a look at this PR? We should aim to keep our parser in the most readable and most efficient state.

primitiveType
: primitiveTypeWithParameters
| primitiveTypeWithoutParameters
| unsupportedType=identifier (LEFT_PAREN INTEGER_VALUE(COMMA INTEGER_VALUE)* RIGHT_PAREN)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we get an unsupportedType with a suffix like: TIME WITH TIME ZONE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this potentially can make a problem. I mean, we would probably return a bad error message. Let me think if we can scope issues like this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually error message will stay the same, even previously we would return syntax error. The only thing here is if we want to improve the error messages a bit further?

@@ -1340,7 +1340,20 @@ collateClause
: COLLATE collationName=multipartIdentifier
;

type
primitiveTypeWithParameters
: STRING collateClause?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be primitiveTypeWithoutParameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can go, but in this case I would say collation is a parameter as well. It can change it's value to some different value not known at parsing time. If we follow this case, then probably INTERVAL should go to primitiveTypeWithoutParameters, as it is actually without parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not known at parsing time, I mean identifier/arbitrary value.

@mihailom-db mihailom-db changed the title [WIP] Fix inconsistencies and refactor primitive types in parser [SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser Jul 8, 2025
| BINARY
| DECIMAL | DEC | NUMERIC
| VOID
| INTERVAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move this one into nonTrivialPrimitiveType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved both INTERVAL and TIMESTAMP so that in the code it is clear that trivial types do not require post processing, but only return a specific type.

@@ -165,6 +183,9 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
* Create a complex DataType. Arrays, Maps and Structures are supported.
*/
override def visitComplexDataType(ctx: ComplexDataTypeContext): DataType = withOrigin(ctx) {
if (ctx.LT() == null && ctx.NEQ() == null) {
throw QueryParsingErrors.nestedTypeMissingElementTypeError(ctx.getText, ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a new error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is refactoring that I did. Previously if someone only writes STRUCT/ARRAY/MAP without parameters it would go to path of the primitive type. This is not a good practice, so I made a change that complex types are isolated in the separate context. The only change here is that we would only change the error message for when someone writes STRUCT(2) where it would return unsupported primitive type instead of the complex type missing element. We could argue that this is a change, but if you ask me, we need to distinguish between primitive and complex types first, as this is a general practice in type theory. We have primitive types which can be used as leaf arguments in complex types, we do not want to go into some recursive link between the two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants