Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace grammar by antlr/grammars-v4's #73

Merged
merged 16 commits into from
Apr 3, 2023
Merged

Conversation

svengiegerich
Copy link
Contributor

@svengiegerich svengiegerich commented Mar 20, 2023

Motivation

Microsoft has never published a full open specification for T-SQL, so each grammar "replication" of Transact-SQL (TSQL) is incomplete. The project's current grammar is (mainly) based on https://github.com/antlr/codebuff/blob/master/grammars/org/antlr/codebuff/tsql.g4, which was last updated in 2016.

Consequently, in my projects, I encountered several (elements of) SQL statements that pytsql was or is currently incapable of parsing - i.e., it's not possible to execute them with pytsql.
Examples are PR #51, #59, or, more recently, I tried to parse DATETIME2 and STRING_AGG() without luck.

In the past, we always manually adjusted our .g4 grammar - e.g., PR #33. However, this solves the problem, namely that the grammar has (apparent) gaps, only fragmentarily. Also, manual changes to the grammar generally make the package harder to maintain and may need to be more focused on the actual focus of the package.

Fortunately, however, a relatively actively maintained grammar with significantly greater functionality exists: https://github.com/antlr/grammars-v4/tree/master/sql/tsql (*). It is even an evolution of our 2016 version. I propose to base our grammar on it instead.

I would even propose moving all future grammar extensions completely to this external repository to keep our package as light as possible.

👍 Pros:

  • Grammar (*) ships much more valid syntax
  • 👉 pytsql get's "leaner" & can focus less on grammar definition
  • 👉 In effect, more TSQL statements get parseable

👎 Cons:

  • Grammar (*) has more false positives. That is, sometimes invalid statements are still parsed successfully
    • example (1): USEE master does not raise an issue
    • example (2): in rare cases \N GO is not recognised correctly; see TSQL: GO falsely identify as column_alias? antlr/grammars-v4#3275 - adding ';' solves it for isolate_top_level_statements=False
    • however, despite pytsql "executing" these statements, sqlcmd in the end still handles them correctly...
    • ... so it may boils down to which tools should generate the error message (?)

Changes

Replace tsql.g4 by https://github.com/antlr/grammars-v4/tree/master/sql/tsql (*).

  • Modify grammar (*) slightly:
  • Generate new targets
  • Adjust tsql.py
  • Adjust tests:
    • +1: some minor flaws in current test queries were not flagged with our current grammar
    • -1: remove some tests that flagged wrong TSQL statements
  • Adjust documentation

Example of a false positive parsing of grammar (*):
image

Relabeling policy '<x>_label':
- 'from'
- 'id'
- 'property'
- 'str'
- 'format'
- 'input'
- 'with'
- 'type'
- name or parser and lexer
- incoporate new rule `batch_level_statement`
- replace `is not None` by `hasattr()`
- '\NGO' not recognized for simple SELECT statements, see antlr/grammars-v4#3275
- New grammar has more false positives; hence, some invalid queries are flagged as valid
@xhochy
Copy link
Member

xhochy commented Mar 24, 2023

For faster feedback this should also have Windows unit tests ;)

@svengiegerich
Copy link
Contributor Author

FYI, thanks to 🧙‍♂️ @xhochy, all checks are 🟢 now. Hence, the PR should be finally ready for your review

Copy link
Member

@SimeonStoykovQC SimeonStoykovQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, at the moment, the new grammar doesn't seem to accommodate the current use cases we have for pytsql internally. See one example in #74 .

Also, regarding this:

The second moment to note is that indeed it seems grammar in this repo is not correct about dividing file to batches by GO-statements.

I would be a bit worried, because this is one of the most important uses of pytsql for us.

@svengiegerich
Copy link
Contributor Author

Sadly, at the moment, the new grammar doesn't seem to accommodate the current use cases we have for pytsql internally. See one example in #74 .

Thanks, @SimeonStoykovQC, for spotting! I was a bit naive. I've commented on #74. The issue is the order of appearance for the separate lexer rule RAW. This is an easy fix, but I first would like to get feedback here on whether the easiest fix is the most elegant one.

So let's wait for that ⏳

I've now applied tsql._split() to every *.sql file in my main internal project.
If I replace "raw.", I don't run into any other parsing issue.

Also, regarding antlr/grammars-v4#3275 (comment): [...]
I would be a bit worried, because this is one of the most important uses of pytsql for us.

True, but the issue only occurs if the user changes the default value of isolate_top_level_statements to false.
(Plus, I believe it only matters for some edge cases in combination with dynamic SQL - the test example is really a strange one) Also, there is an easy fix for it, simply using a semicolon (which is good practice either way).

But let's have a more general ☎️ on it; maybe we could also address the concerns with extended test cases?

Copy link
Collaborator

@kklein kklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a great ambition behind this undertaking!

Moving forward, three rather high-level things are not yet crystal clear to me:

  1. I haven't seen a license for the source of the new grammar at first sight. Are we sure we can use their code? If yes, should we give them credits in the Readme?

  2. You suggested that this change would make pytsql leaner and focalize attention on topics other than the grammar. Does that mean you intend to use a submodule or the like eventually? If no, what is the suggested mode of operation: should pytsql contributors always seek to contribute changes to the grammar via the antlr repository and then copy the grammar to pytsql, once merged?

@@ -309,7 +305,6 @@ def test_non_xml_method():
def test_linked_server():
seed = """
SELECT *
INTO linked_server.db.schm.table1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description of the PR as well as the commit messages it's not obvious to me why we would no longer be able to support this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge (checked with one external colleague that uses a linked server setup), this is not a valid syntax / possible. You can insert into your own server but not into the linked one directly.

@@ -78,21 +78,25 @@ def compare_context(
py_context: ParserRuleContext, cpp_context: ParserRuleContext
) -> None:
assert type(py_context) == type(cpp_context)
assert len(py_context.children) == len(cpp_context.children)
py_context_children = list(py_context.getChildren())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a clear cut problem with using the children attribute here and in the following lines or was this syntactic preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a clear cut problem with using the children attribute here and in the following lines or was this syntactic preference?

See https://github.com/amykyta3/speedy-antlr-tool/pull/13/files#diff-3a473f45aaf0af2ca559e0f45b4c1b3e8fb252b473a83ee80bdc1c7b24bacfb3, I simply aim to reflect this change as well

@@ -6,7 +6,7 @@


def test_ignore_go_inside_quoted_text():
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112);
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to your comment, I could follow that the change in test_dynamics is due to antlr/grammars-v4#3275.

It's not clear to me why this change - and all other changes removing semicolons - is necessary. Was the previous grammar too lax; do we actually think this is an illegitimate statement? Or is this an unfortunate confinement of the newly proposed grammar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this change - and all other changes removing semicolons - is necessary. Was the previous grammar too lax; do we actually think this is an illegitimate statement? Or is this an unfortunate confinement of the newly proposed grammar?

The last one. The new grammar "loses" these semicolons when parsing/visiting. I guess they are not needed when declaring variables. Hence, I believe it only affects the tests, and should have no impact on the actual execution.

Copy link
Contributor Author

@svengiegerich svengiegerich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kklein

I haven't seen a license for the source of the new grammar at first sight. Are we sure we can use their code? If yes, should we give them credits in the Readme?

The old and the new are the same: MIT license.

Update: a704903 gives credit in the README now

You suggested that this change would make pytsql leaner and focalize attention on topics other than the grammar. Does that mean you intend to use a submodule or the like eventually? If no, what is the suggested mode of operation: should pytsql contributors always seek to contribute changes to the grammar via the antlr repository and then copy the grammar to pytsql, once merged?

I lack experience here and have no informed opinion here. What would you suggest?

In general, I'd vote that all changes to the grammar should take place via the antlr repo (as long as it's active at least). Then, on our side, we should only need to adjust some protected labels - see 03eb658.

Moving forward, three rather high-level things are not yet crystal clear to me:

Did you forget to include your third-one?

@@ -309,7 +305,6 @@ def test_non_xml_method():
def test_linked_server():
seed = """
SELECT *
INTO linked_server.db.schm.table1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge (checked with one external colleague that uses a linked server setup), this is not a valid syntax / possible. You can insert into your own server but not into the linked one directly.

@@ -78,21 +78,25 @@ def compare_context(
py_context: ParserRuleContext, cpp_context: ParserRuleContext
) -> None:
assert type(py_context) == type(cpp_context)
assert len(py_context.children) == len(cpp_context.children)
py_context_children = list(py_context.getChildren())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a clear cut problem with using the children attribute here and in the following lines or was this syntactic preference?

See https://github.com/amykyta3/speedy-antlr-tool/pull/13/files#diff-3a473f45aaf0af2ca559e0f45b4c1b3e8fb252b473a83ee80bdc1c7b24bacfb3, I simply aim to reflect this change as well

@@ -6,7 +6,7 @@


def test_ignore_go_inside_quoted_text():
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112);
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this change - and all other changes removing semicolons - is necessary. Was the previous grammar too lax; do we actually think this is an illegitimate statement? Or is this an unfortunate confinement of the newly proposed grammar?

The last one. The new grammar "loses" these semicolons when parsing/visiting. I guess they are not needed when declaring variables. Hence, I believe it only affects the tests, and should have no impact on the actual execution.

@svengiegerich
Copy link
Contributor Author

Update:

You suggested that this change would make pytsql leaner and focalize attention on topics other than the grammar. Does that mean you intend to use a submodule or the like eventually?

  • @kklein I had another thought: we would always need to run helper_generate_parsers.sh on such a submodule to reflect the changes to our C++/Python targets fully. So a git submodule seems less practical to me...?
    • 606b1d6 adds a (currently commented out) step, that downloads the latest grammar directly from antlrs repo.
    • Hence, we could potentially add a GHA script that updates the grammar on a regular basis

Copy link
Collaborator

@ivergara ivergara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

src/pytsql/grammar/adjust_antlrs_grammar.py Show resolved Hide resolved
src/pytsql/grammar/adjust_antlrs_grammar.py Outdated Show resolved Hide resolved
Copy link
Member

@SimeonStoykovQC SimeonStoykovQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! 🥇

@svengiegerich
Copy link
Contributor Author

I just received the results of another (internal) regression test which looks good as well.
Hence, I'm merging now.

@svengiegerich svengiegerich merged commit b86cf7f into main Apr 3, 2023
@svengiegerich svengiegerich deleted the sg_use_antlr_grammars_v4 branch April 3, 2023 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants