-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace grammar by antlr/grammars-v4
's
#73
Conversation
Relabeling policy '<x>_label': - 'from' - 'id' - 'property' - 'str' - 'format' - 'input' - 'with' - 'type'
- name or parser and lexer - incoporate new rule `batch_level_statement` - replace `is not None` by `hasattr()`
- '\NGO' not recognized for simple SELECT statements, see antlr/grammars-v4#3275 - New grammar has more false positives; hence, some invalid queries are flagged as valid
For faster feedback this should also have Windows unit tests ;) |
FYI, thanks to 🧙♂️ @xhochy, all checks are 🟢 now. Hence, the PR should be finally ready for your review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, at the moment, the new grammar doesn't seem to accommodate the current use cases we have for pytsql internally. See one example in #74 .
Also, regarding this:
The second moment to note is that indeed it seems grammar in this repo is not correct about dividing file to batches by GO-statements.
I would be a bit worried, because this is one of the most important uses of pytsql
for us.
Thanks, @SimeonStoykovQC, for spotting! I was a bit naive. I've commented on #74. The issue is the order of appearance for the separate lexer rule So let's wait for that ⏳ I've now applied
True, but the issue only occurs if the user changes the default value of But let's have a more general ☎️ on it; maybe we could also address the concerns with extended test cases? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a great ambition behind this undertaking!
Moving forward, three rather high-level things are not yet crystal clear to me:
-
I haven't seen a license for the source of the new grammar at first sight. Are we sure we can use their code? If yes, should we give them credits in the Readme?
-
You suggested that this change would make pytsql leaner and focalize attention on topics other than the grammar. Does that mean you intend to use a submodule or the like eventually? If no, what is the suggested mode of operation: should pytsql contributors always seek to contribute changes to the grammar via the
antlr
repository and then copy the grammar to pytsql, once merged?
@@ -309,7 +305,6 @@ def test_non_xml_method(): | |||
def test_linked_server(): | |||
seed = """ | |||
SELECT * | |||
INTO linked_server.db.schm.table1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the description of the PR as well as the commit messages it's not obvious to me why we would no longer be able to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge (checked with one external colleague that uses a linked server setup), this is not a valid syntax / possible. You can insert into your own server but not into the linked one directly.
@@ -78,21 +78,25 @@ def compare_context( | |||
py_context: ParserRuleContext, cpp_context: ParserRuleContext | |||
) -> None: | |||
assert type(py_context) == type(cpp_context) | |||
assert len(py_context.children) == len(cpp_context.children) | |||
py_context_children = list(py_context.getChildren()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a clear cut problem with using the children
attribute here and in the following lines or was this syntactic preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a clear cut problem with using the children attribute here and in the following lines or was this syntactic preference?
See https://github.com/amykyta3/speedy-antlr-tool/pull/13/files#diff-3a473f45aaf0af2ca559e0f45b4c1b3e8fb252b473a83ee80bdc1c7b24bacfb3, I simply aim to reflect this change as well
@@ -6,7 +6,7 @@ | |||
|
|||
|
|||
def test_ignore_go_inside_quoted_text(): | |||
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112); | |||
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to your comment, I could follow that the change in test_dynamics
is due to antlr/grammars-v4#3275.
It's not clear to me why this change - and all other changes removing semicolons - is necessary. Was the previous grammar too lax; do we actually think this is an illegitimate statement? Or is this an unfortunate confinement of the newly proposed grammar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why this change - and all other changes removing semicolons - is necessary. Was the previous grammar too lax; do we actually think this is an illegitimate statement? Or is this an unfortunate confinement of the newly proposed grammar?
The last one. The new grammar "loses" these semicolons when parsing/visiting. I guess they are not needed when declaring variables. Hence, I believe it only affects the tests, and should have no impact on the actual execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen a license for the source of the new grammar at first sight. Are we sure we can use their code? If yes, should we give them credits in the Readme?
The old and the new are the same: MIT license.
Update: a704903 gives credit in the README now
You suggested that this change would make pytsql leaner and focalize attention on topics other than the grammar. Does that mean you intend to use a submodule or the like eventually? If no, what is the suggested mode of operation: should pytsql contributors always seek to contribute changes to the grammar via the antlr repository and then copy the grammar to pytsql, once merged?
I lack experience here and have no informed opinion here. What would you suggest?
In general, I'd vote that all changes to the grammar should take place via the antlr repo (as long as it's active at least). Then, on our side, we should only need to adjust some protected labels - see 03eb658.
Moving forward, three rather high-level things are not yet crystal clear to me:
Did you forget to include your third-one?
@@ -309,7 +305,6 @@ def test_non_xml_method(): | |||
def test_linked_server(): | |||
seed = """ | |||
SELECT * | |||
INTO linked_server.db.schm.table1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge (checked with one external colleague that uses a linked server setup), this is not a valid syntax / possible. You can insert into your own server but not into the linked one directly.
@@ -78,21 +78,25 @@ def compare_context( | |||
py_context: ParserRuleContext, cpp_context: ParserRuleContext | |||
) -> None: | |||
assert type(py_context) == type(cpp_context) | |||
assert len(py_context.children) == len(cpp_context.children) | |||
py_context_children = list(py_context.getChildren()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a clear cut problem with using the children attribute here and in the following lines or was this syntactic preference?
See https://github.com/amykyta3/speedy-antlr-tool/pull/13/files#diff-3a473f45aaf0af2ca559e0f45b4c1b3e8fb252b473a83ee80bdc1c7b24bacfb3, I simply aim to reflect this change as well
@@ -6,7 +6,7 @@ | |||
|
|||
|
|||
def test_ignore_go_inside_quoted_text(): | |||
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112); | |||
seed = """DECLARE @td VARCHAR(20) = CONVERT(VARCHAR, GETDATE(), 112) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why this change - and all other changes removing semicolons - is necessary. Was the previous grammar too lax; do we actually think this is an illegitimate statement? Or is this an unfortunate confinement of the newly proposed grammar?
The last one. The new grammar "loses" these semicolons when parsing/visiting. I guess they are not needed when declaring variables. Hence, I believe it only affects the tests, and should have no impact on the actual execution.
Update:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! 🥇
I just received the results of another (internal) regression test which looks good as well. |
Motivation
Microsoft has never published a full open specification for T-SQL, so each grammar "replication" of Transact-SQL (TSQL) is incomplete. The project's current grammar is (mainly) based on https://github.com/antlr/codebuff/blob/master/grammars/org/antlr/codebuff/tsql.g4, which was last updated in 2016.
Consequently, in my projects, I encountered several (elements of) SQL statements that
pytsql
was or is currently incapable of parsing - i.e., it's not possible to execute them withpytsql.
Examples are PR #51, #59, or, more recently, I tried to parse
DATETIME2
andSTRING_AGG()
without luck.In the past, we always manually adjusted our
.g4
grammar - e.g., PR #33. However, this solves the problem, namely that the grammar has (apparent) gaps, only fragmentarily. Also, manual changes to the grammar generally make the package harder to maintain and may need to be more focused on the actual focus of the package.Fortunately, however, a relatively actively maintained grammar with significantly greater functionality exists: https://github.com/antlr/grammars-v4/tree/master/sql/tsql (*). It is even an evolution of our 2016 version. I propose to base our grammar on it instead.
I would even propose moving all future grammar extensions completely to this external repository to keep our package as light as possible.
👍 Pros:
pytsql
get's "leaner" & can focus less on grammar definition👎 Cons:
USEE master
does not raise an issue\N GO
is not recognised correctly; see TSQL:GO
falsely identify ascolumn_alias
? antlr/grammars-v4#3275 - adding ';' solves it forisolate_top_level_statements=False
pytsql
"executing" these statements, sqlcmd in the end still handles them correctly...Changes
Replace
tsql.g4
by https://github.com/antlr/grammars-v4/tree/master/sql/tsql (*).<x>
to<x>_label
(similar to PR Add C++ ANTLR runtime #28)tsql.py
Example of a false positive parsing of grammar (*):
![image](https://user-images.githubusercontent.com/1476077/226296513-81837b0d-3f64-48c9-b65b-ed7998de61c6.png)