Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support of context tokens (soft keywords, token value comparison operator) support #8

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

KvanTTT
Copy link
Owner

@KvanTTT KvanTTT commented Jan 2, 2024

Reference issues

Explanation

Consider the following JS example:

const obj = {
  get: "test",
  get latest() {
    return this.get;
  },
}

console.log(obj.latest); // prints "test"

Previously to support soft keyword get the following grammar should be used (fragment):

propertyAssignment
    : id ':' singleExpression=STRING
    | GET id '(' ')' functionBody
    ;

id
    : ID
    | GET
    ;

But it's more natural to check token value on parser side without lexer changing:

propertyAssignment
    : ID ':' singleExpression=STRING
    | ID=='get' ID '(' ')' functionBody
    ;

Context tokens are especially useful for SQL-like languages with ton of soft keywords. And the user's mistake I've encountered most often in ANTLR grammars repository was defining a token in lexer without adding it to id parser rule (actually I think there are still a lot of such errors in SQL grammars). Context tokens significantly simplify and reduce grammar size and they are used in other languages as well (C#, Kotlin and other).

caseInsensitive parser option

Regarding SQL grammars: caseInsensitive parser option was also implemented:

options { caseInsensitive=true; }

...

other_function
    : ID=='XMLEXISTS' '(' expression xml_passing_clause? ')' // Case insensitive value checking
    ;

How it's implemented

ANTLR tool

ANTLR tool creates artificial tokens for all tokens defined in context token form. If it encounters multiple context token with the same value and type, the single token is created. For the following grammar the only single ID_keyword is created:

x
    : ID=='keyword' // It's `ID_keyword` token
    | ID=='keyword' // It's also `ID_keyword` token
    ;

ANTLR runtime

ANTLR runtime tries to treat the following token from input stream as a context token. If it's also can be treated as normal token, two DFA states are being initialized. It's needed for ambiguities resolving. Consider the following grammar:

x
    : ID=='keyword' 'a'
    | ID=='keyword' 'b'
    | ID
    ;

With the following input:

keyword a keyword b

When ANTLR runtime takes the first ID_keyword token, the DFA is being initialized by both context and normal tokens (ID_keyword and ID).

Actually it works in a way if grammar is defined in the ordinary way:

x
    : KEYWORD 'a'
    | KEYWORD 'b'
    | ID
    ;

id : ID | KEYWORD;

But resolving is being performed at runtime side.

For performance reason, every context token is placed to a String-Integer map that helps to achieve O(1) checking complexity. But further optimizations (kind of caching) can be implemented later.

It's worth mention that if the current rule is unambitious (LL(1)), ANTLR generates more optimized code with switch instead of adaptivePredict call. This optimization was also implemented for context tokens. Related example is the following:

x // no ambiguity -> fast switch is used
    : ID=='keyword'
    | ID=='keyword2'
    ;

Also, see other tests in ContextTokens subdirectory. But more tests also should be added (for instance, test on errors).

Testing

All tests are green (only Java) and it's worth mention the new version can consume generated code from previous versions.

Plans

  • Add more tests (errors, semantic predicates, and other)
  • Check and improve performance
  • Support of negation operator (~ID='keyword' or `ID!='keyword')
  • Try to rewrite and test big grammars (all from sql).

It will be used for token value comparison feature

antlr/antlr5-specs#15

Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>
…='keyword' -> ID_keyword)

antlr/antlr5-specs#15

Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>
antlr/antlr5-specs#15

Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>
antlr/antlr5-specs#15

Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>
antlr/antlr5-specs#15

Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>
antlr/antlr5-specs#15

Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>
@KvanTTT KvanTTT changed the title Initial context tokens (soft keywords, token value comparison operator) support Initial support of context tokens (soft keywords, token value comparison operator) support Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant