Initial support of context tokens (soft keywords, token value comparison operator) support #8

KvanTTT · 2024-01-02T15:45:12Z

Reference issues

Explanation

Consider the following JS example:

const obj = {
  get: "test",
  get latest() {
    return this.get;
  },
}

console.log(obj.latest); // prints "test"

Previously to support soft keyword get the following grammar should be used (fragment):

propertyAssignment
    : id ':' singleExpression=STRING
    | GET id '(' ')' functionBody
    ;

id
    : ID
    | GET
    ;

But it's more natural to check token value on parser side without lexer changing:

propertyAssignment
    : ID ':' singleExpression=STRING
    | ID=='get' ID '(' ')' functionBody
    ;

Context tokens are especially useful for SQL-like languages with ton of soft keywords. And the user's mistake I've encountered most often in ANTLR grammars repository was defining a token in lexer without adding it to id parser rule (actually I think there are still a lot of such errors in SQL grammars). Context tokens significantly simplify and reduce grammar size and they are used in other languages as well (C#, Kotlin and other).

`caseInsensitive` parser option

Regarding SQL grammars: caseInsensitive parser option was also implemented:

options { caseInsensitive=true; }

...

other_function
    : ID=='XMLEXISTS' '(' expression xml_passing_clause? ')' // Case insensitive value checking
    ;

How it's implemented

ANTLR tool

ANTLR tool creates artificial tokens for all tokens defined in context token form. If it encounters multiple context token with the same value and type, the single token is created. For the following grammar the only single ID_keyword is created:

x
    : ID=='keyword' // It's `ID_keyword` token
    | ID=='keyword' // It's also `ID_keyword` token
    ;

ANTLR runtime

ANTLR runtime tries to treat the following token from input stream as a context token. If it's also can be treated as normal token, two DFA states are being initialized. It's needed for ambiguities resolving. Consider the following grammar:

x
    : ID=='keyword' 'a'
    | ID=='keyword' 'b'
    | ID
    ;

With the following input:

keyword a keyword b

When ANTLR runtime takes the first ID_keyword token, the DFA is being initialized by both context and normal tokens (ID_keyword and ID).

Actually it works in a way if grammar is defined in the ordinary way:

x
    : KEYWORD 'a'
    | KEYWORD 'b'
    | ID
    ;

id : ID | KEYWORD;

But resolving is being performed at runtime side.

For performance reason, every context token is placed to a String-Integer map that helps to achieve O(1) checking complexity. But further optimizations (kind of caching) can be implemented later.

It's worth mention that if the current rule is unambitious (LL(1)), ANTLR generates more optimized code with switch instead of adaptivePredict call. This optimization was also implemented for context tokens. Related example is the following:

x // no ambiguity -> fast switch is used
    : ID=='keyword'
    | ID=='keyword2'
    ;

Also, see other tests in ContextTokens subdirectory. But more tests also should be added (for instance, test on errors).

Testing

All tests are green (only Java) and it's worth mention the new version can consume generated code from previous versions.

Plans

Add more tests (errors, semantic predicates, and other)
Check and improve performance
Support of negation operator (~ID='keyword' or `ID!='keyword')
Try to rewrite and test big grammars (all from sql).

It will be used for token value comparison feature antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

…='keyword' -> ID_keyword) antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

KvanTTT force-pushed the dev branch from 3f8d3cd to 55432c6 Compare January 9, 2024 18:03

KvanTTT added 10 commits January 11, 2024 14:46

Introduce new EQUALS (==) operator to ANTLR grammar

28a1575

It will be used for token value comparison feature antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

Introduce new syntax (ID=='keyword') of checking token value

ba5b261

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

TerminalAST.getText returns token name including context value (ID=…

7cc301b

…='keyword' -> ID_keyword) antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

Support serialization of context tokens (back compatible)

86b3b57

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

computeReachSet always returns not null ATNConfigSet

9f3bca5

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

Consider context token in runtime

276563c

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

Normalize token context value if caseInsensitive option is specified

d6d803d

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

Support caseInsensitive value serialization for parser ATN

5def919

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

Support case-insensitive context tokens in runtime

0440d55

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

Fir runtime to correctly match context tokens

e95da20

antlr/antlr5-specs#15 Signed-off-by: Ivan Kochurkin <kvanttt@gmail.com>

KvanTTT force-pushed the context-tokens branch from d4d8031 to e95da20 Compare January 11, 2024 13:48

KvanTTT changed the title ~~Initial context tokens (soft keywords, token value comparison operator) support~~ Initial support of context tokens (soft keywords, token value comparison operator) support Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support of context tokens (soft keywords, token value comparison operator) support #8

Initial support of context tokens (soft keywords, token value comparison operator) support #8

KvanTTT commented Jan 2, 2024 •

edited

Loading

Initial support of context tokens (soft keywords, token value comparison operator) support #8

Are you sure you want to change the base?

Initial support of context tokens (soft keywords, token value comparison operator) support #8

Conversation

KvanTTT commented Jan 2, 2024 • edited Loading

Reference issues

Explanation

caseInsensitive parser option

How it's implemented

ANTLR tool

ANTLR runtime

Testing

Plans

KvanTTT commented Jan 2, 2024 •

edited

Loading

`caseInsensitive` parser option