Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between terminal and single-match regex terminal #4082

Open
tothtamas28 opened this issue Mar 6, 2024 · 1 comment
Open

Discrepancy between terminal and single-match regex terminal #4082

tothtamas28 opened this issue Mar 6, 2024 · 1 comment
Labels

Comments

@tothtamas28
Copy link
Contributor

Consider the following definitions.

1. With terminal

module LEXICAL
    syntax Foo ::= "foo"
    rule <k> foo => .K ... </k>
endmodule

This works as expected.

2. With regex terminal

Let's change the symbol to a regex terminal.

syntax Foo ::= r"foo"

This defines the same set of valid tokens for Foo as the previous definition. But now there's a parse error:

[Error] Inner Parser: Parse error: unexpected token '=>' following token 'foo'.
        Source(/home/ttoth/git/pyk/lexical.k)
        Location(3,18,3,20)
        3 |         rule <k> foo => .K ... </k>
          .                      ^~
[Error] Compiler: Had 1 parsing errors.

3. Fix potential lexical conflicts

Speculatively, let's change the token to be disjoint from other lexicals in the prelude.

module LEXICAL
    syntax Foo ::= r"@foo"
    rule <k> @foo => .K ... </k>
endmodule

Now the error is something different:

[Error] Compiler: Expected format attribute on production with regular
expression terminal. Did you forget the 'token' attribute?
        Source(/home/ttoth/git/pyk/lexical.k)
        Location(2,20,2,27)
        2 |         syntax Foo ::= r"@foo"
          .                        ^~~~~~~
[Error] Compiler: Had 1 structural errors.

4. Add the format attribute

After adding the format attribute, kompilation works.

syntax Foo ::= r"@foo" [format(@foo)]

Questions

This raises a few questions.

  1. Why doesn't (2) work if (1) does?
  2. Why does the kompiler ask for the format attribute for (3) if for (2) it does not?
@Baltoli
Copy link
Collaborator

Baltoli commented Mar 15, 2024

I dug into this.

This defines the same set of valid tokens for Foo as the previous definition. But now there's a parse error:

I think you identified the reason here, but for concreteness the issue is a collision with the regex for #LowerId defined in kast.md.

Why doesn't (2) work if (1) does?

Regex terminals have a lower priority than non-regex terminals. The logic isn't especially obvious, but when we generate scanners there's a sorting process that happens here:

List<TerminalLike> ordered =
tokens.keySet().stream()
.sorted((t1, t2) -> tokens.get(t2)._2() - tokens.get(t1)._2())
.toList();

case class RegexTerminal(precedeRegex: String, regex: String, followRegex: String)
extends TerminalLike
with RegexTerminalToString {
lazy val pattern = new RunAutomaton(new RegExp(regex).toAutomaton, false)
def compareTo(t: TerminalLike): Int = {
if (t.isInstanceOf[Terminal]) {
return 1;
}
return regex.compareTo(t.asInstanceOf[RegexTerminal].regex)
}
}
case class Terminal(value: String)
extends TerminalLike // hooked
with TerminalToString {
def compareTo(t: TerminalLike): Int = {
if (t.isInstanceOf[RegexTerminal]) {
return -1;
}
return value.compareTo(t.asInstanceOf[Terminal].value)
}
}

This ordering places regexes at the end of the list, which means that if any non-regex matches first then the overlapping regex won't get considered. It also answers the question of why the #LowerId regex gets tried (and succeeds) before the Foo one. If you reverse this ordering, the code in (1) fails because foo is tokenized as a #LowerId.

Why does the kompiler ask for the format attribute for (3) if for (2) it does not?

Because of the tokenization problem, inner parsing hasn't even finished when the error in (2) is emitted. This means that we don't have the full semantic information to validate the attributes. For (3), inner parsing has succeeded and we can validate the attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants