Discrepancy between terminal and single-match regex terminal #4082

tothtamas28 · 2024-03-06T16:53:12Z

Consider the following definitions.

1. With terminal

module LEXICAL
    syntax Foo ::= "foo"
    rule <k> foo => .K ... </k>
endmodule

This works as expected.

2. With regex terminal

Let's change the symbol to a regex terminal.

syntax Foo ::= r"foo"

This defines the same set of valid tokens for Foo as the previous definition. But now there's a parse error:

[Error] Inner Parser: Parse error: unexpected token '=>' following token 'foo'.
        Source(/home/ttoth/git/pyk/lexical.k)
        Location(3,18,3,20)
        3 |         rule <k> foo => .K ... </k>
          .                      ^~
[Error] Compiler: Had 1 parsing errors.

3. Fix potential lexical conflicts

Speculatively, let's change the token to be disjoint from other lexicals in the prelude.

module LEXICAL
    syntax Foo ::= r"@foo"
    rule <k> @foo => .K ... </k>
endmodule

Now the error is something different:

[Error] Compiler: Expected format attribute on production with regular
expression terminal. Did you forget the 'token' attribute?
        Source(/home/ttoth/git/pyk/lexical.k)
        Location(2,20,2,27)
        2 |         syntax Foo ::= r"@foo"
          .                        ^~~~~~~
[Error] Compiler: Had 1 structural errors.

4. Add the `format` attribute

After adding the format attribute, kompilation works.

syntax Foo ::= r"@foo" [format(@foo)]

Questions

This raises a few questions.

Why doesn't (2) work if (1) does?
Why does the kompiler ask for the format attribute for (3) if for (2) it does not?

The text was updated successfully, but these errors were encountered:

Baltoli · 2024-03-15T16:52:52Z

I dug into this.

This defines the same set of valid tokens for Foo as the previous definition. But now there's a parse error:

I think you identified the reason here, but for concreteness the issue is a collision with the regex for #LowerId defined in kast.md.

Why doesn't (2) work if (1) does?

Regex terminals have a lower priority than non-regex terminals. The logic isn't especially obvious, but when we generate scanners there's a sorting process that happens here:

k/kernel/src/main/java/org/kframework/parser/inner/kernel/Scanner.java

Lines 153 to 156 in 67ef71b

    
           List<TerminalLike> ordered = 
        
               tokens.keySet().stream() 
        
                   .sorted((t1, t2) -> tokens.get(t2)._2() - tokens.get(t1)._2()) 
        
                   .toList();

k/kore/src/main/scala/org/kframework/definition/outer.scala

Lines 969 to 992 in 67ef71b

    
           case class RegexTerminal(precedeRegex: String, regex: String, followRegex: String) 
        
               extends TerminalLike 
        
               with RegexTerminalToString { 
        
             lazy val pattern = new RunAutomaton(new RegExp(regex).toAutomaton, false) 
        
             def compareTo(t: TerminalLike): Int = { 
        
               if (t.isInstanceOf[Terminal]) { 
        
                 return 1; 
        
               } 
        
               return regex.compareTo(t.asInstanceOf[RegexTerminal].regex) 
        
             } 
        
           } 
        
           case class Terminal(value: String) 
        
               extends TerminalLike // hooked 
        
               with TerminalToString { 
        
             def compareTo(t: TerminalLike): Int = { 
        
               if (t.isInstanceOf[RegexTerminal]) { 
        
                 return -1; 
        
               } 
        
               return value.compareTo(t.asInstanceOf[Terminal].value) 
        
             } 
        
           }

This ordering places regexes at the end of the list, which means that if any non-regex matches first then the overlapping regex won't get considered. It also answers the question of why the #LowerId regex gets tried (and succeeds) before the Foo one. If you reverse this ordering, the code in (1) fails because foo is tokenized as a #LowerId.

Why does the kompiler ask for the format attribute for (3) if for (2) it does not?

Because of the tokenization problem, inner parsing hasn't even finished when the error in (2) is emitted. This means that we don't have the full semantic information to validate the attributes. For (3), inner parsing has succeeded and we can validate the attributes.

tothtamas28 added the question label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between terminal and single-match regex terminal #4082

Discrepancy between terminal and single-match regex terminal #4082

tothtamas28 commented Mar 6, 2024

Baltoli commented Mar 15, 2024 •

edited

Discrepancy between terminal and single-match regex terminal #4082

Discrepancy between terminal and single-match regex terminal #4082

Comments

tothtamas28 commented Mar 6, 2024

1. With terminal

2. With regex terminal

3. Fix potential lexical conflicts

4. Add the format attribute

Questions

Baltoli commented Mar 15, 2024 • edited

4. Add the `format` attribute

Baltoli commented Mar 15, 2024 •

edited