Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full support for all grammars in Lark examples #1804

Open
DRMacIver opened this issue Feb 10, 2019 · 12 comments
Open

Full support for all grammars in Lark examples #1804

DRMacIver opened this issue Feb 10, 2019 · 12 comments
Labels
lark issues related to support for Lark grammars

Comments

@DRMacIver
Copy link
Member

As of #1740 we support generating strategies from Lark grammars (馃帀), but this is currently a bit under-tested, and I suspect doesn't support a reasonably wide range of grammars. Certainly there are features I know it doesn't support because e.g. it doesn't support the Lark example of a Python grammar because we don't currently support contextual parsing.

The obvious thing to do is to turn all of Lark's examples into tests and flush out any bugs that come up. Both in our support and in Lark.

This is a tracking ticket for those bugs - we should make specific issues (or just pull requests) for individual bugs we find.
This ticket will be complete when we have a passing test for each example in the lark examples directory.

@DRMacIver DRMacIver added lark issues related to support for Lark grammars tracking labels Feb 10, 2019
@Zac-HD Zac-HD removed the tracking label Apr 30, 2019
@mvaled
Copy link
Contributor

mvaled commented Nov 7, 2019

I would like to add that ignored tokens are usually needed to actually generate valid (parseable) source.

The following little grammar is extracted from a bigger one (resembling a Haskell-like syntax):

>>> from hypothesis.extra.lark import from_lark                                                                                                                                                                                                                                                                                                                 
>>> from lark import Lark    
>>> source = r""" 
... PADDING : " "+ 
... %ignore PADDING 
...  
... KEYWORD_FORALL.130 : /\bforall\b/ 
... LOWER_IDENTIFIER.80 : /[a-z][a-z_]*/ 
...  
... type_expr : KEYWORD_FORALL LOWER_IDENTIFIER+ "." LOWER_IDENTIFIER+ 
... """

>>> grammar = Lark(source, lexer="standard", start="type_expr")
>>> grammar.parse("forall a. a")                                                                                                                                                              
Tree(type_expr, [Token(KEYWORD_FORALL, 'forall'), Token(LOWER_IDENTIFIER, 'a'), Token(LOWER_IDENTIFIER, 'a')])

>>> example = from_lark(grammar, start="type_expr").example()                                                                                                                                 
>>> example                                                                                                                                                                                   
'forallaffhbc.in'

The generated example doesn't parse because we're there's no space between the keyword 'forall' and the identifier(s) generated -- I wouldn't know if hypothesis generated more than one identifier.

Notice that the keyword "forall" is enclosed by \b (is this a good practice?).

@Zac-HD
Copy link
Member

Zac-HD commented Nov 8, 2019

Ignored tokens are usually needed to actually generate valid (parseable) source.

Hmm, I see what you mean, though 'usually' will depend on your application area.

Our current token generation is pretty well tuned for common use-cases. Unclear what we should do here - an option to force a specific thing to be drawn between every other terminal seems awkward but might be the best we can do.

Notice that the keyword "forall" is enclosed by \b (is this a good practice?).

\b doesn't do anything in from_regex(), so whatever suits you and your other use-cases is fine. I don't think it's going to do anything at the ends of a pattern though.

@mvaled
Copy link
Contributor

mvaled commented Nov 8, 2019

@Zac-HD

Unclear what we should do here

I was trying to come up with an "easy" solution. There are two things that we'd need to balance:

  • The grammar should not have to be "fine-tuned" for it to work with Hypothesis (at least too much),
  • The generator should not have to replicate all the job done for the grammar.

The full grammar I'm working has several things that make hypothesis generate invalid code.

I just copied the current code of hypothesis' lark with a one-char change. But then I had to make my terminal COMMENT explicit because I assume, the presence of '^' and '$' in my regexp allows hypothesis to insert it everywhere.

Other particular pain would be the generation of application, which in that grammar is just, "fn arg1 arg2 ...".

To solve these problems I was thinking that we could generate a stateful machine that the programmer could inherit from and tweak whatever he needs. Each terminal would get a bundle named "terminal_MYTERMINAL", and a rule named "create_terminal_MYTERMINAL" with the following code-schema:

@rule(target=terminal_MYTERMINAL, value=from_regexp(MYTERMINAL_regexp))
def create_terminal_MYTERMINAL(self, value):
    return value

(I would say that ignored and declared terminals are never generated, so the programmer fills the gaps.)

each non-terminal would get another bundler "nonterminal_rulename" and a rule "create_nonterminal_rulename" following the shape of the grammar rule. For instance, the rule

_datatype_deriving : _NL? KEYWORD_DERIVING _LPAREN _derivations_list _RPAREN
                   | _NL? KEYWORD_DERIVING UPPER_IDENTIFIER

would generate something like:

@rule(target=nonterminal__dataytpe_deriving, value=(
     st.tuples(terminal__NL, terminal_KEYWORD_DERIVING, terminal__LPAREN, nonterminal__derivation_list, _RPAREN).map(lambda args: u"".join(args)) |
     st.tuples(terminal_KEYWORD_DERIVING, terminal__LPAREN, nonterminal__derivation_list, _RPAREN).map(lambda args: u"".join(args)) |
     st.tuples(terminal__NL, terminal_KEYWORD_DERIVING, terminal_UPPER_IDENTIFIER).map(lambda args: u"".join(args)) |
     st.tuples(terminal_KEYWORD_DERIVING, terminal_UPPER_IDENTIFIER).map(lambda args: u"".join(args))))
def create__datatype_deriving(self, value):
    return value

Maybe (most likely) this schema is both too hard and the usage of the stateful machinery is a hack. But that's the first thing that comes to my mind. Creating a class would allow the programmer to inherit from it and/or manipulate the rules he wants to tweak.

What do you think?

@mvaled
Copy link
Contributor

mvaled commented Nov 8, 2019

I've just realized that the map in the code schema would (again) create unparsable code by joining the KEYWORD_DERIVING with the UPPER_IDENTIFIER. So inheriting would be too hackish.

Another possibility to do this in a command-line tool that generates the stateful machine code. This way the programmer changes the code and updates it to keep it in sync with grammar changes.

@Zac-HD
Copy link
Member

Zac-HD commented Nov 9, 2019

The explicit strategy to override comments is pretty much exactly what that option is designed for. Note that the regex pattern for a terminal is within that terminal, and uses fullmatch=True, so the ^...$ markers don't do anything.

I don't think that a state machine is a good intermediate representation - there's way too much indirection and implementation complexity for subclassing to make a good API.

In the limit, you might have noticed that Hypothesis strategies are parser combinators? So writing out your grammar as strategies is usually a pretty close match (via st.deferred() for forward references) to the grammar, and gives you maximum control.

@mvaled
Copy link
Contributor

mvaled commented Nov 9, 2019

Note that the regex pattern for a terminal is within that terminal, and uses fullmatch=True, so the ^...$ markers don't do anything.

I'm not really sure I follow your line of thought in this regard. I know that without the ^ a program that starts with a comment fails to be recognized as a comment:

>           raise UnexpectedToken(token, expect, considered_rules = set(to_scan))
E           lark.exceptions.UnexpectedToken: Unexpected token Token(OPERATOR, '--') at line 1, column 1.
E           Expected one of: 
E           	* KEYWORD_CLASS
E           	* _NL
E           	* KEYWORD_INSTANCE
E           	* LOWER_IDENTIFIER
E           	* _LPAREN
E           	* KEYWORD_DATA

(I haven't tested the $.)

In the limit, you might have noticed that Hypothesis strategies are parser combinators?

I haven't thought about. But, yes! They can be regarded as such.

How do you think we can proceed? Bottling all the magic within a call to from_lark seems to be hard to extend. For instance, I don't think I have a way to properly override the strategy to generate instances of application without changing the rule.

@Zac-HD
Copy link
Member

Zac-HD commented Nov 10, 2019

Yeah, the bottled-magic problem is hard. Unfortunately I haven't yet found a way to factor it out and allow parts to be overridden without compromising the simple case.

My Python-generating strategy may be of interest? On the other hand I've basically hit the limits of grammar-based generation there, and I'm planning to generate and unparse a typed AST in the next round of work on it.

@mvaled
Copy link
Contributor

mvaled commented Nov 11, 2019

Even though I would like to, I can make any promises of working on this. If I can, I will notify you.

@git-afsantos
Copy link

Hello!

Is there any progress regarding this issue?

I would like to add that ignored tokens are usually needed to actually generate valid (parseable) source.
(...)
The generated example doesn't parse because there's no space between the keyword forall and the identifier(s) generated -- I wouldn't know if hypothesis generated more than one identifier.

Notice that the keyword forall is enclosed by \b (is this a good practice?).

I am facing exactly the same problem - of Hypothesis generating strings without white space between tokens - and I would like to know if there is any approach that is easier than (essentially) replicating the grammar with custom strategies.

The discussion above suggests that there is no definitive answer.

@Zac-HD
Copy link
Member

Zac-HD commented Aug 18, 2023

Nobody has yet volunteered to overhaul from_grammar() to improve handling of ignored tokens, and in the absence of anyone paying for that it thus hasn't happened 馃槄

I think your options are to (1) change your grammar, (2) write a custom strategy, or (3) volunteer or pay someone to implement this issue. In the last case I'd be happy to help with design and code review but don't have time to do it myself at the moment.

@git-afsantos
Copy link

Thanks for the reply, Zac!

How big of an undertaking do you think this is? Maybe I could lend a hand.
I do not know the specifics of what would be the desirable API, but it seems to me that an initial approach would:

  1. take a keyword argument in from_grammar to specify how to handle ignored tokens,
  2. use (selected) ignored tokens as separators between other (selected?) tokens of interest.
@cacheable
@defines_strategy(force_reusable_values=True)
def from_lark(
    grammar: lark.lark.Lark,
    *,
    start: Optional[str] = None,
    explicit: Optional[Dict[str, st.SearchStrategy[str]]] = None,
    ignored: Union[bool, str, Iterable[str]] = False,
) -> st.SearchStrategy[str]:
    # if `ignored` is True, use all ignored tokens as possible separators
    # if `ignored` is a string, use the specified token as a separator
    # if `ignored` is a list of strings, use any of the specified tokens as separators

It's relatively easy to imagine how white space would be handled, but a general-purpose solution for any kind of ignored token might be trickier.

@Zac-HD
Copy link
Member

Zac-HD commented Aug 18, 2023

My API design goal here is to avoid adding anything to the API: we should be able to get this working without asking users for help about how to handle ignored tokens. I think the plan below is a nontrivial but feasible undertaking if you'd like to try opening a PR for it.


Currently, we decide whether to generate an ignored token following each terminal, and then simply join all of the leaves as our result. The problem in this issue is that sometimes an ignored token is required between two terminals!

Instead, we could unconditionally draw a flag and a non-empty ignored token each time in gen_ignore(); include the ignored token as an unmarked terminal if the flag is True, and retain the token for later if the flag is False. Finally, we'd replace the simple joining with a helper:

  1. Do the simple string-join, and check that it parses back into the same sequence of terminals as expected. If so, return it.
  2. Add every hopefully-omitted ignored token back in, join them, and confirm that this parses as expected; raise an error if not. Note that there might not be any flagged ignore tokens, either because there are none in the grammar or because we already chose to include them all whether necessary or not.
  3. Iterate over the hopefully-omitted ignored tokens, checking whether we get the expected parse with this token omitted. For example with "1a2b3c" - taking letters are ignored and "a" as necessary - we would try 12b3c (fail), then 1a23c (pass), then keep that omission to try 1a23 (pass). (side note, we may need to special-case an initial ignore token?)
  4. We've now discovered a string which has the parse tree we expected with a minimum of undesired ignore tokens and can return it!

The performance characteristics here aren't ideal, but I think they'll be OK in practice and certainly much better than the strategy not working at all for complex grammars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lark issues related to support for Lark grammars
Projects
None yet
Development

No branches or pull requests

4 participants