Selective auto-whitespace #64

sbelak · 2013-12-30T01:29:22Z

Added the option to auto-add whitespace only to select tags or to exclude certain tags from having whitespace added.
See tests for examples.

Engelberg · 2013-12-30T01:49:17Z

Looks like a useful addition.

Can you elaborate on whether inclusions/exclusions for whitespace rules "trickle down" to subrules?

For example,

S = A
A = 'a'+

If you say :only S, does that mean no whitespace is inserted (since the tokens actually occur in rule A), or the whitespace gets added because it occurs within S?

How about:

S = A B
A = C
B = C
C = 'ccc'+

If you :only A, would that suggest that the C you get to via A includes whitespace, but the version you get to via B does not?

sbelak · 2013-12-31T18:29:39Z

Hm, good question. I would say we want the most precise (and least surprising) meaning, so in the first case it should accept "aaa ", but not "a a a"; and in the second case it should accept "ccc ccc", but not "ccc ccc ". The latter case currently does not hold, as auto whitespace skips alias nodes. Will fix (provided you agree with the direction).

Oh, and a happy new year! :)

Engelberg · 2014-02-21T07:32:08Z

I've been thinking about this again.

So the way it works right now is that optional whitespace is added in front of all tokens (strings or regexes) and at the end of the start rule.

It does seem that allowing some sort of "trickling down" could be quite confusing, so one way to do this is to say that when you add :only and :except, you are saying whether optional whitespace is automatically added in front of every token, i.e., string or regex, in that rule.

But if you don't include the start rule, it's totally unclear what should happen with the space at the end. And if you only cover select rules, should those rules allow for spaces at the end of those productions?

Based on your answer above to my "second case", it sounds like you're advocating that optional whitespace also be inserted in front of every non-terminal as well. I'm somewhat concerned about how much that would balloon the grammar, but it's an option, if that's the most intuitive behavior.

I'd welcome another round of input from you on the matter.

sbelak · 2014-11-18T19:21:23Z

I think the clearest rule would be, if we treat autowhitespacing as a series of rewrite rules; that is, if there is well defined expansion the user could do on her own that would work the same. Furthermore to allow composability (my use-case for instance leans heavily on composability, overwriting and aliasing) these rewrite rules should be local to each rule.

As for ballooning, I think it's first of all an ideological question: is Instaparse first and foremost fast (part of infrastructure) or is it convenient (a hammer you reach for first). My current needs are right now in the latter camp and so is my bias.

danielytics · 2015-06-28T12:24:38Z

What about having some way to mark a () grouping as "don't insert auto whitespace here" and otherwise having auto-whitespace behave as it currently does?

Eg:

S = A B #(C D)

would insert whitespace before A and B but not before C and D. (assuming # is the character used to mark the grouping)

hlship · 2016-03-01T19:41:58Z

I suspect this would have helped us, as we liked using the whitespace, but our application (parsing a GraphQL query document) included literal text in quotes; the auto-whitespace removed that whitespace and we did not find a way to exclude auto-whitespace logic from the productions that were such literal text, so we reverted to a frequently used hidden ws element.

cjsauer · 2021-10-03T15:59:59Z

Here is a somewhat rough solution to this problem that I came up with for processing string literals. Remembering that an instaparse parser is just data, you can transform it prior to invocation to strip out the automatically injected whitespace rules. Given a grammar with a string literal rule:

string =
    <'"'> inner-string* <'"'> ;
<inner-string> = 
    #'[^"\\]'
    | ( <'\\'> escape-char )
    ;
escape-char = #'[trn\\"]' ;

Define the parser as usual, and then strip out the unwanted auto-whitespace rules:

(defparser my-cool-parser
  "resources/grammar.instaparse"
  :auto-whitespace :standard)

(def ^:private disable-auto-ws
  (memoize
    (fn [parser]
      (walk/postwalk
        (fn [form]
          (if (and (map-entry? form) (= :parsers (first form)))
            [:parsers
             (filter #(not= :whitespace (-> % :parser :keyword))
               (second form))]
            form))
        parser))))

(defn parse
  [text]
  (let [parser (update-in my-cool-parser [:grammar :inner-string] disable-auto-ws)]
    (parser text)))

By splitting string and inner-string rules, you can very accurately target whitespace disabling to the inside of the string literal, leaving all other automatic ws rules intact.

cjsauer · 2021-10-03T16:04:12Z

That said, I like @danielytics approach of wrapping a rule in an "atomic construct" like #( ... ) in his example. I have used other parser generators to great effect with this feature. For example, the pest parser generator in Rust uses this approach, and calls them "atomic rules".

Selective auto-whitespace

f019578

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selective auto-whitespace #64

Selective auto-whitespace #64

sbelak commented Dec 30, 2013

Engelberg commented Dec 30, 2013

sbelak commented Dec 31, 2013

Engelberg commented Feb 21, 2014

sbelak commented Nov 18, 2014

danielytics commented Jun 28, 2015

hlship commented Mar 1, 2016

cjsauer commented Oct 3, 2021

cjsauer commented Oct 3, 2021 •

edited

Selective auto-whitespace #64

Are you sure you want to change the base?

Selective auto-whitespace #64

Conversation

sbelak commented Dec 30, 2013

Engelberg commented Dec 30, 2013

sbelak commented Dec 31, 2013

Engelberg commented Feb 21, 2014

sbelak commented Nov 18, 2014

danielytics commented Jun 28, 2015

hlship commented Mar 1, 2016

cjsauer commented Oct 3, 2021

cjsauer commented Oct 3, 2021 • edited

cjsauer commented Oct 3, 2021 •

edited