Skip to content

Handling Whitespace

roryokane edited this page Dec 3, 2012 · 9 revisions

One disadvantage of PEGs over lexer based CFGs can be the handling of white space. In a traditional CFG based parser with a separate lexer (scanner) phase this lexer might simply skip all white space and only generate tokens for the actual parser to operate on. This can free the actual parser grammar from all white space treatment. Since PEGs do not have a lexer but directly operate on the raw input they have to deal with white space in the grammar itself. Language designers with little experience in PEGs can sometime be unsure of how to best handle white space in their grammar.
A common and highly recommended pattern is to match white space always immediately after a terminal (a single character or string) but not in any other place.

With parboiled you can take this rule even one step further and factor out most whitespace handling to only one helper method. One way to go is shown in the CalculatorParser3 example for parboiled for Java and the JSON Parser example for parboiled for Scala.
The technique is to override the default String-to-Rule conversion method and inject custom logic. The two examples listed above define a special rule building construct for string literals ending with a blank. These literals are wrapped in a sequence rule that automatically matches all trailing whitespace after the string (or character).

The result is that everywhere you use string literals ending with a blank in your grammar any trailing white space will automatically be consumed as well. This can make your grammar rules much more compact, readable and therefore maintainable.
However there are a few things to remember when you use this solution:

  1. All input text matched for rules containing “whitespace-enabled” string literals will now also have an unknown number of white space in their matched input texts, which can in some cases throw off parser action methods expecting otherwise.
  2. CharRange rules and AnyOf (String) are not affected by this solution, i.e. for them you still have to “manually” take care of matching trailing white space.
Clone this wiki locally