1. Introduction
1.1. Technology
1.2. Installation & Use
1.3. Features
1.4. Tests
1.5. Contribute
1.6. Credits & License
2. PEG
2.1. Example
3. Files
4. References
5. User Manual
5.1. Example usage:
5.2. Creating PeglingJS Parsers
6. Technical Manual
6.1.1. `Pegling.mkXpiler`
6.1.2. A Generated Parser
6.1.3. `Pegling.prime`
6.1.4. `rdi` (internal)
6.1.5. `Pegling.pegRules`, `Pegling.mkPhase1`
6.2. Rules
6.3. Instructions
PeglingJS is no longer under development. Imho PEG's require too much effort to get right given small changed in the langues. Currently I use an ad-hoc parser.
PeglingJS is a parser-and-compiler generator based on Parsing Expression Grammars (PEG's [1]). PEG's offer an efficient way of defining languages and parsers.
PeglingJS is incredibly compact (7Kb source, 4Kb minified). PeglingJS is created in the context of the development of another parser generator based on [2], and as such several disadvantages of PeglingJS are accepted for the moment:
- PeglingJS uses recursive descent parsers, which are notoriously inefficient and difficult to debug
- PeglingJS parsers aren't particularly fast (400 KB, 13.000 lines of JSON parse in 13 minutes), but they are effective and efficient for small inputs.
- PeglingJS parsers are recursive and therefore limited by the JavaScript stack. In practice this isn't a limitation. Stack overflow usually means infinite recuresion.
- PeglingJS offers various logging/tracing capabilities, but:
- parser error-reporting is limited
- grammar debugging is difficult
PeglingJS is written entirely in plain JavaScript. It uses several ES6 features.
- Download or clone into the repository
- Include the local copy
<script src = "pegling.min.js" type = "text/javascript"/></script>
or use the JSDelivr CDN:
<script src="https://cdn.jsdelivr.net/gh/BabelfishNL/PeglingJS/src/pegling.min.js"></script> - This creates the global object
Pegling. See section 'Technical Manual' for further instructions
PeglingJS creates parsers/compilers (implemented in JavaScript) for languages defined with Parsing Expression Grammars. A developer can attribute the grammar with code to create abstract syntax trees, other datastructures, or direct interpretation. Strictly speaking 'parsing' refers to language recognition; building an abstract syntax tree or others data structure is 'compilation', whereas direct computation based on the input would be 'interpretation'. PeglingJS parses grammars and produces a set of rules which encode a parser for the given language. As such, PeglingJS is a compiler compiler.
Directory tst includes a HTML file which imports two JavaScript files which run tests (and produce output in the browser console.
- Fork it (https://github.com/yourname/yourproject/fork)
- Create your feature branch (
git checkout -b feature/fooBar) - Commit your changes (
git commit -am 'Add some fooBar') - Push to the branch (
git push origin feature/fooBar) - Create a new Pull Request
(c) 2019, Pum Walters, Babelfish
Pum Walters – @PumWalters – pum@babelfish.nl
Released under MIT license
PeglingJS uses the original PEG operators described in [1] with a few extensions (see extensions, below). The operators are:
- (juxtaposition): concatenation
- /: choice
- ** * **: iterate zero or more
- ** + **: iterate one or more
- ?: optional
- &: positive lookahead
- !: negative lookahead
- .: any character
- [...]: character class
- '...': literal text
A PEG defines a language by defining a function which accepts texts in that language and rejects other texts. In addition to this, texts are usually processed further, based on the structure as defined by the PEG. But the parse-tree is often not directly suitable to process. Instead one needs
- an abstract syntax tree (for instance to be processed by a compiler or other component)
- some other data structure: further compilation
- direct interpretation
PeglingJS accepts Javascript code in PEG rules, which describe the tranformation from the parse-tree to something else. In particular, PeglingJS PEG rules may look like
{{ ... code ... }}( ... PEG ... )where the code defines a function which will be applied to the parse-tree resulting from parsing the PEG.
Any PEG alternative can be preceded by '#'. In combination with the 'mylogs' directive this results in a log being printed precisely for the recognition of the elements in that specific rule/part. Use '#' in similar place as '!' and '&'; and #( PEG ) is always an appropriate sub-expression.
Any PEG alternative can be preceded by '^'. The expression '^PEG' succeeds only if parsing PEG succeeds and if at least one character was parsed to do so. Use this to prevent infinite recursion. Use '^' in similar place as '!' and '&'; and ^( PEG ) is always an appropriate sub-expression.
Below is the grammar of PEG's written as a PEG (taken from [3]). That is, this grammar defines the language it is written in.
grammar <- non-terminal '<-' sp pattern+
pattern <- alternative ('/' sp alternative)*
alternative <- ([!&#^]? sp suffix)+
suffix <- primary ([*+?] sp)*
primary <- '(' sp pattern sp ')' sp /
'.' sp /
literal /
charclass /
code '(' sp pattern ')' sp /
non-terminal !'<-'
literal <- ['] (!['] .)* ['] sp
charclass <- '[' ']'? (!']' (. '-' . / .))* ']' sp
non-terminal <- [a-zA-Z]+ sp
code <- '{{' (!'}}' .)* '}}' sp
sp <- [ \t\n]*
Notes:
- added
']'?in charclass to allow inclusion of](as first character) - added
#and^in 'alternative' for logging and infinite recurion prevention
It is quite possible to use this PEG directly to define the PEG language and generate a parser. However, without intervention, that parser will generate parse-trees which contain superfluous (for further processing) details. For example, the parse-tree of the whitespace between the fifth and sixth lines of this specification, is (in JSON) ["#*", [["#cc"," "], ["#cc","\n"], ["#cc","\t"], ["#cc","\t"], ["#cc","\t"], ["#cc","\t"]]] but contains no relevant detail other than that it concerns whitespace.
In order to create abstract syntax trees or other data structures from parse-trees, code snippets can be inserted in the PEG rules. For example, the rule for sp looks like:
sp <- {{ast=>[]}} ([ \t\n]*)
↑ PEG for whitespace ↑ function which transforms any whitespace parse-tree into an empty result
In this manner, PEG rules with appropriate functions, in PeglingJS, transform parse-trees on the fly to other data or actions.
File test1.js in this distribution contains the specification of PEG's which, when parsed by PeglingJS, will result in the same rules that PeglingJS uses to parse PEG's. By inference, applied to any PEG, those rules produce a parser/compiler for that language.
This distribution contains the following files:
README.md: this documentation- src/
pegling.js: the javascript source for the parser generator (PEG parse rules, 50 lines; recursive descent engine, 120 lines, tracing/logging, 20 lines; total, 4KB, 230 lines);pegling.min.js: minified version, made byminify pegling.js > pegling.min.js
- tst/
test1.js: first test file: self-compilation and some loggingtest2.js: JSON PEG and reader (30 lines)main.htmlHTML page for testingrandom.json: large JSON file for testingrandom.short.json: smaller test file
- Bryan Ford. Parsing expression grammars: A recognition-based syntactic foundation. In 31st Symposium on Principles of Programming Languages (POPL’04). ACM, January 2004.
- Medeiros, Sérgio & Ierusalimschy, Roberto. A parsing machine for PEGs. DLS'08: Proceedings of the 2008 Symposium on Dynamic Languages. (2008).
- Roberto Ierusalimschy. A Text Pattern-Matching Tool based on Parsing Expression Grammars. In Software: Practice and Experience. John Willey and Sons. (2008)
PeglingJS exposes an object with four entries:
pegRules, the rules which define the PEG grammar and parsersprime, the auxiliary function which given rules returns a parser (not normally used)mkPhase1, the auxiliary function which returns the PEG parser (not normally used)mkXpiler, the function which, given a annotated PEG grammar, returns a parser/compiler for that grammar.
let txt = ... ,
myPEG = ... ,
myparser = Pegling.mkXpiler(myPEG),
ast = myparser(txt,'rootNonterminal'),
myloggingparser = Pegling.mkXpiler(myPEG,['non-terminals','myfuns']),
more = myloggingparser(txt,'rootNonterminal')
This will return a parser for the given PEG and will apply that parser to the given input. Then it will create another parser which logs certain events (to the Javascript console), namely reduction for entire rules (non-terminals) and for the functions that are included in the PEG.
Creating PEG grammars is outside the scope of this document.
In this section we describe how to create a JSON parser which, given a JSON text, will produce the Javascript object described in that text (that is, this implementation is a parser/interpreter). This is contained in the file test2.js. The JSON PEG without code is
element <- wh value wh
value <- object / array / string / number / 'true' / 'false' / 'null'
object <- '{' ( wh member (wh ',' wh member)* )? wh '}' wh
member <- string wh ':' wh value
array <- '[' ( wh value (wh ',' wh value)* )? wh ']' wh
string <- '"' character* '"'
character <- '\\' (["\\/bfnrt] /
'u' [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) / !'"' .
number <- integer fraction? exponent?
integer <- '-'? ('0' / [1-9] [0-9]*)
fraction <- '.' [0-9]+
exponent <- [eE] [-+]? [0-9]+
wh <- [ \t\n\r]*
The process to annotate this with appropriate code snippets can be done in two ways:
- with knowledge of the parser generator, to predict the structure of the ensuing parse-trees;
- interactively, using a debugger.
We describe the second way.
Every PEG expression will lead to a structure (parse-tree) which can be transformed by applying a function to it. The interactive method is as follows:
- Whitespace is easiest because any parse-tree can be dropped. We add a function which returns the empty list for any input:
wh <- {{x=>[]}}([ \t\n\r]*) - Now, apply a named dummy function
f = x => xto some expression. E.g.,element <- wh value whbecomeselement <- {{f}}(wh value wh). - In a debugger place a breakpoint in that function and run the parser
- When it breaks, inspect the value of
x. In the example the parse-tree will have three sub-trees, two of which represent the (now cleared) whitespace. - Construct the desired structure or perform the actions that are needed. In our example we return the middle value, whatever non-terminal 'value' produced.
- Obviously, if the expression has variants, all these variants must be expected and must be dealt with in this function. For instance, a JSON number is an integer, optionally followed by a fraction, optionally followed by an exponent (e.g. -1.22e+2 is a legal JSON number). The PEG rule is
number <- {{number}}(integer fraction? exponent?)
and the function number must determine the variants:
number = x => function(x) {
let hd = x[1][0], // the int part
tl = (typeof x[1][1]==='string'?'.'+x[1][1]:'')+ // the decimal part
(typeof x[1][2]==='string'?x[1][2]:''); // the scietific exponent
return tl!=='' ? ['flt',Number(hd+tl)]:['int',Number(hd)];
}
- Repeat this process for every non-terminal.
- Often, the expression leads to lists of values of indeterminate length. If
lstis such a list, the following pattern can be used to process those values:lst.reduce((a,v)=>exp1,exp2)whereexp2is an initial accumulator (e.g.0or''or[]) andexp1computes the desired element value from parse-tree element v. - Often, the expression is complex and is easier processed by handling subexpressions first. The constants
'true','false'and'null'produce similar parse-trees, which differ however from the values returned byobject,array,stringandnumber. It is therefore convenient to split off a subexpression{{constant}}('true' / 'false' / 'null')
File test2.js contains the final result of applying these steps.
- Log events: these log events can be used:
attempts: every engine attemptnonterminals: attempt and reduction of every entire ruleengine: every engine reductionmylogs: every ` of the pseudo-instruction 'log'myfuns: every reduction of a code snippet
- arguments:
- An annotated PEG
- A list of log events to be applied during parser generation. This is relevant to debug grammars.
- A list of log events to be applied during parser application. This is relevant to debug compilers.
- returns:
- a parser
- arguments:
- Input text
- Expected root non-terminal
- returns:
- Whatever the code-snippets created
- arguments:
- Parse rules
- A list of log events
- returns:
- a parser
- arguments:
inp: the input textidx: the character index to start parsingpat: the pattern (non-terminal) to expect
- returns: [idxn,ast] where
idxn: is the character index after acceptingpatininp, or is -1 if the parse failedast: is an abstract syntax tree for the parsed non-terminal (or a data structure representing where parsing failed eventually ifidxnis -1
Auxiliary functions which form the two halves of Pegling.mkXpiler:
Pegling.pegRulesare the rules for the PEG grammarPegling.mkPhase1accepts a list of log events and returns a parser for PeglingJS PEG specifications
The rdi rules have the following format:
- The top-level is a hash with one field for every non-terminal (i.e. the set of non-terminals / rules), e.g.
{ grammar: ... PEG for non-terminal 'grammar' ...,
pattern: ... PEG for non-terminal 'pattern' ...,
... } - Other values are
- Lists containing other values. Common format: [ keyword, arguments... ]
- Strings, representing keywords and string data
- JavaScript code
Keywords coincide with the common PEG [1] operators plus extensions (as mentioned)
PEG's resolve a text into its structure, accepting it, or rejecting it if the text is not an instance of the PEG. In order to further process the text, that structure must be represented. It is straightforward to generate a parse-tree, which follows the breakdown in PEG subexpressions, but more appropriate is an abstract syntax tree. PeglingJS offers the ability to attribute PEG's with functions that transform the parse-tree (on the fly) to an abstract syntax tree.
['var', 'name']
parse the PEG for non-terminalname['cat', ...]
parse the PEG's folowingcat['/', ...]
attempt to parse the PEG's folowing/till the first successful parse['+', ...]
parse one or more instances of the PEG folowing+['*', ...]
parse zero or more instances of the PEG folowing*['?', ...]
if an instance of the PEG after?exists, accept it['&', ...]
if no instance of the PEG after&exists, fail['^', ...]
if an instance of the PEG after?exists which uses at least one character from the source, accept it['#', ...]
copy whatever the instace after '#' returned; log the transition if the 'myfuns' directive is active['!', ...]
if no instance of the PEG after!exists, fail['cc', ...]
if an instance of the charclass exists, accept it. A charclass consits of one or more elements of a single character (e.g., 'a') or a range of characters (e.x. 'A-Z')['.', ...]
accept a single character['span', ...]
if the string afterspanexists, accept it[ ...function..., ...]Defines a function which transforms the parse-tree for the PEG after the code into an abstract syntax tree.['log', ...]
The pseudo-keywordlogaccepts everything the PEG does, and is used for debugging.
For example
[ast=>logr(ast[0].join('')),
['cat',['+',['cc','a-zA-Z']],['var','sp']]]
joins the individual characters into a single string