Skip to content

Commit

Permalink
CST Docs.
Browse files Browse the repository at this point in the history
Relates to #215.
  • Loading branch information
Shahar Soel authored and bd82 committed Mar 16, 2017
1 parent 26ee66d commit c472d39
Show file tree
Hide file tree
Showing 3 changed files with 311 additions and 1 deletion.
307 changes: 307 additions & 0 deletions docs/concrete_syntax_tree.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
## Automatic Concrete Syntax Tree Creation
Chevrotain has the capability to **automatically** create a concrete syntax tree (CST)
during parsing. A CST is a simple structure which represents the entire parse tree.
It contains information on every token parsed.

The main advantage of using the automatic CST creation is that it enables writing "pure" grammars.
This means that the semantic actions are **not** embedded into the grammar implementation but are instead
completely separated from it.

This separation of concerns makes the grammar easier to maintain
and makes it easier to implement different capabilities on the grammar,
for example, separate logic for compilation and for IDE support.


### Differences between an AST and a CST.
There are two major differences.
1. An Abstract Syntax Tree would not normally contain all the syntactic information.
This mean the **original** text could not be re-constructed from the AST.

2. An Abstract Syntax Tree would not represent the whole syntactic parse tree.
It would normally only contain nodes related to certain parse tree nodes, but not all of those (mostly leaf nodes).


### How to enable CST output?

In the future this capability will be enabled by default.
Currently this feature must be explicitly enabled by setting the **outputCst** flag.
In the parser [configuration object](http://sap.github.io/chevrotain/documentation/0_23_0/interfaces/iparserconfig.html).

```JavaScript
class MyParser extends chevrotain.Parser {

constructor(input) {
super(input, allTokens, {outputCst : true})
}
}
```

### The structure of the CST

The structure of the CST is very simple.

```TypeScript
export type CstElement = ISimpleTokenOrIToken | CstNode
export type CstChildrenDictionary = { [identifier:string]:CstElement[] }

export interface CstNode {
readonly name:string

readonly children:CstChildrenDictionary

readonly recoveredNode?:boolean
}
```

A Single CstNode corresponds to a single Grammar rule's invocation result.

```JavaScript
$.RULE("qualifiedName", () => {

})

input = ""

output = {
name: "qualifiedName",
children: {}
}
```

Each Terminal will appear in the children dictionary using the terminal's name
as the key and an **array** of ISimpleTokenOrIToken as the value. These array items will be either
a Token instance of a Token structure depending on the [Token type](docs/token_types.md) used.


```JavaScript
$.RULE("qualifiedName", () => {
$.CONSUME(Identifier)
$.CONSUME(Dot)
$.CONSUME2(Identifier)
})

input = "foo"

output = {
name: "qualifiedName",
children: {
Dot : [dotToken1],
Identifier : [identToken1, identToken2]
}
}
```

Non-Terminals are handled similarly to Terminals except each item in the value's array
Is the CstNode of the corresponding Non-Terminal.

```JavaScript
$.RULE("qualifiedName", () => {
$.SUBRULE($.singleIdent)
})

$.RULE("singleIdent", () => {
$.CONSUME(Identifier)
})

input = "foo"

output = {
name: "qualifiedName",
children: {
singleIdent : [
{
name: "singleIdent",
children: {
Identifier : [identToken1]
}
}
]
}
}
```

### In-Lined Rules

So far the CST structure is quite simple, but how would a more complex grammar be handled?
```JavaScript
$.RULE("statements", () => {
$.OR([
// let x = 5
{ALT: () => {
$.CONSUME(Let)
$.CONSUME(Identifer)
$.CONSUME(Equals)
$.SUBRULE($.expression)
}},
// select age from employee where age = 120
{ALT: () => {
$.CONSUME(Select)
$.CONSUME2(Identifer)
$.CONSUME(From)
$.CONSUME3(Identifer)
$.CONSUME(Where)
$.SUBRULE($.expression)
}}
])
})
````

Some of the Terminals and Non-Terminals are used in **both** alternatives.
It is possible to check for the existence of distinguishing terminals such as the Let and Select
But this is not a robust approach.

```javaScript
let cstResult = parser.qualifiedName()
if (cstResult.children.Let.length > 0) {
// Let statement
// do something...
}
else if (cstResult.children.Select.length > 0) {
// Select statement
// do something else.
}
```

Alternatively it is possible to refactor the grammar in such a way that both alternatives
Would be completely wrapped in their own Non-Terminal rules.

```javascript
$.RULE("statements", () => {
$.OR([
{ALT: () => $.SUBRULE($.letStatement)},
{ALT: () => $.SUBRULE($.selectStatement)}
])
})
```

This is the recommended approach in this case as otherwise as more alternations would be added the grammar rule
will become too difficult to understand and maintain due to verbosity.

However sometimes refactoring out rules is too much, this is where **in-lined** rules arrive to the rescue.

```JavaScript
$.RULE("statements", () => {
$.OR([
// let x = 5
{
NAME: "$letStatement",
ALT: () => {
$.CONSUME(Let)
$.CONSUME(Identifer)
$.CONSUME(Equals)
$.SUBRULE($.expression)
}},
// select age from employee where age = 120
{
NAME: "$selectStatement",
ALT: () => {
$.CONSUME(Select)
$.CONSUME2(Identifer)
$.CONSUME(From)
$.CONSUME3(Identifer)
$.CONSUME(Where)
$.SUBRULE($.expression)
}}
])
})
output = {
name: "statements",
children: {
$letStatement : [/*...*/],
$$selectStatement : [/*...*/]
}
}
```

Providing a **NAME** property to the DSL methods will create an in-lined rule.
It is equivalent to extraction to a separate grammar rule with two differences:

* To avoid naming conflicts in-lined rules **must** start with a dollar($) sign.
* In-lined rules do not posses error recovery (re-sync) capabilities as do regular rules.

Syntax Limitation:
* The **NAME** property of an in-lined rule must appear as the **first** property
of the **DSLMethodOpts** object.

```javascript
// GOOD
$.RULE("field", () => {
$.OPTION({
NAME:"$modifier",
DEF: () => {
$.CONSUME(Static)
}
})
})
// Bad - won't work.
$.RULE("field", () => {
$.OPTION({
DEF: () => {
$.CONSUME(Static)
},
NAME:"$modifier"
})
})
```


### CST And Error Recovery

CST output is also supported in combination with automatic error recovery.
This combination is actually stronger than regular error recovery because
even partially formed CstNodes will be present on the CST output.

For example given this grammar and assuming the parser re-synced after a token mismatch at
the "Where" token:

```JavaScript
$.RULE("statements", () => {
$.CONSUME(Select)
$.CONSUME2(Identifer)
$.CONSUME(From)
$.CONSUME3(Identifer)
$.CONSUME(Where)
$.SUBRULE($.expression)
})
// mismatch token due to typo at "wherrrre", parsing halts and re-syncs to upper rule so
// "wherrrre age > 25" is not parsed.
input = "select age from persons wherrrre age > 25"
output = {
name: "statements",
children: {
Select: ["select"],
Identifier: ["age, persons"],
From: ["from"],
Where: [/*nothing here, due to parse error*/],
expression: [/*nothing here, due to parse error*/],
}
}
```

This accessibility of **partial parsing results** means some post parsing logic
may be able to perform farther analysis for example: offer auto-fix suggestions or provide better error messages.


### Performance of CST building.

Building the CST is a fairly intensive operation.
Using a JSON Grammar benchmarking has shown the performance of CST
building is 55-65% of simply parsing without any output structure.

* This is a bad benchmark as it compares apples to oranges because one scenario creates an output structure and
the other does not.
- A more representative benchmark will be provided in the future.

* Upcoming versions of Chrome (59) with the new V8 JS compilation pipeline enabled were faster (65%)
Than current versions of chrome (56).

* Note that even when building a CST the performance on most recent versions of Chrome (59) was faster
Than any other tested parsing library (Antlr4/PegJS/Jison).
- Again we are unfortunately comparing apples to oranges as most parsing libraries in that benchmark
do not output any data structure.
2 changes: 1 addition & 1 deletion src/parse/cst/cst_public.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import {ISimpleTokenOrIToken} from "../../scan/tokens_public"

export type CstElement = ISimpleTokenOrIToken | CstNode
export type CstChildrenDictionary = { [identifier:string]:CstElement | CstElement[] }
export type CstChildrenDictionary = { [identifier:string]:CstElement[] }

/**
* A Concrete Syntax Tree Node.
Expand Down
3 changes: 3 additions & 0 deletions src/parse/parser_public.ts
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,9 @@ export interface IParserState {
}

export interface DSLMethodOpts<T> {
/**
* in-lined method name
*/
NAME?:string

/**
Expand Down

0 comments on commit c472d39

Please sign in to comment.