Skip to content
This repository has been archived by the owner on Sep 22, 2021. It is now read-only.

Latest commit

 

History

History
287 lines (212 loc) · 13.6 KB

syntax.md

File metadata and controls

287 lines (212 loc) · 13.6 KB

Monarch Tokenizer Syntax

This document is a mostly dry, precise description of all of the various constructs you can use in a tokenizer. It is to serve as a reference. The order of this list is from the 'largest' objects to the 'smallest' - beginning with root, states, and ending with parser directives.

Tokenization

However, first we will describe how Monarch actually interacts with a document, as that is essential in understanding how to program the tokenizer.

Fundamentally, Monarch is a line-based tokenizer. What this means is that your document will first be broken into lines, and then for every line Monarch will output a series of contiguous tokens that describe the syntax highlighting for that line.

When Monarch attempts to match regular expressions against the document, it only does so per-line. A regular expression will not be able to 'see' prior to the start of the line nor past the end of it. Any rules that are written for Monarch need to be written with these limitations in mind.

Root

The root of the language object, where the tokenizer is defined in the tokenizer property, can host several other properties. A special property type called an attribute can be used within the tokenizer - see regex within the description of the rules objects.

The special properties, other than tokenizer, follow:

ignoreCase Defaults to false. If true, the language will automatically compile regular expressions to be case-insensitive.
defaultToken Defaults to ''. If the tokenizer cannot match some text to any rule, it will move forward a single character and assign the last character with the defaultToken properties value.
start Defaults to 'root'. Sets the start state of the tokenizer.

States

States are named lists of rules that can be switched to and from as a document is tokenized. The currently active state is determined by the stack, which can be manipulated through the rules present in the currently active state.

type State = Rule[]

const tokenizer = {
  foo: [
    ...rules
  ],
  block: [
    ...rules
  ]
}

The start state, if not specified, is always 'root'. The start property of the language will change this default state.

Usually, a state will have a simple name, e.g. block. However, states can have sub-states. They take the following form:

const tokenizer = {
  'state.substate1.substate2.etc': [ ...rules ]
}

Sub-states do not have to be literally present within the tokenizer to be useful. If a state isn't found by Monarch, its parent will be searched by progressively decomposing sub-states from the name. As an example, if the current state of the stack was comment.foo, and the tokenizer had no such state as comment.foo, but did have comment, it would treat comment as the active state.

The reason sub-states are useful is because they store information about how a state was reached within their names. These names can be parsed by actions in order to affect how the tokenizer progresses through the document.

Rules

Rules instruct the tokenizer what to match, how to 'tokenize' what it has matched, and how the tokenizer should progress past the match.

type Rule = [regex: RegExp, action: Action, next?: string]

They can be written in three ways:

let rule1 = [regex, action]
let rule2 = [regex, action, next]
let rule3 = { regex: regex, action: action }

The first two forms are simply terse alternatives for the last form, which is what the tokenizer actually uses when parsing.

A special type of rule is an include directive. It is a compile-time-only object that tells the compiler to duplicate the specified state's rules into the state where the directive presides. These are usually used for the sake of tidiness, organization, and 'don't-repeat-yourself'.

const rule = { include: 'foo' }

Rules contain only two types of objects, regex and actions.

regex

Monarch uses regular expressions to match against the document. In contrast to the original Monaco implementation of Monarch, cm6-monarch practically supports all of JavaScript's regex functionality, like lookahead and lookbehind.

const rule = [/(?<=\s)\w+/, 'scope']

Monarch provides a 'attribute' syntax with regex. A Monarch tokenizer is defined within the tokenizer property of the language, but attributes are special constants given as properties along with the tokenizer property.

type Attribute = RegExp | string | string[]

const lang = {
  attribute1: 'foo',
  attribute2: /[(){}]/,
  attribute3: ['foo', 'bar'],
  // examples
  control: /[\\`@~*=^$_[\]{}()#+\-.!/]/,
  keywordsAsync: ['async', 'await'],

  tokenizer: [
    ...states
  ]
}

Attributes can be referenced with the special @ character in regex.

// matches 'async', 'await'
const rule1 = [/@keywordsAsync/, '']

// matches '\w+' unless the next char is in the `@control` attribute.
const rule2 = [/\w+(?!@control)/, '']

Attributes are simply inserted directly where they are found inside of a regex.

Actions

Actions inform the tokenizer about what to do after it has made a match. Actions have the most involved syntax - and so this document will break them down into their individual properties.

However, they do have some short-hands and alternative forms that should be described first.

type Action = { ... } | string | Action[]

const action1 = { token: 'foo' }
const action2 = 'foo'
const action3 = ['foo', 'bar']

The first two types are identical in effect. The last form is for group matches. Group matches effectively break a single regular expression into rules made from its individual capture group.

const rule = [/(match1)(match2)(match3)/, [action1, action2, action3]]

Actions can have the following properties:

token Assigns the matched text to the specified token.
next Pushes, or pops states from the stack.
switchTo Switches to states without pushing additional states on the stack.
goBack Reverses the tokenizer's position by the specified number of characters.
nextEmbedded Informs the parser what language to nest with, or to stop nesting with.
log Logs a message whenever the rule is matched.
parser Directs the parser to open or close syntax blocks.

A special type of action, cases, is exclusive with these properties.

Substitution

All action properties can make of use substitutions, which are literal substitutions derived from either the matched text or the current state/sub-states.

They can take three forms:

$# Substitutes the rule's match, or match group in a group match.
$n where n is a number. Substitutes for the nth capture group. The entire match is the special group $0.
$Sn where n is a number. Substitutes for the nth sub-state in the full state expansion. e.g. $S2 matches foo in comment.foo. The entire state is the special group $S0.
// matches the text with a token type equiavlent to the text of match2
const rule = [/(match1)(match2)/, { token: '$2' }]

token

The token property causes the matched text to become 'scoped', or 'tagged' with the specified token name. All actions, and in a roundabout way including cases, require the token property.

An action that is neither an array nor object but a string is interpreted as a short-hand for the token property. e.g. 'foo' becomes { token: 'foo' }

It can be in one of three forms:

foo as in lowercased. Lowercased token names signify a styling tag. Tokens of this type will be automatically highlighted with CodeMirror's native highlighter tags, if the tag name itself is valid. Unknown tag names will automatically be exported in the languages tags property, which allows for specifying a custom highlighting style for that tag.
Foo as in uppercase. Uppercased token names do not signify anything by themselves. They are intended to be used with the language's configure property, the same as a Lezer grammar.
@rematch The special @rematch token type causes the tokenizer to completely reverse the current match's progress, and then restart the tokenizer from that point again. The purpose of this is that state changes are still processed. This allows you to 'cancel' or 'lookahead' with state changes.
const action = { token: 'foo' }

next

The next property informs the tokenizer to make a state change before the next match.

It can be in one of four forms:

foo Pushes the specified state to the stack, which makes it the active state. It can be prefixed with an @ character, or left without one.
@pop Pops the current state from the stack and returns to the previous state.
@push Pushes the current state to the stack.
@popall Pops all states except for the very first, returning to top/root.
// pushes, and then switches to the 'comment' tokenizer state
const action = { next: '@comment' }

switchTo

The switchTo property is much like next except that the state specified is switched to without altering the stack.

// switches to the 'comment' state without changing the depth of the stack
const action = { switchTo: '@comment' }

goBack

The goBack property directs the tokenizer to reverse position by the specified number of characters.

// goes back 5 characters
const action = { goBack: 5 }

nextEmbedded

The nextEmbedded property looks somewhat like the next property, but instead of states it nests embedded languages. Unlike next, you cannot stack nextEmbedded. It is more like a flag that is set, with the tokenizer tracking what range of text should be filled in with the specified language.

It is very likely that a grammar will use substitution with this property. For example, Markdown code blocks which specify the language after a series of backticks. The language specifier text itself could be matched in a capture group and used as the value for nextEmbedded.

It takes two forms:

foo where foo is the name of the language, such as typescript or golang. This sets the tokenizer to begin tracking the span of the text that is marked as specified language.
@pop Terminates the range tracking procedure at the start of the token.
const action = { nextEmbedded: 'typescript' }

log

The log property logs (with console.log) the specified message whenever the associated action executes.

const action = { log: 'my rule fired' }

{ cases: {} }

The special action type cases is intentionally similar to the switch -> case syntax found in many programming languages. It allows the branching to differing actions depending on whether the matched text matches against certain patterns.

type Cases = { cases: {
  [guard: string]: Action
}}

const cases = { cases: {
  'foo'     : { token: 'bar' },
  'foobar'  : { token: 'foobar', next: '@foo' },
  '@default': { token: 'content' }
}}

The guard expression can be in one of four forms:

foo as in does not start with $ or @. This is parsed as regex, not as a simple string comparison. The regex provided is treated like any other regex in the tokenizer, although it does need to be escaped as it is a string. This form is technically a short-hand for $#~foo, which is explained in the next section.
@bar as in an attribute. Matches against an attribute string or against all strings within an attribute array. See the section on regex and attributes.
@eos Matches against the text being at the very end of the current line.
@default Matches against any input, like the default case in an ordinary switch -> case statement.

As eluded to, the previously shown 'regex' pattern is a short-hand for the syntax [pat][op]match.

The pattern is any substitution, e.g. $#.

The operator and match are any of the following:

~regex or !~regex Tests the pattern against the regex or the negation of the regex.
@attribute or !@attribute Tests if the pattern is an element of the attribute or if the pattern is not in the attribute.
==string or !=string Tests if the pattern is equivalent (or not equivalent) to the given string.

parser

The parser property attaches special meaning to the tokens it is defined on. Tokens with this property inform the parser to make special decisions, mainly opening and closing syntax nodes.

It can have a combination of two states, with two (optional) properties each:

open or close This directs the parser to open or close the given syntax nodes after or before the matched token.
start or end This directs the parser to open or close the given syntax nodes with the matched tokens inside of the opened/closed node.

The 'syntax node' type given acts exactly like the token properties value type, with the exception of the @rematch string not being special. Generally, a language will use the parser property in conjunction with the capitalized Foo tags in order to support otherwise impossible language features.

type Exclusive = { open?: string[] | string, close?: string[] | string }
type Inclusive = { start?: string[] | string, end?: string[] | string}
type ParserAction = Exclusive & Inclusive

const exclusive = { token: 'foo', parser: { open: 'Block' } }
const inclusive = { token: 'bar', parser: { end: 'Block' } }