Skip to content

Latest commit

 

History

History
384 lines (226 loc) · 9.31 KB

grammar.rst

File metadata and controls

384 lines (226 loc) · 9.31 KB

TAML Grammar Reference

Hint

This page is aimed at format support implementors.

For a user manual (even when using a TAML library as developer), see taml_by_example.

TK: Use singular for headings.

All grammar is defined in terms of Unicode codepoint identity.

Where available, the canonical binary or at-rest encoding of TAML is UTF-8, while its runtime text-API representation should use the canonical representation of arbitrary Unicode strings in the target ecosystem.

Note

Where no standard Unicode text representation exists, it's likely best to provide only a binary UTF-8 API.

Whitespace

Note

TK: Format as regex section

[ \t]+

Whitespace is meaningless except when separating otherwise-joined tokens.

Note that line breaks are not included here.

Comment

Note

TK: Format as regex section

//[^\r\n]+

At (nearly) any point in the document, a line comment can be written as follows:

// This is a comment. It stretches for the rest of the line.
// This is another comment.

The only limitation to comment placement is that the line up to that point must be otherwise complete.

Line break

Note

TK: Format as regex section

\r?\n

TAML does not use commas to delineate values, outside of inline lists and rows.

Instead, line breaks are a grammar token that separates comments, headings, key-value pairs and table rows.

Note

"Line break" more specifically refers to Unicode code point U+000A LINE FEED (LF), which can optionally be prefixed with a single U+000D CARRIAGE RETURN (CR).

This is the only position in which verbatim carriage return characters are legal. Note that occurrences of the line feed character in quotes are not considered to be a line break token! Correct the literal in question by either replacing all verbatim carriage return characters with \r or deleting them.

Empty lines outside of quotes and lines containing only a comment always can be removed without changing the structure or contents of the document.

Hint

taml fmt preserves single empty lines but collapses longer blank parts of the document.

taml fix can fix your line endings for you without changing the meaning of quotes. (TODO) It warns about any occurrence of the character it doesn't fix by default, in either sense. (TODO)

Identifier

Note

TK: Format as regex section

[a-zA-Z_][a-zA-Z\-_0-9]*
`([^\\`\r]|\\\\|\\`|\\r)*`

Identifiers in TAML are arbitrary Unicode strings and can appear in two forms, verbatim and quoted:

Verbatim

Verbatim identifiers must start with an ASCII-letter or underscore (_). They may contain only those codepoints plus ASCII digits and the hypen-minus character (-).

Hint

Support for - is a compatibility affordance.

When outlining a new configuration structure, I recommend for example a_b over a-b, as the former is treated as single "word" by most text editors. (Try double-clicking each.)

Quoted

Backtick (`)-quoted identifiers are parsed as completely arbitrary Unicode strings.

Only the following characters are backlash-escaped:

  • \ as \\
  • as ``

All other sequences starting with a backslash are invalid in quoted strings and must lead to an error.

Warning

Identifiers formally may be empty or contain U+0000 NULL.

However, parsers for ecosystems where this cannot be safely supported are free to limit support here, as long as this limitation is prominently declared.

(A parser written in for example C# or Rust very much should support both, though. A parser written in C or C++ should consider not supporting NULL due to its common special meaning.)

TK: Define an error code that should be used here. Something like TAML-L0001?

Key

Only identifiers may be keys. Keys appear in section headers, enum variants and as part of key-value pairs like the following:

key: value

(value is a unit variant here, but could be replaced with any other value.)

Value

A value is any one of the following:

data literal, decimal, enum variant, integer, list, string, struct.

Warning

TAML processors should be as strict as at all sensible regarding value types. For example, if a string is expected, don't accept an integer and vice versa.

In some cases, remapping TAML value types is a good idea, like when parsing rust_decimal values using Serde, which should still be written as decimals in TAML but internally processed as strings. Such remappings should be done explicitly on a case-by-case basis.

Integer

Note

TK: Format as regex section

-?(0|[1-9]\d*)

A whole number with base 10. Note that -0 is legal and may be interpreted differently from 0.

Additional leading zeroes are disallowed to avoid confusion with languages and/or parsing systems where this would denote base 8.

Hint

If your configuration requires setting a bitfield, consider accepting it as data literal e.g. like this instead:

some_bitfield: <bits:1000_0001 1111_0000>
another_encoding: <hex:81 F0>

Decimal

Note

TK: Format as regex section

-?(0|[1-9]\d*)\.\d+

A fractional base 10 number. Note that -0 is legal and may be interpreted differently from 0.

Additional leading zeroes are disallowed for consistency with integers. Additional trailing zeroes are considered idempotent and must not make a difference when parsing a value.

Note

Integers and decimals should be considered disjoint. Don't accept one for the other unless not doing so would be unusually inconvenient.

Note

Decimals, like integers, are not required to fit any particular binary representation.

For example, they could be parsed and processed with arbitrary precision rather than as IEEE 754 float.

Warning

taml fmt removes idempotent trailing zeroes from decimals.

serde_taml excludes them while lexing, which also affects reserde.

Absolutely do not make any distinction regarding additional trailing zeroes in decimals when writing a lexer or parser.

String

Note

TK: Format as regex section

"([^\\"\r]|\\\\|\\"|\\r)*"

Strings are written as quoted Unicode literals. The characters \, " and U+000D CARRIAGE RETURN (CR) must be escaped as \\, \" and \r, respectively.

The character U+0000 NULL may be unsupported in environments where processing it would be unreasonably error-prone.

Enum Variants

TK

Unit Variant

Unit variants are written as single identifiers.

Notable unit variants are the boolean values true and false, which are not associated with more specific grammar in TAML.

List

TK

Inline Lists

Sections

TAML's grammar is, roughly speaking, split into three contexts:

  • structural sections
  • headings
  • tabular sections

Structural Sections

The initial context is a structural section. Structural sections can contain key-value pairs and nested sections, which can be structural sections.

first: 1
second: 2

# third
first: 3.1
second: 3.2

Each nested section is introduced by a heading nested exactly one deeper than the surrounding section's.

It continues until a heading with at most equal depth is encountered or up to the end of the file. An empty nested heading can be used to semantically (but not grammatically!) return to its immediately surrounding structural section.

first: 1
second: 2

# third
first: 3.1
second: 3.2

## third
first: "3.3.1"
second: "3.3.2"

## fourth
first: "3.4.1"
second: "3.4.2"

#
fourth: 4

Headings

Tabular Sections

Tabular sections are a special shorthand to quickly define lists with structured content.

The following are equivalent:

# [[dishes].{id, name, [price].{currency, amount}]
<luid:d6fce69d-9c9d>, "A", EUR, 10.95
<luid:c37dcc6a-2002>, "B", EUR, 5.50
<luid:00000000-0000>, "Test Item", EUR, 0.0
# [dishes]
id: <luid:d6fce69d-9c9d>
name: "A"
## price
currency: EUR
amount: 10.95

# [dishes]
id: <luid:c37dcc6a-2002>
name: "B"
## price
currency: EUR
amount: 5.50

# [dishes]
id: <luid:00000000-0000>
name: "Test Item"
## price
currency: EUR
amount: 0.0

Hint

As of right now, there is intentionally no way to define common values once per table.

I haven't found a way to express this that both is intuitive and won't make copy/paste errors much more likely.

Row

TK