All requirements and recommendations are provided with the implicit stipulation that they be implemented "as-if" they are true. The actual implementation may differ provided there is no way for a distinction to be made without invoking unspecified behavior or through environment introspection.
In this document "shall" is to be interpreted as a requirement on an implementation; "shall not" is to be interpreted as a prohibition. Failure to meet these requirements qualifies an implementation as non-conforming. If a "shall" or "shall not" requirement is violated, the behavior is undefined. Undefined behavior is otherwise indicated by the words "undefined behavior" or by the omission of any explicit definition of behavior.
In this document "should" is to be interpreted as a strong recommendation which may be broken in exceptional circumstances; "should not" is to be interpreted as a possible implementation which is strongly recommended against except in exceptional circumstances. The definition of "exceptional" here is left purposefully vague.
In this document "may" is to be interpreted as an example of a possible implementation for which conformance isn't necessary.
Behavior for which this standard imposes no requirements. Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or execution in a documented manner characteristic of the environment, to terminating a translation or execution (with the issuance of a diagnostic message).
Behavior which this standard provides two or more possibilities and and may change in future implementations or iterations of this standard. Relying on unspecified behavior means a program is ill-formed and nonportable.
Unspecified behavior where each implementation documents how the choice is made
The theoretical model used by an implementation in conformance with this standard to define the execution of a program.
The environment in which the abstract machine executes a program, including the architecture, operating system, and configuration.
The particular set of software that performs translation of programs for, and supports execution of functions in, a particular execution environment
Translation of source code to a sequence of distinct, non-overlapping tokens.
Translation of a sequence of tokens to an Abstract Syntax Tree (AST) describing the structure of the program.
Translation of an AST to an Intermediate Representation (IR), unspecified to the user code, which can be readily interpreted or compiled further into machine instructions.
Translation of all or part of the IR of a program into machine instructions which can be executed directly by the execution environment. This may result in an executable binary or Just-In-Time (JIT) compiled functions which can be executed without interpretation.
A unit of storage in the execution environment large enough to hold an object that may have exactly one of two values.
An addressable unit in the execution environment large enough to hold an object that may have a value in the range [0, 256)
All lvalues are also rvalues and uvalues. All prvalues are also rvalues. All uvalues are also lvalues.
An expression which can appear on the left-hand side of an assignment operator. This includes:
- Variable names
- Array subscripts
- Indexed object members
- Tuples, lists, dictionaries, or function calls
- Type assertions
An expression which can appear on the right-hand side of an assignment operator. This includes:
- Literals
- Expressions
- Function calls
- Tuples, lists, or dictionaries with no defaults
A regular expression using the PCRE format.
The lexical phase should be minimally stateful and receive minimal feedback from downstream phases in normal operation. Tokenization should not exceed the computational complexity of a Deterministic Pushdown Automata (DPA) and should aim for less wherever possible. Whitespace should serve the sole purpose of separating ambiguous tokens and should not be used in the parsing phase. Insertion of a space between two tokens should not change the semantics of the AST they parse to.
The parsing phase may not look ahead more than 1 symbol and may not backtrack.
A token is the minimal lexical element of the language. The categories of tokens are: whitespace, comments, words, literals, operators, and punctuators. If the input stream has been parsed into tokens up to a given character, the next token is the longest sequence of characters which constitute a token.
Whitespace is any character with an ASCII value less than or equal to 32 (the space character). They can occur between any pair of tokens without changing the semantics. The only exception is a newline after an identifier, which prevents the adjacency operator.
Comments can appear anywhere whitespace is valid. They generally do not affect the semantics, but depending on where they appear they may be included in the program as metadata.
Single-line comments begin with a hash character #
and continue until the end of the current line.
Multi-line comments begin with #*
and end with *#
. No nesting is considered.
A word is any sequence of characters satisfying the following regular expression (regex):
(?i)[a-z_][a-z\d_]*
A subset of these are designated as keywords which have special semantics in some contexts. These are:
if
then
else
loop
while
for
with
try
do
which
return
fail
yield
break
continue
case
import
export
namespace
proto
this
super
and
or
not
in
is
as
has
after
var
let
public
protected
private
static
The following is a list of reserved words which may become keywords in the future and should be avoided:
until
before
unless
likely
unlikely
elif
always
await
async
defer
sync
atomic
const
catch
where
of
def
from
inline
noinline
pure
define
ref
ptr
virtual
dynamic
abstract
final
new
delete
native
extern
unsafe
asm
strict
debug
goto
typedef
template
using
finally
macro
alias
fn
function
All numeric literals (integer and floating-point) may be followed by a word with no space in between to specify a numeric suffix. All numeric patterns permit trailing underscores - if more than one underscore is present, only the first shall be considered part of the numeric literal and the rest is the suffix. The semantics of this suffix are a lookup of a variable with that name which is called, passing the numeric literal's string value.
Integer literals come in 4 varieties satisfying the following regexes:
- Binary integer:
(?i)0b[01_]*
- Octal integer:
(?i)0o[0-7_]*
- Decimal integer:
(?i)\d[\d_]*(?!e)
- Hexadecimal integer:
(?i)0x(?i)[a-f\d_]*
Floating-point literals come in 2 varieties satisfying the following regexes:
- Decimal float:
(?i)^(?=.*\d)\d[\d_]*\.[\d_]*(e[-+]?[\d_]*)?
- Hexadecimal float:
(?i)^(?=.*\d)0x[a-f\d_]*\.[a-f\d_]*(p[-+]?[\d_]*)?
String literals come in 6 varieties satisfying the following regexes:
- Single-quoted:
'(\\.|.+)*'
- Double-quoted:
"(\\.|.+)*"
- Back-quoted:
(\\.|.+)*
- Triple-quoted varieties of the above
Single and double-quoted strings can contain character escapes, the \
character followed by one of the following sequences:
Sequence | Description |
---|---|
\ | Backslash |
' | Single quote |
" | Double quote |
` | Backtick |
n | Newline |
r | Carriage return |
t | Tab |
0-9 | 1-digit ASCII decimal |
xHH | 2-digit ASCII hex |
uHHHH | 4-digit unicode hex |
u{...} | n-digit unicode hex |
Back-quoted strings are "raw strings" that can only contain the character escapes for \`
and \\
; all other escapes are a literal backslash and following character.
Any newlines which appear in a string shall be converted to the ASCII newline character.
Interpolated strings support arbitrary nesting, and thus must be be split into 3 token regexes and a pushdown automata.
- Begin-interpolate:
(['"]|['"]{3}).*?\\\{
(push group 1) - Between-interpolate:
.*?\\\{
- End-interpolate:
.*?(['"]|['"]{3})
(pop, compare to group 1; syntax error if mismatched)
The between-interpolate and end-interpolate regexes should only be matched when the parser encounters an extra closing brace }
, matching the opening escape pattern \{
.
Punctuators satisfy one of the following sequences:
- Syntax:
(
)
[
]
{
}
.
..
...
:
::
;
,
Operators satisfy one of the following sequences:
- Comparison:
<
<=
>
>=
==
!=
===
!==
<=>
- Unary:
not
@
- Arithmetic:
+
-
*
/
%
**
//
%%
- Bitwise:
!
&
|
^
<<
<<<
>>
>>>
<<>
<>>
- Boolean:
and
or
??
?!
- Access:
->
- Other:
.=
..=
,=
:=
::=
??=
@=
~=
Assignment operators are any of the arithmetic, bitwise, or non-keyword boolean operators followed by an equals character =
with no other characters in between.
An expression is a sequence of operators and operands that specifies computation of a value, declares a variable, generates side effects, or performs some combination thereof.
The grouping of operators and operands is indicated by the syntax. The order of evaluation of subexpressions and their side effects is depth-first from beginning to end unless otherwise specified.
A relaxed identifier is an identifier with relaxed restrictions for certain contexts. It consists of any valid token which is not a punctuator.
Atoms are any of the following expressions:
- Constants
- Compound literals
- Identifiers
- Subexpressions within enclosing parentheses:
( expression )
- One of the keyword expressions
A list compound literal is an atom with the following syntax:
list = "[" ",".("..."? expression?)+ ","? "]";
A list can be an lvalue provided all subexpressions are also lvalues. If there is one or more rvalue subexpressions, the list is a prvalue.
An object compound literal is an atom with the following syntax:
computed_key = "[" expression "]"
key = relaxed_id | computed_key;
mapping = key ":" expression;
method = ("get" | "set" | "delete")? key "(" params ")" block;
property = ("public" | "protected" | "private")? (mapping | method);
entry = property | "..." expression;
object = "{" ",".(entry?)+ ","? "}";
An object is an lvalue if it contains only lvalue entries. It is a plvalue if it has any plvalues. A plvalue object shall contain no more than one ...
entry, and shall not contain any methods. If there is one or more rvalue subexpressions, or one or more methods, the object is a prvalue.
An object literal shall not contain duplicated mapping keys. If a computed key evaluates to an existing mapping key, its value overwrites the existing key.
The var
keyword denotes a mutable variable, while let
denotes an immutable variable. Both have the same syntax, as follows:
type_lval = id "(" params ")";
func_lval = type_lval "{" statements "}";
decl = tuple_lval | list_lval | object_lval | type_lval | func_lval;
varlet = ("var" | "let") ",".decl
Variable bindings are hoisted to the top of the current scope and initialized to none if no initializer is provided.
return
is a unary rvalue expression with the never
type. It immediately halts execution of the current function, returning the value of the expression if provided, or none
if not.
fail
is a unary rvalue expression with the never
type. It immediately halts execution and resumes at the next enclosing try
-else
block in the current function, or returns a failure object with the value of the expression if provided, or none
if not.
yield
is a unary rvalue expression with the option[any]
type. It immediately halts execution of the current function and returns control to the coroutine caller with the value provided, else none
. If the coroutine is resumed, the yield
expression returns the value provided to the next
call, else none
.