UIP-0107: specify @c and @t atoms, redefine +fitz

urbit · May 8, 2024 · 9d8a6bb · 9d8a6bb
1 parent 835fe3a
commit 9d8a6bb
Showing 1 changed file with 82 additions and 25 deletions.
diff --git a/UIPS/UIP-0107.md b/UIPS/UIP-0107.md
@@ -1,6 +1,6 @@
 ---
 uip: 0107
-title: Auras Renovation
+title: Aura Renovation
 description: Address inconsistencies in the aura type system
 author: ~ponmep-litsem
 status: Draft
@@ -11,49 +11,50 @@ created: 2023-06-30
 
 ## Abstract
 
-The aura type system provides a way to indicate a particular interperation of an atom in Hoon. An aura of an atom influences both how it is presented to the user, as well as specifies the parsing routine. Currently, auras are used in an inconsistent manner, thus preventing enforcement of atom sanity at the compiler level. In this proposal, we address these inconsistencies, together with related problems. With the implementation of this proposal, auras become a correct type system tool, dispensing with the need for manual atom sanity checks.
+The aura type system provides a way to indicate a particular interperation of an atom in Hoon. An aura of an atom influences both how it is presented to the user, as well as specifies its parsing routine. Currently, auras are used in an inconsistent manner, thus preventing enforcement of atom sanity at the compiler level. We address these inconsistencies together with related problems. With the implementation of this proposal auras become a correct type system tool, paving the way for compile-time atom sanity checking.
 
 ## Motivation
 
-There are several ways in which auras are used in an inconsistent manner.
+There are several ways in which auras are used in inconsistent manner.
 
 1. The `aura` type itself, defined as `@ta`, violates the constraint on knots, which do not permit uppercase letters.
 
-2. Use of a `term`, which is too restrictive, for an aura variable, which admits uppercase bit-width indicator.
+2. Use of `term` type is too restrictive for an aura variable, which admits uppercase bit-width indicator.
 	An example of this inconsistency is in the Hoon atom definition,
-	
+
 	```
 	+$  hoon
 	...
 	[%rock p=term q=*]
 	[%sand p=term q=*]
 	```
-as well as in many functions throughout `sys/hoon.hoon` which accept `term` for an `aura`. This is seen in 		practice 	with tapes, defined as `(list @tD)`. Another example is `+fitz` which accepts a `term` for a general aura.
+as well as in many functions throughout `sys/hoon.hoon` which accept `term` for what is an `aura`. This is seen in practice with tapes, defined as `(list @tD)`. Another example is `+fitz` which accepts a `term` for a general aura.
 
-3. Further, there are two Hoon literals which have uppercase letters in their digit set: `@uc` and `@uw`. The `scot` atom printer returns a `@ta`, and the `slaw` family of atom parsers similarly violates this same constraint, accepting a `@ta` as their string argument.
+3. There are two Hoon literals which have uppercase letters in their digit set: `@uc` and `@uw`. The `scot` atom printer returns a `@ta`, and the `slaw` family of atom parsers similarly violates this same constraint, accepting a `@ta` as their string argument.
 
 4. The `+slaw` family of atom parsers accepts a `@tas` for an aura, thus restricting valid auras with bit-width indicator.
 
 5. Clay persists invalid paths according to the path specification as a list of `@ta`. These can either be some valid `@uc`, `@uw` atoms, or, more generally, any path addressing a file whose name contains uppercase letters.
 
-These violations of the aura type system make it impossible to enforce sanity of atoms at the compiler level, as originally envisioned with the `+sane` and `+ruth` arms and had caused problems in `%clay` handling 
-of paths. Fixing the aura type system, as proposed here, is a required step before enabling compiler
-atom sanity checks in the future. 
+These violations of the aura type system make it impossible to enforce sanity of atoms at compile-time, as originally envisioned with the `+sane` and `+ruth` arms and had already caused problems in `%clay` handling of paths and desks. Fixing the aura type system, as proposed here, is a required step before enabling compile-time
+atom sanity checks in the future.
 
 Apart from the inconsistent use of the auras, there are also two related problems:
 
 - (A) `+sane`, so far not used at the compiler level, is under-specified and performs only some atom sanity checks
 - (B) Default hexadecimal printing of user-defined auras could become a source of unforseen bugs should this behaviour be changed in the future
 
 ## Specification
-- The knot aura `@ta` denotes a string composed of allowed characters: `[0-9], [a-z], [A-Z], cab, hep, sig, dot`. It is considered URL-safe, and allows the encoding of all Hoon constants.
-- Aura syntax. A valid aura consists of three parts, prefixed by `@`: the major part, the bit-width (optional), and the minor part (optional). The major part is matched by `[a-z]*`. The bit-width is matched by `[A-Z]`. The minor part is matched by `-[a-z]+`. Examples of valid auras include `@u`, `@uD`, `@uv-blob`, `@uxG-hash`.
-- Aura nesting rules. Aura B nests in aura A if the following conditions are all satisfied. (1) Nesting in the major part. B nests in A if the major part of B is an extension of the major part of A. (2) Nesting in the bit-width. B nests in A if the bit-width of B is not greater than the bit-width of A. Absence of bit-width indicates unlimited size. (3) Nesting in the minor part. B nests in A if the minor part of B matches the minor part of A.
-- Aura _compatibility_. Auras `A` and `B` are said to be compatible if and only if A nests in B or B nests in A.
-- (To be determined) The `@t` aura denotes a UTF-8 string composed of such and such characters.
-- (To be determined) The `@c` aura denotes a UTF-32 string composed of such and such characters.
-- Auras are printed by the `+scot` arm. The pretty-printer renders user-defined auras according their
- nearest defined nest aura.
+- The knot aura `@ta` denotes a string composed of allowed characters: `[0-9], [a-z], [A-Z], '_', '-', '~', '.'`. It is considered URL-safe, and allows the encoding of all Hoon constants.
+- Aura syntax. A valid aura consists of three parts prefixed by `@`: the major part, the bit-width (optional), and the minor part (optional). The major part is matched by `[a-z]*`. The bit-width is matched by `[A-Z]`. The minor part is matched by `-[a-z]+`. Examples of valid auras include `@u`, `@uD`, `@uv-blob`, `@uxG-hash`.
+- Aura nesting rules. Aura B nests in aura A if the following conditions are all satisfied. (1) Nesting in the major part. B nests in A if the major part of B is an extension of the major part of A. (2) Nesting in the bit-width. B nests in A if the bit-width of B is not greater than the bit-width of A. Absence of bit-width indicates unlimited size. (3) Nesting in the minor part. B nests in A if the minor part of B matches the minor part of A, or if A is the universal aura `@`.
+- _Universal aura `@`_. All atoms nest under the universal aura `@`. This is a consequence of specified nesting rules.
+- _Aura compatibility_. Auras `A` and `B` are said to be compatible if and only if A nests in B or B nests in A.
+- `+fitz` is defined to be an _aura nest check_.
+- The `@c` aura denotes a well-formed UTF-32 code unit sequence encoded in UTF-32LE encoding form, according to Definitions 88 and 90 of the Unicode standard. Detailed specification is found in the specification supplement.
+- `@t` aura denotes a well-formed UTF-8 code unit sequence encoded in UTF-8 encoding form, according to Definitions 86 and 92 of the Unicode standard. In addition, `@t` excludes the control code points `0x00..0x09` and `0x0B..0x1F`, but includes the newline control code point `0x0A`. Detailed specification is found in the specification supplement.
+- Conversion from a @c atom to a @t atom. Since a @c atom can assume values that represent control code points invalid for a @t atom, the conversion routine shall prevent creation of invalid @t atoms by crashing on invalid inputs.
+- The pretty-printer renders user-defined auras according to their nearest defined nest.
 
 ## Rationale
 
@@ -69,15 +70,33 @@ We therefore propose to relax this restriction and allow uppercase letters in `@
 
 #### Major part nests only by extension.
 
-The nesting rules for the major part are modified. Previously, the nesting rules functioned as a bidirectional compatibility check. The aura `@tas` nests in `@ta`, and `@ta` nests in `@tas`. The latter could easily result in production of mad atoms, since an aura extension could introduce a constraint not present in its parent.
+The nesting rules for the major part are modified. Previously, the nesting rules in the major part functioned as compatibility check, as realized by the `+fitz` arm. This enables the programmer to produce atoms violating the sanity check.
+
+```
+> (fitz %ta %tas)
+> %.y
+> (fitz %tas %ta)
+> %.y
+> ^-(@tas 'MAD')
+%MAD
+```
+Meawhile, in the bit-width `+fitz` functioned as a _nest check_ proper.
+
+```
+> (fitz 'uE' 'uD')
+> %.y
+> (fitz 'uD' 'uE')
+> %.n
+```
+We restore consistency by redefining `+fitz` as a one-directional _aura nest check_.
 
 #### Bit-width syntax with alternative nesting behaviour
 
 Currently, only the last uppercase letter is treated as a bit-width extension by the aura compatibility check arm `+fitz`. This means that an aura like `@uDE` denotes an aura of type `@uD` with the bit-width extension of `E` -- 16 bits. We propose to simplify things by further restricting the bit-width syntax: only a single, trailing uppercase letter is allowed in the aura specification. Thus  an aura `@uDE` is invalid, while `@uD` and `@uE` are allowed.
 
-#### Minor part nests by exact match
+#### Minor parts nest by exact match
 
-The minor part of an aura allows us to introduce more descriptive aura types, to state the indented purpose of the aura, in addition to its semantics. For instance, `@uxD-flag` could be used when communicating with hardware devices, while `@uv-blob` could be used in place of the less readable `@uvblob`. The minor part nests based on exact match, as opposed to the major part, which nests by extension. In particular, `@uv-blob` does not nest under `@uv`, and the _universal aura_ `@` is not compatible with any aura with a minor extension. 
+The minor part of an aura allows us to introduce more descriptive aura types, to state the indented purpose of the aura, in addition to its semantics. For instance, `@uxD-flag` could be used when communicating with hardware devices, while `@uv-blob` could be used in place of the less readable `@uvblob`. The minor parts nest based on exact match, as opposed to major parts, which nest by extension. In particular, `@uv-blob` does not nest under `@uv`, but it does nest under `@-blob`. As a special case, the universal aura `@` is considered to encompass auras, regardless of the minor part.
 
 ### Example parser of the new syntax
 
@@ -97,18 +116,56 @@ An example implementation of the new `+mota` to be introduced by this proposal i
 
 ### Change in default printing of atoms
 
-The default printing of user-defined auras as hexadecimal number without separators tempts the programmer to exploit this behaviour as a convenient way to print atoms. It is also inconsistent with the rendering of the empty aura `@` as an unsigned decimal atom. Rather, user-defined auras should render in the same way as the `@` aura, under which they nest.
+The default printing of user-defined auras as hexadecimal number without separators tempts the programmer to exploit this behaviour as a convenient way to print hexadecimal atoms. It is also inconsistent with the rendering of the empty aura `@` as an unsigned decimal atom. Rather, user-defined auras should render in the same way as the `@` aura, under which they nest.
+
+## Specification supplement
+
+### @c atom specification
+
+In UTF-32LE encoding form, each Unicode scalar value is assigned a 32-bit number with the same value as the Unicode scalar value. This number is then encoded in a Hoon atom in little-endian byte order. A @c atom is a sequence of Unicode scalar values encoded in this manner.
+
+The maximum Unicode scalar value is `0x10FFFF`. Since surrogate code points are not in the set of Unicode scalar values, the code points range `0xD800..0xDFFF` is excluded.
+
+### @t atom specification
+In UTF-8 encoding form, each Unicode scalar value is assigned a sequence of one to four bytes, following Definition 92 of the Unicode Standard. A @t atom is a sequence of bytes resulting from encoding a sequence of Unicode scalar values in this manner.
+
+Surrogate code points are likewise excluded. In addition, the control code points `0x00..0x09` and `0x0B..0x1F` are excluded, but the newline control code point `0x0A` is included.
+
+The mapping between Unicode scalar values and UTF-8 byte sequences is as follows.
+#### UTF-8 encoding of code points
+
+| Scalar Value                  | First Byte | Second Byte | Third Byte | Fourth Byte |
+| ----------------------------- | ---------- | ----------- | ---------- | ----------- |
+| 0000.0000 0xxx.xxxx           | 0xxxxxxx   | <br>        |            |             |
+| 0000.0yyy yyxx.xxxx           | 110yyyyy   | 10xxxxxx    |            |             |
+| zzzz.yyyy yyxx.xxxx           | 1110zzzz   | 10yyyyyy    | 10xxxxxx   |             |
+| 000u.uuuu zzzz.yyyy yyxx.xxxx | 11110uuu   | 10uuzzzz    | 10yyyyyy   | 10xxxxxx    |
+#### Well-formed UTF-8 byte sequences
+
+| Code Points              | First Byte | Second Byte | Third Byte | Fourth Byte |
+| ------------------------ | ---------- | ----------- | ---------- | ----------- |
+| U+000A, U+0020..U+007F   | 0A, 20..7F |             |            |             |
+| U+0080..U+07FF           | C2..DF     | 80..BF      |            |             |
+| U+0800..U+0FFF           | E0         | ==A0==..BF  | 80..BF     |             |
+| U+1000..U+CFFF           | E1..EC     | 80..BF      | 80..BF     |             |
+| U+D000..U+D7FF           | ED         | 80..==9F==  | 80..BF     |             |
+| U+E000..U+FFFF           | EE..EF     | 80..BF      | 80..BF     |             |
+| U+1.0000..U+3.FFFF       | F0         | ==90==..BF  | 80..BF     | 80..BF      |
+| U+4.0000..U+F.FFFF       | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
+| U+10.0000..U+10.FFFF<br> | F4         | 80..==8F==  | 80..BF     | 80..BF      |
 
+In the above table, byte ranges for the second byte that are different from the usual range `80..BF` are highlighted. The structure of above table is a consequence of imposed code point range restrictions, as well as of mapping bits of a particular code point byte to resulting encoded bytes in such a way that the former are spread across byte boundaries.
 ## Backward Compatibility
 
 - The stricter aura syntax, with only single trailing uppercase bit-width annotation allowed, is currently obeyed across all standard desks.
+- The change in `+fitz` is not backward compatible. Code relying on casting to a compatible, but not nesting aura will crash.
 - The redefined `@ta` aura will cover a wider range of strings and is therefore backward-compatible.
 - With the change of default printing of user-defined auras, any code relying on it will malfunction.
 
-## Acknowledgements 
+## Acknowledgements
 
-~rovnys-ricfer proposed extending the aura syntax with minor part. 
+~rovnys-ricfer proposed extending the aura syntax with minor part.
 
 ## Copyright
 
-Copyright and related rights waived via [CC0](../LICENSE.md).
+Copyright and related rights waived via [CC0](../LICENSE.md).