Skip to content

Identifiers/spec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VERSION 0.4 Draft (Jan 1, 2019)

What are Identifiers?

Identifiers are data that identify a unique entity apart from other entities. The concept of Identifiers has many uses in the world. In software, identifiers found in every facet of development. Some types of identifiers are standardized like UUID and URI. Most identifiers, however, are not externally defined in a specification and are dependent on many factors specific to their application.

In practice, identifiers are serialized strings that must be interpreted, parsed, encoded and decoded along software system pathways. They transit multiple systems, in many kinds of mediums like JSON, emails and log files. Software that must interpret this data along the way has to know how to consume the identifier and interpret it's value.

To illustrate this problem, consider a string identifier encoded with Base64. The generator of the identifier needs to convert their identifier value into a byte array or string. It transforms this array into Base64 and sends or stores this result. Later, another application encounters this Base64 string, and then must make several determinations:

  1. Is this string encoded?
  2. If so, how should it be decoded?
  3. Once decoded into a byte array, should it be transformed into another data structure?
  4. Once it is transformed, what are the semantics of the value?

The developer must find a source of truth to answer their questions about this multi-step process. Often docs are out-of-date, the developers are unavailable, or they provide incorrect guidance. This process is hard, error-prone and the source of many bugs, failures, and other negative outcomes.

The Identifiers project hopes to tackle this problem by defining sharable identifier types that can be applied across software domains. It intends to make it simple to convert a data identifier into a string, transmit it or store it, and then allow a different application convert the encoded identifier into a semantic data value for processing.

Identifier Types

Identifier types can be primitive values, semantic values or structured identifiers.

Primitive Identifiers

  • string (UTF-8)
  • boolean
  • integer (32-bit signed)
  • float 64-bit signed decimals (IEEE 754)
  • long (64-bit signed)
  • bytes (array of bytes)

Structural Variants

All identifier types have collection variants that hold multiple values of the type. The two collection types are list and map. Collections can only hold same-typed values at this time.

List

A list identifier is a list of values. They are not a list of identifiers, but are a single identifier composed of multiple values of the same type.

Maps

A map identifier is a map of values. Maps are useful to create a single identifier composed of multiple labeled values of the same type. These values are labeled by the map keys. The keys are stored in alphabetically-sorted order for consistency.

Composite

Composite identifiers combine other identifiers of mixed types into a single identifier. One can combine primitive identifiers, semantic identifiers and structured variants together into one composite identifier. They can be either a list or a map of other identifiers.

Semantic Identifiers

Semantic identifiers are based on either single or structured primitive identifiers. They can be considered to "extend" a base identifier type.

type base type structure notes
uuid bytes 16 bytes Supports all uuid versions. https://en.wikipedia.org/wiki/Universallyuniqueidentifier
datetime long single value Time in Unix/Posix Epoch, in milliseconds. https://en.wikipedia.org/wiki/Epoch_(computing)
geo float-list [latitude, longitude] decimal latitude & longitude. https://en.wikipedia.org/wiki/Geotagging

Future Possibilities

If you have suggestions please file an Issue to start a discussion.

Cross-Version Consumption

Semantic identifiers are guaranteed safe passage through older systems that do not understand the semantics of the identifier. They can consume a semantic identifier, parse it's data, and pass it through to another system without losing the semantic type information.

As an example, if a system encounters an unknown IPv6 semantic identifier, but has no explicit support for IPv6 identifiers, this system will interpret the value as it's base identifier type which is a fixed list of 2 longs. If this system then passes this identifier on to another system that does understand IPv6 identifiers, that system will interpret it as a IPv6 identifier. The IPv6 type information is not lost along the way.

String Encoding

Identifiers have two forms of string encoding—Data and Human. These forms have different uses.

Data Form

The data form is intended for identifiers that go into transmitted data like JSON and XML, as well as data storage like a SQL database. They are not intended for use in URIs and are not human-enterable, though they are composed of visible characters.

Identifiers serialized for data purposes are encoded with a Base128 symbol set for minimum size bloat and safe transferability.

Human Form

Identifiers are often consumed and entered by humans and thus have different constraints. Examples of this form are account identifiers, URLs and serial numbers. These identifiers are often encountered in messages like text and email. The specification can be found in the Base32 specification.

Implementations

The following projects implement the Identifiers specification:

Implementation Requirements

This section applies to library authors who build implementations of the Identifiers spec for platforms of their choosing.

Primitive Identifiers

The primitive identifiers should map to any existing platform types. Most platforms have string, boolean, and the other primitive types natively implemented. If one is not available, the implementer is encouraged to build the type support into the library rather than requiring the consumer to explicitly utilize a third-party library. For instance, JS does not support a full 64-bit long value, so the JS implementation utilizes the a popular Long library to support the long number space.

Type Codes

All primitive identifier types are associated with a 1-byte type code. Semantic identifiers have a second type code to identify themselves. The type codes are calculated with bitwise operators to accumulate the various flags that compose their full value.

Byte 1 Positions

0 1 2 3 4 5 6 7
primitive primitive primitive list map list-of map-of semantic

Structured Variants

All identifier types also have structured variants that hold their values in collections. Their type codes combine the structural flags and the type code of the value. To create the full type code, | the appropriate structured type code to the base primitive type.

type code MsgPack family
list 0x8 array
map 0x10 map

MsgPack

String-encoded identifiers are compressed using MsgPack. More details are in the following section, but the related MsgPack information is included in the type tables for easy reference.

Primitive Types

Here are the type codes for primitive types, as well as their list and map structured types.

type code MsgPack family list map
string 0x0 string 0x8 0x10
boolean 0x1 bool 0x9 0x11
integer 0x2 int 0xa 0x12
float 0x3 float 0xb 0x13
long 0x4 int 0xc 0x14
bytes 0x5 bin 0xd 0x15

Composite Types

Composite identifiers can be either lists or maps of other identifiers. Composite identifiers are not typed with primitive type flags. They contain fully-formed identifiers of any type. They can be used to define Semantic identifiers.

When encoded to MsgPack, the outer type is either composite-list or composite-map. The contents of composite identifiers are fully-encoded identifiers.

type code MsgPack family
composite-list 0x38 array
composite-map 0x58 map

Semantic Types

Semantic identifiers have 2-byte type values. The first byte is the primitive and structural information, and the second byte is the "slot" number. The integer type value is computed by starting with the base type (including structural type), adding a semantic value flag (0x80), and then adding the slot position shifted left by 0x8. The left shift pushes the slot position into the second byte. For example, The geo type code is calculated like this:

float list semantic slot
0x3 0x8 0x80 2 << 0x8

0x3 | 0x8 | 0x80 | (2 << 0x8) = 0x28b

The following table lists the defined semantic types:

type slot code MsgPack format list map
uuid 0 0x85 bin 16 size 16 0x8d 0x95
datetime 1 0x184 int 0x18c 0x194
geo 2 0x28b fixarray size 2 floats 0x2ab 0x2cb
List/Map of Structured Semantic Identifiers

It is possible for a semantic identifier's base type to be a list or map of primitives. The example of this is the geo identifier. In order to create a list or map of these identifiers, the structured types must be marked as either a list-of or map-of the semantic identifier. These type code addenda are defined in the following table:

type code MsgPack family
list-of 0x20 array[semantic]
map-of 0x40 map[semantic]

For example, to create the type code for a list of geos, Set the list-of flag bit (0x20):

0x28b | 0x20 = 0x2ab

Encoding Format

In order to encode an identifier, one must first encode it to bytes using the MsgPack encoding format. These bytes are then encoded using either Base128 for data uses or Base32 for human uses. Implementations will auto-detect the encoding format and decode into an identifier value correctly.

MsgPack

Internally encoded Identifiers are compressed MsgPack data structures. In order to inter-operate with MsgPack correctly, One must pass the MsgPack encoder the following array:

[type-code, identifier-encode-value]

Each identifier type has a specific encode value shape that must be met. Implementations will often have platform-specific formats of the identifier values, like native class representations, but these must be transformed into formats that are usable by MsgPack.

Most MsgPack implementations have cross-platform quirks that will require fine-tuning or even fixing. For instance, the Java version of MsgPack treats all doubles as FLOAT64 while other platforms encode float values as either FLOAT32 or FLOAT64. The java version of identifiers has to manually emit FLOAT32 for single-precision doubles. The Test Compatibility Kit will aid the implementer in discovering and mitigating their platform's MsgPack anomalies.

Cross-Implementation Compatibility

It is expected that encoded identifiers created in one system will be consumed in another system of a different architecture. For instance, a Java server will encode an Identifier that will be consumed by a JavaScript client. To support this goal, all implementations must pass the Test Compatibility Kit.

About

The specification for Identifiers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published