Skip to content

bminer/schemer

Repository files navigation

schemer

Go Reference

Lightweight and robust data encoding library for Go

Schemer provides an API to construct schemata that describe data structures; a schema is then used to encode and decode values into sequences of bytes to be sent over the network or written to a file.

Schemer seeks to be an alternative to protobuf or Avro, but it can also be used as a substitute for JSON.

Features

  • Compact binary data format
  • High-speed encoding and decoding
  • Forward and backward compatibility
  • No code generation and no new language to learn
  • Simple and lightweight library with no external dependencies
  • Supports custom encoding for user-defined data types
  • JavaScript library for web browser interoperability (coming soon!)

Why?

Schemer is an attempt to further simplify data encoding. Unlike other encoding libraries that use interface description languages (i.e. protobuf), schemer allows developers to construct schemata programmatically with an API. Rather than generating code from a schema, a schema can be constructed from code. In Go, schemata can be generated from Go types using the reflection library. This adds a surprising amount of flexibility and extensibility to the encoding library.

Here's how schemer stacks up against other encoding formats:

Property JSON XML MessagePack Protobuf Thrift Avro Gob Schemer
Human-Readable ✔️ 😐
Support for Many Programming Languages ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Widely Adopted ✔️ ✔️ ✔️
Precise Encoding of Numbers 😐 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Binary Strings ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Compact Encoded Payload ❌❌ ❌❌ ✔️ ✔️ ✔️ ✔️ ✔️
Fast Encoding / Decoding ✔️ ✔️ 😐 😐
Backward Compatibility ✔️ ✔️ ✔️ 😐 😐 ✔️ 😐 ✔️
Forward Compatibility ✔️ ✔️ ✔️ 😐 😐 ✔️ 😐 ✔️
No Language To Learn ✔️ ✔️ ✔️ 😐 ✔️ ✔️
Schema Support 😐 😐 ✔️ ✔️ ✔️ ✔️
Supports Fixed-field Objects ✔️ ✔️ ✔️ ✔️ ✔️
Works on Web Browser ✔️ ✔️ ✔️ ✔️ 😐 ✔️ 📆 soon…

The table above is intended to guide the reader toward an encoding format based on their requirements, but the evaluations of these encoding formats are, of course, rather subjective. Please feel free to open an issue if you feel something should be adjusted and/or corrected.

Types

schemer uses type information provided by the schema to encode values. The following are all of the types that are supported:

  • Integer
    • Can be signed or unsigned
    • Fixed-size or variable-size 1
      • Fixed-size integers can be 8, 16, 32, or 64 bits
  • Floating-point number (32 or 64-bit)
  • Complex number (64 or 128-bit)
  • Boolean
  • Enumeration
  • String
    • Can support any encoding, including UTF-8 and binary
    • Fixed-size or variable-size 2
  • Array
    • Fixed-size or variable-size
  • Object w/fixed fields (i.e. struct)
  • Object w/variable fields (i.e. map)
  • Schema (i.e. a schemer schema)
  • Dynamically-typed value (i.e. variant)
  • User-defined types
    • A few common types are provided for representing timestamps, time durations, IP addresses, UUIDs, regular expressions, etc.

Schema JSON Specification

Types

Type JSON Type Name Additional Options
Fixed-size Integer int * signed - boolean indicating if integer is signed or unsigned
* bits - one of the following numbers indicating the size of the integer: 8, 16, 32, 64, 128, 256, 512, 1024
Note: integers larger than 64 bits are not fully supported
Variable-size Integer int * signed - boolean indicating if integer is signed or unsigned
* bits - must be null or omitted
Floating-point Number float * bits - one of the following numbers indicating the size of the floating-point: 32, 64
Complex Number complex * bits - one of the following numbers indicating the size of the complex number: 64, 128
Boolean bool
Enum enum * values - an object mapping strings to integer values
Fixed-Length String string * length - the length of the string in bytes
Variable-Length String string * length - must be null or omitted
Fixed-Length Array array * length - the length of the string in bytes
Variable-Length Array array * length - must be null or omitted
Object w/fixed fields object * fields - an array of fields. Each field is an type object with keys:
name3, type, and any additional options for the type
Object w/variable fields object * fields - must be null or omitted
Variant variant

Example

Here's a struct with three fields:

  • firstName (string)
  • lastName (string)
  • age (uint8 - unsigned integer requiring a single byte)
{
  "type": "object",
  "fields": [
    {
      "name": "firstName",
      "type": "string"
    }, {
      "name": "lastName",
      "type": "string"
    }, {
      "name": "age",
      "type": "int",
      "signed": false,
      "bits": 8
    }
  ]
}

Type Compatibility

When decoding values from one type to another, schemer employs the following compatibility rules. These rules, while rather opinionated, provide safe defaults when decoding values. Users who want to carefully craft how values are decoded from one type to another can simply create a custom type.

As a general rule, types are only compatible with themselves (i.e. boolean values can only be decoded to boolean values). The table below outlines a few notable exceptions and describes how using "weak" decoding mode can increase type compatibility by sacrificing type safety and by making a few assumptions.

Destination
Source int float complex bool enum string array (see #12) object
int ✔️ #1 ✔️ #1 ✔️ #1 ❕ #6 ❕ #7 ❕ #9
float ✔️ #1 ✔️ #1 ✔️ #1 ❕ #9
complex ✔️ #1 ✔️ #1 ✔️ #1 ❕ #9 ❕ #11
bool ❕ #6 ✔️ ❕ #10
enum ❕ #7 ✔️ #2 ✔️ #2
string ❕ #8 ❕ #8 ❕ #8 ❕ #10 ✔️ #2 ✔️
array (see #12) ❕ #11 ✔️ #3
object ✔️ #4

Legend:
✔️ - indicates compatibility according to the specified rule
❕- indicates compatibility according to the specified rule only if weak decoding is used
❌ - indicates that the source type cannot be decoded to the destination (excepting rule #12)

Compatibility Rules:

  1. Any number can be decoded to any other number, provided the decoded value can be stored into the destination without losing any precision. If weak decoding is specified, we loosen this restriction slightly by allowing floating-point and complex number conversions to lose precision.

    For example, if the number 3.14 is decoded, it can be stored as a float or complex number, but it cannot be stored as an integer. Similarly, the number 500 can be stored into a uint16 but not a uint8, since uint8 can only store values between 0 and 255.

  2. Enumerations are decoded to other enumerations by performing a case-insensitive match on the named value, not a match on the numeric value. If multiple matches occur, a case-sensitive match is then performed. Decoding fails if the decoded named value does not match a named value in the destination enumeration. Enumerations can also be converted to strings and vice-versa by matching on the enumeration's named value.

  3. Arrays can be decoded to arrays if the element type and array length is compatible. Specifically, when the destination array is of fixed-size and does not support null values, the decoded array must match exactly in length.

  4. Objects are decoded to other objects by performing a case-insensitive match on the key or field name. If multiple matches occur, a case-sensitive match is then performed. When the destination is an object with fixed fields and the decoded value does not have a matching key or field name, the key / field is simply skipped and will remain unchanged.

  5. Null values can only be decoded to destinations that support null values (i.e. pointers), but a non-null value can be decoded even if the destination does not support null values.

The following compatibility rules apply for weak decoding only:

  1. The boolean value true can be converted to the integer value 1, and the boolean value false can be converted to the integer value 0. Similarly, the integer 0 will be decoded as false, and all other integers are decoded as true.
  2. Enumerations can be converted to integer values and vice-versa, and they are matched on the enumeration's numeric value.
  3. Strings can be decoded to numeric values by considering the string format according to the table below. The resulting numeric value is compatible with the destination according to the relevant compatibility rules.
  4. Numbers are always encoded to strings in base 10.
  5. Boolean values true and false are converted to string values "true" and "false" respectively. Strings "1", "t", "T", "TRUE", "true", and "True" can be converted to the boolean value true. Strings "0", "f", "F", "FALSE", "false", and "False" can be converted to boolean value false.
  6. Complex numbers may be converted into 2-element arrays of floating-point numbers and vice-versa. The real part of the complex number will be matched with array element 0, and the complex part will be matched with array element 1.
  7. Single-element arrays can be decoded to a destination that is compatible with the array element and vice-versa.

String to number decoding:

String Example Decoded As Regular Expression
"-3.14" Number, base 10 `^[-+]?(0
"0b1101" Integer, base 2 ^[-+]?0[bB][01]+$
"0775" Integer, base 8 ^[-+]?0[oO]?[0-7]+$
"0x2020" Number, base 16 ^[-+]?0[xX][0-9A-Fa-f]+(\.[0-9A-Fa-f]*)?([pP][+-]?[0-9A-Fa-f]+)?$
"2.34 + 2i" Complex number, base 10 You don't want to see it, but here's the link.

Credits

This library was created on April 14, 2021, the day of Bernie Madoff's death. What a schemer! May he rest in peace.

Special thanks to Benjamin Pritchard for his significant contributions to this library and for making it a reality.

Footnotes

  1. By default, integer types are encoded as variable integers, as this format will most likely generate the smallest encoded values.

  2. By default, string types are encoded as variable-size strings. Fixed-size strings are padded with trailing null bytes / zeros.

  3. It is strongly encouraged to use camelCase for object field names.

About

Lightweight and flexible data encoding and decoding library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages