Skip to content

Commit

Permalink
doc: create "Why NestedText?" section
Browse files Browse the repository at this point in the history
The index page had a lot of good material, but it was getting so long
that I worried people wouldn't read it.  To help with that, I moved some
of the content into a new section devoted to making the case for
*NestedText*.

This commit also adds a section on TOML to the "Alternatives" page.
  • Loading branch information
kalekundert committed Oct 7, 2020
1 parent ab737a0 commit 7c17cdb
Show file tree
Hide file tree
Showing 6 changed files with 166 additions and 133 deletions.
162 changes: 30 additions & 132 deletions README.rst
Expand Up @@ -82,141 +82,39 @@ The format holds dictionaries (ordered collections of name/value pairs), lists
(ordered collections of values) and strings (text) organized hierarchically to
any depth. Indentation is used to indicate the hierarchy of the data, and
a simple natural syntax is used to distinguish the types of data in such
a manner that it is not easily confused. Specifically, lines that begin with
a word or words followed by a colon are dictionary items; a dash introduces list
items, and a leading greater-than symbol signifies a line in a multi-line
string. Dictionaries and lists are used for nesting and the leaf values are
always simple text, hence the name, *NestedText*. The top-level must be
a dictionary.
a manner that it is not easily confused. Specifically, lines that begin with a
word (or words) followed by a colon are dictionary items, lines that begin with
a dash are list items, and lines that begin with a greater-than sign are part
of a multi-line string. Dictionaries and lists can be nested arbitrarily, and
the leaf values are always text, hence the name *NestedText*.

*NestedText* is somewhat unique in that the leaf values are always strings. Of
course the values start off as strings in the input file, but alternatives like
JSON or YAML aggressively convert those values into the underlying data types
such as integers, floats, and Booleans. For example, a value like 2.10 would be
converted to a floating point number. But making the decision to do so is based
purely on the form of the value, not the context in which it is found. This can
lead to misinterpretations. For example, assume that this value is the software
version number two point ten. By converting it to a floating point number it
becomes two point one, which is wrong. There are many possible versions of this
basic issue. But there is also the inverse problem; values that should be
converted to particular data types but are not recognized. For example, a value
of $2.00 should be converted to a real number but would be a string instead.
There are simply too many values types for a general purpose solution that is
only looking at the values themselves to be able to interpret all of them. For
example, 12/10/09 is likely a date, but is it in MM/DD/YY, YY/MM/DD or DD/MM/YY
form? The fact is, the value alone is often insufficient to reliably determine
how to convert values into internal data types. *NestedText* avoids these
problems by leaving the values in their original form and allowing the decision
to be made by the end application where more context is available to help guide
the conversions. If a price is expected for a value, then $2.00 would be
checked and converted accordingly. Similarly, local conventions along with the
fact that a date is expected for a particular value allows 12/10/09 to be
correctly validated and converted. This process of validation and conversion is
referred to as applying a schema to the data. There are packages such as
`Voluptuous <https://github.com/alecthomas/voluptuous>`_ and `Pydantic
<https://pydantic-docs.helpmanual.io>`_ available that make this process easy
and reliable.


The Zen of *NestedText*
-----------------------

*NestedText* aspires to be a simple dumb vessel that holds peoples' structured
data, and to do so in a way that allows people to easily interact with that
data.

The desire to be simple is an attempt to minimize the effort required to learn
and use the language. Ideally people can understand it by looking at one or two
examples and they can use it without without needing to remember any arcane
rules and without relying on any of the knowledge that programmers accumulate
through years of experience. One source of simplicity is consistency. As such,
*NestedText* uses a small amount of rules that it applies with few exceptions.

The desire to be dumb means that it tries not to transform the data in any
meaningful way. It allows you to recover the structure in your data without
doing anything that might change the interpretation of the data. Rather, it
tries to make it easy for you to interpret the data by managing the structure,
which allows you to analyze it in small easy to interpret pieces without making
any changes that would get in your way.


Alternatives
------------

There are no shortage of well established alternatives to *NestedText* for
storing data in a human-readable text file. Probably the most obvious are `json
<https://docs.python.org/3/library/json.html>`_ and `YAML
<https://pyyaml.org/wiki/PyYAMLDocumentation>`_. Both are primarily intended to
be used as serialization languages. *NestedText* is not intended to be used as
a serialization language, rather it is more suitable for configuration and hand
generated and edited data files. In these applications, both *JSON* and *YAML*
have significant short comings.


JSON
""""

*JSON* is a subset of JavaScript suitable for holding data. Like *NestedText*,
it consists of a hierarchical collection of dictionaries, lists, and strings,
but also allows integers, floats, Booleans and nulls. The problem with *JSON*
for this application is that it is awkward. With all those data types it must
syntactically distinguish between them. For example, in *JSON* 32 is an
integer, 32.0 is the real version of 32, and "32" is the string version. These
distinctions are not meaningful and can be confusing to non-programmers. In
addition, in most datasets a majority of leaf values are strings and the
required quotes adds substantial visual clutter. *NestedText* avoids these
issues by keeping all leaf values as unmodified strings; no need for quoting or
escaping. It is up to the application that employs *NestedText* as an input
format to use context to check these strings and convert them to the right
datatype.

*JSON* does not provide for multi-line strings and any special characters like
newlines are encoded with escape codes, which can make strings long and
difficult to interpret. Also, dictionary and list items must be separated with
commas, but a comma must not follow last item. All of this results in *JSON*
being a frustrating format for humans to read, enter or edit.

*NestedText* has the following clear advantages over *JSON* as human readable
and writable data file format:

- text does not require quotes
- data is left in its original form
- comments
- multiline strings
- special characters without escaping them
- commas are not used to separate dictionary and list items


YAML
""""

*YAML* is considered by many to be a human friendly alternative to *JSON*, but
over time it has accumulated too many data types and too many formats. To
distinguish between all the various types and formats, a complicated and
non-intuitive set of rules developed. *YAML* at first appears very appealing
when used with simple examples, but things can quickly become complicated or
provide unexpected results. A reaction to this is the use of *YAML* subsets,
such as `StrictYAML <https://hitchdev.com/strictyaml>`_. However, the subsets
still try to maintain compatibility with *YAML* and so inherit much of its
complexity. For example, both *YAML* and *StrictYAML* support `nine different
ways of writing multi-line strings
<http://stackoverflow.com/a/21699210/660921>`_.

*YAML* avoids excessive quoting and supports comments and multiline strings, but
like *JSON* it converts data to the underlying data types as appropriate, but
unlike with *JSON*, the lack of quoting makes the format ambiguous, which means
it has to guess at times, and small seemingly insignificant details can affect
the result.

*NestedText* was inspired by *YAML*, but eschews its complexity. It has the
following clear advantages over *YAML* as human readable and writable data file
format:

- simple
- unambiguous (no implicit typing)
- data is left in its original form
- syntax is insensitive to special characters within text
- safe, no risk of malicious code execution
*YAML* or *TOML* aggressively convert those values into the underlying data
types such as integers, floats, and Booleans. For example, a value like 2.10
would be converted to a floating point number. But making the decision to do so
is based purely on the form of the value, not the context in which it is found.
This can lead to misinterpretations. For example, assume that this value is
the software version number two point ten. By converting it to a floating point
number it becomes two point one, which is wrong. There are many possible
versions of this basic issue. But there is also the inverse problem; values
that should be converted to particular data types but are not recognized. For
example, a value of $2.00 should be converted to a real number but would be a
string instead. There are simply too many values types for a general purpose
solution that is only looking at the values themselves to be able to interpret
all of them. For example, 12/10/09 is likely a date, but is it in MM/DD/YY,
YY/MM/DD or DD/MM/YY form? The fact is, the value alone is often insufficient
to reliably determine how to convert values into internal data types.
*NestedText* avoids these problems by leaving the values in their original form
and allowing the decision to be made by the end application where more context
is available to help guide the conversions. If a price is expected for a
value, then $2.00 would be checked and converted accordingly. Similarly, local
conventions along with the fact that a date is expected for a particular value
allows 12/10/09 to be correctly validated and converted. This process of
validation and conversion is referred to as applying a schema to the data.
There are packages such as `Pydantic <https://pydantic-docs.helpmanual.io>`_
and `Voluptuous <https://github.com/alecthomas/voluptuous>`_ available that
make this process easy and reliable.


Issues
Expand Down
102 changes: 102 additions & 0 deletions doc/alternatives.rst
@@ -0,0 +1,102 @@
************
Alternatives
************

There are no shortage of well established alternatives to *NestedText* for
storing configuration data in a human-readable text file. The features and
shortcomings of some of these alternatives are discussed below:

JSON
====

JSON_ is a subset of JavaScript suitable for holding data. Like *NestedText*,
it consists of a hierarchical collection of dictionaries, lists, and strings,
but also allows integers, floats, Booleans and nulls. The fundamental problem
with *JSON* in this context is that its meant for serializing and exchanging
data between programs; it's not meant for configuration files. Of course, it's
used for this purpose anyways, where it has a number of glaring shortcomings:

To begin, it has an excessive amount of syntactic clutter. Dictionary keys and
strings both have to be quoted, commas are required between dictionary and list
items (but forbidden after the last item), braces are required around
dictionaries, etc. Features that would improve clarity are also lacking.
Comments are not allowed, multiline strings are not supported, and whitespace
is insignificant (leading to the possibility that the appearance of the data
may not match its true structure). More conceptually, it is the responsibility
of the user to provide data of the correct type (e.g. ``32`` vs. ``32.0`` vs.
``"32"``), even though the application already knows what type it expects. All
of this results in *JSON* being a frustrating format for humans to read, enter
or edit.

*NestedText* has the following clear advantages over *JSON* as human readable
and writable data file format:

- text does not require quotes
- data is left in its original form
- comments
- multiline strings
- special characters without escaping them
- commas are not used to separate dictionary and list items

YAML
====

YAML_ is considered by many to be a human friendly alternative to *JSON*, but
over time it has accumulated too many data types and too many formats. To
distinguish between all the various types and formats, a complicated and
non-intuitive set of rules developed. *YAML* at first appears very appealing
when used with simple examples, but things can quickly become complicated or
provide unexpected results. A reaction to this is the use of *YAML* subsets,
such as StrictYAML_. However, the subsets still try to maintain compatibility
with *YAML* and so inherit much of its complexity. For example, both *YAML* and
*StrictYAML* support `nine different ways of writing multi-line strings
<http://stackoverflow.com/a/21699210/660921>`_.

*YAML* avoids excessive quoting and supports comments and multiline strings, but
like *JSON* it converts data to the underlying data types as appropriate, but
unlike with *JSON*, the lack of quoting makes the format ambiguous, which means
it has to guess at times, and small seemingly insignificant details can affect
the result.

*NestedText* was inspired by *YAML*, but eschews its complexity. It has the
following clear advantages over *YAML* as human readable and writable data file
format:

- simple
- unambiguous (no implicit typing)
- data is left in its original form
- syntax is insensitive to special characters within text
- safe, no risk of malicious code execution

TOML
====

TOML_ is a configuration file format inspired by the well-known *INI* syntax.
It supports a number of basic data types (notably including dates and times)
using syntax that is more similar to *JSON* (explicit but verbose) than to
*YAML* (succinct but confusing). As discussed previously, though, this makes
it the responsibility of the user to specify the correct type for each field,
when it should be the responsibility of the application to convert each field
to the correct type.

Another flaw in TOML is that it is difficult to specify deeply nested
structures. The only way to specify a nested dictionary is to give the full
key to that dictionary, relative to the root of the entire hierarchy. This is
not much a problem if the hierarchy only has 1-2 levels, but any more than that
and you find yourself typing the same long keys over and over. A corollary to
this is that TOML-based configurations do not scale well: increases in
complexity are often accompanied by disproportionate decreases in readability
and writability.

*NestedText* has the following clear advantages over *TOML* as human readable
and writable data file format:

- text does not require quotes
- data is left in its original form
- indentation used to succinctly represent nested data
- the structure of the file matches the structure of the data

.. _json: https://www.json.org/json-en.html
.. _yaml: https://yaml.org/
.. _strictyaml: <https://hitchdev.com/strictyaml
.. _toml: https://toml.io/en/
1 change: 1 addition & 0 deletions doc/examples.rst
Expand Up @@ -127,5 +127,6 @@ And finally, the code:
.. literalinclude:: ../examples/cryptocurrency
:language: python


.. _pydantic: https://pydantic-docs.helpmanual.io/
.. _voluptuous: https://github.com/alecthomas/voluptuous
9 changes: 8 additions & 1 deletion doc/index.rst
@@ -1,10 +1,17 @@
.. include:: ../README.rst

.. toctree::
:caption: Why NestedText?
:maxdepth: 1

Philosophy <philosophy>
alternatives

.. toctree::
:caption: Getting started
:maxdepth: 1

releases
installation
basic_syntax
basic_use
schemas
Expand Down
File renamed without changes.
25 changes: 25 additions & 0 deletions doc/philosophy.rst
@@ -0,0 +1,25 @@
***********************
The Zen of *NestedText*
***********************

*NestedText* aspires to be a simple dumb vessel that holds peoples' structured
data, and does so in a way that allows people to easily interact with that
data.

The desire to be simple is an attempt to minimize the effort required to learn
and use the language. Ideally people can understand it by looking at one or two
examples and they can use it without without needing to remember any arcane
rules and without relying on any of the knowledge that programmers accumulate
through years of experience. One source of simplicity is consistency. As such,
*NestedText* uses a small amount of rules that it applies with few exceptions.

The desire to be dumb means that *NestedText* tries not to transform the data
in any meaningful way. It parses the structure of the data without doing
anything that might change how the data is interpreted. Instead, it aims to
make it easy for you to interpret the data yourself. After all, you understand
what the data is supposed to mean, so you are in the best position to interpret
it. There are also many powerful tools available to help with :doc:`this exact
task <schemas>`.



0 comments on commit 7c17cdb

Please sign in to comment.