## Chapter 4: Encoding and Evolution
For a data-intensive application evolution means allowing changes to the data structure and source code without affecting current processes. In essence, the application should allow upgrades to be implemented without going into downtimes. To achieve this, the application must support two types of compatibility:
* Backward compatibility
    - Newer code can read data that was written by older code.
* Forward compatibility
    - Older code can read data that was written by newer code.
    
The applications can be served by the use of schemes for encoding and decoding instances of the abstractions of the data, by managing these schemes it is possible to achieve the desired compatibility.

!! Definition !!
* **Encoding or serialization:** The translation from the in-memory representation to a byte
* **Decoding or deserialization:** The translation from the byte representation to a in-memory

The process for achieving this compatibility consists of the following:
* The Format of Encoding Data
* The mode of Dataflow

---

### Formats for Enconding Data

The most used formats for encoding data are:
* Lenguage-Specific Formats
* JSON, XML, Binary variants
* Apache Thrift and Protocol Buffers
* Apache Avro

#### Lenguage-Specific Formats
Many programming languages come with built-in support for encoding in-memory
objects into byte sequences. For example, Java has java.io.Serializable, Ruby
has Marshal, Python has pickle, and so on. Many third-party libraries also
exist, such as Kryo for Java.

* Pros
    - Easy to save and restore objects with few lines of code
* Cons
    - The encoding is often tied to a particular programming language, and readingthe data in another language is very difficult. If you store or transmit data in such an encoding, you are committing yourself to your current programming language for potentially a very long time.
    - In order to restore data in the same object types, the decoding process needs tobe able to instantiate arbitrary classes (Insecure).
    - Versioning data is often an afterthought in these libraries (backward and forward compatibility is usually problematic)
    - Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought

#### JSON, XML, Binary variants
Moving to standardized encodings that can be written and read by many programming
languages, JSON and XML are the obvious contenders. They are widely known,
widely supported, and almost as widely disliked. XML is often criticized for being too
verbose and unnecessarily complicated. JSON’s popularity is mainly due to its
built-in support in web browsers (by virtue of being a subset of JavaScript) and simplicity
relative to XML. CSV is another popular language-independent format, albeit
less powerful.

* Pros
    - JSON, XML, and CSV are textual formats, and thus somewhat human-readable
* Cons
    - There is a lot of ambiguity around the encoding of numbers. In XML and CSV,
you cannot distinguish between a number and a string that happens to consist of
digits (except by referring to an external schema). JSON distinguishes strings and
numbers, but it doesn’t distinguish integers and floating-point numbers, and it
doesn’t specify a precision.
    - JSON and XML have good support for Unicode character strings (i.e., humanreadable
text), but they don’t support binary strings (sequences of bytes without
a character encoding).

**Binary variants**

JSON is less verbose than XML, but both still use a lot of space compared to binary
formats. This observation led to the development of a profusion of binary encodings
for JSON (MessagePack, BSON, BJSON, UBJSON, BISON, and Smile, to name a few)
and for XML (WBXML and Fast Infoset, for example). These formats have been
adopted in various niches, but none of them are as widely adopted as the textual versions
of JSON and XML.

#### Apache Thrift and Protocol Buffers
Apache Thrift and Protocol Buffers (protobuf) are binary encoding libraries
that are based on the same principle. Both Thrift and Protocol Buffers require a schema for any data that is encoded.

Thrift and Protocol Buffers each come with a code generation tool that takes a
schema definition and produces classes that implement the
schema in various programming languages. Your application code can call this
generated code to encode or decode records of the schema.

**Apache Thift**

Thrift has two different binary encoding formats:
* BinaryProtocol

![](images/image_17.png)
    
* CompactProtocol

![](images/image_18.png)

**Protocol Buffers**

![](images/image_19.png)

**Forward and backward compatibility**

How do Thrift and Protocol Buffers handle schema changes while
keeping backward and forward compatibility?

* If a field value is not set, it is simply omitted from the encoded record.
    - You can change the
name of a field in the schema, since the encoded data never refers to field names, but
you cannot change a field’s tag, since that would make all existing encoded data
invalid.
* You can add new fields to the schema, provided that you give each field a new tag number
    - The datatype annotation allows the
parser to determine how many bytes it needs to skip. This maintains forward compatibility:
old code can read records that were written by new code.
* As long as each field has a unique tag number,
new code can always read old data, because the tag numbers still have the same
meaning.
* Removing a field is just like adding a field, with backward and forward compatibility
concerns reversed.
    - You can only remove a field that is optional (a
required field can never be removed), and you can never use the same tag number
again (because you may still have data written somewhere that includes the old tag
number, and that field must be ignored by new code).

#### Apache Avro
Avro also uses a schema to specify the structure of the data being encoded. It has two
schema languages:
* Avro IDL
    - Intended for human editing
* JSON    
    - More easily machine-readable
    
To parse the binary data, you go through the fields in the order that they appear in
the schema and use the schema to tell you the datatype of each field. This means that
the binary data can only be decoded correctly if the code reading the data is using the
exact same schema as the code that wrote the data. Any mismatch in the schema
between the reader and the writer would mean incorrectly decoded data.

![](images/image_20.png)

With Avro, when an application wants to encode some data (to write it to a file or
database, to send it over the network, etc.), it encodes the data using whatever version
of the schema it knows about—for example, that schema may be compiled into the
application. This is known as the writer’s schema.

When an application wants to decode some data (read it from a file or database,
receive it from the network, etc.), it is expecting the data to be in some schema, which
is known as the reader’s schema. That is the schema the application code is relying on
—code may have been generated from that schema during the application’s build
process.

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have
to be the same—they only need to be compatible.

**Examples:**

* Large file with lots of records
    - In this case, the writer of that
file can just include the writer’s schema once at the beginning of the file. Avro
specifies a file format (object container files) to do this.
* Database with individually written records
    - The simplest solution is to include a version number at the beginning
of every encoded record, and to keep a list of schema versions in your database.
* Sending records over a network connection
    - When two processes are communicating over a bidirectional network connection,
they can negotiate the schema version on connection setup and then use
that schema for the lifetime of the connection. The Avro RPC protocol works like this.

---

### Modes of Dataflow
These are the ways in which information is shared between processes that do not share memory and determine who encodes and who decodes. Some examples are:
* Via databases
* Via service calls
* Via asynchronous message passing

#### Dataflow through Databases
In a database, the process that writes to the database encodes the data, and the process
that reads from the database decodes it.

**Complications**
* Updating records
    - If a field is updated for a newer version it may delete fields ignored in the read process (written by an old-version)
* Different values written at different times
* Archival storage

#### Dataflow Through Services: REST and RPC

The most common arrangement is
to have two roles: _clients_ and _servers_. The servers expose an API over the network,
and the clients can connect to the servers to make requests to that API. The API
exposed by the server is known as a service.

* Thrift, gRPC (Protocol Buffers), and Avro RPC can be evolved according to the
compatibility rules of the respective encoding format.
* In SOAP, requests and responses are specified with XML schemas. These can be
evolved, but there are some subtle pitfalls.
* RESTful APIs most commonly use JSON (without a formally specified schema)
for responses, and JSON or URI-encoded/form-encoded request parameters for
requests. Adding optional request parameters and adding new fields to response
objects are usually considered changes that maintain compatibility.

#### Message-Passing Dataflow
They are similar to RPC in that a
client’s request (usually called a message) is delivered to another process with low
latency. They are similar to databases in that the message is not sent via a direct network
connection, but goes via an intermediary called a message broker (also called a
message queue or message-oriented middleware), which stores the message temporarily.

The actor model is a programming model for concurrency in a single process. Rather
than dealing directly with threads (and the associated problems of race conditions,
locking, and deadlock), logic is encapsulated in actors. Each actor typically represents
one client or entity, it may have some local state (which is not shared with any other
actor), and it communicates with other actors by sending and receiving asynchronous
messages. Message delivery is not guaranteed: in certain error scenarios, messages
will be lost. Since each actor processes only one message at a time, it doesn’t
need to worry about threads, and each actor can be scheduled independently by the
framework.