# Serialisation

Noodles lets you run jobs remotely and store/retrieve results in case of duplicate jobs or reruns. These features rely on the *serialisation* (and not unimportant, reconstruction) of all objects that are passed between scheduled functions. Serialisation refers to the process of turning any object into a stream of bytes from which we can reconstruct a functionally identical object. "Easy enough!" you might think, just use `pickle`. 

In [1]:
import pickle
function = pickle.dumps(str.upper)
message = pickle.dumps("Hello, Wold!")
print("function:", function, "\nmessage:", message)

function: b'\x80\x03cbuiltins\ngetattr\nq\x00cbuiltins\nstr\nq\x01X\x05\x00\x00\x00upperq\x02\x86q\x03Rq\x04.' 
message: b'\x80\x03X\x0c\x00\x00\x00Hello, Wold!q\x00.'


In [2]:
pickle.loads(function)(pickle.loads(message))

'HELLO, WOLD!'

However `pickle` cannot serialise all objects ... "Use `dill`!" you say; still the pickle/dill method of serializing is rather indiscriminate. Some of our objects may contain runtime data we can't or don't want to store, coroutines, threads, locks, open files, you name it. We work with a Sqlite3 database to store our data. An application might store gigabytes of numerical data.  We don't want those binary blobs in our database, rather to store them externally in a HDF5 file.

There are many cases where a more fine-grained control of serialisation is in order. The bottom line being, that there is *no silver bullet solution*. Here we show some examples on how to customize the Noodles serialisation mechanism.

## The registry

Noodles keeps a registry of `Serialiser` objects that know exactly how to serialise and reconstruct objects. This registry is specified to the backend when we call the one of the `run` functions. To make the serialisation registry visible to remote parties it is important that the registry can be imported. This is why it has to be a function of zero arguments (a *thunk*) returning the actual registry object.

```python
def registry():
    return Registry(...)
    
run(workflow,
    db_file='project-cache.db',
    registry=registry)
```

The registry that should always be included is `noodles.serial.base`. This registry knows how to serialise basic Python dictionaries, lists, tuples, sets, strings, bytes, slices and all objects that are internal to Noodles. Special care is taken with objects that have a `__name__` attached and can be imported using the `__module__.__name__` combination.

Registries can be composed using the `+` operator. For instance, suppose we want to use `pickle` as a default option for objects that are not in `noodles.serial.base`:

In [3]:
import noodles
from noodles.tutorial import highlight_lines

def registry():
    return noodles.serial.pickle() \
        + noodles.serial.base()

reg = registry()

Let's see what is made of our objects!

In [4]:
highlight_lines(reg.to_json([
    "These data are JSON compatible!", 0, 1.3, None,
    {"dictionaries": "too!"}], indent=2))

Great! JSON compatible data stays the same. Now try an object that JSON doesn't know about.

In [5]:
highlight_lines(reg.to_json({1, 2, 3}, indent=2), [1])

Objects are encoded as a dictionary containing a `'_noodles'` key. So what will happen if we serialise an object the registry cannot possibly know about? Next we define a little astronomical class describing a star in the [Morgan-Keenan classification scheme](https://en.wikipedia.org/wiki/Stellar_classification).

In [6]:
class Star(object):
    """Morgan-Keenan stellar classification."""
    def __init__(self, spectral_type, number, luminocity_class):
        assert spectral_type in "OBAFGKM"
        assert number in range(10)
        
        self.spectral_type = spectral_type
        self.number = number
        self.luminocity_class = luminocity_class

highlight_lines(reg.to_json(Star('G', 2, 'V'), indent=2), [4], max_width=60)

The registry obviously doesn't know about `Star`s, so it falls back to serialisation using `pickle`. The pickled data is further encoded using `base64`. This solution won't work if some of your data cannot be pickled. Also, if you're sensitive to aesthetics, the pickled output doesn't look very nice.

## *serialize* and *construct*

One way to take control of the serialisation of your objects is to add the `__serialize__` and `__construct__` methods.

In [7]:
class Star(object):
    """Morgan-Keenan stellar classification."""
    def __init__(self, spectral_type, number, luminocity_class):
        assert spectral_type in "OBAFGKM"
        assert number in range(10)
        
        self.spectral_type = spectral_type
        self.number = number
        self.luminocity_class = luminocity_class
        
    def __str__(self):
        return f'{self.spectral_type}{self.number}{self.luminocity_class}'
    
    @staticmethod
    def from_string(string):
        """Construct a new Star from a string describing the stellar type."""
        return Star(string[0], int(string[1]), string[2:])
    
    def __serialize__(self, pack):
        return pack(str(self))
    
    @classmethod
    def __construct__(cls, data):
        return Star.from_string(data)

highlight_lines(reg.to_json(Star('G', 2, 'V'), indent=2), [4])

The `__serialize__` method takes one argument (besides `self`). The argument `pack` is a function that creates the data record with all handles attached. The reason for this construct is that it takes keyword arguments for special cases.

```python
def pack(data, ref=None, files=None):
    pass
```

* The `ref` argument, if given as `True`, will make sure that this object will not get reconstructed unnecessarily. One instance where this is incredibly useful, is if the object is a gigabytes large Numpy array.
* The `files` argument, when given, should be a list of filenames. This makes sure Noodles knows about the involvement of external files.

The data passed to `pack` maybe of any type, as long as the serialisation registry knows how to serialise it.

The `__construct__` method must be a *class method*. The `data` argument it is given can be expected to be identical to the data passed to the `pack` function at serialisation.

## Writing a Serialiser class (TODO)

Often, the class that needs serialising is not from your own package. In that case we need to write a specialised `Serialiser` class.

## Using hooks (TODO)

Not always can we find the correct serialisation method by looking at the type of an object. To catch the more slippery eels, we may use a *hook*.

### Footnote: better parsing
If you're interested in doing a bit better in parsing generic expressions into objects, take a look at `pyparsing`.

In [8]:
!pip install pyparsing



In [9]:
from pyparsing import Literal, replaceWith, OneOrMore, Word, nums, oneOf

def roman_numeral_literal(string, value):
    return Literal(string).setParseAction(replaceWith(value))
    
one = roman_numeral_literal("I", 1)
four = roman_numeral_literal("IV", 4)
five = roman_numeral_literal("V", 5)

roman_numeral = OneOrMore(
    (five | four | one).leaveWhitespace()) \
    .setName("roman") \
    .setParseAction(lambda s, l, t: sum(t))

integer = Word(nums) \
    .setName("integer") \
    .setParseAction(lambda t:int(t[0]))

mkStar = oneOf(list("OBAFGKM")) + integer + roman_numeral

In [12]:
list(mkStar.parseString('B2IV'))

['B', 2, 4]

In [13]:
roman_class = {
    'I': 'supergiant',
    'II': 'bright giant',
    'III': 'regular giant',
    'IV': 'sub-giants',
    'V': 'main-sequence',
    'VI': 'sub-dwarfs',
    'VII': 'white dwarfs'
}

# Implementation

A `Registry` object roughly consists of three parts. It works like a dictionary searching for `Serialiser`s based on the class or baseclass of an object. If an object cannot be identified through its class or baseclasses the `Registry` has a function hook that may use any test to determine the proper `Serialiser`. When neither the hook nor the dictionary give a result, there is a default fall-back option.