Revisit serialization in IPython.parallel

erialization code used in IPython.parallel could use an audit and overhaul.  It will be a while before I actually get to this, but I should make a note here to keep track of current thoughts.

The serialization code (principally in `utils.pickleutil/newserialized`) is largely inherited from the days of IPython.kernel, and could use some improvements. Specifically:
- the newserialized machinery for zero-copy sends should be extensible, allowing third-parties to define their own non-copying behavior.  Since this is currently implemented with a simple `if/elif/elif`-style switch statement, it can just as easily be done with a dict-based switch, which would trivially provide extensibilty.
- the whole can/uncan/serialize logic is a bit convoluted.  It might make better sense to conflate these two libraries. I'm not sure.
- the `serialize_object` method used for serializing _containers_ of objects (applied to args/kwargs and results when using apply) serializes objects _separately_, and then pickles the container itself.  This is extraordinarily expensive for large containers, which technically only applies to _results_, but is still important.

There is also a data-overhead from using this inefficient double-serialization:

``` python
from IPython.parallel.util import serialize_object
import cPickle as pickle
obj = range(1000)
print '    pickle: %6i bytes' % len(pickle.dumps(obj, -1))
print 'serialized: %6i bytes' % len(serialize_object(obj)[0])
#--
    pickle:   2752 bytes
serialized:  34086 bytes
```

There is also finite overhead of doing zero-copy sends (construction of Python objects and async invocation of the GIL from the ØMQ io_thread).  The Session object currently does zero-copy sends _exclusively_, but there is probably a finite threshold below which copying is actually cheaper.  We need to do some rough profiling to figure out what this should be, but I expect it's somewhere above 1kB.

A simple test case on my laptop:

``` python
In [11]: import zmq
    ...: import numpy as np
    ...: 
    ...: ctx = zmq.Context()
    ...: s = ctx.socket(zmq.PUB)
    ...: s.bind_to_random_port('tcp://127.0.0.1')
    ...: 
    ...: 
    ...: for n in np.logspace(3,5,9).astype(int):
    ...:     msg = 'x'*n
    ...:     print n
    ...:     print 'copy', 
    ...:     %timeit s.send(msg)
    ...:     print 'zero-copy',
    ...:     %timeit s.send(msg, copy=False)
1000
copy 1000000 loops, best of 3: 1.5 us per loop
zero-copy 100000 loops, best of 3: 5.85 us per loop
1778
copy 1000000 loops, best of 3: 1.53 us per loop
zero-copy 100000 loops, best of 3: 5.74 us per loop
3162
copy 1000000 loops, best of 3: 1.63 us per loop
zero-copy 100000 loops, best of 3: 5.86 us per loop
5623
copy 1000000 loops, best of 3: 1.75 us per loop
zero-copy 100000 loops, best of 3: 5.93 us per loop
10000
copy 100000 loops, best of 3: 2.22 us per loop
zero-copy 100000 loops, best of 3: 6.13 us per loop
17782
copy 100000 loops, best of 3: 3.44 us per loop
zero-copy 100000 loops, best of 3: 5.78 us per loop
31622
copy 100000 loops, best of 3: 4.92 us per loop
zero-copy 100000 loops, best of 3: 5.77 us per loop
56234
copy 100000 loops, best of 3: 7.34 us per loop
zero-copy 100000 loops, best of 3: 5.78 us per loop
100000
copy 100000 loops, best of 3: 11.7 us per loop
zero-copy 100000 loops, best of 3: 5.77 us per loop
```

Indicates that the threshold should be somewhere around 50kB.  Now, with the memory costs of performing this copy, as well as the hidden and unpredictable cost of asynchronous GIL calls from the IO thread when doing zero-copy make this much less clear than a simple crossover plot would suggest.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revisit serialization in IPython.parallel #1504

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Revisit serialization in IPython.parallel #1504

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions