erialization code used in IPython.parallel could use an audit and overhaul. It will be a while before I actually get to this, but I should make a note here to keep track of current thoughts.
The serialization code (principally in utils.pickleutil/newserialized) is largely inherited from the days of IPython.kernel, and could use some improvements. Specifically:
- the newserialized machinery for zero-copy sends should be extensible, allowing third-parties to define their own non-copying behavior. Since this is currently implemented with a simple
if/elif/elif-style switch statement, it can just as easily be done with a dict-based switch, which would trivially provide extensibilty.
- the whole can/uncan/serialize logic is a bit convoluted. It might make better sense to conflate these two libraries. I'm not sure.
- the
serialize_object method used for serializing containers of objects (applied to args/kwargs and results when using apply) serializes objects separately, and then pickles the container itself. This is extraordinarily expensive for large containers, which technically only applies to results, but is still important.
There is also a data-overhead from using this inefficient double-serialization:
from IPython.parallel.util import serialize_object
import cPickle as pickle
obj = range(1000)
print ' pickle: %6i bytes' % len(pickle.dumps(obj, -1))
print 'serialized: %6i bytes' % len(serialize_object(obj)[0])
#--
pickle: 2752 bytes
serialized: 34086 bytes
There is also finite overhead of doing zero-copy sends (construction of Python objects and async invocation of the GIL from the ØMQ io_thread). The Session object currently does zero-copy sends exclusively, but there is probably a finite threshold below which copying is actually cheaper. We need to do some rough profiling to figure out what this should be, but I expect it's somewhere above 1kB.
A simple test case on my laptop:
In [11]: import zmq
...: import numpy as np
...:
...: ctx = zmq.Context()
...: s = ctx.socket(zmq.PUB)
...: s.bind_to_random_port('tcp://127.0.0.1')
...:
...:
...: for n in np.logspace(3,5,9).astype(int):
...: msg = 'x'*n
...: print n
...: print 'copy',
...: %timeit s.send(msg)
...: print 'zero-copy',
...: %timeit s.send(msg, copy=False)
1000
copy 1000000 loops, best of 3: 1.5 us per loop
zero-copy 100000 loops, best of 3: 5.85 us per loop
1778
copy 1000000 loops, best of 3: 1.53 us per loop
zero-copy 100000 loops, best of 3: 5.74 us per loop
3162
copy 1000000 loops, best of 3: 1.63 us per loop
zero-copy 100000 loops, best of 3: 5.86 us per loop
5623
copy 1000000 loops, best of 3: 1.75 us per loop
zero-copy 100000 loops, best of 3: 5.93 us per loop
10000
copy 100000 loops, best of 3: 2.22 us per loop
zero-copy 100000 loops, best of 3: 6.13 us per loop
17782
copy 100000 loops, best of 3: 3.44 us per loop
zero-copy 100000 loops, best of 3: 5.78 us per loop
31622
copy 100000 loops, best of 3: 4.92 us per loop
zero-copy 100000 loops, best of 3: 5.77 us per loop
56234
copy 100000 loops, best of 3: 7.34 us per loop
zero-copy 100000 loops, best of 3: 5.78 us per loop
100000
copy 100000 loops, best of 3: 11.7 us per loop
zero-copy 100000 loops, best of 3: 5.77 us per loop
Indicates that the threshold should be somewhere around 50kB. Now, with the memory costs of performing this copy, as well as the hidden and unpredictable cost of asynchronous GIL calls from the IO thread when doing zero-copy make this much less clear than a simple crossover plot would suggest.
erialization code used in IPython.parallel could use an audit and overhaul. It will be a while before I actually get to this, but I should make a note here to keep track of current thoughts.
The serialization code (principally in
utils.pickleutil/newserialized) is largely inherited from the days of IPython.kernel, and could use some improvements. Specifically:if/elif/elif-style switch statement, it can just as easily be done with a dict-based switch, which would trivially provide extensibilty.serialize_objectmethod used for serializing containers of objects (applied to args/kwargs and results when using apply) serializes objects separately, and then pickles the container itself. This is extraordinarily expensive for large containers, which technically only applies to results, but is still important.There is also a data-overhead from using this inefficient double-serialization:
There is also finite overhead of doing zero-copy sends (construction of Python objects and async invocation of the GIL from the ØMQ io_thread). The Session object currently does zero-copy sends exclusively, but there is probably a finite threshold below which copying is actually cheaper. We need to do some rough profiling to figure out what this should be, but I expect it's somewhere above 1kB.
A simple test case on my laptop:
Indicates that the threshold should be somewhere around 50kB. Now, with the memory costs of performing this copy, as well as the hidden and unpredictable cost of asynchronous GIL calls from the IO thread when doing zero-copy make this much less clear than a simple crossover plot would suggest.