Numpy bottleneck with ctable #174

ARF1 · 2015-04-18T14:51:12Z

bcolz ctable seems to suffer from a numpy bottleneck:

import numpy as np
import bcolz

a = bcolz.open(rootdir='mydata.bcolz')

a
ctable((8769282,), [('date', 'S10'), ('valueSignificand', '<i4'), (
'valueExponent', 'i1'), ('id', 'S12'), ('location', 'S20'), ... some other columns ...)
  nbytes: 1.05 GB; cbytes: 106.91 MB; ratio: 10.01
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := mydata.bcolz'
... snip ...

%timeit -r10 test = a[['id', 'date', 'location', 'valueSignificand', 'valueExponent']][:]

1 loops, best of 10: 2.59 s per loop

%%timeit -r10
test = np.ndarray(shape=(len(a),), dtype=a[['id', 'date', 'location', 'valueSignificand', 'valueExponent']].dtype)
test['id'] = a['id'][:]
test['date'] = a['date'][:]
test['location'] = a['location'][:]
test['valueSignificand'] = a['valueSignificand'][:]
test['valueExponent'] = a['valueExponent'][:]

1 loops, best of 10: 2.59 s per loop

%%timeit -r10
test1 = a['id'][:]
test2 = a['date'][:]
test3 = a['location'][:]
test4 = a['valueSignificand'][:]
test5 = a['valueExponent'][:]

1 loops, best of 10: 1.16 s per loop

bcolz.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.9.0-dev
NumPy version:     1.9.2
Blosc version:     1.5.5.dev ($Date:: 2015-04-14 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.3.1
Python version:    2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014, 17:00:07) [MSC v.1500 32 bit (Intel)]
Byte-ordering:     little
Detected cores:    2
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

It appears that the assignment to the numpy recarray is a factor >2 slower. In addition, (on my machine at least) this prevents using the multi-core processor with ctables efficiently. This is especially a shame since it works fairly well for carrays.

Is this a known issue with ctable? Can anybody think of a way of fixing this bottleneck? I would be happy to help.

The text was updated successfully, but these errors were encountered:

ARF1 · 2015-04-18T20:14:14Z

Even if copying the read data was unavoidable, numpy structured arrays performance is still a factor 1.6 slower than ideal:

%%timeit -r10
test1 = a['isin'][:].copy()
test2 = a['date'][:].copy()
test3 = a['tradeVenue'][:].copy()
test4 = a['closeSignificand'][:].copy()
test5 = a['closeExponent'][:].copy()

1 loops, best of 10: 1.67 s per loop

ARF1 mentioned this issue Apr 18, 2015

Consider adding a Categorical datatype to bcolz #66

Open

ARF1 mentioned this issue Apr 19, 2015

Pandas out_flavor for better ctable performance #176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy bottleneck with ctable #174

Numpy bottleneck with ctable #174

ARF1 commented Apr 18, 2015

ARF1 commented Apr 18, 2015

Numpy bottleneck with ctable #174

Numpy bottleneck with ctable #174

Comments

ARF1 commented Apr 18, 2015

ARF1 commented Apr 18, 2015