Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

Numpy bottleneck with ctable #174

Open
ARF1 opened this issue Apr 18, 2015 · 1 comment
Open

Numpy bottleneck with ctable #174

ARF1 opened this issue Apr 18, 2015 · 1 comment

Comments

@ARF1
Copy link

ARF1 commented Apr 18, 2015

bcolz ctable seems to suffer from a numpy bottleneck:

import numpy as np
import bcolz

a = bcolz.open(rootdir='mydata.bcolz')

a
ctable((8769282,), [('date', 'S10'), ('valueSignificand', '<i4'), (
'valueExponent', 'i1'), ('id', 'S12'), ('location', 'S20'), ... some other columns ...)
  nbytes: 1.05 GB; cbytes: 106.91 MB; ratio: 10.01
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := mydata.bcolz'
... snip ...

%timeit -r10 test = a[['id', 'date', 'location', 'valueSignificand', 'valueExponent']][:]

1 loops, best of 10: 2.59 s per loop

%%timeit -r10
test = np.ndarray(shape=(len(a),), dtype=a[['id', 'date', 'location', 'valueSignificand', 'valueExponent']].dtype)
test['id'] = a['id'][:]
test['date'] = a['date'][:]
test['location'] = a['location'][:]
test['valueSignificand'] = a['valueSignificand'][:]
test['valueExponent'] = a['valueExponent'][:]

1 loops, best of 10: 2.59 s per loop

%%timeit -r10
test1 = a['id'][:]
test2 = a['date'][:]
test3 = a['location'][:]
test4 = a['valueSignificand'][:]
test5 = a['valueExponent'][:]

1 loops, best of 10: 1.16 s per loop

bcolz.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.9.0-dev
NumPy version:     1.9.2
Blosc version:     1.5.5.dev ($Date:: 2015-04-14 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.3.1
Python version:    2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014, 17:00:07) [MSC v.1500 32 bit (Intel)]
Byte-ordering:     little
Detected cores:    2
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

It appears that the assignment to the numpy recarray is a factor >2 slower. In addition, (on my machine at least) this prevents using the multi-core processor with ctables efficiently. This is especially a shame since it works fairly well for carrays.

Is this a known issue with ctable? Can anybody think of a way of fixing this bottleneck? I would be happy to help.

@ARF1
Copy link
Author

ARF1 commented Apr 18, 2015

Even if copying the read data was unavoidable, numpy structured arrays performance is still a factor 1.6 slower than ideal:

%%timeit -r10
test1 = a['isin'][:].copy()
test2 = a['date'][:].copy()
test3 = a['tradeVenue'][:].copy()
test4 = a['closeSignificand'][:].copy()
test5 = a['closeExponent'][:].copy()

1 loops, best of 10: 1.67 s per loop

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant