BUG: possible indexing/selection bug #319

jreback · 2014-01-13T16:41:42Z

In [10]: tables.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  3.0.0
HDF5 version:      1.8.4-patch1
NumPy version:     1.7.1
Numexpr version:   2.1 (not using Intel's VML/MKL)
Zlib version:      1.2.3.4 (in Python interpreter)
LZO version:       2.03 (Apr 30 2008)
Blosc version:     1.2.3 (2013-05-17)
Cython version:    0.17.2
Python version:    2.7.3 (default, Jun 21 2012, 07:50:29) 
[GCC 4.4.5]
Platform:          linux2-x86_64
Byte-ordering:     little
Detected cores:    12
Default encoding:  ascii
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

see here: pandas-dev/pandas#5913

Narrowed it down to this:

create a table with a larger expectedrows that actually storing
create an index on a column
select via a pretty small start/stop range (e.g. in the below example if you use a chunksize of 1M, then it doesn't show up, but 500k makes it fail).

If I don't pass expectedrows, then this works as expected!

Code to reproduce:

import numpy as np
import tables

def pr(result):
    print np.unique(result['o'])

n = 7000000
arr = np.zeros((n,),dtype=[('index', 'i8'), ('o', 'i8'), ('value', 'f8')])
arr['index'] = np.arange(n)
arr['o'] = np.random.randint(-20000,-15000,size=n)
arr['value'] = value = np.random.randn(n)

handle = tables.openFile('test.h5','w',filters=tables.Filters(complevel=9,complib='blosc'))
node   = handle.createGroup(handle.root, 'foo')
table  = handle.createTable(node, 'table', dict(
    index   = tables.Int64Col(),
    o   = tables.Int64Col(),
    value  = tables.FloatCol(shape=())),
                            expectedrows=10000000)

table.cols.index.createIndex()
table.append(arr)
handle.close()

v1 = np.unique(arr['o'])[0]
v2 = np.unique(arr['o'])[1]
selector = '((o == %s) | (o == %s))' % (v1, v2)
print "selecting values: %s" % selector

handle = tables.openFile('test.h5','a')
table  = handle.root.foo.table

print "select entire table"
pr(table.readWhere(selector))

print "index the column o"
table.cols.o.createIndex()

print "select via chunks"
cs = 500000
chunks = n / cs
for i in range(chunks):
    pr(table.readWhere(selector, start=i*cs,stop=(i+1)*cs))

handle.close()

Output; the output for each chunk should be [-20000, -19999]; extraneous values are being selected that
are not in the selection spec

[-20000 -19999]
[sheep-jreback-~/pandas] python test.py
selecting values: ((o == -20000) | (o == -19999))
select entire table
[-20000 -19999]
index the column o
select via chunks
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999 -18154]
[-20000 -19999]
[-20000 -19999 -15413]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]

The text was updated successfully, but these errors were encountered:

CarstVaartjes · 2014-01-14T00:43:49Z

I can mail the login if anyone wants it (Jeff has it too and can share it with you too if you want)

scopatz · 2014-01-14T00:50:06Z

Is there any way you can provide a simple script which generates an error that fails? Additionally is there any way that you could provide a failing script that doesn't use subprocess? Thanks!

jreback · 2014-01-14T00:55:46Z

@scopatz I couldn't find a smaller example
I will email u the login/pw to retrieve the file

I don't need sub process - that's just to run the ptrepacl (which shows that it does select properly with no index)

jreback · 2014-01-14T00:58:45Z

@scopatz emailed u the login indo

scopatz · 2014-01-14T01:00:53Z

@jreback A bash script that does the same thing then would be greatly appreciated. I want to ensure that our workflows are exactly the same so that we don't waste a lot of time. Also, I might not be able to get to this for a few days.

jreback · 2014-01-14T14:06:44Z

@scopatz found a reproducible example (I changed the top section).

scopatz · 2014-01-14T14:46:10Z

Thanks @jreback!

rockg · 2014-09-14T16:56:29Z

Has there been any progress with this? We have a couple of reproducible examples and nasty workarounds are required to avoid using the index which is seemingly unreliable. Thanks.

jreback · 2014-09-29T14:15:20Z

@scopatz any progress on this?

…xed queries. Fixes #319.

jreback · 2015-04-18T18:59:25Z

@FrancescAlted thanks!

jreback mentioned this issue Jan 13, 2014

HDF5 Select with Filter gives incorrect results when using Iteration pandas-dev/pandas#5913

Closed

jreback mentioned this issue Sep 14, 2014

HDF5 index corruption pandas-dev/pandas#8265

Closed

jreback mentioned this issue Mar 18, 2015

BUG: entries missing when reading from pytables hdf store using "where" statement pandas-dev/pandas#9676

Closed

alexfields mentioned this issue Mar 18, 2015

table.where query does not seem to be able to find rows... #409

Closed

FrancescAlted added a commit that referenced this issue Apr 18, 2015

The buffersize *must* be a multiple of chunkshape[0] while doing inde…

035dbd5

…xed queries. Fixes #319.

FrancescAlted closed this as completed Apr 18, 2015

avalentino added this to the 3.2 milestone Apr 19, 2015

avalentino added the defect label Apr 19, 2015

kaukrise mentioned this issue Aug 20, 2015

Make _calc_nrowsinbuf in tables/leaf.py also check for classes inheriting Table #489

Merged

andreabedini mentioned this issue Sep 23, 2015

test_indexvalues.BuffersizeMultipleChunksize fails on win-amd64-py3.5 #506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: possible indexing/selection bug #319

BUG: possible indexing/selection bug #319

jreback commented Jan 13, 2014

CarstVaartjes commented Jan 14, 2014

scopatz commented Jan 14, 2014

jreback commented Jan 14, 2014

jreback commented Jan 14, 2014

scopatz commented Jan 14, 2014

jreback commented Jan 14, 2014

scopatz commented Jan 14, 2014

rockg commented Sep 14, 2014

jreback commented Sep 29, 2014

jreback commented Apr 18, 2015

BUG: possible indexing/selection bug #319

BUG: possible indexing/selection bug #319

Comments

jreback commented Jan 13, 2014

CarstVaartjes commented Jan 14, 2014

scopatz commented Jan 14, 2014

jreback commented Jan 14, 2014

jreback commented Jan 14, 2014

scopatz commented Jan 14, 2014

jreback commented Jan 14, 2014

scopatz commented Jan 14, 2014

rockg commented Sep 14, 2014

jreback commented Sep 29, 2014

jreback commented Apr 18, 2015