Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: possible indexing/selection bug #319

Closed
jreback opened this issue Jan 13, 2014 · 10 comments
Closed

BUG: possible indexing/selection bug #319

jreback opened this issue Jan 13, 2014 · 10 comments
Labels
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 13, 2014

In [10]: tables.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  3.0.0
HDF5 version:      1.8.4-patch1
NumPy version:     1.7.1
Numexpr version:   2.1 (not using Intel's VML/MKL)
Zlib version:      1.2.3.4 (in Python interpreter)
LZO version:       2.03 (Apr 30 2008)
Blosc version:     1.2.3 (2013-05-17)
Cython version:    0.17.2
Python version:    2.7.3 (default, Jun 21 2012, 07:50:29) 
[GCC 4.4.5]
Platform:          linux2-x86_64
Byte-ordering:     little
Detected cores:    12
Default encoding:  ascii
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

see here: pandas-dev/pandas#5913

Narrowed it down to this:

create a table with a larger expectedrows that actually storing
create an index on a column
select via a pretty small start/stop range (e.g. in the below example if you use a chunksize of 1M, then it doesn't show up, but 500k makes it fail).

If I don't pass expectedrows, then this works as expected!

Code to reproduce:

import numpy as np
import tables

def pr(result):
    print np.unique(result['o'])

n = 7000000
arr = np.zeros((n,),dtype=[('index', 'i8'), ('o', 'i8'), ('value', 'f8')])
arr['index'] = np.arange(n)
arr['o'] = np.random.randint(-20000,-15000,size=n)
arr['value'] = value = np.random.randn(n)

handle = tables.openFile('test.h5','w',filters=tables.Filters(complevel=9,complib='blosc'))
node   = handle.createGroup(handle.root, 'foo')
table  = handle.createTable(node, 'table', dict(
    index   = tables.Int64Col(),
    o   = tables.Int64Col(),
    value  = tables.FloatCol(shape=())),
                            expectedrows=10000000)

table.cols.index.createIndex()
table.append(arr)
handle.close()

v1 = np.unique(arr['o'])[0]
v2 = np.unique(arr['o'])[1]
selector = '((o == %s) | (o == %s))' % (v1, v2)
print "selecting values: %s" % selector

handle = tables.openFile('test.h5','a')
table  = handle.root.foo.table

print "select entire table"
pr(table.readWhere(selector))

print "index the column o"
table.cols.o.createIndex()

print "select via chunks"
cs = 500000
chunks = n / cs
for i in range(chunks):
    pr(table.readWhere(selector, start=i*cs,stop=(i+1)*cs))

handle.close()

Output; the output for each chunk should be [-20000, -19999]; extraneous values are being selected that
are not in the selection spec

[-20000 -19999]
[sheep-jreback-~/pandas] python test.py
selecting values: ((o == -20000) | (o == -19999))
select entire table
[-20000 -19999]
index the column o
select via chunks
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999 -18154]
[-20000 -19999]
[-20000 -19999 -15413]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
[-20000 -19999]
@CarstVaartjes
Copy link

I can mail the login if anyone wants it (Jeff has it too and can share it with you too if you want)

@scopatz
Copy link
Member

scopatz commented Jan 14, 2014

Is there any way you can provide a simple script which generates an error that fails? Additionally is there any way that you could provide a failing script that doesn't use subprocess? Thanks!

@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2014

@scopatz I couldn't find a smaller example
I will email u the login/pw to retrieve the file

I don't need sub process - that's just to run the ptrepacl (which shows that it does select properly with no index)

@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2014

@scopatz emailed u the login indo

@scopatz
Copy link
Member

scopatz commented Jan 14, 2014

@jreback A bash script that does the same thing then would be greatly appreciated. I want to ensure that our workflows are exactly the same so that we don't waste a lot of time. Also, I might not be able to get to this for a few days.

@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2014

@scopatz found a reproducible example (I changed the top section).

@scopatz
Copy link
Member

scopatz commented Jan 14, 2014

Thanks @jreback!

@rockg
Copy link

rockg commented Sep 14, 2014

Has there been any progress with this? We have a couple of reproducible examples and nasty workarounds are required to avoid using the index which is seemingly unreliable. Thanks.

@jreback
Copy link
Contributor Author

jreback commented Sep 29, 2014

@scopatz any progress on this?

@jreback
Copy link
Contributor Author

jreback commented Apr 18, 2015

@FrancescAlted thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants