Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QueryCondition problems with ascii/bytes_ attributes #782

Closed
bkmartinjr opened this issue Nov 16, 2021 · 9 comments
Closed

QueryCondition problems with ascii/bytes_ attributes #782

bkmartinjr opened this issue Nov 16, 2021 · 9 comments

Comments

@bkmartinjr
Copy link

Test case below demonstrates a few issues with QueryCondition on Linux/TileDB-Py 0.11.0

As I understand the typing system, attributes of type 'ascii' should be the same as numpy.bytes_. If that is wrong, the first two items below may not be accurate. Three issues:

  1. Creating a schema Attr with dtype=np.bytes_ vs dtype="ascii" give you two different schema types in the underlying array:
    Eg,
                tiledb.Attr(name="bytes_", dtype=np.bytes_, var=True, filters=filters),
                tiledb.Attr(name="ascii", dtype="ascii", var=True, filters=filters),

Results in:

  attrs=[
    Attr(name='bytes_', dtype='|S0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='ascii', dtype='ascii', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],

  1. A query condition on an attribute of type 'ascii' works OK. The same query condition on a type np.bytes_ fails with the exception:

[TileDB::QueryCondition] Error: Clause non-empty attribute may only be var-sized for ASCII strings: bytes_.

This is demonstrated by the difference in the first two tests:

    qc = tiledb.QueryCondition("ascii == '0'")

vs

     qc = tiledb.QueryCondition("bytes_ == '0'")

  1. a query condition that looks for an empty string goes into an infinite loop, and eventually will consume all memory in the host and crash. This is demonstrated by uncommenting the final test ("Test 3")

Test case:

import numpy as np
import tiledb


def main():
    uri = "obs"
    create_array(uri)

    vals = np.char.encode(np.array([str(i) for i in range(100)], dtype=str), encoding="ascii")
    with tiledb.open(uri, "w") as obs:
        obs[vals] = {"bytes_": vals, "ascii": vals}
    tiledb.consolidate(uri)
    tiledb.vacuum(uri)

    with tiledb.open(uri, "r") as obs:        
        # this works
        print("\n____Test 1:")
        qc = tiledb.QueryCondition("ascii == '0'")
        print(qc, obs.query(attr_cond=qc).df[:])

        # this throws an error
        print("\n____Test 2:")
        try:
            qc = tiledb.QueryCondition("bytes_ == '0'")
            print(qc, obs.query(attr_cond=qc).df[:])
        except Exception as e:
            print(e)

        # And if you uncomment these lines, you get an infinite loop that runs out of memory and crashes
        # print("\n____Test 3:")
        # qc = tiledb.QueryCondition("ascii == ''")
        # print(qc, obs.query(attr_cond=qc).df[:])


def create_array(uri):
    if tiledb.array_exists(uri):
        tiledb.remove(uri)

    filters = tiledb.FilterList([tiledb.ZstdFilter()])
    dims = [
        tiledb.Dim(name="dim0", dtype=np.bytes_, filters=filters, var=True),
    ]
    tiledb.Array.create(
        uri,
        tiledb.ArraySchema(
            domain=tiledb.Domain(dims),
            sparse=True,
            allows_duplicates=True,
            attrs=[
                tiledb.Attr(name="bytes_", dtype=np.bytes_, var=True, filters=filters),
                tiledb.Attr(name="ascii", dtype="ascii", var=True, filters=filters),
            ],
            cell_order="row-major",
            tile_order="row-major",
            capacity=10000,
        ),
    )


if __name__ == "__main__":
    main()
@nguyenv
Copy link
Collaborator

nguyenv commented Nov 16, 2021

Thanks for bringing these issues to our attention.

This is confusing but for dimensions, both "ascii" and np.bytes_/np.dtype('S1') map internally to TILEDB_STRING_ASCII, so in that case they are the same. However, for attributes, "ascii" maps to TILEDB_STRING_ASCII whereas np.bytes_/np.dtype('S1') maps to TILEDB_CHAR, so they are different here. We only support string matching with attributes of "ascii" (TILEDB_STRING_ASCII) dtypes. I'll make better note of this in the documentation.

I will add the bug into our issue tracker and take a look soon.

@Shelnutt2
Copy link
Member

@nguyenv would it be possible to dis-allow variable length TILEDB_CHAR? That would force users to use TILEDB_STRING_ASCII which is the intended var-length string.

Also I wonder, @ihnorton should np.bytes_ actually be TILEDB_UINT8? Forgive my python ignorance at what a byte means here :).

@nguyenv
Copy link
Collaborator

nguyenv commented Nov 16, 2021

@nguyenv would it be possible to dis-allow variable length TILEDB_CHAR? That would force users to use TILEDB_STRING_ASCII which is the intended var-length string.

https://github.com/TileDB-Inc/TileDB-Py/blob/dev/tiledb/libtiledb.pyx#L2267-L2288
Yes, we have added a deprecation warning for this, but we are still allowing it for the time being.

@bkmartinjr
Copy link
Author

@nguyenv - thank you!

Beyond query conditions, are there any additional differences implied by TILEDB_STRING_ASCII and variable-length TILEDB_CHAR (when used in attributes)?

@bkmartinjr
Copy link
Author

bkmartinjr commented Nov 16, 2021

Also, noted a very minor inconsistency with tiledb.Dim() and its handling of var=True

This creates an ascii, var=True dimension:

In [8]: tiledb.Dim(name='foo', dtype='ascii')
Out[8]: Dim(name='foo', domain=(None, None), tile='None', dtype='|S0', var=True)

I would expect that the addition of var=True would do the same (see output above), but it instead generates an error:

In [7]: tiledb.Dim(name='foo', dtype='ascii', var=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-1e861f61c7c6> in <module>
----> 1 tiledb.Dim(name='foo', dtype='ascii', var=True)

tiledb/libtiledb.pyx in tiledb.libtiledb.Dim.__init__()

TypeError: data type 'ascii' not understood

Oddly, it appears that tiledb.Attr() requires the var=True for dtype="ascii" and will generate an error without it:

In [1]: import tiledb

In [2]: tiledb.Attr(name='foo', dtype='ascii')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-68e19123de0a> in <module>
----> 1 tiledb.Attr(name='foo', dtype='ascii')

tiledb/libtiledb.pyx in tiledb.libtiledb.Attr.__init__()

TypeError: dtype is not compatible with var-length attribute

In [3]: tiledb.Attr(name='foo', dtype='ascii', var=True)
Out[3]: Attr(name='foo', dtype='ascii', var=True, nullable=False)

@bkmartinjr
Copy link
Author

bkmartinjr commented Nov 16, 2021

Another variation on the out-of-memory crasher ("Test 3" in the original test case above): it will occur for any value that does not exist in the query result dataset.

In other words, this fails in the same manner:

array.query(attr_cond=tiledb.QueryCondition("field == 'no_such_value_in_data'")).df[:]

@nguyenv
Copy link
Collaborator

nguyenv commented Nov 16, 2021

Thanks for the additional example. I believe the issue has something to do with when no values match the condition when allows_duplicates=True, but I haven't pinpointed why yet. These queries work fine when duplicates aren't allowed.

@nguyenv
Copy link
Collaborator

nguyenv commented Nov 23, 2021

Test 3 is now fixed by this PR in core (TileDB-Inc/TileDB#2643). We'll be keeping this issue open until we the string typing issue resolved.

@nguyenv
Copy link
Collaborator

nguyenv commented Feb 4, 2022

Applying query conditions to np.bytes_ attributes is now supported with TileDB-Inc/TileDB#2814 and will be available in the upcoming 0.12.3 release (linked against libtiledb 2.6.3). You may find an example of usage in the Python API here. Please reopen this issue if there are still any problems.

@nguyenv nguyenv closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants