Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Arrow support to cover nullable data. #4049

Merged
merged 13 commits into from
Sep 11, 2024
Merged

Conversation

eddelbuettel
Copy link
Contributor

@eddelbuettel eddelbuettel commented Apr 20, 2023

The arrowio header provides import export support from/to TileDB and Arrow with its interface of two void* pointers. This PR extends the support to cover 'nullable' aka 'validity map' data.

The PR is in need of some tests but the existing tests is a little involved between Python, pybind11 and C++ so @ihnorton has kindly 'volunteered' to add this.

The PR will remain a draft til we have tests.

[sc-27472]


TYPE: FEATURE
DESC: Extend Arrow support to cover nullable data.

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #27472: Extend arrow i/o operations to nullable columns.

@Shelnutt2
Copy link
Member

Looks like a few msvc errors, https://github.com/TileDB-Inc/TileDB/actions/runs/4758463351/jobs/8456556083?pr=4049#step:10:593

 D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(624,15): error C3863: array type 'uint8_t [n]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(650,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(651,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(657,24): error C3863: array type 'int64_t [num_offsets+1]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]

@eddelbuettel
Copy link
Contributor Author

I also saw an Ubuntu build fail but figure that would be spurious. My changes so far are quite gingerly and limited. Let's see what happens once new tests get injected.

@kounelisagis
Copy link
Member

@eddelbuettel if you see my last commit, we were basically never testing col sizes greater than zero.
My change cause the test to fail with the message:

  WriterBase: Invalid offsets for attribute utf_string1; the last offset: 0 is
  not equal to the size of the data buffer: 71

or something similar.

Do you have any idea? Spent some hours but couldn't find any solution.

@eddelbuettel
Copy link
Contributor Author

Howdy. Love that you are trying to resurrect the PR. Theses days it would probably be better to wrap some of the code in the nanoarrow (header-only) helper functions we already use in tiledb-r and tiledbsoma (mostly the R parts). See https://github.com/apache/arrow-nanoarrow and check out the recently expanded python side (to manipulate objects; nanoarrow is really three packages namely the header-only C/C++ helper, a somewhat mature R package to work on object s and a growing Python package).

I think I wrote this PR after I had similar work in the R package where I since have changed things. (Just took a look.) Hm, that is on the other hand quite motivated by SOMA's libtiledbsoma and its ColumnBuffers etc.

My pedestrian approach has usually been to setup a similar mock function standalone, or as an R extension, and ten abstract out the pure C++ side for Core.

Back to your PR: I honestly do not know what the col size is mad about. Is there something driving this now and/or urgency?

@kounelisagis kounelisagis marked this pull request as ready for review September 4, 2024 10:22
@eddelbuettel
Copy link
Contributor Author

Thanks for giving this PR some attention and care! I tried to take a good look today 'from first principles' to see if we could use the updated to code read columns containing nulls, and whether we can write.

The overall summary is 'yes' and 'no'. Reading works, writing has an issue which I identified but have no fix for yet. (This is likely an old issue and not caused by this update.)

The reading example worked. I created a (pretty simple) standalone C++ program which only relies on Core as in this PR, plus nanoarrow which we already vendor elsewhere. Because the interface we have is still on a column-by-column basis I opted to point a column of the standard 'penguins' data set which has two known NAs among the 344 int values. Relying on nanoarrow and some of its (internal) testing code (not part of the shipped header) we can turn the column into JSON, the output is below. Note that the fourth validity value is zero, as is one towards the end. (The JSON writer used is barebones and does not format for linebreaks.)

$ ./arrow_ex2
{"name": "body_mass_g", "count": 344, "VALIDITY": [1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], "DATA": [3750, 3800, 3250, 0, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, 3325, 4200, 3400, 3600, 3800, 3950, 3800, 3800, 3550, 3200, 3150, 3950, 3250, 3900, 3300, 3900, 3325, 4150, 3950, 3550, 3300, 4650, 3150, 3900, 3100, 4400, 3000, 4600, 3425, 2975, 3450, 4150, 3500, 4300, 3450, 4050, 2900, 3700, 3550, 3800, 2850, 3750, 3150, 4400, 3600, 4050, 2850, 3950, 3350, 4100, 3050, 4450, 3600, 3900, 3550, 4150, 3700, 4250, 3700, 3900, 3550, 4000, 3200, 4700, 3800, 4200, 3350, 3550, 3800, 3500, 3950, 3600, 3550, 4300, 3400, 4450, 3300, 4300, 3700, 4350, 2900, 4100, 3725, 4725, 3075, 4250, 2925, 3550, 3750, 3900, 3175, 4775, 3825, 4600, 3200, 4275, 3900, 4075, 2900, 3775, 3350, 3325, 3150, 3500, 3450, 3875, 3050, 4000, 3275, 4300, 3050, 4000, 3325, 3500, 3500, 4475, 3425, 3900, 3175, 3975, 3400, 4250, 3400, 3475, 3050, 3725, 3000, 3650, 4250, 3475, 3450, 3750, 3700, 4000, 4500, 5700, 4450, 5700, 5400, 4550, 4800, 5200, 4400, 5150, 4650, 5550, 4650, 5850, 4200, 5850, 4150, 6300, 4800, 5350, 5700, 5000, 4400, 5050, 5000, 5100, 4100, 5650, 4600, 5550, 5250, 4700, 5050, 6050, 5150, 5400, 4950, 5250, 4350, 5350, 3950, 5700, 4300, 4750, 5550, 4900, 4200, 5400, 5100, 5300, 4850, 5300, 4400, 5000, 4900, 5050, 4300, 5000, 4450, 5550, 4200, 5300, 4400, 5650, 4700, 5700, 4650, 5800, 4700, 5550, 4750, 5000, 5100, 5200, 4700, 5800, 4600, 6000, 4750, 5950, 4625, 5450, 4725, 5350, 4750, 5600, 4600, 5300, 4875, 5550, 4950, 5400, 4750, 5650, 4850, 5200, 4925, 4875, 4625, 5250, 4850, 5600, 4975, 5500, 4725, 5500, 4700, 5500, 4575, 5500, 5000, 5950, 4650, 5500, 4375, 5850, 4875, 6000, 4925, 0, 4850, 5750, 5200, 5400, 3500, 3900, 3650, 3525, 3725, 3950, 3250, 3750, 4150, 3700, 3800, 3775, 3700, 4050, 3575, 4050, 3300, 3700, 3450, 4400, 3600, 3400, 2900, 3800, 3300, 4150, 3400, 3800, 3700, 4550, 3200, 4300, 3350, 4100, 3600, 3900, 3850, 4800, 2700, 4500, 3950, 3650, 3550, 3500, 3675, 4450, 3400, 4300, 3250, 3675, 3325, 3950, 3600, 4050, 3350, 3450, 3250, 4050, 3800, 3525, 3950, 3650, 3650, 4000, 3400, 3775, 4100, 3775]}
$ 

I am also quite please with how nanoarrow kept the code simple. The full file is here but the key part really is just

    ArrowAdapter aa(&ctx, &query);

    nanoarrow::UniqueArray arr;
    nanoarrow::UniqueSchema sch;
    struct ArrowError ec;
    if (NANOARROW_OK != ArrowSchemaInitFromType(sch.get(), NANOARROW_TYPE_INT32)) error_exit("Bad init");
    if (NANOARROW_OK != ArrowArrayInitFromSchema(arr.get(), sch.get(), &ec)) error_exit(ec.message);

    aa.export_buffer("body_mass_g", arr.get(), sch.get());

which has our ArrowAdapter in the first and last line, and very little nanoarrow in the middle. We start from C++-wrapped ones that guarantee lifetime, and simply access them where the otherwise bare void* would go.

However, on the write side no such luck. Our issue is that the importer looks at a TypeInfo struct inferred from the schema. It has a boolean field nullable, but it is not correctly set to false for dimensions to set_validity_buffer is called for a dimensions when the query buffer is set from the arrow schema and array. There might be a way to set this explicitly, I will do some more digging tomorrow. If we one-off override this (based on the column name) then we can show show it would work. Here is simple 4 x 2 array, one dim and one attr with one attr value set to NULL.

$ ./arrow_ex3 
$ r -pe 'tiledb::tiledb_array("/tmp/tiledb/arrow_ex_write/", return_as="data.table")[]'
     row  data
   <int> <int>
1:     1    10
2:    21    20
3:    42    NA
4:    67    40
$ 

As this is not really an issue with the current PR, I think we could also pass it and merge, and address this in a follow-up.

@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Sep 6, 2024

There might be a way to set this explicitly, I will do some more digging tomorrow.

Found it. A column schema gets set nullable set by default in nanoarrow, something we had come across before and that I may bring up upstream. In any event, ORing out the NULLABLE part does it, eg

    dsch.get()->flags &= ~ARROW_FLAG_NULLABLE;

As that is 'client-side' this PR should be good to go. But as I started this PR I don't seem to be able to review and approve the subsequent changes.

PS Issue filed here apache/arrow-nanoarrow#604

@eddelbuettel
Copy link
Contributor Author

Can we rebase the PR branch to dev (or merge dev in -- I tend to prefer rebase) ?

This allows to e.g. include the nanoarrow headers first.
@eddelbuettel
Copy link
Contributor Author

I added another two-line commit that protects the Arrow C Data Interface definition we have from being redefined (which throws a compiler error) which can happen when, e.g., the nanoarrow headers were already included.

@eddelbuettel
Copy link
Contributor Author

Interesting: Approved, all tests passsed but merging blocked. Needs one more merge from dev, I presume?

@KiterLuc KiterLuc changed the title Extend Arrow support to cover nullable data Extend Arrow support to cover nullable data. Sep 10, 2024
@KiterLuc KiterLuc merged commit a35426a into dev Sep 11, 2024
65 checks passed
@KiterLuc KiterLuc deleted the de/sc-27472/arrow_nullable branch September 11, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants