Describe a schema in an R object by eddelbuettel · Pull Request #344 · TileDB-Inc/TileDB-R

eddelbuettel · 2022-01-07T23:41:53Z

This PR gather schema information we can use to both print schema creation command, and to summarize array objects directly for more fine-grained formatting. It is a rather unfortunate this only comes together now as @johnkerl could most likely have saved some time had I put this together earlier.

I will leave this as a draft for now. It currently returns a list with two data frames for array (high-level) descriptives and then one for all 'data' columns. As dimensions and attributes are in fact a little distinct it may be beneficial to return one each for dimensions and attributes.

Current output format showing the two data frames directly on two sample arrays:

edd@rob:~/git/tiledb-r(de/sc-13273/schema_object)$ r -ltiledb -e'arr <- tiledb_array("/tmp/tiledb/quickstart_dense"); df <- tiledb_schema_object(arr); print(df)'
$array
                           uri  type cell_order tile_order capacity allow_dupes coord_filters        coord_options offset_filters     offset_filters.1
1 /tmp/tiledb/quickstart_dense dense  COL_MAJOR  COL_MAJOR    10000       FALSE          ZSTD COMPRESSION_LEVEL=-1           ZSTD COMPRESSION_LEVEL=-1

$data
     column names datatypes nullable varnum domains extend nfilters filters filtopts fillvalue
1 dimension  rows     INT32    FALSE      1   [1,4]      4        0                           
2 dimension  cols     INT32    FALSE      1   [1,4]      4        0                           
3 attribute     a     INT32    FALSE      1                       0                         NA

edd@rob:~/git/tiledb-r(de/sc-13273/schema_object)$ r -ltiledb -e'arr <- tiledb_array("/tmp/tiledb/penguins"); df <- tiledb_schema_object(arr); print(df)'
$array
                   uri   type cell_order tile_order capacity allow_dupes coord_filters        coord_options offset_filters     offset_filters.1
1 /tmp/tiledb/penguins sparse  COL_MAJOR  COL_MAJOR    10000        TRUE          ZSTD COMPRESSION_LEVEL=-1           ZSTD COMPRESSION_LEVEL=-1

$data
     column             names datatypes nullable varnum     domains extend nfilters filters             filtopts fillvalue
1 dimension           species     ASCII    FALSE     NA (null,null)   null        0                                       
2 dimension            island     ASCII    FALSE     NA (null,null)   null        0                                       
3 attribute    bill_length_mm   FLOAT64     TRUE      1                           1    ZSTD COMPRESSION_LEVEL=-1          
4 attribute     bill_depth_mm   FLOAT64     TRUE      1                           1    ZSTD COMPRESSION_LEVEL=-1          
5 attribute flipper_length_mm     INT32     TRUE      1                           1    ZSTD COMPRESSION_LEVEL=-1          
6 attribute       body_mass_g     INT32     TRUE      1                           1    ZSTD COMPRESSION_LEVEL=-1          
7 attribute               sex     ASCII     TRUE     NA                           1    ZSTD COMPRESSION_LEVEL=-1          
8 attribute              year     INT32    FALSE      1                           1    ZSTD COMPRESSION_LEVEL=-1        NA

edd@rob:~/git/tiledb-r(de/sc-13273/schema_object)$

shortcut-integration · 2022-01-07T23:41:55Z

This pull request has been linked to Shortcut Story #13273: Wrap ArraySchema as high-level object.

eddelbuettel · 2022-01-13T16:44:18Z

Some more work here to make it more like the Python sibbling that describes in array schema in 'code'. A new function describe() was added to do just that. We may want to embed it in show() instead, via an option. That is easy to adjust.

This is now mostly feature complete, I have don't full round-trips yet to see if all ascii representations of enums of mapping fully (there may be some cases of, say, "ASCII" != "TILEDB_ASCII").

For example for the Penguins array (with NAs) we now get this:

> library(tiledb)
TileDB R 0.10.2 with TileDB Embedded 2.7.0. See https://tiledb.com for more information.
> arr <- tiledb_array("/tmp/tiledb/penguins/")
> describe(arr)
dims <- c(tiledb_dim(name="species", domain=c(NULL,NULL), tile=NULL, type="ASCII"),
          tiledb_dim(name="island", domain=c(NULL,NULL), tile=NULL, type="ASCII")))
dom <- tiledb_domain(dims=dims)
attrs <- c(tiledb_attr(name="bill_length_mm", type="FLOAT64", ncells=1, nullable=TRUE, filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))),
           tiledb_attr(name="bill_depth_mm", type="FLOAT64", ncells=1, nullable=TRUE, filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))),
           tiledb_attr(name="flipper_length_mm", type="INT32", ncells=1, nullable=TRUE, filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))),
           tiledb_attr(name="body_mass_g", type="INT32", ncells=1, nullable=TRUE, filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))),
           tiledb_attr(name="sex", type="ASCII", ncells=NA, nullable=TRUE, filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))),
           tiledb_attr(name="year", type="INT32", ncells=1, nullable=FALSE, filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))))
sch <- tiledb_array_schema(domain=dom, attrs=attrs, cell_order="COL_MAJOR", tile_order="COL_MAJOR", sparse=TRUE, capacity=10000, allow_dupes=TRUE, 
                           coord_filters=filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))), 
                           offset_filters=filter_list=c(tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))))
>

after which sch can be used with a suitable uri in tiledb_array_create(uri, sch).

johnkerl

awesome @eddelbuettel !!

one little fine-tune opportunity, i ran this for a few arrays then vim'ed the result & i am seeing some parenthesis imbalances --

to be discussed if this should be an option to show() instead

johnkerl

🚢
🎉

eddelbuettel · 2022-01-13T21:48:53Z

We all may need to gab a little in a little while (as per my chat with @ihnorton) as we a) probably want to restore the object dump from core as an optional feature and b) need to work out if we want these verbose code pretty printers or the "look like core but ain't" you added as default as two may seem like one too many. Not urgent, but one of those things where a little chat prior to unannounced PRs can work wonders.

Also note that the code in the PR so far 'only looks pretty' but hasn't been to the dance yet. I haven't done any round-robin tests yet. I am sure there may be a many-legged creature be hiding in a corner or two.

eddelbuettel requested review from Shelnutt2, aaronwolen, ihnorton and johnkerl January 7, 2022 23:41

eddelbuettel force-pushed the de/sc-13273/schema_object branch from 9f9f353 to a4bf68e Compare January 10, 2022 20:20

eddelbuettel force-pushed the master branch from 2a1a6e6 to ef27748 Compare January 10, 2022 22:00

eddelbuettel force-pushed the de/sc-13273/schema_object branch 4 times, most recently from 847a7a8 to cf154e1 Compare January 11, 2022 16:06

eddelbuettel marked this pull request as ready for review January 13, 2022 16:39

ihnorton approved these changes Jan 13, 2022

View reviewed changes

johnkerl suggested changes Jan 13, 2022

View reviewed changes

eddelbuettel added 11 commits January 13, 2022 15:27

work in progress on array and schema description

cdbbde3

summary of array

eafbe3e

return list of three objects rather than two

6a6ebfe

only call fill value getter if TileDB > 2.1.0

0e3adcd

snapshot with dimension printer

8bbfbb9

snapshot with attrs (sans filterlist)

a648753

small extension having filter option setter return mod'ed filter

70f1017

add filterlist support to filterlist

9f0828f

array schema too

4fe6895

new high-level function describe()

bb77d45

to be discussed if this should be an option to show() instead

correct parens (thanks for spotting this @johnkerl)

1a9379c

eddelbuettel force-pushed the de/sc-13273/schema_object branch from 367b0ad to 1a9379c Compare January 13, 2022 21:33

johnkerl approved these changes Jan 13, 2022

View reviewed changes

eddelbuettel merged commit 8c87c71 into master Jan 13, 2022

eddelbuettel deleted the de/sc-13273/schema_object branch January 13, 2022 21:55

eddelbuettel mentioned this pull request Jan 24, 2022

Release 0.11.0 #356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Describe a schema in an R object#344

Describe a schema in an R object#344
eddelbuettel merged 11 commits intomasterfrom
de/sc-13273/schema_object

eddelbuettel commented Jan 7, 2022

Uh oh!

shortcut-integration Bot commented Jan 7, 2022

Uh oh!

eddelbuettel commented Jan 13, 2022

Uh oh!

johnkerl left a comment

Uh oh!

johnkerl left a comment

Uh oh!

eddelbuettel commented Jan 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eddelbuettel commented Jan 7, 2022

Uh oh!

shortcut-integration Bot commented Jan 7, 2022

Uh oh!

eddelbuettel commented Jan 13, 2022

Uh oh!

johnkerl left a comment

Choose a reason for hiding this comment

Uh oh!

johnkerl left a comment

Choose a reason for hiding this comment

Uh oh!

eddelbuettel commented Jan 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants