Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of from_numpy Support for mode={ingest,schema_only,append} #1185

Merged

Conversation

nguyenv
Copy link
Collaborator

@nguyenv nguyenv commented Jun 21, 2022

No description provided.

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #17765: from_numpy support for mode={ingest,schema_only,append}.

Copy link
Contributor

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to use this; sorry, cannot accept or reject the PR :(

Using a local checkout of https://github.com/single-cell-data/TileDB-SingleCell/tree/main/apis/python and trying to re-upload:

$ git diff
diff --git a/apis/python/src/tiledbsc/uns_array.py b/apis/python/src/tiledbsc/uns_array.py
index 838f071..4fde242 100644
--- a/apis/python/src/tiledbsc/uns_array.py
+++ b/apis/python/src/tiledbsc/uns_array.py
@@ -93,13 +93,18 @@ class UnsArray(TileDBArray):
             # Note arr.astype('str') does not lead to a successfuly tiledb.from_numpy.
             arr = np.array(arr, dtype="O")

+        mode = "ingest"
+        if self.exists():
+            mode = "append"
+            logger.info(f"{self._indent}Re-using existing array {self.uri}")
+
         # overwrite = False
         # if self.exists:
         #     overwrite = True
         #     logger.info(f"{self._indent}Re-using existing array {self.uri}")
         # tiledb.from_numpy(uri=self.uri, array=arr, ctx=self._ctx, overwrite=overwrite)
         # TODO: find the right syntax for update-in-place (tiledb.from_pandas uses `mode`)
-        tiledb.from_numpy(uri=self.uri, array=arr, ctx=self._ctx)
+        tiledb.from_numpy(uri=self.uri, array=arr, mode=mode, ctx=self._ctx)

         logger.info(
             util.format_elapsed(
$ ingestor anndata/pbmc3k_processed.h5ad s3://tiledb-johnkerl/scratch/try003
START  SOMA.from_h5ad anndata/pbmc3k_processed.h5ad -> s3://tiledb-johnkerl/scratch/try003
START  READING anndata/pbmc3k_processed.h5ad
FINISH READING anndata/pbmc3k_processed.h5ad TIME 0.076 seconds
START  DECATEGORICALIZING
FINISH DECATEGORICALIZING TIME 0.005 seconds
START  WRITING s3://tiledb-johnkerl/scratch/try003
  START  WRITING s3://tiledb-johnkerl/scratch/try003/uns
    START  WRITING s3://tiledb-johnkerl/scratch/try003/uns/draw_graph
      START  WRITING s3://tiledb-johnkerl/scratch/try003/uns/draw_graph/params
        START  WRITING FROM NUMPY.NDARRAY s3://tiledb-johnkerl/scratch/try003/uns/draw_graph/params/random_state
Traceback (most recent call last):
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/tools/ingestor", line 254, in <module>
    main()
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/tools/ingestor", line 167, in main
    ingest_one(
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/tools/ingestor", line 235, in ingest_one
    tiledbsc.io.from_h5ad(soma, input_path)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/io.py", line 17, in from_h5ad
    _from_h5ad_common(soma, input_path, from_anndata)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/io.py", line 46, in _from_h5ad_common
    handler_func(soma, anndata)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/io.py", line 112, in from_anndata
    soma.uns.from_anndata_uns(anndata.uns)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/uns_group.py", line 182, in from_anndata_uns
    subgroup.from_anndata_uns(value)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/uns_group.py", line 182, in from_anndata_uns
    subgroup.from_anndata_uns(value)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/uns_group.py", line 211, in from_anndata_uns
    elif array._maybe_from_numpyable_object(value):
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/uns_array.py", line 65, in _maybe_from_numpyable_object
    self.from_numpy_ndarray(arr)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/src/tiledbsc/uns_array.py", line 102, in from_numpy_ndarray
    tiledb.from_numpy(uri=self.uri, array=arr, mode='append', ctx=self._ctx)
  File "/Users/johnkerl/git/TileDB-Inc/TileDB-Py/tiledb/highlevel.py", line 98, in from_numpy
    return tiledb.DenseArray.from_numpy(uri, array, ctx=_get_ctx(ctx, config), **kwargs)
  File "tiledb/libtiledb.pyx", line 4302, in tiledb.libtiledb.DenseArrayImpl.from_numpy
  File "tiledb/libtiledb.pyx", line 4307, in tiledb.libtiledb.DenseArrayImpl.from_numpy
  File "tiledb/libtiledb.pyx", line 4816, in tiledb.libtiledb.DenseArrayImpl.write_direct
  File "tiledb/libtiledb.pyx", line 574, in tiledb.libtiledb._raise_ctx_err
  File "tiledb/libtiledb.pyx", line 559, in tiledb.libtiledb._raise_tiledb_error
tiledb.cc.TileDBError: [TileDB::Subarray] Error: Cannot add range to dimension '__dim_0'; Range [1, 1] is out of domain bounds [0, 0]

Here's what I do with tiledb.from_pandas:
https://github.com/single-cell-data/TileDB-SingleCell/blob/main/apis/python/src/tiledbsc/annotation_dataframe.py#L226-L229

What I want, and what I thought this PR was (may I misunderstand this PR) is that I would be able to do with tiledb.from_numpy what I'm already successfully doing with tiledb.from_pandas.

@nguyenv nguyenv force-pushed the viviannguyen/sc-17765/from-numpy-support-for-mode-ingest-schema branch from c71d222 to c978bfa Compare June 27, 2022 22:10
@nguyenv
Copy link
Collaborator Author

nguyenv commented Jun 27, 2022

@johnkerl

Based on our discussion earlier, I have added a start_idx flag and think the example below is closer to what you're trying to achieve. I decided to not make a separate mode="overwrite" because using mode="append" and passing a starting indexing is closer to how from_pandas behaves.

By default, mode="append" for from_numpy will append to the end of the data although I know this deviates from from_pandas as it defaults to start_row_idx=0. I can change this though.

import tiledb
import numpy as np
import os, shutil

uri = "example_overwrite"

if os.path.exists(uri):
    shutil.rmtree(uri)

# create an array with data [1, 2, 3]
tiledb.from_numpy(uri=uri, array=np.asarray([1, 2, 3]), mode="ingest")

with tiledb.open(uri) as A:
    print(A.df[:])

# overwrite the data beginning at index 0 with [4, 5, 6]
tiledb.from_numpy(uri=uri, array=np.asarray([4, 5, 6]), mode="append", start_idx=0)

with tiledb.open(uri) as A:
    print(A.df[:])
(tiledb-3.10) vivian@mangonada:~/tiledb-bugs$ python from_numpy_append.py
   __dim_0
0        0  1
1        1  2
2        2  3
   __dim_0
0        0  4
1        1  5
2        2  6

@johnkerl
Copy link
Contributor

johnkerl commented Jul 5, 2022

@nguyenv sorry for the delay in replying -- this works beautifully!

Copy link
Member

@ihnorton ihnorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although this would probably be a good time to pull as much of the schema_like_numpy code as possible out of cython for maintainability (we'll have to do it eventually either way).

@nguyenv nguyenv force-pushed the viviannguyen/sc-17765/from-numpy-support-for-mode-ingest-schema branch from c978bfa to f98fd54 Compare July 7, 2022 17:18
@nguyenv nguyenv merged commit 353f2d1 into dev Jul 8, 2022
@nguyenv nguyenv deleted the viviannguyen/sc-17765/from-numpy-support-for-mode-ingest-schema branch July 8, 2022 13:11
@johnkerl
Copy link
Contributor

johnkerl commented Jul 8, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants