Skip to content

Conversation

nalinigans
Copy link
Member

No description provided.

self._genomicsdb.query_variant_calls(processor, array_name, scan_range)

if non_blocking:
query_thread = threading.Thread(target=query_calls)
Copy link
Member Author

@nalinigans nalinigans Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be changing this to using the thread pool from asyncio

requirements.txt Outdated
@@ -28,3 +28,4 @@
numpy>=1.24.0
pandas>=2.1.0
protobuf>=4.21.1
pyarrow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a version specifier be added to this dependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oliviawebber , any suggestions for pyarrow versions? I don't want to introduce incompatibilities with other packages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no known minimum, my usual rule of thumb is to pick a version from like ~1 year ago which would give 14.0.0. The one library I know we use that might eventually pull in pyarrow looks like it requires >= 7.0.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorta related: I ran into this issue https://stackoverflow.com/questions/78634235/numpy-dtype-size-changed-may-indicate-binary-incompatibility-expected-96-from recently. Based on that (and maybe this migration guide https://numpy.org/devdocs/numpy_2_0_migration_guide.html) do we also need to pin to numpy<2.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am using numpy C interface that may need porting to 2. So will have

numpy>=1.25,<2.0.0
pyarrow>=14.0.0

schema_obj = _ArrowSchemaWrapper._import_from_c_capsule(schema_capsule)
schema = pa.schema(schema_obj.children_schema)
yield schema.serialize().to_pybytes()
if arrow_array and arrow_schema:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The and arrow_schema part of this check seems redundant given the arrow_schema == NULL check just above, or is there some truthiness conversion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Copy link
Member

@mlathara mlathara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor questions

break

if non_blocking:
query_thread.join()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to clean up the array and schema here? Or does the client do that?

Copy link
Member Author

@nalinigans nalinigans Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client does the cleanup here as the operations are zero copy. See

cdef object pycapsule_get_arrow_schema(void *schema):
return PyCapsule_New(<ArrowSchema*>schema, "arrow_schema", &pycapsule_delete_arrow_schema);
and
cdef object pycapsule_get_arrow_array(void *array):
return PyCapsule_New(<ArrowArray*>array, "arrow_array", &pycapsule_delete_arrow_array);
where we register the callbacks for cleanup for schema and array respectively.

nalinigans pushed a commit to GenomicsDB/GenomicsDB that referenced this pull request Jun 24, 2024
nalinigans added a commit to GenomicsDB/GenomicsDB that referenced this pull request Jun 25, 2024
…lls() (#338)

* Support arrow output using nanoarrow in respnse to query_variant_calls()

* Introduce semaphore for separate array_schema() see GenomicsDB/GenomicsDB-Python#51, address review comments and some cleanup

* Test coverage for ArrowVariantCallProcessor::allocate_schema
Copy link
Member

@mlathara mlathara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nalinigans nalinigans merged commit e082f3c into develop Jun 25, 2024
@nalinigans nalinigans deleted the ng_arrow branch June 25, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants