-
Notifications
You must be signed in to change notification settings - Fork 0
Support pyarrow output for GenomicsDB queries #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
self._genomicsdb.query_variant_calls(processor, array_name, scan_range) | ||
|
||
if non_blocking: | ||
query_thread = threading.Thread(target=query_calls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be changing this to using the thread pool from asyncio
requirements.txt
Outdated
@@ -28,3 +28,4 @@ | |||
numpy>=1.24.0 | |||
pandas>=2.1.0 | |||
protobuf>=4.21.1 | |||
pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a version specifier be added to this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oliviawebber , any suggestions for pyarrow
versions? I don't want to introduce incompatibilities with other packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no known minimum, my usual rule of thumb is to pick a version from like ~1 year ago which would give 14.0.0
. The one library I know we use that might eventually pull in pyarrow
looks like it requires >= 7.0.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorta related: I ran into this issue https://stackoverflow.com/questions/78634235/numpy-dtype-size-changed-may-indicate-binary-incompatibility-expected-96-from recently. Based on that (and maybe this migration guide https://numpy.org/devdocs/numpy_2_0_migration_guide.html) do we also need to pin to numpy<2.0.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am using numpy C interface that may need porting to 2. So will have
numpy>=1.25,<2.0.0
pyarrow>=14.0.0
src/genomicsdb.pyx
Outdated
schema_obj = _ArrowSchemaWrapper._import_from_c_capsule(schema_capsule) | ||
schema = pa.schema(schema_obj.children_schema) | ||
yield schema.serialize().to_pybytes() | ||
if arrow_array and arrow_schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The and arrow_schema
part of this check seems redundant given the arrow_schema == NULL
check just above, or is there some truthiness conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor questions
break | ||
|
||
if non_blocking: | ||
query_thread.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to clean up the array and schema here? Or does the client do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client does the cleanup here as the operations are zero copy
. See
GenomicsDB-Python/src/utils.pxi
Lines 102 to 103 in 4e37170
cdef object pycapsule_get_arrow_schema(void *schema): | |
return PyCapsule_New(<ArrowSchema*>schema, "arrow_schema", &pycapsule_delete_arrow_schema); |
GenomicsDB-Python/src/utils.pxi
Lines 105 to 106 in 4e37170
cdef object pycapsule_get_arrow_array(void *array): | |
return PyCapsule_New(<ArrowArray*>array, "arrow_array", &pycapsule_delete_arrow_array); |
…csDB-Python#51, address review comments and some cleanup
…lls() (#338) * Support arrow output using nanoarrow in respnse to query_variant_calls() * Introduce semaphore for separate array_schema() see GenomicsDB/GenomicsDB-Python#51, address review comments and some cleanup * Test coverage for ArrowVariantCallProcessor::allocate_schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
No description provided.