-
Notifications
You must be signed in to change notification settings - Fork 7
Add Backed AnnData Support via Low-Memory Chunked Processing and Numba Kernels #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Backed AnnData Support via Low-Memory Chunked Processing and Numba Kernels #64
Conversation
…ifferential expression calculations - Added `_is_backed_array` function to check for backed/HDF5 arrays. - Introduced `_load_chunk` to handle loading of matrix chunks. - Updated `parallel_differential_expression` to automatically select processing strategy based on data type. - Enhanced error handling for backed AnnData in differential expression functions. - Improved documentation for clarity on parameters and return values.
…rential_expression_vec_wrapper - Introduced low_memory option for optimized processing. - Added gene_chunk_size parameter to control chunk size during computations.
- Added show_progress parameter to parallel_differential_expression and parallel_differential_expression_vec_wrapper for better user feedback during processing. - Updated handling of fc calculations to return NaN for zero means in specific cases, improving robustness of differential expression calculations.
- Introduced a new module `_parallel.py` containing utilities for parallel processing in differential expression calculations. - Implemented functions for default parallelization settings, Numba thread management, and processing targets in chunks. - Added a vectorized Wilcoxon ranksum test implementation for improved performance. - Created a new test suite in `test_parallel.py` to validate the functionality of the parallelization helpers.
- Moved ranksum buffer preparation and kernel functions from `_single_cell.py` to `_parallel.py` for better organization and performance. - Introduced a new `prepare_ranksum_buffers` function to allocate per-thread buffers. - Added a vectorized implementation of the ranksum test using Numba for parallel processing. - Updated `_single_cell.py` to utilize the new ranksum test functions, enhancing modularity and code clarity.
…ameters and refactor processing logic - Added `num_workers` and `num_threads` parameters to `_parallel_differential_expression_chunked` and `parallel_differential_expression` for improved control over parallel processing. - Refactored target processing logic to utilize `process_target_in_chunk` and `process_targets_parallel` for better modularity and performance. - Updated documentation to clarify the usage of new parameters and their impact on processing behavior.
- Updated progress descriptions in `process_targets_parallel` and `_parallel_differential_expression_chunked` to include the number of workers and Numba thread status for better user feedback. - Enhanced logging to provide details on the number of threads configured for Numba, improving transparency during execution.
- Added detailed explanations for handling in-memory and backed AnnData objects in the README, clarifying execution strategies and memory management. - Updated `parallel_differential_expression` docstring to specify the roles of `num_workers` and `num_threads` in low-memory mode, improving user understanding of parallelization options. - Enhanced documentation for parallel processing utilities in `_parallel.py`, emphasizing their modularity and reusability.
- Introduced `is_integer_data` and `should_use_numba` functions to determine Numba applicability based on data type, improving performance for integer-like data. - Updated `_parallel_differential_expression_chunked` to log warnings when Numba is disabled due to non-integer values, ensuring users are informed of fallback to SciPy. - Adjusted default `num_threads` parameter in several functions to improve usability and consistency in parallel processing settings. - Added tests for new functions to ensure correct behavior in various scenarios, enhancing overall test coverage.
…unctions - Changed `num_workers` parameter in `parallel_differential_expression_vec_wrapper` to accept `None`, allowing for more flexible worker configuration. - Updated index creation in `build_small_anndata` and `var` DataFrame to use `pd.Index`, enhancing compatibility with pandas operations. - Modified `_sort_results` function to accept both `pd.DataFrame` and `pl.DataFrame`, improving versatility in handling different DataFrame types.
…ions - Reformatted the `_compute_means` call in `process_target_in_chunk` for better readability. - Simplified the iterable creation in `process_targets_parallel` by removing unnecessary line breaks, enhancing clarity in the code structure. - Updated the DataFrame creation in `build_small_anndata` to streamline the dictionary definition, improving overall code conciseness.
…upport - Added dual-kernel functionality for the Wilcoxon ranksum test, optimizing performance for both integer and float data types. - Updated documentation to clarify the use of histogram-based and sorting-based kernels, improving user understanding of the implementation. - Refactored related functions for better organization and clarity, ensuring seamless integration with existing processing workflows. - Adjusted parameter handling in `_parallel_differential_expression_chunked` to support enhanced Numba integration.
…tings - Replaced print statements with logging for design parameter output in `pseudobulk_dex`, enhancing traceability. - Added warnings in `parallel_differential_expression_vec_wrapper` for ignored parameters, improving user awareness of configuration limitations. - Updated test name for clarity regarding functionality related to low memory float data handling.
…ksum processing - Enhanced function signatures in `_parallel.py` to include return type hints, improving code clarity and type safety. - Reformatted function definitions for better readability, aligning with modern Python standards. - Updated documentation comments to reflect changes in function signatures, ensuring consistency and clarity for users.
… documentation for pdex.
- Added support for polars DataFrame conversion in integration tests, ensuring compatibility with varying data types. - Updated assertions in tests to allow for small numerical differences between results from different implementations, improving robustness of tests. - Refactored correlation assertions for clarity and consistency in error messaging.
Summary of ChangesHello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new low-memory, chunked processing mode for differential expression analysis, primarily benefiting backed AnnData objects or very large in-memory datasets. A new _parallel.py module was added to centralize parallelization helpers, including a dual-kernel Numba-accelerated Wilcoxon rank-sum test that automatically dispatches between a histogram-based kernel for integer data and a sorting-based kernel for float data. The main parallel_differential_expression function in _single_cell.py was refactored to act as a dispatcher, automatically selecting between the original shared-memory multiprocessing approach and the new chunked mode based on whether the AnnData object is backed or if low_memory=True is explicitly set. The README.md was updated with extensive documentation on these new execution strategies and parallelization controls (num_workers, num_threads). Integration tests were added to validate the consistency of results between the standard and low-memory modes across various data types (dense/sparse, counts/log1p) and chunk sizes. Review comments highlighted concerns about the use of fastmath=True in the Numba kernel due to potential precision issues in statistical calculations, suggested removing an unreachable else block in process_target_in_chunk, and questioned the high atol tolerance used in integration tests for comparing p-values, implying it might mask numerical differences or be a consequence of fastmath.
- Updated assertion formatting in integration tests for better readability and consistency in error messages. - Adjusted numerical tolerance in `test_pdex.py` to allow for smaller differences, enhancing test robustness.
This PR introduces a major enhancement to
pdexby adding support for backed AnnData objects, enabling differential expression analysis on datasets that are too large to fit into memory. This is achieved through a new low-memory, chunked processing pipeline, accelerated by dual Numba kernels for the Wilcoxon rank-sum test.Changes
Features & Enhancements
Low-Memory Chunked Processing:
_single_cell.pythat processes genes in chunks, drastically reducing peak memory usage.parallel_differential_expressionfunction now automatically detects if the inputAnnDatais backed and switches to this low-memory implementation.low_memory=Trueoption has been added to allow users to force the chunked implementation for large in-memory datasets.Numba-accelerated Wilcoxon Kernels:
_parallel.pymodule introduces high-performance Numba kernels for the Wilcoxon rank-sum test.Enhanced Parallelization Control:
parallel_differential_expressionfunction now offers two levels of parallelization for the low-memory backend:num_workers: Controls the number of threads for target-level parallelism.num_threads: Controls the number of Numba threads for gene-level parallelism within the Wilcoxon kernel.Testing
tests/test_integration.py) has been added to ensure that the results of the new low-memory implementation are consistent with the standard in-memory implementation across various scenarios (dense/sparse, count/log-transformed data).tests/test_parallel.py) provides unit tests for the new Numba kernels and parallelization helper functions.tests/test_pdex.pyhave been extended to validate the low-memory implementation against the standard one.Documentation
README.mdfile has been significantly updated to explain the new features, including the automatic backend selection for backed AnnData and the new parallelization options.