Skip to content

scdl neighbor update#843

Merged
camirr-nv merged 5 commits into
mainfrom
camirr/scdl-neighbor-update
Jun 30, 2025
Merged

scdl neighbor update#843
camirr-nv merged 5 commits into
mainfrom
camirr/scdl-neighbor-update

Conversation

@camirr-nv
Copy link
Copy Markdown
Contributor

@camirr-nv camirr-nv commented Apr 23, 2025

Description

This PR introduces comprehensive neighbor sampling functionality to the Single Cell Data Loader (SCDL), enabling efficient retrieval of cell neighbors for graph-based machine learning workflows and spatial analysis tasks.

Core Functionality

  • Neighbor data loading: Automatic detection and loading of neighbor information from .neighbors files
  • Flexible neighbor sampling: Support for random sampling strategies with configurable neighbor counts
  • Memory-efficient implementation: Lazy loading of neighbor data with proper memory management

Public API Methods

  • get_row_with_neighbor(): Retrieve cell data with a single randomly sampled neighbor
  • get_row_padded_with_neighbor(): Get cell data with multiple neighbors, padded to specified length
  • get_neighbor_data(): Direct access to neighbor information for a given cell
  • get_neighbor_stats(): Statistical analysis of neighbor connectivity

Internal Implementation Methods

  • _init_neighbor_args(): Initialize neighbor-related configuration and validate parameters
  • _extract_neighbor_data(): Extract and process neighbor data from memory-mapped files
  • get_neighbor_indices_for_cell(): Retrieve all neighbor indices for a specific cell
  • sample_neighbor_index(): Efficiently sample random neighbor indices with validation

File Structure

  • Neighbor data stored in three memory-mapped .npy files:
    -- neighbor_indices.npy: Contains neighbor cell indices in CSR format
    -- neighbor_indptr.npy: Index pointers for CSR sparse matrix structure
    -- neighbor_values.npy: Weights/values associated with neighbor relationships
  • Data extracted from AnnData's .obsp sparse matrices (e.g., 'next_cell_ids')
  • Automatic file discovery and validation during dataset initialization

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

TODO: Add code snippet

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
@polinabinder1
Copy link
Copy Markdown
Collaborator

On a high level - test cases!

Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
@polinabinder1
Copy link
Copy Markdown
Collaborator

/ok to test c6fa91e

Copy link
Copy Markdown
Collaborator

@skothenhill-nv skothenhill-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left mostly style and documentation related comments. One thing I noticed is from a top down reading, its not clear what the expected behavior is when neighbor data is requested but doesnt exist. I saw a few methods where an exception is raised, and others return a boolean or maybe throw a warning. If you feel like you have the control flow in a good place for incomplete neighbor data then no worries- but if not it might be worth taking another pass. Approved for now, I trust youll merge when comments are addressed.

Comment thread sub-packages/bionemo-scdl/README.md
Comment thread sub-packages/bionemo-scdl/README.md
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/util/torch_dataloader_utils.py Outdated
@camirr-nv camirr-nv changed the title DRAFT: scdl neighbor update scdl neighbor update Jun 12, 2025
Comment thread sub-packages/bionemo-scdl/tests/bionemo/scdl/util/test_torch_dataloader_utils.py Outdated
Copy link
Copy Markdown
Collaborator

@polinabinder1 polinabinder1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the 3rdParty changes - they should be the same as in main.

Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
@camirr-nv
Copy link
Copy Markdown
Contributor Author

Remove the 3rdParty changes - they should be the same as in main.

@polinabinder1 I'm not sure why, but I realize at some point when i initialized the submodules they showed up as changes to be committed under 'git status' and I did. I just reverted that commit.

@camirr-nv
Copy link
Copy Markdown
Contributor Author

/ok to test c562990

@camirr-nv camirr-nv requested a review from polinabinder1 June 26, 2025 17:14
@camirr-nv camirr-nv force-pushed the camirr/scdl-neighbor-update branch from c562990 to 7a33849 Compare June 26, 2025 17:41
@camirr-nv camirr-nv requested a review from nvdreidenbach as a code owner June 26, 2025 17:41
Signed-off-by: Camir Ricketts <camirr@nvidia.com>
@camirr-nv camirr-nv force-pushed the camirr/scdl-neighbor-update branch from 50d1981 to ac9cf78 Compare June 26, 2025 18:22
Signed-off-by: Camir Ricketts <camirr@nvidia.com>
@camirr-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 6943412

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 26, 2025

Codecov Report

Attention: Patch coverage is 86.74699% with 22 lines in your changes missing coverage. Please review.

Project coverage is 84.24%. Comparing base (2aac67e) to head (402bc66).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
.../src/bionemo/scdl/io/single_cell_memmap_dataset.py 85.80% 22 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #843      +/-   ##
==========================================
+ Coverage   84.18%   84.24%   +0.06%     
==========================================
  Files         144      144              
  Lines        9229     9394     +165     
==========================================
+ Hits         7769     7914     +145     
- Misses       1460     1480      +20     
Files with missing lines Coverage Δ
...dl/src/bionemo/scdl/util/torch_dataloader_utils.py 100.00% <100.00%> (ø)
.../src/bionemo/scdl/io/single_cell_memmap_dataset.py 90.02% <85.80%> (-2.04%) ⬇️

... and 1 file with indirect coverage changes

Comment thread sub-packages/bionemo-scdl/src/bionemo/scdl/io/single_cell_memmap_dataset.py Outdated
Comment thread sub-packages/bionemo-scdl/README.md Outdated
Signed-off-by: Camir Ricketts <camirr@nvidia.com>
Signed-off-by: Camir Ricketts <camirr@nvidia.com>
@camirr-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 402bc66

@camirr-nv camirr-nv added this pull request to the merge queue Jun 30, 2025
Merged via the queue into main with commit 998ce9a Jun 30, 2025
10 checks passed
@camirr-nv camirr-nv deleted the camirr/scdl-neighbor-update branch June 30, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants