Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add BPCell backend #110

Closed
Yunuuuu opened this issue Dec 20, 2023 · 8 comments
Closed

Feature request: add BPCell backend #110

Yunuuuu opened this issue Dec 20, 2023 · 8 comments

Comments

@Yunuuuu
Copy link

Yunuuuu commented Dec 20, 2023

BPCell brings a more effecient on-disk matrix attached with a series operation for single cell. Can this be integrated into Delayed Array ?

https://bnprks.github.io/BPCells/articles/web-only/benchmarks.html

@LTLA
Copy link
Contributor

LTLA commented Dec 20, 2023

Usually, new file-backed formats are handled by DelayedArray subclasses in separate packages, e.g., see HDF5Array. This avoids adding dependencies to the core DelayedArray package, which would require installation of those dependencies by all downstream packages (which is a lot). So if you have a new file format you'd like to wrap, you can create your own package that implements the relevant methods (e.g., extract_array, dim, is_sparse).

That said, wrapping BPCell files in a DelayedArray may not provide the performance improvements that you'd expect, as any reads via the DelayedArray wrapper will involve calling R code. IIRC BPCell's performance is at least partially based on doing most of their operations in C++, so going through an R layer may dilute some of that. This could be overcome by writing bindings to tatami, which is my unofficial "C++ API for DelayedArrays"; then downstream tatami-compliant packages will be able to call the C++ code directly, without going through R. There's not a lot of these packages yet, though.

If you want to give this a go, have a look at the TileDBArray package (and the associated tatami bindings) for a demonstration. Incidentally, TileDB also positions itself as a successor to HDF5, particularly in the single-cell space, so that might be worth checking out if you just want something better than HDF5.

@Yunuuuu
Copy link
Author

Yunuuuu commented Jan 19, 2024

Hi, I'm implementing the BPCells backend, but there are some errors confusing me.
Why DelayedArray function relies on extract_array methods which always return a dense matrix (I have tried to return a sparse matrix, but it failed) causing error: cannot allocate vector of size 177.3 Gb?

"pbmc_3k_rna_raw" is the path of BPCellsDir file created BPCells::write_matrix_dir

DelayedArray::DelayedArray(BPCellsDirSeed("pbmc_3k_rna_raw"))
>Error in h(simpleError(msg, call)) : 
>error in evaluating the argument 'x' in selecting a method for function 'type': cannot allocate vector of size 177.3 Gb

The codes I used are as follows:

#' @importClassesFrom BPCells MatrixDir
methods::setClass("BPCellsDirSeed", contains = "MatrixDir")

BPCellsDirSeed <- function(x, buffer_size = 8192L) {
    if (rlang::is_string(x)) {
        matrix_dir <- BPCells::open_matrix_dir(
            dir = x, buffer_size = buffer_size
        )
    } else if (methods::is(x, "MatrixDir")) {
        matrix_dir <- x
    } else {
        cli::cli_abort(
            "{.arg x} must be a {.cls string} or {.cls MatrixDir} object"
        )
    }
    methods::as(matrix_dir, "BPCellsDirSeed")
}

#' @importClassesFrom DelayedArray DelayedArray
methods::setClass("BPCellsDirArray",
    contains = "DelayedArray",
    slots = c(seed = "BPCellsDirSeed"),
    prototype = list(seed = methods::new("BPCellsDirSeed"))
)

#' @importMethodsFrom DelayedArray elayedArray
#' @importFrom DelayedArray new_DelayedArray
methods::setMethod(
    "DelayedArray", "BPCellsDirSeed",
    function(seed) {
        new_DelayedArray(seed, Class = "BPCellsDirArray")
    }
)

BPCellsDirArray <- function(x, ...) {
    DelayedArray(BPCellsDirSeed(x, ...))
}

#' @export
methods::setMethod("type", "BPCellsDirSeed", function(x) {
    # it seems R don't differentiate 32-bit and 64-bit real number
    switch(x@type,
        uint32_t = "integer",
        float = "double", # 32-bit real number
        double = "double" # 64-bit real number
    )
})

#' @export
methods::setMethod(
    "extract_array", "BPCellsDirSeed",
    function(x, index) {
        out <- as.matrix(extract_bpcells_array(x, index))
        storage.mode(out) <- type(x)
        out
    }
)

extract_bpcells_array <- function(x, index) {
    i <- index[[1L]]
    j <- index[[2L]]
    if (!length(i) && !length(j)) {
        out <- x
    } else if (length(i) && length(j)) {
        out <- x[i, j]
    } else if (length(i)) {
        out <- x[i, ]
    } else {
        out <- x[, j]
    }
    methods::as(out, "dgCMatrix")
}

#' @export
#' @importFrom DelayedArray path
methods::setMethod("path", "BPCellsDirSeed", function(object, ...) {
    object@dir
})

@hpages
Copy link
Contributor

hpages commented Jan 19, 2024

Hi @Yunuuuu, I'll take a look. Are you working on a BPCellArray package or somthing like that? It's going to be easier if the BPCell backend for DelayedArray objects is implemented in a dedicated package.

@Yunuuuu
Copy link
Author

Yunuuuu commented Jan 19, 2024

Yes, I have created a pacakge named BPCellsArray deposited in https://github.com/Yunuuuu/BPCellsArray, I finally figured it out, the problem exist in extract_bpcells_array function.

@Yunuuuu
Copy link
Author

Yunuuuu commented Jan 19, 2024

Now, two basic class MatrixSubset and ConvertMatrixType has been implemented, additionally, MatrixDir object using Directory of files format has also been implemented with DelayedArray. I'll implement Hdf5 file format tomorrow

https://bnprks.github.io/BPCells/articles/web-only/bitpacking-format.html

@hpages
Copy link
Contributor

hpages commented Jan 19, 2024

Thanks for the link. If you are now working on a BPCellsArray package, can we close this issue? Feel free to ask any question in a new issue if you run into other problems related to the implementation of this new backend.

@Yunuuuu
Copy link
Author

Yunuuuu commented Jan 19, 2024

Of course, you can close this thread. It is better to ask any questions on the Bioconductor support site. Apologies for any inconvenience caused.

@hpages
Copy link
Contributor

hpages commented Jan 19, 2024

No worries

@hpages hpages closed this as completed Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants