Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending TENxMatrix to work on variants of HDF5-backed storage #34

Closed
LTLA opened this issue Feb 5, 2021 · 16 comments
Closed

Extending TENxMatrix to work on variants of HDF5-backed storage #34

LTLA opened this issue Feb 5, 2021 · 16 comments

Comments

@LTLA
Copy link

LTLA commented Feb 5, 2021

From theislab/zellkonverter#37:

It seems that the TENxMatrix almost works with AnnData's sparse HDF5 format. Only a small modification is required, namely:

.get_tenx_shape <- function(filepath, group) {
    as.integer(h5readAttributes(filepath, group)$shape)
 #   .read_tenx_component(filepath, group, "shape")
}

And then it all works, with an extra transposition (because they store it row-based). Would it be possible to convert some of these internals into generics so that I can create a TENxMatrix subclass to handle these sparse AnnData datasets?

Might even make sense to have a more generally named virtual base class, e.g., CompressedSparseHDF5Matrix. And then subclasses could define generics to provide the paths to the elements, re-using the rest of the existing TENxMatrix machinery to do all the data extraction.

@hpages
Copy link
Contributor

hpages commented Feb 6, 2021

Hmm... seems like we are lacking support for the lzf filter at the moment. Using cheng18.processed.h5ad downloaded from here:

> h5read("cheng18.processed.h5ad", "/X/data", index=list(1:10))
Error: Unable to read dataset.
Not all required filters available.
Missing filters: lzf
Error: Error in h5checktype(). H5Identifier not valid.

@grimbough What are the plans for h5py LZF? Is it going to make it to rhdf5filters?

@LTLA
Copy link
Author

LTLA commented Feb 6, 2021

In the meantime, I've just been using files created using whatever h5py's default filters are:

library(scRNAseq)
sce <- ZeiselBrainData()
counts(sce) <- as(counts(sce), "dgCMatrix")

library(zellkonverter)
writeH5AD(sce, file="blah.h5ad")

@LTLA
Copy link
Author

LTLA commented Feb 8, 2021

Just to be clear: most H5AD files don't seem to use these external filters, they're just using the standard ones (probably DEFLATE). So we should still be able to proceed with these classes and get something useable for most people.

@hpages
Copy link
Contributor

hpages commented Feb 8, 2021

It's funny but the first H5AD file I tried had the LZF filter on it. Must be my usual good luck ;-)

Anyways I just added support for LZF compression to HDF5Array 1.19.2 (still need to test this on Windows). Once LZF makes its way to rhdf5filters, it'll be easy for me to pick it up from there.

I'll work on the H5ADMatrixSeed/H5ADMatrix classes after lunch.

@LTLA
Copy link
Author

LTLA commented Feb 8, 2021

Okay, thanks. Do you want H5ADMatrix to live in HDF5Array? These classes are pretty application-specific; it may make more sense for, e.g., some virtual base class containing 99% of the code to live in HDF5Array while application-specific packages contain the concrete subclasses.

This strategy would preserve the generality of the HDF5Array package, while downstream applications would be in charge of making sure everything works with specific formats. For example, DropletUtils already has a significant testing set-up to make sure that TENxMatrix works with all of the old and new Cellranger output formats.

@hpages
Copy link
Contributor

hpages commented Feb 8, 2021

It makes sense to have H5ADMatrix next to TENxMatrix.

@LTLA
Copy link
Author

LTLA commented Feb 8, 2021

Also, the H5ADMatrix is stored as a CSR matrix, as the AnnData infrastructure operates with a cells-as-rows paradigm. It would have to be transposed to truly be comparable to a TENxMatrix; we were planning to handle this at the zellkonverter level.

@hpages
Copy link
Contributor

hpages commented Feb 9, 2021

R already has the notions of rows and cols flipped so its natural to read these h5ad files as already transposed.

@grimbough
Copy link

@grimbough What are the plans for h5py LZF? Is it going to make it to rhdf5filters?

LZF support is now in rhdf5filters, should propagate in a few days.

> rhdf5::h5version()
This is Bioconductor rhdf5 2.35.2 linking to C-library HDF5 1.10.7 and rhdf5filters 1.3.4
> h5read("~/Downloads/cheng18.processed.h5ad", "/X/data", index=list(1:10))
 [1] 0.6057951 0.6057951 0.6057951 0.7852854 0.7852854 0.9670818 0.6607942 0.6607942
 [9] 0.6607942 0.8280389

That should add both read and write support if it's ever needed.

@hpages
Copy link
Contributor

hpages commented Feb 19, 2021

Thx @grimbough ! I'll drop my copy of the LZF code from HDF5Array.

@LTLA : Basic support for H5ADMatrix objects is now in HDF5Array 1.19.4. Still needs a few more examples and some unit tests. Let me know how it works for you.

@LTLA
Copy link
Author

LTLA commented Feb 20, 2021

Thanks @hpages. To set the context, I'm testing it on the mock dataset described in theislab/zellkonverter#37, i.e.:

library(scRNAseq)
out <- ZeiselBrainData()
counts(out) <- as(counts(out), "dgCMatrix")

library(zellkonverter)
writeH5AD(out, file="stuff.h5")

The first point is that, while X is the default name, it's not the only possible group path. The H5AD format also have layers, akin to SE's assays, where each layer is a separate assay. This would be easily supported by allowing the H5ADMatrix constructor to accept a path in the same manner that HDF5Array() and TENxMatrix() do.

Secondly, for the file generated above, the shape attributes are not named h5sparse_shape, but instead shape. zellkonverter generates these files via a reasonably up-to-date Python interface, so is as close to the truth as can be; I assume you also got your test H5AD files from some trusted source, which suggests that there was some kind of change to the format at some point. Guess you could just check for one or the other and just use whatever's there.

@hpages
Copy link
Contributor

hpages commented Feb 20, 2021

Thx, I'll implement these improvements (should be easy). Since I didn't manage to put my hands on many .h5ad files yet, and since the .h5ad specs are kind of minimalist (to say the least), that kind of feedback is useful. Reverse engineering can be fun but is only what it is: a hack.

Still need to test the H5ADMatrix stuff on all kinds of .h5ad files to polish all these rough edges.

@hpages
Copy link
Contributor

hpages commented Feb 21, 2021

Done in HDF5Array 1.19.5. Shouldn't writeH5AD() write the dgCMatrix assay in csr format instead of csc? The matrix gets transposed when written to the .h5ad file so csr in R should become csc in the file.

@LTLA
Copy link
Author

LTLA commented Feb 22, 2021

Just tested it, looks good to me.

Shouldn't writeH5AD() write the dgCMatrix assay in csr format instead of csc?

¯\_(ツ)_/¯. This is handled by the Python interface. All I can say is that we manually transpose the matrix when we drag it from R into Python, otherwise the Python-side object (their equivalent to our SE) cannot be properly constructed. This manual transposition is still a CSC matrix; it's just that the Python-side columns are equivalent to the R-side rows.

@hpages
Copy link
Contributor

hpages commented Feb 22, 2021

I see. Looks like transposing a dgCMatrix could have returned a dgRMatrix instead of a dgCMatrix. This would avoid the cost of re-computing the row indices:

library(Matrix)

transpose_dgCMatrix_to_dgRMatrix <- function(x)
{
    stopifnot(is(x, "dgCMatrix"))
    ## Nothing to compute, just moving the slot values around!
    new("dgRMatrix", p=x@p, j=x@i, Dim=rev(x@Dim), Dimnames=rev(x@Dimnames), x=x@x)
}

m0 <- rsparsematrix(1e6, 1e3, 0.1)

system.time(m1 <- t(m0))
#    user  system elapsed 
#   3.647   0.704   4.351 

class(m1)
# [1] "dgCMatrix"
# attr(,"package")
[1] "Matrix"

dim(m1)
# [1]    1000 1000000

system.time(m2 <- transpose_dgCMatrix_to_dgRMatrix(m0))
#    user  system elapsed 
#   0.708   0.000   0.709 

class(m2)
# [1] "dgRMatrix"
# attr(,"package")
# [1] "Matrix"

dim(m2)
# [1]    1000 1000000

@hpages
Copy link
Contributor

hpages commented Feb 28, 2021

So I'm going to close this. If any problem with the H5ADMatrix/H5ADMatrixSeed stuff, please open a new issue.

@hpages hpages closed this as completed Feb 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants