-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending TENxMatrix to work on variants of HDF5-backed storage #34
Comments
Hmm... seems like we are lacking support for the lzf filter at the moment. Using
@grimbough What are the plans for h5py LZF? Is it going to make it to rhdf5filters? |
In the meantime, I've just been using files created using whatever h5py's default filters are: library(scRNAseq)
sce <- ZeiselBrainData()
counts(sce) <- as(counts(sce), "dgCMatrix")
library(zellkonverter)
writeH5AD(sce, file="blah.h5ad") |
Just to be clear: most H5AD files don't seem to use these external filters, they're just using the standard ones (probably DEFLATE). So we should still be able to proceed with these classes and get something useable for most people. |
It's funny but the first H5AD file I tried had the LZF filter on it. Must be my usual good luck ;-) Anyways I just added support for LZF compression to HDF5Array 1.19.2 (still need to test this on Windows). Once LZF makes its way to rhdf5filters, it'll be easy for me to pick it up from there. I'll work on the H5ADMatrixSeed/H5ADMatrix classes after lunch. |
Okay, thanks. Do you want This strategy would preserve the generality of the HDF5Array package, while downstream applications would be in charge of making sure everything works with specific formats. For example, DropletUtils already has a significant testing set-up to make sure that |
It makes sense to have H5ADMatrix next to TENxMatrix. |
Also, the |
R already has the notions of rows and cols flipped so its natural to read these h5ad files as already transposed. |
LZF support is now in rhdf5filters, should propagate in a few days. > rhdf5::h5version()
This is Bioconductor rhdf5 2.35.2 linking to C-library HDF5 1.10.7 and rhdf5filters 1.3.4
> h5read("~/Downloads/cheng18.processed.h5ad", "/X/data", index=list(1:10))
[1] 0.6057951 0.6057951 0.6057951 0.7852854 0.7852854 0.9670818 0.6607942 0.6607942
[9] 0.6607942 0.8280389 That should add both read and write support if it's ever needed. |
Thx @grimbough ! I'll drop my copy of the LZF code from HDF5Array. @LTLA : Basic support for H5ADMatrix objects is now in HDF5Array 1.19.4. Still needs a few more examples and some unit tests. Let me know how it works for you. |
Thanks @hpages. To set the context, I'm testing it on the mock dataset described in theislab/zellkonverter#37, i.e.: library(scRNAseq)
out <- ZeiselBrainData()
counts(out) <- as(counts(out), "dgCMatrix")
library(zellkonverter)
writeH5AD(out, file="stuff.h5") The first point is that, while Secondly, for the file generated above, the shape attributes are not named |
Thx, I'll implement these improvements (should be easy). Since I didn't manage to put my hands on many Still need to test the H5ADMatrix stuff on all kinds of |
Done in HDF5Array 1.19.5. Shouldn't |
Just tested it, looks good to me.
¯\_(ツ)_/¯. This is handled by the Python interface. All I can say is that we manually transpose the matrix when we drag it from R into Python, otherwise the Python-side object (their equivalent to our SE) cannot be properly constructed. This manual transposition is still a CSC matrix; it's just that the Python-side columns are equivalent to the R-side rows. |
I see. Looks like transposing a dgCMatrix could have returned a dgRMatrix instead of a dgCMatrix. This would avoid the cost of re-computing the row indices:
|
So I'm going to close this. If any problem with the |
From theislab/zellkonverter#37:
It seems that the
TENxMatrix
almost works with AnnData's sparse HDF5 format. Only a small modification is required, namely:And then it all works, with an extra transposition (because they store it row-based). Would it be possible to convert some of these internals into generics so that I can create a
TENxMatrix
subclass to handle these sparse AnnData datasets?Might even make sense to have a more generally named virtual base class, e.g.,
CompressedSparseHDF5Matrix
. And then subclasses could define generics to provide the paths to the elements, re-using the rest of the existingTENxMatrix
machinery to do all the data extraction.The text was updated successfully, but these errors were encountered: