Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dual matrix seeds for efficient row/column accesses #36

Closed
LTLA opened this issue Feb 27, 2021 · 7 comments
Closed

Dual matrix seeds for efficient row/column accesses #36

LTLA opened this issue Feb 27, 2021 · 7 comments

Comments

@LTLA
Copy link

LTLA commented Feb 27, 2021

Inspired by #34, I was wondering whether we could have a DualMatrixSeed that itself contains two seeds, both of which point to the same data but one of which is optimized for row access and the other for column access. The idea being that, depending on what index is supplied to extract_(sparse)_array(), we could dynamically switch between the two seeds for best performance.

We had a variant of these discussions before for dense HDF5 matrices; to create two datasets with row and column chunking for optimal extraction on either dimension. That turned out to be largely overkill as rectangular chunks were largely satisfactory, especially given that most applications were pulling out blocks at a time rather than requiring efficient single row/column-access.

However, with the compressed sparse variants (e.g., H5ADMatrixSeed, TENxMatrixSeed), there might be a lot of benefit of having this dual structure. These are stored in a manner that is inherently optimized for column (or row) access, so one could imagine just storing a dataset twice and dynamically choosing the right representation for a given index. The data are sparse so the extra storage should be pretty cheap, but even then, a doubling of the file size is not the biggest deal.

The choice between the two can be determined by minimizing the number of runs of adjacent values that need to be read from the file. A simple algorithm is to sort and convert index[[1]] into an rle and multiply the number of runs by index[[2]], and vice versa, and then pick the orientation with the lowest value. Some extra finesse could be applied when one dimension is NULL, because that implies a consecutive read across multiple entries of the other dimension.

If this sounds interesting, I could try to whip up some timings and see what the benefit might be. I run into this every now and again for in-memory objects, see tatami-inc/beachmat#20.

@hpages
Copy link
Contributor

hpages commented Mar 1, 2021

This idea was discussed here Bioconductor/DelayedArray#19 (comment) and abandoned. Because at the time, it seemed that converting a 10x Genomics dataset from TENxMatrixSeed to HDF5MatrixSeed format (e.g. like you did for the 1.3 Million Brain Cell, using 100x100 chunks) was good enough. Can't the same be done with H5ADMatrixSeed objects? I mean, with the dual seed solution, you're going to have to re-write the dataset anyway so why not just re-write it with a chunk geometry that works for row and column access? With the as.sparse feature I implemented recently for HDF5MatrixSeed/HDF5Matrix objects, you still get some benefit from the sparse nature of the data, even if it's now stored in the dense format.

@LTLA
Copy link
Author

LTLA commented Mar 1, 2021

The hope is that a truly sparse representation on file would be a lot faster to read. For example:

## Setting up; need a larger dataset than Zeisel to manifest benefits.
library(scRNAseq)
sce <- MacoskoRetinaData()

library(HDF5Array)
writeHDF5Array(counts(sce), "dense.h5", name="data")

library(zellkonverter)
writeH5AD(sce, file="stuff.h5")

## Comparing the reads.
library(HDF5Array)
D <- HDF5Array("dense.h5", "data", as.sparse=TRUE)
system.time(mat <- as(D, "dgCMatrix"))
##    user  system elapsed
##  43.790  14.216  60.740

X <- H5ADMatrix("stuff.h5")
system.time(mat2 <- as(X, "dgCMatrix"))
##    user  system elapsed
##   3.496   0.983   4.556

Pretty neat, huh. I was trying to do a more realistic example involving rowVars, but ran into #37, hence the full read above.

Along with the reduced overhead of decompression and memory allocation, it should also be possible to extract something closer to individual columns or rows, instead of having to drag in data from adjacent columns or rows in the same chunk. This should help improve the performance of interactive applications that just need to visualize a single row or column.

@hpages
Copy link
Contributor

hpages commented Mar 1, 2021

Maybe, but using proper chunk geometry and H5 type for the dense matrix would be a little bit more fair:

system.time(rs <- rowSums(D))
#    user  system elapsed 
#  28.554   0.051  28.604 

writeHDF5Array(counts(sce), "dense100x100.h5", name="data", chunkdim=c(100, 100), H5type="H5T_STD_U16LE")
D2 <- HDF5Array("dense100x100.h5", "data", as.sparse=TRUE)

system.time(rs2 <- rowSums(D2))
#    user  system elapsed 
#  10.472   0.142  10.616 

So yeah of course rowSums(X) is still faster:

system.time(rsX <- rowSums(X))
#    user  system elapsed 
#   5.568   0.067   5.636 

but only twice faster in this case. I'm not entirely convinced this is enough to embark on the dual seed thing.

There's also the question of who's going to create the dual seeds, when, and how. For example, in the context of zellkonverter::readH5AD, are you going to write the sparse matrix with transposed compression to an HDF5 file on-the-fly? Will be quite expensive for big matrices. Also where are you going to write it so this costly operation is done only once?

Or do you envision to use dual seeds in ExperimentHub only? Which solves the question of dual seed management. But that also means that only package examples and vignettes (including the OSCA book) will benefit. People working on their own datasets won't.

@LTLA
Copy link
Author

LTLA commented Mar 1, 2021

Why the H5type? That's a 4-fold smaller data type than the double, and there's no guarantee that the values being extracted would fit into that range. Or even be integers, for that matter, e.g., if we're holding normalized expression values. FWIW, chunking doesn't really make a difference for me:

library(scRNAseq)
sce <- MacoskoRetinaData()

library(HDF5Array)
writeHDF5Array(counts(sce), "dense.h5", name="data", chunkdim=c(100, 100))

library(zellkonverter)
writeH5AD(sce, file="stuff.h5")

library(HDF5Array)
D <- HDF5Array("dense.h5", "data", as.sparse=TRUE)
system.time(mat <- as(D, "dgCMatrix"))
##    user  system elapsed
##  41.157   9.198  53.561

X <- H5ADMatrix("stuff.h5")
system.time(mat2 <- as(X, "dgCMatrix"))
##    user  system elapsed
##   3.248   1.151   5.182

Switching to H5T_STD_U16LE brings that down to 30 seconds, but you could probably get a similar speed-up if you did the same for the sparse representation as well.

There's also the question of who's going to create the dual seeds, when, and how.

Yes, something like ExperimentHub. The two files corresponding to the dual seeds would be created ahead of time by the data provider and then both could be consumed by users. We also have internal systems that could fill EHub's role in this hypothetical case. For these centrally managed datasets, it seems like a worthwhile cost. Or at least something worth testing.

Anyway, I was planning to add this capability for our in-house applications, and making the dual seed is pretty straightforward, so I guess I'll just do it myself and I'll let you know how it goes.

@hpages
Copy link
Contributor

hpages commented Mar 2, 2021

Why the H5type?

Why not? If I can use 16-bit integers instead of 32-bit integers or 64-bit doubles, then it makes sense to do that. Turns out that with many count datasets, I can.

Another reason why the dual seed idea didn't really get any traction is because we were kind of hoping to be able to use a storage backend with decent support for sparsity at some point. Once you replace the current CSR/CSC-on-top-of-HDF5 hack with native CSR/CSC (i.e. at the chunk or tile level), you no longer need the heavy and laborious dual seed trick.

@LTLA
Copy link
Author

LTLA commented Mar 2, 2021

Turns out that with many count datasets, I can.

Right, and one could do the same for the sparse representations; probably with a similar speed up. Of course, it wouldn't be strictly H5AD anymore, but I would imagine that H5ADMatrix would be able to deal with the change in type easily enough.

Another reason why the dual seed idea didn't really get any traction is because we were kind of hoping to be able to use a storage backend with decent support for sparsity at some point.

And we continue to hope. HDF5's going to take ages (if ever?) to make this available in their C library, and none of the alternatives have nearly as much mindshare. The sparse natives also tend to use a multi-file structure, and say what you like about HDF5, at least everything's in one file and I don't have to worry about coordinating the download, caching, versioning, etc. of multiple files. Even the dual structure can be stuffed into a single HDF5 file via another set of groups.

Anyway, I think it's worth a shot. I'll try to find some time to run through this on our internal systems, probably with the 1.3 million cell dataset in TEnxBrainData, and see how it handles.

@LTLA
Copy link
Author

LTLA commented Mar 23, 2021

I'll close this as we can continue this discussion in https://github.com/LTLA/DualIndexedMatrix.

@LTLA LTLA closed this as completed Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants