Better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object #5

hpages · 2021-04-30T15:47:35Z

I heard someone say:

SparseArray could also provide a better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object.

Sounds good to me.

LTLA · 2021-05-01T17:53:14Z

It was I!

While we're on this topic, you could also take scuttle::readSparseCounts() off my hands. This will read a matrix in a dense CSV file into a dgCMatrix by chunk-wise processing. Such dense matrices are quite common, especially in older scRNA-seq studies where no one had an idea about what to do with sparse matrices and just treated them in the same way as bulk RNA-seq.

hpages · 2023-06-06T00:43:01Z

SparseArray has readSparseCSV() which is similar to scuttle::readSparseCounts() but returns an SVT_SparseArray object (of type() "integer") instead of a dgCMatrix object (the user can just coerce if they want the latter). If it does what you need, feel free to deprecate scuttle::readSparseCounts() in favor of that. If it doesn't, let me know how it should be improved.

LTLA · 2023-06-06T18:15:26Z

Thanks Herve. A couple of thoughts.

Comparing SparseArray::readSparseCSV() and scuttle::readSparseCounts(), there are quite a few options in the latter that are not (yet) in the former. This refers to many of the read.table-like options such as skip.*, *.names, etc. Inspecting some real-world usage, I can see some CSVs with, e.g., different quoting methods for the row/column names, sometimes column names aren't present, sometimes rows/columns need to be skipped. I remember the Zeisel datasets being particularly tedious, though that particular script was before I wrote readSparseCounts(). Anyway, my point is that it would be helpful to have a few of these options, given that there isn't a standard way of storing matrices in CSVs and users need to be able to adapt to whatever zany formatting was provided by the data generator.

The other thought is that my original comment actually refers to Matrix::readMM, which creates a dgTMatrix by default. There's an opportunity for decent optimization if we can read this directly into a sparse array. If this were available, DropletUtils::read10xCounts() would switch to it ASAP.

hpages · 2023-06-06T19:04:02Z

Makes sense. Thanks for the feedback. Added to the TODO list:

SparseArray/TODO

Lines 88 to 90 in 872c617

    
           - Improve readSparseCSV() functionality by adding a few read.table-like args 
        
             to it. See https://github.com/Bioconductor/SparseArray/issues/5 for the 
        
             details.

drighelli · 2023-06-13T18:48:17Z

I was opening an issue about this new readSparseCSV function, but it seems to be related to Aaron's comment.

I found the function really fast (4 times faster than data.table::fread in my case) and helpful, but, in particular, I noticed that the function automatically seems to assign the first column present in the file to the rownames of the returned SVT_SparseMatrix, which could not always be the wanted behaviour.

I'm sure this will be easily solved with the already-mentioned improvements because, at the actual status, there is no argument allowing to specify the column to use for the rownames.

I hope this could be helpful :)

hpages · 2023-06-13T19:10:34Z

Thanks @drighelli . So many things on SparseArray's TODO but your input helps me prioritize things.

hpages transferred this issue from Bioconductor/S4Arrays Jun 6, 2023

hpages added enhancement New feature or request labels Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object #5

Better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object #5

hpages commented Apr 30, 2021

LTLA commented May 1, 2021

hpages commented Jun 6, 2023

LTLA commented Jun 6, 2023

hpages commented Jun 6, 2023

drighelli commented Jun 13, 2023 •

edited

Loading

hpages commented Jun 13, 2023

Better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object #5

Better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object #5

Comments

hpages commented Apr 30, 2021

LTLA commented May 1, 2021

hpages commented Jun 6, 2023

LTLA commented Jun 6, 2023

hpages commented Jun 6, 2023

drighelli commented Jun 13, 2023 • edited Loading

hpages commented Jun 13, 2023

drighelli commented Jun 13, 2023 •

edited

Loading