Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object #5

Open
hpages opened this issue Apr 30, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@hpages
Copy link
Contributor

hpages commented Apr 30, 2021

I heard someone say:

SparseArray could also provide a better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object.

Sounds good to me.

@LTLA
Copy link

LTLA commented May 1, 2021

It was I!

While we're on this topic, you could also take scuttle::readSparseCounts() off my hands. This will read a matrix in a dense CSV file into a dgCMatrix by chunk-wise processing. Such dense matrices are quite common, especially in older scRNA-seq studies where no one had an idea about what to do with sparse matrices and just treated them in the same way as bulk RNA-seq.

@hpages hpages transferred this issue from Bioconductor/S4Arrays Jun 6, 2023
@hpages
Copy link
Contributor Author

hpages commented Jun 6, 2023

SparseArray has readSparseCSV() which is similar to scuttle::readSparseCounts() but returns an SVT_SparseArray object (of type() "integer") instead of a dgCMatrix object (the user can just coerce if they want the latter). If it does what you need, feel free to deprecate scuttle::readSparseCounts() in favor of that. If it doesn't, let me know how it should be improved.

@LTLA
Copy link

LTLA commented Jun 6, 2023

Thanks Herve. A couple of thoughts.

Comparing SparseArray::readSparseCSV() and scuttle::readSparseCounts(), there are quite a few options in the latter that are not (yet) in the former. This refers to many of the read.table-like options such as skip.*, *.names, etc. Inspecting some real-world usage, I can see some CSVs with, e.g., different quoting methods for the row/column names, sometimes column names aren't present, sometimes rows/columns need to be skipped. I remember the Zeisel datasets being particularly tedious, though that particular script was before I wrote readSparseCounts(). Anyway, my point is that it would be helpful to have a few of these options, given that there isn't a standard way of storing matrices in CSVs and users need to be able to adapt to whatever zany formatting was provided by the data generator.

The other thought is that my original comment actually refers to Matrix::readMM, which creates a dgTMatrix by default. There's an opportunity for decent optimization if we can read this directly into a sparse array. If this were available, DropletUtils::read10xCounts() would switch to it ASAP.

@hpages
Copy link
Contributor Author

hpages commented Jun 6, 2023

Makes sense. Thanks for the feedback. Added to the TODO list:

SparseArray/TODO

Lines 88 to 90 in 872c617

- Improve readSparseCSV() functionality by adding a few read.table-like args
to it. See https://github.com/Bioconductor/SparseArray/issues/5 for the
details.

@drighelli
Copy link

drighelli commented Jun 13, 2023

I was opening an issue about this new readSparseCSV function, but it seems to be related to Aaron's comment.

I found the function really fast (4 times faster than data.table::fread in my case) and helpful, but, in particular, I noticed that the function automatically seems to assign the first column present in the file to the rownames of the returned SVT_SparseMatrix, which could not always be the wanted behaviour.

I'm sure this will be easily solved with the already-mentioned improvements because, at the actual status, there is no argument allowing to specify the column to use for the rownames.

I hope this could be helpful :)

@hpages
Copy link
Contributor Author

hpages commented Jun 13, 2023

Thanks @drighelli . So many things on SparseArray's TODO but your input helps me prioritize things.

@hpages hpages added enhancement New feature or request labels Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants