Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CSC format rather than CSR #2

Closed
wants to merge 3 commits into from
Closed

Use CSC format rather than CSR #2

wants to merge 3 commits into from

Conversation

alexey0308
Copy link
Collaborator

** current state

*** Successfully stored methylation data for 20 cells with 21

real    2m30,985s
user    2m24,895s
sys     0m8,014s

*** Successfully stored methylation data for 50 cells with 21 chromosomes.

real    4m12,956s
user    4m4,463s
sys     0m10,402s

-rw-rw-r-- 1 alex alex 2,0M Jul 16 17:21 data/testres/10.npz

** csc instead of csr

*** Successfully stored methylation data for 20 cells with 21 chromosomes.

real    1m3,573s
user    1m2,027s
sys     0m3,247s

*** Successfully stored methylation data for 50 cells with 21 chromosomes.

real    2m52,079s
user    2m47,957s
sys     0m5,839s

-rw-rw-r-- 1 alex alex 1,8M Jul 16 17:34 data/testres/10.npz

@alexey0308
Copy link
Collaborator Author

alexey0308 commented Jul 16, 2021

And np.loadtxt is very slow.
After switching to pandas one gets further

*** Successfully stored methylation data for 50 cells with 21 chromosomes.

real 1m23,850s
user 1m21,203s
sys 0m4,180s

@LKremer, was there any reason to involve ndim argument in loadtxt? I guess you should be confident in the files
which are produced by the package?

dc31c2b#diff-1b359a4369a76403a79c9d54de3341764471d6e0aaa42ade712559b49c87fa30L122

@LKremer
Copy link
Owner

LKremer commented Jul 16, 2021

CSC might be faster to write, but it's much slower to work with later since we need fast access to genomic positions, which are stored in the rows. CSR has fast access to rows, which is what we need when we iterate through a bed file to check the regions listed in the bed.

@alexey0308
Copy link
Collaborator Author

alexey0308 commented Jul 16, 2021

I need to check it, but I believe, that changing the format in memory is rather fast after reading CSC from the disk.

@LKremer
Copy link
Owner

LKremer commented Jul 16, 2021

@LKremer, was there any reason to involve ndim argument in loadtxt? I guess you should be confident in the files
which are produced by the package?

I am confident in the files, but not in loadtxt's ability to guess the array dimensions from the file alone. In the rare case where the file has only one row (only one covered CpG on that chromosome in that cell), loadtxt guesses that the file is a 1D-vector instead of a 2D-matrix, which is incorrect and results in a crash. I added ndmin to fix that crash.

@alexey0308
Copy link
Collaborator Author

Also in case of CSR this line was very consuming

https://github.com/LKremer/scbs/blob/master/scbs/prepare.py#L40

@alexey0308
Copy link
Collaborator Author

So to triage the PR, it is left to compare the loading time of CSR npz vs CSC npz + transforming

@alexey0308
Copy link
Collaborator Author

alexey0308 commented Jul 16, 2021

I suggest you also try it on you machine :) also if you have big files from the field.
EDIT: my bad, I mixed the formats, the correct code here

import numpy as np
import os
import scipy.sparse as sp

datadir = "../testres"
dat = sp.load_npz(os.path.join(datadir, "10.npz"))

sp.save_npz("test-csc.npz", dat.tocsc())
sp.save_npz("test-csr.npz", dat.tocsr())

def read_from_csc():
    x = sp.load_npz("test-csc.npz")
    return x.tocsr()

def read_from_csr():
    return sp.load_npz("test-csr.npz")
In [89]: %timeit read_from_csr()
943 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [90]: %timeit read_from_csc()
312 ms ± 8.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@alexey0308
Copy link
Collaborator Author

OK, @simon-anders suggested to use CSR format, because there is no guarantee, that the whole file can fit into the memory.

@alexey0308 alexey0308 closed this Aug 4, 2021
@LKremer LKremer deleted the prepare-csc branch March 21, 2023 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants