Use CSC format rather than CSR #2

alexey0308 · 2021-07-16T15:40:04Z

** current state

*** Successfully stored methylation data for 20 cells with 21

real    2m30,985s
user    2m24,895s
sys     0m8,014s

*** Successfully stored methylation data for 50 cells with 21 chromosomes.

real    4m12,956s
user    4m4,463s
sys     0m10,402s

-rw-rw-r-- 1 alex alex 2,0M Jul 16 17:21 data/testres/10.npz

** csc instead of csr

*** Successfully stored methylation data for 20 cells with 21 chromosomes.

real    1m3,573s
user    1m2,027s
sys     0m3,247s

*** Successfully stored methylation data for 50 cells with 21 chromosomes.

real    2m52,079s
user    2m47,957s
sys     0m5,839s

-rw-rw-r-- 1 alex alex 1,8M Jul 16 17:34 data/testres/10.npz

looks like it saves some space and much faster

alexey0308 · 2021-07-16T15:45:27Z

And np.loadtxt is very slow.
After switching to pandas one gets further

*** Successfully stored methylation data for 50 cells with 21 chromosomes.

real 1m23,850s
user 1m21,203s
sys 0m4,180s

@LKremer, was there any reason to involve ndim argument in loadtxt? I guess you should be confident in the files
which are produced by the package?

dc31c2b#diff-1b359a4369a76403a79c9d54de3341764471d6e0aaa42ade712559b49c87fa30L122

LKremer · 2021-07-16T15:51:53Z

CSC might be faster to write, but it's much slower to work with later since we need fast access to genomic positions, which are stored in the rows. CSR has fast access to rows, which is what we need when we iterate through a bed file to check the regions listed in the bed.

alexey0308 · 2021-07-16T15:55:37Z

I need to check it, but I believe, that changing the format in memory is rather fast after reading CSC from the disk.

LKremer · 2021-07-16T15:56:54Z

@LKremer, was there any reason to involve ndim argument in loadtxt? I guess you should be confident in the files
which are produced by the package?

I am confident in the files, but not in loadtxt's ability to guess the array dimensions from the file alone. In the rare case where the file has only one row (only one covered CpG on that chromosome in that cell), loadtxt guesses that the file is a 1D-vector instead of a 2D-matrix, which is incorrect and results in a crash. I added ndmin to fix that crash.

alexey0308 · 2021-07-16T15:57:57Z

Also in case of CSR this line was very consuming

https://github.com/LKremer/scbs/blob/master/scbs/prepare.py#L40

alexey0308 · 2021-07-16T15:59:28Z

So to triage the PR, it is left to compare the loading time of CSR npz vs CSC npz + transforming

alexey0308 · 2021-07-16T16:13:29Z

I suggest you also try it on you machine :) also if you have big files from the field.
EDIT: my bad, I mixed the formats, the correct code here

import numpy as np
import os
import scipy.sparse as sp

datadir = "../testres"
dat = sp.load_npz(os.path.join(datadir, "10.npz"))

sp.save_npz("test-csc.npz", dat.tocsc())
sp.save_npz("test-csr.npz", dat.tocsr())

def read_from_csc():
    x = sp.load_npz("test-csc.npz")
    return x.tocsr()

def read_from_csr():
    return sp.load_npz("test-csr.npz")

In [89]: %timeit read_from_csr()
943 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [90]: %timeit read_from_csc()
312 ms ± 8.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

alexey0308 · 2021-08-04T12:48:55Z

OK, @simon-anders suggested to use CSR format, because there is no guarantee, that the whole file can fit into the memory.

alexey0308 added 2 commits July 16, 2021 17:38

Use CSC format rather than CSR

3e00f9a

looks like it saves some space and much faster

pandas is faster than loadtxt

dc31c2b

adapt echo text

87e4fa0

alexey0308 closed this Aug 4, 2021

LKremer deleted the prepare-csc branch March 21, 2023 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CSC format rather than CSR #2

Use CSC format rather than CSR #2

alexey0308 commented Jul 16, 2021

alexey0308 commented Jul 16, 2021 •

edited

Loading

LKremer commented Jul 16, 2021

alexey0308 commented Jul 16, 2021 •

edited

Loading

LKremer commented Jul 16, 2021

alexey0308 commented Jul 16, 2021

alexey0308 commented Jul 16, 2021

alexey0308 commented Jul 16, 2021 •

edited

Loading

alexey0308 commented Aug 4, 2021

Use CSC format rather than CSR #2

Use CSC format rather than CSR #2

Conversation

alexey0308 commented Jul 16, 2021

alexey0308 commented Jul 16, 2021 • edited Loading

LKremer commented Jul 16, 2021

alexey0308 commented Jul 16, 2021 • edited Loading

LKremer commented Jul 16, 2021

alexey0308 commented Jul 16, 2021

alexey0308 commented Jul 16, 2021

alexey0308 commented Jul 16, 2021 • edited Loading

alexey0308 commented Aug 4, 2021

alexey0308 commented Jul 16, 2021 •

edited

Loading

alexey0308 commented Jul 16, 2021 •

edited

Loading

alexey0308 commented Jul 16, 2021 •

edited

Loading