-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CSC format rather than CSR #2
Conversation
looks like it saves some space and much faster
And np.loadtxt is very slow. *** Successfully stored methylation data for 50 cells with 21 chromosomes. real 1m23,850s @LKremer, was there any reason to involve ndim argument in loadtxt? I guess you should be confident in the files dc31c2b#diff-1b359a4369a76403a79c9d54de3341764471d6e0aaa42ade712559b49c87fa30L122 |
CSC might be faster to write, but it's much slower to work with later since we need fast access to genomic positions, which are stored in the rows. CSR has fast access to rows, which is what we need when we iterate through a bed file to check the regions listed in the bed. |
I need to check it, but I believe, that changing the format in memory is rather fast after reading CSC from the disk. |
I am confident in the files, but not in loadtxt's ability to guess the array dimensions from the file alone. In the rare case where the file has only one row (only one covered CpG on that chromosome in that cell), loadtxt guesses that the file is a 1D-vector instead of a 2D-matrix, which is incorrect and results in a crash. I added ndmin to fix that crash. |
Also in case of CSR this line was very consuming https://github.com/LKremer/scbs/blob/master/scbs/prepare.py#L40 |
So to triage the PR, it is left to compare the loading time of CSR npz vs CSC npz + transforming |
I suggest you also try it on you machine :) also if you have big files from the field.
|
OK, @simon-anders suggested to use CSR format, because there is no guarantee, that the whole file can fit into the memory. |
** current state
*** Successfully stored methylation data for 20 cells with 21
*** Successfully stored methylation data for 50 cells with 21 chromosomes.
** csc instead of csr
*** Successfully stored methylation data for 20 cells with 21 chromosomes.
*** Successfully stored methylation data for 50 cells with 21 chromosomes.