Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About ld refinement on putative somatic SNVs, spend a lot of time #18

Open
monoplasty opened this issue Sep 20, 2023 · 3 comments
Open

Comments

@monoplasty
Copy link
Contributor

Hello,
I use single cell sequencing data to run somatic SNV calling from scRNA-seq. It takes a lot of time when I run the second step (cellScan), about 30+ hours. Is there any way to improve the running speed?

The bam file used is about 8G, and the cpu has 32 cores.

Could you please provide some guidance on how to resolve this issue?
Thank you in advance for your assistance.

@monoplasty monoplasty changed the title About ld refinement on putative somatic SNVs About ld refinement on putative somatic SNVs, spend a lot of time Sep 20, 2023
@jinzhuangdou
Copy link
Collaborator

Yes, the cellScan step usually take long time since we need to extract cell-level read information. Could you let me know how many cells you have included? You can select cells using the option --keep 0.8 (select cells with most variable reads) to reduce the computational burden.

@monoplasty
Copy link
Contributor Author

@jinzhuangdou thank you for your reply! The sample I ran had 8609 cells. --keep uses the default value of 0.8 without modification.

@slinnarsson
Copy link

The cellScan step is written in such a way that execution time will be quadratic in the number of cells. It takes the list of cell barcodes, and for each barcode, scans the entire BAM file to find the reads from that cell.

for cell in cell_lst:
	para = "merge" + ":" + cell + ":" + args.out + ":" + args.app_path
	joblst.append(para)
with Pool(processes=args.nthreads) as pool:
	result = pool.map(bamSplit, joblst)  # <--- bamSplit scans the whole BAM file for each cell

This means that if 8609 cells takes 30 hours, 2x8609 cells would take five days and 10x8609 cells would take four months.

It could be rewritten to scan the BAM file just once, writing all the cell-specific BAM files in parallel and on the fly. That would likely reduce execution time from 30+ hours to a few minutes. It would make it possible to run Monopogen on much larger samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants