-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some samples running easily, others never finishing with dedup #340
Comments
See https://umi-tools.readthedocs.io/en/latest/faq.html for advice on speeding up/memory usage. The running time/memory is far more dependent on the length of the UMI and the level of duplication than it is on the total number of reads. The biggest thing you can do here to improve things is not generate the stats. The stats generation is by far the biggest time and space hog when used as it randomly samples reads from the file to compute a null distribution. |
Thanks so much. Sorry I didn't get this from the faq. I tried looking around on there but somehow missed the important information. I'll set up the runs sans stats and report the differences. |
I re-ran my 2 sets of samples this time not requesting the stats. As expected, this did significantly reduce the time it took to analyze the samples. However, I still have a small number of samples that are maxing out the RAM requirement on our servers (355Gb RAM). I'm seeing that the read-length parameter also reduced the RAM requirement, which likely makes sense as our miRNA reads are variable length post adapter trimming. All the reads are single end. Do you have any other suggestions I can try? |
By this do you mean the `--read-length` parameter? If you are doing miRNA
seq then it is often good to use this parameter as your original (pre-PCR)
molecules may have been of different lengths, and this can help to separate
out things that are different before UMIs are even considered, so you
should definitely use this.
I'm guessing that what you have in those few samples that are maxing out
the memory is some positions with a very large number of different UMIs at
them. As the number of UMIs grows, the networks become more and more
complex and thus take more an more memory to hold. Also at this point, some
of the assumptions that underlie UMI-tools start to break down. UMI-tools
is based on the assumption that all other things being different two
different UMIs are unlikely to differ by just a single base (or at least,
when 2 UMIs differ by only 1 base, its more likely this is a sequencing
error than a genuine chance occurrence). However, once you pass a certain
level of saturation of UMI space, that stops being true. This suggests
either that one miRNA is compeltely dominating your sample (unlikely I'd
think in a normal sample), or you have massively over-sequenced.
If `--read-length` doesn't help enough, I can think of three options:
Switch to a non-network based de-duplication protocol. You could try
percentile, which while is only slightly better than naive UMI counting,
would use far less memory. Obviously you'd have to do this for all samples.
Alternatively you could down-sample (UMI-tools option `--subset` allows
this). Alternatively identify the sequence that has such massive read depth
- its possible its not even an miRNA at all (e.g. could be an rRNA
fragment) and remove the reads associated with it.
Sorry I can't be more helpful.
Ian
---
|
Thanks @IanSudbery , In case it helps, and perhaps confirms what you described above, I tried pulling out some statistics based on using just chr17 for these data. Below I've just posted the numbers captured from stdout, but I also have the stats output if those are helpful. To my untrained eye it looks like the S2 samples with the longer UMI have a much more diverse set of UMIs which seems to be related to the required RAM to process the data. All of the samples processed below came from the same original source, but different protocols were used to generate the S1 and S2 data. As you'd probably guess, the example UMIs aren't actually in the data, I just pulled one example from each set just to show the format.
Command used: |
Hi Richard, I hope you eventually managed to find a satifactory way through this. We are currently in the process of applying for funding to make a real change in the efficiency of UMI-tools. If you are still interested in the tool, I wondered if you might be able to support the application by writing a letter saying how useful it would be for you if UMI-tools went fast/didn't use as much memory? |
Closing due to inactivity |
Hi folks,
I'm using umi_tools 1.0.0 on two cohorts of miRNA BAM files.
Set 1 has about 40 million reads per bam with UMI format in the RX tag example of "CAGC-CCAC"
Set 2 has about 10 million reads per bam with the UMIs in the RX tab being slightly longer as in "AACCTC-AAATTG"
All dedup commands look like one of the following (I've tried both and gotten similar results):
My Set 1 commands dependably finish in less than a day. About half of the Set 2 datasets are killed on my cluster after they hit a RAM occupancy above 355Gb.
Do you have any suggestions or things I could look in to to get this running well on all my samples?
thanks
Richard
The text was updated successfully, but these errors were encountered: