some samples running easily, others never finishing with dedup #340

RichardCorbett · 2019-06-04T16:56:35Z

Hi folks,

I'm using umi_tools 1.0.0 on two cohorts of miRNA BAM files.

Set 1 has about 40 million reads per bam with UMI format in the RX tag example of "CAGC-CCAC"

Set 2 has about 10 million reads per bam with the UMIs in the RX tab being slightly longer as in "AACCTC-AAATTG"

All dedup commands look like one of the following (I've tried both and gotten similar results):

umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX -S ${1}.umi_tools_100_deduplicated.bam --output-stats=${1}.umi_tools_100_deduplicated.stats

umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX --read-length -S ${1}.umi_tools_100_deduplicated_read_length.bam --output-stats=${1}.umi_tools_100_deduplicated_read_length.stats

My Set 1 commands dependably finish in less than a day. About half of the Set 2 datasets are killed on my cluster after they hit a RAM occupancy above 355Gb.

Do you have any suggestions or things I could look in to to get this running well on all my samples?

thanks
Richard

The text was updated successfully, but these errors were encountered:

IanSudbery · 2019-06-06T16:20:26Z

See https://umi-tools.readthedocs.io/en/latest/faq.html for advice on speeding up/memory usage.

The running time/memory is far more dependent on the length of the UMI and the level of duplication than it is on the total number of reads.

The biggest thing you can do here to improve things is not generate the stats. The stats generation is by far the biggest time and space hog when used as it randomly samples reads from the file to compute a null distribution.

RichardCorbett · 2019-06-06T16:34:37Z

Thanks so much. Sorry I didn't get this from the faq. I tried looking around on there but somehow missed the important information. I'll set up the runs sans stats and report the differences.

RichardCorbett · 2019-06-10T15:39:37Z

I re-ran my 2 sets of samples this time not requesting the stats. As expected, this did significantly reduce the time it took to analyze the samples. However, I still have a small number of samples that are maxing out the RAM requirement on our servers (355Gb RAM).

I'm seeing that the read-length parameter also reduced the RAM requirement, which likely makes sense as our miRNA reads are variable length post adapter trimming.

All the reads are single end. Do you have any other suggestions I can try?

IanSudbery · 2019-06-10T16:44:18Z

By this do you mean the `--read-length` parameter? If you are doing miRNA seq then it is often good to use this parameter as your original (pre-PCR) molecules may have been of different lengths, and this can help to separate out things that are different before UMIs are even considered, so you should definitely use this. I'm guessing that what you have in those few samples that are maxing out the memory is some positions with a very large number of different UMIs at them. As the number of UMIs grows, the networks become more and more complex and thus take more an more memory to hold. Also at this point, some of the assumptions that underlie UMI-tools start to break down. UMI-tools is based on the assumption that all other things being different two different UMIs are unlikely to differ by just a single base (or at least, when 2 UMIs differ by only 1 base, its more likely this is a sequencing error than a genuine chance occurrence). However, once you pass a certain level of saturation of UMI space, that stops being true. This suggests either that one miRNA is compeltely dominating your sample (unlikely I'd think in a normal sample), or you have massively over-sequenced. If `--read-length` doesn't help enough, I can think of three options: Switch to a non-network based de-duplication protocol. You could try percentile, which while is only slightly better than naive UMI counting, would use far less memory. Obviously you'd have to do this for all samples. Alternatively you could down-sample (UMI-tools option `--subset` allows this). Alternatively identify the sequence that has such massive read depth - its possible its not even an miRNA at all (e.g. could be an rRNA fragment) and remove the reads associated with it. Sorry I can't be more helpful. Ian ---

RichardCorbett · 2019-06-10T22:37:13Z

Thanks @IanSudbery ,

In case it helps, and perhaps confirms what you described above, I tried pulling out some statistics based on using just chr17 for these data. Below I've just posted the numbers captured from stdout, but I also have the stats output if those are helpful. To my untrained eye it looks like the S2 samples with the longer UMI have a much more diverse set of UMIs which seems to be related to the required RAM to process the data. All of the samples processed below came from the same original source, but different protocols were used to generate the S1 and S2 data.

As you'd probably guess, the example UMIs aren't actually in the data, I just pulled one example from each set just to show the format.

sample	total_umis	#umis	example_umi	RAM (Gb)	CPU (HH:MM:SS)
S1_1	3173421	65417	RX:Z:GGGC-GGGT	8.58	00:17:42
S1_2	2777234	66308	RX:Z:GGGC-GGGT	10.83	00:21:18
S1_3	5418676	73887	RX:Z:GGGC-GGGT	16.9	00:33:35
S1_4	7814809	75804	RX:Z:GGGC-GGGT	18.38	00:39:13
S1_5	10183956	77819	RX:Z:GGGC-GGGT	20.13	00:47:44
S2_1	635978	459178	RX:Z:TTATTT-GTTCAG	75.72	02:24:23
S2_2	620241	457182	RX:Z:TTATTT-GTTCAG	75.45	02:23:02
S2_3	776708	553499	RX:Z:TTATTT-GTTCAG	105.46	03:30:44
S2_4	602724	422818	RX:Z:TTATTT-GTTCAG	70.91	02:00:30
S2_5	507139	372762	RX:Z:TTATTT-GTTCAG	54	01:30:37
S2_6	553217	398429	RX:Z:TTATTT-GTTCAG	59.59	01:41:53
S2_7	502332	378169	RX:Z:TTATTT-GTTCAG	42.84	01:22:07
S2_8	2137629	1299840	RX:Z:TTATTT-GTTCAG	355.48	02:45:23

Command used:
umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX -S ${1}.umi_tools_100_deduplicated.bam --output-stats ${1}.stats --chrom=17

IanSudbery · 2019-11-04T13:36:07Z

Hi Richard,

I hope you eventually managed to find a satifactory way through this.

We are currently in the process of applying for funding to make a real change in the efficiency of UMI-tools. If you are still interested in the tool, I wondered if you might be able to support the application by writing a letter saying how useful it would be for you if UMI-tools went fast/didn't use as much memory?

TomSmithCGAT · 2024-03-07T21:31:03Z

Closing due to inactivity

TomSmithCGAT closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some samples running easily, others never finishing with dedup #340

some samples running easily, others never finishing with dedup #340

RichardCorbett commented Jun 4, 2019

IanSudbery commented Jun 6, 2019

RichardCorbett commented Jun 6, 2019

RichardCorbett commented Jun 10, 2019

IanSudbery commented Jun 10, 2019 via email

RichardCorbett commented Jun 10, 2019

IanSudbery commented Nov 4, 2019

TomSmithCGAT commented Mar 7, 2024

some samples running easily, others never finishing with dedup #340

some samples running easily, others never finishing with dedup #340

Comments

RichardCorbett commented Jun 4, 2019

IanSudbery commented Jun 6, 2019

RichardCorbett commented Jun 6, 2019

RichardCorbett commented Jun 10, 2019

IanSudbery commented Jun 10, 2019 via email

RichardCorbett commented Jun 10, 2019

IanSudbery commented Nov 4, 2019

TomSmithCGAT commented Mar 7, 2024