Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some samples running easily, others never finishing with dedup #340

Closed
RichardCorbett opened this issue Jun 4, 2019 · 7 comments
Closed

Comments

@RichardCorbett
Copy link

Hi folks,

I'm using umi_tools 1.0.0 on two cohorts of miRNA BAM files.

Set 1 has about 40 million reads per bam with UMI format in the RX tag example of "CAGC-CCAC"

Set 2 has about 10 million reads per bam with the UMIs in the RX tab being slightly longer as in "AACCTC-AAATTG"

All dedup commands look like one of the following (I've tried both and gotten similar results):

umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX -S ${1}.umi_tools_100_deduplicated.bam --output-stats=${1}.umi_tools_100_deduplicated.stats

umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX --read-length -S ${1}.umi_tools_100_deduplicated_read_length.bam --output-stats=${1}.umi_tools_100_deduplicated_read_length.stats

My Set 1 commands dependably finish in less than a day. About half of the Set 2 datasets are killed on my cluster after they hit a RAM occupancy above 355Gb.

Do you have any suggestions or things I could look in to to get this running well on all my samples?

thanks
Richard

@IanSudbery
Copy link
Member

See https://umi-tools.readthedocs.io/en/latest/faq.html for advice on speeding up/memory usage.

The running time/memory is far more dependent on the length of the UMI and the level of duplication than it is on the total number of reads.

The biggest thing you can do here to improve things is not generate the stats. The stats generation is by far the biggest time and space hog when used as it randomly samples reads from the file to compute a null distribution.

@RichardCorbett
Copy link
Author

Thanks so much. Sorry I didn't get this from the faq. I tried looking around on there but somehow missed the important information. I'll set up the runs sans stats and report the differences.

@RichardCorbett
Copy link
Author

I re-ran my 2 sets of samples this time not requesting the stats. As expected, this did significantly reduce the time it took to analyze the samples. However, I still have a small number of samples that are maxing out the RAM requirement on our servers (355Gb RAM).

I'm seeing that the read-length parameter also reduced the RAM requirement, which likely makes sense as our miRNA reads are variable length post adapter trimming.

All the reads are single end. Do you have any other suggestions I can try?

@IanSudbery
Copy link
Member

IanSudbery commented Jun 10, 2019 via email

@RichardCorbett
Copy link
Author

Thanks @IanSudbery ,

In case it helps, and perhaps confirms what you described above, I tried pulling out some statistics based on using just chr17 for these data. Below I've just posted the numbers captured from stdout, but I also have the stats output if those are helpful. To my untrained eye it looks like the S2 samples with the longer UMI have a much more diverse set of UMIs which seems to be related to the required RAM to process the data. All of the samples processed below came from the same original source, but different protocols were used to generate the S1 and S2 data.

As you'd probably guess, the example UMIs aren't actually in the data, I just pulled one example from each set just to show the format.

sample total_umis #umis example_umi RAM (Gb) CPU (HH:MM:SS)
S1_1 3173421 65417 RX:Z:GGGC-GGGT 8.58 00:17:42
S1_2 2777234 66308 RX:Z:GGGC-GGGT 10.83 00:21:18
S1_3 5418676 73887 RX:Z:GGGC-GGGT 16.9 00:33:35
S1_4 7814809 75804 RX:Z:GGGC-GGGT 18.38 00:39:13
S1_5 10183956 77819 RX:Z:GGGC-GGGT 20.13 00:47:44
S2_1 635978 459178 RX:Z:TTATTT-GTTCAG 75.72 02:24:23
S2_2 620241 457182 RX:Z:TTATTT-GTTCAG 75.45 02:23:02
S2_3 776708 553499 RX:Z:TTATTT-GTTCAG 105.46 03:30:44
S2_4 602724 422818 RX:Z:TTATTT-GTTCAG 70.91 02:00:30
S2_5 507139 372762 RX:Z:TTATTT-GTTCAG 54 01:30:37
S2_6 553217 398429 RX:Z:TTATTT-GTTCAG 59.59 01:41:53
S2_7 502332 378169 RX:Z:TTATTT-GTTCAG 42.84 01:22:07
S2_8 2137629 1299840 RX:Z:TTATTT-GTTCAG 355.48 02:45:23

Command used:
umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX -S ${1}.umi_tools_100_deduplicated.bam --output-stats ${1}.stats --chrom=17

@IanSudbery
Copy link
Member

Hi Richard,

I hope you eventually managed to find a satifactory way through this.

We are currently in the process of applying for funding to make a real change in the efficiency of UMI-tools. If you are still interested in the tool, I wondered if you might be able to support the application by writing a letter saying how useful it would be for you if UMI-tools went fast/didn't use as much memory?

@TomSmithCGAT
Copy link
Member

Closing due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants