makes umi_tools deterministic with --random-seed #550

TomSmithCGAT · 2022-08-04T15:59:18Z

Trying to resolve issues with umi_tools being non-deterministic even with --random-seed. We have two open PRs on this.

I found #365 was sometimes reporting the same read multiple times. For example, running the test
umi_tools group -L test.log --out-sam --random-seed=123456789 --method=adjacency --output-bam --group-out=group_adj_py3.tsv -I tests/chr19.bam > branch_out

SRR2057595.11597812_ATAAA is reported twice, but just once (as it should be) when using the master branch.

$ grep SRR2057595.11597812_ATAAA *out
branch_out:SRR2057595.11597812_ATAAA	16	chr19	4078297	255	38M	*	0	0	*	*	XA:i:1	MD:Z:29A8	NM:i:1	UG:i:52	BX:Z:ATAAA
branch_out:SRR2057595.11597812_ATAAA	16	chr19	4078297	255	38M	*	0	0	*	*	XA:i:1	MD:Z:29A8	NM:i:1	UG:i:53	BX:Z:ATAAA
master_out:SRR2057595.11597812_ATAAA	16	chr19	4078297	255	38M	*	0	0	*	*	XA:i:1	MD:Z:29A8	NM:i:1	UG:i:52	BX:Z:ATAAA

#470 works as intended but introduces a non-conda dependency, as discussed. We could rip the code or get siphashc into conda, but we can take a much simpler route if we don't care about having to update the test files.

Following @TyberiusPrime's suggestion in #470, this PR just adds a sort to the components in network.py

Runtime is unaffected, at least on the example.bam file we include in the release (9.4-9.7s with both master and this branch).

The test files will need to be updated of course. I just wanted to raise the PR first to check you are OK with this @IanSudbery before I update them and merge.

Note that the output when using the adjacency method can change with respect to not just which reads are output, but also how many reads, since the order of the UMIs with the same counts affects how many steps are required to account for a connected component. I see no issue with the number of reads returned in these cases being fixed by the sort, given we are currently treating any possible resolution as equally probable.

For the other methods, the number of reads returned will be unchanged.

IanSudbery · 2022-08-09T08:30:54Z

In general I have no issue with this, but I want to check performance on bigger input files (with 10s of thousands of UMIs at one position).

TomSmithCGAT · 2022-08-09T08:32:14Z

Have you got something in mind?

IanSudbery · 2022-08-09T08:33:36Z

I should have something somewhere from when I did benchmarking for the grant proposal.

TomSmithCGAT · 2022-09-26T09:48:42Z

Hi @IanSudbery - Did you have a chance to check the runtimes on bigger files?

TomSmithCGAT · 2023-06-13T09:06:46Z

Hi @IanSudbery. Would you be able to run the benchmarks that you wanted to before I merge this. The tests fail at the moment because many of the test files need to be updated since this fixed determination yields a different set of UMIs for cases where the selection is between equally likely UMIs, compared to the previous code.

IanSudbery · 2023-06-13T13:12:33Z

Okay, I checked out timeings. Not much in it. Here I checked times to process a single position with an increasing number of "real" UMIs, check with twenty random PCR errors:

Current:

Centre_umis     time
1       17.33
4       17.27
16      17.3
64      17.21
256     17.65
1024    21.36
4096    70.68
5792    117.89
8192    223.07999999999998
11585   534.42

Proposed:

Centre_umis     time
1       9.26
4       9.15
16      9.21
64      9.19
256     9.61
1024    13.61
4096    62.99
5792    111.89
8192    219.17000000000002
11585   592.01

I'm not sure why there is a small but constant reduction in time for the new method at lower UMI counts - possibly something to do with the other changes since this branch was proposed? Anyway, other than that, very little in it.

I also tested on a 5M read real data set, with 4M positions, and an average of 1.1 UMI per position. Both versions took exactly the same 165 seconds.

makes umi_tools deterministic with --random-seed

6db7f76

TomSmithCGAT requested a review from IanSudbery August 8, 2022 07:29

sjaenick mentioned this pull request Sep 22, 2022

umi_tools reproducibility issues patrick-barth/PARANOID#19

Merged

nikellepetrillo mentioned this pull request Nov 9, 2022

Staging -> Master broadinstitute/warp#837

Merged

TomSmithCGAT mentioned this pull request Jan 3, 2023

Inconsistent result of no reads from the same coordinate in UMI-tools dedup #567

Closed

TomSmithCGAT and others added 3 commits March 2, 2023 13:11

git commit to re-trigger tests

726ac72

another test re-trigger (output from previous was lost)

9ccf700

Remove py3.6 from testing

5c0d427

Joon-Klaps mentioned this pull request Aug 29, 2023

New module: umitools/group nf-core/modules#3765

Merged

14 tasks

This was referenced Mar 7, 2024

Tests don't pass in released 1.1.1 #469

Closed

Representative sequences after deduplication not consistent between different runs #440

Closed

How to use Pyinstaller on UMI-tools and make it reproducible? #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

makes umi_tools deterministic with --random-seed #550

makes umi_tools deterministic with --random-seed #550

TomSmithCGAT commented Aug 4, 2022

IanSudbery commented Aug 9, 2022

TomSmithCGAT commented Aug 9, 2022

IanSudbery commented Aug 9, 2022

TomSmithCGAT commented Sep 26, 2022

TomSmithCGAT commented Jun 13, 2023

IanSudbery commented Jun 13, 2023 •

edited

makes umi_tools deterministic with --random-seed #550

Are you sure you want to change the base?

makes umi_tools deterministic with --random-seed #550

Conversation

TomSmithCGAT commented Aug 4, 2022

IanSudbery commented Aug 9, 2022

TomSmithCGAT commented Aug 9, 2022

IanSudbery commented Aug 9, 2022

TomSmithCGAT commented Sep 26, 2022

TomSmithCGAT commented Jun 13, 2023

IanSudbery commented Jun 13, 2023 • edited

IanSudbery commented Jun 13, 2023 •

edited