Computing a consensus sequence #181

AlisonKennedy · 2017-09-08T16:54:47Z

Hi,

I would like to use UMI-tools to compute a consensus sequence from a 'read group'. Is this modification possible?

Thanks!

TomSmithCGAT · 2017-09-09T15:38:47Z

hi @AK-WIMM. Yes, this should be possible, although it will require some straightforward additional bespoke code.

You can use the group command with --output-bam to output a BAM annotated with the UMI groups, i.e the reads which have originated from the same initial molecule. You would then need to parse the BAM to identify the consensus sequence for each UMI group.

By the way, are you working in the WIMM, Oxford by any chance?

AlisonKennedy · 2017-09-10T09:12:48Z

Hi. Thanks for getting back to me and yes I am based in the WIMM, Oxford.

TomSmithCGAT · 2017-09-11T08:18:45Z

Hi @AK-WIMM. I believe Nils Kölling (@koelling) is still at the WIMM and this is exactly what he was using UMI-tools for if I remember correctly

TomSmithCGAT · 2017-09-12T16:57:30Z

@AK-WIMM & @koelling - If either or both of you want to help us add consensus sequence identification to UMI-tools, this is something I'd be happy to assist with.

koelling · 2017-09-12T17:06:49Z

At the moment I am not actually outputting any consensus sequences. However, I am making a "consensus call" for each UMI group at each position in the genome, which I then use to calculate allele fractions. I was indeed planning on extending this to generate a BAM file with consensus sequences though, to help with visualisation and integrating this with downstream tools.

If @AK-WIMM is who I think it is then we will have a meeting soon to discuss this anyway. If not, please let me know! :)

TomSmithCGAT · 2017-10-13T07:48:07Z

Hi @koelling & @AK-WIMM. @dhoconno also wants to use UMI-tools to derive consensus sequences (#194). What approach did you decide to take and do you think this is something we could, and should, do within UMI-tools?

koelling · 2017-11-20T13:46:19Z

I've been looking into doing this in my local pipeline and now have an early prototype that calculates consensus sequences given a set of UMI-tagged reads. The algorithm is quite simple (basically a majority vote at every nucleotide) but it seems to work well enough and runs very fast.

I ended up adding writing this as a separate script that just uses UMI-tools for the UMI grouping, since that was a lot easier for me to implement. However, I would think that it makes sense to do add something like this UMI-tools too, since there is clearly some demand for this. I think it would also be quite straightforward to add this to the dedup operation.

I would imagine you could just add a --consensus parameter to the dedup operation. If that flag is set, instead of just outputting a random read per group, you would output a consensus read, which is built based on sequences of all the reads for each UMI group.

I think this would be quite straightforward to implement, but it would probably be easier if one of you adds this, since I'm not very familiar with the codebase. I'd be happy to share my consensus calling code though!

mfoll · 2017-11-29T10:57:48Z

I'm also interested in this feature, and would be happy to see it implemented in UMI-tools!

TomSmithCGAT · 2017-11-29T12:55:50Z

@mfoll - We'll try and include this in the next release
@koelling - Would you mind sharing a link to your script

TomSmithCGAT · 2017-11-29T12:58:31Z

@IanSudbery - My preference would be to have a separate command to perform consensus calling since I consider this to be a distinct task. What do you think?

Internally, it's essentially just the same as group but each group is reduced to a consensus read sequence prior to writing out. EDIT - Of course it's not that simple. group only labels one read per pair and at no point during dedup do we obtain all read pairs per group. This will need some more thought...

dhoconno · 2017-11-29T15:30:35Z

We would also be happy to help test the consensus calling functionality. Cheers, dave

…

On Wed, Nov 29, 2017 at 6:58 AM, Tom Smith ***@***.***> wrote: @IanSudbery <https://github.com/iansudbery> - My preference would be to have a separate command to perform consensus calling since I consider this to be a distinct task. What do you think? Internally, it's essentially just the same as group but each group is reduced to a consensus read sequence prior to writing out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#181 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFyze7ueb2JqyM-6IrDzzVhC2tkdpgRqks5s7VT5gaJpZM4PRcCR> .

-- --- David O'Connor, Ph.D. Professor, UW-Madison (608) 301-5710 • @dho http://labs.pathology.wisc.edu/oconnor

crutching · 2018-07-12T19:51:23Z

Hi all,

Just out of curiosity, what is the current status of this issue? I am finding ways to utilize this as well.

Thanks!

TomSmithCGAT · 2019-02-11T14:03:02Z

After some consideration, we've decided not to implement consensus sequence computation within UMI-tools for the time being.

We're hoping to release v1.0.0 in the coming days with we have no plans personally to add functionality from this point. However, if any users would be interesting in adding a consensus command to UMI-tools, please do get in touch and we can try and help you out.

kenyon-alexander · 2021-01-03T22:06:04Z

Hey all,

I'm interested in implementing consensus functionality in UMITools. Thoughts:

I agree with @TomSmithCGAT's previous comment that consensus functionality should be a separate command. The ultimate goal is a umi_tools consensus command which exists alongside dedup and group.
To my understanding, consensus formation can currently be achieved with "straightforward additional bespoke code" after running umi_tools group. However, @TomSmithCGAT's above point that "group only labels one read per pair" complicates the bespoke downstream code. Do we need to modify group so we can include all reads?
I'm aware that many third-party consensus formation tools exist. I have experience with fgbio CallMolecularConsensusReads. As a first step, I propose building a command which automatically formats the group output for input to CallMolecularConsensusReads. Is there a more common consensus read caller which would be better for this first step?
I'd love input on the exact method of consensus formation that users want. Some considerations:
-- Should the consensus method consider sequencing quality and/or mapping quality?
-- Should the consensus method be designed for a specific sequencing platform, chemistry, or application (e.g. HiSeq, target capture and amplification, cfDNA variant calling). Does the consensus method need to change/be customizable depending on these details?
-- Should the consensus method consider unpaired reads or singletons? I need to confirm how UMITools currently deals with these.
-- Are there any shortcomings from UMITools' grouping algorithm which should be considered for consensus formation?

Let me know if I should open a new issue for this project, and if anyone else would like to get involved. Thanks!

IanSudbery · 2021-01-04T10:48:04Z

With regard to the second point, this is a long standing issue with the way UMITools deals with pairs.

In the group command, all read 2s are just output, unaltered, as soon as they are encountered. Adding grouping information to read2 would require knowing the group of the corresponding read1 when dealing with the read2. This means either:

Processing all read1s - remembering the read -> grouping information, parse the file a second time adding grouping information.
Storing read2s until the corresponding read1 is reached, and which point both read1 and read2 are removed from the queue. If a read2 is not in the buffer when read1 is reached, remembering read1 until we do encounter the corresponding read2.

All this is complicated by multi-mappers - there may be more than one read2 for each read1, even though read1 only points to one of them (and vice versa). Currently dedup uses a combination of the above two methods, but cruicially, this is feasible from a memory point of view because we only story the lead read from each group. Storing every read from every group is likely to use an infeasible amount of memory. The obvious solution is to store things on disk and use an index to find the relevant read as and when needed, but we tried this (basically using the BAM index to seek the read2 for a read1) and it was way too slow.

I do have a potential solution:

Sort the file by name.
Add a position tag that is either:
The read position (if the read is read1)
The read position of the corresponding read1 (if the read is read2).
Do a custom sort on this tag.

This way we keep the read1s in genome order, but keep the read2s next to them.

I've had this idea for a while, but never had chance to implement.

IanSudbery · 2021-01-04T10:50:53Z

-- Should the consensus method consider unpaired reads or singletons? I need to confirm how UMITools currently deals with these.

This is currently customisable - you can either choose to keep or discard unpaired reads. In dedup and count, in paired mode, unpaired read2s will always be discarded, but read1s can be kept or discarded depending on configuration.

RGellerLab · 2022-03-07T19:36:42Z

Hi,
It would be great if you include a method for generating a consensus from a group of reads derived from the same UMI that also includes deletions. Many of the current tools just deal with bases.

IanSudbery · 2022-03-08T09:37:56Z

Unfortunately I don't think either me nor Tom has the capacity to deal with this at the moment. It is on the list of improvements to make if we ever secure funding to maintain UMI-tools, although I have to say, we hadn't considered the question of indels, which would take the problem from being computationally trivial, to being highly demanding (probably the reason others don't do it), although it would be a motivating factor as to why such a development was necessary.

karlkashofer · 2022-12-10T13:30:43Z

I've been looking into doing this in my local pipeline and now have an early prototype that calculates consensus sequences given a set of UMI-tagged reads. The algorithm is quite simple (basically a majority vote at every nucleotide) but it seems to work well enough and runs very fast.

I ended up adding writing this as a separate script that just uses UMI-tools for the UMI grouping, since that was a lot easier for me to implement. However, I would think that it makes sense to do add something like this UMI-tools too, since there is clearly some demand for this. I think it would also be quite straightforward to add this to the dedup operation.

I would imagine you could just add a --consensus parameter to the dedup operation. If that flag is set, instead of just outputting a random read per group, you would output a consensus read, which is built based on sequences of all the reads for each UMI group.

I think this would be quite straightforward to implement, but it would probably be easier if one of you adds this, since I'm not very familiar with the codebase. I'd be happy to share my consensus calling code though!

Hey Nils ! Would you be so kind and share your script ?

TomSmithCGAT · 2024-03-07T21:36:12Z

Closing due to inactivity

TomSmithCGAT mentioned this issue Oct 13, 2017

Q: Handling UMI-only reads #194

Closed

TomSmithCGAT added the Next release label Nov 29, 2017

TomSmithCGAT mentioned this issue Dec 20, 2017

Grouping in --paired mode only tags First-in-pair #224

Closed

TomSmithCGAT added this to To do in 1.0.0 Jan 16, 2019

TomSmithCGAT moved this from To do to Kicking the can in 1.0.0 Jan 31, 2019

TomSmithCGAT closed this as completed Feb 11, 2019

TomSmithCGAT reopened this Jan 4, 2021

TomSmithCGAT closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computing a consensus sequence #181

Computing a consensus sequence #181

AlisonKennedy commented Sep 8, 2017

TomSmithCGAT commented Sep 9, 2017

AlisonKennedy commented Sep 10, 2017

TomSmithCGAT commented Sep 11, 2017 •

edited

Loading

TomSmithCGAT commented Sep 12, 2017

koelling commented Sep 12, 2017

TomSmithCGAT commented Oct 13, 2017

koelling commented Nov 20, 2017

mfoll commented Nov 29, 2017

TomSmithCGAT commented Nov 29, 2017

TomSmithCGAT commented Nov 29, 2017 •

edited

Loading

dhoconno commented Nov 29, 2017 via email

crutching commented Jul 12, 2018

TomSmithCGAT commented Feb 11, 2019

kenyon-alexander commented Jan 3, 2021 •

edited

Loading

IanSudbery commented Jan 4, 2021

IanSudbery commented Jan 4, 2021

RGellerLab commented Mar 7, 2022

IanSudbery commented Mar 8, 2022

karlkashofer commented Dec 10, 2022

TomSmithCGAT commented Mar 7, 2024

Computing a consensus sequence #181

Computing a consensus sequence #181

Comments

AlisonKennedy commented Sep 8, 2017

TomSmithCGAT commented Sep 9, 2017

AlisonKennedy commented Sep 10, 2017

TomSmithCGAT commented Sep 11, 2017 • edited Loading

TomSmithCGAT commented Sep 12, 2017

koelling commented Sep 12, 2017

TomSmithCGAT commented Oct 13, 2017

koelling commented Nov 20, 2017

mfoll commented Nov 29, 2017

TomSmithCGAT commented Nov 29, 2017

TomSmithCGAT commented Nov 29, 2017 • edited Loading

dhoconno commented Nov 29, 2017 via email

crutching commented Jul 12, 2018

TomSmithCGAT commented Feb 11, 2019

kenyon-alexander commented Jan 3, 2021 • edited Loading

IanSudbery commented Jan 4, 2021

IanSudbery commented Jan 4, 2021

RGellerLab commented Mar 7, 2022

IanSudbery commented Mar 8, 2022

karlkashofer commented Dec 10, 2022

TomSmithCGAT commented Mar 7, 2024

TomSmithCGAT commented Sep 11, 2017 •

edited

Loading

TomSmithCGAT commented Nov 29, 2017 •

edited

Loading

kenyon-alexander commented Jan 3, 2021 •

edited

Loading