Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing a consensus sequence #181

Closed
AlisonKennedy opened this issue Sep 8, 2017 · 20 comments
Closed

Computing a consensus sequence #181

AlisonKennedy opened this issue Sep 8, 2017 · 20 comments
Projects

Comments

@AlisonKennedy
Copy link

Hi,

I would like to use UMI-tools to compute a consensus sequence from a 'read group'. Is this modification possible?

Thanks!

@TomSmithCGAT
Copy link
Member

hi @AK-WIMM. Yes, this should be possible, although it will require some straightforward additional bespoke code.

You can use the group command with --output-bam to output a BAM annotated with the UMI groups, i.e the reads which have originated from the same initial molecule. You would then need to parse the BAM to identify the consensus sequence for each UMI group.

By the way, are you working in the WIMM, Oxford by any chance?

@AlisonKennedy
Copy link
Author

Hi. Thanks for getting back to me and yes I am based in the WIMM, Oxford.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Sep 11, 2017

Hi @AK-WIMM. I believe Nils Kölling (@koelling) is still at the WIMM and this is exactly what he was using UMI-tools for if I remember correctly

@TomSmithCGAT
Copy link
Member

@AK-WIMM & @koelling - If either or both of you want to help us add consensus sequence identification to UMI-tools, this is something I'd be happy to assist with.

@koelling
Copy link

At the moment I am not actually outputting any consensus sequences. However, I am making a "consensus call" for each UMI group at each position in the genome, which I then use to calculate allele fractions. I was indeed planning on extending this to generate a BAM file with consensus sequences though, to help with visualisation and integrating this with downstream tools.

If @AK-WIMM is who I think it is then we will have a meeting soon to discuss this anyway. If not, please let me know! :)

@TomSmithCGAT
Copy link
Member

Hi @koelling & @AK-WIMM. @dhoconno also wants to use UMI-tools to derive consensus sequences (#194). What approach did you decide to take and do you think this is something we could, and should, do within UMI-tools?

@koelling
Copy link

I've been looking into doing this in my local pipeline and now have an early prototype that calculates consensus sequences given a set of UMI-tagged reads. The algorithm is quite simple (basically a majority vote at every nucleotide) but it seems to work well enough and runs very fast.

I ended up adding writing this as a separate script that just uses UMI-tools for the UMI grouping, since that was a lot easier for me to implement. However, I would think that it makes sense to do add something like this UMI-tools too, since there is clearly some demand for this. I think it would also be quite straightforward to add this to the dedup operation.

I would imagine you could just add a --consensus parameter to the dedup operation. If that flag is set, instead of just outputting a random read per group, you would output a consensus read, which is built based on sequences of all the reads for each UMI group.

I think this would be quite straightforward to implement, but it would probably be easier if one of you adds this, since I'm not very familiar with the codebase. I'd be happy to share my consensus calling code though!

@mfoll
Copy link

mfoll commented Nov 29, 2017

I'm also interested in this feature, and would be happy to see it implemented in UMI-tools!

@TomSmithCGAT
Copy link
Member

@mfoll - We'll try and include this in the next release
@koelling - Would you mind sharing a link to your script

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Nov 29, 2017

@IanSudbery - My preference would be to have a separate command to perform consensus calling since I consider this to be a distinct task. What do you think?

Internally, it's essentially just the same as group but each group is reduced to a consensus read sequence prior to writing out. EDIT - Of course it's not that simple. group only labels one read per pair and at no point during dedup do we obtain all read pairs per group. This will need some more thought...

@dhoconno
Copy link

dhoconno commented Nov 29, 2017 via email

@crutching
Copy link

Hi all,

Just out of curiosity, what is the current status of this issue? I am finding ways to utilize this as well.

Thanks!

@TomSmithCGAT TomSmithCGAT added this to To do in 1.0.0 Jan 16, 2019
@TomSmithCGAT TomSmithCGAT moved this from To do to Kicking the can in 1.0.0 Jan 31, 2019
@TomSmithCGAT
Copy link
Member

After some consideration, we've decided not to implement consensus sequence computation within UMI-tools for the time being.

We're hoping to release v1.0.0 in the coming days with we have no plans personally to add functionality from this point. However, if any users would be interesting in adding a consensus command to UMI-tools, please do get in touch and we can try and help you out.

@kenyon-alexander
Copy link

kenyon-alexander commented Jan 3, 2021

Hey all,

I'm interested in implementing consensus functionality in UMITools. Thoughts:

  • I agree with @TomSmithCGAT's previous comment that consensus functionality should be a separate command. The ultimate goal is a umi_tools consensus command which exists alongside dedup and group.
  • To my understanding, consensus formation can currently be achieved with "straightforward additional bespoke code" after running umi_tools group. However, @TomSmithCGAT's above point that "group only labels one read per pair" complicates the bespoke downstream code. Do we need to modify group so we can include all reads?
  • I'm aware that many third-party consensus formation tools exist. I have experience with fgbio CallMolecularConsensusReads. As a first step, I propose building a command which automatically formats the group output for input to CallMolecularConsensusReads. Is there a more common consensus read caller which would be better for this first step?
  • I'd love input on the exact method of consensus formation that users want. Some considerations:
    -- Should the consensus method consider sequencing quality and/or mapping quality?
    -- Should the consensus method be designed for a specific sequencing platform, chemistry, or application (e.g. HiSeq, target capture and amplification, cfDNA variant calling). Does the consensus method need to change/be customizable depending on these details?
    -- Should the consensus method consider unpaired reads or singletons? I need to confirm how UMITools currently deals with these.
    -- Are there any shortcomings from UMITools' grouping algorithm which should be considered for consensus formation?

Let me know if I should open a new issue for this project, and if anyone else would like to get involved. Thanks!

@IanSudbery
Copy link
Member

With regard to the second point, this is a long standing issue with the way UMITools deals with pairs.

In the group command, all read 2s are just output, unaltered, as soon as they are encountered. Adding grouping information to read2 would require knowing the group of the corresponding read1 when dealing with the read2. This means either:

  1. Processing all read1s - remembering the read -> grouping information, parse the file a second time adding grouping information.
  2. Storing read2s until the corresponding read1 is reached, and which point both read1 and read2 are removed from the queue. If a read2 is not in the buffer when read1 is reached, remembering read1 until we do encounter the corresponding read2.

All this is complicated by multi-mappers - there may be more than one read2 for each read1, even though read1 only points to one of them (and vice versa). Currently dedup uses a combination of the above two methods, but cruicially, this is feasible from a memory point of view because we only story the lead read from each group. Storing every read from every group is likely to use an infeasible amount of memory. The obvious solution is to store things on disk and use an index to find the relevant read as and when needed, but we tried this (basically using the BAM index to seek the read2 for a read1) and it was way too slow.

I do have a potential solution:

  1. Sort the file by name.
  2. Add a position tag that is either:
    The read position (if the read is read1)
    The read position of the corresponding read1 (if the read is read2).
  3. Do a custom sort on this tag.

This way we keep the read1s in genome order, but keep the read2s next to them.

I've had this idea for a while, but never had chance to implement.

@IanSudbery
Copy link
Member

-- Should the consensus method consider unpaired reads or singletons? I need to confirm how UMITools currently deals with these.

This is currently customisable - you can either choose to keep or discard unpaired reads. In dedup and count, in paired mode, unpaired read2s will always be discarded, but read1s can be kept or discarded depending on configuration.

@TomSmithCGAT TomSmithCGAT reopened this Jan 4, 2021
@RGellerLab
Copy link

Hi,
It would be great if you include a method for generating a consensus from a group of reads derived from the same UMI that also includes deletions. Many of the current tools just deal with bases.

@IanSudbery
Copy link
Member

Unfortunately I don't think either me nor Tom has the capacity to deal with this at the moment. It is on the list of improvements to make if we ever secure funding to maintain UMI-tools, although I have to say, we hadn't considered the question of indels, which would take the problem from being computationally trivial, to being highly demanding (probably the reason others don't do it), although it would be a motivating factor as to why such a development was necessary.

@karlkashofer
Copy link

I've been looking into doing this in my local pipeline and now have an early prototype that calculates consensus sequences given a set of UMI-tagged reads. The algorithm is quite simple (basically a majority vote at every nucleotide) but it seems to work well enough and runs very fast.

I ended up adding writing this as a separate script that just uses UMI-tools for the UMI grouping, since that was a lot easier for me to implement. However, I would think that it makes sense to do add something like this UMI-tools too, since there is clearly some demand for this. I think it would also be quite straightforward to add this to the dedup operation.

I would imagine you could just add a --consensus parameter to the dedup operation. If that flag is set, instead of just outputting a random read per group, you would output a consensus read, which is built based on sequences of all the reads for each UMI group.

I think this would be quite straightforward to implement, but it would probably be easier if one of you adds this, since I'm not very familiar with the codebase. I'd be happy to share my consensus calling code though!

Hey Nils ! Would you be so kind and share your script ?

@TomSmithCGAT
Copy link
Member

Closing due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
1.0.0
Kicking the can
Development

No branches or pull requests

10 participants