Keeping unmapped reads #78

Poshi · 2017-02-02T12:49:11Z

I'm sorry I'm opening so many issues these days. This is a functionality request.
In our pipelines, we always keep all the input data, so we can re-generate the source FastQ file from the final BAM (go back to the original sequencer result file from the heavily processed final bam).
For this to work, we cannot discard any read, and for this we have to mark duplicates, and not remove them, so we will be mostly using group instead of dedup, and generating a BAM file with the proper tags and flags updated.
If I'm not wrong, the group command does not mark the duplicates, only gives you an overview of the results. And the dedup command, if I understood correctly, kills the duplicate reads, but does not mark them.
We will need two things:

Be able to mark the duplicates instead of removing them (this means SAM FLAG modification and some tag addition)
Don't remove any read. The unmapped ones can still be outputed unmodified (obviously, without a chromosome and position you cannot know if it is a duplicate, just keep that information in the output).

TomSmithCGAT · 2017-02-02T13:53:22Z

Hi @Poshi. Don't worry about opening up lots of issues. We're keen to make sure UMI-tools meets everyone's needs where possible so it's always good to get feedback about any functionality requirements. The issues regarding documentation and errors in the output are also obviously very helpful! Do you mind if I ask how you're using UMI-tools? I.e academic, industry, as part of another software product, etc?

The group command will mark the duplicates in the sense that each read will be assigned to a discrete read group and this read group can then be used to decide which reads to retain. The main use case of the group command would be where all the reads are required for the downstream analysis, e.g error correction when calling low frequency mutations. This was the motivating use case for introducing the group command (#54)

If you wanted to mark the duplicates but return all reads, I think the best option would be to extend the group command so that the reads are returned with an additional flag marking the reads as 'primary' or 'duplicate', such that removal of all secondary reads would give the same output as the dedup command. @IanSudbery - what do you think?

You make a very good point about the unmapped reads. Until recently, I'd always seen UMI-tools being used downstream of the initial alignment, hence the loss of unmapped reads would not be an issue. However, retaining unmapped reads would allow the BAM output of UMI-tools group to become the 'raw' data as it would be possible to convert back to the original fastq file as you say. This would of course necessitate adding functionality to extract to store the UMI base calling qualities too. Are you aware of any tools to convert BAM to fastq when the BAM contains a UMI information? If not, perhaps that's something to add to UMI-tools too in the longer term.

@IanSudbery - I think all we need to do to output unmapped reads for this is to add 'until_eof=True' to the pysam.fetch() call and modify the control flow within get_bundles (outputting unmapped reads should be optional of course, but perhaps default=true?)

IanSudbery · 2017-02-02T16:56:33Z

Two things we need to think about: What happens if we encounter an unmapped read in the middle of a load of mapped reads? Do we stop and output it? What happens if read1 is unmapped, but read 2 is mapped? How do we handle deduping? What happens if an unmapped read is fed to the TwoPassWriter? (At the moment we dedup on read1 mapping position and insert size).

Poshi · 2017-02-03T11:04:21Z

@TomSmithCGAT Regarding the marking of duplicates, I'm not sure if you are talking about adding tags with the information or modyfying the FLAG field of the sam file.
What I will do would be take a read as a representative (by better MAPQ score, less secondary alignments... whatever), leave that one as is, and mark the others as duplicates by setting the proper sam FLAG (0x400) and maybe also putting some tag to inform about the type of duplication (PCR, optical). Something similar to what Picard MarkDuplicates do. Also adding the tag with the assigned UMI.

You also have a point where most people, when doing the analysis, will just discard some not useful reads. But we do the sequencing and analysis for third parties and we have to give out the whole data got from the sequencer. This is why we want to keep everything in the BAM file. We want to avoid the need to keep the original FastQ file and the final bam.
This also implies that the UMI qualities should be somewhere, but so far, this is something that we didn't tackled yet. The best option would be to add them as another tag, so we will have RX for the UMI sequences and RQ(?) for the UMI quality. Some day...

So far, I'm not aware of any UMI tool that goes from BAM to FastQ and keeps UMI information. But this is not difficult to implement with a couple of awks. The information is there.
I'm aware of tools for going from FastQ to UMI tagged BAM: your tool, that puts the UMI in the ID, and fgbio AnnotateBamWithUmis, than only takes care of the UMI sequence by putting it in a tag.

@IanSudbery My five cents on how to deal with unmapped reads. They should not be a problem. They cannot be mapped, so just ignore them, output them as is and continue with the next. It won't be a duplicate because it is not mapped.

For the situation where one end is mapped but the other end is not, perform the operation as if it were single end, but keeping the other end close in the output file and unmapped. Maybe this situation is kind of tricky for the code.

TomSmithCGAT · 2017-02-03T11:17:35Z

@Poshi Yes this is exactly what I was thinking. The dedup command selects the best read according to MAPQ, number of mapping positions etc. We would use the same process to identify the 'primary' read and mark all other reads within the group as 'duplicate'.

We'll have to have a discussion internally about how to deal with quality scores.

Ian's point was more in relation to the internal workings of UMI-tools which will need to be modified to deal with unmapped reads. Read pairs with one unmapped read are a tricky one to deal with. Unfortunately, we can't really handle a mix of single and paired end reads so I think we would have to treat these read pairs as both unmapped and mark them accordingly (e.g not assigned to a read group).

I've added this issue to our TODO list for the next release so hopefully we'll resolve most of these points for then.

TomSmithCGAT · 2017-03-09T16:01:33Z

Retaining unmapped reads with the group command is now supported (v0.3.7).

See option --output-unmapped

lparsons · 2018-08-02T21:07:10Z

Sorry to comment on such an old issue, but it's not clear to me from the documentation. Does the group command mark reads (using FLAG (0x400)) as duplicates? That would be helpful for our workflow.

TomSmithCGAT closed this as completed Mar 9, 2017

ctuni mentioned this issue Apr 14, 2021

Problems with dedup output and RSEM #465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping unmapped reads #78

Keeping unmapped reads #78

Poshi commented Feb 2, 2017

TomSmithCGAT commented Feb 2, 2017

IanSudbery commented Feb 2, 2017 via email

Poshi commented Feb 3, 2017

TomSmithCGAT commented Feb 3, 2017

TomSmithCGAT commented Mar 9, 2017

lparsons commented Aug 2, 2018

Keeping unmapped reads #78

Keeping unmapped reads #78

Comments

Poshi commented Feb 2, 2017

TomSmithCGAT commented Feb 2, 2017

IanSudbery commented Feb 2, 2017 via email

Poshi commented Feb 3, 2017

TomSmithCGAT commented Feb 3, 2017

TomSmithCGAT commented Mar 9, 2017

lparsons commented Aug 2, 2018