New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keeping unmapped reads #78
Comments
Hi @Poshi. Don't worry about opening up lots of issues. We're keen to make sure UMI-tools meets everyone's needs where possible so it's always good to get feedback about any functionality requirements. The issues regarding documentation and errors in the output are also obviously very helpful! Do you mind if I ask how you're using UMI-tools? I.e academic, industry, as part of another software product, etc? The group command will mark the duplicates in the sense that each read will be assigned to a discrete read group and this read group can then be used to decide which reads to retain. The main use case of the group command would be where all the reads are required for the downstream analysis, e.g error correction when calling low frequency mutations. This was the motivating use case for introducing the group command (#54) If you wanted to mark the duplicates but return all reads, I think the best option would be to extend the group command so that the reads are returned with an additional flag marking the reads as 'primary' or 'duplicate', such that removal of all secondary reads would give the same output as the dedup command. @IanSudbery - what do you think? You make a very good point about the unmapped reads. Until recently, I'd always seen UMI-tools being used downstream of the initial alignment, hence the loss of unmapped reads would not be an issue. However, retaining unmapped reads would allow the BAM output of UMI-tools group to become the 'raw' data as it would be possible to convert back to the original fastq file as you say. This would of course necessitate adding functionality to extract to store the UMI base calling qualities too. Are you aware of any tools to convert BAM to fastq when the BAM contains a UMI information? If not, perhaps that's something to add to UMI-tools too in the longer term. @IanSudbery - I think all we need to do to output unmapped reads for this is to add 'until_eof=True' to the pysam.fetch() call and modify the control flow within get_bundles (outputting unmapped reads should be optional of course, but perhaps default=true?) |
Two things we need to think about:
What happens if we encounter an unmapped read in the middle of a load of
mapped reads? Do we stop and output it?
What happens if read1 is unmapped, but read 2 is mapped? How do we handle
deduping? What happens if an unmapped read is fed to the TwoPassWriter?
(At the moment we dedup on read1 mapping position and insert size).
|
@TomSmithCGAT Regarding the marking of duplicates, I'm not sure if you are talking about adding tags with the information or modyfying the FLAG field of the sam file. You also have a point where most people, when doing the analysis, will just discard some not useful reads. But we do the sequencing and analysis for third parties and we have to give out the whole data got from the sequencer. This is why we want to keep everything in the BAM file. We want to avoid the need to keep the original FastQ file and the final bam. So far, I'm not aware of any UMI tool that goes from BAM to FastQ and keeps UMI information. But this is not difficult to implement with a couple of awks. The information is there. @IanSudbery My five cents on how to deal with unmapped reads. They should not be a problem. They cannot be mapped, so just ignore them, output them as is and continue with the next. It won't be a duplicate because it is not mapped. For the situation where one end is mapped but the other end is not, perform the operation as if it were single end, but keeping the other end close in the output file and unmapped. Maybe this situation is kind of tricky for the code. |
@Poshi Yes this is exactly what I was thinking. The dedup command selects the best read according to MAPQ, number of mapping positions etc. We would use the same process to identify the 'primary' read and mark all other reads within the group as 'duplicate'. We'll have to have a discussion internally about how to deal with quality scores. Ian's point was more in relation to the internal workings of UMI-tools which will need to be modified to deal with unmapped reads. Read pairs with one unmapped read are a tricky one to deal with. Unfortunately, we can't really handle a mix of single and paired end reads so I think we would have to treat these read pairs as both unmapped and mark them accordingly (e.g not assigned to a read group). I've added this issue to our TODO list for the next release so hopefully we'll resolve most of these points for then. |
Retaining unmapped reads with the group command is now supported (v0.3.7). See option |
Sorry to comment on such an old issue, but it's not clear to me from the documentation. Does the |
I'm sorry I'm opening so many issues these days. This is a functionality request.
In our pipelines, we always keep all the input data, so we can re-generate the source FastQ file from the final BAM (go back to the original sequencer result file from the heavily processed final bam).
For this to work, we cannot discard any read, and for this we have to mark duplicates, and not remove them, so we will be mostly using group instead of dedup, and generating a BAM file with the proper tags and flags updated.
If I'm not wrong, the group command does not mark the duplicates, only gives you an overview of the results. And the dedup command, if I understood correctly, kills the duplicate reads, but does not mark them.
We will need two things:
The text was updated successfully, but these errors were encountered: