New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output read groups rather than deduplicated BAM #54
Comments
I've been wondering recently why no one does this! I think we probably might want to do this in a new command - otherwise we are going to have to store all of the reads in memory - at the moment we only store the best of each position_splice_umi combo. At high read depths this is a substantial memory saving. However, the new problem is kind of a deduping problem though? Presumably you only want to group together reads with the same UMI and same position. I'd be interested to see how this progresses - I've got some project where using duplicates to resolve sequencing errors could be very useful. |
Yeah, in this case this is ultimately a de-duplication problem but the intermediate step could be quite bespoke, hence the solution to just return the read groups and allow the user to decide how to use the information. I'll make a group branch and start work shortly. |
Great, thank you very much Tom! Happy to help out with testing etc! |
@koelling Would you mind testing out the new group command (on the group branch). Latest commit here 9579cad. You can run it like so:
Currently, it only outputs a tsv mapping read_identiers to read groups according to their position and UMI. Once I've figured out how to add a tag to the reads, the following command should also generate a bam with the read group included as the 'UG' tag.
|
@koelling Best hold on the testing. Just discovered the output only includes the UMIs which would be outputted with the dedup command, e.g not the related UMIs. |
I gave it a try now and it looks good! Is it possible that you're only attaching the UG tag to the first read in every read pair though? Is that intentional? Also, it looks like the "Number of reads out:" output in the log does not actually match the number of "UG". For my test file, it says 28610 there but the highest UG I see is 26387. Weirdly, this also doesn't match the result that I get when I run "dedup" with exactly the same parameters (26482). The output BAM file also contains fewer reads than the input BAM file - I assume most of this difference is because you're excluding unmapped reads, but there seem to be a few additional reads that disappear. Do you use any other filtering criteria to exclude read pairs? |
@koelling. Happy to see it's at least working in principle. You're right about the read tags, that was an oversight on my side. This should be corrected in c2f6e81. I've also updated the counting so that it now reports the counts for "reads in", "reads out" & "groups." I've checked with a single end bam that "reads in" matches the number of reads in the input BAM, "reads out" matched the number of reads in the output BAM and flatfile and "groups" matches the maximum group number. Running dedup now gives the same "output reads" count as "groups" count we get from running group, which is as it should be. Would you mind pulling c2f6e81 and then running the group command with your paired end BAM and checking these counts also match up and that the tag is on both read pairs. Thanks! |
Thanks a lot! The number of groups does indeed match the highest UG tag I see now, assuming these are zero-indexed. It also matches the output from However, I'm still getting slightly weird results with the other numbers:
So on one hand the reported number of reads out is actually larger than the reported number of reads in (which itself does not seem to match the number of alignments in the input), and on the other hand it does not seem to match the number of alignments in the actual output file. I suspect this may be something about multi-mapping or unpaired reads? Regarding the UG tag it looks like this is now on almost all the read pairs, but there are still a few where it's only on one of them. These seem to be cases where one of the mates is multi-mapped and has a low mapq though (6), so maybe that's why? |
@koelling. This one took a while! The issue with the output being higher than the input occurs due to the way in which we resolve the directional network down to it's constituent parts (each of which should is inferred to represent a single UMI). Currently, this allows for the lowly abundant UMIs to be present in more than one network: Imagine we have the directional network below with UMIs and counts in parenthesis and arrows representing directional edges. To identify the true UMIs, we start with the most abundant UMI (TGTT) and follow the edges to nodes AGTT and AGCT to make our first connected component (TGTT, AGTT, AGCT), from which we assume TGTT is the true UMI. Since this component does not account for the whole network, we proceed to the next most abundant, AGTA and follow the edges along two paths to get the next component (AGTA, ACTA, AGTT, AGCT). Together the two components account for the whole network and we stop looking for connected components in the graph. However, AGTT & AGCT have been included in both components.
This issue was previously raised in discussion with @logust79 but I dismissed it as for de-duplication purposes we are interested only in the most abundant UMI and no UMI can be the most abundant in two components by this method. However, the new group command uses the connected components to derive the unique UMI groups. As the same UMI can be reported to be in multiple groups and it was thus being outputted once for each component in which it was found. Thanks for checking the counts matched up and reporting the issue. That one could have gone unnoticed for a long time! This should be have resolved in 1164250 |
@IanSudbery Can you sanity check 1164250 please. I've modified the breath_first_search function so that nodes which have already been visited are rejected. As I saw it, the other option would be to reform the adj_list after identifying each connected component but this would obviously be very slow for large networks. I actually see a negligible improvement in run-time by including this check in the search function. |
Thanks, this does seem to have fixed the number of alignments that are being returned. However, the input number is still slightly lower than the actual number of alignments in the input file. To help debug this, I extracted the read IDs from the input and output BAM and looked at a few cases that went missing. It looks like these were all cases where only one of the mates mapped successfully, with the other one being flagged as unmapped in the BAM file (but still having a location assigned). Here is one such pair:
Are you removing these on purpose? I could see how it would make sense to ignore them (and I probably wouldn't want to consider them anyway) but it would be good to know if this is a bug or not. I also still get four reads where the UG is missing. Looking at the output BAM these happen to be the last four reads in the entire file (all belonging to the same UMI group), so I suspect this may be a bug related to how the UG tag is being attached? |
@koelling Coud you try out the latest commit on the group branch 87c22a0 The logging of reads in/out has been improved and all reads should now have tags. You have two "types" of unmapped reads in your BAM. Reads without a reference contig and read with a reference contig but unmapped according to their SAM flag. The first group will not be retrieved by pysam.fetch() and so are not included in the counts. This means the total number of input reads is slightly different between |
@IanSudbery This is commonly done in repertoire sequencing http://dx.doi.org/10.1016/j.it.2015.09.006 |
@TomSmithCGAT I just came across a new issue - it seems that Looking at this I can see that the second copy of this read has two UG and FU tags attached that are quite similar ("UG:i:85 FU:Z:CTACAGTCGT UG:i:176 FU:Z:CTACAGGCGG") while the read itself has a UMI with an edit distance of 1 from both - CTACAGTCGG. So I assume what's happening here is that it was ambiguous which UMI group the read belonged to, so you assigned it to both. I'm not sure what the best solution would be for this but I was definitely not expecting duplicate reads in the BAM output. And of course the UG tags also should not show up twice for a single read. I'm actually surprised that pysam has been able to handle these at all! :) Maybe we could choose a UG at random (or set it to -1) and then provide a comma-separated list of possible UGs as an additional tag? That way I could handle this properly downstream. |
@koelling. Thanks for bringing this to my attention. You're right that this is caused by a read being assigned to multiple groups. I thought I'd resolved this problem but obviously there are still some cases which are not addressed. I'd like to stick to assigning a single group for each read in a logical manner rather than random assignment from multiple choices. I'll try and resolve this in the next few days. |
Thank you Tom! By the way, another thing that would be great would be if you could include unmapped reads (and invalid read pairs) in the output file, even if you can't assign them to a group. I'm not sure what other people might want, but at least for me the ideal output BAM file would contain exactly the same reads as the input BAM file, just with the UG tags added. But if this is complicated I can work around it. |
@koelling This shouldn't be too difficult. We can include an option to retain the unmapped reads. It looks like we'll need to add the "until_eof" to the pysam.fetch() call to include the unmapped reads in the iterator. |
@koelling. I'll try and return to this soon but for now I'd recommend using the directional method which will not suffer from this issue of returning the same read multiple times. |
@IanSudbery Just had a chat with Nils Koelling here at the WIMM. He's given some very helpful feedback on documentation which I'll feed into the master branch shortly.
He's using UMIs in a variant calling framework and therefore wants to group reads by their UMI and then perform a downstream step to identify the correct base call for each position in the set of reads to correctly resolve sequencing errors. We spoke about a couple of ways to achieve this:
Theoretically, these could be achieved within the dedup command but this is getting away from a pure deduplication problem as such. I think the best approach might be to provide a new umi_tools command "group" which uses the same network methods to identify the set of reads with the same UMI or related UMIs and then outputs either a tagged BAM or csv. We'd need to abstract some of the dedup.py code to a module file and reuse it in group.py.
This might also be a good way to address this issue regarding counting reads supporting each UMI: #45
What do you think?
The text was updated successfully, but these errors were encountered: