Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output read groups rather than deduplicated BAM #54

Closed
TomSmithCGAT opened this issue Oct 31, 2016 · 20 comments
Closed

Output read groups rather than deduplicated BAM #54

TomSmithCGAT opened this issue Oct 31, 2016 · 20 comments
Assignees

Comments

@TomSmithCGAT
Copy link
Member

@IanSudbery Just had a chat with Nils Koelling here at the WIMM. He's given some very helpful feedback on documentation which I'll feed into the master branch shortly.

He's using UMIs in a variant calling framework and therefore wants to group reads by their UMI and then perform a downstream step to identify the correct base call for each position in the set of reads to correctly resolve sequencing errors. We spoke about a couple of ways to achieve this:

  1. Add a tag to the BAM with a unique group identifier (1 -> infinity)
  2. Output a csv mapping read identifier to group

Theoretically, these could be achieved within the dedup command but this is getting away from a pure deduplication problem as such. I think the best approach might be to provide a new umi_tools command "group" which uses the same network methods to identify the set of reads with the same UMI or related UMIs and then outputs either a tagged BAM or csv. We'd need to abstract some of the dedup.py code to a module file and reuse it in group.py.

This might also be a good way to address this issue regarding counting reads supporting each UMI: #45

What do you think?

@TomSmithCGAT TomSmithCGAT self-assigned this Oct 31, 2016
@IanSudbery
Copy link
Member

I've been wondering recently why no one does this! I think we probably might want to do this in a new command - otherwise we are going to have to store all of the reads in memory - at the moment we only store the best of each position_splice_umi combo. At high read depths this is a substantial memory saving.

However, the new problem is kind of a deduping problem though? Presumably you only want to group together reads with the same UMI and same position.

I'd be interested to see how this progresses - I've got some project where using duplicates to resolve sequencing errors could be very useful.

@TomSmithCGAT
Copy link
Member Author

Yeah, in this case this is ultimately a de-duplication problem but the intermediate step could be quite bespoke, hence the solution to just return the read groups and allow the user to decide how to use the information.

I'll make a group branch and start work shortly.

@koelling
Copy link

Great, thank you very much Tom! Happy to help out with testing etc!

@TomSmithCGAT
Copy link
Member Author

@koelling Would you mind testing out the new group command (on the group branch). Latest commit here 9579cad. You can run it like so:

umi_tools group -I example.bam --group-out=test.tsv

Currently, it only outputs a tsv mapping read_identiers to read groups according to their position and UMI. Once I've figured out how to add a tag to the reads, the following command should also generate a bam with the read group included as the 'UG' tag.

umi_tools group -I example.bam -S deduplicated.bam --group-out=test.tsv --output-bam

@TomSmithCGAT
Copy link
Member Author

@koelling The group command should now be able to handle adding a read group too a96774d so you can test out outputting both the mapping tsv and BAM.

@TomSmithCGAT
Copy link
Member Author

@koelling Best hold on the testing. Just discovered the output only includes the UMIs which would be outputted with the dedup command, e.g not the related UMIs.

@TomSmithCGAT
Copy link
Member Author

@koelling Hopefully sorted. See c2f6e81

@koelling
Copy link

koelling commented Nov 7, 2016

I gave it a try now and it looks good!

Is it possible that you're only attaching the UG tag to the first read in every read pair though? Is that intentional?

Also, it looks like the "Number of reads out:" output in the log does not actually match the number of "UG". For my test file, it says 28610 there but the highest UG I see is 26387. Weirdly, this also doesn't match the result that I get when I run "dedup" with exactly the same parameters (26482).

The output BAM file also contains fewer reads than the input BAM file - I assume most of this difference is because you're excluding unmapped reads, but there seem to be a few additional reads that disappear. Do you use any other filtering criteria to exclude read pairs?

@TomSmithCGAT
Copy link
Member Author

@koelling. Happy to see it's at least working in principle. You're right about the read tags, that was an oversight on my side. This should be corrected in c2f6e81. I've also updated the counting so that it now reports the counts for "reads in", "reads out" & "groups."

I've checked with a single end bam that "reads in" matches the number of reads in the input BAM, "reads out" matched the number of reads in the output BAM and flatfile and "groups" matches the maximum group number. Running dedup now gives the same "output reads" count as "groups" count we get from running group, which is as it should be.

Would you mind pulling c2f6e81 and then running the group command with your paired end BAM and checking these counts also match up and that the tag is on both read pairs.

Thanks!

@koelling
Copy link

koelling commented Nov 8, 2016

Thanks a lot! The number of groups does indeed match the highest UG tag I see now, assuming these are zero-indexed. It also matches the output from dedup.

However, I'm still getting slightly weird results with the other numbers:

INFO Number of reads in: 339571, Number of reads out: 339671, Number of groups: 26674
...
$ echo $(( $(samtools view -F4 -c input.bam) / 2 ));
339688
$ echo $(( $(samtools view -F4 -c output.bam) / 2 ));
339621

So on one hand the reported number of reads out is actually larger than the reported number of reads in (which itself does not seem to match the number of alignments in the input), and on the other hand it does not seem to match the number of alignments in the actual output file. I suspect this may be something about multi-mapping or unpaired reads?

Regarding the UG tag it looks like this is now on almost all the read pairs, but there are still a few where it's only on one of them. These seem to be cases where one of the mates is multi-mapped and has a low mapq though (6), so maybe that's why?

@TomSmithCGAT
Copy link
Member Author

@koelling. This one took a while!

The issue with the output being higher than the input occurs due to the way in which we resolve the directional network down to it's constituent parts (each of which should is inferred to represent a single UMI). Currently, this allows for the lowly abundant UMIs to be present in more than one network:

Imagine we have the directional network below with UMIs and counts in parenthesis and arrows representing directional edges. To identify the true UMIs, we start with the most abundant UMI (TGTT) and follow the edges to nodes AGTT and AGCT to make our first connected component (TGTT, AGTT, AGCT), from which we assume TGTT is the true UMI. Since this component does not account for the whole network, we proceed to the next most abundant, AGTA and follow the edges along two paths to get the next component (AGTA, ACTA, AGTT, AGCT). Together the two components account for the whole network and we stop looking for connected components in the graph. However, AGTT & AGCT have been included in both components.

                    (TGTT: 60)  
                        ||  
                        \/
(AGTA: 30)   ==>    (AGTT: 3)    ==>   (AGCT: 1)
    ||  
    \/
(ACTA: 8)

This issue was previously raised in discussion with @logust79 but I dismissed it as for de-duplication purposes we are interested only in the most abundant UMI and no UMI can be the most abundant in two components by this method. However, the new group command uses the connected components to derive the unique UMI groups. As the same UMI can be reported to be in multiple groups and it was thus being outputted once for each component in which it was found.

Thanks for checking the counts matched up and reporting the issue. That one could have gone unnoticed for a long time!

This should be have resolved in 1164250

@TomSmithCGAT
Copy link
Member Author

TomSmithCGAT commented Nov 8, 2016

@IanSudbery Can you sanity check 1164250 please. I've modified the breath_first_search function so that nodes which have already been visited are rejected. As I saw it, the other option would be to reform the adj_list after identifying each connected component but this would obviously be very slow for large networks. I actually see a negligible improvement in run-time by including this check in the search function.

@koelling
Copy link

koelling commented Nov 8, 2016

Thanks, this does seem to have fixed the number of alignments that are being returned. However, the input number is still slightly lower than the actual number of alignments in the input file.

To help debug this, I extracted the read IDs from the input and output BAM and looked at a few cases that went missing. It looks like these were all cases where only one of the mates mapped successfully, with the other one being flagged as unmapped in the BAM file (but still having a location assigned). Here is one such pair:

XXX_TCCACTTTCG  121     chr2    101638859       60      53M     =       101638859       0       AGCTGAATCAGGTGTGCAAACTGCCGGGCGTCTATGCGGTACTGCTCAGCAAA   AAGGFC6,,8B<E,9FEFC6@+7@+C@@C66<C,@6,,C,CE<EEGFC<FC,C   NM:i:3  MD:Z:7A21A21T1  AS:i:41 XS:i:0  RG:Z:XXX      pr:Z:XXX  um:Z:TCCACTTTCG ol:i:80 ts:i:27 te:i:82

XXX_TCCACTTTCG  181     chr2    101638859       0       *       =       101638859       0       TATTCTGAGCAGTACGTGATAGAGGTCCGGCAGTTTGCACGCCTGTTTCCGCTAG F9<,E,,C6,C,C,,,6,,C,9,+++++C,,C:,,,8+,+,,C,F@:@7+CE;,F AS:i:0  XS:i:0  RG:Z:XXX      pr:Z:XXX  um:Z:TCCACTTTCG ol:i:80 ts:i:23 te:i:78 MQ:i:60

Are you removing these on purpose? I could see how it would make sense to ignore them (and I probably wouldn't want to consider them anyway) but it would be good to know if this is a bug or not.

I also still get four reads where the UG is missing. Looking at the output BAM these happen to be the last four reads in the entire file (all belonging to the same UMI group), so I suspect this may be a bug related to how the UG tag is being attached?

@TomSmithCGAT
Copy link
Member Author

@koelling Coud you try out the latest commit on the group branch 87c22a0

The logging of reads in/out has been improved and all reads should now have tags.

You have two "types" of unmapped reads in your BAM. Reads without a reference contig and read with a reference contig but unmapped according to their SAM flag. The first group will not be retrieved by pysam.fetch() and so are not included in the counts. This means the total number of input reads is slightly different between samtools view -c input.BAM and the umi_tools logging. The discrepency should be entirely due to these reads without a reference contig.

@peterch405
Copy link
Contributor

@IanSudbery This is commonly done in repertoire sequencing http://dx.doi.org/10.1016/j.it.2015.09.006
#23

@koelling
Copy link

@TomSmithCGAT I just came across a new issue - it seems that umi_tools group sometimes duplicates reads in the output file. For example I have a read pair here where the first mate suddenly appears twice in the grouped BAM file, even though it only appeared once in the input.

Looking at this I can see that the second copy of this read has two UG and FU tags attached that are quite similar ("UG:i:85 FU:Z:CTACAGTCGT UG:i:176 FU:Z:CTACAGGCGG") while the read itself has a UMI with an edit distance of 1 from both - CTACAGTCGG. So I assume what's happening here is that it was ambiguous which UMI group the read belonged to, so you assigned it to both.

I'm not sure what the best solution would be for this but I was definitely not expecting duplicate reads in the BAM output. And of course the UG tags also should not show up twice for a single read. I'm actually surprised that pysam has been able to handle these at all! :)

Maybe we could choose a UG at random (or set it to -1) and then provide a comma-separated list of possible UGs as an additional tag? That way I could handle this properly downstream.

@TomSmithCGAT
Copy link
Member Author

@koelling. Thanks for bringing this to my attention. You're right that this is caused by a read being assigned to multiple groups. I thought I'd resolved this problem but obviously there are still some cases which are not addressed. I'd like to stick to assigning a single group for each read in a logical manner rather than random assignment from multiple choices.

I'll try and resolve this in the next few days.

@TomSmithCGAT TomSmithCGAT mentioned this issue Nov 15, 2016
@koelling
Copy link

Thank you Tom! By the way, another thing that would be great would be if you could include unmapped reads (and invalid read pairs) in the output file, even if you can't assign them to a group. I'm not sure what other people might want, but at least for me the ideal output BAM file would contain exactly the same reads as the input BAM file, just with the UG tags added. But if this is complicated I can work around it.

@TomSmithCGAT
Copy link
Member Author

@koelling This shouldn't be too difficult. We can include an option to retain the unmapped reads.

It looks like we'll need to add the "until_eof" to the pysam.fetch() call to include the unmapped reads in the iterator.

@TomSmithCGAT
Copy link
Member Author

@koelling. I'll try and return to this soon but for now I'd recommend using the directional method which will not suffer from this issue of returning the same read multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants