Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also report position of 2nd read in "group --group-out" table #180

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

fgp
Copy link

@fgp fgp commented Sep 8, 2017

This extends the "group" tool's "--group-out" output to report the position of the 2nd read (in a separate column called "position2") of proper mate pairs. For improper pairs, only the first read's position is reported, and the 2nd read is ignored (same as now).

Implementation-wise, the main works happens in bundle_iterator, which now collects 2nd reads in a separate list parallel to "read" called "read2". For now, only all_reads=True is handled correctly -- for all_reads=False we must decide whether to always keep the best pair, or whether to keep the best (in some sense) 1st and 2nd read separately (i.e. the kept reads may stem from different pairs). To recover pairs of reads from the (position-sorted) input file, 2nd reads are stashed away until we reach the corresponding 1st read. Conversely, if the 1st is found first, it's processing is delayed until the 2nd read is reached. To avoid prematurely returning a bundle containing a position for which 1st reads are still deferred, the number of deferred reads for each position is tracked and the returning these position (plus any later ones) is deferred as well.

This happens only for reads that are part of proper pairs -- improper pairs (as flagged by the mapper) are treated same as now, i.e. only the first read is processed. The code assumes that for pairs flagged as proper, the two mates map to the same chromosome. If this assumption is violated, a warning is emitted.

Since the "group" code now has both reads at its disposal, I also changed the way the "--paired" mode works regarding the insert size. So far, the code used the tlen reported in the BAM file, which may change due to slight differences in mapping, and thus prevent UMIs from being grouped. Instead of tlen, the code now uses the position of read2 as computed by get_read_position (i.e. the position adjusted for clipped bases) as an additional key in --paired mode. This makes the grouping less sensitive to slight differences in mapping, and for my samples produces better results than the tlen-based approach. Similarly, read length and whether the read looks like its extending to the next exon is now taken into account for read2 is well (if the corresponding options are enabled). And finally, the UG tag is now added to read2 as well (since that was now easy to do, and seemed sensible to me).

So far, I've only done minimal tests -- it basically processes one of my samples successfully, and seems to produce the same results as my old pipeline. If this is deemed to be an acceptable approach, I'll fix the remaining issues and do some more tests.

best,
Florian

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Sep 11, 2017

Hi @fgp - I've had a look through your pull request and this seems to be a sensible approach. However, I think this will run into issues when the read pairs have multiple alignments since one could see the same read1 before observing the read2, or vis versa. We can of course check for this and store the multiple read1s/read2s but I think ultimately we would want to introduce a flag to allow the user to turn off the behaviour you introduce above when they have multi-mapped reads - reverting to outputting all reads2 as they are observed. I had a few comments for review but before I add them, I wonder if it might be worth considering another approach as I'm slightly concerned that the code for get_bundles.call() is getting a bit unwieldy, and will only get more so if we make this behaviour optional as I suggest.

Perhaps a cleaner solution in terms of maintaining the code might be to stick with the current approach of only tagging the read1s but introduce a --tag-read2 option (with documentation to say this isn't supported for reads with multiple alignments). This option would trigger re-parsing the BAM and adding information to the read2s. The position of read 2 would also be identified and added to the group flatfile. This would make the process of tagging the read2s easier to understand since it would be separated from the process of identifying read groups. We could of course still implement all the same checks for read pair validity which you have included. There would be a slight increase in run time but this would only be very minor.

If you'd like I can make a branch with this approach implemented and you could check it out with your input files.

@IanSudbery - Any thoughts on the relative merits of handling read2s prior and post UMI grouping?

@fgp
Copy link
Author

fgp commented Sep 11, 2017

I do understand your concern about the complexity of get_bundles.call(), but handling read2 in a second pass after determining UMI groups has the (severe, IMHO) disadvantage that the information from read2 can no longer be used to influence the grouping. As it stands, the patch uses the second read's mapping position as a group key, same as the first read's position. At least for my samples, this yields a much better grouping than the unpatches version in --paired mode which uses only the first read's position plus the fragment length as reported in the BAM file (tlen).

The reason seems to be that the mapping position as returned by get_read_position (i.e. adjusted for clipped bases) is quite resilient to slight differences in mapping between reads of the same UMI (which I assume is why that adjustment is done in the first place). tlen, on the other hand, can easily change by a few bases if read2 is mapped slightly differently (i.e. fewer or more bases are clipped).

I do agree that get_bundles.call() is unwieldy, but I believe that there are better ways to fix that. I could try to move the pair-building code from bundle_iterator to a separate iterator (say, pair_iterator). bundle_iterator would then receive pairs of reads, and the only additional code it would require is the actual handling of read2 (which isn't much).

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Sep 11, 2017

Hmmm.. yeah I hadn't really considered this. So there is a major advantage to considering the read2s in a single pass. OK, how about we replace lines 1299-1391 of get_bundle.__call__() with a call to self.check_unmapped_paired() with the code as per the master branch. Then we can create a new child class from bundle_iterator called, e.g pair_iterator with a check_unmapped_paired method which includes the functionality you've added here. In group we can decide which iterator to use (suggest default to bundle_iterator with option --parse-read2 to switch to pair_iterator).

What do you think? If you're happy with this, do you want to make the changes and I'll review the pull request fully.

@fgp
Copy link
Author

fgp commented Sep 12, 2017

Ok, I moved the pair-building code out of get_bundles. There is now a separate functor called get_readpairs that does all the legwork of forming pairs and complaining about inconsistent BAM files.

I've split the changes up onto smaller units now.

The first two commits on the group_endpos branch now fix what looks like oversights in the existing code to me -- one is a redundant check for read.is_mapped, the other seems to count unmapped reads twice in some occasions.

The next four introduce get_readpairs and its trivial counter-part get_singletons for single-end mode, extend get_bundles to use them, and extend the group tool to output position2.

Finally, get_bundles is taught to use the information from read2 to improve the grouping and to drop improperly paired reads.

I still need to add a test case for --paired mode and probably update the documentation, but I hope the code is close to being in a merge-able state now.

@TomSmithCGAT
Copy link
Member

Great. I'll review this tomorrow. I think @IanSudbery would also like to look this over too.

@fgp fgp force-pushed the group_endpos branch 2 times, most recently from 359fed0 to 85ec391 Compare September 12, 2017 21:53
@fgp
Copy link
Author

fgp commented Sep 12, 2017

I've now added two tests, and updated the "dedup" tool (and ReadDeduplicator) to use the pair support of get_bundles instead of TwoPassPairWriter.

Currently, the dedup_single_stats test seems to fail, I'll look into that...

@fgp fgp force-pushed the group_endpos branch 2 times, most recently from 88d9c9a to 56d8c07 Compare September 13, 2017 06:47
@fgp
Copy link
Author

fgp commented Sep 13, 2017

I've fixed the test failures and style violations. On my machine (using python 2.7) test_umi_tools.py and test_style.py now both succeed. We'll see if the CI build succeeds now as well.

From my end, this is ready for review. Documentation still needs to be updated probably, not sure where to explain the new behaviour though.

Copy link
Member

@IanSudbery IanSudbery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the worries I have highlighted in the review, I am also worried about the memory and run-time implications of this. We used to have an implementation that used a queue to output read pairs, but it was replaced with the post process resweep of the chr because the former was causing an explosion in memory usage and run times in the days rather than the minutes range.

I'm going to need to some substantial benchmarking before we'd be happy to merge.

I do have an alternative idea for how we might get around the problem of using the proper position of the read2 rather than the tlen, but I need to think about it a little....

if get_readpairs.NoMate in queue_entry:
# Incomplete entry. Turn into singleton entry
read = next(r for r in queue_entry if r is not get_readpairs.NoMate)
U.warn("inconsistent BAM file: only one of the two mates of proper pair %s found on chromosome %s, treating as unpaired" %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is probably going to be very common for people that are mapping to the transcriptome, where each transcript is a different contig.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but in this case does --paired mode make any sense at all?

# Read is part of an already queued (incomplete) read pair. Complete it.
queue_entry = self.incomplete_queue_entries.pop(read.qname)
if queue_entry[read_i] is not get_readpairs.NoMate:
U.warn("inconsistent BAM file: both mates %s flagged to be read%d" %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is going to be a problem with multi mappers. Many people allow multimappers, and they are unavoidable when mapping to the transcriptome.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but again I don't really see how multi-mapping and paired reads should work together at all. The fundamental problem seems to be that there are n*n possible pairing if both read1 and read2 are mapped to n locations. And the BAM format at best may tell us which the "primary" pair is, but not how the secondary mapping should be paired. Maybe I'm missing the point, though -- I haven't really worked with multi-mapped data so far...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I could do the same that TwoPassPairWriter does, and use reference name and mapping position as additional keys to identify mate pairs. Would that alleviate these concerns?

self.incomplete_queue_entries[read.qname] = queue_entry

# Output queued reads and readpairs up to the first incomplete pair
while self.queue and sum(1 for r in self.queue[0] if r is get_readpairs.NoMate) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is definately going to become a mess if you have say to alignments for read1 followed by an alignment for read2. The two alignments of read one will be treated as a read one and a read two, and output as a pair, then read 2 will be encountered and a new pair will be created.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah, multi-mapping breaks this. Again, I'm inclined to simply say "multi-mapping" and --paired are contradictory, but again I might be missing something...


for read in inreads:
for read, read2 in generator(inreads):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are waiting for pairs to be output, can we any longer assume that reads will be output in the same order that they are in in the file. Specifically, if a mate is not found, it will not be output until the end? But in order to collect all reads that map to a location, we assume that they are sorted by pos.

Copy link
Author

@fgp fgp Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had done that correctly, but I realize now there is be a bug in the current version. The idea was to keep the order "intact enough" for get_bundle's purposes. The original version (where collecting pairs was handled within get_bundles) was fine, but it seems I have introduced a bug when factoring get_readpairs out.

What get_bundle needs is basically that read pairs are ordered according to read1. Currently, that is the case if read1 occurs first in the file. But if read2 occurs first, the pair will be output at the correct position for read2, not read1. (In my old pipeline that was indeed the correct thing to do, because I used the leftmost mapped position of a pair for deduplication.)

This is easily fixed AFAICS by enqueueing a pair not immediately, but only after read1 (even if that's the second read of the pair in the BAM file) has been observed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed, see commits below.

@fgp
Copy link
Author

fgp commented Sep 13, 2017

Regarding the performance -- I can certainly do some benchmarking.

Also, my existing pipeline uses the same kind of queuing to process pairs together, and I haven't run into any problems with memory consumption or runtime. In essence, the memory consumption is bound by the BAM file's maximum coverage, if you call a base "covered" that lies within a sequenced fragment (but wasn't necessarily sequenced itself). Even if that number reaches a million (which amounts to extremely deep sequencing), and each read takes 1k to store, a gigabyte of memory would still suffice. Runtime-wise, the queuing should be O(1) in the number of reads.

@fgp
Copy link
Author

fgp commented Oct 2, 2017

I've changed get_readpairs to report pairs in ascending mapping position of read1 -- this should be sufficient to not break the clustering algorithm

get_readpairs now also uses query name plus the mapping position of read1 as the "pair identification key" -- this should work better for multi-mappings. I can't really test this, though -- I don't have any multi-mapped paired-end data to test with

Apart from the general worries about performance, this should address the raised concerns I think. I'm currently doing some benchmarking (with pretty much worst-case files - a short bacterial genome, but sequenced extremely deeply, so the number of "pending" pairs is probably in the many thousands at any time) -- I'll post the results soonish

@ijhoskins
Copy link

Would love to test the new paired-end position/template length adjustment for clipped bases. Any estimate on when this PR will merge?

In --paired mode, get_bundles now uses get_readpairs to process the first
and second read of a pair together. The 2nd reads are now stored in the bundle
as well, in a list parallel to "read" called "read2". This currently works
only for all_reads=True.

So far, the second read is only stored. The next commits will introduce
changes that use the information from the second read to improve the grouping.
…d mate

The output contains a new column "position2" which contains the position of
the 2nd read, computed in the same way as the position of read1 (i.e, the
position is the position of the 1st base in sequencing (not reference!)
direction, and is adjusted for possible clipped bases.

This uses the previously introduced support for collecting read pairs in
get_bundles.
…assPairWriter

This includes changes to ReadDeduplicator, which now uses the 2nd reads
reported by get_bundles, and now returns a list of read pairs instead of
a list of plain reads.
… read1

Previously, they were reported in ascending mapping position of the left
(in reference direction) read, which for get_bundle's purposes is incorrect.

The test outputs are updated accordingly
…r's key in get_readpairs

This should make get_readpairs work also for multi-mapped data, as long as mates aren't
re-used in multiple (multi-mapped) pairs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants