Phasebam updated #4

pontushojer · 2020-03-18T09:30:10Z

This is an updated version of the PR Tobias posted (#3) in which i have done some edits. Some are rather substantial so i decided to open a new PR.

The basis is the same script but include several edits

In get_phased_snvs_hapcut2_format i have factored out PhaseBlockReader for reading the hapcut2 phaseblock files and also made a class PhasedVariant to handle the variants more easily.
I changed the filters so that currently only SNVs are used for phasing.
translate_positions() have been changed to use pysam's get_aligned_pairs(). I found that in some cases variants at the edges of reads were mislabeled with haplotypes in the PR Phasebam #3 script but it was hard to find the error using the get_blocks()method so I switched to correct the issue.
phase_read() now uses the base quality for scoring haplotypes.
I added an option for how to prioritise which haplotype to adding to a read when you have contradicting information from the barcode (molecule) and read level. Previously this was based on consensus meaning that no reads carrying contradicting haplotype information would be tagged. Now I have changed in to tag based on read information by default, based on that this also the case in whatshap haplotag but it could be something to test further.

I also compared our tagging to the output of whatshap haplotag and found the following:

phasebam uses the assigned molecule index (tag MI) to distribute haplotype over molecules while whatshap has to recalculate the 'read clouds' (or molecules as we refer to them). This means that some of the filtration using filterclusters is no longer effective for the whatshap tool as the filtration keeps the barcode but changes the MI tag for -1. Thus some reads within barcodes that we don't want to tag will still be tagged. One easy way to solve this would be to remove these barcodes entirely in the filterclusters step.
whatshap can handle other variants than SNVs.
whatshap have two mode to identify which haplotype a read belongs to. Possibly we should think about implementing one of them in our script.
- If a reference is supplied then realignment of the read sequence to the reference is performed.
- If no reference is given the cigar string is used to define the variant in the read.
pruned variants are still used by whatshap for tagging.

Comparison of % reads tagged

Tool	%tagged
phasebam no_filter	49.4
phasebam default	48.3
haplotag 30k	47.0
haplotag 50k	48.9

phasebam: "no_filter" means run using --include-pruned --min-switch-error 0 --min-missmatch-error 0. "default" means run using default settings
haplotag: Run using two different values for --linked-read-distance-cutoff "30k" for 30000 (same as in default blr configs) and "50k" for 50000 (haplotag default).

FrickTobias · 2020-03-23T08:36:24Z

Really awesome job, really like the pysam module you found!

Some comments;

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

How many reads are affected?
Manual inspection of some randomly chosen reads.
Do they seem to be allocated to specific kinds of regions?

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

FrickTobias · 2020-03-23T08:43:36Z

src/blr/config.schema.yaml

@@ -59,6 +67,10 @@ properties:
    type: string
    description: Which variant caller to use.
    pattern: "gatk|bcftools|freebayes|deepvariant"
+  haplotag_tool:
+    type: string
+    description: Whish haplotype tagging tool to use.


Suggested change

description: Whish haplotype tagging tool to use.

description: Which haplotype tagging tool to use.

pontushojer · 2020-03-26T15:11:12Z

@FrickTobias

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

How many reads are affected?

In the chr22 dataset, for about 7% of the reads having both read and molecule phase info, the phasing info is contradictory.

Manual inspection of some randomly chosen reads.

I looked into some cases where the read and molecule haplotypes were different and they made sense based on how the program currently opperates and with respect to the variants overlapped by the read for those barcodes.

Do they seem to be allocated to specific kinds of regions?

Not really. I binned the discordant read position along with the phased variants and total coverage and they all overlap pretty well (See figure below).

I think it will be hard to find a the best way to label these reads. How do you weigh read and molecule evidence against eachother? �READ priority is what is preformed by whatshap haplotag but I am not sure why. On a read level this makes sense since the variants for that read is clearly matched to a particular haplotype. However the fact that the read has a different haplotype to the molecule is a sign that it either (1) was not assigned to the correct barcode/molecule or (2) we have a switch error for this variant. I am also uncertain of the actual effects this will have downstream i.e. how the haplotype is used for SV calling. Possibly we just have to try the different option out and see how they affect the results. For now however I guess the safest choice is to remove these entirely from the analysis (e.g running �CONSENSUS mode). Also as it affects only ~�7% of reads with both read and mol haplotype, it is a relatively small loss.

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

Yeah I think this is probably a good idea. It seams like we keep running into issues like this and then is just simpler to clear out the information altogether.

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

I am not entirely sure what you mean here? Do you mean to use the phased VCF rather than the HAPCUT2 file for the variants?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

Yeah I also think this might be too much work for now.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

It does not use molecule information, rather the BX tag and a window of a desired size (much like in buldmolecules).

…ingly):wq :

…BAMs. Modified phasebam so that HP is int and moved 'anomaly_file' for optional arguments.

…rsing.

… haplotag.

- Added missmatch threshold - Use mapping quality to score haplotypes - Added phase quality (tag PC) for read phased reads. - Docs update - Use same 0-based position for hapcut2 entries. - Normalize variants to remove redundant SVs. - Skip reads with unmapped mate - Translate positions in read using get_aligned_pairs() for simplicity. - Added option to select priority when desiding haplotype.

- Don't tag unmapped reads with haplotype. - Sort variants after generation (position can differ between phaseblocks) - Name changes

FrickTobias · 2020-04-15T13:40:08Z

Alright, looks good! Although 7% is quite a lot I'm not sure what else we can do currently, other than removing the information.

What do you think about saving the stripped read names or positions to a separate file or something where we strip the information? Then we could also look into that to see if we can later correlate it to switch error rates, or at least investigate if they seem related. This way we could strengthen our idea about stripping the information.

Otherwise everything looks good!

pontushojer · 2020-04-16T07:25:53Z

@FrickTobias
I am not sure if 7 % percent is that much in this case. Note that these are the reads with both phasing information from the molecule and the read which in turn make up about ~12% of all reads.

Regarding storing the read names, you implemented this "anomaly" file in your script which I have kept. This outputs a TSV file which has the read names of the remove reads. They are marked with mol_hap!=read_hap in the second column of this file. I guess this should be sufficient?

…than phasebam.py. For a test dataset of chr22 with called variants 49.2% were tagged using whatshap while only 35.3% were tagged with phasebam.py. Probably this is due to the updates in whatshap (compare to PR #4). Whatshap haplotag is able to handle phased INDELs which is not the case for phasebam.py.

FrickTobias reviewed Mar 23, 2020

View reviewed changes

FrickTobias closed this Mar 23, 2020

FrickTobias reopened this Mar 23, 2020

FrickTobias and others added 16 commits April 1, 2020 16:48

Adding preliminary phasebam script. Runs, but not correctly yet (seem…

41ff2b7

…ingly):wq :

Adding phasebam to configs, tests etc

7433aa7

Remove debugging code

0bce071

Update skip_read

0f74a52

Bugfix: if, if, else => if, elif, else & clean up unused things

8f30668

Remove upstream variants from reads from phased snv dict

8d641a4

Fix filtering & update summary output

d676338

Use PySAMIO to get adjusted header

41e4747

Integrate phasebam with whatshap haplotag as options for haplotaging …

c471b2d

…BAMs. Modified phasebam so that HP is int and moved 'anomaly_file' for optional arguments.

Factor out reader and position classes for hapcut2 phaseblock file pa…

7d41be5

…rsing.

Small edits.

2669b8e

Minor edits. Changed from phaseblock number tagging to phaseset as in…

3fd2d36

… haplotag.

Add option to discard untagged reads.

b4f8c21

Further edits:

eb850d1

- Don't tag unmapped reads with haplotype. - Sort variants after generation (position can differ between phaseblocks) - Name changes

Changes based on review + updated docs and other small edits.

ca120cd

pontushojer force-pushed the phasebam-pontus branch from 2a904e2 to ca120cd Compare April 1, 2020 15:11

Fix flake8 error and reformat arguments.

7f66412

pontushojer merged commit e2149d9 into master Apr 22, 2020

pontushojer deleted the phasebam-pontus branch April 22, 2020 12:20

pontushojer mentioned this pull request Jul 6, 2020

Phase all reads into phase blocks FrickTobias/BLR#130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phasebam updated #4

Phasebam updated #4

pontushojer commented Mar 18, 2020

FrickTobias commented Mar 23, 2020 •

edited

Loading

FrickTobias Mar 23, 2020

pontushojer commented Mar 26, 2020

FrickTobias commented Apr 15, 2020

pontushojer commented Apr 16, 2020 •

edited

Loading

	description: Whish haplotype tagging tool to use.
	description: Which haplotype tagging tool to use.

Phasebam updated #4

Phasebam updated #4

Conversation

pontushojer commented Mar 18, 2020

FrickTobias commented Mar 23, 2020 • edited Loading

FrickTobias Mar 23, 2020

Choose a reason for hiding this comment

pontushojer commented Mar 26, 2020

FrickTobias commented Apr 15, 2020

pontushojer commented Apr 16, 2020 • edited Loading

FrickTobias commented Mar 23, 2020 •

edited

Loading

pontushojer commented Apr 16, 2020 •

edited

Loading