Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phasebam updated #4

Merged
merged 17 commits into from
Apr 22, 2020
Merged

Phasebam updated #4

merged 17 commits into from
Apr 22, 2020

Conversation

pontushojer
Copy link
Collaborator

This is an updated version of the PR Tobias posted (#3) in which i have done some edits. Some are rather substantial so i decided to open a new PR.

The basis is the same script but include several edits

  • In get_phased_snvs_hapcut2_format i have factored out PhaseBlockReader for reading the hapcut2 phaseblock files and also made a class PhasedVariant to handle the variants more easily.
  • I changed the filters so that currently only SNVs are used for phasing.
  • translate_positions() have been changed to use pysam's get_aligned_pairs(). I found that in some cases variants at the edges of reads were mislabeled with haplotypes in the PR Phasebam #3 script but it was hard to find the error using the get_blocks()method so I switched to correct the issue.
  • phase_read() now uses the base quality for scoring haplotypes.
  • I added an option for how to prioritise which haplotype to adding to a read when you have contradicting information from the barcode (molecule) and read level. Previously this was based on consensus meaning that no reads carrying contradicting haplotype information would be tagged. Now I have changed in to tag based on read information by default, based on that this also the case in whatshap haplotag but it could be something to test further.

I also compared our tagging to the output of whatshap haplotag and found the following:

  • phasebam uses the assigned molecule index (tag MI) to distribute haplotype over molecules while whatshap has to recalculate the 'read clouds' (or molecules as we refer to them). This means that some of the filtration using filterclusters is no longer effective for the whatshap tool as the filtration keeps the barcode but changes the MI tag for -1. Thus some reads within barcodes that we don't want to tag will still be tagged. One easy way to solve this would be to remove these barcodes entirely in the filterclusters step.
  • whatshap can handle other variants than SNVs.
  • whatshap have two mode to identify which haplotype a read belongs to. Possibly we should think about implementing one of them in our script.
    • If a reference is supplied then realignment of the read sequence to the reference is performed.
    • If no reference is given the cigar string is used to define the variant in the read.
  • pruned variants are still used by whatshap for tagging.

Comparison of % reads tagged

Tool %tagged
phasebam no_filter 49.4
phasebam default 48.3
haplotag 30k 47.0
haplotag 50k 48.9

phasebam: "no_filter" means run using --include-pruned --min-switch-error 0 --min-missmatch-error 0. "default" means run using default settings
haplotag: Run using two different values for --linked-read-distance-cutoff "30k" for 30000 (same as in default blr configs) and "50k" for 50000 (haplotag default).

@FrickTobias
Copy link
Collaborator

FrickTobias commented Mar 23, 2020

Really awesome job, really like the pysam module you found!

Some comments;

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

  • How many reads are affected?
  • Manual inspection of some randomly chosen reads.
  • Do they seem to be allocated to specific kinds of regions?

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

@@ -59,6 +67,10 @@ properties:
type: string
description: Which variant caller to use.
pattern: "gatk|bcftools|freebayes|deepvariant"
haplotag_tool:
type: string
description: Whish haplotype tagging tool to use.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: Whish haplotype tagging tool to use.
description: Which haplotype tagging tool to use.

@FrickTobias FrickTobias reopened this Mar 23, 2020
@pontushojer
Copy link
Collaborator Author

@FrickTobias

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

  • How many reads are affected?

In the chr22 dataset, for about 7% of the reads having both read and molecule phase info, the phasing info is contradictory.

  • Manual inspection of some randomly chosen reads.

I looked into some cases where the read and molecule haplotypes were different and they made sense based on how the program currently opperates and with respect to the variants overlapped by the read for those barcodes.

  • Do they seem to be allocated to specific kinds of regions?

Not really. I binned the discordant read position along with the phased variants and total coverage and they all overlap pretty well (See figure below).

image

I think it will be hard to find a the best way to label these reads. How do you weigh read and molecule evidence against eachother? �READ priority is what is preformed by whatshap haplotag but I am not sure why. On a read level this makes sense since the variants for that read is clearly matched to a particular haplotype. However the fact that the read has a different haplotype to the molecule is a sign that it either (1) was not assigned to the correct barcode/molecule or (2) we have a switch error for this variant. I am also uncertain of the actual effects this will have downstream i.e. how the haplotype is used for SV calling. Possibly we just have to try the different option out and see how they affect the results. For now however I guess the safest choice is to remove these entirely from the analysis (e.g running �CONSENSUS mode). Also as it affects only ~�7% of reads with both read and mol haplotype, it is a relatively small loss.

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

Yeah I think this is probably a good idea. It seams like we keep running into issues like this and then is just simpler to clear out the information altogether.

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

I am not entirely sure what you mean here? Do you mean to use the phased VCF rather than the HAPCUT2 file for the variants?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

Yeah I also think this might be too much work for now.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

It does not use molecule information, rather the BX tag and a window of a desired size (much like in buldmolecules).

FrickTobias and others added 16 commits April 1, 2020 16:48
…BAMs. Modified phasebam so that HP is int and moved 'anomaly_file' for optional arguments.
- Added missmatch threshold
- Use mapping quality to score haplotypes
- Added phase quality (tag PC) for read phased reads.
- Docs update
- Use same 0-based position for hapcut2 entries.
- Normalize variants to remove redundant SVs.
- Skip reads with unmapped mate
- Translate positions in read using get_aligned_pairs() for simplicity.
- Added option to select priority when desiding haplotype.
- Don't tag unmapped reads with haplotype.
- Sort variants after generation (position can differ between phaseblocks)
- Name changes
@FrickTobias
Copy link
Collaborator

Alright, looks good! Although 7% is quite a lot I'm not sure what else we can do currently, other than removing the information.

What do you think about saving the stripped read names or positions to a separate file or something where we strip the information? Then we could also look into that to see if we can later correlate it to switch error rates, or at least investigate if they seem related. This way we could strengthen our idea about stripping the information.

Otherwise everything looks good!

@pontushojer
Copy link
Collaborator Author

pontushojer commented Apr 16, 2020

@FrickTobias
I am not sure if 7 % percent is that much in this case. Note that these are the reads with both phasing information from the molecule and the read which in turn make up about ~12% of all reads.

Regarding storing the read names, you implemented this "anomaly" file in your script which I have kept. This outputs a TSV file which has the read names of the remove reads. They are marked with mol_hap!=read_hap in the second column of this file. I guess this should be sufficient?

@pontushojer pontushojer merged commit e2149d9 into master Apr 22, 2020
@pontushojer pontushojer deleted the phasebam-pontus branch April 22, 2020 12:20
pontushojer added a commit that referenced this pull request Aug 27, 2020
…than phasebam.py. For a test dataset of chr22 with called variants 49.2% were tagged using whatshap while only 35.3% were tagged with phasebam.py. Probably this is due to the updates in whatshap (compare to PR #4). Whatshap haplotag is able to handle phased INDELs which is not the case for phasebam.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants