-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phasebam updated #4
Conversation
Really awesome job, really like the pysam module you found! Some comments; I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like
I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think? With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right? Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already. I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information? |
src/blr/config.schema.yaml
Outdated
@@ -59,6 +67,10 @@ properties: | |||
type: string | |||
description: Which variant caller to use. | |||
pattern: "gatk|bcftools|freebayes|deepvariant" | |||
haplotag_tool: | |||
type: string | |||
description: Whish haplotype tagging tool to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: Whish haplotype tagging tool to use. | |
description: Which haplotype tagging tool to use. |
In the chr22 dataset, for about 7% of the reads having both read and molecule phase info, the phasing info is contradictory.
I looked into some cases where the read and molecule haplotypes were different and they made sense based on how the program currently opperates and with respect to the variants overlapped by the read for those barcodes.
Not really. I binned the discordant read position along with the phased variants and total coverage and they all overlap pretty well (See figure below). I think it will be hard to find a the best way to label these reads. How do you weigh read and molecule evidence against eachother?
Yeah I think this is probably a good idea. It seams like we keep running into issues like this and then is just simpler to clear out the information altogether.
I am not entirely sure what you mean here? Do you mean to use the phased VCF rather than the HAPCUT2 file for the variants?
Yeah I also think this might be too much work for now.
It does not use molecule information, rather the BX tag and a window of a desired size (much like in |
…BAMs. Modified phasebam so that HP is int and moved 'anomaly_file' for optional arguments.
- Added missmatch threshold - Use mapping quality to score haplotypes - Added phase quality (tag PC) for read phased reads. - Docs update - Use same 0-based position for hapcut2 entries. - Normalize variants to remove redundant SVs. - Skip reads with unmapped mate - Translate positions in read using get_aligned_pairs() for simplicity. - Added option to select priority when desiding haplotype.
- Don't tag unmapped reads with haplotype. - Sort variants after generation (position can differ between phaseblocks) - Name changes
2a904e2
to
ca120cd
Compare
Alright, looks good! Although 7% is quite a lot I'm not sure what else we can do currently, other than removing the information. What do you think about saving the stripped read names or positions to a separate file or something where we strip the information? Then we could also look into that to see if we can later correlate it to switch error rates, or at least investigate if they seem related. This way we could strengthen our idea about stripping the information. Otherwise everything looks good! |
@FrickTobias Regarding storing the read names, you implemented this "anomaly" file in your script which I have kept. This outputs a TSV file which has the read names of the remove reads. They are marked with |
…than phasebam.py. For a test dataset of chr22 with called variants 49.2% were tagged using whatshap while only 35.3% were tagged with phasebam.py. Probably this is due to the updates in whatshap (compare to PR #4). Whatshap haplotag is able to handle phased INDELs which is not the case for phasebam.py.
This is an updated version of the PR Tobias posted (#3) in which i have done some edits. Some are rather substantial so i decided to open a new PR.
The basis is the same script but include several edits
get_phased_snvs_hapcut2_format
i have factored outPhaseBlockReader
for reading the hapcut2 phaseblock files and also made a classPhasedVariant
to handle the variants more easily.translate_positions()
have been changed to use pysam'sget_aligned_pairs()
. I found that in some cases variants at the edges of reads were mislabeled with haplotypes in the PR Phasebam #3 script but it was hard to find the error using theget_blocks()
method so I switched to correct the issue.phase_read()
now uses the base quality for scoring haplotypes.whatshap haplotag
but it could be something to test further.I also compared our tagging to the output of
whatshap haplotag
and found the following:phasebam
uses the assigned molecule index (tag MI) to distribute haplotype over molecules whilewhatshap
has to recalculate the 'read clouds' (or molecules as we refer to them). This means that some of the filtration usingfilterclusters
is no longer effective for the whatshap tool as the filtration keeps the barcode but changes the MI tag for -1. Thus some reads within barcodes that we don't want to tag will still be tagged. One easy way to solve this would be to remove these barcodes entirely in thefilterclusters
step.whatshap
can handle other variants than SNVs.whatshap
have two mode to identify which haplotype a read belongs to. Possibly we should think about implementing one of them in our script.whatshap
for tagging.Comparison of % reads tagged
phasebam: "no_filter" means run using
--include-pruned --min-switch-error 0 --min-missmatch-error 0
. "default" means run using default settingshaplotag: Run using two different values for
--linked-read-distance-cutoff
"30k" for 30000 (same as in default blr configs) and "50k" for 50000 (haplotag default).