Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract phased variants or reads #23

Open
Adamtaranto opened this issue Oct 28, 2022 · 5 comments
Open

Extract phased variants or reads #23

Adamtaranto opened this issue Oct 28, 2022 · 5 comments

Comments

@Adamtaranto
Copy link

Hi Omar,

Thanks for sharing this great tool. I was wondering if there is a way to export a phased vcf or tag long reads with their haplotype group as in Whatshap-haplotag?

@OmarOakheart
Copy link
Owner

I'm glad you find nPhase useful!

There's no function to turn nPhase results into a VCF or tag the long reads of a BAM file with a haplotype directly.
It should be straight-forward enough to create a phased vcf or a haplotagged BAM using the existing outputs of nPhase so I think that these are things I could add, are there any tools that you plan to use which would use either of these formats as an input?

@Adamtaranto
Copy link
Author

The aim would be to reassemble the phased reads for each scaffold similar to Garg et al. 2021.

We have HiFi data that would be used for variant calling and ONT ultra-long reads that can be added for phasing. The phased VCF would be used with WhatsHap-Haplotag, and tagged reads would be split into tagged groups per scaffold and assembled with Hifiasm (Hifi) or Shasta (ONT).

@OmarOakheart
Copy link
Owner

nPhase doesn't output haplotagged BAM files but it does output a fastQ file for each predicted haplotig, so you could still assemble the long reads this way, but I can see that not outputting a phased VCF does get in the way of you separating and assembling the short reads.

As far as I know, whatshap haplotag and split don't handle polyploids, and I'm not aware of another tool that currently does that (though maybe whatshap will soon whatshap/whatshap#408 )

If you really need to split the short reads to assemble them, maybe you can start by assembling the phased long reads, then merging all of the long read scaffolds and mapping the short reads onto them in order to separate them that way and later assemble them? It's a bit clunky but would require the least amount of custom development

@Adamtaranto
Copy link
Author

A phased VCF would do the trick! We have an allotetraploid genome with very little variation within sub-genomes so I can cluster the tracks and do the haplotag splitting in two steps.

@OmarOakheart
Copy link
Owner

OmarOakheart commented Nov 4, 2022

Hi Adam,

Sorry for the delays in responses, I'm still not sure how you plan to cluster the tracks together so it's a little difficult to anticipate what sort of phased VCF format would work.

The main issue is that there isn't a standard for creating a phased VCF for a polyploid. The PS tag that currently exists is an explicitly diploid format, and the alternatives that have been developed by tools like whatshap or flopp use the fact that they start by assigning a ploidy to the sample. For example if you use flopp and tell it that you have a tetraploid and therefore k=4, it will reliably output 4 columns for each variable position. So then you use a script for downstream analysis that works on the flopp output, using the same workflow would fail or give you messy results with an nPhase output because nPhase allows for aneuploidy and this wouldn't have been explicitly taken into account.

I did write you a (relatively untested) python function that takes the nPhase output (OUTPUT_FOLDER/Phased/$prefix_variants.tsv) and a VCF file and produces a version of a phased VCF that works with the nPhase output. I'm just not sure if I designed it in a way that's appropriate for your intended purpose.

So you'll get a VCF line that looks like this:

#CHROM  POS  ID  REF  ALT  QUAL   FILTER  INFO FORMAT   SAMPLE
chr1    100  .   A    T,C   50.0   .       .    GT:PS:PQ:HX:HY   0|1:100:22:cluster3,cluster8,cluster12,cluster15,cluster16:0,0,1,1,2

The way to interpret this is the HX field tells you the names of the different haplotypes that have a base in this position for this chromosome, the HY field tells you what the base is for each haplotype listed in HX. This means for example that cluster3 is associated with 0, which is the REF allele A, and cluster12 is associated with 2, which is the ALT allele C.

I know this format doesn't exist anywhere else, so I imagine this isn't very helpful to you, but maybe this still helps clears something up and leads you in the right direction

Here's the python script:

def makePhasedVCF(vcfFilePath,phasedFilePath,outputFilePath):
    phasedData={}
    phasedFile=open(phasedFilePath,"r")
    for line in phasedFile:
        line=line.strip("\n").split("\t")
        haplotigName=line[0]
        chromosome=line[1]
        position=line[2]
        base=line[3]
        phasedData.setdefault(chromosome,{})
        phasedData[chromosome].setdefault(position,{})
        phasedData[chromosome][position].setdefault(base,[])
        phasedData[chromosome][position][base].append(haplotigName)
    phasedFilePath.close()
    vcfOutputText=""
    vcfFilePath=open(vcfFilePath,"r")
    for line in vcfFilePath:
        if line[0]=="#":
            if line[0:6]=="#CHROM":
                vcfOutputText+='##FORMAT=<ID=HX,Number=n,Type=String,Description="HaplotypeNames">\n'
                vcfOutputText += '##FORMAT=<ID=HY,Number=n,Type=String,Description="HaplotypeBases">\n'
            vcfOutputText+=line
        else:
            line=line.strip("\n").split("\t")
            chromosome=line[0]
            position=line[1]
            bases=[line[3]]+line[4].split(",")
            phasedInfoX=""
            phasedInfoY=""
            if chromosome in phasedData:
                if position in phasedData[chromosome]:
                    hasBase=False
                    i=0
                    for base in bases:
                        if base in phasedData[chromosome][position]:
                            for haplotigName in phasedData[chromosome][position][base]:
                                phasedInfoX+=haplotigName+","
                                phasedInfoY+=str(i)+","
                            hasBase=True
                        i+=1
                    if hasBase:
                        phasedInfoX.strip(",")
                        phasedInfoY.strip(",")
                        line[8]=line[8]+":HX:HY"
                        line[9]=line[9]+phasedInfoX+":"+phasedInfoY
            line="\t".join(line)+"\n"
            vcfOutputText+=line
    vcfFilePath.close()
    outputFile=open(outputFilePath,"w")
    outputFile.write(vcfOutputText)
    outputFile.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants