Skip to content

Association testing with GPAT

jewmanchue edited this page Jun 9, 2014 · 11 revisions

Association testing with GPAT++

GPAT++ has three tools to test for population divergence at a SNV/SNP: pFst, wcFst, and bFst. A dataset from a recent publication (Shapiro et al.) is provided with GPAT++ for testing and tutorial purposes only. The file scaffold612.vcf, found in vcflib/sampls comprises genotypes of pigeons. The Ephb2 locus ~600KB is strongly associated with the head crest phenotype.

##pFst

pFst is a likelihood ratio test for allele frequency differences between populations. The parameters of the test are estimated using genotype counts or genotype likelihoods. By default genotype likelihoods are used as it can compensate for NGS genotyping errors. If genotype likelihoods are not available the "count" flag should be specified.

Usage statement for pFst :

INFO: help
INFO: description:
     pFst is a probabilistic approach for detecting differences in allele frequencies between two populations.

Output : 3 columns :
     1. seqid
     2. position
     3. pFst probability

INFO: usage:  pFst --target 0,1,2,3,4,5,6,7 --background 11,12,13,16,17,19,22 --file my.vcf --deltaaf 0.1 --type PL

INFO: required: t,target     -- argument: a zero based comma separated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- argument: a zero based comma separated list of background individuals corrisponding to VCF columns
INFO: required: f,file       -- argument: a properly formatted VCF.
INFO: required: y,type       -- argument: genotype likelihood format ; genotypes: GP,GL or PL; pooled: PO
INFO: optional: d,deltaaf    -- argument: skip sites where the difference in allele frequencies is less than deltaaf, default is zero
INFO: optional: r,region     -- argument: a tabix compliant genomic range : seqid or seqid:start-end
INFO: optional: c,counts     -- switch  : use genotype counts rather than genotype likelihoods to estimate parameters, default false

INFO: version 1.0.0 ; date: April 2014 ; author: Zev Kronenberg; email : zev.kronenberg@utah.edu

Running provided example:

WARNING code blocks scroll horizontally

../bin/pFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf --counts --type PL   > 612.counts
R --vanilla < ../bin/plotPfst.R --args 612.counts 

The resulting output:

with-counts

cd samples
../bin/pFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf --type PL > 612.nocounts
R --vanilla < ../bin/plotPfst.R --args 612.nocounts

The resulting output:

without-counts

While the differences are visually subtle, genome-wide pFst run with counts has a higher signal to noise ratio. Use genotype likelihood when available!

##wcFst

Weir and Cockerham's Fst estimator is implemented in wcFst (paywal - paper link). wcFst uses the called genotypes with no error modeling.

Usage statement for wcFst :

INFO: help
INFO: description:
      wcFst is Weir & Cockerham's Fst for two populations.  Negative values are VALID,
      they are sites which can be treated as zero Fst. For more information see Evolution, Vol. 38 N. 6 Nov 1984.
      Specifically wcFst uses equations 1,2,3,4.

Output : 3 columns :
     1. seqid
     2. position
     3. target allele frequency
     4. background allele frequency
     5. wcFst

INFO: usage:  wcFst --target 0,1,2,3,4,5,6,7 --background 11,12,13,16,17,19,22 --file my.vcf --deltaaf 0.1 --type PL

INFO: required: t,target     -- argument: a zero based comma separated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- argument: a zero based comma separated list of background individuals corrisponding to VCF columns
INFO: required: f,file       -- argument: proper formatted VCF
INFO: required, y,type       -- argument: genotype likelihood format; genotype : GL,PL,GP
INFO: optional: r,region     -- argument: a tabix compliant genomic range: seqid or seqid:start-end
INFO: optional: d,deltaaf    -- argument: skip sites where the difference in allele frequencies is less than deltaaf, default is zero

INFO: version 1.1.0 ; date: April 2014 ; author: Zev Kronenberg; email : zev.kronenberg@utah.edu

Running distributed example:

cd samples
../bin/wcFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf   > 612.wcfst.txt
R --vanilla < ../bin/plotWCfst.R --args 612.wcfst.txt

The resulting output:

wc-fst

##bFst

A Bayesian approach to Fst is calculated in bFST. The fundamental concept was described by Kent Holsinger (paper link). This concept has been extended to use genotype likelihoods. The major advantage of bFst is Fst credible intervals can be obtained. bFst uses Metropolis Hastings to sample the posterior of Fst and therefore it is very computationally expensive. bFst is only fit for assaying small genomic regions regions.

Usage statement for bFst :

INFO: help: 

     bFst is a Bayesian approach to Fst.  Importantly bFst account for genotype uncertainty in the model using genotype likelihoods.
     For a more detailed description see: Holsinger et al. Molecular Ecology Vol 11, issue 7 2002.  The likelihood function has been 
     modified to use genotype likelihoods provided by variant callers. There are five free parameters estimated in the model: each 
     subpopulation's allele frequency and Fis (fixation index, within each subpopulation), a free parameter for the total population's 
     allele frequency, and Fst. 

Output : 11 columns :                          
     1.  Seqid                                     
     2.  Position				     
     3.  Observed allele frequency in target.	     
     4.  Estimated allele frequency in target.     
     5.  Observed allele frequency in background.  
     6.  Estimated allele frequency in background. 
     7.  Observed allele frequency combined. 	     
     8.  Estimated allele frequency in combined.   
     9.  ML estimate of Fst (mean)		     
     10. Lower bound of the 95% credible interval  
     11. Upper bound of the 95% credible interval  

INFO: usage:  bFst --target 0,1,2,3,4,5,6,7 --background 11,12,13,16,17,19,22 --file my.vcf --deltaaf 0.1

INFO: required: t,target     -- a zero based comma seperated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- a zero based comma seperated list of background individuals corrisponding to VCF columns
INFO: required: f,file a     -- a proper formatted VCF file.  the FORMAT field MUST contain "PL"
INFO: required: d,deltaaf    -- skip sites were the difference in allele frequency is less than deltaaf

INFO: version 1.0.0 ; date: April 2014 ; author: Zev Kronenberg; email : zev.kronenberg@utah.edu 

Running distributed example:

cd samples
../bin/bFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf > 612.bFst
R --vanilla < ../bin/plotBfst.R --args 612.bFst

The resulting output:

bFst