Outlier Analysis Workshop

There are lots of interesting patterns that you can extract from genetic variant data. This can include patterns of linkage, balancing selection, or even inbreeding signals. One of the most common ones is to try find sites on the genome that are under selection. The following vignette will take you through the basics of genetic selection analysis.

The project has been funded by the AES ERC Networking Grant Scheme and GSA.

Schedule

Day 1
9:00am Introduction Slides
9:30am Download data
10:00am Morning Tea
10:15am PCAdapt
12:00pm Lunch
1:00pm VCFtools
2:00 Afternoon tea
2:15pm VCFtool continued and setup for Bayescan and Baypass

Day 2
9:00am Bayescan
10:00am Morning Tea
10:15am Bayescan continued & Bayepass
12:00pm Lunch
1:00pm Baypass
2:00pm Afternoon tea
2:15pm Compiling results, group discussiona and metanalysis contribution

A brief introduction to Genetic Outlier and Association Analysis

When we look through a genome to try find loci that are under divergent selection, we often conduct what is called outlier or association analyses. Outlier analysis requires just knowledge of the genetics of your samples (plus sample metadata, for example population groupings), and tries to find loci that behave very differently from the underlying patterns across the genome (with the assumption being that the rest of the genome represents patterns of neutral genetic diveristy)^A. Meanwhile association analysis require some sort of covariate data, and tests whether there are any genetic variants statistically associated with this new data (you may have heard the term GWAS). This covariate data can come in the form of phenotype data (e.g. morphology, disease status, physiology measures), or could be spatial (e.g. environmental, climate). Association tests look for sites in the genome where the precense or absence of a variant is highly correlated with the values in the co-variate data, usually through some regression type analysis.

Throughout this vignette I will refer to outlier and association analysis collectively as selection analysis. There are a lot of programs that exist currently. You can find a very long (though not exhaustive) list of them here.

This vignette still start by covering some very simple outlier analyses:

PCAdapt, a program that can depect genetic marker outliers without having population^B designations, using a Principle Component Analysis (PCA) approach.
F_ST outlier analysis, an approach that uses pairwise comparisons between two populations and the fixation index metric to assess each genetic marker.

Next we will conduct some more advanced outlier analysis:

Bayescan, a program.
Baypass, a program that elaborates on the bayenv model (another popular association analysis program).

We'll cover the pre-processing of program specific input files, how to run the programs, how to visualise the output and also in some cases we'll need to take extra steps to map the genetic markers of interest back to the SNP data.

🔰 Reduced representation verses whole genome sequencing

Completing outlier analysis is possible and often done on reduced representation data. It is important to remember how your genome coverage (the number of genome variant sites / the genome length ^C) will affect your results and interpretation. Often with WGS data, you will see well resolved 'peaks' with a fairly smooth curve of points leading up to it either side. From this we often infer that the highest point is the genetic variant of interest, and the other sites either side of that exhibit signals of selection because they reside close to, and thus are linked, to the variant of interest. However, consider that even in WGS data, unless we have every single genetic variant represented (which may not be the case, depending on our variant calling and filtering parameters) it is possible that the genetic variant of interest that we have identified is not the main one, but is simply another neighbouring linked SNP to one that is not represented in the data. This problem becomes even more relevant with reduced representation sequencing (RRS), for which the genome coverage may be extremely patchy^C. Thus with all outlier analysis, but especially so for those using RRS data, remember that your flagged outliers are not exhaustive, and may themselves only be liked to the variant that is truely under selection.

Define you working directory for this project, and the VCF file location:

mkdir /nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis
DIR=/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis
cd $DIR

Our data tree will look like:

outlier_analysis/
├── analysis
│ ├── bayescan
│ ├── baypass
│ ├── pcadapt
│ ├── summary
│ └── vcftools_fst
├── data
├── programs
└── workshop_material

So lets set up our directories to match this

mkdir -p {analysis/{bayescan,baypass,pcadapt,summary,vcftools_fst},data,programs,workshop_material}

Project data

The data provided in this workshop contains 5007 SNPs loci for across 39 individuals (13 individuals each from 3 different locations). There is some missingness (i.e. missing SNP calls) within this data.

There is also a metadata file, that contains the individuals unique IDs, their assigned populations, and a wingspan measurement for each individual.

Let's grab this data from the project's git resository, place the data files into our data directory, and define the environmental variables VCF and METADATA with the locations of the genetic variant and metadata files respectively.

cd $DIR/workshop_material
git clone https://github.com/katarinastuart/Ev1_SelectionMetaAnalysis.git
cp $DIR/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/* $DIR/data
VCF=$DIR/data/starling_3populations.recode.vcf
METADATA=$DIR/data/starling_3populations_metadata.txt

❗ Working with your own data

Alternatively, you can also use your own data for this workshop. If so, it is a good idea to thin your SNP dataset down to roughly 5,000 SNPs to ensure compute times are not too long. If you have more than 50 individuals you may also want to reduce this too. If you would like to do this, just place your genetic variant and metadata file in the data directory and define VCF and METADATA based on their names.

Across this workshop, we will need the genetic data to be in several different formats. Let's prepare that now. First we convert the VCF to PLINK, and then to BED.

cd $DIR/data
module load VCFtools/0.1.15-GCC-9.2.0-Perl-5.30.1
module load PLINK/1.09b6.16 
vcftools --vcf $VCF --plink --out starling_3populations.plink
plink --file starling_3populations.plink --make-bed --noweb --out starling_3populations

PCAdapt

PCAdapt uses an ordinatio approach to find sites in a data set that are outliers with respect to background population structure. The PCAdapt manual is available here.

CITE: Privé, F., Luu, K., Vilhjálmsson, B. J., & Blum, M. G.B. (2020). Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. Molecular Biology and Evolution.

First, let's install PCAdapt and set your working directory.

module load R/4.1.0-gimkl-2020a
R

install.packages("pcadapt")
library("pcadapt")

setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/pcadapt/")

Now let's load in the data - PCAdapt uses bed file types.

starling_bed <- "/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/data/starling_3populations.bed"
starlings_pcadapt <- read.pcadapt(starling_bed, type = "bed")

Produce K plot

starlings_pcadapt_kplot <- pcadapt(input = starlings_pcadapt, K = 20)
pdf("pcadapt_starlings_kplot.pdf")
plot(starlings_pcadapt_kplot, option = "screeplot")
dev.off()

K value of 3 is most appropriate, as this is the value of K after which the curve starts to flatten out more.

starlings_pcadapt_pca <- pcadapt(starlings_pcadapt, K = 3)
summary(starlings_pcadapt_pca)

✔️ output
Length Class Mode
scores 117 -none- numeric
singular.values 3 -none- numeric
loadings 15021 -none- numeric
zscores 15021 -none- numeric
af 5007 -none- numeric
maf 5007 -none- numeric
chi2.stat 5007 -none- numeric
stat 5007 -none- numeric
gif 1 -none- numeric
pvalues 5007 -none- numeric
pass 4610 -none- numeric

Investigate axis projections:

poplist.names <- c(rep("Lemon", 13),rep("Warrnambool", 13),rep("Nowra", 13))
print(poplist.names)

pdf("pcadapt_starlings_projection1v2.pdf")
plot(starlings_pcadapt_kplot, option = "scores", i = 1, j = 2, pop = poplist.names)
dev.off()

pdf("pcadapt_starlings_projection5v7.pdf")
plot(starlings_pcadapt_kplot, option = "scores", i = 5, j = 7, pop = poplist.names)
dev.off()

Ignore the warning:

❗ Use of df$Pop is discouraged. Use Pop instead.

Investigate manhattan and Q-Qplot:

🔰 Manhattan plots are a way to visualise the GWAS (genome-wide association study) p-values (or other statistical values) at each SNP locus along the genome

🔰 Q-Qplots plots are just a quick way to visually check if your residuals are normally distributed. Check out more information here.

pdf("pcadapt_starlings_manhattan.pdf")
plot(starlings_pcadapt_pca, option = "manhattan")
dev.off()

pdf("pcadapt_starlings_qqplot.pdf")
plot(starlings_pcadapt_pca, option = "qqplot")
dev.off()

Plotting and correcting the pvalues

starling_pcadapt_pvalues <- as.data.frame(starlings_pcadapt_pca$pvalues)

library("ggplot2")

pdf("pcadapt_starlings_pvalues.pdf")
hist(starlings_pcadapt_pca$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
dev.off()

starlings_pcadapt_padj <- p.adjust(starlings_pcadapt_pca$pvalues,method="bonferroni")
alpha <- 0.1
outliers <- which(starlings_pcadapt_padj < alpha)
length(outliers)

write.table(outliers, file="starlings_pcadapt_outliers.txt") 

length(outliers)

✔️ Output
[1] 3

After this, we will be jumping out of R and back into the command line by using the command:

q()

Mapping Outliers: PCAdapt

finding the SNP ID of the outlier variants

cd $DIR/analysis

The first thing we will do is create list of SNPs in VCF, assign line numbers that can be used to find matching line numbers in outliers (SNP ID is lost in PCadapt & Bayescan, line numbers used as signifiers).

We create this in the analysis folder because we will use it for more than just mapping the outlier SNPs for PCAdapt.

grep -v "^#" $DIR/data/starling_3populations.recode.vcf | cut -f1-3 | awk '{print $0"\t"NR}' > starling_3populations_SNPs.txt

Now let's jump back into the pcadapt directory to contiue working with our outliers. We grab column 2 of the outlier file using the AWK command, which contain the number of the outliers

cd $DIR/analysis/pcadapt
awk '{print $2}' starlings_pcadapt_outliers.txt > starlings_pcadapt_outliers_numbers.txt

We now make a list of outlier SNPS ID's

awk 'FNR==NR{a[$1];next} (($4) in a)' starlings_pcadapt_outliers_numbers.txt ../starling_3populations_SNPs.txt   | cut -f3 > pcadapt_outlierSNPIDs.txt
head pcadapt_outlierSNPIDs.txt

✔️ Output

230955:72:-
238881:46:+
286527:46:-

VCFtools windowed Fst

The VCFTools manual is available here.

CITE: Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST et al. 2011 The variant call format and VCFtools. Bioinformatics 27 2156–2158. (doi:10.1093/bioinformatics/btr330)

Fst outliers will allow us to identify SNPs that behave abnormally in pairwise comparisons between populations.

The first things we need to do is use our metadata file (currently defined by the environmental variable METADATA) to make three individual files containing just the list of individuals in each of the populations. We can do this by subseting our sample metadata file, using the command grep to grab lines that match each population's name, and then using awk to keep only the first column of metadta, i.e. the sample names.

module load VCFtools/0.1.15-GCC-9.2.0-Perl-5.30.1

cd $DIR/data

grep "Lemon" $METADATA | awk '{print $1}' > individuals_Lemon.txt
grep "War" $METADATA | awk '{print $1}' > individuals_War.txt
grep "Nowra" $METADATA | awk '{print $1}' > individuals_Nowra.txt

Now we can pick two populations to compare. Let's work with Lemon (short for Lemon Tree, QLD, AU) and War (short for Warnambool, VIC, AU), and so a SNP-based Fst comparison.

cd $DIR/analysis/vcftools_fst

vcftools --vcf $VCF --weir-fst-pop $DIR/data/individuals_Lemon.txt --weir-fst-pop $DIR/data/individuals_War.txt --out lemon_war

head -n 5 lemon_war.weir.fst

✔️ Output
CHROM POS WEIR_AND_COCKERHAM_FST
starling4 107735 0.160891
starling4 137462 -0.0805785
starling4 151332 0.0524489
starling4 227887 0.0569961

The important column is column 5: the Weighted Fst, from Weir and Cockerham’s 1984 publication. This corrects for INFO NEEDED.

wc -l lemon_war.weir.fst

✔️ Output
5008

Notice how there are as many lines as there are SNPs in the data set, plus one for a header. It is always a good idea to check your output, and make sure everything looks as you expect it to!

Next, instead of calculating pairwise populaiton differentiation on a SNP by SNP basis, we will be usiing a sliding window approach. The --fst-window-size 50000 refers to the window size of the genome (in bsae pairs) in which we are calculating one value: all SNPs within this window are used to caluclate Fst. The --fst-window-step option indicates how many base pairs the window is moving down the genome before calculating Fst for the second window, and then the third, and so on.

Warning These sliding windows only work on ordered SNPs on the same chromosome/scaffold/contig. If you data is not set up like this (i.e. all your SNPs are on a single pseudo chromosome) then this method is not appropriate for your data, as it will be making an assumption about where the SNPs are located with respect to one another.

vcftools --vcf $VCF --fst-window-size 50000 --fst-window-step 10000 --weir-fst-pop $DIR/data/individuals_Lemon.txt --weir-fst-pop $DIR/data/individuals_War.txt --out lemon_war

head lemon_war.windowed.weir.fst

✔️ Output
CHROM BIN_START BIN_END N_VARIANTS WEIGHTED_FST MEAN_FST
starling4 60001 110000 1 0.160891 0.160891
starling4 70001 120000 1 0.160891 0.160891
starling4 80001 130000 1 0.160891 0.160891
starling4 90001 140000 2 -0.00374291 0.0401563

Notice the output is different.

wc -l lemon_war.windowed.weir.fst

✔️ Output
10838

Notice the line count is different from the SNP-based Fst comparison; there are more lines in the sliding window based Fst comparison. This is because there are more sliding windows across the chromosome in this data set than there are SNPs. Consider which of these steps is better for your data: in low density SNP datasets, the sliding window approach might not be the best to use.

Now let's plot the Fst across the chromosome. To do this we will add line numbers on our Fst file that will be used to order the Fst measurements across the x-axis of our manhattan plot.

🔰 X-axis values in the following plot are done y using each ourlier window's line number, as they are in order along the genome. Ourlier windows are equally spces, and so line numbers are sufficient to capture the patterns along the genome. Consider that if you are plotting Fst values for SNPs (rather than windows), they may not be equally spaced along the genome and so SNP position may need to be used to make your manhattan plots.

awk '{print $0"\t"NR}' ./lemon_war.windowed.weir.fst  > lemon_war.windowed.weir.fst.edit

module load R/4.1.0-gimkl-2020a
R

library("ggplot2")

setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/vcftools_fst")

windowed_fst <- read.table("lemon_war.windowed.weir.fst.edit", sep="\t", header=TRUE)
str(windowed_fst)

quantile(windowed_fst$WEIGHTED_FST, probs = c(.95, .99, .999))

✔️ Output
95% 99% 99.9%
0.1948850 0.3501600 0.5741306

Choose the quantile threshold above which SNPs will be classified as outliers. Below, we chose 99% (i.e. the top 1% of SNP windows).

pdf("fst_starlings_windowed.pdf", width=10, height=5)
ggplot(windowed_fst, aes(x=X1, y=WEIGHTED_FST)) + 
geom_point() + 
theme_classic() +
geom_hline(yintercept=0.35, linetype="dashed", color = "red")
dev.off()

q()

Finally, we will generate a list of outier SNP IDs. We do this by grabbing all of the SNPs located in the ourlier windows.

cd $DIR/analysis/vcftools_fst
cat lemon_war.windowed.weir.fst.edit | awk '$5>0.3501600' > lemon_war.windowed.outliers
wc -l lemon_war.windowed.outliers

✔️ Output
107 lemon_war_fst.windowed.outliers

cut -f1-3 lemon_war.windowed.outliers > lemon_war.windowed.outliers.bed 
vcftools --vcf $VCF --bed lemon_war.windowed.outliers.bed --out fst_outliers --recode
grep -v "#" fst_outliers.recode.vcf | awk '{print $3}' > vcftoolsfst_outlierSNPIDs.txt
wc -l vcftoolsfst_outlierSNPIDs.txt

✔️ Output
61

We have a total of 61 outlier SNPs locate across 107 outlier SNP windows.

Bayescan

The VCFTools manual is available here.

CITE: Foll M & Gaggiotti O 2008 A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective. Genetics 180 977–993. (doi:10.1534/genetics.108.092221)

Bayescan identified outlier SNPs based on allele frequencies. More explination about alpha and such.

First, we will need to convert out VCF to the Bayescan format. To do this we will use the genetic file conversion program called PGDspider.

cd $DIR/programs
wget http://www.cmpg.unibe.ch/software/PGDSpider/PGDSpider_2.1.1.5.zip
unzip *.zip

We also need to create a new populations metadata file, which contains individual names in column 1, and population names in column 2.

cd $DIR/data
cut -f1,2 $METADATA > starling_3populations_metadata_INDPOP.txt

Now navigate to the folder in which we will be running the bayescan analysis.

cd $DIR/analysis/bayescan

We now run PGDSpider in two steps: first we convert the VCF file to the PGD format, second from PGD format to Bayescan format. To do this we will need to create a SPID file. create a file called VCF_PGD.spid using the nano command. Paste in the below, replacing the location of you metadata file.

include snapshot of SPID:

# VCF Parser questions
PARSER_FORMAT=VCF
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/data/starling_3populations_metadata_INDPOP.txt
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=false
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false

# PGD Writer questions
WRITER_FORMAT=PGD

Now run the two step convserion.

module purge
module load Java/1.8.0_144

java -Xmx1024m -Xms512m -jar $DIR/programs/PGDSpider_2.1.1.5/PGDSpider2-cli.jar -inputfile $VCF -inputformat VCF -outputfile starling_3populations.pgd -outputformat  PGD -spid VCF_PGD.spid 

java -Xmx1024m -Xms512m -jar $DIR/programs/PGDSpider_2.1.1.5/PGDSpider2-cli.jar -inputfile starling_3populations.pgd -inputformat PGD -outputfile starling_3populations.bs -outputformat GESTE_BAYE_SCAN

Let's have a quick look at what the input file looks like:

head starling_3populations.bs

✔️ Output
[loci]=5007

[populations]=3

[pop]=1
1 12 2 9 3
2 20 2 11 9
3 18 2 15 3
4 20 2 0 20
5 22 2 2 20

So for each population we have a note of how many REF and ALT alleles we have at each genomic variant position.

🔰 An important note about additive genetic variance: It is important to bear in mind how the input genetic data for outlier or association models is being interpreted by the model. When dealing with many of these models (and input genotype files) the assumption is that the SNP effects are additive. This can be seen from, for example, the way we encode homozygous reference allele, heterozygous, and homozygous alternate allele as "0", "1", and "2" respectively in a BayPass input genofile. For the diploid organism (with two variant copies for each allele) one copy of a variant (i.e. heterozygous) is assumed to have half the effect of having two copies. However, what if the locus in question has dominance effects? This would mean the heterozygous form behaves the same as the homozygous dominant form, and it would be more appropriate to label these instead as "0", "0", "1". But with thousands, if not millions of (most likely) completely unknown variants in a dataset, how can we possibly know? The answer is we cannot. And most models will assume additive effects, because this the simplest assumption. However, by not factoring in dominance effects we could possible be missing many important functional variants, as Reynolds et al. 2021 demonstrates. Genomics is full of caveats and pitfalls, which while providing new directions to explore can be a bit overwhelming. Remember, you selection analysis doesn't have to be exhaustive, just make sure it is as fit for purpose within your study design. There is so much going on in just one genome, there is no way you can analyse everything in one go.

Now let's set Bayescan to run. Using the nano bayescan_starling.sl we will create a slurm script and submit it to run using the command sbatch bayescan_starling.sl. This should take approximately 1 hr to run. Currently everything is set to default, but read the manual if you want to understand what they mean and how to refine them if needed.

#!/bin/bash -e
#SBATCH --job-name=2023_04_14.bayescan_starling.sl
#SBATCH --account=uoa02613
#SBATCH --time=00-12:00:00
#SBATCH --mem=5GB
#SBATCH --output=%x_%j.errout
#SBATCH --mail-user=katarina.stuart@auckland.ac.nz
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --profile task

cd /nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/bayescan

#load bayescan
module load BayeScan/2.1-GCCcore-7.4.0

#run bayescan. 
bayescan_2.1 ./starling_3populations.bs -od ./ -threads 8 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10

Identify outliers:

module load R/4.1.0-gimkl-2020a
R
library(ggplot2)
setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/bayescan")
source("/opt/nesi/CS400_centos7_bdw/BayeScan/2.1-GCCcore-7.4.0/R\ functions/plot_R.r")
outliers.bayescan=plot_bayescan("starling_3population_fst.txt",FDR=0.05)
outliers.bayescan
write.table(outliers.bayescan, file="bayescan_outliers.txt")

q()

✔️ Output
$outliers
[1] 333 395 1367 2376 3789
$nb_outliers
[1] 5

Mapping Outliers

cd $DIR/analysis/bayescan

Create list of SNPs in VCF, assign line numbers that can be used to find matching line numbers in outliers (SNP ID is lost in bayescan, line numbers used as signifiers).

grep -v "^#" ../../data/starling_3populations.recode.vcf  | cut -f1-3 | awk '{print $0"\t"NR}' > starling_3populations_SNPs.txt

awk '{print $2}' bayescan_outliers.txt > bayescan_outliers_numbers.txt

list of outlier SNPS, by matching column 1 of of the outliers list to the fourth column of the whole SNP list data.

awk 'FNR==NR{a[$1];next} (($4) in a)' bayescan_outliers_numbers.txt starling_3populations_SNPs.txt   | cut -f3 > bayescan_outlierSNPIDs.txt

Bayescane Log Plot, colouring the outliers in a different colour.

module load R/4.1.0-gimkl-2020a
R
library(ggplot2)
library(dplyr)
setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/bayescan")

bayescan.out<- read.table("starling_3population_fst.txt", header=TRUE)
bayescan.out <- bayescan.out %>% mutate(ID = row_number())
bayescan.outiers<- read.table("bayescan_outliers_numbers.txt", header=FALSE)
outliers.plot <- filter(bayescan.out, ID %in% bayescan.outiers[["V1"]])

png("bayescan_outliers.png", width=600, height=350)
ggplot(bayescan.out, aes(x=log10.PO., y=alpha))+
geom_point(size=5,alpha=1)+xlim(-1.3,3.5)+ theme_classic(base_size = 18) + geom_vline(xintercept = 0, linetype="dashed", color = "black", size=3)+
geom_point(aes(x=log10.PO., y=alpha), data=outliers.plot, col="red", fill="red",size=5,alpha=1) + theme(axis.text=element_text(size=18), axis.title=element_text(size=22,face="bold"))
dev.off()

q()

BayPass

The Baypass manual can be found here.

CITE: Gautier M 2015 Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics 201 1555–1579. (doi:10.1534/genetics.115.181453)

Baypass requires that the allele frequency data be on a population, not an individual basis. The genotyping data file is simply organized as a matrix with nsnp rows and 2 ∗ npop columns. The row field separator is a space. More precisely, each row corresponds to one marker and the number of columns is twice the number of populations because each pair of numbers corresponds to each allele (or read counts for PoolSeq experiment) counts in one population.

To generate this population gene count data we will work with the PLINK file. First we have to fix the individual ID and population labels, as PLINK has pulled these directly from the VCF which has no popuation information. What we are aiming for is population in column 1, and individual ID in column 2.

cd $DIR/data

awk '{print $2,"\t",$1}' $METADATA > starling_3populations_metadata_POPIND.txt

cd $DIR/analysis/baypass

PLINK=$DIR/data/starling_3populations.plink.ped

#remove first 2 columns
cut -f 3- $PLINK > x.delete

paste $DIR/data/starling_3populations_metadata_POPIND.txt x.delete > starling_3populations.plink.ped
rm x.delete 

cp $DIR/data/starling_3populations.plink.map .
cp $DIR/data/starling_3populations.plink.log .

run the pop based allele frequency calculations

module load PLINK/1.09b6.16 

plink --file starling_3populations.plink --allow-extra-chr --freq counts --family --out starling_3populations

manipulate file so it has baypass format, numbers set for plink output file and pop number for column count

tail -n +2 starling_3populations.frq.strat | awk '{ $9 = $8 - $7 } 1' | awk '{print $7,$9}' | tr "\n" " " | sed 's/ /\n/6; P; D' > starling_3populations_baypass.txt

Now we can run Baypass by creating a slurm script baypass1_starling.sl, which should run for about 5 minutes.

#!/bin/bash -e
#SBATCH --job-name=2023_04_14.baypass1_starling.sl
#SBATCH --account=uoa02613
#SBATCH --time=00-12:00:00
#SBATCH --mem=5GB
#SBATCH --output=%x_%j.errout
#SBATCH --mail-user=katarina.stuart@auckland.ac.nz
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --profile task

module load BayPass/2.31-intel-2022a

cd /nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/baypass

i_baypass -npop 3 -gfile ./starling_3populations_baypass.txt -outprefix starling_3populations_baypass -nthreads 4

Running in R to make the anapod data. First let's quickly download the utils we need.

cd $DIR/programs
git clone https://github.com/andbeck/BayPass.git

Now let's generate some simulated data, based on the parameters calculated from our genetic data.

module load R/4.1.0-gimkl-2020a
R
setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/baypass")
source("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/programs/BayPass/baypass_utils.R")

install.packages("ape")
library("ape")

install.packages("corrplot")
library("corrplot")

omega=as.matrix(read.table("starling_3populations_baypass_mat_omega.out"))
pi.beta.coef=read.table("starling_3populations_baypass_summary_beta_params.out",h=T)$Mean
bta14.data<-geno2YN("starling_3populations_baypass.txt")
simu.bta<-simulate.baypass(omega.mat=omega, nsnp=5000, sample.size=bta14.data$NN, beta.pi=pi.beta.coef,pi.maf=0,suffix="btapods")

q()

We now have the simulated genetic data, we can find the XtX statistic threshold above which we will consider genetic sites an outlier. Let's create another script in cd $DIR/analysis/baypass called baypass2_starling.sl.

#!/bin/bash -e
#SBATCH --job-name=2023_04_14.baypass2_starling.sl
#SBATCH --account=uoa02613
#SBATCH --time=00-12:00:00
#SBATCH --mem=5GB
#SBATCH --output=%x_%j.errout
#SBATCH --mail-user=katarina.stuart@auckland.ac.nz
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --profile task

module load BayPass/2.31-intel-2022a

cd /nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/baypass

i_baypass -npop 3 -gfile G.btapods  -outprefix G.btapods -nthreads 4

XtX calibration; get the pod XtX theshold

module load R/4.1.0-gimkl-2020a
R
setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/baypass")
source("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/programs/BayPass/baypass_utils.R")
library("ape")
library("corrplot")

pod.xtx=read.table("G.btapods_summary_pi_xtx.out",h=T)$M_XtX

We compute the 1% threshold for the simulated neutral data.

pod.thresh=quantile(pod.xtx,probs=0.99)
pod.thresh

q()

✔️ Output
6.258372

Your values may be slightly different as the simlated data will not be identical.

Next, we filter the data for the outlier snps by identifying those above the threshold.

cat starling_3populations_baypass_summary_pi_xtx.out | awk '$4>6.258372 ' > baypass_outliers.txt

create list of SNPs in VCF, assign line numbers that can be used to find matching line numbers in outliers (SNP ID is lost in bayescan, line numbers used as signifiers).

grep -v "^#" $VCF  | cut -f1-3 | awk '{print $0"\t"NR}' > starling_3populations_SNPs.txt

List of outlier SNPS

awk 'FNR==NR{a[$1];next} (($4) in a)' baypass_outliers.txt starling_3populations_SNPs.txt | cut -f3 > baypass_outlierSNPIDs.txt
wc -l baypass_outlierSNPIDs.txt

✔️ Output
38

Low lets find SNPs that are statistically associated with wingspan. To do this we have to go back to the metadata and compute the average wingspan of each of out populations and place them in a file.

module load R/4.1.0-gimkl-2020a
R
setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/baypass")
metadata <- read.table("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/data/starling_3populations_metadata.txt", sep="\t", header=FALSE)
str(metadata)
pop_metadata <- aggregate(V3 ~ V2, data = metadata, mean)
write(pop_metadata[,2], "pop_mean_wingspan.txt")

q()

✔️ Output
14.89805 19.63306 22.09655

Now we can submit a third and final baypass job baypass3_starling.sl, which will let us know which SNPs are statistically associated with wingspan.

#!/bin/bash -e
#SBATCH --job-name=2023_04_14.baypass3_starling.sl
#SBATCH --account=uoa02613
#SBATCH --time=00-12:00:00
#SBATCH --mem=5GB
#SBATCH --output=%x_%j.errout
#SBATCH --mail-user=katarina.stuart@auckland.ac.nz
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --profile task

module load BayPass/2.31-intel-2022a

cd /nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/baypass

i_baypass -npop 3 -gfile starling_3populations_baypass.txt -efile pop_mean_wingspan.txt -scalecov -auxmodel -nthreads 4 -omegafile starling_3populations_baypass_mat_omega.out -outprefix starling_3populations_baypass_wing

Next we plot the outliers. We are chosing a BF threshold of 20 dB, which indicates "Strong evidence for alternative hypothesis".

module load R/4.1.0-gimkl-2020a
R

library(ggplot2)

setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/baypass")

covaux.snp.res.mass=read.table("starling_3populations_baypass_wing_summary_betai.out",h=T)
covaux.snp.xtx.mass=read.table("starling_3populations_baypass_summary_pi_xtx.out",h=T)$M_XtX

pdf("Baypass_plots.pdf")
layout(matrix(1:3,3,1))
plot(covaux.snp.res.mass$BF.dB.,xlab="Mass",ylab="BFmc (in dB)")
abline(h=20, col="red")
plot(covaux.snp.res.mass$M_Beta,xlab="SNP",ylab=expression(beta~"coefficient"))
plot(covaux.snp.xtx.mass, xlab="SNP",ylab="XtX corrected for SMS")
dev.off()

Finally, lets generate the list of phenotype-associated SNP IDs.

cat starling_3populations_baypass_wing_summary_betai.out | awk '$6>20' > starling_3populations_baypass_wing_BF20.txt
starling_3populations_baypass_wing_BF20.txt

wc -l starling_3populations_baypass_wing_BF20.txt

✔️ Output
48

Filtering the data sets for SNPS above BFmc threshold - these are out outlier SNPs that are associated with wingspan.

awk 'FNR==NR{a[$2];next} (($4) in a)' starling_3populations_baypass_wing_BF20.txt starling_3populations_SNPs.txt | cut -f3 > baypass_wingspan_outlierSNPIDs.txt

comm -12 <(sort baypass_wingspan_outlierSNPIDs.txt) <(sort baypass_outlierSNPIDs.txt) > double_outliers.txt

wc -l double_outliers.txt

✔️ Output
18

Comparing Outlier Overlap

Now we wil make an upset plot to compare the overlap of outliers detected over our different methods.

cd $DIR/analysis/summary
ln -s $DIR/analysis/pcadapt/pcadapt_outlierSNPIDs.txt .
ln -s $DIR/analysis/vcftools_fst/vcftoolsfst_outlierSNPIDs.txt .
ln -s $DIR/analysis/bayescan/bayescan_outlierSNPIDs.txt .
ln -s $DIR/analysis/baypass/baypass_outlierSNPIDs.txt .
ln -s $DIR/analysis/baypass/baypass_wingspan_outlierSNPIDs.txt .

Now we have a copy of all the SNP IDs for each of out outlier analysis, let's use the R package UpSetR to plot the overlap.

module load R/4.1.0-gimkl-2020a
R
setwd("/nesi/nobackup/uoa02613/kstuart_projects/outlier_analysis/analysis/summary")

pcadapt<-scan("pcadapt_outlierSNPIDs.txt", what = "", quiet=TRUE)
vcftools<-scan("vcftoolsfst_outlierSNPIDs.txt", what = "", quiet=TRUE)
bayescan<-scan("bayescan_outlierSNPIDs.txt", what = "", quiet=TRUE)
baypass<-scan("baypass_outlierSNPIDs.txt", what = "", quiet=TRUE)
baypass_wing<-scan("baypass_wingspan_outlierSNPIDs.txt", what = "", quiet=TRUE)  

all_outliers <- list(PCAdapt = pcadapt, VCFtools = vcftools, Bayescan = bayescan, Baypass = baypass, BaypassWing = baypass_wing)

install.packages("UpSetR")
library(UpSetR)

pdf("All_outliers_upsetplot.pdf")
upset(fromList(all_outliers), order.by = "freq", empty.intersections = "on", point.size = 3.5, line.size = 2, mainbar.y.label = "Outlier Count", sets.x.label = "Total Outliers", text.scale = c(1.3, 1.3, 1, 1, 2, 1.3), number.angles = 30, nintersects = 11 ) 
dev.off()

Let's have a discussion about the overlap between these five outlier groups.

And if you want to get really fancy you may event want to plot your variants at their location around your genome in a circle plot!

Workshop End discussion

A brief period of group discussion on one of the days about research question framing + grant integration

Outlier Analysis Metanalysis

This workshop was concieved as part of a larger project. The goal is to compile as many genomics data set with identified outliers as possible. While identifying outliers is an interesting and often necessary component of singular genomics projects, there is also a lot to be gained from looking at patterns across neutral versus outlier genetic variants across many different projects and taxa.

One of the goals of this project is to compile a collection of genetic data sets information. Most of these will come from prepublished work, but attendees of this workshop may opt in their data sets should they want to have their data invovled in this project.

Ideally for the metanalysis we need:
VCF file with all genetic variants (SNPS)
List of which variants are outliers, and what type of outliers these are
OPTIONAL but PREFERED: Reference genome that has been annotated and well scaffolded

Funding

Thank you to the AES ERC Networking Grant Scheme and GSA for funding this project.

Footnotes

^A it is very important then to account for any population substructure. There are many different ways to approach this: refer to introduction slides for some guidance.
^B I will say population for simplicity throughout this vignette. However, equally we can test for differences between sample sites, subpopulations, and other types of groupings. What counts as one 'group' of organisms will be dependent on your study system or study question.
^C You may also want to consider linkage blocks.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github/workflows		.github/workflows
Meta-analysis_project		Meta-analysis_project
docs		docs
example_scripts		example_scripts
public		public
slides		slides
workshop_files		workshop_files
.gitignore		.gitignore
README.md		README.md
mkdocs.yml		mkdocs.yml

GenomicsAotearoa/Outlier_Analysis_Workshop

Folders and files

Latest commit

History

Repository files navigation

Outlier Analysis Workshop

Schedule

A brief introduction to Genetic Outlier and Association Analysis

Define you working directory for this project, and the VCF file location:

Project data

PCAdapt

VCFtools windowed Fst

Bayescan

BayPass

Comparing Outlier Overlap

Workshop End discussion

Outlier Analysis Metanalysis

Funding

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Languages