Skip to content

Tutorial

Samuel Hamann edited this page Feb 3, 2020 · 4 revisions

Welcome! This is a short guide to population genetics analysis using ANGSD-wrapper. For additional information on how each wrapper is run, please refer to the Methods sidebar. We will be using a test data set containing sequences from Zea mays subsp. parviglumis, Zea mays subsp. mays, Zea mays subsp. mexicana, and Tripsacum dactylodies. For help with installing ANGSD-wrapper, please see the installation page. Note that we assume the user is using bash for this tutorial. Using zsh is possible, but you have to modify the path to point to the ANGSD-wrapper executable in your zsh configuration.

Setup Test Data

To download and set up a directory with test data and example configuration files, we run the following command:

./angsd-wrapper setup data

These data and configuration files are located in the Example_Data directory. Finally, ANGSD-wrapper will be installed system-wide so that it can be used from any working directory. To make sure ANGSD-wrapper installed correctly, run angsd-wrapper, without the ./ that we used before.

In the Example_Data directory, there are five directories: three for the three Zea samples that we will be running with and one for the reference and ancestral sequences. The fifth directory contains example configuration files that have been pre-filled. These will work for the tutorial, but we recommend following along to get practice creating your own configuration files.

Directory Contents File names
Configuration_Files Example configuration files pre-filled for all three Zea samples for the following methods:
  • Site Frequency Spectrum
  • Thetas Estimations
  • Admixture Analysis
  • Principal Component Analysis
In addition, there is an example Common_Config for each Zea sample
  • Maize_Example_Admixture_Config
  • Maize_Example_Common_Config
  • Maize_Example_Principal_Component_Analysis_Config
  • Maize_Example_Site_Frequency_Spectrum_Config
  • Maize_Example_Thetas_Config
  • Mexicana_Example_Admixture_Config
  • Mexicana_Example_Common_Config
  • Mexicana_Example_Principal_Component_Analysis_Config
  • Mexicana_Example_Site_Frequency_Spectrum_Config
  • Mexicana_Example_Thetas_Config
  • Teosinte_Example_Admixture_Config
  • Teosinte_Example_Common_Config
  • Teosinte_Example_Principal_Component_Analysis_Config
  • Teosinte_Example_Site_Frequency_Spectrum_Config
  • Teosinte_Example_Thetas_Config
Maize
  • BAM files for Zea mays subsp. mays
  • Indexed BAM files
  • Inbreeding coefficients for Zea mays subsp. mays
  • A regions file
  • Sample list
  • All .bam files
  • All .bai files
  • Maize_Inbreeding.indF
  • Maize_Regions.txt
  • Maize_Samples.txt
Mexicana
  • BAM files for Zea mexicana
  • Indexed BAM files
  • Inbreeding coefficients for Zea mexicana
  • A regions file
  • Sample list
  • All .bam files
  • All .bai files
  • Mexicana_Inbreeding.indF
  • Mexicana_Regions.txt
  • Mexicana_Samples.txt
Sequences
  • Ancestral Tripsacum dactylodies sequence
  • Reference Zea mays sequence
  • Indexed FASTA files
  • Tripsacum_TDD39103.fa
  • Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
  • All .fai files
Teosinte
  • BAM files for Zea mays subsp. parviglumis
  • Indexed BAM files
  • Inbreeding coefficients for Zea mays subsp. parviglumis
  • A regions file
  • Sample list
  • All .bam files
  • All .bai files
  • Teosinte_Inbreeding.indF
  • Teosinte_Regions.txt
  • Teosinte_Samples.txt

Note: BAM Files MUST Have an @HD Header Line

Some programs, when generating BAM files, will not include the @HD header line. To see if you have this line, use SAMTools to check the header for your BAM files:

samtools view -H <name of BAM file> | head -1

The @HD header line should be the first line that pops up; if you don't see it, this Gist will add one for you.

ANGSD-wrapper has many different routines, or wrappers, that it can perform on a given dataset; we will be working with the Site Frequency Spectrum (SFS), Thetas Estimator, Admixture Analysis, and Principal Component Analysis (PCA) routines for this tutorial. To see all available wrappers, run angsd-wrapper without any arguments.

We will also be graphing our results using a Shiny web app. All analyses should be done using a supercomputer-like device, at least 32 GB of RAM, and all graphing should be done using a computer with a graphical user interface. If you have access to a supercomputer cluster, we recommend setting up ANGSD-wrapper on both the cluster for analysis and local machine for graphing.

Configuring ANGSD-wrapper with the Common_Config file

Note: There are pre-filled example configuration files in the Example_Data/Configuration_Files directory. If you would like to use these files instead of creating your own, you may do so, though we strongly recommend practicing creating your own configuration files

ANGSD-wrapper uses configuration files to figure out where the data is and what options should be passed to ANGSD and other dependencies. There is one configuration file per wrapper included with angsd-wrapper, as well as a common configuration file (Common_Config) that can be used by multiple wrappers. For this example, we will configure ANGSD-wrapper to analyze the Zea mays subsp. mays samples. All of these are located in the Configuration_Files directory; we recommend copying this directory to another directory so that there is always a clean copy of the configuration files available. In this case, starting in the angsd-wrapper directory, we will copy the Configuration_Files directory into the Maize directory inside the Example_Data directory using the following command:

cp -r Configuration_Files/ Example_Data/Maize/

Note: A Word About Configuration Files

Each wrapper-specific configuration file is split into three parts: the COMMON definition, the 'not-using-common' section, and the wrapper-specific variables section. If a wrapper utilizes the Common definition, it will always be on line 10. The 'not-using-common' section is blocked off by 94 hash marks (#). If you are not using the Common_Config file, please fill out the variable definitions in this section. Since we're using Common_Config, we can skip these lines. Finally, the wrapper-specific section includes any other variable definitions as well as parameters for the specific wrapper. If the Common_Config files are used, they will overwrite anything specified in the wrapper-specific section. This is to ensure similarity between analyses that make use of the Common_Config file.


Now, let's go into the Maize directory and figure out the full path to this directory using pwd.

cd Example_Data/Maize
pwd

This will output a string that starts with /home/; go ahead and copy everything following the forward slash after your user name. For example, if we get /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize as our output, we only need /software/angsd-wrapper/Example_Data/Maize.

Now, we'll go find our configuration files in the Configuration_Files directory:

cd Configuration_Files/

Because we're using multiple wrappers in this tutorial, we'll use the Common_Config file to hold variables that will be used across all methods. Open Common_Config in your favorite text editor, such as Vim or Emacs.

First, we need to define a list of samples. On line 10 of Common_Config, there's a place to define this sample list. If we remember back in our Maize directory, our sample list is called Maize_Samples.txt

So, to tell ANGSD-wrapper where our sample list is, we will use our Maize example with the directory location being /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize. We'll make sure line 10 looks like this:

SAMPLE_LIST=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Samples.txt

If you're using the dev branch, it looks like this:

GROUP_SAMPLES=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Samples.txt

We use ${HOME} to help ANGSD-wrapper find your files, if we used /home, some systems will error out, saying that the directory does not exist.


Adjust the /software/angsd-wrapper/Example_Data part to whatever you copied from your output.

Next, we need our list of inbreeding coefficients. This is called Maize_Inbreeding.indF, to run this we tell ANGSD-wrapper where this file is on line 13 of our Common_Config file:

SAMPLE_INBREEDING=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF

If you're using the dev branch, it looks like this:

GROUP_INBREEDING=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF

Lines 17 and 20 ask for our ancestral and reference sequences. These are Tripsacum_TDD39103.fa and Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa, respectively, found in the Sequences directory. In the Common_Config file, we'd enter the following on their respective lines:

ANC_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
REF_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa

Now we need to set up our outdirectory structure. We use two variables to define this: PROJECT and SCRATCH. All output files will be placed in $SCRATCH/$PROJECT/<name_of_program>; for example, if we set SCRATCH to be "${HOME}/scratch" and PROJECT to be "Maize" and calculated a site frequency spectrum, our outdirectory would be ${HOME}/scratch/Maize/SFS.

If we used the same SCRATCH and PROJECT assignments and estimated Thetas, our outdirectory would be ${HOME}/scratch/Maize/Thetas. The outdirectory structure is generated automatically, making any directory within the structure that doesn't already exist, so it is not necessary to make these directories before hand.

Let's set SCRATCH to be "${HOME}/scratch" and PROJECT to be "Maize"; we define these two variables on lines 23, for PROJECT, and 28, for SCRATCH:

PROJECT=Maize
SCRATCH=${HOME}/scratch

Finally, we need to specify a regions file for ANGSD-wrapper. While we can run ANGSD-wrapper without a regions file, it becomes very computationally expensive and takes much longer. If you would like to generate a regions file from an existing BED file, this Regions File Format page will show you how to convert a BED file to a valid regions file.

We have a regions file in our Example_Data directory called Maize_Regions.txt, let's tell ANGSD-wrapper where this is on line 32 of Common_Config

REGIONS=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Regions.txt

Now we're ready to run ANGSD-wrapper to calculate a site frequency spectrum, estimate Thetas, perform an analyze admixture, and run a principal component analysis. Close out of Common_Config, and be sure to save your changes. We're going to stay in the Configuration_Files directory for now; we're going to need the full path for this directory, which we obtain with the following command:

pwd

Again, we'll get an output starting with /home/ and we only need the part after the second forward slash. Using our directory structure from before, our output would be /home/software/angsd-wrapper/Example_Data/Maize/Configuration_Files and we need /software/angsd-wrapper/Example_Data/Maize/Configuration_Files

Site Frequency Spectrum

Each wrapper function has its own configuration file associated with it. To run the site frequency spectrum, we need the Site_Frequency_Spectrum_Config file. Open this up in your favorite text editor.

In Site_Frequency_Spectrum_Config, we need to tell ANGSD-wrapper where our Common_Config file is. This definition is on line 10:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

Remember to adjust the /software/angsd-wrapper/Example_Data/Maize/Configuration_Files part for your own directory structure.

Most of the other variables and parameters are set up to run smoothly. For now, we're going to set OVERRIDE to be true, in case we run this another time and want updated results; we change this on line 53:

OVERRIDE=true

Note: How ANGSD-wrapper Knows What to do

ANGSD-wrapper has several functions, or wrappers, built into it. These are predefined and very specific. The syntax for running ANGSD-wrapper is as follows:

angsd-wrapper <wrapper> <configuration file>

Where <wrapper> is one of the wrappers that ANGSD-wrapper can perform and <configuration file> is the full path to the configuration file we set up for it. To see a full list of wrappers that ANGSD-wrapper has and how to call them, run the following command:

angsd-wrapper

This will display a usage message with the wrappers ANGSD-wrapper has and how to call them. Capitalization and spelling are very important with ANGSD-wrapper; you must type out what you see in the usage message to get ANGSD-wrapper to run. Also, you don't have to use our presupplied configuration files with ANGSD-wrapper, but you do need to have all of the variable definitions that we have supplied.


Now, lets calculate a site frequency spectrum using ANGSD-wrapper:

angsd-wrapper SFS ./Site_Frequency_Spectrum_Config

Once this finishes, our output files will be in the outdirectory we specified, ${HOME}/scratch/Maize/SFS, let's go there and look at our files:

cd ${HOME}/scratch/Maize/SFS/
ls

The following are the output files we should see in the SFS directory:

  • Maize_DerivedSFS.graph.me
  • Maize_SFSOut.arg
  • Maize_SFSOut.geno.gz
  • Maize_SFSOut.mafs.gz
  • Maize_SFSOut.saf.gz
  • Maize_SFSOut.saf.idx
  • Maize_SFSOut.saf.pos.gz

We'll need the Maize_DerivedSFS.graphe.me file for our Thetas estimation and graphing later on.

Thetas Estimation

Now, we need to go back to our Configuration_Files directory so we can set up ANGSD-wrapper to estimate Thetas values for us. We use the cd command to do this:

cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/

Open up Thetas_Config in your favorite text editor. We have three variables we need to define in this configuration file. First, we need to tell ANGSD-wrapper where our Common_Config file is; this will be the same as what we put in our Site_Frequency_Spectrum_Config file. On line 10, we'll put:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

Next, we need to specify our pest (prior estimation) file. This file comes from our site frequency spectrum; in this case, our file is Maize_DerivedSFS.graph.me. We need to specify this on line 41 of Thetas_Config:

PEST=${HOME}/scratch/Maize/SFS/Maize_DerivedSFS.graph.me

Finally, let's set OVERRIDE to be true again; we do this on line 56:

OVERRIDE=true

Now, we can estimate Thetas values using ANGSD-wrapper; we do this with the following command:

angsd-wrapper Thetas ./Thetas_Config

Our output files will be in ${HOME}/scratch/Maize/Thetas, let's go there and look at our files

cd ${HOME}/scratch/Maize/Thetas/
ls

Here we have output files we should see in the Thetas directory:

  • Maize_Diversity.arg
  • Maize_Diversity.mafs.gz
  • Maize_Diversity.thetas.gz
  • Maize_Diversity.thetas.gz.bin
  • Maize_Diversity.thetas.gz.idx
  • Maize_Diversity.thetas.gz.pestPG
  • Maize_Diversity.thetas.graph.me

Genotype Likelihoods

Let's go back out to Configuration_Files directory to set up our admixture analysis:

First we need to get genotype likelihoods from the Genotypes command. Open the Genotype_Config file and add in the path to the Common_Config file:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

Then run the Genotypes command to get out the genotype likelihoods (in beagle format).

angsd-wrapper Genotypes ./Genotype_Config

Admixture Analysis

Now that this is done, we can set up the admixture configuration file:

cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/

We need to edit variables in Admixture_Config to tell ANGSD-wrapper where everything is for the admixture analysis. Open up Admixture_Analysis with your favorite text editor. On line 10, we need to specify where our Common_Config file is:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

The only other variable we need to specify is our likelihood file. On line 26 of Admixture_Config, we need to tell ANGSD-wrapper where this likelihood file is:

LIKELIHOOD=${HOME}/scratch/Maize/GenotypeLikelihoods/Maize_SFSOut.beagle.gz

Now, let's run the admixture analysis. This is done with the following command:

angsd-wrapper Admixture ./Admixture_Config

Note: A Word About Convergence

There are several commands that run optimization methods that may end up converging upon a local maxima. In order to make sure the optimization finds the global maximum, we recommend running the program several times from different starting points. For Admixture, this just means running the program several times because the starting point is chosen randomly.


Our output files will be in ${HOME}/scratch/Maize/Admixture, let's go there an look at our files.

cd ${HOME}/scratch/Maize/Admixture/
ls

Here, we see some more output files in the Admixture directory:

  • Maize.2.filter
  • Maize.2.fopt.gz
  • Maize.2.log
  • Maize.2.qopt.graph.me
  • Maize.3.filter
  • Maize.3.fopt.gz
  • Maize.3.log
  • Maize.3.qopt.graph.me
  • Maize.4.filter
  • Maize.4.fopt.gz
  • Maize.4.log
  • Maize.4.qopt.graph.me
  • Maize.5.filter
  • Maize.5.fopt.gz
  • Maize.5.log
  • Maize.5.qopt.graph.me

Where each number in the filename correlates with the number of K ancestral populations graphed.

Principal Component Analysis

Let's go back out to Configuration_Files directory to set up our principal component analysis (PCA):

cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/

We only need to tell ANGSD-wrapper where our Common_Config file is, everything else for the PCA is taken care of. We do this on line 10 of Principal_Component_Analysis_Config:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

Now, let's run the PCA. We use the following command to do this:

angsd-wrapper PCA ./Principal_Component_Analysis_Config

Our output files will be in ${HOME}/scratch/Maize/PCA, let's go there and look at our files.

cd ${HOME}/scratch/Maize/PCA/
ls

Here are the output files we should see in the PCA directory:

  • Maize_PCA.arg
  • Maize_PCA.covar
  • Maize_PCA.geno
  • Maize_PCA.mafs.gz

Graphing

ANGSD-wrapper comes with a visualization package, based off of Rstudio's Shiny platform. To use this, we need to be on a machine that has a graphical user interface (GUI) and a web browser. If you used ANGSD-wrapper on a high performance computing system, please transfer your files to another machine with a GUI so we can utilize the visualization package. You may need to setup ANGSD-wrapper again.

The files we need for graphing are:

  • Maize_DerivedSFS.graph.me
  • Maize.pestPG
  • All .qopt.graph.me files
  • Maize_PCA.covar

Starting Shiny

Use git clone to clone ANGSD-wrapper in a terminal window on your local machine.

git clone https://github.com/ANGSD-wrapper/angsd-wrapper.git
cd angsd-wrapper

Setup ANGSD-wrapper on your local machine the same way you set it up at the beginning of this tutorial. You will not need to setup the data again, only the dependencies.

./angsd-wrapper setup dependencies
source ~/.bash_profile

Once setup is complete, you can now start Shiny.

To start the Shiny graphing interface run:

angsd-wrapper shiny graphing

Additional help text is available on the side panels within Shiny.