This repository contains all analysis code for two related projects on the baboons of Kafue National Park. Baboon DNA was prepared using double-digest RADseq and used for two related projects. The citations for the papers are below (citations will be updated as their statuses change):
Chiou, K. L., Bergey, C. M., Burrell, A. S., Disotell, T. R., Rogers, J., Jolly, C. J., & Phillips-Conroy, J. E. Genomic signatures of extreme body size divergence in baboons.
Chiou, K. L., Bergey, C. M., Burrell, A. S., Disotell, T. R., Rogers, J., Jolly, C. J., & Phillips-Conroy, J. E. Genome-wide ancestry and introgression in a Zambian baboon hybrid zone.
Some scripts in this pipeline borrow heavily from the awash-dopamine repository, written by coauthor Christina Bergey.
The analysis pipeline here begins with one pair of FASTQ files for each animal in the study. This should be most convenient for those wishing to replicate this analysis, as the files are the same as the ones available on SRA.
For those wishing to recreate the steps for demultiplexing and, in some cases, merging libraries across independent sequencing runs, raw reads are available upon request (output from bcl2fastq). Demultiplexing code is available in the demultiplex repository, which was originally written by Christina Bergey.
The code throughout is documented as though all code is run line by line, with all programs available in the $PATH
. In reality, the majority of the code was run on the NYU high performance computing cluster, particularly those steps requiring extra time, memory, or processors. Most of the programs were available as preinstalled modules maintained by the HPC staff. Here, the modules are omitted and instead software links and version numbers are provided to help users find the appropriate versions. Similarly, for R packages, you will need to check each script and install required libraries using install.packages
or Bioconductor.
If you find yourself unable to recreate any of the steps in this pipeline, start by checking version numbers and consulting the documentation for the relevant programs. If you still have questions, feel free to contact me and I will do my best to help.
Contact: Kenneth Chiou (website) kchiou [at] uw [dot] edu
Mapping and genotyping with the NGS-map pipeline
To begin, ensure that the NGS-map repository is up-to-date (written by Christina Bergey).
git submodule update NGS-map
Reads downloaded from SRA (BioProject PRJNA486659) will have different file names from those used here. To ensure that the pipeline works properly, you may need to rename the reads to those indicated in the table below. Once finished, place reads in the NGS-map/data
folder.
SRA accession | Read 1 filename | Read 2 filename |
---|---|---|
SRR7717351 | BZ06_218.read1.fastq | BZ06_218.read2.fastq |
SRR7717350 | BZ06_220.read1.fastq | BZ06_220.read2.fastq |
SRR7717353 | BZ06_221.read1.fastq | BZ06_221.read2.fastq |
SRR7717352 | BZ06_224.read1.fastq | BZ06_224.read2.fastq |
SRR7717355 | BZ06_225.read1.fastq | BZ06_225.read2.fastq |
SRR7717354 | BZ06_227.read1.fastq | BZ06_227.read2.fastq |
SRR7717357 | BZ07_001.read1.fastq | BZ07_001.read2.fastq |
SRR7717356 | BZ07_004.read1.fastq | BZ07_004.read2.fastq |
SRR7717359 | BZ07_005.read1.fastq | BZ07_005.read2.fastq |
SRR7717358 | BZ07_007.read1.fastq | BZ07_007.read2.fastq |
SRR7717328 | BZ07_029.read1.fastq | BZ07_029.read2.fastq |
SRR7717327 | BZ07_030.read1.fastq | BZ07_030.read2.fastq |
SRR7717326 | BZ07_032.read1.fastq | BZ07_032.read2.fastq |
SRR7717325 | BZ07_034.read1.fastq | BZ07_034.read2.fastq |
SRR7717332 | BZ07_035.read1.fastq | BZ07_035.read2.fastq |
SRR7717331 | BZ07_039.read1.fastq | BZ07_039.read2.fastq |
SRR7717330 | BZ07_041.read1.fastq | BZ07_041.read2.fastq |
SRR7717329 | BZ07_042.read1.fastq | BZ07_042.read2.fastq |
SRR7717324 | BZ07_045.read1.fastq | BZ07_045.read2.fastq |
SRR7717323 | BZ07_047.read1.fastq | BZ07_047.read2.fastq |
SRR7717395 | BZ07_100.read1.fastq | BZ07_100.read2.fastq |
SRR7717396 | BZ11_001.read1.fastq | BZ11_001.read2.fastq |
SRR7717393 | BZ11_002.read1.fastq | BZ11_002.read2.fastq |
SRR7717394 | BZ11_003.read1.fastq | BZ11_003.read2.fastq |
SRR7717399 | BZ11_004.read1.fastq | BZ11_004.read2.fastq |
SRR7717400 | BZ11_005.read1.fastq | BZ11_005.read2.fastq |
SRR7717397 | BZ11_006.read1.fastq | BZ11_006.read2.fastq |
SRR7717398 | BZ11_007.read1.fastq | BZ11_007.read2.fastq |
SRR7717401 | BZ11_008.read1.fastq | BZ11_008.read2.fastq |
SRR7717402 | BZ11_009.read1.fastq | BZ11_009.read2.fastq |
SRR7717384 | BZ11_010.read1.fastq | BZ11_010.read2.fastq |
SRR7717383 | BZ11_011.read1.fastq | BZ11_011.read2.fastq |
SRR7717386 | BZ11_012.read1.fastq | BZ11_012.read2.fastq |
SRR7717385 | BZ11_013.read1.fastq | BZ11_013.read2.fastq |
SRR7717388 | BZ11_014.read1.fastq | BZ11_014.read2.fastq |
SRR7717387 | BZ11_015.read1.fastq | BZ11_015.read2.fastq |
SRR7717390 | BZ11_016.read1.fastq | BZ11_016.read2.fastq |
SRR7717389 | BZ11_017.read1.fastq | BZ11_017.read2.fastq |
SRR7717392 | BZ11_018.read1.fastq | BZ11_018.read2.fastq |
SRR7717391 | BZ11_019.read1.fastq | BZ11_019.read2.fastq |
SRR7717293 | BZ11_020.read1.fastq | BZ11_020.read2.fastq |
SRR7717294 | BZ11_021.read1.fastq | BZ11_021.read2.fastq |
SRR7717295 | BZ11_022.read1.fastq | BZ11_022.read2.fastq |
SRR7717296 | BZ11_023.read1.fastq | BZ11_023.read2.fastq |
SRR7717297 | BZ11_024.read1.fastq | BZ11_024.read2.fastq |
SRR7717298 | BZ11_025.read1.fastq | BZ11_025.read2.fastq |
SRR7717299 | BZ11_026.read1.fastq | BZ11_026.read2.fastq |
SRR7717300 | BZ11_028.read1.fastq | BZ11_028.read2.fastq |
SRR7717301 | BZ11_029.read1.fastq | BZ11_029.read2.fastq |
SRR7717302 | BZ11_030.read1.fastq | BZ11_030.read2.fastq |
SRR7717281 | BZ11_031.read1.fastq | BZ11_031.read2.fastq |
SRR7717280 | BZ11_032.read1.fastq | BZ11_032.read2.fastq |
SRR7717279 | BZ11_033.read1.fastq | BZ11_033.read2.fastq |
SRR7717278 | BZ11_034.read1.fastq | BZ11_034.read2.fastq |
SRR7717277 | BZ11_035.read1.fastq | BZ11_035.read2.fastq |
SRR7717276 | BZ11_036.read1.fastq | BZ11_036.read2.fastq |
SRR7717275 | BZ11_037.read1.fastq | BZ11_037.read2.fastq |
SRR7717274 | BZ11_038.read1.fastq | BZ11_038.read2.fastq |
SRR7717283 | BZ11_039.read1.fastq | BZ11_039.read2.fastq |
SRR7717282 | BZ11_040.read1.fastq | BZ11_040.read2.fastq |
SRR7717321 | BZ11_041.read1.fastq | BZ11_041.read2.fastq |
SRR7717322 | BZ11_042.read1.fastq | BZ11_042.read2.fastq |
SRR7717319 | BZ11_043.read1.fastq | BZ11_043.read2.fastq |
SRR7717320 | BZ11_045.read1.fastq | BZ11_045.read2.fastq |
SRR7717317 | BZ11_046.read1.fastq | BZ11_046.read2.fastq |
SRR7717318 | BZ11_047.read1.fastq | BZ11_047.read2.fastq |
SRR7717315 | BZ11_048.read1.fastq | BZ11_048.read2.fastq |
SRR7717316 | BZ11_050.read1.fastq | BZ11_050.read2.fastq |
SRR7717313 | BZ11_051.read1.fastq | BZ11_051.read2.fastq |
SRR7717314 | BZ11_052.read1.fastq | BZ11_052.read2.fastq |
SRR7717310 | BZ11_053.read1.fastq | BZ11_053.read2.fastq |
SRR7717309 | BZ11_054.read1.fastq | BZ11_054.read2.fastq |
SRR7717312 | BZ11_056.read1.fastq | BZ11_056.read2.fastq |
SRR7717311 | BZ11_057.read1.fastq | BZ11_057.read2.fastq |
SRR7717306 | BZ11_058.read1.fastq | BZ11_058.read2.fastq |
SRR7717305 | BZ11_059.read1.fastq | BZ11_059.read2.fastq |
SRR7717308 | BZ11_061.read1.fastq | BZ11_061.read2.fastq |
SRR7717307 | BZ11_062.read1.fastq | BZ11_062.read2.fastq |
SRR7717304 | BZ11_063.read1.fastq | BZ11_063.read2.fastq |
SRR7717303 | BZ11_064.read1.fastq | BZ11_064.read2.fastq |
SRR7717368 | BZ11_065.read1.fastq | BZ11_065.read2.fastq |
SRR7717369 | BZ11_066.read1.fastq | BZ11_066.read2.fastq |
SRR7717370 | BZ11_067.read1.fastq | BZ11_067.read2.fastq |
SRR7717371 | BZ11_068.read1.fastq | BZ11_068.read2.fastq |
SRR7717364 | BZ11_069.read1.fastq | BZ11_069.read2.fastq |
SRR7717365 | BZ11_070.read1.fastq | BZ11_070.read2.fastq |
SRR7717366 | BZ11_071.read1.fastq | BZ11_071.read2.fastq |
SRR7717367 | BZ11_072.read1.fastq | BZ11_072.read2.fastq |
SRR7717361 | BZ11_073.read1.fastq | BZ11_073.read2.fastq |
SRR7717362 | BZ11_074.read1.fastq | BZ11_074.read2.fastq |
SRR7717340 | BZ11_075.read1.fastq | BZ11_075.read2.fastq |
SRR7717334 | BZ11_076.read1.fastq | BZ11_076.read2.fastq |
SRR7717335 | BZ12_001.read1.fastq | BZ12_001.read2.fastq |
SRR7717333 | BZ12_002.read1.fastq | BZ12_002.read2.fastq |
SRR7717341 | BZ12_003.read1.fastq | BZ12_003.read2.fastq |
SRR7717374 | BZ12_004.read1.fastq | BZ12_004.read2.fastq |
SRR7717339 | BZ12_005.read1.fastq | BZ12_005.read2.fastq |
SRR7717338 | BZ12_006.read1.fastq | BZ12_006.read2.fastq |
SRR7717363 | BZ12_007.read1.fastq | BZ12_007.read2.fastq |
SRR7717360 | BZ12_008.read1.fastq | BZ12_008.read2.fastq |
SRR7717344 | BZ12_009.read1.fastq | BZ12_009.read2.fastq |
SRR7717345 | BZ12_010.read1.fastq | BZ12_010.read2.fastq |
SRR7717342 | BZ12_011.read1.fastq | BZ12_011.read2.fastq |
SRR7717343 | BZ12_012.read1.fastq | BZ12_012.read2.fastq |
SRR7717348 | BZ12_030.read1.fastq | BZ12_030.read2.fastq |
SRR7717349 | BZ12_031.read1.fastq | BZ12_031.read2.fastq |
SRR7717346 | BZ12_032.read1.fastq | BZ12_032.read2.fastq |
SRR7717347 | BZ12_033.read1.fastq | BZ12_033.read2.fastq |
SRR7717336 | Chiou_14_001.read1.fastq | Chiou_14_001.read2.fastq |
SRR7717337 | Chiou_14_003.read1.fastq | Chiou_14_003.read2.fastq |
SRR7717376 | Chiou_14_004.read1.fastq | Chiou_14_004.read2.fastq |
SRR7717375 | Chiou_14_005.read1.fastq | Chiou_14_005.read2.fastq |
SRR7717378 | Chiou_14_030.read1.fastq | Chiou_14_030.read2.fastq |
SRR7717377 | Chiou_14_036.read1.fastq | Chiou_14_036.read2.fastq |
SRR7717380 | Chiou_14_039.read1.fastq | Chiou_14_039.read2.fastq |
SRR7717379 | Chiou_14_041.read1.fastq | Chiou_14_041.read2.fastq |
SRR7717382 | Chiou_14_042.read1.fastq | Chiou_14_042.read2.fastq |
SRR7717381 | Chiou_14_044.read1.fastq | Chiou_14_044.read2.fastq |
SRR7717373 | Chiou_14_050.read1.fastq | Chiou_14_050.read2.fastq |
SRR7717372 | Chiou_14_054.read1.fastq | Chiou_14_054.read2.fastq |
SRR7717284 | Chiou_14_056.read1.fastq | Chiou_14_056.read2.fastq |
SRR7717285 | Chiou_14_057.read1.fastq | Chiou_14_057.read2.fastq |
SRR7717286 | Chiou_14_058.read1.fastq | Chiou_14_058.read2.fastq |
SRR7717287 | Chiou_14_059.read1.fastq | Chiou_14_059.read2.fastq |
SRR7717288 | Chiou_14_065.read1.fastq | Chiou_14_065.read2.fastq |
SRR7717289 | Chiou_14_069.read1.fastq | Chiou_14_069.read2.fastq |
SRR7717290 | Chiou_15_003.read1.fastq | Chiou_15_003.read2.fastq |
SRR7717291 | Chiou_15_004.read1.fastq | Chiou_15_004.read2.fastq |
SRR7717292 | Chiou_15_005.read1.fastq | Chiou_15_005.read2.fastq |
Proceed to map samples and call and filter variants according to the NGS-map pipeline. The following outline may be helpful but will not work as is--some components of NGS-map will need to be manually configured (particularly paths to software).
- Prepare sample list and save to a file.
ls -1 NGS-map/data/*.fastq | \
sed 's/NGS-map\/data\/\([A-Za-z0-9_]*\)\.read[1-2].fastq/\1/' | \
sort -u > NGS-map/data/individual_list.txt
- Change to
NGS-map
directory.
cd NGS-map
- Download and index the baboon reference genome (Panu2.0).
genomes/download_papAnu2.sh
scripts/index_genome.sh genomes/papAnu2/papAnu2.fa
- Map samples to the reference genome using BWA mem (v0.7.15).
qsub -t 1-129 pbs/call_make.pbs
- Call genotypes using GATK (v3.5.0).
qsub -t 1-20 pbs/call_gatk_genotyper.pbs
- Filter genotype calls using GATK (v3.5.0).
qsub -t 1-20 pbs/filter_gatk_snps.pbs
- Change to directory containing variant calls.
cd baboon_snps
- Concatenate SNP calls from autosomes using VCFtools (v0.1.14).
vcf-concat $(echo chr{1..20}.pass.snp.vcf) > kafue.pass.snp.noX.vcf
- Change to base project directory.
cd ../..
Following the process above, transfer results files to the data
folder (at the base level of this repository)
mv NGS-map/baboon_snps/kafue.pass.snp.noX.vcf data
- Download Repeatmasker and Tandem Repeats Finder files for the baboon reference genome (Panu2.0).
scripts/download_repeats.sh
scripts/vcf_to_ped.sh kafue.pass.snp.noX
- Calculate stats on missingness using PLINK (v1.90).
plink --noweb --double-id --vcf data/kafue.pass.snp.noX.vcf --missing --out reports/kafue.pass.snp.noX
- Find repeats based on the Repeatmasker and Tandem Repeats Finder files and the PLINK
.map
file generated in the previous step. BEDtools (v2.26.0) must be installed and in the$PATH
.
scripts/repeat_finder.pl
- Filter dataset to remove repeats identified in the previous step as well as samples and loci with excessive missingness using PLINK (v1.90). This will produce two datasets: the larger
results/all.cleaned.*
dataset and the more stringentresults/strict.cleaned.*
dataset (the latter is not used in subsequent steps).
scripts/make_clean_dataset.sh
- Calculate SNPs in linkage disequilibrium and remove them from the dataset using PLINK (v1.90). This will produce the
results/all.cleaned.LDpruned.*
dataset.
scripts/find_and_remove_ld.sh
- Perform multidimensional scaling and principal components analyses using PLINK (v1.90). The scripts also plot respective results using R (v3.4.2).
scripts/mds_plot.sh
scripts/pca_plot.sh
- Visualize library mapping statistics from individuals passing filters using R (v3.4.2). This step will read the
results/mds.txt
file generated in the previous step.
scripts/plot_mapping.R
-
Raster and vector imagery for Zambia and Kafue National Park are used for subsequent spatial analyses or plotting. The following datasets were used to prepare a custom ASCII-formatted grid file for this project using R (v3.4.2):
-
The bio1 bioclimatic dataset, obtained from WorldClim and stored in ASCII format. File was moved to:
data/bio01.asc
. -
Zambia country administrative borders, obtained from DIVA-GIS. The specific file used is
ZMB_adm0.shp
, renamed to:data/zambia_borders.shp
. -
Park boundaries for Kafue National Park, kindly shared by the Zambia Wildlife Authority (now the Department of National Parks and Wildlife of Zambia. The particular file is not mine to share, but a perfectly suitable alternative is available for download at Protected Planet. The file was moved to:
data/kafue_parkboundary.shp
.
-
scripts/make_asc_file.R
- Convert VCF file to BED file using PLINK (v1.90) to prepare dataset for ancestry inference. Will generate
results/all.cleaned.LDpruned.vcf
.
plink --bfile results/all.cleaned.LDpruned --recode vcf --out results/all.cleaned.LDpruned
mv results/all.cleaned.LDpruned.log reports/bed_to_vcf.log
- Write spatial coordinates into file that is line-matched to the VCF data using R (v3.4.2). Will generate
data/all.cleaned.LDpruned.coord
.
scripts/make_coord_file.R
- Infer ancestry using ADMIXTURE (v1.3.0) in cross-validation mode testing values of K from 1 to 10. Plot cross-validated errors using R (v3.4.2).
scripts/run_admixture_cv.sh
scripts/plot_admixture_cv_error.R
- Run ADMIXTURE (v1.3.0) on the optimal value of K (K = 2).
scripts/run_admixture_k.sh 2
- Generate a variety of ancestry plots using R (v3.4.2), including comparisons to multidimensional scaling analyses done earlier. The script uses functions from
POPSutilities.R
, which is not provided here and must be downloaded (link) and moved to thescripts
folder.
scripts/plot_admixture_results.R
- Generate ancestry plots from Jolly et al. 2011. The script uses the data set
data/sex_chr_genotypes.txt
, which is not provided here but can be requested from Cliff Jolly.
scripts/plot_jolly_et_al_2011.R
- Compare autosomal and sex chromosome ancestries.
scripts/test_autosomes_vs_uniparental_markers.R
- Partition populations based on ancestry results using PLINK (v1.90) and prepare and convert requisite files for population genomic scans using PLINK/SEQ (v0.10), VCFtools (v0.1.14), htslib
tabix
(v1.4.1), and R (v3.4.2).
scripts/prep_popgen.sh
- Calculate a variety of within-population and between-population (FST) statistics using VCFtools (v0.1.14).
scripts/popgen_stats.sh
- Generate exploratory plot of inbreeding according to ancestry using R (v3.4.2).
scripts/plot_inbreeding.R
- Make maps for spatial visualization of sample locales and geospatial ancestry using R (v3.4.2).
scripts/map_samples.R
- Prepare for genome scans by downloading baboon and macaque annotations, finding open reading frames (ORFs), mapping baboon ORFs to macaque proteins, and finalizing baboon PANTHER annotations derived from homology. sesbio (v2016-10-18), HMMER2GO (v0.17.2), HMMER (v3.1b2), SAMtools (v1.6), and R (v3.4.2) are used and should be in the
$PATH
, along with the dependency EMBOSS (v6.6.0).
scripts/prep_genome_scans.sh
- Assign and average FST values across genes and conduct permutation test to build empirical FST null distribution using R (v3.4.2).
scripts/scan_fst.R
-
Some p values from the above procedure might be so close to the threshold of 0.05 that we cannot confidently determine their significance. For these sets of genes, we can run more permutations to augment ("boost") their empirical FST null distributions to bring the total up one order of magnitude (e.g., from 1 million to 10 million permutations). The procedure will make use of the
.RData
files in thecheckpoints
folder. To make this more efficient, we can parallelize the process using the following script. The script uses a modified R scriptscripts/scan_fst_boost.R
and passes as argument two variables:-
$reps
is automatically calculated and shows how many permutations have already been run (on a log10 scale) -
$1
(may be used for GNU parallel, PBS [replace$1
with$PBS_ARRAYID
], or Slurm [replace$1
with$SLURM_ARRAY_TASK_ID
]) is an arbitrary integer that ensures that result files do not conflict.
-
#!/bin/sh
reps=$(ls checkpoints/pval.perm.1e*.RData | sed -e 's/checkpoints\/pval.perm.1e\([0-9]*\)\.RData/\1/' | sort -rn | head -n 1)
scripts/scan_fst_boost.R $reps $1
- If the boost procedure described above was run, consolidate the results using R (v3.4.2) to generate the final empirical null distribution.
scripts/scan_fst_aggregate.R
- Convert genotypes into the genotype files require for bgc (v1.03). The script takes two VCF files as arguments because the first includes read count information but also extraneous SNPs (those not passing filters) while the second contains SNPs passing filters but no read counts. The conversion script merges information from the two to produce read counts for SNPs passing filters in the proper format.
scripts/vcf_to_bgc.R data/kafue.pass.snp.noX.vcf results/all.cleaned.LDpruned.vcf
- Run bgc (v1.03). The script below can be run in parallel (5 parallel runs were done in this case) by passing an arbitrary argument
$1
(may be used for GNU parallel, PBS [replace$1
with$PBS_ARRAYID
], or Slurm [replace$1
with$SLURM_ARRAY_TASK_ID
]). GSL (v1.16) and HDF5 (v1.8.17) are required as dependencies, and thebgc
andestpost
programs must be properly compiled and in the$PATH
. The stdout of one chain was saved to the filereports/bgc_err.txt
. Running on a remote machine is essentially required as each run will take a long time to converge and complete.
#!/bin/sh
scripts/run_bgc.sh $1
- Aggregate posterior MCMC chains across the 5 bgc (v1.03) runs.
scripts/run_bgc_aggregate.sh
- Visualize genomic cline results and compare to other ancestry results.
scripts/plot_bgc_results.R
- Assign genomic cline parameter estimates to genes based on overlap.
scripts/run_bgc_genes.R
- Plot results from both FST and genomic cline scans.
scripts/plot_all_scans.R
- Test for difference in beta cline parameter estimates between genic and nongenic SNPs.
scripts/run_bgc_genic_vs_nongenic.R
- Test for enrichment of Gene Ontology terms and PANTHER pathways for both FST and genomic cline analyses.
scripts/test_enrichment_all.R