Skip to content

kchiou/kafue-baboons-ddrad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kafue-baboons-ddrad

This repository contains all analysis code for two related projects on the baboons of Kafue National Park. Baboon DNA was prepared using double-digest RADseq and used for two related projects. The citations for the papers are below (citations will be updated as their statuses change):

Chiou, K. L., Bergey, C. M., Burrell, A. S., Disotell, T. R., Rogers, J., Jolly, C. J., & Phillips-Conroy, J. E. Genomic signatures of extreme body size divergence in baboons.

Chiou, K. L., Bergey, C. M., Burrell, A. S., Disotell, T. R., Rogers, J., Jolly, C. J., & Phillips-Conroy, J. E. Genome-wide ancestry and introgression in a Zambian baboon hybrid zone.

Some scripts in this pipeline borrow heavily from the awash-dopamine repository, written by coauthor Christina Bergey.

The analysis pipeline here begins with one pair of FASTQ files for each animal in the study. This should be most convenient for those wishing to replicate this analysis, as the files are the same as the ones available on SRA.

For those wishing to recreate the steps for demultiplexing and, in some cases, merging libraries across independent sequencing runs, raw reads are available upon request (output from bcl2fastq). Demultiplexing code is available in the demultiplex repository, which was originally written by Christina Bergey.

The code throughout is documented as though all code is run line by line, with all programs available in the $PATH. In reality, the majority of the code was run on the NYU high performance computing cluster, particularly those steps requiring extra time, memory, or processors. Most of the programs were available as preinstalled modules maintained by the HPC staff. Here, the modules are omitted and instead software links and version numbers are provided to help users find the appropriate versions. Similarly, for R packages, you will need to check each script and install required libraries using install.packages or Bioconductor.

If you find yourself unable to recreate any of the steps in this pipeline, start by checking version numbers and consulting the documentation for the relevant programs. If you still have questions, feel free to contact me and I will do my best to help.

Contact: Kenneth Chiou (website) kchiou [at] uw [dot] edu

Table of Contents

Mapping and genotyping with the NGS-map pipeline

To begin, ensure that the NGS-map repository is up-to-date (written by Christina Bergey).

git submodule update NGS-map

Reads downloaded from SRA (BioProject PRJNA486659) will have different file names from those used here. To ensure that the pipeline works properly, you may need to rename the reads to those indicated in the table below. Once finished, place reads in the NGS-map/data folder.

SRA accession Read 1 filename Read 2 filename
SRR7717351 BZ06_218.read1.fastq BZ06_218.read2.fastq
SRR7717350 BZ06_220.read1.fastq BZ06_220.read2.fastq
SRR7717353 BZ06_221.read1.fastq BZ06_221.read2.fastq
SRR7717352 BZ06_224.read1.fastq BZ06_224.read2.fastq
SRR7717355 BZ06_225.read1.fastq BZ06_225.read2.fastq
SRR7717354 BZ06_227.read1.fastq BZ06_227.read2.fastq
SRR7717357 BZ07_001.read1.fastq BZ07_001.read2.fastq
SRR7717356 BZ07_004.read1.fastq BZ07_004.read2.fastq
SRR7717359 BZ07_005.read1.fastq BZ07_005.read2.fastq
SRR7717358 BZ07_007.read1.fastq BZ07_007.read2.fastq
SRR7717328 BZ07_029.read1.fastq BZ07_029.read2.fastq
SRR7717327 BZ07_030.read1.fastq BZ07_030.read2.fastq
SRR7717326 BZ07_032.read1.fastq BZ07_032.read2.fastq
SRR7717325 BZ07_034.read1.fastq BZ07_034.read2.fastq
SRR7717332 BZ07_035.read1.fastq BZ07_035.read2.fastq
SRR7717331 BZ07_039.read1.fastq BZ07_039.read2.fastq
SRR7717330 BZ07_041.read1.fastq BZ07_041.read2.fastq
SRR7717329 BZ07_042.read1.fastq BZ07_042.read2.fastq
SRR7717324 BZ07_045.read1.fastq BZ07_045.read2.fastq
SRR7717323 BZ07_047.read1.fastq BZ07_047.read2.fastq
SRR7717395 BZ07_100.read1.fastq BZ07_100.read2.fastq
SRR7717396 BZ11_001.read1.fastq BZ11_001.read2.fastq
SRR7717393 BZ11_002.read1.fastq BZ11_002.read2.fastq
SRR7717394 BZ11_003.read1.fastq BZ11_003.read2.fastq
SRR7717399 BZ11_004.read1.fastq BZ11_004.read2.fastq
SRR7717400 BZ11_005.read1.fastq BZ11_005.read2.fastq
SRR7717397 BZ11_006.read1.fastq BZ11_006.read2.fastq
SRR7717398 BZ11_007.read1.fastq BZ11_007.read2.fastq
SRR7717401 BZ11_008.read1.fastq BZ11_008.read2.fastq
SRR7717402 BZ11_009.read1.fastq BZ11_009.read2.fastq
SRR7717384 BZ11_010.read1.fastq BZ11_010.read2.fastq
SRR7717383 BZ11_011.read1.fastq BZ11_011.read2.fastq
SRR7717386 BZ11_012.read1.fastq BZ11_012.read2.fastq
SRR7717385 BZ11_013.read1.fastq BZ11_013.read2.fastq
SRR7717388 BZ11_014.read1.fastq BZ11_014.read2.fastq
SRR7717387 BZ11_015.read1.fastq BZ11_015.read2.fastq
SRR7717390 BZ11_016.read1.fastq BZ11_016.read2.fastq
SRR7717389 BZ11_017.read1.fastq BZ11_017.read2.fastq
SRR7717392 BZ11_018.read1.fastq BZ11_018.read2.fastq
SRR7717391 BZ11_019.read1.fastq BZ11_019.read2.fastq
SRR7717293 BZ11_020.read1.fastq BZ11_020.read2.fastq
SRR7717294 BZ11_021.read1.fastq BZ11_021.read2.fastq
SRR7717295 BZ11_022.read1.fastq BZ11_022.read2.fastq
SRR7717296 BZ11_023.read1.fastq BZ11_023.read2.fastq
SRR7717297 BZ11_024.read1.fastq BZ11_024.read2.fastq
SRR7717298 BZ11_025.read1.fastq BZ11_025.read2.fastq
SRR7717299 BZ11_026.read1.fastq BZ11_026.read2.fastq
SRR7717300 BZ11_028.read1.fastq BZ11_028.read2.fastq
SRR7717301 BZ11_029.read1.fastq BZ11_029.read2.fastq
SRR7717302 BZ11_030.read1.fastq BZ11_030.read2.fastq
SRR7717281 BZ11_031.read1.fastq BZ11_031.read2.fastq
SRR7717280 BZ11_032.read1.fastq BZ11_032.read2.fastq
SRR7717279 BZ11_033.read1.fastq BZ11_033.read2.fastq
SRR7717278 BZ11_034.read1.fastq BZ11_034.read2.fastq
SRR7717277 BZ11_035.read1.fastq BZ11_035.read2.fastq
SRR7717276 BZ11_036.read1.fastq BZ11_036.read2.fastq
SRR7717275 BZ11_037.read1.fastq BZ11_037.read2.fastq
SRR7717274 BZ11_038.read1.fastq BZ11_038.read2.fastq
SRR7717283 BZ11_039.read1.fastq BZ11_039.read2.fastq
SRR7717282 BZ11_040.read1.fastq BZ11_040.read2.fastq
SRR7717321 BZ11_041.read1.fastq BZ11_041.read2.fastq
SRR7717322 BZ11_042.read1.fastq BZ11_042.read2.fastq
SRR7717319 BZ11_043.read1.fastq BZ11_043.read2.fastq
SRR7717320 BZ11_045.read1.fastq BZ11_045.read2.fastq
SRR7717317 BZ11_046.read1.fastq BZ11_046.read2.fastq
SRR7717318 BZ11_047.read1.fastq BZ11_047.read2.fastq
SRR7717315 BZ11_048.read1.fastq BZ11_048.read2.fastq
SRR7717316 BZ11_050.read1.fastq BZ11_050.read2.fastq
SRR7717313 BZ11_051.read1.fastq BZ11_051.read2.fastq
SRR7717314 BZ11_052.read1.fastq BZ11_052.read2.fastq
SRR7717310 BZ11_053.read1.fastq BZ11_053.read2.fastq
SRR7717309 BZ11_054.read1.fastq BZ11_054.read2.fastq
SRR7717312 BZ11_056.read1.fastq BZ11_056.read2.fastq
SRR7717311 BZ11_057.read1.fastq BZ11_057.read2.fastq
SRR7717306 BZ11_058.read1.fastq BZ11_058.read2.fastq
SRR7717305 BZ11_059.read1.fastq BZ11_059.read2.fastq
SRR7717308 BZ11_061.read1.fastq BZ11_061.read2.fastq
SRR7717307 BZ11_062.read1.fastq BZ11_062.read2.fastq
SRR7717304 BZ11_063.read1.fastq BZ11_063.read2.fastq
SRR7717303 BZ11_064.read1.fastq BZ11_064.read2.fastq
SRR7717368 BZ11_065.read1.fastq BZ11_065.read2.fastq
SRR7717369 BZ11_066.read1.fastq BZ11_066.read2.fastq
SRR7717370 BZ11_067.read1.fastq BZ11_067.read2.fastq
SRR7717371 BZ11_068.read1.fastq BZ11_068.read2.fastq
SRR7717364 BZ11_069.read1.fastq BZ11_069.read2.fastq
SRR7717365 BZ11_070.read1.fastq BZ11_070.read2.fastq
SRR7717366 BZ11_071.read1.fastq BZ11_071.read2.fastq
SRR7717367 BZ11_072.read1.fastq BZ11_072.read2.fastq
SRR7717361 BZ11_073.read1.fastq BZ11_073.read2.fastq
SRR7717362 BZ11_074.read1.fastq BZ11_074.read2.fastq
SRR7717340 BZ11_075.read1.fastq BZ11_075.read2.fastq
SRR7717334 BZ11_076.read1.fastq BZ11_076.read2.fastq
SRR7717335 BZ12_001.read1.fastq BZ12_001.read2.fastq
SRR7717333 BZ12_002.read1.fastq BZ12_002.read2.fastq
SRR7717341 BZ12_003.read1.fastq BZ12_003.read2.fastq
SRR7717374 BZ12_004.read1.fastq BZ12_004.read2.fastq
SRR7717339 BZ12_005.read1.fastq BZ12_005.read2.fastq
SRR7717338 BZ12_006.read1.fastq BZ12_006.read2.fastq
SRR7717363 BZ12_007.read1.fastq BZ12_007.read2.fastq
SRR7717360 BZ12_008.read1.fastq BZ12_008.read2.fastq
SRR7717344 BZ12_009.read1.fastq BZ12_009.read2.fastq
SRR7717345 BZ12_010.read1.fastq BZ12_010.read2.fastq
SRR7717342 BZ12_011.read1.fastq BZ12_011.read2.fastq
SRR7717343 BZ12_012.read1.fastq BZ12_012.read2.fastq
SRR7717348 BZ12_030.read1.fastq BZ12_030.read2.fastq
SRR7717349 BZ12_031.read1.fastq BZ12_031.read2.fastq
SRR7717346 BZ12_032.read1.fastq BZ12_032.read2.fastq
SRR7717347 BZ12_033.read1.fastq BZ12_033.read2.fastq
SRR7717336 Chiou_14_001.read1.fastq Chiou_14_001.read2.fastq
SRR7717337 Chiou_14_003.read1.fastq Chiou_14_003.read2.fastq
SRR7717376 Chiou_14_004.read1.fastq Chiou_14_004.read2.fastq
SRR7717375 Chiou_14_005.read1.fastq Chiou_14_005.read2.fastq
SRR7717378 Chiou_14_030.read1.fastq Chiou_14_030.read2.fastq
SRR7717377 Chiou_14_036.read1.fastq Chiou_14_036.read2.fastq
SRR7717380 Chiou_14_039.read1.fastq Chiou_14_039.read2.fastq
SRR7717379 Chiou_14_041.read1.fastq Chiou_14_041.read2.fastq
SRR7717382 Chiou_14_042.read1.fastq Chiou_14_042.read2.fastq
SRR7717381 Chiou_14_044.read1.fastq Chiou_14_044.read2.fastq
SRR7717373 Chiou_14_050.read1.fastq Chiou_14_050.read2.fastq
SRR7717372 Chiou_14_054.read1.fastq Chiou_14_054.read2.fastq
SRR7717284 Chiou_14_056.read1.fastq Chiou_14_056.read2.fastq
SRR7717285 Chiou_14_057.read1.fastq Chiou_14_057.read2.fastq
SRR7717286 Chiou_14_058.read1.fastq Chiou_14_058.read2.fastq
SRR7717287 Chiou_14_059.read1.fastq Chiou_14_059.read2.fastq
SRR7717288 Chiou_14_065.read1.fastq Chiou_14_065.read2.fastq
SRR7717289 Chiou_14_069.read1.fastq Chiou_14_069.read2.fastq
SRR7717290 Chiou_15_003.read1.fastq Chiou_15_003.read2.fastq
SRR7717291 Chiou_15_004.read1.fastq Chiou_15_004.read2.fastq
SRR7717292 Chiou_15_005.read1.fastq Chiou_15_005.read2.fastq

Proceed to map samples and call and filter variants according to the NGS-map pipeline. The following outline may be helpful but will not work as is--some components of NGS-map will need to be manually configured (particularly paths to software).

  1. Prepare sample list and save to a file.
ls -1 NGS-map/data/*.fastq | \
	sed 's/NGS-map\/data\/\([A-Za-z0-9_]*\)\.read[1-2].fastq/\1/' | \
	sort -u > NGS-map/data/individual_list.txt
  1. Change to NGS-map directory.
cd NGS-map
  1. Download and index the baboon reference genome (Panu2.0).
genomes/download_papAnu2.sh
scripts/index_genome.sh genomes/papAnu2/papAnu2.fa
  1. Map samples to the reference genome using BWA mem (v0.7.15).
qsub -t 1-129 pbs/call_make.pbs
  1. Call genotypes using GATK (v3.5.0).
qsub -t 1-20 pbs/call_gatk_genotyper.pbs
  1. Filter genotype calls using GATK (v3.5.0).
qsub -t 1-20 pbs/filter_gatk_snps.pbs
  1. Change to directory containing variant calls.
cd baboon_snps
  1. Concatenate SNP calls from autosomes using VCFtools (v0.1.14).
vcf-concat $(echo chr{1..20}.pass.snp.vcf) > kafue.pass.snp.noX.vcf
  1. Change to base project directory.
cd ../..

Following the process above, transfer results files to the data folder (at the base level of this repository)

mv NGS-map/baboon_snps/kafue.pass.snp.noX.vcf data

Analysis pipeline

SNP filtering

scripts/download_repeats.sh
  • Convert VCF files to PLINK PED and MAP files using VCFtools (v0.1.14).
scripts/vcf_to_ped.sh kafue.pass.snp.noX
  • Calculate stats on missingness using PLINK (v1.90).
plink --noweb --double-id --vcf data/kafue.pass.snp.noX.vcf --missing --out reports/kafue.pass.snp.noX
scripts/repeat_finder.pl
  • Filter dataset to remove repeats identified in the previous step as well as samples and loci with excessive missingness using PLINK (v1.90). This will produce two datasets: the larger results/all.cleaned.* dataset and the more stringent results/strict.cleaned.* dataset (the latter is not used in subsequent steps).
scripts/make_clean_dataset.sh
  • Calculate SNPs in linkage disequilibrium and remove them from the dataset using PLINK (v1.90). This will produce the results/all.cleaned.LDpruned.* dataset.
scripts/find_and_remove_ld.sh

Exploratory data visualization

  • Perform multidimensional scaling and principal components analyses using PLINK (v1.90). The scripts also plot respective results using R (v3.4.2).
scripts/mds_plot.sh
scripts/pca_plot.sh
  • Visualize library mapping statistics from individuals passing filters using R (v3.4.2). This step will read the results/mds.txt file generated in the previous step.
scripts/plot_mapping.R

Ancestry inference

  • Raster and vector imagery for Zambia and Kafue National Park are used for subsequent spatial analyses or plotting. The following datasets were used to prepare a custom ASCII-formatted grid file for this project using R (v3.4.2):

    • The bio1 bioclimatic dataset, obtained from WorldClim and stored in ASCII format. File was moved to: data/bio01.asc.

    • Zambia country administrative borders, obtained from DIVA-GIS. The specific file used is ZMB_adm0.shp, renamed to: data/zambia_borders.shp.

    • Park boundaries for Kafue National Park, kindly shared by the Zambia Wildlife Authority (now the Department of National Parks and Wildlife of Zambia. The particular file is not mine to share, but a perfectly suitable alternative is available for download at Protected Planet. The file was moved to: data/kafue_parkboundary.shp.

scripts/make_asc_file.R
  • Convert VCF file to BED file using PLINK (v1.90) to prepare dataset for ancestry inference. Will generate results/all.cleaned.LDpruned.vcf.
plink --bfile results/all.cleaned.LDpruned --recode vcf --out results/all.cleaned.LDpruned
mv results/all.cleaned.LDpruned.log reports/bed_to_vcf.log
  • Write spatial coordinates into file that is line-matched to the VCF data using R (v3.4.2). Will generate data/all.cleaned.LDpruned.coord.
scripts/make_coord_file.R
  • Infer ancestry using ADMIXTURE (v1.3.0) in cross-validation mode testing values of K from 1 to 10. Plot cross-validated errors using R (v3.4.2).
scripts/run_admixture_cv.sh
scripts/plot_admixture_cv_error.R
  • Run ADMIXTURE (v1.3.0) on the optimal value of K (K = 2).
scripts/run_admixture_k.sh 2
  • Generate a variety of ancestry plots using R (v3.4.2), including comparisons to multidimensional scaling analyses done earlier. The script uses functions from POPSutilities.R, which is not provided here and must be downloaded (link) and moved to the scripts folder.
scripts/plot_admixture_results.R
  • Generate ancestry plots from Jolly et al. 2011. The script uses the data set data/sex_chr_genotypes.txt, which is not provided here but can be requested from Cliff Jolly.
scripts/plot_jolly_et_al_2011.R
  • Compare autosomal and sex chromosome ancestries.
scripts/test_autosomes_vs_uniparental_markers.R

Exploratory population genetic analyses

  • Partition populations based on ancestry results using PLINK (v1.90) and prepare and convert requisite files for population genomic scans using PLINK/SEQ (v0.10), VCFtools (v0.1.14), htslib tabix (v1.4.1), and R (v3.4.2).
scripts/prep_popgen.sh
  • Calculate a variety of within-population and between-population (FST) statistics using VCFtools (v0.1.14).
scripts/popgen_stats.sh
  • Generate exploratory plot of inbreeding according to ancestry using R (v3.4.2).
scripts/plot_inbreeding.R
  • Make maps for spatial visualization of sample locales and geospatial ancestry using R (v3.4.2).
scripts/map_samples.R
  • Prepare for genome scans by downloading baboon and macaque annotations, finding open reading frames (ORFs), mapping baboon ORFs to macaque proteins, and finalizing baboon PANTHER annotations derived from homology. sesbio (v2016-10-18), HMMER2GO (v0.17.2), HMMER (v3.1b2), SAMtools (v1.6), and R (v3.4.2) are used and should be in the $PATH, along with the dependency EMBOSS (v6.6.0).

Genome scan for selection

scripts/prep_genome_scans.sh
  • Assign and average FST values across genes and conduct permutation test to build empirical FST null distribution using R (v3.4.2).
scripts/scan_fst.R
  • Some p values from the above procedure might be so close to the threshold of 0.05 that we cannot confidently determine their significance. For these sets of genes, we can run more permutations to augment ("boost") their empirical FST null distributions to bring the total up one order of magnitude (e.g., from 1 million to 10 million permutations). The procedure will make use of the .RData files in the checkpoints folder. To make this more efficient, we can parallelize the process using the following script. The script uses a modified R script scripts/scan_fst_boost.R and passes as argument two variables:

    1. $reps is automatically calculated and shows how many permutations have already been run (on a log10 scale)

    2. $1 (may be used for GNU parallel, PBS [replace $1 with $PBS_ARRAYID], or Slurm [replace $1 with $SLURM_ARRAY_TASK_ID]) is an arbitrary integer that ensures that result files do not conflict.

#!/bin/sh

reps=$(ls checkpoints/pval.perm.1e*.RData | sed -e 's/checkpoints\/pval.perm.1e\([0-9]*\)\.RData/\1/' | sort -rn | head -n 1)

scripts/scan_fst_boost.R $reps $1
  • If the boost procedure described above was run, consolidate the results using R (v3.4.2) to generate the final empirical null distribution.
scripts/scan_fst_aggregate.R

Genomic cline analysis

  • Convert genotypes into the genotype files require for bgc (v1.03). The script takes two VCF files as arguments because the first includes read count information but also extraneous SNPs (those not passing filters) while the second contains SNPs passing filters but no read counts. The conversion script merges information from the two to produce read counts for SNPs passing filters in the proper format.
scripts/vcf_to_bgc.R data/kafue.pass.snp.noX.vcf results/all.cleaned.LDpruned.vcf
  • Run bgc (v1.03). The script below can be run in parallel (5 parallel runs were done in this case) by passing an arbitrary argument $1 (may be used for GNU parallel, PBS [replace $1 with $PBS_ARRAYID], or Slurm [replace $1 with $SLURM_ARRAY_TASK_ID]). GSL (v1.16) and HDF5 (v1.8.17) are required as dependencies, and the bgc and estpost programs must be properly compiled and in the $PATH. The stdout of one chain was saved to the file reports/bgc_err.txt. Running on a remote machine is essentially required as each run will take a long time to converge and complete.
#!/bin/sh

scripts/run_bgc.sh $1
  • Aggregate posterior MCMC chains across the 5 bgc (v1.03) runs.
scripts/run_bgc_aggregate.sh
  • Visualize genomic cline results and compare to other ancestry results.
scripts/plot_bgc_results.R
  • Assign genomic cline parameter estimates to genes based on overlap.
scripts/run_bgc_genes.R
  • Plot results from both FST and genomic cline scans.
scripts/plot_all_scans.R
  • Test for difference in beta cline parameter estimates between genic and nongenic SNPs.
scripts/run_bgc_genic_vs_nongenic.R

Gene Ontology and PANTHER pathway enrichment

  • Test for enrichment of Gene Ontology terms and PANTHER pathways for both FST and genomic cline analyses.
scripts/test_enrichment_all.R

About

Compilation of scripts used for analysis of the Kafue National Park baboon population

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published