kafue-baboons-ddrad

This repository contains all analysis code for two related projects on the baboons of Kafue National Park. Baboon DNA was prepared using double-digest RADseq and used for two related projects. The citations for the papers are below (citations will be updated as their statuses change):

Chiou, K. L., Bergey, C. M., Burrell, A. S., Disotell, T. R., Rogers, J., Jolly, C. J., & Phillips-Conroy, J. E. Genomic signatures of extreme body size divergence in baboons.

Chiou, K. L., Bergey, C. M., Burrell, A. S., Disotell, T. R., Rogers, J., Jolly, C. J., & Phillips-Conroy, J. E. Genome-wide ancestry and introgression in a Zambian baboon hybrid zone.

Some scripts in this pipeline borrow heavily from the awash-dopamine repository, written by coauthor Christina Bergey.

The analysis pipeline here begins with one pair of FASTQ files for each animal in the study. This should be most convenient for those wishing to replicate this analysis, as the files are the same as the ones available on SRA.

For those wishing to recreate the steps for demultiplexing and, in some cases, merging libraries across independent sequencing runs, raw reads are available upon request (output from bcl2fastq). Demultiplexing code is available in the demultiplex repository, which was originally written by Christina Bergey.

The code throughout is documented as though all code is run line by line, with all programs available in the $PATH. In reality, the majority of the code was run on the NYU high performance computing cluster, particularly those steps requiring extra time, memory, or processors. Most of the programs were available as preinstalled modules maintained by the HPC staff. Here, the modules are omitted and instead software links and version numbers are provided to help users find the appropriate versions. Similarly, for R packages, you will need to check each script and install required libraries using install.packages or Bioconductor.

If you find yourself unable to recreate any of the steps in this pipeline, start by checking version numbers and consulting the documentation for the relevant programs. If you still have questions, feel free to contact me and I will do my best to help.

Contact: Kenneth Chiou (website) kchiou [at] uw [dot] edu

Mapping and genotyping with the NGS-map pipeline

To begin, ensure that the NGS-map repository is up-to-date (written by Christina Bergey).

git submodule update NGS-map

Reads downloaded from SRA (BioProject PRJNA486659) will have different file names from those used here. To ensure that the pipeline works properly, you may need to rename the reads to those indicated in the table below. Once finished, place reads in the NGS-map/data folder.

SRA accession	Read 1 filename	Read 2 filename
SRR7717351	BZ06_218.read1.fastq	BZ06_218.read2.fastq
SRR7717350	BZ06_220.read1.fastq	BZ06_220.read2.fastq
SRR7717353	BZ06_221.read1.fastq	BZ06_221.read2.fastq
SRR7717352	BZ06_224.read1.fastq	BZ06_224.read2.fastq
SRR7717355	BZ06_225.read1.fastq	BZ06_225.read2.fastq
SRR7717354	BZ06_227.read1.fastq	BZ06_227.read2.fastq
SRR7717357	BZ07_001.read1.fastq	BZ07_001.read2.fastq
SRR7717356	BZ07_004.read1.fastq	BZ07_004.read2.fastq
SRR7717359	BZ07_005.read1.fastq	BZ07_005.read2.fastq
SRR7717358	BZ07_007.read1.fastq	BZ07_007.read2.fastq
SRR7717328	BZ07_029.read1.fastq	BZ07_029.read2.fastq
SRR7717327	BZ07_030.read1.fastq	BZ07_030.read2.fastq
SRR7717326	BZ07_032.read1.fastq	BZ07_032.read2.fastq
SRR7717325	BZ07_034.read1.fastq	BZ07_034.read2.fastq
SRR7717332	BZ07_035.read1.fastq	BZ07_035.read2.fastq
SRR7717331	BZ07_039.read1.fastq	BZ07_039.read2.fastq
SRR7717330	BZ07_041.read1.fastq	BZ07_041.read2.fastq
SRR7717329	BZ07_042.read1.fastq	BZ07_042.read2.fastq
SRR7717324	BZ07_045.read1.fastq	BZ07_045.read2.fastq
SRR7717323	BZ07_047.read1.fastq	BZ07_047.read2.fastq
SRR7717395	BZ07_100.read1.fastq	BZ07_100.read2.fastq
SRR7717396	BZ11_001.read1.fastq	BZ11_001.read2.fastq
SRR7717393	BZ11_002.read1.fastq	BZ11_002.read2.fastq
SRR7717394	BZ11_003.read1.fastq	BZ11_003.read2.fastq
SRR7717399	BZ11_004.read1.fastq	BZ11_004.read2.fastq
SRR7717400	BZ11_005.read1.fastq	BZ11_005.read2.fastq
SRR7717397	BZ11_006.read1.fastq	BZ11_006.read2.fastq
SRR7717398	BZ11_007.read1.fastq	BZ11_007.read2.fastq
SRR7717401	BZ11_008.read1.fastq	BZ11_008.read2.fastq
SRR7717402	BZ11_009.read1.fastq	BZ11_009.read2.fastq
SRR7717384	BZ11_010.read1.fastq	BZ11_010.read2.fastq
SRR7717383	BZ11_011.read1.fastq	BZ11_011.read2.fastq
SRR7717386	BZ11_012.read1.fastq	BZ11_012.read2.fastq
SRR7717385	BZ11_013.read1.fastq	BZ11_013.read2.fastq
SRR7717388	BZ11_014.read1.fastq	BZ11_014.read2.fastq
SRR7717387	BZ11_015.read1.fastq	BZ11_015.read2.fastq
SRR7717390	BZ11_016.read1.fastq	BZ11_016.read2.fastq
SRR7717389	BZ11_017.read1.fastq	BZ11_017.read2.fastq
SRR7717392	BZ11_018.read1.fastq	BZ11_018.read2.fastq
SRR7717391	BZ11_019.read1.fastq	BZ11_019.read2.fastq
SRR7717293	BZ11_020.read1.fastq	BZ11_020.read2.fastq
SRR7717294	BZ11_021.read1.fastq	BZ11_021.read2.fastq
SRR7717295	BZ11_022.read1.fastq	BZ11_022.read2.fastq
SRR7717296	BZ11_023.read1.fastq	BZ11_023.read2.fastq
SRR7717297	BZ11_024.read1.fastq	BZ11_024.read2.fastq
SRR7717298	BZ11_025.read1.fastq	BZ11_025.read2.fastq
SRR7717299	BZ11_026.read1.fastq	BZ11_026.read2.fastq
SRR7717300	BZ11_028.read1.fastq	BZ11_028.read2.fastq
SRR7717301	BZ11_029.read1.fastq	BZ11_029.read2.fastq
SRR7717302	BZ11_030.read1.fastq	BZ11_030.read2.fastq
SRR7717281	BZ11_031.read1.fastq	BZ11_031.read2.fastq
SRR7717280	BZ11_032.read1.fastq	BZ11_032.read2.fastq
SRR7717279	BZ11_033.read1.fastq	BZ11_033.read2.fastq
SRR7717278	BZ11_034.read1.fastq	BZ11_034.read2.fastq
SRR7717277	BZ11_035.read1.fastq	BZ11_035.read2.fastq
SRR7717276	BZ11_036.read1.fastq	BZ11_036.read2.fastq
SRR7717275	BZ11_037.read1.fastq	BZ11_037.read2.fastq
SRR7717274	BZ11_038.read1.fastq	BZ11_038.read2.fastq
SRR7717283	BZ11_039.read1.fastq	BZ11_039.read2.fastq
SRR7717282	BZ11_040.read1.fastq	BZ11_040.read2.fastq
SRR7717321	BZ11_041.read1.fastq	BZ11_041.read2.fastq
SRR7717322	BZ11_042.read1.fastq	BZ11_042.read2.fastq
SRR7717319	BZ11_043.read1.fastq	BZ11_043.read2.fastq
SRR7717320	BZ11_045.read1.fastq	BZ11_045.read2.fastq
SRR7717317	BZ11_046.read1.fastq	BZ11_046.read2.fastq
SRR7717318	BZ11_047.read1.fastq	BZ11_047.read2.fastq
SRR7717315	BZ11_048.read1.fastq	BZ11_048.read2.fastq
SRR7717316	BZ11_050.read1.fastq	BZ11_050.read2.fastq
SRR7717313	BZ11_051.read1.fastq	BZ11_051.read2.fastq
SRR7717314	BZ11_052.read1.fastq	BZ11_052.read2.fastq
SRR7717310	BZ11_053.read1.fastq	BZ11_053.read2.fastq
SRR7717309	BZ11_054.read1.fastq	BZ11_054.read2.fastq
SRR7717312	BZ11_056.read1.fastq	BZ11_056.read2.fastq
SRR7717311	BZ11_057.read1.fastq	BZ11_057.read2.fastq
SRR7717306	BZ11_058.read1.fastq	BZ11_058.read2.fastq
SRR7717305	BZ11_059.read1.fastq	BZ11_059.read2.fastq
SRR7717308	BZ11_061.read1.fastq	BZ11_061.read2.fastq
SRR7717307	BZ11_062.read1.fastq	BZ11_062.read2.fastq
SRR7717304	BZ11_063.read1.fastq	BZ11_063.read2.fastq
SRR7717303	BZ11_064.read1.fastq	BZ11_064.read2.fastq
SRR7717368	BZ11_065.read1.fastq	BZ11_065.read2.fastq
SRR7717369	BZ11_066.read1.fastq	BZ11_066.read2.fastq
SRR7717370	BZ11_067.read1.fastq	BZ11_067.read2.fastq
SRR7717371	BZ11_068.read1.fastq	BZ11_068.read2.fastq
SRR7717364	BZ11_069.read1.fastq	BZ11_069.read2.fastq
SRR7717365	BZ11_070.read1.fastq	BZ11_070.read2.fastq
SRR7717366	BZ11_071.read1.fastq	BZ11_071.read2.fastq
SRR7717367	BZ11_072.read1.fastq	BZ11_072.read2.fastq
SRR7717361	BZ11_073.read1.fastq	BZ11_073.read2.fastq
SRR7717362	BZ11_074.read1.fastq	BZ11_074.read2.fastq
SRR7717340	BZ11_075.read1.fastq	BZ11_075.read2.fastq
SRR7717334	BZ11_076.read1.fastq	BZ11_076.read2.fastq
SRR7717335	BZ12_001.read1.fastq	BZ12_001.read2.fastq
SRR7717333	BZ12_002.read1.fastq	BZ12_002.read2.fastq
SRR7717341	BZ12_003.read1.fastq	BZ12_003.read2.fastq
SRR7717374	BZ12_004.read1.fastq	BZ12_004.read2.fastq
SRR7717339	BZ12_005.read1.fastq	BZ12_005.read2.fastq
SRR7717338	BZ12_006.read1.fastq	BZ12_006.read2.fastq
SRR7717363	BZ12_007.read1.fastq	BZ12_007.read2.fastq
SRR7717360	BZ12_008.read1.fastq	BZ12_008.read2.fastq
SRR7717344	BZ12_009.read1.fastq	BZ12_009.read2.fastq
SRR7717345	BZ12_010.read1.fastq	BZ12_010.read2.fastq
SRR7717342	BZ12_011.read1.fastq	BZ12_011.read2.fastq
SRR7717343	BZ12_012.read1.fastq	BZ12_012.read2.fastq
SRR7717348	BZ12_030.read1.fastq	BZ12_030.read2.fastq
SRR7717349	BZ12_031.read1.fastq	BZ12_031.read2.fastq
SRR7717346	BZ12_032.read1.fastq	BZ12_032.read2.fastq
SRR7717347	BZ12_033.read1.fastq	BZ12_033.read2.fastq
SRR7717336	Chiou_14_001.read1.fastq	Chiou_14_001.read2.fastq
SRR7717337	Chiou_14_003.read1.fastq	Chiou_14_003.read2.fastq
SRR7717376	Chiou_14_004.read1.fastq	Chiou_14_004.read2.fastq
SRR7717375	Chiou_14_005.read1.fastq	Chiou_14_005.read2.fastq
SRR7717378	Chiou_14_030.read1.fastq	Chiou_14_030.read2.fastq
SRR7717377	Chiou_14_036.read1.fastq	Chiou_14_036.read2.fastq
SRR7717380	Chiou_14_039.read1.fastq	Chiou_14_039.read2.fastq
SRR7717379	Chiou_14_041.read1.fastq	Chiou_14_041.read2.fastq
SRR7717382	Chiou_14_042.read1.fastq	Chiou_14_042.read2.fastq
SRR7717381	Chiou_14_044.read1.fastq	Chiou_14_044.read2.fastq
SRR7717373	Chiou_14_050.read1.fastq	Chiou_14_050.read2.fastq
SRR7717372	Chiou_14_054.read1.fastq	Chiou_14_054.read2.fastq
SRR7717284	Chiou_14_056.read1.fastq	Chiou_14_056.read2.fastq
SRR7717285	Chiou_14_057.read1.fastq	Chiou_14_057.read2.fastq
SRR7717286	Chiou_14_058.read1.fastq	Chiou_14_058.read2.fastq
SRR7717287	Chiou_14_059.read1.fastq	Chiou_14_059.read2.fastq
SRR7717288	Chiou_14_065.read1.fastq	Chiou_14_065.read2.fastq
SRR7717289	Chiou_14_069.read1.fastq	Chiou_14_069.read2.fastq
SRR7717290	Chiou_15_003.read1.fastq	Chiou_15_003.read2.fastq
SRR7717291	Chiou_15_004.read1.fastq	Chiou_15_004.read2.fastq
SRR7717292	Chiou_15_005.read1.fastq	Chiou_15_005.read2.fastq

Proceed to map samples and call and filter variants according to the NGS-map pipeline. The following outline may be helpful but will not work as is--some components of NGS-map will need to be manually configured (particularly paths to software).

Prepare sample list and save to a file.

ls -1 NGS-map/data/*.fastq | \
	sed 's/NGS-map\/data\/\([A-Za-z0-9_]*\)\.read[1-2].fastq/\1/' | \
	sort -u > NGS-map/data/individual_list.txt

Change to NGS-map directory.

cd NGS-map

Download and index the baboon reference genome (Panu2.0).

genomes/download_papAnu2.sh
scripts/index_genome.sh genomes/papAnu2/papAnu2.fa

Map samples to the reference genome using BWA mem (v0.7.15).

qsub -t 1-129 pbs/call_make.pbs

Call genotypes using GATK (v3.5.0).

qsub -t 1-20 pbs/call_gatk_genotyper.pbs

Filter genotype calls using GATK (v3.5.0).

qsub -t 1-20 pbs/filter_gatk_snps.pbs

Change to directory containing variant calls.

cd baboon_snps

Concatenate SNP calls from autosomes using VCFtools (v0.1.14).

vcf-concat $(echo chr{1..20}.pass.snp.vcf) > kafue.pass.snp.noX.vcf

Change to base project directory.

cd ../..

Following the process above, transfer results files to the data folder (at the base level of this repository)

mv NGS-map/baboon_snps/kafue.pass.snp.noX.vcf data

Analysis pipeline

SNP filtering

Download Repeatmasker and Tandem Repeats Finder files for the baboon reference genome (Panu2.0).

scripts/download_repeats.sh

Convert VCF files to PLINK PED and MAP files using VCFtools (v0.1.14).

scripts/vcf_to_ped.sh kafue.pass.snp.noX

Calculate stats on missingness using PLINK (v1.90).

plink --noweb --double-id --vcf data/kafue.pass.snp.noX.vcf --missing --out reports/kafue.pass.snp.noX

Find repeats based on the Repeatmasker and Tandem Repeats Finder files and the PLINK .map file generated in the previous step. BEDtools (v2.26.0) must be installed and in the $PATH.

scripts/repeat_finder.pl

Filter dataset to remove repeats identified in the previous step as well as samples and loci with excessive missingness using PLINK (v1.90). This will produce two datasets: the larger results/all.cleaned.* dataset and the more stringent results/strict.cleaned.* dataset (the latter is not used in subsequent steps).

scripts/make_clean_dataset.sh

Calculate SNPs in linkage disequilibrium and remove them from the dataset using PLINK (v1.90). This will produce the results/all.cleaned.LDpruned.* dataset.

scripts/find_and_remove_ld.sh

Exploratory data visualization

Perform multidimensional scaling and principal components analyses using PLINK (v1.90). The scripts also plot respective results using R (v3.4.2).

scripts/mds_plot.sh
scripts/pca_plot.sh

Visualize library mapping statistics from individuals passing filters using R (v3.4.2). This step will read the results/mds.txt file generated in the previous step.

scripts/plot_mapping.R

Ancestry inference

Raster and vector imagery for Zambia and Kafue National Park are used for subsequent spatial analyses or plotting. The following datasets were used to prepare a custom ASCII-formatted grid file for this project using R (v3.4.2):
- The bio1 bioclimatic dataset, obtained from WorldClim and stored in ASCII format. File was moved to: data/bio01.asc.
- Zambia country administrative borders, obtained from DIVA-GIS. The specific file used is ZMB_adm0.shp, renamed to: data/zambia_borders.shp.
- Park boundaries for Kafue National Park, kindly shared by the Zambia Wildlife Authority (now the Department of National Parks and Wildlife of Zambia. The particular file is not mine to share, but a perfectly suitable alternative is available for download at Protected Planet. The file was moved to: data/kafue_parkboundary.shp.

scripts/make_asc_file.R

Convert VCF file to BED file using PLINK (v1.90) to prepare dataset for ancestry inference. Will generate results/all.cleaned.LDpruned.vcf.

plink --bfile results/all.cleaned.LDpruned --recode vcf --out results/all.cleaned.LDpruned
mv results/all.cleaned.LDpruned.log reports/bed_to_vcf.log

Write spatial coordinates into file that is line-matched to the VCF data using R (v3.4.2). Will generate data/all.cleaned.LDpruned.coord.

scripts/make_coord_file.R

Infer ancestry using ADMIXTURE (v1.3.0) in cross-validation mode testing values of K from 1 to 10. Plot cross-validated errors using R (v3.4.2).

scripts/run_admixture_cv.sh
scripts/plot_admixture_cv_error.R

Run ADMIXTURE (v1.3.0) on the optimal value of K (K = 2).

scripts/run_admixture_k.sh 2

Generate a variety of ancestry plots using R (v3.4.2), including comparisons to multidimensional scaling analyses done earlier. The script uses functions from POPSutilities.R, which is not provided here and must be downloaded (link) and moved to the scripts folder.

scripts/plot_admixture_results.R

Generate ancestry plots from Jolly et al. 2011. The script uses the data set data/sex_chr_genotypes.txt, which is not provided here but can be requested from Cliff Jolly.

scripts/plot_jolly_et_al_2011.R

Compare autosomal and sex chromosome ancestries.

scripts/test_autosomes_vs_uniparental_markers.R

Exploratory population genetic analyses

Partition populations based on ancestry results using PLINK (v1.90) and prepare and convert requisite files for population genomic scans using PLINK/SEQ (v0.10), VCFtools (v0.1.14), htslib tabix (v1.4.1), and R (v3.4.2).

scripts/prep_popgen.sh

Calculate a variety of within-population and between-population (F_ST) statistics using VCFtools (v0.1.14).

scripts/popgen_stats.sh

Generate exploratory plot of inbreeding according to ancestry using R (v3.4.2).

scripts/plot_inbreeding.R

Make maps for spatial visualization of sample locales and geospatial ancestry using R (v3.4.2).

scripts/map_samples.R

Prepare for genome scans by downloading baboon and macaque annotations, finding open reading frames (ORFs), mapping baboon ORFs to macaque proteins, and finalizing baboon PANTHER annotations derived from homology. sesbio (v2016-10-18), HMMER2GO (v0.17.2), HMMER (v3.1b2), SAMtools (v1.6), and R (v3.4.2) are used and should be in the $PATH, along with the dependency EMBOSS (v6.6.0).

Genome scan for selection

scripts/prep_genome_scans.sh

Assign and average F_ST values across genes and conduct permutation test to build empirical F_ST null distribution using R (v3.4.2).

scripts/scan_fst.R

Some p values from the above procedure might be so close to the threshold of 0.05 that we cannot confidently determine their significance. For these sets of genes, we can run more permutations to augment ("boost") their empirical F_ST null distributions to bring the total up one order of magnitude (e.g., from 1 million to 10 million permutations). The procedure will make use of the .RData files in the checkpoints folder. To make this more efficient, we can parallelize the process using the following script. The script uses a modified R script scripts/scan_fst_boost.R and passes as argument two variables:
1. $reps is automatically calculated and shows how many permutations have already been run (on a log10 scale)
2. $1 (may be used for GNU parallel, PBS [replace $1 with $PBS_ARRAYID], or Slurm [replace $1 with $SLURM_ARRAY_TASK_ID]) is an arbitrary integer that ensures that result files do not conflict.

#!/bin/sh

reps=$(ls checkpoints/pval.perm.1e*.RData | sed -e 's/checkpoints\/pval.perm.1e\([0-9]*\)\.RData/\1/' | sort -rn | head -n 1)

scripts/scan_fst_boost.R $reps $1

If the boost procedure described above was run, consolidate the results using R (v3.4.2) to generate the final empirical null distribution.

scripts/scan_fst_aggregate.R

Genomic cline analysis

Convert genotypes into the genotype files require for bgc (v1.03). The script takes two VCF files as arguments because the first includes read count information but also extraneous SNPs (those not passing filters) while the second contains SNPs passing filters but no read counts. The conversion script merges information from the two to produce read counts for SNPs passing filters in the proper format.

scripts/vcf_to_bgc.R data/kafue.pass.snp.noX.vcf results/all.cleaned.LDpruned.vcf

Run bgc (v1.03). The script below can be run in parallel (5 parallel runs were done in this case) by passing an arbitrary argument $1 (may be used for GNU parallel, PBS [replace $1 with $PBS_ARRAYID], or Slurm [replace $1 with $SLURM_ARRAY_TASK_ID]). GSL (v1.16) and HDF5 (v1.8.17) are required as dependencies, and the bgc and estpost programs must be properly compiled and in the $PATH. The stdout of one chain was saved to the file reports/bgc_err.txt. Running on a remote machine is essentially required as each run will take a long time to converge and complete.

#!/bin/sh

scripts/run_bgc.sh $1

Aggregate posterior MCMC chains across the 5 bgc (v1.03) runs.

scripts/run_bgc_aggregate.sh

Visualize genomic cline results and compare to other ancestry results.

scripts/plot_bgc_results.R

Assign genomic cline parameter estimates to genes based on overlap.

scripts/run_bgc_genes.R

Plot results from both F_ST and genomic cline scans.

scripts/plot_all_scans.R

Test for difference in beta cline parameter estimates between genic and nongenic SNPs.

scripts/run_bgc_genic_vs_nongenic.R

Gene Ontology and PANTHER pathway enrichment

Test for enrichment of Gene Ontology terms and PANTHER pathways for both F_ST and genomic cline analyses.

scripts/test_enrichment_all.R

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
NGS-map @ d4b2d69		NGS-map @ d4b2d69
checkpoints		checkpoints
data		data
genomes		genomes
reports		reports
results		results
scripts		scripts
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NGS-map @ d4b2d69

NGS-map @ d4b2d69

checkpoints

checkpoints

data

data

genomes

genomes

reports

reports

results

results

scripts

scripts

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

Repository files navigation

kafue-baboons-ddrad

Table of Contents

Mapping and genotyping with the NGS-map pipeline

Analysis pipeline

SNP filtering

Exploratory data visualization

Ancestry inference

Exploratory population genetic analyses

Genome scan for selection

Genomic cline analysis

Gene Ontology and PANTHER pathway enrichment

About

Releases

Packages

Languages

License

kchiou/kafue-baboons-ddrad

Folders and files

Latest commit

History

Repository files navigation

kafue-baboons-ddrad

Table of Contents

Mapping and genotyping with the NGS-map pipeline

Analysis pipeline

SNP filtering

Exploratory data visualization

Ancestry inference

Exploratory population genetic analyses

Genome scan for selection

Genomic cline analysis

Gene Ontology and PANTHER pathway enrichment

About

Resources

License

Stars

Watchers

Forks

Languages