This repository contains the code required for the Simulation component of the IMLS_GCCO marker comparison project, which is quantifying how measures of ex situ conservation differ between RADseq and microsatellite markers. Simulations using different marker types (microsatellites, or "MSAT", and SNP markers, or "DNA") are performed using the fastSimcoal2
software and the strataG
R package.
This folder contains all the R scripts used for our analyses. Scripts are numbered (alphanumerically) according to the order they're run in the workflow. The purpose of these scripts is summarized below.
Functions used repeatedly throughout the scripts are declared in 0_functions.R. These functions are roughly grouped according to their purpose in the workflow.
There are 3 scripts named _simParams_
. Each of these scripts runs through the following steps, using the fastSimcoal simulation software and the R package strataG
:
- declare simulation parameters,
- run simulations according to parameters,
- convert the Arelquin objects generated by simulations into objects files (for resampling steps; see below),
- save the genind objects containing simulation data to the disk. The script will also "clean up" the simulation outputs, removing some files and moving others to certain directories.
These steps populate the files found in the SimulationOutputs directory (see below). The genind files generated in Step 4 are read in at the beginning of the 3_resampleAndBootstrap.R script.
The 3 scripts correspond to 3 different analyses, as summarized below: - N1200: These simulations use a total population size (nInd) of 1200 individuals. MSAT and DNA markers are simulated - N4800: These simulations use a total population size (nInd) of 4800 individuals. MSAT and DNA markers are simulated - lowMutDNA: These simulations use a lower mutation rate (1e-8) than what's used in the DNA simulations (both N1200 and N4800) described above (1e-7). Both N1200 and N4800 total population sizes are simulated in this script.
The script 2_senseCheck.R performs different checks on the genind files generated from Step #1 (running simulations using fastSimcoal
and strataG
). These checks test our expectations of the simulation scenarios based on biology--for instance, that average Fst values will be greater for scenarios with lower migration rates between populations.
The script 3_resampleAndBootstrap.R is used to summarize the allele frequency distributions of each simulation scenario (for each marker type), and then randomly sample each object to model ex situ conservation. The script reads in the genind files saved to disk, and then using these files, generates resampling arrays, with the following dimensions
- rows: number of randomly selected samples, for which allelic representation is calculated
- columns: allelelic representation categories - Column 1: the Total allelic representation (all categories of alleles) - Columns 2--5: the representation values for alleles of different frequency categories (Very common, Common, Low frequency, Rare)
- slices: each array slice represents a different, independent resampling replicate. For these scenarios, the current number of resampling replicates is 5
The resampling arrays generated in this script are read in at the beginning of the 4_linearModeling.R script.
This resampling script is built to run in the background of a server, with multiple clusters processing in parallel. For this reason, it uses many parallelized apply-family functions. Variable flags are used to indicate which of the analysis scenarios (specified above; N1200, N4800, DNA low mutation) to run resampling analyses on. In addition, for all scenarios except the DNA low mutation scenario, we explore using different levels of filtration for minor allele counts. Functionally, this means removing MSAT and DNA alleles which only appear once (N1) or twice (N2) in the entire population. We compare these results to no filter for minor allele count (N0).
The script reads in the allelic resampling data arrays generated for simulations generated in Rosenberger et al. 2022, which created species-specific simulations of 14 oak species native to the U.S. It then generates MSSEs and their prediction intervals, for each species.
Loci bootstrapping
Loci boostrapping happens simultaneously to resampling. For each level of loci that's used for the loci bootstrapping, a resampling array (of 5 simulation slices) is generated. Therefore, for each dataset (e.g. MSAT N1200, DNA N4800, etc.), the final output is a list of resampling arrays.
The script 4_linearModeling.R reads in resampling arrays and uses them to calculate the minimum samples size estimates (MSSE) required for representing 95% of the total allelic diversity found within a population using both MSAT and DNA markers. It also explores the simulation parameters that most greatly impact these MSSEs.
The script archived_SimulationsScenarios.R contains simulation parameters for scenarios we explored, but ultimately ended up not using. The script is kept here for documentation purposes.
The script demo_dnaMarkerIssue.R demonstrates the issue discovered when trying to simulation SNP ("DNA") markers using just fastSimcoal
. Markers will be concatenated, such that 3 blocks of chromosomes each 500 bp long will, in the resulting genind file, result in a single locus 1,500 bp long. This motivated our usage of strataG
.
This folder contains 7 subfolders: 6 containing outputs for simulations used in our analyses, as outlined below, and 1 containing the outputs for demonstrating a DNA marker concatenation issue we found using fastSimcoal
(see above):
- N1200
- MSAT
- DNA
- N4800
- MSAT
- DNA
- DNA low mutation
- N1200
- N4800
In each of these subfolders, there are directories for each of our 6 simulation scenarios (2 migration rates * 3 different population configurations). Within those directories are 2 files: a .log
file, which describes the steps taken by strataG
and fastSimcoal
in generating simulations, and a .par
file, which contains the simulation parameters.
We are unable to keep the actual files generated from simulations (Arlequin and genind files) on this repostory, due to file size restrictions (see our .gitignore
file for more details).
For questions about this project, open an Issue or contact Austin Koontz.