Skip to content

Schumerlab/mixnmatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

##########################
#####mixnmatch quickstart
##########################

See the user manual pdf for more complete information

This is a pipeline for simulating admixed genomes to evaluate the accuracy of local ancestry callers under different scenarios

#####For step-by-step installation instructions see:
installation_instructions.txt

#short version: The following programs/dependencies **must** be in your PATH:

fastahack - https://github.com/ekg/fastahack
seqtk - https://github.com/lh3/seqtk
wgsim - https://github.com/lh3/wgsim
bedtools2 - https://github.com/arq5x/bedtools2.git
SELAM - https://github.com/russcd/SELAM
macs & msformatter - https://github.com/gchen98/macs
seq-gen - https://github.com/rambaut/Seq-Gen

other dependencies:
perl Math::Random module
perl use Math::Round module
biopython

To test that programs are in your PATH type each program name and hit enter. If it returns usage they are installed properly. e.g.:

fastahack

Option 1: Simulates genome sequences with macs and simulates parental and hybrid haplotypes based on provided parameters. An ancestral sequence can also be used with this option.

Optuon 2: Takes two co-linear genome assemblies as input, simulates data according parameters and produces reads for input into AncestryHMM or otehr programs. See here for generating them if you don't have any available: https://github.com/YourePrettyGood/PseudoreferencePipeline

##############################
####To run
#############################

perl simulate_admixed_genomes.pl my_cfg.cfg

##############################
######Configuration file
##############################

#see included example also: hybrid_simulation_configuration.cfg

#name of parent1 genome
genome1=

#name of parent2 genome
genome2=

#use a genome1 sequence as the ancestral sequence? Options are 0 - no and 1 - yes (only compatible with use_macs=1)
use_ancestral=

#proportion genome from parent species 1
mixture_prop_par1=

#expected recombination rate in Morgans per kb
rec_rate_Morgans_kb=

#number of individuals to simulate
num_indivs=

#generations since initial admixture
gens_since_admixture=

#chromosome to simulate, name must match exactly what is in the fasta file
chr_to_simulate=

#rate of polymorphism per bp in parent species 1
poly_perbp_par1=

#rate of polymorphism per bp in parent species 2
poly_perbp_par2=

#proportion of AIMs that are really shared polymorphisms (to simulate errors from this source)
rate_shared_poly_at_aims=

#read type, SE or PE
read_type=

#read length
read_length=

#per bp indel rate to simulate
per_bp_indels=

#sequencing error rate to simulate
sequencing_error=

#number of reads to simulate
number_reads=

#amount of cross-contamination to simulate
cross_contam=

#Simulate drift between the populations that admixed and the sampled parental species populations? 1 - yes, 0 - no
#NOTE: macs_par1_aims_pop and macs_par2_aims_pop must be defined with this option, and an appropriate macs command provided
parental_drift=

#If you are simulating drift, put which population to treat as the parent 1 reference population. NOTE: This value must be greater than 2 because populations 1 and 2 must be the admixing parental populations.
macs_par1_aims_pop=

#If you are simulating drift, put which population to treat as the parent 2 reference population. NOTE: This value must be greater than 2 because populations 1 and 2 must be the admixing parental populations.
macs_par2_aims_pop=

#Ancestry informative site frequency threshold to impose on parental species haplotypes (ratio 0-1, recommended >0.9)
aim_freq_cutoff=

#job submission command (i.e. sbatch or bash)
job_submit_cmd=

#Use macs to simulate sequences? 0 - no, use provided genomes, 1 - yes simulate with macs
use_macs=

#Use a recombination file provided for macs to generate rate variation in parentals and hybrids; 0 - use a uniform recombination rate, 1 - use a provided recombination map. NOTE: only compatible with use_macs=1
use_map=

#Provide a SELAM parameter file to specify hybrid population demography. If one is not specified the program will assume a simple demographic scenario, see documentation for more details
SELAM_param_file=

#Provide a SELAM selection file if you are simulating selection
SELAM_selection_file=

#If you are simulating sequences using macs (use_macs=1), you must provide a macs command string. The program will accept two populations only but arbitrary demographic history. A macs formatted recombination file must be provided with the -R parameter if use_map=1
#example: macs_params=60 20000000 -I 2 30 30 0 -t 0.001 -h 1e2 -r 0.001 -ej 2 2 1 -R recombination_map_for_macs.txt
#NOTE: if you are using the ancestral sequence option the chromosome length *must* match the length of the input scaffold
#NOTE: if you are using the parental drift option, you need to specify those populations and chromosomes to sample in the command line
macs_params=

#provide the full path to where the program scripts are located:
program_path=

#Use this number of parent 1 haplotypes to define aims. Must be <= the number of macs haplotypes simulated
par1_for_aims=

#Use this number of parent 2 haplotypes	to define aims. Must be	<= the number of macs haplotypes simulated
par2_for_aims=

#provide base composition and transition/transversion parameters for seq-gen. This is optional, otherwise defaults will be used:
seq_params=

#provide the header to use in the job submission file
#for example: job_header=#!/bin/sh #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 --mem=32000 #SBATCH -p schumer #SBATCH --time=02:00:00

#Number of individuals to simulate per job
num_indiv_per_job=

#######################
####EVALUATING ACCURACY
#######################

After your simulation, run Ancestry_HMM_pipeline as you would on any data (see: https://github.com/Schumerlab/Ancestry_HMM_pipeline)

########
#INPUT FILES FOR HMM pipeline
#######

The input ancestry informative sites, parental counts, and parental genomes (only generated if macs is used) files will be automatically generated by the pipeline. Regardless of the simulation parameters, the counts and AIMs files will be called:

simulated_parental_counts_for_AncestryHMM
simulated_AIMs_for_AncestryHMM

If you specified use_map=1 the program will also output the following file with local recombination rates indicated:

simulated_parental_counts_for_AncestryHMM_recombination_prior.txt

During the simulation an input reads file list will also be automatically generated. The name will be: 

simulated_hybrids_readlist_genX_prop_par1_Y

where X and Y are the parameters simulated.

If you are using macs to simulate, the genomes will be:

macs_simulated_parent1.fa
macs_simulated_parent2.fa

###########

Once the Ancestry_HMM_pipeline has finished you can run the following to determine accuracy:

perl post_hmm_accuracy_shell.pl ancestry-probs-par1_file ancestry-probs_par2_file path_to_simulation_reads_folder posterior_thresh path_to_simulator_install