-
Notifications
You must be signed in to change notification settings - Fork 1
Schumerlab/mixnmatch
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
########################## #####mixnmatch quickstart ########################## See the user manual pdf for more complete information This is a pipeline for simulating admixed genomes to evaluate the accuracy of local ancestry callers under different scenarios #####For step-by-step installation instructions see: installation_instructions.txt #short version: The following programs/dependencies **must** be in your PATH: fastahack - https://github.com/ekg/fastahack seqtk - https://github.com/lh3/seqtk wgsim - https://github.com/lh3/wgsim bedtools2 - https://github.com/arq5x/bedtools2.git SELAM - https://github.com/russcd/SELAM macs & msformatter - https://github.com/gchen98/macs seq-gen - https://github.com/rambaut/Seq-Gen other dependencies: perl Math::Random module perl use Math::Round module biopython To test that programs are in your PATH type each program name and hit enter. If it returns usage they are installed properly. e.g.: fastahack Option 1: Simulates genome sequences with macs and simulates parental and hybrid haplotypes based on provided parameters. An ancestral sequence can also be used with this option. Optuon 2: Takes two co-linear genome assemblies as input, simulates data according parameters and produces reads for input into AncestryHMM or otehr programs. See here for generating them if you don't have any available: https://github.com/YourePrettyGood/PseudoreferencePipeline ############################## ####To run ############################# perl simulate_admixed_genomes.pl my_cfg.cfg ############################## ######Configuration file ############################## #see included example also: hybrid_simulation_configuration.cfg #name of parent1 genome genome1= #name of parent2 genome genome2= #use a genome1 sequence as the ancestral sequence? Options are 0 - no and 1 - yes (only compatible with use_macs=1) use_ancestral= #proportion genome from parent species 1 mixture_prop_par1= #expected recombination rate in Morgans per kb rec_rate_Morgans_kb= #number of individuals to simulate num_indivs= #generations since initial admixture gens_since_admixture= #chromosome to simulate, name must match exactly what is in the fasta file chr_to_simulate= #rate of polymorphism per bp in parent species 1 poly_perbp_par1= #rate of polymorphism per bp in parent species 2 poly_perbp_par2= #proportion of AIMs that are really shared polymorphisms (to simulate errors from this source) rate_shared_poly_at_aims= #read type, SE or PE read_type= #read length read_length= #per bp indel rate to simulate per_bp_indels= #sequencing error rate to simulate sequencing_error= #number of reads to simulate number_reads= #amount of cross-contamination to simulate cross_contam= #Simulate drift between the populations that admixed and the sampled parental species populations? 1 - yes, 0 - no #NOTE: macs_par1_aims_pop and macs_par2_aims_pop must be defined with this option, and an appropriate macs command provided parental_drift= #If you are simulating drift, put which population to treat as the parent 1 reference population. NOTE: This value must be greater than 2 because populations 1 and 2 must be the admixing parental populations. macs_par1_aims_pop= #If you are simulating drift, put which population to treat as the parent 2 reference population. NOTE: This value must be greater than 2 because populations 1 and 2 must be the admixing parental populations. macs_par2_aims_pop= #Ancestry informative site frequency threshold to impose on parental species haplotypes (ratio 0-1, recommended >0.9) aim_freq_cutoff= #job submission command (i.e. sbatch or bash) job_submit_cmd= #Use macs to simulate sequences? 0 - no, use provided genomes, 1 - yes simulate with macs use_macs= #Use a recombination file provided for macs to generate rate variation in parentals and hybrids; 0 - use a uniform recombination rate, 1 - use a provided recombination map. NOTE: only compatible with use_macs=1 use_map= #Provide a SELAM parameter file to specify hybrid population demography. If one is not specified the program will assume a simple demographic scenario, see documentation for more details SELAM_param_file= #Provide a SELAM selection file if you are simulating selection SELAM_selection_file= #If you are simulating sequences using macs (use_macs=1), you must provide a macs command string. The program will accept two populations only but arbitrary demographic history. A macs formatted recombination file must be provided with the -R parameter if use_map=1 #example: macs_params=60 20000000 -I 2 30 30 0 -t 0.001 -h 1e2 -r 0.001 -ej 2 2 1 -R recombination_map_for_macs.txt #NOTE: if you are using the ancestral sequence option the chromosome length *must* match the length of the input scaffold #NOTE: if you are using the parental drift option, you need to specify those populations and chromosomes to sample in the command line macs_params= #provide the full path to where the program scripts are located: program_path= #Use this number of parent 1 haplotypes to define aims. Must be <= the number of macs haplotypes simulated par1_for_aims= #Use this number of parent 2 haplotypes to define aims. Must be <= the number of macs haplotypes simulated par2_for_aims= #provide base composition and transition/transversion parameters for seq-gen. This is optional, otherwise defaults will be used: seq_params= #provide the header to use in the job submission file #for example: job_header=#!/bin/sh #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 --mem=32000 #SBATCH -p schumer #SBATCH --time=02:00:00 #Number of individuals to simulate per job num_indiv_per_job= ####################### ####EVALUATING ACCURACY ####################### After your simulation, run Ancestry_HMM_pipeline as you would on any data (see: https://github.com/Schumerlab/Ancestry_HMM_pipeline) ######## #INPUT FILES FOR HMM pipeline ####### The input ancestry informative sites, parental counts, and parental genomes (only generated if macs is used) files will be automatically generated by the pipeline. Regardless of the simulation parameters, the counts and AIMs files will be called: simulated_parental_counts_for_AncestryHMM simulated_AIMs_for_AncestryHMM If you specified use_map=1 the program will also output the following file with local recombination rates indicated: simulated_parental_counts_for_AncestryHMM_recombination_prior.txt During the simulation an input reads file list will also be automatically generated. The name will be: simulated_hybrids_readlist_genX_prop_par1_Y where X and Y are the parameters simulated. If you are using macs to simulate, the genomes will be: macs_simulated_parent1.fa macs_simulated_parent2.fa ########### Once the Ancestry_HMM_pipeline has finished you can run the following to determine accuracy: perl post_hmm_accuracy_shell.pl ancestry-probs-par1_file ancestry-probs_par2_file path_to_simulation_reads_folder posterior_thresh path_to_simulator_install
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Packages 0
No packages published