Online resource "The first steps toward a global pandemic: Reconstructing the demographic history of parasite host switches in its native range"

This repository aims at providing an online resource for the preprint

Using whole-genome sequencing, (i) we compared sympatric populations genetic diversity, and (ii) estimated the demographic parameters of independent host switches in the Apis honey bee ectoparasites mites: Varroa destructor and V. jacobsoni. The following sections described how the bioinformatics analysis were processed and how to replicate them with available genome data reads, Snakemake pipeline and other file formats or codes necessary.

Where can you find the data?

Varroa mites whole-genome sequences

Varroa mite sequencing reads are available under the DDBJ DNA Data Bank of Japan BioProject PRJDB9195
GCF_002443255.1 Vdes_3.0 Reference genome can be directly downloaded from NCBI developed by Techer et al. 2020. Comms. Biology
Honey bees and Varroa mites fasta sequences and Bowtie2 index build are available in the ref2020 folder on DRYAD DOI TO ADD

Codes

Genomics analysis are summarized into a Snakemake pipeline is available in the file Snakefile, along with parameters file cluster.json and launcher snakemake.slurm.
All scripts called in Snakefile are present in the scripts folder.
Interactive visualization of sampling distribution and details is available here.
R markdown including the code lines used to generate the interactive maps, PCAs analysis and bootstraping for the demographic inferences can be found in R_data.

Input or output files

Variant calling files (Raw, filtered and LD_pruned) as well as some lists (samples ID, SNP list, ) DRYAD DOI to add
Fastsimcoal2 input files for demographic scenarios and SFS subsets are available in demography.
FASTA files generated for mitochondrial analysis (COX1 and COX1-COX3-ATP6-CYTB) and compared to Varroa reference sequences are available in alignment_mtDNA

Customed Snakemake workflow for Varroa population genomics

Software and dependencies necessary to run the Snakemake pipeline :

Samtools : version samtools/1.3.1 used here.
Mashtree
Bowtie2: version bowtie2/2.2 used here.
NextGenMap: version NextGenMap/0.5.0 used here.
VariantBam: version VariantBam/1.4.3 used here.
Freebayes: version freebayes/1.1.0 used here.
VCFtools: version vcftools/0.1.12b used here.
vcflib: version vcflib/1.0.0-rc1 used here.
PLINK: version plink/1.90b3.36 used here.
R: version R/3.4.2 used here.
NGSadmix: version NGSadmix/32 used here.

How to run the pipeline?

The first 72 lines of the Snakemake file are setting in which environment we ran our analysis (i.e., which folders the raw reads data can be found, and the paths of future generated resulting analysis data). We computed most analysis on Sango, OIST Super Computer Center.

You can change the targets of rule all depending on which particular steps of the pipeline you want to proceed following Snakemake's documentation.

For example, if you want to generate the mapped read .bam files from the original .fastq files downloaded on NCBI:

Place the fastq files in a reads folder
Samples name will be recognized according to line 45 for which fastq R1 files end with {sample}-R1_001.fastq.gz
Apply rule all as expand(outDir + "/alignments/ngm/{sample}.bam", sample = SAMPLES)

Or, if you want to generate the variant call file from NextGenMap reads apply: outDir + "/var/ngm/filtered.vcf" for which outDir is the path of your working folder.

Estimation of the mutation rate

All files are available in the DRYAD repository.

Demographic scenarios testing and parameters estimation with fastsimcoal2

We designed six evolutionary scenario using fastsimcoal2 to choose the most likely and estimate demographic parameters such as the size of founding population at time of independent host switch events by V. destructor and V. jacobsoni from A. cerana to A. mellifera.

Useful links and tools for running your own demographic inferences

Additionally to the complete manual available for fastsimcoal2 written by Laurent Excoffier, we recommend the helpful active google group forum.

An excellent tutorial on fastsimcoal2 usage on genomics data can be found here, hosted on the speciationgenomics Github page by Mark Ravinet & Joana Meier.

easySFS was used to generate SFS input files from our vcf files (developed by Isaac Overcast).

SFS-scripts were used to plot observed and expected 2D-SFS under each scenario model (developed by David Marques).

Templates files in `demography` folder:

Here, we described the command lines used for one demographic model (templates_vjac_MUTINB/14_mig_botgwt/mut_8e-10) with the SFS generated from all sites subset for V. destructor (VDES32by22_folded_50kbvcf2sfs.txt). The mig_botgwt_TEMPLATE_jointMAFpop1_0.obs file should be place in the same folder as the same named .tpl and .est files.

Briefly the mig_botgwt_TEMPLATE.tpl file draw the scenario with :

A two populations model between V. destructor mites on original host A. cerana with a current effective population size NVAC1 and NVAM0 for the novel host A. mellifera.
Observed SFS Data were projected using easySFS with 17 haploid genomes for both A. mellifera (population 0) and A. cerana (population 1) mites.
A growth rate GAM since the host switch event only for the novel host (expansion biologically known).
Two migration matrices were given for the population split and after.
We considered a single historical event TJUMP ending at TBOTEND from which a number of haploid mites NBOTAM splited from A. cerana population to found the new A. mellifera population.
The mutation rate was proposed following preliminary de novo mutations estimations.

In the mig_botgwt_TEMPLATE.est file, most parameters were sampled in a uniform distribution (NB: upper limit does not constitute a maximum bound for fsc26, see manual).

We copied-named mig_botgwt_TEMPLATE_XXX.est, mig_botgwt_TEMPLATE_XXX.tpl, mig_botgwt_TEMPLATE_XXX_jointMAFpop1_0.obs 100 times for which XXX is in {1..100}. Using an array bash script we then ran the following command for each replicate.

#!/bin/bash
#SBATCH --job-name=vd-migbotgwt
#SBATCH --partition=XXX
#SBATCH --mem=5G
#SBATCH -c 10
#SBATCH --time=3:00:00

number=$SLURM_ARRAY_TASK_ID
[PATH_TO_FASTSIMCOAL2]/fsc26 --tplfile mig_botgwt_TEMPLATE_"$number".tpl --estfile mig_botgwt_TEMPLATE_"$number".est -m --numsims 1000000 --maxlhood 0.001 --minnumloops 20 --numloops 100 -c 10

Then to extract the best results from each run, we simply apply the following command lines:

cat mig_botgwt_TEMPLATE_1/mig_botgwt_TEMPLATE_1.bestlhoods >> scenario_migbotgwt.txt
for i in {2..100}; do sed -n 2p mig_botgwt_TEMPLATE_"$i"/mig_botgwt_TEMPLATE_"$i".bestlhoods >> scenario_migbotgwt.txt; done
for i in {1..100}; do cat mig_botgwt_TEMPLATE_"$i"/mig_botgwt_TEMPLATE_"$i".bestlhoods >> scenario_migbotgwt.txt; done
cat scenario_migbotgwt.txt | wc -l # to check that we have the results for the 100 replicates cat scenario_migbotgwt.txt | sort -k 10nr # the first line is then the scenatio with the lowest MaxEstLhood

For the best scenario, bootstraps from the best replicate were performed following the tutorial in the manual (page 56-57). We simulated 100 SFS datasets from the best output .par file after modifying it to generate a DNA sequence data, using:
[PATH_TO_FASTSIMCOAL2]/NUMBER=YYY # where NUMBER is the replicate number with the lowest MaxEstLhood fsc26 -i mig_botgwt_TEMPLATE_${NUMBER}_boot.par -n100 -j -m -s0 -x -I -q

We then repeat the parameters estimation 100 times for each of the 100 simulated SFS mig_botgwt_TEMPLATE_${NUMBER}_boot_jointMAFpop1_0.obs using the same initial command lines.

The best replicate for each simulated SFS run is then used to obtain the confidence interval of NVAM0, NVAC1, TJUMP, NBOTAM, GAM and migration rates.

Contact

Questions about the data or scripts? please contact either:

Maeva Techer, JSPS Postdoctoral Fellow, Ecology and Evolution lab @OIST
OIST Email: maeva.techer@oist.jp
Gmail: maeva.angelique.techer@oist.jp

Alexander (Sasha) Mikheyev, Adjunct Professor Ecology and Evolution @OIST, Associate Professor Evolutionary Genomics@ANU OIST Email: alexander.mikheyev@oist.jp
ANU Email: alexander.mikheyev@anu.edu.au

Name		Name	Last commit message	Last commit date
Latest commit History 288 Commits
alignment_mtDNA		alignment_mtDNA
data		data
demography		demography
docs		docs
images		images
list		list
map_layer		map_layer
mutations		mutations
scripts		scripts
.Rhistory		.Rhistory
Host_Switch_Nov2020.Rmd		Host_Switch_Nov2020.Rmd
Host_Switch_Nov2020.html		Host_Switch_Nov2020.html
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
snakemake.slurm		snakemake.slurm

MaevaTecher/varroa-host-jump

Folders and files

Latest commit

History

Repository files navigation

Online resource "The first steps toward a global pandemic: Reconstructing the demographic history of parasite host switches in its native range"

Where can you find the data?

Varroa mites whole-genome sequences

Codes

Input or output files

Customed Snakemake workflow for Varroa population genomics

Software and dependencies necessary to run the Snakemake pipeline :

How to run the pipeline?

Estimation of the mutation rate

Demographic scenarios testing and parameters estimation with fastsimcoal2

Useful links and tools for running your own demographic inferences

Templates files in demography folder:

Contact

About

Resources

Stars

Watchers

Forks

Languages

Templates files in `demography` folder: