# Assembling simulated and empirical data

This Jupyter notebook provides a completely reproducible record of all the assembly analyses for the ipyrad manuscript. In this notebook we assemble multiple data sets using five different software programs: ipyrad, pyrad, stacks, aftrRAD, and dDocent. We include executable code to download and install each program, to download each data set, and to run each data set through each program.

### Jupyter notebook SSH Tunneling
This notebook was executed on the Yale farnam cluster with access to 40 compute cores on a single node. We performed local I/O in the notebook using SSH Tunneling as described in the ipyrad Documentation (http://ipyrad.readthedocs.io/HPC_Tunnel.html).


### Organize the directory structure

In [1]:
import shutil
import glob
import sys
import os

In [2]:
## Set the default directories for exec and data. 
WORK_DIR = "/home/deren/manuscript-analysis"

## create dirs for the data 
EMPERICAL_DATA_DIR = os.path.join(WORK_DIR, "example_empirical_rad")
SIMULATION_DATA_DIR = os.path.join(WORK_DIR, "simulated_data")

## create dirs for results from each software
IPYRAD_DIR = os.path.join(WORK_DIR, "assembly-ipyrad")
PYRAD_DIR = os.path.join(WORK_DIR, "assembly-pyrad")
STACKS_DIR = os.path.join(WORK_DIR, "assembly-stacks")
AFTRRAD_DIR = os.path.join(WORK_DIR, "assembly-aftrRAD")
DDOCENT_DIR = os.path.join(WORK_DIR, "assembly-dDocent")

## (empirical data dir will be created for us when we untar it)
for dir in [WORK_DIR, IPYRAD_DIR, PYRAD_DIR, STACKS_DIR, AFTRRAD_DIR, DDOCENT_DIR]:
    if not os.path.exists(dir):
        os.makedirs(dir)
        
## Simulated data directories
SIMNO = os.path.join(WORK_DIR, "simno")
SIMLO = os.path.join(WORK_DIR, "simlo")
SIMHI = os.path.join(WORK_DIR, "simhi")
SIMLA = os.path.join(WORK_DIR, "simla")

## create directories
for dir in [SIMNO, SIMLO, SIMHI, SIMLA]:
    if not os.path.exists(dir):
        os.makedirs(dir)

## subdirectories for results
IPYRAD_OUTPUT = os.path.join(IPYRAD_DIR, "REALDATA")
PYRAD_OUTPUT = os.path.join(PYRAD_DIR, "REALDATA")
STACKS_OUTPUT = os.path.join(STACKS_DIR, "REALDATA")
STACKS_GAP_OUT = os.path.join(STACKS_OUTPUT, "gapped")
STACKS_UNGAP_OUT = os.path.join(STACKS_OUTPUT, "ungapped")
STACKS_DEFAULT_OUT = os.path.join(STACKS_OUTPUT, "default")
AFTRRAD_OUTPUT = os.path.join(AFTRRAD_DIR, "REALDATA")
DDOCENT_OUTPUT = os.path.join(DDOCENT_DIR, "REALDATA")

## Make the empirical output directories if they don't already exist
for dir in [IPYRAD_OUTPUT, PYRAD_OUTPUT, STACKS_OUTPUT,
            STACKS_GAP_OUT, STACKS_UNGAP_OUT, STACKS_DEFAULT_OUT,
            AFTRRAD_OUTPUT, DDOCENT_OUTPUT]:
    if not os.path.exists(dir):
        os.makedirs(dir)
        
## in case we're not in the workdir, get there
os.chdir(WORK_DIR)

### Download the empirical RAD *Pedicularis* data set

This is a RAD data set for 13 taxa from Eaton and Ree (2013) [open access link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3739883/pdf/syt032.pdf). Here we grab the demultiplexed fastq data from a public Dropbox link, but the same data is also hosted on the NCBI SRA as SRP021469.


In [4]:
%%bash

## the location of the data 
dataloc="https://dl.dropboxusercontent.com/u/2538935/example_empirical_rad.tar.gz"

## grab data from the public link (uses an upper-case o argument, not a zero).
curl -LkO $dataloc > /dev/null 2>&1

## the tar command decompresses the data directory
tar -xvzf example_empirical_rad.tar.gz

## remove the large tar file
rm example_empirical_rad.tar.gz

example_empirical_rad/
example_empirical_rad/38362_rex.fastq.gz
example_empirical_rad/32082_przewalskii.fastq.gz
example_empirical_rad/40578_rex.fastq.gz
example_empirical_rad/30686_cyathophylla.fastq.gz
example_empirical_rad/39618_rex.fastq.gz
example_empirical_rad/41954_cyathophylloides.fastq.gz
example_empirical_rad/41478_cyathophylloides.fastq.gz
example_empirical_rad/33588_przewalskii.fastq.gz
example_empirical_rad/35855_rex.fastq.gz
example_empirical_rad/35236_rex.fastq.gz
example_empirical_rad/29154_superba.fastq.gz
example_empirical_rad/30556_thamno.fastq.gz
example_empirical_rad/33413_thamno.fastq.gz


### Download the empirical *Heliconius* reference genome¶


In [6]:
%%bash
## Fetch the heliconius genome and rad data from
## Davey, John W., et al. "Major improvements to the Heliconius melpomene 
## genome assembly used to confirm 10 chromosome fusion events in 6 
## million years of butterfly evolution." G3: Genes| Genomes| Genetics 6.3
## (2016): 695-708.
##curl -LkO http://butterflygenome.org/sites/default/files/Hmel2-0_Release_20160201.tgz

## location of the data
dataloc="http://butterflygenome.org/sites/default/files/Hmel2-0_Release_20151013.tgz"

## grab the data from the public link
curl -LkO $dataloc > /dev/null 2>&1

## untar the files
tar -zxvf Hmel2-0_Release_20151013.tgz

## remove the tar file
rm Hmel2-0_Release_20151013.tgz

./._Hmel2
Hmel2/
Hmel2/._annotation
Hmel2/annotation/
Hmel2/._ChangeLog.txt
Hmel2/ChangeLog.txt
Hmel2/._Hmel2.fa
Hmel2/Hmel2.fa
Hmel2/._Hmel_mtDNA.fa
Hmel2/Hmel_mtDNA.fa
Hmel2/._maps
Hmel2/maps/
Hmel2/._README.txt
Hmel2/README.txt
Hmel2/._repeats
Hmel2/repeats/
Hmel2/._transfer
Hmel2/transfer/
Hmel2/transfer/._Hmel1-1_Hmel2.chain
Hmel2/transfer/Hmel1-1_Hmel2.chain
Hmel2/transfer/._Hmel2_broken.gff
Hmel2/transfer/Hmel2_broken.gff
Hmel2/transfer/._Hmel2_removed.gff
Hmel2/transfer/Hmel2_removed.gff
Hmel2/transfer/._Hmel2_transfer_new.tsv
Hmel2/transfer/Hmel2_transfer_new.tsv
Hmel2/transfer/._Hmel2_transfer_old.tsv
Hmel2/transfer/Hmel2_transfer_old.tsv
Hmel2/repeats/._Hmel.all.named.final.1-31.lib
Hmel2/repeats/Hmel.all.named.final.1-31.lib
Hmel2/maps/._Hmel2_chromosome_linkage.tsv
Hmel2/maps/Hmel2_chromosome_linkage.tsv
Hmel2/maps/._Hmel2_chromosomes.agp
Hmel2/maps/Hmel2_chromosomes.agp
Hmel2/maps/._Hmel2_scaffold_linkage.tsv
Hmel2/maps/Hmel2_scaffold_linkage.tsv
Hmel2/maps/._Hmel2_scaffo

# Install all the software 

### First install a new isolated miniconda directory

We could create a separate environment in an existing conda installation, but this is simpler since it works for users whether they have conda installed or not, and we can easily remove all the software at the end when we are done by simply deleting the miniconda directory.

In [3]:
%%bash -s "$WORK_DIR"

## path to the installer download
miniconda="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh"

## Fetch the latest miniconda installer
curl -LkO $miniconda > /dev/null 2>&1

## Install miniconda silently to the work directory.
## -b = "batch" mode, -f = force overwrite, -p = install dir
bash Miniconda-latest-Linux-x86_64.sh -b -f -p $1/miniconda > /dev/null 2>&1 

## update conda, if this fails your other conda installation may be getting in 
## the way. Try temporarily removing your ~/.condarc file.
export PATH="$1/miniconda/bin:$PATH"
conda update -y conda > /dev/null 2>&1 
echo 'conda: '$(which conda)

## remove the install file
rm Miniconda-latest-Linux-x86_64.sh 

conda: /home/deren/manuscript-analysis/miniconda/bin/conda


### Install ipyrad (Eaton & Overcast 2017)

In [4]:
%%bash -s "$WORK_DIR"

## ensure we are using the workdir conda
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"

## installs the latest release (silently)
conda install -y -c ipyrad ipyrad > /dev/null 2>&1  
echo 'ipyrad: '$(which ipyrad)

ipyrad: /home/deren/manuscript-analysis/miniconda/bin/ipyrad


### Install stacks (Catchen)

For this notebook we are using linux which is a bit easier than installing Stacks on OSX. Stacks is picky about where stuff installs to. If you don't have permission to install to /usr/local (most HPC systems) then you need to provide the --prefix argument to ./configure as we do here to install it locally (into miniconda).

In [5]:
%%bash -s "$WORK_DIR"

## ensure we are using the workdir conda
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"

## installs the latest release (silently)
conda install -y -c bioconda stacks > /dev/null 2>&1  
echo 'stacks: '$(which process_radtags)

stacks: /home/deren/manuscript-analysis/miniconda/bin/process_radtags


### Install ddocent 

In [7]:
%%bash -s "$WORK_DIR"

## ensure we're in the right conda
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"

## install all of the ddocent reqs
conda install -y -c bioconda ddocent=2.2.4 > /dev/null 2>&1  
echo 'dDocent: '$(which dDocent)

dDocent: /home/deren/manuscript-analysis/miniconda/bin/dDocent


### Install pyrad (Eaton 2014)

In [9]:
%%bash -s "$WORK_DIR"

## ensure we are using the workdir conda
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"

## Should be unnecessary because numpy and scipy already installed by conda
conda install numpy scipy > /dev/null 2>&1

## get muscle binary
muscle="http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz"
curl -LkO $muscle > /dev/null 2>&1
tar -xvzf muscle*.tar.gz > /dev/null 2>&1
mv muscle3.8.31_i86linux64 $WORK_DIR/miniconda/bin/muscle
rm muscle3.8.31_i86linux64.tar.gz
echo 'muscle: '$(which muscle)

## Download and install vsearch
vsearch="https://github.com/torognes/vsearch/releases/download/v2.0.3/vsearch-2.0.3-linux-x86_64.tar.gz"
curl -LkO $vsearch > /dev/null 2>&1
tar xzf vsearch-2.0.3-linux-x86_64.tar.gz > /dev/null 2>&1
mv vsearch-2.0.3-linux-x86_64/bin/vsearch $WORK_DIR/miniconda/bin/vsearch
rm -r vsearch-2.0.3-linux-x86_64/ vsearch-2.0.3-linux-x86_64.tar.gz
echo 'vsearch: '$(which vsearch)

## Fetch pyrad source from git repository 
if [ ! -d ./pyrad-git ]; then
  git clone https://github.com/dereneaton/pyrad.git pyrad-git > /dev/null 2>&1
fi;

## and install to conda using pip
cd ./pyrad-git
pip install -e . > /dev/null 2>&1
cd ..
echo 'pyrad: '$(which pyrad)

muscle: /home/deren/manuscript-analysis/miniconda/bin/muscle
vsearch: /home/deren/manuscript-analysis/miniconda/bin/vsearch
pyrad: /home/deren/manuscript-analysis/miniconda/bin/pyrad


# Simulate data
We will use simrrls to generate some simulated RAD-seq data for testing. This is a program that was written by Deren Eaton and is available on github: github.com/dereneaton/simrrls.git. simrrls requires the python egglib module, which is a pain to install in full, but we only need the simulation aspects of it, which are fairly easy to install. See below.

In [11]:
%%bash -s "$WORK_DIR"

## ensure we are using the new conda
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"

## install gsl
conda install gsl -y > /dev/null 2>&1

## get egglib cpp and py
eggcpp="http://mycor.nancy.inra.fr/egglib/releases/2.1.11/egglib-cpp-2.1.11.tar.gz"
curl -LkO $eggcpp > /dev/null 2>&1
tar -xzvf egglib-cpp-*.tar.gz > /dev/null 2>&1
cd egglib-cpp-*/
sh ./configure --prefix=$WORK_DIR/miniconda/ > /dev/null 2>&1
make > /dev/null 2>&1
make install > /dev/null 2>&1
cd ..

## install py module
eggpy="http://mycor.nancy.inra.fr/egglib/releases/2.1.11/egglib-py-2.1.11.tar.gz"
curl -LkO $eggpy > /dev/null 2>&1
tar -xvzf egglib-py-2.1.11.tar.gz > /dev/null 2>&1
cd egglib-py-2.1.11/
python setup.py build --prefix=$WORK_DIR/miniconda > /dev/null 2>&1
python setup.py install --prefix=$WORK_DIR/miniconda > /dev/null 2>&1
cd ..

## install simrrls
if [ ! -d simrrls ] ; then
  git clone https://github.com/dereneaton/simrrls.git
fi
cd simrrls
pip install -e . > /dev/null 2>&1
cd ..
echo 'simrrls: '$(which simrrls)

## cleanup
rm -r egglib-*

simrrls: /home/deren/manuscript-analysis/miniconda/bin/simrrls



### Simulating different RAD-datasets

Both pyRAD and stacks have undergone a lot of work since the original pyrad analysis. Because improvements have been made we want to test performance of all the current pipelines and be able to compare current to past performance. We'll follow the original pyRAD manuscript analysis (Eaton 2013) by simulating modest sized datasets with variable amounts of indels. We'll also simulate one much larger dataset. Also, because stacks has since included an option for handling gapped analysis we'll test both gapped and ungapped assembly.


### Tuning simrrls indel parameter

The -I parameter for simrrls has changed since the initial pyrad manuscript, so the we had to explore new values for this parameter that will approximate the number of indels we are after. I figured out a way to run simrrls and pipe the output to muscle to get a quick idea of the indel variation for different params. This gives a good idea of how many indel bearing seqs are generated.


In [12]:
# simrrls -n 1 -L 1 -I 1 -r1 $RANDOM 2>&1 | \
#                             grep 0 -A 1 | \
#                             tr '@' '>' | \
#                             muscle | grep T | head -n 60

In [13]:
# for i in {1..50}; 
#   do simrrls -n 1 -L 1 -I .05 -r1 $RANDOM 2>&1 | \
#                                   grep 0 -A 1 | \
#                                   tr '@' '>' | \
#                                   muscle |  \
#                                   grep T | \
#                                   head -n 40 >> rpt.txt;
# done
# grep "-" rpt.txt | wc -l

From experimentation:

-I value -- %loci w/ indels

    0.02 -- ~10%
    0.05 -- ~15%
    0.10 -- ~25%

The simulated data will live in these directories:

    SIM_NO_DIR = WORK_DIR/simulated_data/simno
    SIM_LO_DIR = WORK_DIR/simulated_data/simlo
    SIM_HI_DIR = WORK_DIR/simulated_data/simhi
    SIM_LARGE_DIR = WORK_DIR/simulated_data/simlarge

Timing:

    10K loci -- ~8MB -- ~ 2 minutes
    100K loci -- ~80MB -- ~ 20 minutes



### Organize directories for sim data

In [15]:
import subprocess
import shutil

force = True

## Directories for the simulation data
SIM_NO_DIR=os.path.join(SIMULATION_DATA_DIR, "simno")
SIM_LO_DIR=os.path.join(SIMULATION_DATA_DIR, "simlo")
SIM_HI_DIR=os.path.join(SIMULATION_DATA_DIR, "simhi")
SIM_LARGE_DIR=os.path.join(SIMULATION_DATA_DIR, "simlarge")

for idir in [SIMULATION_DATA_DIR, SIM_NO_DIR, SIM_LO_DIR,\
            SIM_HI_DIR, SIM_LARGE_DIR]:
    if force and os.path.exists(idir):
        shutil.rmtree(idir)
    if not os.path.exists(idir):
        os.makedirs(idir)

### Call *simrrls* to simulate data

In [None]:
%%bash -s "$WORK_DIR"

## fix path
export PATH=$1/miniconda/bin:$PATH; 

## call simrrls (No INDELS)
simrrls -o "simno/simno" -ds 2 -L 10000 -I 0 

## call simrrls (Low INDELS)
simrrls -o "simlo/simlo" -ds 2 -L 10000 -I 0.02

## call simrrls (High INDELS) 
simrrls -o "simhi/simhi" -ds 2 -L 10000 -I 0.05

## call simrls on Large data set (this takes a few hours)
## (30x12=360 Individuals at 100K loci)
simrrls -o "simla/Big_i360_L100K" -ds 0 -L 100000 -n 30


### Demultiplex all four libraries

We will start each analysis from the demultiplexed data files, since this is commonly what's available when data are downloaded from NCBI. We use ipyrad to demultiplex the data.


In [None]:
import ipyrad as ip

for idir in ["no", "lo", "hi", "la"]:
    ## demultiplex the library
    name = "sim{}".format(idir)
    data = ip.Assembly(name)
    data.set_params("project_dir", name)
    data.set_params("raw_fastq_path", name+"/*.gz")
    data.set_params("barcodes_path", name+"/*barcodes.txt")
    data.run("1")

### ipyrad : simulated data assembly
Results  
Data set 	Cores 	Loci 	Time  
simno 	4 	10000 	13:58  
simlo 	4 	10000 	15:17  
simhi 	4 	10000 	13:58  
simla 	4 	100000 	13:38  
simla 	80 	100000 	...  

In [16]:
import ipyrad as ip
ip.__version__

'0.5.15'

In [None]:
%%timeit -n1 -r1 
## this records time to run the code in this cell once

## create Assembly
data1 = ip.Assembly("simno")

## set params, including path to store data in ipyrad results dir
data1.set_params("project_dir", os.path.join(IPYRAD_DIR, "SIMDATA"))
data1.set_params("sorted_fastq_path", "simno/simno_fastqs/*.gz")
data1.set_params('max_low_qual_bases', 4)
data1.set_params('filter_min_trim_len', 69)
data1.set_params('max_Ns_consens', (99,99))
data1.set_params('max_Hs_consens', (99,99))
data1.set_params('max_SNPs_locus', (100, 100))
data1.set_params('min_samples_locus', 2)
data1.set_params('max_Indels_locus', (99,99))
data1.set_params('max_shared_Hs_locus', 99)
data1.set_params('trim_overhang', (2,2,2,2))

## run ipyrad steps 1-7
data1.run("1234567", show_cluster=True)