# Species-tree & species-delimitation using *bpp* (BP&P) or *ibpp*
The program *bpp* by Rannala & Yang (2010; 2015) is useful for inferring species tree parameters and testing species delimitation hypotheses. Is it *relatively* easy to use, and best of all, it's *quite fast*, although not highly parallelizable. This notebook describes a relatively streamlined approach we've developed to easily setup input files for testing different hypthotheses in *bpp*, and to do so in a clear programmatic way. We also show how to submit many separate jobs to run in parallel. This approacgalso works with the program *ibpp*, which allow integration of traits with sequence data. 

### Using Jupyter notebooks
If you have not used Jupyter notebooks before, please see the other documentation for an introduction. This is a Jupyter notebook which contains documented code, in this case all Python, that can be used to replicate an analysis. The purpose of these notebooks is to produce a reproducible document that is easy to share, reproduce, and/or use as supplemental materials, by simply uploading it to a site such as github. 

In [88]:
## Start by importing a few python modules
import ipyrad
import ipyparallel as ipp
import subprocess
import socket
import os
import sys

### Download and install *bpp* v.3.3 locally (only tested on Linux)
The code in the link below will create a new directory if it does not already exist in *~/local/src/* to install *bpp* from its source code. This creates a binary file called **bpp** which can be executed from the command line. This code installs the software *locally*, meaning that you do not need administrator privileges to install it. You can copy and paste the code into a terminal, or into a cell of this notebook (with a %%bash header). When finished it will print out the location where it is installed. Keep note of that location because you may have to enter it later below in this notebook. It will be in '~/local/src/bpp3.3/src/bpp'.
https://gist.github.com/dereneaton/73a377c643adaddc83635506a81180af


### Download and Install *ibpp* (v.2.1)
Similarly, *ibpp* installation follows a similar procedure and is installed in a similar place. The source code in this case is downloaded (cloned) from github, so you will need to have the software *git* installed/loaded. This is usually available by default on a linux machine, and/or HPC cluster. Execute the code here, which will print out the location where it is installed. Keep note of that location because you may have to enter it later below in this notebook. It will be in '~/local/src/iBPP/src/ibpp'.
https://gist.github.com/dereneaton/527b87488eede7b670222640fe26878d

### Create input files (.seq.txt, .imap.txt, .ctl.txt, and .traits.txt) 
We can create these files quite easily by parsing the sequence information from the *.loci* file produced by ipyrad, and by providing some additional information about which samples should be grouped together into the same "species" using Python dictionaries. I show an example of this below, using a function from the ipyrad API [**loci2bpp()**] that we've created for this purpose. This will create all of the dependency files for a bpp analysis. The first is the IMAP file (*.bpp.imap.txt*), which simply maps sample names to species groups. The second is the sequence file (*.bpp.seq.txt*), which obviously contains the sequence data, properly formatted. And the third is the *.bpp.ctl.txt* file, which contains parameters for the bpp analysis. A final optional 'traits' file can also be produced for ibpp analyses. 

The *loci2bpp()* function further contains several options for filtering loci or samples from the sequence data. For example, you can keep only loci that have at least N samples in each species, and it removes any sample from the data set that is not listed in your IMAP dictionary. You can also set all of the ctl parameters here. We'll start by creating an IMAP dictionary that matches 'species' names to lists of sample names belonging to each species. 

In [2]:
## Create a mapping dictionary
## The keys are 'species', i.e., clades/groups for your samples, 
## The values are lists of sample names that belong to each group

IMAP = {"A": ["1A_0", "1B_0", "1C_0", "1D_0"], 
        "B": ["2E_0", "2F_0", "2G_0", "2H_0"],
        "C": ["3I_0", "3J_0", "3K_0", "3L_0"]
       }

In [3]:
## Then you must write your tree hypothesis as a newick string.
## This must include all 'species' names in the imap dictionary

TREE = "((A,B),C);"



(Optional): You can further designate an additional dictionary that will be used to subsample loci for inclusion in the *bpp* analysis. Below I call this dictionary MINMAP, and it will be used to filter loci so that we only include loci in the analysis that have at least N taxa with sequence data in a locus for each given 'species' group.  

In [89]:
## (Optional) The keys are 'species', i.e., clade/group names 
## The values are the number of samples in each 'species' that must have data
## for a given locus for it to be included in the data set. 
## The example here will allow no missing data. 

MINMAP = {"A": 4, 
          "B": 4, 
          "C": 4,
         }

In [87]:
## (Optional) Traits as a pandas DataFrame (used only by iBPP)
## If you have your trait data in CSV format you can easily read it
## in using Pandas. The first column should have sample names, and 
## all following columns should have quantitative traits. Missing 
## data 

TRAIT_CSV = """\
Indiv, t1, t2, t3
1A_0, 3, 40.1, 0.9
1B_0, 3, 38.8, 1.0
1C_0, 4, 35.4, 1.2
1D_0, 4, 37.0, 1.0
2E_0, 5, 33.0, 0.7
2F_0, 5, 32.4, 0.7
2G_0, , , 0.5
2H_0, , , 0.5
3I_0, 8, 65.0, 0.6
3J_0, 8, 67.4, 0.4
3K_0, 8, 68.2, 0.3
3L_0, 9, 59.9, 0.3
"""

## I'm using the stringIO library here b/c it will make the string TRAIT_CSV
## above act like it is a CSV file that we have saved on disk. 
import StringIO
csvfile = StringIO.StringIO(TRAIT_CSV)

## we can load it in using pandas.read_csv() function, this also can replace 
## various missing data cells to the proper 'nan' setting. In TRAIT_CSV above
## you can see that missing data is either "" or "NA". 
import pandas
traits = pandas.read_csv(csvfile, delimiter=",", na_values=" ", index_col=0)
      
## mean standardize columns
traits = traits.apply(lambda x: (x - x.mean()) / (x.std()))
print traits

             t1        t2        t3
Indiv                              
1A_0  -1.167918 -0.497734  0.760639
1B_0  -1.167918 -0.582649  1.098701
1C_0  -0.735356 -0.804735  1.774824
1D_0  -0.735356 -0.700224  1.098701
2E_0  -0.302794 -0.961502  0.084515
2F_0  -0.302794 -1.000693  0.084515
2G_0        NaN       NaN -0.591608
2H_0        NaN       NaN -0.591608
3I_0   0.994893  1.128719 -0.253546
3J_0   0.994893  1.285486 -0.929670
3K_0   0.994893  1.337741 -1.267731
3L_0   1.427456  0.795590 -1.267731


### Run loci2bpp() to generate bpp input files
The loci2bpp() function has four required arguments, a name, a loci file, an IMAP dictionary, and a tree hypothesis. In addition there is a huge range of additional optional arguments that can be passed to fine tune the analysis. You can see further documentation of these options by checking out the help function (in a cell type:  **?ipyrad.file_conversion.loci2bpp()**. You can also see that it returns the ctl file as a string, which you will see later can be quite useful.  

In [90]:
## enter the path to your loci file
locifile = "/home/deren/Documents/ipyrad/tests/cli/cli_outfiles/cli.loci"

## create bpp seq file with data for all samples in the loci file and IMAP dict.
## if you tell it verbose=True then it will also print the ctl file info to the screen
ipyrad.file_conversion.loci2bpp('test', locifile, IMAP, TREE, verbose=True)

ctl file
--------
seed = 12345
seqfile = /home/deren/Documents/ipyrad/tests/test.bpp.seq.txt
Imapfile = /home/deren/Documents/ipyrad/tests/test.bpp.imap.txt
mcmcfile = /home/deren/Documents/ipyrad/tests/test.bpp.mcmc.txt
outfile = /home/deren/Documents/ipyrad/tests/test.bpp.out.txt
nloci = 9999
usedata = 1
cleandata = 0
speciestree = 0
speciesdelimitation = 0 0 5
species&tree = 3 A C B
                 4 4 4
                 ((A,B),C);
thetaprior = 5 5
tauprior = 4 2 1
finetune = 1: 1 0.002 0.01 0.01 0.02 0.005 1.0
print = 1 0 0 0
burnin = 1000
sampfreq = 2
nsample = 10000
--------

new files created (9999 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt


'/home/deren/Documents/ipyrad/tests/test.bpp.ctl.txt'

#### LOTS of extra arguments are available in *loci2bpp()*
These can be used to filter the loci that will be included in the data set, as well as to modify the parameters that will be used in *bpp* and which are specified in the *.ctl* file. The *.ctl* file has a large range of options, and so for some advanced usage you may still need to modify the file by hand, but our intention with this function is to at least provide a fairly easy to use function to produce these files programatically, instead of having to always produce them by hand. You can see in the final example that we provided the traits dictionary, and that loci2bpp() created an extra .traits.txt file, and that all of the files produced have ibpp in their names instead of bpp. 


In [9]:
## enter the path to your loci file
locifile = "/home/deren/Documents/ipyrad/tests/cli/cli_outfiles/cli.loci"

## Create bpp seq file with data for all samples in the loci file and IMAP dict
ipyrad.file_conversion.loci2bpp('test', locifile, IMAP, TREE)

## Create bpp file with only the first 100 loci
ipyrad.file_conversion.loci2bpp('test', locifile, IMAP, TREE, maxloci=100)

## Only keep loci that have at least MINMAP samples for each species
ipyrad.file_conversion.loci2bpp('test', locifile, IMAP, TREE, minmap=MINMAP)

## Only keep loci that have at least MINMAP samples for each species
## and write the ctl file so that we perform species delimitation
ipyrad.file_conversion.loci2bpp('test', locifile, IMAP, TREE, minmap=MINMAP, infer_delimit=1)

## Create an iBPP ctl file that includes trait information from the trait file.
## We will name this one 'itest' to differentiate it. 
ipyrad.file_conversion.loci2bpp('itest', locifile, IMAP, TREE, minmap=MINMAP, 
                                traitdict=TRAITS)

new files created (9999 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt
new files created (100 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt
new files created (9994 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt
new files created (9994 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt
new files created (9994 loci, 3 species, 12 samples)
  itest.ibpp.seq.txt
  itest.ibpp.imap.txt
  itest.ibpp.ctl.txt
  itest.ibpp.traits.txt


'/home/deren/Documents/ipyrad/tests/itest.ibpp.ctl.txt'

### Why?
You could of course alternatively create all of the bpp input files by hand but trust me, it's a pain. Besides, by making it programmatic in this way you can easily create a variety of input files for different jobs with different parameter settings. Furthermore, it will be easy to share your code with others to show how you created a range of analyses. It's certainly much easier to share a bit of code than it is to share 20 different ctl files that you produced. Below we show an example where we create bpp input files for a range of parameter values and submit them to run in parallel on a cluster. 

### What if I don't want to run parallel jobs?
Simple. You can just call bpp or ibpp on a single *.ctl.txt* file at a time. I would recommend running parallel code, however, since each job takes pretty long to run, and each bpp job can only run on a single CPU at a time. Although we can't parallelize a single run of *bpp*, we can run many jobs simultaneously, allowing us to test a bunch of different priors, or delimitation methods. 

In [10]:
## %%bash

## I've commented the code out, but you could uncomment it to run a single job.
## The '2>&1 bpp-log.txt` part saves all of the output to a file instead of to the screen 
# bpp test.ctl.txt 2>&1 bpp-log.txt

### Set up a parallel client to submit parallel jobs through this notebook
We need to know a few tricks to submit parallel jobs from this jupyter notebook. This is all handled by the ipyparallel library, which we loaded at the top of this notebook. We have a separate tuturial with more background about using ipyparallel. You will need to have an 'ipcluster' instance running in a separate terminal on your machine (or ideally, it is running on your HPC cluster). The code below simply connects to that cluster and prints how many CPUs are available for use. 

In [91]:
## Connect to the running ipcluster instance
## (you need to start it in a separate terminal)
ipyclient = ipp.Client()
lbview = ipyclient.load_balanced_view()

## print some information about our cluster
res = ipyclient[:].apply(socket.gethostname)
for host in set(res.result_dict.values()):
    print "compute node: [{} cores] on {}"\
          .format(res.result_dict.values().count(host), host)

compute node: [10 cores] on tinus


### A function to run bpp/ibpp
This function simply calls the bpp/ibpp binary. If you installed your binaries into a different location than the default in the install scripts at the beginning of this notebook then you will have to change the path to the binaries in this function.

In [93]:
def run_bpp(ctlfile):
    """ run bpp command line program """
    
    import subprocess    
    cmd = ["/home/deren/local/src/bpp3.3/src/bpp", ctlfile]
    proc = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
    proc.communicate()
    
 

In [102]:
ctlfile = ipyrad.file_conversion.loci2bpp('test', locifile, IMAP, TREE, maxloci=100)
async = lbview.apply(run_bpp, ctlfile)

new files created (100 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt


Now, we want each jobs that we submit to have a unique name. The code below is creating new jobs over a range of theta and tau prior values, and creating a name (rname) that stores those values, and passing these to the loci2bpp function to create new input files, and then it is submitting those jobs to run on the cluster. You could edit this code to iterate over a different range of parameter settings. 

In [116]:
ipyclient.queue_status()

{0: {u'completed': 1, u'queue': 0, u'tasks': 0},
 1: {u'completed': 0, u'queue': 0, u'tasks': 1},
 2: {u'completed': 1, u'queue': 0, u'tasks': 0},
 3: {u'completed': 1, u'queue': 0, u'tasks': 0},
 4: {u'completed': 1, u'queue': 0, u'tasks': 0},
 5: {u'completed': 1, u'queue': 0, u'tasks': 0},
 6: {u'completed': 0, u'queue': 0, u'tasks': 1},
 7: {u'completed': 1, u'queue': 0, u'tasks': 0},
 8: {u'completed': 0, u'queue': 0, u'tasks': 1},
 9: {u'completed': 1, u'queue': 0, u'tasks': 0},
 u'unassigned': 0}

In [115]:
if async.ready():
    if not async.successful():
        print async.exception()
    else:
        print async, async.result()
else:
    print async, 'still running'

<AsyncResult: BPP> still running


In [32]:
## a dictionary to store our results in
asyncs = {}

## send jobs to run 'asynchronously' using 'apply' over a range of values
for theta in [(5, 5), (5, 50), (5, 500)]:
    for tau in [(1, 1, 1), (1, 10, 1), (1, 100, 1)]:
        
        ## name this run by its theta and tau params
        rname = 'TEST-o.{}.{}_t.{}.{}.{}'.format(*theta+tau)
    
        ## create input files for this run, the function returns the ctl
        ## file name as a string, which we will store and use below
        ctlfile = ipyrad.file_conversion.loci2bpp(rname, locifile, IMAP, TREE, 
                                                  thetaprior=theta, 
                                                  tauprior=tau, 
                                                  nsample=10000, 
                                                  burnin=1000,
                                                  maxloci=100)

        ## submit job to the queue as args to run_bpp
        asyncs[ctlfile] = lbview.apply(run_bpp, ctlfile)
        
        ## print that the job was submitted
        sys.stderr.write('job submitted: bpp {}\n\n'.format(ctlfile))

new files created (100 loci, 3 species, 12 samples)
  TEST-o.5.5_t.1.1.1.bpp.seq.txt
  TEST-o.5.5_t.1.1.1.bpp.imap.txt
  TEST-o.5.5_t.1.1.1.bpp.ctl.txt
job submitted: bpp TEST-o.5.5_t.1.1.1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-o.5.5_t.1.10.1.bpp.seq.txt
  TEST-o.5.5_t.1.10.1.bpp.imap.txt
  TEST-o.5.5_t.1.10.1.bpp.ctl.txt
job submitted: bpp TEST-o.5.5_t.1.10.1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-o.5.5_t.1.100.1.bpp.seq.txt
  TEST-o.5.5_t.1.100.1.bpp.imap.txt
  TEST-o.5.5_t.1.100.1.bpp.ctl.txt
job submitted: bpp TEST-o.5.5_t.1.100.1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-o.5.50_t.1.1.1.bpp.seq.txt
  TEST-o.5.50_t.1.1.1.bpp.imap.txt
  TEST-o.5.50_t.1.1.1.bpp.ctl.txt
job submitted: bpp TEST-o.5.50_t.1.1.1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-o.5.50_t.1.10.1.bpp.seq.txt
  TEST-o.5.50_t.1.10.1.bpp.imap.txt
  TEST-o.5.50_t.1.10.1.bpp.ctl.txt
job submitted:

### Track progress
You could interrupt and/or restart this progress tracker without it interrupting the jobs that are running on the ipcluster engines. As you can see, we can still continue to work in this notebook while these jobs are running. We will have to wait for them to finish before we move on to analyzing the results, however. 

In [None]:
## print success/failure of jobs
for job in asyncs:
    if asyncs[job].ready():
        if job.successful():
            print "{:<40} -- finished".format(job)
        else:
            print "{:<40} -- failed:".format(asyncs[job].exception())
    else:
        print "{:<40} -- still running".format(job)
        
## if you wanted to cancel the jobs running on the cluster
## you can do so by running the code below that is commented out.
for job in asyncs:
    asyncs[job].abort()
    asyncs[job].cancel()
    print asyncs[job]

### Interpreting/analyzing results
In this example we ran *bpp* under 10 different prior settings. We can compare the results of these analyses to investigate the effect of the prior on the estimated posterior distributions of the parameter estimates from the multi-species coalescent ($\theta$ and $\tau$). 

In [12]:
## I'll leave that to you.


### So what's a smart test to perform?
Well, my interest in bpp was to perform species delimitation. And Rannala and Yang suggest that you try out both species delimitation algorithms and that you do so over a range of params for the two algorithms. They suggest that you run algorithm 0 with $\epsilon$=(2, 5, 10, 20), and algorithm 1 with $\alpha$=(1, 1.5, 2) and $m$=(1, 1.5, 2). And also to do this with different starting trees. So let's set up that test below for the example RAD data set from ipyrad, which in this case is the loci file that we have been using. 

In [158]:
## set up a couple tests to perform
DELIMIT_TESTS = [
    (0, (2)),
    (0, (5)),
    (0, (10)),
    (0, (20)),
    (1, (1.0, 1.0)),
    (1, (1.0, 1.5)),
    (1, (1.0, 2.0)),
    (1, (1.5, 1.0)), 
    (1, (1.5, 1.5)), 
    (1, (1.5, 2.0)),
    (1, (2.0, 1.0)), 
    (1, (2.0, 1.5)), 
    (1, (2.0, 2.0))
]

TREE_TESTS = [
    "((A,B),C);",
    "((A,C),B);",
    "((A,B),C);"
]

In [None]:
## a dictionary to store our results in
asyncs = {}

## send jobs to run 'asynchronously'
for alg in DELIMIT_TESTS:
    ## name this run by its theta and tau params
    rname = 'DTEST-{}_{}'.format(alg[0], ".".join(alg[1:]))

    ## create input files for this run, the function returns the ctl
    ## file name as a string, which we will store and use below
    ctlfile = ipyrad.file_conversion.loci2bpp(rname, locifile, IMAP, TREE, 
                                              infer_delimit=1, 
                                              delimit_alg=(alg),
                                              thetaprior=theta, tauprior=tau, 
                                              nsample=10000, burnin=1000,
                                              maxloci=100)

    ## submit job to the queue as args to run_bpp
    asyncs[ctlfile] = lbview.apply(run_bpp, ctlfile)

    ## print that the job was submitted
    sys.stderr.write('job submitted: bpp {}\n\n'.format(ctlfile))

# Empirical example
This is an empirical example from Federman et al. (In Prep). 

In [30]:
%%bash

## download the Canarium .loci file from the web, it's kinda big (~100Mb)
## and will take a few minutes, so be patient.
curl -LkO https://dl.dropboxusercontent.com/u/2538935/CanEnd_min20.loci

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0  106M    0 97524    0     0   108k      0  0:16:43 --:--:--  0:16:43  108k  0  106M    0  623k    0     0   332k      0  0:05:26  0:00:01  0:05:25  332k  1  106M    1 1263k    0     0   438k      0  0:04:07  0:00:02  0:04:05  438k  1  106M    1 1807k    0     0   463k      0  0:03:54  0:00:03  0:03:51  463k  2  106M    2 2351k    0     0   482k      0  0:03:45  0:00:04  0:03:41  482k  2  106M    2 2847k    0     0   485k      0  0:03:44  0:00:05  0:03:39  551k  3  106M    3 3407k    0     0   495k      0  0:03:39  0:00:06  0:03:33  556k  3  106M    3 3743k    0     0   475k      0  0:03:48  0:00:07  0:03:41  496k  3  106M    3 3951k    0     0   445k      0  0:04:04  0:00:08  0:03:56  431k  4  106M    4 4351k    0     0   440k      0  0:04

#### Create the '.bpp.imap.txt' file

In [195]:
## a mapping dictionary made by hand, mapping 
## sample names to clades/groups.
IMAP = {
    "A": ['SF172', 'SF175', 'SF328', 'SF200', 'SF209',
          'D14528', 'SF276', 'SF286', 'D13052'],
    "B": ['D13101', 'D13103', 'D14482', 'D14483'],
    "C": ['D14504', 'D14505', 'D14506'],
    "D": ['D14477', 'D14478', 'D14480', 'D14485', 'D14501', 'D14513'], 
    "E": ['D13090', 'D12950'],
    "F": ['D13097', 'SF155', 'D13063', 'D12963', 'SF160', 'SF327',
          'SF224', 'SF228', '5573', 'SF153', 'SF164', 'D13075', 'SF197'], 
    }

## Optional: minmap
## enter here the minimum number of samples you require to be present 
## from each clade at every locus in the analysis. 
MINMAP = {
    "A": 8, 
    "B": 4, 
    "C": 3,
    "D": 4, 
    "E": 2, 
    "F": 10,
}

## Species tree hypothesis
TREE = "((((D,B),C),(E,F)),A);"

In [240]:
data

Unnamed: 0,Indiv,leaf_tot,juga,leaf_juga_ratio,stip_dist,stip_scar_length,pet_length,petiole_stip_ratio,lateral_petiolules,basal_petiolule,...,lateral_lft_W,lateral_L_widest_point,ll_lw_ratio,ll_wp_ratio,termil_lft_L,termil_lft_W,termil_L_widest_point,tl_tw_ratio,tl_wp_ratio,X2o_vein_pairs
0,SF175,371.77,4.67,79.49,21.38,2.3,59.43,2.91,19.45,8.98,...,36.79,55.04,2.98,2.98,85.59,40.27,45.99,2.18,1.88,10.67
1,SF328,268.61,4.0,67.15,39.91,1.39,67.65,1.72,9.01,5.4,...,33.04,41.29,2.35,2.35,67.84,29.39,40.11,2.32,1.71,10.5
2,SF200,208.42,3.67,58.15,12.6,1.78,32.21,2.6,8.58,4.63,...,27.35,37.53,2.37,2.37,55.54,30.66,34.89,1.82,1.59,10.0
3,SF209,218.85,4.0,56.69,13.0,2.04,42.98,3.46,8.62,4.94,...,30.8,35.68,2.05,2.05,61.2,30.43,30.8,2.03,2.01,10.67
4,D14528,264.57,4.0,66.14,10.85,2.99,53.21,4.9,12.23,9.12,...,33.91,53.94,2.61,2.61,79.49,35.67,49.57,2.22,1.6,10.33
5,SF276,283.45,3.0,94.48,21.84,2.64,74.72,3.55,12.56,8.35,...,37.79,52.69,2.31,2.31,82.05,44.05,51.32,1.87,1.6,8.0
6,SF286,288.35,3.0,96.12,27.57,2.75,72.73,2.65,17.31,11.73,...,52.27,42.52,1.58,1.58,83.78,46.91,44.01,1.78,1.93,8.5
7,D14504,323.02,6.0,54.21,12.54,2.82,65.82,6.09,8.6,5.32,...,48.74,44.06,1.7,1.7,70.46,36.71,34.97,1.94,2.05,16.67
8,D14505,448.82,8.0,56.25,9.51,3.0,63.97,6.7,5.77,5.67,...,54.47,45.0,2.29,2.29,129.33,64.38,65.85,2.01,2.03,17.5
9,D14506,534.65,7.0,77.69,11.67,3.5,91.8,8.0,15.4,13.51,...,70.34,66.34,2.85,2.85,155.46,61.82,74.68,2.52,2.06,20.33


In [None]:
## Read in trait data (csv) from ("https://dl.dropboxusercontent.com/u/2538935/CanEnd_trait2.csv")
import pandas
data = pandas.read_csv("https://dl.dropboxusercontent.com/u/2538935/CanEnd_trait2.csv", 
                       na_values="")

## it is assumed that traits are normally distributed, we can help by 
## mean-standardizing trait values.
data.ix[:, 1:] = data.ix[:, 1:].apply(lambda x: (x - x.mean()) / (x.std()))

## Fill nan cells with string NA.
#data = data.fillna("NA")

## convert to a dictionary {name: [traitlist]}
TRAITS = {}
for row in xrange(len(data)):
    TRAITS[data.Indiv[row]] = [round(x, 3) if isinstance(x, float) else "NA" for x in data.iloc[row][1:]]
    
## print a preview
for key, val in TRAITS.items():
    print "{}\t:  [{}\t] ...".format(key, "\t".join([str(i) for i in val[:11]]))

#### Create the '.bpp.seq.txt' file

In [154]:
## enter the path to your loci file
locifile = "./CanEnd_min20.loci"

## create a bpp file from the loci file, if we wanted fewer loci than the total 
## amount, which we might want just to make the analysis run faster, you can 
## enter an optional value for 'maxloci' to grab just the first N loci.
ipyrad.file_conversion.loci2bpp("CanEnd", locifile, IMAP, TREE, 
                                minmap=MINMAP,
                                maxloci=100, 
                                nsample=100000, burnin=10000, 
                                thetaprior=(2, 1000), 
                                tauprior=(2, 2000, 1),
                                traitdict=TRAITS,
                                verbose=True
                                )                                
                                

ctl file
--------
seed = 12345
seqfile = CanEnd.ibpp.seq.txt
Imapfile = CanEnd.ibpp.imap.txt
mcmcfile = CanEnd.ibpp.mcmc.txt
outfile = CanEnd.ibpp.out.txt
traitfile = CanEnd.ibpp.traits.txt
nloci = 100
cleandata = False
speciesdelimitation = 0
ntraits = 27
nindT = 36
usetraitdata = 1
useseqdata = 1
nu0 = 0
kappa0 = 0
species&tree = 6 A C B E D F
                 9 3 4 2 6 13
                 ((((D,B),C),(E,F)),A);
thetaprior = 2 1000
tauprior = 2 2000 1
finetune = 1: 1 0.002 0.01 0.01 0.02 0.005 1.0
print = 1 0 0 0
burnin = 10000
sampfreq = 2
nsample = 100000
--------

new files created (100 loci, 6 species, 37 samples)
  CanEnd.ibpp.seq.txt
  CanEnd.ibpp.imap.txt
  CanEnd.ibpp.ctl.txt
  CanEnd.ibpp.traits.txt


'CanEnd.ibpp.ctl.txt'

In [155]:
async = ipyclient[0].apply(run_bpp, 'CanEnd.ibpp.ctl.txt')

In [163]:
async.exception()

CancelledError: 

### Submit parallel jobs

In [43]:
## a dictionary to store our results in
asyncs = {}

## send jobs to run 'asynchronously' using 'apply' over a range of 
## prior values on theta and tau.
for theta in [5, 1, 0.5]:
    for tau in [15, 5, 1]:
        ## name this run
        rname = '{}-{}'.format(theta, tau)
    
        ## create a ctl file with these priors
        argdict['name'] = rname
        argdict['thetaprior'] = '{} {}'.format(theta, 5)
        argdict['tauprior'] = '{} {} {}'.format(tau, 5, 1)
        write_ctl(argdict)

        ## submit job to queue
        #asyncs[rname] = lbview.apply(bpp, 'ctl-{}.txt'.format(rname))
        
        ## print the job submitted
        print 'submitted: bpp', 'ctl-{}.txt'.format(rname)

submitted: bpp ctl-5-15.txt
submitted: bpp ctl-5-5.txt
submitted: bpp ctl-5-1.txt
submitted: bpp ctl-1-15.txt
submitted: bpp ctl-1-5.txt
submitted: bpp ctl-1-1.txt
submitted: bpp ctl-0.5-15.txt
submitted: bpp ctl-0.5-5.txt
submitted: bpp ctl-0.5-1.txt


ipyparallel.error.RemoteError(u'CalledProcessError',
                              u"Command '['/home/deren/local/src/bpp3.3/src/bpp', 'ctl-0.5-1.txt']' returned non-zero exit status 255")