What we're going to do in this notebook is prepare data from three sources (molecular, morphological and fossils without data, used only for their age information) for use with the BEAST2 suite of tools for divergence dating using the Fossilized Birth-Death model. BEAST2 has a requirement that can make this challenging: all taxa have to be present in all blocks (i.e., your morphology block has to have all your morphology and molecular taxa). 

The first thing we do is load our the two libraries we'll be using: Pandas, a data-crunching library and Dendropy, a perennially excellent package for phylogenetic tree and dataset manipulation.

In [None]:
import dendropy
import pandas as pd
pd.options.display.max_rows = 999

I'm going to have us start out by processing the fossils that don't have associated morphological data. These fossils will have missing data for every cell in the phylogenetic matrix. We just want them for their age information. The cool thing about using a fossilized birth-death model is that you can use as many specimens as you have - making complete use of the data is one advantage of this method over the traditional node calibration framework.

For this work, I have obtained some ant fossil occurances from PaleoDB. Let's read them in.

In [None]:
pbdb_raw = pd.read_csv("../Data/Morph/Raw/pbdb_data.csv", skiprows=16)


There's a lot of data in this, which I don't really want to deal with right now. So I'm going to slim it down to a few important columns (the name, reference minimum and maximum ages) and drop any duplicates. 

In [None]:
pbdb_slim = pbdb_raw[['accepted_name', 'reference_no', 'max_ma', 'min_ma']]
pbdb_nodup = pbdb_slim.drop_duplicates(subset='accepted_name')

In [None]:
pbdb_nodup

This has 789 rows, down from ~1800. That being said, the quality assurance on them may be quite low.

Next, to prepare the data, I want to add some columns. In all of my files, I keep track of the higher taxonomy of the ants. In the future, I will want to be able to parse these files to subsample my ants by family, tribe or genus to look at issues of fossil sampling.

In [None]:
pbdb_nodup['SubFamily'] = 'None'
pbdb_nodup['Tribe'] = 'None'
pbdb_nodup['Genus'] = 'None'
pbdb_nodup['Fossil'] = 'Yes'
pbdb_nodup = pbdb_nodup[['accepted_name', 'reference_no', 'SubFamily','Tribe','Genus','Fossil','min_ma','max_ma']]
pbdb_nodup['Notes'] = 'Note'

I don't know this information yet, so I'm initializing these columns without data. A couple more housekeeping steps.

In [None]:
pbdb_nodup.accepted_name= pbdb_nodup.accepted_name.str.replace(' ','_')
pbdb_sort = pbdb_nodup.sort(columns='accepted_name')

I wanted each of the names to have an underscore in place of a space. Most phylogeny programs don't deal well with spaces. I also wanted to sort by alphabet, just because that's nice to look at.

In [None]:
 pbdb_nodup.loc[pbdb_nodup.accepted_name.str.split('_').str[0].str[-3:] == 'nae', 'SubFamily'] = pbdb_nodup.accepted_name
 pbdb_nodup.loc[pbdb_nodup.accepted_name.str.split('_').str[0].str[-3:] != 'nae', 'Genus'] = pbdb_nodup.accepted_name


This data operation will fill in some of those missing columns. If the last three letters of a name are 'nae', we know that that's a subfamily. So we can put that name in the subfamily column. If it ends in anything else, we know it's not a subfamily and is a genus.

Because some of these groups are extinct, they don't appear in my other taxonomies, which are of extants. So this matrix is still really sparse. I populated the rest by hand.

BEAST reads tip dates for fossils off of the end of your taxon labels. The below joins together the taxon name with its age info. If you don't want that, skip to the cell below it.

In [None]:
tax_labels = [tax+'_'+str(min)+'_'+str(max) for tax,min,max in zip(pbdb_nodup.accepted_name,pbdb_nodup.min_ma,pbdb_nodup.max_ma)] 

In [None]:
tax_labels = pbdb_nodup.accepted_name

As I mentioned in the intro, BEAST wants data for every taxon for each data type. So we can create a dictionary of the taxon names and the missing data symbol for those taxa that don't have data associated with them.

In [None]:
md_val = "?"*139
dict_of_dat = {}
for name in tax_labels:                                                                                             
	dict_of_dat[name] = md_val
dict_of_moldat = {}
md_val = "?"*4572
for name in tax_labels:                                                                                             
	dict_of_moldat[name] = md_val

Now we can read in the molecular and morphological data. Because BEAST needs all data partitions to have the same taxon names, we want these to have the same taxon namespace, or the same expected taxon names.

In [None]:
taxa = dendropy.TaxonNamespace()                                                                                      
morphDat = dendropy.StandardCharacterMatrix.get_from_path("../Data/Morph/KellerMatrix.nex", schema="nexus", taxon_namespace=taxa)
molDat = dendropy.DnaCharacterMatrix.get_from_path("../Data/Mol/Moreau_simple.nex", schema="nexus", taxon_namespace=taxa)  

And then we're going to make dataset objects to turn our dictionaries of data into actual phylogenetic matrices. We're doing this so that they're in the taxon namespace.

In [None]:
fossDat = dendropy.StandardCharacterMatrix.from_dict(dict_of_dat,taxon_namespace=taxa)    	
fossmolDat = dendropy.DnaCharacterMatrix.from_dict(dict_of_moldat, taxon_namespace=taxa) 

Since some of our taxa with DNA don't have morphological data, and vice versa, we need to pack those taxa with missing data:

In [None]:
morphDat.pack()
molDat.pack()

So now we have two matrices, with all of our taxa (1027) represented for both molecular data and morphological data. And then we can write them out. These are prepared to load into BEAST  - though pack uses 'None' as missing data. You can fix this with grep to whatever your missing data character is (for phylogenetics, this is a question mark).

In [None]:
fossDat.write_to_path('samp_morphTest.nex', schema='nexus')
fossmolDat.write_to_path('samp_molTest.nex', schema='nexus')

So, we have matrices of characters. The next thing we might want to do is get some taxon sets, which are groupings of taxa that we can tell BEAST to make monophyletic and keep track of. In the traditional node calibration framework, people would put a prior on the age of each taxon set. Because we are using FBD dating, the ages of all the fossils in that set can be used to date that group. 

I'm going to make taxon sets at the subfamily level. I have taxonomies for all the data I have molecular data for, for my morphology taxa and for my calibration taxa. I want to read in those taxonomies, and use this to group taxa by subfamily.

In [None]:
morphTable = pd.read_csv("../Data/Morph/morphTNRS.csv")
molTable = pd.read_csv("../Data/Mol/molTNRS.csv")
fossTable = pd.read_csv("../Data/Morph/FossilTNRS.csv")


fossMerge = fossTable[['accepted_name','subfamily']] 
fossMerge.columns = ['specimen','subfamily'] 
fossMerge['Fossil'] = 'Yes'
morphMerge = morphTable[['Specimen', 'SubFamily']]
morphMerge.columns = ['specimen','subfamily'] 
morphMerge['Fossil'] = 'No'
molMerge = molTable[['Moreau_et_al_name', 'subfamily']]
molMerge.columns = ['specimen','subfamily']  
molMerge['Fossil'] = 'No'
mega_df = pd.concat([fossMerge,molMerge,morphMerge]) 
mega_df = mega_df.drop_duplicates('specimen')

In the above, we read in the files of taxonomies, pulled the relevant data, made sure all the columns had the same name, and then concatenated them into one mega matrix. 

In [None]:
families = mega_df.groupby('subfamily')

for name, group in families:
     print(name)
     print(group)

The above code groups the ants by subfamily, and then prints all members of the subfamily to the screen. In BEAUTI's taxon set panel, you can either enter these groupings by hand, or you can print them to a file and parse them into XML. You can look at an individual group like so:

In [None]:
families.get_group('Myrmicinae')

Or, if you wanted to export your group to a file so you can assemble a BEAST XML by hand, you could export the taxon set like so:

In [None]:
for name, group in families:
    fname = name
    group.specimen.to_csv(fname, index=False)

Those are the three main components of the input BEAST needs: taxon names, datasets and taxon sets. Parsing all this output can be a little daunting, so hopefully this was useful.