## Getting an evolutionary tree for taxa in a region using GBIF and OpenTree of Life
### Example from the University of California, Merced Vernal Pools Reserve
https://vernalpools.ucmerced.edu/

More info and context at https://github.com/McTavishLab/biodiversity_next/blob/master/biodiversity_next.pdf



### Get occurence data from GBIF
Set polygon: https://www.gbif.org/occurrence/search?country=US&has_coordinate=true&has_geospatial_issue=false&taxon_key=1&advanced=1&geometry=POLYGON((-120.45565%2037.35309,-120.36587%2037.35309,-120.36587%2037.44063,-120.45565%2037.44063,-120.45565%2037.35309))

Download records.

Here I have limited the search to 'Animalia'



In [1]:
import sys
from opentree import OT, taxonomy_helpers

#### Download example data

Example data (17 October 2019) of GBIF Occurrence can be downloaded from https://doi.org/10.15468/dl.9bigak 

The name of the example data file is "0023214-190918142434337.csv". We will assign it to a variable named `input_gbif_file`.

Make sure to download and unzio the file into the same folder as this notebook, or to change the path of the file name accordingly.

In [2]:
input_gbif_file = "0023214-190918142434337.csv"


In [3]:
fi = open(input_gbif_file)
header = fi.readline().split('\t')


#Get indexes for each column in the csv file
col_dict = {}
for i, col in enumerate(header):
    col_dict[col] = i


sys.stdout.write("Matching ids\n")

match_dict = {}
gbif_ids = []
ott_ids = []
i = 0
#Looop through each line in the input CSV file
for lin in fi:
    i += 1
    sys.stdout.write(".") #progress bar
    sys.stdout.flush()
    lii = lin.split('\t')
    gb_id = lii[col_dict['taxonKey']]
    if gb_id in match_dict:
        #Skip gb_id's you have already matched
        pass
    else:
        # Do a direct match to gbif id's in the open tree taxonomy
        gbiftax = "gbif:{}".format(int(gb_id))
        res = OT.taxon_info(source_id=gbiftax)
        if res.status_code == 200:
            ott_id = int(res.response_dict['ott_id'])
            match_dict[gb_id] = ott_id
        if res.status_code == 400:
            # If GBIF id isn' found in the open tree taxonomy, search on scientific name
            spp_name = lii[col_dict['verbatimScientificName']]
            if spp_name == '':
                continue
#            sys.stdout.write("\n{},{} not matched on ID\n".format(gbiftax, spp_name))
            res2 = OT.tnrs_match([spp_name])
            if res2.status_code == 200:
                if len(res2.response_dict['results']) > 0:
                    if res2.response_dict['results'][0]['matches']:
                        ott_id = int(res2.response_dict['results'][0]['matches'][0]['taxon']['ott_id'])
                        match_dict[gb_id] = ott_id
                        ott_ids.append(ott_id)
                        sys.stdout.write("\n{},{} matched on name to ott id{}\n".format(gbiftax, spp_name, ott_id))
                    else:
                        sys.stdout.write("\n{},{} still NO MATCH\n".format(gbiftax, spp_name))
                        match_dict[gb_id] = None
                else:
                    sys.stdout.write("\n{},{} still NO MATCH\n".format(gbiftax, spp_name))
                    match_dict[gb_id] = None
        ott_ids.append(ott_id)


Matching ids
........................
gbif:5229155,Pelecanus erythrorhynchos matched on name to ott id316989
....
gbif:9088491,Dryobates nuttallii matched on name to ott id701703
..
gbif:5231677,Mimus polyglottos matched on name to ott id571310
......................................
gbif:2498167,Anser caerulescens matched on name to ott id190878
...........
gbif:2498161,Anser rossii matched on name to ott id767830
..................................................................................................
gbif:6093694,Oreothlypis ruficapilla matched on name to ott id392341
.............................
gbif:7342009,Oreothlypis celata matched on name to ott id88835
.
gbif:8332393,Spatula clypeata matched on name to ott id656794
.....
gbif:9362027,Mareca strepera matched on name to ott id30856
..............................
gbif:7340222,Gallinula galeata matched on name to ott id181047
......................
gbif:9345027,Spatula cyanoptera matched on name to ott id82411
...........

In [4]:
ott_ids = set(ott_ids)
if None in ott_ids:
    ott_ids.remove(None)

trefile = "VernalPools.tre"
#Get the synthetic tree from OpenTree and write out the citations to a text file.
output = taxonomy_helpers.labelled_induced_synth(ott_ids=list(ott_ids),  label_format='name')
output['labelled_tree'].write(path = trefile, schema = "newick")
sys.stdout.write("Tree written to {}\n".format(trefile))


Tree written to VernalPools.tre


In [5]:
len(ott_ids)

202

In [6]:
len(output['supporting_studies'])

160

In [7]:
print(OT.get_citations(output['supporting_studies']))

https://tree.opentreeoflife.org/curator/study/view/ot_816?tab=trees&tree=tree1
Gibson, Rosemary, Allan Baker. 2012. Multiple gene sequences resolve phylogenetic relationships in the shorebird suborder Scolopaci (Aves: Charadriiformes). Molecular Phylogenetics and Evolution 64 (1): 66-72
http://dx.doi.org/10.1016/j.ympev.2012.03.008

https://tree.opentreeoflife.org/curator/study/view/ot_1268?tab=trees&tree=tree5
Martín J. Ramírez, 2014, 'The Morphology And Phylogeny Of Dionychan Spiders (Araneae: Araneomorphae)', Bulletin of the American Museum of Natural History, vol. 390, pp. 1-374
http://dx.doi.org/10.1206/821.1

https://tree.opentreeoflife.org/curator/study/view/pg_1776?tab=trees&tree=tree3581
Cho, S., Zwick A., Regier J., Mitter C., Cummings M.P., Yao J., Du Z., Zhao H., Kawahara A.Y., Weller S.J., Davis D.R., Baixeras J., Brown J.W., & Parr C. 2011. Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (Hexapoda: L