See GBIF demo pdf https://github.com/snacktavish/OpenTree_SSB2020/blob/master/pdfs/GBIF_demo_intro.pdf for background on this example

In [1]:
!pip install opentree



In [2]:
import sys
from opentree import OT

In [3]:
# OOh! We can mix togther python and bash commands :P
# The '!' at the start of the line means the command is executed in bash
# This wget command pulls GBIF data file from the internet, and saves it as 'gbif_example.csv'
#!wget -O gbif_example.csv https://raw.githubusercontent.com/McTavishLab/biodiversity_next/master/example.csv
# This is actually occurence data downloaded from GBIF doi https://doi.org/10.15468/dl.9bigak

In [4]:
!head -n 3 ../tutorial/gbif_example.csv
# Oof! Lots of information.

gbifID	datasetKey	occurrenceID	kingdom	phylum	class	order	family	genus	species	infraspecificEpithet	taxonRank	scientificName	verbatimScientificName	verbatimScientificNameAuthorship	countryCode	locality	stateProvince	occurrenceStatus	individualCount	publishingOrgKey	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	coordinatePrecision	elevation	elevationAccuracy	depth	depthAccuracy	eventDate	day	month	year	taxonKey	speciesKey	basisOfRecord	institutionCode	collectionCode	catalogNumber	recordNumber	identifiedBy	dateIdentified	license	rightsHolder	recordedBy	typeStatus	establishmentMeans	lastInterpreted	mediaType	issue
2423004790	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/32478397	Animalia	Arthropoda	Insecta	Odonata	Libellulidae	Tramea	Tramea lacerata		SPECIES	Tramea lacerata Hagen, 1861	Tramea lacerata		US		California			28eb1a3f-1c15-4a95-931a-4af90ecb574d	37.36422	-120.424003	12.0						2019-09-10T08:39:35Z	10	9	2019	1428475	1428475	HUMAN_O

In [5]:
filename = "../tutorial/gbif_example.csv"
fi = open(filename)
header = fi.readline().split('\t') # Save the first line seperately as the header

gbif_data = fi.readlines() #read in the data

#Get indexes for each column in the csv file
col_dict = {}
for i, col in enumerate(header):
    col_dict[col] = i
    
# Would this make more sense to do in Pandas? Maybe! But I like loops.

In [6]:
# Now we know what column each of out data types are in.
# So much (many) data!
col_dict

{'gbifID': 0,
 'datasetKey': 1,
 'occurrenceID': 2,
 'kingdom': 3,
 'phylum': 4,
 'class': 5,
 'order': 6,
 'family': 7,
 'genus': 8,
 'species': 9,
 'infraspecificEpithet': 10,
 'taxonRank': 11,
 'scientificName': 12,
 'verbatimScientificName': 13,
 'verbatimScientificNameAuthorship': 14,
 'countryCode': 15,
 'locality': 16,
 'stateProvince': 17,
 'occurrenceStatus': 18,
 'individualCount': 19,
 'publishingOrgKey': 20,
 'decimalLatitude': 21,
 'decimalLongitude': 22,
 'coordinateUncertaintyInMeters': 23,
 'coordinatePrecision': 24,
 'elevation': 25,
 'elevationAccuracy': 26,
 'depth': 27,
 'depthAccuracy': 28,
 'eventDate': 29,
 'day': 30,
 'month': 31,
 'year': 32,
 'taxonKey': 33,
 'speciesKey': 34,
 'basisOfRecord': 35,
 'institutionCode': 36,
 'collectionCode': 37,
 'catalogNumber': 38,
 'recordNumber': 39,
 'identifiedBy': 40,
 'dateIdentified': 41,
 'license': 42,
 'rightsHolder': 43,
 'recordedBy': 44,
 'typeStatus': 45,
 'establishmentMeans': 46,
 'lastInterpreted': 47,
 'medi

In [7]:
# As described in the TNRS, sction, 
# we can use OpenTree API's to match out Gbif identifiers to Open Tree unique identifiers

match_dict = {} # This will list the matches
ott_ids = set() # And generate a set of taxa

#Loop through each line in the gbif output
for lin in gbif_data:
    lii = lin.split('\t')
    gb_id = lii[col_dict['taxonKey']] # this grabs the gbif id number from the right column
    sys.stdout.write(".") #progress bar
    sys.stdout.flush()
    if gb_id in match_dict:
        #Skip gb_id's you have already matched
        pass
    else:
        # Do a direct match to gbif id's in the open tree taxonomy
        try:
            ott_id = OT.get_ottid_from_gbifid(gb_id)
        except:
            # Sometimes we don't have a record of the gbif ID, but we do have a taxon with that exact name
            # Search on the name
            spp_name = lii[col_dict['verbatimScientificName']]
            ott_id = OT.get_ottid_from_name(spp_name)
            if ott_id == None:
                sys.stdout.write("Couldn't find an id for {}, gbif {}".format(spp_name, gb_id))
        match_dict[gb_id] = ott_id
        ott_ids.add(ott_id)

...................................................................................................

In [8]:
# Lets grab a tree for those taxa!
output = OT.synth_induced_tree(ott_ids=list(ott_ids),  label_format='name')
treefile = "gbif_taxa.tre"
output.tree.write(path = treefile, schema = "newick")
sys.stdout.write("Tree written to {}\n".format(treefile))
output.tree.print_plot(width=100)

Tree written to gbif_taxa.tre
                                                              /---------++ Regulus calendula        
                                                              +                                     
                                                              |   /---++++ Sturnus vulgaris         
                                                  /-----------+++++                                 
                                                  |           |   \+++++++ Mimus polyglottos        
                                                  |           |                                     
                                                  |           \-----++++++ Phainopepla nitens       
                                                  |                                                 
                                                  |                /++++++ Passerculus sandwichensis
                                                  |          

In [None]:
# if we print to string we can take a quick look over at icytree.org or itol.embl.de
output.tree.as_string(schema="newick")

In [9]:
# Don't forget to cite your friendly phylogeneticists!
studies = output.response_dict['supporting_studies']
cites = OT.get_citations(studies) #this can be a bit slow
print(cites)

https://tree.opentreeoflife.org/curator/study/view/ot_1510?tab=trees&tree=Tr55267
Carrano M.T., Benson R.B., & Sampson S.D. 2012. The phylogeny of Tetanurae (Dinosauria: Theropoda). Journal of Systematic Palaeontology, 10(2): 211-300.
http://dx.doi.org/10.1080/14772019.2011.630927

https://tree.opentreeoflife.org/curator/study/view/ot_1512?tab=trees&tree=Tr55545
Brusatte S.L., Benton M.J., Desojo J.B., & Langer M.C. 2010. The higher-level phylogeny of Archosauria (Tetrapoda: Diapsida). Journal of Systematic Palaeontology, 8(1): 3-47.
http://dx.doi.org/10.1080/14772010903537732

https://tree.opentreeoflife.org/curator/study/view/ot_1517?tab=trees&tree=Tr60095
Andres B.B., Clark J.M., & Xu X. 2012. A new rhamphorhynchid pterosaur from the Upper Jurassic of Xinjiang, China, and the phylogenetic relationships of basal pterosaurs. Journal of Vertebrate Paleontology, 30(1): 163-187.
http://dx.doi.org/10.1080/02724630903409220

https://tree.opentreeoflife.org/curator/study/view/ot_1513?tab=tr

# DIY: Go to GBIF and choose a region of interest to you. Download the data as csv, and see if you can get a phylogeny for those taxa!