## VARS DwC conversion - larger data set (all species observed in 2001)

Resources:
- https://dwc.tdwg.org/terms/
- https://tools.gbif.org/dwca-validator/extension.do?id=dwc:Occurrence
- https://www.mbari.org/products/research-software/video-annotation-and-reference-system-vars/query-interface/advanced-user-guide/
- https://www.gbif.org/data-quality-requirements-occurrences

In [1]:
## Imports

import pandas as pd
import numpy as np

import re # for extracting logon info from text file

import jaydebeapi # for connecting to VARS db
import VARS # for connecting to VARS db

from datetime import datetime # for handling dates
import pytz # for handling time zones

import urllib.request, urllib.parse, json # for dealing with WoRMS API and output
import WoRMS # functions for querying WoRMS REST API

### Obtain data from VARS database

In [2]:
# ## Extract logon information from text file

# # Get list of each line in file
# filename = 'VARS_logon_info.txt'
# f = open(filename, 'r')
# lines = f.readlines()
# f.close()

# # Function for extracting information from lines
# def get_single_quoted_text(s):
#     """ 
#     Takes string s and returns any text in s that is between the first set of single quotes, removing whitespace. 
    
#     Example:
#     s = "What if there's more ' than one' sest of single' quotes?"
#     get_single_quoted_text(s) --> 's more'
    
#     """
    
#     extracted_text = re.search('''(?<=')\s*[^']+?\s*(?=')''', s)
#     return(extracted_text.group().strip())

# # Assign logon info
# dr = get_single_quoted_text(lines[2])
# name = get_single_quoted_text(lines[3])
# pw = get_single_quoted_text(lines[4])
# un = get_single_quoted_text(lines[5])
# url = get_single_quoted_text(lines[6])

An explanation of the regex in get_single_quoted_text() can be found here: <br>
https://stackoverflow.com/questions/42002931/regex-extract-string-between-single-quotes-trim-whitespace?rq=1

In [3]:
# ## Build SQL query

# sql = """
#         SELECT index_recorded_timestamp,
#                observation_uuid,
#                concept,
#                observation_group,
#                observer,
#                image_url,
#                depth_meters,
#                latitude,
#                longitude,
#                oxygen_ml_per_l,
#                psi,
#                salinity,
#                temperature_celsius,
#                video_uri,
#                video_sequence_name,
#                chief_scientist
#         FROM annotations a
#         WHERE NOT EXISTS (
#            SELECT DISTINCT observation_uuid
#            FROM annotations b
#            WHERE (
#              (  -- Delete last 2 years of annotations
#              index_recorded_timestamp > DATEADD([year], - 2, GETDATE()) OR
#              index_recorded_timestamp IS NULL OR
#              index_recorded_timestamp < CAST('1970-01-02' AS datetime)
#              )
#            OR ( -- Delete embargoes by dive
#              dive_number IN ('Ventana 50', 'Ventana 217', 'Ventana 218', 'Ventana 248')
#               )
#            OR (
#              dive_number IN ('Tiburon 1001', 'Tiburon 1029', 'Tiburon 1030', 'Tiburon 1031', 'Tiburon 1032', 'Tiburon 1033', 'Tiburon 1034')
#              )
#            OR ( -- Delete embargoes by selectedConcept
#              concept IN (
#                  'Aegina sp. 1',
#                  'Ctenophora',
#                  'Cydippida 2',
#                  'Cydippida',
#                  'Intacta',
#                  'Llyria',
#                  'Lyrocteis',
#                  'Lyroctenidae',
#                  'Mertensia',
#                  'Mertensiidae sp. A',
#                  'Mystery Mollusc',
#                  'Mystery Mollusc',
#                  'Physonectae sp. 1',
#                  'Platyctenida sp. 1',
#                  'Platyctenida',
#                  'Thalassocalycida sp. 1',
#                  'Thalassocalycida',
#                  'Thliptodon sp. A',
#                  'Tjalfiella tristoma',
#                  'Tjalfiella',
#                  'Tjalfiellidae',
#                  'Tuscarantha braueri',
#                  'Tuscarantha luciae',
#                  'Tuscarantha',
#                  'Tuscaretta globosa',
#                  'Tuscaretta',
#                  'Tuscaridium cygneum',
#                  'Tuscaridium',
#                  'Tuscarilla campanella',
#                  'Tuscarilla nationalis',
#                  'Tuscarilla similis',
#                  'Tuscarilla',
#                  'Tuscarora',
#                  'Tuscaroridae'
#                  )
#             )
#         ) AND a.observation_uuid = b.observation_uuid
#     ) AND index_recorded_timestamp >= CAST('2001-01-01' AS datetime) 
#       AND index_recorded_timestamp <= CAST('2001-12-31' AS datetime)
#     """

In [4]:
# ## Query the database

# # Get connection
# conn = VARS.get_db_conn(dr, url, un, pw, name)

# # Submit query
# data = VARS.get_data(conn, sql)

# # Close connection
# conn.close()

In [5]:
# ## Check data is there

# print(data.shape)
# data.head()

**Note:** For some reason, this query didn't return any column names. I'll add them here...

In [6]:
# ## Add column names

# data.rename(columns={
#     0:'index_recorded_timestamp',
#     1:'observation_uuid',
#     2:'concept',
#     3:'observation_group',
#     4:'observer',
#     5:'image_url',
#     6:'depth_meters',
#     7:'latitude',
#     8:'longitude',
#     9:'oxygen_ml_per_l',
#     10:'psi',
#     11:'salinity',
#     12:'temperature_celsius',
#     13:'video_uri',
#     14:'video_sequence_name',
#     15:'chief_scientist'
# }, inplace=True)

# data.head()

In [7]:
# ## Save

# data.to_csv('VARS_2001_data.csv', index=False, na_rep='NaN')

### Read in saved data (if not pulled directly from the database)

In [26]:
## Load csv

path = ''
filename = 'VARS_2001_data.csv'
data = pd.read_csv(path+filename, dtype={'image_url': object})

print(data.shape)
data.head()

(210576, 16)


Unnamed: 0,index_recorded_timestamp,observation_uuid,concept,observation_group,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri,video_sequence_name,chief_scientist
0,2001-05-14 23:50:03,53CF02AF-535F-41AE-B31E-4BCAB4F39A56,manipulator,ROV,vars,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,urn:tid:mbari.org:T0324-09,Tiburon 0324,David Clague
1,2001-05-14 23:50:03,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,ROV,vars,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,urn:tid:mbari.org:T0324-09,Tiburon 0324,David Clague
2,2001-07-05 21:16:45,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,ROV,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,98.400002,36.702446,-122.059237,2.55,329.5,34.019001,9.307,urn:tid:mbari.org:V2016-05,Ventana 2016,Rob Sherlock
3,2001-07-05 21:16:45,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,ROV,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,98.400002,36.702446,-122.059237,2.55,329.5,34.019001,9.307,urn:tid:mbari.org:V2016-05,Ventana 2016,Rob Sherlock
4,2001-07-05 21:16:45,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,ROV,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,98.400002,36.702446,-122.059237,2.55,329.5,34.019001,9.307,urn:tid:mbari.org:V2016-05,Ventana 2016,Rob Sherlock


### Pre-processing

In [27]:
## Drop duplicate rows that arise from associations, which we don't care about here

data = data.drop_duplicates()
print(data.shape)
data.head()

(172102, 16)


Unnamed: 0,index_recorded_timestamp,observation_uuid,concept,observation_group,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri,video_sequence_name,chief_scientist
0,2001-05-14 23:50:03,53CF02AF-535F-41AE-B31E-4BCAB4F39A56,manipulator,ROV,vars,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,urn:tid:mbari.org:T0324-09,Tiburon 0324,David Clague
1,2001-05-14 23:50:03,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,ROV,vars,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,urn:tid:mbari.org:T0324-09,Tiburon 0324,David Clague
2,2001-07-05 21:16:45,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,ROV,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,98.400002,36.702446,-122.059237,2.55,329.5,34.019001,9.307,urn:tid:mbari.org:V2016-05,Ventana 2016,Rob Sherlock
5,2001-10-03 17:15:41,C283DC75-BB9A-4E98-9E72-BC8EABC6EED9,rock,ROV,svonthun,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,881.400024,36.760986,-121.984398,0.37,255.600006,34.513,4.42,urn:tid:mbari.org:V2076-02,Ventana 2076,Charlie Paull
6,2001-10-03 17:15:41,EBE21573-7B5D-422B-87ED-83CB16F1D611,ledge,ROV,svonthun,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,881.400024,36.760986,-121.984398,0.37,255.600006,34.513,4.42,urn:tid:mbari.org:V2076-02,Ventana 2076,Charlie Paull


### Convert

In [28]:
## Start with basic event data and change headings

converted = data[['index_recorded_timestamp', 'video_sequence_name', 'observation_group', 'chief_scientist']]
converted = converted.rename(columns={
    'index_recorded_timestamp':'eventDate',
    'video_sequence_name':'eventID',
    'observation_group':'samplingProtocol',
    'chief_scientist':'recordedBy'
})
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy
0,2001-05-14 23:50:03,Tiburon 0324,ROV,David Clague
1,2001-05-14 23:50:03,Tiburon 0324,ROV,David Clague
2,2001-07-05 21:16:45,Ventana 2016,ROV,Rob Sherlock
5,2001-10-03 17:15:41,Ventana 2076,ROV,Charlie Paull
6,2001-10-03 17:15:41,Ventana 2076,ROV,Charlie Paull


In [29]:
## Remove whitespace from eventID

converted['eventID'] = [event.replace(' ', '_') for event in converted['eventID']]
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy
0,2001-05-14 23:50:03,Tiburon_0324,ROV,David Clague
1,2001-05-14 23:50:03,Tiburon_0324,ROV,David Clague
2,2001-07-05 21:16:45,Ventana_2016,ROV,Rob Sherlock
5,2001-10-03 17:15:41,Ventana_2076,ROV,Charlie Paull
6,2001-10-03 17:15:41,Ventana_2076,ROV,Charlie Paull


**Note** that this code also places an underscore between 'Doc' and 'Ricketts'. It's possible that using 'DocRicketts' could be preferable.

In [30]:
## Add institutionCode

converted['institutionCode'] = 'MBARI'
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode
0,2001-05-14 23:50:03,Tiburon_0324,ROV,David Clague,MBARI
1,2001-05-14 23:50:03,Tiburon_0324,ROV,David Clague,MBARI
2,2001-07-05 21:16:45,Ventana_2016,ROV,Rob Sherlock,MBARI
5,2001-10-03 17:15:41,Ventana_2076,ROV,Charlie Paull,MBARI
6,2001-10-03 17:15:41,Ventana_2076,ROV,Charlie Paull,MBARI


In [31]:
## Format eventDate

formatted = []

for dt in converted['eventDate']:
    
    # Convert string to datetime
    try:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S.%f') # some datetimes have milliseconds
    except ValueError:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S')
        
    # Assign UTC timezone
    utc = pytz.UTC
    dt = dt.astimezone(utc)
    
    # Put in ISO format string
    dt = dt.isoformat()
    
    # Save in list
    formatted.append(dt)

converted['eventDate'] = formatted
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode
0,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,MBARI
5,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI
6,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI


In [32]:
## Add in occurrence-related columns from data, renaming as needed

converted['occurrenceID'] = data['observation_uuid']
converted['scientificName'] = data['concept']
converted['identifiedBy'] = data['observer']
converted['minimumDepthInMeters'] = data['depth_meters']
converted['maximumDepthInMeters'] = data['depth_meters']
converted['decimalLatitude'] = data['latitude']
converted['decimalLongitude'] = data['longitude']
converted['dissolvedOxygenInMLPerL'] = data['oxygen_ml_per_l']
converted['pressureInPsi'] = data['psi']
converted['salinity'] = data['salinity']
converted['temperatureInCelsius'] = data['temperature_celsius']
converted['image_url'] = data['image_url']
converted['video_uri'] = data['video_uri']
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,decimalLatitude,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri
0,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,53CF02AF-535F-41AE-B31E-4BCAB4F39A56,manipulator,vars,1948.599976,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,MBARI,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,36.702446,-122.059237,2.55,329.5,34.019001,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2016-05
5,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI,C283DC75-BB9A-4E98-9E72-BC8EABC6EED9,rock,svonthun,881.400024,881.400024,36.760986,-121.984398,0.37,255.600006,34.513,4.42,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2076-02
6,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI,EBE21573-7B5D-422B-87ED-83CB16F1D611,ledge,svonthun,881.400024,881.400024,36.760986,-121.984398,0.37,255.600006,34.513,4.42,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2076-02


In [33]:
## Add coordinateUncertaintyInMeters 

converted['coordinateUncertaintyInMeters'] = round(converted['minimumDepthInMeters']*0.03, 2)
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,decimalLatitude,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters
0,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,53CF02AF-535F-41AE-B31E-4BCAB4F39A56,manipulator,vars,1948.599976,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,21.742367,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,MBARI,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,36.702446,-122.059237,2.55,329.5,34.019001,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2016-05,2.95
5,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI,C283DC75-BB9A-4E98-9E72-BC8EABC6EED9,rock,svonthun,881.400024,881.400024,36.760986,-121.984398,0.37,255.600006,34.513,4.42,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2076-02,26.44
6,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI,EBE21573-7B5D-422B-87ED-83CB16F1D611,ledge,svonthun,881.400024,881.400024,36.760986,-121.984398,0.37,255.600006,34.513,4.42,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2076-02,26.44


**Note** that the calculation for coordinateUncertaintyInMeters may be more complex once I've talked to Dave Caress. At the moment, I've just used a rule of thumb that Brian gave me that he may or may not have remembered correctly: that uncertainty goes like 3% of depth. Also, I've somewhat arbitrarily rounded to two decimal places. **It might be worth asking whether the depths, which are reported to 6 decimal places, are actually that accurate. Unlikely.**

In [16]:
# ## Get a list of unique species names

# converted['scientificName'] = [name.lower().strip() for name in converted['scientificName']]
# names = converted['scientificName'].unique()

In [17]:
# %%capture cap --no-stderr --no-display

# ## Look up names in WoRMS and save matched name, name ID and taxon ID to dicts ----- TAKES ~ 15 MINUTES TO DO THE ENTIRE NAMES LIST

# name_id_dic, name_dic, id_dic = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

In [18]:
# ## Write failed names to log file

# with open('VARS_2001_WoRMS_log.txt', 'w') as f:
#     f.write(cap.stdout)

**Note:** need to check through log. Seems like there are some terms that should have been recognized by WoRMS.

**Terms that seem like they should have matched:**
- hydromedusae
- phyllospadix-zostera (clearly phyllospadix OR zostera)
- gastrozooid (can be more broadly classified as cnidaria?)
- clam (bivalvia)
- teuthoidea 
- medusae (WoRMS will match Medaeus and Medusa)
- larvacean house didn't match, but neither did larvacean
- neptunea-buccinum complex
- doliolinetta (WoRMS will match Dolioletta)
- tomopterid (WoRMS will match Tomopteridae)
- vitreosalpa gemini
- sergestid (WoRMS will match Sergestidae)
- sebastomus complex (Perhaps an old name? WoRMS fuzzy-matches several similar terms)
- oryphaenoides armatus-leptolepis-yaquinae complex matched to coryphaenoides
- graneledoninae
- pyrosomida (WoRMS will match Pyrosoma)
- funiculina-halipteris complex (WoRMS will match Funiculina and Halipteris)

**Is the species1-species2 a common way to denote AND? or OR? If so, we might be able to handle these programmatically.**

Also **note** that some terms that did match weren't necessarily species:
- cylinder
- phyllospadix settlement rake matched to phyllospadix

#### If desired, save dictionaries as json

In [19]:
# ## Save dictionaries

# with open('VARS_2001_name_id_dict.json', 'w') as fp:
#     json.dump(name_id_dic, fp, sort_keys=True, indent=4)
    
# with open('VARS_2001_name_name_dict.json', 'w') as fp:
#     json.dump(name_dic, fp, sort_keys=True, indent=4)
    
# with open('VARS_2001_name_taxid_dict.json', 'w') as fp:
#     json.dump(id_dic, fp, sort_keys=True, indent=4)

#### If desired, read dictionaries rather than querying WoRMS

In [45]:
## Load dictionaries

with open('VARS_2001_name_id_dict.json') as f:
  name_id_dic = json.load(f)

with open('VARS_2001_name_name_dict.json') as f:
  name_dic = json.load(f)

with open('VARS_2001_name_taxid_dict.json') as f:
  id_dic = json.load(f)

In [58]:
## Create columns from WoRMS data

# Create scientificNameID column with the same content as scientificName - strip to ensure no whitespace, lowercase
converted['scientificNameID'] = converted['scientificName'].str.strip().str.lower()

# Use dictionary to replace scientific names with name IDs
converted.replace({'scientificNameID':name_id_dic}, inplace=True)

# Repeat to create taxonID
converted['taxonID'] = converted['scientificName'].str.strip().str.lower()
converted.replace({'taxonID':id_dic}, inplace=True)

converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,...,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID
0,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,53CF02AF-535F-41AE-B31E-4BCAB4F39A56,manipulator,vars,1948.599976,1948.599976,...,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46,manipulator,manipulator
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,...,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46,urn:lsid:marinespecies.org:taxname:123083,123083
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,MBARI,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,...,-122.059237,2.55,329.5,34.019001,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2016-05,2.95,urn:lsid:marinespecies.org:taxname:135396,135396
5,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI,C283DC75-BB9A-4E98-9E72-BC8EABC6EED9,rock,svonthun,881.400024,881.400024,...,-121.984398,0.37,255.600006,34.513,4.42,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2076-02,26.44,rock,rock
6,2001-10-04T00:15:41+00:00,Ventana_2076,ROV,Charlie Paull,MBARI,EBE21573-7B5D-422B-87ED-83CB16F1D611,ledge,svonthun,881.400024,881.400024,...,-121.984398,0.37,255.600006,34.513,4.42,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2076-02,26.44,ledge,ledge


In [61]:
## Remove rows that didn't have a WoRMS match

converted = converted[converted['scientificName'].str.strip().str.lower() != converted['scientificNameID']]
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,...,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,...,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46,urn:lsid:marinespecies.org:taxname:123083,123083
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,MBARI,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,...,-122.059237,2.55,329.5,34.019001,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2016-05,2.95,urn:lsid:marinespecies.org:taxname:135396,135396
9,2001-03-22T01:40:01+00:00,Tiburon_0266,ROV,Bruce Robison,MBARI,C8428DAB-39B5-46BC-A87F-C50A18E351BD,Lobata,schlin,264.299988,264.299988,...,-143.506874,5.27,254.699997,34.034,11.69,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0266-01,7.93,urn:lsid:marinespecies.org:taxname:603346,603346
15,2001-07-20T00:32:11+00:00,Tiburon_0336,ROV,Bruce Robison,MBARI,1EB91809-BABF-4282-9AB9-802A5B4CFC2A,Apolemia,svonthun,1063.699951,1063.699951,...,-122.529437,0.42,32.0,34.469002,3.841,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0336-04,31.91,urn:lsid:marinespecies.org:taxname:135393,135393
17,2001-05-05T04:16:36+00:00,Ventana_1971,ROV,Bruce Robison,MBARI,D1A62DD5-44DA-4EE9-B8C7-A04573964F0B,Solmissus,schlin,440.5,440.5,...,-122.057003,0.82,252.399994,34.278999,6.486,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V1971-04,13.22,urn:lsid:marinespecies.org:taxname:117074,117074


In [66]:
## Replace scientificName with matched scientific names from WoRMS

converted['scientificName'] = converted['scientificName'].str.strip().str.lower()
converted['scientificName'].replace(name_dic, inplace=True)
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,...,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,MBARI,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,...,-159.504933,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46,urn:lsid:marinespecies.org:taxname:123083,123083
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,MBARI,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,...,-122.059237,2.55,329.5,34.019001,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2016-05,2.95,urn:lsid:marinespecies.org:taxname:135396,135396
9,2001-03-22T01:40:01+00:00,Tiburon_0266,ROV,Bruce Robison,MBARI,C8428DAB-39B5-46BC-A87F-C50A18E351BD,Lobata,schlin,264.299988,264.299988,...,-143.506874,5.27,254.699997,34.034,11.69,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0266-01,7.93,urn:lsid:marinespecies.org:taxname:603346,603346
15,2001-07-20T00:32:11+00:00,Tiburon_0336,ROV,Bruce Robison,MBARI,1EB91809-BABF-4282-9AB9-802A5B4CFC2A,Apolemia,svonthun,1063.699951,1063.699951,...,-122.529437,0.42,32.0,34.469002,3.841,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0336-04,31.91,urn:lsid:marinespecies.org:taxname:135393,135393
17,2001-05-05T04:16:36+00:00,Ventana_1971,ROV,Bruce Robison,MBARI,D1A62DD5-44DA-4EE9-B8C7-A04573964F0B,Solmissus,schlin,440.5,440.5,...,-122.057003,0.82,252.399994,34.278999,6.486,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V1971-04,13.22,urn:lsid:marinespecies.org:taxname:117074,117074


In [46]:
## Create additional needed columns

converted['nameAccordingTo'] = 'WoRMS'
converted['occurrenceStatus'] = 'present'
converted['basisOfRecord'] = 'HumanObservation'
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,decimalLatitude,...,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,21.742367,...,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46,urn:lsid:marinespecies.org:taxname:123083,123083,WoRMS,present,HumanObservation
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,36.702446,...,34.019001,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2016-05,2.95,urn:lsid:marinespecies.org:taxname:135396,135396,WoRMS,present,HumanObservation
9,2001-03-22T01:40:01+00:00,Tiburon_0266,ROV,Bruce Robison,C8428DAB-39B5-46BC-A87F-C50A18E351BD,Lobata,schlin,264.299988,264.299988,28.181265,...,34.034,11.69,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0266-01,7.93,urn:lsid:marinespecies.org:taxname:603346,603346,WoRMS,present,HumanObservation
15,2001-07-20T00:32:11+00:00,Tiburon_0336,ROV,Bruce Robison,1EB91809-BABF-4282-9AB9-802A5B4CFC2A,Apolemia,svonthun,1063.699951,1063.699951,36.569538,...,34.469002,3.841,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0336-04,31.91,urn:lsid:marinespecies.org:taxname:135393,135393,WoRMS,present,HumanObservation
17,2001-05-05T04:16:36+00:00,Ventana_1971,ROV,Bruce Robison,D1A62DD5-44DA-4EE9-B8C7-A04573964F0B,Solmissus,schlin,440.5,440.5,36.726976,...,34.278999,6.486,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V1971-04,13.22,urn:lsid:marinespecies.org:taxname:117074,117074,WoRMS,present,HumanObservation


In [51]:
## Assemble associatedMedia

associatedMedia = []

for occ_id in converted['occurrenceID'].unique():
    
    # Select data associated with that occurrenceID:
    selected = converted[converted['occurrenceID'] == occ_id]
    
    # Retrieve unique image and video files
    image_files = selected['image_url'].drop_duplicates()
    video_files = selected['video_uri'].drop_duplicates()
    
    # Remove any NaN values
    image_files = image_files.dropna()
    video_files = video_files.dropna()
    
    # Join image and video files
    media = pd.concat([image_files, video_files])
    
    # Create a string with all the urls
    url_str = ''
    for url in media: url_str = url_str + url + ' | '
    url_str = url_str[0:-3]
    
    # Add to associatedMedia
    associatedMedia.append(url_str)

**Note:** That took a little under 15 minutes to run. I don't know if there's a better/easier way to do this?

In [52]:
## Add to df

# First, need to remove rows with duplicate occurrenceIDs
converted = converted.drop_duplicates(subset='occurrenceID', keep="first")

# Add associatedMedia
converted['associatedMedia'] = associatedMedia
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,decimalLatitude,...,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,associatedMedia
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,21.742367,...,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0324-09,58.46,urn:lsid:marinespecies.org:taxname:123083,123083,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,36.702446,...,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V2016-05,2.95,urn:lsid:marinespecies.org:taxname:135396,135396,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
9,2001-03-22T01:40:01+00:00,Tiburon_0266,ROV,Bruce Robison,C8428DAB-39B5-46BC-A87F-C50A18E351BD,Lobata,schlin,264.299988,264.299988,28.181265,...,11.69,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0266-01,7.93,urn:lsid:marinespecies.org:taxname:603346,603346,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
15,2001-07-20T00:32:11+00:00,Tiburon_0336,ROV,Bruce Robison,1EB91809-BABF-4282-9AB9-802A5B4CFC2A,Apolemia,svonthun,1063.699951,1063.699951,36.569538,...,3.841,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,urn:tid:mbari.org:T0336-04,31.91,urn:lsid:marinespecies.org:taxname:135393,135393,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
17,2001-05-05T04:16:36+00:00,Ventana_1971,ROV,Bruce Robison,D1A62DD5-44DA-4EE9-B8C7-A04573964F0B,Solmissus,schlin,440.5,440.5,36.726976,...,6.486,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V1971-04,13.22,urn:lsid:marinespecies.org:taxname:117074,117074,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...


In [53]:
## Drop extra columns

converted = converted.drop(['image_url', 'video_uri'], axis=1)
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,maximumDepthInMeters,decimalLatitude,...,pressureInPsi,salinity,temperatureInCelsius,coordinateUncertaintyInMeters,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,associatedMedia
1,2001-05-15T06:50:03+00:00,Tiburon_0324,ROV,David Clague,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,vars,1948.599976,1948.599976,21.742367,...,320.799988,34.594002,2.407,58.46,urn:lsid:marinespecies.org:taxname:123083,123083,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
2,2001-07-06T04:16:45+00:00,Ventana_2016,ROV,Rob Sherlock,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,schlin,98.400002,98.400002,36.702446,...,329.5,34.019001,9.307,2.95,urn:lsid:marinespecies.org:taxname:135396,135396,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
9,2001-03-22T01:40:01+00:00,Tiburon_0266,ROV,Bruce Robison,C8428DAB-39B5-46BC-A87F-C50A18E351BD,Lobata,schlin,264.299988,264.299988,28.181265,...,254.699997,34.034,11.69,7.93,urn:lsid:marinespecies.org:taxname:603346,603346,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
15,2001-07-20T00:32:11+00:00,Tiburon_0336,ROV,Bruce Robison,1EB91809-BABF-4282-9AB9-802A5B4CFC2A,Apolemia,svonthun,1063.699951,1063.699951,36.569538,...,32.0,34.469002,3.841,31.91,urn:lsid:marinespecies.org:taxname:135393,135393,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
17,2001-05-05T04:16:36+00:00,Ventana_1971,ROV,Bruce Robison,D1A62DD5-44DA-4EE9-B8C7-A04573964F0B,Solmissus,schlin,440.5,440.5,36.726976,...,252.399994,34.278999,6.486,13.22,urn:lsid:marinespecies.org:taxname:117074,117074,WoRMS,present,HumanObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...


In [56]:
## Reorder columns

converted = converted[['eventID', 'eventDate', 'samplingProtocol', 'recordedBy', 'institutionCode', 'occurrenceID', 'scientificName', 'scientificNameID', 'taxonID', 
                       'nameAccordingTo', 'occurrenceStatus', 'basisOfRecord', 'identifiedBy', 'decimalLatitude', 'decimalLongitude', 'coordinateUncertaintyInMeters',
                       'minimumDepthInMeters', 'maximumDepthInMeters', 'dissolvedOxygenInMLPerL', 'pressureInPsi', 'salinity', 'temperatureInCelsius', 'associatedMedia']]
converted.head()

Unnamed: 0,eventID,eventDate,samplingProtocol,recordedBy,institutionCode,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,...,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,associatedMedia
1,Tiburon_0324,2001-05-15T06:50:03+00:00,ROV,David Clague,MBARI,E9881AC2-AD21-4420-B2DD-3C4A73645E92,Holothuroidea,urn:lsid:marinespecies.org:taxname:123083,123083,WoRMS,...,21.742367,-159.504933,58.46,1948.599976,1948.599976,2.07,320.799988,34.594002,2.407,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
2,Ventana_2016,2001-07-06T04:16:45+00:00,ROV,Rob Sherlock,MBARI,5F374B81-172F-4888-8C2B-1489E9C8A366,Forskalia,urn:lsid:marinespecies.org:taxname:135396,135396,WoRMS,...,36.702446,-122.059237,2.95,98.400002,98.400002,2.55,329.5,34.019001,9.307,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
9,Tiburon_0266,2001-03-22T01:40:01+00:00,ROV,Bruce Robison,MBARI,C8428DAB-39B5-46BC-A87F-C50A18E351BD,Lobata,urn:lsid:marinespecies.org:taxname:603346,603346,WoRMS,...,28.181265,-143.506874,7.93,264.299988,264.299988,5.27,254.699997,34.034,11.69,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
15,Tiburon_0336,2001-07-20T00:32:11+00:00,ROV,Bruce Robison,MBARI,1EB91809-BABF-4282-9AB9-802A5B4CFC2A,Apolemia,urn:lsid:marinespecies.org:taxname:135393,135393,WoRMS,...,36.569538,-122.529437,31.91,1063.699951,1063.699951,0.42,32.0,34.469002,3.841,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...
17,Ventana_1971,2001-05-05T04:16:36+00:00,ROV,Bruce Robison,MBARI,D1A62DD5-44DA-4EE9-B8C7-A04573964F0B,Solmissus,urn:lsid:marinespecies.org:taxname:117074,117074,WoRMS,...,36.726976,-122.057003,13.22,440.5,440.5,0.82,252.399994,34.278999,6.486,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...


In [57]:
## Save

converted.to_csv('VARS_2001_converted.csv', index=False, na_rep='NaN')

**Note:** This pipeline takes 20-25 minutes to run on ~200,000 records. Primarily, this time comes from 1) WoRMS look-up and 2) assembly of associatedMedia.

### Remaining issues

1. coordinateUncertaintyInMeters **Waiting to talk to Dave Caress**
2. How accurate are depths? Probably not to 6 decimal places. **Waiting to talk to Dave Caress**
3. Check through WoRMS log for names that should have matched **COMPLETE - emailed Brian about who to ask about ongoing questions**
4. Faster way to assemble associatedMedia?