## VARS DwC conversion

This code is adapted from DwC conversion code I worked on early in 2020. It's intended to convert VARS observations one year at a time. It has been tested on 2001, 2010, 2017, and 1989.

Resources:
- https://dwc.tdwg.org/terms/
- https://tools.gbif.org/dwca-validator/extension.do?id=dwc:Occurrence
- https://www.mbari.org/products/research-software/video-annotation-and-reference-system-vars/query-interface/advanced-user-guide/
- https://www.gbif.org/data-quality-requirements-occurrences

In [1]:
## Imports

import pandas as pd
import numpy as np

import re # for extracting logon info from text file

import jaydebeapi # for connecting to VARS db
import VARS # for connecting to VARS db

from datetime import datetime # for handling dates
import pytz # for handling time zones

import urllib.request, urllib.parse, json # for dealing with WoRMS API and output
import WoRMS # functions for querying WoRMS REST API

### Obtain data from VARS database

In [2]:
## Extract logon information from text file

# Get list of each line in file
filename = 'VARS_logon_info.txt'
f = open(filename, 'r')
lines = f.readlines()
f.close()

# Function for extracting information from lines
def get_single_quoted_text(s):
    """ 
    Takes string s and returns any text in s that is between the first set of single quotes, removing whitespace. 
    
    Example:
    s = "What if there's more ' than one' sest of single' quotes?"
    get_single_quoted_text(s) --> 's more'
    
    """
    
    extracted_text = re.search('''(?<=')\s*[^']+?\s*(?=')''', s)
    return(extracted_text.group().strip())

# Assign logon info
dr = get_single_quoted_text(lines[2])
name = get_single_quoted_text(lines[3])
pw = get_single_quoted_text(lines[4])
un = get_single_quoted_text(lines[5])
url = get_single_quoted_text(lines[6])

An explanation of the regex in get_single_quoted_text() can be found here: <br>
https://stackoverflow.com/questions/42002931/regex-extract-string-between-single-quotes-trim-whitespace?rq=1

The following query is based on one provided by Brian Schlining that avoids pulling embargoed records. These embargos (especially embargoed concepts and dives) may need to be updated.

Currently, the query pulls records from 1989.

In [4]:
## Build SQL query

sql = """
        SELECT index_recorded_timestamp,
               observation_uuid,
               concept,
               observation_group,
               observer,
               image_url,
               depth_meters,
               latitude,
               longitude,
               oxygen_ml_per_l,
               psi,
               salinity,
               temperature_celsius,
               video_uri,
               video_sequence_name,
               chief_scientist
        FROM annotations a
        WHERE NOT EXISTS (
           SELECT DISTINCT observation_uuid
           FROM annotations b
           WHERE (
             (  -- Delete last 2 years of annotations
             index_recorded_timestamp > DATEADD([year], - 2, GETDATE()) OR
             index_recorded_timestamp IS NULL OR
             index_recorded_timestamp < CAST('1970-01-02' AS datetime)
             )
           OR ( -- Delete embargoes by dive
             dive_number IN ('Ventana 50', 'Ventana 217', 'Ventana 218', 'Ventana 248')
              )
           OR (
             dive_number IN ('Tiburon 1001', 'Tiburon 1029', 'Tiburon 1030', 'Tiburon 1031', 'Tiburon 1032', 'Tiburon 1033', 'Tiburon 1034')
             )
           OR ( -- Delete embargoes by selectedConcept
             concept IN (
                 'Aegina sp. 1',
                 'Ctenophora',
                 'Cydippida 2',
                 'Cydippida',
                 'Intacta',
                 'Llyria',
                 'Lyrocteis',
                 'Lyroctenidae',
                 'Mertensia',
                 'Mertensiidae sp. A',
                 'Mystery Mollusc',
                 'Mystery Mollusc',
                 'Physonectae sp. 1',
                 'Platyctenida sp. 1',
                 'Platyctenida',
                 'Thalassocalycida sp. 1',
                 'Thalassocalycida',
                 'Thliptodon sp. A',
                 'Tjalfiella tristoma',
                 'Tjalfiella',
                 'Tjalfiellidae',
                 'Tuscarantha braueri',
                 'Tuscarantha luciae',
                 'Tuscarantha',
                 'Tuscaretta globosa',
                 'Tuscaretta',
                 'Tuscaridium cygneum',
                 'Tuscaridium',
                 'Tuscarilla campanella',
                 'Tuscarilla nationalis',
                 'Tuscarilla similis',
                 'Tuscarilla',
                 'Tuscarora',
                 'Tuscaroridae'
                 )
            )
        ) AND a.observation_uuid = b.observation_uuid
    ) AND index_recorded_timestamp >= CAST('1989-01-01' AS datetime) 
      AND index_recorded_timestamp <= CAST('1989-12-31' AS datetime)
    """

NOTE that the following query will not run unless you're VPNed in to MBARI.

In [5]:
## Query the database

# Get connection
conn = VARS.get_db_conn(dr, url, un, pw, name)

# Submit query
data = VARS.get_data(conn, sql)

# Close connection
conn.close()

In [10]:
## Check data is there

col_names = data[1]
data = data[0]

data.columns = col_names

print(data.shape)
data.head()

(44747, 16)


Unnamed: 0,index_recorded_timestamp,observation_uuid,concept,observation_group,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri,video_sequence_name,chief_scientist
0,1989-05-17 22:32:03,E7A74B35-0C79-41C5-BB4F-3548FB74DBEB,shoe,ROV,jana,http://search.mbari.org/ARCHIVE/framegrabs/Ven...,399.269989,36.10751,-121.669977,,174.0,34.230999,6.719,urn:tid:mbari.org:V0050-07,Ventana 0050,Chris Grech
1,1989-04-27 17:54:50,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,ROV,amberR,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,415.0,36.612168,-122.029075,,44.0,34.234001,6.259,urn:tid:mbari.org:V0043-04,Ventana 0043,Chuck Baxter
2,1989-04-27 17:54:50,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,ROV,amberR,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,415.0,36.612168,-122.029075,,44.0,34.234001,6.259,urn:tid:mbari.org:V0043-04,Ventana 0043,Chuck Baxter
3,1989-09-12 22:01:59,4FF3A4DB-19AE-4C66-8D3F-7096FA5858E2,eggcase,ROV,unknown,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,329.48999,36.699454,-121.997758,,228.0,34.127998,7.936,urn:tid:mbari.org:V0076-18,Ventana 0076,Chuck Baxter
4,1989-09-12 22:01:59,4FF3A4DB-19AE-4C66-8D3F-7096FA5858E2,eggcase,ROV,unknown,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,329.48999,36.699454,-121.997758,,228.0,34.127998,7.936,urn:tid:mbari.org:V0076-18,Ventana 0076,Chuck Baxter


In [11]:
## Save the data if you don't want to have to retrieve it every time

data.to_csv('VARS_1989_data.csv', index=False, na_rep='NaN')

### Read in saved data (if not pulled directly from the database)

In [12]:
## Load csv

path = ''
filename = 'VARS_1989_data.csv'
data = pd.read_csv(path+filename, dtype={'image_url': object})

print(data.shape)
data.head()

(44747, 16)


Unnamed: 0,index_recorded_timestamp,observation_uuid,concept,observation_group,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri,video_sequence_name,chief_scientist
0,1989-05-17 22:32:03,E7A74B35-0C79-41C5-BB4F-3548FB74DBEB,shoe,ROV,jana,http://search.mbari.org/ARCHIVE/framegrabs/Ven...,399.269989,36.10751,-121.669977,,174.0,34.230999,6.719,urn:tid:mbari.org:V0050-07,Ventana 0050,Chris Grech
1,1989-04-27 17:54:50,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,ROV,amberR,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,415.0,36.612168,-122.029075,,44.0,34.234001,6.259,urn:tid:mbari.org:V0043-04,Ventana 0043,Chuck Baxter
2,1989-04-27 17:54:50,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,ROV,amberR,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,415.0,36.612168,-122.029075,,44.0,34.234001,6.259,urn:tid:mbari.org:V0043-04,Ventana 0043,Chuck Baxter
3,1989-09-12 22:01:59,4FF3A4DB-19AE-4C66-8D3F-7096FA5858E2,eggcase,ROV,unknown,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,329.48999,36.699454,-121.997758,,228.0,34.127998,7.936,urn:tid:mbari.org:V0076-18,Ventana 0076,Chuck Baxter
4,1989-09-12 22:01:59,4FF3A4DB-19AE-4C66-8D3F-7096FA5858E2,eggcase,ROV,unknown,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,329.48999,36.699454,-121.997758,,228.0,34.127998,7.936,urn:tid:mbari.org:V0076-18,Ventana 0076,Chuck Baxter


### Pre-processing

In [13]:
## Drop duplicate rows that arise from associations, which we don't care about here

data = data.drop_duplicates()
print(data.shape)
data.head()

(36184, 16)


Unnamed: 0,index_recorded_timestamp,observation_uuid,concept,observation_group,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri,video_sequence_name,chief_scientist
0,1989-05-17 22:32:03,E7A74B35-0C79-41C5-BB4F-3548FB74DBEB,shoe,ROV,jana,http://search.mbari.org/ARCHIVE/framegrabs/Ven...,399.269989,36.10751,-121.669977,,174.0,34.230999,6.719,urn:tid:mbari.org:V0050-07,Ventana 0050,Chris Grech
1,1989-04-27 17:54:50,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,ROV,amberR,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,415.0,36.612168,-122.029075,,44.0,34.234001,6.259,urn:tid:mbari.org:V0043-04,Ventana 0043,Chuck Baxter
3,1989-09-12 22:01:59,4FF3A4DB-19AE-4C66-8D3F-7096FA5858E2,eggcase,ROV,unknown,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,329.48999,36.699454,-121.997758,,228.0,34.127998,7.936,urn:tid:mbari.org:V0076-18,Ventana 0076,Chuck Baxter
5,1989-03-20 22:03:16.644000,63B6895A-4DB2-4F58-AA03-A4F994814952,eggcase,ROV,unknown,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,,,,,,,,urn:tid:mbari.org:V0033-04,Ventana 0033,Chuck Baxter
6,1989-03-20 22:03:16.644000,B1164C4F-E511-4FB9-A0DB-CCA3588AB685,eggcase,ROV,amberR,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,,,,,,,,urn:tid:mbari.org:V0033-04,Ventana 0033,Chuck Baxter


### Convert

In [14]:
## Start with basic event data and change headings

converted = data[['index_recorded_timestamp', 'video_sequence_name', 'observation_group', 'chief_scientist']]
converted = converted.rename(columns={
    'index_recorded_timestamp':'eventDate',
    'video_sequence_name':'eventID',
    'observation_group':'samplingProtocol',
    'chief_scientist':'recordedBy'
})
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy
0,1989-05-17 22:32:03,Ventana 0050,ROV,Chris Grech
1,1989-04-27 17:54:50,Ventana 0043,ROV,Chuck Baxter
3,1989-09-12 22:01:59,Ventana 0076,ROV,Chuck Baxter
5,1989-03-20 22:03:16.644000,Ventana 0033,ROV,Chuck Baxter
6,1989-03-20 22:03:16.644000,Ventana 0033,ROV,Chuck Baxter


A small number of records have samplingProtocol = NaN. However, they have one of the ROVs indicated in the eventID, so I feel comfortable forcing all records to ROV.

In [15]:
## Ensure samplingProtocol is always ROV

converted['samplingProtocol'] = 'ROV'
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy
0,1989-05-17 22:32:03,Ventana 0050,ROV,Chris Grech
1,1989-04-27 17:54:50,Ventana 0043,ROV,Chuck Baxter
3,1989-09-12 22:01:59,Ventana 0076,ROV,Chuck Baxter
5,1989-03-20 22:03:16.644000,Ventana 0033,ROV,Chuck Baxter
6,1989-03-20 22:03:16.644000,Ventana 0033,ROV,Chuck Baxter


In [16]:
## Remove whitespace from eventID

converted['eventID'] = [event.replace(' ', '_') for event in converted['eventID']]
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy
0,1989-05-17 22:32:03,Ventana_0050,ROV,Chris Grech
1,1989-04-27 17:54:50,Ventana_0043,ROV,Chuck Baxter
3,1989-09-12 22:01:59,Ventana_0076,ROV,Chuck Baxter
5,1989-03-20 22:03:16.644000,Ventana_0033,ROV,Chuck Baxter
6,1989-03-20 22:03:16.644000,Ventana_0033,ROV,Chuck Baxter


**Note** that this code also places an underscore between 'Doc' and 'Ricketts'. It's possible that using 'DocRicketts' could be preferable.

In [17]:
## Add datasetID

converted['datasetID'] = 'VARS'

In [18]:
## Add institutionCode

converted['institutionCode'] = 'MBARI'
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode
0,1989-05-17 22:32:03,Ventana_0050,ROV,Chris Grech,VARS,MBARI
1,1989-04-27 17:54:50,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI
3,1989-09-12 22:01:59,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI
5,1989-03-20 22:03:16.644000,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI
6,1989-03-20 22:03:16.644000,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI


In [19]:
## Format eventDate

formatted = []

for dt in converted['eventDate']:
    
    # Convert string to datetime
    try:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S.%f') # some datetimes have milliseconds
    except ValueError:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S')
        
    # Assign UTC timezone
#     utc = pytz.UTC
#     dt = dt.astimezone(utc)
    
    # Put in ISO format string
    dt = dt.isoformat()
    
    # Save in list
    formatted.append(dt + 'Z')

converted['eventDate'] = formatted
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode
0,1989-05-17T22:32:03Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI
3,1989-09-12T22:01:59Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI
5,1989-03-20T22:03:16.644000Z,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI
6,1989-03-20T22:03:16.644000Z,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI


In [20]:
## Add in occurrence-related columns from data, renaming as needed

converted['occurrenceID'] = data['observation_uuid']
converted['scientificName'] = data['concept']
converted['occurrenceRemarks'] = data['concept']
converted['identifiedBy'] = data['observer']
converted['minimumDepthInMeters'] = round(data['depth_meters'], 1) - 2.2 # Sensor is at most 2 m shallower than camera/observed organism (?), and sensor is accurate within 20 cm
converted['maximumDepthInMeters'] = round(data['depth_meters'], 1) + 2.2
converted['verbatimDepth'] = round(data['depth_meters'], 1)
converted['decimalLatitude'] = data['latitude']
converted['decimalLongitude'] = data['longitude']
converted['dissolvedOxygenInMLPerL'] = data['oxygen_ml_per_l']
converted['pressureInPsi'] = data['psi']
converted['salinity'] = data['salinity']
converted['temperatureInCelsius'] = data['temperature_celsius']
converted['image_url'] = data['image_url']
converted['video_uri'] = data['video_uri']
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,occurrenceRemarks,identifiedBy,...,maximumDepthInMeters,verbatimDepth,decimalLatitude,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri
0,1989-05-17T22:32:03Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI,E7A74B35-0C79-41C5-BB4F-3548FB74DBEB,shoe,shoe,jana,...,401.5,399.3,36.10751,-121.669977,,174.0,34.230999,6.719,http://search.mbari.org/ARCHIVE/framegrabs/Ven...,urn:tid:mbari.org:V0050-07
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,417.2,415.0,36.612168,-122.029075,,44.0,34.234001,6.259,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0043-04
3,1989-09-12T22:01:59Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI,4FF3A4DB-19AE-4C66-8D3F-7096FA5858E2,eggcase,eggcase,unknown,...,331.7,329.5,36.699454,-121.997758,,228.0,34.127998,7.936,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0076-18
5,1989-03-20T22:03:16.644000Z,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI,63B6895A-4DB2-4F58-AA03-A4F994814952,eggcase,eggcase,unknown,...,,,,,,,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0033-04
6,1989-03-20T22:03:16.644000Z,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI,B1164C4F-E511-4FB9-A0DB-CCA3588AB685,eggcase,eggcase,amberR,...,,,,,,,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0033-04


In [21]:
## Add coordinateUncertaintyInMeters 

converted['coordinateUncertaintyInMeters'] = 300
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,occurrenceRemarks,identifiedBy,...,verbatimDepth,decimalLatitude,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters
0,1989-05-17T22:32:03Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI,E7A74B35-0C79-41C5-BB4F-3548FB74DBEB,shoe,shoe,jana,...,399.3,36.10751,-121.669977,,174.0,34.230999,6.719,http://search.mbari.org/ARCHIVE/framegrabs/Ven...,urn:tid:mbari.org:V0050-07,300
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,415.0,36.612168,-122.029075,,44.0,34.234001,6.259,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0043-04,300
3,1989-09-12T22:01:59Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI,4FF3A4DB-19AE-4C66-8D3F-7096FA5858E2,eggcase,eggcase,unknown,...,329.5,36.699454,-121.997758,,228.0,34.127998,7.936,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0076-18,300
5,1989-03-20T22:03:16.644000Z,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI,63B6895A-4DB2-4F58-AA03-A4F994814952,eggcase,eggcase,unknown,...,,,,,,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0033-04,300
6,1989-03-20T22:03:16.644000Z,Ventana_0033,ROV,Chuck Baxter,VARS,MBARI,B1164C4F-E511-4FB9-A0DB-CCA3588AB685,eggcase,eggcase,amberR,...,,,,,,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0033-04,300


In [22]:
## Get a list of unique concept names

converted['scientificName'] = [name.lower().strip() for name in converted['scientificName']]
names = converted['scientificName'].unique()

In [32]:
## Get names of all animals in VARS

animalia = pd.read_json('http://m3.shore.mbari.org/kb/v1/phylogeny/taxa/Animalia')
animalia['name'] = animalia['name'].str.lower()

In [35]:
## Filter dataframe to include only animal concepts, and reformulate names list

converted = converted[converted['scientificName'].isin(animalia['name']) == True].copy()
names = converted['scientificName'].unique()

NOTE that we are also interested in other forms of life - algae, etc. So maybe we can ask Brian if that's available?

I ALSO HAVEN'T DONE ANY CHECKING OF WHAT'S FILTERED OUT DURING THIS STEP. 

In [37]:
## Look up names in WoRMS

name_id_dic, name_dic, id_dic, class_dic = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for opisthoteuthis cf. californiana checking:  opisthoteuthis
Url didn't work, check name:  teuthoidea


Normally I check through names that didn't match manually, but I don't have time right now. Based on work I did in 2020, I have a note that 'teuthoidea' should be 'teuthida'.

#### Handle organisms that should have matched on WoRMS but didn't

In [38]:
## Create a dictionary mapping biological names that didn't match on WoRMS to names that should match

VARS_to_WoRMS_dict = {'teuthoidea':'teuthida'}

In [39]:
## Run these additional terms through WoRMS

revised_concepts = ['teuthida']
revised_name_id_dic, revised_name_dic, revised_id_dic, revised_class_dic = WoRMS.run_get_worms_from_scientific_name(revised_concepts, verbose_flag=True)

In [40]:
## Add values for revised names to original WoRMS output

name_id_dic.update(revised_name_id_dic)
name_dic.update(revised_name_dic)
id_dic.update(revised_id_dic)

In [41]:
## Create columns from WoRMS data

# Replace names that don't have a WoRMS match in scientificName with revised names
converted['scientificName'].replace(VARS_to_WoRMS_dict, inplace=True)

# Create scientificNameID column with the same content as scientificName - strip to ensure no whitespace, lowercase
converted['scientificNameID'] = converted['scientificName'].str.strip().str.lower()

# Use dictionary to replace scientific names with name IDs
converted.replace({'scientificNameID':name_id_dic}, inplace=True)

# Repeat to create taxonID
converted['taxonID'] = converted['scientificName'].str.strip().str.lower()
converted.replace({'taxonID':id_dic}, inplace=True)

converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,occurrenceRemarks,identifiedBy,...,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,parmaturus xaniurus,Parmaturus xaniurus,amberR,...,-122.029075,,44.0,34.234001,6.259,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0043-04,300,urn:lsid:marinespecies.org:taxname:282166,282166
12,1989-05-17T17:13:42Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI,6198DB37-934F-478A-9820-251627551327,parmaturus,Parmaturus,amberR,...,-121.669329,,142.0,34.224998,6.996,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0050-02,300,urn:lsid:marinespecies.org:taxname:270301,270301
14,1989-09-12T22:01:07Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI,7A96AB28-8188-425B-BB9F-92E9F35E12FF,galatheidae,Galatheidae,unknown,...,-121.997843,,226.0,34.132,7.93,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0076-18,300,urn:lsid:marinespecies.org:taxname:106733,106733
19,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,rathbunaster californicus,Rathbunaster californicus,unknown,...,-122.018691,,72.0,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:254844,254844
22,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,D09D6A8C-8DDA-4431-81AB-971872DE96A6,parmaturus xaniurus,Parmaturus xaniurus,amberR,...,-122.018691,,72.0,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:282166,282166


In [43]:
## Replace scientificName with matched scientific names from WoRMS

converted['scientificName'] = converted['scientificName'].str.strip().str.lower()
converted['scientificName'].replace(name_dic, inplace=True)
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,occurrenceRemarks,identifiedBy,...,decimalLongitude,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,-122.029075,,44.0,34.234001,6.259,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0043-04,300,urn:lsid:marinespecies.org:taxname:282166,282166
12,1989-05-17T17:13:42Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI,6198DB37-934F-478A-9820-251627551327,Parmaturus,Parmaturus,amberR,...,-121.669329,,142.0,34.224998,6.996,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0050-02,300,urn:lsid:marinespecies.org:taxname:270301,270301
14,1989-09-12T22:01:07Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI,7A96AB28-8188-425B-BB9F-92E9F35E12FF,Galatheidae,Galatheidae,unknown,...,-121.997843,,226.0,34.132,7.93,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0076-18,300,urn:lsid:marinespecies.org:taxname:106733,106733
19,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,Rathbunaster californicus,Rathbunaster californicus,unknown,...,-122.018691,,72.0,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:254844,254844
22,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,D09D6A8C-8DDA-4431-81AB-971872DE96A6,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,-122.018691,,72.0,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:282166,282166


In [44]:
## Create additional needed columns

converted['nameAccordingTo'] = 'WoRMS'
converted['occurrenceStatus'] = 'present'
converted['basisOfRecord'] = 'MachineObservation'

converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,occurrenceRemarks,identifiedBy,...,salinity,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,34.234001,6.259,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0043-04,300,urn:lsid:marinespecies.org:taxname:282166,282166,WoRMS,present,MachineObservation
12,1989-05-17T17:13:42Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI,6198DB37-934F-478A-9820-251627551327,Parmaturus,Parmaturus,amberR,...,34.224998,6.996,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0050-02,300,urn:lsid:marinespecies.org:taxname:270301,270301,WoRMS,present,MachineObservation
14,1989-09-12T22:01:07Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI,7A96AB28-8188-425B-BB9F-92E9F35E12FF,Galatheidae,Galatheidae,unknown,...,34.132,7.93,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0076-18,300,urn:lsid:marinespecies.org:taxname:106733,106733,WoRMS,present,MachineObservation
19,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,Rathbunaster californicus,Rathbunaster californicus,unknown,...,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:254844,254844,WoRMS,present,MachineObservation
22,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,D09D6A8C-8DDA-4431-81AB-971872DE96A6,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:282166,282166,WoRMS,present,MachineObservation


In [45]:
## Assemble associatedMedia

associatedMedia = []

for occ_id in converted['occurrenceID'].unique():
    
    # Select data associated with that occurrenceID:
    selected = converted[converted['occurrenceID'] == occ_id]
    
    # Retrieve unique image and video files
    image_files = selected['image_url'].drop_duplicates()
    video_files = selected['video_uri'].drop_duplicates()
    
    # Remove any NaN values
    image_files = image_files.dropna()
    video_files = video_files.dropna()
    
    # Join image and video files
    media = pd.concat([image_files, video_files])
    
    # Create a string with all the urls
    url_str = ''
    for url in media: url_str = url_str + url + ' | '
    url_str = url_str[0:-3]
    
    # Add to associatedMedia
    associatedMedia.append(url_str)

**Note:** The above can take some time depending on how many records must be processed. There's probably a better way to do it, but I'm not going to take the time to update this code right now.

In [46]:
## Add to df

# First, need to remove rows with duplicate occurrenceIDs
converted = converted.drop_duplicates(subset='occurrenceID', keep="first")

# Add associatedMedia
converted['associatedMedia'] = associatedMedia
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,occurrenceRemarks,identifiedBy,...,temperatureInCelsius,image_url,video_uri,coordinateUncertaintyInMeters,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,associatedMedia
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,6.259,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0043-04,300,urn:lsid:marinespecies.org:taxname:282166,282166,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
12,1989-05-17T17:13:42Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI,6198DB37-934F-478A-9820-251627551327,Parmaturus,Parmaturus,amberR,...,6.996,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0050-02,300,urn:lsid:marinespecies.org:taxname:270301,270301,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
14,1989-09-12T22:01:07Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI,7A96AB28-8188-425B-BB9F-92E9F35E12FF,Galatheidae,Galatheidae,unknown,...,7.93,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0076-18,300,urn:lsid:marinespecies.org:taxname:106733,106733,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
19,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,Rathbunaster californicus,Rathbunaster californicus,unknown,...,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:254844,254844,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
22,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,D09D6A8C-8DDA-4431-81AB-971872DE96A6,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,urn:tid:mbari.org:V0049-01,300,urn:lsid:marinespecies.org:taxname:282166,282166,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...


In [47]:
## Save columns for MeasurementOrFact file

mof = converted[['occurrenceID', 'dissolvedOxygenInMLPerL', 'pressureInPsi', 'salinity', 'temperatureInCelsius']]

In [48]:
## Drop extra columns

converted = converted.drop(['image_url', 'video_uri', 'dissolvedOxygenInMLPerL', 'pressureInPsi', 'salinity', 'temperatureInCelsius'], axis=1)
converted.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,occurrenceRemarks,identifiedBy,...,verbatimDepth,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,associatedMedia
1,1989-04-27T17:54:50Z,Ventana_0043,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,415.0,36.612168,-122.029075,300,urn:lsid:marinespecies.org:taxname:282166,282166,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
12,1989-05-17T17:13:42Z,Ventana_0050,ROV,Chris Grech,VARS,MBARI,6198DB37-934F-478A-9820-251627551327,Parmaturus,Parmaturus,amberR,...,368.5,36.103195,-121.669329,300,urn:lsid:marinespecies.org:taxname:270301,270301,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
14,1989-09-12T22:01:07Z,Ventana_0076,ROV,Chuck Baxter,VARS,MBARI,7A96AB28-8188-425B-BB9F-92E9F35E12FF,Galatheidae,Galatheidae,unknown,...,329.2,36.699518,-121.997843,300,urn:lsid:marinespecies.org:taxname:106733,106733,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
19,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,Rathbunaster californicus,Rathbunaster californicus,unknown,...,353.1,36.607831,-122.018691,300,urn:lsid:marinespecies.org:taxname:254844,254844,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
22,1989-05-10T16:39:10Z,Ventana_0049,ROV,Chris Harrold,VARS,MBARI,D09D6A8C-8DDA-4431-81AB-971872DE96A6,Parmaturus xaniurus,Parmaturus xaniurus,amberR,...,353.1,36.607831,-122.018691,300,urn:lsid:marinespecies.org:taxname:282166,282166,WoRMS,present,MachineObservation,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...


In [49]:
## Reorder columns

converted = converted[['eventID', 'eventDate', 'samplingProtocol', 'recordedBy', 'datasetID', 'institutionCode', 'occurrenceID', 'scientificName', 'scientificNameID', 'taxonID', 
                       'nameAccordingTo', 'occurrenceStatus', 'basisOfRecord', 'identifiedBy', 'occurrenceRemarks', 'decimalLatitude', 'decimalLongitude', 'coordinateUncertaintyInMeters',
                       'minimumDepthInMeters', 'maximumDepthInMeters', 'verbatimDepth', 'associatedMedia']]
converted.head()

Unnamed: 0,eventID,eventDate,samplingProtocol,recordedBy,datasetID,institutionCode,occurrenceID,scientificName,scientificNameID,taxonID,...,basisOfRecord,identifiedBy,occurrenceRemarks,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,associatedMedia
1,Ventana_0043,1989-04-27T17:54:50Z,ROV,Chuck Baxter,VARS,MBARI,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,Parmaturus xaniurus,urn:lsid:marinespecies.org:taxname:282166,282166,...,MachineObservation,amberR,Parmaturus xaniurus,36.612168,-122.029075,300,412.8,417.2,415.0,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
12,Ventana_0050,1989-05-17T17:13:42Z,ROV,Chris Grech,VARS,MBARI,6198DB37-934F-478A-9820-251627551327,Parmaturus,urn:lsid:marinespecies.org:taxname:270301,270301,...,MachineObservation,amberR,Parmaturus,36.103195,-121.669329,300,366.3,370.7,368.5,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
14,Ventana_0076,1989-09-12T22:01:07Z,ROV,Chuck Baxter,VARS,MBARI,7A96AB28-8188-425B-BB9F-92E9F35E12FF,Galatheidae,urn:lsid:marinespecies.org:taxname:106733,106733,...,MachineObservation,unknown,Galatheidae,36.699518,-121.997843,300,327.0,331.4,329.2,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
19,Ventana_0049,1989-05-10T16:39:10Z,ROV,Chris Harrold,VARS,MBARI,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,Rathbunaster californicus,urn:lsid:marinespecies.org:taxname:254844,254844,...,MachineObservation,unknown,Rathbunaster californicus,36.607831,-122.018691,300,350.9,355.3,353.1,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...
22,Ventana_0049,1989-05-10T16:39:10Z,ROV,Chris Harrold,VARS,MBARI,D09D6A8C-8DDA-4431-81AB-971872DE96A6,Parmaturus xaniurus,urn:lsid:marinespecies.org:taxname:282166,282166,...,MachineObservation,amberR,Parmaturus xaniurus,36.607831,-122.018691,300,350.9,355.3,353.1,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...


In [50]:
## Save occurrence file

converted.to_csv('VARS_1989_converted_20220104.csv', index=False, na_rep='NaN')

### Build MeasurementOrFact file

In [51]:
## Add columns by occurrenceID

mof.head()

Unnamed: 0,occurrenceID,dissolvedOxygenInMLPerL,pressureInPsi,salinity,temperatureInCelsius
1,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,,44.0,34.234001,6.259
12,6198DB37-934F-478A-9820-251627551327,,142.0,34.224998,6.996
14,7A96AB28-8188-425B-BB9F-92E9F35E12FF,,226.0,34.132,7.93
19,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,,72.0,,
22,D09D6A8C-8DDA-4431-81AB-971872DE96A6,,72.0,,


In [52]:
## Convert to long format

mof_long = pd.melt(mof, id_vars='occurrenceID', var_name='measurementType', value_name='measurementValue')
mof_long.head()

Unnamed: 0,occurrenceID,measurementType,measurementValue
0,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,dissolvedOxygenInMLPerL,
1,6198DB37-934F-478A-9820-251627551327,dissolvedOxygenInMLPerL,
2,7A96AB28-8188-425B-BB9F-92E9F35E12FF,dissolvedOxygenInMLPerL,
3,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,dissolvedOxygenInMLPerL,
4,D09D6A8C-8DDA-4431-81AB-971872DE96A6,dissolvedOxygenInMLPerL,


There are some rows where salinity=0. That doesn't make sense, and I'm assuming salinity is unavailable for these records. Changing them to NaN.

In [54]:
## Change salinity = 0 to NaN

mof_long.loc[(mof_long['measurementType'] == 'salinity') & (mof_long['measurementValue'] == 0), 'measurementValue'] = np.nan

In [55]:
## Round

mof_long['measurementValue'] = round(mof_long['measurementValue'], 2)

In [56]:
## Change measurementType names

mof_long['measurementType'].replace({
    'dissolvedOxygenInMLPerL':'dissolvedOxygen',
    'pressureInPsi':'pressure',
    'temperatureInCelsius':'temperature'
}, inplace = True)

In [57]:
## Add measurementUnit

mof_long['measurementUnit'] = 'mL per L seawater'
mof_long.loc[mof_long['measurementType'] == 'pressure', 'measurementUnit'] = 'psi'
mof_long.loc[mof_long['measurementType'] == 'salinity', 'measurementUnit'] = 'psu'
mof_long.loc[mof_long['measurementType'] == 'temperature', 'measurementUnit'] = 'celsius'

mof_long.head()

Unnamed: 0,occurrenceID,measurementType,measurementValue,measurementUnit
0,A916C1F3-4C1A-45A1-B835-A2F0C7517D56,dissolvedOxygen,,mL per L seawater
1,6198DB37-934F-478A-9820-251627551327,dissolvedOxygen,,mL per L seawater
2,7A96AB28-8188-425B-BB9F-92E9F35E12FF,dissolvedOxygen,,mL per L seawater
3,6684BDBA-4E42-4DE6-B346-C8F21EAD9E75,dissolvedOxygen,,mL per L seawater
4,D09D6A8C-8DDA-4431-81AB-971872DE96A6,dissolvedOxygen,,mL per L seawater


In [58]:
## Save

mof_long.to_csv('VARS_1989_MoF_20220104.csv', index=False, na_rep='NaN')

### Remaining issues

1. Metadata
2. measurementType --> standard terms?
3. Didn't drop NaN values in MoF file