## VARS Darwin Core conversion

Resources:
- https://dwc.tdwg.org/terms/
- https://tools.gbif.org/dwca-validator/extension.do?id=dwc:Occurrence#Event
- https://www.mbari.org/products/research-software/video-annotation-and-reference-system-vars/query-interface/advanced-user-guide/

In [1]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handline dates
import pytz # for handling time zones

import urllib.request, urllib.parse, json # for dealing with WoRMS API and output

In [2]:
## Load csv

path = ''
filename = 'VARS_DwC_conversion_practice_200403.csv'
data = pd.read_csv(path+filename)

data.head()

Unnamed: 0,imaged_moment_uuid,index_elapsed_time_millis,index_recorded_timestamp,index_timecode,observation_uuid,activity,concept,duration_millis,observation_group,observation_timestamp,...,video_description,video_duration_millis,video_name,video_start_timestamp,camera_id,video_sequence_description,video_sequence_name,chief_scientist,dive_number,camera_platform
0,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
1,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
2,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
3,BE6C5C5B-8B04-45F0-BD48-C3EBE187B2EB,,2004-05-03 17:27:02,03:24:58:15,A3DE4416-5B67-4A10-A6BC-AE9A99552A13,cruise,Dosidicus gigas,,ROV,2004-05-03 18:29:03,...,,3600000.0,T0666-04,2004-05-03 17:12:42,Tiburon,,Tiburon 0666,David Clague,Tiburon 0666,Tiburon
4,DBCD3F40-E533-4AD0-9541-A87932C9EC78,,2010-02-25 18:58:34,00:38:45:13,BF43A883-1DE2-439A-8F55-F7F215FE2D4F,cruise,Dosidicus gigas,,ROV,2010-04-14 23:40:00.360000,...,,3596000.0,V3527-01HD,2010-02-25 18:40:18,Ventana,,Ventana 3527,Linda Kuhnz,Ventana 3527,Ventana


In [3]:
## List all columns

data.columns

Index(['imaged_moment_uuid', 'index_elapsed_time_millis',
       'index_recorded_timestamp', 'index_timecode', 'observation_uuid',
       'activity', 'concept', 'duration_millis', 'observation_group',
       'observation_timestamp', 'observer', 'image_reference_uuid',
       'image_description', 'image_format', 'image_height', 'image_width',
       'image_url', 'link_name', 'link_value', 'to_concept',
       'association_mime_type', 'associations', 'altitude',
       'coordinate_reference_system', 'depth_meters', 'latitude', 'longitude',
       'oxygen_ml_per_l', 'phi', 'xyz_position_units', 'pressure_dbar', 'psi',
       'salinity', 'temperature_celsius', 'theta', 'x', 'y', 'z',
       'light_transmission', 'video_reference_uuid', 'audio_codec',
       'video_container', 'video_reference_description', 'frame_rate',
       'video_height', 'video_sha512', 'video_size_bytes', 'video_uri',
       'video_codec', 'video_width', 'video_description',
       'video_duration_millis', 'video_nam

### What do all these columns mean?

I don't think all of them are in the user guide, even the advanced user guide. They also don't align with the columns listed in Brian's SQL code, for some reason. **They don't align with Brian's code because that code was written for a legacy version of the database.**

Below:
- **bold** = term or important observation
- <span style="color:red">**red**</span> = questions still to ask
- <span style="color:orange">**orange**</span> = things I still don't understand but it probably doesn't matter

imaged_moment_uuid = unique identifier for the imaged_moments table. All information about images and annotations is stored in the M3_ANNOTATIONS database.<br>
imaged_elapsed_time_millis = how long the video had been running when the image was captured <br>
index_recorded_timestamp = **Recorded Date**, the time in UTC when the image was captured on camera <br>
index_timecode = **Tape Time Code**, the hours, minutes, seconds and frames since the beginning of the dive (00:00:00:00) in the format (HH:MM:SS:FF); used for tape <br>
observation_uuid = the unique id of the annotation, unique identifier for the observations table. **For one imaged moment, there can be multiple observations.** <br>
activity = **Camera Direction**, describes what the ROV was doing when the image was taken, possible values: 'cruise', 'descend', 'ascend', 'transect', 'stationary', 'unspecified', 'diel transect', nan <br>
concept = **Concept Name**, key terms referring to organisms, geologic features, sampling devices/scientific equipment, and marine debris - I assume we're interested in organisms only <br>
duration_millis = how long the concept was observed for, starting from index_recorded_timestamp. <br>
observation_group = only has a value of ROV in this sample data set <br>
observation_timestamp = **Observation Date**, the date the annotation was created in UTC, not necessarily the same as Recorded Date <br>
observer = **Observer**, the mbari username of the person who created the annotation in theory, but in practice a mix of usernames, full names, partial names, different capitalizations, etc. <br>
image_reference_uuid = unique id for the image_reference table, which contains info about the size, type of image <br>
image_description = each imaged_moment is saved at least twice in compressed and uncompressed format; takes on values of 'compressed image with overlay', 'source image', 'uncompressed image', 'compressed image', nan <br>
image_format = format of associated image file: 'image/jpg', 'image/png', nan. One of these should correspond to compressed and one to uncompressed description. <br>
image_height = height of associated image file in pixels: 0, 1080, nan. <span style="color:red">**What does 0 mean versus nan in this context?**</span> <br>
image_width = width of associated image file in pixels: 0, 1920, nan. <span style="color:red">**What does 0 mean versus nan in this context?**</span> <br>
image_url = url where the associated image is stored online. <span style="color:red">**These links are available to everyone, including outside MBARI**</span> <br>
link_name = to the best of my understanding, link_name and link_value are like a dictionary key, value pair. The link_name indicates what kind of information is held in link_value. Some of these are clearly descriptors of the animal or its behavior, others are confusing (e.g. "on"). <span style="color:red">**Difference between nil and nan?**</span> **Each observation can have multiple associations (i.e. link_names, values, etc.)** <br>
link_value = value of the information described by link_name. <br>
to_concept = points to 'self' or to another concept. So, for example, if the annotation is for a red octopus, the concept might be the octopus species name, the link_name might be 'color', the to_concept might be 'self', and the link_value might be 'red'. <span style="color:red">**Difference between nil and nan?**</span> <br>
<span style="color:orange">association_mime_type</span> - either 'text/plain' or nan, possibly the data type of the link_value? <br>
<span style="color:orange">associations</span> <br>
altitude = how far the ROV was off the bottom? In meters? <span style="color:orange">**What do the negative values mean?**</span> <br>
coordinate_reference_system = all nan here, **Assume WGS84, but Brian wasn't sure and wanted to double check.** <br>
depth = depth below the surface in meters, positive number, ranges from ~6 to ~3500 here. <br>
latitude = latitude where the image was taken in decimal degrees <br>
longitude = longitude where the image was taken in decimal degrees <br>
oxygen_ml_per_l = **Oxygen**, mL of dissolved oxygen per L seawater, includes nan values, ranges from -1.5 to 15 <br>
<span style="color:orange">phi</span> <br>
<span style="color:orange">xyz_position_units</span> - all nan for these data <br>
pressure_dbar = the pressure measured in decibars at the time the image was taken <br>
psi = <span style="color:red">**The same pressure converted to psi?**</span> <br>
salinity = **Salinity**, the salinity at the time the image was taken, calculated from conductivity and pressure measurements, or nan <br>
temperature_celcius = **Temperature**, the water temperature in degrees C when the image was taken, or nan <br>
<span style="color:orange">theta</span> <br>
<span style="color:orange">x</span> <br>
<span style="color:orange">y</span> <br>
<span style="color:orange">z</span> <br>
light_transmission = **Light**, percent light transmitted through the water column when the image was taken, or nan. **Patrick suggests not including this as a MeasurementOrFact, because the ROVs generate light and so it's unclear what's actually being measured.** <br>
video_reference_uuid = unique id in video_references table, which holds information about the size, format, etc. of video files. All video data are stored in the M3_VIDEO_ASSETS database. <br>
<span style="color:orange">audio_codec</span> - all nan for these data <br>
video_container = whether the original video exists on tape or digitally, options are 'tape', 'video/quicktime' <br>
<span style="color:orange">video_reference_description</span> - appears to be the words "Tape loaded from VARS on" plus a datetime <br>
frame_rate = frame rate of the camera, either 29.97 or 0 <span style="color:orange">**What does 0 mean in this context? It's a still image?**</span> <br>
video_height = height of the video frame in pixels, 1080 <br>
<span style="color:orange">video_sha512</span> <br>
video_size_bytes = size of the video file in bytes, 0.00000000e+00, 2.62291551e+10, 2.67221996e+10; larger videos are probably quicktime files <br>
video_uri = location of video file on MBARI's servers, or a pointer to where the tape is stored. If you want to open a video on the web, make sure you're VPN'ed into MBARI and choose an mp4 rather than a quicktime file. **These files are not currently publically accessible.**
<span style="color:orange">video_codec</span> - all nan for these data <br>
video_width = width of the video frame in pixels, 1920 <br>
<span style="color:orange">video_description</span> - all nan for these data <br>
video_duration_millis = total duration of the video in milliseconds. **Multiple videos are taken per dive.** <br>
<span style="color:orange">video_name</span> <br>
video_start_timestamp = time when the video started <span style="color:orange">**in local time?**</span> <br>
camera_id = **ROV name**, 'Tiburon', 'Ventana', or 'Doc Ricketts' <br>
<span style="color:orange">video_sequence_description</span> - all nan for these data <br>
video_sequence_name = ROV name plus a 4-digit integer (dive number); indicates which video sequence/dive a particular video came from. **Use this instead of dive_number, it is more up to date.** <br>
chief_scientist = **Chief Scientist**, the full name of PI for whom the dive was primarily conducted, maiden name sometimes included in parentheses <br>
dive_number = **Dive Number**, ROV name plus a 4-digit integer uniquely identifying which dive a video/image was taken on, **Out of date, use video_sequence_name** <br>
camera_platform = **ROV name**, 'Tiburon', 'Ventana', 'Doc Ricketts' or nan, <span style="color:red">**How is this different than camera_id? What does nan mean in this context?**</span>

### How might these map to DwC terms?

**Event** = dive <br>
**Occurrence** = annotated observation

We will create two files for submission to OBIS, one containing event information and one containing occurrence information. Then, we can combine these into a single file as needed for compatibility with ERDDAP.

#### Event file contents
index_recorded_timestamp = **eventDate** in UTC, ISO 8601:2004. Extract only date. <br>
observation_group = 'ROV', some or all of **samplingProtocol**? <br>
video_sequence_name = **eventID** <br>
chief_scientist = **recordedBy** <br>

#### Occurrence file contents
index_recorded_timestamp = **eventDate** in UTC, ISO 8601:2004 <br>
observation_uuid = **occurrenceID**. Must be mindful that there may be multiple rows with the same observation_uuid due to multiple available images, multiple available videos, and/or multiple associations if using data from the associations table. <span style="color:red">Note that, at least for the first go, I'll probably leave out information from the associations table.</span> <br>
concept = **scientificName, scientificNameID, taxonID, nameAccordingToID, identificationReferences, occurrenceStatus, basisOfRecord** <br>
observer = **identifiedBy**. Also assign **institutionCode** as MBARI. <br>
image_url = **associatedMedia**. Note that we may need to choose which image to link to, or there may be a way to link to multiple images and/or videos. <br>
depth = **minimumDepthInMeters, maximumDepthInMeters**. <br>
latitude = **decimalLatitude** <br>
longitude = **decimalLongitude** <br>
oxygen_ml_per_l = a MeasurementOrFact, **measurementType, measurementValue, measurementAccuracy, measurementUnits**. Note that if there is more than one MeasurementOrFact to be included, create sensible column names for each and then designate those columns as MeasurementOrFact columns in the metadata file. <br>
psi, salinity, temperature_celcius, light_transmission = other candidates for MeasurementOrFact <br>
video_uri = **associatedMedia**. Again, we may need to figure out how to deal with multiple media links, and which to provide if not all of them. <br>
video_sequence_name = **eventID** <br>


-------

#### Columns that can be excluded from data set?
imaged_moment_uuid <br>
imaged_elapsed_time_millis <br>
index_timecode/**Tape Time Code** <br>
image_reference_uuid - <span style="color:red">Could be **eventID** if we pick the image/video to be the event rather than the dive</span> <br>
altitude <br>
coordinate_reference_system <br>
video_reference_uuid - <span style="color:red">Could be **eventID** if we pick the image/video to be the event rather than the dive</span> <br>
audio_codec <br>
video_reference_description <br>
video_sha512 <br>
video_codec <br>
video_descrption <br>
video_name <br>
video_start_timestamp <br>
camera_id <br>
video_sequence_description <br>
observation_timestamp

#### Not sure whether or how these should be included:
image_description <br>
image_height <br>
image_width - <span style="color:red">Can height and width be part of the metadata somehow?</span> <br>
link_value <br>
to_concept <br>
association_mime_type <br>
associations <br>
phi <br>
xyz_position_units <br>
theta <br>
x <br>
y <br>
z <br>
video_container <br>
frame_rate <br>
video_size_bytes <br>
video_height <br>
video_width - <span style="color:red">Can frame_rate, size, height and width be part of the metadata somehow?</span><br>
video_sequence_description - <span style="color:red">Pointer to other records that might be relevant?</span> <br>

### Questions

#### For Brian:
1. Double check understanding of what columns mean - especially links, associations, to_concepts.
2. Navigating the database - what is the cursor?
3. Why the column names in your SQL for removing embargoed records are different than the ones I see in annotations table?
4. Long view: ultimately, best way to pull down data a chunk at a time for processing?

#### For Patrick:
1. parentEvent/event/occurrence structure - dive/video or image/annotation versus dive/annotation?
2. eventDate and occurrenceDate - the latter seems more relevant, but doesn't exist.
3. how to include information about the observer, chief scientist, MBARI?
4. can there be multiple MeaurementOrFacts per record? Say, if you wanted temperature and salinity? Seems like these columns would not have unique names?
5. 