## VARS Darwin Core conversion

Resources:
- https://dwc.tdwg.org/terms/
- https://www.mbari.org/products/research-software/video-annotation-and-reference-system-vars/query-interface/advanced-user-guide/

In [1]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handline dates
import pytz # for handling time zones

import urllib.request, urllib.parse, json # for dealing with WoRMS API and output

In [2]:
## Load csv

path = ''
filename = 'VARS_DwC_conversion_practice_200403.csv'
data = pd.read_csv(path+filename)

data.head()

Unnamed: 0,imaged_moment_uuid,index_elapsed_time_millis,index_recorded_timestamp,index_timecode,observation_uuid,activity,concept,duration_millis,observation_group,observation_timestamp,...,video_description,video_duration_millis,video_name,video_start_timestamp,camera_id,video_sequence_description,video_sequence_name,chief_scientist,dive_number,camera_platform
0,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
1,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
2,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
3,BE6C5C5B-8B04-45F0-BD48-C3EBE187B2EB,,2004-05-03 17:27:02,03:24:58:15,A3DE4416-5B67-4A10-A6BC-AE9A99552A13,cruise,Dosidicus gigas,,ROV,2004-05-03 18:29:03,...,,3600000.0,T0666-04,2004-05-03 17:12:42,Tiburon,,Tiburon 0666,David Clague,Tiburon 0666,Tiburon
4,DBCD3F40-E533-4AD0-9541-A87932C9EC78,,2010-02-25 18:58:34,00:38:45:13,BF43A883-1DE2-439A-8F55-F7F215FE2D4F,cruise,Dosidicus gigas,,ROV,2010-04-14 23:40:00.360000,...,,3596000.0,V3527-01HD,2010-02-25 18:40:18,Ventana,,Ventana 3527,Linda Kuhnz,Ventana 3527,Ventana


In [3]:
## List all columns

data.columns

Index(['imaged_moment_uuid', 'index_elapsed_time_millis',
       'index_recorded_timestamp', 'index_timecode', 'observation_uuid',
       'activity', 'concept', 'duration_millis', 'observation_group',
       'observation_timestamp', 'observer', 'image_reference_uuid',
       'image_description', 'image_format', 'image_height', 'image_width',
       'image_url', 'link_name', 'link_value', 'to_concept',
       'association_mime_type', 'associations', 'altitude',
       'coordinate_reference_system', 'depth_meters', 'latitude', 'longitude',
       'oxygen_ml_per_l', 'phi', 'xyz_position_units', 'pressure_dbar', 'psi',
       'salinity', 'temperature_celsius', 'theta', 'x', 'y', 'z',
       'light_transmission', 'video_reference_uuid', 'audio_codec',
       'video_container', 'video_reference_description', 'frame_rate',
       'video_height', 'video_sha512', 'video_size_bytes', 'video_uri',
       'video_codec', 'video_width', 'video_description',
       'video_duration_millis', 'video_nam

### What do all these columns mean?

I don't think all of them are in the user guide, even the advanced user guide. They also don't align with the columns listed in Brian's SQL code, for some reason.

<span style="color:red">imaged_moment_uuid</span> <br>
<span style="color:red">imaged_elapsed_time_millis</span> <br>
index_recorded_timestamp = **Recorded Date**, the time in UTC when the image was captured on camera <br>
index_timecode = **Tape Time Code**, the hours, minutes, seconds and frames since the beginning of the dive (00:00:00:00) in the format (HH:MM:SS:FF) <br>
observation_uuid = the unique id of the annotation <br>
<span style="color:red">activity</span> - **Camera Direction?**, describes what the ROV was doing when the image was taken, possible values: 'cruise', 'descend', 'ascend', 'transect', 'stationary', 'unspecified', 'diel transect', nan <br>
concept = **Concept Name**, key terms referring to organisms, geologic features, sampling devices/scientific equipment, and marine debris - I assume we're interested in organisms only <br>
<span style="color:red">duration_millis</span> <br>
<span style="color:red">observation_group</span> - only has a value of ROV in this sample data set <br>
observation_timestamp = **Observation Date**, the date the annotation was created in UTC, not necessarily the same as Recorded Date <br>
observer = **Observer**, the mbari username of the person who created the annotation in theory, but in practice a mix of usernames, full names, partial names, different capitalizations, etc. <br>
<span style="color:red">image_reference_uuid</span> <br>
<span style="color:red">image_description</span> - takes on values of 'compressed image with overlay', 'source image', 'uncompressed image', 'compressed image', nan <br>
image_format = format of associated image file: 'image/jpg', 'image/png', nan <br>
image_height = height of associated image file in pixels: 0, 1080, nan. **What does 0 mean versus nan in this context?** <br>
image_width = width of associated image file in pixels: 0, 1920, nan. **What does 0 mean versus nan in this context?** <br>
image_url = url where the associated image is stored online. **Are these links permanent/stable?** <br>
<span style="color:red">link_name</span> - some of these are clearly descriptors of the animal or its behavior, others are confusing (e.g. "on"). **Difference between nil and nan? How do I make sense of this information?** <br>
<span style="color:red">link_value</span> <br>
<span style="color:red">to_concept</span> - points to other concepts? Something like *the animal in this annotation is doing [link_value] to/with the subject of to_concept*? **Difference between nil and nan?** <br>
<span style="color:red">association_mime_type</span> - either 'text/plain' or nan, possibly the data type of the link_value? <br>
<span style="color:red">associations</span> <br>
altitude = how far the ROV was off the bottom? In meters? **What do the negative values mean?** <br>
coordinate_reference_system = all nan here, **assume WGS84?** <br>
depth = depth below the surface in meters, positive number, ranges from ~6 to ~3500 here. **Relationship to altitude?** <br>
latitude = latitude where the image was taken in decimal degrees <br>
longitude = longitude where the image was taken in decimal degrees <br>
oxygen_ml_per_l = **Oxygen**, mL of dissolved oxygen per L seawater, includes nan values (**not taken for all dives?**), ranges from -1.5 to 15 <br>
<span style="color:red">phi</span> <br>
<span style="color:red">xyz_position_units</span> - all nan for these data <br>
pressure_dbar = the pressure measured in decibars at the time the image was taken <br>
psi = **The same pressure converted to psi?** <br>
salinity = **Salinity**, the salinity at the time the image was taken, calculated from conductivity and pressure measurements, or nan <br>
temperature_celcius = **Temperature**, the water temperature in degrees C when the image was taken, or nan <br>
<span style="color:red">theta</span> <br>
<span style="color:red">x</span> <br>
<span style="color:red">y</span> <br>
<span style="color:red">z</span> <br>
light_transmission = **Light**, percent light transmitted through the water column when the image was taken (**100% at the surface?**), or nan <br>
<span style="color:red">video_reference_uuid</span> <br>
<span style="color:red">audio_codec</span> - all nan for these data <br>
video_container = whether the original video exists on tape or digitally, options are 'tape', 'video/quicktime' <br>
<span style="color:red">video_reference_description</span> - appears to be the words "Tape loaded from VARS on" plus a datetime <br>
frame_rate = frame rate of the camera, either 29.97 or 0 **What does 0 mean in this context? It's a still image?** <br>
video_height = height of the video frame in pixels, 1080 <br>
<span style="color:red">video_sha512</span> <br>
video_size_bytes = size of the video file in bytes, 0.00000000e+00, 2.62291551e+10, 2.67221996e+10 **Why is this so consistent accross videos?** <br>
<span style="color:red">video_uri</span> <br>
<span style="color:red">video_codec</span> - all nan for these data <br>
video_width = width of the video frame in pixels, 1920 <br>
<span style="color:red">video_description</span> - all nan for these data <br>
video_duration_millis = total duration of the video in milliseconds <br>
<span style="color:red">video_name</span> <br>
video_start_timestamp = time when the video started **in local time?** <br>
camera_id = **ROV name**, 'Tiburon', 'Ventana', or 'Doc Ricketts' <br>
<span style="color:red">video_sequence_description</span> - all nan for these data <br>
<span style="color:red">video_sequence_name</span> - ROV name plus a 4-digit integer, possibly dive number? <br>
chief_scientist = **Chief Scientist**, the full name of PI for whom the dive was primarily conducted, maiden name sometimes included in parentheses <br>
dive_number = **Dive Number**, ROV name plus a 4-digit integer uniquely identifying which dive a video/image was taken on, **How is this different than video_sequence_name?** <br>
camera_platform = **ROV name**, 'Tiburon', 'Ventana', 'Doc Ricketts' or nan, **How is this different than camera_id? What does nan mean in this context?**

### How might these map to DwC terms?
Generally, it seems like the event should be the dive, and the occurrences should be annotations made from the video/images recorded during the dive. But this structure leads to some questions about how to assign dates/times to occurrences versus events, and how to assign ids to events, occurrences, and videos/images etc.

Alternatively, the dive could be the parent event, and the video/image could be the event, and the annotation could be the occurrence.

index_recorded_timestamp = **eventDate** in UTC, ISO 8601:2004 - <span style="color:red">This actually seems like it should be an 'occurrenceDate', but no such field exists. eventDate could be just the date, not time?</span> <br>
observation_uuid = **occurrenceID** <br>
activity/**Camera Direction** = some or all of **samplingProtocol**? <br>
concept = **scientificName, scientificNameID, taxonID, nameAccordingToID, identificationReferences, occurrenceStatus, basisOfRecord** <br>
duration_millis = <span style="color:red">If this is actually the duration of the video or dive, it could be part of **samplingEffort**?</span> <br>
observation_group = 'ROV', some or all of **samplingProtocol**? <br>
observer = <span style="color:red">I think we have to figure out a way to include the person who created the annotation (observer) in addition to the PI (chief_scientist) and MBARI (possibly using **institutionID** or **institutionCode**). Not sure how to do this. Perhaps **recordedBy** in the format observer | PI?</span> <br>
image_url = **associatedMedia** <br>
link_name = <span style="color:red">Super confusing, but seems like it could contain information about **individualCount, lifeStage, sex,** and **behavior**.</span> <br>
depth = **minimumDepthInMeters, maximumDepthInMeters**. <span style="color:red">I assume, if there's only one depth, these are the same?</span> <br>
latitude = **decimalLatitude** <br>
longitude = **decimalLongitude** <br>
oxygen_ml_per_l = a MeasurementOrFact, **measurementType, measurementValue, measurementAccuracy, measurementUnits**. <span style="color:red">Can there be more than one MeasurementOrFact per record?</span> <br>
pressure_dbar, psi, salinity, temperature_celcius, light_transmission = other candidates for MeasurementOrFact <br>
video_uri = **associatedMedia**, although I'm not sure what these uri's are <br>
video_duration_millis = <span style="color:red">If this is actually the duration of the video, it could be part of **samplingEffort**?</span> <br>
camera_platform = **eventID** <br>
chief_scientist

#### Columns that can be excluded from data set?
imaged_moment_uuid <br>
imaged_elapsed_time_millis <br>
index_timecode/**Tape Time Code** <br>
image_reference_uuid - <span style="color:red">Could be **eventID** if we pick the image/video to be the event rather than the dive</span> <br>
altitude <br>
coordinate_reference_system <br>
video_reference_uuid - <span style="color:red">Could be **eventID** if we pick the image/video to be the event rather than the dive</span> <br>
audio_codec <br>
video_reference_description <br>
video_sha512 <br>
video_codec <br>
video_descrption <br>
video_name <br>
video_start_timestamp <br>
camera_id <br>
video_sequence_description <br>
observation_timestamp

#### Not sure whether or how these should be included:
image_description <br>
image_height <br>
image_width - <span style="color:red">Can height and width be part of the metadata somehow?</span> <br>
link_value <br>
to_concept <br>
association_mime_type <br>
associations <br>
phi <br>
xyz_position_units <br>
theta <br>
x <br>
y <br>
z <br>
video_container <br>
frame_rate <br>
video_size_bytes <br>
video_height <br>
video_width - <span style="color:red">Can frame_rate, size, height and width be part of the metadata somehow?</span><br>
video_sequence_description - <span style="color:red">Pointer to other records that might be relevant?</span> <br>

### Questions

#### For Brian:
1. Double check understanding of what columns mean - especially links, associations, to_concepts.
2. Navigating the database - what is the cursor?
3. Why the column names in your SQL for removing embargoed records are different than the ones I see in annotations table?
4. Long view: ultimately, best way to pull down data a chunk at a time for processing?

#### For Patrick:
1. parentEvent/event/occurrence structure - dive/video or image/annotation versus dive/annotation?
2. eventDate and occurrenceDate - the latter seems more relevant, but doesn't exist.
3. how to include information about the observer, chief scientist, MBARI?
4. can there be multiple MeaurementOrFacts per record? Say, if you wanted temperature and salinity? Seems like these columns would not have unique names?
5. 