## VARS Darwin Core conversion

Resources:
- https://dwc.tdwg.org/terms/
- https://tools.gbif.org/dwca-validator/extension.do?id=dwc:Occurrence#Event
- https://www.mbari.org/products/research-software/video-annotation-and-reference-system-vars/query-interface/advanced-user-guide/

In [1]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handline dates
import pytz # for handling time zones

import urllib.request, urllib.parse, json # for dealing with WoRMS API and output

In [2]:
## Load csv

path = ''
filename = 'VARS_DwC_conversion_practice_200403.csv'
data = pd.read_csv(path+filename)

data.head()

Unnamed: 0,imaged_moment_uuid,index_elapsed_time_millis,index_recorded_timestamp,index_timecode,observation_uuid,activity,concept,duration_millis,observation_group,observation_timestamp,...,video_description,video_duration_millis,video_name,video_start_timestamp,camera_id,video_sequence_description,video_sequence_name,chief_scientist,dive_number,camera_platform
0,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
1,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
2,97BD5895-9489-478B-8797-6961D8A770D2,,2001-03-19 22:24:12,04:00:43:27,1040A0BF-2D50-45D2-9C96-35CC80708939,cruise,Dosidicus gigas,,ROV,2010-10-25 22:10:05.150000,...,,5529000.0,T0265-04,2001-03-19 20:57:56,Tiburon,,Tiburon 0265,Bruce Robison,Tiburon 0265,Tiburon
3,BE6C5C5B-8B04-45F0-BD48-C3EBE187B2EB,,2004-05-03 17:27:02,03:24:58:15,A3DE4416-5B67-4A10-A6BC-AE9A99552A13,cruise,Dosidicus gigas,,ROV,2004-05-03 18:29:03,...,,3600000.0,T0666-04,2004-05-03 17:12:42,Tiburon,,Tiburon 0666,David Clague,Tiburon 0666,Tiburon
4,DBCD3F40-E533-4AD0-9541-A87932C9EC78,,2010-02-25 18:58:34,00:38:45:13,BF43A883-1DE2-439A-8F55-F7F215FE2D4F,cruise,Dosidicus gigas,,ROV,2010-04-14 23:40:00.360000,...,,3596000.0,V3527-01HD,2010-02-25 18:40:18,Ventana,,Ventana 3527,Linda Kuhnz,Ventana 3527,Ventana


In [3]:
## List all columns

data.columns

Index(['imaged_moment_uuid', 'index_elapsed_time_millis',
       'index_recorded_timestamp', 'index_timecode', 'observation_uuid',
       'activity', 'concept', 'duration_millis', 'observation_group',
       'observation_timestamp', 'observer', 'image_reference_uuid',
       'image_description', 'image_format', 'image_height', 'image_width',
       'image_url', 'link_name', 'link_value', 'to_concept',
       'association_mime_type', 'associations', 'altitude',
       'coordinate_reference_system', 'depth_meters', 'latitude', 'longitude',
       'oxygen_ml_per_l', 'phi', 'xyz_position_units', 'pressure_dbar', 'psi',
       'salinity', 'temperature_celsius', 'theta', 'x', 'y', 'z',
       'light_transmission', 'video_reference_uuid', 'audio_codec',
       'video_container', 'video_reference_description', 'frame_rate',
       'video_height', 'video_sha512', 'video_size_bytes', 'video_uri',
       'video_codec', 'video_width', 'video_description',
       'video_duration_millis', 'video_nam

### What do all these columns mean?

I don't think all of them are in the user guide, even the advanced user guide. They also don't align with the columns listed in Brian's SQL code, for some reason. **They don't align with Brian's code because that code was written for a legacy version of the database.**

Below:
- **bold** = term or important observation
- <span style="color:red">**red**</span> = questions still to ask
- <span style="color:orange">**orange**</span> = things I still don't understand but it probably doesn't matter

imaged_moment_uuid = unique identifier for the imaged_moments table. All information about images and annotations is stored in the M3_ANNOTATIONS database.<br>
imaged_elapsed_time_millis = how long the video had been running when the image was captured <br>
index_recorded_timestamp = **Recorded Date**, the time in UTC when the image was captured on camera <br>
index_timecode = **Tape Time Code**, the hours, minutes, seconds and frames since the beginning of the dive (00:00:00:00) in the format (HH:MM:SS:FF); used for tape <br>
observation_uuid = the unique id of the annotation, unique identifier for the observations table. **For one imaged moment, there can be multiple observations.** <br>
activity = **Camera Direction**, describes what the ROV was doing when the image was taken, possible values: 'cruise', 'descend', 'ascend', 'transect', 'stationary', 'unspecified', 'diel transect', nan <br>
concept = **Concept Name**, key terms referring to organisms, geologic features, sampling devices/scientific equipment, and marine debris - I assume we're interested in organisms only <br>
duration_millis = how long the concept was observed for, starting from index_recorded_timestamp. <br>
observation_group = only has a value of ROV in this sample data set <br>
observation_timestamp = **Observation Date**, the date the annotation was created in UTC, not necessarily the same as Recorded Date <br>
observer = **Observer**, the mbari username of the person who created the annotation in theory, but in practice a mix of usernames, full names, partial names, different capitalizations, etc. <br>
image_reference_uuid = unique id for the image_reference table, which contains info about the size, type of image <br>
image_description = each imaged_moment is saved at least twice in compressed and uncompressed format; takes on values of 'compressed image with overlay', 'source image', 'uncompressed image', 'compressed image', nan <br>
image_format = format of associated image file: 'image/jpg', 'image/png', nan. One of these should correspond to compressed and one to uncompressed description. <br>
image_height = height of associated image file in pixels: 0, 1080, nan. <span style="color:red">**What does 0 mean versus nan in this context?**</span> <br>
image_width = width of associated image file in pixels: 0, 1920, nan. <span style="color:red">**What does 0 mean versus nan in this context?**</span> <br>
image_url = url where the associated image is stored online. <span style="color:red">**These links are available to everyone, including outside MBARI**</span> <br>
link_name = to the best of my understanding, link_name and link_value are like a dictionary key, value pair. The link_name indicates what kind of information is held in link_value. Some of these are clearly descriptors of the animal or its behavior, others are confusing (e.g. "on"). <span style="color:red">**Difference between nil and nan?**</span> **Each observation can have multiple associations (i.e. link_names, values, etc.)** <br>
link_value = value of the information described by link_name. <br>
to_concept = points to 'self' or to another concept. So, for example, if the annotation is for a red octopus, the concept might be the octopus species name, the link_name might be 'color', the to_concept might be 'self', and the link_value might be 'red'. <span style="color:red">**Difference between nil and nan?**</span> <br>
<span style="color:orange">association_mime_type</span> - either 'text/plain' or nan, possibly the data type of the link_value? <br>
<span style="color:orange">associations</span> <br>
altitude = how far the ROV was off the bottom? In meters? <span style="color:orange">**What do the negative values mean?**</span> <br>
coordinate_reference_system = all nan here, **Assume WGS84, but Brian wasn't sure and wanted to double check.** <br>
depth = depth below the surface in meters, positive number, ranges from ~6 to ~3500 here. <br>
latitude = latitude where the image was taken in decimal degrees <br>
longitude = longitude where the image was taken in decimal degrees <br>
oxygen_ml_per_l = **Oxygen**, mL of dissolved oxygen per L seawater, includes nan values, ranges from -1.5 to 15 <br>
<span style="color:orange">phi</span> <br>
<span style="color:orange">xyz_position_units</span> - all nan for these data <br>
pressure_dbar = the pressure measured in decibars at the time the image was taken <br>
psi = <span style="color:red">**The same pressure converted to psi?**</span> <br>
salinity = **Salinity**, the salinity at the time the image was taken, calculated from conductivity and pressure measurements, or nan <br>
temperature_celcius = **Temperature**, the water temperature in degrees C when the image was taken, or nan <br>
<span style="color:orange">theta</span> <br>
<span style="color:orange">x</span> <br>
<span style="color:orange">y</span> <br>
<span style="color:orange">z</span> <br>
light_transmission = **Light**, percent light transmitted through the water column when the image was taken, or nan. **Patrick suggests not including this as a MeasurementOrFact, because the ROVs generate light and so it's unclear what's actually being measured.** <br>
video_reference_uuid = unique id in video_references table, which holds information about the size, format, etc. of video files. All video data are stored in the M3_VIDEO_ASSETS database. <br>
<span style="color:orange">audio_codec</span> - all nan for these data <br>
video_container = whether the original video exists on tape or digitally, options are 'tape', 'video/quicktime' <br>
<span style="color:orange">video_reference_description</span> - appears to be the words "Tape loaded from VARS on" plus a datetime <br>
frame_rate = frame rate of the camera, either 29.97 or 0 <span style="color:orange">**What does 0 mean in this context? It's a still image?**</span> <br>
video_height = height of the video frame in pixels, 1080 <br>
<span style="color:orange">video_sha512</span> <br>
video_size_bytes = size of the video file in bytes, 0.00000000e+00, 2.62291551e+10, 2.67221996e+10; larger videos are probably quicktime files <br>
video_uri = location of video file on MBARI's servers, or a pointer to where the tape is stored. If you want to open a video on the web, make sure you're VPN'ed into MBARI and choose an mp4 rather than a quicktime file. **These files are not currently publically accessible.**
<span style="color:orange">video_codec</span> - all nan for these data <br>
video_width = width of the video frame in pixels, 1920 <br>
<span style="color:orange">video_description</span> - all nan for these data <br>
video_duration_millis = total duration of the video in milliseconds. **Multiple videos are taken per dive.** <br>
<span style="color:orange">video_name</span> <br>
video_start_timestamp = time when the video started <span style="color:orange">**in local time?**</span> <br>
camera_id = **ROV name**, 'Tiburon', 'Ventana', or 'Doc Ricketts' <br>
<span style="color:orange">video_sequence_description</span> - all nan for these data <br>
video_sequence_name = ROV name plus a 4-digit integer (dive number); indicates which video sequence/dive a particular video came from. **Use this instead of dive_number, it is more up to date.** <br>
chief_scientist = **Chief Scientist**, the full name of PI for whom the dive was primarily conducted, maiden name sometimes included in parentheses <br>
dive_number = **Dive Number**, ROV name plus a 4-digit integer uniquely identifying which dive a video/image was taken on, **Out of date, use video_sequence_name** <br>
camera_platform = **ROV name**, 'Tiburon', 'Ventana', 'Doc Ricketts' or nan, <span style="color:red">**How is this different than camera_id? What does nan mean in this context?**</span>

### How might these map to DwC terms?

**Event** = dive <br>
**Occurrence** = annotated observation

We will create two files for submission to OBIS, one containing event information and one containing occurrence information. Then, we can combine these into a single file as needed for compatibility with ERDDAP.

#### Event file contents
index_recorded_timestamp = **eventDate** in UTC, ISO 8601:2004. Extract only date. <br>
observation_group = 'ROV', some or all of **samplingProtocol**? <br>
video_sequence_name = **eventID** <br>
chief_scientist = **recordedBy** <br>

#### Occurrence file contents
index_recorded_timestamp = **eventDate** in UTC, ISO 8601:2004 <br>
observation_uuid = **occurrenceID**. Must be mindful that there may be multiple rows with the same observation_uuid due to multiple available images, multiple available videos, and/or multiple associations if using data from the associations table. <span style="color:red">Note that, at least for the first go, I'll probably leave out information from the associations table.</span> <br>
concept = **scientificName, scientificNameID, taxonID, nameAccordingToID, identificationReferences, occurrenceStatus, basisOfRecord** <br>
observer = **identifiedBy**. Also assign **institutionCode** as MBARI. <br>
image_url = **associatedMedia**. Note that we may need to choose which image to link to, or there may be a way to link to multiple images and/or videos. <br>
depth = **minimumDepthInMeters, maximumDepthInMeters**. <br>
latitude = **decimalLatitude** <br>
longitude = **decimalLongitude** <br>
oxygen_ml_per_l = a MeasurementOrFact, **measurementType, measurementValue, measurementAccuracy, measurementUnits**. Note that if there is more than one MeasurementOrFact to be included, create sensible column names for each and then designate those columns as MeasurementOrFact columns in the metadata file. <br>
psi, salinity, temperature_celcius, light_transmission = other candidates for MeasurementOrFact <br>
video_uri = **associatedMedia**. Again, we may need to figure out how to deal with multiple media links, and which to provide if not all of them. <br>
video_sequence_name = **eventID** <br>

#### Maybe have some potential to be included in occurrence file:
link_value <br>
link_name <br>
to_concept <br>

In [4]:
## Get data with only relevant columns

df = data[['index_recorded_timestamp', 'observation_group', 'video_sequence_name', 'chief_scientist', 'observation_uuid', 'concept', 'observer', 'image_url', 'depth_meters', 
          'latitude', 'longitude', 'oxygen_ml_per_l', 'psi', 'salinity', 'temperature_celsius', 'video_uri']]
df.head()

Unnamed: 0,index_recorded_timestamp,observation_group,video_sequence_name,chief_scientist,observation_uuid,concept,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri
0,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
1,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
2,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
3,2004-05-03 17:27:02,ROV,Tiburon 0666,David Clague,A3DE4416-5B67-4A10-A6BC-AE9A99552A13,Dosidicus gigas,vars,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,546.400024,32.27368,-119.67217,0.38,70.800003,34.199001,5.58,urn:tid:mbari.org:T0666-04
4,2010-02-25 18:58:34,ROV,Ventana 3527,Linda Kuhnz,BF43A883-1DE2-439A-8F55-F7F215FE2D4F,Dosidicus gigas,linda,http://search.mbari.org/ARCHIVE/frameGrabs/Ven...,441.980011,36.797201,-122.188644,0.854,195.199997,34.175999,6.959,urn:tid:mbari.org:V3527-01HD


### Where do multiple rows from the with the same observation ID come from?

In [5]:
df[df['observation_uuid'] == df['observation_uuid'].iloc[0]]

Unnamed: 0,index_recorded_timestamp,observation_group,video_sequence_name,chief_scientist,observation_uuid,concept,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri
0,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
1,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
2,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
3772,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
3784,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
3791,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tib...,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04


In [6]:
pd.options.display.max_colwidth = 100 # default=50
df.loc[df['observation_uuid'] == df['observation_uuid'].iloc[0], 'image_url']

0       http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg
1       http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg
2       http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg
3772    http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.png
3784    http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.png
3791    http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.png
Name: image_url, dtype: object

Ok, so two of these duplicates are arising from the compressed (.jpg) and uncompressed (.png) versions of the image files. I bet the rest of them are coming from associations information that I haven't included in df. Let's check that.

In [7]:
data.loc[data['observation_uuid'] == data['observation_uuid'].iloc[0], ['index_recorded_timestamp', 'observation_uuid', 'concept', 'image_description', 'image_url', 
                                                                       'link_name', 'link_value', 'to_concept', 'video_uri', 'video_size_bytes']]

Unnamed: 0,index_recorded_timestamp,observation_uuid,concept,image_description,image_url,link_name,link_value,to_concept,video_uri,video_size_bytes
0,2001-03-19 22:24:12,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,compressed image with overlay,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg,perspective,close-up,self,urn:tid:mbari.org:T0265-04,0.0
1,2001-03-19 22:24:12,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,compressed image with overlay,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg,identity-reference,4,self,urn:tid:mbari.org:T0265-04,0.0
2,2001-03-19 22:24:12,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,compressed image with overlay,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg,identity-certainty,maybe,self,urn:tid:mbari.org:T0265-04,0.0
3772,2001-03-19 22:24:12,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,source image,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.png,perspective,close-up,self,urn:tid:mbari.org:T0265-04,0.0
3784,2001-03-19 22:24:12,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,source image,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.png,identity-reference,4,self,urn:tid:mbari.org:T0265-04,0.0
3791,2001-03-19 22:24:12,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,source image,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.png,identity-certainty,maybe,self,urn:tid:mbari.org:T0265-04,0.0


That seems right. So, it seems like if we filter for unique observation ID's within compressed images only, that should work. Although, we may want to provide the links to both images... In that case, we could:
1. Eliminate link_name, link_value, etc. from the data frame
2. Drop duplicates
3. Get multiple rows of associated media information (e.g. .jpg and .png format image links) into a single column of the form image_url_1 | image_url_2 | video_uri

**Note** that I just tried to open an image link while not VPN'ed in to MBARI, and got:

Forbidden: You don't have permission to access /ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg on this server.

Are we sure these files are publically accessible?

In [8]:
## Drop duplicate rows from df

df = df.drop_duplicates()
df.head()

Unnamed: 0,index_recorded_timestamp,observation_group,video_sequence_name,chief_scientist,observation_uuid,concept,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri
0,2001-03-19 22:24:12,ROV,Tiburon 0265,Bruce Robison,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg,718.200012,30.402252,-137.669863,,320.899994,,,urn:tid:mbari.org:T0265-04
3,2004-05-03 17:27:02,ROV,Tiburon 0666,David Clague,A3DE4416-5B67-4A10-A6BC-AE9A99552A13,Dosidicus gigas,vars,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2004/124/03_25_01_14.png,546.400024,32.27368,-119.67217,0.38,70.800003,34.199001,5.58,urn:tid:mbari.org:T0666-04
4,2010-02-25 18:58:34,ROV,Ventana 3527,Linda Kuhnz,BF43A883-1DE2-439A-8F55-F7F215FE2D4F,Dosidicus gigas,linda,http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3527/00_38_45_13.jpg,441.980011,36.797201,-122.188644,0.854,195.199997,34.175999,6.959,urn:tid:mbari.org:V3527-01HD
6,2007-11-02 20:22:18,ROV,Tiburon 1147,Bruce Robison,0EFEFAFD-B9ED-4640-9873-75A5EDA7C253,Muusoctopus robustus,svonthun,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/images/1147/06_40_40_00.jpg,3411.820068,36.323256,-122.903316,2.65,139.0,34.484001,1.589,urn:tid:mbari.org:T1147-07
7,2010-02-24 17:59:12,ROV,Ventana 3525,Chad Widmer,2E2280EF-397A-4384-B94B-23B12A287175,Dosidicus gigas,lonny,http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3525/01_06_04_24.jpg,454.769989,36.714396,-121.999731,0.828,99.300003,34.181,6.962,urn:tid:mbari.org:V3525-02HD


In [9]:
df.loc[df['observation_uuid'] == df['observation_uuid'].iloc[0], 'image_url']

0       http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg
3772    http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.png
Name: image_url, dtype: object

### Break into event and occurrence data frames

In [10]:
event_df = df[['index_recorded_timestamp', 'video_sequence_name', 'observation_group', 'chief_scientist']]
event_df.head()

Unnamed: 0,index_recorded_timestamp,video_sequence_name,observation_group,chief_scientist
0,2001-03-19 22:24:12,Tiburon 0265,ROV,Bruce Robison
3,2004-05-03 17:27:02,Tiburon 0666,ROV,David Clague
4,2010-02-25 18:58:34,Ventana 3527,ROV,Linda Kuhnz
6,2007-11-02 20:22:18,Tiburon 1147,ROV,Bruce Robison
7,2010-02-24 17:59:12,Ventana 3525,ROV,Chad Widmer


In [11]:
occ_df = df[['video_sequence_name', 'index_recorded_timestamp', 'observation_uuid', 'concept', 'observer', 'depth_meters', 'latitude', 'longitude', 'oxygen_ml_per_l',
            'psi', 'salinity', 'temperature_celsius', 'image_url', 'video_uri']]
occ_df.head()

Unnamed: 0,video_sequence_name,index_recorded_timestamp,observation_uuid,concept,observer,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,image_url,video_uri
0,Tiburon 0265,2001-03-19 22:24:12,1040A0BF-2D50-45D2-9C96-35CC80708939,Dosidicus gigas,schlin,718.200012,30.402252,-137.669863,,320.899994,,,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2001/078/04_00_48_12.jpg,urn:tid:mbari.org:T0265-04
3,Tiburon 0666,2004-05-03 17:27:02,A3DE4416-5B67-4A10-A6BC-AE9A99552A13,Dosidicus gigas,vars,546.400024,32.27368,-119.67217,0.38,70.800003,34.199001,5.58,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/stills/2004/124/03_25_01_14.png,urn:tid:mbari.org:T0666-04
4,Ventana 3527,2010-02-25 18:58:34,BF43A883-1DE2-439A-8F55-F7F215FE2D4F,Dosidicus gigas,linda,441.980011,36.797201,-122.188644,0.854,195.199997,34.175999,6.959,http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3527/00_38_45_13.jpg,urn:tid:mbari.org:V3527-01HD
6,Tiburon 1147,2007-11-02 20:22:18,0EFEFAFD-B9ED-4640-9873-75A5EDA7C253,Muusoctopus robustus,svonthun,3411.820068,36.323256,-122.903316,2.65,139.0,34.484001,1.589,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/images/1147/06_40_40_00.jpg,urn:tid:mbari.org:T1147-07
7,Ventana 3525,2010-02-24 17:59:12,2E2280EF-397A-4384-B94B-23B12A287175,Dosidicus gigas,lonny,454.769989,36.714396,-121.999731,0.828,99.300003,34.181,6.962,http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3525/01_06_04_24.jpg,urn:tid:mbari.org:V3525-02HD


### Convert event data

In [12]:
## Change headings

event_df = event_df.rename(columns={
    'index_recorded_timestamp':'eventDate',
    'video_sequence_name':'eventID',
    'observation_group':'samplingProtocol',
    'chief_scientist':'recordedBy'
})
event_df.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy
0,2001-03-19 22:24:12,Tiburon 0265,ROV,Bruce Robison
3,2004-05-03 17:27:02,Tiburon 0666,ROV,David Clague
4,2010-02-25 18:58:34,Ventana 3527,ROV,Linda Kuhnz
6,2007-11-02 20:22:18,Tiburon 1147,ROV,Bruce Robison
7,2010-02-24 17:59:12,Ventana 3525,ROV,Chad Widmer


In [13]:
## Add institutionCode

event_df['institutionCode'] = 'MBARI'
event_df.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode
0,2001-03-19 22:24:12,Tiburon 0265,ROV,Bruce Robison,MBARI
3,2004-05-03 17:27:02,Tiburon 0666,ROV,David Clague,MBARI
4,2010-02-25 18:58:34,Ventana 3527,ROV,Linda Kuhnz,MBARI
6,2007-11-02 20:22:18,Tiburon 1147,ROV,Bruce Robison,MBARI
7,2010-02-24 17:59:12,Ventana 3525,ROV,Chad Widmer,MBARI


In [14]:
## This probably still contains duplicate records due to multiple image_urls. Check:

any(event_df.duplicated() == True)

True

In [15]:
## Drop duplicates

event_df.drop_duplicates(inplace=True)
event_df.head()

Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode
0,2001-03-19 22:24:12,Tiburon 0265,ROV,Bruce Robison,MBARI
3,2004-05-03 17:27:02,Tiburon 0666,ROV,David Clague,MBARI
4,2010-02-25 18:58:34,Ventana 3527,ROV,Linda Kuhnz,MBARI
6,2007-11-02 20:22:18,Tiburon 1147,ROV,Bruce Robison,MBARI
7,2010-02-24 17:59:12,Ventana 3525,ROV,Chad Widmer,MBARI


In [16]:
## Eliminate time from eventDate and put in correct format

print(type(event_df.iloc[0,0]))

# Format dates
formatted_dates = []

for dt in event_df['eventDate']:
    
    # Convert string to datetime
    try:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S.%f') # some datetimes have milliseconds
    except ValueError:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S')
        
    # ---- may need to insert code to handle timezones here -----
    
    # Convert to date
    dt = dt.date()
    
    # Put in ISO format string
    dt = dt.isoformat()
    
    # Save in list
    formatted_dates.append(dt)

event_df['eventDate'] = formatted_dates
event_df.head()

<class 'str'>


Unnamed: 0,eventDate,eventID,samplingProtocol,recordedBy,institutionCode
0,2001-03-19,Tiburon 0265,ROV,Bruce Robison,MBARI
3,2004-05-03,Tiburon 0666,ROV,David Clague,MBARI
4,2010-02-25,Ventana 3527,ROV,Linda Kuhnz,MBARI
6,2007-11-02,Tiburon 1147,ROV,Bruce Robison,MBARI
7,2010-02-24,Ventana 3525,ROV,Chad Widmer,MBARI


In [17]:
## Check for any weird values in recordedBy

event_df['recordedBy'].unique()

array(['Bruce Robison', 'David Clague', 'Linda Kuhnz', 'Chad Widmer',
       'Patrick Whaling', 'Ken Smith', 'Bob Vrijenhoek',
       'Shannon Johnson', 'Joe Jones', 'Chuck Saltsman', 'Edie Widder',
       'Steve Haddock', 'Larry Madin', 'Steve Rock', 'Peter Brewer',
       'Andrew DeVogelaere', 'Jim Barry', 'Lisa Levin', nan,
       'Rob Sherlock', 'John Orcutt', 'Kim Reisenbichler', 'Craig Smith',
       'Gary Greene', 'Alastair Fothergill', 'Peter Walz',
       'Paul Schrader', 'William Chadwick', 'Erika Raymond',
       'Alana Sherman', 'Larry Bird', 'Charlie Paull', 'Dave Caress',
       'Craig McClain', 'Paul McGill', 'Meghan Powers', 'Debra Stakes',
       'Keenan Ball', 'Sebastian Sudek', 'Mark Chaffey', 'Scott Wankel',
       'Stephanie Bush', 'Sheri White', 'Craig Dawe', 'Peter Girguis',
       'Tony Ramirez', 'Russ Hopcroft', 'Michael Begnaud', 'Chuck Baxter',
       'Douglas Pargett', 'Bill Ussler', 'Knute Brekke', 'Luke Coletti',
       'Nancy Jacobsen', 'Lou Zeidberg', 'C

In [18]:
## Save

event_df.to_csv('event_df.csv', index=False)

### Datetimes and JDBC

Brian always saves timestamps in UTC, but apparently JayDeBeApi changes the timezones of datetime data when you use it to retrieve data from a database: <br>
https://github.com/baztian/jaydebeapi/issues/73

I just edited `_jdbc_connect_jpype` in JayDeBeApi's `__init__` function as instructed by one of the posts from this link:
```python
def _jdbc_connect_jpype(jclassname, url, driver_args, jars, libs):
    import jpype
    if not jpype.isJVMStarted():
        args = []
        class_path = []
        if jars:
            class_path.extend(jars)
        class_path.extend(_get_classpath())
        if class_path:
            args.append('-Djava.class.path=%s' %
                        os.path.pathsep.join(class_path))
        if libs:
            # path to shared libraries
            libs_path = os.path.pathsep.join(libs)
            args.append('-Djava.library.path=%s' % libs_path)
        args.append('-Duser.timezone=GMT') # ADD THIS LINE TO CODE
```

And re-pulled the data. They are saved as **VARS_DwC_conversion_practice_200416.csv**

#### Did this change actually change the datetimes?

In [19]:
## Load csv

path = ''
filename = 'VARS_DwC_conversion_practice_200416.csv'
new_times = pd.read_csv(path+filename)

# Get relevant columns
new_times = new_times[['index_recorded_timestamp', 'observation_group', 'video_sequence_name', 'chief_scientist', 'observation_uuid', 'concept', 'observer', 'image_url', 'depth_meters', 
          'latitude', 'longitude', 'oxygen_ml_per_l', 'psi', 'salinity', 'temperature_celsius', 'video_uri']]

# Drop duplicates
new_times = new_times.drop_duplicates()

# Examine
new_times.loc[new_times['observation_uuid'] == new_times['observation_uuid'].iloc[0]]

Unnamed: 0,index_recorded_timestamp,observation_group,video_sequence_name,chief_scientist,observation_uuid,concept,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri
0,2005-08-23 22:52:40,ROV,Tiburon 0884,Bob Vrijenhoek,A90437A1-F8F8-45C2-9862-3B54B0D036BF,Muusoctopus robustus,linda,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/images/0884/07_38_49_28.jpg,2755.620117,42.755189,-126.70974,1.24,136.699997,34.595001,1.756,urn:tid:mbari.org:T0884-08
2136,2005-08-23 22:52:40,ROV,Tiburon 0884,Bob Vrijenhoek,A90437A1-F8F8-45C2-9862-3B54B0D036BF,Muusoctopus robustus,linda,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/images/0884/07_38_49_28.png,2755.620117,42.755189,-126.70974,1.24,136.699997,34.595001,1.756,urn:tid:mbari.org:T0884-08


In [20]:
df.loc[df['observation_uuid'] == new_times['observation_uuid'].iloc[0]]

Unnamed: 0,index_recorded_timestamp,observation_group,video_sequence_name,chief_scientist,observation_uuid,concept,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri
883,2005-08-23 22:52:40,ROV,Tiburon 0884,Bob Vrijenhoek,A90437A1-F8F8-45C2-9862-3B54B0D036BF,Muusoctopus robustus,linda,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/images/0884/07_38_49_28.png,2755.620117,42.755189,-126.70974,1.24,136.699997,34.595001,1.756,urn:tid:mbari.org:T0884-08
3102,2005-08-23 22:52:40,ROV,Tiburon 0884,Bob Vrijenhoek,A90437A1-F8F8-45C2-9862-3B54B0D036BF,Muusoctopus robustus,linda,http://search.mbari.org/ARCHIVE/frameGrabs/Tiburon/images/0884/07_38_49_28.jpg,2755.620117,42.755189,-126.70974,1.24,136.699997,34.595001,1.756,urn:tid:mbari.org:T0884-08


Changing the code as suggested did **NOT** actually alter the dates or times at all. I've tried entering a few timezones, including 'UTC', 'GMT', and 'PST', but the output is always the same. Could it be that this bug has been fixed?

Is there any other way to double check this?

Maybe I can compare to results I get using the VARS query app.

In [21]:
## Get records from original csv file (200403) matching a record found using the VARS query app

df.loc[df['observation_uuid'] == '6A798718-A68B-47B6-B306-70F205A60F76']

Unnamed: 0,index_recorded_timestamp,observation_group,video_sequence_name,chief_scientist,observation_uuid,concept,observer,image_url,depth_meters,latitude,longitude,oxygen_ml_per_l,psi,salinity,temperature_celsius,video_uri
11672,2006-05-13 21:28:13,ROV,Tiburon 0984,Steve Haddock,6A798718-A68B-47B6-B306-70F205A60F76,Dosidicus gigas,schlin,,248.550003,35.633198,-122.736856,2.4,84.099998,33.962002,7.206,urn:tid:mbari.org:T0984-01


This timestamp matches what I found on the VARS query app... so maybe it's all OK?

### Convert occurrence data

In [40]:
## Change column names

occ_df = occ_df.rename(columns={
    'video_sequence_name':'eventID',
    'index_recorded_timestamp':'eventDate',
    'observation_uuid':'occurrenceID',
    'concept':'scientificName',
    'observer':'identifiedBy',
    'depth_meters':'minimumDepthInMeters',
    'latitude':'decimalLatitude',
    'longitude':'decimalLongitude',
    'oxygen_ml_per_l':'dissolvedOxygen',
    'psi':'pressureInPsi',
    'temperature_celsius':'temperatureInCelsius',    
})

# Sort by eventID
occ_df = occ_df.sort_values(by = ['eventID'])
occ_df.head()

Unnamed: 0,eventID,eventDate,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,decimalLatitude,decimalLongitude,dissolvedOxygen,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,maximumDepthInMeters
4563,Doc Ricketts 0011,2009-03-12T22:10:40,2FCE7413-3AD3-4B9B-952D-CE6D9715D22A,Patellogastropoda,lonny,2893.149902,36.613369,-122.435098,2.192,88.599998,34.623001,1.684,,urn:tid:mbari.org:D0011-06HD,urn:lsid:marinespecies.org:taxname:382158,382158,WoRMS,present,HumanObservation,2893.149902
13031,Doc Ricketts 0019,2009-05-07T01:48:49,F5546E33-118D-4893-806F-BF02AA3723B1,Dosidicus gigas,svonthun,252.029999,36.709469,-122.176063,0.991,256.0,34.118999,7.492,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,252.029999
12327,Doc Ricketts 0019,2009-05-07T02:04:28,D8051747-D106-4F3F-98BC-B7DB4861B9F6,Dosidicus gigas,svonthun,245.110001,36.709908,-122.176456,1.028,13.4,34.115002,7.577,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,245.110001
9590,Doc Ricketts 0019,2009-05-07T01:49:25,D3D54FAC-E00D-4509-919F-0F5871C8A3D3,Dosidicus gigas,svonthun,260.290009,36.709432,-122.176155,0.956,219.199997,34.123001,7.471,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,260.290009
8340,Doc Ricketts 0019,2009-05-07T01:49:05,BA0EF7C9-93E0-4114-A319-76FF0978D654,Dosidicus gigas,svonthun,255.830002,36.709461,-122.176098,0.975,234.5,34.120998,7.478,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,255.830002


In [23]:
## Get eventDate in the correct format

# Format dates
formatted_dt = []

for dt in occ_df['eventDate']:
    
    # Convert string to datetime
    try:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S.%f') # some datetimes have milliseconds
    except ValueError:
        dt = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S')
        
    # ---- may need to insert code to handle timezones here -----
    
    # Put in ISO format string
    dt = dt.isoformat()
    
    # Save in list
    formatted_dt.append(dt)

occ_df['eventDate'] = formatted_dt
occ_df.head()

Unnamed: 0,eventID,eventDate,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,decimalLatitude,decimalLongitude,dissolvedOxygen,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri
4563,Doc Ricketts 0011,2009-03-12T22:10:40,2FCE7413-3AD3-4B9B-952D-CE6D9715D22A,Patellogastropoda,lonny,2893.149902,36.613369,-122.435098,2.192,88.599998,34.623001,1.684,,urn:tid:mbari.org:D0011-06HD
428,Doc Ricketts 0019,2009-05-07T01:49:47,5815263A-A69E-48B4-A5C7-A1E17599EEC3,Dosidicus gigas,svonthun,264.170013,36.709389,-122.176214,0.946,258.399994,34.125,7.464,http://search.mbari.org/ARCHIVE/frameGrabs/Doc%20Ricketts/images/0019/05_36_01_06.jpg,urn:tid:mbari.org:D0019-06HD
5365,Doc Ricketts 0019,2009-05-07T01:49:25,5EAF101F-6EAA-44F5-8EC7-EA4CD84D4F3A,Dosidicus gigas,svonthun,260.290009,36.709432,-122.176155,0.956,219.199997,34.123001,7.471,,urn:tid:mbari.org:D0019-06HD
9590,Doc Ricketts 0019,2009-05-07T01:49:25,D3D54FAC-E00D-4509-919F-0F5871C8A3D3,Dosidicus gigas,svonthun,260.290009,36.709432,-122.176155,0.956,219.199997,34.123001,7.471,,urn:tid:mbari.org:D0019-06HD
13031,Doc Ricketts 0019,2009-05-07T01:48:49,F5546E33-118D-4893-806F-BF02AA3723B1,Dosidicus gigas,svonthun,252.029999,36.709469,-122.176063,0.991,256.0,34.118999,7.492,,urn:tid:mbari.org:D0019-06HD


In [24]:
## Get a list of unique species names

names = occ_df['scientificName'].unique()
names

array(['Patellogastropoda', 'Dosidicus gigas', 'Dosidicus',
       'Muusoctopus robustus', 'Arachnactis', 'Muusoctopus leioderma'],
      dtype=object)

In [25]:
## Function to obtain scientificNameIDs and taxonIDs from WoRMS

def get_worms_from_scientific_name(sci_name):
    '''
    Using WORMS REST Api, retrieve the id given a scientific name
    
    Returns:
        - scientificName: Worms specified scientific name
        - scientificNameID: Worms specific id for scientific name
    '''
    
    sci_name_url = urllib.parse.quote(sci_name)
    _url = 'http://www.marinespecies.org/rest/AphiaRecordsByNames?scientificnames%5B%5D='+ sci_name_url + '&like=false&marine_only=false'
    
    try:
        with urllib.request.urlopen(_url) as url:
            data = json.loads(url.read().decode())
            return (data[0][0]['scientificname'], data[0][0]['lsid'], data[0][0]['AphiaID'])
        
    except Exception as e:
        # Try passing just the genus if bthe species name is unrecognized
        if len(sci_name_url.split('%20')) > 1: #If species is unknown and listed as spp. or sp.
            return get_worms_from_scientific_name(sci_name_url.split('%20')[0])
        else:
            print("Url didn't work, check name, ", sci_name)

In [26]:
## Run function on test

test_name = names[2]

out = get_worms_from_scientific_name(test_name)
out

('Dosidicus', 'urn:lsid:marinespecies.org:taxname:341417', 341417)

In [27]:
%%time

## Run function on all names

# Initialize name and id dicts
name_id_dic = {}
name_dic = {}
id_dic = {}

for sci_name in names:
    
    sci_name = sci_name.strip()
    
    try:
        sname, sname_id, id = get_worms_from_scientific_name(sci_name)
        name_id_dic[sci_name] = sname_id
        name_dic[sci_name] = sname
        id_dic[sci_name] = id
        
    except:
        pass # very hacky    

Wall time: 4.19 s


In [32]:
## Create columns from WoRMS data

# Create scientificNameID column with the same content as scientificName - strip to ensure no whitespace
occ_df['scientificNameID'] = occ_df['scientificName'].str.strip()

# Use dictionary to replace scientific names with name IDs
occ_df.replace({'scientificNameID':name_id_dic}, inplace=True)

# Repeat to create taxonID
occ_df['taxonID'] = occ_df['scientificName'].str.strip()
occ_df.replace({'taxonID':id_dic}, inplace=True)

# Create additional needed columns
occ_df['nameAccordingTo'] = 'WoRMS'
occ_df['occurrenceStatus'] = 'present'
occ_df['basisOfRecord'] = 'HumanObservation'

occ_df.head()

Unnamed: 0,eventID,eventDate,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,decimalLatitude,decimalLongitude,dissolvedOxygen,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
4563,Doc Ricketts 0011,2009-03-12T22:10:40,2FCE7413-3AD3-4B9B-952D-CE6D9715D22A,Patellogastropoda,lonny,2893.149902,36.613369,-122.435098,2.192,88.599998,34.623001,1.684,,urn:tid:mbari.org:D0011-06HD,urn:lsid:marinespecies.org:taxname:382158,382158,WoRMS,present,HumanObservation
428,Doc Ricketts 0019,2009-05-07T01:49:47,5815263A-A69E-48B4-A5C7-A1E17599EEC3,Dosidicus gigas,svonthun,264.170013,36.709389,-122.176214,0.946,258.399994,34.125,7.464,http://search.mbari.org/ARCHIVE/frameGrabs/Doc%20Ricketts/images/0019/05_36_01_06.jpg,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation
5365,Doc Ricketts 0019,2009-05-07T01:49:25,5EAF101F-6EAA-44F5-8EC7-EA4CD84D4F3A,Dosidicus gigas,svonthun,260.290009,36.709432,-122.176155,0.956,219.199997,34.123001,7.471,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation
9590,Doc Ricketts 0019,2009-05-07T01:49:25,D3D54FAC-E00D-4509-919F-0F5871C8A3D3,Dosidicus gigas,svonthun,260.290009,36.709432,-122.176155,0.956,219.199997,34.123001,7.471,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation
13031,Doc Ricketts 0019,2009-05-07T01:48:49,F5546E33-118D-4893-806F-BF02AA3723B1,Dosidicus gigas,svonthun,252.029999,36.709469,-122.176063,0.991,256.0,34.118999,7.492,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation


In [33]:
## Add maximumDepthInMeters column

occ_df['maximumDepthInMeters'] = occ_df['minimumDepthInMeters']
occ_df.head()

Unnamed: 0,eventID,eventDate,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,decimalLatitude,decimalLongitude,dissolvedOxygen,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,maximumDepthInMeters
4563,Doc Ricketts 0011,2009-03-12T22:10:40,2FCE7413-3AD3-4B9B-952D-CE6D9715D22A,Patellogastropoda,lonny,2893.149902,36.613369,-122.435098,2.192,88.599998,34.623001,1.684,,urn:tid:mbari.org:D0011-06HD,urn:lsid:marinespecies.org:taxname:382158,382158,WoRMS,present,HumanObservation,2893.149902
428,Doc Ricketts 0019,2009-05-07T01:49:47,5815263A-A69E-48B4-A5C7-A1E17599EEC3,Dosidicus gigas,svonthun,264.170013,36.709389,-122.176214,0.946,258.399994,34.125,7.464,http://search.mbari.org/ARCHIVE/frameGrabs/Doc%20Ricketts/images/0019/05_36_01_06.jpg,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,264.170013
5365,Doc Ricketts 0019,2009-05-07T01:49:25,5EAF101F-6EAA-44F5-8EC7-EA4CD84D4F3A,Dosidicus gigas,svonthun,260.290009,36.709432,-122.176155,0.956,219.199997,34.123001,7.471,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,260.290009
9590,Doc Ricketts 0019,2009-05-07T01:49:25,D3D54FAC-E00D-4509-919F-0F5871C8A3D3,Dosidicus gigas,svonthun,260.290009,36.709432,-122.176155,0.956,219.199997,34.123001,7.471,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,260.290009
13031,Doc Ricketts 0019,2009-05-07T01:48:49,F5546E33-118D-4893-806F-BF02AA3723B1,Dosidicus gigas,svonthun,252.029999,36.709469,-122.176063,0.991,256.0,34.118999,7.492,,urn:tid:mbari.org:D0019-06HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,252.029999


In [52]:
## Finally, assemble associatedMedia

num = occ_df['occurrenceID'].iloc[11000]
selected = occ_df.loc[occ_df['occurrenceID'] == num]
selected

Unnamed: 0,eventID,eventDate,occurrenceID,scientificName,identifiedBy,minimumDepthInMeters,decimalLatitude,decimalLongitude,dissolvedOxygen,pressureInPsi,salinity,temperatureInCelsius,image_url,video_uri,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,maximumDepthInMeters
2578,Ventana 3506,2010-01-29T21:00:53,EF4B8ABA-C5CB-4DBA-92AA-71308FF43C98,Dosidicus gigas,linda,389.649994,36.809049,-122.178092,0.494,277.399994,34.214001,6.333,http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3506/02_24_52_25.png,urn:tid:mbari.org:V3506-01HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,389.649994
940,Ventana 3506,2010-01-29T21:00:53,EF4B8ABA-C5CB-4DBA-92AA-71308FF43C98,Dosidicus gigas,linda,389.649994,36.809049,-122.178092,0.494,277.399994,34.214001,6.333,http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3506/02_24_52_25.jpg,urn:tid:mbari.org:V3506-01HD,urn:lsid:marinespecies.org:taxname:342291,342291,WoRMS,present,HumanObservation,389.649994


In [55]:
if selected.shape[0] > 1:
    image_files = selected['image_url'].drop_duplicates()
    video_files = selected['video_uri'].drop_duplicates()
    
print(image_files)
video_files.drop_duplicates()

2578    http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3506/02_24_52_25.png
940     http://search.mbari.org/ARCHIVE/frameGrabs/Ventana/images/3506/02_24_52_25.jpg
Name: image_url, dtype: object


2578    urn:tid:mbari.org:V3506-01HD
Name: video_uri, dtype: object

In [41]:
occ_df.shape

(12022, 20)