# Processing zooplankton data to DWC compliant files
Adrienne wrote this in 2020 for AMBON projects/Data

majorly updated Dec2020 after back and forther with OBIS coordinator, Abby Benson.

This notebook converts a file in the RW project data folder to a csv with column headers that match DWC metadata schema. More on DWC can be found : https://dwc.tdwg.org/terms/ 

The steps are, roughly:
1. make a new notebook, setup python pandas and other libraries as needed
2. read in the existing csv
3. look at it
4. list out the existing column names and match, as best as possible to DWC terms (the many nuances of writing metadata will come into play)
5. Rename the columns with those DWC terms for occurence table (see below for required fields)
6. Rename the columns for an eventTable (as needed??)
6. Rename the columns and data for measurementorfact table (as needed???)
7. 'Tidy' the data for long form quantity observation, in this case instead of wide form 'abundance' or 'biomass' separate observations
8. Confirm do not need event table
9. Write the data frames out into csvs for packages

The requirements for outputs, for OBIS compliant DWC datasets, for this project:
- Occurence/organism quantity table with the following fields, minimum
    - eventDate
    - eventID
    - dec lat
    - dec long
    - scientificName
    - basisofRecord (according to controlled vocab, probably HumanObservation)
    - occurenceID
    - datasetid (a shortname for the OPT occurence table ( more at  https://tools.gbif.org/dwca-validator/extension.do?id=dwc:event )
    - organismQuantity (as applicable)
    - organismQuantityType (as applicable)

    
- MeasurementOrFact table for additional attributes/variables
    - eventID
    - occurenceID
    - measurementType
    - measurementValue
- Event Table
    - eventID
    - EventDate
    - Depth
    - decimal Lat
    - decimal Long
    - basisofRecord


**Notes**
- As of 2020, Obis is operating on a 3 file system for each archive package: 
    - 1 file that is a flat csv/data file for occurence, that can include one instance of the attribute organismQuantity, one measurement attribute. 
    - Another file for including multiple measurements and more information about the envent should be included, as MeasurementorFact Table.  
    - A third table is event table, and eventID needs to be included elsewhere only if this table exists.
- https://obis.org/manual 

In this project, we recieved 2 wide datasets per cruise, abundance and  biomass measurements in each, not compliant column headers at all. So each cruise needs the 3 files, as tidy long datasets.

In [366]:
#get coding environment ready
#bring in pandas for manipulating columns, put no limit on the amount of columns in display
import pandas as pd
pd.options.display.max_columns = None

#bring in numpy to manipulate numbers
import numpy as np

#I need random numbers
import random

#I need to manipulate dates
import datetime

# Let start with data: read in existing csv

In [150]:
#find the file
# %ls is a 'line magic command' for notebooks ¯\_(ツ)_/¯

%ls "../Zooplankton_505"

AMBON2017505.csv


## take a look at the file

In [367]:
#Make a dataframe, parse dates
raw = pd.read_csv('../Zooplankton_505/AMBON2017505.csv', parse_dates=['Date_Time'])

raw.head()



Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3]
0,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Acartia longiremis,,104257,0.057,0.0002
1,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Eurytemora pacifica,,232028,0.057,0.0007
2,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Hyperia,,101796,0.014,0.0012
3,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Limacina helicina,,140223,0.014,0.0013
4,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Pseudocalanus acuspes,,104514,0.114,0.0015


In [92]:
raw.tail()


Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3]
1788,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1,45,Beroe,,106331,0.047,1.0295
1789,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1,45,Parasagitta elegans,,105440,5.985,2.8788
1790,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1,45,Euphausiacea,juvenile,1128,10.259,5.4358
1791,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1,45,Bolinopsis infundibulum,,106939,0.404,5.4557
1792,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1,45,Pisces,larvae,11676,0.024,


## list existing column headers, match to DWC terms

In [368]:
# what are the existing column headers?
raw.columns

Index(['Cruise', 'Station', 'Type', 'Date_Time',
       'Longitude_[decimal_degrees_east]', 'Latitude_[decimal _degrees_north]',
       'Bottom_Depth_[m]', 'Cast_Number', 'Depth_[m]',
       'Accepted_Organism_Identification', 'Life_Stage', 'APHIA_ID',
       'Abundance_[#/m3]', 'Biomass_[mg dw/m3]'],
      dtype='object')

### ok how do these column headers match up with DWC and the tables we need?

'Cruise', -a string meant to identify the cruise  
 'Station',-a string meant to identify the station (location) of occurence   
 'Type', - the gear used for the occurence observation I assume?  
 'Date_Time' - needs to be ISO standard foreventDate,  
 'Longitude_[decimal_degrees_east]',  
 'Latitude_[decimal _degrees_north]',  
 'Bottom_Depth_[m]',  
 'Cast_Number', - integer, identifies the attempt the occurence was sampled from in the station procedures?  
 'Depth_[m]', - depth of the sample/occurence, may be different from bottom depth  
 'Accepted_Organism_Identification', - ID'd  
 'Life_Stage', -   
 'APHIA_ID',- integer that corresponds with the WORMS database for the organism id'd  
 'Abundance_[#/m3]', - zooplankton abundance measurement , it's count per cubic meter  
 'Biomass_[mg dw/m3]' - zooplankton biomass measurement, milligrams of dry weight per cubic meter  
 
 ### Ok what tables do I need to make with which pieces?
 
#### Occurence Table

 'Date_Time'== eventDate,  
 'Longitude_[decimal_degrees_east]' == decimalLongitude,  
 'Latitude_[decimal _degrees_north]' == decimalLatitude,  
 'Depth_[m]', - depth of the sample/occurence, may be different from bottom depth, === verbatimDepth   
 'Accepted_Organism_Identification' == scientificName,  
 'APHIA_ID' == taxonID,- integer that corresponds with the WORMS database for the organism id'd  
 'Life_Stage' == lifeStage,     
 'type' == occurenceRemarks,  
 
 
 **Plus make these columns/values**
 
scientificNameID == 'urn:lsid:marinespecies.org:taxname:' + df['taxonID']
nomenclaturalCode == 'WoRMS'
 basisOfRecord == 'HumanObservation'  
 occurenceID  = datasetID + Cruise +random number?
 datasetID  - "AMBON_Zooplankton_2017" ?  
 the aubundance and biomass measurements combined into a 'tidy/long* dataset with value for organismQuantity and organismQuantityType to be identified as abundance or biomass with the units.  Like this:  
        *#Abundance files : 'Abundance_[#/m3]': 'organismQuantity',  
        #Biomass files : 'Biomass_[mg dw/m3]': 'organismQuantity' and df['organismRemarks'] = 'dry weight'*
 
 **and then drop the other columns**
 
#### measurementorfact table
 (if the above table is just the data as tidy data then, no need for additional table, because both measurements are included with different types)
 
OH and there's different net sizes, or at least, that's how we got the data.... 150 and 505.
So, let's rowbind that

#### event table
 'Cruise', -a string meant to identify the cruise *so, leave it?*  ~ eventID
 'Station',-a string meant to identify the station (location) of occurence ~ eventID

In [369]:
# dataframe processing
# make the data from from the csv
df = raw.copy()
df.head()


Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3]
0,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Acartia longiremis,,104257,0.057,0.0002
1,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Eurytemora pacifica,,232028,0.057,0.0007
2,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Hyperia,,101796,0.014,0.0012
3,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Limacina helicina,,140223,0.014,0.0013
4,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56,20,Pseudocalanus acuspes,,104514,0.114,0.0015


### what have we done before?? 

OK, in Benthic Epifauna 2017, Luc built an occurenceID, an eventID, and other such good things. So let's look there and see what's up with how to buuild an occurence id here.

Looks like, from Luc Mehl:

    ```
    abund = raw_abund.copy()

    abund['parentEventID'] = 'AMBON_Epifauna_2017'
    abund['eventID'] = abund['parentEventID'] + '_' + abund['eventID']

    # add occurrenceID
    abund['occurrenceID'] = abund['eventID'] + '_' + abund['taxonID'].astype(str)

    # convert time to UTC
    abund['eventDate'] = abund['eventDate'].dt.tz_localize(tz='US/Alaska').dt.tz_convert(tz='UTC')

    abund.rename(columns={'organismQuantity': 'count_per_1000m2'}, inplace=True)
    abund.drop(columns='organismQuantityType', inplace=True)

    abund.head()  
    ```

So, he made a new column to use as a root, and then renamed the eventID column with the root at the front. Then used the event id in the cocurence id using the taxon id as string. Wow, that's a multipstep. Reading through the [OBIS Manual](https://obis.org/manual/darwincore/) for occurence, it looks like, that's the recommendation for building occurenceID if it doesn't exist (well, something like that is recommended.)

In [370]:
#what I need is class(df[Station]) and levels. In Pandas that's...
#df['Station'].astype('category')
#hmm there are 78
#df['APHIA_ID'].astype('category')
#hmmm 70

#OK, also, unique()
#df.describe(include='all', datetime_is_numeric=True)

## Enough looking, let's get the two files as one

Zooplankton has two file system going on.
because of net size.
Let's rowbind the 150 and the 505 datasets together. luckily, their columns already look alike.

In [371]:
#Rename df to it's netsize

df505 = df

df505.head()

#read in the 150 data

%ls ../Zooplankton_150

raw2 = pd.read_csv('../Zooplankton_150/AMBON2017150.csv', parse_dates=['Date_Time'])

raw2.tail()

AMBON2017150.csv


Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3]
2931,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-22 03:47:00,-168.2368,67.897,59,,56,Parasagitta elegans,,105440.0,0.998,2.0528
2932,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-22 03:47:00,-168.2368,67.897,59,,56,Calanus glacialis/marshallae,,196770.0,10.112,2.8979
2933,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-22 03:47:00,-168.2368,67.897,59,,56,Neocalanus plumchrus,,196772.0,7.983,5.0574
2934,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-07 01:39:00,-168.2368,67.897,59,,56,Decapoda,megalopa,1130.0,2.195,5.0972
2935,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-07 01:39:00,-168.2368,67.897,59,,56,Neocalanus cristatus,,104470.0,1.064,5.0994


In [372]:
#rowbind these two csvs together
#first make the csvs raw copies
df150 = raw2.copy()

df150.head()



Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3]
0,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pisces,larvae,11676.0,1.172,
1,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Isopoda,,1131.0,1.172,
2,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Evadne nordmanni,,106273.0,1.172,0.0086
3,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pseudocalanus newmani,,157679.0,4.686,0.0121
4,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pseudocalanus acuspes,,104514.0,4.686,0.0335


In [373]:
#so now with have df150 and df505, let's concat

df = pd.concat([df150, df505])
#df[2933:2942]

# Yay for one big dataframe, I do not understand why the indexing numbers started all over again?? But hey, it looks right so I think it joined right.
#and, now I know how to fix it

df.reset_index(drop=True, inplace=True)
df[2933:2942]


Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3]
2933,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-22 03:47:00,-168.2368,67.897,59,,56,Neocalanus plumchrus,,196772.0,7.983,5.0574
2934,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-07 01:39:00,-168.2368,67.897,59,,56,Decapoda,megalopa,1130.0,2.195,5.0972
2935,AMBON2017,DBO3.6,TWINRING_150UM_MICROSCOPY,2017-08-07 01:39:00,-168.2368,67.897,59,,56,Neocalanus cristatus,,104470.0,1.064,5.0994
2936,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56.0,20,Acartia longiremis,,104257.0,0.057,0.0002
2937,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56.0,20,Eurytemora pacifica,,232028.0,0.057,0.0007
2938,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56.0,20,Hyperia,,101796.0,0.014,0.0012
2939,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56.0,20,Limacina helicina,,140223.0,0.014,0.0013
2940,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56.0,20,Pseudocalanus acuspes,,104514.0,0.114,0.0015
2941,AMBON2017,ML1.2,BONGO_505UM_MICROSCOPY,2017-08-17 18:22:00,-163.0342,70.1434,23,56.0,20,Gammaridae,,101383.0,0.327,0.002


## EventDate as ISO format

In [374]:
#let's get the eventDate looking ISO_perfect
df.head()

list = df['Date_Time']
list

0      2017-08-20 22:48:00
1      2017-08-20 22:48:00
2      2017-08-20 22:48:00
3      2017-08-20 22:48:00
4      2017-08-20 22:48:00
               ...        
4724   2017-08-06 19:33:00
4725   2017-08-06 19:33:00
4726   2017-08-06 19:33:00
4727   2017-08-06 19:33:00
4728   2017-08-06 19:33:00
Name: Date_Time, Length: 4729, dtype: datetime64[ns]

In [375]:
#iso format the thing? from datetime package
test = list[1]
test
#est.isoformat()
test = test.isoformat()
#est
print(test)
test

2017-08-20T22:48:00


'2017-08-20T22:48:00'

In [376]:
storage = []
for x in list:
    x=x.isoformat()
    print(x)
    storage.append(x)

    #I have no idea why I can't write over in place in the list, but dumping the output into another list works for me

2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-20T22:48:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00
2017-08-17T15:24:00


2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-08T12:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T07:26:00
2017-08-16T11:26:00
2017-08-16T11:26:00
2017-08-16T11:26:00
2017-08-16T11:26:00
2017-08-16T11:26:00
2017-08-16T11:26:00
2017-08-16T11:26:00


2017-08-12T09:24:00
2017-08-12T09:24:00
2017-08-12T09:24:00
2017-08-12T09:24:00
2017-08-12T09:24:00
2017-08-12T09:24:00
2017-08-12T09:24:00
2017-08-12T09:24:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T06:03:00
2017-08-12T18:32:00
2017-08-12T18:32:00
2017-08-12T18:32:00
2017-08-12T18:32:00
2017-08-12T18:32:00
2017-08-12T18:32:00
2017-08-12T18:32:00
2017-08-12T18:32:00
2017-08-12T18:32:00


In [377]:
#done
storage
df['eventDate']=storage
df.head()

Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3],eventDate
0,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pisces,larvae,11676.0,1.172,,2017-08-20T22:48:00
1,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Isopoda,,1131.0,1.172,,2017-08-20T22:48:00
2,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Evadne nordmanni,,106273.0,1.172,0.0086,2017-08-20T22:48:00
3,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pseudocalanus newmani,,157679.0,4.686,0.0121,2017-08-20T22:48:00
4,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pseudocalanus acuspes,,104514.0,4.686,0.0335,2017-08-20T22:48:00


## eventID and datasetID and OccurenceID for all the tables

So my understanding is.....
One occurenceID for each row as is here, in the untidy, because 2 measurements were taken off a given species from the 'station stop' on the cruise. And then reference that occurenceID in the long tidy measurementorFact table


In [379]:
#let's do two eggs in one nest with the datasetID and occurenceID stuff
#first, datasetID
df['datasetID'] = 'AMBON_Zooplankton_2017'

#occurenceID -- I'll have to tackle that separately
df['occurrenceID'] =df['datasetID']+'_'+df['Station']+'_' #+str(random.sample(numbers, k=1)) <- not working to make unique occurenceids
#ok this didn't work -> df['occurenceID'] = df['datasetID']+'_'+df['APHIA_ID'].astype(str), circle back to that late

#then eventID
df['eventID'] = df['datasetID']+'_'+df['Station']+'_'+df['eventDate']

In [380]:
#How many rows are in this dataframe?
#df.describe()
df.tail()
#

Unnamed: 0,Cruise,Station,Type,Date_Time,Longitude_[decimal_degrees_east],Latitude_[decimal _degrees_north],Bottom_Depth_[m],Cast_Number,Depth_[m],Accepted_Organism_Identification,Life_Stage,APHIA_ID,Abundance_[#/m3],Biomass_[mg dw/m3],eventDate,datasetID,occurrenceID,eventID
4724,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1.0,45,Beroe,,106331.0,0.047,1.0295,2017-08-06T19:33:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_DBO3.8_,AMBON_Zooplankton_2017_DBO3.8_2017-08-06T19:33:00
4725,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1.0,45,Parasagitta elegans,,105440.0,5.985,2.8788,2017-08-06T19:33:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_DBO3.8_,AMBON_Zooplankton_2017_DBO3.8_2017-08-06T19:33:00
4726,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1.0,45,Euphausiacea,juvenile,1128.0,10.259,5.4358,2017-08-06T19:33:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_DBO3.8_,AMBON_Zooplankton_2017_DBO3.8_2017-08-06T19:33:00
4727,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1.0,45,Bolinopsis infundibulum,,106939.0,0.404,5.4557,2017-08-06T19:33:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_DBO3.8_,AMBON_Zooplankton_2017_DBO3.8_2017-08-06T19:33:00
4728,AMBON2017,DBO3.8,BONGO_505UM_MICROSCOPY,2017-08-06 19:33:00,-168.9554,67.6645,48,1.0,45,Pisces,larvae,11676.0,0.024,,2017-08-06T19:33:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_DBO3.8_,AMBON_Zooplankton_2017_DBO3.8_2017-08-06T19:33:00


In [381]:
#OK, rename and then add new columns
#rename columns as necessary

df.rename(columns={'Depth_[m]': 'verbatimDepth',
                   'Life_Stage': 'lifeStage',
                   'Type': 'occurrenceRemarks',
                   'Longitude_[decimal_degrees_east]': 'decimalLongitude',
                   'Latitude_[decimal _degrees_north]': 'decimalLatitude',
                   'Accepted_Organism_Identification': 'scientificName',
                   'APHIA_ID': 'taxonID' }, inplace=True)

# add new parameters

df['scientificNameID'] = 'urn:lsid:marinespecies.org:taxname:'+df['taxonID'].astype(str)
df['nomenclaturalCode'] = 'WoRMS'
df['basisOfRecord'] = 'HumanObservation' #nospace!

df.head()

Unnamed: 0,Cruise,Station,occurrenceRemarks,Date_Time,decimalLongitude,decimalLatitude,Bottom_Depth_[m],Cast_Number,verbatimDepth,scientificName,lifeStage,taxonID,Abundance_[#/m3],Biomass_[mg dw/m3],eventDate,datasetID,occurrenceID,eventID,scientificNameID,nomenclaturalCode,basisOfRecord
0,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pisces,larvae,11676.0,1.172,,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:11676.0,WoRMS,HumanObservation
1,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Isopoda,,1131.0,1.172,,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:1131.0,WoRMS,HumanObservation
2,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Evadne nordmanni,,106273.0,1.172,0.0086,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:106273.0,WoRMS,HumanObservation
3,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pseudocalanus newmani,,157679.0,4.686,0.0121,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:157679.0,WoRMS,HumanObservation
4,AMBON2017,BBL1,TWINRING_150UM_MICROSCOPY,2017-08-20 22:48:00,-163.5095,69.3443,15,,12,Pseudocalanus acuspes,,104514.0,4.686,0.0335,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:104514.0,WoRMS,HumanObservation


In [382]:
# Remove unnecessary columns
# In this case, the 2015 DWC data does not include the biomass measurement and abundance measurement, it's seperated into two spreadsheets

df = df.drop(columns=['Cruise', 'Station', 'Bottom_Depth_[m]', 'Date_Time','Cast_Number'])

df.head()

Unnamed: 0,occurrenceRemarks,decimalLongitude,decimalLatitude,verbatimDepth,scientificName,lifeStage,taxonID,Abundance_[#/m3],Biomass_[mg dw/m3],eventDate,datasetID,occurrenceID,eventID,scientificNameID,nomenclaturalCode,basisOfRecord
0,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pisces,larvae,11676.0,1.172,,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:11676.0,WoRMS,HumanObservation
1,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Isopoda,,1131.0,1.172,,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:1131.0,WoRMS,HumanObservation
2,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Evadne nordmanni,,106273.0,1.172,0.0086,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:106273.0,WoRMS,HumanObservation
3,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus newmani,,157679.0,4.686,0.0121,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:157679.0,WoRMS,HumanObservation
4,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus acuspes,,104514.0,4.686,0.0335,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:104514.0,WoRMS,HumanObservation


In [383]:
#this needs to happen to all of the data, post stacking

#taxonid needs to not have trailing .0
taxons = df[['taxonID']].astype('string', errors='ignore')
t=taxons['taxonID'].convert_dtypes()
t=t.str.strip('.0')
t
df['taxonID']=t
df.head()

#finally!
#nope it's scientificNameID that needs this
sciids = df[['scientificNameID']].astype('string', errors='ignore')
s = sciids['scientificNameID'].convert_dtypes()
s=s.str.strip('.0')
df['scientificNameID']=s
df.head()

Unnamed: 0,occurrenceRemarks,decimalLongitude,decimalLatitude,verbatimDepth,scientificName,lifeStage,taxonID,Abundance_[#/m3],Biomass_[mg dw/m3],eventDate,datasetID,occurrenceID,eventID,scientificNameID,nomenclaturalCode,basisOfRecord
0,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pisces,larvae,11676,1.172,,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:11676,WoRMS,HumanObservation
1,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Isopoda,,1131,1.172,,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:1131,WoRMS,HumanObservation
2,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Evadne nordmanni,,106273,1.172,0.0086,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:106273,WoRMS,HumanObservation
3,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus newmani,,157679,4.686,0.0121,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:157679,WoRMS,HumanObservation
4,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus acuspes,,104514,4.686,0.0335,2017-08-20T22:48:00,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,urn:lsid:marinespecies.org:taxname:104514,WoRMS,HumanObservation


In [384]:
#double checking
df.columns

Index(['occurrenceRemarks', 'decimalLongitude', 'decimalLatitude',
       'verbatimDepth', 'scientificName', 'lifeStage', 'taxonID',
       'Abundance_[#/m3]', 'Biomass_[mg dw/m3]', 'eventDate', 'datasetID',
       'occurrenceID', 'eventID', 'scientificNameID', 'nomenclaturalCode',
       'basisOfRecord'],
      dtype='object')

# Now for table 1 - event table

## Event table needs:

Event Table  

eventID == 'eventID'   
EventDate== 'eventDate'  
Depth == 'verbatimDepth'  
decimal Lat== 'decimalLatitude'  
decimal Long == 'decimalLongitude'  
basisofRecord== 'basisOfRecord'  

Without repeating or duplicate events. So, filtered by unique. IN PANDAS that's...

In [385]:
eventdf = df[['eventID', 'eventDate', 'verbatimDepth', 'decimalLatitude', 'decimalLongitude','basisOfRecord']]
eventdf.head()
eventdf['eventID'].describe()

#and, that's that? Check!

count                                                  4729
unique                                                  154
top       AMBON_Zooplankton_2017_DBO3.8_2017-08-06T18:14:00
freq                                                     54
Name: eventID, dtype: object

In [386]:
#The event table needs to only hold those 154 unique events, so....
eventdf.head()
eventdf = eventdf.drop_duplicates(subset=['eventID'])

eventdf['eventID'].describe()

count                                                154
unique                                               154
top       AMBON_Zooplankton_2017_CL1_2017-08-08T12:26:00
freq                                                   1
Name: eventID, dtype: object

# Now for table 2 - Occurence

Occurence table needs
- Occurence/organism quantity table with the following fields, minimum
    - eventDate 
    - eventID
    - dec lat
    - dec long
    - scientificName
    - basisofRecord (according to controlled vocab, probably HumanObservation)
    - occurenceID
    - datasetid (a shortname for the OPT occurence table )
    - organismQuantity (as applicable)
    - organismQuantityType (as applicable)
    
The tricky part here is the occurrence ID. Each occurence ID should be, unique to each row in the occurence table. It's recommended out of eventID and TAXON id. Or just make, or a UUID type deal.
It's also, a unique occurence but NOT a unique measurement. SO, for abundance and biomass, they're off the same occurence.
So I'm not tidy-ing the dataset yet. 


### Make unique Occurence IDS 

again, while the data isn't tidy

In [396]:
#recommended - build from taxonid and eventid
# newIDS = occdf['eventID']+'_'+occdf['taxonID']
# newIDS[0:20]

# pd.unique(newIDS)
# But that doesn't work for this dataset


#let's just do, random 4 digit number? not an elegant solution, but I"ll take it.
numbers = random.sample(range(1000,9999), 4729)
numbers[0:10]
len(numbers)
#succes, have list of random 4 digit numbers.
df['occurrenceID'].describe()

#next, concatonate

#loop?

newIDS = []
i = 0
for x in df['occurrenceID']:
     x = x+str(numbers[i])
     #print(x)
     i = i+1
     newIDS.append(x)

newIDS[0:20]

#     #sidenote- still don't know why I can't change list values 'in-place' --> need to learn
df['occurrenceID'] = newIDS

df['occurrenceID'].describe()

#I guess that works


count                                   4729
unique                                  4729
top       AMBON_Zooplankton_2017_DBO3.7_8688
freq                                       1
Name: occurrenceID, dtype: object

In [397]:
#build the dataframe

occdf = df[['datasetID','eventID','occurrenceID', 'eventDate', 'verbatimDepth', 'decimalLatitude', 'decimalLongitude', 'scientificName', 'scientificNameID', 'taxonID','nomenclaturalCode', 'basisOfRecord']]

occdf.describe()



Unnamed: 0,verbatimDepth,decimalLatitude,decimalLongitude
count,4729.0,4729.0,4729.0
mean,40.226263,70.464017,-162.593389
std,8.212902,1.194594,23.136981
min,12.0,67.6645,-168.9554
25%,38.0,69.9114,-166.4237
50%,42.0,70.7718,-163.9288
75%,45.0,71.3222,-162.2024
max,56.0,72.4969,159.4106


In [398]:
occdf.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4729 entries, 0 to 4728
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   datasetID          4729 non-null   object 
 1   eventID            4729 non-null   object 
 2   occurrenceID       4729 non-null   object 
 3   eventDate          4729 non-null   object 
 4   verbatimDepth      4729 non-null   int64  
 5   decimalLatitude    4729 non-null   float64
 6   decimalLongitude   4729 non-null   float64
 7   scientificName     4729 non-null   object 
 8   scientificNameID   4729 non-null   string 
 9   taxonID            4727 non-null   string 
 10  nomenclaturalCode  4729 non-null   object 
 11  basisOfRecord      4729 non-null   object 
dtypes: float64(2), int64(1), object(7), string(2)
memory usage: 443.5+ KB


In [399]:
#where are the messy values in scinameid and taxon id?
n = occdf['scientificNameID'] == 'urn:lsid:marinespecies.org:taxname:nan'
occdf[n]
#2351 (occurence: _BBL9_7189) and 2905 (_DBO3.6_9137)

occdf.loc[2351, 'scientificNameID'] = np.nan
occdf.loc[2905, 'scientificNameID'] = np.nan

#That's the best I can do , I think, this not really being my data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


## Make the measurementorfact table long and tidy

So the occurence table is now, bereft of actual specimen information? it references out to this table.
(it could hold one measurement of the information, but in this particular case study, it made more sense not to)

What I need is melt, so in pandas that's....
melt. ^_^

So, Abundance, and biomass, stacked on top of eachother, as 'organismQuantity' and 'organismQuantitytype' containing 'biomass' or abundance dry weight.

MoFTable needs:
- eventID
- occurenceID
- measurementType
- measurementValue


In [400]:
#OK, let's make this a dataset of as much of the information as possible
#DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
# wait, how do I do this, 'in place' ?

tidydf = df.melt(id_vars=['datasetID','occurrenceID','eventID','eventDate','occurrenceRemarks','decimalLongitude','decimalLatitude','verbatimDepth','scientificName','lifeStage','taxonID','scientificNameID','nomenclaturalCode','basisOfRecord'],
        value_vars=['Abundance_[#/m3]', 'Biomass_[mg dw/m3]'])

tidydf.head()

Unnamed: 0,datasetID,occurrenceID,eventID,eventDate,occurrenceRemarks,decimalLongitude,decimalLatitude,verbatimDepth,scientificName,lifeStage,taxonID,scientificNameID,nomenclaturalCode,basisOfRecord,variable,value
0,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_4978,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pisces,larvae,11676,urn:lsid:marinespecies.org:taxname:11676,WoRMS,HumanObservation,Abundance_[#/m3],1.172
1,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_9686,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Isopoda,,1131,urn:lsid:marinespecies.org:taxname:1131,WoRMS,HumanObservation,Abundance_[#/m3],1.172
2,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_9198,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Evadne nordmanni,,106273,urn:lsid:marinespecies.org:taxname:106273,WoRMS,HumanObservation,Abundance_[#/m3],1.172
3,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_9586,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus newmani,,157679,urn:lsid:marinespecies.org:taxname:157679,WoRMS,HumanObservation,Abundance_[#/m3],4.686
4,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_3460,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus acuspes,,104514,urn:lsid:marinespecies.org:taxname:104514,WoRMS,HumanObservation,Abundance_[#/m3],4.686


In [401]:
# I will take that, let's rename 'variable ' and 'value'
tidydf.rename(columns={'variable': 'measurementType',
                   'value': 'measurementValue'}, inplace= True)
tidydf.head()

#tidydf.columns


Unnamed: 0,datasetID,occurrenceID,eventID,eventDate,occurrenceRemarks,decimalLongitude,decimalLatitude,verbatimDepth,scientificName,lifeStage,taxonID,scientificNameID,nomenclaturalCode,basisOfRecord,measurementType,measurementValue
0,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_4978,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pisces,larvae,11676,urn:lsid:marinespecies.org:taxname:11676,WoRMS,HumanObservation,Abundance_[#/m3],1.172
1,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_9686,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Isopoda,,1131,urn:lsid:marinespecies.org:taxname:1131,WoRMS,HumanObservation,Abundance_[#/m3],1.172
2,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_9198,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Evadne nordmanni,,106273,urn:lsid:marinespecies.org:taxname:106273,WoRMS,HumanObservation,Abundance_[#/m3],1.172
3,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_9586,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus newmani,,157679,urn:lsid:marinespecies.org:taxname:157679,WoRMS,HumanObservation,Abundance_[#/m3],4.686
4,AMBON_Zooplankton_2017,AMBON_Zooplankton_2017_BBL1_3460,AMBON_Zooplankton_2017_BBL1_2017-08-20T22:48:00,2017-08-20T22:48:00,TWINRING_150UM_MICROSCOPY,-163.5095,69.3443,12,Pseudocalanus acuspes,,104514,urn:lsid:marinespecies.org:taxname:104514,WoRMS,HumanObservation,Abundance_[#/m3],4.686


In [411]:
tidydf.describe()
tidydf['occurrenceID'].describe()

count                                   9458
unique                                  4729
top       AMBON_Zooplankton_2017_DBO3.7_8688
freq                                       2
Name: occurrenceID, dtype: object

## So, next is....

...submit package?? Ruh roh

## write out dataframe to new csv file

In [412]:
#Event Table
eventdf.to_csv('AMBON2017_Event_DWC.csv')
#occurence table
occdf.to_csv('AMBON2017_Occurrence_DWC.csv')
#MoF table
tidydf.to_csv('AMBON2017_MoF_DWC.csv')

In [413]:
%ls ./

ls: cannot access './2017zooplanton_to_dwc_.ipynb': No such file or directory
2017zooplanton_to_dwc_.ipynb  AMBON2017_Occurrence_DWC.csv
AMBON2017_Event_DWC.csv       AMBON2017_Zooplankton_DWC.csv
AMBON2017_MoF_DWC.csv


# *Fin*