# PISCO - size frequency data

Size frequency measurements of body size of targeted invertebrate species are recorded by divers both along benthic transects and at random locations within a study site. 

Measured sizes correspond with:
- test diameter for urchins
- length of longest arm for seastars
- shell length for shelled mollusks
- carapace length for lobsters
- total turgid length for sea cucumbers

In the case of urchins sampled by UCSB and VRG in Southern California, large numbers of individuals may be collected in bags and brought aboard the research vessel to facilitate measurement.

**It sounds like some of the animals with recorded sizes here may also have recorded sizes in the swath table. Is that correct? If so, what is the best way to handle it?**
    
**Resources:** <br>
https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [41]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handling dates

In [42]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [125]:
## Load data

filename = 'MLPA_kelpforest_sizefreq.1.csv'
data = pd.read_csv(filename, encoding='ANSI', dtype={'transect':str, 'site_name_old':str})

print(data.shape)
data.head()

(80977, 18)


Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old
0,UCSC,SBTL_SWATH_PISCO,2014,2014,8,8,HOPKINS_UC,TRANSECT,INNER,1,APOCAL,1,7.0,,4.6,COLIN GAYLORD,,
1,UCSC,SBTL_SWATH_PISCO,2018,2018,8,15,SAUNDERS_REFERENCE_1,TRANSECT,OUTER,2,ASTSPP,1,10.0,,20.2,MICHAEL LANGHANS,,
2,UCSC,SBTL_SWATH_PISCO,2014,2014,7,3,MACABEE_DC,TRANSECT,OUTER,2,DERIMB,2,12.0,HEALTHY,12.3,TRISTIN MCHUGH,SIZE 10-14 CM,
3,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,DERIMB,1,12.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE 10-14 CM,
4,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,DERIMB,1,17.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE <=15 CM,


### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_SWATH_PISCO, SBTL_SWATH_HSU or SBTL_SWATH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 2003 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 2003 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 350 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**location** = RANDOM, TRANSECT OR SIZE_TRANSECT. The location where the size observation was recorded. **Note that sizes recorded as part of PISCO and HSU swath surveys are duplicated in this dataset. These duplicated records should have method = SBTL_SWATH_PISCO or SBTL_SWATH_HSU.**
- RANDOM: sizes were measured across the general site and were not specifically located on the swath/upc transect
- TRANSECT: sizes were measured on the swath/upc transects
- SIZE_TRANSECT: applies to HSU only where specific size frequency transects are conducted separately from swath/upc transects

**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**transect** = It seems like this should only be 1 - 8, but there are **a number of other designations as well.** The unique transect replicate within each site and zone. <br>
**classcode** = One of 37 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for invertebrates and algae, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode and a given size (if applicable) per transect <br>
**size** = Size (in centimeters) of observation. For specific species groups, measured sizes correspond with test diameter for urchins, length of longest arm for seastars, shell length for shelled mollusks, carapace length for lobsters, total turgid length for sea cucumbers. <br>
**disease** = For some years echinoderm disease was recorded on transects for select species. When systematic observation for disease was conducted, disease is indicated here. Where blank, disease was not evaluated.
- HEALTHY: Individual was inspected and no was disease observed
- YES: Some form of disease was observed
- MILD: Mild disease was observed
- SEVERE: Severe disease was observed
- WASTING: Wasting disease was observed
- BLACK SPOT: Black spot disease was observed

**depth** = Between 1.8 and 26.5 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**observer** = The diver who conducted the survey transect <br>
**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

**NOTE THAT 25419 RECORDS HAVE METHOD = 'SBTL_SIZEFREQ_PISCO' AND LOCATION = 'TRANSECT'. VRG AND HSU DO NOT SEEM TO HAVE THIS PROBLEM. HOW DO I INTERPRET THESE RECORDS?**
```python
data[(data['method'] == 'SBTL_SIZEFREQ_PISCO') & (data['location'] == 'TRANSECT')]
```

Also **note** that 6005 records have NaN in the transect field despite the location being listed as 'TRANSECT.'
```python
data[data['location'] == 'TRANSECT'].isna().sum()
```

**HOW SHOULD I INTERPRET THE TRANSECT AND ZONE FIELDS FOR RECORDS WHERE LOCATION = RANDOM?**
- Looking at the top 60 rows of the location != transect data, it's clear that counts and size bins re-start at new transects and zones. So I guess **I'll let events be transects for now, process all of the data, and then try to figure out which records need to be dropped and how when I talk to PISCO again.**
- **Update:** If zone and/or transect are specified, but location != transect, then the animal was observed in the vicinity of the transect but not on it. **Perhaps this can be specified in the metadata.**

```python
out = data[data['location'].isin(['RANDOM', 'SIZE_TRANSECT'])]
out.iloc[0:60, :]
```

### Strategy

So, for now, I'll let transects be **events** and species observations be **occurrences**.

There are no event-level measurements or facts, so we can just have an occurrence and a MoF file.

The **event** file should contain: eventID (from site, survey date, zone, transect), eventDate (from year, month, date), datasetID, locality (site), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Should I include the campus information somewhere? Observer?

The **occurrence** file should contain: eventID, eventDate, datasetID, locality, countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, samplingEffort, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe disease), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Size and disease status can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [126]:
## Load site table

filename = 'MLPA_kelpforest_site_table.1.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(7458, 17)


Unnamed: 0,LTM_project_short_code,campus,method,survey_year,year,site,latitude,longitude,CA_MPA_Name_Short,site_designation,site_status,Secondary_MPA_Name,Secondary_site_designation,Secondary_site_status,BaselineRegion,LongTermRegion,MPA_priority_tier
0,LTM_Kelp_SRock,VRG,SBTL_SIZEFREQ_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
1,LTM_Kelp_SRock,VRG,SBTL_FISH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
2,LTM_Kelp_SRock,VRG,SBTL_SWATH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
3,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
4,LTM_Kelp_SRock,HSU,SBTL_UPC_HSU,2018,2018,ABALONE_POINT_1,39.6915,-123.8141,Ten Mile SMR,reference,reference,,,,North Coast,North Coast,I


There are a number of sites in the site table that have no size frequency data.

```python
for site in sites['site'].unique():
    if site not in data['site'].unique():
        print(site)
```

These include PISMO_W and SAL_E, which also don't appear in the fish, swath or upc data.

**PISMO_W and SAL_E also have latitude and longitude = NaN**

As checked in the fish transect conversion script, lat and lon have been consistently assigned within a site. Additionally, sites have been consistently labeled either 'mpa' or 'reference.' Since, for the purpose of DwC, we're not interested in which sites were sampled when, I can simplify the site table to only contain relevant information: site, latitude, longitude, and site status. **Note that which campus is responsible for a given site has changed between years in a number of cases, so I'm leaving this information out for now.**

In [127]:
## Create simplified site table

site_summary = sites[['site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(382, 4)


Unnamed: 0,site,site_status,latitude,longitude
0,3 Palms East,reference,33.71762,-118.33215
4,ABALONE_POINT_1,reference,39.6915,-123.8141
15,ABALONE_POINT_2,reference,39.66502,-123.80435
26,ABALONE_POINT_3,reference,39.62877,-123.79658
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252


Some site names have spaces or ' - ' characters. I'll replace these in a sensible way.

In [128]:
# Replace ' ' and ' - ' in site names and add site_name column

site_name = [name.replace(' - ', '-') for name in site_summary['site']]
site_name = [name.replace(' ', '_') for name in site_name]
site_summary['site_name'] = site_name

site_summary.head()

Unnamed: 0,site,site_status,latitude,longitude,site_name
0,3 Palms East,reference,33.71762,-118.33215,3_Palms_East
4,ABALONE_POINT_1,reference,39.6915,-123.8141,ABALONE_POINT_1
15,ABALONE_POINT_2,reference,39.66502,-123.80435,ABALONE_POINT_2
26,ABALONE_POINT_3,reference,39.62877,-123.79658,ABALONE_POINT_3
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252,ANACAPA_ADMIRALS_CEN


### Convert

In [129]:
## Merge to add site_name (also lat, lon and site_status) to data table

data = data.merge(site_summary, how='left', on='site')
data.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,...,size,disease,depth,observer,notes,site_name_old,site_status,latitude,longitude,site_name
0,UCSC,SBTL_SWATH_PISCO,2014,2014,8,8,HOPKINS_UC,TRANSECT,INNER,1,...,7.0,,4.6,COLIN GAYLORD,,,mpa,36.621649,-121.900789,HOPKINS_UC
1,UCSC,SBTL_SWATH_PISCO,2018,2018,8,15,SAUNDERS_REFERENCE_1,TRANSECT,OUTER,2,...,10.0,,20.2,MICHAEL LANGHANS,,,reference,38.82227,-123.62233,SAUNDERS_REFERENCE_1
2,UCSC,SBTL_SWATH_PISCO,2014,2014,7,3,MACABEE_DC,TRANSECT,OUTER,2,...,12.0,HEALTHY,12.3,TRISTIN MCHUGH,SIZE 10-14 CM,,mpa,36.618184,-121.896835,MACABEE_DC
3,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,...,12.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE 10-14 CM,,mpa,36.623586,-121.904196,HOPKINS_DC
4,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,...,17.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE <=15 CM,,mpa,36.623586,-121.904196,HOPKINS_DC


In [131]:
## Pad month and day as needed

data = data.astype({'month':str, 'day':str})
data['month'] = data['month'].str.pad(2, fillchar='0')
data['day'] = data['day'].str.pad(2, fillchar='0')
data.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,...,size,disease,depth,observer,notes,site_name_old,site_status,latitude,longitude,site_name
0,UCSC,SBTL_SWATH_PISCO,2014,2014,8,8,HOPKINS_UC,TRANSECT,INNER,1,...,7.0,,4.6,COLIN GAYLORD,,,mpa,36.621649,-121.900789,HOPKINS_UC
1,UCSC,SBTL_SWATH_PISCO,2018,2018,8,15,SAUNDERS_REFERENCE_1,TRANSECT,OUTER,2,...,10.0,,20.2,MICHAEL LANGHANS,,,reference,38.82227,-123.62233,SAUNDERS_REFERENCE_1
2,UCSC,SBTL_SWATH_PISCO,2014,2014,7,3,MACABEE_DC,TRANSECT,OUTER,2,...,12.0,HEALTHY,12.3,TRISTIN MCHUGH,SIZE 10-14 CM,,mpa,36.618184,-121.896835,MACABEE_DC
3,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,...,12.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE 10-14 CM,,mpa,36.623586,-121.904196,HOPKINS_DC
4,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,...,17.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE <=15 CM,,mpa,36.623586,-121.904196,HOPKINS_DC


In [139]:
## Create eventID

# In this data set, 9268 transects = NaN. In order to form an event ID, I'm going to replace these with ''
data['transect'] = data['transect'].replace(np.nan, '')

# In addition, there are 1706 zones = NaN. In order to form an event ID, I'm going to replace these with ''
data['zone'] = data['zone'].replace(np.nan, '')

# Create eventID
occ['eventID'] = data['site_name'] + '_' + data['year'].astype(str) + data['month'] + data['day'] + '_' + data['zone'] + '_' + data['transect']

# Strip trailing underscores
occ['eventID'] = occ['eventID'].str.strip('__*')

occ.head()

Unnamed: 0,eventID
0,HOPKINS_UC_20140808_INNER_1
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2
2,MACABEE_DC_20140703_OUTER_2
3,HOPKINS_DC_20140716_MID_1
4,HOPKINS_DC_20140716_MID_1


**Note** that because some events had transect = NaN and/or zone = NaN (if zone = NaN, transect is not necessarily NaN), some of the eventID's are a bit weird (e.g. ANACAPA_MIDDLE_ISLE_E_20050628__). I have stripped the trailing underscores (ANACAPA_MIDDLE_ISLE_E_20050628). **These event IDs will not be comparable to those in the swath data, even if location = TRANSECT. If Dan chooses to remove all location = TRANSECT records on a subsequent data submission, it's likely that all eventIDs should be aggregated to the site level, in which case it won't matter.**

In [140]:
## eventDate

# Create survey_date column in data
data['survey_date'] = data['year'].astype(str) + data['month'] + data['day']

# Set eventDate to survey_date
occ['eventDate'] = data['survey_date']

# Format
formatted = [datetime.strptime(dt, '%Y%m%d').date().isoformat() for dt in occ['eventDate']]
occ['eventDate'] = formatted
occ.head()

Unnamed: 0,eventID,eventDate
0,HOPKINS_UC_20140808_INNER_1,2014-08-08
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15
2,MACABEE_DC_20140703_OUTER_2,2014-07-03
3,HOPKINS_DC_20140716_MID_1,2014-07-16
4,HOPKINS_DC_20140716_MID_1,2014-07-16


In [141]:
## Dataset ID

occ['datasetID'] = 'PISCO size-frequency'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency


In [142]:
## InstitutionID, locality, locationRemarks

occ['institutionID'] = data['campus']
occ['locality'] = data['site']
occ['locationRemarks'] = data['site_status']

# Update locationRemarks to vocabulary used in other data sets
occ['locationRemarks'].replace({'mpa':'marine protected area'}, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area


In [143]:
## Add countryCode, decimalLat, decimalLon

occ['countryCode'] = 'US'
occ['decimalLatitude'] = data['latitude']
occ['decimalLongitude'] = data['longitude']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196


In [145]:
## Add coordinateUncertainty in Meters

occ['coordinateUncertaintyInMeters'] = 250

**Is this a reasonable coordinateUncertaintyInMeters?** Yes

In [146]:
## minimumDepthInMeters, maximumDepthInMeters

occ['minimumDepthInMeters'] = data['depth']
occ['maximumDepthInMeters'] = data['depth']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,4.6,4.6
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,20.2,20.2
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,12.3,12.3
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8


**Note** that here, I haven't had to aggregate by eventID, so there's no need to specifically handle situations where different depths were assigned to the same transect. 

In [147]:
## Add samplingProtocol

occ['samplingProtocol'] = data['method']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,4.6,4.6,SBTL_SWATH_PISCO
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,20.2,20.2,SBTL_SWATH_PISCO
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,12.3,12.3,SBTL_SWATH_PISCO
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO


**Note** that here, I've set samplingProtocol to PISCO's method column. **I think it might be good to do this for the fish and swath data as well. This change has now been made.**

Also **note** that the samplingEffort column has been excluded. I don't think there's been clear tracking of how long to look for individuals. **But I will double check this with Dan.**

In [148]:
## Add occurrenceID

occ['occurrenceID'] = data.groupby(['site', 'survey_date', 'zone', 'transect'])['classcode'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,4.6,4.6,SBTL_SWATH_PISCO,HOPKINS_UC_20140808_INNER_1_occ1
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,20.2,20.2,SBTL_SWATH_PISCO,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,12.3,12.3,SBTL_SWATH_PISCO,MACABEE_DC_20140703_OUTER_2_occ1
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ1
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ2


**Note** that for records where zone and/or transect information was missing, and was replaced with '', it's possible to have occurrences associated with a zone and transect at a given site on a given date, and occurrences that are NOT associated with a zone and transect on that same site and date. I.e., both ANACAPA_MIDDLE_ISLE_E_20050628_occ1 and ANACAPA_MIDDLE_ISLE_E_20050628_MID_2_occ1 are valid occurrenceIDs. 

I don't know if this actually ever arises in the data.

In [150]:
## Load species table

filename = 'MLPA_kelpforest_taxon_table.1.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1336, 38)


Unnamed: 0,campus,sample_type,sample_subtype,classcode,orig_classcode,Kingdom,Phylum,Class,Order,Family,...,LOOKED2009,LOOKED2010,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018
0,HSU,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,no,no,no,no,no,yes,yes,no,yes,yes
1,UCSB,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,FISH,FISH,AARG,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no


In [151]:
## Select relevant species

sf_sp = species.loc[species['sample_type'] == 'SIZEFREQ', ['sample_type', 'classcode', 'species_definition', 'common_name']]
sf_sp.drop_duplicates(inplace=True)

In [152]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(sf_sp['classcode'], sf_sp['species_definition']))
code_to_com_dict = dict(zip(sf_sp['classcode'], sf_sp['common_name']))

In [153]:
## Update dictionaries according to note below

code_to_sci_dict['ASTSPP'] = 'Asteroidea'
code_to_com_dict['ASTSPP'] = 'Asteroidea'

code_to_sci_dict['LEPHEXAD'] = 'Leptasterias hexactis'
code_to_com_dict['LEPHEXAD'] = 'Six Arm Star - adult'

In [154]:
## Create scientificName, vernacularName

occ['vernacularName'] = data['classcode']
occ['vernacularName'].replace(code_to_com_dict, inplace=True)

occ['scientificName'] = data['classcode']
occ['scientificName'].replace(code_to_sci_dict, inplace=True)

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,4.6,4.6,SBTL_SWATH_PISCO,HOPKINS_UC_20140808_INNER_1_occ1,California Sea Cucumber,Apostichopus californicus
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,20.2,20.2,SBTL_SWATH_PISCO,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1,Asteroidea,Asteroidea
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,12.3,12.3,SBTL_SWATH_PISCO,MACABEE_DC_20140703_OUTER_2_occ1,Leather Star,Dermasterias imbricata
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ1,Leather Star,Dermasterias imbricata
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ2,Leather Star,Dermasterias imbricata


In [155]:
## Get unique scientific names for lookup in WoRMS

names = occ['scientificName'].unique()

There were no species that were not identified to the species level, therefore an **identificationQualifier column is not necessary.**

There were two classcodes that did not appear in the species table:
- ASTSPP
- LEPHEXAD

LEPHEXAD, as noted in the swath conversion script, is missing entirely. **ASTSPP is present in the species table, but it is only listed under sample_type = SWATH.** It matches to class Asteroidea, and the species_definition is also Asteroidea. **I've added this in manually above. LEPHEXAD has now also been handled**.

In [156]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

In [157]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,4.6,4.6,SBTL_SWATH_PISCO,HOPKINS_UC_20140808_INNER_1_occ1,California Sea Cucumber,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,20.2,20.2,SBTL_SWATH_PISCO,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1,Asteroidea,Asteroidea,urn:lsid:marinespecies.org:taxname:123080,123080
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,12.3,12.3,SBTL_SWATH_PISCO,MACABEE_DC_20140703_OUTER_2_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ2,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771


In [158]:
## No identificationQualifier needed, replace scientificName using name_name_dict

occ['scientificName'].replace(name_name_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,4.6,4.6,SBTL_SWATH_PISCO,HOPKINS_UC_20140808_INNER_1_occ1,California Sea Cucumber,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,20.2,20.2,SBTL_SWATH_PISCO,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1,Asteroidea,Asteroidea,urn:lsid:marinespecies.org:taxname:123080,123080
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,12.3,12.3,SBTL_SWATH_PISCO,MACABEE_DC_20140703_OUTER_2_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,8.8,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ2,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771


In [159]:
## Add final name-related columns

occ['nameAccordingTo'] = 'WoRMS'
occ['occurrenceStatus'] = 'present'
occ['basisOfRecord'] = 'HumanObservation'

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,...,4.6,SBTL_SWATH_PISCO,HOPKINS_UC_20140808_INNER_1_occ1,California Sea Cucumber,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363,WoRMS,present,HumanObservation
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,...,20.2,SBTL_SWATH_PISCO,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1,Asteroidea,Asteroidea,urn:lsid:marinespecies.org:taxname:123080,123080,WoRMS,present,HumanObservation
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,...,12.3,SBTL_SWATH_PISCO,MACABEE_DC_20140703_OUTER_2_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,...,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,...,8.8,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ2,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation


In [160]:
## Add individualCount

occ['individualCount'] = data['count']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,individualCount
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,...,SBTL_SWATH_PISCO,HOPKINS_UC_20140808_INNER_1_occ1,California Sea Cucumber,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363,WoRMS,present,HumanObservation,1
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,...,SBTL_SWATH_PISCO,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1,Asteroidea,Asteroidea,urn:lsid:marinespecies.org:taxname:123080,123080,WoRMS,present,HumanObservation,1
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,...,SBTL_SWATH_PISCO,MACABEE_DC_20140703_OUTER_2_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation,2
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,...,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation,1
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,...,SBTL_SWATH_PISCO,HOPKINS_DC_20140716_MID_1_occ2,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation,1


In [161]:
## Add notes under occurrenceRemarks

occ['occurrenceRemarks'] = data['notes']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,individualCount,occurrenceRemarks
0,HOPKINS_UC_20140808_INNER_1,2014-08-08,PISCO size-frequency,UCSC,HOPKINS_UC,marine protected area,US,36.621649,-121.900789,250,...,HOPKINS_UC_20140808_INNER_1_occ1,California Sea Cucumber,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363,WoRMS,present,HumanObservation,1,
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,2018-08-15,PISCO size-frequency,UCSC,SAUNDERS_REFERENCE_1,reference,US,38.82227,-123.62233,250,...,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1,Asteroidea,Asteroidea,urn:lsid:marinespecies.org:taxname:123080,123080,WoRMS,present,HumanObservation,1,
2,MACABEE_DC_20140703_OUTER_2,2014-07-03,PISCO size-frequency,UCSC,MACABEE_DC,marine protected area,US,36.618184,-121.896835,250,...,MACABEE_DC_20140703_OUTER_2_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation,2,SIZE 10-14 CM
3,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,...,HOPKINS_DC_20140716_MID_1_occ1,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation,1,SIZE 10-14 CM
4,HOPKINS_DC_20140716_MID_1,2014-07-16,PISCO size-frequency,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,...,HOPKINS_DC_20140716_MID_1_occ2,Leather Star,Dermasterias imbricata,urn:lsid:marinespecies.org:taxname:240771,240771,WoRMS,present,HumanObservation,1,SIZE <=15 CM


In [164]:
# Save size and disease data for MoF file

# Get size data
size_df = pd.DataFrame({'eventID':occ['eventID'],
                        'occurrenceID':occ['occurrenceID'],
                        'scientificName':occ['scientificName'],
                        'commonName':occ['vernacularName'],
                        'size':data['size']})
print(size_df.shape)

# Create a size_measurement_type column
size_df['measurementType'] = 'longest arm length' # sea stars
size_df.loc[size_df['scientificName'].isin(['Mesocentrotus franciscanus', 
                                            'Strongylocentrotus purpuratus',
                                            'Lytechinus pictus']), 'measurementType'] = 'test diameter' # urchins
size_df.loc[size_df['scientificName'].isin(['Haliotis rufescens', 
                                            'Haliotis walallensis',
                                            'Haliotis',
                                            'Haliotis kamtschatkana',
                                            'Haliotis corrugata',
                                            'Haliotis cracherodii',
                                            'Haliotis fulgens']), 'measurementType'] = 'shell length' # abalone
size_df.loc[size_df['scientificName'] == 'Panulirus interruptus', 'measurementType'] = 'carapace length' # lobsters
size_df.loc[size_df['scientificName'].isin(['Apostichopus californicus',
                                            'Apostichopus parvimensis']), 'measurementType'] = 'total turgid length' # sea cucumbers
size_df.loc[size_df['scientificName'].isin(['Kelletia kelletii', 
                                            'Megathura crenulata', 
                                            'Megastraea undosa',
                                            'Pomaulax gibberosus']), 'measurementType'] = 'shell length' # gastropods

# Get disease data
disease_df = pd.DataFrame({'eventID':occ['eventID'],
                           'occurrenceID':occ['occurrenceID'],
                           'scientificName':occ['scientificName'],
                           'disease':data['disease']})
disease_df.dropna(subset=['disease'], inplace=True)
print(disease_df.shape)

# Change the disease category 'YES' to something more descriptive
disease_df[disease_df['disease'] == 'YES'] = 'DISEASED'

(80977, 5)
(22404, 4)


In [165]:
## Change NaN to '' in string fields

occ['occurrenceRemarks'] = occ['occurrenceRemarks'].replace(np.nan, '')
occ.isna().sum()

eventID                              0
eventDate                            0
datasetID                            0
institutionID                        0
locality                             0
locationRemarks                      0
countryCode                          0
decimalLatitude                      0
decimalLongitude                     0
coordinateUncertaintyInMeters        0
minimumDepthInMeters             16744
maximumDepthInMeters             16744
samplingProtocol                     0
occurrenceID                         0
vernacularName                       0
scientificName                       0
scientificNameID                     0
taxonID                              0
nameAccordingTo                      0
occurrenceStatus                     0
basisOfRecord                        0
individualCount                      0
occurrenceRemarks                    0
dtype: int64

In [166]:
## Save

occ.to_csv('PISCO_sizefreq_occurrence_20210209.csv', index=False, na_rep='NaN')

## Create MoF file

In [167]:
## Finalize occurrence-level measurements and facts

# Size
size_mof = pd.DataFrame({'eventID':size_df['eventID'],
                        'occurrenceID':size_df['occurrenceID'],
                        'measurementType':size_df['measurementType'],
                        'measurementValue':size_df['size'],
                        'measurementUnit':'centimeters',
                        'measurementMethod':'measured by diver'})

# Disease
dis_mof = pd.DataFrame({'eventID':disease_df['eventID'],
                       'occurrenceID':disease_df['occurrenceID'],
                       'measurementType':'observation',
                       'measurementValue':disease_df['disease'].str.lower(),
                       'measurementUnit':np.nan,
                       'measurementMethod':'visually determined by diver'})
dis_mof

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
2,MACABEE_DC_20140703_OUTER_2,MACABEE_DC_20140703_OUTER_2_occ1,observation,healthy,,visually determined by diver
3,HOPKINS_DC_20140716_MID_1,HOPKINS_DC_20140716_MID_1_occ1,observation,healthy,,visually determined by diver
4,HOPKINS_DC_20140716_MID_1,HOPKINS_DC_20140716_MID_1_occ2,observation,healthy,,visually determined by diver
5,HOPKINS_DC_20140716_MID_2,HOPKINS_DC_20140716_MID_2_occ1,observation,healthy,,visually determined by diver
6,HOPKINS_DC_20140716_MID_2,HOPKINS_DC_20140716_MID_2_occ2,observation,healthy,,visually determined by diver
...,...,...,...,...,...,...
65588,ANACAPA_WEST_ISLE_W_20160726_OUTER_2,ANACAPA_WEST_ISLE_W_20160726_OUTER_2_occ4,observation,healthy,,visually determined by diver
65589,ANACAPA_WEST_ISLE_W_20160726_OUTER_2,ANACAPA_WEST_ISLE_W_20160726_OUTER_2_occ5,observation,healthy,,visually determined by diver
65590,ANACAPA_WEST_ISLE_W_20160726_OUTER_2,ANACAPA_WEST_ISLE_W_20160726_OUTER_2_occ6,observation,healthy,,visually determined by diver
65591,ANACAPA_WEST_ISLE_W_20160726_OUTER_2,ANACAPA_WEST_ISLE_W_20160726_OUTER_2_occ7,observation,healthy,,visually determined by diver


In [168]:
## Concatenate

mof = pd.concat([size_mof, dis_mof])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,HOPKINS_UC_20140808_INNER_1,HOPKINS_UC_20140808_INNER_1_occ1,total turgid length,7,centimeters,measured by diver
1,SAUNDERS_REFERENCE_1_20180815_OUTER_2,SAUNDERS_REFERENCE_1_20180815_OUTER_2_occ1,longest arm length,10,centimeters,measured by diver
2,MACABEE_DC_20140703_OUTER_2,MACABEE_DC_20140703_OUTER_2_occ1,longest arm length,12,centimeters,measured by diver
3,HOPKINS_DC_20140716_MID_1,HOPKINS_DC_20140716_MID_1_occ1,longest arm length,12,centimeters,measured by diver
4,HOPKINS_DC_20140716_MID_1,HOPKINS_DC_20140716_MID_1_occ2,longest arm length,17,centimeters,measured by diver


In [171]:
## Change NaN to '' in string fields

mof['measurementUnit'] = mof['measurementUnit'].replace(np.nan, '')
mof.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

In [172]:
## Save

mof.to_csv('PISCO_sizefreq_MoF_20210209.csv', index=False, na_rep='NaN')

## Questions

(In addition to those already listed on fish and swath data).

**Bigger question: How to most reasonably include the swath and sizefreq data? Are all the 'transect' data in sizefreq also in swath? Is it confusing to have size data in two data sets? (I.e., if someone wanted the size data and not the presence/absence data, they would have to look in two places?)**

1. 25419 records indicate that a sizefreq method was used, but then the location lists 'TRANSECT.' This is only true for SBTL_SWATH_PISCO (UCSC and UCSB) method. How do I interpret these? Especially if I want to remove the records that are duplicated in the swath data? **This connects to a set of larger questions, which Dan's not sure about, including: How to most reasonably include the swath and sizefreq data? Are all the 'transect' data in sizefreq also in swath? Is it confusing to have size data in two data sets? (I.e., if someone wanted the size data and not the presence/absence data, they would have to look in two places?) Dan will look into why so many records with method = sizefreq also have location = transect. In Dan's follow-up email, he said he's still working on this issue, but that they will likely drop the transect sizes in the sizefreq data for the upcoming 2020 submission and make this change clear in the metadata. I'm still not sure about the method = sizefreq, location=transect issue.**
2. How should I interpret the transect and zone fields when location = RANDOM? **If transect/zone is specified, organisms were measured in the vicinity of the transect but not on it. Perhaps this can be made clear in the metadata?**
3. Reasonable value for coordinateUncertaintyInMeters? **250 m is probably fine.**
4. LEPHEXAD, as noted in the swath conversion script, is missing entirely from the species table. ASTSPP is present in the species table, but it is only listed under sample_type = SWATH. **I conveyed this information to Dan.**

**Dan would also like me to ask Rani: Where is the PISCO data actually stored? On PISCO's server or DataONE's?**

**One thing I didn't ask the first time:** Is there some way to characterize sampling effort here?

## Remaining Questions 2/9/21

1. Did Dan end up removing the location=TRANSECT records from this data set for the upcoming 2020 submission to DataONE? If not, it seems fine to me if he uploads the entire data set to DataONE, as long as I know that by filtering out records with location=TRANSECT, I can successfully get rid of duplicate records for OBIS (i.e., all of the 'transect' data in size frequency are also in swath; all of the sizes that are not in swath are appropriately labeled as 'random' or 'size transect') **Yes, he will be removing the location = TRANSECT records.**
2. If Dan did remove the location=TRANSECT records, does it make more sense to have events defined as surveys (site + date) rather than transects? **Probably, I will need to adjust for this.**
3. samplingEffort? **There isn't really a good way to quantify samplingEffort for these data.**
4. Would it be helpful to share the data with me before submission so I can run through my code and see if I catch anything else? **Yes, Dan will send the data in the next week or two.**