# PISCO - size frequency data

Size frequency measurements of body size of targeted invertebrate species are recorded by divers both along benthic transects and at random locations within a study site. 

Measured sizes correspond with:
- test diameter for urchins
- length of longest arm for seastars
- shell length for shelled mollusks
- carapace length for lobsters
- total turgid length for sea cucumbers

In the case of urchins sampled by UCSB and VRG in Southern California, large numbers of individuals may be collected in bags and brought aboard the research vessel to facilitate measurement.

Originally, this data set contained sized animals that also occurred in the swath table. I believe Dan has removed these, and now this data set contains only individuals who were found and sized off-transect.
    
**Resources:** <br>
https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [1]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "/Users/dianalg/PycharmProjects/PythonScripts/MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data

filename = 'sizefreq_through_2020.csv'
data = pd.read_csv(filename, dtype={'transect':str, 'disease':str, 'site_name_old':str})

print(data.shape)
data.head()

(50348, 18)


Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old
0,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,MESFRA,1,3.0,,,IAN TANIGUCHI,,
1,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,MESFRA,1,3.2,,,IAN TANIGUCHI,,
2,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,MESFRA,1,3.6,,,IAN TANIGUCHI,,
3,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,MESFRA,1,3.8,,,IAN TANIGUCHI,,
4,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,MESFRA,1,4.0,,,IAN TANIGUCHI,,


In [4]:
data['year'].max()

2020

### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_SWATH_PISCO, SBTL_SWATH_HSU or SBTL_SWATH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 2003 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 2003 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 350 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**location** = RANDOM OR SIZE_TRANSECT. The location where the size observation was recorded. 
- RANDOM: sizes were measured across the general site and were not specifically located on the swath/upc transect
- SIZE_TRANSECT: applies to HSU only where specific size frequency transects are conducted separately from swath/upc transects

**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**transect** = It seems like this should only be 1 - 8, but there are **a number of other designations as well.** The unique transect replicate within each site and zone. <br>
**classcode** = One of 37 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for invertebrates and algae, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode and a given size (if applicable) per transect <br>
**size** = Size (in centimeters) of observation. For specific species groups, measured sizes correspond with test diameter for urchins, length of longest arm for seastars, shell length for shelled mollusks, carapace length for lobsters, total turgid length for sea cucumbers. <br>
**disease** = For some years echinoderm disease was recorded on transects for select species. When systematic observation for disease was conducted, disease is indicated here. Where blank, disease was not evaluated.
- HEALTHY: Individual was inspected and no was disease observed
- YES: Some form of disease was observed
- MILD: Mild disease was observed
- SEVERE: Severe disease was observed
- WASTING: Wasting disease was observed
- BLACK SPOT: Black spot disease was observed

**depth** = Between 1.8 and 26.5 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**observer** = The diver who conducted the survey transect <br>
**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.


### Strategy

I could let transects be **events** and species observations be **occurrences**. The only issue with that is, these observations are not specifically associated with a transect. The transect information is included because the individual was found *in the vicinity* of that transect. I think this is useful because it's a proxy for depth data if a depth was not specifically recorded. Depth was not recorded for about half of the observations. So maybe I'll let site/date/zone combinations be events, and if the zone is missing, it'll be have site/date information only.

There are no event-level measurements or facts, so we can just have an occurrence and a MoF file.

The **occurrence** file should contain: eventID, eventDate, datasetID, locality, countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, samplingEffort, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe disease), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Size and disease status can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [6]:
## Load site table

filename = 'site_table_through_2020.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(570, 53)


Unnamed: 0,site,site_name_old,site_name_for_figures,campus,latitude,longitude,site_campus (unique_ID),Unnamed: 7,latitude_old,longitude_old,...,survey_2016,survey_2017,survey_2018,survey_2019,survey_2020,SurveyYears,time_series_category,Notes (ETS),notes_location_details,notes_data_density
0,120/OML,,120/OML,RCCA,33.7379,-118.392,120/OML RCCA,,,,...,,,,,,,,,,
1,3_PALMS_EAST,3 Palms East,3 Palms East,VRG,33.718105,-118.3326,3 Palms East VRG,,,,...,,,,,,,,Added 6/21 based on field waypoint files,,
2,Abalone Cove,,Abalone Cove,RCCA,33.7362,-118.376,Abalone Cove RCCA,,,,...,,,,,,,,,,
3,ABALONE_COVE_KELP_W,Abalone Cove Kelp West,Abalone Cove Kelp W,VRG,33.73922,-118.38789,Abalone Cove Kelp West VRG,,,,...,X,X,X,X,X,11.0,1.0,,,
4,ABALONE_POINT_1,,Abalone Point 1,HSU,39.6915,-123.8141,ABALONE_POINT_1 HSU,,,,...,,,,,,,2.0,,,


**Note** that this site table is not in the standard format given to me last year. Hopefully I can simplify it to site, latitude, longitude, and site status, and that will still work.

In [7]:
## Create simplified site table

site_summary = sites.loc[sites['campus'] != 'RCCA', ['site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(423, 4)


Unnamed: 0,site,site_status,latitude,longitude
1,3_PALMS_EAST,reference,33.718105,-118.3326
3,ABALONE_COVE_KELP_W,mpa,33.73922,-118.38789
4,ABALONE_POINT_1,reference,39.6915,-123.8141
5,ABALONE_POINT_2,reference,39.66502,-123.80435
6,ABALONE_POINT_3,reference,39.62877,-123.79658


Well, it sortof works.

I had to exclude the RCCA sites, which weren't included in my original site table. In addition, there are five sites where slightly different lat/lon have been provided by slightly different groups. **I'll have to arbitrarily pick one for the moment.**

There are also a number of sites that have no data in the sizefreq table. This is not necessarily a problem for me right now, but worth making a note of. 

Finally, there are a bunch of sites with site_status missing. A subset of these also have lat, lon missing.

```python
# Sites with conflicting lat/lon entries
site_summary[site_summary['site'].duplicated()]

# Sites with no data in sizefreq table
for name in site_summary['site'].unique():
    if name not in data['site'].unique():
        print(name)
        
# Sites where site_status is missing
site_summary[site_summary['site_status'].isna() == True]
        
# Sites where lat, lon is missing
site_summary[site_summary['latitude'].isna() == True]
```

In [8]:
## Remove sites with not-quite-identical coordinates manually ----- THIS CAN BE CHANGED WITH AN UPDATED SITE TABLE

print(site_summary.shape)
site_summary = site_summary.drop(index=site_summary[site_summary['site'].duplicated()].index)
site_summary.shape

(423, 4)


(418, 4)

### Convert

In [8]:
## Merge to add site_name (also lat, lon and site_status) to data table

data = data.merge(site_summary, how='left', on='site')
data.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,...,count,size,disease,depth,observer,notes,site_name_old,site_status,latitude,longitude
0,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.0,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
1,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.2,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
2,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.6,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
3,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.8,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
4,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,4.0,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192


In [9]:
## Pad month and day as needed

data = data.astype({'month':str, 'day':str})
data['month'] = data['month'].str.pad(2, fillchar='0')
data['day'] = data['day'].str.pad(2, fillchar='0')
data.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,...,count,size,disease,depth,observer,notes,site_name_old,site_status,latitude,longitude
0,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.0,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
1,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.2,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
2,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.6,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
3,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,3.8,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192
4,UCSB,SBTL_SIZEFREQ_PISCO,2003,2003,10,15,SRI_SOUTH_POINT_E,,MID,,...,1,4.0,,,IAN TANIGUCHI,,,mpa,33.891567,-120.1192


In [10]:
## Create eventID

# There are 2571 zones = NaN. In order to form an event ID, I'm going to replace these with ''
data['zone'] = data['zone'].replace(np.nan, '')

# Create eventID
eventID = data['site'] + '_' + data['year'].astype(str) + data['month'] + data['day'] + '_' + data['zone']
occ = pd.DataFrame({'eventID':eventID})

# Strip trailing underscores
occ['eventID'] = occ['eventID'].str.strip('__*')

occ.head()

Unnamed: 0,eventID
0,SRI_SOUTH_POINT_E_20031015_MID
1,SRI_SOUTH_POINT_E_20031015_MID
2,SRI_SOUTH_POINT_E_20031015_MID
3,SRI_SOUTH_POINT_E_20031015_MID
4,SRI_SOUTH_POINT_E_20031015_MID


In [11]:
## eventDate

# Create survey_date column in data
data['survey_date'] = data['year'].astype(str) + data['month'] + data['day']

# Set eventDate to survey_date
occ['eventDate'] = data['survey_date']

# Format
formatted = [datetime.strptime(dt, '%Y%m%d').date().isoformat() for dt in occ['eventDate']]
occ['eventDate'] = formatted
occ.head()

Unnamed: 0,eventID,eventDate
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15


In [14]:
## Dataset ID

occ['datasetID'] = 'PISCO size-frequency'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency


In [15]:
## InstitutionID, locality, locationRemarks

occ['institutionID'] = data['campus']
occ['locality'] = data['site']
occ['locationRemarks'] = data['site_status']

# Update locationRemarks to vocabulary used in other data sets
occ['locationRemarks'].replace({'mpa':'marine protected area'}, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area


In [16]:
## Add countryCode, decimalLat, decimalLon

occ['countryCode'] = 'US'
occ['decimalLatitude'] = data['latitude']
occ['decimalLongitude'] = data['longitude']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192


In [17]:
## Add coordinateUncertainty in Meters

occ['coordinateUncertaintyInMeters'] = 250

In [18]:
## minimumDepthInMeters, maximumDepthInMeters

occ['minimumDepthInMeters'] = data['depth']
occ['maximumDepthInMeters'] = data['depth']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,


In [53]:
## Add samplingProtocol

occ['samplingProtocol'] = data['method']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO


In [54]:
## Add occurrenceID

occ['occurrenceID'] = data.groupby(['site', 'survey_date', 'zone', 'transect'])['classcode'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ1
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ2
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ3
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ4
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ5


In [55]:
## Load species table

filename = 'species_table_through_2020.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1937, 42)


Unnamed: 0,sample_type,sample_subtype,campus,pisco_classcode,orig_classcode,crane_code,genus,species,common_name,max_total_length,...,LOOKED2019,LOOKED2020,Taxonomic_source,AphiaID,ScientificName,Kingdom,Phylum,Class,Order,Family
0,FISH,FISH,HSU,AARG,AARG,,Amphistichus,argenteus,Barred Surfperch,43.0,...,X,X,WoRMS,279594,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae
1,FISH,FISH,UCSB,AARG,AARG,,Amphistichus,argenteus,Barred Surfperch,43.0,...,X,X,WoRMS,279594,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae
2,FISH,FISH,VRG,AARG,Amphistichus argenteus,,Amphistichus,argenteus,Barred Surfperch,43.0,...,X,X,WoRMS,279594,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae
3,FISH,FISH,HSU,ACOR,ACOR,ACOR,Artedius,corallinus,Coralline Sculpin,14.0,...,,,WoRMS,279699,Artedius corallinus,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae
4,FISH,FISH,UCSB,ACOR,ACOR,ACOR,Artedius,corallinus,Coralline Sculpin,14.0,...,,,WoRMS,279699,Artedius corallinus,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae


In [66]:
## Change pisco_classcode column to classcode --------- THIS CAN BE CHANGED AFTER SPECIES TABLE IS UPDATED

species.rename({'pisco_classcode':'classcode',
                'ScientificName':'species_definition'}, axis='columns', inplace=True)

In [67]:
## Select relevant species

sf_sp = species.loc[species['sample_type'] == 'SIZEFREQ', ['sample_type', 'classcode', 'species_definition', 'common_name']]
sf_sp.drop_duplicates(inplace=True)

In [69]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(sf_sp['classcode'], sf_sp['species_definition']))
code_to_com_dict = dict(zip(sf_sp['classcode'], sf_sp['common_name']))

In [72]:
## Create scientificName, vernacularName

occ['vernacularName'] = data['classcode']
occ['vernacularName'].replace(code_to_com_dict, inplace=True)

occ['scientificName'] = data['classcode']
occ['scientificName'].replace(code_to_sci_dict, inplace=True)

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ1,Red Urchin - all sizes,Mesocentrotus franciscanus
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ2,Red Urchin - all sizes,Mesocentrotus franciscanus
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ3,Red Urchin - all sizes,Mesocentrotus franciscanus
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ4,Red Urchin - all sizes,Mesocentrotus franciscanus
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ5,Red Urchin - all sizes,Mesocentrotus franciscanus


In [73]:
## Get unique scientific names for lookup in WoRMS

names = occ['scientificName'].unique()

There were no species that had uncertain identifications at the species level, therefore an **identificationRemarksa column is not necessary.**

There were a few classcodes that are in the species table but are not in the data:
- ASTSER
- HALSPP
- LYTPIC
- NO_ORG
- PYCHEL
- SOLDAW
- SOLSPP
- SOLSTI
- STYFOR
- DELETE

I don't think this is actually a problem, but worth noting.

```python
# Codes that are in species table but not data
for code in sf_sp['classcode'].unique():
    if code not in data['classcode'].unique():
        print(code)
```

In [84]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

In [86]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ1,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ2,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ3,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ4,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ5,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102


In [87]:
## No identificationRemarks needed, replace scientificName using name_name_dict

occ['scientificName'].replace(name_name_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ1,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ2,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ3,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ4,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ5,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102


In [88]:
## Add final name-related columns

occ['nameAccordingTo'] = 'WoRMS'
occ['occurrenceStatus'] = 'present'
occ['basisOfRecord'] = 'HumanObservation'

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,maximumDepthInMeters,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ1,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ2,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ3,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ4,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ5,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation


In [89]:
## Add individualCount

occ['individualCount'] = data['count']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,samplingProtocol,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,individualCount
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ1,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ2,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ3,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ4,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SBTL_SIZEFREQ_PISCO,SRI_SOUTH_POINT_E_20031015_MID_occ5,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1


In [92]:
## Add notes under occurrenceRemarks

occ['occurrenceRemarks'] = data['notes']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,individualCount,occurrenceRemarks
0,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SRI_SOUTH_POINT_E_20031015_MID_occ1,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1,
1,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SRI_SOUTH_POINT_E_20031015_MID_occ2,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1,
2,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SRI_SOUTH_POINT_E_20031015_MID_occ3,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1,
3,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SRI_SOUTH_POINT_E_20031015_MID_occ4,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1,
4,SRI_SOUTH_POINT_E_20031015_MID,2003-10-15,PISCO size-frequency,UCSB,SRI_SOUTH_POINT_E,marine protected area,US,33.891567,-120.1192,250,...,SRI_SOUTH_POINT_E_20031015_MID_occ5,Red Urchin - all sizes,Mesocentrotus franciscanus,urn:lsid:marinespecies.org:taxname:591102,591102,WoRMS,present,HumanObservation,1,


In [93]:
# Save size and disease data for MoF file

# Get size data
size_df = pd.DataFrame({'eventID':occ['eventID'],
                        'occurrenceID':occ['occurrenceID'],
                        'scientificName':occ['scientificName'],
                        'commonName':occ['vernacularName'],
                        'size':data['size']})
print(size_df.shape)

# Create a size_measurement_type column
size_df['measurementType'] = 'longest arm length' # sea stars
size_df.loc[size_df['scientificName'].isin(['Mesocentrotus franciscanus', 
                                            'Strongylocentrotus purpuratus',
                                            'Lytechinus pictus']), 'measurementType'] = 'test diameter' # urchins
size_df.loc[size_df['scientificName'].isin(['Haliotis rufescens', 
                                            'Haliotis walallensis',
                                            'Haliotis',
                                            'Haliotis kamtschatkana',
                                            'Haliotis corrugata',
                                            'Haliotis cracherodii',
                                            'Haliotis fulgens']), 'measurementType'] = 'shell length' # abalone
size_df.loc[size_df['scientificName'] == 'Panulirus interruptus', 'measurementType'] = 'carapace length' # lobsters
size_df.loc[size_df['scientificName'].isin(['Apostichopus californicus',
                                            'Apostichopus parvimensis']), 'measurementType'] = 'total turgid length' # sea cucumbers
size_df.loc[size_df['scientificName'].isin(['Kelletia kelletii', 
                                            'Megathura crenulata', 
                                            'Megastraea undosa',
                                            'Pomaulax gibberosus']), 'measurementType'] = 'shell length' # gastropods

# Get disease data
disease_df = pd.DataFrame({'eventID':occ['eventID'],
                           'occurrenceID':occ['occurrenceID'],
                           'scientificName':occ['scientificName'],
                           'disease':data['disease']})
disease_df.dropna(subset=['disease'], inplace=True)
print(disease_df.shape)

# Change the disease category 'YES' to something more descriptive
disease_df[disease_df['disease'] == 'YES'] = 'DISEASED'

(50348, 5)
(2305, 4)


In [95]:
## Change NaN to '' in string fields

occ['occurrenceRemarks'] = occ['occurrenceRemarks'].replace(np.nan, '')
occ.isna().sum()

eventID                              0
eventDate                            0
datasetID                            0
institutionID                        0
locality                             0
locationRemarks                    238
countryCode                          0
decimalLatitude                    238
decimalLongitude                   238
coordinateUncertaintyInMeters        0
minimumDepthInMeters             21186
maximumDepthInMeters             21186
samplingProtocol                     0
occurrenceID                         0
vernacularName                       0
scientificName                       0
scientificNameID                     0
taxonID                              0
nameAccordingTo                      0
occurrenceStatus                     0
basisOfRecord                        0
individualCount                      0
occurrenceRemarks                    0
dtype: int64

In [96]:
## Save

occ.to_csv('PISCO_sizefreq_occurrence_20210816.csv', index=False, na_rep='NaN')

## Create MoF file

In [97]:
## Finalize occurrence-level measurements and facts

# Size
size_mof = pd.DataFrame({'eventID':size_df['eventID'],
                        'occurrenceID':size_df['occurrenceID'],
                        'measurementType':size_df['measurementType'],
                        'measurementValue':size_df['size'],
                        'measurementUnit':'centimeters',
                        'measurementMethod':'measured by diver'})

# Disease
dis_mof = pd.DataFrame({'eventID':disease_df['eventID'],
                       'occurrenceID':disease_df['occurrenceID'],
                       'measurementType':'observation',
                       'measurementValue':disease_df['disease'].str.lower(),
                       'measurementUnit':np.nan,
                       'measurementMethod':'visually determined by diver'})
dis_mof

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
18822,SCI_HAZARDS_W_20140702_OUTER,SCI_HAZARDS_W_20140702_OUTER_occ5,observation,healthy,,visually determined by diver
18823,SCI_HAZARDS_W_20140702_OUTER,SCI_HAZARDS_W_20140702_OUTER_occ6,observation,healthy,,visually determined by diver
18824,SCI_HAZARDS_W_20140702_OUTER,SCI_HAZARDS_W_20140702_OUTER_occ7,observation,healthy,,visually determined by diver
18825,SCI_HAZARDS_W_20140702_OUTER,SCI_HAZARDS_W_20140702_OUTER_occ8,observation,healthy,,visually determined by diver
18826,SCI_HAZARDS_W_20140702_OUTER,SCI_HAZARDS_W_20140702_OUTER_occ9,observation,mild,,visually determined by diver
...,...,...,...,...,...,...
27032,ANACAPA_LIGHTHOUSE_REEF_E_20201023_OUTER,ANACAPA_LIGHTHOUSE_REEF_E_20201023_OUTER_occ36,observation,black spot,,visually determined by diver
27044,ANACAPA_LIGHTHOUSE_REEF_E_20201023_OUTER,ANACAPA_LIGHTHOUSE_REEF_E_20201023_OUTER_occ48,observation,black spot,,visually determined by diver
27243,SMI_HARRIS_PT_RESERVE_W_20201024,SMI_HARRIS_PT_RESERVE_W_20201024_occ74,observation,wasting,,visually determined by diver
27574,SCI_GULL_ISLE_W_20201111_MID,SCI_GULL_ISLE_W_20201111_MID_occ62,observation,wasting,,visually determined by diver


In [98]:
## Concatenate

mof = pd.concat([size_mof, dis_mof])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,SRI_SOUTH_POINT_E_20031015_MID,SRI_SOUTH_POINT_E_20031015_MID_occ1,test diameter,3.0,centimeters,measured by diver
1,SRI_SOUTH_POINT_E_20031015_MID,SRI_SOUTH_POINT_E_20031015_MID_occ2,test diameter,3.2,centimeters,measured by diver
2,SRI_SOUTH_POINT_E_20031015_MID,SRI_SOUTH_POINT_E_20031015_MID_occ3,test diameter,3.6,centimeters,measured by diver
3,SRI_SOUTH_POINT_E_20031015_MID,SRI_SOUTH_POINT_E_20031015_MID_occ4,test diameter,3.8,centimeters,measured by diver
4,SRI_SOUTH_POINT_E_20031015_MID,SRI_SOUTH_POINT_E_20031015_MID_occ5,test diameter,4.0,centimeters,measured by diver


In [100]:
## Change NaN to '' in string fields

mof['measurementUnit'] = mof['measurementUnit'].replace(np.nan, '')
mof.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

In [101]:
## Save

mof.to_csv('PISCO_sizefreq_MoF_20210816.csv', index=False, na_rep='NaN')

## Questions

I have no questions or comments here, except for the problems noted with the site and species tables in the fish and swath conversion code.