# PISCO - fish transect data

The density of all conspicuous fishes (i.e. species whose adults are longer than 10 cm and visually detectable by SCUBA divers) are visually recorded along replicate 2m wide by 2m tall by 30m long (120m3) transects. 
- Transects are performed in 2-3 heights: bottom, mid-water and canopy
    - Bottom transects are always performed; a diver searches in cracks and crevices with a flashlight
    - Mid-water transects are always performed; a second diver surveys 120 m3 about 1/3 - 1/2 of the way up into the water column
    - Canopy transects are surveyed at a subset of sites, and are usually completed separately from the bottom and midwater transects; a diver swims 2m below the surface counting fishes in the top two meters of the water column
- Three 30 m long transects, distributed end-to-end and 5-10 m apart, are typically performed at each height, and at each of four depths:
    - 5m
    - 10m
    - 15m
    - 20m 
    - transects at the 25 m isobath are performed by VRG where habitat is available
- Survey depths may vary based on reef topography 
- Counts on mid-water and bottom transects are eventually combined, generating 12 replicate transects for each site. **Are these already combined in this data set? [No, doesn't look like it.]** **Note** that at sites with narrow kelp beds, particularly in parts of the Northern Channel Islands, only two depths are sampled, with four transects in each depth zone for a total of eight replicate transects
- Surveyors record:
    - The total length (TL) of each fish observed
    - Transect depth
    - Horizontal visibility along each transect (**must be at least 3 m to perform fish transects**)
    - Water temperature
    - Sea state (surge)
    - Percent of the transect volume occupied by kelp (PISCO only)

**Resources**
- https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [1]:
## Imports

import pandas as pd
import numpy as np
import random
import math

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\PISCO\\'
filename = 'MLPA_kelpforest_fish.1.csv'
fish = pd.read_csv(filename, encoding='ANSI', dtype={'transect':str, 'sex':str, 'site_name_old':str})

print(fish.shape)
fish.head()

(381693, 24)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,max_tl,sex,observer,depth,vis,temp,surge,pctcnpy,notes,site_name_old
0,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
1,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
2,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
3,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
4,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,8.0,,MARK CARR,6.1,2.4,,HIGH,1.0,,


### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_FISH_PISCO, SBTL_FISH_CRANE, SBTL_FISH_HSU or SBTL_FISH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 1999 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 1999 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 380 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**level** = BOT, CAN, MID or CNMD. The horizontal placement of the transect within the water column. Includes BOT: bottom transects placed at the seafloor, MID: midwater transects placed at roughly half the depth of the seafloor, and CAN: canopy transects placed at the surface to survey the top two meters of the water column and kelp canopy. CNMD is used when an inner transect is too shallow to allow both canopy and midwater transects without overlapping (applies to UCSB and VRG only) <br>
**transect** = It seems like this should only be 1 - 12, but there are a number of other designations as well. The unique transect replicate within each site, zone, and level. <br>
**classcode** = One of 166 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for fishes, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode of a given size per transect <br>
**fish_tl** = The total length of an individual or group of individuals (of the same length) OR the average total length for a group of fish where a range in lengths is specified (rounded to the nearest cm) <br>
**min_tl** = The minimum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**max_tl** = The maximum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**sex** = MALE, FEMALE, TRANSITIONAL, JUVENILE or 'nan'. The sex classification for sexually dimorphic species where sex can be distinguished visually and is recorded. For some species, individuals with juvenile markings are also indicated here. The TRANSITIONAL class is used for fish with external morphological features consistent with both male and female (applies to sex changing fishes such as California sheephead). JUVENILE is not always indicated when a juvenile fish is observed. <br>
**observer** = The diver who conducted the survey transect <br>
**depth** = Between 0.2 and 33.4 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**vis** = Between 1 and 35 meters. The diver's estimation of horizontal visibility on each transect. Measured by reeling in the transect tape and noting the distance at which the end of the tape can first be seen. <br>
**temp** = Between 7 and 25.6 degrees C. The temperature on each transect measured by the diver's computer. <br>
**surge** = HIGH, MODERATE, LIGHT or 'nan'. The diver's estimation of magnitude of horizontal displacement on each transect.
- LIGHT: No significant surge
- MODERATE: Surge causing noticeable lateral movement, diver must compensate
- HIGH: Significant surge, diver moved out of transect bounds when not holding on

**pctcnpy** = 0 - 3 or NaN. The diver's estimation of the percent of the transect, by volume, that is occupied by kelp. This estimation is specific to the level of the transect that is being surveyed (i.e. excluding canopy transects, this not an estimation of surface canopy but of the amount of kelp within the transect at the specified level). **I believe this measure was only recorded by PISCO.**
- 0: 0% of transect volume occupied by kelp
- 1: 1-33% of transect volume occupied by kelp
- 2: 34-66% of transect volume occupied by kelp
- 3: 67-100% of transect volume occupied by kelp

**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

### Strategy

As with the RCCA data, each transect can be an **event** and each fish observation can be an **occurrence**. There are both event-level and occurrence-level measurements, necessitating event and MoF files. 

The **event** file should contain: eventID (from site, survey_year, transect, level?), eventDate (from year, month, date), datasetID, locality (site), localityRemarks (maybe level and/or zone information), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Some notes might be eventRemarks. Should I include the campus information somewhere? Observer?

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe some notes), sex (sex), lifeStage (sex), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Depth, vis, temp, surge and pctcnpy can be recorded at the event level. Fish_tl, min_tl and max_tl can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [4]:
## Load site table

filename = 'MLPA_kelpforest_site_table.1.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(7458, 17)


Unnamed: 0,LTM_project_short_code,campus,method,survey_year,year,site,latitude,longitude,CA_MPA_Name_Short,site_designation,site_status,Secondary_MPA_Name,Secondary_site_designation,Secondary_site_status,BaselineRegion,LongTermRegion,MPA_priority_tier
0,LTM_Kelp_SRock,VRG,SBTL_SIZEFREQ_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
1,LTM_Kelp_SRock,VRG,SBTL_FISH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
2,LTM_Kelp_SRock,VRG,SBTL_SWATH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
3,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
4,LTM_Kelp_SRock,HSU,SBTL_UPC_HSU,2018,2018,ABALONE_POINT_1,39.6915,-123.8141,Ten Mile SMR,reference,reference,,,,North Coast,North Coast,I


There are two sites in the site table that have no fish records:
- PISMO_W
- SAL_E

**These sites also have latitude and longitude = NaN. In addition, one site that is in the fish table has latitude and longitude = NaN:**
- SCI_PELICAN_FAR_WEST

```python
sites[sites['latitude'].isna() == True]
```

Also, it looks like only one lat and lon is given for each site. Additionally, sites have been consistently labeled as either 'reference' or 'mpa'. To check this:
```python
# Groupby
out = sites.groupby(['site']).agg({
    'latitude':pd.Series.nunique,
    'longitude':pd.Series.nunique,
    'site_status':pd.Series.nunique,
    'campus':pd.Series.nunique
})
out.reset_index(inplace=True)

# Check
out[out['latitude'] > 1]
out[out['longitude'] > 1]
out[out['site_status'] > 1]
out[out['campus'] > 1]
```

Since, for the purpose of DwC, we're not interested in which sites were sampled when, I can simplify the site table to only contain relevant information: site, latitude, longitude, and site status. The campus responsible for the survey might also be good to include. **Which campus is responsible for a given site has changed between years in 13 cases. I'll leave this information out for now.**

In [5]:
## Create simplified site table

site_summary = sites[['site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(382, 4)


Unnamed: 0,site,site_status,latitude,longitude
0,3 Palms East,reference,33.71762,-118.33215
4,ABALONE_POINT_1,reference,39.6915,-123.8141
15,ABALONE_POINT_2,reference,39.66502,-123.80435
26,ABALONE_POINT_3,reference,39.62877,-123.79658
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252


Some site names have spaces or ' - ' characters. I'll replace these in a sensible way.

In [6]:
# Replace ' ' and ' - ' in site names and add site_name column

site_name = [name.replace(' - ', '-') for name in site_summary['site']]
site_name = [name.replace(' ', '_') for name in site_name]
site_summary['site_name'] = site_name

site_summary.head()

Unnamed: 0,site,site_status,latitude,longitude,site_name
0,3 Palms East,reference,33.71762,-118.33215,3_Palms_East
4,ABALONE_POINT_1,reference,39.6915,-123.8141,ABALONE_POINT_1
15,ABALONE_POINT_2,reference,39.66502,-123.80435,ABALONE_POINT_2
26,ABALONE_POINT_3,reference,39.62877,-123.79658,ABALONE_POINT_3
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252,ANACAPA_ADMIRALS_CEN


### Convert

In [7]:
## Pad month and day as needed

paddedDay = ['0' + str(fish['day'].iloc[i]) if len(str(fish['day'].iloc[i])) == 1 else str(fish['day'].iloc[i]) for i in range(fish.shape[0])]
paddedMonth = ['0' + str(fish['month'].iloc[i]) if len(str(fish['month'].iloc[i])) == 1 else str(fish['month'].iloc[i]) for i in range(fish.shape[0])]

In [8]:
## Merge to add site_name (also lat, lon and site_status) to fish table

fish = fish.merge(site_summary, how='left', on='site')
fish.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,vis,temp,surge,pctcnpy,notes,site_name_old,site_status,latitude,longitude,site_name
0,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
1,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
2,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
3,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
4,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC


In [9]:
## Create eventID

eventID = [fish['site_name'].iloc[i] + '_' + str(fish['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + fish['zone'].iloc[i] + '_' + fish['level'].iloc[i] + '_' +
           fish['transect'].iloc[i].replace(' ', '') for i in range(fish.shape[0])]
fish_occ = pd.DataFrame({'eventID':eventID})

fish_occ.head()

Unnamed: 0,eventID
0,HOPKINS_DC_19990907_INNER_BOT_1
1,HOPKINS_DC_19990907_INNER_BOT_1
2,HOPKINS_DC_19990907_INNER_BOT_1
3,HOPKINS_DC_19990907_INNER_BOT_1
4,HOPKINS_DC_19990907_INNER_BOT_1


In [10]:
## Add occurrenceID

# Create survey_date column in fish
fish['survey_date'] = [str(fish['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(fish.shape[0])]

# Groupby to create occurrenceID
fish_occ['occurrenceID'] = fish.groupby(['site', 'survey_date', 'zone', 'level', 'transect'])['classcode'].cumcount()+1
fish_occ['occurrenceID'] = fish_occ['eventID'] + '_occ' + fish_occ['occurrenceID'].astype(str)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5


In [11]:
## Load species table

filename = 'MLPA_kelpforest_taxon_table.1.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1336, 38)


Unnamed: 0,campus,sample_type,sample_subtype,classcode,orig_classcode,Kingdom,Phylum,Class,Order,Family,...,LOOKED2009,LOOKED2010,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018
0,HSU,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,no,no,no,no,no,yes,yes,no,yes,yes
1,UCSB,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,FISH,FISH,AARG,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no


The subset of the species table that's currently relevant is entries with sample_type = 'FISH'. **Note** that there are 172 unique classcodes under this sample type, only 166 of which are actually in the fish data set. **Where does this discrepancy come from?** It seems like all classcodes should appear at least once, with a count of either 0 (looked for and not found) or NaN (not looked for). **It sounded like the data should already have NaN if a species was not looked for during a given survey and 0 if it was looked for and not found. Is that correct?**

Classcodes that do not appear in the data are:
- DMAC
- HSPI
- HSTE
- MXEN
- RBIN

```python
species.loc[species['classcode'].isin(['DMAC', 'HSPI', 'HSTE', 'MXEN', 'RBIN']), ['campus', 'classcode', 'species_definition', 'common_name']]
```

In [12]:
## Select species for fish surveys

fish_sp = species.loc[species['sample_type'] == 'FISH', ['classcode', 'species_definition', 'common_name']]
fish_sp.drop_duplicates(inplace=True)

print(fish_sp.shape)
fish_sp

(172, 3)


Unnamed: 0,classcode,species_definition,common_name
0,AARG,Amphistichus argenteus,Barred Surfperch
3,ACOR,Artedius corallinus,Coralline Sculpin
7,ADAV,Anisotremus davidsonii,Sargo
11,AFLA,Aulorhynchus flavidus,Tubesnout
15,AGUA,Apogon guadalupensis,Guadalupe Cardinalfish
...,...,...,...
510,UNID,Unidentified Fish,Unidentified Fish
513,URON,Umbrina roncador,Yellowfin drum
514,USAN,Ulvicola sanctaerosae,Kelp Gunnel
516,ZEXA,Zapteryx exasperata,Banded Guitarfish


In [13]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(fish_sp['classcode'], fish_sp['species_definition']))
code_to_com_dict = dict(zip(fish_sp['classcode'], fish_sp['common_name']))

In [14]:
## Update code_to_sci_dict for code OYT (see taxon notes below)

code_to_sci_dict['OYT'] = 'Sebastes serranoides/flavidus'

In [15]:
## Create scientificName and vernacularName columns

fish_occ['vernacularName'] = fish['classcode']
fish_occ['vernacularName'].replace(code_to_com_dict, inplace=True)

fish_occ['scientificName'] = fish['classcode']
fish_occ['scientificName'].replace(code_to_sci_dict, inplace=True)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes serranoides/flavidus


In [16]:
## Get unique scientific names for lookup in WoRMS

names = fish_occ['scientificName'].unique()

**Note** that there are a number of names that are not specific at the species level:
- Sebastes atrovirens/carnatus/chrysomelas/caurinus (matched to Sebastes; Kelp, Gopher, Black and Yellow, and Copper Rockfish YoY)
- Sebastes chrysomelas/carnatus (matched to Sebastes; Gopher and Black and Yellow Rockfish YoY)
- Synchirus/Rimicola (Manacled sculpin or kelp clingfish, SYRI) --> This should mean either Synchirus spp. or Rimicola spp. These are both from class **Actinopterygii**
- Sebastes serranoides/flavidus/melanops (matched to Sebastes; Olive, Yellowtail and Black Rockfish YoY)
- Sebastes carnatus/caurinus (matched to Sebastes; Gopher, Copper Rockfish YoY)

There are also some descriptions that lack a scientific name:
- No Organisms Present In This Sample (NO_ORG) --> **There are 12430 records with this designation. Should they just be removed?**
    - One thing that confuses me here is that **sometimes, NO_ORG observations occur in the same event as other observations.** You would think that a NO_ORG entry would be the only entry for a given event - see example below. Also note that **all records with a NO_ORG classcode also have count = 0, and vice versa.** 
    
```python
fish_occ[fish_occ['eventID'] == 'HOPKINS_DC_19990907_OUTER_CAN_2']
```

- Unidentified Fish (UNID) --> **I assume this should match to Actinopterygii, or maybe Pisces.**

Species with multiple common names:
- Atherinopsidae --> Grunion, Topsmelt Or Jacksmelt
- Clupeiformes --> Bait, Sardines/Anchovies (BAITBALL)
- Clinidae --> Kelpfishes And Fringeheads
- Clupeiformes --> Sardines And Anchovies (CLUP)
- Lethops connectens --> Kelp Goby, Halfblind Goby
- Scomber japonicus --> Pacific Mackerel, Greenback Mackerel
- Thaleichthys pacificus --> Candlefish, eulachon

Other classifications to be aware of:
- Hexagrammos --> Unidentified Hexagrammos
- Sebastes --> Rockfish, Unidentified Sp. (SEBSPP)
- Sebastes --> Rockfish Young Of The Year, Unidentified Sp (RFYOY)
- OYT matches to two species definitions: Sebastes serranoides/flavidus for HSU, UCSB, UCSC and Sebastes serranoides for VRG. **I think this is probably a mistake, and that the former is correct. I've changed OYT in code_to_sci_dict accordingly.** To check this, use:

```python
species.loc[species['classcode'] == 'OYT', ['campus', 'classcode', 'Genus', 'Species', 'species_definition', 'common_name']]
```

In [17]:
## Make changes based on the above observations

fish_occ.loc[fish_occ['scientificName'] == 'Synchirus/Rimicola', 'scientificName'] = 'Actinopterygii'
fish_occ.loc[fish_occ['scientificName'] == 'Unidentified Fish', 'scientificName'] = 'Actinopterygii'

# REMOVING NO_ORG RECORDS FOR NOW
fish_occ = fish_occ[fish_occ['scientificName'] != 'No Organisms Present In This Sample'].copy()

# Redefine names
names = fish_occ['scientificName'].unique()

In [18]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Sebastes serranoides/flavidus checking:  Sebastes
Url didn't work for Sebastes atrovirens/carnatus/chrysomelas/caurinus checking:  Sebastes
Url didn't work for Sebastes chrysomelas/carnatus checking:  Sebastes
Url didn't work for Sebastes serranoides/flavidus/melanops checking:  Sebastes
Url didn't work for Sebastes carnatus/caurinus checking:  Sebastes


In [19]:
## Add scientific name-related columns

fish_occ['scientificNameID'] = fish_occ['scientificName']
fish_occ['scientificNameID'].replace(name_id_dict, inplace=True)

fish_occ['taxonID'] = fish_occ['scientificName']
fish_occ['taxonID'].replace(name_taxid_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes serranoides/flavidus,urn:lsid:marinespecies.org:taxname:126175,126175


In [20]:
## Create identificationQualifier

qualifier_dict = {'Sebastes serranoides/flavidus':'Sebastes serranoides or Sebastes flavidus',
               'Sebastes atrovirens/carnatus/chrysomelas/caurinus':'Sebastes atrovirens, Sebastes carnatus, Sebastes chrysomelas or Sebastes Caurinus',
               'Sebastes chrysomelas/carnatus':'Sebastes chrysomelas or Sebastes carnatus',
               'Sebastes serranoides/flavidus/melanops':'Sebastes serranoides, Sebastes flavidus or Sebastes melanops',
                 'Sebastes carnatus/caurinus':'Sebastes carnatus or Sebastes caurinus'}

identificationQualifier = [qualifier_dict[name] if name in qualifier_dict.keys() else np.nan for name in fish_occ['scientificName']]

In [21]:
## Replace scientificName using name_name_dict

fish_occ['scientificName'].replace(name_name_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175


In [22]:
## Add final name-related columns

fish_occ['nameAccordingTo'] = 'WoRMS'
fish_occ['occurrenceStatus'] = 'present'
fish_occ['basisOfRecord'] = 'HumanObservation'
fish_occ['identificationQualifier'] = identificationQualifier

# Add identificationQualifier for Synchirus/Rimicola
fish_occ.loc[fish_occ['vernacularName'] == 'Manacled Sculpin/Kelp Clingfish', 'identificationQualifier'] = 'Synchirus spp. or Rimicola spp.'

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732,WoRMS,present,HumanObservation,
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727,WoRMS,present,HumanObservation,
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,present,HumanObservation,Sebastes serranoides or Sebastes flavidus


In [23]:
## Pull sex and lifeStage information out of sex column

fish_occ['sex'] = fish['sex'].copy()
fish_occ['lifeStage'] = fish_occ['sex']

# Separate
fish_occ.loc[fish_occ['sex'].isin(['JUVENILE']), 'sex'] = np.nan
fish_occ.loc[fish_occ['lifeStage'].isin(['MALE', 'FEMALE', 'TRANSITIONAL']), 'lifeStage'] = np.nan

# Replace sex and lifeStage with controlled vocabulary
fish_occ['sex'].replace({'MALE':'male', 'FEMALE':'female', 'TRANSITIONAL':'transitional'}, inplace=True)
fish_occ['lifeStage'].replace({'JUVENILE':'juvenile'}, inplace=True)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732,WoRMS,present,HumanObservation,,,
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727,WoRMS,present,HumanObservation,,,
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,present,HumanObservation,Sebastes serranoides or Sebastes flavidus,,


In [24]:
## Create density

fish_density = fish.loc[fish['classcode'] != 'NO_ORG', 'count'] # units = individuals per 120 m3
fish_occ['organismQuantity'] = fish_density
fish_occ['organismQuantityType'] = 'number of individuals per 120 m3'
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727,WoRMS,present,HumanObservation,,,,85.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,present,HumanObservation,Sebastes serranoides or Sebastes flavidus,,,100.0,number of individuals per 120 m3


**Note** that because we dropped the NO_ORG records, there are no instances where organismQuantity = 0. These instances would normally have occurrenceStatus = absent. 

```python
fish_occ[fish_occ['organismQuantity'] == 0]
```

Also, there is one record where organismQuantity (i.e. the count column in fish) is NaN. The eventID is PALO_COLORADO_20080904_OUTMID_BOT_2. See occurrence 3 in the following example:

```python
fish_occ[fish_occ['eventID'] == 'PALO_COLORADO_20080904_OUTMID_BOT_2']
```

**I will drop this record for now.**

In [25]:
## Drop record where organismQuantity is missing

print(fish_occ.shape)
fish_occ.dropna(subset=['organismQuantity'], inplace=True)
fish_occ.shape

(369263, 14)


(369262, 14)

### The notes column

I would like to make some effort to extract useful information and tidy the notes column. **A few of the notes are pretty inappropriate and/or use names. However, since PISCO has already shared them publicly, I assume it continues to be OK to do so.** Many, many of the notes contain potentially useful information.

A large number of the notes contain sex information. I can probably extract this. I need to look for and pull:
- M
- F
- M;
- F;
- MALE
- FEMALE
- FEAMLE
- MALE;
- FEMALE;
- TRANSITIONAL;
- TRANSITION
- MALE,

There is also some life stage information:
- JUVENILE
- JUVENILE;
- JEVENILE

Sometimes, sex is uncertain (e.g. 'M?'). I'll leave these in the notes.

Note cleaning:
- Explore cleaning lowercase versus capitals
- Some notes are preceeded by '. '

To look at the non-sex-related notes, use:

```python
not_sex = fish[(fish['notes'].isna() == False) & (fish['notes'] != 'M') & (fish['notes'] != 'F') & (fish['notes'] != 'M;') & (fish['notes'] != 'F;') 
               & (fish['notes'] != 'MALE') & (fish['notes'] != 'FEMALE')].copy()

for note in not_sex['notes'].unique():
    print(note)
```

In [40]:
## Obtain relevant records from fish

notes = fish[['site', 'survey_date', 'classcode', 'count', 'sex', 'notes']].copy()
notes = notes[(notes['classcode'] != 'NO_ORG') & (notes['count'].isna() == False)]
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes
0,HOPKINS_DC,19990907,ELAT,1.0,,
1,HOPKINS_DC,19990907,ELAT,1.0,,
2,HOPKINS_DC,19990907,HDEC,1.0,,
3,HOPKINS_DC,19990907,OCAL,85.0,,
4,HOPKINS_DC,19990907,OYT,100.0,,


In [41]:
## Extract sex from notes column

sex_notes = []
sex_options = ['M', 'F', 'MALE', 'FEMALE', 'FEAMLE', 'MALES', 'FEMALES', 'TRANSITIONAL', 'TRANSITION', 'TRANNY']

for note in notes['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(sex_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(sex_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(sex_options) & set(slash_split))
          
        
        if note in sex_options:
            sex_notes.append(note)
        elif colon_overlap != []:
            sex_notes.extend(colon_overlap)
        elif comma_overlap != []:
            sex_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            sex_notes.extend(slash_overlap)
        
        else:
            sex_notes.append(np.nan)
            
    else:
        sex_notes.append(np.nan)
        
notes['sex_notes'] = sex_notes
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes
0,HOPKINS_DC,19990907,ELAT,1.0,,,
1,HOPKINS_DC,19990907,ELAT,1.0,,,
2,HOPKINS_DC,19990907,HDEC,1.0,,,
3,HOPKINS_DC,19990907,OCAL,85.0,,,
4,HOPKINS_DC,19990907,OYT,100.0,,,


In [42]:
## Clean sex_notes

print(notes['sex_notes'].unique())
notes['sex_notes'].replace({'F':'female',
                  'M':'male',
                  'FEMALE':'female',
                  'MALE':'male',
                  'MALES':'male',
                  'FEMALES':'female',
                  'FEAMLE':'female',
                  'TRANSITIONAL':'transitional',
                  'TRANNY':'transitional',
                  'TRANSITION':'transitional'}, inplace=True)
print(notes['sex_notes'].unique())

[nan 'F' 'M' 'FEMALE' 'MALE' 'TRANSITIONAL' 'MALES' 'FEMALES' 'TRANNY'
 'TRANSITION' 'FEAMLE']
[nan 'female' 'male' 'transitional']


In [43]:
# Add sex from fish_occ to notes

notes['occ_sex'] = fish_occ['sex']
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex
0,HOPKINS_DC,19990907,ELAT,1.0,,,,
1,HOPKINS_DC,19990907,ELAT,1.0,,,,
2,HOPKINS_DC,19990907,HDEC,1.0,,,,
3,HOPKINS_DC,19990907,OCAL,85.0,,,,
4,HOPKINS_DC,19990907,OYT,100.0,,,,


In [44]:
## Create new column merging information from occ_sex and sex_notes

new_sex = [notes['occ_sex'].iloc[i] if notes['occ_sex'].iloc[i] == notes['occ_sex'].iloc[i] else notes['sex_notes'].iloc[i] for i in range(notes.shape[0])]
notes['new_sex'] = new_sex
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex,new_sex
0,HOPKINS_DC,19990907,ELAT,1.0,,,,,
1,HOPKINS_DC,19990907,ELAT,1.0,,,,,
2,HOPKINS_DC,19990907,HDEC,1.0,,,,,
3,HOPKINS_DC,19990907,OCAL,85.0,,,,,
4,HOPKINS_DC,19990907,OYT,100.0,,,,,


In [45]:
## Replace sex column in fish_occ with new_sex

fish_occ['sex'] = notes['new_sex']
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727,WoRMS,present,HumanObservation,,,,85.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,present,HumanObservation,Sebastes serranoides or Sebastes flavidus,,,100.0,number of individuals per 120 m3


To check that the above process is working, use:

```python
fish[fish['sex'] == 'FEMALE'].shape[0] # 16528
notes[notes['occ_sex'] == 'female'].shape[0] # 16528
notes[notes['new_sex'] == 'female'].shape[0] # 18031
fish_occ[fish_occ['sex'] == 'female'].shape[0]
```

In [58]:
## Repeat the process to extract lifeStage information from notes

stage_notes = []
stage_options = ['JUVENILE', 'JUV', 'JEVENILE']

for note in notes['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(stage_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(stage_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(stage_options) & set(slash_split))
          
        
        if note in stage_options:
            stage_notes.append(note)
        elif colon_overlap != []:
            stage_notes.extend(colon_overlap)
        elif comma_overlap != []:
            stage_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            stage_notes.extend(slash_overlap)
        
        else:
            stage_notes.append(np.nan)
            
    else:
        stage_notes.append(np.nan)
        
notes['stage_notes'] = stage_notes
        
# Clean stage_notes
print(notes['stage_notes'].unique())
notes['stage_notes'].replace({'JUV':'juvenile',
                  'JUVENILE':'juvenile',
                  'JEVENILE':'juvenile'}, inplace=True)
print(notes['stage_notes'].unique())

# Add lifeStage from fish_occ to notes
notes['occ_stage'] = fish_occ['lifeStage']

# Create new column merging information from occ_stage and stage_notes
new_stage = [notes['occ_stage'].iloc[i] if notes['occ_stage'].iloc[i] == notes['occ_stage'].iloc[i] else notes['stage_notes'].iloc[i] for i in range(notes.shape[0])]
notes['new_stage'] = new_stage

# Replace lifeStage column in fish_occ with new_stage
fish_occ['lifeStage'] = notes['new_stage']
fish_occ.head()

[nan 'JUV' 'JUVENILE' 'JEVENILE']
[nan 'juvenile']


Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727,WoRMS,present,HumanObservation,,,,85.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,present,HumanObservation,Sebastes serranoides or Sebastes flavidus,,,100.0,number of individuals per 120 m3


Check:

```python
fish[fish['sex'] == 'JUVENILE'].shape[0] # 2336
notes[notes['occ_stage'] == 'juvenile'].shape[0] # 2336
notes[notes['new_stage'] == 'juvenile'].shape[0] # 2381
fish_occ[fish_occ['lifeStage'] == 'juvenile'].shape[0] # 2381
```

In [69]:
# ## Save notes to inspect process if desired

# notes.to_csv('notes.csv', index=False, na_rep='NaN')

In [30]:
# ## Clean notes if desired **NOT GOING TO DO THIS FOR NOW**

# cleaned_notes = fish['notes'].copy()
# print(cleaned_notes[cleaned_notes.index == 348366])
# for i in range(cleaned_notes.shape[0]):
#     note = cleaned_notes.iloc[i]
#     if note == note:
#         if note[0:2] == '. ':
#             cleaned_notes.iloc[i] = cleaned_notes.iloc[i][2:]
#         cleaned_notes.iloc[i] = cleaned_notes.iloc[i].lower()
    
# print(cleaned_notes[cleaned_notes.index == 348366])
# fish_occ['occurrenceRemarks'] = cleaned_notes
# fish_occ.head()

In [75]:
## Add notes under occurrenceRemarks

fish_occ['occurrenceRemarks'] = notes['notes']
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType,occurrenceRemarks
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3,
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3,
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3,
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727,WoRMS,present,HumanObservation,,,,85.0,number of individuals per 120 m3,
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,present,HumanObservation,Sebastes serranoides or Sebastes flavidus,,,100.0,number of individuals per 120 m3,


In [71]:
## Save Size, Min and Max for use in MoF file

# Obtain relevant records from fish
subset = fish[['site', 'survey_date', 'classcode', 'count', 'fish_tl', 'min_tl', 'max_tl']].copy()
subset = subset[(subset['classcode'] != 'NO_ORG') & (subset['count'].isna() == False)]

# Fix records where count = 1 and min and/or max values are present
subset.loc[subset['count'] == 1, ['min_tl', 'max_tl']] = np.nan

# Fix records where count > 1 and min and max don't provide a reasonable size range
subset.loc[(subset['fish_tl'] == subset['min_tl']) & (subset['max_tl'].isna() == True), 'min_tl'] = np.nan

# For groups where a size range exists, we want to drop the average length measure
subset.loc[(subset['fish_tl'] < subset['max_tl']) & (subset['fish_tl'] > subset['min_tl']), 'fish_tl'] = np.nan

# Assemble fish_sizes
fish_sizes = pd.DataFrame({
    'eventID':fish_occ['eventID'],
    'occurrenceID':fish_occ['occurrenceID'],
    'fish_tl':subset['fish_tl'],
    'min_tl':subset['min_tl'],
    'max_tl':subset['max_tl']
})
fish_sizes.dropna(how='all', subset=['fish_tl', 'min_tl', 'max_tl'], inplace=True) # Note that this drops 116 records where no size information was given (fish_tl = NaN)

print(fish_sizes.shape)
fish_sizes.head()

(369146, 5)


Unnamed: 0,eventID,occurrenceID,fish_tl,min_tl,max_tl
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,18.0,,
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,25.0,,
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,20.0,,
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,10.0,,
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,,7.0,8.0


**Some notes on size values:**
- 0 records have a missing count (count = NaN)
- 0 records have count = 0
- 234652 records have count = 1
    - Of these, 3130 have min values that = fish_tl and max values of NaN. **How do I interpret these records? For now I will disregard min values.**
        - Note that the opposite is never the case (max = fish_tl, min = NaN)
    - Of these 3130, 287 have min and max values, and fish_tl is the average of those values
- 134610 records have count > 1
    - Of these, 104917 are all the same size (i.e. min_tl = max_tl = NaN)
    - 27907 are different sizes with a size range given (i.e. fish_tl is the average of min_tl and max_tl)
    - 1786 are of unknown size, with min values that = fish_tl and max values of NaN. **How do I interpret these records? For now I will disregard min values.**
        - Again, note that the opposite is never the case (max = fish_tl, min = NaN)
        
```python
# Records where count = 1
count1 = subset[subset['count'] == 1].copy()

# Records where count = 1, but min is present
count1[(count1['min_tl'].isna() == False) & (count1['max_tl'].isna() == True)]

# Records where count = 1, but min and max are present
count1[(count1['min_tl'].isna() == False) & (count1['max_tl'].isna() == False)]

# Records where count > 1
count2 = subset[subset['count'] > 1].copy()

# Records where count > 1 and all fish were the same size
count2[(count2['min_tl'].isna() == True) & (count2['max_tl'].isna() == True)]

# Records where count > 1 and fish were not the same size and a size range was given
count2[(count2['fish_tl'].isna() == False) & (count2['min_tl'].isna() == False) & (count2['max_tl'].isna() == False)]

# Records where count > 1 and fish are of unknown size (min = fish_tl, max not given)
count2[(count2['min_tl'].isna() == False) & (count2['max_tl'].isna() == True)]
```

### Save

In [76]:
## Save

fish_occ.to_csv('PISCO_occurrence_20200824.csv', index=False, na_rep='NaN')

## Create event file

The event file should contain: eventID (from site, survey_year, transect, level?), eventDate (from year, month, date), datasetID, locality (site), localityRemarks (maybe level and/or zone information), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Some notes might be eventRemarks. Should I include the campus information somewhere? Observer?

In [134]:
## Get unique eventIDs from occurrence file and their associated survey_dates

event = pd.DataFrame({'eventID':fish_occ['eventID'],
                    'eventDate':fish.loc[(fish['count'].isna() == False) & (fish['classcode'] != 'NO_ORG'), 'survey_date'],
                    'institutionID':fish.loc[(fish['count'].isna() == False) & (fish['classcode'] != 'NO_ORG'), 'campus'],
                    'locality':fish.loc[(fish['count'].isna() == False) & (fish['classcode'] != 'NO_ORG'), 'site']})
event.drop_duplicates(inplace=True)

print(event.shape)
event.head()

(49766, 4)


Unnamed: 0,eventID,eventDate,institutionID,locality
0,HOPKINS_DC_19990907_INNER_BOT_1,19990907,UCSC,HOPKINS_DC
7,HOPKINS_DC_19990907_INNER_BOT_2,19990907,UCSC,HOPKINS_DC
14,HOPKINS_DC_19990907_INNER_CAN_1,19990907,UCSC,HOPKINS_DC
16,HOPKINS_DC_19990907_INNER_CAN_2,19990907,UCSC,HOPKINS_DC
22,HOPKINS_DC_19990907_INNER_MID_1,19990907,UCSC,HOPKINS_DC


In [136]:
## Format eventDate

formatted = [datetime.strptime(dt, '%Y%m%d').date().isoformat() for dt in event['eventDate']]
event['eventDate'] = formatted
event.head()

Unnamed: 0,eventID,eventDate,institutionID,locality
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,UCSC,HOPKINS_DC
7,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,UCSC,HOPKINS_DC
14,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,UCSC,HOPKINS_DC
16,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,UCSC,HOPKINS_DC
22,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,UCSC,HOPKINS_DC


In [137]:
## Dataset ID

event.insert(2, 'datasetID', 'PISCO fish transects')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
7,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
14,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
16,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
22,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC


In [138]:
## Merge to obtain decimalLatitude and decimalLongitude

event = event.merge(site_summary, how='left', left_on='locality', right_on='site')
event.rename(columns = {'site_status':'locationRemarks', 'latitude':'decimalLatitude', 'longitude':'decimalLongitude'}, inplace=True)
event['locationRemarks'].replace({'mpa':'marine protected area'}, inplace=True)
event.drop(['site', 'site_name'], axis=1, inplace=True)
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196


In [139]:
## Add countryCode

event.insert(6, 'countryCode', 'US')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196


In [140]:
## Add coordinateUncertainty in Meters

event['coordinateUncertaintyInMeters'] = 250

**Is this a reasonable coordinateUncertaintyInMeters?**

In [168]:
## minimumDepthInMeters, maximumDepthInMeters

# Add eventID to fish
fish['eventID'] = eventID
fish_subset = fish[(fish['count'].isna() == False) & (fish['classcode'] != 'NO_ORG')].copy()

# Groupby eventID to obtain depth column
depth = fish_subset.groupby(['eventID']).agg({
    'depth':[min, max]
})
depth.reset_index(inplace=True)
depth.columns = depth.columns.droplevel()

# Add to event
event['minimumDepthInMeters'] = depth['min']
event['maximumDepthInMeters'] = depth['max']
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,10.5,10.5
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,,


**Note** that there are no duplicated depth measurements that I can see.

```python
any(fish_subset.groupby(['eventID'])['depth'].nunique() > 1)
```

In [169]:
## Add samplingProtocol and samplingEffort

event['samplingProtocol'] = 'band transect'
event['samplingEffort'] = 'ADD EFFORT?'
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,10.5,10.5,band transect,ADD EFFORT?
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5,band transect,ADD EFFORT?
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5,band transect,ADD EFFORT?
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0,band transect,ADD EFFORT?
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,,,band transect,ADD EFFORT?


**Sampling effort?**

In [192]:
## Get vis, temp, surge, and pctcnpy for MoF

# Get relevant measurementValues
event_MoF_values = fish_subset[['eventID', 'vis', 'temp', 'surge', 'pctcnpy']].copy()
event_MoF_values.drop_duplicates(inplace=True)

# vis
vis = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'visibility',
    'measurementValue':event_MoF_values['vis'],
    'measurementUnit':'meters',
    'measurementMethod':'Horizontal visibility on each transect, estimated by diver reeling in transect tape and noting the distance at which the end of the tape can first be seen.'
})

# temp
temp = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'temperature',
    'measurementValue':event_MoF_values['temp'],
    'measurementUnit':'degrees Celsius',
    'measurementMethod':"The temperature on each transect as measured by the diver's computer."
})

# surge
surge = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'surge',
    'measurementValue':event_MoF_values['surge'],
    'measurementUnit':np.NaN,
    'measurementMethod':"The diver's categorical estimation of the magnitude of horizontal displacement on each transect."
})

# pctcnpy
pct = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'percent canopy',
    'measurementValue':event_MoF_values['pctcnpy'],
    'measurementUnit':np.nan,
    'measurementMethod':"The diver's categorical estimation of the percent of the transect, by volume, that is occupied by kelp."
})

pct.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,HOPKINS_DC_19990907_INNER_BOT_1,,percent canopy,1.0,,The diver's categorical estimation of the perc...
7,HOPKINS_DC_19990907_INNER_BOT_2,,percent canopy,2.0,,The diver's categorical estimation of the perc...
14,HOPKINS_DC_19990907_INNER_CAN_1,,percent canopy,1.0,,The diver's categorical estimation of the perc...
16,HOPKINS_DC_19990907_INNER_CAN_2,,percent canopy,2.0,,The diver's categorical estimation of the perc...
22,HOPKINS_DC_19990907_INNER_MID_1,,percent canopy,1.0,,The diver's categorical estimation of the perc...


In [170]:
## Save

event.to_csv('PISCO_event_20200824.csv', index=False, na_rep='NaN')

## Create MoF file

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Depth, vis, temp, surge and pctcnpy can be recorded at the event level. Fish_tl, min_tl and max_tl can be recorded at the occurrence level.

In [204]:
## Assemble fish_sizes data

# total length
tl_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'length',
                      'measurementValue':fish_sizes['fish_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The total length of an individual or group of individuals (of the same length), estimated visually to the nearest centimeter'})

# min length
min_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'minimum length',
                      'measurementValue':fish_sizes['min_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The minimum size recorded in a group of fish of the same species, estimated visually to the nearest centimeter'})

# max length
max_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'length',
                      'measurementValue':fish_sizes['max_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The maximum size recorded in a group of fish of the same species, estimated visually to the nearest centimeter'})

In [206]:
## Concatenate dataframes

mof = pd.concat([vis, temp, surge, pct, tl_mof, min_mof, max_mof])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,HOPKINS_DC_19990907_INNER_BOT_1,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
7,HOPKINS_DC_19990907_INNER_BOT_2,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
14,HOPKINS_DC_19990907_INNER_CAN_1,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
16,HOPKINS_DC_19990907_INNER_CAN_2,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
22,HOPKINS_DC_19990907_INNER_MID_1,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."


In [208]:
## Drop missing measurements

print(mof.shape)
mof.dropna(subset=['measurementValue'], inplace=True)
mof.shape

(1306502, 6)


(558926, 6)

In [210]:
## Save

mof.to_csv('PISCO_MoF_20200824.csv', index=False, na_rep='NaN')

## Questions

1. Does the "method" column actually relate to different survey methods? For example, here, method is always "FISH" and the only thing that's different is the campus information.
2. I assume depths were estimated by dive computer?
3. There are two sites in the site table that have no fish records: PISMO_W and SAL_E. The lat, lon for these sites are missing. The lat, lon for SCI_PELICAN_FAR_WEST are also missing.
4. There are 172 unique classcodes under sample type "FISH", but only 166 of them are actually in the fish data set. Classcodes that do not appear in the data are: DMAC, HSPI, HSTE, MXEN, and RBIN. Where does this discrepancy come from? 
5. It seems like there are no absence records in this data, even though absence records could be formulated based on which species were looked for in a given year (as indicated by the species table). Is this correct? I draw this conclusion because the only record where count = NaN is eventID = PALO_COLORADO_20080904_OUTMID_BOT_2 (classcode = SACA). This record has been excluded for now, but it seems like it might be a data entry error?
6. There are 12430 records where the classcode is "NO_ORG." I assume these indicate empty transects, but as is, they are not useful as presence/absence data, and have been excluded. To make them useful, we would have to obtain whether or not the species was looked for during a given year, and add in 0 counts for the relevant transects. This is more work than I think it's reasonable for me to do.
7. Sometimes, NO_ORG observations occur in the same event as other observations. This does not make sense to me.
8. Should the unidentified fish (UNID) classcode match to Actinopterygii? Pisces?
9. In the species table, the classcode OYT matches to two species definitions: Sebastes serranoides/flavidus for HSU, UCSB, UCSC and Sebastes serranoides for VRG. I've assumed this is a mistake.
10. For data to be compatible with DwC standard, sex information must be entered using a controlled vocabulary. This vocabulary includes 'hermaphrodite' and 'indeterminate', but not something like 'transitional.' Do either of these terms seem reasonable? I can also potentially ask for the vocabulary to be expanded.
11. There are 234652 records with count = 1. Of these, 2843 have fish_tl = min_tl and max_tl missing. 287 have min_tl and max_tl specified, with fish_tl being an average of these values. This is confusing; I would expect all records with count = 1 to have fish_tl specified, and min_tl and max_tl missing. Can you clarify?
12. Similarly, there are 134610 records with count > 1. Of these, 1786 have fish_tl = min_tl and max_tl missing. How do I interpret these records? 
13. coordinateUncertaintyInMeters?
14. Is there a samplingEffort I can list for fish transects? Like a time goal or limit?