# PISCO - fish transect data

The density of all conspicuous fishes (i.e. species whose adults are longer than 10 cm and visually detectable by SCUBA divers) are visually recorded along replicate 2m wide by 2m tall by 30m long (120m3) transects. 
- Transects are performed in 2-3 heights: bottom, mid-water and canopy
    - Bottom transects are always performed; a diver searches in cracks and crevices with a flashlight
    - Mid-water transects are always performed; a second diver surveys 120 m3 about 1/3 - 1/2 of the way up into the water column
    - Canopy transects are surveyed at a subset of sites, and are usually completed separately from the bottom and midwater transects; a diver swims 2m below the surface counting fishes in the top two meters of the water column
- Three 30 m long transects, distributed end-to-end and 5-10 m apart, are typically performed at each height, and at each of four depths:
    - 5m
    - 10m
    - 15m
    - 20m 
    - transects at the 25 m isobath are performed by VRG where habitat is available
- Survey depths may vary based on reef topography 
- Counts on mid-water and bottom transects are eventually combined, generating 12 replicate transects for each site. **Are these already combined in this data set? [No, doesn't look like it.]** **Note** that at sites with narrow kelp beds, particularly in parts of the Northern Channel Islands, only two depths are sampled, with four transects in each depth zone for a total of eight replicate transects
- Surveyors record:
    - The total length (TL) of each fish observed
    - Transect depth
    - Horizontal visibility along each transect (**must be at least 3 m to perform fish transects**)
    - Water temperature
    - Sea state (surge)
    - Percent of the transect volume occupied by kelp (PISCO only)

**Resources**
- https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [1]:
## Imports

import pandas as pd
import numpy as np
import random
import math

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [12]:
## Load data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\PISCO\\'
filename = 'MLPA_kelpforest_fish_with_2020.csv'
fish = pd.read_csv(filename, dtype={'transect':str, 'sex':str, 'notes':str, 'site_name_old':str})

print(fish.shape)
fish.head()

(80977, 18)


Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old
0,UCSC,SBTL_SWATH_PISCO,2014,2014,8,8,HOPKINS_UC,TRANSECT,INNER,1,APOCAL,1,7.0,,4.6,COLIN GAYLORD,,
1,UCSC,SBTL_SWATH_PISCO,2018,2018,8,15,SAUNDERS_REFERENCE_1,TRANSECT,OUTER,2,ASTSPP,1,10.0,,20.2,MICHAEL LANGHANS,,
2,UCSC,SBTL_SWATH_PISCO,2014,2014,7,3,MACABEE_DC,TRANSECT,OUTER,2,DERIMB,2,12.0,HEALTHY,12.3,TRISTIN MCHUGH,SIZE 10-14 CM,
3,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,DERIMB,1,12.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE 10-14 CM,
4,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,DERIMB,1,17.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE <=15 CM,


In [13]:
fish['year'].max()

2018

### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_FISH_PISCO, SBTL_FISH_CRANE, SBTL_FISH_HSU or SBTL_FISH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 1999 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 1999 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 380 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**level** = BOT, CAN, MID or CNMD. The horizontal placement of the transect within the water column. Includes BOT: bottom transects placed at the seafloor, MID: midwater transects placed at roughly half the depth of the seafloor, and CAN: canopy transects placed at the surface to survey the top two meters of the water column and kelp canopy. CNMD is used when an inner transect is too shallow to allow both canopy and midwater transects without overlapping (applies to UCSB and VRG only) <br>
**transect** = It seems like this should only be 1 - 12, but there are a number of other designations as well. The unique transect replicate within each site, zone, and level. <br>
**classcode** = One of 166 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for fishes, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode of a given size per transect <br>
**fish_tl** = The total length of an individual or group of individuals (of the same length) OR the average total length for a group of fish where a range in lengths is specified (rounded to the nearest cm) <br>
**min_tl** = The minimum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**max_tl** = The maximum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**sex** = MALE, FEMALE, TRANSITIONAL, JUVENILE or 'nan'. The sex classification for sexually dimorphic species where sex can be distinguished visually and is recorded. For some species, individuals with juvenile markings are also indicated here. The TRANSITIONAL class is used for fish with external morphological features consistent with both male and female (applies to sex changing fishes such as California sheephead). JUVENILE is not always indicated when a juvenile fish is observed. <br>
**observer** = The diver who conducted the survey transect <br>
**depth** = Between 0.2 and 33.4 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**vis** = Between 1 and 35 meters. The diver's estimation of horizontal visibility on each transect. Measured by reeling in the transect tape and noting the distance at which the end of the tape can first be seen. <br>
**temp** = Between 7 and 25.6 degrees C. The temperature on each transect measured by the diver's computer. <br>
**surge** = HIGH, MODERATE, LIGHT or 'nan'. The diver's estimation of magnitude of horizontal displacement on each transect.
- LIGHT: No significant surge
- MODERATE: Surge causing noticeable lateral movement, diver must compensate
- HIGH: Significant surge, diver moved out of transect bounds when not holding on

**pctcnpy** = 0 - 3 or NaN. The diver's estimation of the percent of the transect, by volume, that is occupied by kelp. This estimation is specific to the level of the transect that is being surveyed (i.e. excluding canopy transects, this not an estimation of surface canopy but of the amount of kelp within the transect at the specified level). **I believe this measure was only recorded by PISCO.**
- 0: 0% of transect volume occupied by kelp
- 1: 1-33% of transect volume occupied by kelp
- 2: 34-66% of transect volume occupied by kelp
- 3: 67-100% of transect volume occupied by kelp

**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

### Strategy

As with the RCCA data, each transect can be an **event** and each fish observation can be an **occurrence**. There are both event-level and occurrence-level measurements, necessitating event and MoF files. 

The **event** file should contain: eventID (from site, survey_year, transect, level?), eventDate (from year, month, date), datasetID, locality (site), localityRemarks (maybe level and/or zone information), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Some notes might be eventRemarks. Should I include the campus information somewhere? Observer?

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe some notes), sex (sex), lifeStage (sex), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Depth, vis, temp, surge and pctcnpy can be recorded at the event level. Fish_tl, min_tl and max_tl can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [4]:
## Load site table

filename = 'MLPA_kelpforest_site_table.1.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(7458, 17)


Unnamed: 0,LTM_project_short_code,campus,method,survey_year,year,site,latitude,longitude,CA_MPA_Name_Short,site_designation,site_status,Secondary_MPA_Name,Secondary_site_designation,Secondary_site_status,BaselineRegion,LongTermRegion,MPA_priority_tier
0,LTM_Kelp_SRock,VRG,SBTL_SIZEFREQ_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
1,LTM_Kelp_SRock,VRG,SBTL_FISH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
2,LTM_Kelp_SRock,VRG,SBTL_SWATH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
3,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
4,LTM_Kelp_SRock,HSU,SBTL_UPC_HSU,2018,2018,ABALONE_POINT_1,39.6915,-123.8141,Ten Mile SMR,reference,reference,,,,North Coast,North Coast,I


There are two sites in the site table that have no fish records:
- PISMO_W
- SAL_E

**These sites also have latitude and longitude = NaN. In addition, one site that is in the fish table has latitude and longitude = NaN:**
- SCI_PELICAN_FAR_WEST

```python
sites[sites['latitude'].isna() == True]
```

Also, it looks like only one lat and lon is given for each site. Additionally, sites have been consistently labeled as either 'reference' or 'mpa'. To check this:
```python
# Groupby
out = sites.groupby(['site']).agg({
    'latitude':pd.Series.nunique,
    'longitude':pd.Series.nunique,
    'site_status':pd.Series.nunique,
    'campus':pd.Series.nunique
})
out.reset_index(inplace=True)

# Check
out[out['latitude'] > 1]
out[out['longitude'] > 1]
out[out['site_status'] > 1]
out[out['campus'] > 1]
```

Since, for the purpose of DwC, we're not interested in which sites were sampled when, I can simplify the site table to only contain relevant information: site, latitude, longitude, and site status. The campus responsible for the survey might also be good to include. **Which campus is responsible for a given site has changed between years in 13 cases. I'll leave this information out for now.**

In [13]:
## Create simplified site table

site_summary = sites[['site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(382, 4)


Unnamed: 0,site,site_status,latitude,longitude
0,3 Palms East,reference,33.71762,-118.33215
4,ABALONE_POINT_1,reference,39.6915,-123.8141
15,ABALONE_POINT_2,reference,39.66502,-123.80435
26,ABALONE_POINT_3,reference,39.62877,-123.79658
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252


In [17]:
## Update coordinates for SCI_PELICAN_FAR_WEST manually ----- THIS CAN BE CHANGED WITH THE NEW SITE TABLE, WHICH SHOULD HAVE THEM INCLUDED

site_summary.loc[site_summary['site'] == 'SCI_PELICAN_FAR_WEST', 'site_status'] = 'reference'
site_summary.loc[site_summary['site'] == 'SCI_PELICAN_FAR_WEST', 'latitude'] = 34.0324
site_summary.loc[site_summary['site'] == 'SCI_PELICAN_FAR_WEST', 'longitude'] = -119.698883

site_summary[site_summary['site'] == 'SCI_PELICAN_FAR_WEST']

Unnamed: 0,site,site_status,latitude,longitude
5529,SCI_PELICAN_FAR_WEST,reference,34.0324,-119.698883


Some site names have spaces or ' - ' characters. I'll replace these in a sensible way.

In [18]:
# Replace ' ' and ' - ' in site names and add site_name column

site_name = [name.replace(' - ', '-') for name in site_summary['site']]
site_name = [name.replace(' ', '_') for name in site_name]
site_summary['site_name'] = site_name

site_summary.head()

Unnamed: 0,site,site_status,latitude,longitude,site_name
0,3 Palms East,reference,33.71762,-118.33215,3_Palms_East
4,ABALONE_POINT_1,reference,39.6915,-123.8141,ABALONE_POINT_1
15,ABALONE_POINT_2,reference,39.66502,-123.80435,ABALONE_POINT_2
26,ABALONE_POINT_3,reference,39.62877,-123.79658,ABALONE_POINT_3
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252,ANACAPA_ADMIRALS_CEN


### Get species table

In [19]:
## Load species table

filename = 'MLPA_kelpforest_taxon_table.1.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1336, 38)


Unnamed: 0,campus,sample_type,sample_subtype,classcode,orig_classcode,Kingdom,Phylum,Class,Order,Family,...,LOOKED2009,LOOKED2010,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018
0,HSU,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,no,no,no,no,no,yes,yes,no,yes,yes
1,UCSB,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,FISH,FISH,AARG,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no


The subset of the species table that's currently relevant is entries with sample_type = 'FISH'. **Note** that there are 172 unique classcodes under this sample type, only 166 of which are actually in the fish data set. **Where does this discrepancy come from?** It seems like all classcodes should appear at least once, with a count of either 0 (looked for and not found) or NaN (not looked for). **It sounded like the data should already have NaN if a species was not looked for during a given survey and 0 if it was looked for and not found. Is that correct?** No - the looked for columns need to be used to populate absence records (see below).

Classcodes that do not appear in the data are:
- DMAC
- HSPI
- HSTE
- MXEN
- RBIN

These are rare classcodes from VRG. Dan will likely remove them from the data for the 2020 submission to DataONE.

```python
species.loc[species['classcode'].isin(['DMAC', 'HSPI', 'HSTE', 'MXEN', 'RBIN']), ['campus', 'classcode', 'species_definition', 'common_name']]
```

In [9]:
## Select species for fish surveys



fish_sp = species.loc[species['sample_type'] == 'FISH', ['campus', 'classcode', 'species_definition', 'common_name', 'LOOKED1999', 
                                                         'LOOKED2000', 'LOOKED2001', 'LOOKED2002', 'LOOKED2003', 'LOOKED2004',
                                                         'LOOKED2005', 'LOOKED2006', 'LOOKED2007', 'LOOKED2008', 'LOOKED2009',
                                                         'LOOKED2010', 'LOOKED2011', 'LOOKED2012', 'LOOKED2013', 'LOOKED2014',
                                                         'LOOKED2015', 'LOOKED2016', 'LOOKED2017', 'LOOKED2018']]

print(fish_sp.shape)
fish_sp

(523, 24)


Unnamed: 0,campus,classcode,species_definition,common_name,LOOKED1999,LOOKED2000,LOOKED2001,LOOKED2002,LOOKED2003,LOOKED2004,...,LOOKED2009,LOOKED2010,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018
0,HSU,AARG,Amphistichus argenteus,Barred Surfperch,no,no,no,no,no,no,...,no,no,no,no,no,yes,yes,no,yes,yes
1,UCSB,AARG,Amphistichus argenteus,Barred Surfperch,yes,yes,yes,yes,yes,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,AARG,Amphistichus argenteus,Barred Surfperch,no,no,no,no,no,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,ACOR,Artedius corallinus,Coralline Sculpin,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,ACOR,Artedius corallinus,Coralline Sculpin,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
518,VRG,ZEXA,Zapteryx exasperata,Banded Guitarfish,no,no,no,no,no,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
519,HSU,ZROS,Zalembius rosaceus,Pink Surfperch,no,no,no,no,no,no,...,no,no,no,no,no,yes,yes,no,yes,yes
520,UCSB,ZROS,Zalembius rosaceus,Pink Surfperch,yes,yes,yes,yes,yes,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
521,UCSC,ZROS,Zalembius rosaceus,Pink Surfperch,yes,yes,yes,yes,yes,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes


In [10]:
## Melt species table

long = pd.melt(fish_sp, id_vars=fish_sp.columns[0:4], var_name='year', value_name='looked')
print(long.shape)
long.head()

(10460, 6)


Unnamed: 0,campus,classcode,species_definition,common_name,year,looked
0,HSU,AARG,Amphistichus argenteus,Barred Surfperch,LOOKED1999,no
1,UCSB,AARG,Amphistichus argenteus,Barred Surfperch,LOOKED1999,yes
2,VRG,AARG,Amphistichus argenteus,Barred Surfperch,LOOKED1999,no
3,HSU,ACOR,Artedius corallinus,Coralline Sculpin,LOOKED1999,no
4,UCSB,ACOR,Artedius corallinus,Coralline Sculpin,LOOKED1999,no


In [11]:
## Replace 

long['year'] = long['year'].str.split('D').str[1].astype(int)
long['year'].unique()

array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018])

### Fill in absence records in fish

In [12]:
## Check if there are any records where count data is missing

fish[fish['count'].isna() == True]

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,max_tl,sex,observer,depth,vis,temp,surge,pctcnpy,notes,site_name_old
57953,UCSC,SBTL_FISH_PISCO,2008,2008,9,4,PALO_COLORADO,OUTMID,BOT,2,...,100.0,,SCOTT GABARA,15.2,7.0,14.4,LIGHT,,"DOGFISH (LOTS), INTERFERANCE FROM MIDWATER DIV...",


In [13]:
## Drop these records 

print(fish.shape)
fish.dropna(subset=['count'], inplace=True)
fish.shape

(381693, 24)


(381692, 24)

While populating absence records, I observed that there are a couple entries missing in the species table (i.e., there are observations of a given classcode from a given campus, but that classcode is not listed in the species table for that campus.) **I need to let Dan know about this, and hopefully he can fix it in the next update to DataONE.** Until then, though, I'm using the following step to add these records in. They are:

- RFYOY for UCSC, observations in 2000, 2003, 2011, 2013, 2014, and 2017
- RFYOY for UCSB, observations in 2001, 2003, 2005
- SCAL for UCSC, observations in 2003

In [14]:
## Add missing entries to long-format species table --- THIS STEP SHOULD GO AWAY AFTER DAN MAKES CORRECTIONS TO THE SPECIES TABLE

to_add = pd.DataFrame({'campus':['UCSC']*7 + ['UCSB']*3,
                      'classcode':['RFYOY']*6 + ['SCAL'] + ['RFYOY']*3,
                      'species_definition':['Sebastes']*6 + ['Squatina californica'] + ['Sebastes']*3,
                      'common_name':['Rockfish Young Of The Year, Unidentified Sp.']*6 + ['Pacific Angel Shark'] + ['Rockfish Young Of The Year, Unidentified Sp.']*3,
                      'year':[2000, 2003, 2011, 2013, 2014, 2017, 2003, 2001, 2003, 2005],
                      'looked':'yes'})

print(long.shape)
long = pd.concat([long, to_add])
print(long.shape)
long.reset_index(drop=True, inplace=True)
long.tail()

(10460, 6)
(10470, 6)


Unnamed: 0,campus,classcode,species_definition,common_name,year,looked
10465,UCSC,RFYOY,Sebastes,"Rockfish Young Of The Year, Unidentified Sp.",2017,yes
10466,UCSC,SCAL,Squatina californica,Pacific Angel Shark,2003,yes
10467,UCSB,RFYOY,Sebastes,"Rockfish Young Of The Year, Unidentified Sp.",2001,yes
10468,UCSB,RFYOY,Sebastes,"Rockfish Young Of The Year, Unidentified Sp.",2003,yes
10469,UCSB,RFYOY,Sebastes,"Rockfish Young Of The Year, Unidentified Sp.",2005,yes


In [15]:
## Get a table telling whether each fish was looked for during each specific transect

survey_table = fish[['campus', 'method', 'day', 'month', 'survey_year', 'year', 'site', 'zone', 'level', 'transect']].merge(long[['campus', 'classcode', 'year', 'looked']], 
                                                                                                             how='left', 
                                                                                                             left_on=['campus', 'survey_year'],
                                                                                                             right_on=['campus', 'year'])
survey_table.drop_duplicates(inplace=True)
survey_table.rename(columns={'year_x':'year'}, inplace=True) # year_x retains actual year when survey took place
survey_table.drop(columns=['year_y'], inplace=True) # year_y == survey_year because of the merge
survey_table

Unnamed: 0,campus,method,day,month,survey_year,year,site,zone,level,transect,classcode,looked
0,UCSC,SBTL_FISH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,BOT,1,ACOR,no
1,UCSC,SBTL_FISH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,BOT,1,ADAV,yes
2,UCSC,SBTL_FISH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,BOT,1,AFLA,yes
3,UCSC,SBTL_FISH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,BOT,1,AHOL,no
4,UCSC,SBTL_FISH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,BOT,1,AOCE,yes
...,...,...,...,...,...,...,...,...,...,...,...,...
54117169,VRG,SBTL_FISH_VRG,12,8,2011,2011,Long Point East,DEEP,MID,2,TSYM,yes
54117170,VRG,SBTL_FISH_VRG,12,8,2011,2011,Long Point East,DEEP,MID,2,UHAL,yes
54117171,VRG,SBTL_FISH_VRG,12,8,2011,2011,Long Point East,DEEP,MID,2,URON,yes
54117172,VRG,SBTL_FISH_VRG,12,8,2011,2011,Long Point East,DEEP,MID,2,ZEXA,yes


In [15]:
## Merge with fish data to get final outcome

full_fish = fish.merge(survey_table, 
                             how='right', 
                             on=['campus', 'method', 'day', 'month', 'year', 'survey_year', 'site', 'zone', 'level', 'transect', 'classcode'])
full_fish

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,sex,observer,depth,vis,temp,surge,pctcnpy,notes,site_name_old,looked
0,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,no
1,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
2,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
3,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,no
4,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8883628,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes
8883629,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes
8883630,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes
8883631,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes


In [16]:
## Clean

full_fish = full_fish[full_fish['classcode'] != 'NO_ORG'].copy()
full_fish.loc[(full_fish['looked'] == 'yes') & (full_fish['count'].isna() == True), 'count'] = 0
full_fish.dropna(subset=['count'], inplace=True)
full_fish

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,sex,observer,depth,vis,temp,surge,pctcnpy,notes,site_name_old,looked
1,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
2,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
4,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
5,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
6,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,,,,,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8883628,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes
8883629,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes
8883630,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes
8883631,VRG,SBTL_FISH_VRG,2011,2011,8,12,Long Point East,DEEP,MID,2,...,,,,,,,,,,yes


**Note** that there are 2767 records where count > 0 but looked = no. This doesn't make sense. I'm not seeing trends here with respect to campus, year, etc.

```python
# Get records
weird = full_fish[(full_fish['count'] > 0) & (full_fish['looked'] == 'no')]

## Get table of campuses and years where there were observations for classcodes that were not looked for according to the species table

obs_exist = weird[['campus', 'survey_year', 'classcode']].copy()
obs_exist.drop_duplicates(inplace=True)
obs_exist.head()
```

### Convert

In [17]:
## Pad month and day as needed

paddedDay = ['0' + str(full_fish['day'].iloc[i]) if len(str(full_fish['day'].iloc[i])) == 1 else str(full_fish['day'].iloc[i]) for i in range(full_fish.shape[0])]
paddedMonth = ['0' + str(full_fish['month'].iloc[i]) if len(str(full_fish['month'].iloc[i])) == 1 else str(full_fish['month'].iloc[i]) for i in range(full_fish.shape[0])]

In [18]:
## Merge to add site_name (also lat, lon and site_status) to fish table

full_fish = full_fish.merge(site_summary, how='left', on='site')
full_fish.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,temp,surge,pctcnpy,notes,site_name_old,looked,site_status,latitude,longitude,site_name
0,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
1,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
2,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
3,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
4,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC


In [19]:
## Create eventID

eventID = full_fish['site_name'] + '_' + full_fish['year'].astype(str) + pd.Series(paddedMonth) + pd.Series(paddedDay) + '_' + full_fish['zone'] + '_' + full_fish['level'] + '_' + full_fish['transect'].str.replace(' ', '')
fish_occ = pd.DataFrame({'eventID':eventID})

fish_occ.head()

Unnamed: 0,eventID
0,HOPKINS_DC_19990907_INNER_BOT_1
1,HOPKINS_DC_19990907_INNER_BOT_1
2,HOPKINS_DC_19990907_INNER_BOT_1
3,HOPKINS_DC_19990907_INNER_BOT_1
4,HOPKINS_DC_19990907_INNER_BOT_1


In [20]:
## Add occurrenceID

# Create survey_date column in fish
full_fish['survey_date'] = full_fish['year'].astype(str) + pd.Series(paddedMonth) + pd.Series(paddedDay)

# Groupby to create occurrenceID
fish_occ['occurrenceID'] = full_fish.groupby(['site', 'survey_date', 'zone', 'level', 'transect'])['classcode'].cumcount()+1
fish_occ['occurrenceID'] = fish_occ['eventID'] + '_occ' + fish_occ['occurrenceID'].astype(str)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5


In [21]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(fish_sp['classcode'], fish_sp['species_definition']))
code_to_com_dict = dict(zip(fish_sp['classcode'], fish_sp['common_name']))

In [22]:
## Update code_to_sci_dict for code OYT (see taxon notes below)

code_to_sci_dict['OYT'] = 'Sebastes serranoides/flavidus'
code_to_sci_dict['OYT_VRG'] = 'Sebastes serranoides'

fish.loc[(fish['classcode'] == 'OYT') & (fish['campus'] == 'VRG'), 'classcode'] = 'OYT_VRG'
code_to_com_dict['OYT_VRG'] = 'Olive Rockfish'

In [23]:
## Create scientificName and vernacularName columns

fish_occ['vernacularName'] = full_fish['classcode']
fish_occ['vernacularName'].replace(code_to_com_dict, inplace=True)

fish_occ['scientificName'] = full_fish['classcode']
fish_occ['scientificName'].replace(code_to_sci_dict, inplace=True)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae


In [24]:
## Get unique scientific names for lookup in WoRMS

names = fish_occ['scientificName'].unique()

**Note** that there are a number of names that are not specific at the species level:
- Sebastes atrovirens/carnatus/chrysomelas/caurinus (matched to Sebastes; Kelp, Gopher, Black and Yellow, and Copper Rockfish YoY)
- Sebastes chrysomelas/carnatus (matched to Sebastes; Gopher and Black and Yellow Rockfish YoY)
- Synchirus/Rimicola (Manacled sculpin or kelp clingfish, SYRI) --> This should mean either Synchirus spp. or Rimicola spp. These are both from class **Actinopterygii**
- Sebastes serranoides/flavidus/melanops (matched to Sebastes; Olive, Yellowtail and Black Rockfish YoY)
- Sebastes carnatus/caurinus (matched to Sebastes; Gopher, Copper Rockfish YoY)

There are also some descriptions that lack a scientific name:
- No Organisms Present In This Sample (NO_ORG) --> **There are 12430 records with this designation. Should they just be removed? Dan said yes, unless you're using them in combination with the looked for fields in the species table to populate absence records. Now that I have populated absence records, the NO_ORG classification has been handled.**
    - One thing that confuses me here is that **sometimes, NO_ORG observations occur in the same event as other observations.** You would think that a NO_ORG entry would be the only entry for a given event - see example below. Also note that **all records with a NO_ORG classcode also have count = 0, and vice versa.** 
    
```python
fish_occ[fish_occ['eventID'] == 'HOPKINS_DC_19990907_OUTER_CAN_2']
```

- Unidentified Fish (UNID) --> **I assume this should match to Actinopterygii, or maybe Pisces. Dan says Actinopterygii.**

Species with multiple common names:
- Atherinopsidae --> Grunion, Topsmelt Or Jacksmelt
- Clupeiformes --> Bait, Sardines/Anchovies (BAITBALL)
- Clinidae --> Kelpfishes And Fringeheads
- Clupeiformes --> Sardines And Anchovies (CLUP)
- Lethops connectens --> Kelp Goby, Halfblind Goby
- Scomber japonicus --> Pacific Mackerel, Greenback Mackerel
- Thaleichthys pacificus --> Candlefish, eulachon

Other classifications to be aware of:
- Hexagrammos --> Unidentified Hexagrammos
- Sebastes --> Rockfish, Unidentified Sp. (SEBSPP)
- Sebastes --> Rockfish Young Of The Year, Unidentified Sp (RFYOY)
- OYT matches to two species definitions: Sebastes serranoides/flavidus for HSU, UCSB, UCSC and Sebastes serranoides for VRG. **I think this is probably a mistake, and that the former is correct. I've changed OYT in code_to_sci_dict accordingly. Dan said that this is wrong - VRG probably does think they're actually seeing S. serranoides based on the region they're surveying in. I've created a new classcode, OYT_VRG, to match to S. serranoides rather than S. serranoides/flavidus.** To check this, use:

```python
species.loc[species['classcode'] == 'OYT', ['campus', 'classcode', 'Genus', 'Species', 'species_definition', 'common_name']]
```

In [25]:
## Make changes based on the above observations

fish_occ.loc[fish_occ['scientificName'] == 'Synchirus/Rimicola', 'scientificName'] = 'Actinopterygii'
fish_occ.loc[fish_occ['scientificName'] == 'Unidentified Fish', 'scientificName'] = 'Actinopterygii'

# Redefine names
names = fish_occ['scientificName'].unique()

In [26]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Sebastes chrysomelas/carnatus checking:  Sebastes
Url didn't work for Sebastes atrovirens/carnatus/chrysomelas/caurinus checking:  Sebastes
Url didn't work for Sebastes serranoides/flavidus/melanops checking:  Sebastes
Url didn't work for Sebastes serranoides/flavidus checking:  Sebastes
Url didn't work for Sebastes carnatus/caurinus checking:  Sebastes


In [27]:
## Add scientific name-related columns

fish_occ['scientificNameID'] = fish_occ['scientificName']
fish_occ['scientificNameID'].replace(name_id_dict, inplace=True)

fish_occ['taxonID'] = fish_occ['scientificName']
fish_occ['taxonID'].replace(name_taxid_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995


In [28]:
## Create identificationQualifier

qualifier_dict = {'Sebastes serranoides/flavidus':'Sebastes serranoides or Sebastes flavidus',
               'Sebastes atrovirens/carnatus/chrysomelas/caurinus':'Sebastes atrovirens, Sebastes carnatus, Sebastes chrysomelas or Sebastes Caurinus',
               'Sebastes chrysomelas/carnatus':'Sebastes chrysomelas or Sebastes carnatus',
               'Sebastes serranoides/flavidus/melanops':'Sebastes serranoides, Sebastes flavidus or Sebastes melanops',
                 'Sebastes carnatus/caurinus':'Sebastes carnatus or Sebastes caurinus'}

identificationQualifier = [qualifier_dict[name] if name in qualifier_dict.keys() else np.nan for name in fish_occ['scientificName']]

In [29]:
## Replace scientificName using name_name_dict

fish_occ['scientificName'].replace(name_name_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995


In [30]:
## Add final name-related columns

fish_occ['nameAccordingTo'] = 'WoRMS'
fish_occ['occurrenceStatus'] = 'present'
fish_occ['basisOfRecord'] = 'HumanObservation'
fish_occ['identificationQualifier'] = identificationQualifier

# Add identificationQualifier for Synchirus/Rimicola
fish_occ.loc[fish_occ['vernacularName'] == 'Manacled Sculpin/Kelp Clingfish', 'identificationQualifier'] = 'Synchirus spp. or Rimicola spp.'

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,present,HumanObservation,
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,present,HumanObservation,
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605,WoRMS,present,HumanObservation,
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664,WoRMS,present,HumanObservation,
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995,WoRMS,present,HumanObservation,


In [31]:
## Pull sex and lifeStage information out of sex column

fish_occ['sex'] = full_fish['sex'].copy()
fish_occ['lifeStage'] = fish_occ['sex']

# Separate
fish_occ.loc[fish_occ['sex'].isin(['JUVENILE']), 'sex'] = np.nan
fish_occ.loc[fish_occ['lifeStage'].isin(['MALE', 'FEMALE', 'TRANSITIONAL']), 'lifeStage'] = np.nan

# Replace sex and lifeStage with controlled vocabulary
fish_occ['sex'].replace({'MALE':'male', 'FEMALE':'female', 'TRANSITIONAL':'hermaphrodite'}, inplace=True)
fish_occ['lifeStage'].replace({'JUVENILE':'juvenile'}, inplace=True)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,present,HumanObservation,,,
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,present,HumanObservation,,,
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605,WoRMS,present,HumanObservation,,,
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664,WoRMS,present,HumanObservation,,,
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995,WoRMS,present,HumanObservation,,,


In [32]:
## Create density

fish_density = full_fish['count'] # units = individuals per 120 m3
fish_occ['organismQuantity'] = fish_density
fish_occ['organismQuantityType'] = 'number of individuals per 120 m3'
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3


**Note** that because we dropped the NO_ORG records, there are no instances where organismQuantity = 0. These instances would normally have occurrenceStatus = absent. 

```python
fish_occ[fish_occ['organismQuantity'] == 0]
```

**This is no longer true. NO_ORG records were dealt with when absence records were populated.**

Also, there is one record where organismQuantity (i.e. the count column in fish) is NaN. The eventID is PALO_COLORADO_20080904_OUTMID_BOT_2. See occurrence 3 in the following example:

```python
fish_occ[fish_occ['eventID'] == 'PALO_COLORADO_20080904_OUTMID_BOT_2']
```

**I will drop this record for now. --> Now that absence records were populated, this record has already been removed.**

In [33]:
## Update absence status

fish_occ.loc[fish_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3


### The notes column

I would like to make some effort to extract useful information and tidy the notes column. **A few of the notes are pretty inappropriate and/or use names. However, since PISCO has already shared them publicly, I assume it continues to be OK to do so.** Many, many of the notes contain potentially useful information.

A large number of the notes contain sex information. I can probably extract this. I need to look for and pull:
- M
- F
- M;
- F;
- MALE
- FEMALE
- FEAMLE
- MALE;
- FEMALE;
- TRANSITIONAL;
- TRANSITION
- MALE,

There is also some life stage information:
- JUVENILE
- JUVENILE;
- JEVENILE

Sometimes, sex is uncertain (e.g. 'M?'). I'll leave these in the notes.

Note cleaning:
- Explore cleaning lowercase versus capitals
- Some notes are preceeded by '. '

To look at the non-sex-related notes, use:

```python
not_sex = fish[(fish['notes'].isna() == False) & (fish['notes'] != 'M') & (fish['notes'] != 'F') & (fish['notes'] != 'M;') & (fish['notes'] != 'F;') 
               & (fish['notes'] != 'MALE') & (fish['notes'] != 'FEMALE')].copy()

for note in not_sex['notes'].unique():
    print(note)
```

In [34]:
## Obtain relevant records from fish

notes = full_fish[['site', 'survey_date', 'classcode', 'count', 'sex', 'notes']].copy()
print(notes.shape)
notes.head()

(7655353, 6)


Unnamed: 0,site,survey_date,classcode,count,sex,notes
0,HOPKINS_DC,19990907,ADAV,0.0,,
1,HOPKINS_DC,19990907,AFLA,0.0,,
2,HOPKINS_DC,19990907,AOCE,0.0,,
3,HOPKINS_DC,19990907,APFL,0.0,,
4,HOPKINS_DC,19990907,ATHE,0.0,,


In [35]:
## Extract sex from notes column

sex_notes = []
sex_options = ['M', 'F', 'MALE', 'FEMALE', 'FEAMLE', 'MALES', 'FEMALES', 'TRANSITIONAL', 'TRANSITION', 'TRANNY']

for note in notes['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(sex_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(sex_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(sex_options) & set(slash_split))
          
        
        if note in sex_options:
            sex_notes.append(note)
        elif colon_overlap != []:
            sex_notes.extend(colon_overlap)
        elif comma_overlap != []:
            sex_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            sex_notes.extend(slash_overlap)
        
        else:
            sex_notes.append(np.nan)
            
    else:
        sex_notes.append(np.nan)
        
notes['sex_notes'] = sex_notes
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes
0,HOPKINS_DC,19990907,ADAV,0.0,,,
1,HOPKINS_DC,19990907,AFLA,0.0,,,
2,HOPKINS_DC,19990907,AOCE,0.0,,,
3,HOPKINS_DC,19990907,APFL,0.0,,,
4,HOPKINS_DC,19990907,ATHE,0.0,,,


In [36]:
## Clean sex_notes

print(notes['sex_notes'].unique())
notes['sex_notes'].replace({'F':'female',
                  'M':'male',
                  'FEMALE':'female',
                  'MALE':'male',
                  'MALES':'male',
                  'FEMALES':'female',
                  'FEAMLE':'female',
                  'TRANSITIONAL':'hermaphrodite',
                  'TRANNY':'hermaphrodite',
                  'TRANSITION':'hermaphrodite'}, inplace=True)
print(notes['sex_notes'].unique())

[nan 'F' 'M' 'FEMALE' 'MALE' 'TRANSITIONAL' 'MALES' 'FEMALES' 'TRANNY'
 'TRANSITION' 'FEAMLE']
[nan 'female' 'male' 'hermaphrodite']


In [37]:
# Add sex from fish_occ to notes

notes['occ_sex'] = fish_occ['sex']
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex
0,HOPKINS_DC,19990907,ADAV,0.0,,,,
1,HOPKINS_DC,19990907,AFLA,0.0,,,,
2,HOPKINS_DC,19990907,AOCE,0.0,,,,
3,HOPKINS_DC,19990907,APFL,0.0,,,,
4,HOPKINS_DC,19990907,ATHE,0.0,,,,


In [38]:
## Create new column merging information from occ_sex and sex_notes

new_sex = [notes['occ_sex'].iloc[i] if notes['occ_sex'].iloc[i] == notes['occ_sex'].iloc[i] else notes['sex_notes'].iloc[i] for i in range(notes.shape[0])]
notes['new_sex'] = new_sex
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex,new_sex
0,HOPKINS_DC,19990907,ADAV,0.0,,,,,
1,HOPKINS_DC,19990907,AFLA,0.0,,,,,
2,HOPKINS_DC,19990907,AOCE,0.0,,,,,
3,HOPKINS_DC,19990907,APFL,0.0,,,,,
4,HOPKINS_DC,19990907,ATHE,0.0,,,,,


In [39]:
## Replace sex column in fish_occ with new_sex

fish_occ['sex'] = notes['new_sex']
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3


To check that the above process is working, use:

```python
fish[fish['sex'] == 'FEMALE'].shape[0] # 16528
full_fish[full_fish['sex'] == 'FEMALE'].shape[0] # 16528
notes[notes['occ_sex'] == 'female'].shape[0] # 16528
notes[notes['new_sex'] == 'female'].shape[0] # 18031
fish_occ[fish_occ['sex'] == 'female'].shape[0]
```

In [40]:
## Repeat the process to extract lifeStage information from notes

stage_notes = []
stage_options = ['JUVENILE', 'JUV', 'JEVENILE']

for note in notes['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(stage_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(stage_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(stage_options) & set(slash_split))
          
        
        if note in stage_options:
            stage_notes.append(note)
        elif colon_overlap != []:
            stage_notes.extend(colon_overlap)
        elif comma_overlap != []:
            stage_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            stage_notes.extend(slash_overlap)
        
        else:
            stage_notes.append(np.nan)
            
    else:
        stage_notes.append(np.nan)
        
notes['stage_notes'] = stage_notes
        
# Clean stage_notes
print(notes['stage_notes'].unique())
notes['stage_notes'].replace({'JUV':'juvenile',
                  'JUVENILE':'juvenile',
                  'JEVENILE':'juvenile'}, inplace=True)
print(notes['stage_notes'].unique())

# Add lifeStage from fish_occ to notes
notes['occ_stage'] = fish_occ['lifeStage']

# Create new column merging information from occ_stage and stage_notes
new_stage = [notes['occ_stage'].iloc[i] if notes['occ_stage'].iloc[i] == notes['occ_stage'].iloc[i] else notes['stage_notes'].iloc[i] for i in range(notes.shape[0])]
notes['new_stage'] = new_stage

# Replace lifeStage column in fish_occ with new_stage
fish_occ['lifeStage'] = notes['new_stage']
fish_occ.head()

[nan 'JUV' 'JUVENILE' 'JEVENILE']
[nan 'juvenile']


Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3


Check:

```python
fish[fish['sex'] == 'JUVENILE'].shape[0] # 2336
notes[notes['occ_stage'] == 'juvenile'].shape[0] # 2336
notes[notes['new_stage'] == 'juvenile'].shape[0] # 2381
fish_occ[fish_occ['lifeStage'] == 'juvenile'].shape[0] # 2381
```

In [38]:
# ## Save notes to inspect process if desired

# notes.to_csv('notes.csv', index=False, na_rep='NaN')

In [39]:
# ## Clean notes if desired **NOT GOING TO DO THIS FOR NOW**

# cleaned_notes = fish['notes'].copy()
# print(cleaned_notes[cleaned_notes.index == 348366])
# for i in range(cleaned_notes.shape[0]):
#     note = cleaned_notes.iloc[i]
#     if note == note:
#         if note[0:2] == '. ':
#             cleaned_notes.iloc[i] = cleaned_notes.iloc[i][2:]
#         cleaned_notes.iloc[i] = cleaned_notes.iloc[i].lower()
    
# print(cleaned_notes[cleaned_notes.index == 348366])
# fish_occ['occurrenceRemarks'] = cleaned_notes
# fish_occ.head()

In [41]:
## Add notes under occurrenceRemarks

fish_occ['occurrenceRemarks'] = notes['notes']
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType,occurrenceRemarks
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Wolf Eel,Anarrhichthys ocellatus,urn:lsid:marinespecies.org:taxname:279605,279605,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Penpiont Gunnel,Apodichthys flavidus,urn:lsid:marinespecies.org:taxname:279664,279664,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,"Grunion, Topsmelt Or Jacksmelt",Atherinopsidae,urn:lsid:marinespecies.org:taxname:266995,266995,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,


In [46]:
## Change NaN values in string fields to ''

fish_occ[['identificationQualifier', 'sex', 'lifeStage', 'occurrenceRemarks']] = fish_occ[['identificationQualifier', 'sex', 'lifeStage', 'occurrenceRemarks']].replace(np.nan, '')
fish_occ.isna().sum()

eventID                    0
occurrenceID               0
vernacularName             0
scientificName             0
scientificNameID           0
taxonID                    0
nameAccordingTo            0
occurrenceStatus           0
basisOfRecord              0
identificationQualifier    0
sex                        0
lifeStage                  0
organismQuantity           0
organismQuantityType       0
occurrenceRemarks          0
dtype: int64

In [64]:
## Save Size, Min and Max for use in MoF file

# Obtain relevant records from fish
subset = full_fish[['site', 'survey_date', 'classcode', 'count', 'fish_tl', 'min_tl', 'max_tl']].copy()

# Fix records where count = 1 and min and/or max values are present - this defaults to size = fish_tl if max_tl is missing, or if a size range has been given for a single fish
subset.loc[subset['count'] == 1, ['min_tl', 'max_tl']] = np.nan

# Fix records where count > 1 and min and max don't provide a reasonable size range - this defaults to size = fish_tl if max_tl is missing
subset.loc[(subset['fish_tl'] == subset['min_tl']) & (subset['max_tl'].isna() == True), 'min_tl'] = np.nan

# For groups where a size range exists, we want to drop the average length measure
subset.loc[(subset['fish_tl'] < subset['max_tl']) & (subset['fish_tl'] > subset['min_tl']), 'fish_tl'] = np.nan

# Assemble fish_sizes
fish_sizes = pd.DataFrame({
    'eventID':fish_occ['eventID'],
    'occurrenceID':fish_occ['occurrenceID'],
    'fish_tl':subset['fish_tl'],
    'min_tl':subset['min_tl'],
    'max_tl':subset['max_tl']
})
fish_sizes.dropna(how='all', subset=['fish_tl', 'min_tl', 'max_tl'], inplace=True) # Note that this drops 116 records where no size information was given (fish_tl = NaN)

print(fish_sizes.shape)
fish_sizes.head()

(369146, 5)


Unnamed: 0,eventID,occurrenceID,fish_tl,min_tl,max_tl
25,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ26,18.0,,
26,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ27,25.0,,
38,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ39,20.0,,
53,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ54,10.0,,
57,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ58,,7.0,8.0


**Some notes on size values:**
- 0 records have a missing count (count = NaN)
- 0 records have count = 0
- 234652 records have count = 1
    - Of these, 3130 have min values that = fish_tl and max values of NaN. **How do I interpret these records? For now I will disregard min values. This is correct - use fish_tl as the size.**
        - Note that the opposite is never the case (max = fish_tl, min = NaN)
    - Of these 3130, 287 have min and max values, and fish_tl is the average of those values **Here, someone couldn't decide how big the fish was and put a range. Use the average, fish_tl, as the size.**
- 134610 records have count > 1
    - Of these, 104917 are all the same size (i.e. min_tl = max_tl = NaN)
    - 27907 are different sizes with a size range given (i.e. fish_tl is the average of min_tl and max_tl)
    - 1786 are of unknown size, with min values that = fish_tl and max values of NaN. **How do I interpret these records? For now I will disregard min values. This is correct - use fish_tl as the size.**
        - Again, note that the opposite is never the case (max = fish_tl, min = NaN)
        
```python
# Records where count = 1
count1 = subset[subset['count'] == 1].copy()

# Records where count = 1, but min is present
count1[(count1['min_tl'].isna() == False) & (count1['max_tl'].isna() == True)]

# Records where count = 1, but min and max are present
count1[(count1['min_tl'].isna() == False) & (count1['max_tl'].isna() == False)]

# Records where count > 1
count2 = subset[subset['count'] > 1].copy()

# Records where count > 1 and all fish were the same size
count2[(count2['min_tl'].isna() == True) & (count2['max_tl'].isna() == True)]

# Records where count > 1 and fish were not the same size and a size range was given
count2[(count2['fish_tl'].isna() == False) & (count2['min_tl'].isna() == False) & (count2['max_tl'].isna() == False)]

# Records where count > 1 and fish are of unknown size (min = fish_tl, max not given)
count2[(count2['min_tl'].isna() == False) & (count2['max_tl'].isna() == True)]
```

To check that all sizes in the original data set have made it through the absence population process, use:

```python
fish[fish['fish_tl'].isna() == False].shape[0] # 369146
fish[fish['min_tl'].isna() == False].shape[0] # 32823
fish[fish['max_tl'].isna() == False].shape[0] # 28194

full_fish[full_fish['fish_tl'].isna() == False].shape[0] # 369146
full_fish[full_fish['min_tl'].isna() == False].shape[0] # 32823
full_fish[full_fish['max_tl'].isna() == False].shape[0] # 28194
```

**Note** that the number of entries in fish_sizes will not be identical to these, because of the corrections made in the codeblock above according to Dan's directions.

### Save

In [48]:
## Save

fish_occ.to_csv('PISCO_occurrence_20210119.csv', index=False, na_rep='NaN')

## Create event file

The event file should contain: eventID (from site, survey_year, transect, level?), eventDate (from year, month, date), datasetID, locality (site), localityRemarks (maybe level and/or zone information), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Some notes might be eventRemarks. Should I include the campus information somewhere? Observer?

In [50]:
## Get unique eventIDs from occurrence file and their associated survey_dates

event = pd.DataFrame({'eventID':fish_occ['eventID'],
                     'eventDate':full_fish['survey_date'],
                     'institutionID':full_fish['campus'],
                     'locality':full_fish['site']})
event.drop_duplicates(inplace=True)

print(event.shape)
event.head()

(62099, 4)


Unnamed: 0,eventID,eventDate,institutionID,locality
0,HOPKINS_DC_19990907_INNER_BOT_1,19990907,UCSC,HOPKINS_DC
120,HOPKINS_DC_19990907_INNER_BOT_2,19990907,UCSC,HOPKINS_DC
241,HOPKINS_DC_19990907_INNER_CAN_1,19990907,UCSC,HOPKINS_DC
360,HOPKINS_DC_19990907_INNER_CAN_2,19990907,UCSC,HOPKINS_DC
480,HOPKINS_DC_19990907_INNER_MID_1,19990907,UCSC,HOPKINS_DC


To double check that all eventIDs are also in occurrence table:

```python
test = event.merge(fish_occ[['eventID', 'scientificName']], how='outer', on='eventID', indicator=True)
test[test['_merge'] != 'both']
```

In [51]:
## Format eventDate

formatted = [datetime.strptime(dt, '%Y%m%d').date().isoformat() for dt in event['eventDate']]
event['eventDate'] = formatted
event.head()

Unnamed: 0,eventID,eventDate,institutionID,locality
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,UCSC,HOPKINS_DC
120,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,UCSC,HOPKINS_DC
241,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,UCSC,HOPKINS_DC
360,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,UCSC,HOPKINS_DC
480,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,UCSC,HOPKINS_DC


In [52]:
## Dataset ID

event.insert(2, 'datasetID', 'PISCO fish transects')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
120,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
241,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
360,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC
480,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC


In [53]:
## Merge to obtain decimalLatitude and decimalLongitude

event = event.merge(site_summary, how='left', left_on='locality', right_on='site')
event.rename(columns = {'site_status':'locationRemarks', 'latitude':'decimalLatitude', 'longitude':'decimalLongitude'}, inplace=True)
event['locationRemarks'].replace({'mpa':'marine protected area'}, inplace=True)
event.drop(['site', 'site_name'], axis=1, inplace=True)
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196


In [54]:
## Add countryCode

event.insert(6, 'countryCode', 'US')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196


In [55]:
## Add coordinateUncertainty in Meters

event['coordinateUncertaintyInMeters'] = 250

**Is this a reasonable coordinateUncertaintyInMeters?** Yes

In [56]:
## minimumDepthInMeters, maximumDepthInMeters

# Add eventID to fish
full_fish['eventID'] = eventID

# Groupby eventID to obtain depth column
depth = full_fish.groupby(['eventID']).agg({
    'depth':[min, max]
})
depth.reset_index(inplace=True)
depth.columns = depth.columns.droplevel()

# Add to event
event['minimumDepthInMeters'] = depth['min']
event['maximumDepthInMeters'] = depth['max']
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,10.5,10.5
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,,


**Note** that there are no duplicated depth measurements that I can see.

```python
any(fish_subset.groupby(['eventID'])['depth'].nunique() > 1)
```

In [57]:
## Add samplingProtocol and samplingEffort

# samplingProtocol
protocol = full_fish[['method', 'site_name', 'survey_date', 'zone', 'level', 'transect']].copy()
protocol.drop_duplicates(inplace=True)
protocol.reset_index(drop=True, inplace=True)
print(protocol.shape)
event['samplingProtocol'] = protocol['method']

# samplingEffort
event['samplingEffort'] = 'average of 12 minutes per transect'
print(event.shape)
event.head()

(62099, 6)
(62099, 14)


Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,HOPKINS_DC_19990907_INNER_BOT_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,10.5,10.5,SBTL_FISH_PISCO,average of 12 minutes per transect
1,HOPKINS_DC_19990907_INNER_BOT_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5,SBTL_FISH_PISCO,average of 12 minutes per transect
2,HOPKINS_DC_19990907_INNER_CAN_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.5,9.5,SBTL_FISH_PISCO,average of 12 minutes per transect
3,HOPKINS_DC_19990907_INNER_CAN_2,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0,SBTL_FISH_PISCO,average of 12 minutes per transect
4,HOPKINS_DC_19990907_INNER_MID_1,1999-09-07,PISCO fish transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,,,SBTL_FISH_PISCO,average of 12 minutes per transect


**Sampling effort?** Average of 12 minutes per transect

In [58]:
## Get vis, temp, surge, and pctcnpy for MoF

# Get relevant measurementValues
event_MoF_values = full_fish[['eventID', 'vis', 'temp', 'surge', 'pctcnpy']].copy()
event_MoF_values.drop_duplicates(inplace=True)
event_MoF_values.dropna(how='all', subset=['vis', 'temp', 'surge', 'pctcnpy'], inplace=True)

# vis
vis = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'visibility',
    'measurementValue':event_MoF_values['vis'],
    'measurementUnit':'meters',
    'measurementMethod':'Horizontal visibility on each transect, estimated by diver reeling in transect tape and noting the distance at which the end of the tape can first be seen.'
})

# temp
temp = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'temperature',
    'measurementValue':event_MoF_values['temp'],
    'measurementUnit':'degrees Celsius',
    'measurementMethod':"The temperature on each transect as measured by the diver's computer."
})

# surge
surge = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'surge',
    'measurementValue':event_MoF_values['surge'],
    'measurementUnit':np.NaN,
    'measurementMethod':"The diver's categorical estimation of the magnitude of horizontal displacement on each transect."
})

# pctcnpy
pct = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'percent canopy',
    'measurementValue':event_MoF_values['pctcnpy'],
    'measurementUnit':np.nan,
    'measurementMethod':"The diver's categorical estimation of the percent of the transect, by volume, that is occupied by kelp."
})

pct.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
25,HOPKINS_DC_19990907_INNER_BOT_1,,percent canopy,1.0,,The diver's categorical estimation of the perc...
145,HOPKINS_DC_19990907_INNER_BOT_2,,percent canopy,2.0,,The diver's categorical estimation of the perc...
249,HOPKINS_DC_19990907_INNER_CAN_1,,percent canopy,1.0,,The diver's categorical estimation of the perc...
368,HOPKINS_DC_19990907_INNER_CAN_2,,percent canopy,2.0,,The diver's categorical estimation of the perc...
558,HOPKINS_DC_19990907_INNER_MID_1,,percent canopy,1.0,,The diver's categorical estimation of the perc...


**Note** that for a given measurement, use the following to make sure all the data in fish made it to full fish, and then to the correct data frame:

```python
print('vis')
print(fish[(fish['classcode'] != 'NO_ORG') & (fish['vis'].isna() == False)].shape[0]) # 352676
print(full_fish[full_fish['vis'].isna() == False].shape[0]) # 352676

full_fish[full_fish['vis'].isna() == False].shape[0]
test = full_fish.loc[full_fish['vis'].isna() == False, ['eventID', 'vis']].drop_duplicates()
print(test.shape[0]) # 47324
print(vis[vis['measurementValue'].isna() == False].shape[0]) # 47324
```

In [61]:
## Change NaN in string fields to '' ---- NOTE, WHEN DAN PROVIDES INFORMATION FOR MISSING SITE, THIS WILL NO LONGER BE NEEDED

event['locationRemarks'] = event['locationRemarks'].replace(np.nan, '')
event.isna().sum()

eventID                              0
eventDate                            0
datasetID                            0
institutionID                        0
locality                             0
locationRemarks                      0
countryCode                          0
decimalLatitude                     15
decimalLongitude                    15
coordinateUncertaintyInMeters        0
minimumDepthInMeters             18492
maximumDepthInMeters             18492
samplingProtocol                     0
samplingEffort                       0
dtype: int64

In [62]:
## Save

event.to_csv('PISCO_event_20210119.csv', index=False, na_rep='NaN')

## Create MoF file

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Depth, vis, temp, surge and pctcnpy can be recorded at the event level. Fish_tl, min_tl and max_tl can be recorded at the occurrence level.

In [65]:
## Assemble fish_sizes data

# total length
tl_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'length',
                      'measurementValue':fish_sizes['fish_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The total length of an individual or group of individuals (of the same length), estimated visually to the nearest centimeter'})

# min length
min_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'minimum length',
                      'measurementValue':fish_sizes['min_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The minimum size recorded in a group of fish of the same species, estimated visually to the nearest centimeter'})

# max length
max_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'maximum length',
                      'measurementValue':fish_sizes['max_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The maximum size recorded in a group of fish of the same species, estimated visually to the nearest centimeter'})

In [66]:
## Concatenate dataframes

mof = pd.concat([vis, temp, surge, pct, tl_mof, min_mof, max_mof])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
25,HOPKINS_DC_19990907_INNER_BOT_1,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
145,HOPKINS_DC_19990907_INNER_BOT_2,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
249,HOPKINS_DC_19990907_INNER_CAN_1,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
368,HOPKINS_DC_19990907_INNER_CAN_2,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."
558,HOPKINS_DC_19990907_INNER_MID_1,,visibility,2.4,meters,"Horizontal visibility on each transect, estima..."


In [67]:
## Drop missing measurements

print(mof.shape)
mof.dropna(subset=['measurementValue'], inplace=True)
mof.shape

(1304134, 6)


(558926, 6)

To check that all the data are still there, use:

```python
mof[(mof['measurementType'] == 'visibility')] # 47324
```

In [69]:
## Change NaN in string fields to ''

mof[['occurrenceID', 'measurementUnit']] = mof[['occurrenceID', 'measurementUnit']].replace(np.nan, '')
mof.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

In [70]:
## Save

mof.to_csv('PISCO_MoF_20210119.csv', index=False, na_rep='NaN')

## Questions

1. Does the "method" column actually relate to different survey methods? For example, here, method is always "FISH" and the only thing that's different is the campus information. **The method column does indicate some small changes in methodology, but these are not meaningful in the context of this data submission.**
2. I assume depths were estimated by dive computer? **Yes**
3. There are two sites in the site table that have no fish records: PISMO_W and SAL_E. **Someone decided these sites had invalid data. Dan will likely remove them from the site table.** The lat, lon for these sites are missing. The lat, lon for SCI_PELICAN_FAR_WEST are also missing. **Dan said he sent a new site table with the lat, lon for SCI_PELICAN_FAR_WEST, but it was not attached. NEED TO FOLLOW UP ON THIS.**
4. There are 172 unique classcodes under sample type "FISH", but only 166 of them are actually in the fish data set. Classcodes that do not appear in the data are: DMAC, HSPI, HSTE, MXEN, and RBIN. Where does this discrepancy come from? **These should be non-UCSC, non-UCSB codes. Likely codes from very rare observations by VRG. Dan may tidy these up when he submits the 2020 data.**
5. It seems like there are no absence records in this data, even though absence records could be formulated based on which species were looked for in a given year (as indicated by the species table). Is this correct? I draw this conclusion because the only record where count = NaN is eventID = PALO_COLORADO_20080904_OUTMID_BOT_2 (classcode = SACA). This record has been excluded for now, but it seems like it might be a data entry error? **Dropped is fine. It is correct that there are no absence records in this data set. However, Dan says that I should try to create the absence records using the looked for and not looked for information in the species table. He notes that he still has to generate a separate set of lines for campus=OSU, since these will be slightly different than UCSC. I'm not sure whether or how that will affect my analysis.**
6. There are 12430 records where the classcode is "NO_ORG." I assume these indicate empty transects, but as is, they are not useful as presence/absence data, and have been excluded. To make them useful, we would have to obtain whether or not the species was looked for during a given year, and add in 0 counts for the relevant transects, as discussed above. **Yes, NO_ORG indicates an empty transect. The looked for columns in the species table can be integrated with the observations to create absence records, as discussed above.**
7. Sometimes, NO_ORG observations occur in the same event as other observations. This does not make sense to me. **The results from some early transects were recorded in 10 m segments, and some segments had NO_ORG if no fish were observed. These records can be dropped.**
8. Should the unidentified fish (UNID) classcode match to Actinopterygii? Pisces? **Actinopterygii**
9. In the species table, the classcode OYT matches to two species definitions: Sebastes serranoides/flavidus for HSU, UCSB, UCSC and Sebastes serranoides for VRG. I've assumed this is a mistake. **This is OK.**
10. For data to be compatible with DwC standard, sex information must be entered using a controlled vocabulary. This vocabulary includes 'hermaphrodite' and 'indeterminate', but not something like 'transitional.' Do either of these terms seem reasonable? I can also potentially ask for the vocabulary to be expanded. **These fish are transitioning from female to male. They are sequential hermaphrodites that are protogynous. The designation 'indeterminate' might be reasonable. If it had to be simplified, say they're male, since they're about to be. Abby et al. suggested using hermaphrodite and clarifying in the metadata, since at the moment of observation, the fish has both male and female parts.**
11. There are 234652 records with count = 1. Of these, 2843 have fish_tl = min_tl and max_tl missing. **All of these records are from HSU. I can assume that the fish_tl value is correct.** 287 have min_tl and max_tl specified, with fish_tl being an average of these values. This is confusing; I would expect all records with count = 1 to have fish_tl specified, and min_tl and max_tl missing. Can you clarify? **When a range of sizes has been given and count=1, someone couldn't make up their mind on how big the fish was. Use the average, fish_tl, as the best estimate.**
12. Similarly, there are 134610 records with count > 1. Of these, 1786 have fish_tl = min_tl and max_tl missing. **These are from HSU too. Use fish_tl as the true length.** How do I interpret these records? 
13. coordinateUncertaintyInMeters? **250 is fine**
14. Is there a samplingEffort I can list for fish transects? Like a time goal or limit? **Average 12 minutes**
15. Can I assume sex/lifestage info in notes is accurate? **Yes, assume the notes are accurate. Assemble a file for Dan showing how I've extracted information from the notes. He has updated the data based on his file, and these changes will be reflected in the 2020 update.**

## Remaining Questions - 1/19/21

1. I'm still missing the lat, lon for SCI_PELICAN_FAR_WEST.
2. Absence records are now populated, but there are a collection of situations where the species was not looked for in the species table, but appears as present in the data. The data sets fish_counted_but_not_looked_for.csv and fish_counted_but_not_looked_for_summary.csv contain the details for these instances. **These are cryptic species. They were not thoroughly looked for, and so PISCO doesn't want to claim they know the true density of these species. Dan will remove for 2020 submission.** 
3. Conversely, while populating absence records, I observed that there are a couple entries missing in the species table (i.e., there are observations of a given classcode from a given campus, but that classcode is not listed in the species table for that campus.) I need to let Dan know about this, and hopefully he can fix it in the next update to DataONE. Until then, though, I've added the following manually to the species table:
    - RFYOY for UCSC, observations in 2000, 2003, 2011, 2013, 2014, and 2017
    - RFYOY for UCSB, observations in 2001, 2003, 2005
    - SCAL for UCSC, observations in 2003
    
**These have been fixed in the 2020 species table.** The updated table can be found here: https://docs.google.com/spreadsheets/d/1j8CbdEY4TR5KjEqny42EYlwqPRwTqvhl4zdp5j2aKfo/edit#gid=0

4. Dan mentioned that he hadn't generated a separate set of lines for campus=OSU, so this may need to be done over again once that's accomplished. **Don't worry about this - Dan just got his data sets confused, it doesn't matter for this submission.**
5. Before doing the final version of this data set, it might be worth re-visiting controlled vocabularies in the context of measurementType, measurementUnit, etc.

## Find number of years each MPA was surveyed

In [53]:
transects_per_survey = fish[['site', 'year', 'month', 'day', 'zone', 'level', 'transect']].drop_duplicates()
transects_per_survey = transects_per_survey.groupby(['site', 'year', 'month', 'day'], as_index=False)['transect'].count() # 1-48 (target is 36, I think, mode=24)
fish['date'] = fish['year'].astype(str) + '-' + fish['month'].astype(str) + '-' + fish['day'].astype(str)
surveys_per_year = fish[['site', 'year', 'date']].drop_duplicates()
surveys_per_year = surveys_per_year.groupby(['site', 'year'], as_index=False)['date'].nunique() # 1, 2, 3 or 4, mode = 1
sites_and_years = fish[['site', 'year']].drop_duplicates()
merged = sites_and_years.merge(sites.loc[sites['CA_MPA_Name_Short'].isna() == False, ['site', 'CA_MPA_Name_Short']], how='left', on='site')
merged = merged[merged['CA_MPA_Name_Short'].isna() == False]
num_years_per_site = merged.groupby(['CA_MPA_Name_Short', 'site'], as_index=False)['year'].nunique()
num_sites_per_mpa = merged.groupby('CA_MPA_Name_Short', as_index=False)['site'].nunique() # 2-19, mode=2
num_years_per_mpa = merged.groupby('CA_MPA_Name_Short', as_index=False)['year'].nunique()
num_years_per_mpa = num_years_per_mpa.sort_values('CA_MPA_Name_Short')
num_years_per_mpa.to_csv('pisco_fish_transects_years_per_mpa.csv', index=False)