# PISCO - fish transect data

The density of all conspicuous fishes (i.e. species whose adults are longer than 10 cm and visually detectable by SCUBA divers) are visually recorded along replicate 2m wide by 2m tall by 30m long (120m3) transects. 
- Transects are performed in 2-3 heights: bottom, mid-water and canopy
    - Bottom transects are always performed; a diver searches in cracks and crevices with a flashlight
    - Mid-water transects are always performed; a second diver surveys 120 m3 about 1/3 - 1/2 of the way up into the water column
    - Canopy transects are surveyed at a subset of sites, and are usually completed separately from the bottom and midwater transects; a diver swims 2m below the surface counting fishes in the top two meters of the water column
- Three 30 m long transects, distributed end-to-end and 5-10 m apart, are typically performed at each height, and at each of four depths:
    - 5m
    - 10m
    - 15m
    - 20m 
    - transects at the 25 m isobath are performed by VRG where habitat is available
- Survey depths may vary based on reef topography 
- Counts on mid-water and bottom transects are eventually combined, generating 12 replicate transects for each site. **Are these already combined in this data set? [No, doesn't look like it.]** **Note** that at sites with narrow kelp beds, particularly in parts of the Northern Channel Islands, only two depths are sampled, with four transects in each depth zone for a total of eight replicate transects
- Surveyors record:
    - The total length (TL) of each fish observed
    - Transect depth
    - Horizontal visibility along each transect (**must be at least 3 m to perform fish transects**)
    - Water temperature
    - Sea state (surge)
    - Percent of the transect volume occupied by kelp (PISCO only)

**Resources**
- https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [1]:
## Imports

import pandas as pd
import numpy as np
import random
import math

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "/Users/dianalg/PycharmProjects/PythonScripts/MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data

filename = 'MLPA_kelpforest_fish.4.csv'
fish = pd.read_csv(filename, dtype={'transect':str, 'sex':str, 'notes':str, 'site_name_old':str})

print(fish.shape)
fish.head()

(432297, 23)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,min_tl,max_tl,sex,observer,depth,vis,temp,surge,pctcnpy,notes
0,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,15.2,6.1,17.2,LIGHT,1.0,
1,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,4,...,,,,,15.2,6.1,17.2,LIGHT,2.0,
2,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,6,...,,,,,15.2,6.1,17.2,LIGHT,3.0,
3,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,CAN,3,...,,,,,15.2,6.1,17.2,,1.0,
4,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,MID,3,...,,,,,15.2,6.1,17.2,,1.0,


In [4]:
fish['year'].max()

2020

### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_FISH_PISCO, SBTL_FISH_CRANE, SBTL_FISH_HSU or SBTL_FISH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 1999 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 1999 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 380 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**level** = BOT, CAN, MID or CNMD. The horizontal placement of the transect within the water column. Includes BOT: bottom transects placed at the seafloor, MID: midwater transects placed at roughly half the depth of the seafloor, and CAN: canopy transects placed at the surface to survey the top two meters of the water column and kelp canopy. CNMD is used when an inner transect is too shallow to allow both canopy and midwater transects without overlapping (applies to UCSB and VRG only) <br>
**transect** = It seems like this should only be 1 - 12, but there are a number of other designations as well. The unique transect replicate within each site, zone, and level. <br>
**classcode** = One of 166 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for fishes, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode of a given size per transect <br>
**fish_tl** = The total length of an individual or group of individuals (of the same length) OR the average total length for a group of fish where a range in lengths is specified (rounded to the nearest cm) <br>
**min_tl** = The minimum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**max_tl** = The maximum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**sex** = MALE, FEMALE, TRANSITIONAL, JUVENILE or 'nan'. The sex classification for sexually dimorphic species where sex can be distinguished visually and is recorded. For some species, individuals with juvenile markings are also indicated here. The TRANSITIONAL class is used for fish with external morphological features consistent with both male and female (applies to sex changing fishes such as California sheephead). JUVENILE is not always indicated when a juvenile fish is observed. <br>
**observer** = The diver who conducted the survey transect <br>
**depth** = Between 0.2 and 33.4 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**vis** = Between 1 and 35 meters. The diver's estimation of horizontal visibility on each transect. Measured by reeling in the transect tape and noting the distance at which the end of the tape can first be seen. <br>
**temp** = Between 7 and 25.6 degrees C. The temperature on each transect measured by the diver's computer. <br>
**surge** = HIGH, MODERATE, LIGHT or 'nan'. The diver's estimation of magnitude of horizontal displacement on each transect.
- LIGHT: No significant surge
- MODERATE: Surge causing noticeable lateral movement, diver must compensate
- HIGH: Significant surge, diver moved out of transect bounds when not holding on

**pctcnpy** = 0 - 3 or NaN. The diver's estimation of the percent of the transect, by volume, that is occupied by kelp. This estimation is specific to the level of the transect that is being surveyed (i.e. excluding canopy transects, this not an estimation of surface canopy but of the amount of kelp within the transect at the specified level). **I believe this measure was only recorded by PISCO.**
- 0: 0% of transect volume occupied by kelp
- 1: 1-33% of transect volume occupied by kelp
- 2: 34-66% of transect volume occupied by kelp
- 3: 67-100% of transect volume occupied by kelp

**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

### Strategy

As with the RCCA data, each transect can be an **event** and each fish observation can be an **occurrence**. There are both event-level and occurrence-level measurements, necessitating event and MoF files. 

The **event** file should contain: eventID (from site, survey_year, transect, level?), eventDate (from year, month, date), datasetID, locality (site), localityRemarks (maybe level and/or zone information), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Some notes might be eventRemarks. Should I include the campus information somewhere? Observer?

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe some notes), sex (sex), lifeStage (sex), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Depth, vis, temp, surge and pctcnpy can be recorded at the event level. Fish_tl, min_tl and max_tl can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [5]:
## Load site table

filename = 'MLPA_kelpforest_site_table.4.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(8138, 14)


Unnamed: 0,LTM_project_short_code,campus,method,survey_year,site,latitude,longitude,CA_MPA_Name_Short,site_designation,site_status,Secondary_MPA_Name,Secondary_site_designation,BaselineRegion,LongTermRegion
0,LTM_Kelp_SRock,VRG,SBTL_SIZEFREQ_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
1,LTM_Kelp_SRock,VRG,SBTL_FISH_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
2,LTM_Kelp_SRock,VRG,SBTL_SWATH_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
3,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
4,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2010,ABALONE_COVE_KELP_W,33.73922,-118.38789,Abalone Cove SMCA,SMCA,mpa,,,SOUTH,South Coast


Since, for the purpose of DwC, we're not interested in which sites were sampled when, I can simplify the site table to only contain relevant information: campus, site, latitude, longitude, and site status. 

In [6]:
## Create simplified site table

site_summary = sites[['campus', 'site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(404, 5)


Unnamed: 0,campus,site,site_status,latitude,longitude
0,VRG,3_PALMS_EAST,reference,33.718105,-118.3326
4,VRG,ABALONE_COVE_KELP_W,mpa,33.73922,-118.38789
47,HSU,ABALONE_POINT_1,reference,39.6915,-123.8141
64,HSU,ABALONE_POINT_2,reference,39.66502,-123.80435
81,HSU,ABALONE_POINT_3,reference,39.62877,-123.79658


### Get species table

In [7]:
## Load species table

filename = 'MLPA_kelpforest_taxon_table.4.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1221, 40)


Unnamed: 0,campus,sample_type,sample_subtype,classcode,orig_classcode,Kingdom,Phylum,Class,Order,Family,...,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018,LOOKED2019,LOOKED2020
0,HSU,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,no,no,no,yes,yes,no,yes,yes,yes,yes
1,UCSB,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,FISH,FISH,AARG,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no


The subset of the species table that's currently relevant is entries with sample_type = 'FISH'. **Note** that there are 173 unique classcodes under this sample type, only 146 of which are actually in the fish data set. I'm not sure that this is a problem; it's possible that some species have been looked for, but never seen, and therefore don't appear in the presence-only data. 


```python
# Number of unique fish classcodes in species table
species.loc[(species['sample_type'] == 'FISH') &
            (species['sample_subtype'] == 'FISH'), 'classcode'].nunique()

# Number of unique classcodes in fish data
fish['classcode'].nunique()

# Classcodes that appear in the species table but not in the data
for sp in species.loc[(species['sample_type'] == 'FISH') &
                      (species['sample_subtype'] == 'FISH'), 'classcode'].unique():
    if sp not in fish['classcode'].unique():
        print(sp)
```

In [8]:
## Select species for fish surveys

fish_sp = species.loc[species['sample_type'] == 'FISH',  
                      ['campus', 'classcode', 'species_definition', 'common_name', 'LOOKED1999',
                       'LOOKED2000', 'LOOKED2001', 'LOOKED2002', 'LOOKED2003', 'LOOKED2004',
                       'LOOKED2005', 'LOOKED2006', 'LOOKED2007', 'LOOKED2008', 'LOOKED2009',
                       'LOOKED2010', 'LOOKED2011', 'LOOKED2012', 'LOOKED2013', 'LOOKED2014',
                       'LOOKED2015', 'LOOKED2016', 'LOOKED2017', 'LOOKED2018', 'LOOKED2019',
                       'LOOKED2020']]

print(fish_sp.shape)
fish_sp.head()

(530, 26)


Unnamed: 0,campus,classcode,species_definition,common_name,LOOKED1999,LOOKED2000,LOOKED2001,LOOKED2002,LOOKED2003,LOOKED2004,...,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018,LOOKED2019,LOOKED2020
0,HSU,AARG,Amphistichus argenteus,Barred Surfperch,no,no,no,no,no,no,...,no,no,no,yes,yes,no,yes,yes,yes,yes
1,UCSB,AARG,Amphistichus argenteus,Barred Surfperch,yes,yes,yes,yes,yes,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,AARG,Amphistichus argenteus,Barred Surfperch,no,no,no,no,no,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,ACOR,Artedius corallinus,Coralline Sculpin,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,ACOR,Artedius corallinus,Coralline Sculpin,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no


In [9]:
## Melt species table

long = pd.melt(fish_sp, id_vars=fish_sp.columns[0:4], var_name='year', value_name='looked')
print(long.shape)
long.head()

(11660, 6)


Unnamed: 0,campus,classcode,species_definition,common_name,year,looked
0,HSU,AARG,Amphistichus argenteus,Barred Surfperch,LOOKED1999,no
1,UCSB,AARG,Amphistichus argenteus,Barred Surfperch,LOOKED1999,yes
2,VRG,AARG,Amphistichus argenteus,Barred Surfperch,LOOKED1999,no
3,HSU,ACOR,Artedius corallinus,Coralline Sculpin,LOOKED1999,no
4,UCSB,ACOR,Artedius corallinus,Coralline Sculpin,LOOKED1999,no


In [10]:
## Replace 

long['year'] = long['year'].str.split('D').str[1].astype(int)
long['year'].unique()

array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])

### Fill in absence records in fish

In [11]:
## Check if there are any records where count data is missing

fish[fish['count'].isna() == True]

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,min_tl,max_tl,sex,observer,depth,vis,temp,surge,pctcnpy,notes


In [12]:
## Drop these records 

print(fish.shape)
fish.dropna(subset=['count'], inplace=True)
fish.shape

(432297, 23)


(432297, 23)

In [13]:
## Check whether there are any campus/species combinations that occur in the data but not the species table

# Get a dict of campus/species combinations for years where the species wasn't looked for
species_not_looked_for = long[long['looked'] == 'no'].copy()
species_not_looked_for_dict = dict(zip(species_not_looked_for['campus'] + '-' + 
                                       species_not_looked_for['classcode'] + '-' + 
                                       species_not_looked_for['year'].astype(str), 
                                       species_not_looked_for['looked']))

# Get a dict of campus/species combinations for years they were actually found
species_found = dict(zip(fish['campus'] + '-' + fish['classcode'] + '-' + fish['survey_year'].astype(str), ['yes']*fish.shape[0]))

# Identify whether species that ostensibly weren't looked for were actually found
found_but_not_looked_for = []
for key in species_found.keys():
    if key in species_not_looked_for_dict.keys():
        found_but_not_looked_for.append(key)
if len(found_but_not_looked_for) > 0:
    print('There were {x} campus/species/year combinations that were found but not looked for.'.format(x=len(found_but_not_looked_for)))

There are no campus/species combinations in the data that do not occur in the species table.

In [14]:
## Get a table telling whether each fish was looked for during each specific transect

survey_table = fish[['campus', 
                     'method', 
                     'day', 
                     'month', 
                     'survey_year', 
                     'year', 
                     'site', 
                     'zone', 
                     'level', 
                     'transect']].merge(long[['campus', 
                                              'classcode', 
                                              'year', 
                                              'looked']], 
                                        how='left', 
                                        left_on=['campus', 'survey_year'],
                                        right_on=['campus', 'year'])
survey_table.drop_duplicates(inplace=True)
survey_table.rename(columns={'year_x':'year'}, inplace=True) # year_x retains actual year when survey took place
survey_table.drop(columns=['year_y'], inplace=True) # year_y == survey_year because of the merge
print(survey_table.shape)
survey_table.head()

(10138124, 12)


Unnamed: 0,campus,method,day,month,survey_year,year,site,zone,level,transect,classcode,looked
0,UCSB,SBTL_FISH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,AARG,yes
1,UCSB,SBTL_FISH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,ACOR,no
2,UCSB,SBTL_FISH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,ADAV,yes
3,UCSB,SBTL_FISH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,AFLA,yes
4,UCSB,SBTL_FISH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,AGUA,yes


In [20]:
## Merge with fish data to get final outcome

full_fish = fish.merge(survey_table, 
                             how='right', 
                             on=['campus', 
                                 'method', 
                                 'day', 
                                 'month', 
                                 'year', 
                                 'survey_year', 
                                 'site', 
                                 'zone', 
                                 'level', 
                                 'transect', 
                                 'classcode'])
print(full_fish.shape)
full_fish.head()

(10341441, 24)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,max_tl,sex,observer,depth,vis,temp,surge,pctcnpy,notes,looked
0,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes
1,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,no
2,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes
3,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes
4,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes


In [21]:
## Clean

full_fish = full_fish[full_fish['classcode'] != 'NO_ORG'].copy()
full_fish.loc[(full_fish['looked'] == 'yes') & (full_fish['count'].isna() == True), 'count'] = 0
full_fish.dropna(subset=['count'], inplace=True)
print(full_fish.shape)
full_fish.head()

(8863192, 24)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,max_tl,sex,observer,depth,vis,temp,surge,pctcnpy,notes,looked
0,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes
2,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes
3,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes
4,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes
6,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,,,,yes


You can double check that there are no records where count > 0 but looked = no. 

```python
# Get records
check = full_fish[(full_fish['count'] > 0) & (full_fish['looked'] == 'no')]

# Get table of campuses and years where there were observations for classcodes that were not looked for according to the species table
obs_exist = check[['campus', 'survey_year', 'classcode']].copy()
obs_exist.drop_duplicates(inplace=True)
obs_exist.head()
```

### Convert

**Note** that the site "Swami's" has not been changed in the fish data to "SWAMIS" to match the site table. I'll have to do this by hand now, and also let Dan know about it.

In [22]:
## Fix Swami's ------------- THIS STEP CAN BE REMOVED AFTER DATA ARE UPDATED

full_fish.loc[full_fish['site'] == "Swami's", 'site'] = 'SWAMIS'
"Swami's" in full_fish['site'].unique()

False

In [23]:
## Merge to add site_name (also lat, lon and site_status) to fish table

full_fish = full_fish.merge(site_summary, how='left', on=['campus', 'site'])
print(full_fish.shape)
full_fish.head()

(8863192, 27)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,depth,vis,temp,surge,pctcnpy,notes,looked,site_status,latitude,longitude
0,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,yes,reference,34.002883,-119.4252
1,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,yes,reference,34.002883,-119.4252
2,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,yes,reference,34.002883,-119.4252
3,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,yes,reference,34.002883,-119.4252
4,UCSB,SBTL_FISH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_CEN,INNER,BOT,3,...,,,,,,,yes,reference,34.002883,-119.4252


In [26]:
## Create eventID

# Get paddedMonth and paddedDay columns
full_fish['paddedMonth'] = full_fish['month'].astype(str).str.pad(
    width=2,
    side='left',
    fillchar='0'
)
full_fish['paddedDay'] = full_fish['day'].astype(str).str.pad(
    width=2,
    side='left',
    fillchar='0'
)

# Create eventID
full_fish['eventID'] = full_fish['site'] + '_' + \
                       full_fish['year'].astype(str) + \
                       full_fish['paddedMonth'] + \
                       full_fish['paddedDay'] + '_' + \
                       full_fish['zone'] + '_' + \
                       full_fish['level'] + '_' + \
                       full_fish['transect'].str.replace(' ', '')
fish_occ = pd.DataFrame({'eventID':full_fish['eventID'].copy()})

fish_occ.head()

Unnamed: 0,eventID
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3


In [27]:
## Add occurrenceID

# Create survey_date column in fish
full_fish['survey_date'] = full_fish['year'].astype(str) + full_fish['paddedMonth'] + full_fish['paddedDay']

# Groupby to create occurrenceID
fish_occ['occurrenceID'] = full_fish.groupby(['site', 'survey_date', 'zone', 'level', 'transect'])['classcode'].cumcount()+1
fish_occ['occurrenceID'] = fish_occ['eventID'] + '_occ' + fish_occ['occurrenceID'].astype(str)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5


In [28]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(fish_sp['classcode'], fish_sp['species_definition']))
code_to_com_dict = dict(zip(fish_sp['classcode'], fish_sp['common_name']))

In [29]:
## Create scientificName and vernacularName columns

fish_occ['vernacularName'] = full_fish['classcode']
fish_occ['vernacularName'].replace(code_to_com_dict, inplace=True)

fish_occ['scientificName'] = full_fish['classcode']
fish_occ['scientificName'].replace(code_to_sci_dict, inplace=True)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis


In [30]:
## Get unique scientific names for lookup in WoRMS

names = fish_occ['scientificName'].unique()

**Note** that there are a number of names that are not specific at the species level:
- Sebastes atrovirens/carnatus/chrysomelas/caurinus (matched to Sebastes; Kelp, Gopher, Black and Yellow, and Copper Rockfish YoY)
- Sebastes chrysomelas/carnatus (matched to Sebastes; Gopher and Black and Yellow Rockfish YoY)
- Synchirus/Rimicola (Manacled sculpin or kelp clingfish, SYRI) --> This should mean either Synchirus spp. or Rimicola spp. These are both from class **Actinopterygii**
- Sebastes serranoides/flavidus/melanops (matched to Sebastes; Olive, Yellowtail and Black Rockfish YoY)
- Sebastes carnatus/caurinus (matched to Sebastes; Gopher, Copper Rockfish YoY)

There are also some descriptions that lack a scientific name:
- No Organisms Present In This Sample (NO_ORG) --> **This classcode has been dealt with during the absence record population process.**
- Unidentified Fish (UNID) --> **I've matched this to Actinopterygii.**

Species with multiple common names:
- Atherinopsidae --> Grunion, Topsmelt Or Jacksmelt
- Clupeiformes --> Bait, Sardines/Anchovies (BAITBALL)
- Clinidae --> Kelpfishes And Fringeheads
- Clupeiformes --> Sardines And Anchovies (CLUP)
- Lethops connectens --> Kelp Goby, Halfblind Goby
- Scomber japonicus --> Pacific Mackerel, Greenback Mackerel
- Thaleichthys pacificus --> Candlefish, eulachon

Other classifications to be aware of:
- Hexagrammos --> Unidentified Hexagrammos
- Sebastes --> Rockfish, Unidentified Sp. (SEBSPP)
- Sebastes --> Rockfish Young Of The Year, Unidentified Sp (RFYOY)

In [31]:
## Make changes based on the above observations

fish_occ.loc[fish_occ['scientificName'] == 'Synchirus/Rimicola', 'scientificName'] = 'Actinopterygii'
fish_occ.loc[fish_occ['scientificName'] == 'Unidentified Fish', 'scientificName'] = 'Actinopterygii'

# Redefine names
names = fish_occ['scientificName'].unique()

In [32]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Sebastes chrysomelas/carnatus checking:  Sebastes
Url didn't work for Sebastes atrovirens/carnatus/chrysomelas/caurinus checking:  Sebastes
Url didn't work for Sebastes serranoides/flavidus/melanops checking:  Sebastes
Url didn't work for Sebastes serranoides/flavidus checking:  Sebastes
Url didn't work for Sebastes carnatus/caurinus checking:  Sebastes


In [33]:
## Add scientific name-related columns

fish_occ['scientificNameID'] = fish_occ['scientificName']
fish_occ['scientificNameID'].replace(name_id_dict, inplace=True)

fish_occ['taxonID'] = fish_occ['scientificName']
fish_occ['taxonID'].replace(name_taxid_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630


In [34]:
## Create identificationRemarks

remarks_dict = {'Sebastes serranoides/flavidus':'Sebastes serranoides or Sebastes flavidus',
               'Sebastes atrovirens/carnatus/chrysomelas/caurinus':'Sebastes atrovirens, Sebastes carnatus, Sebastes chrysomelas or Sebastes Caurinus',
               'Sebastes chrysomelas/carnatus':'Sebastes chrysomelas or Sebastes carnatus',
               'Sebastes serranoides/flavidus/melanops':'Sebastes serranoides, Sebastes flavidus or Sebastes melanops',
                 'Sebastes carnatus/caurinus':'Sebastes carnatus or Sebastes caurinus'}

identificationRemarks = [remarks_dict[name] if name in remarks_dict.keys() else np.nan for name in fish_occ['scientificName']]

In [35]:
## Replace scientificName using name_name_dict

fish_occ['scientificName'].replace(name_name_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630


In [36]:
## Add final name-related columns

fish_occ['nameAccordingTo'] = 'WoRMS'
fish_occ['occurrenceStatus'] = 'present'
fish_occ['basisOfRecord'] = 'HumanObservation'
fish_occ['identificationRemarks'] = identificationRemarks

# Add identificationQualifier for Synchirus/Rimicola
fish_occ.loc[fish_occ['vernacularName'] == 'Manacled Sculpin/Kelp Clingfish', 'identificationRemarks'] = 'Synchirus spp. or Rimicola spp.'

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594,WoRMS,present,HumanObservation,
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,present,HumanObservation,
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,present,HumanObservation,
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016,WoRMS,present,HumanObservation,
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630,WoRMS,present,HumanObservation,


In [37]:
## Pull sex and lifeStage information out of sex column

fish_occ['sex'] = full_fish['sex'].copy()
fish_occ['lifeStage'] = fish_occ['sex']

# Separate
fish_occ.loc[fish_occ['sex'].isin(['JUVENILE']), 'sex'] = np.nan
fish_occ.loc[fish_occ['lifeStage'].isin(['MALE', 'FEMALE', 'TRANSITIONAL']), 'lifeStage'] = np.nan

# Replace sex and lifeStage with controlled vocabulary
fish_occ['sex'].replace({'MALE':'male', 'FEMALE':'female', 'TRANSITIONAL':'hermaphrodite'}, inplace=True)
fish_occ['lifeStage'].replace({'JUVENILE':'juvenile'}, inplace=True)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,sex,lifeStage
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594,WoRMS,present,HumanObservation,,,
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,present,HumanObservation,,,
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,present,HumanObservation,,,
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016,WoRMS,present,HumanObservation,,,
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630,WoRMS,present,HumanObservation,,,


In [38]:
## Create density

fish_density = full_fish['count'] # units = individuals per 120 m3
fish_occ['organismQuantity'] = fish_density
fish_occ['organismQuantityType'] = 'number of individuals per 120 m3'
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630,WoRMS,present,HumanObservation,,,,0.0,number of individuals per 120 m3


In [39]:
## Update absence status

fish_occ.loc[fish_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3


### The notes column

I would like to make some effort to extract useful information and tidy the notes column. A large number of notes contain sex information. I can probably extract this. I need to look for and pull:
- M
- F
- M;
- F;
- MALE
- FEMALE
- FEAMLE
- MALE;
- FEMALE;
- TRANSITIONAL;
- TRANSITION
- MALE,

There is also some life stage information:
- JUVENILE
- JUVENILE;
- JEVENILE

Sometimes, sex is uncertain (e.g. 'M?'). I'll leave these in the notes.

To look at the non-sex-related notes, use:

```python
not_sex = fish[(fish['notes'].isna() == False) & (fish['notes'] != 'M') & (fish['notes'] != 'F') & (fish['notes'] != 'M;') & (fish['notes'] != 'F;') 
               & (fish['notes'] != 'MALE') & (fish['notes'] != 'FEMALE')].copy()

for note in not_sex['notes'].unique():
    print(note)
```

In [40]:
## Obtain relevant records from fish

notes = full_fish[['site', 'survey_date', 'classcode', 'count', 'sex', 'notes']].copy()
print(notes.shape)
notes.head()

(8863192, 6)


Unnamed: 0,site,survey_date,classcode,count,sex,notes
0,ANACAPA_ADMIRALS_CEN,19990930,AARG,0.0,,
1,ANACAPA_ADMIRALS_CEN,19990930,ADAV,0.0,,
2,ANACAPA_ADMIRALS_CEN,19990930,AFLA,0.0,,
3,ANACAPA_ADMIRALS_CEN,19990930,AGUA,0.0,,
4,ANACAPA_ADMIRALS_CEN,19990930,AINE,0.0,,


In [41]:
## Extract sex from notes column

sex_notes = []
sex_options = ['M', 'F', 'MALE', 'FEMALE', 'FEAMLE', 'MALES', 'FEMALES', 'TRANSITIONAL', 'TRANSITION', 'TRANNY']

for note in notes['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(sex_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(sex_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(sex_options) & set(slash_split))
          
        
        if note in sex_options:
            sex_notes.append(note)
        elif colon_overlap != []:
            sex_notes.extend(colon_overlap)
        elif comma_overlap != []:
            sex_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            sex_notes.extend(slash_overlap)
        
        else:
            sex_notes.append(np.nan)
            
    else:
        sex_notes.append(np.nan)
        
notes['sex_notes'] = sex_notes
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes
0,ANACAPA_ADMIRALS_CEN,19990930,AARG,0.0,,,
1,ANACAPA_ADMIRALS_CEN,19990930,ADAV,0.0,,,
2,ANACAPA_ADMIRALS_CEN,19990930,AFLA,0.0,,,
3,ANACAPA_ADMIRALS_CEN,19990930,AGUA,0.0,,,
4,ANACAPA_ADMIRALS_CEN,19990930,AINE,0.0,,,


In [42]:
## Clean sex_notes

print(notes['sex_notes'].unique())
notes['sex_notes'].replace({'F':'female',
                  'M':'male',
                  'FEMALE':'female',
                  'MALE':'male',
                  'MALES':'male',
                  'FEMALES':'female',
                  'FEAMLE':'female',
                  'TRANSITIONAL':'hermaphrodite',
                  'TRANNY':'hermaphrodite',
                  'TRANSITION':'hermaphrodite'}, inplace=True)
print(notes['sex_notes'].unique())

[nan 'FEMALE' 'MALE' 'TRANSITIONAL' 'F' 'M' 'MALES' 'FEMALES' 'TRANSITION'
 'TRANNY' 'FEAMLE']
[nan 'female' 'male' 'hermaphrodite']


In [43]:
# Add sex from fish_occ to notes

notes['occ_sex'] = fish_occ['sex']
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex
0,ANACAPA_ADMIRALS_CEN,19990930,AARG,0.0,,,,
1,ANACAPA_ADMIRALS_CEN,19990930,ADAV,0.0,,,,
2,ANACAPA_ADMIRALS_CEN,19990930,AFLA,0.0,,,,
3,ANACAPA_ADMIRALS_CEN,19990930,AGUA,0.0,,,,
4,ANACAPA_ADMIRALS_CEN,19990930,AINE,0.0,,,,


In [44]:
## Create new column merging information from occ_sex and sex_notes

new_sex = [notes['occ_sex'].iloc[i] if notes['occ_sex'].iloc[i] == notes['occ_sex'].iloc[i] else notes['sex_notes'].iloc[i] for i in range(notes.shape[0])]
notes['new_sex'] = new_sex
notes.head()

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex,new_sex
0,ANACAPA_ADMIRALS_CEN,19990930,AARG,0.0,,,,,
1,ANACAPA_ADMIRALS_CEN,19990930,ADAV,0.0,,,,,
2,ANACAPA_ADMIRALS_CEN,19990930,AFLA,0.0,,,,,
3,ANACAPA_ADMIRALS_CEN,19990930,AGUA,0.0,,,,,
4,ANACAPA_ADMIRALS_CEN,19990930,AINE,0.0,,,,,


In [45]:
## Replace sex column in fish_occ with new_sex

fish_occ['sex'] = notes['new_sex']
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3


To check that the above process is working, use:

```python
fish[fish['sex'] == 'FEMALE'].shape[0] # 19714
full_fish[full_fish['sex'] == 'FEMALE'].shape[0] # 19714
notes[notes['occ_sex'] == 'female'].shape[0] # 19714
notes[notes['new_sex'] == 'female'].shape[0] # 21230
fish_occ[fish_occ['sex'] == 'female'].shape[0] # 21230
```

In [46]:
## Repeat the process to extract lifeStage information from notes

stage_notes = []
stage_options = ['JUVENILE', 'JUV', 'JEVENILE']

for note in notes['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(stage_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(stage_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(stage_options) & set(slash_split))
          
        
        if note in stage_options:
            stage_notes.append(note)
        elif colon_overlap != []:
            stage_notes.extend(colon_overlap)
        elif comma_overlap != []:
            stage_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            stage_notes.extend(slash_overlap)
        
        else:
            stage_notes.append(np.nan)
            
    else:
        stage_notes.append(np.nan)
        
notes['stage_notes'] = stage_notes
        
# Clean stage_notes
print(notes['stage_notes'].unique())
notes['stage_notes'].replace({'JUV':'juvenile',
                  'JUVENILE':'juvenile',
                  'JEVENILE':'juvenile'}, inplace=True)
print(notes['stage_notes'].unique())

# Add lifeStage from fish_occ to notes
notes['occ_stage'] = fish_occ['lifeStage']

# Create new column merging information from occ_stage and stage_notes
new_stage = [notes['occ_stage'].iloc[i] if notes['occ_stage'].iloc[i] == notes['occ_stage'].iloc[i] else notes['stage_notes'].iloc[i] for i in range(notes.shape[0])]
notes['new_stage'] = new_stage

# Replace lifeStage column in fish_occ with new_stage
fish_occ['lifeStage'] = notes['new_stage']
fish_occ.head()

[nan 'JUVENILE' 'JUV' 'JEVENILE']
[nan 'juvenile']


Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3


Check:

```python
fish[fish['sex'] == 'JUVENILE'].shape[0] # 2548
notes[notes['occ_stage'] == 'juvenile'].shape[0] # 2548
notes[notes['new_stage'] == 'juvenile'].shape[0] # 2596
fish_occ[fish_occ['lifeStage'] == 'juvenile'].shape[0] # 2596
```

In [47]:
## Add notes under occurrenceRemarks

fish_occ['occurrenceRemarks'] = notes['notes']
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,sex,lifeStage,organismQuantity,organismQuantityType,occurrenceRemarks
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ1,Barred Surfperch,Amphistichus argenteus,urn:lsid:marinespecies.org:taxname:279594,279594,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ2,Sargo,Anisotremus davidsonii,urn:lsid:marinespecies.org:taxname:279617,279617,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ3,Tubesnout,Aulorhynchus flavidus,urn:lsid:marinespecies.org:taxname:279839,279839,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ4,Guadalupe Cardinalfish,Apogon guadalupensis,urn:lsid:marinespecies.org:taxname:273016,273016,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ5,Smooth Alligatorfish,Anoplagonus inermis,urn:lsid:marinespecies.org:taxname:279630,279630,WoRMS,absent,HumanObservation,,,,0.0,number of individuals per 120 m3,


In [48]:
## Change NaN values in string fields to ''

fish_occ[['identificationRemarks', 'sex', 'lifeStage', 'occurrenceRemarks']] = fish_occ[['identificationRemarks', 'sex', 'lifeStage', 'occurrenceRemarks']].replace(np.nan, '')
fish_occ.isna().sum()

eventID                  0
occurrenceID             0
vernacularName           0
scientificName           0
scientificNameID         0
taxonID                  0
nameAccordingTo          0
occurrenceStatus         0
basisOfRecord            0
identificationRemarks    0
sex                      0
lifeStage                0
organismQuantity         0
organismQuantityType     0
occurrenceRemarks        0
dtype: int64

In [49]:
## Save Size, Min and Max for use in MoF file

# Obtain relevant records from fish
subset = full_fish[['site', 'survey_date', 'classcode', 'count', 'fish_tl', 'min_tl', 'max_tl']].copy()

# Fix records where count = 1 and min and/or max values are present - this defaults to size = fish_tl if max_tl is missing, or if a size range has been given for a single fish
subset.loc[subset['count'] == 1, ['min_tl', 'max_tl']] = np.nan

# Fix records where count > 1 and min and max don't provide a reasonable size range - this defaults to size = fish_tl if max_tl is missing
subset.loc[(subset['fish_tl'] == subset['min_tl']) & (subset['max_tl'].isna() == True), 'min_tl'] = np.nan

# For groups where a size range exists, we want to drop the average length measure
subset.loc[(subset['fish_tl'] < subset['max_tl']) & (subset['fish_tl'] > subset['min_tl']), 'fish_tl'] = np.nan

# Assemble fish_sizes
fish_sizes = pd.DataFrame({
    'eventID':fish_occ['eventID'],
    'occurrenceID':fish_occ['occurrenceID'],
    'fish_tl':subset['fish_tl'],
    'min_tl':subset['min_tl'],
    'max_tl':subset['max_tl']
})
fish_sizes.dropna(how='all', subset=['fish_tl', 'min_tl', 'max_tl'], inplace=True) # Note that this drops 113 presence records where no size information was given (fish_tl = NaN)

print(fish_sizes.shape)
fish_sizes.head()

(417683, 5)


Unnamed: 0,eventID,occurrenceID,fish_tl,min_tl,max_tl
15,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ16,10.0,,
27,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ28,40.0,,
32,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ33,20.0,,
44,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ45,20.0,,
45,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3_occ46,15.0,,


**You can check for issues with the size data using the following checks. They are all problems that I've identified in the past.**
- 0 records have a missing count (count = NaN)
- 8445396 records have count = 0
- 268234 records have count = 1
    - Of these, 1 has a min value that = fish_tl and a max value of NaN. **Use fish_tl as the size.**
        - Note that the opposite is never the case (max = fish_tl, min = NaN)
    - Of these, 5443 have min and max values, and fish_tl is the average of those values **Here, someone couldn't decide how big the fish was and put a range. Use the average, fish_tl, as the size.**
- 149562 records have count > 1
    - Of these, 117908 are all the same size (i.e. min_tl = max_tl = NaN)
    - 31654 are different sizes with a size range given (i.e. fish_tl is the average of min_tl and max_tl)
    - 0 are of unknown size, with min values that = fish_tl and max values of NaN. **If this occurs, use fish_tl as the size.**
        - Again, note that the opposite is never the case (max = fish_tl, min = NaN)
        
```python
# Records where count = 1
count1 = subset[subset['count'] == 1].copy()

# Records where count = 1, but min is present
count1[(count1['min_tl'].isna() == False) & (count1['max_tl'].isna() == True)]

# Records where count = 1, but min and max are present
count1[(count1['min_tl'].isna() == False) & (count1['max_tl'].isna() == False)]

# Records where count > 1
count2 = subset[subset['count'] > 1].copy()

# Records where count > 1 and all fish were the same size
count2[(count2['min_tl'].isna() == True) & (count2['max_tl'].isna() == True)]

# Records where count > 1 and fish were not the same size and a size range was given
count2[(count2['fish_tl'].isna() == False) & (count2['min_tl'].isna() == False) & (count2['max_tl'].isna() == False)]

# Records where count > 1 and fish are of unknown size (min = fish_tl, max not given)
count2[(count2['min_tl'].isna() == False) & (count2['max_tl'].isna() == True)]
```

To check that all sizes in the original data set have made it through the absence population process, use:

```python
print(fish[fish['fish_tl'].isna() == False].shape[0]) # 417683
print(fish[fish['min_tl'].isna() == False].shape[0]) # 37098
print(fish[fish['max_tl'].isna() == False].shape[0]) # 37097

print(full_fish[full_fish['fish_tl'].isna() == False].shape[0]) # 417683
print(full_fish[full_fish['min_tl'].isna() == False].shape[0]) # 37098
print(full_fish[full_fish['max_tl'].isna() == False].shape[0]) # 37097
```

**Note** that the number of entries in fish_sizes will not be identical to these, because of the corrections made in the codeblock above according to Dan's directions.

### Save

In [50]:
## Save

fish_occ.to_csv('PISCO_occurrence_20210816.csv', index=False, na_rep='NaN')

## Create event file

In [51]:
## Get unique eventIDs from occurrence file and their associated survey_dates

event = pd.DataFrame({'eventID':fish_occ['eventID'],
                     'eventDate':full_fish['survey_date'],
                     'institutionID':full_fish['campus'],
                     'locality':full_fish['site']})
event.drop_duplicates(inplace=True)

print(event.shape)
event.head()

(71109, 4)


Unnamed: 0,eventID,eventDate,institutionID,locality
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,19990930,UCSB,ANACAPA_ADMIRALS_CEN
132,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,19990930,UCSB,ANACAPA_ADMIRALS_CEN
269,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,19990930,UCSB,ANACAPA_ADMIRALS_CEN
403,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,19990930,UCSB,ANACAPA_ADMIRALS_CEN
532,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,19990930,UCSB,ANACAPA_ADMIRALS_CEN


To double check that all eventIDs are also in occurrence table:

```python
test = event.merge(fish_occ[['eventID', 'scientificName']], how='outer', on='eventID', indicator=True)
test[test['_merge'] != 'both']
```

In [52]:
## Format eventDate

formatted = [datetime.strptime(dt, '%Y%m%d').date().isoformat() for dt in event['eventDate']]
event['eventDate'] = formatted
event.head()

Unnamed: 0,eventID,eventDate,institutionID,locality
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,1999-09-30,UCSB,ANACAPA_ADMIRALS_CEN
132,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,1999-09-30,UCSB,ANACAPA_ADMIRALS_CEN
269,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,1999-09-30,UCSB,ANACAPA_ADMIRALS_CEN
403,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,1999-09-30,UCSB,ANACAPA_ADMIRALS_CEN
532,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,1999-09-30,UCSB,ANACAPA_ADMIRALS_CEN


In [53]:
## Dataset ID

event.insert(2, 'datasetID', 'PISCO fish transects')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN
132,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN
269,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN
403,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN
532,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN


In [54]:
## Merge to obtain decimalLatitude and decimalLongitude

event = event.merge(site_summary, how='left', left_on=['institutionID', 'locality'], right_on=['campus', 'site'])
event.rename(columns = {'site_status':'locationRemarks', 'latitude':'decimalLatitude', 'longitude':'decimalLongitude'}, inplace=True)
event['locationRemarks'].replace({'mpa':'marine protected area'}, inplace=True)
event.drop(['campus', 'site'], axis=1, inplace=True)
print(event.shape)
event.head()

(71109, 8)


Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252


In [56]:
## Add countryCode

event.insert(6, 'countryCode', 'US')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252


In [57]:
## Add coordinateUncertainty in Meters

event['coordinateUncertaintyInMeters'] = 250

In [58]:
## minimumDepthInMeters, maximumDepthInMeters

# Groupby eventID to obtain depth column
depth = full_fish.groupby(['eventID']).agg({
    'depth':[min, max]
})
depth.reset_index(inplace=True)
depth.columns = depth.columns.droplevel()

# Add to event
event['minimumDepthInMeters'] = depth['min']
event['maximumDepthInMeters'] = depth['max']
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,10.5,10.5
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,9.5,9.5
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,9.5,9.5
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,9.0,9.0
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,,


**Note** that there are no duplicated depth measurements that I can see.

```python
any(full_fish.groupby(['eventID'])['depth'].nunique() > 1)
```

In [60]:
## Add samplingProtocol and samplingEffort

# samplingProtocol
protocol = full_fish[['method', 'site', 'survey_date', 'zone', 'level', 'transect']].copy()
protocol.drop_duplicates(inplace=True)
protocol.reset_index(drop=True, inplace=True)
print(protocol.shape)
event['samplingProtocol'] = protocol['method']

# samplingEffort
event['samplingEffort'] = 'average of 12 minutes per transect'
print(event.shape)
event.head()

(71109, 6)
(71109, 14)


Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,10.5,10.5,SBTL_FISH_PISCO,average of 12 minutes per transect
1,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,9.5,9.5,SBTL_FISH_PISCO,average of 12 minutes per transect
2,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,9.5,9.5,SBTL_FISH_PISCO,average of 12 minutes per transect
3,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,9.0,9.0,SBTL_FISH_PISCO,average of 12 minutes per transect
4,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,1999-09-30,PISCO fish transects,UCSB,ANACAPA_ADMIRALS_CEN,reference,US,34.002883,-119.4252,250,,,SBTL_FISH_PISCO,average of 12 minutes per transect


In [61]:
## Get vis, temp, surge, and pctcnpy for MoF

# Get relevant measurementValues
event_MoF_values = full_fish[['eventID', 'vis', 'temp', 'surge', 'pctcnpy']].copy()
event_MoF_values.drop_duplicates(inplace=True)
event_MoF_values.dropna(how='all', subset=['vis', 'temp', 'surge', 'pctcnpy'], inplace=True)

# vis
vis = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'visibility',
    'measurementValue':event_MoF_values['vis'],
    'measurementUnit':'meters',
    'measurementMethod':'Horizontal visibility on each transect, estimated by diver reeling in transect tape and noting the distance at which the end of the tape can first be seen.'
})

# temp
temp = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'temperature',
    'measurementValue':event_MoF_values['temp'],
    'measurementUnit':'degrees Celsius',
    'measurementMethod':"The temperature on each transect as measured by the diver's computer."
})

# surge
surge = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'surge',
    'measurementValue':event_MoF_values['surge'],
    'measurementUnit':np.NaN,
    'measurementMethod':"The diver's categorical estimation of the magnitude of horizontal displacement on each transect."
})

# pctcnpy
pct = pd.DataFrame({
    'eventID':event_MoF_values['eventID'],
    'occurrenceID':np.nan,
    'measurementType':'percent canopy',
    'measurementValue':event_MoF_values['pctcnpy'],
    'measurementUnit':np.nan,
    'measurementMethod':"The diver's categorical estimation of the percent of the transect, by volume, that is occupied by kelp."
})

pct.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
15,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,,percent canopy,1.0,,The diver's categorical estimation of the perc...
147,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,,percent canopy,2.0,,The diver's categorical estimation of the perc...
284,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,,percent canopy,3.0,,The diver's categorical estimation of the perc...
418,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,,percent canopy,1.0,,The diver's categorical estimation of the perc...
547,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,,percent canopy,1.0,,The diver's categorical estimation of the perc...


**Note** that for a given measurement, use the following to make sure all the data in fish made it to full fish, and then to the correct data frame:

```python
print('vis')
print(fish[(fish['classcode'] != 'NO_ORG') & (fish['vis'].isna() == False)].shape[0]) # 399051
print(full_fish[full_fish['vis'].isna() == False].shape[0]) # 399051

full_fish[full_fish['vis'].isna() == False].shape[0]
test = full_fish.loc[full_fish['vis'].isna() == False, ['eventID', 'vis']].drop_duplicates()
print(test.shape[0]) # 53913
print(vis[vis['measurementValue'].isna() == False].shape[0]) # 53913
```

In [63]:
## Check for NaN in string fields 

event.isna().sum()

eventID                              0
eventDate                            0
datasetID                            0
institutionID                        0
locality                             0
locationRemarks                      0
countryCode                          0
decimalLatitude                      0
decimalLongitude                     0
coordinateUncertaintyInMeters        0
minimumDepthInMeters             21389
maximumDepthInMeters             21389
samplingProtocol                     0
samplingEffort                       0
dtype: int64

In [64]:
## Save

event.to_csv('PISCO_event_20210816.csv', index=False, na_rep='NaN')

## Create MoF file

In [65]:
## Assemble fish_sizes data

# total length
tl_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'length',
                      'measurementValue':fish_sizes['fish_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The total length of an individual or group of individuals (of the same length), estimated visually to the nearest centimeter'})

# min length
min_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'minimum length',
                      'measurementValue':fish_sizes['min_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The minimum size recorded in a group of fish of the same species, estimated visually to the nearest centimeter'})

# max length
max_mof = pd.DataFrame({'eventID':fish_sizes['eventID'],
                      'occurrenceID':fish_sizes['occurrenceID'],
                      'measurementType':'maximum length',
                      'measurementValue':fish_sizes['max_tl'],
                      'measurementUnit':'centimeters',
                      'measurementMethod': 'The maximum size recorded in a group of fish of the same species, estimated visually to the nearest centimeter'})

In [66]:
## Concatenate dataframes

mof = pd.concat([vis, temp, surge, pct, tl_mof, min_mof, max_mof])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
15,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_3,,visibility,6.1,meters,"Horizontal visibility on each transect, estima..."
147,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_4,,visibility,6.1,meters,"Horizontal visibility on each transect, estima..."
284,ANACAPA_ADMIRALS_CEN_19990930_INNER_BOT_6,,visibility,6.1,meters,"Horizontal visibility on each transect, estima..."
418,ANACAPA_ADMIRALS_CEN_19990930_INNER_CAN_3,,visibility,6.1,meters,"Horizontal visibility on each transect, estima..."
547,ANACAPA_ADMIRALS_CEN_19990930_INNER_MID_3,,visibility,6.1,meters,"Horizontal visibility on each transect, estima..."


In [67]:
## Drop missing measurements

print(mof.shape)
mof.dropna(subset=['measurementValue'], inplace=True)
mof.shape

(1477457, 6)


(633841, 6)

To check that all the data are still there, use:

```python
print(mof[(mof['measurementType'] == 'visibility')].shape) # 53913
vis[vis['measurementValue'].isna() == False].shape
```

In [76]:
## Change NaN in string fields to ''

mof[['occurrenceID', 'measurementUnit']] = mof[['occurrenceID', 'measurementUnit']].replace(np.nan, '')
mof.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

In [77]:
## Save

mof.to_csv('PISCO_MoF_20210816.csv', index=False, na_rep='NaN')

## Questions

1. Change "Swami's" in fish data to "SWAMIS"

## Future directions

1. This dataset is large (~9 million rows), and some steps take a long time to run. Are there techniques I could use to get things going faster?
2. I could package my sanity checks into actual pass/fail tests.