# Reef Check - abalone size/frequency data

Abalone size surveys are conducted north of the Golden Gate. Any red abalone encountered during usual Reef Check surveys are sized using calipers. In addition, independent abalone surveys are conducted where a diver swims over the reef and measures every red abalone encountered, with a goal of measuring 250 animals.

In [18]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [19]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

In [20]:
## Load data

data = pd.read_csv('RCCA_abalone_size_data.csv')
print(data.shape)
data.head()

(20009, 13)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Classcode,Size,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,30,8,2008,30-Aug-08,5,pink abalone,10.0,33.737919,-118.392014,25.0,17.0,4.0
1,Andrew Molera,27,6,2017,27-Jun-17,2,red abalone,16.0,36.278454,-121.880859,52.0,10.0,6.0
2,Andrew Molera,27,6,2017,27-Jun-17,2,red abalone,20.0,36.278454,-121.880859,52.0,10.0,6.0
3,Andrew Molera,27,6,2017,27-Jun-17,4,red abalone,8.0,36.278454,-121.880859,38.5,10.0,5.0
4,Andrew Molera,27,6,2017,27-Jun-17,4,red abalone,10.0,36.278454,-121.880859,38.5,10.0,5.0


**A couple questions here:** 
1. I thought these surveys were conducted at the site level. Why are they broken out by transect? **Looking closer, the transect column is not actually informative. I don't need to aggregate by transect, since each individual has its own row and is assigned to the correct site and survey date regardless of transect. It's just weird that this information is available, seeing as the divers could measure abalone off-transect, and transect is never NaN.**
2. Based on the protocol document, I thought these surveys only covered red abalone, but several different species are listed here.

According to the table metadata, starting in 2015, red abalone north of San Francisco were measured to the nearest mm. Prior to 2015 and south of San Francisco abalone were measured to the nearest cm.

## Create occurrence file

Here, it seems reasonable for the **event** to be the survey (e.g. site + survey date). The **occurrrences** can be the abalone observations. 

In [4]:
## Get site names w/o spaces for use in eventID

# Get a list of site names with spaces removed
site_names = [name.replace(' ', '') for name in data['Site']]

# Map site_names to sites
site_name_dict = dict(zip(data['Site'], site_names))
site_name_dict["Lover's 3"] = 'Lovers3'
site_name_dict['West Long Point'] = 'LongPointWest'

 # Create SiteName column from Site column
data['SiteName'] = data['Site']
data['SiteName'].replace(site_name_dict, inplace=True)
data.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Classcode,Size,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SiteName
0,120 Reef,30,8,2008,30-Aug-08,5,pink abalone,10.0,33.737919,-118.392014,25.0,17.0,4.0,120Reef
1,Andrew Molera,27,6,2017,27-Jun-17,2,red abalone,16.0,36.278454,-121.880859,52.0,10.0,6.0,AndrewMolera
2,Andrew Molera,27,6,2017,27-Jun-17,2,red abalone,20.0,36.278454,-121.880859,52.0,10.0,6.0,AndrewMolera
3,Andrew Molera,27,6,2017,27-Jun-17,4,red abalone,8.0,36.278454,-121.880859,38.5,10.0,5.0,AndrewMolera
4,Andrew Molera,27,6,2017,27-Jun-17,4,red abalone,10.0,36.278454,-121.880859,38.5,10.0,5.0,AndrewMolera


In [5]:
## Pad month and day as needed

paddedDay = ['0' + str(data['Day'].iloc[i]) if len(str(data['Day'].iloc[i])) == 1 else str(data['Day'].iloc[i]) for i in range(len(data['Day']))]
paddedMonth = ['0' + str(data['Month'].iloc[i]) if len(str(data['Month'].iloc[i])) == 1 else str(data['Month'].iloc[i]) for i in range(len(data['Month']))]

In [6]:
## Create eventID

eventID = [data['SiteName'].iloc[i] + '_' + str(data['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(len(data['Site']))]
occ = pd.DataFrame({'eventID':eventID})

occ.head()

Unnamed: 0,eventID
0,120Reef_20080830
1,AndrewMolera_20170627
2,AndrewMolera_20170627
3,AndrewMolera_20170627
4,AndrewMolera_20170627


In [7]:
## Format dates and add eventDate

eventDate = [datetime.strptime(dt, '%d-%b-%y').date().isoformat() for dt in data['SurveyDate']]
occ['eventDate'] = eventDate
occ.head()

Unnamed: 0,eventID,eventDate
0,120Reef_20080830,2008-08-30
1,AndrewMolera_20170627,2017-06-27
2,AndrewMolera_20170627,2017-06-27
3,AndrewMolera_20170627,2017-06-27
4,AndrewMolera_20170627,2017-06-27


In [8]:
## Add datasetID

occ['datasetID'] = 'RCCA abalone size'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,120Reef_20080830,2008-08-30,RCCA abalone size
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size


In [9]:
## Add locality and countryCode

# locality
occ['locality'] = data['Site']
occ.loc[occ['locality'] == "Lover's 3", 'locality'] = 'Lovers 3'
occ.loc[occ['locality'] == 'West Long Point', 'locality'] = 'Long Point West'

# countryCode
occ['countryCode'] = 'US'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US


**Note** that some lat, lon pairs were missing from the abalone size data, although the sites are listed in the site table, and the lon, lat for those same sites are listed elsewhere in the data. For example, the following query will show some missing values for the Aquarium site:

``` python
data[data['Site'] == 'Aquarium']
```

Given this, it seemed most reasonable to load the site table and populate lat, lons from there.

In [10]:
## Load site table to get lat, lon

filename = 'RCCA_site_table.csv'
sites = pd.read_csv(filename, usecols=range(7))

sites.head()

Unnamed: 0,Research_group,Site,CA_MPA_Name_Short,MPA_status,LTM_project_short_code,Latitude,Longitude
0,RCCA,Macklyn Cove,,REF,LTM_Kelp_SRock,42.045155,-124.294724
1,RCCA,Pyramid Pt,Pyramid Point SMCA,MPA,LTM_Kelp_SRock,41.994801,-124.217308
2,RCCA,Flat Iron Rock,,,,41.059425,-124.157829
3,RCCA,Trinidad,,,,41.055,-124.139999
4,RCCA,MacKerricher North,MacKerricher SMCA,MPA,LTM_Kelp_SRock,39.492823,-123.80199


In [11]:
## Add rows to site table -- CAN BE DELETED WHEN SITE TABLE IS UPDATED ON DATAONE

sites_to_add = pd.DataFrame({'Research_group':['RCCA']*5,
                            'Site':['Cayucos', 'Hurricane Ridge', 'LA Federal Breakwater', 'Ocean Cove Kelper', 'Pier 400'],
                            'Latitude':[35.4408, 37.4701, 33.711899, 38.555119, 33.716301],
                            'Longitude':[-120.936302, -122.4796, -118.241997, -123.3046, -118.258003]})
sites = pd.concat([sites, sites_to_add])

In [12]:
## Merge to obtain decimalLat, decimalLon

# Merge
occ = occ.merge(sites[['Site', 'MPA_status', 'Latitude', 'Longitude']], how='left', left_on='locality', right_on='Site')

# Rename columns
occ.rename(columns={'MPA_status':'locationRemarks', 'Latitude':'decimalLatitude', 'Longitude':'decimalLongitude'}, inplace=True)

# Drop Site
occ.drop(columns='Site', inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,locationRemarks,decimalLatitude,decimalLongitude
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US,MPA/REF,33.737919,-118.392014
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,MPA,36.278454,-121.880859
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,MPA,36.278454,-121.880859
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,MPA,36.278454,-121.880859
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,MPA,36.278454,-121.880859


According to the metadata on DataONE, locationRemarks can take on the following values:
- MPA = Site is in MPA
- REF = Site is outside of MPA and used as MPA refernce site
- NaN = Site is not part of MPA monitoring
- MPA/REF = Site is in MPA and used as refercnes site for another MPA

Based on this, it sounds like MPA and MPA/REF should be converted to "marine protected area" and REF and NaN should be converted to "fished area"

**Note** that for the sites I added to the site table, I looked up manually whether the location was inside an MPA or not:
- Cayucos = fished
- Hurricane Ridge = fished
- LA Federal Breakwater = fished
- Ocean Cove Kelper = fished?
- Pier 400 = fished

In [13]:
## Clean locationRemarks

occ['locationRemarks'] = occ['locationRemarks'].replace({'MPA':'marine protected area',
                                                             'REF':'fished area',
                                                             np.nan:'fished area',
                                                             'MPA/REF':'marine protected area'})
occ.loc[occ['locality'].isin(['Cayucos', 'Hurricane Ridge', 'LA Federal Breakwater', 'Ocean Cove Kelper', 'Pier 400']), 'locationRemarks'] = 'fished area'
occ['locationRemarks'].unique()

array(['marine protected area', 'fished area'], dtype=object)

In [14]:
## Add coordinateUncertainty in Meters

occ['coordinateUncertaintyInMeters'] = 250

In [15]:
## Add occurrenceID

occ['occurrenceID'] = data.groupby(['Site', 'SurveyDate'])['Classcode'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,locationRemarks,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US,marine protected area,33.737919,-118.392014,250,120Reef_20080830_occ1
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ1
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ2
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ3
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ4


In [16]:
## Get unique common names

names = data['Classcode'].unique()
names

array(['pink abalone', 'red abalone', 'flat abalone', 'pinto abalone',
       'unknown abalone', 'green abalone', 'black abalone'], dtype=object)

In [17]:
## Load species table to obtain scientific names

filename = 'RCCA_invertebrate_lookup_table.csv'
species = pd.read_csv(filename, encoding='ansi')

print(species.shape)
species.head()

(32, 10)


Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species,Classcode,taxonomic_source,taxonomic_id
0,Animalia,Echinodermata,Asteroidea,Valvatida,Asterinidae,Patiria,miniata,bat star,www.marinespecies.org,382131
1,Animalia,Cnidaria,Anthozoa,Alcyonacea,Plexauridae,Muricea,fruticosa/californica,brown/golden gorgonian,www.marinespecies.org,177745
2,Animalia,Echinodermata,Holothuroidea,Synallactida,Stichopodidae,Parastichopus,californicus,CA sea cucumber,www.marinespecies.org,711954
3,Animalia,Arthropoda,Malacostraca,Decapoda,Palinuridae,Panulirus,interruptus,CA spiny lobster,www.marinespecies.org,382898
4,Animalia,Mollusca,Gastropoda,Littorinimorpha,Cypraeidae,Neobernaya,spadicea,chestnut cowry,www.marinespecies.org,580674


In [18]:
## Map scientific names to classcodes and create scientificName

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Map scientific names to classcodes
subset = species[species['Classcode'].isin(names)].copy()
code_to_species_dict = dict(zip(subset['Classcode'], subset['scientificName']))

# Add missing code
code_to_species_dict['unknown abalone'] = 'Haliotis'

# Fix misspellings
code_to_species_dict['black abalone'] = 'Haliotis cracherodii'

# Create scientificName
occ['scientificName'] = data['Classcode']
occ['scientificName'].replace(code_to_species_dict, inplace=True)

# Strip any whitespace
occ['scientificName'] = occ['scientificName'].str.strip()
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,locationRemarks,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US,marine protected area,33.737919,-118.392014,250,120Reef_20080830_occ1,Haliotis corrugata
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ1,Haliotis rufescens
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ2,Haliotis rufescens
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ3,Haliotis rufescens
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ4,Haliotis rufescens


In [19]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(occ['scientificName'].unique(), verbose_flag=True)

In [20]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,locationRemarks,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US,marine protected area,33.737919,-118.392014,250,120Reef_20080830_occ1,Haliotis corrugata,urn:lsid:marinespecies.org:taxname:445308,445308
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ1,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ2,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ3,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,AndrewMolera_20170627_occ4,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357


In [21]:
## Add vernacularName

occ.insert(9, 'vernacularName', data['Classcode'])
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,locationRemarks,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,vernacularName,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US,marine protected area,33.737919,-118.392014,250,pink abalone,120Reef_20080830_occ1,Haliotis corrugata,urn:lsid:marinespecies.org:taxname:445308,445308
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ1,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ2,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ3,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ4,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357


In [22]:
## Add final name-related columns

occ['nameAccordingTo'] = 'WoRMS'
occ['occurrenceStatus'] = 'present'
occ['basisOfRecord'] = 'HumanObservation'

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,locationRemarks,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,vernacularName,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US,marine protected area,33.737919,-118.392014,250,pink abalone,120Reef_20080830_occ1,Haliotis corrugata,urn:lsid:marinespecies.org:taxname:445308,445308,WoRMS,present,HumanObservation
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ1,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ2,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ3,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ4,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation


In [23]:
## Add depth

# Add eventID column to data
data['eventID'] = occ['eventID']

# Aggregate data to handle different transects with different depths
depth = data.groupby(['eventID']).agg({'Depth_ft':[min, max]})
depth.reset_index(inplace=True)
depth.columns = depth.columns.droplevel()
depth.rename(columns={'':'eventID'}, inplace=True)

# Join
occ = occ.merge(depth, how='left', on='eventID')
occ.rename(columns={'min':'minimumDepthInMeters', 'max':'maximumDepthInMeters'}, inplace=True)

# Convert from feet to meters
occ['minimumDepthInMeters'] = round(occ['minimumDepthInMeters']*0.3048, 1)
occ['maximumDepthInMeters'] = round(occ['maximumDepthInMeters']*0.3048, 1)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,locationRemarks,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,vernacularName,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,minimumDepthInMeters,maximumDepthInMeters
0,120Reef_20080830,2008-08-30,RCCA abalone size,120 Reef,US,marine protected area,33.737919,-118.392014,250,pink abalone,120Reef_20080830_occ1,Haliotis corrugata,urn:lsid:marinespecies.org:taxname:445308,445308,WoRMS,present,HumanObservation,7.6,7.6
1,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ1,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation,11.7,15.8
2,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ2,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation,11.7,15.8
3,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ3,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation,11.7,15.8
4,AndrewMolera_20170627,2017-06-27,RCCA abalone size,Andrew Molera,US,marine protected area,36.278454,-121.880859,250,red abalone,AndrewMolera_20170627_occ4,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation,11.7,15.8


**Note** that there are 326 records where depth was not available.

```python
cc[occ['minimumDepthInMeters'].isna() == True]
```

## Save

In [24]:
## Save

occ.to_csv('RCCA_abalone_size_occurrence_20210212.csv', index=False, na_rep='NaN')

## Create MoF

I can include abalone sizes, temperature and visibility in MoF.

In [25]:
## Add eventID, occurrenceID and size

mof = pd.DataFrame({'eventID':occ['eventID']})
mof['occurrenceID'] = occ['occurrenceID']
mof['measurementType'] = 'diameter' 
mof['measurementValue'] = data['Size']
mof['measurementUnit'] = 'centimeters'
mof['measurementMethod'] = 'measured with calipers to the nearest millimeter; rounded to the nearest centimeter if prior to 2015 or south of San Francisco' 

print(mof.shape)
mof.head()

(20009, 6)


Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20080830,120Reef_20080830_occ1,diameter,10.0,centimeters,measured with calipers to the nearest millimet...
1,AndrewMolera_20170627,AndrewMolera_20170627_occ1,diameter,16.0,centimeters,measured with calipers to the nearest millimet...
2,AndrewMolera_20170627,AndrewMolera_20170627_occ2,diameter,20.0,centimeters,measured with calipers to the nearest millimet...
3,AndrewMolera_20170627,AndrewMolera_20170627_occ3,diameter,8.0,centimeters,measured with calipers to the nearest millimet...
4,AndrewMolera_20170627,AndrewMolera_20170627_occ4,diameter,10.0,centimeters,measured with calipers to the nearest millimet...


In [26]:
## Group temperature and visibility to event level (NOTE that there are different visibility measures for one event, but not for temperature)

# Aggregate (using np.mean)
event_mof = data.groupby(['eventID']).agg({
    'Temp10m':[lambda x: round(np.mean(x), 1)],
    'Visibility':[lambda x: round(np.mean(x), 0)]
})

# Tidy
event_mof.reset_index(inplace=True)
event_mof.columns = event_mof.columns.droplevel()
event_mof.columns = ['eventID', 'Temp10m_mean', 'Visibility_mean']

event_mof.head()

Unnamed: 0,eventID,Temp10m_mean,Visibility_mean
0,120Reef_20080830,17.0,4.0
1,120Reef_20180710,,4.0
2,AlbionCove_20180508,,5.0
3,AlbionCove_20190504,,3.0
4,AndrewMolera_20170627,10.0,5.0


In [27]:
## Add temperature

temp = pd.DataFrame({'eventID':event_mof['eventID']})
temp['occurrenceID'] = np.nan
temp['measurementType'] = 'temperature'
temp['measurementValue'] = event_mof['Temp10m_mean']
temp['measurementUnit'] = 'degrees Celsius'
temp['measurementMethod'] = 'measured by dive computer at 10 m depth, or at the seafloor if shallower than 10 m'

# Drop events that lack temperature data
temp.dropna(subset=['measurementValue'], inplace=True)

temp.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20080830,,temperature,17.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
4,AndrewMolera_20170627,,temperature,10.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
7,Aquarium_20070525,,temperature,10.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
9,Aquarium_20080517,,temperature,9.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
11,Aquarium_20090524,,temperature,12.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."


In [28]:
## Add visibility

vis = pd.DataFrame({'eventID':event_mof['eventID']})
vis['occurrenceID'] = np.nan
vis['measurementType'] = 'average visibility'
vis['measurementValue'] = event_mof['Visibility_mean']
vis['measurementUnit'] = 'meters'
vis['measurementMethod'] = 'determined by divers by measuring the distance from which the fingers on a hand held up into the water column can be counted accurately'

# Drop events that lack visibility data (32 events)
vis.dropna(subset=['measurementValue'], inplace=True)

vis.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20080830,,average visibility,4.0,meters,determined by divers by measuring the distance...
1,120Reef_20180710,,average visibility,4.0,meters,determined by divers by measuring the distance...
2,AlbionCove_20180508,,average visibility,5.0,meters,determined by divers by measuring the distance...
3,AlbionCove_20190504,,average visibility,3.0,meters,determined by divers by measuring the distance...
4,AndrewMolera_20170627,,average visibility,5.0,meters,determined by divers by measuring the distance...


**Note** that there are 295 events where multiple visibility measures were reported. I've averaged them here.

```python
out = vis.groupby('eventID')['measurementValue'].nunique()
out[out > 1]

# View a specific example
data[data['eventID'] == 'AlbionCove_20190504']
```

In [29]:
## Concatenate

mof = pd.concat([mof, temp, vis])
print(mof.shape)
mof.head()

(21016, 6)


Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20080830,120Reef_20080830_occ1,diameter,10.0,centimeters,measured with calipers to the nearest millimet...
1,AndrewMolera_20170627,AndrewMolera_20170627_occ1,diameter,16.0,centimeters,measured with calipers to the nearest millimet...
2,AndrewMolera_20170627,AndrewMolera_20170627_occ2,diameter,20.0,centimeters,measured with calipers to the nearest millimet...
3,AndrewMolera_20170627,AndrewMolera_20170627_occ3,diameter,8.0,centimeters,measured with calipers to the nearest millimet...
4,AndrewMolera_20170627,AndrewMolera_20170627_occ4,diameter,10.0,centimeters,measured with calipers to the nearest millimet...


In [32]:
## Replace NaN with '' in string fields

mof['occurrenceID'] = mof['occurrenceID'].replace(np.nan, '')
mof.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

## Save

In [33]:
## Save

mof.to_csv('RCCA_abalone_size_MoF_20210212.csv', index=False, na_rep='NaN')

## Questions

1. Verify that for these surveys, abalone can be found anywhere in the site (not just on transect). **From Dan: First when we do kelp forest monitoring invert transects we count and size  all species of abalone to the nearest cm. Though many times the abalone is back in crack and can’t be measured but we still count it for density. We have been doing these surveys since 2006. In 2016 we started doing additional abalone size-frequency surveys for just red abalone, just north of the golden gate, using specialized calipers that measure to the nearest mm.**
2. I thought this was only supposed to be for red abalone. When did you start tracking other species? **See the above response. Dan says that even though some abalone are found during transects, it's probably OK to exclude transect numbers for this data set. THIS LIKELY HAS THE SAME DUPLICATED RECORDS ISSUE AS THE PISCO DATA, WHERE AN INDIVIDUAL ABALONE CAN APPEAR TWICE: ONCE IN THE TRANSECT DATA AND ONCE IN THE SIZE-FREQUENCY DATA. HOW IMPORTANT IS THIS? DO I NEED TO DEAL WITH IT?**
3. Verify that the widest width of the shell is what's measured. What if there are fewer than 250 abalone at a site? Is there a search time associated with this survey type (especially if the survey is done independently of other transect-based surveys)? **Yes, the widest width is measured. There is not really a rigorously monitored search time associated with these surveys. During the red abalone surveys that started in 2016, they just try to measure as many as they can, and usually get a good number, although this is getting harder. Last year (2019) they averaged 100 individuals per survey.**
4. Some records are missing latitude and longitude, even though these are known and provided for other records from the same site.
5. Note that there were some surveys where visibility values differed by transect. I've averaged these for the MoF file.

## Find number of years each MPA was surveyed

In [17]:
surveys_per_year = data.groupby(['Site', 'Year'], as_index=False)['SurveyDate'].nunique() # 1, 2, or 4
sites_and_years = data[['Site', 'Year']].drop_duplicates()
merged = sites_and_years.merge(sites.loc[sites['CA_MPA_Name_Short'].isna() == False, ['Site', 'CA_MPA_Name_Short']], how='left', on='Site')
merged = merged[merged['CA_MPA_Name_Short'].isna() == False]
num_years_per_site = merged.groupby(['CA_MPA_Name_Short', 'Site'], as_index=False)['Year'].nunique()
num_years_per_mpa = merged.groupby('CA_MPA_Name_Short', as_index=False)['Year'].nunique()
num_years_per_mpa = num_years_per_mpa.sort_values('CA_MPA_Name_Short')
num_years_per_mpa.to_csv('rcca_abalone_size_years_per_mpa.csv', index=False)