# Reef Check - invertebrate data conversion

For each survey site, Reef Check performs 6 core transects where divers record inverts, kelp (plus presence/absence of invasives), UPC and fish. 12 additional fish-only transects, and abalone and urchin size surveys, are performed separately. For this reason, I originally thought it would be reasonable to have four converted datasets: 
1. Core transect data
2. Fish-only transect data
3. Urchin size data
4. Abalone size data

After some thought, however, I decided **it makes more sense to continue to group the data by organism type** (inverts, kelp, fish, UPC). The fact that certain organisms were found on the same core transect does lend a certain amount of spatial information. But because that information is not preserved across surveys, statements about abundance and variability only make sense at the site level. Aside from the fact that transects at a given sight must be deep or shallow, and approximately parallel to shore following a depth contour, there is nothing that ensures that transect 1 at site A in 2001 matches transect 1 at site A in 2002.

**Note** that the urchin and abalone size data should also be converted separately. These surveys are not conducted as part of the core transects, and it would be misleading therefore to attach presence or size information to the converted inverts data.

**Resources:**
- https://dwc.tdwg.org/terms/#occurrence
- https://reefcheck.org/
- https://reefcheck.org/PDFs/RCCAmanual9thedition.pdf

In [102]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [103]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [104]:
## Load data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\Reef Check\\'
filename = 'RCCA_invert_data.csv'
data = pd.read_csv(filename)

data.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Amount,Distance,Lat,Lon,Depth_ft,Region,Temp10m,Heading,Visibility
0,120 Reef,8,10,2006,8-Oct-06,1,bat star,13.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
1,120 Reef,8,10,2006,8-Oct-06,1,black abalone,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
2,120 Reef,8,10,2006,8-Oct-06,1,brown/golden gorgonian,11.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
3,120 Reef,8,10,2006,8-Oct-06,1,ca sea cucumber,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
4,120 Reef,8,10,2006,8-Oct-06,1,ca spiny lobster,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0


### Information on column definitions from Reef Check's metadata files

**Site** = The unique site code that indicates where the survey was performed. This site code refers to a specific entry in the site table. <br>
**Day** = The day that the survey was done. This date is expressed in D or DD format. Dates reflect measurements taken in local time.<br>
**Month** = The month that the survey was done. This month is expressed in M or MM format. Dates reflect measurements taken in local time.<br>
**Year** = The year that the survey was done. This year is expressed in YYYY format. Dates reflect measurements taken in local time.<br>
**SurveyDate** = The  date that the survey was completed.<br>
**Transect** = A number representing one of the parallel transects through the study site. Core transects (i.e. transects at which fish, invertebrate, algae, and substrate data is collected) are numbered 1 - 6 with the transects in the offshore zone numbered as 1-3 and the inshore core transects numbered 4 - 6. Fish-only transects are numbered 7 - 18 with the offshore fish only transects numbered 7 - 12 and the inshore fish only transects numbered 13 - 18.<br>
**Species** = The unique taxonomic classification code that is being counted. The taxonomy of the species is defined in the species lookup table.<br>
**Amount** = Total number of individuals of a given classcode counted within the distance indicated in the Distance column along a transect.<br>
**Distance** = Distance along transect over which individuals of a given classcode were counted.  When this distance is less than 30m, the species was sub-sampled at about 50 individuals. To generate densities for a 60 square meter area the 'amount' variable needs to be  divided by the 'distance' variable and multiplied by 30.<br>
**Lat** = Latitude of the site.<br>
**Lon** = Longitude of the site.<br>
**Depth_ft** = Average depth of a transect in feet as measured by diver using dive computer.<br>
**Region** = MLPA region<br>
**Temp10m** = The water temperature at the sites during the survey measured using a dive computer at 10 meter depth or the seafloor if site is shallower than 10 meters. Measured in degrees Celsius.<br>
**Heading** = General compass heading of the transect<br>
**Visibility** = Visibility in meters at the transect location as determined by divers by measuring the distance from which the fingers on a hand help up into the water column can be counted.<br>

## Convert

The **event** is a transect at a particular site and the **occurrence** is the observation of an animal along that transect.

### Get site names (will use to construct eventID)

In [105]:
## Load site table

filename = 'RCCA_site_table.csv'
sites = pd.read_csv(filename)

sites.head()

Unnamed: 0,IDNumSite,SiteName,County,CityIsland,State,DateOfEntry,Comments,AvgDepthSite,DistFromShore,Lat,Lon,Location,mpa,ProtectionStatus
0,1,120 Reef,Los Angeles,Los Angeles,CA,3/5/2008 0:00,,8,0,33.737919,-118.392014,South,Abalone Cove SMCA,State Marine Conservation Area
1,2,Abalone Cove,Los Angeles,Palos Verdes,CA,3/5/2008 0:00,,9,0,33.736149,-118.37632,South,Abalone Cove SMCA,State Marine Conservation Area
2,3,Aquarium,Monterey,Monterey,CA,3/5/2008 0:00,,10,0,36.619232,-121.899414,Central,Edward F. Ricketts SMCA,State Marine Conservation Area
3,4,Big Creek,Monterey,Santa Lucia,CA,3/5/2008 0:00,,13,0,36.069183,-121.600601,Central,Big Creek SMR,State Marine Reserve
4,5,Big Rock,Los Angeles,Malibu,CA,3/5/2008 0:00,,8,0,34.035168,-118.608086,South,Point Dume SMR,


In [106]:
## Let's use the SiteName column, and then add in locality or something similar later

# Create site_names
site_names = [name.replace(' ', '') for name in sites['SiteName']]

# Map site_names to names in data
site_name_dict = dict(zip(sites['SiteName'], site_names))

# Create a new column of site_names in data
data['SiteName'] = data['Site']
data['SiteName'].replace(site_name_dict, inplace=True)

### Create eventID

In [107]:
## Pad month and day as needed

paddedDay = ['0' + str(data['Day'].iloc[i]) if len(str(data['Day'].iloc[i])) == 1 else str(data['Day'].iloc[i]) for i in range(len(data['Day']))]
paddedMonth = ['0' + str(data['Month'].iloc[i]) if len(str(data['Month'].iloc[i])) == 1 else str(data['Month'].iloc[i]) for i in range(len(data['Month']))]

In [108]:
## Create eventID

eventID = [data['SiteName'].iloc[i] + '_' + str(data['Year'].iloc[i]) + paddedDay[i] + paddedMonth[i] + '_' + str(data['Transect'].iloc[i]) for i in range(len(data['Site']))]
converted = pd.DataFrame({'eventID':eventID})
converted.head()

Unnamed: 0,eventID
0,120Reef_20060810_1
1,120Reef_20060810_1
2,120Reef_20060810_1
3,120Reef_20060810_1
4,120Reef_20060810_1


### eventDate

In [109]:
## Reformat SurveyDate

eventDate = [datetime.strptime(dt, '%d-%b-%y').date().isoformat() for dt in data['SurveyDate']]
converted['eventDate'] = eventDate
converted.head()

Unnamed: 0,eventID,eventDate
0,120Reef_20060810_1,2006-10-08
1,120Reef_20060810_1,2006-10-08
2,120Reef_20060810_1,2006-10-08
3,120Reef_20060810_1,2006-10-08
4,120Reef_20060810_1,2006-10-08


### Add Location information

In [110]:
## View counties in data

sites['County'].unique()

array(['Los Angeles', 'Monterey', 'San Diego', 'Ventura', 'Santa Barbara',
       'Sonoma', 'Orange', 'Mendocino', 'San Luis Obispo', 'Humboldt',
       'LA', 'Test', 'Santa Cruz', 'San Mateo', 'Del Norte', 'Curry, OR'],
      dtype=object)

In [111]:
## Remove test county

sites = sites[sites['County'] != 'Test']

In [112]:
## Map to verified county names in the Getty Thesaurus of Geographic Names

# County names according to Getty Thesaurus of Geographic Names
county_names_dict = dict(zip(sites['County'].unique(), sites['County'].unique()))
county_names_dict['LA'] = 'Los Angeles'
county_names_dict['Curry, OR'] = 'Curry'

# County IDs according to Getty Thesaurus of Geographic Names 
county_ids_dict = {'Los Angeles':1002608,
                   'Monterey':1002684,
                   'San Diego':1002858,
                   'Ventura':1002972,
                   'Santa Barbara':1002867,
                   'Sonoma':7014516,
                   'Orange':1002748,
                   'Mendocino':2000185,
                   'San Luis Obispo':1002863,
                   'Humboldt':2000181,
                   'LA':1002608,
                   'Santa Cruz':1002869,
                   'San Mateo':1002864,
                   'Del Norte':2000180,
                   'Curry, OR':2001704                  
                  }

In [113]:
## Add county, state, IDs and authority

# County
site_to_county_dict = dict(zip(sites['SiteName'].str.strip(), sites['County'].str.strip()))
converted['county'] = data['Site']
converted['county'].replace(site_to_county_dict, inplace=True)
converted['county'].replace(county_names_dict, inplace=True)

# State
converted['stateProvence'] = 'California'
converted.loc[converted['county'] == 'Curry', 'stateProvence'] = 'Oregon'

# ID
converted['locationID'] = data['Site']
converted['locationID'].replace(site_to_county_dict, inplace=True)
converted['locationID'].replace(county_ids_dict, inplace=True) ## Site names not in site table: 'Cueva Valdez', 'Point Vicente East', 'Point Vicente West'

# authority
converted['locationAccordingTo'] = 'Getty Thesaurus of Geographic Names'

converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names


In [114]:
## Add locality

site_to_mpa_dict = dict(zip(sites['SiteName'], sites['mpa']))
converted['locality'] = data['Site']
converted['locality'].replace(site_to_mpa_dict, inplace=True)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA


In [115]:
## Add site names as verbatimLocality

converted['verbatimLocality'] = data['Site']
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef


In [116]:
## Add lat, lon

converted['decimalLatitude'] = round(data['Lat'], 4)
converted['decimalLongitude'] = round(data['Lon'], 4)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392


**Note:** I don't think coordinateUncertaintyInMeters is relevant here. Zach from the Standardizing Marine Biological Data working group suggested it would be the length of the transect, but that assumes that the coordinates correspond to the transect start location. Here, the coordinates correspond to the site location, and I'm not sure whether it's even the "middle" of the site, or what the extent of each site is.

### Specify whether transect was inshore or offshore using locationRemarks

In [117]:
## Add locationRemarks

converted['locationRemarks'] = 'inshore zone'
converted.loc[data['Transect'] < 4, 'locationRemarks'] = 'offshore zone'
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone


In [118]:
## Add depth

converted['verbatimDepth'] = round(data['Depth_ft']*0.3048)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks,verbatimDepth
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0


Since these depths were taken by divers using dive computers, I think it's reasonable to round to the nearest meter. **Note** that I assume these depths are not corrected for tidal height. 

### Add occurrence data

In [119]:
## Add occurrenceID

converted['occurrenceID'] = range(1, converted.shape[0]+1)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks,verbatimDepth,occurrenceID
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,1
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,2
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,3
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,4
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,5


In [120]:
## Load species table

filename = 'RCCA_invert_species_table.csv'
species = pd.read_csv(filename)

species.head()

Unnamed: 0,SC_project_short_code,Kingdom,Phylum,Class,Order,Family,Genus,Species,Classcode
0,SC_CitSci_Kelp,Animalia,Echinodermata,Asteroidea,Spinulosida,Asterinidae,Patiria,miniata,bat star
1,SC_CitSci_Kelp,Animalia,Cnidaria,Anthozoa,Alcyonacea,Plexauridae,Muricea,fruticosa/californica,brown/golden gorgonian
2,SC_CitSci_Kelp,Animalia,Echinodermata,Holothuroidea,Aspidochirotida,Stichopodidae,Parastichopus,californicus,CA sea cucumber
3,SC_CitSci_Kelp,Animalia,Arthropoda,Malacostraca,Decapoda,Palinuridae,Panulirus,interruptus,CA spiny lobster
4,SC_CitSci_Kelp,Animalia,Mollusca,Gastropoda,Neotaeinoglossa,Cypraeidae,Cypraea,spadicea,chestnut cowry


In [121]:
## Map scientific names to classcodes

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Fix species names where Genus and Species were NaN
species.loc[species['Family'] == 'Actiniidae', 'scientificName'] = 'Actiniidae' 
species.loc[species['Genus'] == 'Solaster/Pycnopodia', 'scientificName'] = 'Solaster/Pycnopodia spp.'

# Create map
code_to_species_dict = dict(zip(species['Classcode'], species['scientificName']))

In [122]:
## Create scientificName column

converted['scientificName'] = data['Species']
converted['scientificName'].replace(code_to_species_dict, inplace=True)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks,verbatimDepth,occurrenceID,scientificName
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,1,Patiria miniata
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,2,Haliotis craherodii
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,3,Muricea fruticosa/californica
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,4,ca sea cucumber
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,5,ca spiny lobster


In [123]:
## Get unique scientific names for lookup in WoRMS

names = converted['scientificName'].unique()

**Note** that there are some new species that were not covered in the species table: ca sea cucumber, ca spiny lobster, kellet's whelk, unknown abalone. Maybe I can manage these by hand?

California sea cucumber matches to Apostichopus californicus <br>
California spiny lobster matches to Panulirus interruptus <br>
Kellet's whelk does not match in WoRMS, but likely Kelletia kelletii <br>
Unknown Abalone can be designated Haliotis

Also **note** that there are a number of names that are not specific at the species level. These will match at the genus level, but may want to include occurrenceRemarks:
- Muricea fruticosa/californica
- Loxorhynchus grandis/crispatus
- Lithopoma undosa/gibberosa

**Assumed misspellings:**
- Haliotis craherodii --> Haliotis cracherodii
- Haliotis kantschatkana --> Haliotis kamtschatkana

**Assumed old names:**
- Lophogorgia chilensis (Lophogorgia has status 'unaccepted' by WoRMS) should be Leptogorgia chilensis
- Parastichopus parvimensis, although Parastichopus does match as a genus of sea cucumbers. Apostichopus parvimensis matches - **IS THIS CORRECT?**
- Lithopoma undosa, although Lithopoma does match as a genus of snail. Megastraea undosa matches - **IS THIS CORRECT?**

Finally, I assume seastars of genus Solaster and genus Pycnopodia cannot be distinguished by divers. These animals do not share a family or order, but are both of class Asteroidea. Can include the possible genuses in occurrenceRemarks. 

In [124]:
## Add manually identified scientific names to names; correct spelling errors

names_to_change = ['ca sea cucumber', 'ca spiny lobster', "kellet's whelk", 'unknown abalone', 'Haliotis craherodii', 'Haliotis kantschatkana', 'Lophogorgia chilensis',
                  'Parastichopus parvimensis', 'Solaster/Pycnopodia spp.']
correct_names = ['Apostichopus californicus', 'Panulirus interruptus', 'Kelletia kelletii', 'Haliotis', 'Haliotis cracherodii', 'Haliotis kamtschatkana', 'Leptogorgia chilensis',
                'Apostichopus parvimensis', 'Asteroidea']

for i in range(len(names_to_change)):
    names = np.where(names==names_to_change[i], correct_names[i], names)
    
# Also correct names in converted scientificName column
converted['scientificName'].replace({'ca sea cucumber':'Apostichopus californicus',
                                   'ca spiny lobster':'Panulirus interruptus',
                                   "kellet's whelk":'Kelletia kelletii', 
                                    'unknown abalone':'Haliotis', 
                                    'Haliotis craherodii':'Haliotis cracherodii',
                                    'Haliotis kantschatkana':'Haliotis kamtschatkana',
                                    'Lophogorgia chilensis':'Leptogorgia chilensis',
                                    'Parastichopus parvimensis':'Apostichopus parvimensis', 
                                    'Solaster/Pycnopodia spp.':'Asteroidea'}, inplace=True)

In [125]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Muricea fruticosa/californica checking:  Muricea
Url didn't work for Cancer spp. checking:  Cancer
Url didn't work for Loxorhynchus grandis/crispatus checking:  Loxorhynchus
Url didn't work for Lithopoma undosa/gibberosa checking:  Lithopoma


In [126]:
## Add scientific name-related columns

converted['scientificNameID'] = converted['scientificName']
converted['scientificNameID'].replace(name_id_dict, inplace=True)

converted['taxonID'] = converted['scientificName']
converted['taxonID'].replace(name_taxid_dict, inplace=True)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks,verbatimDepth,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,3,Muricea fruticosa/californica,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,4,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [127]:
## Create occurrenceRemarks

remarks_dict = {'Muricea fruticosa/californica':'Muricea fruticosa or Muricea californica',
               'Loxorhynchus grandis/crispatus':'Loxorhynchus grandis or Loxorhynchus crispatus',
               'Lithopoma undosa/gibberosa':'Megastraea undosa or Lithopoma gibberosa',
               'Asteroidea':'Solaster spp. or Pycnopodia spp.'}

occurrenceRemarks = [remarks_dict[name] if name in remarks_dict.keys() else np.nan for name in converted['scientificName']]
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks,verbatimDepth,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,3,Muricea fruticosa/californica,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,4,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [128]:
## Replace scientificName using name_name_dict

converted['scientificName'].replace(name_name_dict, inplace=True)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks,verbatimDepth,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,4,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [129]:
## Add final name-related columns

converted['nameAccordingTo'] = 'WoRMS'
converted['occurrenceStatus'] = 'present'
converted['basisOfRecord'] = 'HumanObservation'
converted['occurrenceRemarks'] = occurrenceRemarks

converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,locationRemarks,verbatimDepth,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,occurrenceRemarks
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,4,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363,WoRMS,present,HumanObservation,
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,offshore zone,9.0,5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,


### Add measurementOrFacts

In [130]:
## Density

converted['density'] = round((data['Amount']/data['Distance'])*30, 2)
converted['densityUnits'] = 'individuals per 60 m2'
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,...,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,occurrenceRemarks,density,densityUnits
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,13.0,individuals per 60 m2
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,,0.0,individuals per 60 m2
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,11.0,individuals per 60 m2
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,4,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363,WoRMS,present,HumanObservation,,0.0,individuals per 60 m2
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,,0.0,individuals per 60 m2


**Note:** There are some records where density is NaN. What does this indicate?

**I'm assuming that a density of 0 indicates 'absent.'**

In [131]:
## Assign an occurrenceStatus of 'absent' to records where density = 0

converted.loc[converted['density'] == 0, 'occurrenceStatus'] = 'absent'

In [132]:
## Temperature

converted['temperatureInCelsius'] = data['Temp10m']
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,...,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,occurrenceRemarks,density,densityUnits,temperatureInCelsius
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,13.0,individuals per 60 m2,15.0
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,absent,HumanObservation,,0.0,individuals per 60 m2,15.0
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,11.0,individuals per 60 m2,15.0
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,Apostichopus californicus,urn:lsid:marinespecies.org:taxname:529363,529363,WoRMS,absent,HumanObservation,,0.0,individuals per 60 m2,15.0
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,absent,HumanObservation,,0.0,individuals per 60 m2,15.0


In [133]:
## Visibility

converted['visibilityInMeters'] = data['Visibility']
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,...,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,occurrenceRemarks,density,densityUnits,temperatureInCelsius,visibilityInMeters
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,13.0,individuals per 60 m2,15.0,7.0
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,absent,HumanObservation,,0.0,individuals per 60 m2,15.0,7.0
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,11.0,individuals per 60 m2,15.0,7.0
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,urn:lsid:marinespecies.org:taxname:529363,529363,WoRMS,absent,HumanObservation,,0.0,individuals per 60 m2,15.0,7.0
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,...,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,absent,HumanObservation,,0.0,individuals per 60 m2,15.0,7.0


## Save

In [134]:
## Save

converted.to_csv('InvertebrateDensity_ReefCheck_2006-2017.csv', index=False, na_rep='NaN')

## Remaining Issues

**Questions:**
1. Depths do not take tidal height into account, correct?
2. Parastichopus parvimensis, although Parastichopus does match as a genus of sea cucumbers. Apostichopus parvimensis matches - **IS THIS CORRECT?**
3. Lithopoma undosa, although Lithopoma does match as a genus of snail. Megastraea undosa matches - **IS THIS CORRECT?**
4. I assume seastars of genus Solaster and genus Pycnopodia cannot be distinguished by divers.
5. Was there a set list of species looked for every year? I'm assuming that density = 0 are 'absent' records. What's up with the record where density = NaN?

In [137]:
converted['locality'].unique()

array(['Abalone Cove SMCA', 'Point Sur SMR', 'Edward F. Ricketts SMCA',
       'Big Creek SMR', 'Point Dume SMR',
       'Blue Cavern (Catalina Island) SMCA', 'Blue Cavern SMCA',
       'Cabrillo SMR', nan, 'Carmel Bay SMCA',
       'Casino Point (Catalina Island) SMCA', 'Point Cabrillo SMR',
       'Anacapa Island Special Closure', 'Morro Bay SMRMA',
       'Point Vicente SMCA', 'Pacific Grove Marine Gardens SMCA',
       'Crystal Cove SMCA', 'Cueva Valdez', 'White Rock SMCA',
       'Laguna Beach SMR', 'Skunk Point (Santa Rosa Island) SMR',
       'Montara SMR', 'Painted Cave (Santa Cruz Island) SMCA',
       'Gerstle Cove SMR', 'MacKerricher SMCA', 'Lovers Point SMR',
       'Campus Point SMCA', 'South Point (Santa Rosa Island) SMR',
       'Judith Rock (San Miguel Island) SMR', 'Matlahuayl SMR',
       'Point Dume SMCA', 'Anacapa Island SMR',
       'Arrow Point to Lion Head Point (Catalina Island) SMCA',
       'Long Point (Catalina Island) SMR', 'Pyramid Point SMCA',
       'Poin