# Reef Check - kelp data conversion

**Resources:**
- https://dwc.tdwg.org/terms/#occurrence
- https://reefcheck.org/
- https://reefcheck.org/PDFs/RCCAmanual9thedition.pdf
- https://reefcheck.org/PDFs/Reef%20Check%20California%20Abalone%20Protocol.pdf

In [1]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\Reef Check\\'
filename = 'RCCA_algae_data.csv'
data = pd.read_csv(filename)

data.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Amount,Stipes,Distance,Lat,Lon,Depth_ft,Region,Temp10m,Heading,Visibility
0,120 Reef,8,10,2006,8-Oct-06,1,bull kelp,0.0,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
1,120 Reef,8,10,2006,8-Oct-06,1,laminaria spp,0.0,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
2,120 Reef,8,10,2006,8-Oct-06,1,pterygophora,0.0,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
3,120 Reef,8,10,2006,8-Oct-06,1,so. sea palm,0.0,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
4,120 Reef,8,10,2006,8-Oct-06,1,giant kelp,0.0,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0


### Column definitions from Reef Check metadata

**Site** = The unique site code that indicates where the survey was performed. This site code refers to a specific entry in the site table. <br>
**Day** = The day that the survey was done. This date is expressed in D or DD format. Dates reflect measurements taken in local time.<br>
**Month** = The month that the survey was done. This month is expressed in M or MM format. Dates reflect measurements taken in local time.<br>
**Year** = The year that the survey was done. This year is expressed in YYYY format. Dates reflect measurements taken in local time.<br>
**SurveyDate** = The  date that the survey was completed.<br>
**Transect** = A number representing one of the parallel transects through the study site. Core transects (i.e. transects at which fish, invertebrate, algae, and substrate data is collected) are numbered 1 - 6 with the transects in the offshore zone numbered as 1-3 and the inshore core transects numbered 4 - 6. Fish-only transects are numbered 7 - 18 with the offshore fish only transects numbered 7 - 12 and the inshore fish only transects numbered 13 - 18.<br>
**Species** = The unique taxonomic classification code that is being counted. The taxonomy of the species is defined in the species lookup table.<br>
**Amount** = Total number of individuals of a given classcode counted within the distance indicated in the Distance column along a transect.<br>
**Stipes** = Number of stipes of Macrocystis pyrifera counted per individual counted under 'Amount'. <br>
**Distance** = Distance along transect over which individuals of a given classcode were counted.  When this distance is less than 30m, the species was sub-sampled at about 50 individuals. To generate densities for a 60 square meter area the 'amount' variable needs to be  divided by the 'distance' variable and multiplied by 30.<br>
**Lat** = Latitude of the site.<br>
**Lon** = Longitude of the site.<br>
**Depth_ft** = Average depth of a transect in feet as measured by diver using dive computer.<br>
**Region** = MLPA region<br>
**Temp10m** = The water temperature at the sites during the survey measured using a dive computer at 10 meter depth or the seafloor if site is shallower than 10 meters. Measured in degrees Celsius.<br>
**Heading** = General compass heading of the transect<br>
**Visibility** = Visibility in meters at the transect location as determined by divers by measuring the distance from which the fingers on a hand help up into the water column can be counted.<br>

## Convert

The **event** is a transect at a particular site and the **occurrence** is the observation of a kelp along that transect.

### Get site names (will use to construct eventID)

In [6]:
## Load site table

filename = 'RCCA_site_table.csv'
sites = pd.read_csv(filename)

sites.head()

Unnamed: 0,IDNumSite,SiteName,County,CityIsland,State,DateOfEntry,Comments,AvgDepthSite,DistFromShore,Lat,Lon,Location,mpa,ProtectionStatus
0,1,120 Reef,Los Angeles,Los Angeles,CA,3/5/2008 0:00,,8,0,33.737919,-118.392014,South,Abalone Cove SMCA,State Marine Conservation Area
1,2,Abalone Cove,Los Angeles,Palos Verdes,CA,3/5/2008 0:00,,9,0,33.736149,-118.37632,South,Abalone Cove SMCA,State Marine Conservation Area
2,3,Aquarium,Monterey,Monterey,CA,3/5/2008 0:00,,10,0,36.619232,-121.899414,Central,Edward F. Ricketts SMCA,State Marine Conservation Area
3,4,Big Creek,Monterey,Santa Lucia,CA,3/5/2008 0:00,,13,0,36.069183,-121.600601,Central,Big Creek SMR,State Marine Reserve
4,5,Big Rock,Los Angeles,Malibu,CA,3/5/2008 0:00,,8,0,34.035168,-118.608086,South,Point Dume SMR,


In [7]:
## Let's use the SiteName column, and then add in locality or something similar later

# Create site_names
site_names = [name.replace(' ', '') for name in sites['SiteName']]

# Map site_names to names in data
site_name_dict = dict(zip(sites['SiteName'], site_names))

# Create a new column of site_names in data
data['SiteName'] = data['Site']
data['SiteName'].replace(site_name_dict, inplace=True)

### Create eventID

In [9]:
## Pad month and day as needed

paddedDay = ['0' + str(data['Day'].iloc[i]) if len(str(data['Day'].iloc[i])) == 1 else str(data['Day'].iloc[i]) for i in range(len(data['Day']))]
paddedMonth = ['0' + str(data['Month'].iloc[i]) if len(str(data['Month'].iloc[i])) == 1 else str(data['Month'].iloc[i]) for i in range(len(data['Month']))]

In [10]:
## Create eventID

eventID = [data['SiteName'].iloc[i] + '_' + str(data['Year'].iloc[i]) + paddedDay[i] + paddedMonth[i] + '_' + str(data['Transect'].iloc[i]) for i in range(len(data['Site']))]
converted = pd.DataFrame({'eventID':eventID})
converted.head()

Unnamed: 0,eventID
0,120Reef_20060810_1
1,120Reef_20060810_1
2,120Reef_20060810_1
3,120Reef_20060810_1
4,120Reef_20060810_1


### eventDate

In [12]:
## Reformat SurveyDate

eventDate = [datetime.strptime(dt, '%d-%b-%y').date().isoformat() for dt in data['SurveyDate']]
converted['eventDate'] = eventDate
converted.head()

Unnamed: 0,eventID,eventDate
0,120Reef_20060810_1,2006-10-08
1,120Reef_20060810_1,2006-10-08
2,120Reef_20060810_1,2006-10-08
3,120Reef_20060810_1,2006-10-08
4,120Reef_20060810_1,2006-10-08


### Add location information

In [14]:
## Remove test county

sites = sites[sites['County'] != 'Test']

In [15]:
## Map to verified county names in the Getty Thesaurus of Geographic Names

# County names according to Getty Thesaurus of Geographic Names
county_names_dict = dict(zip(sites['County'].unique(), sites['County'].unique()))
county_names_dict['LA'] = 'Los Angeles'
county_names_dict['Curry, OR'] = 'Curry'

# County IDs according to Getty Thesaurus of Geographic Names 
county_ids_dict = {'Los Angeles':1002608,
                   'Monterey':1002684,
                   'San Diego':1002858,
                   'Ventura':1002972,
                   'Santa Barbara':1002867,
                   'Sonoma':7014516,
                   'Orange':1002748,
                   'Mendocino':2000185,
                   'San Luis Obispo':1002863,
                   'Humboldt':2000181,
                   'LA':1002608,
                   'Santa Cruz':1002869,
                   'San Mateo':1002864,
                   'Del Norte':2000180,
                   'Curry, OR':2001704                  
                  }

In [16]:
## Add county, state, IDs and authority

# County
site_to_county_dict = dict(zip(sites['SiteName'].str.strip(), sites['County'].str.strip()))
converted['county'] = data['Site']
converted['county'].replace(site_to_county_dict, inplace=True)
converted['county'].replace(county_names_dict, inplace=True)

# State
converted['stateProvence'] = 'California'
converted.loc[converted['county'] == 'Curry', 'stateProvence'] = 'Oregon'

# ID
converted['locationID'] = data['Site']
converted['locationID'].replace(site_to_county_dict, inplace=True)
converted['locationID'].replace(county_ids_dict, inplace=True) ## Site names not in site table: 'Cueva Valdez', 'Point Vicente East', 'Point Vicente West'

# authority
converted['locationAccordingTo'] = 'Getty Thesaurus of Geographic Names'

converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names


In [17]:
## Add locality

site_to_mpa_dict = dict(zip(sites['SiteName'], sites['mpa']))
converted['locality'] = data['Site']
converted['locality'].replace(site_to_mpa_dict, inplace=True)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA


In [18]:
## Add site names as verbatimLocality

converted['verbatimLocality'] = data['Site']
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef


**Note** that there are four sites with kelp surveys that did not have inverts data: Cat Harbor, Flat Iron Rock, Half Moon Reef and Iron Bound Cove. Is it possible that not all data types are acquired during every survey, even for core transects? Do I need to handle this in some way?

In [37]:
## Add lat, lon

converted['decimalLatitude'] = round(data['Lat'], 4)
converted['decimalLongitude'] = round(data['Lon'], 4)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392


In [39]:
## Add coordinateUncertaintyInMeters

converted['coordinateUncertaintyInMeters'] = 250

### Specify whether transect was inshore or offshore using locationRemarks

In [40]:
## Add locationRemarks

converted['locationRemarks'] = 'inshore zone'
converted.loc[data['Transect'] < 4, 'locationRemarks'] = 'offshore zone'
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,locationRemarks
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone


In [45]:
out = data.groupby(['Site', 'SurveyDate'])['Transect'].nunique()
out[out < 6]

Site                    SurveyDate
Casino Point            11-Nov-17     4
Caspar North            22-Aug-17     4
Cat Harbor              31-Oct-17     2
Flat Iron Rock          26-Jun-17     2
Frolic Cove             17-Jun-17     2
Ft Ross                 30-Jul-17     4
Half Moon Reef          15-Oct-17     2
Harmony                 29-Jun-17     4
Iron Bound Cove         30-Oct-17     2
Johnsons Lee            30-Aug-17     4
Lopez                   26-Jun-17     2
Mendocino Headlands     2-Aug-17      2
Otter Cove              24-Jun-17     4
Point Sur               27-Jun-17     2
Point Vicente East      17-Jul-17     2
South Grestle           17-Sep-17     4
South Monastery         5-Nov-17      2
Stillwater Cove Sonoma  29-Jul-17     4
Trinidad                13-Aug-17     4
Weston                  5-Oct-17      2
Name: Transect, dtype: int64

**Note:** Sometimes not all transects are conducted. Is it still safe to assume that 1-3 are inshore and 4-6 are offshore?

In [54]:
## Add depth

converted['verbatimDepth'] = round(data['Depth_ft']*0.3048)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,locationRemarks,verbatimDepth
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0


Since these depths were taken by divers using dive computers, I think it's reasonable to round to the nearest meter. **Note** that I assume these depths are not corrected for tidal height. 

### Add occurrence data

In [55]:
## Add occurrenceID

converted['occurrenceID'] = range(1, converted.shape[0]+1)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,locationRemarks,verbatimDepth,occurrenceID
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,1
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,2
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,3
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,4
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,5


In [56]:
## Load species table

filename = 'RCCA_algae_species_table.csv'
species = pd.read_csv(filename)

species.head()

Unnamed: 0,SC_project_short_code,Kingdom,Division,Class,Order,Family,Genus,Species,Classcode
0,SC_CitSci_Kelp,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Lessoniaceae,Nereocystis,luetkeana,bull kelp
1,SC_CitSci_Kelp,Chromista,Phaeophyta,Phaeophycease,Laminariales,Alariaceae,Pterygophora,californica,Pterygophora
2,SC_CitSci_Kelp,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Alariaceae,Eisenia,arborea,so. sea palm
3,SC_CitSci_Kelp,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Laminariaceae,Laminaria,spp.,Laminaria spp
4,SC_CitSci_Kelp,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Lessoniaceae,Macrocystis,pyrifera,giant kelp


In [62]:
## Map scientific names to classcodes

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Create map
code_to_species_dict = dict(zip(species['Classcode'], species['scientificName']))

# Add in classcodes that are different in data and species table
code_to_species_dict['laminaria spp'] = 'Laminaria spp.'
code_to_species_dict['pterygophora'] = 'Pterygophora californica'

In [63]:
## Create scientificName column

converted['scientificName'] = data['Species']
converted['scientificName'].replace(code_to_species_dict, inplace=True)
converted.head()

Unnamed: 0,eventID,eventDate,county,stateProvence,locationID,locationAccordingTo,locality,verbatimLocality,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,locationRemarks,verbatimDepth,occurrenceID,scientificName
0,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,1,Nereocystis luetkeana
1,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,2,Laminaria spp.
2,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,3,Pterygophora californica
3,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,4,Eisenia arborea
4,120Reef_20060810_1,2006-10-08,Los Angeles,California,1002608,Getty Thesaurus of Geographic Names,Abalone Cove SMCA,120 Reef,33.7379,-118.392,250,offshore zone,9.0,5,Macrocystis pyrifera


In [64]:
## Get unique scientific names for lookup in WoRMS

names = converted['scientificName'].unique()

**Note:** There is one new species not listed in the species table: Sargassum horneri.