# Reef Check - invertebrate data conversion

For each survey site, Reef Check performs 6 core transects where divers record inverts, kelp (plus presence/absence of invasives), UPC and fish. 12 additional fish-only transects, and abalone and urchin size surveys, are performed separately. For this reason, I originally thought it would be reasonable to have four converted datasets: 
1. Core transect data
2. Fish-only transect data
3. Urchin size data
4. Abalone size data

After some thought, however, I decided **it makes more sense to continue to group the data by organism type** (inverts, kelp, fish, UPC). The fact that certain organisms were found on the same core transect does lend a certain amount of spatial information. But because that information is not preserved across surveys, statements about abundance and variability only make sense at the site level. Aside from the fact that transects at a given sight must be deep or shallow, and approximately parallel to shore following a depth contour, there is nothing that ensures that transect 1 at site A in 2001 matches transect 1 at site A in 2002.

**Note** that the urchin and abalone size data should also be converted separately. These surveys are not conducted as part of the core transects, and it would be misleading therefore to attach presence or size information to the converted inverts data.

**Resources:**
- https://dwc.tdwg.org/terms/#occurrence
- https://reefcheck.org/
- https://reefcheck.org/PDFs/RCCAmanual9thedition.pdf

In [1]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handline dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [5]:
## Load data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\Reef Check\\'
filename = 'RCCA_invert_data.csv'
data = pd.read_csv(filename)

data.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Amount,Distance,Lat,Lon,Depth_ft,Region,Temp10m,Heading,Visibility
0,120 Reef,8,10,2006,8-Oct-06,1,bat star,13.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
1,120 Reef,8,10,2006,8-Oct-06,1,black abalone,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
2,120 Reef,8,10,2006,8-Oct-06,1,brown/golden gorgonian,11.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
3,120 Reef,8,10,2006,8-Oct-06,1,ca sea cucumber,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0
4,120 Reef,8,10,2006,8-Oct-06,1,ca spiny lobster,0.0,30.0,33.737919,-118.392014,28.0,South,15.0,170.0,7.0


### Information on column definitions from Reef Check's metadata files

**Site** = The unique site code that indicates where the survey was performed. This site code refers to a specific entry in the site table. <br>
**Day** = The day that the survey was done. This date is expressed in D or DD format. Dates reflect measurements taken in local time.<br>
**Month** = The month that the survey was done. This month is expressed in M or MM format. Dates reflect measurements taken in local time.<br>
**Year** = The year that the survey was done. This year is expressed in YYYY format. Dates reflect measurements taken in local time.<br>
**SurveyDate** = The  date that the survey was completed.<br>
**Transect** = A number representing one of the parallel transects through the study site. Core transects (i.e. transects at which fish, invertebrate, algae, and substrate data is collected) are numbered 1 - 6 with the transects in the offshore zone numbered as 1-3 and the inshore core transects numbered 4 - 6. Fish-only transects are numbered 7 - 18 with the offshore fish only transects numbered 7 - 12 and the inshore fish only transects numbered 13 - 18.<br>
**Species** = The unique taxonomic classification code that is being counted. The taxonomy of the species is defined in the species lookup table.<br>
**Amount** = Total number of individuals of a given classcode counted within the distance indicated in the Distance column along a transect.<br>
**Distance** = Distance along transect over which individuals of a given classcode were counted.  When this distance is less than 30m, the species was sub-sampled at about 50 individuals. To generate densities for a 60 square meter area the 'amount' variable needs to be  divided by the 'distance' variable and multiplied by 30.<br>
**Lat** = Latitude of the site.<br>
**Lon** = Longitude of the site.<br>
**Depth_ft** = Average depth of a transect in feet as measured by diver using dive computer.<br>
**Region** = MLPA region<br>
**Temp10m** = The water temperature at the sites during the survey measured using a dive computer at 10 meter depth or the seafloor if site is shallower than 10 meters. Measured in degrees Celsius.<br>
**Heading** = General compass heading of the transect<br>
**Visibility** = Visibility in meters at the transect location as determined by divers by measuring the distance from which the fingers on a hand help up into the water column can be counted.<br>

# Convert

The **event** is a transect at a particular site and the **occurrence** is the observation of an animal along that transect.

## Create eventID

In [43]:
## Pad month and day as needed

paddedDay = ['0' + str(data['Day'].iloc[i]) if len(str(data['Day'].iloc[i])) == 1 else str(data['Day'].iloc[i]) for i in range(len(data['Day']))]
paddedMonth = ['0' + str(data['Month'].iloc[i]) if len(str(data['Month'].iloc[i])) == 1 else str(data['Month'].iloc[i]) for i in range(len(data['Month']))]

In [71]:
## Create eventID

eventID = [data['Site'].iloc[i] + '_' + str(data['Year'].iloc[i]) + paddedDay[i] + paddedMonth[i] + '_' + str(data['Transect'].iloc[i]) for i in range(len(data['Site'])-1)]
converted = pd.DataFrame({'eventID':eventID})
converted.head()

Unnamed: 0,eventID
0,120 Reef_20060810_1
1,120 Reef_20060810_1
2,120 Reef_20060810_1
3,120 Reef_20060810_1
4,120 Reef_20060810_1


**Note** that when Jan sends along the site table, additional site information should be incorporated. I requested the site table on 5/20/20.

In [72]:
## Reformat SurveyDate

eventDate = [datetime.strptime(dt, '%d-%b-%y').date().isoformat() for dt in data['SurveyDate']]
converted['eventDate'] = eventDate
converted.head()

Unnamed: 0,eventID,eventDate
0,120 Reef_20060810_1,2006-10-08
1,120 Reef_20060810_1,2006-10-08
2,120 Reef_20060810_1,2006-10-08
3,120 Reef_20060810_1,2006-10-08
4,120 Reef_20060810_1,2006-10-08


**Note** that I don't see a good way to include transect number, except as part of eventID. But I'd like to include whether the transect was a core transect or not (methodology) and whether the transect was inshore or offshore (location). For now, I'm incorporating the former under samplingProtocol and the latter under eventRemarks. 

**Update:** There are no fish only transects in this data set!! That will only be relevant for the fish data. Obviously.

In [73]:
## Add eventRemarks

converted['eventRemarks'] = 'inshore zone'
converted.loc[data['Transect'] < 4, 'eventRemarks'] = 'offshore zone'
converted.head()

Unnamed: 0,eventID,eventDate,eventRemarks
0,120 Reef_20060810_1,2006-10-08,offshore zone
1,120 Reef_20060810_1,2006-10-08,offshore zone
2,120 Reef_20060810_1,2006-10-08,offshore zone
3,120 Reef_20060810_1,2006-10-08,offshore zone
4,120 Reef_20060810_1,2006-10-08,offshore zone


In [76]:
## Add lat, lon

converted['decimalLatitude'] = data['Lat']
converted['decimalLongitude'] = data['Lon']
converted.head()

Unnamed: 0,eventID,eventDate,eventRemarks,scientificName,decimalLatitude,decimalLongitude
0,120 Reef_20060810_1,2006-10-08,offshore zone,bat star,33.737919,-118.392014
1,120 Reef_20060810_1,2006-10-08,offshore zone,black abalone,33.737919,-118.392014
2,120 Reef_20060810_1,2006-10-08,offshore zone,brown/golden gorgonian,33.737919,-118.392014
3,120 Reef_20060810_1,2006-10-08,offshore zone,ca sea cucumber,33.737919,-118.392014
4,120 Reef_20060810_1,2006-10-08,offshore zone,ca spiny lobster,33.737919,-118.392014


**Note:** What to do about coordinateUncertaintyInMeters?

In [84]:
## Add depth

converted['minimumDepthInMeters'] = round(data['Depth_ft']*0.3048)
converted['maximumDepthInMeters'] = round(data['Depth_ft']*0.3048)
converted.head()

Unnamed: 0,eventID,eventDate,eventRemarks,scientificName,decimalLatitude,decimalLongitude,minimumDepthInMeters,maximumDepthInMeters
0,120 Reef_20060810_1,2006-10-08,offshore zone,bat star,33.737919,-118.392014,9.0,9.0
1,120 Reef_20060810_1,2006-10-08,offshore zone,black abalone,33.737919,-118.392014,9.0,9.0
2,120 Reef_20060810_1,2006-10-08,offshore zone,brown/golden gorgonian,33.737919,-118.392014,9.0,9.0
3,120 Reef_20060810_1,2006-10-08,offshore zone,ca sea cucumber,33.737919,-118.392014,9.0,9.0
4,120 Reef_20060810_1,2006-10-08,offshore zone,ca spiny lobster,33.737919,-118.392014,9.0,9.0


Since these depths were taken by divers using dive computers, I think it's reasonable to round to the nearest meter.

In [75]:
## Add occurrenceID

In [74]:
## Add scientificName

converted['scientificName'] = data['Species']
converted.head()

Unnamed: 0,eventID,eventDate,eventRemarks,scientificName
0,120 Reef_20060810_1,2006-10-08,offshore zone,bat star
1,120 Reef_20060810_1,2006-10-08,offshore zone,black abalone
2,120 Reef_20060810_1,2006-10-08,offshore zone,brown/golden gorgonian
3,120 Reef_20060810_1,2006-10-08,offshore zone,ca sea cucumber
4,120 Reef_20060810_1,2006-10-08,offshore zone,ca spiny lobster


**Note** that species information will have to be updated once the species table is received. This will include actual scientific names, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus and basisOfRecord. Species table requested on 5/20.

In [80]:
round(8.6)

9

In [85]:
## Density, temp and visiblity can be measurements or facts