# Reef Check - transect data conversion

For each survey site, Reef Check performs 6 core transects where divers record inverts, kelp, UPC and fish. 12 additional fish-only transects are performed separately. Abalone and urchin size surveys are performed off-transect, and only at some sites. Presence/absence surveys for invasive kelps are also performed off-transect. For this reason, I think it would be reasonable to have four converted datasets: 
1. Core transect and fish-only transect data
2. Urchin size data
3. Abalone size data
4. Invasive kelp data

In this notebook, I deal with the transect data only.

**Resources:**
- https://dwc.tdwg.org/terms/#occurrence
- https://reefcheck.org/
- https://reefcheck.org/PDFs/RCCAmanual9thedition.pdf
- https://reefcheck.org/PDFs/Reef%20Check%20California%20Abalone%20Protocol.pdf

**SettingWithCopyWarning reference:** https://www.dataquest.io/blog/settingwithcopywarning/

In [375]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [376]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [377]:
## Load inverts data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\Reef Check\\'
filename = 'RCCA_invertebrate_swath_data.csv'
inverts = pd.read_csv(filename)

inverts.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Classcode,Amount,Distance,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,1-Aug-10,1,bat star,9.0,30.0,33.73792,-118.392,21.0,13.0,3.0
1,120 Reef,1,8,2010,1-Aug-10,1,black abalone,0.0,30.0,33.73792,-118.392,21.0,13.0,3.0
2,120 Reef,1,8,2010,1-Aug-10,1,brown/golden gorgonian,1.0,30.0,33.73792,-118.392,21.0,13.0,3.0
3,120 Reef,1,8,2010,1-Aug-10,1,ca sea cucumber,0.0,30.0,33.73792,-118.392,21.0,13.0,3.0
4,120 Reef,1,8,2010,1-Aug-10,1,ca spiny lobster,0.0,30.0,33.73792,-118.392,21.0,13.0,3.0


### Information on column definitions from Reef Check's metadata files

**Site** = The unique site code that indicates where the survey was performed. This site code refers to a specific entry in the site table. <br>
**Day** = The day that the survey was done. This date is expressed in D or DD format. Dates reflect measurements taken in local time.<br>
**Month** = The month that the survey was done. This month is expressed in M or MM format. Dates reflect measurements taken in local time.<br>
**Year** = The year that the survey was done. This year is expressed in YYYY format. Dates reflect measurements taken in local time.<br>
**SurveyDate** = The  date that the survey was completed.<br>
**Transect** = A number representing one of the parallel transects through the study site. Core transects (i.e. transects at which fish, invertebrate, algae, and substrate data is collected) are numbered 1 - 6 with the transects in the offshore zone numbered as 1-3 and the inshore core transects numbered 4 - 6. Fish-only transects are numbered 7 - 18 with the offshore fish only transects numbered 7 - 12 and the inshore fish only transects numbered 13 - 18.<br>
**Classcode** = The unique taxonomic classification code that is being counted. The taxonomy of the species is defined in the species lookup table.<br>
**Amount** = Total number of individuals of a given classcode counted within the distance indicated in the Distance column along a transect.<br>
**Distance** = Distance along transect over which individuals of a given classcode were counted.  When this distance is less than 30m, the species was sub-sampled at about 50 individuals. To generate densities for a 60 square meter area the 'amount' variable needs to be  divided by the 'distance' variable and multiplied by 30.<br>
**Lat** = Latitude of the site.<br>
**Lon** = Longitude of the site.<br>
**Depth_ft** = Average depth of a transect in feet as measured by diver using dive computer.<br>
**Temp10m** = The water temperature at the sites during the survey measured using a dive computer at 10 meter depth or the seafloor if site is shallower than 10 meters. Measured in degrees Celsius.<br>
**Visibility** = Visibility in meters at the transect location as determined by divers by measuring the distance from which the fingers on a hand help up into the water column can be counted.<br>

In [378]:
## Load kelp data

filename = 'RCCA_algae_swath_data.csv'
kelp = pd.read_csv(filename)

kelp.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Classcode,Amount,Stipes,Distance,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,1-Aug-10,1,bull kelp,0.0,,30.0,33.737919,-118.392014,21.0,13.0,3.0
1,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,8.0,18.0,33.737919,-118.392014,21.0,13.0,3.0
2,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,12.0,18.0,33.737919,-118.392014,21.0,13.0,3.0
3,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,13.0,18.0,33.737919,-118.392014,21.0,13.0,3.0
4,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,16.0,18.0,33.737919,-118.392014,21.0,13.0,3.0


### Additional column definitions for algae data

**Stipes** = Number of stipes of Macrocystis pyrifera counted per individual counted under 'Amount'.

In [379]:
## Load fish data

filename = 'RCCA_fish_data.csv'
fish = pd.read_csv(filename)

fish.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
1,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,none,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,a,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0


### Additional column definitions for fish data

**Sex** = For species for which males, females or juveniles can be identified their sex or state of maturity is identified as m =male, f=female; or a = adult, j=juvenile (none is recorded if sex can't be determined). <br>
**SizeCategory** = Prior to 2013, fish were sized in three categories: small, medium and large. Small fish are <15cm total length. Medium sized fish are 15 to 30cm total length, except for lingcod, cabazon, bocaccio and horn shark for which the medium size category is 15-50cm. Large fish are >30cm total length, except for lingcod, cabazon, bocaccio and horn shark for which the large category is >50cm. in 2013 and later years fish are sized to the nearest cm and 'NA is recorded for SizeCategory. <br>
**Size_cm** = The total length of an individual or group of individuals (of the same length) in centimeter (rounded to the nearest cm) OR the average total length for a group of fish for which a range in lengths is specified. For data collected in 2013 and following years. <br>
**Min_cm** = The minimum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species. For data collected in 2013 and following years. <br>
**Max_cm** = The maximum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species. For data collected in 2013 and following years. <br>

In [961]:
## Load UPC data (to be included in measurement or fact file)

filename = 'RCCA_upc_data.csv'
upc = pd.read_csv(filename)

upc.head()

Unnamed: 0,site,Day,Month,Year,Transect,Category,Classcode,Amount,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SurveyDate
0,120 Reef,8,10,2006,1,Cover,articulated coralline,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
1,120 Reef,8,10,2006,1,Cover,brown seaweed,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
2,120 Reef,8,10,2006,1,Cover,crustose coralline,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
3,120 Reef,8,10,2006,1,Cover,green seaweed,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
4,120 Reef,8,10,2006,1,Cover,none,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06


### Additional column definitions for UPC

**Category** = Defines the type of data collected in three categories. Data either describes the seafloor substrate (Substrate); the primary organism on the substrate (Cover); or the relief of the substrate (Relief).<br>
**Classcode** = A unique classification code of what is beeing recored at a UPC point. The classcodes are defined in the UPC lookup table. <br>
**Amount** = Total number of points of a given classcode encounted along a transect. (I assume out of 30 possible points). <br>

## Create occurrence file

Let's allow the **event** to be the transect, and the **occurrences** to be any organisms observed along that transect. We have MeasurementOrFacts at both the event and occurrence level (e.g. temperatures and percent covers by transect and sizes by occurrence). These can be incorporated into a single MoF file.

### Get site names

Once I've retrieved the site names from the site table, I can use them to create eventIDs.

In [381]:
## Load site table

filename = 'RCCA_site_table.csv'
sites = pd.read_csv(filename, usecols=range(7))

sites.head()

Unnamed: 0,Research_group,Site,CA_MPA_Name_Short,MPA_status,LTM_project_short_code,Latitude,Longitude
0,RCCA,Macklyn Cove,,REF,LTM_Kelp_SRock,42.045155,-124.294724
1,RCCA,Pyramid Pt,Pyramid Point SMCA,MPA,LTM_Kelp_SRock,41.994801,-124.217308
2,RCCA,Flat Iron Rock,,,,41.059425,-124.157829
3,RCCA,Trinidad,,,,41.055,-124.139999
4,RCCA,MacKerricher North,MacKerricher SMCA,MPA,LTM_Kelp_SRock,39.492823,-123.80199


In [382]:
## Create a new column containing site names w/o spaces, and add it to data files

# Get a list of site names with spaces removed
site_names = [name.replace(' ', '') for name in sites['Site']]
    
# Map site_names to SiteName in sites_df; add sites that are in fish, inverts and algae data but not in site table
site_name_dict = dict(zip(sites['Site'], site_names))
site_name_dict['Cayucos'] = 'Cayucos'
site_name_dict["Fry's Anchorage"] = 'FrysAnchorage'
site_name_dict['Hurricane Ridge'] = 'HurricaneRidge'
site_name_dict['LA Federal Breakwater'] = 'LAFederalBreakwater'
site_name_dict["Lover's 3"] = 'Lovers3'
site_name_dict['Ocean Cove Kelper'] = 'OceanCoveKelper'
site_name_dict['Pier 400'] = 'Pier400'
site_name_dict['West Long Point'] = 'WestLongPoint'
site_name_dict['Judith Reserve San Miguel Is'] = 'JudithReserveSanMiguelIsland'

def create_SiteName(df, site_name_dict):
    
    # Create SiteName column from Site column in df
    df['SiteName'] = df['Site']
    df['SiteName'].replace(site_name_dict, inplace=True)
    
    return(df)

inverts = create_SiteName(inverts, site_name_dict)
kelp = create_SiteName(kelp, site_name_dict)
fish = create_SiteName(fish, site_name_dict)

fish.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SiteName
0,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
1,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,none,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,a,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef


**Note** - sites in fish, kelp and inverts data that are not in site table:
- Cayucos
- Fry's Anchorage (I assume this is the same as Frys Anchorage)
- Hurricane Ridge
- LA Federal Breakwater
- Lover's 3 (I assume this is the same as Lovers 3)
- Ocean Cove Kelper (Is this the same as Ocean Cove?)
- Pier 400
- West Long Point

Also **note** that Judith Reserve San Miguel Island is written Judith Reserve San Miguel Is in fish data only.

I'm going to assemble columns for one organism type at a time, and then concatenate at the end.

### Inverts occurrence file

In [383]:
## Pad month and day as needed

paddedDay = ['0' + str(inverts['Day'].iloc[i]) if len(str(inverts['Day'].iloc[i])) == 1 else str(inverts['Day'].iloc[i]) for i in range(len(inverts['Day']))]
paddedMonth = ['0' + str(inverts['Month'].iloc[i]) if len(str(inverts['Month'].iloc[i])) == 1 else str(inverts['Month'].iloc[i]) for i in range(len(inverts['Month']))]

In [384]:
## Create eventID

eventID = [inverts['SiteName'].iloc[i] + '_' + str(inverts['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(inverts['Transect'].iloc[i]) for i in range(len(inverts['Site']))]
inverts_occ = pd.DataFrame({'eventID':eventID})

inverts_occ.head()

Unnamed: 0,eventID
0,120Reef_20100801_1
1,120Reef_20100801_1
2,120Reef_20100801_1
3,120Reef_20100801_1
4,120Reef_20100801_1


In [385]:
## Add occurrenceID

inverts_occ['occurrenceID'] = inverts.groupby(['Site', 'SurveyDate', 'Transect'])['Classcode'].cumcount()+1
inverts_occ['occurrenceID'] = inverts_occ['eventID'] + '_inverts_occ' + inverts_occ['occurrenceID'].astype(str)

inverts_occ.head()

Unnamed: 0,eventID,occurrenceID
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5


In [386]:
## Load species table

filename = 'RCCA_invertebrate_lookup_table.csv'
species = pd.read_csv(filename, encoding='ANSI')

species.head()

Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species,Classcode,taxonomic_source,taxonomic_id
0,Animalia,Echinodermata,Asteroidea,Valvatida,Asterinidae,Patiria,miniata,bat star,www.marinespecies.org,382131
1,Animalia,Cnidaria,Anthozoa,Alcyonacea,Plexauridae,Muricea,fruticosa/californica,brown/golden gorgonian,www.marinespecies.org,177745
2,Animalia,Echinodermata,Holothuroidea,Synallactida,Stichopodidae,Parastichopus,californicus,CA sea cucumber,www.marinespecies.org,711954
3,Animalia,Arthropoda,Malacostraca,Decapoda,Palinuridae,Panulirus,interruptus,CA spiny lobster,www.marinespecies.org,382898
4,Animalia,Mollusca,Gastropoda,Littorinimorpha,Cypraeidae,Neobernaya,spadicea,chestnut cowry,www.marinespecies.org,580674


In [395]:
## Map scientific names to classcodes

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Strip any whitespace
species['scientificName'] = species['scientificName'].str.strip()

# Fix species names where Genus and Species were NaN
species.loc[species['Family'] == 'Actiniidae', 'scientificName'] = 'Actiniidae' 

# Create map
code_to_species_dict = dict(zip(species['Classcode'], species['scientificName']))

**Note** that some classcodes in inverts don't match the classcodes in the species table:
- ca sea cucumber --> CA sea cucumber
- ca spiny lobster --> CA spiny lobster
- kellet's welk --> Kellet's welk
- wavy red turban snail --> wavy/red turban snail
- california sea hare --> California sea hare
- ochre star --> ochre sea star

In [396]:
## Change classcodes that don't match classcodes in species table

inverts.loc[inverts['Classcode'] == 'ca sea cucumber', 'Classcode'] = 'CA sea cucumber'
inverts.loc[inverts['Classcode'] == 'ca spiny lobster', 'Classcode'] = 'CA spiny lobster'
inverts.loc[inverts['Classcode'] == "kellet's whelk", 'Classcode'] = "Kellet's whelk"
inverts.loc[inverts['Classcode'] == 'wavy red turban snail', 'Classcode'] = 'wavy/red turban snail'
inverts.loc[inverts['Classcode'] == 'california sea hare', 'Classcode'] = 'California sea hare'
inverts.loc[inverts['Classcode'] == 'ochre star', 'Classcode'] = 'ochre sea star'

In [397]:
## Create scientificName column

inverts_occ['scientificName'] = inverts['Classcode']
inverts_occ['scientificName'].replace(code_to_species_dict, inplace=True)
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis craherodii
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea fruticosa/californica
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus


In [400]:
## Drop the one row with scientificName = 'no inverts'

inverts_occ = inverts_occ[inverts_occ['scientificName'] != 'no inverts'].copy()

In [401]:
## Get unique scientific names for lookup in WoRMS

names = inverts_occ['scientificName'].unique()

**Note** that there are some classcodes that are not covered in the species table: sunflower/sun star, unknown abalone, no inverts 

Suflower/sun star can be designated as class Asteroidea, with Solaster spp. or Pycnopodia helianthoides in identificationQualifier <br>
Unknown Abalone can be designated Haliotis <br>
No inverts is designated in a single row - drop it? **What does this mean?**

Also **note** that there are a number of names that are not specific at the species level. These will match at the genus level, but may want to include identificationQualifier:
- Muricea fruticosa/californica
- Loxorhynchus grandis/crispatus
- Megastrea/Lithopoma undosa/gibberosa

**Assumed misspellings:**
- Haliotis craherodii --> Haliotis cracherodii
- Mesocentrotus francicanus --> Mesocentrotus franciscanus
- Crassadoma giganteum --> Crassedoma giganteum

**Assumed old names:**
- I think Megastraea/Lithopoma undosa/gibberosa means that either Megatrea undosa or Lithopoma gibberosa was observed
- Megastraea undosa matches in WoRMS. **Note** that Megastrea should be spelled Megastraea.
- Lithopoma gibberosa is unaccepted in WoRMS. The indicated accepted name is Pomaulax gibberosus
- I think it's best to put subfamily Turbininae for both, and then indicate possible species in the identificationQualifier column

In [402]:
## Add manually identified scientific names to names; correct spelling errors

names_to_change = ['sunflower/sun star', 'unknown abalone', 'Megastrea/Lithopoma undosa/gibberosa', 'Haliotis craherodii', 'Mesocentrotus francicanus', 'Crassadoma giganteum']
correct_names = ['Asteroidea', 'Haliotis', 'Turbininae', 'Haliotis cracherodii', 'Mesocentrotus franciscanus', 'Crassedoma giganteum']

for i in range(len(names_to_change)):
    names = np.where(names==names_to_change[i], correct_names[i], names)
    
# Also correct names in converted scientificName column
inverts_occ['scientificName'].replace({'sunflower/sun star':'Asteroidea',
                                      'unknown abalone':'Haliotis',
                                      'Megastrea/Lithopoma undosa/gibberosa':'Turbininae',
                                      'Haliotis craherodii':'Haliotis cracherodii',
                                      'Mesocentrotus francicanus':'Mesocentrotus franciscanus',
                                      'Crassadoma giganteum':'Crassedoma giganteum'}, inplace=True)

In [404]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Muricea fruticosa/californica checking:  Muricea
Url didn't work for Cancer spp. checking:  Cancer
Url didn't work for Loxorhynchus grandis/crispatus checking:  Loxorhynchus
Url didn't work for Solaster spp. checking:  Solaster


In [407]:
## Add scientific name-related columns

inverts_occ['scientificNameID'] = inverts_occ['scientificName']
inverts_occ['scientificNameID'].replace(name_id_dict, inplace=True)

inverts_occ['taxonID'] = inverts_occ['scientificName']
inverts_occ['taxonID'].replace(name_taxid_dict, inplace=True)
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea fruticosa/californica,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [408]:
## Create identificationQualifier

qualifier_dict = {'Muricea fruticosa/californica':'Muricea fruticosa or Muricea californica',
               'Loxorhynchus grandis/crispatus':'Loxorhynchus grandis or Loxorhynchus crispatus',
               'Turbininae':'Megastraea undosa or Pomaulax gibberosus (previously Lithopoma gibberosa)',
               'Asteroidea':'Solaster spp. or Pycnopodia helianthoides'}

identificationQualifier = [qualifier_dict[name] if name in qualifier_dict.keys() else np.nan for name in inverts_occ['scientificName']]

In [409]:
## Replace scientificName using name_name_dict

inverts_occ['scientificName'].replace(name_name_dict, inplace=True)
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [411]:
## Add final name-related columns

inverts_occ['nameAccordingTo'] = 'WoRMS'
inverts_occ['occurrenceStatus'] = 'present'
inverts_occ['basisOfRecord'] = 'HumanObservation'
inverts_occ['identificationQualifier'] = identificationQualifier
inverts_occ['occurrenceRemarks'] = np.nan  # no occurrenceRemarks required for inverts

inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,,
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,present,HumanObservation,,
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,,


In [412]:
## Add sex and lifeStage column (no data for inverts)

inverts_occ['sex'] = np.nan
inverts_occ['lifeStage'] = np.nan
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,,,,
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,present,HumanObservation,,,,
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,,,,


Next, I'll create a density column under the column name organismQuantity. I'll use this column to indicate absence records, **assuming that a density of 0 indicates absent**, and save the series to be used in the MoF file.

In [413]:
## Density

# First, remove the 'no inverts' row from inverts
inverts = inverts[inverts['Classcode'] != 'no inverts']

# Calculate density
inverts_density = round((inverts['Amount']/inverts['Distance'])*30, 2) 
inverts_occ['organismQuantity'] = inverts_density
inverts_occ['organismQuantityType'] = 'number of individuals per 60 m2'
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,,9.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,,1.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2


**Note** that there are 562 records that have density = NaN. Where do these come from?

The inverts data frame never has Distance = 0. But both Distance and Amount are sometimes NaN. Amount is NaN for 562 records; Distance is NaN for 501 records. For the remaining 61 records for which Amount is NaN and Distance is not, Distance takes on a number of values (not just 30). **What does all this indicate?** I'm going to leave the NaN records in for now.

In [414]:
## Assign an occurrenceStatus of 'absent' to records where density = 0

inverts_occ.loc[inverts_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'

Temperature and visibility data will be included in the MoF file at the event level, so I'll leave that alone for now.

### Algae occurrence file

In the algae data, giant kelp have been given multiple rows if more than one was observed on a given transect. So, for example, at 120 Reef on October 8, 2006, during the 5th transect, 4 giant kelps were observed. Correspondingly, 'giant kelp' has four rows, while the other species only have 1 to indicate their presence/absence and density. The stipe column indicates the nubmer of stipes for each of these giant kelps.

To be consistent with the inverts data, I think the only thing that makes sense is to group these giant kelp observations together and provide a density value for a single occurrenceID. I also want to preserve the stipe counts, though. I think it will be OK to give a list of the number of stipes for each plant with the corresponding occurrenceID in the MoF file. So, for example, the occurrenceID for the giant kelp observation for the 120 Reef survey mentioned above would be 120Reef_20060810_5_algae_occ5, the density would be 4 per 60 m2, and the associated MoF would be type = size, value = 10|6|4|7, unit = number of stipes.

In [415]:
## Reformat kelp so that giant kelp individuals are aggregated and there is a column containing a list of stipe counts

# Convert stipe value to string and add separator
kelp['StipeList'] = kelp['Stipes'].astype(str) + ' | '

# replace 'nan | ' strings with empty strings
kelp.loc[kelp['StipeList'] == 'nan | ', 'StipeList'] = ''

# repeat the stipe value the number of times indicated by the Amount column
kelp.loc[kelp['Amount'].isna() == False, 'StipeList'] = kelp.loc[kelp['Amount'].isna() == False, 'StipeList']*kelp.loc[kelp['Amount'].isna() == False, 'Amount'].astype(int)

# Aggregate kelp to obtain the total number of individuals per distance, with a list of associated stipe counts
kelp_agg = kelp.groupby(['SiteName', 'Day', 'Month', 'Year', 'Latitude', 'Longitude', 'Transect', 'Classcode']).agg({
    'Amount':[sum],
    'Distance':[max],
    'StipeList':[sum]
})
kelp_agg.reset_index(inplace=True)
kelp_agg.columns = kelp_agg.columns.droplevel(level=1)

# Remove trailing ' | ' characters
kelp_agg['StipeList'] = kelp_agg['StipeList'].str[:-3]

print(kelp_agg.shape)
kelp_agg.head()

(39102, 11)


Unnamed: 0,SiteName,Day,Month,Year,Latitude,Longitude,Transect,Classcode,Amount,Distance,StipeList
0,120Reef,1,8,2010,33.737919,-118.392014,1,bull kelp,0.0,30.0,
1,120Reef,1,8,2010,33.737919,-118.392014,1,giant kelp,49.0,18.0,8.0 | 12.0 | 13.0 | 16.0 | 3.0 | 3.0 | 7.0 | 7...
2,120Reef,1,8,2010,33.737919,-118.392014,1,laminaria spp.,0.0,30.0,
3,120Reef,1,8,2010,33.737919,-118.392014,1,pterygophora,2.0,30.0,
4,120Reef,1,8,2010,33.737919,-118.392014,1,souther sea palm,0.0,30.0,


In [416]:
## Build eventID

# Pad month and day as needed
paddedDay = ['0' + str(kelp_agg['Day'].iloc[i]) if len(str(kelp_agg['Day'].iloc[i])) == 1 else str(kelp_agg['Day'].iloc[i]) for i in range(len(kelp_agg['Day']))]
paddedMonth = ['0' + str(kelp_agg['Month'].iloc[i]) if len(str(kelp_agg['Month'].iloc[i])) == 1 else str(kelp_agg['Month'].iloc[i]) for i in range(len(kelp_agg['Month']))]

# Create eventID
eventID = [kelp_agg['SiteName'].iloc[i] + '_' + str(kelp_agg['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(kelp_agg['Transect'].iloc[i]) for i in range(len(kelp_agg['SiteName']))]
kelp_occ = pd.DataFrame({'eventID':eventID})

print(kelp_occ.shape)
kelp_occ.head()

(39102, 1)


Unnamed: 0,eventID
0,120Reef_20100801_1
1,120Reef_20100801_1
2,120Reef_20100801_1
3,120Reef_20100801_1
4,120Reef_20100801_1


In [417]:
## Add occurrenceID

# Create SurveyDate column to groupby
SurveyDate = [str(kelp_agg['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(len(kelp_agg['SiteName']))]
kelp_agg['SurveyDate'] = SurveyDate

# Use SurveyDate to create occurrenceID
kelp_occ['occurrenceID'] = kelp_agg.groupby(['SiteName', 'SurveyDate', 'Transect'])['Classcode'].cumcount()+1
kelp_occ['occurrenceID'] = kelp_occ['eventID'] + '_algae_occ' + kelp_occ['occurrenceID'].astype(str)

kelp_occ.head()

Unnamed: 0,eventID,occurrenceID
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5


In [418]:
## Load species table

filename = 'RCCA_algae_species_lookup_table.csv'
species = pd.read_csv(filename)

species.head()

Unnamed: 0,Kingdom,Division,Class,Order,Family,Genus,Species,Classcode,taxonomic_source,taxonomic_id,species_definition
0,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Phaeophyceae,Nereocystis,luetkeana,bull kelp,www.marinespecies.org,240752,bull kelp
1,Chromista,Phaeophyta,Phaeophycease,Laminariales,Alariaceae,Pterygophora,californica,pterygophora,www.marinespecies.org,240750,Pterygophora
2,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Lessoniaceae,Eisenia,arborea,southern sea palm,www.marinespecies.org,371990,Southern sea palm larger than 30 cm. Prior to ...
3,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Lessoniaceae,Eisenia,arborea,southern sea palm small,www.marinespecies.org,371990,Southern sea palm smaller than 30 cm. Prior to...
4,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Laminariaceae,Laminaria,spp.,Laminaria spp,www.marinespecies.org,144199,Laminaria farlowii and L. setchellii were coun...


In [419]:
## Map scientific names to classcodes and create scientificName

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Create map
code_to_species_dict = dict(zip(species['Classcode'], species['scientificName']))

# Add in classcodes that are different in data and species table
code_to_species_dict['laminaria spp.'] = 'Laminaria spp.'
code_to_species_dict['souther sea palm'] = 'Eisenia arborea'
code_to_species_dict['souther sea palm small'] = 'Eisenia arborea'
code_to_species_dict['laminaria farlowi'] = 'Laminaria farlowii'
code_to_species_dict['laminaria setchel'] = 'Laminaria setchellii'
code_to_species_dict['sargassum horneri'] = 'Sargassum horneri'

# Create scientificName
kelp_occ['scientificName'] = kelp_agg['Classcode']
kelp_occ['scientificName'].replace(code_to_species_dict, inplace=True)
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Laminaria spp.
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Pterygophora californica
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Eisenia arborea


In [420]:
## Get unique scientific names for lookup in WoRMS

names = kelp_occ['scientificName'].unique()

In [421]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Laminaria spp. checking:  Laminaria


In [422]:
## Add scientific name-related columns

kelp_occ['scientificNameID'] = kelp_occ['scientificName']
kelp_occ['scientificNameID'].replace(name_id_dict, inplace=True)

kelp_occ['taxonID'] = kelp_occ['scientificName']
kelp_occ['taxonID'].replace(name_taxid_dict, inplace=True)
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Laminaria spp.,urn:lsid:marinespecies.org:taxname:516207,516207
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Pterygophora californica,urn:lsid:marinespecies.org:taxname:240750,240750
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990


In [423]:
## Create identificationQualifier to handle Laminaria spp. and occurrenceRemarks to handle Eisenia arborea

identificationQualifier = ['Laminaria farlowii or Laminaria setchellii' if kelp_occ['scientificName'].iloc[i] == 'Laminaria spp.' else np.nan for i in range(kelp_occ.shape[0])]

occurrenceRemarks = []
for i in range(kelp_occ.shape[0]):
    if kelp_agg['Classcode'].iloc[i] == 'souther sea palm':
        occurrenceRemarks.append('individuals >= 30 cm in size')
    elif kelp_agg['Classcode'].iloc[i] == 'souther sea palm small':
        occurrenceRemarks.append('individuals < 30 cm in size')
    else:
        occurrenceRemarks.append(np.nan)

In [424]:
## Replace scientificName using name_name_dict

kelp_occ['scientificName'].replace(name_name_dict, inplace=True)
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Pterygophora californica,urn:lsid:marinespecies.org:taxname:240750,240750
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990


In [425]:
## Add final name-related columns

kelp_occ['nameAccordingTo'] = 'WoRMS'
kelp_occ['occurrenceStatus'] = 'present'
kelp_occ['basisOfRecord'] = 'HumanObservation'
kelp_occ['identificationQualifier'] = identificationQualifier
kelp_occ['occurrenceRemarks'] = occurrenceRemarks

kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752,WoRMS,present,HumanObservation,,
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,present,HumanObservation,Laminaria farlowii or Laminaria setchellii,
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Pterygophora californica,urn:lsid:marinespecies.org:taxname:240750,240750,WoRMS,present,HumanObservation,,
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,present,HumanObservation,,individuals >= 30 cm in size


In [426]:
## Add sex and lifeStage column (no data for kelp)

kelp_occ['sex'] = np.nan
kelp_occ['lifeStage'] = np.nan
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752,WoRMS,present,HumanObservation,,,,
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,present,HumanObservation,Laminaria farlowii or Laminaria setchellii,,,
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Pterygophora californica,urn:lsid:marinespecies.org:taxname:240750,240750,WoRMS,present,HumanObservation,,,,
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,present,HumanObservation,,individuals >= 30 cm in size,,


In [427]:
## Create density to find absence records and for use in MoF

kelp_density = round((kelp_agg['Amount']/kelp_agg['Distance'])*30, 2) # units = individuals per 60 km2
kelp_occ['organismQuantity'] = kelp_density
kelp_occ['organismQuantityType'] = 'number of individuals per 60 m2'
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,81.67,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,present,HumanObservation,Laminaria farlowii or Laminaria setchellii,,,,0.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Pterygophora californica,urn:lsid:marinespecies.org:taxname:240750,240750,WoRMS,present,HumanObservation,,,,,2.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,present,HumanObservation,,individuals >= 30 cm in size,,,0.0,number of individuals per 60 m2


**Note** that here, again, 1642 densities are NaN. In kelp_occ, 1641 records have Amount = NaN. 1635 of these also have Distance = NaN. 7 records where Amount = NaN have been assigned a distance of 30. A single record where Distance = NaN has been assigned an Amount of 74. **How do I interpret these rows?**

In [428]:
## Assign an occurrenceStatus of 'absent' to records where density = 0

kelp_occ.loc[kelp_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'

In [429]:
## Save StipeList for use in MoF file

kelp_sizes = kelp_agg['StipeList']
kelp_sizes

0                                                         
1        8.0 | 12.0 | 13.0 | 16.0 | 3.0 | 3.0 | 7.0 | 7...
2                                                         
3                                                         
4                                                         
                               ...                        
39097                                                     
39098    1.0 | 5.0 | 8.0 | 9.0 | 4.0 | 4.0 | 6.0 | 6.0 ...
39099                                                     
39100                                                     
39101                                                     
Name: StipeList, Length: 39102, dtype: object

### Fish occurrence file

There are lots of complexities here as well. Following on the idea that the occurrence file should contain one line per species with a density per 60 m2, then it seems like I should have organismQuantity (density), organismQuantityType, sex, and lifeStage in the occurrence file. Sex and lifeStage will have to be separated out from the Sex column in the original fish dataframe. Then, in the MoF file, each occurrenceID can be associated with a density and size list (formatted similarly to the stipes list from kelp). Sizes will be "Small", "Medium" or "Large" prior to 2013, and estimated to the nearest cm afterwards. If a min and max size has been provided (as may occur after 2013 for groups of fish), these sizes will not be included in the size list but will instead have their own min size and max size row associated with the occurrenceID.

Note that sometimes a subset of fish are sized individually while others have been given a size range (e.g. 120 Reef on 7/10/18, transect 4, black perch).

**Note** that there are a large number (~160,000) of records where amount is NaN. I'm not sure what this indicates. In addition, for these records, Size_cm is either blank or NaN, and Min_cm and Max_cm are either 0 or NaN. Which value (0 or NaN) is used does not depend on whether the Year is before or after 2013. How do I interpret these records? This also affects the calculation of density at the bottom of this section.

In [430]:
## Start by pulling sex and lifeStage information out of Sex column

fish['lifeStage'] = fish['Sex']
fish.loc[fish['lifeStage'].isin(['m', 'f', 'none']), 'lifeStage'] = ''
fish.loc[fish['Sex'].isin(['a', 'j', 'none']), 'Sex'] = ''

# Replace Sex and lifeStage with controlled vocabulary
fish['Sex'].replace({'m':'male', 'f':'female'}, inplace=True)
fish['lifeStage'].replace({'a':'adult', 'j':'juvenile'}, inplace=True)

# Change NaN values to ''
fish.loc[fish['lifeStage'].isnull() == True, 'lifeStage'] = ''
fish.loc[fish['Sex'].isnull() == True, 'Sex'] = ''
fish.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SiteName,lifeStage
0,120 Reef,1,8,2010,8/1/2010,1,black perch,,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,
1,120 Reef,1,8,2010,8/1/2010,1,black perch,,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,adult
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,


There is one row where the value 'Sex' appears in the Sex column. I feel comfortable changing that to NaN.

In [431]:
## Change 'Sex' to NaN

fish.loc[fish['Sex'] == 'Sex', 'Sex'] = ''
fish.loc[fish['lifeStage'] == 'Sex', 'lifeStage'] = ''

In [432]:
## Combine SizeCategory and Size columns

# Remove Size_cm values if Min_cm and Max_cm are assigned
fish.loc[(fish['Size_cm'] > fish['Min_cm']) & (fish['Size_cm'] < fish['Max_cm']), 'Size_cm'] = np.nan

# Convert Size_cm to string
fish['Size_cm'] = fish['Size_cm'].astype(str)

# Change NaN values in both Size_cm and SizeCategory to ''
fish.loc[fish['Size_cm'] == 'nan', 'Size_cm'] = ''
fish.loc[fish['SizeCategory'].isna() == True, 'SizeCategory'] = ''

# Combine columns
size = [fish['SizeCategory'].iloc[i] if fish['SizeCategory'].iloc[i] != '' else fish['Size_cm'].iloc[i] for i in range(fish.shape[0])]
fish['Size'] = size

fish.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,...,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SiteName,lifeStage,Size
0,120 Reef,1,8,2010,8/1/2010,1,black perch,,Small,1.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,Small
1,120 Reef,1,8,2010,8/1/2010,1,black perch,,Medium,2.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,Medium
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,,Medium,1.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,Medium
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,,Medium,2.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,adult,Medium
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,


In [433]:
## Reformat fish so that the amounts of multiple individuals of the same species are aggregated and there are columns containing lists of sizes, sexes and lifeStages

# Add separators to Size, Sex and lifeStage columns
fish.loc[fish['Size'] != '', 'Size'] = fish.loc[fish['Size'] != '', 'Size'] + ' | '
fish.loc[fish['Sex'] != '', 'Sex'] = fish.loc[fish['Sex'] != '', 'Sex'] + ' | '
fish.loc[fish['lifeStage'] != '', 'lifeStage'] = fish.loc[fish['lifeStage'] != '', 'lifeStage'] + ' | '

# Repeat the size value the number of times indicated by the Amount column
fish.loc[fish['amount'].isna() == False, 'Size'] = fish.loc[fish['amount'].isna() == False, 'Size']*fish.loc[fish['amount'].isna() == False, 'amount'].astype(int)

# Repeat the sex value the number of times indicated by the Amount column
fish.loc[fish['amount'].isna() == False, 'Sex'] = fish.loc[fish['amount'].isna() == False, 'Sex']*fish.loc[fish['amount'].isna() == False, 'amount'].astype(int)

# Repeat the lifeStage value the number of times indicated by the Amount column
fish.loc[fish['amount'].isna() == False, 'lifeStage'] = fish.loc[fish['amount'].isna() == False, 'lifeStage']*fish.loc[fish['amount'].isna() == False, 'amount'].astype(int)

fish.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,...,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SiteName,lifeStage,Size
0,120 Reef,1,8,2010,8/1/2010,1,black perch,,Small,1.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,Small |
1,120 Reef,1,8,2010,8/1/2010,1,black perch,,Medium,2.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,Medium | Medium |
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,,Medium,1.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,Medium |
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,,Medium,2.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,adult | adult |,Medium | Medium |
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,...,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,,


In [434]:
## Aggregate fish to obtain the total number of individuals per distance, with a list of associated sizes -- TAKES ~ 2 MINS

fish_agg = fish.groupby(['SiteName', 'Day', 'Month', 'Year', 'Latitude', 'Longitude', 'Transect', 'Species']).agg({
    'amount':[sum],
    'Size':[sum],
    'Sex':[sum],
    'lifeStage':[sum]
})
fish_agg.reset_index(inplace=True)
fish_agg.columns = fish_agg.columns.droplevel(level=1)
fish_agg.head()

Unnamed: 0,SiteName,Day,Month,Year,Latitude,Longitude,Transect,Species,amount,Size,Sex,lifeStage
0,120Reef,1,8,2010,33.737919,-118.392014,1,black perch,3.0,Small | Medium | Medium |,,
1,120Reef,1,8,2010,33.737919,-118.392014,1,blacksmith,1.0,Medium |,,
2,120Reef,1,8,2010,33.737919,-118.392014,1,garibaldi,2.0,Medium | Medium |,,adult | adult |
3,120Reef,1,8,2010,33.737919,-118.392014,1,giant sea bass,0.0,,,
4,120Reef,1,8,2010,33.737919,-118.392014,1,rubberlip perch,1.0,Medium |,,


In [435]:
## Join aggregated dataframe to dataframe containing only records with min and max sizes to associate min and max with the appropriate site, date, etc.

# Create dataframe containing only records with min_cm and max_cm values
min_max = fish[(fish['Size_cm'] == '') & (fish['Min_cm'].isna() == False) & (fish['Max_cm'].isna() == False)]
min_max = min_max[['SiteName', 'Day', 'Month', 'Year', 'Transect', 'Species', 'Min_cm', 'Max_cm']]

# Join
fish_agg_joined = fish_agg.merge(min_max, how='left', on=['SiteName', 'Day', 'Month', 'Year', 'Transect', 'Species'])
fish_agg_joined.head()

Unnamed: 0,SiteName,Day,Month,Year,Latitude,Longitude,Transect,Species,amount,Size,Sex,lifeStage,Min_cm,Max_cm
0,120Reef,1,8,2010,33.737919,-118.392014,1,black perch,3.0,Small | Medium | Medium |,,,,
1,120Reef,1,8,2010,33.737919,-118.392014,1,blacksmith,1.0,Medium |,,,,
2,120Reef,1,8,2010,33.737919,-118.392014,1,garibaldi,2.0,Medium | Medium |,,adult | adult |,,
3,120Reef,1,8,2010,33.737919,-118.392014,1,giant sea bass,0.0,,,,,
4,120Reef,1,8,2010,33.737919,-118.392014,1,rubberlip perch,1.0,Medium |,,,,


In [436]:
## Remove trailing ' | ' characters in Size, Sex and lifeStage columns

fish_agg_joined['Size'] = fish_agg_joined['Size'].str[:-3]
fish_agg_joined['Sex'] = fish_agg_joined['Sex'].str[:-3]
fish_agg_joined['lifeStage'] = fish_agg_joined['lifeStage'].str[:-3]

print(fish_agg_joined.shape)
fish_agg_joined.head()

(406515, 14)


Unnamed: 0,SiteName,Day,Month,Year,Latitude,Longitude,Transect,Species,amount,Size,Sex,lifeStage,Min_cm,Max_cm
0,120Reef,1,8,2010,33.737919,-118.392014,1,black perch,3.0,Small | Medium | Medium,,,,
1,120Reef,1,8,2010,33.737919,-118.392014,1,blacksmith,1.0,Medium,,,,
2,120Reef,1,8,2010,33.737919,-118.392014,1,garibaldi,2.0,Medium | Medium,,adult | adult,,
3,120Reef,1,8,2010,33.737919,-118.392014,1,giant sea bass,0.0,,,,,
4,120Reef,1,8,2010,33.737919,-118.392014,1,rubberlip perch,1.0,Medium,,,,


In [437]:
## Build eventID

# Pad month and day as needed
paddedDay = ['0' + str(fish_agg_joined['Day'].iloc[i]) if len(str(fish_agg_joined['Day'].iloc[i])) == 1 else str(fish_agg_joined['Day'].iloc[i]) 
             for i in range(len(fish_agg_joined['Day']))]
paddedMonth = ['0' + str(fish_agg_joined['Month'].iloc[i]) if len(str(fish_agg_joined['Month'].iloc[i])) == 1 else str(fish_agg_joined['Month'].iloc[i]) 
               for i in range(len(fish_agg_joined['Month']))]

# Create eventID
eventID = [fish_agg_joined['SiteName'].iloc[i] + '_' + str(fish_agg_joined['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + 
           '_' + str(fish_agg_joined['Transect'].iloc[i]) for i in range(len(fish_agg_joined['SiteName']))]
fish_occ = pd.DataFrame({'eventID':eventID})

print(fish_occ.shape)
fish_occ.head()

(406515, 1)


Unnamed: 0,eventID
0,120Reef_20100801_1
1,120Reef_20100801_1
2,120Reef_20100801_1
3,120Reef_20100801_1
4,120Reef_20100801_1


In [438]:
## Add occurrenceID

# Create SurveyDate column to groupby
SurveyDate = [str(fish_agg_joined['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(len(fish_agg_joined['SiteName']))]
fish_agg_joined['SurveyDate'] = SurveyDate

# Use SurveyDate to create occurrenceID
fish_occ['occurrenceID'] = fish_agg_joined.groupby(['SiteName', 'SurveyDate', 'Transect'])['Species'].cumcount()+1
fish_occ['occurrenceID'] = fish_occ['eventID'] + '_fish_occ' + fish_occ['occurrenceID'].astype(str)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5


In [439]:
## Load species table

filename = 'RCCA_fish_species_lookup_table.csv'
species = pd.read_csv(filename, encoding='ansi')

print(species.shape)
species.head()

(39, 11)


Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species,Name,taxonomic_source,taxonomic_id,species_definition
0,Animalia,Chordata,Actinopterygii,Perciformes,Haemulidae,Anisotremus,davidsonii,sargo,www.marinespecies.org,279617,Anisotremus davidsonii
1,Animalia,Chordata,Actinopterygii,Perciformes,Pomacentridae,Chromis,punctipinnis,blacksmith,www.marinespecies.org,273751,Chromis punctipinnis
2,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,Embiotoca,lateralis,striped perch,www.marinespecies.org,240740,Embiotoca lateralis
3,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,Embiotoca,jacksoni,black perch,www.marinespecies.org,240746,Embiotoca jacksoni
4,Animalia,Chordata,Actinopterygii,Perciformes,Kyphosidae,Girella,nigricans,opaleye,www.marinespecies.org,280865,Girella nigricans


In [440]:
## Map scientific names to classcodes and create scientificName

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Create map
code_to_species_dict = dict(zip(species['Name'], species['scientificName']))

# Indicate that yoy rockfish should be Sebastes
code_to_species_dict['yoy rockfish'] = 'Sebastes'

# Add in classcodes that are different in data and species table
code_to_species_dict['moray eel'] = 'Gymnothorax mordax'

# Create scientificName
fish_occ['scientificName'] = fish_agg_joined['Species']
fish_occ['scientificName'].replace(code_to_species_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Chromis punctipinnis
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Hypsypops rubicundus
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Stereolepis gigas
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Rhacochilus toxotes


**Note** that there is one species in fish_occ that does not match it's entry in the species table: fish_occ has 'moray eel' instead of 'California moray', as in the specie table.

In [441]:
## Get unique scientific names for lookup in WoRMS

names = fish_occ['scientificName'].unique()

**Note**  that there are a number of names that are not specific at the species level. These will match at the genus level, but may want to include identificationQualifier:
- Sebastes flavidus/serranoides
- Sebastes miniatus/pinger

**Assumed misspellings:**
- Sebastes pinger --> Sebastes pinniger
- Balistes polyepis --> Balistes polylepis

In [451]:
## Add manually identified scientific names to names; correct spelling errors

names_to_change = ['Sebastes miniatus/pinger', 'Balistes polyepis']
correct_names = ['Sebastes miniatus/pinniger', 'Balistes polylepis']

for i in range(len(names_to_change)):
    names = np.where(names==names_to_change[i], correct_names[i], names)
    
# Also correct names in converted scientificName column
fish_occ['scientificName'].replace({'Sebastes miniatus/pinger':'Sebastes miniatus/pinniger',
                                      'Balistes polyepis':'Balistes polylepis'}, inplace=True)

In [453]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Sebastes flavidus/serranoides checking:  Sebastes
Url didn't work for Sebastes miniatus/pinniger checking:  Sebastes


In [454]:
## Add scientific name-related columns

fish_occ['scientificNameID'] = fish_occ['scientificName']
fish_occ['scientificNameID'].replace(name_id_dict, inplace=True)

fish_occ['taxonID'] = fish_occ['scientificName']
fish_occ['taxonID'].replace(name_taxid_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Rhacochilus toxotes,urn:lsid:marinespecies.org:taxname:282575,282575


In [455]:
## Create identificationQualifier

qualifier_dict = {'Sebastes flavidus/serranoides':'Sebastes flavidus or Sebastes serranoides',
               'Sebastes miniatus/pinniger':'Sebastes miniatus or Sebastes pinniger'}

identificationQualifier = [qualifier_dict[name] if name in qualifier_dict.keys() else np.nan for name in fish_occ['scientificName']]

In [456]:
## Replace scientificName using name_name_dict

fish_occ['scientificName'].replace(name_name_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Rhacochilus toxotes,urn:lsid:marinespecies.org:taxname:282575,282575


In [457]:
## Add final name-related columns

fish_occ['nameAccordingTo'] = 'WoRMS'
fish_occ['occurrenceStatus'] = 'present'
fish_occ['basisOfRecord'] = 'HumanObservation'
fish_occ['identificationQualifier'] = identificationQualifier
fish_occ['occurrenceRemarks'] = np.nan

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751,WoRMS,present,HumanObservation,,
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130,WoRMS,present,HumanObservation,,
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884,WoRMS,present,HumanObservation,,
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Rhacochilus toxotes,urn:lsid:marinespecies.org:taxname:282575,282575,WoRMS,present,HumanObservation,,


In [462]:
## Add sex and lifeStage information

fish_occ['sex'] = fish_agg_joined['Sex']
fish_occ['sex'].replace('', np.nan, inplace=True)

fish_occ['lifeStage'] = fish_agg_joined['lifeStage']
fish_occ['lifeStage'].replace('', np.nan, inplace=True)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751,WoRMS,present,HumanObservation,,,,
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130,WoRMS,present,HumanObservation,,,,adult | adult
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884,WoRMS,present,HumanObservation,,,,
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Rhacochilus toxotes,urn:lsid:marinespecies.org:taxname:282575,282575,WoRMS,present,HumanObservation,,,,


In [463]:
## Create density to find absence records and for use in MoF

fish_density = fish_agg_joined['amount'] # units = individuals per 60 km2
fish_occ['organismQuantity'] = fish_density
fish_occ['organismQuantityType'] = 'number of individuals per 60 m2'
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,,3.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130,WoRMS,present,HumanObservation,,,,adult | adult,2.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Rhacochilus toxotes,urn:lsid:marinespecies.org:taxname:282575,282575,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 60 m2


In [464]:
## Assign an occurrenceStatus of 'absent' to records where density = 0

fish_occ.loc[fish_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'

In [465]:
## Save Size, Min and Max for use in MoF file

fish_sizes = fish_agg_joined['Size']
min_size = fish_agg_joined['Min_cm']
max_size = fish_agg_joined['Max_cm']

### Aggregate inverts, kelp and fish

In [466]:
## Aggregate

occ = pd.concat([inverts_occ, kelp_occ, fish_occ])
occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,,9.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,,1.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2


## Save occurrence file

In [467]:
## Save

occ.to_csv('RCCA_occurrence_20200723.csv', index=False, na_rep='NaN')

## Load occurrence file if desired

In [517]:
## Load

occ = pd.read_csv('RCCA_occurrence_20200723.csv', dtype={'occurrenceRemarks':str, 'sex':str, 'lifeStage':str})
print(occ.shape)
occ.head()

(630934, 14)


Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,,9.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,,1.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2


## Create event file

The event file needs to include eventID, eventDate, datasetID, locality, countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, locationRemarks, minDepth, maxDepth, and verbatimDepth (if desired).

In addition, depth, temperature and visibility can be included as event-level measurements in the MoF file.

In [812]:
## Get unique eventIDs from occurrence file

event = pd.DataFrame({'eventID':occ['eventID']})
event.drop_duplicates(inplace=True)

print(event.shape)
event.head()

(19381, 1)


Unnamed: 0,eventID
0,120Reef_20100801_1
28,120Reef_20100801_2
56,120Reef_20100801_3
84,120Reef_20100801_4
112,120Reef_20100801_5


In [813]:
## Create eventDate from eventID

eventDate = [datetime.strptime(ID.split('_')[1], '%Y%m%d').date().isoformat() for ID in event['eventID']]
event['eventDate'] = eventDate
event.head()

Unnamed: 0,eventID,eventDate
0,120Reef_20100801_1,2010-08-01
28,120Reef_20100801_2,2010-08-01
56,120Reef_20100801_3,2010-08-01
84,120Reef_20100801_4,2010-08-01
112,120Reef_20100801_5,2010-08-01


In [814]:
## Dataset ID

event['datasetID'] = 'RCCA transects'

In [815]:
## Add locality and countryCode

# Get site name out of eventID
locality = event['eventID'].str.split('_')
locality = locality.str[0]
event['locality'] = locality

# Reverse site_name_dict
reversed_site_name_dict = {v: k for k, v in site_name_dict.items()}

# Use reversed dict to retrieve original names from locality
event['locality'].replace(reversed_site_name_dict, inplace=True)

# Change names as needed to match names in site table
event.loc[event['locality'] == "Lover's 3", 'locality'] = 'Lovers 3'
event.loc[event['locality'] == "Fry's Anchorage", 'locality'] = 'Frys Anchorage'
event.loc[event['locality'] == 'Judith Reserve San Miguel Is', 'locality'] = 'Judith Reserve San Miguel Island'

# Add countryCode
event['countryCode'] = 'US'

event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US
28,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US
56,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US
84,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US
112,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US


In [816]:
## Merge to obtain decimalLatitude and decimalLongitude

event = event.merge(sites[['Site', 'Latitude', 'Longitude']], how='left', left_on='locality', right_on='Site')
event.rename(columns = {'Latitude':'decimalLatitude', 'Longitude':'decimalLongitude'}, inplace=True)
event.drop('Site', axis=1, inplace=True)
event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
1,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
2,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
3,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
4,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014


**As noted above** - there are some sites in fish, kelp and inverts data that are not in site table:
- Cayucos
- Fry's Anchorage (I assume this is the same as Frys Anchorage)
- Hurricane Ridge
- LA Federal Breakwater
- Lover's 3 (I assume this is the same as Lovers 3)
- Ocean Cove Kelper (Is this the same as Ocean Cove?)
- Pier 400
- West Long Point

Also **note** that Judith Reserve San Miguel Island is written Judith Reserve San Miguel Is in fish data only.

**Have to get lat, lon for Cayucos, Hurricane Ridge, LA Federal Breakwater, Pier 400 and West Long Point from Reef Check. Also, possibly Ocean Cove Kelper if it is not the same as Ocean Cove.**

In [817]:
## Add coordinateUncertainty in Meters

event['coordinateUncertaintyInMeters'] = 250

In [818]:
## Add locationRemarks to indicate whether the transect was offshore or onshore

locationRemarks = event['eventID'].str.split('_')
locationRemarks = locationRemarks.str[2]
locationRemarks[locationRemarks.isin(['1', '2', '3', '7', '8', '9', '10', '11', '12'])] = 'inshore zone'
locationRemarks[locationRemarks.isin(['4', '5', '6', '13', '14', '15', '16', '17', '18'])] = 'offshore zone'
event['locationRemarks'] = locationRemarks
event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,locationRemarks
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone
1,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone
2,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone
3,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,offshore zone
4,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,offshore zone


**Note** that there are some transects numbered 19 - 31. When are these done? Are they inshore or offshore?

In [819]:
## In order to get depth data (and temp and visibility for MoF file), we'll need an eventID in inverts, kelp and fish

# Inverts
inverts['eventID'] = inverts_occ['eventID']

# Kelp
paddedDay = ['0' + str(kelp['Day'].iloc[i]) if len(str(kelp['Day'].iloc[i])) == 1 else str(kelp['Day'].iloc[i]) for i in range(len(kelp['Day']))]
paddedMonth = ['0' + str(kelp['Month'].iloc[i]) if len(str(kelp['Month'].iloc[i])) == 1 else str(kelp['Month'].iloc[i]) for i in range(len(kelp['Month']))]
eventID = [kelp['SiteName'].iloc[i] + '_' + str(kelp['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(kelp['Transect'].iloc[i]) for i in range(len(kelp['SiteName']))]
kelp['eventID'] = eventID

# Fish
paddedDay = ['0' + str(fish['Day'].iloc[i]) if len(str(fish['Day'].iloc[i])) == 1 else str(fish['Day'].iloc[i]) for i in range(len(fish['Day']))]
paddedMonth = ['0' + str(fish['Month'].iloc[i]) if len(str(fish['Month'].iloc[i])) == 1 else str(fish['Month'].iloc[i]) for i in range(len(fish['Month']))]
eventID = [fish['SiteName'].iloc[i] + '_' + str(fish['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(fish['Transect'].iloc[i]) for i in range(len(fish['SiteName']))]
fish['eventID'] = eventID

In [820]:
## Obtain depth

depth = pd.concat([inverts[['eventID', 'Depth_ft']],
           kelp[['eventID', 'Depth_ft']],
           fish[['eventID', 'Depth_ft']]])
depth.drop_duplicates(inplace=True)

# A small number of events have more than one depth listed - average these
depth = depth.groupby('eventID').mean()
depth.reset_index(inplace=True)

print(depth.shape)
depth.head()

(19381, 2)


Unnamed: 0,eventID,Depth_ft
0,120Reef_20061008_1,28.0
1,120Reef_20061008_10,28.0
2,120Reef_20061008_11,26.5
3,120Reef_20061008_12,32.5
4,120Reef_20061008_13,23.5


**Note** that a small number of events have more than one depth listed. I'm not sure if these are typos or multiple measurements from different dive computers. Assuming the latter, I'm averaging them. I'll do the same with duplicate temperature and visibility measurements.

In [821]:
## Add depth to event file

event['minimumDepthInMeters'] = depth['Depth_ft']
event['maximumDepthInMeters'] = depth['Depth_ft']
event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,locationRemarks,minimumDepthInMeters,maximumDepthInMeters
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone,28.0,28.0
1,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone,28.0,28.0
2,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone,26.5,26.5
3,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,offshore zone,32.5,32.5
4,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,offshore zone,23.5,23.5


In [822]:
## Get temperature and visibility data for MoF file

temp = pd.concat([inverts[['eventID', 'Temp10m']],
           kelp[['eventID', 'Temp10m']],
           fish[['eventID', 'Temp10m']]])
temp.drop_duplicates(inplace=True)
temp = temp.groupby('eventID').mean()
temp.reset_index(inplace=True)

vis = pd.concat([inverts[['eventID', 'Visibility']],
           kelp[['eventID', 'Visibility']],
           fish[['eventID', 'Visibility']]])
vis.drop_duplicates(inplace=True)
vis = vis.groupby('eventID').mean()
vis.reset_index(inplace=True)

In [823]:
## Add samplingProtocol and samplingEffort

event['samplingProtocol'] = 'band transect'
event['samplingEffort'] = '10-15 minutes'
event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,locationRemarks,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone,28.0,28.0,band transect,10-15 minutes
1,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone,28.0,28.0,band transect,10-15 minutes
2,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,inshore zone,26.5,26.5,band transect,10-15 minutes
3,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,offshore zone,32.5,32.5,band transect,10-15 minutes
4,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,offshore zone,23.5,23.5,band transect,10-15 minutes


## Save

In [824]:
## Save

event.to_csv('RCCA_event_20200723.csv', index=False, na_rep='NaN')

## Create MoF file

The MoF file here needs to contain both event and occurrence level measurements. At the event level, we need eventID, occurrenceID = NaN, measurementType, measurementValue and measurementUnit. measurementMethod could optionally be included. measurementTypes available are: temperature, visibility. I also want to include the UPC data here.

At the occurrence level, both an eventID and an occurrenceID will be listed. measurementTypes available are: kelp stipe counts, fish sizes (including min size and max size for groups).

### Assemble UPC data

All amounts are out of 30 possible points surveyed - theoretically the sum over all classcodes in a category should always be 30. **Note** that upon checking this, there are only 4 surveys where it is not true. At Isthmus Reef on 10-Dec-2019, Transect 3, Category, Cover and Relief all have 60 points. At Lovers Point on 16-Sep-2006, Transect 1, Substrate has 31 points. At South La Jolla on 11-Aug-2019, Transect 1, Substrate has 31 points. And At Stillwater Cove Monterey on 26-Aug-2013, Transect 3, Relief has 31 points.

Generally, there are somewhere between 8 and 11 categories for Cover, 5 categories for Substrate and 4 categories for Relief. **Note** that there is one survey for which there seem to be duplicate entries with values that conflict (Isthmus Reef, 10-Dec-2019, Transect 3). **Also**, relief categories are not all labeled consistently ('0 - 10cm' = '0 - 10c', etc.).

In [969]:
## Aggregate UPC data so that all classcodes are collapsed into one row

# Add SiteName column
upc['SiteName'] = upc['site'].copy()
upc['SiteName'].replace(site_name_dict, inplace=True)

# Create eventID
paddedDay = ['0' + str(upc['Day'].iloc[i]) if len(str(upc['Day'].iloc[i])) == 1 else str(upc['Day'].iloc[i]) for i in range(len(upc['Day']))]
paddedMonth = ['0' + str(upc['Month'].iloc[i]) if len(str(upc['Month'].iloc[i])) == 1 else str(upc['Month'].iloc[i]) for i in range(len(upc['Month']))]
eventID = [upc['SiteName'].iloc[i] + '_' + str(upc['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(upc['Transect'].iloc[i]) for i in range(len(upc['SiteName']))]
upc['eventID'] = eventID

# Fix problems with classcodes
upc['Classcode'].replace({'> 2':'> 2m',
                          '> 1 - 2':'> 1m - 2m',
                          '> 10c -1':'> 10cm - 1m',
                          '0 - 10c':'0 - 10cm'}, inplace=True)

# Create Percent and UPC columns
upc['Percent'] = round((upc['Amount']/30)*100, 1)
upc = upc[(upc['Percent'] > 0) & (upc['Percent'].isna() == False)]
upc['Percent'] = upc['Percent'].astype(str)
upc['UPC'] = upc['Percent'] + '% ' + upc['Classcode'] + ' | '

# Aggregate
upc_agg = upc.groupby(['eventID', 'Category']).agg({'UPC':sum})
upc_agg.reset_index(inplace=True)
upc_agg['UPC'] = upc_agg['UPC'].str[:-3]

upc_agg

Unnamed: 0,eventID,Category,UPC
0,120Reef_20061008_1,Relief,3.3% 0 - 10cm | 90.0% > 10cm - 1m | 6.7% > 1m ...
1,120Reef_20061008_1,Substrate,83.3% bedrock | 16.7% cobble
2,120Reef_20061008_2,Cover,43.3% crustose coralline | 43.3% none | 13.3% ...
3,120Reef_20061008_2,Relief,46.7% 0 - 10cm | 50.0% > 10cm - 1m | 3.3% > 1m...
4,120Reef_20061008_2,Substrate,50.0% bedrock | 50.0% sand
...,...,...,...
19394,Yellowbanks_20131107_5,Relief,23.3% 0 - 10cm | 76.7% > 10cm - 1m
19395,Yellowbanks_20131107_5,Substrate,13.3% bedrock | 3.3% boulder | 16.7% cobble | ...
19396,Yellowbanks_20131107_6,Cover,13.3% articulated coralline | 3.3% brown seawe...
19397,Yellowbanks_20131107_6,Relief,6.7% 0 - 10cm | 93.3% > 10cm - 1m


In [975]:
## Use upc_agg to create a mof dataframe with eventID, occurrenceID = NaN, measurementType, measurementValue, measurementUnit, measurementMethod

upc_mof = pd.DataFrame({'eventID':upc_agg['eventID']})
upc_mof['occurrenceID'] = np.nan
upc_mof['measurementType'] = upc_agg['Category'].str.lower()
upc_mof['measurementValue'] = upc_agg['UPC']
upc_mof['measurementUnit'] = 'percent cover'
upc_mof['measurementMethod'] = 'UPC'
upc_mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20061008_1,,relief,3.3% 0 - 10cm | 90.0% > 10cm - 1m | 6.7% > 1m ...,percent cover,UPC
1,120Reef_20061008_1,,substrate,83.3% bedrock | 16.7% cobble,percent cover,UPC
2,120Reef_20061008_2,,cover,43.3% crustose coralline | 43.3% none | 13.3% ...,percent cover,UPC
3,120Reef_20061008_2,,relief,46.7% 0 - 10cm | 50.0% > 10cm - 1m | 3.3% > 1m...,percent cover,UPC
4,120Reef_20061008_2,,substrate,50.0% bedrock | 50.0% sand,percent cover,UPC
...,...,...,...,...,...,...
19394,Yellowbanks_20131107_5,,relief,23.3% 0 - 10cm | 76.7% > 10cm - 1m,percent cover,UPC
19395,Yellowbanks_20131107_5,,substrate,13.3% bedrock | 3.3% boulder | 16.7% cobble | ...,percent cover,UPC
19396,Yellowbanks_20131107_6,,cover,13.3% articulated coralline | 3.3% brown seawe...,percent cover,UPC
19397,Yellowbanks_20131107_6,,relief,6.7% 0 - 10cm | 93.3% > 10cm - 1m,percent cover,UPC


### Assemble remaining measurements

In [976]:
## Assemble event-level measurements

# Temperature
mof = pd.DataFrame({'eventID':temp['eventID']})
mof['occurrenceID'] = np.nan
mof['measurementType'] = 'temperature'
mof['measurementValue'] = temp['Temp10m']
mof['measurementUnit'] = 'degrees Celsius'
mof['measurementMethod'] = 'measured by dive computer at 10 m depth, or at the seafloor if shallower than 10 m'
mof = mof[mof['measurementValue'].isna() == False]

# Visibility
vis_mof = pd.DataFrame({'eventID':vis['eventID']})
vis_mof['occurrenceID'] = np.nan
vis_mof['measurementType'] = 'visibility'
vis_mof['measurementValue'] = vis['Visibility']
vis_mof['measurementUnit'] = 'meters'
vis_mof['measurementMethod'] = 'determined by divers by measuring the distance from which the fingers on a hand held up into the water column can be counted accurately'
vis_mof = vis_mof[vis_mof['measurementValue'].isna() == False]
vis_mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20061008_1,,visibility,7.0,meters,determined by divers by measuring the distance...
1,120Reef_20061008_10,,visibility,6.0,meters,determined by divers by measuring the distance...
2,120Reef_20061008_11,,visibility,6.0,meters,determined by divers by measuring the distance...
3,120Reef_20061008_12,,visibility,6.0,meters,determined by divers by measuring the distance...
4,120Reef_20061008_13,,visibility,7.0,meters,determined by divers by measuring the distance...


In [977]:
## Assemble occurrence-level measurements

# Kelp sizes
kelp_mof = pd.DataFrame({'eventID':kelp_occ['eventID']})
kelp_mof['occurrenceID'] = kelp_occ['occurrenceID']
kelp_mof['measurementType'] = 'size'
kelp_mof['measurementValue'] = kelp_sizes
kelp_mof['measurementUnit'] = 'number of stipes per individual'
kelp_mof['measurementMethod'] = 'counted for each Macrocystis pyrifera surveyed'
kelp_mof = kelp_mof[kelp_mof['measurementValue'] != '']

# Fish sizes for individuals
fish_mof = pd.DataFrame({'eventID':fish_occ['eventID']})
fish_mof['occurrenceID'] = fish_occ['occurrenceID']
fish_mof['measurementType'] = 'length'
fish_mof['measurementValue'] = fish_sizes
fish_mof['measurementUnit'] = 'size category or centimeters per individual'
fish_mof['measurementMethod'] = 'categorized as small/medium/large prior to 2013, estimated visually to the nearest centimeter afterwards'
fish_mof = fish_mof[fish_mof['measurementValue'] != '']

# Fish sizes for groups
min_mof = pd.DataFrame({'eventID':fish_occ['eventID']})
min_mof['occurrenceID'] = fish_occ['occurrenceID']
min_mof['measurementType'] = 'minimum length'
min_mof['measurementValue'] = min_size
min_mof['measurementUnit'] = 'centimeters'
min_mof['measurementMethod'] = 'minimum size observed in a group of fish of the same species, estimated visually to the nearest centimeter'
min_mof = min_mof[min_mof['measurementValue'].isna() == False]

max_mof = pd.DataFrame({'eventID':fish_occ['eventID']})
max_mof['occurrenceID'] = fish_occ['occurrenceID']
max_mof['measurementType'] = 'maximum length'
max_mof['measurementValue'] = max_size
max_mof['measurementUnit'] = 'centimeters'
max_mof['measurementMethod'] = 'maximum size observed in a group of fish of the same species, estimated visually to the nearest centimeter'
max_mof = max_mof[max_mof['measurementValue'].isna() == False]
max_mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
618,120Reef_20180710_4,120Reef_20180710_4_fish_occ3,maximum length,10.0,centimeters,maximum size observed in a group of fish of th...
753,120Reef_20180710_9,120Reef_20180710_9_fish_occ30,maximum length,12.0,centimeters,maximum size observed in a group of fish of th...
819,120Reef_20180710_11,120Reef_20180710_11_fish_occ24,maximum length,34.0,centimeters,maximum size observed in a group of fish of th...
836,120Reef_20180710_12,120Reef_20180710_12_fish_occ5,maximum length,18.0,centimeters,maximum size observed in a group of fish of th...
872,120Reef_20180710_13,120Reef_20180710_13_fish_occ5,maximum length,15.0,centimeters,maximum size observed in a group of fish of th...


In [978]:
## Concatenate dataframes

mof = pd.concat([mof, vis_mof, upc_mof, kelp_mof, fish_mof, min_mof, max_mof])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20061008_1,,temperature,15,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
1,120Reef_20061008_10,,temperature,15,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
2,120Reef_20061008_11,,temperature,15,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
3,120Reef_20061008_12,,temperature,15,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
4,120Reef_20061008_13,,temperature,15,degrees Celsius,"measured by dive computer at 10 m depth, or at..."


## Save

In [979]:
## Save

mof.to_csv('RCCA_MoF_20200723.csv', index=False, na_rep='NaN')

## Questions

1. There are sites listed in the inverts, kelp and fish data that are not in the site table. They are: Cayucos, Hurricane Ridge, LA Federal Breakwater, Ocean Cove, Pier 400 and West Long Point. I need the lat and lon for these sites. Also, is Ocean Cove the same as Ocean Cove Kelper?
2. In the inverts and kelp data, both Distance and Amount are sometimes NaN. There are also records where Distance is a number (including but not limited to 30) and Amount is NaN. How should I interpret these records? Are NaN values different than 0 values?
3. In the fish data, a large proportion of records have Amount = NaN (~160,000). In addition, for these records, Size_cm is either blank or NaN, and Min_cm and Max_cm are either 0 or NaN. Which value (0 or NaN) is used does not depend on whether the Year is before or after 2013. How do I interpret these records? 
4. In the kelp data, there is sometimes a 7th transect. Similarly, in the fish data, there are occassionally transects numbering 19-31. When are extra transects done? Are they inshore or offshore? I thought that inshore transects were always labeled 1-3 and 7-12, and offshore transects were always labeled 4-6 and 13-18. Is this not correct?
5. There are some transects where more than one depth is given. Should I assume this is a typo? Or that two measurements were taken?
6. The values in the Amount column for each Category of the UPC surveys should sum to 30. There are 4 surveys where this doesn't happen: Isthmus Reef on 10-Dec-2019, Transect 3 (Category, Cover and Relief all sum to 60); Lovers Point on 16-Sep-2006, Transect 1 (Substrate sums to 31); South La Jolla on 11-Aug-2019, Transect 1 (Substrate sums to 31); and Stillwater Cove Monterey on 26-Aug-2013, Transect 3 (Relief sums to 31). Do you know where the errors are here? Particularly for Isthmus Reef, where there seem to be duplicate entries with conflicting data?