# Reef Check - transect data conversion

For each survey site, Reef Check performs 6 core transects where divers record inverts, kelp, UPC and fish. 12 additional fish-only transects are performed separately. Abalone and urchin size surveys are performed off-transect, and only at some sites. Presence/absence surveys for invasive kelps are also performed off-transect. For this reason, I think it would be reasonable to have four converted datasets: 
1. Core transect and fish-only transect data
2. Urchin size data
3. Abalone size data
4. Invasive kelp data

In this notebook, I deal with the transect data only.

**Resources:**
- https://dwc.tdwg.org/terms/#occurrence
- https://reefcheck.org/
- https://reefcheck.org/PDFs/RCCAmanual9thedition.pdf
- https://reefcheck.org/PDFs/Reef%20Check%20California%20Abalone%20Protocol.pdf

**SettingWithCopyWarning reference:** https://www.dataquest.io/blog/settingwithcopywarning/

In [1]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load inverts data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\Reef Check\\'
filename = 'RCCA_invertebrate_swath_data.csv'
inverts = pd.read_csv(filename)

inverts.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Classcode,Amount,Distance,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,1-Aug-10,1,bat star,9.0,30.0,33.73792,-118.392,21.0,13.0,3.0
1,120 Reef,1,8,2010,1-Aug-10,1,black abalone,0.0,30.0,33.73792,-118.392,21.0,13.0,3.0
2,120 Reef,1,8,2010,1-Aug-10,1,brown/golden gorgonian,1.0,30.0,33.73792,-118.392,21.0,13.0,3.0
3,120 Reef,1,8,2010,1-Aug-10,1,ca sea cucumber,0.0,30.0,33.73792,-118.392,21.0,13.0,3.0
4,120 Reef,1,8,2010,1-Aug-10,1,ca spiny lobster,0.0,30.0,33.73792,-118.392,21.0,13.0,3.0


### Information on column definitions from Reef Check's metadata files

**Site** = The unique site code that indicates where the survey was performed. This site code refers to a specific entry in the site table. <br>
**Day** = The day that the survey was done. This date is expressed in D or DD format. Dates reflect measurements taken in local time.<br>
**Month** = The month that the survey was done. This month is expressed in M or MM format. Dates reflect measurements taken in local time.<br>
**Year** = The year that the survey was done. This year is expressed in YYYY format. Dates reflect measurements taken in local time.<br>
**SurveyDate** = The  date that the survey was completed.<br>
**Transect** = A number representing one of the parallel transects through the study site. Core transects (i.e. transects at which fish, invertebrate, algae, and substrate data is collected) are numbered 1 - 6 with the transects in the offshore zone numbered as 1-3 and the inshore core transects numbered 4 - 6. Fish-only transects are numbered 7 - 18 with the offshore fish only transects numbered 7 - 12 and the inshore fish only transects numbered 13 - 18.<br>
**Classcode** = The unique taxonomic classification code that is being counted. The taxonomy of the species is defined in the species lookup table.<br>
**Amount** = Total number of individuals of a given classcode counted within the distance indicated in the Distance column along a transect.<br>
**Distance** = Distance along transect over which individuals of a given classcode were counted.  When this distance is less than 30m, the species was sub-sampled at about 50 individuals. To generate densities for a 60 square meter area the 'amount' variable needs to be  divided by the 'distance' variable and multiplied by 30.<br>
**Lat** = Latitude of the site.<br>
**Lon** = Longitude of the site.<br>
**Depth_ft** = Average depth of a transect in feet as measured by diver using dive computer.<br>
**Temp10m** = The water temperature at the sites during the survey measured using a dive computer at 10 meter depth or the seafloor if site is shallower than 10 meters. Measured in degrees Celsius.<br>
**Visibility** = Visibility in meters at the transect location as determined by divers by measuring the distance from which the fingers on a hand help up into the water column can be counted.<br>

In [4]:
## Load kelp data

filename = 'RCCA_algae_swath_data.csv'
kelp = pd.read_csv(filename)

kelp.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Classcode,Amount,Stipes,Distance,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,1-Aug-10,1,bull kelp,0.0,,30.0,33.737919,-118.392014,21.0,13.0,3.0
1,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,8.0,18.0,33.737919,-118.392014,21.0,13.0,3.0
2,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,12.0,18.0,33.737919,-118.392014,21.0,13.0,3.0
3,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,13.0,18.0,33.737919,-118.392014,21.0,13.0,3.0
4,120 Reef,1,8,2010,1-Aug-10,1,giant kelp,1.0,16.0,18.0,33.737919,-118.392014,21.0,13.0,3.0


### Additional column definitions for algae data

**Stipes** = Number of stipes of Macrocystis pyrifera counted per individual counted under 'Amount'.

In [5]:
## Load fish data

filename = 'RCCA_fish_data.csv'
fish = pd.read_csv(filename)

fish.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
1,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,none,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,a,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0


### Additional column definitions for fish data

**Sex** = For species for which males, females or juveniles can be identified their sex or state of maturity is identified as m =male, f=female; or a = adult, j=juvenile (none is recorded if sex can't be determined). <br>
**SizeCategory** = Prior to 2013, fish were sized in three categories: small, medium and large. Small fish are <15cm total length. Medium sized fish are 15 to 30cm total length, except for lingcod, cabazon, bocaccio and horn shark for which the medium size category is 15-50cm. Large fish are >30cm total length, except for lingcod, cabazon, bocaccio and horn shark for which the large category is >50cm. in 2013 and later years fish are sized to the nearest cm and 'NA is recorded for SizeCategory. <br>
**Size_cm** = The total length of an individual or group of individuals (of the same length) in centimeter (rounded to the nearest cm) OR the average total length for a group of fish for which a range in lengths is specified. For data collected in 2013 and following years. <br>
**Min_cm** = The minimum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species. For data collected in 2013 and following years. <br>
**Max_cm** = The maximum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species. For data collected in 2013 and following years. <br>

In [6]:
## Load UPC data (to be included in measurement or fact file)

filename = 'RCCA_upc_data.csv'
upc = pd.read_csv(filename)

upc.head()

Unnamed: 0,site,Day,Month,Year,Transect,Category,Classcode,Amount,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SurveyDate
0,120 Reef,8,10,2006,1,Cover,articulated coralline,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
1,120 Reef,8,10,2006,1,Cover,brown seaweed,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
2,120 Reef,8,10,2006,1,Cover,crustose coralline,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
3,120 Reef,8,10,2006,1,Cover,green seaweed,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06
4,120 Reef,8,10,2006,1,Cover,none,,33.737919,-118.392014,28.0,15.0,7.0,8-Oct-06


### Additional column definitions for UPC

**Category** = Defines the type of data collected in three categories. Data either describes the seafloor substrate (Substrate); the primary organism on the substrate (Cover); or the relief of the substrate (Relief).<br>
**Classcode** = A unique classification code of what is beeing recored at a UPC point. The classcodes are defined in the UPC lookup table. <br>
**Amount** = Total number of points of a given classcode encounted along a transect. (I assume out of 30 possible points). <br>

## Create occurrence file

Let's allow the **event** to be the transect, and the **occurrences** to be any organisms observed along that transect. We have MeasurementOrFacts at both the event and occurrence level (e.g. temperatures and percent covers by transect and sizes by occurrence). These can be incorporated into a single MoF file.

### Get site names

Once I've retrieved the site names from the site table, I can use them to create eventIDs.

In [7]:
## Load site table

filename = 'RCCA_site_table.csv'
sites = pd.read_csv(filename, usecols=range(7))

sites.head()

Unnamed: 0,Research_group,Site,CA_MPA_Name_Short,MPA_status,LTM_project_short_code,Latitude,Longitude
0,RCCA,Macklyn Cove,,REF,LTM_Kelp_SRock,42.045155,-124.294724
1,RCCA,Pyramid Pt,Pyramid Point SMCA,MPA,LTM_Kelp_SRock,41.994801,-124.217308
2,RCCA,Flat Iron Rock,,,,41.059425,-124.157829
3,RCCA,Trinidad,,,,41.055,-124.139999
4,RCCA,MacKerricher North,MacKerricher SMCA,MPA,LTM_Kelp_SRock,39.492823,-123.80199


**Note** - there are some sites in fish, kelp and inverts data that are not in site table:
- Cayucos
- Fry's Anchorage (Same as Frys Anchorage)
- Hurricane Ridge
- LA Federal Breakwater
- Lover's 3 (Same as Lovers 3)
- Ocean Cove Kelper 
- Pier 400
- West Long Point (Same as Long Point West)

I am going to manually add the lat and lon for these sites to the site table. However, I talked to Jan and Dan on 8/6, and they're planning to update the official site table on DataONE as well.

Also **note** that Judith Reserve San Miguel Island is written Judith Reserve San Miguel Is in fish data only.

In [8]:
## Add rows to site table -- CAN BE DELETED WHEN SITE TABLE IS UPDATED ON DATAONE

sites_to_add = pd.DataFrame({'Research_group':['RCCA']*5,
                            'Site':['Cayucos', 'Hurricane Ridge', 'LA Federal Breakwater', 'Ocean Cove Kelper', 'Pier 400'],
                            'Latitude':[35.4408, 37.4701, 33.711899, 38.555119, 33.716301],
                            'Longitude':[-120.936302, -122.4796, -118.241997, -123.3046, -118.258003]})
sites = pd.concat([sites, sites_to_add])

In [9]:
## Create a new column containing site names w/o spaces, and add it to data files

# Get a list of site names with spaces removed
site_names = [name.replace(' ', '') for name in sites['Site']]
    
# Map site_names to SiteName in sites_df; add sites that are in fish, inverts and algae data but not in site table
site_name_dict = dict(zip(sites['Site'], site_names))
site_name_dict["Fry's Anchorage"] = 'FrysAnchorage'
site_name_dict["Lover's 3"] = 'Lovers3'
site_name_dict['West Long Point'] = 'LongPointWest'
site_name_dict['Judith Reserve San Miguel Is'] = 'JudithReserveSanMiguelIsland'

def create_SiteName(df, site_name_dict):
    
    # Create SiteName column from Site column in df
    df['SiteName'] = df['Site']
    df['SiteName'].replace(site_name_dict, inplace=True)
    
    return(df)

inverts = create_SiteName(inverts, site_name_dict)
kelp = create_SiteName(kelp, site_name_dict)
fish = create_SiteName(fish, site_name_dict)

fish.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SiteName
0,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
1,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,none,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,a,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef


I'm going to assemble columns for one organism type at a time, and then concatenate at the end.

### Inverts occurrence file

In [10]:
## Pad month and day as needed

paddedDay = ['0' + str(inverts['Day'].iloc[i]) if len(str(inverts['Day'].iloc[i])) == 1 else str(inverts['Day'].iloc[i]) for i in range(len(inverts['Day']))]
paddedMonth = ['0' + str(inverts['Month'].iloc[i]) if len(str(inverts['Month'].iloc[i])) == 1 else str(inverts['Month'].iloc[i]) for i in range(len(inverts['Month']))]

In [11]:
## Create eventID

eventID = [inverts['SiteName'].iloc[i] + '_' + str(inverts['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(inverts['Transect'].iloc[i]) for i in range(len(inverts['Site']))]
inverts_occ = pd.DataFrame({'eventID':eventID})

inverts_occ.head()

Unnamed: 0,eventID
0,120Reef_20100801_1
1,120Reef_20100801_1
2,120Reef_20100801_1
3,120Reef_20100801_1
4,120Reef_20100801_1


In [12]:
## Add occurrenceID

inverts_occ['occurrenceID'] = inverts.groupby(['Site', 'SurveyDate', 'Transect'])['Classcode'].cumcount()+1
inverts_occ['occurrenceID'] = inverts_occ['eventID'] + '_inverts_occ' + inverts_occ['occurrenceID'].astype(str)

inverts_occ.head()

Unnamed: 0,eventID,occurrenceID
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5


In [13]:
## Load species table

filename = 'RCCA_invertebrate_lookup_table.csv'
species = pd.read_csv(filename, encoding='ANSI')

species.head()

Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species,Classcode,taxonomic_source,taxonomic_id
0,Animalia,Echinodermata,Asteroidea,Valvatida,Asterinidae,Patiria,miniata,bat star,www.marinespecies.org,382131
1,Animalia,Cnidaria,Anthozoa,Alcyonacea,Plexauridae,Muricea,fruticosa/californica,brown/golden gorgonian,www.marinespecies.org,177745
2,Animalia,Echinodermata,Holothuroidea,Synallactida,Stichopodidae,Parastichopus,californicus,CA sea cucumber,www.marinespecies.org,711954
3,Animalia,Arthropoda,Malacostraca,Decapoda,Palinuridae,Panulirus,interruptus,CA spiny lobster,www.marinespecies.org,382898
4,Animalia,Mollusca,Gastropoda,Littorinimorpha,Cypraeidae,Neobernaya,spadicea,chestnut cowry,www.marinespecies.org,580674


In [14]:
## Map scientific names to classcodes

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Strip any whitespace
species['scientificName'] = species['scientificName'].str.strip()

# Fix species names where Genus and Species were NaN
species.loc[species['Family'] == 'Actiniidae', 'scientificName'] = 'Actiniidae' 

# Create map
code_to_species_dict = dict(zip(species['Classcode'], species['scientificName']))

**Note** that some classcodes in inverts don't match the classcodes in the species table:
- ca sea cucumber --> CA sea cucumber
- ca spiny lobster --> CA spiny lobster
- kellet's welk --> Kellet's welk
- wavy red turban snail --> wavy/red turban snail
- california sea hare --> California sea hare
- ochre star --> ochre sea star

In [15]:
## Change classcodes that don't match classcodes in species table

inverts.loc[inverts['Classcode'] == 'ca sea cucumber', 'Classcode'] = 'CA sea cucumber'
inverts.loc[inverts['Classcode'] == 'ca spiny lobster', 'Classcode'] = 'CA spiny lobster'
inverts.loc[inverts['Classcode'] == "kellet's whelk", 'Classcode'] = "Kellet's whelk"
inverts.loc[inverts['Classcode'] == 'wavy red turban snail', 'Classcode'] = 'wavy/red turban snail'
inverts.loc[inverts['Classcode'] == 'california sea hare', 'Classcode'] = 'California sea hare'
inverts.loc[inverts['Classcode'] == 'ochre star', 'Classcode'] = 'ochre sea star'

In [16]:
## Create scientificName column

inverts_occ['scientificName'] = inverts['Classcode']
inverts_occ['scientificName'].replace(code_to_species_dict, inplace=True)
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis craherodii
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea fruticosa/californica
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus


In [17]:
## Drop the one row with scientificName = 'no inverts'

inverts_occ = inverts_occ[inverts_occ['scientificName'] != 'no inverts'].copy()

In [18]:
## Get unique scientific names for lookup in WoRMS

names = inverts_occ['scientificName'].unique()

**Note** that there are some classcodes that are not covered in the species table: sunflower/sun star, unknown abalone, no inverts 

Suflower/sun star can be designated as class Asteroidea, with Solaster spp. or Pycnopodia helianthoides in identificationQualifier <br>
Unknown Abalone can be designated Haliotis <br>
No inverts is designated in a single row - drop it? **What does this mean?**

Also **note** that there are a number of names that are not specific at the species level. These will match at the genus level, but may want to include identificationQualifier:
- Muricea fruticosa/californica
- Loxorhynchus grandis/crispatus
- Megastrea/Lithopoma undosa/gibberosa

**Assumed misspellings:**
- Haliotis craherodii --> Haliotis cracherodii
- Mesocentrotus francicanus --> Mesocentrotus franciscanus
- Crassadoma giganteum --> Crassedoma giganteum

**Assumed old names:**
- I think Megastraea/Lithopoma undosa/gibberosa means that either Megatrea undosa or Lithopoma gibberosa was observed
- Megastraea undosa matches in WoRMS. **Note** that Megastrea should be spelled Megastraea.
- Lithopoma gibberosa is unaccepted in WoRMS. The indicated accepted name is Pomaulax gibberosus
- I think it's best to put subfamily Turbininae for both, and then indicate possible species in the identificationQualifier column

In [19]:
## Add manually identified scientific names to names; correct spelling errors

names_to_change = ['sunflower/sun star', 'unknown abalone', 'Megastrea/Lithopoma undosa/gibberosa', 'Haliotis craherodii', 'Mesocentrotus francicanus', 'Crassadoma giganteum']
correct_names = ['Asteroidea', 'Haliotis', 'Turbininae', 'Haliotis cracherodii', 'Mesocentrotus franciscanus', 'Crassedoma giganteum']

for i in range(len(names_to_change)):
    names = np.where(names==names_to_change[i], correct_names[i], names)
    
# Also correct names in converted scientificName column
inverts_occ['scientificName'].replace({'sunflower/sun star':'Asteroidea',
                                      'unknown abalone':'Haliotis',
                                      'Megastrea/Lithopoma undosa/gibberosa':'Turbininae',
                                      'Haliotis craherodii':'Haliotis cracherodii',
                                      'Mesocentrotus francicanus':'Mesocentrotus franciscanus',
                                      'Crassadoma giganteum':'Crassedoma giganteum'}, inplace=True)

In [20]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Muricea fruticosa/californica checking:  Muricea
Url didn't work for Cancer spp. checking:  Cancer
Url didn't work for Loxorhynchus grandis/crispatus checking:  Loxorhynchus
Url didn't work for Solaster spp. checking:  Solaster


In [21]:
## Add scientific name-related columns

inverts_occ['scientificNameID'] = inverts_occ['scientificName']
inverts_occ['scientificNameID'].replace(name_id_dict, inplace=True)

inverts_occ['taxonID'] = inverts_occ['scientificName']
inverts_occ['taxonID'].replace(name_taxid_dict, inplace=True)
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea fruticosa/californica,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [22]:
## Create identificationQualifier

qualifier_dict = {'Muricea fruticosa/californica':'Muricea fruticosa or Muricea californica',
               'Loxorhynchus grandis/crispatus':'Loxorhynchus grandis or Loxorhynchus crispatus',
               'Turbininae':'Megastraea undosa or Pomaulax gibberosus (previously Lithopoma gibberosa)',
               'Asteroidea':'Solaster spp. or Pycnopodia helianthoides'}

identificationQualifier = [qualifier_dict[name] if name in qualifier_dict.keys() else np.nan for name in inverts_occ['scientificName']]

In [23]:
## Replace scientificName using name_name_dict

inverts_occ['scientificName'].replace(name_name_dict, inplace=True)
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [24]:
## Add vernacularName

inverts_occ.insert(2, 'vernacularName', inverts.loc[inverts['Classcode'] != 'no inverts', 'Classcode'])
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,bat star,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,black abalone,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,brown/golden gorgonian,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,CA sea cucumber,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,CA spiny lobster,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898


In [25]:
## Add final name-related columns

inverts_occ['nameAccordingTo'] = 'WoRMS'
inverts_occ['occurrenceStatus'] = 'present'
inverts_occ['basisOfRecord'] = 'HumanObservation'
inverts_occ['identificationQualifier'] = identificationQualifier
inverts_occ['occurrenceRemarks'] = np.nan  # no occurrenceRemarks required for inverts

inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,bat star,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,black abalone,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,,
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,brown/golden gorgonian,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,CA sea cucumber,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,present,HumanObservation,,
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,CA spiny lobster,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,,


In [26]:
## Add sex and lifeStage column (no data for inverts)

inverts_occ['sex'] = np.nan
inverts_occ['lifeStage'] = np.nan
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,bat star,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,black abalone,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,,,,
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,brown/golden gorgonian,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,CA sea cucumber,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,present,HumanObservation,,,,
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,CA spiny lobster,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,,,,


Next, I'll create a density column under the column name organismQuantity. I'll use this column to indicate absence records, **assuming that a density of 0 indicates absent**, and save the series to be used in the MoF file.

In [27]:
## Density

# First, remove the 'no inverts' row from inverts
inverts = inverts[inverts['Classcode'] != 'no inverts']

# Calculate density
inverts_density = round((inverts['Amount']/inverts['Distance'])*30, 2) 
inverts_occ['organismQuantity'] = inverts_density
inverts_occ['organismQuantityType'] = 'number of individuals per 60 m2'
inverts_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,bat star,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,,9.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,black abalone,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,brown/golden gorgonian,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,,1.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,CA sea cucumber,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,CA spiny lobster,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2


**Note** that there are 562 (of 185317) records that have density = NaN. Where do these come from?

The inverts data frame never has Distance = 0. But both Distance and Amount are sometimes NaN. Amount is NaN for 562 records; Distance is NaN for 501 records. For the remaining 61 records for which Amount is NaN and Distance is not, Distance takes on a number of values (not just 30). Jan and Dan said that if Amount is NaN, I can safely assume that those species were not looked for during that survey. **So I can drop rows where Amount is NaN.**

To explore this further, use:

```python
missing = inverts[inverts['Amount'].isna() == True]
missing['Classcode'].unique()
```

In [28]:
## Drop records where Amount = NaN

inverts_occ.dropna(subset=['organismQuantity'], inplace=True)
inverts_occ.shape

(184755, 15)

In [29]:
## Assign an occurrenceStatus of 'absent' to records where density = 0

inverts_occ.loc[inverts_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'

Temperature and visibility data will be included in the MoF file at the event level, so I'll leave that alone for now.

### Algae occurrence file

In the algae data, giant kelp have been given multiple rows if more than one was observed on a given transect. So, for example, at 120 Reef on October 8, 2006, during the 5th transect, 4 giant kelps were observed. Correspondingly, 'giant kelp' has four rows, while the other species only have 1 to indicate their presence/absence and density. The stipe column indicates the nubmer of stipes for each of these giant kelps. 

In [30]:
## Build eventID

# Pad month and day as needed
paddedDay = ['0' + str(kelp['Day'].iloc[i]) if len(str(kelp['Day'].iloc[i])) == 1 else str(kelp['Day'].iloc[i]) for i in range(len(kelp['Day']))]
paddedMonth = ['0' + str(kelp['Month'].iloc[i]) if len(str(kelp['Month'].iloc[i])) == 1 else str(kelp['Month'].iloc[i]) for i in range(len(kelp['Month']))]

# Create eventID
eventID = [kelp['SiteName'].iloc[i] + '_' + str(kelp['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(kelp['Transect'].iloc[i]) for i in range(len(kelp['SiteName']))]
kelp_occ = pd.DataFrame({'eventID':eventID})

print(kelp_occ.shape)
kelp_occ.head()

(72114, 1)


Unnamed: 0,eventID
0,120Reef_20100801_1
1,120Reef_20100801_1
2,120Reef_20100801_1
3,120Reef_20100801_1
4,120Reef_20100801_1


In [31]:
## Add occurrenceID

# Create SurveyDate column to groupby
SurveyDate = [str(kelp['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(len(kelp['SiteName']))]
kelp['SurveyDate'] = SurveyDate

# Use SurveyDate to create occurrenceID
kelp_occ['occurrenceID'] = kelp.groupby(['SiteName', 'SurveyDate', 'Transect'])['Classcode'].cumcount()+1
kelp_occ['occurrenceID'] = kelp_occ['eventID'] + '_algae_occ' + kelp_occ['occurrenceID'].astype(str)

kelp_occ.head()

Unnamed: 0,eventID,occurrenceID
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5


In [32]:
## Load species table

filename = 'RCCA_algae_species_lookup_table.csv'
species = pd.read_csv(filename)

species.head()

Unnamed: 0,Kingdom,Division,Class,Order,Family,Genus,Species,Classcode,taxonomic_source,taxonomic_id,species_definition
0,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Phaeophyceae,Nereocystis,luetkeana,bull kelp,www.marinespecies.org,240752,bull kelp
1,Chromista,Phaeophyta,Phaeophycease,Laminariales,Alariaceae,Pterygophora,californica,pterygophora,www.marinespecies.org,240750,Pterygophora
2,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Lessoniaceae,Eisenia,arborea,southern sea palm,www.marinespecies.org,371990,Southern sea palm larger than 30 cm. Prior to ...
3,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Lessoniaceae,Eisenia,arborea,southern sea palm small,www.marinespecies.org,371990,Southern sea palm smaller than 30 cm. Prior to...
4,Chromista,Phaeophyta,Phaeophyceae,Laminariales,Laminariaceae,Laminaria,spp.,Laminaria spp,www.marinespecies.org,144199,Laminaria farlowii and L. setchellii were coun...


In [33]:
## Map scientific names to classcodes and create scientificName

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Create map
code_to_species_dict = dict(zip(species['Classcode'], species['scientificName']))

# Add in classcodes that are different in data and species table
code_to_species_dict['laminaria spp.'] = 'Laminaria spp.'
code_to_species_dict['souther sea palm'] = 'Eisenia arborea'
code_to_species_dict['souther sea palm small'] = 'Eisenia arborea'
code_to_species_dict['laminaria farlowi'] = 'Laminaria farlowii'
code_to_species_dict['laminaria setchel'] = 'Laminaria setchellii'
code_to_species_dict['sargassum horneri'] = 'Sargassum horneri'

# Create scientificName
kelp_occ['scientificName'] = kelp['Classcode']
kelp_occ['scientificName'].replace(code_to_species_dict, inplace=True)
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Macrocystis pyrifera
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Macrocystis pyrifera
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Macrocystis pyrifera


In [34]:
## Get unique scientific names for lookup in WoRMS

names = kelp_occ['scientificName'].unique()

In [35]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Laminaria spp. checking:  Laminaria


In [36]:
## Add scientific name-related columns

kelp_occ['scientificNameID'] = kelp_occ['scientificName']
kelp_occ['scientificNameID'].replace(name_id_dict, inplace=True)

kelp_occ['taxonID'] = kelp_occ['scientificName']
kelp_occ['taxonID'].replace(name_taxid_dict, inplace=True)
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231


In [37]:
## Create identificationQualifier to handle Laminaria spp. and occurrenceRemarks to handle Eisenia arborea

identificationQualifier = ['Laminaria farlowii or Laminaria setchellii' if kelp_occ['scientificName'].iloc[i] == 'Laminaria spp.' else np.nan for i in range(kelp_occ.shape[0])]

occurrenceRemarks = []
for i in range(kelp_occ.shape[0]):
    if kelp['Classcode'].iloc[i] == 'souther sea palm':
        occurrenceRemarks.append('individuals >= 30 cm in size')
    elif kelp['Classcode'].iloc[i] == 'souther sea palm small':
        occurrenceRemarks.append('individuals < 30 cm in size')
    else:
        occurrenceRemarks.append(np.nan)

In [38]:
## Replace scientificName using name_name_dict

kelp_occ['scientificName'].replace(name_name_dict, inplace=True)
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231


In [39]:
## Add vernacularName

kelp_occ.insert(2, 'vernacularName', kelp['Classcode'])
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,bull kelp,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231


In [40]:
## Clearn vernacularName

kelp_occ['vernacularName'].replace({'souther sea palm':'southern sea palm',
                                   'laminaria farlowi':'laminaria farlowii',
                                   'laminaria setchel':'laminaria setchellii',
                                   'souther sea palm small':'southern sea palm'}, inplace=True)
kelp_occ['vernacularName'].unique()

array(['bull kelp', 'giant kelp', 'laminaria spp.', 'pterygophora',
       'southern sea palm', 'feather boa', 'laminaria farlowii',
       'laminaria setchellii', 'sargassum horneri'], dtype=object)

In [41]:
## Add final name-related columns

kelp_occ['nameAccordingTo'] = 'WoRMS'
kelp_occ['occurrenceStatus'] = 'present'
kelp_occ['basisOfRecord'] = 'HumanObservation'
kelp_occ['identificationQualifier'] = identificationQualifier
kelp_occ['occurrenceRemarks'] = occurrenceRemarks

kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,bull kelp,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752,WoRMS,present,HumanObservation,,
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,


In [42]:
## Add sex and lifeStage column (no data for kelp)

kelp_occ['sex'] = np.nan
kelp_occ['lifeStage'] = np.nan
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,bull kelp,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752,WoRMS,present,HumanObservation,,,,
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,


In [43]:
## Create density to find absence records and for use in MoF

# Change instances in kelp where Distance = NaN to Distance = 30 (see Markdown cell below for reasoning)
kelp.loc[kelp['Distance'].isna() == True, 'Distance'] = 30

# Calculate density
kelp_density = round((kelp['Amount']/kelp['Distance'])*30, 2) # units = individuals per 60 km2
kelp_occ['organismQuantity'] = kelp_density
kelp_occ['organismQuantityType'] = 'number of individuals per 60 m2'
kelp_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_algae_occ1,bull kelp,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,1.67,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,1.67,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,1.67,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,1.67,number of individuals per 60 m2


**Note** that here, again, 9806 densities are NaN. Where do these come from?

In kelp, there are 9802 records with Distance = NaN. Of these:
- 9684 records have Amount = 0 
- 1 record has Amount > 0 
- 117 records have Amount = NaN 

To look at these records, use:
```python
# Amount = 0
kelp[(kelp['Distance'].isna() == True) & (kelp['Amount']==0)].shape

# Amount > 0
kelp[(kelp['Distance'].isna() == True) & (kelp['Amount'] > 0)]

# Amount = NaN
kelp[(kelp['Distance'].isna() == True) & (kelp['Amount'].isna() == True)].shape
```

**For all of these, I think I should assume that Distance = 30, and the NaN values are true missing values.**

There are 4 remaining records where Distance is not NaN, but density is. These have Amount = NaN but Distance != NaN. To look at these records, use:
```python
(kelp[(kelp['Distance'].isna() == False) & (kelp['Amount'].isna() == True)])
```

**These are true missing values.**

**CONCLUSION: After correcting records where Distance = NaN to Distance = 30, there should be 121 true NaN values in density. These records can be dropped.**

In [44]:
## Assign an occurrenceStatus of 'absent' to records where density = 0; drop records where organismQuantity = NaN

# Absent records
kelp_occ.loc[kelp_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'

# Drop organismQuantity = NaN
kelp_occ.dropna(subset=['organismQuantity'], inplace=True)
kelp_occ.shape

(71993, 15)

In [45]:
## Save Stipes for use in MoF file

kelp_sizes = pd.DataFrame({
    'eventID':kelp_occ['eventID'],
    'occurrenceID':kelp_occ['occurrenceID'],
    'size':kelp.loc[kelp['Amount'].isna() == False, 'Stipes']
})
kelp_sizes.dropna(inplace=True)
kelp_sizes.head()

Unnamed: 0,eventID,occurrenceID,size
1,120Reef_20100801_1,120Reef_20100801_1_algae_occ2,8.0
2,120Reef_20100801_1,120Reef_20100801_1_algae_occ3,12.0
3,120Reef_20100801_1,120Reef_20100801_1_algae_occ4,13.0
4,120Reef_20100801_1,120Reef_20100801_1_algae_occ5,16.0
5,120Reef_20100801_1,120Reef_20100801_1_algae_occ6,3.0


**Note** that there are 1400 rows where size = 0. Looking back in kelp, these rows also have amount = 0, and so I think these rows can be deleted. To check this, use:
```python
# 1400 records where size = 0
kelp_sizes[kelp_sizes['size'] == 0]

# All of these have Amount = 0 in original kelp data
kelp.loc[kelp['Stipes'] == 0, 'Amount'].unique()
```

In [46]:
## Remove zeros

kelp_sizes = kelp_sizes[kelp_sizes['size'] > 0]

### Fish occurrence file

There are lots of complexities here as well. It seems like I should have organismQuantity (density), organismQuantityType, sex, and lifeStage in the occurrence file. Sex and lifeStage will have to be separated out from the Sex column in the original fish dataframe. Then, in the MoF file, each occurrenceID can be associated with a size. Sizes will be "Small", "Medium" or "Large" prior to 2013, and estimated to the nearest cm afterwards. If a min and max size has been provided (as may occur after 2013 for groups of fish), these sizes will have their own min size and max size row associated with the occurrenceID.

**Note** that there are a good number (~17,000) of records where amount is NaN. For these records, Size_cm, Min_cm and Max_cm are either 0 or NaN. Which value (0 or NaN) is used does not depend on whether the Year is before or after 2013. To check this use:
```python
# Number of records where Amount = NaN
fish[fish['amount'].isna() == True].shape

# For these records, 'Size_cm' is either 0 or NaN
fish.loc[fish['amount'].isna() == True, 'Size_cm'].unique()

# For these records, 'Min_cm' is either 0 or NaN
fish.loc[fish['amount'].isna() == True, 'Min_cm'].unique()

# For these records, 'Max_cm' is either 0 or NaN
fish.loc[fish['amount'].isna() == True, 'Max_cm'].unique()
```

**Anytime Amount = NaN, it should be treated as a true misisng value. Size_cm, Min_cm and Max_cm can be ignored, and the records can be dropped.**

In [47]:
## Start by pulling sex and lifeStage information out of Sex column

fish['lifeStage'] = fish['Sex']
fish.loc[fish['lifeStage'].isin(['m', 'f', 'none']), 'lifeStage'] = np.nan
fish.loc[fish['Sex'].isin(['a', 'j', 'none']), 'Sex'] = np.nan

# Replace Sex and lifeStage with controlled vocabulary
fish['Sex'].replace({'m':'male', 'f':'female'}, inplace=True)
fish['lifeStage'].replace({'a':'adult', 'j':'juvenile'}, inplace=True)

print(fish.shape)
fish.head()

(485046, 20)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility,SiteName,lifeStage
0,120 Reef,1,8,2010,8/1/2010,1,black perch,,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,
1,120 Reef,1,8,2010,8/1/2010,1,black perch,,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,adult
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0,120Reef,


There is one row where the value 'Sex' appears in the Sex column. I feel comfortable changing that to NaN.

In [48]:
## Change 'Sex' to NaN

fish.loc[fish['Sex'] == 'Sex', 'Sex'] = np.nan
fish.loc[fish['lifeStage'] == 'Sex', 'lifeStage'] = np.nan

Also **note** that there are 4509 records where amount = 0 but the SizeCategory is listed as 'Small'. It seems like these records should have SizeCategory = NaN. **I have assumed this is correct in assembling the size data for the MoF file below.** To check this:
```python
fish.loc[(fish['amount'] == 0) & (fish['SizeCategory'].isna() == False), 'SizeCategory'].unique()
```

This is not a problem for Size_cm, which is only > 0 if amount is > 0.

In [49]:
## Build eventID

# Pad month and day as needed
paddedDay = ['0' + str(fish['Day'].iloc[i]) if len(str(fish['Day'].iloc[i])) == 1 else str(fish['Day'].iloc[i]) for i in range(len(fish['Day']))]
paddedMonth = ['0' + str(fish['Month'].iloc[i]) if len(str(fish['Month'].iloc[i])) == 1 else str(fish['Month'].iloc[i]) for i in range(len(fish['Month']))]

# Create eventID
eventID = [fish['SiteName'].iloc[i] + '_' + str(fish['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(fish['Transect'].iloc[i]) for i in range(len(fish['SiteName']))]
fish_occ = pd.DataFrame({'eventID':eventID})

print(fish_occ.shape)
fish_occ.head()

(485046, 1)


Unnamed: 0,eventID
0,120Reef_20100801_1
1,120Reef_20100801_1
2,120Reef_20100801_1
3,120Reef_20100801_1
4,120Reef_20100801_1


In [50]:
## Add occurrenceID

# Create SurveyDate column to groupby
SurveyDate = [str(fish['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(len(fish['SiteName']))]
fish['SurveyDate'] = SurveyDate

# Use SurveyDate to create occurrenceID
fish_occ['occurrenceID'] = fish.groupby(['SiteName', 'SurveyDate', 'Transect'])['Species'].cumcount()+1
fish_occ['occurrenceID'] = fish_occ['eventID'] + '_fish_occ' + fish_occ['occurrenceID'].astype(str)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5


In [51]:
## Load species table

filename = 'RCCA_fish_species_lookup_table.csv'
species = pd.read_csv(filename, encoding='ansi')

print(species.shape)
species.head()

(39, 11)


Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species,Name,taxonomic_source,taxonomic_id,species_definition
0,Animalia,Chordata,Actinopterygii,Perciformes,Haemulidae,Anisotremus,davidsonii,sargo,www.marinespecies.org,279617,Anisotremus davidsonii
1,Animalia,Chordata,Actinopterygii,Perciformes,Pomacentridae,Chromis,punctipinnis,blacksmith,www.marinespecies.org,273751,Chromis punctipinnis
2,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,Embiotoca,lateralis,striped perch,www.marinespecies.org,240740,Embiotoca lateralis
3,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,Embiotoca,jacksoni,black perch,www.marinespecies.org,240746,Embiotoca jacksoni
4,Animalia,Chordata,Actinopterygii,Perciformes,Kyphosidae,Girella,nigricans,opaleye,www.marinespecies.org,280865,Girella nigricans


In [52]:
## Map scientific names to classcodes and create scientificName

# Create scientific name column in species
species['scientificName'] = species['Genus'] + ' ' + species['Species']

# Create map
code_to_species_dict = dict(zip(species['Name'], species['scientificName']))

# Indicate that yoy rockfish should be Sebastes
code_to_species_dict['yoy rockfish'] = 'Sebastes'

# Add in classcodes that are different in data and species table
code_to_species_dict['moray eel'] = 'Gymnothorax mordax'

# Create scientificName
fish_occ['scientificName'] = fish['Species']
fish_occ['scientificName'].replace(code_to_species_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Embiotoca jacksoni
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Chromis punctipinnis
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Hypsypops rubicundus
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Stereolepis gigas


**Note** that there is one species in fish_occ that does not match it's entry in the species table: fish_occ has 'moray eel' instead of 'California moray', as in the species table.

In [53]:
## Get unique scientific names for lookup in WoRMS

names = fish_occ['scientificName'].unique()

**Note**  that there are a number of names that are not specific at the species level. These will match at the genus level, but may want to include identificationQualifier:
- Sebastes flavidus/serranoides
- Sebastes miniatus/pinger

**Assumed misspellings:**
- Sebastes pinger --> Sebastes pinniger
- Balistes polyepis --> Balistes polylepis

In [54]:
## Add manually identified scientific names to names; correct spelling errors

names_to_change = ['Sebastes miniatus/pinger', 'Balistes polyepis']
correct_names = ['Sebastes miniatus/pinniger', 'Balistes polylepis']

for i in range(len(names_to_change)):
    names = np.where(names==names_to_change[i], correct_names[i], names)
    
# Also correct names in converted scientificName column
fish_occ['scientificName'].replace({'Sebastes miniatus/pinger':'Sebastes miniatus/pinniger',
                                      'Balistes polyepis':'Balistes polylepis'}, inplace=True)

In [55]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Sebastes flavidus/serranoides checking:  Sebastes
Url didn't work for Sebastes miniatus/pinniger checking:  Sebastes


In [56]:
## Add scientific name-related columns

fish_occ['scientificNameID'] = fish_occ['scientificName']
fish_occ['scientificNameID'].replace(name_id_dict, inplace=True)

fish_occ['taxonID'] = fish_occ['scientificName']
fish_occ['taxonID'].replace(name_taxid_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884


In [57]:
## Create identificationQualifier

qualifier_dict = {'Sebastes flavidus/serranoides':'Sebastes flavidus or Sebastes serranoides',
               'Sebastes miniatus/pinniger':'Sebastes miniatus or Sebastes pinniger'}

identificationQualifier = [qualifier_dict[name] if name in qualifier_dict.keys() else np.nan for name in fish_occ['scientificName']]

In [58]:
## Replace scientificName using name_name_dict

fish_occ['scientificName'].replace(name_name_dict, inplace=True)
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884


In [59]:
## Add vernacularName

fish_occ.insert(2, 'vernacularName', fish['Species'])
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,blacksmith,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,garibaldi,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,giant sea bass,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884


In [60]:
## Clean vernacularName

fish_occ['vernacularName'].replace({'yellowtail/olive':'yellowtail/olive rockfish',
                                   'vermilion/canary':'vermilion/canary rockfish',
                                   'black and yellow':'black and yellow rockfish',
                                   'yoy rockfish':'young of the year rockfish',
                                   'vermilion':'vermilion rockfish'}, inplace=True)
fish_occ['vernacularName'].unique()

array(['black perch', 'blacksmith', 'garibaldi', 'giant sea bass',
       'rubberlip perch', 'senorita', 'sheephead',
       'young of the year rockfish', 'kelp bass', 'pile perch', 'opaleye',
       'rainbow perch', 'rock wrasse', 'barred sand bass',
       'blue rockfish', 'kelp rockfish', 'sargo', 'brown rockfish',
       'china rockfish', 'striped perch', 'treefish', 'cabezon',
       'grass rockfish', 'black and yellow rockfish', 'black rockfish',
       'copper rockfish', 'finescale triggerfish', 'gopher rockfish',
       'halfmoon', 'horn shark', 'kelp greenling', 'largemouth blenny',
       'lingcod', 'moray eel', 'vermilion rockfish',
       'yellowtail/olive rockfish', 'bocaccio', 'rock greenling',
       'vermilion/canary rockfish'], dtype=object)

In [61]:
## Add final name-related columns

fish_occ['nameAccordingTo'] = 'WoRMS'
fish_occ['occurrenceStatus'] = 'present'
fish_occ['basisOfRecord'] = 'HumanObservation'
fish_occ['identificationQualifier'] = identificationQualifier
fish_occ['occurrenceRemarks'] = np.nan

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,blacksmith,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751,WoRMS,present,HumanObservation,,
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,garibaldi,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130,WoRMS,present,HumanObservation,,
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,giant sea bass,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884,WoRMS,present,HumanObservation,,


In [62]:
## Add sex and lifeStage information

fish_occ['sex'] = fish['Sex']
fish_occ['lifeStage'] = fish['lifeStage']

fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,blacksmith,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751,WoRMS,present,HumanObservation,,,,
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,garibaldi,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130,WoRMS,present,HumanObservation,,,,adult
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,giant sea bass,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884,WoRMS,present,HumanObservation,,,,


In [63]:
## Create density to find absence records and for use in MoF

fish_density = fish['amount'] # units = individuals per 120 m3
fish_occ['organismQuantity'] = fish_density
fish_occ['organismQuantityType'] = 'number of individuals per 120 m3'
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 120 m3
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,,2.0,number of individuals per 120 m3
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,blacksmith,Chromis punctipinnis,urn:lsid:marinespecies.org:taxname:273751,273751,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 120 m3
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,garibaldi,Hypsypops rubicundus,urn:lsid:marinespecies.org:taxname:281130,281130,WoRMS,present,HumanObservation,,,,adult,2.0,number of individuals per 120 m3
4,120Reef_20100801_1,120Reef_20100801_1_fish_occ5,giant sea bass,Stereolepis gigas,urn:lsid:marinespecies.org:taxname:282884,282884,WoRMS,present,HumanObservation,,,,,0.0,number of individuals per 120 m3


In [64]:
## Assign an occurrenceStatus of 'absent' to records where density = 0

fish_occ.loc[fish_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'

**Note** that here, again, 16923 densities are NaN. These are all from fish, where amount = NaN for 16932 records. **These should be interpreted as true missing values, i.e., the records can be dropped.**

In [65]:
## Drop instances where organismQuantity = NaN (as decided above)

fish_occ.dropna(subset=['organismQuantity'], inplace=True)

In [66]:
## Save Size, Min and Max for use in MoF file

# Drop records where amount = NaN
fish.dropna(subset=['amount'], inplace=True)

# Fix instances where amount = 0 and SizeCategory exists
fish.loc[(fish['amount'] == 0) & (fish['SizeCategory'].isna() == False), 'SizeCategory'] = np.nan

# When size_cm = min_cm = max_cm, it means all individuals were the same size and min_cm and max_cm should be NaN (see point 1 below)
fish.loc[(fish['Size_cm'] == fish['Min_cm']) & (fish['Size_cm'] == fish['Max_cm']), ['Min_cm', 'Max_cm']] = np.nan

# If the amount is 0 and Size_cm is 0, all size measurements should be NaN (see point 2 below)
fish.loc[(fish['amount'] == 0) & (fish['Size_cm'] == 0), ['Size_cm', 'Min_cm', 'Max_cm']] = np.nan

# If species = yoy rockfish, Size_cm should = NaN, Min_cm should = 1, and Max_cm should = 10 (see points 3 and 4 below, and additional notes on points 3 and 4)
fish.loc[(fish['Species'] == 'yoy rockfish') & (fish['amount'].isna() == False) & (fish['amount'] > 0), 'Size_cm'] = np.nan
fish.loc[(fish['Species'] == 'yoy rockfish') & (fish['amount'].isna() == False) & (fish['amount'] > 0), 'Min_cm'] = 1
fish.loc[(fish['Species'] == 'yoy rockfish') & (fish['amount'].isna() == False) & (fish['amount'] > 0), 'Max_cm'] = 10

# If amount = 1 and a size range is given, and min_cm = 0, set min_cm and size_cm to max_cm (see points 3 and 4 below, and additional notes on points 3 and 4)
fish.loc[(fish['amount'] == 1) & 
         (fish['Size_cm'].isna() == False) & 
         (fish['Min_cm'].isna() == False) &
         (fish['Min_cm'] == 0), 'Size_cm'] = fish.loc[(fish['amount'] == 1) & 
                                                      (fish['Size_cm'].isna() == False) &
                                                      (fish['Min_cm'].isna() == False) &
                                                      (fish['Min_cm'] == 0), 'Max_cm'].copy()
fish.loc[(fish['amount'] == 1) & (fish['Size_cm'].isna() == False) & (fish['Min_cm'].isna() == False) & (fish['Min_cm'] == 0), ['Min_cm', 'Max_cm']] = np.nan

# If amount = 1 and a size range is given, and max_cm is missing, set size_cm and max_cim to min_cm (see points 3 and 4 below, and additional notes on points 3 and 4)
fish.loc[(fish['amount'] == 1) & 
         (fish['Size_cm'].isna() == False) & 
         (fish['Min_cm'].isna() == False) & 
         (fish['Max_cm'].isna() == True), 'Size_cm'] = fish.loc[(fish['amount'] == 1) & 
                                                                (fish['Size_cm'].isna() == False) & 
                                                                (fish['Min_cm'].isna() == False) & 
                                                                (fish['Max_cm'].isna() == True), 'Max_cm'].copy()
fish.loc[(fish['amount'] == 1) & (fish['Size_cm'].isna() == False) & (fish['Min_cm'].isna() == False) & (fish['Max_cm'].isna() == True), ['Min_cm', 'Max_cm']] = np.nan

# If amount = 1 and a size range is given, and size_cm != min_cm != max_cm, these are real data entry errors. Exclude from MoF. 
# (see points 3 and 4 below, and additional notes on points 3 and 4)
fish.loc[(fish['amount'] == 1) & 
         (fish['Size_cm'].isna() == False) & 
         (fish['Min_cm'].isna() == False) & 
         (fish['Max_cm'].isna() == False) & 
         (fish['Min_cm'] != 0), ['Size_cm', 'Min_cm', 'Max_cm']] = np.nan

# If amount > 1 and a size range is given, min_cm = 0 and species != yoy rockfish, assume max_cm is the correct size of the individual
# (see points 3 and 4 below, and additional notes on points 3 and 4)
fish.loc[fish['Min_cm'] == 0, 'Size_cm'] = fish.loc[fish['Min_cm'] == 0, 'Max_cm'] 
fish.loc[fish['Min_cm'] == 0, ['Min_cm', 'Max_cm']] = np.nan

# Forgot to account for situations that meet the above criteria but min_cm != 0, i.e., true size ranges. Want to exclude Size_cm in these cases.
fish.loc[(fish['amount'] > 1) & (fish['Size_cm'].isna() == False) & (fish['Min_cm'].isna() == False) & (fish['Max_cm'].isna() == False), ['Size_cm']] = np.nan

# Assemble fish_sizes
fish_sizes = pd.DataFrame({
    'eventID':fish_occ['eventID'],
    'occurrenceID':fish_occ['occurrenceID'],
    'size_cat':fish['SizeCategory'],
    'size_cm':fish['Size_cm'],
    'min_size':fish['Min_cm'],
    'max_size':fish['Max_cm']
})
fish_sizes.dropna(how='all', subset=['size_cat', 'size_cm', 'min_size', 'max_size'], inplace=True)
print(fish_sizes.shape)
fish_sizes.head()

(153695, 6)


Unnamed: 0,eventID,occurrenceID,size_cat,size_cm,min_size,max_size
0,120Reef_20100801_1,120Reef_20100801_1_fish_occ1,Small,,,
1,120Reef_20100801_1,120Reef_20100801_1_fish_occ2,Medium,,,
2,120Reef_20100801_1,120Reef_20100801_1_fish_occ3,Medium,,,
3,120Reef_20100801_1,120Reef_20100801_1_fish_occ4,Medium,,,
5,120Reef_20100801_1,120Reef_20100801_1_fish_occ6,Medium,,,


**Note:** There are a number of problems with the min and max columns.
1. In 74035 cases, amount = 1 and Min_cm and Max_cm = Size_cm. This is correct based on how Dan and Jan describe the data entry procedures. I am interested in the Size_cm category only for these individuals. **I'm going to let Min_cm and Max_cm be NaN.**
2. In a number of cases, amount = 0 and Size_cm = 0, Min_cm = NaN and Max_cm = NaN. I assume Size_cm, Min_cm and Max_cm should all be NaN. **I'm no longer seeing any records where these conditions are met. Instead, there are 314179 records where amount = 0, and Size_cm is either 0 or NaN for those records. If Size_cm = 0, then Min_cm and Max_cm = 0. If Size_cm = NaN, then Min_cm and Max_cm = NaN. I want to set any instances of Size = 0 to Size = NaN. In these instances, the associated Min_cm and Max_cm values also need to be NaN.**
3. There are also 1145 cases where amount = 1 and Min_cm != Max_cm != Size_cm. How am I supposed to interpret these rows? It looks like in these cases, Size_cm remains the average of Min_cm and Max_cm. **These are the ones that I haven't handled in a sensible way. Saving these records and sending them to Dan.**
4. Finally, there are 6738 cases where Min_cm = 0. Most of these also have amount = NaN, and so can be ignored. But 767 of them don't. This doesn't seem like it should be a valid minimum measurement. **Dan and Jan said that this should only have been applied to yoy rockfish, for whom it is hard to estimate a minimum size. These are generally not sized, but given Min_cm = 0 and Max_cm = 10 by default. However, there are many other species that also have Min_cm = 0 occassionally. I'm saving these records and sending them to Dan.**

To view examples of each of these in the code:
```python
# Amount = 1, Size_cm = Min_cm = Max_cm
fish[(fish['amount'] == 1) & (fish['Size_cm'] == fish['Min_cm']) & (fish['Size_cm'] == fish['Max_cm'])]

# Amount = 0, Size_cm = 0, Min_cm = Max_cm = NaN --> There are no records that meet these criteria
fish[(fish['amount'] == 0) & (fish['Size_cm'] == 0) & (fish['Min_cm'].isna() == True) & (fish['Max_cm'].isna() == True)]

# If Amount = 0 and Size_cm = 0, then Min_cm = Max_cm = 0
fish.loc[(fish['amount'] == 0) & (fish['Size_cm'] == 0), 'Min_cm'].unique()
fish.loc[(fish['amount'] == 0) & (fish['Size_cm'] == 0), 'Max_cm'].unique()

# If Amount = 0 and Size_cm = NaN, then Min_cm = Max_cm = NaN
fish.loc[(fish['amount'] == 0) & (fish['Size_cm'].isna() == True), 'Min_cm'].unique()
fish.loc[(fish['amount'] == 0) & (fish['Size_cm'].isna() == True), 'Max_cm'].unique()

# Amount = 1 and Size_cm != Min_cm != Max_cm
fish[(fish['amount'] == 1) & (fish['Size_cm'].isna() == False) & (fish['Min_cm'].isna() == False) & (fish['Size_cm'] != fish['Min_cm'])]

# Example file created for Dan
example = fish[(fish['amount'] == 1) & (fish['Size_cm'].isna() == False) & (fish['Min_cm'].isna() == False) & (fish['Size_cm'] != fish['Min_cm'])]
example.to_csv('RCCA_fish_size_range_example.csv', index=False, na_rep='')

# Min_cm = 0, Species != yoy rockfish
fish[(fish['Min_cm'] == 0) & (fish['Species'] != 'yoy rockfish')]

# Example file created for Dan
example = fish[(fish['Min_cm'] == 0) & (fish['Species'] != 'yoy rockfish')]
example = example[example['amount'].isna() == False]
example.to_csv('RCCA_fish_min_size_example.csv', index=False, na_rep='')
```

**Additional developments with respect to points 3 and 4 above, 9/1/20**

Both of these points are all mixed up in the yoy rockfish situation. Dan says that yoy rockfish should always have min_cm = 1, max_cm = 10, and size_cm = 5.5. This is super not the case. Instead:
- There are 17253 records where species = yoy rockfish
- Of these, 269 records have amount = NaN, and **can be interpreted as true missing records.** (There are 16984 records remaining after dropping these.)
- 11839 have amount = 0. As discussed in point 2 above, these records have been changed so that Size_cm, Min_cm and Max_cm are always NaN.
- 5145 have amount > 0. **It sounds like I should alter all of these to have Size_cm = NaN, Min_cm = 1, and Max_cm = 10.** Note that I'm changing Size_cm = NaN instead of 5.5 because it's just an average of the min/max numbers, and therefore not an actual measurement. This is consistent with what I have done for other fish species. For the record, though:
    - 1480 of these have Size_cm = NaN. For these records, Min_cm and Max_cm also = NaN.
    - 56 have a Size_cm that exists, but both Min_cm and Max_cm = NaN. **It's interesting to note that some of these sizes are very large. Like 44 cm. I wonder if any of these are data entry errors?**
    - Of the remaining 3609 records with Size_cm, Min_cm and Max_cm specified, these columns take on all sorts of values.
        - 975 have Min_cm = 1, as expected. For 972 of these, Max_cm = 10, also as expected. There are three records where Max_cm = 6, 2 and 20 respectively.
        - 2525 have Min_cm = 0, which as Dan noted, seems to be the result of a common data entry error. For 2518 of these, Max_cm = 10 as expected. There are 7 records with Max_cm values ranging from 1 to 19.
        - The remaining 109 records have Min_cm > 1. 33 of these have Max_cm = 10 as expected. The rest have a wide range of values. One common pattern seems to be Size_cm = Min_cm = Max_cm = 5.
        
**I am going to change all yoy rockfish observations to have Size_cm = NaN, Min_cm = 1, and Max_cm = 10.**

**After making this change:**
- There are 716 records remaining that have the problem **discussed in point 3 above.** As Dan noted:
    - 561 of these have a minimum value of zero. Dan said to **set size_cm and min_cm to max_cm for these records.**
    - 6 of these have missing maximum values. Dan said to **set size_cm and max_cm to min_cm for these records.**
    - The remaining 149 records appear to be true errors. **Dan will look into these, and I won't add any size data in the MoF for these for now.**
- There are 206 records still remaining that have the problem **discussed in point 4 above. Dan says that the max value for these records should be treated as the single size for the group of fish. I.e., size_cm should = max_cm, and min_cm and max_cm should be NaN.**


```python
# yoy rockfish observations with amount > 0
fish[(fish['Species'] == 'yoy rockfish') & (fish['amount'].isna() == False) & (fish['amount'] > 0)]

# Size_cm exists, Min_cm and Max_cm = NaN. Some of these sizes seem really odd.
out = fish[(fish['Species'] == 'yoy rockfish') & (fish['amount'].isna() == False) & (fish['amount'] > 0)]
out[(out['Size_cm'].isna() == False) & (out['Min_cm'].isna() == True)]

# Remaining records where Size_cm, Min_cm and Max_cm take on all sorts of values
out[(out['Size_cm'].isna() == False) & (out['Min_cm'].isna() == False)]

# 149 records that are likely true data entry errors
example = fish[(fish['amount'] == 1) & (fish['Size_cm'].isna() == False) & (fish['Min_cm'].isna() == False) & (fish['Size_cm'] != fish['Min_cm'])]
example[(example['Min_cm'] != 0) & (example['Max_cm'].isna() == False)] 

# Remaining records that meet point 4 criteria
fish[fish['Min_cm'] == 0]
```

### Aggregate inverts, kelp and fish

In [67]:
## Aggregate

occ = pd.concat([inverts_occ, kelp_occ, fish_occ])
occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,bat star,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,,9.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,black abalone,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,brown/golden gorgonian,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,,1.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,CA sea cucumber,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,CA spiny lobster,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2


## Tidy and save occurrence file

After initial load on OBIS, Abby said that NaN values in string fields were making OBIS interpret those fields as numeric. Instead, she requested that I use an empty string ''.

There are only missing values in the identificationQualifier, occurrenceRemarks, sex, and lifeStage columns. To check this:
```python
occ.isna().sum()
```

In [82]:
## Tidy NaN values in string columns

occ[['identificationQualifier', 'occurrenceRemarks', 'sex', 'lifeStage']] = occ[['identificationQualifier', 'occurrenceRemarks', 'sex', 'lifeStage']].replace(np.nan, '')
occ.isna().sum()

eventID                    0
occurrenceID               0
vernacularName             0
scientificName             0
scientificNameID           0
taxonID                    0
nameAccordingTo            0
occurrenceStatus           0
basisOfRecord              0
identificationQualifier    0
occurrenceRemarks          0
sex                        0
lifeStage                  0
organismQuantity           0
organismQuantityType       0
dtype: int64

In [83]:
## Save

occ.to_csv('RCCA_occurrence_20210125.csv', index=False, na_rep='NaN')

## Load occurrence file if desired

In [517]:
## Load

occ = pd.read_csv('RCCA_occurrence_20210125.csv', dtype={'occurrenceRemarks':str, 'sex':str, 'lifeStage':str})
print(occ.shape)
occ.head()

(630934, 14)


Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
0,120Reef_20100801_1,120Reef_20100801_1_inverts_occ1,Patiria miniata,urn:lsid:marinespecies.org:taxname:382131,382131,WoRMS,present,HumanObservation,,,,,9.0,number of individuals per 60 m2
1,120Reef_20100801_1,120Reef_20100801_1_inverts_occ2,Haliotis cracherodii,urn:lsid:marinespecies.org:taxname:405012,405012,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
2,120Reef_20100801_1,120Reef_20100801_1_inverts_occ3,Muricea,urn:lsid:marinespecies.org:taxname:177745,177745,WoRMS,present,HumanObservation,Muricea fruticosa or Muricea californica,,,,1.0,number of individuals per 60 m2
3,120Reef_20100801_1,120Reef_20100801_1_inverts_occ4,Parastichopus californicus,urn:lsid:marinespecies.org:taxname:711954,711954,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
4,120Reef_20100801_1,120Reef_20100801_1_inverts_occ5,Panulirus interruptus,urn:lsid:marinespecies.org:taxname:382898,382898,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2


## Create event file

The event file needs to include eventID, eventDate, datasetID, locality, countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, locationRemarks, minDepth, maxDepth, and verbatimDepth (if desired).

In addition, depth, temperature and visibility can be included as event-level measurements in the MoF file.

In [84]:
## Get unique eventIDs from occurrence file

event = pd.DataFrame({'eventID':occ['eventID']})
event.drop_duplicates(inplace=True)

print(event.shape)
event.head()

(19375, 1)


Unnamed: 0,eventID
0,120Reef_20100801_1
28,120Reef_20100801_2
56,120Reef_20100801_3
84,120Reef_20100801_4
112,120Reef_20100801_5


In [85]:
## Create eventDate from eventID

eventDate = [datetime.strptime(ID.split('_')[1], '%Y%m%d').date().isoformat() for ID in event['eventID']]
event['eventDate'] = eventDate
event.head()

Unnamed: 0,eventID,eventDate
0,120Reef_20100801_1,2010-08-01
28,120Reef_20100801_2,2010-08-01
56,120Reef_20100801_3,2010-08-01
84,120Reef_20100801_4,2010-08-01
112,120Reef_20100801_5,2010-08-01


In [86]:
## Dataset ID

event['datasetID'] = 'RCCA transects'

In [87]:
## Add locality and countryCode

# Get site name out of eventID
locality = event['eventID'].str.split('_')
locality = locality.str[0]
event['locality'] = locality

# Reverse site_name_dict
reversed_site_name_dict = {v: k for k, v in site_name_dict.items()}

# Use reversed dict to retrieve original names from locality
event['locality'].replace(reversed_site_name_dict, inplace=True)

# Change names as needed to match names in site table
event.loc[event['locality'] == "Lover's 3", 'locality'] = 'Lovers 3'
event.loc[event['locality'] == "Fry's Anchorage", 'locality'] = 'Frys Anchorage'
event.loc[event['locality'] == 'Judith Reserve San Miguel Is', 'locality'] = 'Judith Reserve San Miguel Island'
event.loc[event['locality'] == 'West Long Point', 'locality'] = 'Long Point West'

# Add countryCode
event['countryCode'] = 'US'

event.head()## Add locality and countryCode

# Get site name out of eventID
locality = event['eventID'].str.split('_')
locality = locality.str[0]
event['locality'] = locality

# Reverse site_name_dict
reversed_site_name_dict = {v: k for k, v in site_name_dict.items()}

# Use reversed dict to retrieve original names from locality
event['locality'].replace(reversed_site_name_dict, inplace=True)

# Change names as needed to match names in site table
event.loc[event['locality'] == "Lover's 3", 'locality'] = 'Lovers 3'
event.loc[event['locality'] == "Fry's Anchorage", 'locality'] = 'Frys Anchorage'
event.loc[event['locality'] == 'Judith Reserve San Miguel Is', 'locality'] = 'Judith Reserve San Miguel Island'
event.loc[event['locality'] == 'West Long Point', 'locality'] = 'Long Point West'

# Add countryCode
event['countryCode'] = 'US'

event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US
28,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US
56,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US
84,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US
112,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US


In [88]:
## Merge to obtain decimalLatitude and decimalLongitude

event = event.merge(sites[['Site', 'Latitude', 'Longitude']], how='left', left_on='locality', right_on='Site')
event.rename(columns = {'Latitude':'decimalLatitude', 'Longitude':'decimalLongitude'}, inplace=True)
event.drop('Site', axis=1, inplace=True)
event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
1,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
2,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
3,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014
4,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014


In [89]:
## Add coordinateUncertainty in Meters

event['coordinateUncertaintyInMeters'] = 250

In [90]:
## In order to get depth data (and temp and visibility for MoF file), we'll need an eventID in inverts, kelp and fish

# Inverts
inverts.dropna(subset=['Amount'], inplace=True)
inverts['eventID'] = inverts_occ['eventID']

# Kelp
paddedDay = ['0' + str(kelp['Day'].iloc[i]) if len(str(kelp['Day'].iloc[i])) == 1 else str(kelp['Day'].iloc[i]) for i in range(len(kelp['Day']))]
paddedMonth = ['0' + str(kelp['Month'].iloc[i]) if len(str(kelp['Month'].iloc[i])) == 1 else str(kelp['Month'].iloc[i]) for i in range(len(kelp['Month']))]
eventID = [kelp['SiteName'].iloc[i] + '_' + str(kelp['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(kelp['Transect'].iloc[i]) for i in range(len(kelp['SiteName']))]
kelp['eventID'] = eventID

# Fish
paddedDay = ['0' + str(fish['Day'].iloc[i]) if len(str(fish['Day'].iloc[i])) == 1 else str(fish['Day'].iloc[i]) for i in range(len(fish['Day']))]
paddedMonth = ['0' + str(fish['Month'].iloc[i]) if len(str(fish['Month'].iloc[i])) == 1 else str(fish['Month'].iloc[i]) for i in range(len(fish['Month']))]
eventID = [fish['SiteName'].iloc[i] + '_' + str(fish['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(fish['Transect'].iloc[i]) for i in range(len(fish['SiteName']))]
fish['eventID'] = eventID

In [91]:
## Merge and combine first to obtain depths, with depths in fish prioritized as most accurate

# Get eventID and depth from fish, inverts and kelp
fish_events = fish[['eventID','Depth_ft']].copy()
fish_events.drop_duplicates(inplace=True)
inverts_events = inverts[['eventID', 'Depth_ft']].copy()
inverts_events.drop_duplicates(inplace=True)
kelp_events = kelp[['eventID', 'Depth_ft']].copy()
kelp_events.drop_duplicates(inplace=True)

# Merge with event
temp = event.merge(fish_events, how='left', on='eventID')
temp = temp.merge(inverts_events, how='left', on='eventID', suffixes=('', '_inverts'))
temp = temp.merge(kelp_events, how='left', on='eventID', suffixes=('', '_kelp'))

# Combine first to obtain final depths from depth_ft, depth_ft_inverts and depth_ft_kelp
depth = temp['Depth_ft'].combine_first(temp['Depth_ft_inverts'])
depth = depth.combine_first(temp['Depth_ft_kelp'])

# Create depth df for MoF
depth_df = pd.DataFrame({'eventID':temp['eventID'], 'Depth_ft':depth})
depth_df.drop_duplicates(inplace=True)

print(depth_df.shape)
depth_df.head()

(19375, 2)


Unnamed: 0,eventID,Depth_ft
0,120Reef_20100801_1,21.0
1,120Reef_20100801_2,21.5
2,120Reef_20100801_3,26.5
3,120Reef_20100801_4,16.0
4,120Reef_20100801_5,16.0


**Note** that 2 events have more than one depth listed (OtterCove_20080803_3 and Torqua_20080527_1). In both cases, this arises from the inverts and kelp data sets having different depths for the same event. **I averaged them originally, but based on feedback from Dan, have redone the code such that the value from the fish dataframe is taken if it exists. I will do the same with other measurements for the MoF file below.**

In [92]:
## Add depth to event file

event['minimumDepthInMeters'] = round(depth_df['Depth_ft']*0.3048, 1)
event['maximumDepthInMeters'] = round(depth_df['Depth_ft']*0.3048, 1)
event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,6.4,6.4
1,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,6.6,6.6
2,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,8.1,8.1
3,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,4.9,4.9
4,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,4.9,4.9


In [93]:
## Get temperature and visibility data for MoF file

# Temperature
temp2 = event.merge(fish[['eventID', 'Temp10m']], how='left', on='eventID')
temp2.drop_duplicates(inplace=True)
temp2 = temp2.merge(inverts[['eventID', 'Temp10m']], how='left', on='eventID', suffixes=('', '_inverts'))
temp2.drop_duplicates(inplace=True)
temp2 = temp2.merge(kelp[['eventID', 'Temp10m']], how='left', on='eventID', suffixes=('', '_kelp'))
temp2.drop_duplicates(inplace=True)

tempC = temp2['Temp10m'].combine_first(temp2['Temp10m_inverts'])
tempC = tempC.combine_first(temp2['Temp10m_kelp'])

temp_df = pd.DataFrame({'eventID':temp2['eventID'], 'Temp10m':tempC})
temp_df.drop_duplicates(inplace=True)

# Visibility
temp3 = event.merge(fish[['eventID', 'Visibility']], how='left', on='eventID')
temp3.drop_duplicates(inplace=True)
temp3 = temp3.merge(inverts[['eventID', 'Visibility']], how='left', on='eventID', suffixes=('', '_inverts'))
temp3.drop_duplicates(inplace=True)
temp3 = temp3.merge(kelp[['eventID', 'Visibility']], how='left', on='eventID', suffixes=('', '_kelp'))
temp3.drop_duplicates(inplace=True)

vis = temp3['Visibility'].combine_first(temp3['Visibility_inverts'])
vis = vis.combine_first(temp3['Visibility_kelp'])

vis_df = pd.DataFrame({'eventID':temp3['eventID'], 'Visibility':vis})
vis_df.drop_duplicates(inplace=True)

In [94]:
## Add samplingProtocol and samplingEffort

event['samplingProtocol'] = 'band transect'
event['samplingEffort'] = '10-15 minutes per transect'
event.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,120Reef_20100801_1,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,6.4,6.4,band transect,10-15 minutes per transect
1,120Reef_20100801_2,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,6.6,6.6,band transect,10-15 minutes per transect
2,120Reef_20100801_3,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,8.1,8.1,band transect,10-15 minutes per transect
3,120Reef_20100801_4,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,4.9,4.9,band transect,10-15 minutes per transect
4,120Reef_20100801_5,2010-08-01,RCCA transects,120 Reef,US,33.737919,-118.392014,250,4.9,4.9,band transect,10-15 minutes per transect


## Save

In [99]:
## Save

event.to_csv('RCCA_event_20210125.csv', index=False, na_rep='NaN')

## Create MoF file

The MoF file here needs to contain both event and occurrence level measurements. At the event level, we need eventID, occurrenceID = NaN, measurementType, measurementValue and measurementUnit. measurementMethod could optionally be included. measurementTypes available are: temperature, visibility. I also want to include the UPC data here.

At the occurrence level, both an eventID and an occurrenceID will be listed. measurementTypes available are: kelp stipe counts, fish sizes (including min size and max size for groups).

### Assemble UPC data

All amounts are out of 30 possible points surveyed - theoretically the sum over all classcodes in a category should always be 30. **Note** that upon checking this, there are only 4 surveys where it is not true. At Isthmus Reef on 10-Dec-2019, Transect 3, Category, Cover and Relief all have 60 points. At Lovers Point on 16-Sep-2006, Transect 1, Substrate has 31 points. At South La Jolla on 11-Aug-2019, Transect 1, Substrate has 31 points. And At Stillwater Cove Monterey on 26-Aug-2013, Transect 3, Relief has 31 points.

**After talking with Dan and Jan, I have fixes for Isthmus Reef, where Transect 1 was accidentally entered as a duplicate Transect 3, and South La Jolla, where there was also a data entry error. They needed to track down the original data sheets to address the problems with Lovers Point and Stillwater Cove, which may take some time. I've fixed what I could below.**

Generally, there are somewhere between 8 and 11 categories for Cover, 5 categories for Substrate and 4 categories for Relief. **Note** that there is one survey for which there seem to be duplicate entries with values that conflict (Isthmus Reef, 10-Dec-2019, Transect 3). **Also**, relief categories are not all labeled consistently ('0 - 10cm' = '0 - 10c', etc.).

To check this:
```python
# Look at surveys where Amount > 30
out = upc.groupby(['site', 'SurveyDate', 'Transect', 'Category'])['Amount', 'Depth_ft'].sum()
out.reset_index(inplace=True)
out[out['Amount'] > 30]

## Look at a particular survey
upc[(upc['site'] == 'South La Jolla') & (upc['SurveyDate'] == '11-Aug-19') & (upc['Transect'] == 1) & (upc['Category'] == 'Substrate')]
```

In [96]:
## Fix UPC data entry errors

# Isthmus Reef, 12/10/2019
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Substrate') & (upc['Classcode'] == 'sand' ) & 
       (upc['Amount'] == 1), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Substrate') & (upc['Classcode'] == 'cobble' ) & 
       (upc['Amount'] == 14), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Substrate') & (upc['Classcode'] == 'boulder' ) & 
       (upc['Amount'] == 3), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Substrate') & (upc['Classcode'] == 'bedrock' ) & 
       (upc['Amount'] == 12), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Substrate') & (upc['Classcode'] == 'other' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]

upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'none' ) & 
       (upc['Amount'] == 2), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'brown seaweed' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'other brown seaweed' ) & 
       (upc['Amount'] == 18), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'green seaweed' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'red seaweed' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'encrusting red' ) & 
       (upc['Amount'] == 2), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'articulated coralline' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'crustose coralline' ) & 
       (upc['Amount'] == 8), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'sessile invertebrates' ) & 
       (upc['Amount'] == 0), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'mobile invertebrates' ) & 
       (upc['Amount'] == 0), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Cover') & (upc['Classcode'] == 'seagrasses' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]

upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Relief') & (upc['Classcode'] == '0 - 10cm' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Relief') & (upc['Classcode'] == '> 10cm - 1m' ) & 
       (upc['Amount'] == 29), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Relief') & (upc['Classcode'] == '> 1m - 2m' ) & 
       (upc['Amount'] == 1), 'Transect'] = 1
upc.loc[(upc['site'] == 'Isthmus Reef') & (upc['SurveyDate'] == '10-Dec-19') & (upc['Transect'] == 3) & (upc['Category'] == 'Relief') & (upc['Classcode'] == '> 2m' ) & 
       (upc['Amount'] == 0), 'Transect'] = [1, 3]

# South La Jolla, 8/11/19
upc.drop(upc[(upc['site'] == 'South La Jolla') & (upc['SurveyDate'] == '11-Aug-19') & (upc['Transect'] == 1) & (upc['Category'] == 'Substrate') & (upc['Classcode'] == 'bedrock') &
         (upc['Amount'] == 0)].index, inplace = True)
upc.drop(upc[(upc['site'] == 'South La Jolla') & (upc['SurveyDate'] == '11-Aug-19') & (upc['Transect'] == 1) & (upc['Category'] == 'Substrate') & (upc['Classcode'] == 'boulder') &
         (upc['Amount'] == 3)].index, inplace = True)

In [97]:
## Aggregate UPC data so that all classcodes are collapsed into one row

# Add SiteName column
upc['SiteName'] = upc['site'].copy()
upc['SiteName'].replace(site_name_dict, inplace=True)

# Create eventID
paddedDay = ['0' + str(upc['Day'].iloc[i]) if len(str(upc['Day'].iloc[i])) == 1 else str(upc['Day'].iloc[i]) for i in range(len(upc['Day']))]
paddedMonth = ['0' + str(upc['Month'].iloc[i]) if len(str(upc['Month'].iloc[i])) == 1 else str(upc['Month'].iloc[i]) for i in range(len(upc['Month']))]
eventID = [upc['SiteName'].iloc[i] + '_' + str(upc['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + str(upc['Transect'].iloc[i]) for i in range(len(upc['SiteName']))]
upc['eventID'] = eventID

# Fix problems with classcodes
upc['Classcode'].replace({'> 2':'> 2m',
                          '> 1 - 2':'> 1m - 2m',
                          '> 10c -1':'> 10cm - 1m',
                          '0 - 10c':'0 - 10cm'}, inplace=True)

# Create Percent and UPC columns
upc['Percent'] = round((upc['Amount']/30)*100, 1)
upc = upc[(upc['Percent'] > 0) & (upc['Percent'].isna() == False)]
upc['Percent'] = upc['Percent'].astype(str)
upc['UPC'] = upc['Percent'] + '% ' + upc['Classcode'] + ' | '

# Aggregate
upc_agg = upc.groupby(['eventID', 'Category']).agg({'UPC':sum})
upc_agg.reset_index(inplace=True)
upc_agg['UPC'] = upc_agg['UPC'].str[:-3]

upc_agg

Unnamed: 0,eventID,Category,UPC
0,120Reef_20061008_1,Relief,3.3% 0 - 10cm | 90.0% > 10cm - 1m | 6.7% > 1m ...
1,120Reef_20061008_1,Substrate,83.3% bedrock | 16.7% cobble
2,120Reef_20061008_2,Cover,43.3% crustose coralline | 43.3% none | 13.3% ...
3,120Reef_20061008_2,Relief,46.7% 0 - 10cm | 50.0% > 10cm - 1m | 3.3% > 1m...
4,120Reef_20061008_2,Substrate,50.0% bedrock | 50.0% sand
...,...,...,...
19397,Yellowbanks_20131107_5,Relief,23.3% 0 - 10cm | 76.7% > 10cm - 1m
19398,Yellowbanks_20131107_5,Substrate,13.3% bedrock | 3.3% boulder | 16.7% cobble | ...
19399,Yellowbanks_20131107_6,Cover,13.3% articulated coralline | 3.3% brown seawe...
19400,Yellowbanks_20131107_6,Relief,6.7% 0 - 10cm | 93.3% > 10cm - 1m


In [114]:
## Use upc_agg to create a mof dataframe with eventID, occurrenceID = NaN, measurementType, measurementValue, measurementUnit, measurementMethod

upc_mof = pd.DataFrame({'eventID':upc_agg['eventID']})
upc_mof['occurrenceID'] = np.nan
upc_mof['measurementType'] = upc_agg['Category'].str.lower()
upc_mof['measurementValue'] = upc_agg['UPC']
upc_mof['measurementUnit'] = 'percent cover'
upc_mof['measurementMethod'] = 'uniform point contact'
upc_mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20061008_1,,relief,3.3% 0 - 10cm | 90.0% > 10cm - 1m | 6.7% > 1m ...,percent cover,uniform point contact
1,120Reef_20061008_1,,substrate,83.3% bedrock | 16.7% cobble,percent cover,uniform point contact
2,120Reef_20061008_2,,cover,43.3% crustose coralline | 43.3% none | 13.3% ...,percent cover,uniform point contact
3,120Reef_20061008_2,,relief,46.7% 0 - 10cm | 50.0% > 10cm - 1m | 3.3% > 1m...,percent cover,uniform point contact
4,120Reef_20061008_2,,substrate,50.0% bedrock | 50.0% sand,percent cover,uniform point contact


### Assemble remaining measurements

In [115]:
## Assemble event-level measurements

# Temperature
mof = pd.DataFrame({'eventID':temp_df['eventID']})
mof['occurrenceID'] = np.nan
mof['measurementType'] = 'temperature'
mof['measurementValue'] = temp_df['Temp10m']
mof['measurementUnit'] = 'degrees Celsius'
mof['measurementMethod'] = 'measured by dive computer at 10 m depth, or at the seafloor if shallower than 10 m'
mof = mof[mof['measurementValue'].isna() == False]

# Visibility
vis_mof = pd.DataFrame({'eventID':vis_df['eventID']})
vis_mof['occurrenceID'] = np.nan
vis_mof['measurementType'] = 'visibility'
vis_mof['measurementValue'] = vis_df['Visibility']
vis_mof['measurementUnit'] = 'meters'
vis_mof['measurementMethod'] = 'determined by divers by measuring the distance from which the fingers on a hand held up into the water column can be counted accurately'
vis_mof = vis_mof[vis_mof['measurementValue'].isna() == False]
vis_mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20100801_1,,visibility,3.0,meters,determined by divers by measuring the distance...
17,120Reef_20100801_2,,visibility,3.0,meters,determined by divers by measuring the distance...
29,120Reef_20100801_3,,visibility,3.0,meters,determined by divers by measuring the distance...
40,120Reef_20100801_4,,visibility,3.0,meters,determined by divers by measuring the distance...
57,120Reef_20100801_5,,visibility,3.0,meters,determined by divers by measuring the distance...


In [116]:
## Assemble occurrence-level measurements

# Kelp sizes
kelp_mof = kelp_sizes[['eventID', 'occurrenceID']].copy()
kelp_mof['measurementType'] = 'size'
kelp_mof['measurementValue'] = kelp_sizes['size']
kelp_mof['measurementUnit'] = 'number of stipes per individual'
kelp_mof['measurementMethod'] = 'counted for each Macrocystis pyrifera surveyed'

# Fish sizes for individuals pre 2013
fish_mof = fish_sizes[['eventID', 'occurrenceID']].copy()
fish_mof['measurementType'] = 'length'
fish_mof['measurementValue'] = fish_sizes['size_cat']
fish_mof['measurementUnit'] = 'size category per individual'
fish_mof['measurementMethod'] = 'categorized as small/medium/large prior to 2013'
fish_mof = fish_mof[fish_mof['measurementValue'].isna() == False]

# Fish sizes for individuals post 2013, and for groups of the same size
fish_cm_mof = fish_sizes[['eventID', 'occurrenceID']].copy()
fish_cm_mof['measurementType'] = 'length'
fish_cm_mof['measurementValue'] = fish_sizes['size_cm']
fish_cm_mof['measurementUnit'] = 'centimeters'
fish_cm_mof['measurementMethod'] = 'size of an individual or group of fish of the same species and size, estimated visually to the nearest centimeter from 2013 on'
fish_cm_mof = fish_cm_mof[fish_cm_mof['measurementValue'].isna() == False]

# Fish sizes for groups of different sizes 
min_mof = fish_sizes[['eventID', 'occurrenceID']].copy()
min_mof['measurementType'] = 'minimum length'
min_mof['measurementValue'] = fish_sizes['min_size']
min_mof['measurementUnit'] = 'centimeters'
min_mof['measurementMethod'] = 'minimum size observed in a group of fish of the same species, estimated visually to the nearest centimeter'
min_mof = min_mof[min_mof['measurementValue'].isna() == False]

max_mof = fish_sizes[['eventID', 'occurrenceID']].copy()
max_mof['measurementType'] = 'maximum length'
max_mof['measurementValue'] = fish_sizes['max_size']
max_mof['measurementUnit'] = 'centimeters'
max_mof['measurementMethod'] = 'maximum size observed in a group of fish of the same species, estimated visually to the nearest centimeter'
max_mof = max_mof[max_mof['measurementValue'].isna() == False]
max_mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
872,120Reef_20180710_4,120Reef_20180710_4_fish_occ6,maximum length,10.0,centimeters,maximum size observed in a group of fish of th...
1044,120Reef_20180710_9,120Reef_20180710_9_fish_occ40,maximum length,12.0,centimeters,maximum size observed in a group of fish of th...
1130,120Reef_20180710_11,120Reef_20180710_11_fish_occ31,maximum length,34.0,centimeters,maximum size observed in a group of fish of th...
1154,120Reef_20180710_12,120Reef_20180710_12_fish_occ6,maximum length,18.0,centimeters,maximum size observed in a group of fish of th...
1205,120Reef_20180710_13,120Reef_20180710_13_fish_occ6,maximum length,15.0,centimeters,maximum size observed in a group of fish of th...


In [117]:
## Concatenate dataframes

mof = pd.concat([mof, vis_mof, upc_mof, kelp_mof, fish_mof, fish_cm_mof, min_mof, max_mof])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20100801_1,,temperature,13,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
17,120Reef_20100801_2,,temperature,13,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
29,120Reef_20100801_3,,temperature,13,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
40,120Reef_20100801_4,,temperature,13,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
57,120Reef_20100801_5,,temperature,13,degrees Celsius,"measured by dive computer at 10 m depth, or at..."


## Tidy and save

I'm assuming that the NaN values in occurrenceID also need to be '', so I'll implement that here.

In [118]:
## Tidy nan values

mof['occurrenceID'] = mof['occurrenceID'].replace(np.nan, '')
mof.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

In [119]:
## Save

mof.to_csv('RCCA_MoF_20210125.csv', index=False, na_rep='NaN')

## Questions

1. There are sites listed in the inverts, kelp and fish data that are not in the site table. They are: Cayucos, Hurricane Ridge, LA Federal Breakwater, Ocean Cove, Pier 400 and West Long Point. I need the lat and lon for these sites. Also, is Ocean Cove the same as Ocean Cove Kelper? **These sites are mostly ones that were only sampled once. Jan and Dan will update the site table with missing sites. Note that West Long Point is the same as Long Point West. Ocean Cove Kelper is not the same as Ocean Cove**
2. In the inverts and kelp data, both Distance and Amount are sometimes NaN. There are also records where Distance is a number (including but not limited to 30) and Amount is NaN. How should I interpret these records? Are NaN values different than 0 values? **If the amount is NaN, assume that the species was not looked for during that transect, regardless of distance value.**
3. In the fish data, a good proportion of records have Amount = NaN (~17,000). In addition, for these records, Size_cm, Min_cm and Max_cm are either 0 or NaN. Which value (0 or NaN) is used does not depend on whether the Year is before or after 2013. How do I interpret these records? **Again, if amount is NaN, disregard size columns. Assume it is a true missing value, and that the fish was not looked for during the transect.**
4. Also in the fish data, there are a good number of records where amount = 0 but a SizeCategory is listed. I've assumed that these records should have SizeCategory = NaN. (Note that this is not a problem for Size_cm, which is only > 0 if amount is > 0.) **This assumption is correct.**
5. In the fish data, there are a number of problems with the min and max size columns.
    - In a number of cases, amount = 1 and Min_cm and Max_cm = Size_cm. I've assumed in these cases Min_cm and Max_cm should be NaN. **This is correct.**
    - In a number of cases, amount = 0 and Size_cm = 0, while Min_cm and Max_cm = NaN. I've assumed that Size_cm, Min_cm and Max_cm should all be NaN **This is correct.**
    - There are also ~1000 cases where amount = 0 and Min_cm != Max_cm != Size_cm. How should I interpret these rows? It looks like in these cases, Size_cm remains the average of Min_cm and Max_cm. <span style="color:red">**As per Dan's instructions, I've handled all but 149 of these, which appear to be true data entry problems. He will look at it and get back to me. In the meantime, I haven't included any size information in the MoF for these records.**</style>
    - There are many cases where Min_cm = 0. This doesn't seem like a valid minimum value - is it? **Dan and Jan said these should be yoy rockfish, which are always entered as having a min_size of 1 and a max_size of 10. This has been entered very inconsistently, and I tried to find and fix all of the problems as best I could. After doing this, there are still 206 records that are not yoy rockfish and have a minimum size of 0. Dan looked back at the datasheets, and said that here, I should assume that max_size is the true size of the fish.**
6. In the kelp data, there is sometimes a 7th transect. Similarly, in the fish data, there are occassionally transects numbering 19-31. When are extra transects done? Are they inshore or offshore? I thought that inshore transects were always labeled 1-3 and 7-12, and offshore transects were always labeled 4-6 and 13-18. Is this not correct? **This is generally correct. Extra transects have been done since 2018. It would be best to trust the depth measurement rather than the inshore/offshore category - get rid of this column.**
7. There are some transects where more than one depth is given. Should I assume this is a typo? Or that two measurements were taken? **Dan clarified that the depths and visibilities recorded in the fish data should be considered the most accurate standard. I've changed the code so that depths/visibilities are filled from the inverts and/or kelp data only when they are missing in the fish data.**
8. The values in the Amount column for each Category of the UPC surveys should sum to 30. There are 4 surveys where this doesn't happen: Isthmus Reef on 10-Dec-2019, Transect 3 (Category, Cover and Relief all sum to 60); Lovers Point on 16-Sep-2006, Transect 1 (Substrate sums to 31); South La Jolla on 11-Aug-2019, Transect 1 (Substrate sums to 31); and Stillwater Cove Monterey on 26-Aug-2013, Transect 3 (Relief sums to 31). Do you know where the errors are here? Particularly for Isthmus Reef, where there seem to be duplicate entries with conflicting data? **Dan has provided correct data for Isthmus Reef and South La Jolla. <span style="color:red">Lovers Point and Stillwater Cove appear to be true problems with the data that need to be traced back to the original data sheet; this may take some time.**</span>

## Check whether fish data is the same as rcca_fish_zeropop_final.csv provided by Dan

Note that all I've done so far is imported packages and loaded fish.

In [4]:
print(fish.shape)
fish.head()

(485046, 18)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
1,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,none,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,a,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0


In [5]:
## Load Dan's data

filename = 'rcca_fish_zeropop_final.csv'
fish_da = pd.read_csv(filename)

print(fish_da.shape)
fish_da.head()

(485046, 21)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,...,Min_cm,Max_cm,Lat,Lon,Depth_ft,Region,Temp10m,Heading,Visibility,Observer1
0,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Small,1.0,...,,,33.737919,-118.392014,21.0,South,13.0,0.0,3.0,
1,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Medium,2.0,...,,,33.737919,-118.392014,21.0,South,13.0,0.0,3.0,
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,none,Medium,1.0,...,,,33.737919,-118.392014,21.0,South,13.0,0.0,3.0,
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,a,Medium,2.0,...,,,33.737919,-118.392014,21.0,South,13.0,0.0,3.0,
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,...,,,33.737919,-118.392014,21.0,South,13.0,0.0,3.0,


In [6]:
## Drop columns that are not in fish

fish_da.drop(columns=['Region', 'Heading', 'Observer1'], inplace=True)
print(fish_da.shape)
fish_da.head()

(485046, 18)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Lat,Lon,Depth_ft,Temp10m,Visibility
0,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Small,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
1,120 Reef,1,8,2010,8/1/2010,1,black perch,none,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
2,120 Reef,1,8,2010,8/1/2010,1,blacksmith,none,Medium,1.0,,,,33.737919,-118.392014,21.0,13.0,3.0
3,120 Reef,1,8,2010,8/1/2010,1,garibaldi,a,Medium,2.0,,,,33.737919,-118.392014,21.0,13.0,3.0
4,120 Reef,1,8,2010,8/1/2010,1,giant sea bass,,,0.0,,,,33.737919,-118.392014,21.0,13.0,3.0


In [7]:
## Query for 206 instances where more than one fish was observed, a range of sizes was provided, the minimum size was 0 cm, and the species was not yoy rockfish

fish[(fish['amount'] > 1) & (fish['Size_cm'] != fish['Min_cm']) & (fish['Size_cm'] != fish['Max_cm']) & (fish['Min_cm'] == 0) & (fish['Species'] != 'yoy rockfish')]

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Latitude,Longitude,Depth_ft,Temp10m,Visibility
57816,Carmel River,16,9,2014,9/16/2014,14,kelp greenling,m,,3.0,26.5,0.0,53.0,36.539082,-121.935097,32.0,13.00,5.0
68414,Casino Point,18,7,2015,7/18/2015,2,blacksmith,none,,2.0,9.5,0.0,19.0,33.349167,-118.324966,45.0,21.67,8.0
68455,Casino Point,18,7,2015,7/18/2015,3,blacksmith,none,,2.0,5.0,0.0,10.0,33.349167,-118.324966,49.0,21.67,7.2
68456,Casino Point,18,7,2015,7/18/2015,3,blacksmith,none,,3.0,6.0,0.0,12.0,33.349167,-118.324966,49.0,21.67,7.2
68466,Casino Point,18,7,2015,7/18/2015,3,garibaldi,a,,2.0,10.0,0.0,20.0,33.349167,-118.324966,49.0,21.67,7.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
463724,WIES Intake Pipes,26,11,2013,11/26/2013,13,senorita,none,,5.0,2.5,0.0,5.0,33.446999,-118.484848,38.5,16.00,12.0
483977,Yellowbanks,7,11,2013,11/7/2013,1,senorita,none,,20.0,2.5,0.0,5.0,33.998798,-119.550499,27.5,14.00,7.5
484009,Yellowbanks,7,11,2013,11/7/2013,2,senorita,none,,2.0,2.5,0.0,5.0,33.998798,-119.550499,27.5,14.00,7.5
484212,Yellowbanks,7,11,2013,11/7/2013,8,senorita,none,,4.0,2.5,0.0,5.0,33.998798,-119.550499,28.0,14.00,7.5


In [10]:
## Same query in Dan's data

examples = fish_da[(fish_da['amount'] > 1) & (fish_da['Size_cm'] != fish_da['Min_cm']) & (fish_da['Size_cm'] != fish_da['Max_cm']) & (fish_da['Min_cm'] == 0) & 
                   (fish_da['Species'] != 'yoy rockfish')]
examples

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Species,Sex,SizeCategory,amount,Size_cm,Min_cm,Max_cm,Lat,Lon,Depth_ft,Temp10m,Visibility
57816,Carmel River,16,9,2014,9/16/2014,14,kelp greenling,m,,3.0,26.5,0.0,53.0,36.539082,-121.935097,32.0,13.00,5.0
68414,Casino Point,18,7,2015,7/18/2015,2,blacksmith,none,,2.0,9.5,0.0,19.0,33.349167,-118.324966,45.0,21.67,8.0
68455,Casino Point,18,7,2015,7/18/2015,3,blacksmith,none,,2.0,5.0,0.0,10.0,33.349167,-118.324966,49.0,21.67,7.2
68456,Casino Point,18,7,2015,7/18/2015,3,blacksmith,none,,3.0,6.0,0.0,12.0,33.349167,-118.324966,49.0,21.67,7.2
68466,Casino Point,18,7,2015,7/18/2015,3,garibaldi,a,,2.0,10.0,0.0,20.0,33.349167,-118.324966,49.0,21.67,7.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
463724,WIES Intake Pipes,26,11,2013,11/26/2013,13,senorita,none,,5.0,2.5,0.0,5.0,33.446999,-118.484848,38.5,16.00,12.0
483977,Yellowbanks,7,11,2013,11/7/2013,1,senorita,none,,20.0,2.5,0.0,5.0,33.998798,-119.550499,27.5,14.00,7.5
484009,Yellowbanks,7,11,2013,11/7/2013,2,senorita,none,,2.0,2.5,0.0,5.0,33.998798,-119.550499,27.5,14.00,7.5
484212,Yellowbanks,7,11,2013,11/7/2013,8,senorita,none,,4.0,2.5,0.0,5.0,33.998798,-119.550499,28.0,14.00,7.5


In [11]:
examples.to_csv('RCCA_fish_min_size_zero_example_with_rownums.csv', na_rep='')

## Try to address OBIS error regarding incorrect eventIDs in MoF

In [133]:
## Any eventID/occurrenceID combinations that are in mof but not in occ?

test = mof[mof['occurrenceID'] != ''].merge(occ[['eventID', 'occurrenceID']], how='outer', on=['eventID', 'occurrenceID'], indicator=True)
test[test['_merge'] == 'left_only']

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod,_merge


In [137]:
## Any eventIDs in mof that are not in event?

test = event[['eventID', 'eventDate']].merge(mof, how='outer', on='eventID', indicator=True)
test[test['_merge'] == 'right_only']

Unnamed: 0,eventID,eventDate,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod,_merge
255377,120Reef_20180711_4,,,cover,13.3% sessile invertebrates | 20.0% crustose c...,percent cover,uniform point contact,right_only
255378,120Reef_20180711_4,,,relief,100.0% > 10cm - 1m,percent cover,uniform point contact,right_only
255379,120Reef_20180711_4,,,substrate,93.3% bedrock | 3.3% cobble | 3.3% sand,percent cover,uniform point contact,right_only
255380,120Reef_20180711_5,,,cover,26.7% red seaweed | 10.0% brown seaweed | 13.3...,percent cover,uniform point contact,right_only
255381,120Reef_20180711_5,,,relief,6.7% 0 - 10cm | 3.3% > 1m - 2m | 90.0% > 10cm ...,percent cover,uniform point contact,right_only
...,...,...,...,...,...,...,...,...
255588,SpanishBay_20170922_6,,,relief,30.0% > 1m - 2m | 70.0% > 10cm - 1m,percent cover,uniform point contact,right_only
255589,SpanishBay_20170922_6,,,substrate,6.7% boulder | 93.3% bedrock,percent cover,uniform point contact,right_only
255590,Weston_20181004_1,,,cover,3.3% mobile invertebrates | 83.3% crustose cor...,percent cover,uniform point contact,right_only
255591,Weston_20181004_1,,,relief,16.7% > 1m - 2m | 83.3% > 10cm - 1m,percent cover,uniform point contact,right_only


Ok, looks like there are a few events in mof that are not in event.

In [145]:
## How many events are there?

len(test.loc[test['_merge'] == 'right_only', 'eventID'].unique())

72

Ok, so that matches up with the OBIS error. Just a priori, it looks like these are events where only UPC data were obtained. So, for example, it looks like fish, invert, and algae transects were conducted at 120 Reef on 2018-07-10. On 2018-07-11, additional fish transects were conducted, and three transects were conducted where only UPC data were taken. The eventID for the latter transects is only appearing in the mof file. To check this, use the code blocks below.

In [151]:
event[(event['locality'] == '120 Reef') & (event['eventDate'].str.startswith('2018'))]

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
24,120Reef_20180710_1,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,,,band transect,10-15 minutes per transect
25,120Reef_20180710_2,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,,,band transect,10-15 minutes per transect
26,120Reef_20180710_3,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,,,band transect,10-15 minutes per transect
27,120Reef_20180710_4,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,4.1,4.1,band transect,10-15 minutes per transect
28,120Reef_20180710_5,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,6.2,6.2,band transect,10-15 minutes per transect
29,120Reef_20180710_6,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,6.6,6.6,band transect,10-15 minutes per transect
6595,120Reef_20180710_9,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,6.9,6.9,band transect,10-15 minutes per transect
6596,120Reef_20180710_10,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,7.3,7.3,band transect,10-15 minutes per transect
6597,120Reef_20180710_11,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,10.2,10.2,band transect,10-15 minutes per transect
6598,120Reef_20180710_12,2018-07-10,RCCA transects,120 Reef,US,33.737919,-118.392014,250,10.8,10.8,band transect,10-15 minutes per transect


In [159]:
t = occ[occ['eventID'].str.startswith('120Reef_20180710')]
t[t['occurrenceID'].str.contains('algae')]

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
251,120Reef_20180710_1,120Reef_20180710_1_algae_occ1,bull kelp,Nereocystis luetkeana,urn:lsid:marinespecies.org:taxname:240752,240752,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
252,120Reef_20180710_1,120Reef_20180710_1_algae_occ2,feather boa,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
253,120Reef_20180710_1,120Reef_20180710_1_algae_occ3,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 60 m2
254,120Reef_20180710_1,120Reef_20180710_1_algae_occ4,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 60 m2
255,120Reef_20180710_1,120Reef_20180710_1_algae_occ5,giant kelp,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 60 m2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
386,120Reef_20180710_6,120Reef_20180710_6_algae_occ18,laminaria setchellii,Laminaria setchellii,urn:lsid:marinespecies.org:taxname:240748,240748,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
387,120Reef_20180710_6,120Reef_20180710_6_algae_occ19,pterygophora,Pterygophora californica,urn:lsid:marinespecies.org:taxname:240750,240750,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
388,120Reef_20180710_6,120Reef_20180710_6_algae_occ20,sargassum horneri,Sargassum horneri,urn:lsid:marinespecies.org:taxname:494853,494853,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 60 m2
389,120Reef_20180710_6,120Reef_20180710_6_algae_occ21,southern sea palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,absent,HumanObservation,,individuals >= 30 cm in size,,,0.0,number of individuals per 60 m2


In [162]:
t = occ[occ['eventID'].str.startswith('120Reef_20180711')]
t[t['occurrenceID'].str.contains('fish')]

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,occurrenceRemarks,sex,lifeStage,organismQuantity,organismQuantityType
1457,120Reef_20180711_1,120Reef_20180711_1_fish_occ1,barred sand bass,Paralabrax nebulifer,urn:lsid:marinespecies.org:taxname:282059,282059,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 120 m3
1458,120Reef_20180711_1,120Reef_20180711_1_fish_occ2,black and yellow rockfish,Sebastes chrysomelas,urn:lsid:marinespecies.org:taxname:240737,240737,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 120 m3
1459,120Reef_20180711_1,120Reef_20180711_1_fish_occ3,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 120 m3
1460,120Reef_20180711_1,120Reef_20180711_1_fish_occ4,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 120 m3
1461,120Reef_20180711_1,120Reef_20180711_1_fish_occ5,black perch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,,1.0,number of individuals per 120 m3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1701,120Reef_20180711_8,120Reef_20180711_8_fish_occ43,striped perch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 120 m3
1702,120Reef_20180711_8,120Reef_20180711_8_fish_occ44,treefish,Sebastes serriceps,urn:lsid:marinespecies.org:taxname:274853,274853,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 120 m3
1703,120Reef_20180711_8,120Reef_20180711_8_fish_occ45,vermilion rockfish,Sebastes miniatus,urn:lsid:marinespecies.org:taxname:274820,274820,WoRMS,absent,HumanObservation,,,,,0.0,number of individuals per 120 m3
1704,120Reef_20180711_8,120Reef_20180711_8_fish_occ46,yellowtail/olive rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,absent,HumanObservation,Sebastes flavidus or Sebastes serranoides,,,,0.0,number of individuals per 120 m3
