# MARINe: Phototransect conversion

MARINe conducts long-term monitoring at sites along the coast of North America approximately annually. Permanent photoplots are employed to monitor the cover of target species assemblages
representing different intertidal zones. Plots are established at sites with sufficient
cover of the target species. The cover of the target species is estimated by sampling 5 permanent 50x75 cm (0.375 m) plots and scoring point contact occurrences by superimposing a uniform grid of 100 dots on the resulting image.

Additionally, permanent point-intercept transects are employed to monitor the cover of Phyllospadix scouleri/torreyi, Egregia menziezii, and Red Algae (turf algae, including articulated corallines and other red algae) at sites with sufficient cover of the relevant species. The cover of the target species is estimated by scoring occurrences at 100 points spaced at 10 cm intervals along 3, 10 m, permanent transects.

**Resources:**
- https://data.piscoweb.org/metacatui/view/doi:10.6085/AA/marine_ltm.12.5

In [1]:
## Import packages

import pandas as pd
import numpy as np
from datetime import datetime

import pyworms

from SPARQLWrapper import SPARQLWrapper, JSON

I wanted to start doing WoRMS lookups with pyworms, but the package is really annoying me. I should really consider contributing to the project/doing my own version of the package. But for now, I'm just going to load my own functions, too.

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, '/Users/dianalg/PycharmProjects/PythonScripts/MPA data integration/')

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data

data = pd.read_csv('MARINe_LTM_photoplots_2020.csv')
print(data.shape)
data.head()

(269257, 30)


Unnamed: 0,group_code,marine_site_name,site_code,site_lat,site_long,ltm_lat,ltm_long,marine_sort_order,marine_common_year,season_name,...,stderr,state_province,georegion,bioregion,mpa_designation,mpa_region,mpa_lt_region,mpa_name,island,last_updated
0,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,33.570782,-117.83773,6660,2002,Fall,...,1.4,California,CA South,Government Point to Mexico,SMCA,South Coast,South Coast,Crystal Cove SMCA,Mainland,2021-07-28 16:02:58
1,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,33.570782,-117.83773,6660,2002,Fall,...,1.4,California,CA South,Government Point to Mexico,SMCA,South Coast,South Coast,Crystal Cove SMCA,Mainland,2021-07-28 16:02:58
2,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,33.570782,-117.83773,6660,2002,Fall,...,0.0,California,CA South,Government Point to Mexico,SMCA,South Coast,South Coast,Crystal Cove SMCA,Mainland,2021-07-28 16:02:58
3,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,33.570782,-117.83773,6660,2002,Fall,...,0.0,California,CA South,Government Point to Mexico,SMCA,South Coast,South Coast,Crystal Cove SMCA,Mainland,2021-07-28 16:02:58
4,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,33.570782,-117.83773,6660,2002,Fall,...,6.316645,California,CA South,Government Point to Mexico,SMCA,South Coast,South Coast,Crystal Cove SMCA,Mainland,2021-07-28 16:02:58


In [4]:
## Site table

site = pd.read_csv('MARINe_site_table_2020.csv')
print(site.shape)
site.head()

(372, 17)


Unnamed: 0,site_code,marine_site_name,marine_sort_order,county,island,state_province,country,latitude,longitude,pisco_code,mpa_designation,mpa_region,mpa_lt_region,mpa_name,LTM_project_short_code,georegion,bioregion
0,PBOR,Otter Rock; Peterson Bay,1100,Kenai Peninsula,Mainland,Alaska,United States,59.574501,-151.2952,NONE,NONE,NONE,NONE,NONE,NONE,AK Southcentral,Alaska to British Columbia
1,ORST,Otter Rock; Peterson Bay ST,1105,Kenai Peninsula,Mainland,Alaska,United States,59.579582,-151.2951,NONE,NONE,NONE,NONE,NONE,NONE,AK Southcentral,Alaska to British Columbia
2,PBLG,Peterson Bay Lagoon,1106,Kenai Peninsula,Mainland,Alaska,United States,59.5793,-151.2959,NONE,NONE,NONE,NONE,NONE,NONE,AK Southcentral,Alaska to British Columbia
3,PBCP,China Poot,1110,Kenai Peninsula,Mainland,Alaska,United States,59.572102,-151.3027,NONE,NONE,NONE,NONE,NONE,NONE,AK Southcentral,Alaska to British Columbia
4,CPST,China Poot ST,1111,Kenai Peninsula,Mainland,Alaska,United States,59.571854,-151.30235,NONE,NONE,NONE,NONE,NONE,NONE,AK Southcentral,Alaska to British Columbia


In [5]:
## Species table

species = pd.read_csv('MARINe_species_table_2020.csv')
print(species.shape)
species.head()

(264, 13)


Unnamed: 0,species_code,marine_species_name,marine_species_definition,WoRMS_AphiaID,taxonomic_source,kingdom,phylum,class,order,family,genus,species,load_date
0,ACAPUN,acanthina punctulata,acanthina punctulata,580790.0,World Register of Marine Species: www.marines...,Animalia,Mollusca,Gastropoda,Neogastropoda,Muricidae,Acanthina,Acanthina punctulata,2021-03-30
1,ACASPI,acanthina spirata,acanthina spirata,580791.0,World Register of Marine Species: www.marines...,Animalia,Mollusca,Gastropoda,Neogastropoda,Muricidae,Acanthina,Acanthina spirata,2021-03-30
2,ACASPP,acanthinucella spp,acanthinucella spp,403857.0,World Register of Marine Species: www.marines...,Animalia,Mollusca,Gastropoda,Neogastropoda,Muricidae,Acanthinucella,,2021-03-30
3,ACRSPP,acrosiphonia spp; cladophora spp,acrosiphonia/cladophora spp (not c. columbiana),146216.0,World Register of Marine Species: www.marines...,Plantae,(Division) Chlorophyta,Ulvophyceae,,,,,2021-03-30
4,AHNLIN,ahnfeltiopsis linearis,ahnfeltiopsis linearis,372234.0,World Register of Marine Species: www.marines...,Plantae,(Division) Rhodophyta,Florideophyceae,Gigartinales,Phyllophoraceae,Ahnfeltiopsis,Ahnfeltiopsis linearis,2021-03-30


## Conversion

I'll define an event can as a survey, uniquely defined by the site code, min date, max date.

An occurrence can be defined as a percent cover measurement of a species observed during the event. (I.e., if percent cover is greater than 0, at least 1 organism was observed, and the species is present).

Measurements only pertain to occurrences, so I don't need a separate event file.

In [6]:
## eventID

occ = pd.DataFrame({'eventID':data['site_code'] + '_' + data['min_survey_date'] + '_' + data['max_survey_date']})
print(occ.shape)
occ.head()

(269257, 1)


Unnamed: 0,eventID
0,CRCO_2002-12-06_2002-12-06
1,CRCO_2002-12-06_2002-12-06
2,CRCO_2002-12-06_2002-12-06
3,CRCO_2002-12-06_2002-12-06
4,CRCO_2002-12-06_2002-12-06


In [7]:
## eventDate

occ['eventDate'] = data['min_survey_date'] + '/' + data['max_survey_date']
occ.head()

Unnamed: 0,eventID,eventDate
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06


In [8]:
## datasetID

occ['datasetID'] = 'MARINe LTM - percent cover from photoplots and phototransects'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...


In [9]:
## Merge with site table to get locality, county, stateProvince, countryCode, decimalLat, decimalLon

# Add site code to occ
occ['site_code'] = data['site_code']

# Define columns to merge from site table
site_cols = [
    'site_code',
    'marine_site_name',
    'county',
    'state_province',
    'country',
    'latitude',
    'longitude',
]

# Define DwC terms for these columns after merge
dwc_cols = [
    'eventID',
    'eventDate',
    'datasetID',
    'locality',
    'county',
    'stateProvince',
    'countryCode',
    'decimalLatitude',
    'decimalLongitude',
]

# Merge
occ = occ.merge(site[site_cols], how='left', on='site_code')
occ.drop(columns=['site_code'], inplace=True)
occ.columns = dwc_cols
print(occ.shape)
occ.head()

(269257, 9)


Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773


In [10]:
## Check county names against GTGN

# Define sql strings
counties_query = """
    select distinct * {
        ?place skos:inScheme tgn: ;
        gvp:placeTypePreferred [gvp:prefLabelGVP [xl:literalForm ?type]];
        gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended) [rdfs:label "counties"@en];
        gvp:broaderPartitiveExtended [rdfs:label "United States"@en];
        gvp:prefLabelGVP [xl:literalForm ?name];
        gvp:parentString ?parents}
"""

divisions_query = """
    select distinct * {
        ?place skos:inScheme tgn: ;
        gvp:placeTypePreferred [gvp:prefLabelGVP [xl:literalForm ?type]];
        gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended) [rdfs:label "national divisions"@en];
        gvp:broaderPartitiveExtended [rdfs:label "United States"@en];
        gvp:prefLabelGVP [xl:literalForm ?name];
        gvp:parentString ?parents}
"""

# Set up query
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")
sparql.setReturnFormat(JSON)
sparql.setQuery(counties_query)

# Obtain county results
try:
   counties_ret = sparql.query().convert()
except e:
   print(e)

# Obtain national division results (Alaska has boroughs and census districts, not counties)
sparql.setQuery(divisions_query)
try:
   div_ret = sparql.query().convert()
except e:
   print(e)

In [11]:
## Clean result

# Extract into data frame
county_df = pd.DataFrame(counties_ret['results']['bindings'])
county_df = county_df.applymap(lambda x: x['value'])
div_df = pd.DataFrame(div_ret['results']['bindings'])
div_df = div_df.applymap(lambda x: x['value'])

# Concatenate
county_df = pd.concat([county_df, div_df])
county_df.drop_duplicates(inplace=True)

# Unpack state, country etc. that each county is located in
county_df[['state', 'country', 'continent', 'planet', 'other']] = county_df['parents'].str.split(', ', expand=True)

# Filter
county_df = county_df[(county_df['country'] == 'United States') & (county_df['state'].isin(occ['stateProvince'].unique()))].copy()
county_df.head()

Unnamed: 0,place,type,name,parents,state,country,continent,planet,other
17,http://vocab.getty.edu/tgn/2002238,counties,Adams,"Washington, United States, North and Central A...",Washington,United States,North and Central America,World,
42,http://vocab.getty.edu/tgn/1002138,counties,Alameda,"California, United States, North and Central A...",California,United States,North and Central America,World,
89,http://vocab.getty.edu/tgn/1002145,counties,Alpine,"California, United States, North and Central A...",California,United States,North and Central America,World,
95,http://vocab.getty.edu/tgn/2002239,counties,Asotin,"Washington, United States, North and Central A...",Washington,United States,North and Central America,World,
101,http://vocab.getty.edu/tgn/1002146,counties,Amador,"California, United States, North and Central A...",California,United States,North and Central America,World,


In [12]:
## Check MARINe counties

for c in occ['county'].unique():
    if c not in county_df['name'].unique():
        print('County {} is not listed in GTGN. Double check name'.format(c))

All county names appear to be accurate.

In [13]:
## Clean countryCode

occ['countryCode'] = occ['countryCode'].str.replace('United States', 'US')
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773


In [14]:
## coordinatUncertaintyInMeters

occ['coordinateUncertaintyInMeters'] = 350
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350


In [48]:
## Add minimumDepthInMeters, maximumDepthInMeters, samplingProtocol and samplingEffort

# Depth
occ['minimumDepthInMeters'] = 0
occ['maximumDepthInMeters'] = 0

# samplingProtocol
occ['samplingProtocol'] = data['survey_type_code']

# samplingEffort
occ['samplingEffort'] = data['num_plots_sampled']
occ.loc[occ['samplingProtocol'] == 'photo_plot_surveys', 'samplingEffort'] = \
    occ.loc[occ['samplingProtocol'] == 'photo_plot_surveys', 'samplingEffort'].astype(str) + ' plot(s)'
occ.loc[occ['samplingProtocol'] == 'transect_surveys', 'samplingEffort'] = \
    occ.loc[occ['samplingProtocol'] == 'transect_surveys', 'samplingEffort'].astype(str) + ' transect(s)'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,samplingEffort,occurrenceID,scientificName,taxonID,scientificNameID,nameAccordingTo,occurrenceStatus,basisOfRecord,organismQuantity,organismQuantityType
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5 plot(s),CRCO_2002-12-06_2002-12-06_1,Actiniidae,100653,urn:lsid:marinespecies.org:taxname:100653,WoRMS,present,Human Observation,1.4,percent cover
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5 plot(s),CRCO_2002-12-06_2002-12-06_2,Corallinaceae,143691,urn:lsid:marinespecies.org:taxname:143691,WoRMS,present,Human Observation,1.4,percent cover
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5 plot(s),CRCO_2002-12-06_2002-12-06_3,Polyplacophora,55,urn:lsid:marinespecies.org:taxname:55,WoRMS,absent,Human Observation,0.0,percent cover
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5 plot(s),CRCO_2002-12-06_2002-12-06_4,Chondracanthus canaliculatus,371723,urn:lsid:marinespecies.org:taxname:371723,WoRMS,absent,Human Observation,0.0,percent cover
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5 plot(s),CRCO_2002-12-06_2002-12-06_5,Sessilia,106033,urn:lsid:marinespecies.org:taxname:106033,WoRMS,present,Human Observation,61.0,percent cover


**Do we want to distinguish between photoplots and phototransects here in the samplingProtocol column? If so, do we want to update samplingEffort to say "3 plot(s)" or "3 transect(s)" as is appropriate?**

In [46]:
occ['samplingProtocol'].unique()

array(['phototransect'], dtype=object)

In [45]:
data['survey_type_code'].unique()

array(['photo_plot_surveys', 'transect_surveys'], dtype=object)

In [16]:
## occurrenceID

occ['occurrenceID'] = data.groupby(['site_code', 'min_survey_date', 'max_survey_date'])['species_code'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_' + occ['occurrenceID'].astype(str)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_1
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_2
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_3
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_4
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_5


In [17]:
## scientificName

# Get species codes
occ['scientificName'] = data['species_code']

# Create scientificName column in species table
sp = species[['species_code', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
sp = sp.replace('NULL', np.nan, regex=True)
sp['scientificName'] = sp['species']
sp['scientificName'] = sp['scientificName'].combine_first(sp['family'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['order'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['class'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['phylum'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['kingdom'])

# Build dictionary mapping codes to names
sp_dict = dict(zip(sp['species_code'], sp['scientificName']))

# Match OTHALG with Biota
sp_dict['OTHALG'] = 'Biota'

# Match abiotic codes with written-out names
sp_dict.update({
    'OTHSUB':'other substrate',
    'ROCK':'rock',
    'SAND':'sand',
    'TAR':'tar',
}
)

# Get rid of "(Division)" designations
sp_dict.update({
    'OTHGRE':'Chlorophyta',
    'OTHRED':'Rhodophyta',
})

# Replace codes with names in occ
occ['scientificName'] = occ['scientificName'].replace(sp_dict)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID,scientificName
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_1,Actiniidae
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_2,Corallinaceae
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_3,Polyplacophora
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_4,Chondracanthus canaliculatus
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_5,Sessilia


In [18]:
## Get unique names

names = occ['scientificName'].unique()
print(len(names))

45


**Note** that the following codes matched to NaN: 
- OTHALG
- OTHSUB
- ROCK
- SAND
- TAR

Rock, sand and tar are obviously abiotic and can't be matched to a species name. **Did Abby say these could remain in the dataset, they just wouldn't show up in OBIS?** OTHALG is pretty clearly other algae. OTHSUB must be other substrate - also abiotic.

**OTHALG probably should be matched to Biota. I've changed this above.**

Additionally, I should probably match the abiotic codes to the written-out version. **I've changed this above as well.**

In [19]:
## Check names in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work, check name:  rock
Url didn't work, check name:  sand
Url didn't work, check name:  tar
Url didn't work for other substrate checking:  other
Url didn't work, check name:  other


In [120]:
# ## Check names on WoRMS using pyworms (may or may not run)

# match_dict = {}
# results = []

# for i in range(0, len(names)):
#     result = pyworms.aphiaRecordsByMatchNames(names[i])
#     if result == [[]]:
#         match_dict[names[i]] = 'no result'
#     else:
#         match_dict[names[i]] = result[0][0]['scientificname']
#         results.append(result)

# # Unpack results
# worms_out = pd.json_normalize(results[0])
# for i in range(1, len(results)):
#     norm = pd.json_normalize(results[i])
#     worms_out = pd.concat([worms_out, norm])
    
# if len(results) == worms_out.shape[0]: print('All names found.')
# else: print('{x} of {y} names found'.format(x = str(worms_out.shape[0]), y = str(len(results))))
# worms_out.head()

In [20]:
## Add taxonomy columns

# scientificName
occ['scientificName'] = occ['scientificName'].str.strip()
occ['scientificName'].replace(name_name_dict, inplace=True)

# taxonID
occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)

# scientificNameID
occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

# Other
occ['nameAccordingTo'] = 'WoRMS'
occ['occurrenceStatus'] = 'present'
occ['basisOfRecord'] = 'Human Observation'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID,scientificName,taxonID,scientificNameID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_1,Actiniidae,100653,urn:lsid:marinespecies.org:taxname:100653,WoRMS,present,Human Observation
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_2,Corallinaceae,143691,urn:lsid:marinespecies.org:taxname:143691,WoRMS,present,Human Observation
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_3,Polyplacophora,55,urn:lsid:marinespecies.org:taxname:55,WoRMS,present,Human Observation
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_4,Chondracanthus canaliculatus,371723,urn:lsid:marinespecies.org:taxname:371723,WoRMS,present,Human Observation
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_5,Sessilia,106033,urn:lsid:marinespecies.org:taxname:106033,WoRMS,present,Human Observation


In [21]:
## Tidy taxonomy columns

# scientificNameID
occ['scientificNameID'].replace(['rock', 'sand', 'tar', 'other substrate'], '', inplace=True)
occ['taxonID'].replace(['rock', 'sand', 'tar', 'other substrate'], np.nan, inplace=True)

# taxonID
occ['taxonID'] = occ['taxonID'].astype('Int32')

# Other
occ.loc[occ['taxonID'].isna() == True, ['nameAccordingTo', 'occurrenceStatus', 'basisOfRecord']] = ''
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID,scientificName,taxonID,scientificNameID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_1,Actiniidae,100653,urn:lsid:marinespecies.org:taxname:100653,WoRMS,present,Human Observation
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_2,Corallinaceae,143691,urn:lsid:marinespecies.org:taxname:143691,WoRMS,present,Human Observation
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_3,Polyplacophora,55,urn:lsid:marinespecies.org:taxname:55,WoRMS,present,Human Observation
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_4,Chondracanthus canaliculatus,371723,urn:lsid:marinespecies.org:taxname:371723,WoRMS,present,Human Observation
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,0,phototransect,5,CRCO_2002-12-06_2002-12-06_5,Sessilia,106033,urn:lsid:marinespecies.org:taxname:106033,WoRMS,present,Human Observation


In [22]:
## Add percent cover

# organismQuantity
occ['organismQuantity'] = data['average_percent_cover']

# Change occurrenceStatus to absent if organismQuantity = 0
occ.loc[(occ['scientificNameID'] != '') & (occ['organismQuantity'] == 0), 'occurrenceStatus'] = 'absent'

# Add organismQuantityType
occ['organismQuantityType'] = 'percent cover'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,samplingEffort,occurrenceID,scientificName,taxonID,scientificNameID,nameAccordingTo,occurrenceStatus,basisOfRecord,organismQuantity,organismQuantityType
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5,CRCO_2002-12-06_2002-12-06_1,Actiniidae,100653,urn:lsid:marinespecies.org:taxname:100653,WoRMS,present,Human Observation,1.4,percent cover
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5,CRCO_2002-12-06_2002-12-06_2,Corallinaceae,143691,urn:lsid:marinespecies.org:taxname:143691,WoRMS,present,Human Observation,1.4,percent cover
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5,CRCO_2002-12-06_2002-12-06_3,Polyplacophora,55,urn:lsid:marinespecies.org:taxname:55,WoRMS,absent,Human Observation,0.0,percent cover
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5,CRCO_2002-12-06_2002-12-06_4,Chondracanthus canaliculatus,371723,urn:lsid:marinespecies.org:taxname:371723,WoRMS,absent,Human Observation,0.0,percent cover
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,...,5,CRCO_2002-12-06_2002-12-06_5,Sessilia,106033,urn:lsid:marinespecies.org:taxname:106033,WoRMS,present,Human Observation,61.0,percent cover


In [49]:
## Save

occ.to_csv('MARINe_LTM_photoplot_phototran_occurrence_20210825.csv', index=False)

## MoF

In [33]:
## Assemble measurementType, measurementValue

mean_pc = occ[['eventID', 'occurrenceID', 'organismQuantityType', 'organismQuantity']].copy()
mean_pc.rename(columns={
    'organismQuantityType':'measurementType',
    'organismQuantity':'measurementValue'
}, inplace=True)
mean_pc['measurementType'] = 'Proportion coverage mean of biological entity specified elsewhere' # Closest term I can find on NVS
mean_pc.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue
0,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_1,Proportion coverage mean of biological entity ...,1.4
1,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_2,Proportion coverage mean of biological entity ...,1.4
2,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_3,Proportion coverage mean of biological entity ...,0.0
3,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_4,Proportion coverage mean of biological entity ...,0.0
4,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_5,Proportion coverage mean of biological entity ...,61.0


In [34]:
## Add measurementUnit, measurementMethod

mean_pc['measurementUnit'] = 'percent'
mean_pc['measurementMethod'] = 'the number of point contact occurrences out of 100 points averaged across 3-5 transects or plots'
mean_pc.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_1,Proportion coverage mean of biological entity ...,1.4,percent,the number of point contact occurrences out of...
1,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_2,Proportion coverage mean of biological entity ...,1.4,percent,the number of point contact occurrences out of...
2,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_3,Proportion coverage mean of biological entity ...,0.0,percent,the number of point contact occurrences out of...
3,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_4,Proportion coverage mean of biological entity ...,0.0,percent,the number of point contact occurrences out of...
4,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_5,Proportion coverage mean of biological entity ...,61.0,percent,the number of point contact occurrences out of...


In [37]:
## Do the same for standard deviation and standard error

std_pc = occ[['eventID', 'occurrenceID']].copy()
std_pc['measurementType'] = 'Proportion coverage standard deviation of biological entity specified elsewhere'
std_pc['measurementValue'] = data['stddev']
std_pc['measurementUnit'] = 'percent'
std_pc['measurementMethod'] = 'the standard deviation of the number of point contact occurrences out of 100 points averaged across 3-5 transects or plots'

stderr_pc = occ[['eventID', 'occurrenceID']].copy()
stderr_pc['measurementType'] = 'Proportion coverage standard error of biological entity specified elsewhere'
stderr_pc['measurementValue'] = data['stderr']
stderr_pc['measurementUnit'] = 'percent'
stderr_pc['measurementMethod'] = 'the standard error of the number of point contact occurrences out of 100 points averaged across 3-5 transects or plots'
stderr_pc.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_1,Proportion coverage standard error of biologic...,1.4,percent,the standard error of the number of point cont...
1,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_2,Proportion coverage standard error of biologic...,1.4,percent,the standard error of the number of point cont...
2,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_3,Proportion coverage standard error of biologic...,0.0,percent,the standard error of the number of point cont...
3,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_4,Proportion coverage standard error of biologic...,0.0,percent,the standard error of the number of point cont...
4,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_5,Proportion coverage standard error of biologic...,6.316645,percent,the standard error of the number of point cont...


In [38]:
## Concatenate

mof = pd.concat([mean_pc, std_pc, stderr_pc])
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_1,Proportion coverage mean of biological entity ...,1.4,percent,the number of point contact occurrences out of...
1,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_2,Proportion coverage mean of biological entity ...,1.4,percent,the number of point contact occurrences out of...
2,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_3,Proportion coverage mean of biological entity ...,0.0,percent,the number of point contact occurrences out of...
3,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_4,Proportion coverage mean of biological entity ...,0.0,percent,the number of point contact occurrences out of...
4,CRCO_2002-12-06_2002-12-06,CRCO_2002-12-06_2002-12-06_5,Proportion coverage mean of biological entity ...,61.0,percent,the number of point contact occurrences out of...


In [39]:
## Save

mof.to_csv('MARINe_LTM_photoplot_phototran_mof_20210825.csv')

## Questions

1. Do we want to distinguish between photoplots and phototransects here in the samplingProtocol column? If so, do we want to update samplingEffort to say "3 plot(s)" or "3 transect(s)" as is appropriate? **Yes and yes. Rani will add a survey type column to the data. DONE.**

For Abby:
1. Did Abby say nonbiological percent cover measurements could remain in the dataset, they just wouldn't show up in OBIS?