# MARINe: Phototransect conversion

MARINe conducts long-term monitoring at sites along the coast of North America approximately annually. Permanent photoplots are employed to monitor the cover of target species assemblages
representing different intertidal zones. Plots are established at sites with sufficient
cover of the target species. The cover of the target species is estimated by sampling 5 permanent 50x75 cm (0.375 m) plots and scoring point contact occurrences by superimposing a uniform grid of 100 dots on the resulting image.

Additionally, permanent point-intercept transects are employed to monitor the cover of Phyllospadix scouleri/torreyi, Egregia menziezii, and Red Algae (turf algae, including articulated corallines and other red algae) at sites with sufficient cover of the relevant species. The cover of the target species is estimated by scoring occurrences at 100 points spaced at 10 cm intervals along 3, 10 m, permanent transects.

**Resources:**
- https://data.piscoweb.org/metacatui/view/doi:10.6085/AA/marine_ltm.12.4

In [1]:
## Import packages

import pandas as pd
import numpy as np
from datetime import datetime

import pyworms

from SPARQLWrapper import SPARQLWrapper, JSON

## Load data

In [2]:
## Load data

data = pd.read_csv('phototransummarysd_download.csv')
print(data.shape)
data.head()

(262623, 25)


Unnamed: 0,group_code,marine_site_name,site_code,latitude,longitude,marine_sort_order,marine_common_year,season_name,marine_season_code,marine_common_season,...,num_plots_sampled,stdev,stderr,state_province,georegion,bioregion,mpa_designation,mpa_region,island,last_updated
0,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2002,Fall,FA02,87,...,5,3.130495,1.4,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 10:55:59
1,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2002,Fall,FA02,87,...,5,3.130495,1.4,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 10:55:59
2,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2002,Fall,FA02,87,...,5,0.0,0.0,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 10:55:59
3,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2002,Fall,FA02,87,...,5,0.0,0.0,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 10:55:59
4,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2002,Fall,FA02,87,...,5,14.124447,6.316645,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 10:55:59


In [3]:
## Site table

site = pd.read_csv('MARINe_site_table.txt')
print(site.shape)
site.head()

(269, 17)


Unnamed: 0,marine_sort_order,site_code,marine_site_name,county,island,state_province,country,latitude,longitude,pisco_code,mpa_region,mpa_lt_region,mpa_designation,mpa_name,LTM_project_short_code,georegion,bioregion
0,1500,GRA,Graves Harbor,Skagway-Hoonah-Angoon,Mainland,Alaska,United States,58.270821,-136.73148,IGHAXX,,,,,,AK,Alaska to British Columbia
1,1600,YAK,Yakobi,Skagway-Hoonah-Angoon,Yakobi,Alaska,United States,58.081501,-136.55453,IYAKXX,,,,,,AK,Alaska to British Columbia
2,1700,PMA,Port Mary,Sitka,Kruzoff,Alaska,United States,57.154228,-135.75452,IPMAXX,,,,,,AK,Alaska to British Columbia
3,1720,SAGE,Sage Rock,Sitka,Baranoff,Alaska,United States,57.048698,-135.32314,ISAGXX,,,,,,AK,Alaska to British Columbia
4,1730,KAY,Kayak Island,Sitka,Kayak,Alaska,United States,57.022499,-135.35387,IKAYXX,,,,,,AK,Alaska to British Columbia


In [4]:
## Species table

species = pd.read_csv('MARINe_species_table.txt')
print(species.shape)
species.head()

(50, 12)


Unnamed: 0,lumping_code,marine_species_name,marine_species_definition,taxonomic_id,taxonomic_source,kingdom,phylum,class,order,family,genus,species
0,ANTELE,anthopleura elegantissima; anthopleura sola,anthopleura elegantissima/sola; may also inclu...,100696.0,World Register of Marine Species: www.marines...,Animalia,Cnidaria,Anthozoa,Actiniaria,Actiniidae,Anthopleura,
1,ARTCOR,articulated corallines,erect; jointed; calcified; red algae of the Fa...,143691.0,World Register of Marine Species: www.marines...,Plantae,(Division) Rhodophyta,Florideophyceae,Corallinales,Corallinaceae,,
2,BARNAC,barnacles,any species of barnacle; used for transects wh...,106033.0,World Register of Marine Species: www.marines...,Animalia,Arthropoda,Thecostraca,Sessilia,,,
3,CHITON,chitons,any species of chiton,55.0,World Register of Marine Species: www.marines...,Animalia,Mollusca,Polyplacophora,,,,
4,CHOCAN,chondracanthus canaliculatus,chondracanthus canaliculatus,371723.0,World Register of Marine Species: www.marines...,Plantae,(Division) Rhodophyta,Florideophyceae,Gigartinales,Gigartinaceae,Chondracanthus,Chondracanthus canaliculatus


## Conversion

I'll define an event can as a survey, uniquely defined by the site code, min date, max date.

An occurrence can be defined as a percent cover measurement of a species observed during the event. (I.e., if percent cover is greater than 0, at least 1 organism was observed, and the species is present).

Measurements only pertain to occurrences, so I don't need a separate event file.

In [6]:
## eventID

occ = pd.DataFrame({'eventID':data['site_code'] + '_' + data['min_survey_date'] + '_' + data['max_survey_date']})
print(occ.shape)
occ.head()

(262623, 1)


Unnamed: 0,eventID
0,CRCO_2002-12-06_2002-12-06
1,CRCO_2002-12-06_2002-12-06
2,CRCO_2002-12-06_2002-12-06
3,CRCO_2002-12-06_2002-12-06
4,CRCO_2002-12-06_2002-12-06


In [8]:
## eventDate

occ['eventDate'] = data['min_survey_date'] + '/' + data['max_survey_date']
occ.head()

Unnamed: 0,eventID,eventDate
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06


In [11]:
## datasetID

occ['datasetID'] = 'MARINe LTM - percent cover from photoplots and phototransects'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...


In [12]:
## Merge with site table to get locality, county, stateProvince, countryCode, decimalLat, decimalLon

# Add site code to occ
occ['site_code'] = data['site_code']

# Define columns to merge from site table
site_cols = [
    'site_code',
    'marine_site_name',
    'county',
    'state_province',
    'country',
    'latitude',
    'longitude',
]

# Define DwC terms for these columns after merge
dwc_cols = [
    'eventID',
    'eventDate',
    'datasetID',
    'locality',
    'county',
    'stateProvince',
    'countryCode',
    'decimalLatitude',
    'decimalLongitude',
]

# Merge
occ = occ.merge(site[site_cols], how='left', on='site_code')
occ.drop(columns=['site_code'], inplace=True)
occ.columns = dwc_cols
print(occ.shape)
occ.head()

(262623, 9)


Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773


In [14]:
## Check county names against GTGN

# Define sql strings
counties_query = """
    select distinct * {
        ?place skos:inScheme tgn: ;
        gvp:placeTypePreferred [gvp:prefLabelGVP [xl:literalForm ?type]];
        gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended) [rdfs:label "counties"@en];
        gvp:broaderPartitiveExtended [rdfs:label "United States"@en];
        gvp:prefLabelGVP [xl:literalForm ?name];
        gvp:parentString ?parents}
"""

divisions_query = """
    select distinct * {
        ?place skos:inScheme tgn: ;
        gvp:placeTypePreferred [gvp:prefLabelGVP [xl:literalForm ?type]];
        gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended) [rdfs:label "national divisions"@en];
        gvp:broaderPartitiveExtended [rdfs:label "United States"@en];
        gvp:prefLabelGVP [xl:literalForm ?name];
        gvp:parentString ?parents}
"""

# Set up query
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")
sparql.setReturnFormat(JSON)
sparql.setQuery(counties_query)

# Obtain county results
try:
   counties_ret = sparql.query().convert()
except e:
   print(e)

# Obtain national division results (Alaska has boroughs and census districts, not counties)
sparql.setQuery(divisions_query)
try:
   div_ret = sparql.query().convert()
except e:
   print(e)

In [15]:
## Clean result

# Extract into data frame
county_df = pd.DataFrame(counties_ret['results']['bindings'])
county_df = county_df.applymap(lambda x: x['value'])
div_df = pd.DataFrame(div_ret['results']['bindings'])
div_df = div_df.applymap(lambda x: x['value'])

# Concatenate
county_df = pd.concat([county_df, div_df])
county_df.drop_duplicates(inplace=True)

# Unpack state, country etc. that each county is located in
county_df[['state', 'country', 'continent', 'planet', 'other']] = county_df['parents'].str.split(', ', expand=True)

# Filter
county_df = county_df[(county_df['country'] == 'United States') & (county_df['state'].isin(occ['stateProvince'].unique()))].copy()
county_df.head()

Unnamed: 0,place,type,name,parents,state,country,continent,planet,other
17,http://vocab.getty.edu/tgn/2002238,counties,Adams,"Washington, United States, North and Central A...",Washington,United States,North and Central America,World,
41,http://vocab.getty.edu/tgn/1002138,counties,Alameda,"California, United States, North and Central A...",California,United States,North and Central America,World,
89,http://vocab.getty.edu/tgn/1002145,counties,Alpine,"California, United States, North and Central A...",California,United States,North and Central America,World,
95,http://vocab.getty.edu/tgn/2002239,counties,Asotin,"Washington, United States, North and Central A...",Washington,United States,North and Central America,World,
101,http://vocab.getty.edu/tgn/1002146,counties,Amador,"California, United States, North and Central A...",California,United States,North and Central America,World,


In [16]:
## Check MARINe counties

for c in occ['county'].unique():
    if c not in county_df['name'].unique():
        print('County {} is not listed in GTGN. Double check name'.format(c))

All county names appear to be accurate.

In [17]:
## Clean countryCode

occ['countryCode'] = occ['countryCode'].str.replace('United States', 'US')
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773


In [18]:
## coordinatUncertaintyInMeters

occ['coordinateUncertaintyInMeters'] = 350
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350


In [27]:
## Add minimumDepthInMeters, maximumDepthInMeters, samplingProtocol and samplingEffort

# Depth
occ['minimumDepthInMeters'] = 0
occ['maximumDepthInMeters'] = 0

# samplingProtocol
occ['samplingProtocol'] = 'phototransect'

# samplingEffort
occ['samplingEffort'] = data['num_plots_sampled']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5


**Do we want to distinguish between photoplots and phototransects here in the samplingProtocol column? If so, do we want to update samplingEffort to say "3 plot(s)" or "3 transect(s)" as is appropriate?**

In [28]:
## occurrenceID

occ['occurrenceID'] = data.groupby(['site_code', 'min_survey_date', 'max_survey_date'])['lumping_code'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_' + occ['occurrenceID'].astype(str)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_1
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_2
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_3
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_4
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_5


In [57]:
## scientificName

# Get species codes
occ['scientificName'] = data['lumping_code']

# Create scientificName column in species table
sp = species[['lumping_code', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
sp = sp.replace('NULL', np.nan, regex=True)
sp['scientificName'] = sp['species'].combine_first(sp['genus'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['family'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['order'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['class'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['phylum'])
sp['scientificName'] = sp['scientificName'].combine_first(sp['kingdom'])

# Build dictionary mapping codes to names
sp_dict = dict(zip(sp['lumping_code'], sp['scientificName']))

# Match OTHALG with Biota
sp_dict['OTHALG'] = 'Biota'

# Replace codes with names in occ
occ['scientificName'] = occ['scientificName'].replace(sp_dict)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID,scientificName
0,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_1,Anthopleura
1,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_2,Corallinaceae
2,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_3,Polyplacophora
3,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_4,Chondracanthus canaliculatus
4,CRCO_2002-12-06_2002-12-06,2002-12-06/2002-12-06,MARINe LTM - percent cover from photoplots and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,350,0,0,phototransect,5,CRCO_2002-12-06_2002-12-06_5,Sessilia


In [58]:
## Get unique names

names = occ['scientificName'].unique()
names

array(['Anthopleura ', 'Corallinaceae ', 'Polyplacophora ',
       'Chondracanthus canaliculatus', 'Sessilia ',
       'Cladophora columbiana', 'Egregia menziesii', 'Eisenia arborea',
       'Endocladia muricata', 'Phaeophyceae ', 'Fucus distichus',
       'Stephanocystis ', 'Hesperophycus californicus', 'Gastropoda ',
       'Lottia gigantea', 'Mastocarpus ', 'Mazzaella affinis',
       'Mazzaella ', 'Mytilus californianus', 'Plantae ', 'Biota',
       'Thecostraca ', 'Chromista ', '(Division) Chlorophyta',
       'Animalia ', '(Division) Rhodophyta', 'Pelvetiopsis limitata',
       'Sabellariidae ', 'Phyllospadix ', 'Pisaster ochraceus',
       'Pollicipes polymerus', 'Pyropia ', nan, 'Sargassum muticum',
       'Mytilidae ', 'Silvetia compressa', 'Tetraclita squamosa',
       'Ulvophyceae ', 'Neorhodomela larix', 'Semibalanus cariosus',
       'Mytilus ', 'Hedophyllum sessile', 'Zostera marina'], dtype=object)

**Note** that there are still NaN values in scientificName. Where are they coming from?

The following codes are still matched to NaN: 
- OTHALG
- OTHSUB
- ROCK
- SAND
- TAR

Rock, sand and tar are obviously abiotic and can't be matched to a species name. **Did Abby say these could remain in the dataset, they just wouldn't show up in OBIS?** OTHALG is pretty clearly other algae. OTHSUB must be other substrate - also abiotic.

**OTHALG probably should be matched to Biota. I've changed this above.**