# PISCO - swath and upc data (inverts, kelp, substrate)

**In this version of the conversion script, cover and superlayer organisms from UPC surveys have NOT been included as occurrences; the information is only in the MoF. Additionally, absence records have been populated. Removing cover and superlayer organisms from the occurrence file made this process easier.**

The density of conspicuous, individually distinguishable macroalgae and macroinvertebrates (i.e. organisms larger than 2.5 cm and visually detectable by SCUBA divers) are visually recorded along replicate 2m wide by 30m long (60m2) transects. 
- For select species (e.g., sea urchins), high densities are spatially subsampled 
- 2 x 30m transects are distributed end-to-end and 5-10m apart at each of the following depths:
    - 5m
    - 12.5m
    - 20m 
- Additional 25m transects are conducted by VRG where habitat is available
- This usually results in 6 replicate transects per site. 
- Surveyors record:
    - Counts of individually distinguishable macroinvertebrates
    - Counts of Giant kelp (Macrocystis pyrifera) and bull kelp (Nereocystis luetkeana), > 1m in height
    - Stipe counts for qualifying giant kelp individuals 
    
In addition, the percent cover of non-individually distinguishable macroalgae and macroinvertebrates are visually recorded, usually on the **same replicate transects as the swath surveys described above.** 
- At each meter mark along the 30m transect, the diver records:
    - the underlying substrate (bedrock, boulder, cobble, or sand)
    - the vertical relief ( 0-10cm, 10cm-1m, 1-2m, and >2m) 
    - the cover (non-mobile primary space holding organism or bare substrate type)
    - the superlayer, if present (a small subset of specific organisms which may be ephemeral, and tend to create a layer over primary space holders)
    
**Resources:** <br>
https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [1]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "/Users/dianalg/PycharmProjects/PythonScripts/MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [4]:
## Load swath data

filename = 'swath_through_2020.csv'
swath = pd.read_csv(filename, dtype={'transect':str, 'disease':str, 'notes':str, 'site_name_old':str})

print(swath.shape)
swath.head()

(266148, 17)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old
0,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,ALAMAR,3.0,,,,,,
1,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,4,ALAMAR,1.0,,,,,,
2,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,5,ALAMAR,5.0,,,,,,
3,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,5,ANTSPP,2.0,,,,,,
4,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,7,ANTSPP,14.0,,,,,,


In [7]:
swath['year'].max()

2020

### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_SWATH_PISCO, SBTL_SWATH_HSU or SBTL_SWATH_VRG. The code describing the sampling technique and monitoring program that conducted each survey.  <br>
**survey_year** = 1999 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 1999 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 350 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**transect** = It seems like this should only be 1 - 8, but there are **a number of other designations as well.** The unique transect replicate within each site and zone. <br>
**classcode** = One of 187 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for invertebrates and algae, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode and a given size (if applicable) per transect <br>
**size** = For Macrocystis pyrifera, this represents the number of individual stipes growing for each individual. For a select number of invertebrate species that are measured on swath transects, this represents the size (in centimeters) of the following: test diameter for urchins, length of longest arm for seastars, shell length for abalone, carapace length for lobsters, total turgid length for sea cucumbers. <br>
**disease** = For some years echinoderm disease was recorded on transects for select species. When systematic observation for disease was conducted, disease is indicated here. Where blank, disease was not evaluated.
- HEALTHY: Individual was inspected and no was disease observed
- YES: Some form of disease was observed
- MILD: Mild disease was observed
- SEVERE: Severe disease was observed
- WASTING: Wasting disease was observed
- BLACK SPOT: Black spot disease was observed

**depth** = Between 1.2 and 28 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**observer** = The diver who conducted the survey transect <br>
**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

Counts have already been adjusted if subsampling occurred.

In [5]:
## Load upc data

filename = 'upc_through_2020.csv'
upc = pd.read_csv(filename, dtype={'transect':str, 'notes':str, 'site_name_old':str})

print(upc.shape)
upc.head()

(191889, 17)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,category,classcode,count,pct_cov,depth,observer,notes,site_name_old
0,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,1,COVER,ANEM,3,3.9,,,,
1,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,2,COVER,ANEM,1,1.1,,,,
2,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,COVER,ANEM,1,1.2,,,,
3,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,5,COVER,ANEM,1,1.0,,,,
4,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,6,COVER,ANEM,1,1.1,,,,


In [6]:
upc['year'].max()

2020

### Additional column notes

**category** = The code indicating which of four types of data are collected on each point. Categories may be SUBSTRATE, RELIEF, COVER and SUPERLAYER. Percent cover should be calculated for each category separately. For SUPERLAYER, the total number of points will not necessarily equal the targeted number of points surveyed in each of the other categories (i.e. superlayer is specific to certain organisms that are only included if present).
- COVER: Primary space holding, non-mobile organism or bare substrate type present at each point, cover types are defined in taxonomic table
- RELIEF: Physical relief is measured as the greatest vertical relief that exists within a 1m wide section across the tape and a 0.5m section along that tape that is centered over the point. Relief categories can be 0-10cm, 10cm-1m, 1-2m, and >2m
- SUBSTRATE: Substrate type underlying each point. Substrate categories include bedrock , boulder, cobble, and sand and are further defined in the taxonomic table
- SUPERLAYER: Superlayer includes a small subset of specific organisms which may be ephemeral, and tend to create a layer over primary space holders. Examples include low-lying, very large-bladed macroalgae such as Laminaria farlowii, brittle stars, and drift algae and in the Northern region abalone are recorded as a superlayer when occupying the space at the point. Not recorded at all points, only where present

**pct_cov** = Percent cover, calculated by dividing the number of points of a given category and classcode by the total number of points of a category per transect. The percent cover of superlayer is calculated by dividing the number of points of a superlayer classcode by the total number of points in the cover category, since superlayers are not present at all points.

### Strategy

As with RCCA, each transect can be an **event**, and each organism observation can be an **occurrence**.

The **event** file should contain: eventID (from site, survey date, zone, transect), eventDate (from year, month, date), datasetID, locality (site), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. 

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe disease), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Cover, substrate, superlayer, and relief (pct_cov values) can be recorded at the event level. Size can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [8]:
## Load site table

filename = 'site_table_through_2020.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(570, 53)


Unnamed: 0,site,site_name_old,site_name_for_figures,campus,latitude,longitude,site_campus (unique_ID),Unnamed: 7,latitude_old,longitude_old,...,survey_2016,survey_2017,survey_2018,survey_2019,survey_2020,SurveyYears,time_series_category,Notes (ETS),notes_location_details,notes_data_density
0,120/OML,,120/OML,RCCA,33.7379,-118.392,120/OML RCCA,,,,...,,,,,,,,,,
1,3_PALMS_EAST,3 Palms East,3 Palms East,VRG,33.718105,-118.3326,3 Palms East VRG,,,,...,,,,,,,,Added 6/21 based on field waypoint files,,
2,Abalone Cove,,Abalone Cove,RCCA,33.7362,-118.376,Abalone Cove RCCA,,,,...,,,,,,,,,,
3,ABALONE_COVE_KELP_W,Abalone Cove Kelp West,Abalone Cove Kelp W,VRG,33.73922,-118.38789,Abalone Cove Kelp West VRG,,,,...,X,X,X,X,X,11.0,1.0,,,
4,ABALONE_POINT_1,,Abalone Point 1,HSU,39.6915,-123.8141,ABALONE_POINT_1 HSU,,,,...,,,,,,,2.0,,,


**Note** that this site table is not in the standard format given to me last year. Hopefully I can simplify it to site, latitude, longitude, and site status, and that will still work.

In [9]:
## Create simplified site table

site_summary = sites.loc[sites['campus'] != 'RCCA', ['site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(423, 4)


Unnamed: 0,site,site_status,latitude,longitude
1,3_PALMS_EAST,reference,33.718105,-118.3326
3,ABALONE_COVE_KELP_W,mpa,33.73922,-118.38789
4,ABALONE_POINT_1,reference,39.6915,-123.8141
5,ABALONE_POINT_2,reference,39.66502,-123.80435
6,ABALONE_POINT_3,reference,39.62877,-123.79658


Well, it sortof works.

I had to exclude the RCCA sites, which weren't included in my original site table. In addition, there are five sites where slightly different lat/lon have been provided by slightly different groups. **I'll have to arbitrarily pick one for the moment.**

There are also a number of sites that have no data in the swath or upc tables. This is not necessarily a problem for me right now, but worth making a note of. There are no sites that have upc data but not swath data. The same sites as previously noted have swath data but not upc data:
- SMI_PRINCE_ISLAND_CEN
- SMI_PRINCE_ISLAND_N
- SRI_CARRINGTON_E
- SRI_CARRINGTON_CEN
- SRI_CARRINGTON_W
- SRI_BEE_ROCK_W

Finally, there are still a bunch of sites with lat, lon missing.

```python
# Sites with conflicting lat/lon entries
site_summary[site_summary['site'].duplicated()]

# Sites with no data in swath table
for name in ['site'].unique():
    if name not in fish['site'].unique():
        print(name)
        
# A very quick glance suggests that the same sites don't have upc data
for name in site_summary['site'].unique():
    if name not in upc['site'].unique():
        print(name)
        
# Sites with swath data but no upc data
for name in swath['site'].unique():
    if name not in upc['site'].unique():
        print(name)
        
# Sites where lat, lon is missing
site_summary[site_summary['latitude'].isna() == True]
```

In [17]:
## Remove sites with not-quite-identical coordinates manually ----- THIS CAN BE CHANGED WITH AN UPDATED SITE TABLE

print(site_summary.shape)
site_summary = site_summary.drop(index=site_summary[site_summary['site'].duplicated()].index)
site_summary.shape

(423, 4)


(418, 4)

### Get species table

In [18]:
## Load species table

filename = 'species_table_through_2020.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1937, 42)


Unnamed: 0,sample_type,sample_subtype,campus,pisco_classcode,orig_classcode,crane_code,genus,species,common_name,max_total_length,...,LOOKED2019,LOOKED2020,Taxonomic_source,AphiaID,ScientificName,Kingdom,Phylum,Class,Order,Family
0,FISH,FISH,HSU,AARG,AARG,,Amphistichus,argenteus,Barred Surfperch,43.0,...,X,X,WoRMS,279594,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae
1,FISH,FISH,UCSB,AARG,AARG,,Amphistichus,argenteus,Barred Surfperch,43.0,...,X,X,WoRMS,279594,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae
2,FISH,FISH,VRG,AARG,Amphistichus argenteus,,Amphistichus,argenteus,Barred Surfperch,43.0,...,X,X,WoRMS,279594,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae
3,FISH,FISH,HSU,ACOR,ACOR,ACOR,Artedius,corallinus,Coralline Sculpin,14.0,...,,,WoRMS,279699,Artedius corallinus,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae
4,FISH,FISH,UCSB,ACOR,ACOR,ACOR,Artedius,corallinus,Coralline Sculpin,14.0,...,,,WoRMS,279699,Artedius corallinus,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae


The subset of the species table that's currently relevant is entries with sample_type = 'SWATH'. **Note** that there are 187 unique classcodes under this sample type, only 175 of which are actually in the swath data set. I'm not sure that this is a problem; it's possible that some species have been looked for, but never seen, and therefore don't appear in the presence-only data. 

Classcodes that do not appear in the data are:
- LAMSAC
- ATRIDA
- CALSPP
- CERNUT
- CROCAL
- CUCSPP
- DELETE
- HERMSPP
- NORSPP
- SCYORE
- TEGSPP
- TEST

It seems reasonable to assume that the "DELETE" classcode indicates that the row should be deleted in the final version of the table.


```python
# Number of unique swath classcodes in species table
species.loc[species['sample_type'] == 'SWATH', 'pisco_classcode'].nunique()

# Number of unique classcodes in swath data
swath['classcode'].nunique()

# Classcodes that appear in the species table but not in the data
for sp in species.loc[species['sample_type'] == 'SWATH', 'pisco_classcode'].unique():
    if sp not in swath['classcode'].unique():
        print(sp)
```

In [26]:
## Select species for swath surveys



swath_sp = species.loc[species['sample_type'] == 'SWATH', 
                       ['campus', 'pisco_classcode', 'ScientificName', 'common_name', 'LOOKED1999',
                        'LOOKED2000', 'LOOKED2001', 'LOOKED2002', 'LOOKED2003', 'LOOKED2004',
                        'LOOKED2005', 'LOOKED2006', 'LOOKED2007', 'LOOKED2008', 'LOOKED2009',
                        'LOOKED2010', 'LOOKED2011', 'LOOKED2012', 'LOOKED2013', 'LOOKED2014',
                        'LOOKED2015', 'LOOKED2016', 'LOOKED2017', 'LOOKED2018', 'LOOKED2019',
                        'LOOKED2020']]

print(swath_sp.shape)
swath_sp.head()

(516, 26)


Unnamed: 0,campus,pisco_classcode,ScientificName,common_name,LOOKED1999,LOOKED2000,LOOKED2001,LOOKED2002,LOOKED2003,LOOKED2004,...,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018,LOOKED2019,LOOKED2020
919,HSU,ALAMAR,Alaria marginata,Alaria,,,,,,,...,,,,X,X,,X,X,X,X
920,UCSB,ALAMAR,Alaria marginata,Alaria,X,X,X,X,X,X,...,X,X,X,X,X,X,X,X,X,X
921,UCSC,ALAMAR,Alaria marginata,Alaria,X,X,X,X,X,X,...,X,X,X,X,X,X,X,X,X,X
922,HSU,COSCOS,Costaria costata,Costaria,,,,,,,...,,,,X,X,,X,X,X,X
923,UCSB,COSCOS,Costaria costata,Costaria,,,,,,X,...,X,X,X,X,X,X,X,X,X,X


In [27]:
## Clean LOOKED columns -------- THIS STEP MIGHT BE DELETED AFTER RECIEVING A FINAL TABLE

swath_sp.iloc[:, 4:] = swath_sp.iloc[:, 4:].replace({'X':'yes', np.nan:'no'})

In [28]:
## Melt species table

long = pd.melt(swath_sp, id_vars=swath_sp.columns[0:4], var_name='year', value_name='looked')
print(long.shape)
long.head()

(11352, 6)


Unnamed: 0,campus,pisco_classcode,ScientificName,common_name,year,looked
0,HSU,ALAMAR,Alaria marginata,Alaria,LOOKED1999,no
1,UCSB,ALAMAR,Alaria marginata,Alaria,LOOKED1999,yes
2,UCSC,ALAMAR,Alaria marginata,Alaria,LOOKED1999,yes
3,HSU,COSCOS,Costaria costata,Costaria,LOOKED1999,no
4,UCSB,COSCOS,Costaria costata,Costaria,LOOKED1999,no


In [29]:
## Replace 

long['year'] = long['year'].str.split('D').str[1].astype(int)
long['year'].unique()

array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])

### Fill in absence records

In [31]:
## Check if there are any records where count data is missing

swath[swath['count'].isna() == True].shape

(21, 17)

**Note** that there are 21 records where count data is missing.

In [32]:
## Drop these records 

print(swath.shape)
swath.dropna(subset=['count'], inplace=True)
swath.shape

(266148, 17)


(266127, 17)

When I was populating absence records for the fish transect data, I found that there were observations of a given classcode from a given campus, but that the classcode was not listed in the species table for that campus. I'd like to try to check for a similar problem here, to avoid having to track down redundant errors later.

In [36]:
## Determine if there are missing campus/year/classcode combos in the species table

# Get unique combinations of campus, survey_year, and classcode from the data
observed_species = swath[['campus', 'survey_year', 'classcode']].drop_duplicates()

# Merge with species table
test = observed_species.merge(long, how='outer', left_on=['campus', 'survey_year', 'classcode'], right_on=['campus', 'year', 'pisco_classcode'], indicator=True)

# Look for campus, survey_year, and classcode combinations that only appear in the observed data
test[test['_merge'] == 'left_only']

Unnamed: 0,campus,survey_year,classcode,pisco_classcode,ScientificName,common_name,year,looked,_merge


It looks like this is not an issue for the swath data, and I can proceed with absence population.



START HERE ---------

In [19]:
## Get a table telling whether each organism was looked for during each specific transect

survey_table = swath[['campus', 'method', 'day', 'month', 'survey_year', 'year', 'site', 'zone', 'transect']].merge(long[['campus', 'classcode', 'year', 'looked']], 
                                                                                                             how='left', 
                                                                                                             left_on=['campus', 'survey_year'],
                                                                                                             right_on=['campus', 'year'])
survey_table.drop_duplicates(inplace=True)
survey_table.rename(columns={'year_x':'year'}, inplace=True) # year_x retains actual year when survey took place
survey_table.drop(columns=['year_y'], inplace=True) # year_y == survey_year because of the merge
survey_table

Unnamed: 0,campus,method,day,month,survey_year,year,site,zone,transect,classcode,looked
0,UCSC,SBTL_SWATH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,1,ALAMAR,yes
1,UCSC,SBTL_SWATH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,1,COSCOS,no
2,UCSC,SBTL_SWATH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,1,DICSPP,yes
3,UCSC,SBTL_SWATH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,1,EISARBAD,yes
4,UCSC,SBTL_SWATH_PISCO,7,9,1999,1999,HOPKINS_DC,INNER,1,LAMFAR,no
...,...,...,...,...,...,...,...,...,...,...,...
24065917,VRG,SBTL_SWATH_VRG,20,11,2018,2018,Bunker Point,INNER,2,TYLFUN,yes
24065918,VRG,SBTL_SWATH_VRG,20,11,2018,2018,Bunker Point,INNER,2,URTCOL,yes
24065919,VRG,SBTL_SWATH_VRG,20,11,2018,2018,Bunker Point,INNER,2,URTMCP,yes
24065920,VRG,SBTL_SWATH_VRG,20,11,2018,2018,Bunker Point,INNER,2,URTSPP,yes


In [20]:
## Merge with swath data to get final outcome

full_swath = swath.merge(survey_table, 
                             how='right', 
                             on=['campus', 'method', 'day', 'month', 'year', 'survey_year', 'site', 'zone', 'transect', 'classcode'])
full_swath

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old,looked
0,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,ALAMAR,,,,,,,,yes
1,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COSCOS,,,,,,,,no
2,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,DICSPP,2.0,,,6.1,STACEY BUCKELEW,,,yes
3,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,EISARBAD,,,,,,,,yes
4,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,LAMFAR,,,,,,,,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309140,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,TYLFUN,,,,,,,,yes
1309141,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,URTCOL,,,,,,,,yes
1309142,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,URTMCP,,,,,,,,yes
1309143,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,URTSPP,,,,,,,,yes


In [21]:
## Clean

full_swath = full_swath[full_swath['classcode'] != 'NO_ORG'].copy()
full_swath.loc[(full_swath['looked'] == 'yes') & (full_swath['count'].isna() == True), 'count'] = 0
full_swath.dropna(subset=['count'], inplace=True)
full_swath

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old,looked
0,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,ALAMAR,0.0,,,,,,,yes
2,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,DICSPP,2.0,,,6.1,STACEY BUCKELEW,,,yes
3,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,EISARBAD,0.0,,,,,,,yes
5,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,LAMSPP,0.0,,,,,,,yes
7,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,MACPYRAD,2.0,2.0,,6.1,STACEY BUCKELEW,,,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309139,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,TRIHEL,0.0,,,,,,,yes
1309140,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,TYLFUN,0.0,,,,,,,yes
1309141,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,URTCOL,0.0,,,,,,,yes
1309142,VRG,SBTL_SWATH_VRG,2018,2018,11,20,Bunker Point,INNER,2,URTMCP,0.0,,,,,,,yes


**Note** that there are 249 records where count > 0 but looked = no. These probably need to be changed to looked = yes. To check:

```python
# Get records
weird = full_swath[(full_swath['count'] > 0) & (full_swath['looked'] == 'no')]

# Get table of campuses and years where there were observations for classcodes that were not looked for according to the species table
obs_exist = weird[['campus', 'survey_year', 'classcode']].copy()
obs_exist.drop_duplicates(inplace=True)
obs_exist.head()
```

### Convert

In [22]:
## Merge to add site_name (also lat, lon and site_status) to swath table

full_swath = full_swath.merge(site_summary, how='left', on='site')
print(full_swath.shape)
full_swath.head()

(986135, 22)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,...,disease,depth,observer,notes,site_name_old,looked,site_status,latitude,longitude,site_name
0,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,ALAMAR,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
1,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,DICSPP,...,,6.1,STACEY BUCKELEW,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
2,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,EISARBAD,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
3,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,LAMSPP,...,,,,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC
4,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,MACPYRAD,...,,6.1,STACEY BUCKELEW,,,yes,mpa,36.623586,-121.904196,HOPKINS_DC


In [23]:
## Pad month and day as needed

paddedDay = ['0' + str(full_swath['day'].iloc[i]) if len(str(full_swath['day'].iloc[i])) == 1 else str(full_swath['day'].iloc[i]) for i in range(full_swath.shape[0])]
paddedMonth = ['0' + str(full_swath['month'].iloc[i]) if len(str(full_swath['month'].iloc[i])) == 1 else str(full_swath['month'].iloc[i]) for i in range(full_swath.shape[0])]

In [78]:
## Create eventID

eventID = [full_swath['site_name'].iloc[i] + '_' + str(full_swath['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + full_swath['zone'].iloc[i] + '_' +
           full_swath['transect'].iloc[i].replace(' ', '') for i in range(full_swath.shape[0])]
swath_occ = pd.DataFrame({'eventID':eventID})

swath_occ.head()

Unnamed: 0,eventID
0,HOPKINS_DC_19990907_INNER_1
1,HOPKINS_DC_19990907_INNER_1
2,HOPKINS_DC_19990907_INNER_1
3,HOPKINS_DC_19990907_INNER_1
4,HOPKINS_DC_19990907_INNER_1


In [79]:
## Add occurrenceID

# Create survey_date column in swath
full_swath['survey_date'] = [str(full_swath['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(full_swath.shape[0])]

# Groupby to create occurrenceID
swath_occ['occurrenceID'] = full_swath.groupby(['site', 'survey_date', 'zone', 'transect'])['classcode'].cumcount()+1
swath_occ['occurrenceID'] = swath_occ['eventID'] + '_occ' + swath_occ['occurrenceID'].astype(str)

swath_occ.head()

Unnamed: 0,eventID,occurrenceID
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5


In [80]:
## Get relevant records from species table

swath_sp = swath_sp[['classcode', 'species_definition', 'common_name']].copy()
swath_sp.drop_duplicates(inplace=True)
swath_sp

Unnamed: 0,classcode,species_definition,common_name
590,ALAMAR,Alaria marginata,Alaria
593,COSCOS,Costaria costata,Costaria
596,DESLIG,Desmarestia ligulata,Acid Weed
597,DICSPP,Dictyoneurum californicum/reticulatum,Dictyoneurum
599,EGRMEN,Egregia menziesii,Egregia
...,...,...,...
999,URTCRA,Urticina crassicornis,Christmas Anemone
1002,URTMCP,Urticina mcpeaki,McPeak's Anemone
1003,URTPIS,Urticina piscivora,Fish-Eating Anemone
1006,URTSPP,Urticina,Urticina Spp.


**Note** that there are 10 classcodes in swath_sp that do not appear in the swath or biological upc data. They are:
- LAMSAC
- SARPAL
- STEDIOAD
- APAPRI
- ENTDOF
- FELCAL
- PELMUL
- POLATR
- PSEMON
- SCYORE

```python
for code in swath_sp['classcode'].unique():
    if code not in swath['classcode'].unique():
        print(code)
```

**There is also one classcode from the swath data that is NOT in the species table: LEPHEXAD. Dan says it should correspond to an adult Leptasterias hexactis. For some surveys in some years, only adults (i.e. individuals over a certain size) were counted.**

In [81]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(swath_sp['classcode'], swath_sp['species_definition']))
code_to_com_dict = dict(zip(swath_sp['classcode'], swath_sp['common_name']))

# Add LEPHEXAD
code_to_sci_dict['LEPHEXAD'] = 'Leptasterias hexactis'
code_to_com_dict['LEPHEXAD'] = 'Six Arm Star - adult'

In [82]:
## Create scientificName

swath_occ['scientificName'] = full_swath['classcode']
swath_occ['scientificName'].replace(code_to_sci_dict, inplace=True)

swath_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria marginata
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum californicum/reticulatum
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Eisenia arborea
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis pyrifera


In [83]:
## Get unique scientific names for lookup in WoRMS

names = swath_occ['scientificName'].unique()

**Note** that there are a number of names that are not specific at the species level:
- Dictyoneurum californicum/reticulatum (matched to Dictyoneurum)
- Loxorhynchus crispatus/Scyra acutifrons (Loxorhynchus crispatus or Scyra acutifrons, shared subfamily Pisinae)
- Scyra acutifrons/Oregonia gracilis (shared superfamily Majoidea)
- Holothuria (Vaneyothuria) zacae - this appears exactly as-is on WoRMS, so maybe it's fine? Yes
- Urticina columbiana/mcpeaki (matched to Urticina)
- Lopholithodes mandtii/foraminatus (matched to Lopholithodes)
- Ceratostoma/Pteropurpura (Ceratostoma spp. or Pteropurpura spp., shared subfamily Ocenebrinae)

**Add these to identificationQualifier column.**

There are also some descriptions that lack a scientific name:
- LEPHEXAD (as noted above, this classcode is not in the species table at all)
- Gorgonian Adult (order Alcyonacea)
- Whole urchin test (this is the code for an urchin test, i.e., a dead animal. **I will exclude these records for now.**)
    - There are 66 presence records in the swath data with this classcode
- No Organisms Present In This Sample (this is the code for a completely empty transect, **I think). I will exclude these records for now.**
    - Here, as with the fish transect data, **the NO_ORG observation occurs in the same event as other observations.** You would think that a NO_ORG entry would be the only entry for a given event - see example below. Also note that **this observation has count = 0.** 
    - **Dan says this would occur if no algae were observed on the transect, but invertebrates were. After checking on this, it doesn't seem to be right. But since there's only 1 NO_ORG observation in the data set, it seems alright to exclude it. Maybe it's a data entry error?**
- Unidentified Mobile Invert Species (scientific_name **Animalia**.)


```python
# NO_ORG observation in the same event as other observations
swath_occ[swath_occ['eventID'] == 'SRI_JOHNSONS_LEE_NORTH_W_20050907_INNER_1']
```

**UPDATE on NO_ORG classcode (2/1/2021):** These have been removed as part of the absence record population process.

In [84]:
## Make changes based on the above observations

swath_occ.loc[swath_occ['scientificName'] == 'Loxorhynchus crispatus/Scyra acutifrons', 'scientificName'] = 'Pisinae'
swath_occ.loc[swath_occ['scientificName'] == 'Scyra acutifrons/Oregonia gracilis', 'scientificName'] = 'Majoidea'
swath_occ.loc[swath_occ['scientificName'] == 'Ceratostoma/Pteropurpura', 'scientificName'] = 'Ocenebrinae'
swath_occ.loc[swath_occ['scientificName'] == 'Gorgonian Adult', 'scientificName'] = 'Alcyonacea'
swath_occ.loc[swath_occ['scientificName'] == 'Unidentified Mobile Invert Species', 'scientificName'] = 'Animalia'

# REMOVING TEST RECORDS FOR NOW
idx = swath_occ[swath_occ['scientificName'].isin(['Whole urchin test'])].index
swath_occ.drop(idx, inplace=True)

# Redefine names
names = swath_occ['scientificName'].unique()

In [61]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Dictyoneurum californicum/reticulatum checking:  Dictyoneurum
Url didn't work for Pugettia spp checking:  Pugettia
Url didn't work for Tegula spp checking:  Tegula
Url didn't work for Urticina columbiana/mcpeaki checking:  Urticina
Url didn't work for Lopholithodes mandtii/foraminatus checking:  Lopholithodes


In [85]:
## Add scientific name-related columns

swath_occ['scientificNameID'] = swath_occ['scientificName']
swath_occ['scientificNameID'].replace(name_id_dict, inplace=True)

swath_occ['taxonID'] = swath_occ['scientificName']
swath_occ['taxonID'].replace(name_taxid_dict, inplace=True)
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum californicum/reticulatum,urn:lsid:marinespecies.org:taxname:369575,369575
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231


In [86]:
## Create identificationQualifier

qualifier_dict = {'Dictyoneurum californicum/reticulatum':'Dictyoneurum californicum or Dictyoneurum reticulatum',
                  'Loxorhynchus crispatus/Scyra acutifrons':'Loxorhynchus crispatus or Scyra acutifrons',
                  'Scyra acutifrons/Oregonia gracilis':'Scyra acutifrons or Oregonia gracilis',
                  'Urticina columbiana/mcpeaki':'Urticina columbiana or Urticina mcpeaki',
                  'Lopholithodes mandtii/foraminatus':'Lopholithodes mandtii or Lopholithodes foraminatus',
                  'Ceratostoma/Pteropurpura':'Ceratostoma spp. or Pteropurpura spp.',
                  'Animalia':'Unidentified mobile invertebrate species'}

identificationQualifier = [qualifier_dict[name] if name in qualifier_dict.keys() else np.nan for name in swath_occ['scientificName']]

In [87]:
## Replace scientificName using name_name_dict

swath_occ['scientificName'].replace(name_name_dict, inplace=True)
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231


In [88]:
## Add vernacular name

swath_occ.insert(2, 'vernacularName', full_swath.loc[full_swath['classcode'] != 'TEST', 'classcode'].copy())
swath_occ.replace(code_to_com_dict, inplace=True)
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231


In [89]:
## Add final name-related columns

swath_occ['nameAccordingTo'] = 'WoRMS'
swath_occ['occurrenceStatus'] = 'present'
swath_occ['basisOfRecord'] = 'HumanObservation'
swath_occ['identificationQualifier'] = identificationQualifier

swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,present,HumanObservation,
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,present,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,present,HumanObservation,
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,present,HumanObservation,
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,


In [92]:
## Create density

swath_occ['organismQuantity'] = full_swath.loc[full_swath['classcode'] != 'TEST', 'count'].copy()
swath_occ['organismQuantityType'] = 'number of individuals per 60 m2'
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,present,HumanObservation,,0.0,number of individuals per 60 m2
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,present,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...,2.0,number of individuals per 60 m2
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,present,HumanObservation,,0.0,number of individuals per 60 m2
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,present,HumanObservation,,0.0,number of individuals per 60 m2
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,2.0,number of individuals per 60 m2


In [109]:
## Update absence status

swath_occ.loc[swath_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,organismQuantity,organismQuantityType,occurrenceRemarks
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,present,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...,2.0,number of individuals per 60 m2,
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,2.0,number of individuals per 60 m2,


In [110]:
## Add notes under occurrenceRemarks

swath_occ['occurrenceRemarks'] = full_swath.loc[full_swath['classcode'] != 'TEST', 'notes'].copy()
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,organismQuantity,organismQuantityType,occurrenceRemarks
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,present,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...,2.0,number of individuals per 60 m2,
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Laminaria,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis,Macrocystis pyrifera,urn:lsid:marinespecies.org:taxname:232231,232231,WoRMS,present,HumanObservation,,2.0,number of individuals per 60 m2,


In [112]:
## Change NaN values in string fields to ''

swath_occ[['identificationQualifier', 'occurrenceRemarks']] = swath_occ[['identificationQualifier', 'occurrenceRemarks']].replace(np.nan, '')
swath_occ.isna().sum()

eventID                    0
occurrenceID               0
vernacularName             0
scientificName             0
scientificNameID           0
taxonID                    0
nameAccordingTo            0
occurrenceStatus           0
basisOfRecord              0
identificationQualifier    0
organismQuantity           0
organismQuantityType       0
occurrenceRemarks          0
dtype: int64

In [107]:
## Save size and disease data for MoF file

# Get size data
size = full_swath.loc[full_swath['classcode'] != 'TEST', 'size']
size_df = pd.DataFrame({'eventID':swath_occ['eventID'],
                        'occurrenceID':swath_occ['occurrenceID'],
                        'scientificName':swath_occ['scientificName'],
                        'size':size})
size_df.dropna(subset=['size'], inplace=True)
print(size_df.shape)

# Create a size_measurement_type column
size_df['measurementType'] = 'number of stipes'
size_df.loc[size_df['scientificName'].isin(['Mesocentrotus franciscanus', 
                                            'Strongylocentrotus purpuratus',
                                            'Lytechinus pictus']), 'measurementType'] = 'test diameter in centimeters' # urchins
size_df.loc[size_df['scientificName'].isin(['Haliotis rufescens', 
                                            'Haliotis walallensis',
                                            'Haliotis',
                                            'Haliotis kamtschatkana',
                                            'Haliotis corrugata',
                                            'Haliotis cracherodii',
                                            'Haliotis fulgens']), 'measurementType'] = 'shell length in centimeters' # abalone
size_df.loc[size_df['scientificName'].isin(['Patiria miniata', 
                                            'Pisaster giganteus',
                                            'Pisaster ochraceus',
                                            'Pycnopodia helianthoides',
                                            'Pisaster',
                                            'Pisaster brevispinus',
                                            'Asteroidea',
                                            'Solaster stimpsoni',
                                            'Leptasterias hexactis',
                                            'Solaster dawsoni',
                                            'Henricia leviuscula',
                                            'Dermasterias imbricata',
                                            'Mediaster aequalis',
                                            'Orthasterias koehleri',
                                            'Stylasterias forreri',
                                            'Linckia columbiae',
                                            'Astrometis sertulifera']), 'measurementType'] = 'longest arm length in centimeters' # sea stars
size_df.loc[size_df['scientificName'] == 'Panulirus interruptus', 'measurementType'] = 'carapace length in centimeters' # lobsters

# Get disease data
disease = full_swath.loc[full_swath['classcode'] != 'TEST', 'disease']
disease_df = pd.DataFrame({'eventID':swath_occ['eventID'],
                           'occurrenceID':swath_occ['occurrenceID'],
                           'scientificName':swath_occ['scientificName'],
                           'disease':disease})
disease_df.dropna(subset=['disease'], inplace=True)
print(disease_df.shape)

# Change the disease category 'YES' to something more descriptive
disease_df[disease_df['disease'] == 'YES'] = 'DISEASED'

(89223, 4)
(15451, 4)


**Note** that I did not find any size measurements for sea cucumbers, even though the methods say they may have been sized.

### Save

In [113]:
## Save

swath_occ.to_csv('PISCO_swath_occurrence_20210203.csv', index=False, na_rep='NaN')

## Create event file

**Note** that there are no events where only nonbiological UPC data was taken.

In [115]:
swath_occ.shape

(986069, 13)

In [117]:
full_swath[full_swath['classcode'] != 'TEST'].shape

(986069, 23)

In [118]:
## Get unique eventIDs from occurrence file and their associated data

event = pd.DataFrame({'eventID':swath_occ['eventID'],
                    'eventDate':full_swath.loc[full_swath['classcode'] != 'TEST', 'survey_date'],
                    'institutionID':full_swath.loc[full_swath['classcode'] != 'TEST', 'campus'],
                    'locality':full_swath.loc[full_swath['classcode'] != 'TEST', 'site'],
                    'locationRemarks':full_swath.loc[full_swath['classcode'] != 'TEST', 'site_status'],
                    'decimalLatitude':full_swath.loc[full_swath['classcode'] != 'TEST', 'latitude'],
                    'decimalLongitude':full_swath.loc[full_swath['classcode'] != 'TEST', 'longitude']})
event.drop_duplicates(inplace=True)
event.reset_index(drop=True, inplace=True)

print(event.shape)
event.head()

(11998, 7)


Unnamed: 0,eventID,eventDate,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_1,19990907,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_2,19990907,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
2,HOPKINS_DC_19990907_MID_1,19990907,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
3,HOPKINS_DC_19990907_MID_2,19990907,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
4,HOPKINS_DC_19990907_OUTER_1,19990907,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196


In [119]:
## Format eventDate

formatted = [datetime.strptime(dt, '%Y%m%d').date().isoformat() for dt in event['eventDate']]
event['eventDate'] = formatted
event.head()

Unnamed: 0,eventID,eventDate,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_1,1999-09-07,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_2,1999-09-07,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
2,HOPKINS_DC_19990907_MID_1,1999-09-07,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
3,HOPKINS_DC_19990907_MID_2,1999-09-07,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
4,HOPKINS_DC_19990907_OUTER_1,1999-09-07,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196


In [120]:
## Dataset ID

event.insert(2, 'datasetID', 'PISCO swath and upc transects')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
2,HOPKINS_DC_19990907_MID_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
3,HOPKINS_DC_19990907_MID_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196
4,HOPKINS_DC_19990907_OUTER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,mpa,36.623586,-121.904196


In [121]:
## Update vocabulary in locationRemarks to be consistent with CCFRP and other PISCO data

event['locationRemarks'].replace({'mpa':'marine protected area'}, inplace=True)
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
2,HOPKINS_DC_19990907_MID_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
3,HOPKINS_DC_19990907_MID_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196
4,HOPKINS_DC_19990907_OUTER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,36.623586,-121.904196


In [122]:
## Add countryCode

event.insert(6, 'countryCode', 'US')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude
0,HOPKINS_DC_19990907_INNER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
1,HOPKINS_DC_19990907_INNER_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
2,HOPKINS_DC_19990907_MID_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
3,HOPKINS_DC_19990907_MID_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196
4,HOPKINS_DC_19990907_OUTER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196


In [123]:
## Add coordinateUncertainty in Meters

event['coordinateUncertaintyInMeters'] = 250

**Is this a reasonable coordinateUncertaintyInMeters?** Yes

In [125]:
## minimumDepthInMeters, maximumDepthInMeters

# Add eventID to swath
full_swath['eventID'] = eventID
swath_subset = full_swath.loc[full_swath['classcode'] != 'TEST', ['eventID', 'depth']]

# Groupby eventID to obtain depth column
depth = swath_subset.groupby(['eventID']).agg({
    'depth':[min, max]
})
depth.reset_index(inplace=True)
depth.columns = depth.columns.droplevel()

# Add to event
event['minimumDepthInMeters'] = depth['min']
event['maximumDepthInMeters'] = depth['max']
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,HOPKINS_DC_19990907_INNER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0
1,HOPKINS_DC_19990907_INNER_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0
2,HOPKINS_DC_19990907_MID_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,5.0,5.0
3,HOPKINS_DC_19990907_MID_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,5.0,5.0
4,HOPKINS_DC_19990907_OUTER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,5.49,5.49


**Note** that there are 875 transects that had missing depths. In addition, 7277 transects had multiple depths listed. These have been summarized using min and max.

``` python
test = swath_subset.drop_duplicates()
test.dropna(inplace=True)
test[test['eventID'].duplicated()]
```

In [126]:
## Add samplingProtocol and samplingEffort

# samplingProtocol
method = pd.DataFrame({'eventID':swath_occ['eventID'], 
                       'method':full_swath.loc[full_swath['classcode'] != 'TEST', 'method']})
method.drop_duplicates(inplace=True)
method = method.groupby('eventID')['method'].unique().str.join(', ')
print(method.shape)
method.reset_index(drop=True, inplace=True)
event['samplingProtocol'] = method

# samplingEffort
event['samplingEffort'] = 'all organisms present were counted'
event.head()

(11998,)


Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,HOPKINS_DC_19990907_INNER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0,SBTL_SWATH_VRG,all organisms present were counted
1,HOPKINS_DC_19990907_INNER_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,9.0,9.0,SBTL_SWATH_VRG,all organisms present were counted
2,HOPKINS_DC_19990907_MID_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,5.0,5.0,SBTL_SWATH_VRG,all organisms present were counted
3,HOPKINS_DC_19990907_MID_2,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,5.0,5.0,SBTL_SWATH_VRG,all organisms present were counted
4,HOPKINS_DC_19990907_OUTER_1,1999-09-07,PISCO swath and upc transects,UCSC,HOPKINS_DC,marine protected area,US,36.623586,-121.904196,250,5.49,5.49,SBTL_SWATH_HSU,all organisms present were counted


**Sampling effort?** Divers try to count everything they can find. Benthic swath surveys are a snapshot in time - no matter how much time you spend looking, because the organisms are sessile, you don't run the risk of double counting.

In [127]:
## Check for NaN values in string columns

event.isna().sum()

eventID                             0
eventDate                           0
datasetID                           0
institutionID                       0
locality                            0
locationRemarks                     0
countryCode                         0
decimalLatitude                     0
decimalLongitude                    0
coordinateUncertaintyInMeters       0
minimumDepthInMeters             1188
maximumDepthInMeters             1188
samplingProtocol                    0
samplingEffort                      0
dtype: int64

In [128]:
## Save

event.to_csv('PISCO_swath_event_20210119.csv', index=False, na_rep='NaN')

## Create MoF file

In [145]:
## Assemble UPC data

# Create eventID
paddedDay = ['0' + str(upc['day'].iloc[i]) if len(str(upc['day'].iloc[i])) == 1 else str(upc['day'].iloc[i]) for i in range(upc.shape[0])]
paddedMonth = ['0' + str(upc['month'].iloc[i]) if len(str(upc['month'].iloc[i])) == 1 else str(upc['month'].iloc[i]) for i in range(upc.shape[0])]
upc = upc.merge(site_summary, how='left', on='site')
eventID = [upc['site_name'].iloc[i] + '_' + str(upc['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + upc['zone'].iloc[i] + '_' +
           upc['transect'].iloc[i].replace(' ', '') for i in range(upc.shape[0])]
upc_mof = pd.DataFrame({'eventID':eventID,
                       'category':upc['category'],
                       'classcode':upc['classcode'],
                       'count':upc['count'],
                       'pct_cov':upc['pct_cov']})

# Create a column with more interpretable definitions of species codes
upc_mof['common_name'] = upc_mof['classcode']
upc_sp = species[species['sample_type'] == 'UPC']
upc_code_to_com_dict = dict(zip(upc_sp['classcode'], upc_sp['species_definition']))
upc_mof['common_name'].replace(upc_code_to_com_dict, inplace=True)

# Create Percent and UPC columns
upc_mof['pct_cov'] = upc_mof['pct_cov'].astype(str)
upc_mof['UPC'] = upc_mof['pct_cov'] + '% ' + upc_mof['common_name'] + ' | '

# Aggregate
upc_agg = upc_mof.groupby(['eventID', 'category']).agg({'UPC':sum})
upc_agg.reset_index(inplace=True)
upc_agg['UPC'] = upc_agg['UPC'].str[:-3]

# Make category column lower case
upc_agg['category'] = upc_agg['category'].str.lower()

upc_agg

Unnamed: 0,eventID,category,UPC
0,3_Palms_East_20080708_MID_1,cover,3.2% Bare Rock | 19.4% Bare Sand | 12.9% Phaeo...
1,3_Palms_East_20080708_MID_1,relief,29.0% Vertical Relief: Flat | 71.0% Vertical R...
2,3_Palms_East_20080708_MID_1,substrate,12.9% Substrate: Bedrock | 58.1% Substrate: Bo...
3,3_Palms_East_20080708_MID_2,cover,29.0% Bare Rock | 3.2% Bare Sand | 9.7% Phaeop...
4,3_Palms_East_20080708_MID_2,relief,45.2% Vertical Relief: Flat | 9.7% Vertical Re...
...,...,...,...
37418,Whites_Point_20120724_MID_1,relief,9.7% Vertical Relief: Flat | 6.5% Vertical Rel...
37419,Whites_Point_20120724_MID_1,substrate,100.0% Substrate: Bedrock
37420,Whites_Point_20120724_MID_2,cover,3.2% Bare Sand | 12.9% Phaeophyceae | 3.2% Bry...
37421,Whites_Point_20120724_MID_2,relief,6.5% Vertical Relief: Flat | 93.5% Vertical Re...


In [146]:
## Use mof dataframe with upc data

mof = pd.DataFrame({'eventID':upc_agg['eventID']})
mof['occurrenceID'] = np.nan
mof['measurementType'] = upc_agg['category']
mof['measurementValue'] = upc_agg['UPC']
mof['measurementUnit'] = 'percent cover'
mof['measurementMethod'] = 'uniform point contact'
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,3_Palms_East_20080708_MID_1,,cover,3.2% Bare Rock | 19.4% Bare Sand | 12.9% Phaeo...,percent cover,uniform point contact
1,3_Palms_East_20080708_MID_1,,relief,29.0% Vertical Relief: Flat | 71.0% Vertical R...,percent cover,uniform point contact
2,3_Palms_East_20080708_MID_1,,substrate,12.9% Substrate: Bedrock | 58.1% Substrate: Bo...,percent cover,uniform point contact
3,3_Palms_East_20080708_MID_2,,cover,29.0% Bare Rock | 3.2% Bare Sand | 9.7% Phaeop...,percent cover,uniform point contact
4,3_Palms_East_20080708_MID_2,,relief,45.2% Vertical Relief: Flat | 9.7% Vertical Re...,percent cover,uniform point contact


In [147]:
## Add occurrence-level measurements and facts

# Size
size_mof = pd.DataFrame({'eventID':size_df['eventID'],
                        'occurrenceID':size_df['occurrenceID'],
                        'measurementType':size_df['measurementType'],
                        'measurementValue':size_df['size'],
                        'measurementUnit':'centimeters',
                        'measurementMethod':'measured by diver'})
size_mof.loc[size_mof['measurementType'] == 'number of stipes', 'measurementUnit'] = 'number of stipes'

# Disease
dis_mof = pd.DataFrame({'eventID':disease_df['eventID'],
                       'occurrenceID':disease_df['occurrenceID'],
                       'measurementType':'observation',
                       'measurementValue':disease_df['disease'].str.lower(),
                       'measurementUnit':np.nan,
                       'measurementMethod':'visually determined by diver'})
dis_mof

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
326510,MACABEE_DC_20140703_INNER_1,MACABEE_DC_20140703_INNER_1_occ70,observation,healthy,,visually determined by diver
326511,MACABEE_DC_20140703_INNER_1,MACABEE_DC_20140703_INNER_1_occ71,observation,severe,,visually determined by diver
326513,MACABEE_DC_20140703_INNER_1,MACABEE_DC_20140703_INNER_1_occ73,observation,healthy,,visually determined by diver
326526,MACABEE_DC_20140703_INNER_1,MACABEE_DC_20140703_INNER_1_occ86,observation,healthy,,visually determined by diver
326603,MACABEE_DC_20140703_INNER_2,MACABEE_DC_20140703_INNER_2_occ66,observation,healthy,,visually determined by diver
...,...,...,...,...,...,...
814502,SRI_SOUTH_POINT_W_20180822_MID_1,SRI_SOUTH_POINT_W_20180822_MID_1_occ65,observation,healthy,,visually determined by diver
814841,SCI_GULL_ISLE_W_20180823_OUTER_1,SCI_GULL_ISLE_W_20180823_OUTER_1_occ80,observation,healthy,,visually determined by diver
816503,SCI_YELLOWBANKS_CEN_20180829_OUTER_2,SCI_YELLOWBANKS_CEN_20180829_OUTER_2_occ70,observation,healthy,,visually determined by diver
819504,SCI_PAINTED_CAVE_E_20181101_OUTER_1,SCI_PAINTED_CAVE_E_20181101_OUTER_1_occ65,observation,healthy,,visually determined by diver


In [148]:
## Append

mof_agg = pd.concat([mof, size_mof, dis_mof])
mof_agg.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,3_Palms_East_20080708_MID_1,,cover,3.2% Bare Rock | 19.4% Bare Sand | 12.9% Phaeo...,percent cover,uniform point contact
1,3_Palms_East_20080708_MID_1,,relief,29.0% Vertical Relief: Flat | 71.0% Vertical R...,percent cover,uniform point contact
2,3_Palms_East_20080708_MID_1,,substrate,12.9% Substrate: Bedrock | 58.1% Substrate: Bo...,percent cover,uniform point contact
3,3_Palms_East_20080708_MID_2,,cover,29.0% Bare Rock | 3.2% Bare Sand | 9.7% Phaeop...,percent cover,uniform point contact
4,3_Palms_East_20080708_MID_2,,relief,45.2% Vertical Relief: Flat | 9.7% Vertical Re...,percent cover,uniform point contact


In [150]:
## Change NaN in string fields to ''

mof_agg[['occurrenceID', 'measurementUnit']] = mof_agg[['occurrenceID', 'measurementUnit']].replace(np.nan, '')
mof_agg.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

In [151]:
## Save

mof_agg.to_csv('PISCO_swath_MoF_20210119.csv', index=False, na_rep='NaN')

## Questions

**These questions are in addition to those listed in PISCO_fish_transect_converstion.ipynb**

1. How do I know if an organism was subsampled? Have counts already been adjusted to 60 m2? **Yes, counts have already been adjusted, so don't worry about it.**
2. What's up with the transects that aren't just numbers ('1_E', '1_S', etc.)? Should I strip off the letters or leave them? **The actual transect numbers aren't meaningful - they're just different to denote replicates. Dan agrees that they are a bit confusing, but the research group would like to keep them.**
3. There are 12 classcodes in the species table that do not appear in the swath or upc data. In addition, one classcode, LEPHEXAD, is missing from the species table. **This classcode refers to a Leptasterias adult. It will be changed to LEPHEX in the 2020 submission.**
4. Sometimes, as in the fish data, NO_ORG observations occur in the same event as other observations. This does not make sense to me. **This should mean that no kelp was observed on the transect. Dan will confirm, and I can look into this as well. -- This does not appear to be correct, but there's only one NO_ORG record, and for now I've just excluded it. The wording in Dan's follow-up email was a bit confusing, but I think he's saying that he will drop this record for the next submission, too.**
5. I did not find any size measurements for sea cucumbers, although the metadata says were sometimes sized. **This is fine; sea cucumber sizes can be found in the size frequency data.**
6. As in the fish data, what is an appropriate coordinateUncertaintyInMeters? **250 m should be fine.**
7. Is there a samplingEffort I can list for swath transects? UPC transects? Like a time goal or limit? **No. Since these transects represent a snapshot in time, and the organisms counted are sessile, it doesn't matter how long it takes someone to search the transect. It takes as long as it takes to survey everything that's there.**

## Remaining Questions - 2/4/21

1. I'm still missing the lat, lon for SCI_PELICAN_FAR_WEST.
2. Absence records are now populated, but there are a collection of situations where the species was not looked for in the species table, but appears as present in the data. The data sets swath_counted_but_not_looked_for.csv and swath_counted_but_not_looked_for_summary.csv contain the details for these instances. **Dan is looking into this.**
3. Conversely, while populating absence records, I observed that there are a couple entries missing in the species table (i.e., there are observations of a given classcode from a given campus, but that classcode is not listed in the species table for that campus.) I need to let Dan know about this, and hopefully he can fix it in the next update to DataONE. Until then, though, I've added the following manually to the species tablew:
    - LAMSETAD for UCSC, observations in 2001-2018
    - LEPHEXAD for UCSC, observations in 2014-2018 (Dan says he's changing this classcode to LEPHEX in the 2020 submission, so this will need to reflect that).
    
**The updated species table for the 2020 submission reflects these changes, although Dan now says that LEPHEXAD will persist in the 2020 submission, so I need to make sure I'm handling that correctly.**

4. Dan mentioned that he hadn't generated a separate set of lines for campus=OSU, so this may need to be done over again once that's accomplished. **Not relevant for this data set.**
5. Before doing the final version of this data set, it might be worth re-visiting controlled vocabularies in the context of measurementType, measurementUnit, etc, especially for the disease observations.