# PISCO - swath and upc data (inverts, kelp, substrate)

**In this version of the conversion script, cover and superlayer organisms from UPC surveys have NOT been included as occurrences; the information is only in the MoF. Additionally, absence records have been populated. Removing cover and superlayer organisms from the occurrence file made this process easier.**

The density of conspicuous, individually distinguishable macroalgae and macroinvertebrates (i.e. organisms larger than 2.5 cm and visually detectable by SCUBA divers) are visually recorded along replicate 2m wide by 30m long (60m2) transects. 
- For select species (e.g., sea urchins), high densities are spatially subsampled 
- 2 x 30m transects are distributed end-to-end and 5-10m apart at each of the following depths:
    - 5m
    - 12.5m
    - 20m 
- Additional 25m transects are conducted by VRG where habitat is available
- This usually results in 6 replicate transects per site. 
- Surveyors record:
    - Counts of individually distinguishable macroinvertebrates
    - Counts of Giant kelp (Macrocystis pyrifera) and bull kelp (Nereocystis luetkeana), > 1m in height
    - Stipe counts for qualifying giant kelp individuals 
    
In addition, the percent cover of non-individually distinguishable macroalgae and macroinvertebrates are visually recorded, usually on the **same replicate transects as the swath surveys described above.** 
- At each meter mark along the 30m transect, the diver records:
    - the underlying substrate (bedrock, boulder, cobble, or sand)
    - the vertical relief ( 0-10cm, 10cm-1m, 1-2m, and >2m) 
    - the cover (non-mobile primary space holding organism or bare substrate type)
    - the superlayer, if present (a small subset of specific organisms which may be ephemeral, and tend to create a layer over primary space holders)
    
**Resources:** <br>
https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [1]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "/Users/dianalg/PycharmProjects/PythonScripts/MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load swath data

filename = 'MLPA_kelpforest_swath.4.csv'
swath = pd.read_csv(filename, dtype={'transect':str, 'disease':str, 'notes':str, 'site_name_old':str})

print(swath.shape)
swath.head()

(266134, 17)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old
0,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,ALAMAR,3.0,,,,,,
1,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,4,ALAMAR,1.0,,,,,,
2,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,5,ALAMAR,5.0,,,,,,
3,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,5,ANTSPP,2.0,,,,,,
4,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,7,ANTSPP,14.0,,,,,,


In [4]:
swath['year'].max()

2020

### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_SWATH_PISCO, SBTL_SWATH_HSU or SBTL_SWATH_VRG. The code describing the sampling technique and monitoring program that conducted each survey.  <br>
**survey_year** = 1999 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 1999 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 350 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**transect** = It seems like this should only be 1 - 8, but there are **a number of other designations as well.** The unique transect replicate within each site and zone. <br>
**classcode** = One of 187 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for invertebrates and algae, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode and a given size (if applicable) per transect <br>
**size** = For Macrocystis pyrifera, this represents the number of individual stipes growing for each individual. For a select number of invertebrate species that are measured on swath transects, this represents the size (in centimeters) of the following: test diameter for urchins, length of longest arm for seastars, shell length for abalone, carapace length for lobsters, total turgid length for sea cucumbers. <br>
**disease** = For some years echinoderm disease was recorded on transects for select species. When systematic observation for disease was conducted, disease is indicated here. Where blank, disease was not evaluated.
- HEALTHY: Individual was inspected and no was disease observed
- YES: Some form of disease was observed
- MILD: Mild disease was observed
- SEVERE: Severe disease was observed
- WASTING: Wasting disease was observed
- BLACK SPOT: Black spot disease was observed

**depth** = Between 1.2 and 28 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**observer** = The diver who conducted the survey transect <br>
**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

Counts have already been adjusted if subsampling occurred.

In [5]:
## Load upc data

filename = 'MLPA_kelpforest_upc.4.csv'
upc = pd.read_csv(filename, dtype={'transect':str, 'notes':str, 'site_name_old':str})

print(upc.shape)
upc.head()

(191889, 17)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,category,classcode,count,pct_cov,depth,observer,notes,site_name_old
0,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,1,COVER,ANEM,3,3.9,,,,
1,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,2,COVER,ANEM,1,1.1,,,,
2,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,COVER,ANEM,1,1.2,,,,
3,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,5,COVER,ANEM,1,1.0,,,,
4,UCSB,SBTL_UPC_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,6,COVER,ANEM,1,1.1,,,,


In [6]:
upc['year'].max()

2020

### Additional column notes

**category** = The code indicating which of four types of data are collected on each point. Categories may be SUBSTRATE, RELIEF, COVER and SUPERLAYER. Percent cover should be calculated for each category separately. For SUPERLAYER, the total number of points will not necessarily equal the targeted number of points surveyed in each of the other categories (i.e. superlayer is specific to certain organisms that are only included if present).
- COVER: Primary space holding, non-mobile organism or bare substrate type present at each point, cover types are defined in taxonomic table
- RELIEF: Physical relief is measured as the greatest vertical relief that exists within a 1m wide section across the tape and a 0.5m section along that tape that is centered over the point. Relief categories can be 0-10cm, 10cm-1m, 1-2m, and >2m
- SUBSTRATE: Substrate type underlying each point. Substrate categories include bedrock , boulder, cobble, and sand and are further defined in the taxonomic table
- SUPERLAYER: Superlayer includes a small subset of specific organisms which may be ephemeral, and tend to create a layer over primary space holders. Examples include low-lying, very large-bladed macroalgae such as Laminaria farlowii, brittle stars, and drift algae and in the Northern region abalone are recorded as a superlayer when occupying the space at the point. Not recorded at all points, only where present

**pct_cov** = Percent cover, calculated by dividing the number of points of a given category and classcode by the total number of points of a category per transect. The percent cover of superlayer is calculated by dividing the number of points of a superlayer classcode by the total number of points in the cover category, since superlayers are not present at all points.

### Strategy

As with RCCA, each transect can be an **event**, and each organism observation can be an **occurrence**.

The **event** file should contain: eventID (from site, survey date, zone, transect), eventDate (from year, month, date), datasetID, locality (site), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. 

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe disease), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Cover, substrate, superlayer, and relief (pct_cov values) can be recorded at the event level. Size can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [7]:
## Load site table

filename = 'MLPA_kelpforest_site_table.4.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(8138, 14)


Unnamed: 0,LTM_project_short_code,campus,method,survey_year,site,latitude,longitude,CA_MPA_Name_Short,site_designation,site_status,Secondary_MPA_Name,Secondary_site_designation,BaselineRegion,LongTermRegion
0,LTM_Kelp_SRock,VRG,SBTL_SIZEFREQ_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
1,LTM_Kelp_SRock,VRG,SBTL_FISH_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
2,LTM_Kelp_SRock,VRG,SBTL_SWATH_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
3,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2008,3_PALMS_EAST,33.718105,-118.3326,Abalone Cove SMCA,SMCA,reference,,,SOUTH,South Coast
4,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2010,ABALONE_COVE_KELP_W,33.73922,-118.38789,Abalone Cove SMCA,SMCA,mpa,,,SOUTH,South Coast


In [8]:
## Create simplified site table

site_summary = sites[['campus', 'site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(404, 5)


Unnamed: 0,campus,site,site_status,latitude,longitude
0,VRG,3_PALMS_EAST,reference,33.718105,-118.3326
4,VRG,ABALONE_COVE_KELP_W,mpa,33.73922,-118.38789
47,HSU,ABALONE_POINT_1,reference,39.6915,-123.8141
64,HSU,ABALONE_POINT_2,reference,39.66502,-123.80435
81,HSU,ABALONE_POINT_3,reference,39.62877,-123.79658


### Get species table

In [9]:
## Load species table

filename = 'MLPA_kelpforest_taxon_table.4.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1221, 40)


Unnamed: 0,campus,sample_type,sample_subtype,classcode,orig_classcode,Kingdom,Phylum,Class,Order,Family,...,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018,LOOKED2019,LOOKED2020
0,HSU,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,no,no,no,yes,yes,no,yes,yes,yes,yes
1,UCSB,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,FISH,FISH,AARG,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no


The subset of the species table that's currently relevant is entries with sample_type = 'SWATH'. **Note** that there are 177 unique classcodes under this sample type, only 173 of which are actually in the swath data set. I'm not sure that this is a problem; it's possible that some species have been looked for, but never seen, and therefore don't appear in the presence-only data. 

```python
# Number of unique swath classcodes in species table
species.loc[species['sample_type'] == 'SWATH', 'classcode'].nunique()

# Number of unique classcodes in swath data
swath['classcode'].nunique()

# Classcodes that appear in the species table but not in the data
for sp in species.loc[species['sample_type'] == 'SWATH', 'classcode'].unique():
    if sp not in swath['classcode'].unique():
        print(sp)
```

In [20]:
## Select species for swath surveys



swath_sp = species.loc[species['sample_type'] == 'SWATH', 
                       ['campus', 'classcode', 'species_definition', 'common_name', 'LOOKED1999',
                        'LOOKED2000', 'LOOKED2001', 'LOOKED2002', 'LOOKED2003', 'LOOKED2004',
                        'LOOKED2005', 'LOOKED2006', 'LOOKED2007', 'LOOKED2008', 'LOOKED2009',
                        'LOOKED2010', 'LOOKED2011', 'LOOKED2012', 'LOOKED2013', 'LOOKED2014',
                        'LOOKED2015', 'LOOKED2016', 'LOOKED2017', 'LOOKED2018', 'LOOKED2019',
                        'LOOKED2020']]

print(swath_sp.shape)
swath_sp.head()

(356, 26)


Unnamed: 0,campus,classcode,species_definition,common_name,LOOKED1999,LOOKED2000,LOOKED2001,LOOKED2002,LOOKED2003,LOOKED2004,...,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018,LOOKED2019,LOOKED2020
608,HSU,ALAMAR,Alaria marginata,Alaria,no,no,no,no,no,no,...,no,no,no,yes,yes,no,yes,yes,yes,yes
609,UCSB,ALAMAR,Alaria marginata,Alaria,yes,yes,yes,yes,yes,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
610,UCSC,ALAMAR,Alaria marginata,Alaria,yes,yes,yes,yes,yes,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
611,HSU,COSCOS,Costaria costata,Costaria,no,no,no,no,no,no,...,no,no,no,yes,yes,no,yes,yes,yes,yes
612,UCSB,COSCOS,Costaria costata,Costaria,no,no,no,no,no,yes,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes


In [21]:
## Melt species table

long = pd.melt(swath_sp, id_vars=swath_sp.columns[0:4], var_name='year', value_name='looked')
print(long.shape)
long.head()

(7832, 6)


Unnamed: 0,campus,classcode,species_definition,common_name,year,looked
0,HSU,ALAMAR,Alaria marginata,Alaria,LOOKED1999,no
1,UCSB,ALAMAR,Alaria marginata,Alaria,LOOKED1999,yes
2,UCSC,ALAMAR,Alaria marginata,Alaria,LOOKED1999,yes
3,HSU,COSCOS,Costaria costata,Costaria,LOOKED1999,no
4,UCSB,COSCOS,Costaria costata,Costaria,LOOKED1999,no


In [22]:
## Replace 

long['year'] = long['year'].str.split('D').str[1].astype(int)
long['year'].unique()

array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])

### Fill in absence records

In [24]:
## Check if there are any records where count data is missing

swath[swath['count'].isna() == True].shape

(21, 17)

**Note** that there are 21 records where count data is missing.

In [25]:
## Drop these records 

print(swath.shape)
swath.dropna(subset=['count'], inplace=True)
swath.shape

(266148, 17)


(266127, 17)

When I was populating absence records for the fish transect data, I found that there were observations of a given classcode from a given campus, but that the classcode was not listed in the species table for that campus. I'd like to try to check for a similar problem here, to avoid having to track down redundant errors later.

In [26]:
## Determine if there are missing campus/year/classcode combos in the species table

# Get unique combinations of campus, survey_year, and classcode from the data
observed_species = swath[['campus', 'survey_year', 'classcode']].drop_duplicates()

# Merge with species table
test = observed_species.merge(long, how='outer', left_on=['campus', 'survey_year', 'classcode'], right_on=['campus', 'year', 'classcode'], indicator=True)

# Look for campus, survey_year, and classcode combinations that only appear in the observed data
test[test['_merge'] == 'left_only']

Unnamed: 0,campus,survey_year,classcode,species_definition,common_name,year,looked,_merge


It looks like this is not an issue for the swath data, and I can proceed with absence population.

In [27]:
## Get a table telling whether each organism was looked for during each specific transect

survey_table = swath[['campus', 'method', 'day', 'month', 'survey_year', 'year', 'site', 'zone', 'transect']].merge(long[['campus', 'classcode', 'year', 'looked']], 
                                                                                                             how='left', 
                                                                                                             left_on=['campus', 'survey_year'],
                                                                                                             right_on=['campus', 'year'])
survey_table.drop_duplicates(inplace=True)
survey_table.rename(columns={'year_x':'year'}, inplace=True) # year_x retains actual year when survey took place
survey_table.drop(columns=['year_y'], inplace=True) # year_y == survey_year because of the merge
survey_table

Unnamed: 0,campus,method,day,month,survey_year,year,site,zone,transect,classcode,looked
0,UCSB,SBTL_SWATH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_E,INNER,3,ALAMAR,yes
1,UCSB,SBTL_SWATH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_E,INNER,3,COSCOS,no
2,UCSB,SBTL_SWATH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_E,INNER,3,DICSPP,yes
5,UCSB,SBTL_SWATH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_E,INNER,3,EGRMEN,yes
6,UCSB,SBTL_SWATH_PISCO,30,9,1999,1999,ANACAPA_ADMIRALS_E,INNER,3,EISARBAD,no
...,...,...,...,...,...,...,...,...,...,...,...
37626954,VRG,SBTL_SWATH_VRG,16,12,2020,2020,POINT_DUME,OUTER,2,TYLFUN,yes
37626955,VRG,SBTL_SWATH_VRG,16,12,2020,2020,POINT_DUME,OUTER,2,URTCOL,yes
37626956,VRG,SBTL_SWATH_VRG,16,12,2020,2020,POINT_DUME,OUTER,2,URTMCP,yes
37626957,VRG,SBTL_SWATH_VRG,16,12,2020,2020,POINT_DUME,OUTER,2,URTSPP,yes


In [28]:
## Merge with swath data to get final outcome

full_swath = swath.merge(survey_table, 
                             how='right', 
                             on=['campus', 'method', 'day', 'month', 'year', 'survey_year', 'site', 'zone', 'transect', 'classcode'])
full_swath

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old,looked
0,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,ALAMAR,3.0,,,,,,,yes
1,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,COSCOS,,,,,,,,no
2,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,DICSPP,,,,,,,,yes
3,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,EGRMEN,,,,,,,,yes
4,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,EISARBAD,,,,,,,,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1534635,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,TYLFUN,,,,,,,,yes
1534636,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,URTCOL,,,,,,,,yes
1534637,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,URTMCP,,,,,,,,yes
1534638,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,URTSPP,,,,,,,,yes


In [29]:
## Clean

full_swath = full_swath[full_swath['classcode'] != 'NO_ORG'].copy()
full_swath.loc[(full_swath['looked'] == 'yes') & (full_swath['count'].isna() == True), 'count'] = 0
full_swath.dropna(subset=['count'], inplace=True)
full_swath

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old,looked
0,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,ALAMAR,3.0,,,,,,,yes
2,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,DICSPP,0.0,,,,,,,yes
3,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,EGRMEN,0.0,,,,,,,yes
5,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,EISARBADSUB,0.0,,,,,,,yes
9,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,LAMSPP,0.0,,,,,,,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1534634,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,TRIHEL,0.0,,,,,,,yes
1534635,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,TYLFUN,0.0,,,,,,,yes
1534636,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,URTCOL,0.0,,,,,,,yes
1534637,VRG,SBTL_SWATH_VRG,2020,2020,12,16,POINT_DUME,OUTER,2,URTMCP,0.0,,,,,,,yes


**Note** that there are 12 records where count > 0 but looked = no. **These need to be changed to looked = yes.** The unique campus/year/classcode combinations for these 12 records are:
- HSU, PHYPAP, 2014 & 2015
- HSU, PODMAC, 2014 & 2015
- HSU, GERRUB, 2018
- VRG, 2004, OPHESM

Note that these don't come up in the earlier test for species that were observed but not in the species table because these campus/year/classcode combinations ARE in the species table, they're just there with looked = 'no'. I've added some code below to verify this.

I'VE ALSO FIXED THESE ISSUES MANUALLY ABOVE.

```python
# Get records
weird = full_swath[(full_swath['count'] > 0) & (full_swath['looked'] == 'no')]

# Get table of campuses and years where there were observations for classcodes that were not looked for according to the species table
obs_exist = weird[['campus', 'survey_year', 'classcode']].copy()
obs_exist.drop_duplicates(inplace=True)
obs_exist.head()

# Verify that records in both species and swath tables where looked = 'no' are the same as the ones listed above
test[(test['_merge'] == 'both') & (test['looked'] == 'no')]
```

### Convert

In [35]:
## Merge to add site_name (also lat, lon and site_status) to swath table

full_swath = full_swath.merge(site_summary, how='left', on='site')
print(full_swath.shape)
full_swath.head()

(1176661, 21)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,...,size,disease,depth,observer,notes,site_name_old,looked,site_status,latitude,longitude
0,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,ALAMAR,...,,,,,,,yes,reference,34.003433,-119.418
1,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,DICSPP,...,,,,,,,yes,reference,34.003433,-119.418
2,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,EGRMEN,...,,,,,,,yes,reference,34.003433,-119.418
3,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,EISARBADSUB,...,,,,,,,yes,reference,34.003433,-119.418
4,UCSB,SBTL_SWATH_PISCO,1999,1999,9,30,ANACAPA_ADMIRALS_E,INNER,3,LAMSPP,...,,,,,,,yes,reference,34.003433,-119.418


In [36]:
## Pad month and day as needed

paddedDay = ['0' + str(full_swath['day'].iloc[i]) if len(str(full_swath['day'].iloc[i])) == 1 else str(full_swath['day'].iloc[i]) for i in range(full_swath.shape[0])]
paddedMonth = ['0' + str(full_swath['month'].iloc[i]) if len(str(full_swath['month'].iloc[i])) == 1 else str(full_swath['month'].iloc[i]) for i in range(full_swath.shape[0])]

In [37]:
## Create eventID

eventID = [full_swath['site'].iloc[i] + '_' + str(full_swath['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + full_swath['zone'].iloc[i] + '_' +
           full_swath['transect'].iloc[i].replace(' ', '') for i in range(full_swath.shape[0])]
swath_occ = pd.DataFrame({'eventID':eventID})

swath_occ.head()

Unnamed: 0,eventID
0,ANACAPA_ADMIRALS_E_19990930_INNER_3
1,ANACAPA_ADMIRALS_E_19990930_INNER_3
2,ANACAPA_ADMIRALS_E_19990930_INNER_3
3,ANACAPA_ADMIRALS_E_19990930_INNER_3
4,ANACAPA_ADMIRALS_E_19990930_INNER_3


In [38]:
## Add occurrenceID

# Create survey_date column in swath
full_swath['survey_date'] = [str(full_swath['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(full_swath.shape[0])]

# Groupby to create occurrenceID
swath_occ['occurrenceID'] = full_swath.groupby(['site', 'survey_date', 'zone', 'transect'])['classcode'].cumcount()+1
swath_occ['occurrenceID'] = swath_occ['eventID'] + '_occ' + swath_occ['occurrenceID'].astype(str)

swath_occ.head()

Unnamed: 0,eventID,occurrenceID
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5


In [39]:
## Get relevant records from species table

swath_sp = swath_sp[['classcode', 'species_definition', 'common_name']].copy()
swath_sp.drop_duplicates(inplace=True)
swath_sp

Unnamed: 0,classcode,species_definition,common_name
919,ALAMAR,Alaria marginata,Alaria
922,COSCOS,Costaria costata,Costaria
925,DESLIG,Desmarestia ligulata,Acid Weed
926,DICSPP,Dictyoneurum californicum/reticulatum,Dictyoneurum
932,EGRMEN,Egregia menziesii,Egregia
...,...,...,...
1419,URTCRA,Urticina crassicornis,Christmas Anemone
1422,URTMCP,Urticina mcpeaki,McPeak's Anemone
1424,URTPIS,Urticina piscivora,Fish-Eating Anemone
1427,URTSPP,Urticina,Urticina Spp.


**Note** that there are 10 classcodes in swath_sp that do not appear in the swath data. They are:
- ATRIDA
- CALSPP
- CERNUT
- CROCAL
- CUCSPP
- DELETE
- HERMSPP
- NORSPP
- TEGSPP
- TEST

**I wonder if I should just drop records with classcodes like DELETE and TEST? Or maybe Dan wants to handle it on his end?**

```python
for code in swath_sp['classcode'].unique():
    if code not in full_swath['classcode'].unique():
        print(code)
```

In [41]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(swath_sp['classcode'], swath_sp['species_definition']))
code_to_com_dict = dict(zip(swath_sp['classcode'], swath_sp['common_name']))

In [42]:
## Create scientificName

swath_occ['scientificName'] = full_swath['classcode']
swath_occ['scientificName'].replace(code_to_sci_dict, inplace=True)

swath_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria marginata
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum californicum/reticulatum
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia menziesii
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Eisenia arborea
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Laminaria


In [43]:
## Get unique scientific names for lookup in WoRMS

names = swath_occ['scientificName'].unique()

**Note** that there are a number of names that are not specific at the species level:
- Dictyoneurum californicum/reticulatum (matched to Dictyoneurum)
- Loxorhynchus crispatus/Scyra acutifrons (Loxorhynchus crispatus or Scyra acutifrons, shared subfamily Pisinae)
- Scyra acutifrons/Oregonia gracilis (shared superfamily Majoidea)
- Holothuria (Vaneyothuria) zacae - this appears exactly as-is on WoRMS, so it's fine
- Urticina columbiana/mcpeaki (matched to Urticina)
- Lopholithodes mandtii/foraminatus (matched to Lopholithodes)
- Ceratostoma/Pteropurpura (Ceratostoma spp. or Pteropurpura spp., shared subfamily Ocenebrinae)

**Add these to identificationRemarks column.**

There are also some descriptions that lack a scientific name:
- Gorgonian Adult (order Alcyonacea)
- Unidentified Mobile Invert Species (scientific_name **Animalia**.)

In [44]:
## Make changes based on the above observations

swath_occ.loc[swath_occ['scientificName'] == 'Loxorhynchus crispatus/Scyra acutifrons', 'scientificName'] = 'Pisinae'
swath_occ.loc[swath_occ['scientificName'] == 'Scyra acutifrons/Oregonia gracilis', 'scientificName'] = 'Majoidea'
swath_occ.loc[swath_occ['scientificName'] == 'Ceratostoma/Pteropurpura', 'scientificName'] = 'Ocenebrinae'
swath_occ.loc[swath_occ['scientificName'] == 'Gorgonian Adult', 'scientificName'] = 'Alcyonacea'
swath_occ.loc[swath_occ['scientificName'] == 'Unidentified Mobile Invert Species', 'scientificName'] = 'Animalia'

# Redefine names
names = swath_occ['scientificName'].unique()

In [45]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Dictyoneurum californicum/reticulatum checking:  Dictyoneurum
Url didn't work for Pugettia spp checking:  Pugettia
Url didn't work for Poraniopsis inflata checking:  Poraniopsis
Url didn't work for Urticina columbiana/mcpeaki checking:  Urticina
Url didn't work for Lopholithodes mandtii/foraminatus checking:  Lopholithodes


Interesting. For the first time, I'm getting some species names that I know are on WoRMS that are not matching. I wonder if the WoRMS API doesn't like me querying this much/rapidly?

Either way, for now, I'm just going to add in some code to redo any queries that didn't go throug the first time.

If I want to check this manually in the future, here's some code from WoRMS.py:

```python
import urllib.parse
import urllib.request
import json

sci_name = 'Stylaster californicus'
sci_name_url = urllib.parse.quote(sci_name)
_url = 'http://www.marinespecies.org/rest/AphiaRecordsByNames?scientificnames%5B%5D=' + sci_name_url + '&like=false&marine_only=false'
with urllib.request.urlopen(_url) as url:
    data = json.loads(url.read().decode())
data[0][0]['scientificname'], data[0][0]['lsid'], data[0][0]['AphiaID'], data[0][0]['class']
```

In [113]:
## Add any names to dictionary that should have matched but didn't ------ I DON'T KNOW IF THIS WILL NEED TO BE UPDATED IN THE FUTURE

name_id, name_name, name_taxid, name_class = WoRMS.run_get_worms_from_scientific_name(['Orthasterias koehleri'], verbose_flag=True)
name_id_dict.update(name_id)
name_name_dict.update(name_name)
name_taxid_dict.update(name_taxid)
name_class_dict.update(name_class)

In [48]:
## Add scientific name-related columns

swath_occ['scientificNameID'] = swath_occ['scientificName']
swath_occ['scientificNameID'].replace(name_id_dict, inplace=True)

swath_occ['taxonID'] = swath_occ['scientificName']
swath_occ['taxonID'].replace(name_taxid_dict, inplace=True)
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum californicum/reticulatum,urn:lsid:marinespecies.org:taxname:369575,369575
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207


In [49]:
## Create identificationQualifier

remarks_dict = {'Dictyoneurum californicum/reticulatum':'Dictyoneurum californicum or Dictyoneurum reticulatum',
                'Loxorhynchus crispatus/Scyra acutifrons':'Loxorhynchus crispatus or Scyra acutifrons',
                'Scyra acutifrons/Oregonia gracilis':'Scyra acutifrons or Oregonia gracilis',
                'Urticina columbiana/mcpeaki':'Urticina columbiana or Urticina mcpeaki',
                'Lopholithodes mandtii/foraminatus':'Lopholithodes mandtii or Lopholithodes foraminatus',
                'Ceratostoma/Pteropurpura':'Ceratostoma spp. or Pteropurpura spp.',
                'Alcyonacea':'Gorgonian Adult',
                'Animalia':'Unidentified mobile invertebrate species'}

identificationRemarks = [remarks_dict[name] if name in remarks_dict.keys() else np.nan for name in swath_occ['scientificName']]

In [50]:
## Replace scientificName using name_name_dict

swath_occ['scientificName'].replace(name_name_dict, inplace=True)
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName,scientificNameID,taxonID
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207


In [51]:
## Add vernacular name

swath_occ.insert(2, 'vernacularName', full_swath['classcode'].copy())
swath_occ.replace(code_to_com_dict, inplace=True)
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Setchell's Kelp,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207


In [52]:
## Add final name-related columns

swath_occ['nameAccordingTo'] = 'WoRMS'
swath_occ['occurrenceStatus'] = 'present'
swath_occ['basisOfRecord'] = 'HumanObservation'
swath_occ['identificationRemarks'] = identificationRemarks

swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,present,HumanObservation,
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,present,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502,WoRMS,present,HumanObservation,
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,present,HumanObservation,
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Setchell's Kelp,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,present,HumanObservation,


In [53]:
## Create density

swath_occ['organismQuantity'] = full_swath['count'].copy()
swath_occ['organismQuantityType'] = 'number of individuals per 60 m2'
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,organismQuantity,organismQuantityType
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,present,HumanObservation,,3.0,number of individuals per 60 m2
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,present,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...,0.0,number of individuals per 60 m2
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502,WoRMS,present,HumanObservation,,0.0,number of individuals per 60 m2
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,present,HumanObservation,,0.0,number of individuals per 60 m2
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Setchell's Kelp,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,present,HumanObservation,,0.0,number of individuals per 60 m2


In [54]:
## Update absence status

swath_occ.loc[swath_occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,organismQuantity,organismQuantityType
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,present,HumanObservation,,3.0,number of individuals per 60 m2
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,absent,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...,0.0,number of individuals per 60 m2
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Setchell's Kelp,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2


In [55]:
## Add notes under occurrenceRemarks

swath_occ['occurrenceRemarks'] = full_swath['notes'].copy()
swath_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationRemarks,organismQuantity,organismQuantityType,occurrenceRemarks
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ1,Alaria,Alaria marginata,urn:lsid:marinespecies.org:taxname:371791,371791,WoRMS,present,HumanObservation,,3.0,number of individuals per 60 m2,
1,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ2,Dictyoneurum,Dictyoneurum,urn:lsid:marinespecies.org:taxname:369575,369575,WoRMS,absent,HumanObservation,Dictyoneurum californicum or Dictyoneurum reti...,0.0,number of individuals per 60 m2,
2,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ3,Egregia,Egregia menziesii,urn:lsid:marinespecies.org:taxname:372502,372502,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
3,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ4,Southern Sea Palm,Eisenia arborea,urn:lsid:marinespecies.org:taxname:371990,371990,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,
4,ANACAPA_ADMIRALS_E_19990930_INNER_3,ANACAPA_ADMIRALS_E_19990930_INNER_3_occ5,Setchell's Kelp,Laminaria,urn:lsid:marinespecies.org:taxname:516207,516207,WoRMS,absent,HumanObservation,,0.0,number of individuals per 60 m2,


In [56]:
## Change NaN values in string fields to ''

swath_occ[['identificationRemarks', 'occurrenceRemarks']] = swath_occ[['identificationRemarks', 'occurrenceRemarks']].replace(np.nan, '')
swath_occ.isna().sum()

eventID                  0
occurrenceID             0
vernacularName           0
scientificName           0
scientificNameID         0
taxonID                  0
nameAccordingTo          0
occurrenceStatus         0
basisOfRecord            0
identificationRemarks    0
organismQuantity         0
organismQuantityType     0
occurrenceRemarks        0
dtype: int64

In [57]:
## Save size and disease data for MoF file

# Get size data
size_df = pd.DataFrame({'eventID':swath_occ['eventID'],
                        'occurrenceID':swath_occ['occurrenceID'],
                        'scientificName':swath_occ['scientificName'],
                        'size':full_swath['size']})
size_df.dropna(subset=['size'], inplace=True)
print(size_df.shape)

# Create a size_measurement_type column
size_df['measurementType'] = 'number of stipes'
size_df.loc[size_df['scientificName'].isin(['Mesocentrotus franciscanus', 
                                            'Strongylocentrotus purpuratus',
                                            'Lytechinus pictus']), 'measurementType'] = 'test diameter in centimeters' # urchins
size_df.loc[size_df['scientificName'].isin(['Haliotis rufescens', 
                                            'Haliotis walallensis',
                                            'Haliotis',
                                            'Haliotis kamtschatkana',
                                            'Haliotis corrugata',
                                            'Haliotis cracherodii',
                                            'Haliotis fulgens']), 'measurementType'] = 'shell length in centimeters' # abalone
size_df.loc[size_df['scientificName'].isin(['Patiria miniata', 
                                            'Pisaster giganteus',
                                            'Pisaster ochraceus',
                                            'Pycnopodia helianthoides',
                                            'Pisaster',
                                            'Pisaster brevispinus',
                                            'Asteroidea',
                                            'Solaster stimpsoni',
                                            'Leptasterias hexactis',
                                            'Solaster dawsoni',
                                            'Henricia leviuscula',
                                            'Dermasterias imbricata',
                                            'Mediaster aequalis',
                                            'Orthasterias koehleri',
                                            'Stylasterias forreri',
                                            'Linckia columbiae',
                                            'Astrometis sertulifera']), 'measurementType'] = 'longest arm length in centimeters' # sea stars
size_df.loc[size_df['scientificName'] == 'Panulirus interruptus', 'measurementType'] = 'carapace length in centimeters' # lobsters

# Get disease data
disease_df = pd.DataFrame({'eventID':swath_occ['eventID'],
                           'occurrenceID':swath_occ['occurrenceID'],
                           'scientificName':swath_occ['scientificName'],
                           'disease':full_swath['disease']})
disease_df.dropna(subset=['disease'], inplace=True)
print(disease_df.shape)

# Change the disease category 'YES' to something more descriptive
disease_df[disease_df['disease'] == 'YES'] = 'DISEASED'

(101975, 4)
(22589, 4)


**Note** that I did not find any size measurements for sea cucumbers, even though the methods say they may have been sized.

### Save

In [67]:
## Save

swath_occ.to_csv('PISCO_swath_occurrence_20210816.csv', index=False, na_rep='NaN')

## Create event file

**Note** that there are no events where only nonbiological UPC data was taken.

In [68]:
## Get unique eventIDs from occurrence file and their associated data

event = pd.DataFrame({'eventID':swath_occ['eventID'],
                    'eventDate':full_swath['survey_date'],
                    'institutionID':full_swath['campus'],
                    'locality':full_swath['site'],
                    'locationRemarks':full_swath['site_status'],
                    'decimalLatitude':full_swath['latitude'],
                    'decimalLongitude':full_swath['longitude']})
event.drop_duplicates(inplace=True)
event.reset_index(drop=True, inplace=True)

print(event.shape)
event.head()

(13745, 7)


Unnamed: 0,eventID,eventDate,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,19990930,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
1,ANACAPA_ADMIRALS_E_19990930_INNER_4,19990930,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
2,ANACAPA_ADMIRALS_E_19990930_INNER_5,19990930,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
3,ANACAPA_ADMIRALS_E_19990930_INNER_7,19990930,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
4,ANACAPA_ADMIRALS_E_19990930_INNER_1,19990930,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418


In [69]:
## Format eventDate

formatted = [datetime.strptime(dt, '%Y%m%d').date().isoformat() for dt in event['eventDate']]
event['eventDate'] = formatted
event.head()

Unnamed: 0,eventID,eventDate,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,1999-09-30,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
1,ANACAPA_ADMIRALS_E_19990930_INNER_4,1999-09-30,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
2,ANACAPA_ADMIRALS_E_19990930_INNER_5,1999-09-30,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
3,ANACAPA_ADMIRALS_E_19990930_INNER_7,1999-09-30,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
4,ANACAPA_ADMIRALS_E_19990930_INNER_1,1999-09-30,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418


In [70]:
## Dataset ID

event.insert(2, 'datasetID', 'PISCO swath and upc transects')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
1,ANACAPA_ADMIRALS_E_19990930_INNER_4,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
2,ANACAPA_ADMIRALS_E_19990930_INNER_5,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
3,ANACAPA_ADMIRALS_E_19990930_INNER_7,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
4,ANACAPA_ADMIRALS_E_19990930_INNER_1,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418


In [71]:
## Update vocabulary in locationRemarks to be consistent with CCFRP and other PISCO data

event['locationRemarks'].replace({'mpa':'marine protected area'}, inplace=True)
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,decimalLatitude,decimalLongitude
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
1,ANACAPA_ADMIRALS_E_19990930_INNER_4,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
2,ANACAPA_ADMIRALS_E_19990930_INNER_5,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
3,ANACAPA_ADMIRALS_E_19990930_INNER_7,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418
4,ANACAPA_ADMIRALS_E_19990930_INNER_1,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,34.003433,-119.418


In [72]:
## Add countryCode

event.insert(6, 'countryCode', 'US')
event.head()

Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418
1,ANACAPA_ADMIRALS_E_19990930_INNER_4,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418
2,ANACAPA_ADMIRALS_E_19990930_INNER_5,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418
3,ANACAPA_ADMIRALS_E_19990930_INNER_7,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418
4,ANACAPA_ADMIRALS_E_19990930_INNER_1,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418


In [73]:
## Add coordinateUncertainty in Meters

event['coordinateUncertaintyInMeters'] = 250

In [74]:
## minimumDepthInMeters, maximumDepthInMeters

# Add eventID to swath
full_swath['eventID'] = eventID
swath_subset = full_swath[['eventID', 'depth']].copy()

# Groupby eventID to obtain depth column
depth = swath_subset.groupby(['eventID']).agg({
    'depth':[min, max]
})
depth.reset_index(inplace=True)
depth.columns = depth.columns.droplevel()
print(depth.shape)

# Add to event
event['minimumDepthInMeters'] = depth['min']
event['maximumDepthInMeters'] = depth['max']
event.head()

(13745, 3)


Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,9.0,9.0
1,ANACAPA_ADMIRALS_E_19990930_INNER_4,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,9.0,9.0
2,ANACAPA_ADMIRALS_E_19990930_INNER_5,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,5.0,5.0
3,ANACAPA_ADMIRALS_E_19990930_INNER_7,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,5.0,5.0
4,ANACAPA_ADMIRALS_E_19990930_INNER_1,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,4.0,4.0


**Note** that there are 1202 transects that had missing depths. In addition, 2614 transects had multiple depths listed. These have been summarized using min and max.

``` python
# Missing depths
depth[depth['min'].isna() == True]

# Transects with multiple depths
multiple_depths = swath_subset.groupby('eventID', as_index=False)['depth'].nunique()
multiple_depths[multiple_depths['depth'] > 1]
```

In [75]:
## Add samplingProtocol and samplingEffort

# samplingProtocol
method = pd.DataFrame({'eventID':swath_occ['eventID'], 
                       'method':full_swath['method']})
method.drop_duplicates(inplace=True)
method = method.groupby('eventID')['method'].unique().str.join(', ')
print(method.shape)
method.reset_index(drop=True, inplace=True)
event['samplingProtocol'] = method

# samplingEffort
event['samplingEffort'] = 'all organisms present were counted'
event.head()

(13745,)


Unnamed: 0,eventID,eventDate,datasetID,institutionID,locality,locationRemarks,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,ANACAPA_ADMIRALS_E_19990930_INNER_3,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,9.0,9.0,SBTL_SWATH_VRG,all organisms present were counted
1,ANACAPA_ADMIRALS_E_19990930_INNER_4,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,9.0,9.0,SBTL_SWATH_VRG,all organisms present were counted
2,ANACAPA_ADMIRALS_E_19990930_INNER_5,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,5.0,5.0,SBTL_SWATH_VRG,all organisms present were counted
3,ANACAPA_ADMIRALS_E_19990930_INNER_7,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,5.0,5.0,SBTL_SWATH_VRG,all organisms present were counted
4,ANACAPA_ADMIRALS_E_19990930_INNER_1,1999-09-30,PISCO swath and upc transects,UCSB,ANACAPA_ADMIRALS_E,reference,US,34.003433,-119.418,250,4.0,4.0,SBTL_SWATH_VRG,all organisms present were counted


In [76]:
## Check for NaN values in string columns

event.isna().sum()

eventID                             0
eventDate                           0
datasetID                           0
institutionID                       0
locality                            0
locationRemarks                    34
countryCode                         0
decimalLatitude                    34
decimalLongitude                   34
coordinateUncertaintyInMeters       0
minimumDepthInMeters             1202
maximumDepthInMeters             1202
samplingProtocol                    0
samplingEffort                      0
dtype: int64

In [77]:
## Replace NaN values in string columns ------ THIS CAN GET DELETED WHEN SITE TABLE IS UPDATED

event['locationRemarks'] = event['locationRemarks'].replace(np.nan, '')
event.isna().sum()

eventID                             0
eventDate                           0
datasetID                           0
institutionID                       0
locality                            0
locationRemarks                     0
countryCode                         0
decimalLatitude                    34
decimalLongitude                   34
coordinateUncertaintyInMeters       0
minimumDepthInMeters             1202
maximumDepthInMeters             1202
samplingProtocol                    0
samplingEffort                      0
dtype: int64

In [78]:
## Save

event.to_csv('PISCO_swath_event_20210816.csv', index=False, na_rep='NaN')

## Create MoF file

In [79]:
## Assemble UPC data

# Create eventID
paddedDay = ['0' + str(upc['day'].iloc[i]) if len(str(upc['day'].iloc[i])) == 1 else str(upc['day'].iloc[i]) for i in range(upc.shape[0])]
paddedMonth = ['0' + str(upc['month'].iloc[i]) if len(str(upc['month'].iloc[i])) == 1 else str(upc['month'].iloc[i]) for i in range(upc.shape[0])]
upc = upc.merge(site_summary, how='left', on='site')
eventID = [upc['site'].iloc[i] + '_' + str(upc['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + upc['zone'].iloc[i] + '_' +
           upc['transect'].iloc[i].replace(' ', '') for i in range(upc.shape[0])]
upc_mof = pd.DataFrame({'eventID':eventID,
                       'category':upc['category'],
                       'classcode':upc['classcode'],
                       'count':upc['count'],
                       'pct_cov':upc['pct_cov']})

# Create a column with more interpretable definitions of species codes
upc_mof['common_name'] = upc_mof['classcode']
upc_sp = species[species['sample_type'] == 'UPC']
upc_code_to_com_dict = dict(zip(upc_sp['classcode'], upc_sp['species_definition']))
upc_mof['common_name'].replace(upc_code_to_com_dict, inplace=True)

# Create Percent and UPC columns
upc_mof['pct_cov'] = upc_mof['pct_cov'].astype(str)
upc_mof['UPC'] = upc_mof['pct_cov'] + '% ' + upc_mof['common_name'] + ' | '

# Aggregate
upc_agg = upc_mof.groupby(['eventID', 'category']).agg({'UPC':sum})
upc_agg.reset_index(inplace=True)
upc_agg['UPC'] = upc_agg['UPC'].str[:-3]

# Make category column lower case
upc_agg['category'] = upc_agg['category'].str.lower()

upc_agg

Unnamed: 0,eventID,category,UPC
0,3_PALMS_EAST_20080708_MID_1,cover,3.2% Bare Rock | 19.4% Bare Sand | 12.9% Phaeo...
1,3_PALMS_EAST_20080708_MID_1,relief,29.0% Vertical Relief: Flat | 71.0% Vertical R...
2,3_PALMS_EAST_20080708_MID_1,substrate,12.9% Substrate: Bedrock | 58.1% Substrate: Bo...
3,3_PALMS_EAST_20080708_MID_2,cover,29.0% Bare Rock | 3.2% Bare Sand | 9.7% Phaeop...
4,3_PALMS_EAST_20080708_MID_2,relief,45.2% Vertical Relief: Flat | 9.7% Vertical Re...
...,...,...,...
43389,WHITE_ROCK_UC_20110629_OUTER_1,relief,10.0% Vertical Relief: Flat | 3.3% Vertical Re...
43390,WHITE_ROCK_UC_20110629_OUTER_1,substrate,83.3% Substrate: Bedrock | 3.3% Substrate: Bou...
43391,WHITE_ROCK_UC_20110629_OUTER_2,cover,3.3% Bare Rock | 3.3% Bryozoa | 3.3% Tunicate ...
43392,WHITE_ROCK_UC_20110629_OUTER_2,relief,20.0% Vertical Relief: Flat | 16.7% Vertical R...


In [80]:
## Use mof dataframe with upc data

mof = pd.DataFrame({'eventID':upc_agg['eventID']})
mof['occurrenceID'] = np.nan
mof['measurementType'] = upc_agg['category']
mof['measurementValue'] = upc_agg['UPC']
mof['measurementUnit'] = 'percent cover'
mof['measurementMethod'] = 'uniform point contact'
mof.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,3_PALMS_EAST_20080708_MID_1,,cover,3.2% Bare Rock | 19.4% Bare Sand | 12.9% Phaeo...,percent cover,uniform point contact
1,3_PALMS_EAST_20080708_MID_1,,relief,29.0% Vertical Relief: Flat | 71.0% Vertical R...,percent cover,uniform point contact
2,3_PALMS_EAST_20080708_MID_1,,substrate,12.9% Substrate: Bedrock | 58.1% Substrate: Bo...,percent cover,uniform point contact
3,3_PALMS_EAST_20080708_MID_2,,cover,29.0% Bare Rock | 3.2% Bare Sand | 9.7% Phaeop...,percent cover,uniform point contact
4,3_PALMS_EAST_20080708_MID_2,,relief,45.2% Vertical Relief: Flat | 9.7% Vertical Re...,percent cover,uniform point contact


In [81]:
## Add occurrence-level measurements and facts

# Size
size_mof = pd.DataFrame({'eventID':size_df['eventID'],
                        'occurrenceID':size_df['occurrenceID'],
                        'measurementType':size_df['measurementType'],
                        'measurementValue':size_df['size'],
                        'measurementUnit':'centimeters',
                        'measurementMethod':'measured by diver'})
size_mof.loc[size_mof['measurementType'] == 'number of stipes', 'measurementUnit'] = 'number of stipes'

# Disease
dis_mof = pd.DataFrame({'eventID':disease_df['eventID'],
                       'occurrenceID':disease_df['occurrenceID'],
                       'measurementType':'observation',
                       'measurementValue':disease_df['disease'].str.lower(),
                       'measurementUnit':np.nan,
                       'measurementMethod':'visually determined by diver'})
dis_mof

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
607344,ARROYO_QUEMADO_W_20140804_OUTER_1,ARROYO_QUEMADO_W_20140804_OUTER_1_occ72,observation,healthy,,visually determined by diver
608679,HORSESHOE_REEF_E_20140916_INNER_1,HORSESHOE_REEF_E_20140916_INNER_1_occ62,observation,healthy,,visually determined by diver
610422,NAPLES_E_20140804_INNER_2,NAPLES_E_20140804_INNER_2_occ61,observation,healthy,,visually determined by diver
610980,NAPLES_W_20140728_OUTER_1,NAPLES_W_20140728_OUTER_1_occ69,observation,healthy,,visually determined by diver
611070,SCI_CAVERN_POINT_E_20140708_INNER_1,SCI_CAVERN_POINT_E_20140708_INNER_1_occ76,observation,healthy,,visually determined by diver
...,...,...,...,...,...,...
932263,WESTON_UC_20200803_MID_2,WESTON_UC_20200803_MID_2_occ90,observation,healthy,,visually determined by diver
932264,WESTON_UC_20200803_MID_2,WESTON_UC_20200803_MID_2_occ91,observation,healthy,,visually determined by diver
932265,WESTON_UC_20200803_MID_2,WESTON_UC_20200803_MID_2_occ92,observation,healthy,,visually determined by diver
932266,WESTON_UC_20200803_MID_2,WESTON_UC_20200803_MID_2_occ93,observation,healthy,,visually determined by diver


In [82]:
## Append

mof_agg = pd.concat([mof, size_mof, dis_mof])
mof_agg.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,3_PALMS_EAST_20080708_MID_1,,cover,3.2% Bare Rock | 19.4% Bare Sand | 12.9% Phaeo...,percent cover,uniform point contact
1,3_PALMS_EAST_20080708_MID_1,,relief,29.0% Vertical Relief: Flat | 71.0% Vertical R...,percent cover,uniform point contact
2,3_PALMS_EAST_20080708_MID_1,,substrate,12.9% Substrate: Bedrock | 58.1% Substrate: Bo...,percent cover,uniform point contact
3,3_PALMS_EAST_20080708_MID_2,,cover,29.0% Bare Rock | 3.2% Bare Sand | 9.7% Phaeop...,percent cover,uniform point contact
4,3_PALMS_EAST_20080708_MID_2,,relief,45.2% Vertical Relief: Flat | 9.7% Vertical Re...,percent cover,uniform point contact


In [83]:
## Change NaN in string fields to ''

mof_agg[['occurrenceID', 'measurementUnit']] = mof_agg[['occurrenceID', 'measurementUnit']].replace(np.nan, '')
mof_agg.isna().sum()

eventID              0
occurrenceID         0
measurementType      0
measurementValue     0
measurementUnit      0
measurementMethod    0
dtype: int64

In [84]:
## Save

mof_agg.to_csv('PISCO_swath_MoF_20210816.csv', index=False, na_rep='NaN')

## Remaining issues

1. Site table
    - 5 sites where different campuses have listed slightly different lat, lon
        - DEL_MAR_REFERENCE_2
        - DEL_MAR_REFERENCE_3
        - LECHUZA
        - POINT_ARENA_REFERENCE_3
        - SAUNDERS_REFERENCE_1
    - There are a bunch of sites with site_status missing. A subset of these also have lat, lon missing
2. Species table
    - Just for your information, there are 12 SWATH species codes that are in the table but not in the swath data. I assume you know about some of them (e.g. DELETE), but just to be thorough:
        - LAMSAC
        - ATRIDA
        - CALSPP
        - CERNUT
        - CROCAL
        - CUCSPP
        - DELETE
        - HERMSPP
        - NORSPP
        - SCYORE
        - TEGSPP
        - TEST
    - There are 12 records where count > 0 but the species was not looked for according to the species table. These need to be changed to looked = yes. The unique campus/year/classcode combinations for these 12 records are:
        - HSU, PHYPAP, 2014 & 2015
        - HSU, PODMAC, 2014 & 2015
        - HSU, GERRUB, 2018
        - VRG, 2004, OPHESM
3. Swath table
    - There are 21 records where count is missing. These are mostly from HSU in July 2015. 
    - There are 1202 transects (19676 records) that have missing depths. 