# PISCO - swath and upc data (inverts, kelp, substrate)

The density of conspicuous, individually distinguishable macroalgae and macroinvertebrates (i.e. organisms larger than 2.5 cm and visually detectable by SCUBA divers) are visually recorded along replicate 2m wide by 30m long (60m2) transects. 
- For select species (e.g., sea urchins), high densities are spatially subsampled 
- 2 x 30m transects are distributed end-to-end and 5-10m apart at each of the following depths:
    - 5m
    - 12.5m
    - 20m 
- Additional 25m transects are conducted by VRG where habitat is available
- This usually results in 6 replicate transects per site. 
- Surveyors record:
    - Counts of individually distinguishable macroinvertebrates
    - Counts of Giant kelp (Macrocystis pyrifera) and bull kelp (Nereocystis luetkeana), > 1m in height
    - Stipe counts for qualifying giant kelp individuals 
    
In addition, the percent cover of non-individually distinguishable macroalgae and macroinvertebrates are visually recorded, usually on the **same replicate transects as the swath surveys described above.** 
- At each meter mark along the 30m transect, the diver records:
    - the underlying substrate (bedrock, boulder, cobble, or sand)
    - the vertical relief ( 0-10cm, 10cm-1m, 1-2m, and >2m) 
    - the cover (non-mobile primary space holding organism or bare substrate type)
    - the superlayer, if present (a small subset of specific organisms which may be ephemeral, and tend to create a layer over primary space holders)
    
**Resources:** <br>
https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [46]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handling dates

In [47]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [48]:
## Load swath data

filename = 'MLPA_kelpforest_swath.1.csv'
swath = pd.read_csv(filename, encoding='ANSI', dtype={'transect':str, 'disease':str, 'notes':str, 'site_name_old':str})

print(swath.shape)
swath.head()

(233646, 17)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old
0,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,CRYSTE,1.0,,,6.1,STACEY BUCKELEW,,
1,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,DICSPP,2.0,,,6.1,STACEY BUCKELEW,,
2,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,MACPYRAD,2.0,2.0,,6.1,STACEY BUCKELEW,,
3,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,MACPYRAD,1.0,5.0,,6.1,STACEY BUCKELEW,,
4,UCSC,SBTL_SWATH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,MACPYRAD,1.0,7.0,,6.1,STACEY BUCKELEW,,


### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_SWATH_PISCO, SBTL_SWATH_HSU or SBTL_SWATH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 1999 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 1999 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 350 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**transect** = It seems like this should only be 1 - 8, but there are **a number of other designations as well.** The unique transect replicate within each site and zone. <br>
**classcode** = One of 173 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for invertebrates and algae, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode and a given size (if applicable) per transect <br>
**size** = For Macrocystis pyrifera, this represents the number of individual stipes growing for each individual. For a select number of invertebrate species that are measured on swath transects, this represents the size (in centimeters) of the following: test diameter for urchins, length of longest arm for seastars, shell length for abalone, carapace length for lobsters, total turgid length for sea cucumbers. <br>
**disease** = For some years echinoderm disease was recorded on transects for select species. When systematic observation for disease was conducted, disease is indicated here. Where blank, disease was not evaluated.
- HEALTHY: Individual was inspected and no was disease observed
- YES: Some form of disease was observed
- MILD: Mild disease was observed
- SEVERE: Severe disease was observed
- WASTING: Wasting disease was observed
- BLACK SPOT: Black spot disease was observed

**depth** = Between 1.2 and 28 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**observer** = The diver who conducted the survey transect <br>
**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

**HOW DO I KNOW IF AN ORGANISM WAS SUBSAMPLED? HAVE THE COUNTS BEEN FIXED TO REFLECT COUNTS/60M2?**

In [49]:
## Load upc data

filename = 'MLPA_kelpforest_upc.1.csv'
upc = pd.read_csv(filename, encoding='ANSI', dtype={'transect':str, 'notes':str, 'site_name_old':str})

print(upc.shape)
upc.head()

(169686, 17)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,category,classcode,count,pct_cov,depth,observer,notes,site_name_old
0,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,ANEM,1,1.1,6.1,JARED FIGURSKI,,
1,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,BARROC,10,11.1,6.1,JARED FIGURSKI,,
2,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,BRANCH,31,34.4,6.1,JARED FIGURSKI,,
3,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,BROWN,2,2.2,6.1,JARED FIGURSKI,,
4,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,COMTUN,1,1.1,6.1,JARED FIGURSKI,,


### Additional column notes

**category** = The code indicating which of four types of data are collected on each point. Categories may be SUBSTRATE, RELIEF, COVER and SUPERLAYER. Percent cover should be calculated for each category separately. For SUPERLAYER, the total number of points will not necessarily equal the targeted number of points surveyed in each of the other categories (i.e. superlayer is specific to certain organisms that are only included if present).
- COVER: Primary space holding, non-mobile organism or bare substrate type present at each point, cover types are defined in taxonomic table
- RELIEF: Physical relief is measured as the greatest vertical relief that exists within a 1m wide section across the tape and a 0.5m section along that tape that is centered over the point. Relief categories can be 0-10cm, 10cm-1m, 1-2m, and >2m
- SUBSTRATE: Substrate type underlying each point. Substrate categories include bedrock , boulder, cobble, and sand and are further defined in the taxonomic table
- SUPERLAYER: Superlayer includes a small subset of specific organisms which may be ephemeral, and tend to create a layer over primary space holders. Examples include low-lying, very large-bladed macroalgae such as Laminaria farlowii, brittle stars, and drift algae and in the Northern region abalone are recorded as a superlayer when occupying the space at the point. Not recorded at all points, only where present

**pct_cov** = Percent cover, calculated by dividing the number of points of a given category and classcode by the total number of points of a category per transect. The percent cover of superlayer is calculated by dividing the number of points of a superlayer classcode by the total number of points in the cover category, since superlayers are not present at all points.

### Strategy

As with RCCA, each transect can be an **event**, and each organism observation can be an **occurrence**.

The **event** file should contain: eventID (from site, survey date, zone, transect), eventDate (from year, month, date), datasetID, locality (site), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Should I include the campus information somewhere? Observer?

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe disease), organismQuantity (count), organismQuantityType. **May try to include Cover data from UPC here, too, just with organismQuantityType as "percent cover."**

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Substrate and Relief (pct_cov values) can be recorded at the event level. Size can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [50]:
## Load site table

filename = 'MLPA_kelpforest_site_table.1.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(7458, 17)


Unnamed: 0,LTM_project_short_code,campus,method,survey_year,year,site,latitude,longitude,CA_MPA_Name_Short,site_designation,site_status,Secondary_MPA_Name,Secondary_site_designation,Secondary_site_status,BaselineRegion,LongTermRegion,MPA_priority_tier
0,LTM_Kelp_SRock,VRG,SBTL_SIZEFREQ_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
1,LTM_Kelp_SRock,VRG,SBTL_FISH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
2,LTM_Kelp_SRock,VRG,SBTL_SWATH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
3,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
4,LTM_Kelp_SRock,HSU,SBTL_UPC_HSU,2018,2018,ABALONE_POINT_1,39.6915,-123.8141,Ten Mile SMR,reference,reference,,,,North Coast,North Coast,I


There are 32 sites in the site table that do not have any swath data, and 36 sites in the site table that do not have any upc data. It's worth noting that PISMO_W and SAL_E, which did not appear in the fish transect data and also do not appear in the swath data, were sampled during upc surveys. 

Sites that have swath data but not upc data are:
- SMI_PRINCE_ISLAND_CEN
- SMI_PRINCE_ISLAND_N
- SRI_CARRINGTON_E
- SRI_CARRINGTON_CEN
- SRI_CARRINGTON_W
- SRI_BEE_ROCK_W

Sites that have upc data but not swath data are:
- PISMO_W
- SAL_E

**PISMO_W and SAL_E also have latitude and longitude = NaN**

As checked in the fish transect conversion script, lat and lon have been consistently assigned within a site. Additionally, sites have been consistently labeled either 'mpa' or 'reference.' Since, for the purpose of DwC, we're not interested in which sites were sampled when, I can simplify the site table to only contain relevant information: site, latitude, longitude, and site status. **Note that which campus is responsible for a given site has changed between years in a number of cases, so I'm leaving this information out for now.**

In [51]:
## Create simplified site table

site_summary = sites[['site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(382, 4)


Unnamed: 0,site,site_status,latitude,longitude
0,3 Palms East,reference,33.71762,-118.33215
4,ABALONE_POINT_1,reference,39.6915,-123.8141
15,ABALONE_POINT_2,reference,39.66502,-123.80435
26,ABALONE_POINT_3,reference,39.62877,-123.79658
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252


Some site names have spaces or ' - ' characters. I'll replace these in a sensible way.

In [52]:
# Replace ' ' and ' - ' in site names and add site_name column

site_name = [name.replace(' - ', '-') for name in site_summary['site']]
site_name = [name.replace(' ', '_') for name in site_name]
site_summary['site_name'] = site_name

site_summary.head()

Unnamed: 0,site,site_status,latitude,longitude,site_name
0,3 Palms East,reference,33.71762,-118.33215,3_Palms_East
4,ABALONE_POINT_1,reference,39.6915,-123.8141,ABALONE_POINT_1
15,ABALONE_POINT_2,reference,39.66502,-123.80435,ABALONE_POINT_2
26,ABALONE_POINT_3,reference,39.62877,-123.79658,ABALONE_POINT_3
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252,ANACAPA_ADMIRALS_CEN


### Convert

In [53]:
## Merge to add site_name (also lat, lon and site_status) to swath and upc tables

swath = swath.merge(site_summary, how='left', on='site')
upc = upc.merge(site_summary, how='left', on='site')
upc.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,category,...,count,pct_cov,depth,observer,notes,site_name_old,site_status,latitude,longitude,site_name
0,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,...,1,1.1,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
1,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,...,10,11.1,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
2,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,...,31,34.4,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
3,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,...,2,2.2,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
4,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,...,1,1.1,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC


I'd like to include some of the upc cover data in the occurrence file - it seems like some of those occurrences should be findable on something like OBIS, and if I only put them in the MoF, they won't be. In order to do that, though, I'll have to sort out the non-biological cover types.

In [54]:
## Load species table

filename = 'MLPA_kelpforest_taxon_table.1.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1336, 38)


Unnamed: 0,campus,sample_type,sample_subtype,classcode,orig_classcode,Kingdom,Phylum,Class,Order,Family,...,LOOKED2009,LOOKED2010,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018
0,HSU,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,no,no,no,no,no,yes,yes,no,yes,yes
1,UCSB,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,FISH,FISH,AARG,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no


In [55]:
## Select species for upc cover

cover_sp = species.loc[(species['sample_type'] == 'UPC') & (species['sample_subtype'].isin(['COVER', 'SUPERLAYER'])), 
                                                            ['sample_subtype', 'classcode', 'Kingdom', 'Phylum', 'Class', 'Order', 
                                                             'Family', 'Genus', 'Species', 'species_definition', 'common_name']]
cover_sp.drop_duplicates(inplace=True)

print(cover_sp.shape)
cover_sp.head()

(80, 11)


Unnamed: 0,sample_subtype,classcode,Kingdom,Phylum,Class,Order,Family,Genus,Species,species_definition,common_name
1013,COVER,AGLSTR,Animalia,Cnidaria,Hydrozoa,Leptothecata,Aglaopheniidae,Aglaophenia,struthionides,Aglaophenia struthionides,Ostrich-Plume Hydroid
1016,COVER,ANEM,Animalia,Cnidaria,Anthozoa,Actiniaria,,,,Actiniaria,Anemone
1034,COVER,BARNAC,Animalia,Arthropoda,Hexanauplia,,,,,Cirripedia,Barnacle
1038,COVER,BARROC,,,,,,,,Bare Rock,Bare Rock
1042,COVER,BARSAN,,,,,,,,Bare Sand,Bare Sand


Basically, I think I want to remove any species where Kingdom = NaN. That seems to sort out the non-biological cover types pretty well. 

```python
cover_sp[cover_sp['Kingdom'].isna() == True]
```

Once I've identified these classcodes, I can select all cover and superlayer records from upc, exclude the non-biological cover records, and append the result to swath. The remaining upc data will go in the MoF file.

In [56]:
## Get classcodes for non-biological cover and superlayer options

nonbio_cover = cover_sp.loc[cover_sp['Kingdom'].isna() == True, 'classcode']
nonbio_cover

1038      BARROC
1042      BARSAN
1100    DEADHOLD
1175         MUD
1226        SCUM
1229       SHELL
1285        UNID
1324      UNDDET
Name: classcode, dtype: object

In [57]:
## Select cover and superlayer records from upc, exclude nonbio_cover records, and append the result to swath

# Select relevant records
cover_records = upc[upc['category'].isin(['COVER', 'SUPERLAYER'])].copy()

# Exclude non-biological cover records
idx = cover_records[cover_records['classcode'].isin(nonbio_cover)].index
cover_records.drop(idx, inplace=True)

# Append to swath
swath = pd.concat([swath, cover_records], ignore_index=True)

In [58]:
## Create a df of remaining upc data for inclusion in MoF, then move on to creating occurrence file

# Get classcodes from UPC data corresponding to substrate, relief and nonbiological cover categories
upc_sp = species.loc[(species['sample_type'] == 'UPC') & (species['Kingdom'].isna() == True), ['sample_subtype', 'classcode', 'species_definition']]
upc_sp.drop_duplicates(inplace=True)

# Select nonbiological upc data for MoF
nonbio_upc = upc[upc['classcode'].isin(upc_sp['classcode'])]
nonbio_upc.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,transect,category,...,count,pct_cov,depth,observer,notes,site_name_old,site_status,latitude,longitude,site_name
1,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,COVER,...,10,11.1,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
10,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,RELIEF,...,2,2.2,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
11,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,RELIEF,...,4,4.4,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
12,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,RELIEF,...,33,36.7,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC
13,UCSC,SBTL_UPC_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,1,RELIEF,...,51,56.7,6.1,JARED FIGURSKI,,,mpa,36.623586,-121.904196,HOPKINS_DC


In [60]:
## Pad month and day as needed

paddedDay = ['0' + str(swath['day'].iloc[i]) if len(str(swath['day'].iloc[i])) == 1 else str(swath['day'].iloc[i]) for i in range(swath.shape[0])]
paddedMonth = ['0' + str(swath['month'].iloc[i]) if len(str(swath['month'].iloc[i])) == 1 else str(swath['month'].iloc[i]) for i in range(swath.shape[0])]

In [106]:
## Create eventID

eventID = [swath['site_name'].iloc[i] + '_' + str(swath['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + swath['zone'].iloc[i] + '_' +
           swath['transect'].iloc[i].replace(' ', '') for i in range(swath.shape[0])]
swath_occ = pd.DataFrame({'eventID':eventID})

swath_occ.head()

Unnamed: 0,eventID
0,HOPKINS_DC_19990907_INNER_1
1,HOPKINS_DC_19990907_INNER_1
2,HOPKINS_DC_19990907_INNER_1
3,HOPKINS_DC_19990907_INNER_1
4,HOPKINS_DC_19990907_INNER_1


In [107]:
## Add occurrenceID

# Create survey_date column in swath
swath['survey_date'] = [str(swath['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(swath.shape[0])]

# Groupby to create occurrenceID
swath_occ['occurrenceID'] = swath.groupby(['site', 'survey_date', 'zone', 'transect'])['classcode'].cumcount()+1
swath_occ['occurrenceID'] = swath_occ['eventID'] + '_occ' + swath_occ['occurrenceID'].astype(str)

swath_occ.head()

Unnamed: 0,eventID,occurrenceID
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5


In [103]:
## Get relevant records from species table

swath_sp = species.loc[species['sample_type'].isin(['SWATH', 'UPC']), ['classcode', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 
                                                                       'Genus', 'Species', 'species_definition', 'common_name']]
swath_sp.drop_duplicates(inplace=True)

# Forward fill NaN values row-wise to get the last known major rank when Genus/Species is not known and use to create scientific_name column
swath_sp.ffill(axis=1, inplace=True)

scientific_name = []
for i in range(swath_sp.shape[0]):
    if swath_sp['Genus'].iloc[i] == swath_sp['Species'].iloc[i]:
        scientific_name.append(swath_sp['Genus'].iloc[i])
    else:
        scientific_name.append(swath_sp['Genus'].iloc[i] + ' ' + swath_sp['Species'].iloc[i])
        
swath_sp['scientific_name'] = scientific_name

swath_sp

Unnamed: 0,classcode,Kingdom,Phylum,Class,Order,Family,Genus,Species,species_definition,common_name,scientific_name
590,ALAMAR,Chromista,Ochrophyta,Phaeophyceae,Laminariales,Alariaceae,Alaria,marginata,Alaria marginata,Alaria,Alaria marginata
593,COSCOS,Chromista,Ochrophyta,Phaeophyceae,Laminariales,Costariaceae,Costaria,costata,Costaria costata,Costaria,Costaria costata
596,DESLIG,Chromista,Ochrophyta,Phaeophyceae,Desmarestiales,Desmarestiaceae,Desmarestia,ligulata,Desmarestia ligulata,Acid Weed,Desmarestia ligulata
597,DICSPP,Chromista,Ochrophyta,Phaeophyceae,Laminariales,Costariaceae,Dictyoneurum,californicum/reticulatum,Dictyoneurum californicum/reticulatum,Dictyoneurum,Dictyoneurum californicum/reticulatum
599,EGRMEN,Chromista,Ochrophyta,Phaeophyceae,Laminariales,Lessoniaceae,Egregia,menziesii,Egregia menziesii,Egregia,Egregia menziesii
...,...,...,...,...,...,...,...,...,...,...,...
1327,UNDHAL,Animalia,Mollusca,Gastropoda,Lepetellida,Haliotidae,Haliotis,spp.,Haliotis,Abalone superlayer,Haliotis spp.
1328,UNDJUVLAM,Chromista,Ochrophyta,Phaeophyceae,Laminariales,Laminariales,Laminariales,Laminariales,Laminariales Recruit,Laminariales Recruit,Laminariales
1330,UNDLAMFAR,Chromista,Ochrophyta,Phaeophyceae,Laminariales,Laminariaceae,Laminaria,farlowii,Laminaria farlowii,Laminaria farlowii Sub-Canopy (Layer Above Pri...,Laminaria farlowii
1333,UNDNEOFIM,Chromista,Ochrophyta,Phaeophyceae,Laminariales,Agaraceae,Neoagarum,fimbriatum,Neoagarum fimbriatum,Sieve/colander kelp,Neoagarum fimbriatum


**Note** that there are 12 classcodes in swath_sp that do not appear in the swath or biological upc data. They are:
- LAMSAC
- SARPAL
- STEDIOAD
- APAPRI
- ENTDOF
- FELCAL
- PELMUL
- POLATR
- PSEMON
- SCYORE
- CAPMAT
- UNDNEOFIM

```python
for code in swath_sp['classcode'].unique():
    if code not in swath['classcode'].unique():
        if code not in nonbio_upc['classcode'].unique():
            print(code)
```

**There is also one classcode from the swath data that is NOT in the species table: LEPHEXAD**

In [104]:
## Map classcodes to species definitions (usually scientific names) and classcodes to common names

code_to_sci_dict = dict(zip(swath_sp['classcode'], swath_sp['scientific_name']))
code_to_com_dict = dict(zip(swath_sp['classcode'], swath_sp['common_name'])) # Use this for either occurrenceRemarks or identificationRemarks

In [108]:
## Create scientificName

swath_occ['scientificName'] = swath['classcode']
swath_occ['scientificName'].replace(code_to_sci_dict, inplace=True)

swath_occ.head()

Unnamed: 0,eventID,occurrenceID,scientificName
0,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ1,Cryptochiton stelleri
1,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ2,Dictyoneurum californicum/reticulatum
2,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ3,Macrocystis pyrifera
3,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ4,Macrocystis pyrifera
4,HOPKINS_DC_19990907_INNER_1,HOPKINS_DC_19990907_INNER_1_occ5,Macrocystis pyrifera


In [109]:
## Get unique scientific names for lookup in WoRMS

names = swath_occ['scientificName'].unique()
names

array(['Cryptochiton stelleri', 'Dictyoneurum californicum/reticulatum',
       'Macrocystis pyrifera', 'Patiria miniata', 'Pisaster giganteus',
       'Stephanocystis osmundacea', 'Henricia leviuscula',
       'Mediaster aequalis', 'Apostichopus spp', 'Diodora aspera',
       'Pomaulax gibberosus', 'Pycnopodia helianthoides',
       'Tethya californiana', 'Urticina spp',
       'Loxorhynchus/Scyra crispatus/acutifrons', 'Neobernaya spadicea',
       'Orthasterias koehleri', 'Pugettia spp',
       'Strongylocentrotus purpuratus', 'Dermasterias imbricata',
       'Pisaster brevispinus', 'Crassadoma gigantea',
       'Mesocentrotus franciscanus', 'Pisaster ochraceus',
       'Pterygophora californica', 'Metridium spp',
       'Loxorhynchus grandis', 'Haliotis rufescens', 'Cancridae',
       'Nereocystis luetkeana', 'Kelletia kelletii', 'Haliotis spp',
       'Laminaria spp', 'Aplysia californica', 'Eisenia arborea',
       'Stylaster californicus', 'Anthopleura spp',
       'Styela monte

**Note** that there are a number of names that are not specific at the species level:
- Dictyoneurum californicum/reticulatum (matched to Dictyoneurum)
- Loxorhynchus/Scyra crispatus/acutifrons (Lozorhynchus crispatus or Scyra acutifrons, shared subfamily Pisinae)
- Megastrea/Lithopoma/Pomaulax/Astraea spp (Megastraea sp. - misspelled! - or Lithopoma sp. or Pomaulax sp. or Astraea sp., shared subfamily Turbininae)
- Urticina columbiana/mcpeaki (matched to Urticina)
- Lopholithodes mandtii/foraminatus (matched to Lopholithodes)
- Ceratostoma/Pteropurpura spp (Ceratostoma spp. or Pteropurpura spp., shared subfamily Ocenebrinae)
- Diopatra/Chaetopterus spp (Diopatra spp. or Chaetopterus spp., shared class Polychaeta)
- Thylacodes/Petaloconchus squamigerus/montereyensis (Thylacodes squamigerus or Petaloconchus montereyensis, shared family Vermetidae)
- Dodecaceria fewkesi/concharum (matched to Dodecaceria)

**Add these to identificationQualifier column.**

There are also some descriptions that lack a scientific name:
- LEPHEXAD (as noted above, this classcode is not in the species table at all)
- TEST (this is the code for an urchin test, i.e., a dead animal. **I will exclude these records for now.**)
    - There are 66 records in the swath data with this classcode
- NO_ORG (this is the code for a completely empty transect, **I think. I will exclude tthese records for now.**
    - Here, as with the fish transect data, **the NO_ORG observation occurs in the same event as other observations.** You would think that a NO_ORG entry would be the only entry for a given event - see example below. Also note that **this observation has count = 0.** 
- UNIDSP (this is the classcode for an unidentified mobile invertebrate species, which has given the scientific_name **Animalia**.)

```python
# NO_ORG observation in the same event as other observations
swath_occ[swath_occ['eventID'] == 'SRI_JOHNSONS_LEE_NORTH_W_20050907_INNER_1']
```

In [110]:
## Match species in WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Dictyoneurum californicum/reticulatum checking:  Dictyoneurum
Url didn't work for Apostichopus spp checking:  Apostichopus
Url didn't work for Urticina spp checking:  Urticina
Url didn't work for Loxorhynchus/Scyra crispatus/acutifrons checking:  Loxorhynchus/Scyra
Url didn't work, check name:  Loxorhynchus/Scyra
Url didn't work for Pugettia spp checking:  Pugettia
Url didn't work for Metridium spp checking:  Metridium
Url didn't work for Haliotis spp checking:  Haliotis
Url didn't work for Laminaria spp checking:  Laminaria
Url didn't work for Anthopleura spp checking:  Anthopleura
Url didn't work for Asteroidea spp checking:  Asteroidea
Url didn't work for Solaster spp checking:  Solaster
Url didn't work, check name:  LEPHEXAD
Url didn't work for Pisaster spp checking:  Pisaster
Url didn't work for Laminariales spp checking:  Laminariales
Url didn't work, check name:  TEST
Url didn't work for Cucumaria spp checking:  Cucumaria
Url didn't work for Megastrea/Lithopo