# MARINe: Sea star and Katharina count conversion

MARINe conducts long-term monitoring at sites along the coast of North America approximately annually. At sites where they are sufficiently abundant, ochre sea stars (Pisaster ochraceus) are monitored either within band transects or in irregularly sized plots. Other organisms, such as other sea stars, are often also counted if they are present within the plots. The protocol document says that there are 3-5 replicate plots per site, but **Rani said there were most commonly 6 replicate plots.** These plots are permanent, but chosen to target high densities of sea stars. They are intended to track changes in density and size frequency within a site, and the resulting data should not be used for comparisons between sites. 

Within each plot, sea stars are counted and measured. Measurements are taken from the center of the disk to the tip of the longest ray and are performed using calipers. These measurements are recorded to the nearest 5 mm for small sea stars (< 10 mm arm length) and to the nearest 10 mm for larger sea stars. Often, sizes have to be estimated due to the orientation or inaccessibility of the sea star. Early surveys binned sea stars into size classes, and these size classes shifted once through time. If there are many sea stars within a particular plot, only a subset of them may be measured.

At sites where sea stars are not abundant, a timed search protocol is used to document rarity. **I don't know if these data have been included in this data set; I suspect not.**

**Notes:**
- Katharina counts do not appear to be discussed anywhere in the methods document. **What are the methods for these surveys?**

**Resources:**
- DataONE link: https://data.piscoweb.org/metacatui/view/doi:10.6085/AA/marine_ltm.4.7

In [1]:
## Import packages

import pandas as pd
import numpy as np
from datetime import datetime

import pyworms

In [2]:
## Import WoRMS functions - although I may use Pyworms this time

# import sys
# sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

# import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data

data = pd.read_csv('seastarkat_size_count_zeroes_totals_download.csv')
print(data.shape)
data.head()

(17979, 25)


Unnamed: 0,group_code,marine_site_name,site_code,latitude,longitude,marine_sort_order,marine_common_year,season_name,marine_season_code,marine_common_season,...,total,num_plots_sampled,method_code,state_province,georegion,bioregion,mpa_designation,mpa_region,island,last_updated
0,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2006,Spring,SP06,101,...,50,1,GSES,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 13:06
1,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2006,Fall,FA06,103,...,25,1,GSES,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 13:06
2,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2007,Spring,SP07,105,...,74,1,GSES,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 13:06
3,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2007,Fall,FA07,107,...,66,1,GSES,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 13:06
4,CSUF,Crystal Cove,CRCO,33.570782,-117.83773,6660,2008,Spring,SP08,109,...,24,1,GSES,California,CA South,Government Point to Mexico,SMCA,South Coast,Mainland,2021-03-26 13:06


In [4]:
## Site table

site = pd.read_csv('MARINe_site_table.txt')
print(site.shape)
site.head()

(269, 17)


Unnamed: 0,marine_sort_order,site_code,marine_site_name,county,island,state_province,country,latitude,longitude,pisco_code,mpa_region,mpa_lt_region,mpa_designation,mpa_name,LTM_project_short_code,georegion,bioregion
0,1500,GRA,Graves Harbor,Skagway-Hoonah-Angoon,Mainland,Alaska,United States,58.270821,-136.73148,IGHAXX,,,,,,AK,Alaska to British Columbia
1,1600,YAK,Yakobi,Skagway-Hoonah-Angoon,Yakobi,Alaska,United States,58.081501,-136.55453,IYAKXX,,,,,,AK,Alaska to British Columbia
2,1700,PMA,Port Mary,Sitka,Kruzoff,Alaska,United States,57.154228,-135.75452,IPMAXX,,,,,,AK,Alaska to British Columbia
3,1720,SAGE,Sage Rock,Sitka,Baranoff,Alaska,United States,57.048698,-135.32314,ISAGXX,,,,,,AK,Alaska to British Columbia
4,1730,KAY,Kayak Island,Sitka,Kayak,Alaska,United States,57.022499,-135.35387,IKAYXX,,,,,,AK,Alaska to British Columbia


In [5]:
## Species table

species = pd.read_csv('MARINe_species_table.txt')
print(species.shape)
species.head()

(50, 12)


Unnamed: 0,lumping_code,marine_species_name,marine_species_definition,taxonomic_id,taxonomic_source,kingdom,phylum,class,order,family,genus,species
0,ANTELE,anthopleura elegantissima; anthopleura sola,anthopleura elegantissima/sola; may also inclu...,100696.0,World Register of Marine Species: www.marines...,Animalia,Cnidaria,Anthozoa,Actiniaria,Actiniidae,Anthopleura,
1,ARTCOR,articulated corallines,erect; jointed; calcified; red algae of the Fa...,143691.0,World Register of Marine Species: www.marines...,Plantae,(Division) Rhodophyta,Florideophyceae,Corallinales,Corallinaceae,,
2,BARNAC,barnacles,any species of barnacle; used for transects wh...,106033.0,World Register of Marine Species: www.marines...,Animalia,Arthropoda,Thecostraca,Sessilia,,,
3,CHITON,chitons,any species of chiton,55.0,World Register of Marine Species: www.marines...,Animalia,Mollusca,Polyplacophora,,,,
4,CHOCAN,chondracanthus canaliculatus,chondracanthus canaliculatus,371723.0,World Register of Marine Species: www.marines...,Plantae,(Division) Rhodophyta,Florideophyceae,Gigartinales,Gigartinaceae,Chondracanthus,Chondracanthus canaliculatus


## Conversion

### Occurrence
Here, it seems like an **event** can be a survey, uniquely defined by the site code, year, and season. Actually, it looks like sometimes there have been multiple surveys within a given site, year, and season. So maybe the best way to go is site, min date, max date.

```python
# Instances where multiple surveys occurred within a given site, year, and season
out = data.groupby(['site_code', 'marine_common_year', 'season_name'])['min_survey_date'].nunique()
out[out > 1]

# Examine the dates for an example:
data[(data['site_code'] == 'BOA') & (data['marine_common_year'] == 2003)].iloc[:, 0:20]
```

An **occurrence** can be defined as an individual organism observed during an event.

Measurements only pertain to occurrences, so I don't need a separate event file.

In [122]:
## eventID

occ = pd.DataFrame({'eventID':data['site_code'] + '_' + data['min_survey_date'] + '_' + data['max_survey_date']})
print(occ.shape)
occ.head()

(17979, 1)


Unnamed: 0,eventID
0,CRCO_2006-03-26_2006-03-26
1,CRCO_2006-11-04_2006-11-04
2,CRCO_2007-04-14_2007-04-14
3,CRCO_2007-10-27_2007-10-27
4,CRCO_2008-03-16_2008-03-16


In [123]:
## eventDate

occ['eventDate'] = data['min_survey_date'] + '/' + data['max_survey_date']
occ.head()

Unnamed: 0,eventID,eventDate
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16


In [124]:
## datasetID

occ['datasetID'] = 'MARINe LTM - sea star and katharina counts and sizes'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...


In [125]:
## Merge with site table to get locality, county, stateProvince, countryCode, decimalLat, decimalLon

# Add site code to occ
occ['site_code'] = data['site_code']

# Define columns to merge from site table
site_cols = [
    'site_code',
    'marine_site_name',
    'county',
    'state_province',
    'country',
    'latitude',
    'longitude',
]

# Define DwC terms for these columns after merge
dwc_cols = [
    'eventID',
    'eventDate',
    'datasetID',
    'locality',
    'county',
    'stateProvince',
    'countryCode',
    'decimalLatitude',
    'decimalLongitude',
]

# Merge
occ = occ.merge(site[site_cols], how='left', on='site_code')
occ.drop(columns=['site_code'], inplace=True)
occ.columns = dwc_cols
print(occ.shape)
occ.head()

(17979, 9)


Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,United States,33.570782,-117.83773


**Note** that ideally we would check county and state names against the [Getty Thesaurus of Geographic Names](http://www.getty.edu/research/tools/vocabularies/tgn/). But there are too many values for me to do by hand, and I don't know if they have an API. **I could look into this.**

In [126]:
## Clean countryCode

occ['countryCode'] = occ['countryCode'].str.replace('United States', 'US')
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773


In [127]:
## coordinatUncertaintyInMeters

occ['coordinateUncertaintyInMeters'] = 250
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250


**Is this a reasonable value for coordinateUncertaintyInMeters? It's what PISCO and Reef Check used.**

In [128]:
## Add minimumDepthInMeters, maximumDepthInMeters, samplingProtocol and samplingEffort

# Depth
occ['minimumDepthInMeters'] = 0
occ['maximumDepthInMeters'] = 0

# Protocol
occ['samplingProtocol'] = data['method_code']
occ['samplingProtocol'] = occ['samplingProtocol'].replace({
    'BT25':'band transect 2m x 5m',
    'GSES':'General search entire site',
    'IP':'Irregular plot',
    'TS30':'Timed search 30 minutes'
})

# Effort
occ['samplingEffort'] = data['num_plots_sampled'].astype(str) + ' plot(s)'

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s)
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s)
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s)
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s)
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s)


**I assume for intertidal work both minimum and maximum depth should be 0 m?**

Also, **note** that for the protocol 'general search entire site,' the effort is always 1. **I assume this means 1 site was searched? Should I change the wording to sites instead of plots?**

In [129]:
## occurrenceID

occ['occurrenceID'] = data.groupby(['site_code', 'min_survey_date', 'max_survey_date'])['species_code'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_' + occ['occurrenceID'].astype(str)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2006-03-26_2006-03-26_1
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2006-11-04_2006-11-04_1
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2007-04-14_2007-04-14_1
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2007-10-27_2007-10-27_1
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2008-03-16_2008-03-16_1


In [130]:
## scientificName

# Get species codes
occ['scientificName'] = data['species_code']

# Create scientificName column in species table
sp = species[['lumping_code', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
sp = sp.replace('NULL', np.nan, regex=True)
sp['scientificName'] = sp['species'].combine_first(sp['genus'])
sp['scientificName'] = sp['species'].combine_first(sp['family'])
sp['scientificName'] = sp['species'].combine_first(sp['order'])
sp['scientificName'] = sp['species'].combine_first(sp['class'])
sp['scientificName'] = sp['species'].combine_first(sp['phylum'])
sp['scientificName'] = sp['species'].combine_first(sp['kingdom'])

# Build dictionary mapping codes to names
sp_dict = dict(zip(sp['lumping_code'], sp['scientificName']))

# Replace codes with names in occ
occ['scientificName'] = occ['scientificName'].replace(sp_dict)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort,occurrenceID,scientificName
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2006-03-26_2006-03-26_1,Pisaster ochraceus
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2006-11-04_2006-11-04_1,Pisaster ochraceus
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2007-04-14_2007-04-14_1,Pisaster ochraceus
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2007-10-27_2007-10-27_1,Pisaster ochraceus
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,0,0,General search entire site,1 plot(s),CRCO_2008-03-16_2008-03-16_1,Pisaster ochraceus


In [131]:
## Get unique names

names = occ['scientificName'].unique()
names

array(['Pisaster ochraceus', 'LEPTAS', 'PATMIN', 'PISBRE', 'PISGIG',
       'EVATRO', 'HENSPP', 'PYCHEL', 'ORTKOE', 'DERIMB', 'KATTUN'],
      dtype=object)

**Well, that's a problem. Most of the names are not in the species table.**

Fortunately, I know what most of this are off the top of my head, but **I will need to get Rani to update this.**

In [132]:
## Add names manually to sp_dict ----- THIS SHOULD CHANGE ONCE RANI UPDATES SPECIES TABLE

sp_dict['LEPTAS'] = 'Leptasterias'
sp_dict['PATMIN'] = 'Patiria miniata'
sp_dict['PISBRE'] = 'Pisaster brevispinus'
sp_dict['PISGIG'] = 'Pisaster giganteus'
sp_dict['EVATRO'] = 'Evasterias troschelii'
sp_dict['HENSPP'] = 'Henricia'
sp_dict['PYCHEL'] = 'Pycnopodia helianthoides'
sp_dict['ORTKOE'] = 'Orthasterias koehleri'
sp_dict['DERIMB'] = 'Dermasterias imbricata'
sp_dict['KATTUN'] = 'Katharina tunicata'

# Replace codes with names in occ
occ['scientificName'] = occ['scientificName'].replace(sp_dict)

names = occ['scientificName'].unique()
names

array(['Pisaster ochraceus', 'Leptasterias', 'Patiria miniata',
       'Pisaster brevispinus', 'Pisaster giganteus',
       'Evasterias troschelii', 'Henricia', 'Pycnopodia helianthoides',
       'Orthasterias koehleri', 'Dermasterias imbricata',
       'Katharina tunicata'], dtype=object)

In [133]:
## Check names on WoRMS

results = pyworms.aphiaRecordsByMatchNames(names.tolist())
if len(results) == len(names): print('All names found.')

# Unpack results
worms_out = pd.json_normalize(results[0])
for i in range(1, len(results)):
    norm = pd.json_normalize(results[i])
    worms_out = pd.concat([worms_out, norm])
worms_out

All names found.


Unnamed: 0,AphiaID,url,scientificname,authority,status,unacceptreason,taxonRankID,rank,valid_AphiaID,valid_name,...,genus,citation,lsid,isMarine,isBrackish,isFreshwater,isTerrestrial,isExtinct,match_type,modified
0,240755,http://www.marinespecies.org/aphia.php?p=taxde...,Pisaster ochraceus,"(Brandt, 1835)",accepted,,220,Species,240755,Pisaster ochraceus,...,Pisaster,"Mah, C.L. (2021). World Asteroidea Database. P...",urn:lsid:marinespecies.org:taxname:240755,1,0.0,0.0,0.0,0.0,exact,2008-10-22T04:46:32.530Z
0,123222,http://www.marinespecies.org/aphia.php?p=taxde...,Leptasterias,"Verrill, 1866",accepted,,180,Genus,123222,Leptasterias,...,Leptasterias,"Mah, C.L. (2021). World Asteroidea Database. L...",urn:lsid:marinespecies.org:taxname:123222,1,0.0,0.0,0.0,0.0,exact,2008-12-31T03:51:57.887Z
0,382131,http://www.marinespecies.org/aphia.php?p=taxde...,Patiria miniata,"(Brandt, 1835)",accepted,,220,Species,382131,Patiria miniata,...,Patiria,"Mah, C.L. (2021). World Asteroidea Database. P...",urn:lsid:marinespecies.org:taxname:382131,1,0.0,0.0,0.0,0.0,exact,2014-10-15T20:23:40.910Z
0,240757,http://www.marinespecies.org/aphia.php?p=taxde...,Pisaster brevispinus,"(Stimpson, 1857)",accepted,,220,Species,240757,Pisaster brevispinus,...,Pisaster,"Mah, C.L. (2021). World Asteroidea Database. P...",urn:lsid:marinespecies.org:taxname:240757,1,0.0,0.0,0.0,0.0,exact,2008-10-22T04:46:32.530Z
0,240758,http://www.marinespecies.org/aphia.php?p=taxde...,Pisaster giganteus,"(Stimpson, 1857)",accepted,,220,Species,240758,Pisaster giganteus,...,Pisaster,"Mah, C.L. (2021). World Asteroidea Database. P...",urn:lsid:marinespecies.org:taxname:240758,1,0.0,0.0,0.0,0.0,exact,2008-10-22T04:46:32.530Z
0,255040,http://www.marinespecies.org/aphia.php?p=taxde...,Evasterias troschelii,"(Stimpson, 1862)",accepted,,220,Species,255040,Evasterias troschelii,...,Evasterias,"Mah, C.L. (2021). World Asteroidea Database. E...",urn:lsid:marinespecies.org:taxname:255040,1,0.0,0.0,0.0,0.0,exact,2017-06-19T16:23:34.257Z
0,123276,http://www.marinespecies.org/aphia.php?p=taxde...,Henricia,"Gray, 1840",accepted,,180,Genus,123276,Henricia,...,Henricia,"Mah, C.L. (2021). World Asteroidea Database. H...",urn:lsid:marinespecies.org:taxname:123276,1,0.0,0.0,0.0,0.0,exact,2008-01-19T09:31:50.063Z
0,240764,http://www.marinespecies.org/aphia.php?p=taxde...,Pycnopodia helianthoides,"(Brandt, 1835)",accepted,,220,Species,240764,Pycnopodia helianthoides,...,Pycnopodia,"Mah, C.L. (2021). World Asteroidea Database. P...",urn:lsid:marinespecies.org:taxname:240764,1,0.0,0.0,0.0,0.0,exact,2008-03-28T05:01:10.653Z
0,255048,http://www.marinespecies.org/aphia.php?p=taxde...,Orthasterias koehleri,"(deLoriol, 1897)",accepted,,220,Species,255048,Orthasterias koehleri,...,Orthasterias,"Mah, C.L. (2021). World Asteroidea Database. O...",urn:lsid:marinespecies.org:taxname:255048,1,0.0,0.0,0.0,0.0,exact,2019-12-30T03:47:59.320Z
0,240771,http://www.marinespecies.org/aphia.php?p=taxde...,Dermasterias imbricata,"(Grube, 1857)",accepted,,220,Species,240771,Dermasterias imbricata,...,Dermasterias,"Mah, C.L. (2021). World Asteroidea Database. D...",urn:lsid:marinespecies.org:taxname:240771,1,0.0,0.0,0.0,0.0,exact,2009-01-30T03:03:03Z


**So, there's a lot that could be done here.** One thing that comes to mind is handling the output more elegantly, and/or handling what happens when a name isn't found. Right now, an empty list (i.e. `[]`) is returned. And if I query a name that's almost right (like Pisaster gigantea instead of Pisaster giganteus), it will match, but the match_type column will say `near_2` instead of `exact`. Finally, I'm not sure what the difference is between `scientificname` and `valid_name`.

In [134]:
## Merge to add remaining taxonomy columns

# Indicate desired columns to merge
worms_cols = [
    'AphiaID',
    'scientificname',
    'kingdom',
    'phylum',
    'class',
    'order',
    'family', 
    'genus',
    'lsid'
]

# Give desired dwc column names
dwc_cols = occ.columns.to_list()
dwc_cols.extend([
    'taxonID',
    'kingdom',
    'phylum',
    'class',
    'order',
    'family', 
    'genus',
    'scientificNameID',
])

# Merge
occ = occ.merge(worms_out[worms_cols], how='left', left_on='scientificName', right_on='scientificname')
occ = occ.drop(columns=['scientificname'])
occ.columns = dwc_cols
print(occ.shape)
occ.head()

(17979, 24)


Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,occurrenceID,scientificName,taxonID,kingdom,phylum,class,order,family,genus,scientificNameID
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2006-03-26_2006-03-26_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2006-11-04_2006-11-04_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2007-04-14_2007-04-14_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2007-10-27_2007-10-27_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2008-03-16_2008-03-16_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755


In [137]:
## Change taxonID to int

occ['taxonID'] = occ['taxonID'].astype(int)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,county,stateProvince,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,...,occurrenceID,scientificName,taxonID,kingdom,phylum,class,order,family,genus,scientificNameID
0,CRCO_2006-03-26_2006-03-26,2006-03-26/2006-03-26,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2006-03-26_2006-03-26_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
1,CRCO_2006-11-04_2006-11-04,2006-11-04/2006-11-04,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2006-11-04_2006-11-04_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
2,CRCO_2007-04-14_2007-04-14,2007-04-14/2007-04-14,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2007-04-14_2007-04-14_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
3,CRCO_2007-10-27_2007-10-27,2007-10-27/2007-10-27,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2007-10-27_2007-10-27_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755
4,CRCO_2008-03-16_2008-03-16,2008-03-16/2008-03-16,MARINe LTM - sea star and katharina counts and...,Crystal Cove,Orange,California,US,33.570782,-117.83773,250,...,CRCO_2008-03-16_2008-03-16_1,Pisaster ochraceus,240755,Animalia,Echinodermata,Asteroidea,Forcipulatida,Asteriidae,Pisaster,urn:lsid:marinespecies.org:taxname:240755


Next: nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe some notes)

Maybe: organismQuantity, organismQuantityType, individualCount?