# North coast kelp data conversion

Since Laura only shared with me an MS Access DB file (.mdb), I'll have to start by extracting the data from there.

Resources:
- [Wiki for pyodbc](https://github.com/mkleehammer/pyodbc/wiki)

In [1]:
## Imports

import pandas as pd
import numpy as np
import csv, pyodbc
import pickle
import datetime

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Connect to db and retrieve data

Note that (as described in the Wiki) the original database filename did not work because it included underscores. I renamed 'Abalone_DiveSurveys_EH_06242020.mdb' to 'AbaloneDiveSurveys-06242020.mdb' to fix this problem

In [49]:
## Check available access driver

[x for x in pyodbc.drivers() if x.startswith('Microsoft Access Driver')]

['Microsoft Access Driver (*.mdb, *.accdb)']

In [87]:
## Function for connecting to db

def get_data_from_db(tbl_name):
    """Connects to .mdb file and extracts data from tbl_name. Saves data in a .csv file and column names in a .txt file."""
    
    # Connect
    conn_str = (
        r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};'
        r'DBQ=C:\Users\dianalg\PycharmProjects\PythonScripts\MPA data integration\North coast kelp\AbaloneDiveSurveys-06242020.mdb;'
        )
    cnxn = pyodbc.connect(conn_str)
    crsr = cnxn.cursor()
    
    # Get column names
    cols = [row.column_name for row in crsr.columns(table=tbl_name)]
    
    # Get rows
    SQL = 'SELECT * FROM ' + tbl_name + ';'
    rows = crsr.execute(SQL).fetchall()
    
    # Close connection
    crsr.close()
    cnxn.close()
    
    # Save data
    with open(tbl_name + '.csv', 'w') as file:
        csv_writer = csv.writer(file) 
        csv_writer.writerows(rows)
        
    # Save column names
    with open(tbl_name + '_cols.data', 'wb') as file:
        pickle.dump(cols, file)

**Note** that the table names in this database are:
- Paste Errors
- Switchboard Items
- Tbl_Diver_Lookup
- Tbl_Format_Lookup
- Tbl_New_size
- Tbl_SppSeen
- Tbl_Survey_Lookup
- tblCounts
- tblHabitat
- tblNewGrowth
- tblQuadrat
- tblSite
- TblSpecies
- tblSubstrate
- tblSurvey
- tblSwimGroups

In [104]:
## Extract data from .mdb file

tables = ['tblSite', 'tblSurvey', 'tblCounts', 'tblSpecies']

for table in tables:
    get_data_from_db(table)

## Load data

In [3]:
## Function to load data

def load_table(tbl_name):
    """Takes tbl_name (a string) and loads saved data from that table."""
    
    # Get filenames
    col_name = tbl_name + '_cols.data'
    data_name = tbl_name + '.csv'
    
    # Retrieve column names
    with open(col_name, 'rb') as file:
        cols = pickle.load(file)
        
    # Load data
    data = pd.read_csv(data_name, header=None, names=cols, encoding='ANSI')
    return(data)

In [4]:
## Load data

site = load_table('tblSite')
survey = load_table('tblSurvey')
count = load_table('tblCounts')
species = load_table('tblSpecies')

In [5]:
## Tidy survey data

# Remove columns that are all nan; there are no rows that are all nan
survey.drop(columns=survey.columns[survey.isna().all()], inplace=True)

# Can probably drop these columns: DIVER, Orientation, Buddy, RANGE, TIDEHEIGHT, SLAT, SLONG, ELAT, ELONG< SLAT_old, SLONG_old, ELAT_old, ELONG_old, Format, DISTANCE, COMMENTS
survey.drop(columns=['DIVER (LEFT FOR TRANSECT)', 'Orientation', 'Buddy (RIGHT FOR TRANSECT)', 'RANGE', 'TIDEHEIGHT', 'SLAT', 'SLONG', 'ELAT', 'ELONG', 'SLAT_old', 'SLONG_old',
                     'ELAT_old', 'ELONG_old', 'Format', 'DISTANCE', 'COMMENTS'], inplace=True)

# We are only interested in the emergent survey type at this stage
survey = survey[survey['TYPE'] == 'EMERGENT']

# View
print(survey.shape)
survey.head()

(455, 16)


Unnamed: 0,Survey_ID,SiteID,Survey_Num,SURVEY,DATE,TYPE,TIME_of_Day,NUMBER,Avg Depth,Min_DEPTH,Max_Depth,SLAT_DD,SLONG_DD,ELAT_DD,ELONG_DD,TIME_MIN
37,1232,BR,BR99-01,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE01,0.0,39.0,,,,,,
38,1243,BR,BR99-02,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE02,0.0,52.0,,,,,,
39,1254,BR,BR99-03,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE03,0.0,27.0,,,,,,
40,1258,BR,BR99-04,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE04,0.0,25.0,,,,,,
41,1259,BR,BR99-05,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE05,0.0,20.0,,,,,,


**Note** that by filtering for TYPE = 'EMERGENT', we are discarding the majority of the data (all but 455 records out of ~ 3400). **Is this really the right thing to do?**

Just for documentation, this eliminates the following values under SURVEY:
- 'Transect - 30m (Rapid Emergent)'
- 'Transect - 20mx2m (Emergent)'
- nan
- 'ARM'
- 'Transect - 30mx2m (Emergent)'
- 'Swim'
- 'Transect - 5m (Invasive)'
- 'Transect - 20mx5m (Emergent)'
- 'Transect - 16mx2m (Emergent)'
- 'Transect - 1/4m'
- 'Transect - 1/4M'
- 'Transect - running transect'
- 'Transect - Running Transect'

Only 'Transect - 30m (Emergent)' is left. 

**Also note** that there are some weird values in the COUNT column of the count data, including:
- 'RG18-C3-'
- 'P'

## Convert

There is very minimal metadata for this data set, so I'm not sure what a lot of the columns mean. I'm just going to try to work through a basic conversion process, and see where questions and problems arise. **I'm going to start by assuming that this data set can be summarized by an occurrence file only**, containing: eventID, eventDate, datasetID, locality, localityRemarks (if needed), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, occurrenceID, scienficName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (if needed), occurrenceRemarks (if needed), individualCount (or organismQuantity and organismQuantityType), minimumDepthInMeters, maximumDepthInMeters, samplingProtocol and samplingEffort.

### eventID

As with other similar surveys, it seems reasonable to assume that each transect is an event. **I don't know what the broader study organization is for this, i.e. number of sites, inside/outside MPAs, number of transects per site, what was looked for, what was measured.** The metadata states that the Survey_Num field is composed of the site location, year and site ID. Transect numbers should only be assigned for rapid assessment surveys, which were filtered out because they're not type EMERGENT. 

That said, each record in the survey table has a unique Survey_Num, so hopefully that can be used as an eventID. **For now I'm going to assume that the Survey_Num format is the site ID + year + transect number. Note** that there are a number of Survey_Nums that do not fit this format, though, including:
- 'PA71C-01'
- 'PA86dfg-13'

In [6]:
## eventID - as with other similar surveys, I'll assume the event is a transect

occ = pd.DataFrame({'eventID':survey['Survey_Num']})
occ.head()

Unnamed: 0,eventID
37,BR99-01
38,BR99-02
39,BR99-03
40,BR99-04
41,BR99-05


In [7]:
## eventDate

occ['eventDate'] = survey['DATE']

# format
eventDate = [datetime.datetime.strptime(dt, '%Y-%m-%d %H:%M:%S').date().isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate
occ.head()

Unnamed: 0,eventID,eventDate
37,BR99-01,1999-07-29
38,BR99-02,1999-07-29
39,BR99-03,1999-07-29
40,BR99-04,1999-07-29
41,BR99-05,1999-07-29


There are 71 unique survey dates in this data set. **Note** that the earliest of these is 1971-09-01 and the latest of these is 2001-08-22. **This makes no sense, seeing as Laura was talking about having entered the 2017 data over the summer. It must be the 'EMERGENT' filtering step. But I'll let it be for now.**

In [8]:
## datasetID

occ['datasetID'] = 'North coast kelp emergent transects'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
37,BR99-01,1999-07-29,North coast kelp emergent transects
38,BR99-02,1999-07-29,North coast kelp emergent transects
39,BR99-03,1999-07-29,North coast kelp emergent transects
40,BR99-04,1999-07-29,North coast kelp emergent transects
41,BR99-05,1999-07-29,North coast kelp emergent transects


In [9]:
## locality

occ['SiteID'] = survey['SiteID']
occ = occ.merge(site, how='left', on='SiteID')
occ.rename(columns={'SITE':'locality'}, inplace=True)

## ------ TODO: Drop SiteID, but for now leave it in in case I need to merge on it again.

print(occ.shape)
occ.head()

(455, 5)


Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)


**Note** that for most of the kelp forest survey data sets, there's a localityRemarks column indicating whether each site was inside or outside an MPA. I don't seem to have this information here, though.

In [10]:
## countryCode

occ['countryCode'] = 'US'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US


**The site table doesn't contain lat/lon information, and the survey table seems to contain weird forms of it.** I received the following information from Laura in June:

| Site           | ID  | Latitude  | Longitude  |
|----------------|-----|-----------|------------|
| Albion Bay     | AB  | 39 22.900 | 123 77.881 |
| Caspar Cove    | CC  | 39 36.556 | 123 82.362 |
| Fort Ross      | FR  | 38 50.797 | 123 23.851 |
| Ocean Cove     | OC  | 38 55.363 | 123 30.688 |
| Point Arena    | PA  | 38 54.783 | 123 42.835 |
| Point Cabrillo | PC  | 39 20.752 | 123 49.628 |
| Russian Gulch  | RG  | 39 32.687 | 123 80.782 |
| Salt Point     | SP  | 38 33.972 | 123 20.172 |
| Sea Ranch      | SER | 38 42.468 | 123 27.302 |
| Timber Cove    | TC  | 38 53.476 | 123 28.192 |
| Todds Point    | TP  | 38 43.195 | 123 81.780 |
| Van Damme      | VD  | 39 16.044 | 123 47.662 |

**Note** that something is clearly wrong here. I assume these coordinates are in degrees + decimal minutes, but several of the decimal minute entries are > 60. This shouldn't be.

In [11]:
## Create a digital copy of the table Laura sent

# Define columns
site_name = ['Albion Bay', 'Caspar Cove', 'Fort Ross', 'Ocean Cove', 'Point Arena', 'Point Cabrillo', 'Russian Gulch', 'Salt Point', 'Sea Ranch', 'Timber Cove', 'Todds Point', 'Van Damme']
site_id = ['AB', 'CC', 'FR', 'OC', 'PA', 'PC', 'RG', 'SP', 'SER', 'TC', 'TP', 'VD']
lat_deg = [39, 39, 38, 38, 38, 39, 39, 38, 38, 38, 38, 39]
lat_min = [22.900, 36.556, 50.797, 55.363, 54.783, 20.752, 32.687, 33.972, 42.468, 53.476, 43.195, 16.044]
lon_deg = [123]*12
lon_min = [77.881, 82.362, 23.851, 30.688, 42.835, 49.628, 80.782, 20.172, 27.302, 28.192, 81.780, 47.662]

# Create df
site_lat_lon = pd.DataFrame({'Site':site_name,
                            'ID':site_id,
                            'Lat_deg':lat_deg,
                            'Lat_min':lat_min,
                            'Lon_deg':lon_deg,
                            'Lon_min':lon_min})

# Convert lat, lons to decimal degrees
site_lat_lon['Lat_dd'] = site_lat_lon['Lat_deg'] + site_lat_lon['Lat_min']/60
site_lat_lon['Lon_dd'] = site_lat_lon['Lon_deg'] + site_lat_lon['Lon_min']/60

# Merge with site table to populate what few lat, lons we know at this time. **Note that based on the prior text box, some of these are wrong.**
site = site.merge(site_lat_lon, how='left', left_on='SiteID', right_on='ID')
site.head()

Unnamed: 0,SiteID,SITE,Site,ID,Lat_deg,Lat_min,Lon_deg,Lon_min,Lat_dd,Lon_dd
0,ALB,Albion Bay,,,,,,,,
1,BR,Bodega Marine Life Refuge (BML),,,,,,,,
2,CAT,"Catalina Island, Bird Rock (Southern CA)",,,,,,,,
3,CC,Caspar Cove,Caspar Cove,CC,39.0,36.556,123.0,82.362,39.609267,124.3727
4,FM,Fisk Mill Cove,,,,,,,,


In [12]:
## decimalLatitude, decimalLongitude

occ = occ.merge(site[['SiteID', 'Lat_dd', 'Lon_dd']], how='left', on='SiteID')
occ.rename(columns={'Lat_dd':'decimalLatitude', 'Lon_dd':'decimalLongitude'}, inplace=True)
print(occ.shape)
occ.head()

(455, 8)


Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,


**Note** that once merged, the only lat, lons that are missing are from Bodega Marine Life Refuge.

In [13]:
## coordinateUncertaintyInMeters

occ['coordinateUncertaintyInMeters'] = 250
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250


**Is 250 m reasonable for coordinateUncertaintyInMeters?**

In [14]:
## occurrenceID

# Merge with counts
occ = occ.merge(count[['Survey_Num', 'SpeciesID', 'COUNT']], how='left', left_on='eventID', right_on='Survey_Num')

# Create occurrenceID
occ['occurrenceID'] = occ.groupby('eventID').cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)

# Drop redundant columns
occ.drop(columns=['Survey_Num'], inplace=True)

print(occ.shape)
occ.head()

(3440, 12)


Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,SpeciesID,COUNT,occurrenceID
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,A1,0,BR99-01_occ1
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,A11,0,BR99-01_occ2
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,A12,0,BR99-01_occ3
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,C1,0,BR99-01_occ4
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,F11,0,BR99-01_occ5


In [15]:
## scientificName

# Merge to get species names from codes
occ = occ.merge(species[['SpeciesID', 'SPECIES', 'Scientific']], how='left', on='SpeciesID')

# Rename columns
occ.rename(columns={'SPECIES':'vernacularName',
                   'Scientific':'scientificName'}, inplace=True)

# Drop
occ.drop(columns=['SpeciesID'], inplace=True)

print(occ.shape)
occ.head()

(3440, 13)


Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,COUNT,occurrenceID,vernacularName,scientificName
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ1,Red Abalone,H. Rufescens
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ2,Flat Abalone,H. walallensis
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ3,Pinto Abalone,H. kamtschatkana
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ4,Cancer sp.,general
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ5,Cabezon,Scorpaenichthys marmoratus


In [16]:
## Get unique names

names = occ['scientificName'].unique()
names

array(['H. Rufescens', 'H. walallensis', 'H. kamtschatkana', 'general',
       'Scorpaenichthys marmoratus', 'Ophiodon elongatus',
       'Pisaster spp.', 'Dermasterias imbricata', 'Henricia leviuscula',
       'Patiria miniata', 'Othasterias koehleri',
       'Pycnopodia helianthoides', 'S. franciscanus', 'S. purpuratus'],
      dtype=object)

**There are a number of changes that need to be made to scientific names before they can go through WoRMS.**
1. A number of abalone species don't have the genus written out
    - H. Rufescens, H. walallensis, H. kamtschatkana --> Haliotis
2. A number of urchin species don't have the genus written out
    - S. franciscanus and S. purpuratus --> Strongylocentrotus
3. A number of scientific names are simply "general"
    - These seem to correspond to vernacular names: Cancer sp., Solaster species
    
**In addition,** misspellings:
- Othasterias koehleri should be Orthasterias koehleri

In [17]:
## Update names and occ based on the above observations

# Update occ
occ.replace({'H. Rufescens':'Haliotis rufescens',
            'H. walallensis':'Haliotis walallensis',
            'H. kamtschatkana':'Haliotis kamtschatkana',
            'S. franciscanus':'Strongylocentrotus franciscanus',
            'S. purpuratus':'Strongylocentrotus purpuratus',
            'Othasterias koehleri':'Orthasterias koehleri'}, inplace=True)
occ.loc[occ['vernacularName'] == 'Cancer sp.', 'scientificName'] = 'Cancer sp.'
occ.loc[occ['vernacularName'] == 'Solaster species', 'scientificName'] = 'Solaster species'

# Get unique names again
names = occ['scientificName'].unique()
names

array(['Haliotis rufescens', 'Haliotis walallensis',
       'Haliotis kamtschatkana', 'Cancer sp.',
       'Scorpaenichthys marmoratus', 'Ophiodon elongatus',
       'Pisaster spp.', 'Solaster species', 'Dermasterias imbricata',
       'Henricia leviuscula', 'Patiria miniata', 'Orthasterias koehleri',
       'Pycnopodia helianthoides', 'Strongylocentrotus franciscanus',
       'Strongylocentrotus purpuratus'], dtype=object)

In [18]:
## Run through WoRMS

name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)

Url didn't work for Cancer sp. checking:  Cancer
Url didn't work for Pisaster spp. checking:  Pisaster
Url didn't work for Solaster species checking:  Solaster


In [19]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,COUNT,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ1,Red Abalone,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ2,Flat Abalone,Haliotis walallensis,urn:lsid:marinespecies.org:taxname:445374,445374
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ3,Pinto Abalone,Haliotis kamtschatkana,urn:lsid:marinespecies.org:taxname:405014,405014
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ4,Cancer sp.,Cancer sp.,urn:lsid:marinespecies.org:taxname:106876,106876
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ5,Cabezon,Scorpaenichthys marmoratus,urn:lsid:marinespecies.org:taxname:282726,282726


In [20]:
## Since there are no identificationQualifiers so far, swap out scientificName too

occ['scientificName'].replace(name_name_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,COUNT,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ1,Red Abalone,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ2,Flat Abalone,Haliotis walallensis,urn:lsid:marinespecies.org:taxname:445374,445374
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ3,Pinto Abalone,Haliotis kamtschatkana,urn:lsid:marinespecies.org:taxname:405014,405014
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ4,Cancer sp.,Cancer,urn:lsid:marinespecies.org:taxname:106876,106876
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ5,Cabezon,Scorpaenichthys marmoratus,urn:lsid:marinespecies.org:taxname:282726,282726


In [21]:
## Add other name-related columns

occ['nameAccordingTo'] = 'WoRMS'
occ['occurrenceStatus'] = 'present'
occ['basisOfRecord'] = 'HumanObservation'

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,COUNT,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ1,Red Abalone,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ2,Flat Abalone,Haliotis walallensis,urn:lsid:marinespecies.org:taxname:445374,445374,WoRMS,present,HumanObservation
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ3,Pinto Abalone,Haliotis kamtschatkana,urn:lsid:marinespecies.org:taxname:405014,405014,WoRMS,present,HumanObservation
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ4,Cancer sp.,Cancer,urn:lsid:marinespecies.org:taxname:106876,106876,WoRMS,present,HumanObservation
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,0,BR99-01_occ5,Cabezon,Scorpaenichthys marmoratus,urn:lsid:marinespecies.org:taxname:282726,282726,WoRMS,present,HumanObservation


In [22]:
## Get count information in the right place

# Set organismQuantity
occ['organismQuantity'] = occ['COUNT'].copy()

# Drop COUNT column
occ.drop(columns=['COUNT'], inplace=True)

# Add organismQuantityType
occ['organismQuantityType'] = 'density in number of individuals per 60m2'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,organismQuantity,organismQuantityType
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ1,Red Abalone,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,present,HumanObservation,0,density in number of individuals per 60m2
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ2,Flat Abalone,Haliotis walallensis,urn:lsid:marinespecies.org:taxname:445374,445374,WoRMS,present,HumanObservation,0,density in number of individuals per 60m2
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ3,Pinto Abalone,Haliotis kamtschatkana,urn:lsid:marinespecies.org:taxname:405014,405014,WoRMS,present,HumanObservation,0,density in number of individuals per 60m2
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ4,Cancer sp.,Cancer,urn:lsid:marinespecies.org:taxname:106876,106876,WoRMS,present,HumanObservation,0,density in number of individuals per 60m2
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ5,Cabezon,Scorpaenichthys marmoratus,urn:lsid:marinespecies.org:taxname:282726,282726,WoRMS,present,HumanObservation,0,density in number of individuals per 60m2


**Note** that I should check the organismQuantityType values.

**Also note** that organismQuantity is currently a string. I think this is because there were weird values in the original data set. These have been eliminated during the EMERGENT filtering step, but may return once that filter's been broadened, which I think it needs to be. **Also, 274 rows in the count table have COUNT = NaN.** I'm not sure if this is meaningful, or if it's an error. Because of the merge, these rows also appear in occ, and I'm going to leave them for now. But I'll have to make organismQuantity a float for now rather than an int.

In [23]:
## Update presence/absence status

# Ensure organismQuantity is an int
occ['organismQuantity'] = pd.to_numeric(occ['organismQuantity'])

# Change 0 counts to absent
occ.loc[occ['organismQuantity'] == 0, 'occurrenceStatus'] = 'absent'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,organismQuantity,organismQuantityType
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ1,Red Abalone,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ2,Flat Abalone,Haliotis walallensis,urn:lsid:marinespecies.org:taxname:445374,445374,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ3,Pinto Abalone,Haliotis kamtschatkana,urn:lsid:marinespecies.org:taxname:405014,405014,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ4,Cancer sp.,Cancer,urn:lsid:marinespecies.org:taxname:106876,106876,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ5,Cabezon,Scorpaenichthys marmoratus,urn:lsid:marinespecies.org:taxname:282726,282726,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2


In [24]:
## Depth

occ = occ.merge(survey[['Survey_Num', 'Min_DEPTH', 'Max_Depth']], how='left', left_on='eventID', right_on='Survey_Num')
occ.drop(columns=['Survey_Num'], inplace=True)
occ.rename(columns={'Min_DEPTH':'minimumDepthInMeters',
           'Max_Depth':'maximumDepthInMeters'}, inplace=True)
print(occ.shape)
occ.head()

(3440, 21)


Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,...,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,organismQuantity,organismQuantityType,minimumDepthInMeters,maximumDepthInMeters
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ1,...,Haliotis rufescens,urn:lsid:marinespecies.org:taxname:445357,445357,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ2,...,Haliotis walallensis,urn:lsid:marinespecies.org:taxname:445374,445374,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ3,...,Haliotis kamtschatkana,urn:lsid:marinespecies.org:taxname:405014,405014,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ4,...,Cancer,urn:lsid:marinespecies.org:taxname:106876,106876,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ5,...,Scorpaenichthys marmoratus,urn:lsid:marinespecies.org:taxname:282726,282726,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,


**Note** that when maximumDepthInMeters is not NaN, it is equal to minimumDepthInMeters. These values look like they're in feet, but **I should check this.** Also, **when was average depth recorded**, and how should I include it?

In [191]:
## samplingProtocol, samplingEffort

occ['samplingProtocol'] = 'band transect'
occ['samplingEffort'] = '10-15 minutes per transect'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,...,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,organismQuantity,organismQuantityType,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol,samplingEffort
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ1,...,445357,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,,band transect,10-15 minutes per transect
1,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ2,...,445374,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,,band transect,10-15 minutes per transect
2,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ3,...,405014,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,,band transect,10-15 minutes per transect
3,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ4,...,106876,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,,band transect,10-15 minutes per transect
4,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,,250,BR99-01_occ5,...,282726,WoRMS,absent,HumanObservation,0.0,density in number of individuals per 60m2,39.0,,band transect,10-15 minutes per transect


**Check samplingProtocol and samplingEffort.**

In [192]:
## Save

occ.to_csv('occ_20201028.csv', index=False, na_rep='NaN')

## Questions

1. If you want habitat and size data to be included, where are these located? **The size data should be in Tbl_new_size, and the substrate data should be in tblSubstrate. tblHabitat should contain algae data.**
2. Laura said that she only wanted to share the data from emergent surveys. If I filter for TYPE = 'EMERGENT', I end up discarding the majority of the data (all but 455 records out of ~ 3400). Is this really the right thing to do? There are the following values under SURVEY, many of which use the word 'Emergent'. **Laura now says she would like to share all of the data (including the Rapid emergent, or REAS surveys). Below, I've indicated Y or N for whether Laura thought the survey type should be included. She also said that anything with 'emergent' somewhere in the title should have TYPE = EMERGENT.**
- 'Transect - 30m (Rapid Emergent)' - Y
- 'Transect - 20mx2m (Emergent)' - N
- nan
- 'ARM' - N
- 'Transect - 30mx2m (Emergent)' - Y (old surveys?)
- 'Swim' - N
- 'Transect - 5m (Invasive)' - N
- 'Transect - 20mx5m (Emergent)' - N
- 'Transect - 16mx2m (Emergent)' - N
- 'Transect - 1/4m' - N
- 'Transect - 1/4M' - N
- 'Transect - running transect' - N
- 'Transect - Running Transect' - N
- 'Transect - 30m (Emergent)' (this is the only value left after filtering by TYPE = EMERGENT) - Y
3. Is it possible to get a general description of what happens during the emergent surveys? How many sites are surveyed and how often? Are some sites inside MPAs and some outside MPAs? Which? How many transects are done per site? Are quadrats important here? **Laura conducts a lot of transects at a small number of sites per year. Started in 1999 with only 1 site. Within each site, transects are depth-stratified into four categories: A (shallow 0-15 ft), B (16-30 ft), C (31-45 ft), D (46-60 ft). Within each strata, they randomly place 36 (or, at least, they try for 36, it's sometimes a little more and sometimes less) 30 x 2 m transects. The random placement occurs by randomly selecting GPS points. Over time, they have eliminated GPS points over sandy habitat; if a diver dropped on their coordinate, and it was > 50% sand, the transect was not conducted and the point was removed from future drawings. During each transect, there is a left and right diver, who each survey their respective 1 x 30 m area. Divers record inverts, benthic fish, substrate type, algae type, depth, and size. They take as long as they have to to survey everything they can find. To quantify substrate type, they stop at the 0, 10, 20, and 30 m marks, look at the 1 m2 area in front of them, and estimate the % of each substrate category. Algae type is recorded by functional group (e.g. crustose, etc.), and can add up to > 100% because they capture layering. Depth is recorded at the 0, 10, 20, and 30 m marks. Target species are sized during each transects (e.g. abalone, urchins); typically, if there are > 30 animals on a transect, only about 30 are sized. Numbers of organisms are recorded in 5 m bins - so for each organism, there should be 6 numbers for abundance on the right side of the transect, and 6 numbers for each organism on the left side of the transect. A few other points to keep in mind:**
- **Some sites (Salt Point, Casper) contain regions that were historically no red sea urchin fishing zones. They distinguish transects that are inside and outside of these zones (how?). Also, these sites often have more than 36 transects, because they try to do additional transects inside the historical no-take zones.**
- **Note that C and D depth strata are out of reach for breath-hold abalone divers, and so are effectively protected**

**Rapid Emergent surveys are slightly different and are conducted in years when they want to survey more sites quickly. A dive team drops in a single location, and runs four transects from that location (so these transects are not as spatially independent). Otherwise, the transects are identical as described above, except that substrate is quantified using UPC at every meter along the transect tape.**

4. How are entries in Survey_Num formatted? Most look like the site code + the year + a number. Is the number a transect number? There are a number of Survey_Nums that do not fit this format, for example:
- 'PA71C-01'
- 'PA86dfg-13'

**Site code contains a unique 2 letter code (Point Arena = PA, etc.) + a 2 digit year (99 - 19) + transect information. Transects are usually accompanied by a letter indicating the depth strata (e.g. A02). If the dive team had to move their transect off of the designated coordinate for some reason, it's indicated by AA (e.g. BAA7 indicates an alternate transect for B7). There are also data from 1971 and 1986. These are historical data from surveys that Laura did not conduct. The methods were similar, although the sites were all accessible from shore. These survey numbers may be weird. There is a report somewhere on these data that may help us flesh out differences in survey methods.** 

5. By filtering by TYPE = EMERGENT, we are also only getting survey dates between 1971 and 2001. This doesn't seem right.
6. What are the correct lat, lons for the survey sites? Columns in the survey table are confusing. In June, Laura sent me a screenshot of another table, but the lat, lons are confusing there, too. Specifically, they appear to be in degrees plus decimal minutes format, but several of the decimal minutes values are > 60. (E.g. Albion Bay, 39 22.900, 123 77.881) **Laura sees that there's something wrong here; she will look for better coordinates. Unlike the other surveys, she also does have transect-level coordinates, which she would prefer to include if possible. I said I'd look at the data and try to give her a better picture of where transect-level coordinates are missing.**
7. Is 250 m reasonable for coordinateUncertaintyInMeters?
8. There are 274 values in the count table where COUNT = NaN. What does this mean? Are these records different from when COUNT = 0? Also, there are some other weird values in the COUNT column, including:
- 'RG18-C3-'
- 'P'
9. How and when was depth recorded for these surveys? Should I use average, min, max?
10. What should I put for samplingProtocol? samplingEffort? **samplingProtocol = band transect (I didn't check but this is likely OK. Or maybe she'd prefer rapid versus emergent). samplingEffort = survey everything.**

### In addition, Laura would like to cover:
1. The left and right side of the transects in the data base and possibly combining them. **Currently, the transect data for each species is in the form of 6 data points on the R and 6 on the left. We would have to sum them to get a total count for each transect. Should we keep it at 12 numbers or convert it to 1? This is a decision we should keep in mind and make eventually.**
2. The total number of transects and how they can change depending on the species **I'm a bit confused on this point, but Laura said that sometimes the number of transects for each species change at a site. E.g., sometimes there are 36 transects for abalone, and 10 for urchin. I didn't think transects were separated out by species, but perhaps this will become clearer as I look at the data. She said this happens primarily in recent years, so I can keep an eye out for that.**
3. The need for zeros in some of the species on some transects. **There are some species where they were looked for, but none were found, and a 0 was not entered. Instead, it's empty. Laura would like these zeros populated. She said it would involve figuring out how many transects happened at that site that year, and then going in and looking and making sure we have all of the zeros. She said that almost all species were looked for in almost all years, except purple urchin in 2017. If I can get the data clean enough, I may be able to do something like I did for PISCO to populate missing absence records.**


**In addition: Ask Laura for site area estimation table. She said there are shapefiles for all of the sites; she can she can connect me with Paulo Serpa, her GIS guy for those. They would be good to include in the DataONE submission.**