# North coast kelp
## Prep for DataONE
Submission guidelines: https://opc.dataone.org/support

In [1]:
## Imports

import pandas as pd
import numpy as np
import csv, pyodbc
import pickle
import datetime

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, '/Users/dianalg/PycharmProjects/PythonScripts/MPA data integration/')

import WoRMS # functions for querying WoRMS REST API

## Connect to db and retrieve data

Note that (as described in the Wiki) the original database filename did not work because it included underscores. I renamed 'Abalone_DiveSurveys_EH_06242020.mdb' to 'AbaloneDiveSurveys-06242020.mdb' to fix this problem.

Also note that **Microsoft does not produce Access OBDC drivers for mac. So now that I'm on a mac, I won't be able to access the actual database without [workarounds](https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-Microsoft-Access).** Fortunately, I've already extracted and saved the data. So...

## Load data

In [3]:
## Function to load data

def load_table(tbl_name):
    """Takes tbl_name (a string) and loads saved data from that table."""
    
    # Get filenames
    col_name = tbl_name + '_cols.data'
    data_name = tbl_name + '.csv'
    
    # Retrieve column names
    with open(col_name, 'rb') as file:
        cols = pickle.load(file)
        
    # Load data
    data = pd.read_csv(data_name, header=None, names=cols)
    return(data)

In [29]:
## Load data

site = load_table('tblSite')
survey = load_table('tblSurvey')
count = load_table('tblCounts')
species = load_table('tblSpecies')
size = load_table('Tbl_New_size')
substrate = load_table('tblSubstrate')
habitat = load_table('tblHabitat')

The data tables are as follows:
- **site**: Contains the site name (SITE) and it's associated two or three letter SiteID
- **survey**: Contains the Survey_ID, SiteID, a unique Survey_Num, a description of the type of survey (SURVEY), the date of the survey (DATE), information about depth (Avg Depth, Min_DEPTH, Max_Depth), information about location (SLAT, SLONG, ELAT, ELONG, SLAT_old, SLONG_old, ELAT_old, ELONG_old, SLAT_DD, SLONG_DD, ELAT_DD, ELONG_DD), and comments (COMMENTS).
- **count**: Contains the Survey_Num, a Layer/Quadrat value indicating the 5 m block along the transect surveyed and whether it was on the left (L) or right (R), the SpeciesID, and the number observed (COUNT)
- **species**: Contains the SpeciesID, common name (SPECIES), scientific name (Scientific) and Notes.
- **size**: Contains sizes (SIZE) of ~ 30 or fewer individuals of target species (SpeciesID) obtained during a survey (Survey_Num).
- **substrate**: Contains the percentage (%Total) of each biotic and abiotic substrate type (HabitatID) for a given Survey_Num. A Subsample column seems to indicate whether the observation was associated with the left (L) side of the transect, the right (R) side of the transect, or both (LR). **Laura said these measurements are taken at the 0, 10, 20 and 30 m marks. Have the values been averaged here? Addded?**
- **habitat**: Contains substrate type codes (HabitatID) and descriptions (HABITAT).

For DataONE, I suggest creating the following tables (based on DataONE guidelines):
- **Site** table, containing site codes, site names, coordinates, CA_MPA_Name_Short, and LTM_project_short_code
- **Species** table, containing species codes, scientific name, ideally common name as well, major taxonomic ranks, WoRMS ID, and species_definition
- **Count** table, containing the number of each organism observed in each layer of each transect during each survey
- **Percent cover** table, containing the percentage of each biotic and abiotic substrate type observed on each transect of each survey
- **Percent cover categories** table, describing biological and abiological substrate codes
- **Size** table, containing the sizes of organisms sampled during each survey

First, I'm going to tidy the survey table by removing survey types that Laura doesn't want included. I might also limit the columns to those that seem relevant to me, although **Laura should weigh in on whether some of these should remain**. Then I'll work through problems with each of these proposed tables.

In [32]:
## Tidy survey table

# Select relevant columns
sur = survey[[
    'Survey_ID', 
    'SiteID', 
    'Survey_Num', 
    'SURVEY', 
    'DATE', 
    'Avg Depth',
    'Min_DEPTH', 
    'Max_Depth',
    'SLAT',
    'SLONG',
    'ELAT',
    'ELONG',
    'SLAT_old',
    'SLONG_old',
    'ELAT_old',
    'ELONG_old',
    'SLAT_DD',
    'SLONG_DD',
    'ELAT_DD',
    'ELONG_DD',
    'COMMENTS',
]]

# Filter survey type as instructed by Laura
print(sur.shape)
surveys_to_keep = [
    'Transect - 30m (Rapid Emergent)',
    'Transect - 30mx2m (Emergent)',
    'Transect - 30m (Emergent)',
]
sur = sur[sur['SURVEY'].isin(surveys_to_keep)]
print(sur.shape)

# View
sur.head()

(3904, 21)
(2943, 21)


Unnamed: 0,Survey_ID,SiteID,Survey_Num,SURVEY,DATE,Avg Depth,Min_DEPTH,Max_Depth,SLAT,SLONG,...,ELONG,SLAT_old,SLONG_old,ELAT_old,ELONG_old,SLAT_DD,SLONG_DD,ELAT_DD,ELONG_DD,COMMENTS
0,4790,FR,FR18-D5-1,Transect - 30m (Rapid Emergent),2018-07-12 00:00:00,,48.0,,,,...,,,,,,0.0,0.0,0.0,0.0,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...
1,4919,ALB,ALB18-A02-1,Transect - 30m (Rapid Emergent),2018-08-30 00:00:00,,7.0,10.0,,,...,,,,,,0.0,0.0,0.0,0.0,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...
2,4920,ALB,ALB18-A02-2,Transect - 30m (Rapid Emergent),2018-08-30 00:00:00,,10.0,14.0,,,...,,,,,,0.0,0.0,0.0,0.0,RAPID EMERGENT SITE A02: 2 OF 2 TRANSECTS. HEA...
3,4915,ALB,ALB18-A1-1,Transect - 30m (Rapid Emergent),2018-08-30 00:00:00,,12.0,16.0,,,...,,,,,,0.0,0.0,0.0,0.0,RAPID EMERGENT SITE A1: 1 OF 4 TRANSECTS. HEAD...
4,4916,ALB,ALB18-A1-2,Transect - 30m (Rapid Emergent),2018-08-30 00:00:00,,8.0,8.0,,,...,,,,,,0.0,0.0,0.0,0.0,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...


## Site table

In [33]:
## Remove sites that have no survey data

no_surveys = []
for s in site['SiteID'].unique():
    if s not in sur['SiteID'].unique():
        no_surveys.append(s)

site_clean = site[~site['SiteID'].isin(no_surveys)].copy()

In [34]:
## Change column names to something sensible

site_clean.columns = ['SiteID', 'SiteName']

In [35]:
## Add missing information

site_clean['Lat'] = np.nan
site_clean['Lon'] = np.nan
site_clean['CA_MPA_Name_Short'] = ''
site_clean['LTM_project_short_code'] = ''
site_clean

Unnamed: 0,SiteID,SiteName,Lat,Lon,CA_MPA_Name_Short,LTM_project_short_code
0,ALB,Albion Bay,,,,
1,BR,Bodega Marine Life Refuge (BML),,,,
3,CC,Caspar Cove,,,,
4,FM,Fisk Mill Cove,,,,
6,FR,Fort Ross State Park,,,,
7,HMS,Hopkins Marine Station,,,,
11,MC,Moat Creek,,,,
14,OC,Ocean Cove,,,,
15,PA,Point Arena,,,,
16,PC,Point Cabrillo Lighthouse Reserve,,,,


**Problems:**
- There are a bunch of sites in the site table that do not appear in the survey table. **Remove these?**
- Sites need to be matched to the appropriate CA_MPA_Name_Short value. This column would be left blank if the site is not inside an MPA (e.g. is a reference site)
- Which LTM_project_short_code to use? LTM_Kelp_SRock? (Or NA if site is not part of long term MPA monitoring)
- Correct lat, lon for each site in WGS84 decimal degrees needs to be provided. Coordinates given previously were wonky (see Site location information.png)

## Species table

In [36]:
## Change column names

species_clean = species.copy()
species_clean.columns = [
    'SpeciesID',
    'CommonName',
    'ScientificName',
    'species_definition'
]
species_clean.head()

Unnamed: 0,SpeciesID,CommonName,ScientificName,species_definition
0,A0,Abalone sp.,Haliotis spp.,unidentified
1,A1,Red Abalone,H. Rufescens,
2,A11,Flat Abalone,H. walallensis,
3,A12,Pinto Abalone,H. kamtschatkana,
4,A13,Green Abalone,H. fulgens,


In [37]:
## Add in missing information

species_clean['taxonomic_source'] = 'WoRMS'
species_clean['taxonomic_id'] = ''
species_clean['Kingdom'] = ''
species_clean['Phylum'] = ''
species_clean['Class'] = ''
species_clean['Order'] = ''
species_clean['Family'] = ''
species_clean['Genus'] = ''
species_clean['Species'] = ''

species_clean.head()

Unnamed: 0,SpeciesID,CommonName,ScientificName,species_definition,taxonomic_source,taxonomic_id,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,A0,Abalone sp.,Haliotis spp.,unidentified,WoRMS,,,,,,,,
1,A1,Red Abalone,H. Rufescens,,WoRMS,,,,,,,,
2,A11,Flat Abalone,H. walallensis,,WoRMS,,,,,,,,
3,A12,Pinto Abalone,H. kamtschatkana,,WoRMS,,,,,,,,
4,A13,Green Abalone,H. fulgens,,WoRMS,,,,,,,,


**Problems:**
- Scientific Names should:
    - Have genus and species fully written out (unless using sp.)
    - If the organism category is more general, then fill in the appropriate order, class, phylum, etc. Find the lowest possible taxonomic rank you can use. E.g. Flatworm would have 'Platyhelminthes' in the ScientificName column. Perhaps the PISCO species table would be helpful filling these out, esp for the algae groups.
- species_definition column is required by DataONE. It should either contain the same value as ScientificName, or a more general description, e.g. 'red algae.' Some of the content from the original Notes column is fine here. But I would update entries to include a description as well. E.g. 'Post 2013 seastar wasting event' could be updated to 'Leather star (Dermasterias imbricata) with wasting disease, only recorded post 2013 seastar wasting event.'
- Add other columns required by DataONE: taxonomic_id, kingdom, phylum, class, order, family, genus, species. These can be obtained programmatically provided a decent ScientificName is provided.
    
## Count table

In [38]:
## Clean Survey_Num - this is probably the best ID column (unique, no missing values)

# Choose ID row - probably Survey_Num is best (unique, no missing values)
sur_num_clean = sur['Survey_Num'].copy()

# Clean leading or lagging whitespace
sur_num_clean = sur_num_clean.str.strip()

# Identify survey numbers that do not fit the formula
do_not_fit = sur_num_clean[~sur_num_clean.str.fullmatch('[A-Z]{2,3}\d\d-[ABCD]{1,2}\d{1,2}')].to_list()

# Identify survey numbers that do fit
fit = sur_num_clean[sur_num_clean.str.fullmatch('[A-Z]{2,3}\d\d-[ABCD]{1,2}\d{1,2}')].to_list()

In [39]:
## Clean SURVEY

survey_clean = pd.DataFrame({
    'Survey_Num':sur_num_clean,
    'SURVEY':sur['SURVEY']
})

survey_clean['SURVEY'] = 'Transect - 30 m x 2 m (Emergent)'

In [40]:
## Clean DATE

# There are two surveys that don't have a date. Dropping these for now.
survey_clean['DATE'] = sur['DATE'].copy()
print(survey_clean.shape)
survey_clean.dropna(inplace=True)
print(survey_clean.shape)

# Turn DATE into datetime
survey_clean['DATE'] = pd.to_datetime(survey_clean['DATE'])

# Add year, month, day as required by DataONE
survey_clean['Year'] = survey_clean['DATE'].dt.year
survey_clean['Month'] = survey_clean['DATE'].dt.month
survey_clean['Day'] = survey_clean['DATE'].dt.day

# Add timezone as required by DataONE
survey_clean['Timezone'] = 'PDT'

(2943, 3)
(2941, 3)


In [41]:
## Add depth

survey_clean['Min_DEPTH'] = sur.loc[sur['DATE'].isna() == False, 'Min_DEPTH'].copy()
survey_clean['Max_Depth'] = sur.loc[sur['DATE'].isna() == False, 'Max_Depth'].copy()

In [42]:
## Deal with lat, lon

survey_clean[['SLAT_DD', 'SLONG_DD']] = sur.loc[sur['DATE'].isna() == False, ['SLAT_DD', 'SLONG_DD']].replace(0, np.nan)
survey_clean.head()

Unnamed: 0,Survey_Num,SURVEY,DATE,Year,Month,Day,Timezone,Min_DEPTH,Max_Depth,SLAT_DD,SLONG_DD
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,
1,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,
2,ALB18-A02-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,10.0,14.0,,
3,ALB18-A1-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,12.0,16.0,,
4,ALB18-A1-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,8.0,8.0,,


In [43]:
## Comments

survey_clean['COMMENTS'] = sur.loc[sur['DATE'].isna() == False, 'COMMENTS'].copy()
survey_clean.head()

Unnamed: 0,Survey_Num,SURVEY,DATE,Year,Month,Day,Timezone,Min_DEPTH,Max_Depth,SLAT_DD,SLONG_DD,COMMENTS
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...
1,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...
2,ALB18-A02-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,10.0,14.0,,,RAPID EMERGENT SITE A02: 2 OF 2 TRANSECTS. HEA...
3,ALB18-A1-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,12.0,16.0,,,RAPID EMERGENT SITE A1: 1 OF 4 TRANSECTS. HEAD...
4,ALB18-A1-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,8.0,8.0,,,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...


In [44]:
## Merge with count table

# First need to strip whitespace from Survey_Num in count table
count['Survey_Num'] = count['Survey_Num'].str.strip()

# Merge
survey_and_count_clean = survey_clean.merge(count.iloc[:, 1:], how='left', on='Survey_Num')
print(survey_and_count_clean.shape)
survey_and_count_clean.head()

(66593, 15)


Unnamed: 0,Survey_Num,SURVEY,DATE,Year,Month,Day,Timezone,Min_DEPTH,Max_Depth,SLAT_DD,SLONG_DD,COMMENTS,Layer/Quadrat,SpeciesID,COUNT
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR0,A1,0
1,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR5,A1,0
2,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR10,A1,0
3,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR15,A1,0
4,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR20,A1,0


In [45]:
## Identify problems

# Surveys with no layer/quadrat information
no_layer = survey_and_count_clean.loc[survey_and_count_clean['Layer/Quadrat'].isna() == True, 'Survey_Num'].unique().tolist()
print('Surveys with at least one entry where layer is missing: {}'.format(len(no_layer)))

# Surveys missing both layer/quadrat information AND species ID - COUNT is either NaN or 0
no_layer_no_species = survey_and_count_clean.loc[(survey_and_count_clean['Layer/Quadrat'].isna() == True) & 
                                                 (survey_and_count_clean['SpeciesID'].isna() == True), 'Survey_Num'].unique().tolist()
print('Surveys with at least one entry where both layer and species id are missing: {}'.format(len(no_layer_no_species)))

# Surveys missing speciesID (not layer) - COUNT is NaN, 0, or 1
no_species = survey_and_count_clean.loc[(survey_and_count_clean['SpeciesID'].isna() == True) &
                                        (survey_and_count_clean['Layer/Quadrat'].isna() == False), 'Survey_Num'].unique().tolist()
print('Surveys with all layer info but at least one entry where species id missing: {}'.format(len(no_species)))

# Surveys lacking layer/quadrat, speciesID and COUNT - Merge problem in db?
no_everything = survey_and_count_clean.loc[(survey_and_count_clean['Layer/Quadrat'].isna() == True) &
                                           (survey_and_count_clean['SpeciesID'].isna() == True) &
                                           (survey_and_count_clean['COUNT'].isna() == True), 'Survey_Num'].unique().tolist()
print('Surveys with at least one entry where layer, species and count are missing: {}'.format(len(no_everything)))

Surveys with at least one entry where layer is missing: 487
Surveys with at least one entry where both layer and species id are missing: 32
Surveys with all layer info but at least one entry where species id missing: 5
Surveys with at least one entry where layer, species and count are missing: 21


In [46]:
## Rename columns

survey_and_count_clean.columns = [
    'SurveyID',
    'SurveyType',
    'SurveyDate',
    'Year',
    'Month',
    'Day',
    'Timezone',
    'MinimumDepth',
    'MaximumDepth',
    'Lat',
    'Lon',
    'Comments',
    'Layer',
    'SpeciesID',
    'Count'
]
survey_and_count_clean.head()

Unnamed: 0,SurveyID,SurveyType,SurveyDate,Year,Month,Day,Timezone,MinimumDepth,MaximumDepth,Lat,Lon,Comments,Layer,SpeciesID,Count
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR0,A1,0
1,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR5,A1,0
2,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR10,A1,0
3,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR15,A1,0
4,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,LR20,A1,0


**Problems**:
- Do we still need the following columns: TYPE, TIME_of_Day, NUMBER, DIVER (LEFT FOR TRANSECT), Orientation, Buddy (RIGHT FOR TRANSECT), RANGE, TIDEHEIGHT, Format, TIME_MIN, DISTANCE, ArmDescr, SwmDescr, TranDscr, GrwthDescr
    - Percent of rows with missing values for these columns:
        - TYPE (84%)
        - TIME_of_Day (58%)
        - NUMBER (1%) - **This one seems like it might contain something important.** 
        - DIVER (LEFT FOR TRANSECT) (13%) - **I can understand wanting to track this information, but do you want it online?**
        - Orientation (100%)
        - Buddy (RIGHT FOR TRANSECT) (21%) - **I can understand wanting to track this information, but do you want it online?**
        - RANGE (98%)
        - TIDEHEIGHT (98%)
        - Format (69%)
        - TIME_MIN (98%)
        - DISTANCE (99%)
        - ArmDescr (100%)
        - SwmDescr (100%)
        - TranDscr (100%)
        - GrwthDescr (100%)

```python
# Calculate % missing rows per column
(survey.isna().sum()/survey.shape[0])*100
```

- Survey_Num should be composed of the site code + the last two digits of the year + a letter indicating depth + a transect number. There are numerous exceptions to this formula. I was able to search for them with regex, but they might need to be corrected by hand. There's also sometimes leading/lagging whitespace. To see Survey_Nums that need to be corrected, look at list `do_not_fit` generated above. Note that Survey_Num values are unique, so they don't absolutely have to be fixed, but best practice would be to name things consistently.
- Suggest using a controlled vocabulary for SURVEY column
- There are two surveys with no date: OC18-B1-4, OC18-B8-1
- Some timezones should be PST rather than PDT, and I'm not sure how to figure this out programmatically
- Which depth to use? Pref. min and max for DwC conversion, although average could still be included for DataONE if desired. All depths have missing values and I think this is a required field for OBIS.
- Transect lats and lons:
    - All have at least 30% missing (NaN)
    - I assume the "old" coordinates shouldn't be used? Also, a bunch of them are in degrees, minutes, seconds which is making pandas read them as strings. There's probably a way to convert.
    - Many values are also 0. 
    - SLAT and SLONG have the fewest missing values, but are in a format I don't understand. Conversion? If SLAT_DD and SLONG_DD contain the converted coordinates, less than half of the existing, nonzero values have been converted.
    - I'm going to use SLAT_DD and SLONG_DD for now, and convert zeros to NaNs

```python
# Percent non-zero and non-missing values for all coordinate columns
num_na = sur.isna().sum()
num_na_or_not_zero = sur.astype(bool).sum()
((num_na_or_not_zero - num_na)/sur.shape[0])*100
```

- Survey_Num has at least one entry with leading/lagging whitespace in the count table (' FR18-D5-1')
- There are two surveys with no data in the count table: ST07-15, ST10-11
- COUNT column is being interpreted as a string because of entries like '00'
- There is a bunch of missing information in the layer, speciesID, and count columns after merging the survey and count tables (i.e., after filtering the count table for Survey_Nums that are the correct survey type - rapid emergent, 30 m x 2 m). The simple solution would be to drop everything. This removes 3551 rows (5%).
    - I don't know if this missing data has anything to do with the missing zeros that Laura said needed to be populated for some species/transects.
- Based on Laura's description, I would expect these Layer/Quadrat values: L0, L5, L10, L15, L20, L25, R0, R5, R10, R15, R20, R25. There are actually a lot more values than that. I would need to know what they all mean and how to combine/replace. I suggest using a controlled vocabulary in the future (e.g., only the 12 values listed above). Only 142 surveys (out of 2941) have a total of 12 values (hopefully the expected ones, I haven't checked that).
- There is one species code that's not in the species table: Wh1. Typo?

```python
# survey_and_count_clean with missing values removed
survey_and_count_clean.dropna(subset=['Layer/Quadrat', 'SpeciesID', 'COUNT'], how='any').shape

# Current Layer/Quadrat values
survey_and_count_clean['Layer/Quadrat'].unique()

# Number of surveys with 12 Layer/Quadrat values, as expected
no_na = survey_and_count_clean.dropna(subset=['Layer/Quadrat', 'SpeciesID', 'COUNT'], how='any')
num_layer_labels = no_na.groupby('Survey_Num', as_index=False)['Layer/Quadrat'].nunique()
num_layer_labels.loc[num_layer_labels['Layer/Quadrat'] == 12, 'Survey_Num'].nunique()

# Species codes in survey but not in species table
for sp in survey_and_count_clean['SpeciesID'].unique():
    if sp not in species['SpeciesID'].unique():
        print(sp)
```

## Percent cover table & percent cover categories table

In [22]:
## Merge cleaned survey data with percent cover data (in substrate table)

survey_and_pc_clean = survey_clean.merge(substrate[[
    'Survey_Num',
    'Subsample',
    'HabitatID',
    '%Total',
]], how='left', on='Survey_Num')
survey_and_pc_clean.head()

Unnamed: 0,Survey_Num,SURVEY,DATE,Year,Month,Day,Timezone,Min_DEPTH,Max_Depth,SLAT_DD,SLONG_DD,COMMENTS,Subsample,HabitatID,%Total
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,,,
1,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,,
2,ALB18-A02-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,10.0,14.0,,,RAPID EMERGENT SITE A02: 2 OF 2 TRANSECTS. HEA...,,,
3,ALB18-A1-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,12.0,16.0,,,RAPID EMERGENT SITE A1: 1 OF 4 TRANSECTS. HEAD...,,,
4,ALB18-A1-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,8.0,8.0,,,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,,


In [23]:
# Rename columns

survey_and_pc_clean.columns = [
    'SurveyID',
    'SurveyType',
    'SurveyDate',
    'Year',
    'Month',
    'Day',
    'Timezone',
    'MinimumDepth',
    'MaximumDepth',
    'Lat',
    'Lon',
    'Comments',
    'Subsample',
    'HabitatID',
    'PercentCover'
]
survey_and_pc_clean.head()

Unnamed: 0,SurveyID,SurveyType,SurveyDate,Year,Month,Day,Timezone,MinimumDepth,MaximumDepth,Lat,Lon,Comments,Subsample,HabitatID,PercentCover
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,,,
1,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,,
2,ALB18-A02-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,10.0,14.0,,,RAPID EMERGENT SITE A02: 2 OF 2 TRANSECTS. HEA...,,,
3,ALB18-A1-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,12.0,16.0,,,RAPID EMERGENT SITE A1: 1 OF 4 TRANSECTS. HEAD...,,,
4,ALB18-A1-2,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,8.0,8.0,,,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,,


In [24]:
## Create percent cover categories table

# Drop codes that are not used in the percent cover data
pc_categories = habitat.iloc[0:10, :].copy()

# Change column names
pc_categories.columns = ['HabitatID', 'HabitatType']

# Add habitat definition
pc_categories['HabitatDefinition'] = ''
pc_categories

Unnamed: 0,HabitatID,HabitatType,HabitatDefinition
0,ALG1,BARE ROCK,
1,ALG2,ENCRUSTING,
2,ALG3,TURF,
3,ALG4,FOLIOSE,
4,ALG5,SUBCANOPY,
5,ALG6,CANOPY,
6,SUB1,REEF,
7,SUB2,BOULDER,
8,SUB3,COBBLE (Movable),
9,SUB4,SAND,


**Problems:**
- This table can use most of the same columns as the count table.
- There are quite a few surveys with no percent cover data. If these really weren't collected, I'd probably just drop the associated records.
- Two surveys are listed in both the substrate and survey tables, but have no percent cover data: FR15-CA3, VD03-A9 
- There are missing values generally in the HabitatID, %Total, and Subsample columns
- Based on what Laura told me, I expected subsample to include: 0, 10, 20 30. There are a number of other values. In particular, in the surveys we care about, there is L, R, LR, and R30. I would need to know how to interpret these. I would also suggest using a controlled vocabulary from now on.
- To avoid confusion, I would create a percent_cover_categories table to accompany the percent cover table. This would be the existing habitat table with the UPC codes removed. I would update the HABITAT column to something more descriptive - maybe HabitatType - and add a HabitatDefinition column that says in words what each code means. E.g. for ENCRUSTING you could put something like "encrusting algal species, such as example1, example2." 

```python
# Surveys with no percent cover data
merged = survey_clean.merge(substrate, how='left', on='Survey_Num', indicator=True)
no_pc = merged.loc[merged['_merge'] == 'left_only', 'Survey_Num'].unique().tolist()
len(no_pc)

# Surveys in both substrate and survey tables but lacking percent cover data anyway
merged[
    (merged['_merge'] == 'both') &
    (merged['HabitatID'].isna() == True) &
    (merged['%Total'].isna() == True) &
    (merged['Subsample'].isna() == True)
]
```

## Size table

In [25]:
## Merge clean survey data with size data (in size table)

survey_and_size_clean = survey_clean.merge(size[[
    'Survey_Num', 
    'Lft_or_Rt', 
    'SpeciesID', 
    'SIZE',
]], how='left', on='Survey_Num')

print(survey_and_size_clean.shape)
survey_and_size_clean.head()

(76813, 15)


Unnamed: 0,Survey_Num,SURVEY,DATE,Year,Month,Day,Timezone,Min_DEPTH,Max_Depth,SLAT_DD,SLONG_DD,COMMENTS,Lft_or_Rt,SpeciesID,SIZE
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,,,
1,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,181.0
2,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,180.0
3,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,135.0
4,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,295.0


In [26]:
## Rename columns
survey_and_size_clean.columns = [
    'SurveyID',
    'SurveyType',
    'SurveyDate',
    'Year',
    'Month',
    'Day',
    'Timezone',
    'MinimumDepth',
    'MaximumDepth',
    'Lat',
    'Lon',
    'Comments',
    'LeftOrRight',
    'SpeciesID',
    'Size'
]
survey_and_size_clean.head()

Unnamed: 0,SurveyID,SurveyType,SurveyDate,Year,Month,Day,Timezone,MinimumDepth,MaximumDepth,Lat,Lon,Comments,LeftOrRight,SpeciesID,Size
0,FR18-D5-1,Transect - 30 m x 2 m (Emergent),2018-07-12,2018,7,12,PDT,48.0,,,,RAPID EMERGENT SURVEY. D5: 1 OUT OF 2 TRANSECT...,,,
1,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,181.0
2,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,180.0
3,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,135.0
4,ALB18-A02-1,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PDT,7.0,10.0,,,RAPID EMERGENT SITE A02: 1 OF 2 TRANSECTS. HEA...,,A1,295.0


**Problems**
- About 500 surveys don't have any size data associated with them. If this is accurate, I would probably drop these from the table.
- Is the Left or Right designation important here? If so, it's missing in ~1/4 of the rows. 
- The Wh1 species code appears in the size data but, as noted in the count data, is missing from the species table
- There are no missing values in the SpeciesID column. Yey! But there are some missing sizes. There are 17 surveys for which at least one size value is missing. The species code is always defined for these rows, and the Left or Right designation is sometimes also defined.

```python
# Surveys with no size data
merged = survey_clean.merge(size, how='left', on='Survey_Num', indicator=True)
no_size = merged.loc[merged['_merge'] == 'left_only', 'Survey_Num'].unique().tolist()
len(no_size)

# Surveys with size data but at least one size missing
both = merged[merged['_merge'] == 'both']
no_size = both[both['SIZE'].isna() == True]
no_size['Survey_Num'].unique().tolist()
```

## General concerns
- I have only identified problems with the data associated with the survey type Laura wants to share, so I can't give guidance for the rest of it
- I don't know what to do about the surveys where zeros need to be populated. To do this programmatically, I would need a table of which organisms were looked for during which years, and I would need very clean data tables to work with. I can't do it by hand. Does this affect just the count data, or does it affect the percent cover data as well? The counts are not *wrong* if zeros aren't present, but they're definitely a lot harder to use. At the very least, we would have to provide a clear disclaimer and suggest users talk to Laura to figure things out.
- In addition to the data, we will have to fill out the metadata on DataONE, including descriptions of each column (what it is, data type, if it's categories what all the different categories mean)
- DataONE would also like you to submit a written protocol as a separate file
- How is Laura going to replicate this longer term? My 'clean' tables won't reflect what she has on her Access Db without overhauling the database.

## What *has* to be done?
- We *could* do nothing. I think DataONE would take the tables as they are. But they definitely would not meet guidelines/best practices, and would not be very useful.
- For the data to be useful, at minimum, IMO, we should:
    - Get lat, lon for the site table
    - Fill in correct scientific name for each organism or organism category in the species table. Figure out what Wh1 refers to (in the percent cover and count data) and add it to the species table or fix the code in the percent cover and count tables.
    - Define habitat codes in percent cover categories table
    - Remove useless columns from the survey table (at the least, the ones that are > 95% missing)
    - Create and apply a controlled vocabulary for survey type in the survey table
    - Identify and populate as much as possible the correct lat, lon in the survey table
    - Clean the count column in the count table of non-numeric values
    - Ensure missing data in the count table is truly missing, and drop any rows where there is no count value. Can we trust that a missing value means the organism wasn't looked for? Or do they ever represent zero counts? Alternatively, we could drop all rows with any missing data.
    - Figure out what to do with the Layer/Quadrat column. What do all the different values mean? Does the area to which the count applies differ for different values of Layer/Quadrat? How do we interpret the count value when Layer/Quadrat is missing? Implement a controlled vocabulary. **Note that this column will have to be fully filled out and accurate if we want to combine data to a single count for an entire transect. Also note that very few surveys have the expected 12 values... there's a lot of variation in how this column has been filled out.**
    - Ensure missing data in the percent cover table is truly missing (i.e. the correct surveys are lacking data). Drop any rows where there is no percent cover value. Can we trust that a missing value means the organism wasn't looked for? Or do they ever represent zero percent cover?
    - Define different values in the subsample column, and implement a controlled vocabulary. Does the subsample change how we interpret the data?
    - Ensure missing data in the size table is truly missing (i.e. size data weren't collected during that survey). Drop any rows where there is no size value.
    - Define the left or right column in the size table and figure out if/how it's important
    - Provide a sentence or two defining each column in each table, and defining any controlled vocabulary terms
    - Provide a PDF document that briefly summarizes the goals of this survey, what data are collected, and how the data are collected
- To help, I can:
    - Look up organism's taxonomies and IDs on WoRMS
    - Remove columns and rows that need to be dropped (e.g. because of missing values)
    - Apply controlled vocabularies
    - Make it clear what area a count or percent cover value applies to using cleaned Left or Right/Subsample/Layer columns
    - Create a submission on DataONE and upload all data tables
    - Populate the required metadata on DataONE
    
**Time investment so far:** ~ 20 hours (not including initial data extraction from db, inital exploration, initial conversations with Laura about what everything means)

**Projected time investment:** Maybe 3 hours for conversations/meetings, 4 hours to code WoRMS ids, controlled vocabs, etc., 16 hours to create and populate DataONE submission? So maybe 24 hours? Realistically, that 24 hours of work would be spread over 3-4 weeks.

## Meeting outcomes

### DLG TODO:
1. Edit count table
    - Create MPA or not column
    - Add average depth column back in
    - Remove layer column
        - Assess transects with missing data in SpeciesID and Count columns
    - Aggregate counts such that there is one number per species per transect
        - A = absent, count = 0
        - P = present, count = NaN
        - 9999 = present but it wasn't possible to measure the individual
    - Add in present/absent column 
2. Edit percent cover table
    - Create MPA or not column
    - Add average depth column back in
3. Edit size table
    - Create MPA or not column
    - Add average depth column back in
    - Figure out exactly which transects have missing data
    - Drop LeftOrRight column
4. Share deliverables (by 8/13)
    - Email a prioritized to-do list to Laura and Bob
    - Create and share Google Drive with template tables
    - Color code template tables for what Laura needs to do versus what I'll do
    - Share link to [PISCO DataONE submission](https://opc.dataone.org/view/MLPA_kelpforest.metadata.1https://opc.dataone.org/view/MLPA_kelpforest.metadata.1) to serve as an example
    - Share list of surveyIDs that do and don't conform to formatting
    - Share list of transects with missing data in SpeciesID and Count columns in count table
    - Share findings about what data is missing in the size table
5. Ask DataONE
    - how they would like CA_MPA_Name_Short designated if some transects are inside an MPA and some are outside
    - how and whether to fill out LTM_project_short_code
6. Lingering issues
    - PDT versus PST in the timezone column
    - Whether and which transect-level lat, lons to include

### Laura/Bob TODO:
1. Fill in a "central" lat, lon for each site
2. Wh1 is a species code in the data but not in the table. Does it need to be added to the table? Or is it a typo in the data? (LRB)
3. Clean scientific names 
    - Full Latin names (Genus and Species written out)
    - If a group cannot be determined to the species level, give the next highest taxonomic rank that is accurate (e.g. class Asteroidea for general sea stars)
4. Fill out species definition column 
    - This can just repeat the scientific name if known to the species level
    - You can also give more information on each of your codes, such as when the code started being used, whether it refers to live or dead animals, etc.
5. Fill in HabitatDefinition column with explanations of HabitatType 
6. Add and fill a column specifying whether each transect is inside MPA, outside MPA, or has some other level of protection or restriction
7. Flesh out depth columns 
8. Address other missing data?
9. Standardize SurveyID format