# North coast kelp
## Prep for DataONE
Submission guidelines: https://opc.dataone.org/support

This code is based on the new data tables that were formulated based on the guidelines and templates I developed in north_coast_kelp_dataone.ipynb

In [1]:
## Imports

import pandas as pd
import numpy as np
# import csv, pyodbc
import pickle
import datetime
import pyworms

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, '/Users/dianalg/PycharmProjects/PythonScripts/MPA data integration/')

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data from csv files

site = pd.read_csv('site_template-complete.csv')
species = pd.read_csv('species_template-complete.csv')
pc_cat = pd.read_csv('percent_cover_categories_template-complete.csv')
survey = pd.read_csv('Revision to tblSurvey-RK-21-11-11.csv')

In [4]:
## Function to load data from pickled db tables

def load_table(tbl_name):
    """Takes tbl_name (a string) and loads saved data from that table."""
    
    # Get filenames
    col_name = tbl_name + '_cols.data'
    data_name = tbl_name + '.csv'
    
    # Retrieve column names
    with open(col_name, 'rb') as file:
        cols = pickle.load(file)
        
    # Load data
    data = pd.read_csv(data_name, header=None, names=cols)
    return(data)

In [5]:
## Load data from pickled db tables

count = load_table('tblCounts')
size = load_table('Tbl_New_size')
pc = load_table('tblSubstrate')

**NOTE** that in the future, I hope that Laura will just send me .csv files of these tables.

The data tables are as follows:
- **site**: Contains site info
- **survey**: Contains survey-level info, including Survey_ID, SiteID, a unique Survey_Num, a description of the type of survey (SURVEY), the date of the survey (DATE), information about depth (Avg Depth, Min_DEPTH, Max_Depth), information about location, etc.
- **count**: Contains number (COUNT) of each species (SpeciesID) observed during a 5 m block (Layer/Quadrat) along the transect surveyed 
- **species**: Contains species info
- **size**: Contains sizes (SIZE) of ~ 30 or fewer individuals of target species (SpeciesID) obtained during a survey (Survey_Num).
- **pc**: Contains the percentage (%Total) of each biotic and abiotic substrate type (HabitatID) for a given Survey_Num. 
- **pc_cat**: Contains substrate type codes (HabitatID) and descriptions (HabitatDefinition).

For DataONE, I suggest creating the following tables (based on DataONE guidelines):
- **Site** table, containing site codes, site names, coordinates, CA_MPA_Name_Short, and LTM_project_short_code
- **Species** table, containing species codes, scientific name, ideally common name as well, major taxonomic ranks, WoRMS ID, and species_definition
- **Count** table, containing the number of each organism observed in each layer of each transect during each survey
- **Percent cover** table, containing the percentage of each biotic and abiotic substrate type observed on each transect of each survey
- **Percent cover categories** table, describing biological and abiological substrate codes
- **Size** table, containing the sizes of organisms sampled during each survey

First, I'm going to tidy the survey table by removing survey types that Laura doesn't want included. I will also limit the columns to those that are relevant based on my work with Laura and Bob.

In [6]:
## Tidy survey table

# Select relevant columns
sur = survey[[ 
    'SiteID', 
    'DFW_short_code',
    'Survey_Num',
    'MPA Status',
    'new_Lat',
    'new_Lon',
    'SURVEY', 
    'DATE', 
    'Avg Depth',
    'Min_DEPTH', 
    'Max_Depth',
    'COMMENTS',
]]

# Filter survey type as instructed by Laura
print(sur.shape)
surveys_to_keep = [
    'Transect - 30m (Rapid Emergent)',
    'Transect - 30mx2m (Emergent)',
    'Transect - 30m (Emergent)',
]
sur = sur[sur['SURVEY'].isin(surveys_to_keep)]
print(sur.shape)

# View
sur.head()

(3566, 12)
(2941, 12)


Unnamed: 0,SiteID,DFW_short_code,Survey_Num,MPA Status,new_Lat,new_Lon,SURVEY,DATE,Avg Depth,Min_DEPTH,Max_Depth,COMMENTS
0,Albion Bay,ALB,ALB17-A1-2,non-MPA,39.25166,-123.49137,Transect - 30m (Rapid Emergent),8/30/2018,8.0,8.0,8.0,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...
1,Albion Bay,ALB,ALB18-A1-3,non-MPA,39.13404,-123.46242,Transect - 30m (Rapid Emergent),8/20/2018,12.5,11.0,14.0,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...
2,Albion Bay,ALB,ALB18-A1-4,non-MPA,39.13404,-123.46242,Transect - 30m (Rapid Emergent),8/30/2018,5.5,4.0,7.0,RAPID EMERGENT SITE A1: 4 OF 4 TRANSECTS. HEAD...
3,Albion Bay,ALB,ALB18-A3-1,non-MPA,39.13404,-123.46242,Transect - 30m (Rapid Emergent),8/30/2018,13.5,12.0,15.0,RAPID EMERGENT SITE A3: 1 OF 2 TRANSECTS. HEAD...
4,Albion Bay,ALB,ALB18-A3-2,non-MPA,39.13404,-123.46242,Transect - 30m (Rapid Emergent),8/30/2018,15.0,12.0,18.0,RAPID EMERGENT SITE A3: 2 OF 2 TRANSECTS. HEAD...


## Site table

This looks clean. I only have two comments:
- Lon should be negative
- Remove spaces for DFW_short_code PA Pier

In [7]:
## Tidy site

site['Lon'] = -1*site['Lon']
site.loc[site['DFW_short_code'] == 'PA Pier', 'DFW_short_code'] = 'PA_Pier'
site.head()

Unnamed: 0,SiteID,SiteName,Lat,Lon,ProtectionStatus,DFW_short_code,CA_MPA_Name_Short,LTM_project_short_code,Notes
0,ALB,Albion Bay,39.13404,-123.46242,non-MPA,ALB,,,
1,BR,Bodega Marine Life Refuge (BML),38.1856,-123.04157,no take MPA,BR,Bodega Head SMR,,
2,CC,Caspar Cove (South),39.21606,-123.49697,no commercial red sea urchin fishing,CCin,,,
3,CC,Caspar Cove (North),39.22394,-123.49743,non-MPA,CCout,,,
4,FM,Fisk Mill Cove,38.35483,-123.21148,non-MPA,FM,,,


## Species table

Defined speciesID Wh1 as a whelk species not identified beyond family Buccinidae. There is another code in the table that also means this. 

Also, there are a number of species in the species table that do not occur in the data. These are likely southern/central California species that do not appear in this survey type. We can remove these for the purposes of this submission.

In [22]:
## Remove species not observed in this survey type

sp_to_keep = count['SpeciesID'].str.strip().unique()
species = species[species['SpeciesID'].isin(sp_to_keep)]

In [23]:
## Query WoRMS for remaining information

names = species['ScientificName']
results = pyworms.aphiaRecordsByMatchNames(names.tolist())

# Unpack results
worms_out = pd.json_normalize(results[0])
for i in range(1, len(results)):
    if results[i] == []:  # Handle if no match was found
        worms_out = worms_out.append(pd.Series(dtype='object'), ignore_index=True)
    else:
        norm = pd.json_normalize(results[i])
        if norm.shape[0] > 1:  # print warning if multiple matches were found
            print('Multiple matches found for {name} at index {idx} in names list.'.format(name=names[i], idx=i))
        worms_out = pd.concat([worms_out, norm])

print(len(names))
print(worms_out.shape)
worms_out.head()

177
(177, 27)


Unnamed: 0,AphiaID,url,scientificname,authority,status,unacceptreason,taxonRankID,rank,valid_AphiaID,valid_name,...,genus,citation,lsid,isMarine,isBrackish,isFreshwater,isTerrestrial,isExtinct,match_type,modified
0,138050,https://www.marinespecies.org/aphia.php?p=taxd...,Haliotis,"Linnaeus, 1758",accepted,,180,Genus,138050,Haliotis,...,Haliotis,MolluscaBase eds. (2022). MolluscaBase. Haliot...,urn:lsid:marinespecies.org:taxname:138050,1,,,,,exact,2019-08-15T01:19:50.897Z
0,445357,https://www.marinespecies.org/aphia.php?p=taxd...,Haliotis rufescens,"Swainson, 1822",accepted,,220,Species,445357,Haliotis rufescens,...,Haliotis,MolluscaBase eds. (2022). MolluscaBase. Haliot...,urn:lsid:marinespecies.org:taxname:445357,1,0.0,0.0,0.0,,exact,2022-01-27T23:12:16.977Z
0,445374,https://www.marinespecies.org/aphia.php?p=taxd...,Haliotis walallensis,"Stearns, 1899",accepted,,220,Species,445374,Haliotis walallensis,...,Haliotis,MolluscaBase eds. (2022). MolluscaBase. Haliot...,urn:lsid:marinespecies.org:taxname:445374,1,0.0,0.0,0.0,,exact,2022-01-27T23:12:16.977Z
0,405014,https://www.marinespecies.org/aphia.php?p=taxd...,Haliotis kamtschatkana,"Jonas, 1845",accepted,,220,Species,405014,Haliotis kamtschatkana,...,Haliotis,MolluscaBase eds. (2022). MolluscaBase. Haliot...,urn:lsid:marinespecies.org:taxname:405014,1,0.0,0.0,0.0,,exact,2020-03-26T04:52:02.953Z
0,445325,https://www.marinespecies.org/aphia.php?p=taxd...,Haliotis fulgens,"Philippi, 1845",accepted,,220,Species,445325,Haliotis fulgens,...,Haliotis,MolluscaBase eds. (2022). MolluscaBase. Haliot...,urn:lsid:marinespecies.org:taxname:445325,1,,,,,exact,2010-10-03T15:39:41.620Z


In [24]:
## Tidy results

# Reset index (index seems to be getting messed up sometimes)
worms_out = worms_out.reset_index(drop=True)

# Find scientificname for name that was matched multiple times
name = results[128][0]['scientificname']

# Look at rows that were added to worms_out
rows = worms_out[worms_out['scientificname'] == name]

# Go to WoRMS and/or data provider to select correct match and remove others
match_idx = 128
for i in rows.index:
    if i != match_idx:
        worms_out.drop(i, inplace=True)
        
# Reset index
worms_out.reset_index(drop=True, inplace=True)
        
worms_out.shape

(177, 27)

In [25]:
## Add name that was searched

worms_out['query_name'] = names

In [26]:
## Identify names that weren't found

worms_out.loc[worms_out['scientificname'].isna() == True, 'query_name']

Series([], Name: query_name, dtype: object)

In [27]:
## Complete species table

species['taxonomic_id'] = worms_out['AphiaID']
species['Kingdom'] = worms_out['kingdom']
species['Phylum'] = worms_out['phylum']
species['Class'] = worms_out['class']
species['Order'] = worms_out['order']
species['Family'] = worms_out['family']
species['Genus'] = worms_out['genus']
species['Species'] = species['ScientificName'].copy()

species.head()

Unnamed: 0,SpeciesID,CommonName,ScientificName,species_definition,taxonomic_source,taxonomic_id,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,A0,Abalone species,Haliotis spp.,"abalone, not identified to species level",WoRMS,138050.0,Animalia,Mollusca,Gastropoda,Lepetellida,Haliotidae,Haliotis,Haliotis spp.
1,A1,Red Abalone,Haliotis rufescens,Haliotis rufescens,WoRMS,445357.0,Animalia,Mollusca,Gastropoda,Lepetellida,Haliotidae,Haliotis,Haliotis rufescens
2,A11,Flat Abalone,Haliotis walallensis,Haliotis walallensis,WoRMS,445374.0,Animalia,Mollusca,Gastropoda,Lepetellida,Haliotidae,Haliotis,Haliotis walallensis
3,A12,Pinto Abalone,Haliotis kamtschatkana,Haliotis kamtschatkana,WoRMS,405014.0,Animalia,Mollusca,Gastropoda,Lepetellida,Haliotidae,Haliotis,Haliotis kamtschatkana
4,A13,Green Abalone,Haliotis fulgens,Haliotis fulgens,WoRMS,445325.0,Animalia,Mollusca,Gastropoda,Lepetellida,Haliotidae,Haliotis,Haliotis fulgens


## Count table

Notes:
- 182 surveys remain with missing depths
- There are 27 surveys where count data were not taken (leaving these in as Laura requested)
- There are A LOT of counts where surveys aren't listed in the survey table. These are probably from other survey types? I've dropped them for now.

```python
# Check whether there are any Species IDs in the count table that are missing in the species table
for code in survey_and_count_clean['SpeciesID'].unique():
    if code not in species['SpeciesID'].unique():
        print(code)
        
# Surveys where no count data were taken
out = survey_clean.merge(count_agg, how='outer', on='Survey_Num', indicator=True)
out[out['_merge'] == 'left_only'].shape

# Surveys that have count data but are not listed in survey table
out.loc[out['_merge'] == 'right_only', 'Survey_Num'].unique()
```

In [28]:
## Clean SURVEY

survey_clean = sur[['Survey_Num', 'DFW_short_code', 'SURVEY']].copy()
survey_clean['SURVEY'] = 'Transect - 30 m x 2 m (Emergent)'

In [29]:
## Clean DATE

# There are still two surveys that don't have a date (OC18-B1-4, OC18-B8-1)
survey_clean['DATE'] = sur['DATE'].copy()

# Turn DATE into datetime
survey_clean['DATE'] = pd.to_datetime(survey_clean['DATE'])

# Add year, month, day as required by DataONE
survey_clean['Year'] = survey_clean['DATE'].dt.year
survey_clean['Month'] = survey_clean['DATE'].dt.month
survey_clean['Day'] = survey_clean['DATE'].dt.day

# Add timezone as required by DataONE
survey_clean['Timezone'] = 'PT'

In [30]:
## Add MPA status

survey_clean['ProtectionStatus'] = sur['MPA Status'].copy()
survey_clean.head()

Unnamed: 0,Survey_Num,DFW_short_code,SURVEY,DATE,Year,Month,Day,Timezone,ProtectionStatus
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA
2,ALB18-A1-4,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA
3,ALB18-A3-1,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA
4,ALB18-A3-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA


In [31]:
## Add depth

survey_clean['Avg Depth'] = sur['Avg Depth'].copy()
survey_clean['Min_DEPTH'] = sur['Min_DEPTH'].copy()
survey_clean['Max_Depth'] = sur['Max_Depth'].copy()
survey_clean.head()

Unnamed: 0,Survey_Num,DFW_short_code,SURVEY,DATE,Year,Month,Day,Timezone,ProtectionStatus,Avg Depth,Min_DEPTH,Max_Depth
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0
2,ALB18-A1-4,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,5.5,4.0,7.0
3,ALB18-A3-1,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,13.5,12.0,15.0
4,ALB18-A3-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,15.0,12.0,18.0


In [32]:
## Lat, lon

survey_clean['Latitude'] = sur['new_Lat'].copy()
survey_clean['Longitude'] = sur['new_Lon'].copy()
survey_clean.head()

Unnamed: 0,Survey_Num,DFW_short_code,SURVEY,DATE,Year,Month,Day,Timezone,ProtectionStatus,Avg Depth,Min_DEPTH,Max_Depth,Latitude,Longitude
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242
2,ALB18-A1-4,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,5.5,4.0,7.0,39.13404,-123.46242
3,ALB18-A3-1,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,13.5,12.0,15.0,39.13404,-123.46242
4,ALB18-A3-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,15.0,12.0,18.0,39.13404,-123.46242


In [33]:
## Comments

survey_clean['COMMENTS'] = sur['COMMENTS'].copy()
survey_clean.head()

Unnamed: 0,Survey_Num,DFW_short_code,SURVEY,DATE,Year,Month,Day,Timezone,ProtectionStatus,Avg Depth,Min_DEPTH,Max_Depth,Latitude,Longitude,COMMENTS
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...
2,ALB18-A1-4,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,5.5,4.0,7.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 4 OF 4 TRANSECTS. HEAD...
3,ALB18-A3-1,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,13.5,12.0,15.0,39.13404,-123.46242,RAPID EMERGENT SITE A3: 1 OF 2 TRANSECTS. HEAD...
4,ALB18-A3-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,15.0,12.0,18.0,39.13404,-123.46242,RAPID EMERGENT SITE A3: 2 OF 2 TRANSECTS. HEAD...


Before merging the survey and count tables, we need to clean up the count table. We want to ignore the layer column and aggregate by survey and species. Before the aggregation, we need to extract presence/absence information into its own column. 

In [34]:
## Tidy count table and create presence table

# Fix '00' typo
count_copy = count.copy()
count_copy.loc[count_copy['COUNT'] == '00', 'COUNT'] = 0

# Create presence column from COUNT
count_copy.loc[count_copy['COUNT'] == 'P', 'Presence'] = 'present'
count_copy.loc[count_copy['COUNT'] == 'p', 'Presence'] = 'present'
count_copy.loc[count_copy['COUNT'] == 'A', 'Presence'] = 'absent'

# Remove p/a values from COUNT column
count_copy.loc[count_copy['COUNT'] == 'P', 'COUNT'] = np.nan
count_copy.loc[count_copy['COUNT'] == 'p', 'COUNT'] = np.nan
count_copy.loc[count_copy['COUNT'] == 'A', 'COUNT'] = 0

# Drop rows with other weird COUNT values
count_copy = count_copy[count_copy['COUNT'] != 'RG18-C3-']

# Convert COUNT to numeric
count_copy['COUNT'] = count_copy['COUNT'].astype(float)

# Add additional p/a values based on COUNT column
count_copy.loc[count_copy['COUNT'] > 0, 'Presence'] = 'present'
count_copy.loc[count_copy['COUNT'] == 0, 'Presence'] = 'absent'

# Function to aggregate presence/absence data
def keep_missing(s):
    if s.eq('present').any():
        return 'present'
    elif s.eq('absent').all():
        return 'absent'
    elif s.isna().all():
        return np.nan
    else:
        return 'mixed'

# Aggregate
count_agg = count_copy.groupby(['Survey_Num', 'SpeciesID'], as_index=False).agg({
    'COUNT':sum,
    'Presence':keep_missing
})
count_agg.head()

Unnamed: 0,Survey_Num,SpeciesID,COUNT,Presence
0,FR18-D5-1,A1,0.0,absent
1,FR18-D5-1,A12,1.0,present
2,FR18-D5-1,DA12,0.0,absent
3,FR18-D5-1,EA1,0.0,absent
4,FR18-D5-1,K11,0.0,absent


In [35]:
## Merge with count table

# First need to strip whitespace from Survey_Num
count_agg['Survey_Num'] = count_agg['Survey_Num'].str.strip()
survey_clean['Survey_Num'] = survey_clean['Survey_Num'].str.strip()

# Merge
survey_and_count_clean = survey_clean.merge(count_agg, how='left', on='Survey_Num')
print(survey_and_count_clean.shape)
survey_and_count_clean.head()

(23834, 18)


Unnamed: 0,Survey_Num,DFW_short_code,SURVEY,DATE,Year,Month,Day,Timezone,ProtectionStatus,Avg Depth,Min_DEPTH,Max_Depth,Latitude,Longitude,COMMENTS,SpeciesID,COUNT,Presence
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,,
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,A1,4.0,present
2,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,DA12,0.0,absent
3,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,EA1,0.0,absent
4,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,G11,1.0,present


In [36]:
## Rename columns

survey_and_count_clean.columns = [
    'SurveyNum',
    'DFW_short_code',
    'SurveyType',
    'SurveyDate',
    'Year',
    'Month',
    'Day',
    'Timezone',
    'ProtectionStatus',
    'AverageDepth',
    'MinimumDepth',
    'MaximumDepth',
    'Lat',
    'Lon',
    'Comments',
    'SpeciesID',
    'Count',
    'Presence'
]
survey_and_count_clean.head()

Unnamed: 0,SurveyNum,DFW_short_code,SurveyType,SurveyDate,Year,Month,Day,Timezone,ProtectionStatus,AverageDepth,MinimumDepth,MaximumDepth,Lat,Lon,Comments,SpeciesID,Count,Presence
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,,
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,A1,4.0,present
2,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,DA12,0.0,absent
3,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,EA1,0.0,absent
4,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,G11,1.0,present


## Percent cover

Still to do:
- 1092 surveys where no percent cover data was taken. Leaving these in as Laura requested

```python
# Surveys with no pc data
survey_and_pc_clean.loc[survey_and_pc_clean['HabitatID'].isna() == True, 'Survey_Num'].nunique()
```

No changes needed for percent cover categories table.

In [38]:
## Merge cleaned survey data with percent cover data (in pc table - substrate table in Laura's db)

survey_and_pc_clean = survey_clean.merge(pc[[
    'Survey_Num',
    'Subsample',
    'HabitatID',
    '%Total',
]], how='left', on='Survey_Num')
survey_and_pc_clean.head(2)

Unnamed: 0,Survey_Num,DFW_short_code,SURVEY,DATE,Year,Month,Day,Timezone,ProtectionStatus,Avg Depth,Min_DEPTH,Max_Depth,Latitude,Longitude,COMMENTS,Subsample,HabitatID,%Total
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,,
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,,,


In [39]:
## Drop subsample

survey_and_pc_clean.drop(columns=['Subsample'], inplace=True)

In [40]:
# Rename columns

survey_and_pc_clean.columns = [
    'Survey_Num',
    'DFW_short_code',
    'SurveyType',
    'SurveyDate',
    'Year',
    'Month',
    'Day',
    'Timezone',
    'ProtectionStatus',
    'AverageDepth',
    'MinimumDepth',
    'MaximumDepth',
    'Lat',
    'Lon',
    'Comments',
    'HabitatID',
    'PercentCover'
]
survey_and_pc_clean.head()

Unnamed: 0,Survey_Num,DFW_short_code,SurveyType,SurveyDate,Year,Month,Day,Timezone,ProtectionStatus,AverageDepth,MinimumDepth,MaximumDepth,Lat,Lon,Comments,HabitatID,PercentCover
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,,
2,ALB18-A1-4,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,5.5,4.0,7.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 4 OF 4 TRANSECTS. HEAD...,,
3,ALB18-A3-1,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,13.5,12.0,15.0,39.13404,-123.46242,RAPID EMERGENT SITE A3: 1 OF 2 TRANSECTS. HEAD...,,
4,ALB18-A3-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,15.0,12.0,18.0,39.13404,-123.46242,RAPID EMERGENT SITE A3: 2 OF 2 TRANSECTS. HEAD...,,


## Size

- There are 570 rows with missing SpeciesIDs and SIZEs. These are probably from surveys where size data were not taken. Laura would like these left in.
- There are an additional 45 rows where SpeciesID is present, but SIZE is not. These don't make sense; there are other size data available for the corresponding surveys in all cases. I've droped them.

```python
# Other size data are available for all surveys that contain records where SpeciesID is present, but SIZE is not
for s in missing_size_only['Survey_Num'].unique():
    sizes = survey_and_size_clean[(survey_and_size_clean['Survey_Num'] == s) &
                                  (survey_and_size_clean['SpeciesID'].isna() == False)]
    if sizes.shape[0] < 2:
        print(s)
```

In [42]:
## Merge clean survey data with size data (in size table)

survey_and_size_clean = survey_clean.merge(size[[
    'Survey_Num', 
    'Lft_or_Rt', 
    'SpeciesID', 
    'SIZE',
]], how='left', on='Survey_Num')

print(survey_and_size_clean.shape)
survey_and_size_clean.head()

(76721, 18)


Unnamed: 0,Survey_Num,DFW_short_code,SURVEY,DATE,Year,Month,Day,Timezone,ProtectionStatus,Avg Depth,Min_DEPTH,Max_Depth,Latitude,Longitude,COMMENTS,Lft_or_Rt,SpeciesID,SIZE
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,,
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,,A1,137.0
2,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,,A1,162.0
3,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,,A1,190.0
4,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,,A1,127.0


In [43]:
## Drop Lft_or_Rt column

survey_and_size_clean.drop(columns=['Lft_or_Rt'], inplace=True)

In [44]:
## Assess surveys with missing size data

missing_id = survey_and_size_clean[survey_and_size_clean['SpeciesID'].isna() == True]
print('There are {x} rows with missing SpeciesIDs from {y} surveys'.format(x=missing_id.shape[0],
                                                                           y=missing_id['Survey_Num'].nunique()))

missing_size = survey_and_size_clean[survey_and_size_clean['SIZE'].isna() == True]
print('There are {x} rows with missing SIZEs from {y} surveys'.format(x=missing_size.shape[0],
                                                                      y=missing_size['Survey_Num'].nunique()))

There are 570 rows with missing SpeciesIDs from 570 surveys
There are 615 rows with missing SIZEs from 587 surveys


In [45]:
## Assess surveys where only SIZE is missing

missing_size_only = survey_and_size_clean[
    (survey_and_size_clean['SIZE'].isna() == True) &
    (survey_and_size_clean['SpeciesID'].isna() == False)
]
print('Threre are {x} rows from {y} surveys where SIZE but not SpeciesID is missing.'.format(x=missing_size_only.shape[0],
                                                                                             y=missing_size_only['Survey_Num'].nunique()))

Threre are 45 rows from 17 surveys where SIZE but not SpeciesID is missing.


In [46]:
## Drop rows where only size information is missing

idx = missing_size_only.index
survey_and_size_clean.drop(index=idx, inplace=True)

In [47]:
## Rename columns
survey_and_size_clean.columns = [
    'Survey_Num',
    'DFW_short_code',
    'SurveyType',
    'SurveyDate',
    'Year',
    'Month',
    'Day',
    'Timezone',
    'ProtectionStatus',
    'AverageDepth',
    'MinimumDepth',
    'MaximumDepth',
    'Lat',
    'Lon',
    'Comments',
    'SpeciesID',
    'Size'
]
survey_and_size_clean.head()

Unnamed: 0,Survey_Num,DFW_short_code,SurveyType,SurveyDate,Year,Month,Day,Timezone,ProtectionStatus,AverageDepth,MinimumDepth,MaximumDepth,Lat,Lon,Comments,SpeciesID,Size
0,ALB17-A1-2,ALB,Transect - 30 m x 2 m (Emergent),2018-08-30,2018,8,30,PT,non-MPA,8.0,8.0,8.0,39.25166,-123.49137,RAPID EMERGENT SITE A1: 2 OF 4 TRANSECTS. HEAD...,,
1,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,A1,137.0
2,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,A1,162.0
3,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,A1,190.0
4,ALB18-A1-3,ALB,Transect - 30 m x 2 m (Emergent),2018-08-20,2018,8,20,PT,non-MPA,12.5,11.0,14.0,39.13404,-123.46242,RAPID EMERGENT SITE A1: 3 OF 4 TRANSECTS. HEAD...,A1,127.0


## Save final tables

In [50]:
## Save tables

site.to_csv('north_coast_kelp_site_table_v1_20220201.csv', index=False, na_rep='')
species.to_csv('north_coast_kelp_species_table_v1_20220201.csv', index=False, na_rep='')
survey_and_count_clean.to_csv('north_coast_kelp_count_table_v1_20220201.csv', index=False, na_rep='')
survey_and_pc_clean.to_csv('north_coast_kelp_percent_cover_table_v1_20220201.csv', index=False, na_rep='')
pc_cat.to_csv('north_coast_kelp_percent_cover_categories_table_v1_20220201.csv', index=False, na_rep='')
survey_and_size_clean.to_csv('north_coast_kelp_size_table_v1_20220201.csv', index=False, na_rep='')