# Reef Check - invasive algae presence/absence data

For each survey site, Reef Check indicates the presence or absence of four invasive algal species. These species can be observed anywhere on-site, including off-transect. 

In [83]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [84]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [134]:
## Load data

data = pd.read_csv('RCCA_algae_invasives_data.csv')
print(data.shape)
data.head()

(5508, 15)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,SargassumMuticum,SargassumFilicinum,Undaria,Caulerpa,Lat,Lon,Depth_ft,Temp10m,Visibility
0,120 Reef,8,10,2006,10/8/2006,1,No,No,No,No,33.737919,-118.392014,28.0,15.0,7.0
1,120 Reef,8,10,2006,10/8/2006,2,No,No,No,No,33.737919,-118.392014,28.0,15.0,5.0
2,120 Reef,8,10,2006,10/8/2006,3,No,No,No,No,33.737919,-118.392014,26.0,15.0,5.0
3,120 Reef,8,10,2006,10/8/2006,4,No,No,No,No,33.737919,-118.392014,21.5,15.0,5.0
4,120 Reef,8,10,2006,10/8/2006,5,No,No,No,No,33.737919,-118.392014,17.0,15.0,4.0


### Aggregate data by site and survey date

**Jan and Dan said that these should be aggregated by site and survey date. If there is a Yes in any of the columns for any of the transects, the survey should have a Yes.**

In [183]:
## Change data from wide to long format

data_long = pd.melt(data, 
                    id_vars=['Site', 'Day', 'Month', 'Year', 'SurveyDate', 'Transect', 'Lat', 'Lon', 'Depth_ft', 'Temp10m', 'Visibility'], 
                    value_vars=['SargassumMuticum', 'SargassumFilicinum', 'Undaria', 'Caulerpa'])
print(data_long.shape)
data_long.head()

(22032, 13)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Lat,Lon,Depth_ft,Temp10m,Visibility,variable,value
0,120 Reef,8,10,2006,10/8/2006,1,33.737919,-118.392014,28.0,15.0,7.0,SargassumMuticum,No
1,120 Reef,8,10,2006,10/8/2006,2,33.737919,-118.392014,28.0,15.0,5.0,SargassumMuticum,No
2,120 Reef,8,10,2006,10/8/2006,3,33.737919,-118.392014,26.0,15.0,5.0,SargassumMuticum,No
3,120 Reef,8,10,2006,10/8/2006,4,33.737919,-118.392014,21.5,15.0,5.0,SargassumMuticum,No
4,120 Reef,8,10,2006,10/8/2006,5,33.737919,-118.392014,17.0,15.0,4.0,SargassumMuticum,No


In [184]:
## Change No to 0 and Yes to 1

data_long.replace({'No':0, 'Yes':1, 'yes':1}, inplace=True)
data_long.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Lat,Lon,Depth_ft,Temp10m,Visibility,variable,value
0,120 Reef,8,10,2006,10/8/2006,1,33.737919,-118.392014,28.0,15.0,7.0,SargassumMuticum,0.0
1,120 Reef,8,10,2006,10/8/2006,2,33.737919,-118.392014,28.0,15.0,5.0,SargassumMuticum,0.0
2,120 Reef,8,10,2006,10/8/2006,3,33.737919,-118.392014,26.0,15.0,5.0,SargassumMuticum,0.0
3,120 Reef,8,10,2006,10/8/2006,4,33.737919,-118.392014,21.5,15.0,5.0,SargassumMuticum,0.0
4,120 Reef,8,10,2006,10/8/2006,5,33.737919,-118.392014,17.0,15.0,4.0,SargassumMuticum,0.0


In [185]:
## Group replaced data to obtain the number of invasive algae sightings for each site/survey date; group data_long to handle transect-level differences in depth and temperature

# Get number of invasive algae observations
num_obs = data_long.groupby(['Site', 'SurveyDate', 'variable'])['value'].sum()
num_obs = num_obs.reset_index()

# Handle differences in depth, temp in original data
data_long = data_long.groupby(['Site', 'Day', 'Month', 'Year', 'SurveyDate']).agg({
    'Depth_ft':[min, max],
    'Temp10m':[np.mean]
})
data_long.reset_index(inplace=True)
data_long.columns = ['Site', 'Day', 'Month', 'Year', 'SurveyDate', 'Depth_ft_min', 'Depth_ft_max', 'Temp10m_mean']

# Merge
data_agg = num_obs.merge(data_long, how='left', on=['Site', 'SurveyDate'])
print(data_agg.shape)
data_agg.head()

(3788, 10)


Unnamed: 0,Site,SurveyDate,variable,value,Day,Month,Year,Depth_ft_min,Depth_ft_max,Temp10m_mean
0,120 Reef,10/14/2012,Caulerpa,0.0,14,10,2012,12.0,28.0,17.0
1,120 Reef,10/14/2012,SargassumFilicinum,0.0,14,10,2012,12.0,28.0,17.0
2,120 Reef,10/14/2012,SargassumMuticum,0.0,14,10,2012,12.0,28.0,17.0
3,120 Reef,10/14/2012,Undaria,0.0,14,10,2012,12.0,28.0,17.0
4,120 Reef,10/8/2006,Caulerpa,0.0,8,10,2006,17.0,28.0,15.0


### Join with site table to retrieve lat and lon

In [186]:
## Load site table

filename = 'RCCA_site_table.csv'
sites = pd.read_csv(filename, usecols=range(7))

sites.head()

Unnamed: 0,Research_group,Site,CA_MPA_Name_Short,MPA_status,LTM_project_short_code,Latitude,Longitude
0,RCCA,Macklyn Cove,,REF,LTM_Kelp_SRock,42.045155,-124.294724
1,RCCA,Pyramid Pt,Pyramid Point SMCA,MPA,LTM_Kelp_SRock,41.994801,-124.217308
2,RCCA,Flat Iron Rock,,,,41.059425,-124.157829
3,RCCA,Trinidad,,,,41.055,-124.139999
4,RCCA,MacKerricher North,MacKerricher SMCA,MPA,LTM_Kelp_SRock,39.492823,-123.80199


As **noted** in previous RCCA data, the invasive kelp data include some sites that are not in the site table:
- Cayucos
- LA Federal Breakwater
- Pier 400
- Fry's Anchorage (as noted previously, this appears as Frys Anchorage and has been corrected)
- West Long Point

I am going to manually add the lat and lon for these sites to the site table. However, I talked to Jan and Dan on 8/6, and they're planning to update the official site table on DataONE as well.

In [187]:
## Add rows to site table -- CAN BE DELETED WHEN SITE TABLE IS UPDATED ON DATAONE

sites_to_add = pd.DataFrame({'Research_group':['RCCA']*5,
                            'Site':['Cayucos', 'Hurricane Ridge', 'LA Federal Breakwater', 'Ocean Cove Kelper', 'Pier 400'],
                            'Latitude':[35.4408, 37.4701, 33.711899, 38.555119, 33.716301],
                            'Longitude':[-120.936302, -122.4796, -118.241997, -123.3046, -118.258003]})
sites = pd.concat([sites, sites_to_add])

In [188]:
## Correct values in invasives data that do not match in the site table

data_agg.loc[data_agg['Site'] == "Fry's Anchorage", 'Site'] = 'Frys Anchorage'
data_agg.loc[data_agg['Site'] == 'West Long Point', 'Site'] = 'Long Point West'

In [189]:
## Merge data_agg and sites

data_agg = data_agg.merge(sites[['Site', 'Latitude', 'Longitude']], how='left', on='Site')
print(data_agg.shape)
data_agg.head()

(3788, 12)


Unnamed: 0,Site,SurveyDate,variable,value,Day,Month,Year,Depth_ft_min,Depth_ft_max,Temp10m_mean,Latitude,Longitude
0,120 Reef,10/14/2012,Caulerpa,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014
1,120 Reef,10/14/2012,SargassumFilicinum,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014
2,120 Reef,10/14/2012,SargassumMuticum,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014
3,120 Reef,10/14/2012,Undaria,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014
4,120 Reef,10/8/2006,Caulerpa,0.0,8,10,2006,17.0,28.0,15.0,33.737919,-118.392014


## Create occurrence file

Here, it seems reasonable for the **event** to be the survey (e.g. site + survey date). The **occurrrences** can be the presence/absence observations of each algal species. I can include the temperature information in an MoF file.

In [212]:
## Get site names w/o spaces for use in eventID

# Get a list of site names with spaces removed
site_names = [name.replace(' ', '') for name in data_agg['Site']]

# Map site_names to sites
site_name_dict = dict(zip(data_agg['Site'], site_names))

 # Create SiteName column from Site column
data_agg['SiteName'] = data_agg['Site']
data_agg['SiteName'].replace(site_name_dict, inplace=True)
data_agg.head()

Unnamed: 0,Site,SurveyDate,variable,value,Day,Month,Year,Depth_ft_min,Depth_ft_max,Temp10m_mean,Latitude,Longitude,SiteName
0,120 Reef,10/14/2012,Caulerpa,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014,120Reef
1,120 Reef,10/14/2012,SargassumFilicinum,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014,120Reef
2,120 Reef,10/14/2012,SargassumMuticum,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014,120Reef
3,120 Reef,10/14/2012,Undaria,0.0,14,10,2012,12.0,28.0,17.0,33.737919,-118.392014,120Reef
4,120 Reef,10/8/2006,Caulerpa,0.0,8,10,2006,17.0,28.0,15.0,33.737919,-118.392014,120Reef


In [213]:
## Pad month and day as needed

paddedDay = ['0' + str(data_agg['Day'].iloc[i]) if len(str(data_agg['Day'].iloc[i])) == 1 else str(data_agg['Day'].iloc[i]) for i in range(len(data_agg['Day']))]
paddedMonth = ['0' + str(data_agg['Month'].iloc[i]) if len(str(data_agg['Month'].iloc[i])) == 1 else str(data_agg['Month'].iloc[i]) for i in range(len(data_agg['Month']))]

In [226]:
## Create eventID

eventID = [data_agg['SiteName'].iloc[i] + '_' + str(data_agg['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(len(data_agg['Site']))]
occ = pd.DataFrame({'eventID':eventID})

occ.head()

Unnamed: 0,eventID
0,120Reef_20121014
1,120Reef_20121014
2,120Reef_20121014
3,120Reef_20121014
4,120Reef_20061008


In [227]:
## Format dates and add eventDate

eventDate = [datetime.strptime(dt, '%m/%d/%Y').date().isoformat() for dt in data_agg['SurveyDate']]
occ['eventDate'] = eventDate
occ.head()

Unnamed: 0,eventID,eventDate
0,120Reef_20121014,2012-10-14
1,120Reef_20121014,2012-10-14
2,120Reef_20121014,2012-10-14
3,120Reef_20121014,2012-10-14
4,120Reef_20061008,2006-10-08


In [228]:
## Add datasetID

occ['datasetID'] = 'RCCA invasive algae'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,120Reef_20121014,2012-10-14,RCCA invasive algae
1,120Reef_20121014,2012-10-14,RCCA invasive algae
2,120Reef_20121014,2012-10-14,RCCA invasive algae
3,120Reef_20121014,2012-10-14,RCCA invasive algae
4,120Reef_20061008,2006-10-08,RCCA invasive algae


In [229]:
## Add locality and countryCode

# locality
occ['locality'] = data_agg['Site']

# countryCode
occ['countryCode'] = 'US'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US


In [230]:
## decimalLat, decimalLon

occ['decimalLatitude'] = data_agg['Latitude']
occ['decimalLongitude'] = data_agg['Longitude']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014


In [231]:
## Add coordinateUncertainty in Meters

occ['coordinateUncertaintyInMeters'] = 250

In [232]:
## Add occurrenceID

occ['occurrenceID'] = data_agg.groupby(['Site', 'SurveyDate'])['variable'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ1
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ2
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ3
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ4
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_occ1


In [233]:
## Get unique scientific names

sci_names = data_agg['variable'].unique()
sci_names = np.where(sci_names=='SargassumMuticum', 'Sargassum muticum', sci_names)
sci_names = np.where(sci_names=='SargassumFilicinum', 'Sargassum filicinum', sci_names)

sci_names

array(['Caulerpa', 'Sargassum filicinum', 'Sargassum muticum', 'Undaria'],
      dtype=object)

In [234]:
## Call run_get_worms_from_scientific_name

name_id_dict, name_name_dict, name_taxid_dict = WoRMS.run_get_worms_from_scientific_name(sci_names, verbose_flag=True)

In [235]:
## Add scientific name-related columns

occ['scientificName'] = data_agg['variable']
occ['scientificName'].replace({'SargassumMuticum':'Sargassum muticum',
                              'SargassumFilicinum':'Sargassum filicinum'}, inplace=True)

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ2,Sargassum filicinum,urn:lsid:marinespecies.org:taxname:496117,496117
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ3,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ4,Undaria,urn:lsid:marinespecies.org:taxname:144196,144196
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816


In [236]:
## Add nameAccordingTo

occ['nameAccordingTo'] = 'WoRMS'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ2,Sargassum filicinum,urn:lsid:marinespecies.org:taxname:496117,496117,WoRMS
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ3,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ4,Undaria,urn:lsid:marinespecies.org:taxname:144196,144196,WoRMS
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS


In [243]:
## Add occurrenceStatus

occ['value'] = data_agg['value']
occ.loc[occ['value'] == 0, 'occurrenceStatus'] = 'absent'
occ.loc[occ['value'] > 0, 'occurrenceStatus'] = 'present'
occ.drop(labels='value', axis=1, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS,absent
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ2,Sargassum filicinum,urn:lsid:marinespecies.org:taxname:496117,496117,WoRMS,absent
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ3,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ4,Undaria,urn:lsid:marinespecies.org:taxname:144196,144196,WoRMS,absent
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS,absent


In [245]:
## Add basisOfRecord

occ['basisOfRecord'] = 'HumanObservation'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS,absent,HumanObservation
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ2,Sargassum filicinum,urn:lsid:marinespecies.org:taxname:496117,496117,WoRMS,absent,HumanObservation
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ3,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ4,Undaria,urn:lsid:marinespecies.org:taxname:144196,144196,WoRMS,absent,HumanObservation
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS,absent,HumanObservation


In [246]:
## Add depth

occ['minimumDepthInMeters'] = round(data_agg['Depth_ft_min']*0.3048, 1)
occ['maximumDepthInMeters'] = round(data_agg['Depth_ft_max']*0.3048, 1)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,minimumDepthInMeters,maximumDepthInMeters
0,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS,absent,HumanObservation,3.7,8.5
1,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ2,Sargassum filicinum,urn:lsid:marinespecies.org:taxname:496117,496117,WoRMS,absent,HumanObservation,3.7,8.5
2,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ3,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation,3.7,8.5
3,120Reef_20121014,2012-10-14,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20121014_occ4,Undaria,urn:lsid:marinespecies.org:taxname:144196,144196,WoRMS,absent,HumanObservation,3.7,8.5
4,120Reef_20061008,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_occ1,Caulerpa,urn:lsid:marinespecies.org:taxname:143816,143816,WoRMS,absent,HumanObservation,5.2,8.5


## Save

In [247]:
occ.to_csv('RCCA_invasives_occurrence_20200807.csv', index=False, na_rep='NaN')

## Create MoF file

It seems worth it to include the temperature and visibility measurements in a MoF file.

In [263]:
## Add eventID, occurrenceID and temperature

mof = pd.DataFrame({'eventID':occ['eventID']})
mof['occurrenceID'] = np.nan
mof['measurementType'] = 'temperature'
mof['measurementValue'] = round(data_agg['Temp10m_mean'], 1)
mof['measurementUnit'] = 'degrees Celsius'
mof['measurementMethod'] = 'measured by dive computer at 10 m depth, or at the seafloor if shallower than 10 m'

print(mof.shape)
mof.head()

(3788, 6)


Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20121014,,temperature,17.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
1,120Reef_20121014,,temperature,17.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
2,120Reef_20121014,,temperature,17.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
3,120Reef_20121014,,temperature,17.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
4,120Reef_20061008,,temperature,15.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."


In [264]:
## Drop duplicate values and measurementValues that are NaN

# Duplicates
mof.drop_duplicates(inplace=True)

# NaNs
mof = mof[mof['measurementValue'].isna() == False]

print(mof.shape)
mof.head()

(783, 6)


Unnamed: 0,eventID,occurrenceID,measurementType,measurementValue,measurementUnit,measurementMethod
0,120Reef_20121014,,temperature,17.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
4,120Reef_20061008,,temperature,15.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
8,120Reef_20140614,,temperature,15.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
12,120Reef_20130615,,temperature,16.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."
20,120Reef_20120609,,temperature,17.0,degrees Celsius,"measured by dive computer at 10 m depth, or at..."


## Save

In [265]:
mof.to_csv('RCCA_invasives_MoF_20200807.csv', index=False, na_rep='NaN')

## Questions

1. Why are these data broken down by transect? I thought that for the invasive algae surveys, a dive group would indicate present if the algae was seen anywhere in the site. **This is correct; these data should be aggregated by site. I went ahead and did this.**
2. As noted in previous RCCA data, the invasive kelp data include some sites that are not in the site table: Cayucos, LA Federal Breakwater, Pier 400 and West Long Point. This was not a problem because the lat, lon were available in the original data set (as opposed to the site table).
3. For each invasive algae, presence or absence is indicated by "Yes", "No" or "NaN". How are No and NaN different? **NaN would suggest that someone forgot to look for invasives during the survey. After aggregating, these went away.**
