# Reef Check - invasive algae presence/absence data

For each survey site, Reef Check indicates the presence or absence of four invasive algal species. These species can be observed anywhere on-site, including off-transect. 

In [1]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Load data

data = pd.read_csv('RCCA_algae_invasives_data.csv')
print(data.shape)
data.head()

(5508, 15)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,SargassumMuticum,SargassumFilicinum,Undaria,Caulerpa,Lat,Lon,Depth_ft,Temp10m,Visibility
0,120 Reef,8,10,2006,10/8/2006,1,No,No,No,No,33.737919,-118.392014,28.0,15.0,7.0
1,120 Reef,8,10,2006,10/8/2006,2,No,No,No,No,33.737919,-118.392014,28.0,15.0,5.0
2,120 Reef,8,10,2006,10/8/2006,3,No,No,No,No,33.737919,-118.392014,26.0,15.0,5.0
3,120 Reef,8,10,2006,10/8/2006,4,No,No,No,No,33.737919,-118.392014,21.5,15.0,5.0
4,120 Reef,8,10,2006,10/8/2006,5,No,No,No,No,33.737919,-118.392014,17.0,15.0,4.0


**Why are these data indicated by transect rather than site? I thought divers annotated the presence of these species anywhere in the site.**

## Create occurrence file

Here, it seems reasonable for the **event** to be the survey (e.g. site + survey date). The **occurrrences** can be the presence/absence observations of each algal species. I think we can get away with only having an occurrences file for this data set.

In [19]:
## Change data from wide to long format

data_long = pd.melt(data, 
                    id_vars=['Site', 'Day', 'Month', 'Year', 'SurveyDate', 'Transect', 'Lat', 'Lon', 'Depth_ft'], 
                    value_vars=['SargassumMuticum', 'SargassumFilicinum', 'Undaria', 'Caulerpa'])
print(data_long.shape)
data_long.head()

(22032, 11)


Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Lat,Lon,Depth_ft,variable,value
0,120 Reef,8,10,2006,10/8/2006,1,33.737919,-118.392014,28.0,SargassumMuticum,No
1,120 Reef,8,10,2006,10/8/2006,2,33.737919,-118.392014,28.0,SargassumMuticum,No
2,120 Reef,8,10,2006,10/8/2006,3,33.737919,-118.392014,26.0,SargassumMuticum,No
3,120 Reef,8,10,2006,10/8/2006,4,33.737919,-118.392014,21.5,SargassumMuticum,No
4,120 Reef,8,10,2006,10/8/2006,5,33.737919,-118.392014,17.0,SargassumMuticum,No


In [23]:
## Get site names w/o spaces for use in eventID

# Get a list of site names with spaces removed
site_names = [name.replace(' ', '') for name in data['Site']]

# Map site_names to sites
site_name_dict = dict(zip(data['Site'], site_names))
site_name_dict["Fry's Anchorage"] = 'FrysAnchorage'

 # Create SiteName column from Site column
data_long['SiteName'] = data_long['Site']
data_long['SiteName'].replace(site_name_dict, inplace=True)
data_long.head()

Unnamed: 0,Site,Day,Month,Year,SurveyDate,Transect,Lat,Lon,Depth_ft,variable,value,SiteName
0,120 Reef,8,10,2006,10/8/2006,1,33.737919,-118.392014,28.0,SargassumMuticum,No,120Reef
1,120 Reef,8,10,2006,10/8/2006,2,33.737919,-118.392014,28.0,SargassumMuticum,No,120Reef
2,120 Reef,8,10,2006,10/8/2006,3,33.737919,-118.392014,26.0,SargassumMuticum,No,120Reef
3,120 Reef,8,10,2006,10/8/2006,4,33.737919,-118.392014,21.5,SargassumMuticum,No,120Reef
4,120 Reef,8,10,2006,10/8/2006,5,33.737919,-118.392014,17.0,SargassumMuticum,No,120Reef


As **noted** in previous RCCA data, the invasive kelp data include some sites that are not in the site table:
- Cayucos
- LA Federal Breakwater
- Pier 400
- Fry's Anchorage (as noted previously, this appears as Frys Anchorage and has been corrected)
- West Long Point

In [24]:
## Pad month and day as needed

paddedDay = ['0' + str(data_long['Day'].iloc[i]) if len(str(data_long['Day'].iloc[i])) == 1 else str(data_long['Day'].iloc[i]) for i in range(len(data_long['Day']))]
paddedMonth = ['0' + str(data_long['Month'].iloc[i]) if len(str(data_long['Month'].iloc[i])) == 1 else str(data_long['Month'].iloc[i]) for i in range(len(data_long['Month']))]

In [25]:
## Create eventID

eventID = [data_long['SiteName'].iloc[i] + '_' + str(data_long['Year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + 
           str(data_long['Transect'].iloc[i]) for i in range(len(data_long['Site']))]
occ = pd.DataFrame({'eventID':eventID})

occ.head()

Unnamed: 0,eventID
0,120Reef_20061008_1
1,120Reef_20061008_2
2,120Reef_20061008_3
3,120Reef_20061008_4
4,120Reef_20061008_5


In [26]:
## Format dates and add eventDate

eventDate = [datetime.strptime(dt, '%m/%d/%Y').date().isoformat() for dt in data_long['SurveyDate']]
occ['eventDate'] = eventDate
occ.head()

Unnamed: 0,eventID,eventDate
0,120Reef_20061008_1,2006-10-08
1,120Reef_20061008_2,2006-10-08
2,120Reef_20061008_3,2006-10-08
3,120Reef_20061008_4,2006-10-08
4,120Reef_20061008_5,2006-10-08


In [27]:
## Add datasetID

occ['datasetID'] = 'RCCA invasive algae'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
0,120Reef_20061008_1,2006-10-08,RCCA invasive algae
1,120Reef_20061008_2,2006-10-08,RCCA invasive algae
2,120Reef_20061008_3,2006-10-08,RCCA invasive algae
3,120Reef_20061008_4,2006-10-08,RCCA invasive algae
4,120Reef_20061008_5,2006-10-08,RCCA invasive algae


In [32]:
## Add locality and countryCode

# locality
occ['locality'] = data_long['Site']
occ.loc[occ['locality'] == "Fry's Anchorage", 'locality'] = 'Frys Anchorage'

# countryCode
occ['countryCode'] = 'US'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode
0,120Reef_20061008_1,2006-10-08,RCCA invasive algae,120 Reef,US
1,120Reef_20061008_2,2006-10-08,RCCA invasive algae,120 Reef,US
2,120Reef_20061008_3,2006-10-08,RCCA invasive algae,120 Reef,US
3,120Reef_20061008_4,2006-10-08,RCCA invasive algae,120 Reef,US
4,120Reef_20061008_5,2006-10-08,RCCA invasive algae,120 Reef,US


In [33]:
## decimalLat, decimalLon

occ['decimalLatitude'] = data_long['Lat']
occ['decimalLongitude'] = data_long['Lon']
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude
0,120Reef_20061008_1,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
1,120Reef_20061008_2,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
2,120Reef_20061008_3,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
3,120Reef_20061008_4,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014
4,120Reef_20061008_5,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014


In [34]:
## Add coordinateUncertainty in Meters

occ['coordinateUncertaintyInMeters'] = 250

In [35]:
## Add occurrenceID

occ['occurrenceID'] = data_long.groupby(['Site', 'SurveyDate', 'Transect'])['variable'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID
0,120Reef_20061008_1,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_1_occ1
1,120Reef_20061008_2,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_2_occ1
2,120Reef_20061008_3,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_3_occ1
3,120Reef_20061008_4,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_4_occ1
4,120Reef_20061008_5,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_5_occ1


In [42]:
## Get unique scientific names

sci_names = data_long['variable'].unique()
sci_names[0] = 'Sargassum muticum'
sci_names[1] = 'Sargassum filicinum'
sci_names

array(['Sargassum muticum', 'Sargassum filicinum', 'Undaria', 'Caulerpa'],
      dtype=object)

In [43]:
## Call run_get_worms_from_scientific_name

name_id_dict, name_name_dict, name_taxid_dict = WoRMS.run_get_worms_from_scientific_name(sci_names, verbose_flag=True)

In [49]:
## Add scientific name-related columns

occ['scientificName'] = data_long['variable']
occ['scientificName'].replace({'SargassumMuticum':'Sargassum muticum',
                              'SargassumFilicinum':'Sargassum filicinum'}, inplace=True)

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID
0,120Reef_20061008_1,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_1_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791
1,120Reef_20061008_2,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_2_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791
2,120Reef_20061008_3,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_3_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791
3,120Reef_20061008_4,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_4_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791
4,120Reef_20061008_5,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_5_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791


In [56]:
## Add final name-related columns

occ['nameAccordingTo'] = 'WoRMS'

occ['occurrenceStatus'] = data_long['value']
occ['occurrenceStatus'].replace('yes', 'Yes', inplace=True)
occ['occurrenceStatus'].replace({'Yes':'present',
                                'No':'absent'}, inplace=True)

occ['basisOfRecord'] = 'HumanObservation'

occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord
0,120Reef_20061008_1,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_1_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation
1,120Reef_20061008_2,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_2_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation
2,120Reef_20061008_3,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_3_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation
3,120Reef_20061008_4,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_4_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation
4,120Reef_20061008_5,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_5_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation


In addition to present and absent, **some occurrenceStatus values are NaN**. In other words, in the original data, the possible values were Yes, No and NaN. **How are No and NaN different?**

In [63]:
## Add depth

occ['minimumDepthInMeters'] = round(data_long['Depth_ft']*0.3048, 1)
occ['maximumDepthInMeters'] = round(data_long['Depth_ft']*0.3048, 1)
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,locality,countryCode,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,occurrenceID,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,minimumDepthInMeters,maximumDepthInMeters
0,120Reef_20061008_1,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_1_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation,8.5,8.5
1,120Reef_20061008_2,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_2_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation,8.5,8.5
2,120Reef_20061008_3,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_3_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation,7.9,7.9
3,120Reef_20061008_4,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_4_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation,6.6,6.6
4,120Reef_20061008_5,2006-10-08,RCCA invasive algae,120 Reef,US,33.737919,-118.392014,250,120Reef_20061008_5_occ1,Sargassum muticum,urn:lsid:marinespecies.org:taxname:494791,494791,WoRMS,absent,HumanObservation,5.2,5.2


## Save

In [64]:
occ.to_csv('RCCA_invasives_occurrence_20200803.csv', index=False, na_rep='NaN')