# PISCO - fish transect data

The density of all conspicuous fishes (i.e. species whose adults are longer than 10 cm and visually detectable by SCUBA divers) are visually recorded along replicate 2m wide by 2m tall by 30m long (120m3) transects. 
- Transects are performed in 2-3 heights: bottom, mid-water and canopy
    - Bottom transects are always performed; a diver searches in cracks and crevices with a flashlight
    - Mid-water transects are always performed; a second diver surveys 120 m3 about 1/3 - 1/2 of the way up into the water column
    - Canopy transects are surveyed at a subset of sites, and are usually completed separately from the bottom and midwater transects; a diver swims 2m below the surface counting fishes in the top two meters of the water column
- Three 30 m long transects, distributed end-to-end and 5-10 m apart, are typically performed at each height, and at each of four depths:
    - 5m
    - 10m
    - 15m
    - 20m 
    - transects at the 25 m isobath are performed by VRG where habitat is available
- Survey depths may vary based on reef topography 
- Counts on mid-water and bottom transects are eventually combined, generating 12 replicate transects for each site. **Are these already combined in this data set?** **Note** that at sites with narrow kelp beds, particularly in parts of the Northern Channel Islands, only two depths are sampled, with four transects in each depth zone for a total of eight replicate transects
- Surveyors record:
    - The total length (TL) of each fish observed
    - Transect depth
    - Horizontal visibility along each transect (**must be at least 3 m to perform fish transects**)
    - Water temperature
    - Sea state (surge)
    - Percent of the transect volume occupied by kelp (PISCO only)

**Resources**
- https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [48]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates

In [49]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [56]:
## Load data

# path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\PISCO\\'
filename = 'MLPA_kelpforest_fish.1.csv'
fish = pd.read_csv(filename, encoding='ANSI', dtype={'transect':str, 'sex':str, 'site_name_old':str})

print(fish.shape)
fish.head()

(381693, 24)


Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,max_tl,sex,observer,depth,vis,temp,surge,pctcnpy,notes,site_name_old
0,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
1,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
2,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
3,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,,,MARK CARR,6.1,2.4,,HIGH,1.0,,
4,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,8.0,,MARK CARR,6.1,2.4,,HIGH,1.0,,


### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_FISH_PISCO, SBTL_FISH_CRANE, SBTL_FISH_HSU or SBTL_FISH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 1999 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 1999 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 380 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**level** = BOT, CAN, MID or CNMD. The horizontal placement of the transect within the water column. Includes BOT: bottom transects placed at the seafloor, MID: midwater transects placed at roughly half the depth of the seafloor, and CAN: canopy transects placed at the surface to survey the top two meters of the water column and kelp canopy. CNMD is used when an inner transect is too shallow to allow both canopy and midwater transects without overlapping (applies to UCSB and VRG only) <br>
**transect** = It seems like this should only be 1 - 12, but there are a number of other designations as well. The unique transect replicate within each site, zone, and level. <br>
**classcode** = One of 166 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for fishes, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode of a given size per transect <br>
**fish_tl** = The total length of an individual or group of individuals (of the same length) OR the average total length for a group of fish where a range in lengths is specified (rounded to the nearest cm) <br>
**min_tl** = The minimum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**max_tl** = The maximum size of the sampled class, used only when a range of sizes was recorded for a group of individuals of a species <br>
**sex** = MALE, FEMALE, TRANSITIONAL, JUVENILE or 'nan'. The sex classification for sexually dimorphic species where sex can be distinguished visually and is recorded. For some species, individuals with juvenile markings are also indicated here. The TRANSITIONAL class is used for fish with external morphological features consistent with both male and female (applies to sex changing fishes such as California sheephead). JUVENILE is not always indicated when a juvenile fish is observed. <br>
**observer** = The diver who conducted the survey transect <br>
**depth** = Between 0.2 and 33.4 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**vis** = Between 1 and 35 meters. The diver's estimation of horizontal visibility on each transect. Measured by reeling in the transect tape and noting the distance at which the end of the tape can first be seen. <br>
**temp** = Between 7 and 25.6 degrees C. The temperature on each transect measured by the diver's computer. <br>
**surge** = HIGH, MODERATE, LIGHT or 'nan'. The diver's estimation of magnitude of horizontal displacement on each transect.
- LIGHT: No significant surge
- MODERATE: Surge causing noticeable lateral movement, diver must compensate
- HIGH: Significant surge, diver moved out of transect bounds when not holding on

**pctcnpy** = 0 - 3 or NaN. The diver's estimation of the percent of the transect, by volume, that is occupied by kelp. This estimation is specific to the level of the transect that is being surveyed (i.e. excluding canopy transects, this not an estimation of surface canopy but of the amount of kelp within the transect at the specified level). **I believe this measure was only recorded by PISCO.**
- 0: 0% of transect volume occupied by kelp
- 1: 1-33% of transect volume occupied by kelp
- 2: 34-66% of transect volume occupied by kelp
- 3: 67-100% of transect volume occupied by kelp

**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

### Strategy

As with the RCCA data, each transect can be an **event** and each fish observation can be an **occurrence**. There are both event-level and occurrence-level measurements, necessitating event and MoF files. 

The **event** file should contain: eventID (from site, survey_year, transect, level?), eventDate (from year, month, date), datasetID, locality (site), localityRemarks (maybe level and/or zone information), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Some notes might be eventRemarks. Should I include the campus information somewhere? Observer?

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe some notes), sex (sex), lifeStage (sex), organismQuantity (count), organismQuantityType.

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Depth, vis, temp, surge and pctcnpy can be recorded at the event level. Fish_tl, min_tl and max_tl can be recorded at the occurrence level.

## Create occurrence file

### Get site names

In [58]:
## Load site table

filename = 'MLPA_kelpforest_site_table.1.csv'
sites = pd.read_csv(filename)

print(sites.shape)
sites.head()

(7458, 17)


Unnamed: 0,LTM_project_short_code,campus,method,survey_year,year,site,latitude,longitude,CA_MPA_Name_Short,site_designation,site_status,Secondary_MPA_Name,Secondary_site_designation,Secondary_site_status,BaselineRegion,LongTermRegion,MPA_priority_tier
0,LTM_Kelp_SRock,VRG,SBTL_SIZEFREQ_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
1,LTM_Kelp_SRock,VRG,SBTL_FISH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
2,LTM_Kelp_SRock,VRG,SBTL_SWATH_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
3,LTM_Kelp_SRock,VRG,SBTL_UPC_VRG,2008,2008,3 Palms East,33.71762,-118.33215,Abalone Cove SMCA,reference,reference,,,,South Coast,South Coast,II
4,LTM_Kelp_SRock,HSU,SBTL_UPC_HSU,2018,2018,ABALONE_POINT_1,39.6915,-123.8141,Ten Mile SMR,reference,reference,,,,North Coast,North Coast,I


There are two sites in the site table that have no fish records:
- PISMO_W
- SAL_E

Also, it looks like only one lat and lon is given for each site. Additionally, sites have been consistently labeled as either 'reference' or 'mpa'. To check this:
```python
# Groupby
out = sites.groupby(['site']).agg({
    'latitude':pd.Series.nunique,
    'longitude':pd.Series.nunique,
    'site_status':pd.Series.nunique,
    'campus':pd.Series.nunique
})
out.reset_index(inplace=True)

# Check
out[out['latitude'] > 1]
out[out['longitude'] > 1]
out[out['site_status'] > 1]
out[out['campus'] > 1]
```

Since, for the purpose of DwC, we're not interested in which sites were sampled when, I can simplify the site table to only contain relevant information: site, latitude, longitude, and site status. The campus responsible for the survey might also be good to include. **Which campus is responsible for a given site has changed between years in 13 cases. I'll leave this information out for now.**

In [78]:
## Create simplified site table

site_summary = sites[['site', 'site_status', 'latitude', 'longitude']].copy()
site_summary.drop_duplicates(inplace=True)

print(site_summary.shape)
site_summary.head()

(382, 4)


Unnamed: 0,site,site_status,latitude,longitude
0,3 Palms East,reference,33.71762,-118.33215
4,ABALONE_POINT_1,reference,39.6915,-123.8141
15,ABALONE_POINT_2,reference,39.66502,-123.80435
26,ABALONE_POINT_3,reference,39.62877,-123.79658
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252


Some site names have spaces or ' - ' characters. I'll replace these in a sensible way.

In [85]:
# Replace ' ' and ' - ' in site names and add site_name column

site_name = [name.replace(' - ', '-') for name in site_summary['site']]
site_name = [name.replace(' ', '_') for name in site_name]
site_summary['site_name'] = site_name

site_summary.head()

Unnamed: 0,site,site_status,latitude,longitude,site_name
0,3 Palms East,reference,33.71762,-118.33215,3_Palms_East
4,ABALONE_POINT_1,reference,39.6915,-123.8141,ABALONE_POINT_1
15,ABALONE_POINT_2,reference,39.66502,-123.80435,ABALONE_POINT_2
26,ABALONE_POINT_3,reference,39.62877,-123.79658,ABALONE_POINT_3
33,ANACAPA_ADMIRALS_CEN,reference,34.002883,-119.4252,ANACAPA_ADMIRALS_CEN


### Convert

In [87]:
## Pad month and day as needed

paddedDay = ['0' + str(fish['day'].iloc[i]) if len(str(fish['day'].iloc[i])) == 1 else str(fish['day'].iloc[i]) for i in range(fish.shape[0])]
paddedMonth = ['0' + str(fish['month'].iloc[i]) if len(str(fish['month'].iloc[i])) == 1 else str(fish['month'].iloc[i]) for i in range(fish.shape[0])]

In [101]:
## Merge to add site_name (also lat, lon and site_status) to fish table

fish = fish.merge(site_summary, how='left', on='site')
fish.head()

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,vis,temp,surge,pctcnpy,notes,site_name_old,site_status,latitude,longitude,site_name
0,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
1,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
2,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
3,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC
4,UCSC,SBTL_FISH_PISCO,1999,1999,9,7,HOPKINS_DC,INNER,BOT,1,...,2.4,,HIGH,1.0,,,mpa,36.623586,-121.904196,HOPKINS_DC


In [110]:
## Create eventID

eventID = [fish['site_name'].iloc[i] + '_' + str(fish['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] + '_' + fish['zone'].iloc[i] + '_' + fish['level'].iloc[i] + '_' +
           fish['transect'].iloc[i].replace(' ', '') for i in range(fish.shape[0])]
fish_occ = pd.DataFrame({'eventID':eventID})

fish_occ.head()

Unnamed: 0,eventID
0,HOPKINS_DC_19990907_INNER_BOT_1
1,HOPKINS_DC_19990907_INNER_BOT_1
2,HOPKINS_DC_19990907_INNER_BOT_1
3,HOPKINS_DC_19990907_INNER_BOT_1
4,HOPKINS_DC_19990907_INNER_BOT_1


In [119]:
## Add occurrenceID

# Create survey_date column in fish
fish['survey_date'] = [str(fish['year'].iloc[i]) + paddedMonth[i] + paddedDay[i] for i in range(fish.shape[0])]

# Groupby to create occurrenceID
fish_occ['occurrenceID'] = fish.groupby(['site', 'survey_date', 'zone', 'level', 'transect'])['classcode'].cumcount()+1
fish_occ['occurrenceID'] = fish_occ['eventID'] + '_occ' + fish_occ['occurrenceID'].astype(str)

fish_occ.head()

Unnamed: 0,eventID,occurrenceID
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5


In [124]:
## Load species table

filename = 'MLPA_kelpforest_taxon_table.1.csv'
species = pd.read_csv(filename)

print(species.shape)
species.head()

(1336, 38)


Unnamed: 0,campus,sample_type,sample_subtype,classcode,orig_classcode,Kingdom,Phylum,Class,Order,Family,...,LOOKED2009,LOOKED2010,LOOKED2011,LOOKED2012,LOOKED2013,LOOKED2014,LOOKED2015,LOOKED2016,LOOKED2017,LOOKED2018
0,HSU,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,no,no,no,no,no,yes,yes,no,yes,yes
1,UCSB,FISH,FISH,AARG,AARG,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,VRG,FISH,FISH,AARG,Amphistichus argenteus,Animalia,Chordata,Actinopterygii,Perciformes,Embiotocidae,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
3,HSU,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no
4,UCSB,FISH,FISH,ACOR,ACOR,Animalia,Chordata,Actinopterygii,Scorpaeniformes,Cottidae,...,no,no,no,no,no,no,no,no,no,no


The subset of the species table that's currently relevant is entries with sample_type = 'FISH'. **Note** that there are 172 unique classcodes under this sample type, only 166 of which are actually in the fish data set. **Where does this discrepancy come from?** It seems like all classcodes should appear at least once, with a count of either 0 (looked for and not found) or NaN (not looked for).

Classcodes that do not appear in the data are:
- DMAC
- HSPI
- HSTE
- MXEN
- RBIN

```python
species.loc[species['classcode'].isin(['DMAC', 'HSPI', 'HSTE', 'MXEN', 'RBIN']), ['campus', 'classcode', 'species_definition', 'common_name']]
```

In [136]:
## Select species for fish surveys

fish_sp = species.loc[species['sample_type'] == 'FISH', ['classcode', 'species_definition', 'common_name']]
fish_sp.drop_duplicates(inplace=True)

print(fish_sp.shape)
fish_sp

(172, 3)


Unnamed: 0,classcode,species_definition,common_name
0,AARG,Amphistichus argenteus,Barred Surfperch
3,ACOR,Artedius corallinus,Coralline Sculpin
7,ADAV,Anisotremus davidsonii,Sargo
11,AFLA,Aulorhynchus flavidus,Tubesnout
15,AGUA,Apogon guadalupensis,Guadalupe Cardinalfish
...,...,...,...
510,UNID,Unidentified Fish,Unidentified Fish
513,URON,Umbrina roncador,Yellowfin drum
514,USAN,Ulvicola sanctaerosae,Kelp Gunnel
516,ZEXA,Zapteryx exasperata,Banded Guitarfish


In [140]:
fish_sp.iloc[121:]

Unnamed: 0,classcode,species_definition,common_name
354,SACA,Squalus acanthias,Spiny Dogfish
357,SARG,Sphyraena argentea,California Barracuda
361,SATR,Sebastes atrovirens,Kelp Rockfish
365,SAUR,Sebastes auriculatus,Brown Rockfish
369,SCAL,Squatina californica,Pacific Angel Shark
372,SCAR,Sebastes carnatus,Gopher Rockfish
376,SCARSCAU,Sebastes carnatus/caurinus,"Gopher, Copper Rockfish Young Of Year"
378,SCAU,Sebastes caurinus,Copper Rockfish
382,SCHI,Sarda chiliensis chiliensis,Eastern Pacific Bonito
385,SCHR,Sebastes chrysomelas,Black And Yellow Rockfish


In [132]:
fish[fish['classcode'] == 'DMAC']

Unnamed: 0,campus,method,survey_year,year,month,day,site,zone,level,transect,...,temp,surge,pctcnpy,notes,site_name_old,site_status,latitude,longitude,site_name,survey_date


In [133]:
fish['classcode'].unique()

array(['ELAT', 'HDEC', 'OCAL', 'OYT', 'RVAC', 'SMEL', 'SMYS', 'BFRE',
       'NO_ORG', 'KGB', 'OELO', 'OPIC', 'RNIC', 'RTOX', 'SATR', 'SPIN',
       'COTT', 'SCHR', 'SPAU', 'EJAC', 'HCAR', 'SPUL', 'SRAS', 'GBY',
       'PCLA', 'SCAR', 'AFLA', 'SMIN', 'ZROS', 'CPUN', 'GNIG', 'CLIN',
       'SMAR', 'LCON', 'PFUR', 'ATHE', 'SYRI', 'CVIO', 'SCAU', 'HLAG',
       'RFYOY', 'OYB', 'ACOR', 'SYNG', 'PHOL', 'OTRI', 'STRE', 'HELL',
       'HROS', 'UNID', 'BRAY', 'SENT', 'SSAX', 'PCAL', 'SNEB', 'CSOR',
       'KSEI', 'BATH', 'JZON', 'SHOP', 'BLEN', 'USAN', 'EMBI', 'SSAG',
       'GIBB', 'HFRA', 'CAGG', 'HARG', 'AOCE', 'HANA', 'CSTI', 'SDAL',
       'SDIP', 'EMOR', 'GOBI', 'PATR', 'SCAL', 'CSAT', 'TSEM', 'SEBSPP',
       'MMOL', 'CVEN', 'SAUR', 'GMAE', 'LIPA', 'STICH', 'CNUG', 'LLEP',
       'EWAL', 'SROS', 'PTRI', 'RALL', 'SACA', 'BOTH', 'CANA', 'GMOR',
       'RHYP', 'RSTE', 'ADAV', 'PLEU', 'APFL', 'SRUB', 'SMAL', 'EBIS',
       'HEXA', 'HHEM', 'RRIC', 'TCAL', 'CITH', 'SSEM', 'PNOT', 'BAITBALL',
