# PISCO - size frequency data

Size frequency measurements of body size of targeted invertebrate species are recorded by divers both along benthic transects and at random locations within a study site. 

Measured sizes correspond with:
- test diameter for urchins
- length of longest arm for seastars
- shell length for shelled mollusks
- carapace length for lobsters
- total turgid length for sea cucumbers

In the case of urchins sampled by UCSB and VRG in Southern California, large numbers of individuals may be collected in bags and brought aboard the research vessel to facilitate measurement.

**It sounds like some of the animals with recorded sizes here may also have recorded sizes in the swath table. Is that correct? If so, what is the best way to handle it?**
    
**Resources:** <br>
https://opc.dataone.org/view/MLPA_kelpforest.metadata.1

In [1]:
## Imports

import pandas as pd
import numpy as np

from datetime import datetime # for handling dates

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [6]:
## Load data

filename = 'MLPA_kelpforest_sizefreq.1.csv'
data = pd.read_csv(filename, encoding='ANSI', dtype={'transect':str, 'site_name_old':str})

print(data.shape)
data.head()

(80977, 18)


Unnamed: 0,campus,method,survey_year,year,month,day,site,location,zone,transect,classcode,count,size,disease,depth,observer,notes,site_name_old
0,UCSC,SBTL_SWATH_PISCO,2014,2014,8,8,HOPKINS_UC,TRANSECT,INNER,1,APOCAL,1,7.0,,4.6,COLIN GAYLORD,,
1,UCSC,SBTL_SWATH_PISCO,2018,2018,8,15,SAUNDERS_REFERENCE_1,TRANSECT,OUTER,2,ASTSPP,1,10.0,,20.2,MICHAEL LANGHANS,,
2,UCSC,SBTL_SWATH_PISCO,2014,2014,7,3,MACABEE_DC,TRANSECT,OUTER,2,DERIMB,2,12.0,HEALTHY,12.3,TRISTIN MCHUGH,SIZE 10-14 CM,
3,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,DERIMB,1,12.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE 10-14 CM,
4,UCSC,SBTL_SWATH_PISCO,2014,2014,7,16,HOPKINS_DC,TRANSECT,MID,1,DERIMB,1,17.0,HEALTHY,8.8,TRISTIN MCHUGH,SIZE <=15 CM,


### Column definitions

**campus** = UCSC, USCB, HSU or VRG. The academic partner campus that conducted the survey. <br>
**method** = SBTL_SWATH_PISCO, SBTL_SWATH_HSU or SBTL_SWATH_VRG. The code describing the sampling technique and monitoring program that conducted each survey. **How is this different than the previous column? Does it actually indicate further methodological differences?**" <br>
**survey_year** = 2003 - 2018. The designated year associated with the annual survey. In most cases, survey_year and year are the same. In rare cases, surveys are completed early in the year following the designated survey year. In these cases, survey_year will differ from year. <br>
**year** = 2003 - 2018. Year that the survey was conducted. <br>
**month** = 1 - 12. Month that the survey was conducted. <br>
**day** = 1 - 31. Day that the survey was conducted. <br>
**site** = One of 350 site codes. The unique site where the survey was performed (as defined in the site table). This site refers to a specific GPS location and is often associated with a geographic placename. Often, multiple site replicates will be associated with a single placename, and will be delineated with additional geographical or directional information (e.g. North/South/East/West/Central - N/S/E/W/CEN, Upcoast/Downcoast - UC/DC) <br>
**location** = RANDOM, TRANSECT OR SIZE_TRANSECT. The location where the size observation was recorded. **Note that sizes recorded as part of PISCO and HSU swath surveys are duplicated in this dataset. These duplicated records should have method = SBTL_SWATH_PISCO or SBTL_SWATH_HSU.**
- RANDOM: sizes were measured across the general site and were not specifically located on the swath/upc transect
- TRANSECT: sizes were measured on the swath/upc transects
- SIZE_TRANSECT: applies to HSU only where specific size frequency transects are conducted separately from swath/upc transects

**zone** = INNER, OUTER, OUTMID, INMID, MID or DEEP. A division of the site into 2 to 4 categories representing onshore-offshore stratification associated with targeted bottom depths for transects.
- INNER: Depth zone targeting roughly 5m of water depth, or the inner edge of the reef
- INMID: Depth zone targeting roughly 10m of water depth 
- MID: Depth zone targeting roughly 10-15m of water depth, used by VRG and in early years of PISCO
- OUTMID: Depth zone targeting roughly 15m of water depth 
- OUTER: Depth zone targeting roughly 20m of water depth 
- DEEP: Depth zone targeting roughly 25m of water depth, where present, used only by VRG

**transect** = It seems like this should only be 1 - 8, but there are **a number of other designations as well.** The unique transect replicate within each site and zone. <br>
**classcode** = One of 37 taxon codes. The unique taxonomic classification code that is being counted, as defined in the taxonomic table. This refers to a code that defines the Genus and Species that is identified, a code that represents a limited number of species that can't be narrowed down to one species, or in some cases family-level or higher order groupings. Generally, for invertebrates and algae, the classcode is comprised of the first letter of the genus, and the first three letters of the species, with some exceptions <br>
**count** = The number of individuals of a given classcode and a given size (if applicable) per transect <br>
**size** = Size (in centimeters) of observation. For specific species groups, measured sizes correspond with test diameter for urchins, length of longest arm for seastars, shell length for shelled mollusks, carapace length for lobsters, total turgid length for sea cucumbers. <br>
**disease** = For some years echinoderm disease was recorded on transects for select species. When systematic observation for disease was conducted, disease is indicated here. Where blank, disease was not evaluated.
- HEALTHY: Individual was inspected and no was disease observed
- YES: Some form of disease was observed
- MILD: Mild disease was observed
- SEVERE: Severe disease was observed
- WASTING: Wasting disease was observed
- BLACK SPOT: Black spot disease was observed

**depth** = Between 1.8 and 26.5 meters. Depth of the transect estimated by the diver. **Does this mean a dive computer was used?** <br>
**observer** = The diver who conducted the survey transect <br>
**notes** = Free text notes taken at the time of the sample, or added at the time of data entry. <br>
**site_name_old** = In cases when specific sites have been surveyed by multiple campuses using different site names, this variable indicates the alternative (historical) site name.

**NOTE THAT 25419 RECORDS HAVE METHOD = 'SBTL_SIZEFREQ_PISCO' AND LOCATION = 'TRANSECT'. VRG AND HSU DO NOT SEEM TO HAVE THIS PROBLEM. HOW DO I INTERPRET THESE RECORDS?**
```python
data[(data['method'] == 'SBTL_SIZEFREQ_PISCO') & (data['location'] == 'TRANSECT')]
```

**HOW SHOULD I INTERPRET THE TRANSECT FIELD FOR RECORDS WHERE LOCTION = RANDOM?**

### Strategy

Here, it seems like I should eliminate all the data from transects (i.e., the duplicated records that are already in the swath data). Once that is done, it seems like an event 

The **event** file should contain: eventID (from site, survey date, zone, transect), eventDate (from year, month, date), datasetID, locality (site), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, minimumDepthInMeters, maximumDepthInMeters, samplingProtocol, and samplingEffort. Should I include the campus information somewhere? Observer?

The **occurrence** file should contain: eventID, occurrenceID, scientificName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (for taxa with two possible species matches), occurrenceRemarks (any other necessary information, maybe disease), organismQuantity (count), organismQuantityType. **May try to include Cover data from UPC here, too, just with organismQuantityType as "percent cover."**

The **MoF** file should contain: eventID, occurrenceID, measurementType, measurementValue, measurementUnit and measurementMethod. Substrate and Relief (pct_cov values) can be recorded at the event level. Size can be recorded at the occurrence level.

In [25]:
data.shape

(80977, 18)

In [56]:
out = data[data['location'].isin(['RANDOM', 'SIZE_TRANSECT'])]
print(out.shape)
out['method'].unique()

(20150, 18)


array(['SBTL_SIZEFREQ_PISCO', 'SBTL_SIZEFREQ_HSU', 'SBTL_SIZEFREQ_VRG'],
      dtype=object)

In [61]:
out.loc[out['location'] == 'SIZE_TRANSECT', 'transect'].unique()

array(['3', '4'], dtype=object)

NEXT: Trying to figure out how to define event/occurrence in this case. Should I include the transect value even though it doesn't make sense for location = RANDOM? Maybe I should be including method under sampleProtocol for all the PISCO data?