# CCFRP data: convert to DwC

**Reminder:** Still have to deal with the olive/yellowtail species name issue.

In [1]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handline dates
import pytz # for handling time zones


In [54]:
## Load CCFRP count data

path = 'C:\\Users\\dianalg\\Documents\\Work\\MBARI\\MPA Data Integration\\CCFRP\\'
filename = 'Counts.csv'
data = pd.read_csv(path+filename)

data.head()

Unnamed: 0,Area,Site,Lat Center Point,Lon Center Point,Year,Barred Sand Bass,Bat Ray,Bigmouth Sole,Black-and-Yellow Rockfish,Black Rockfish,...,Vermilion Rockfish,White Croaker,White Seabass,Widow Rockfish,Wolf Eel,Yelloweye Rockfish,Yellowfin Croaker,Yellowtail Jack,Yellowtail Rockfish,Total
0,Trinidad,REF,41.115,-124.173,2018,0,0,0,0,708,...,2,0,0,0,0,0,0,0,22,898
1,Trinidad,REF,41.115,-124.173,2019,0,0,0,0,384,...,1,0,0,0,0,1,0,0,16,504
2,Cape Mendocino,MPA,40.426,-124.478,2017,0,0,0,0,113,...,9,0,0,0,0,3,0,0,4,229
3,Cape Mendocino,MPA,40.426,-124.478,2018,0,0,0,0,58,...,20,0,0,0,0,10,0,0,7,300
4,Cape Mendocino,MPA,40.426,-124.478,2019,0,0,0,0,52,...,15,0,0,0,0,6,0,0,10,234


In [55]:
## Load scientific names

path = 'C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\MPA data integration\\CCFRP\\'
filename = 'CCFRP_common_to_scientific.csv'
species = pd.read_csv(path+filename)

species.head()

Unnamed: 0,common_names,scientific_names
0,Bigmouth Sole,Hippoglossina stomata
1,Longfin Sanddab,Citharichthys xanthostigma
2,Pacific Halibut,Hippoglossus stenolepis
3,Pelagic Stingray,Pteroplatytrygon violacea
4,Northern Anchovy,Engraulis mordax


### Convert data to long format

In [56]:
## I don't think we want to include Total as a species, so drop it

data.drop('Total', axis=1, inplace=True)

In [57]:
## How many unique areas are there?

len(data['Area'].unique())

16

In [58]:
## How many years has each area been surveyed?

num_years = data.groupby(['Area', 'Site'])['Year'].count()
num_years

Area              Site
Anacapa Island    MPA      3
                  REF      3
Ano Nuevo         MPA     13
                  REF     13
Bodega Head       MPA      3
                  REF      3
Cape Mendocino    MPA      3
                  REF      3
Carrington Point  MPA      3
                  REF      3
Farallon Islands  MPA      2
                  REF      2
Laguna Beach      MPA      1
                  REF      1
Piedras Blancas   MPA     11
                  REF     11
Point Buchon      MPA     13
                  REF     13
Point Conception  MPA      1
                  REF      1
Point Lobos       MPA     13
                  REF     13
South La Jolla    MPA      3
                  REF      3
Stewarts Point    MPA      3
                  REF      3
Swamis            MPA      3
                  REF      3
Ten Mile          MPA      3
                  REF      3
Trinidad          REF      2
Name: Year, dtype: int64

In [59]:
## So how many rows should each species have after converting to long format?

sum(num_years)

158

In [60]:
## Melt data

data_long = pd.melt(data, id_vars=data.columns[0:5].tolist(), var_name='species_common_name', value_name='count')
data_long.head()

Unnamed: 0,Area,Site,Lat Center Point,Lon Center Point,Year,species_common_name,count
0,Trinidad,REF,41.115,-124.173,2018,Barred Sand Bass,0
1,Trinidad,REF,41.115,-124.173,2019,Barred Sand Bass,0
2,Cape Mendocino,MPA,40.426,-124.478,2017,Barred Sand Bass,0
3,Cape Mendocino,MPA,40.426,-124.478,2018,Barred Sand Bass,0
4,Cape Mendocino,MPA,40.426,-124.478,2019,Barred Sand Bass,0


In [61]:
## Check number of records per species

print(data_long[data_long['species_common_name'] == 'Barred Sand Bass'].shape)
print(data_long[data_long['species_common_name'] == 'Garibaldi'].shape)
print(data_long[data_long['species_common_name'] == 'Unknown'].shape)

(158, 7)
(158, 7)
(158, 7)


### Join to obtain scientific names

In [62]:
## Merge

data_long = data_long.merge(species, how='left', left_on='species_common_name', right_on='common_names')
data_long.head()

Unnamed: 0,Area,Site,Lat Center Point,Lon Center Point,Year,species_common_name,count,common_names,scientific_names
0,Trinidad,REF,41.115,-124.173,2018,Barred Sand Bass,0,Barred Sand Bass,Paralabrax nebulifer
1,Trinidad,REF,41.115,-124.173,2019,Barred Sand Bass,0,Barred Sand Bass,Paralabrax nebulifer
2,Cape Mendocino,MPA,40.426,-124.478,2017,Barred Sand Bass,0,Barred Sand Bass,Paralabrax nebulifer
3,Cape Mendocino,MPA,40.426,-124.478,2018,Barred Sand Bass,0,Barred Sand Bass,Paralabrax nebulifer
4,Cape Mendocino,MPA,40.426,-124.478,2019,Barred Sand Bass,0,Barred Sand Bass,Paralabrax nebulifer


In [70]:
## Double check that only Unknown species have missing scientific_names

# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 60)
data_long[data_long['scientific_names'].isnull() == True]

Unnamed: 0,Area,Site,Lat Center Point,Lon Center Point,Year,species_common_name,count,common_names,scientific_names
12482,Trinidad,REF,41.115,-124.173,2018,Unknown,0,Unknown,
12483,Trinidad,REF,41.115,-124.173,2019,Unknown,0,Unknown,
12484,Cape Mendocino,MPA,40.426,-124.478,2017,Unknown,0,Unknown,
12485,Cape Mendocino,MPA,40.426,-124.478,2018,Unknown,0,Unknown,
12486,Cape Mendocino,MPA,40.426,-124.478,2019,Unknown,0,Unknown,
...,...,...,...,...,...,...,...,...,...
12635,South La Jolla,MPA,32.815,-117.298,2018,Unknown,0,Unknown,
12636,South La Jolla,MPA,32.815,-117.298,2019,Unknown,0,Unknown,
12637,South La Jolla,REF,32.839,-117.302,2017,Unknown,0,Unknown,
12638,South La Jolla,REF,32.839,-117.302,2018,Unknown,0,Unknown,


**Question:** Should we include observations of unknown species in this data set? My gut reaction is no; those are important data in some contexts, but not very helpful in a presence/absence data set.

In [71]:
## Drop unnecessary columns

data_long.drop(['species_common_name', 'common_names'], axis=1, inplace=True)
data_long.head()

Unnamed: 0,Area,Site,Lat Center Point,Lon Center Point,Year,count,scientific_names
0,Trinidad,REF,41.115,-124.173,2018,0,Paralabrax nebulifer
1,Trinidad,REF,41.115,-124.173,2019,0,Paralabrax nebulifer
2,Cape Mendocino,MPA,40.426,-124.478,2017,0,Paralabrax nebulifer
3,Cape Mendocino,MPA,40.426,-124.478,2018,0,Paralabrax nebulifer
4,Cape Mendocino,MPA,40.426,-124.478,2019,0,Paralabrax nebulifer


### Conversion terms

**eventID** - Need to create this. Perhaps Area_Site_Year (e.g. Trinidad_REF_2018) <br>
**year** - year. <span style="color:red">Is eventDate required?</span><br>
**habitat** - <span style="color:red">Perhaps this is a good heading for reference versus mpa information?</span> <br>
**location** - <span style="color:red">Is there anything in Location that would be a good fit for area data?</span> <br>
**decimalLatitude, decimalLongitude** - Lat Center Point, Lon Center Point. <span style="color:red">Is there some way to give the corners of the grid as well? Or indicate that this is the center of a larger area over which the result is cumulative? Perhaps something associated with sampling? Should a **samplingProtocol** be linked?</span><br>
**occurrenceID** - Need to create this. <span style="color:red">Ideas? Can it just be a number increasing from 1 to the number of occurrences?</span> <br>
**scientificName** - scientific_names <br>
**scientificNameID** - WoRMS ID <br>
**taxonID** - WoRMS taxon ID <br>
**nameAccordingTo** - Worms <br>
**occurrenceStatus** - present <br>
**basisOfRecord** - HumanObservation <br>
**individualCount** - count <br>
**organismQuantity, organismQuantityType** - <span style="color:red">Do we want to include CPUE this way? Can join this data set with CPUE data. **Actually, CPUE isn't really an "organism quantity." Perhaps there's a better option under MeasurementOrFact?**</span>

<span style="color:red">**Do we want to include some kind of attribution, like institutionCode? Or will that be clear enough in the data submission?**</span>

**Where are all the places the MPA data will ultimately be submitted, anyway?**
