# Using EcoFOCIpy to process raw field data

## CTD / Profile Data

Basic workflow for each instrument grouping is *(initial archive level)*:
- Parse data from raw files into pandas dataframe
- output initial files (pandas->csv) **ERDDAP NRT** when no meta data is added

Convert to xarray dataframe for all following work *(working or final data level):
- TODO: Add metadata from cruise yaml files and/or header info
- ingest metadata from cruiese / cast logs
- process data beyond simple file translate
- apply any calibrations or corrections
    + field corrections
    + offsets
    + instrument compensations
    + some QC were available... this would be old-school simple bounds mostly
- adjust time bounds and sample frequency (xarray dataframe)
- save as CF netcdf via xarray: so many of the steps above are optional
    + **ERDDAP NRT** if no corrections, offsets or time bounds are applied but some meta data is
    + **Working and awaiting QC** has no ERDDAP representation and is a holding spot
    + **ERDDAP Final** fully calibrated, qc'd and populated with meta information

Plot for preview and QC
- preview images (indiv and/or collectively)
- manual qc process
- automated qc process ML/AI

Further refinenments for ERDDAP hosting:


## Example below is for SBE 9/11+ V2 but the workflow is similar for any SBE instruments.

Future processing of this instrument can be a simplified (no markdown) process which can be archived so that the procedure can be traced or updated

In [1]:
import yaml
import glob

import EcoFOCIpy.io.sbe_ctd_parser as sbe_ctd_parser #<- instrument specific
import EcoFOCIpy.io.ncCFsave as ncCFsave
import EcoFOCIpy.metaconfig.load_config as load_config

The sample_data_dir should be included in the github package but may not be included in the pip install of the package

## Simple Processing - first step

In [2]:
sample_data_dir = '../'

In [3]:
###############################################################
# edit to point to {cruise sepcific} raw datafiles 
datafile = sample_data_dir+'staticdata/example_data/profile_data/' #<- point to cruise and process all files within
cruise_name = 'd'
cruise_meta_file = sample_data_dir+'staticdata/cruise_example.yaml'
inst_meta_file = sample_data_dir+'staticdata/instr_metaconfig/sbe16_cf.yaml'
inst_shortname = ''
###############################################################

#init and load data
cruise = sbe_ctd_parser.sbe9_11p()
filename_list = sorted(glob.glob(datafile + '*.cnv'))

(cruise_data,cruise_header) = cruise.parse(filename_list)

There are 6 files in the example folder ('ctd001.cnv','ctd002.cnv', etc).  The routine will read in all .cnv files in a specified directory and name them based on splitting on "/".  This will work for all CTD 'cnv' files

In [4]:
#example of a single profile
cruise_data['ctd001.cnv']

Unnamed: 0_level_0,c0mS/cm,c1mS/cm,flECO-AFL,sbeox0V,t090C,t190C,timeS,sbeox1V,par,turbWETntu0,...,sigma-t00,sigma-t11,sbeox0ML/L,sbox0Mm/Kg,sbeox0PS,sbeox1ML/L,sbox1Mm/Kg,sbeox1PS,nbin,flag
Pressure [dbar],Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,29.753597,29.757207,0.4821,2.5202,3.6086,3.6106,112.124,2.5944,304.01000,0.6675,...,25.4895,25.4912,7.3258,319.039,98.191,7.2834,317.191,97.628,15.0,False
2.0,29.755165,29.756555,0.5064,2.5197,3.6097,3.6089,113.162,2.5965,231.14000,0.6915,...,25.4896,25.4916,7.3273,319.103,98.213,7.2978,317.820,97.818,23.0,False
3.0,29.757821,29.759602,0.4824,2.5203,3.6125,3.6125,116.405,2.5987,172.83000,0.6696,...,25.4893,25.4910,7.3337,319.381,98.305,7.3013,317.971,97.873,35.0,False
4.0,29.765491,29.766984,0.5234,2.5208,3.6211,3.6207,149.648,2.6022,136.68000,0.6786,...,25.4886,25.4904,7.3304,319.238,98.282,7.3011,317.964,97.891,37.0,False
5.0,29.764901,29.766794,0.5624,2.5207,3.6198,3.6198,167.871,2.6045,112.94000,0.6606,...,25.4887,25.4905,7.3354,319.457,98.346,7.3068,318.212,97.965,36.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63.0,29.781214,29.783378,0.7888,2.4997,3.6091,3.6094,503.219,2.5851,0.20995,1.2531,...,25.4896,25.4914,7.3121,318.442,98.009,7.3056,318.157,97.923,35.0,False
64.0,29.781581,29.783670,0.8303,2.5250,3.6090,3.6094,506.872,2.5860,0.20996,1.3506,...,25.4897,25.4913,7.4655,325.122,100.064,7.3082,318.270,97.958,53.0,False
65.0,29.782189,29.784315,0.7998,2.5467,3.6093,3.6096,511.657,2.5859,0.20993,1.2990,...,25.4896,25.4914,7.5336,328.086,100.977,7.3031,318.051,97.891,64.0,False
66.0,29.782807,29.784974,0.8836,2.5173,3.6096,3.6099,517.273,2.5847,0.20994,1.6337,...,25.4895,25.4913,7.3721,321.056,98.814,7.2956,317.724,97.791,38.0,False


In [5]:
cruise_header['ctd001.cnv']

KeyError: 'ctd001.cnv'

In [6]:
#preview a dataframe
cruise_data['ctd001.cnv'].describe()

Unnamed: 0,c0mS/cm,c1mS/cm,flECO-AFL,sbeox0V,t090C,t190C,timeS,sbeox1V,par,turbWETntu0,...,sal11,sigma-t00,sigma-t11,sbeox0ML/L,sbox0Mm/Kg,sbeox0PS,sbeox1ML/L,sbox1Mm/Kg,sbeox1PS,nbin
count,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,...,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0
mean,29.769978,29.772047,0.789513,2.488628,3.610676,3.610854,341.44903,2.59619,22.77178,0.7787,...,32.068264,25.489587,25.491396,7.24503,315.520418,97.113104,7.308734,318.294388,97.968955,36.985075
std,0.006822,0.006822,0.089931,0.041992,0.004228,0.00412,92.84064,0.006939,54.862609,0.239176,...,0.000206,0.000442,0.000454,0.145682,6.34442,1.956144,0.00974,0.42384,0.135408,16.090174
min,29.753597,29.756555,0.4821,2.3835,3.6057,3.6059,112.124,2.5825,0.20983,0.6036,...,32.0679,25.4886,25.4902,6.878,299.536,92.19,7.2834,317.191,97.628,14.0
25%,29.765546,29.76735,0.78045,2.4788,3.6079,3.6081,287.5195,2.591,0.212345,0.64895,...,32.0681,25.4895,25.49125,7.236,315.1275,96.9875,7.3025,318.024,97.886,25.5
50%,29.768302,29.770389,0.8143,2.5041,3.609,3.6094,341.452,2.5952,1.2341,0.6851,...,32.0682,25.4897,25.4915,7.2973,317.799,97.818,7.3086,318.29,97.966,35.0
75%,29.77487,29.777149,0.83885,2.51775,3.612,3.6123,394.2645,2.6016,13.5115,0.77615,...,32.06835,25.4899,25.4917,7.328,319.135,98.227,7.31545,318.5865,98.056,44.5
max,29.783384,29.785494,0.8836,2.5467,3.6213,3.6215,543.732,2.61,304.01,1.6697,...,32.0688,25.4904,25.4923,7.5336,328.086,100.977,7.3321,319.312,98.302,95.0


## Time Properties

Not traditionally dealt with for CTD files as they are likely dynamically updated via GPS feed.

## Depth Properties and other assumptions

- currently, all processing and binning (1m for FOCI) is done via seabird routines and the windows software.  This may change with the python ctd package for a few tasks

## Add Deployment meta information

Two methods are available (if comming from python2 world - ordereddict was important... in py38 a dictionary is inherently ordered)

In [7]:
#just a dictionary of dictionaries - simple
with open(cruise_meta_file) as file:
    cruise_config = yaml.full_load(file)

FileNotFoundError: [Errno 2] No such file or directory: '../staticdata/cruise_example.yaml'

In [11]:
#Generates an ordereddict but prints better for summary
#likely to be depricated as an ordered dict may not be useful and drops a dependency if its EOL
mooring_config_v2 = load_config.load_config(mooring_meta_file)

In [12]:
mooring_config['Instrumentation'][instrument]

{'InstType': 'SBE-16',
 'SerialNo': '7166',
 'DesignedDepth': 1.0,
 'ActualDepth': 0.0,
 'PreDeploymentNotes': 'UAF',
 'PostDeploymentNotes': '',
 'Deployed': 'y',
 'Recovered': 'y'}

## Add Instrument meta information

Time, depth, lat, lon should be added regardless (always our coordinates) but for a mooring site its going to be a (1,1,1,t) dataset
The variables of interest should be read from the data file and matched to a key for naming.  That key is in the inst_config file seen below and should represent common conversion names in the raw data

In [13]:
with open(inst_meta_file) as file:
    inst_config = yaml.full_load(file)
inst_config

{'time': {'epic_key': 'TIM_601',
  'name': 'time',
  'generic_name': 'time',
  'standard_name': 'time',
  'long_name': 'date and time since reference time',
  'time_origin': '1900-01-01 00:00:00',
  'units': 'days since 1900-01-01T00:00:00Z'},
 'depth': {'epic_key': 'D_3',
  'generic_name': 'depth',
  'units': 'meter',
  'long_name': 'depth below surface (meters)',
  'standard_name': 'depth'},
 'latitude': {'epic_key': 'LON_501',
  'name': 'latitude',
  'generic_name': 'latitude',
  'units': 'degrees_north',
  'long_name': 'latitude',
  'standard_name': 'latitude'},
 'longitude': {'epic_key': 'LAT_500',
  'name': 'longitude',
  'generic_name': 'longitude',
  'units': 'degrees_east',
  'long_name': 'longitude',
  'standard_name': 'longitude'},
 'timeseries_id': {'cf_role': 'timeseries_id',
  'long_name': 'timeseries id',
  'standard_name': ''},
 'temperature': {'epic_key': 'T_20',
  'generic_name': 'temp',
  'long_name': 'Sea temperature in-situ ITS-90 scale',
  'standard_name': 'sea_wa

In [14]:
#sbe16 data uses header info to name variables... but we want standard names from the dictionary I've created, so we need to rename column variables appropriately
#rename values to appropriate names, if a value isn't in the .yaml file, you can add it
sbe16_wop_data = sbe16_wop_data.rename(columns={'tv290C':'temperature',
                        'sal00':'salinity',
                        'sbeox0Mm/Kg':'oxy_conc',
                        'sbeox0ML/L':'oxy_concM',
                        'sigma-È00':'sigma_theta',
                        'CStarAt0':'Attenuation',
                        'CStarTr0':'Transmittance',
                        'flECO-AFL':'chlor_fluorescence',
                        'empty':'empty', #this will be ignored
                        'flag':'flag'})
sbe16_wop_data.sample()

Unnamed: 0_level_0,timeJV2,temperature,salinity,oxy_conc,oxy_concM,sigma_theta,Attenuation,Transmittance,chlor_fluorescence,flag
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-09-11 15:00:00,129.750255,5.3333,31.7828,320.326,7.3525,25.0887,1.4648,69.3366,10.9974,0.0


In [14]:
# Add meta data and prelim processing based on meta data
# Convert to xarray and add meta information - save as CF netcdf file
# pass -> data, instmeta, depmeta
sbe16_wop_nc = ncCFsave.EcoFOCI_CFnc_moored(df=sbe16_wop_data, 
                                instrument_yaml=inst_config, 
                                mooring_yaml=mooring_config, 
                                instrument_id=instrument, 
                                inst_shortname=inst_shortname)
sbe16_wop_nc

<EcoFOCIpy.io.ncCFsave.EcoFOCI_CFnc_moored at 0x165d436d0>

At this point, you could save your file with the `.xarray2netcdf_save()` method and have a functioning dataset.... but it would be very simple with no additional qc, meta-data, or tuned parameters for optimizing software like ferret or erddap.

In [15]:
# expand the dimensions and coordinate variables
# renames them appropriatley and prepares them for meta-filled values
sbe16_wop_nc.expand_dimensions()

In [16]:
#build list from columsn in data - if a variable isn't in the yaml file, it will be dropped from the final data fields
sbe16_wop_nc.variable_meta_data(variable_keys=list(sbe16_wop_data.columns.values),drop_missing=True)
sbe16_wop_nc.temporal_geospatioal_meta_data(depth='designed')
#adding dimension meta needs to come after updating the dimension values... BUG?
sbe16_wop_nc.dimension_meta_data(variable_keys=['depth','latitude','longitude'])

The following steps can happen in just about any order and are all meta-data driven.  Therefore, they are not required to have a functioning dataset, but they are required to have a well described dataset

In [17]:
#add global attributes
sbe16_wop_nc.deployment_meta_add()
sbe16_wop_nc.get_xdf()

#add instituitonal global attributes
sbe16_wop_nc.institution_meta_add()

#add creation date/time - provenance data
sbe16_wop_nc.provinance_meta_add()

#provide intial qc status field
sbe16_wop_nc.qc_status(qc_status='unknown')


## Save CF Netcdf files

Currently stick to netcdf3 classic... but migrating to netcdf4 (default) may be no problems for most modern purposes.  Its easy enough to pass the `format` kwargs through to the netcdf api of xarray.

In [18]:
# combine trim (not mandatory) and filename together (saves to test.nc without name)
sbe16_wop_nc.xarray2netcdf_save(xdf = sbe16_wop_nc.autotrim_time(),
                           filename=sbe16_wop_nc.filename_const(),format="NETCDF3_CLASSIC")

# don't trim the data and pass your own filename
sbe16_wop_nc.xarray2netcdf_save(xdf = sbe16_wop_nc.get_xdf(),
                           filename=sbe16_wop_nc.filename_const(manual_label='test'),format="NETCDF4_CLASSIC")

In [19]:
sbe16_wop_nc.get_xdf()

## Next Steps

QC of data (plot parameters with other instruments)
- be sure to updated the qc_status and the history