# Using EcoFOCIpy to process raw field data

## CTD / BTL Data

Basic workflow for each instrument grouping is *(initial archive level)*:
- SBE workflow must happen first
- Parse data from cnv files into pandas dataframe
- output initial files (pandas->csv) **ERDDAP NRT** when no meta data is added, Preliminary when no QC, FINAL after QC

Convert to xarray dataframe for all following work *(working or final data level):
- Add metadata from cruise yaml files and/or header info
- ingest metadata from cruise / cast logs
- process data beyond simple file translate
- apply any calibrations or corrections
    + field corrections
    + offsets
    + instrument compensations
    + some QC were available... this would be old-school simple bounds mostly
- adjust time bounds and sample frequency (xarray dataframe)
- save as CF netcdf via xarray: so many of the steps above are optional
    + **ERDDAP NRT** if no corrections, offsets or time bounds are applied but some meta data is
    + **Working and awaiting QC** has no ERDDAP representation and is a holding spot
    + **ERDDAP Final** fully calibrated, qc'd and populated with meta information

Plot for preview and QC
- preview images (indiv and/or collectively)
- manual qc process
- automated qc process ML/AI

Further refinenments for ERDDAP hosting:


## Example below is for SBE 9/11+ V2 but the workflow is similar for any SBE instruments.

Future processing of this instrument can be a simplified (no markdown) process which can be archived so that the procedure can be traced or updated

We process each cast as an individual file so this example will not loop over all profiles.  See `example/all_casts.py` example for processing an entire cruise at once.

Adding Discrete samples such as Oxygen, Chlorophyll, Salinity to BTL Data is in `example/discrete_castdata.py`.  Its purpose is to match niskin/bottle information to depth for the discrete data.

In [1]:
import yaml
import glob

import EcoFOCIpy.io.sbe_ctd_parser as sbe_ctd_parser #<- instrument specific
import EcoFOCIpy.io.ncCFsave as ncCFsave
import EcoFOCIpy.metaconfig.load_config as load_config

The sample_data_dir should be included in the github package but may not be included in the pip install of the package

## Simple Processing - first step

In [2]:
sample_data_dir = '../'

In [3]:
###############################################################
# edit to point to {cruise sepcific} raw datafiles 
datafile = sample_data_dir+'staticdata/example_data/profile_data/' #<- point to cruise and process all files within
cruise_name = 'DY1805' #no hyphens
cruise_meta_file = sample_data_dir+'staticdata/cruise_example.yaml'
inst_meta_file = sample_data_dir+'staticdata/instr_metaconfig/FOCI_standard_CTD.yaml'
group_meta_file = sample_data_dir+'staticdata/institutional_meta_example.yaml'
inst_shortname = ''
###############################################################

#init and load data
cruise = sbe_ctd_parser.sbe_btl()
filename_list = sorted(glob.glob(datafile + '*.btl'))

cruise_data = cruise.manual_parse(filename_list)

There are 6 files in the example folder ('ctd001.btl','ctd002.btl', etc).  The routine will read in all .btl files in a specified directory and name them based on splitting on "/".  This will work for all CTD 'btl' files

In [4]:
#quick statistical look at the distribution of data for a cast
# #preview a dataframe
cruise_data['ctd001.btl'].describe()

Unnamed: 0,sbeox0ml/l,sbeox0ps,sbox0mm/kg,sbeox1ml/l,sbeox1ps,sbox1mm/kg,sal00,sal11,sigma-t00,sigma-t11,fleco-afl,t090c,t190c,turbwetntu0,prdm,scan
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,7.3231,98.1716,318.9212,7.30787,97.9674,318.2575,32.06588,32.06813,25.48912,25.49089,0.80206,3.61483,3.61533,0.90477,33.5259,24406.1
std,0.141579,1.898755,6.166035,0.014719,0.207848,0.641301,7.9e-05,0.000106,0.000807,0.000817,0.113152,0.008306,0.008331,0.352494,21.967025,8129.420158
min,7.0205,94.102,305.743,7.2792,97.57,317.009,32.0658,32.0679,25.4874,25.4893,0.5261,3.6091,3.6094,0.6597,3.948,12686.0
25%,7.3202,98.12025,318.7945,7.3025,97.88425,318.02475,32.0658,32.0681,25.48895,25.490525,0.78545,3.609325,3.609625,0.676975,17.32975,18528.0
50%,7.32285,98.1595,318.91,7.3138,98.0545,318.5165,32.0659,32.0681,25.48945,25.4913,0.8375,3.61115,3.6115,0.7493,28.775,26144.0
75%,7.326875,98.23575,319.08625,7.318475,98.101,318.719,32.0659,32.0682,25.489675,25.4914,0.8619,3.616575,3.6183,0.9257,48.9415,29765.75
max,7.6205,102.145,331.874,7.3214,98.191,318.848,32.066,32.0683,25.4897,25.4915,0.9022,3.6322,3.6321,1.5855,65.415,35967.0


## Time Properties

Not traditionally dealt with for CTD files as they are likely dynamically updated via GPS feed.  However, FOCI tends to label the date/time with the ***at depth*** time-stamp

## Depth Properties and other assumptions

- currently, all processing and binning (1m for FOCI) is done via seabird routines and the windows software.  This may change with the python ctd package for a few tasks

## Add Deployment meta information

In [5]:
#just a dictionary of dictionaries - simple
with open(cruise_meta_file) as file:
    cruise_config = yaml.full_load(file)
cruise_config[cruise_name]

{'CruiseID': 'DY1805',
 'CruiseID_Historic': None,
 'CruiseID_Alternates': None,
 'Project_Leg': '',
 'Vessel': 'R/V Oscar Dyson',
 'ShipID': 'DY',
 'StartDate': datetime.date(2018, 4, 29),
 'EndDate': datetime.date(2018, 5, 10),
 'Project': 'EcoFOCI',
 'ChiefScientist': 'Peter Proctor',
 'StartPort': 'Dutch Harbor, AK',
 'EndPort': 'Dutch Harbor, AK',
 'CruiseLocation': 'Bering Sea',
 'Description': 'FOCI Spring Mooring Survey',
 'CruiseYear': 2018,
 'ctdlogs_pdf_name': 'DY1805_CastLogs.pdf'}

In [6]:
#and if you want a cast from the cruise, just use the consective cast number
cruise_config['CTDCasts']['CTD001']

{'id': 22869,
 'Vessel': 'R/V Oscar Dyson',
 'CruiseID': 'DY1805',
 'Project_Leg': '',
 'UniqueCruiseID': 'DY1805',
 'Project': 'FOCI Spring Mooring Survey',
 'StationNo_altname': 's1h1',
 'ConsecutiveCastNo': 'CTD001',
 'LatitudeDeg': 56,
 'LatitudeMin': 52.28,
 'GeoLocation': None,
 'LongitudeDeg': 164,
 'LongitudeMin': 2.92,
 'GMTDay': 30,
 'GMTMonth': 'Apr',
 'GMTYear': 2018,
 'GMTTime': 69660,
 'DryBulb': 3.7,
 'RelativeHumidity': 98,
 'WetBulb': -99.9,
 'Pressure': 1013,
 'SeaState': '-99',
 'Visibility': '-99',
 'WindDir': 230,
 'WindSpd': 21.0,
 'CloudAmt': '-99',
 'CloudType': '-99',
 'Weather': '-99',
 'SurfaceTemp': -99.9,
 'BottomDepth': 65,
 'StationNameID': 'M2C',
 'MaxDepth': 72,
 'InstrumentSerialNos': 'Press SN =291, Pri Temp SN = 4379, Sec Temp SN =2376, Pri Cond SN = 04-2985, Sec Cond Sn =04-3127, PAR Sn =70547, Fluor Sn = FLNTUS-2057, pri O2 Sn = 1961, sec O2 Sn = 0904, Turbid SN = FLNTUS-2057',
 'Notes': 'Niskin 2 did not close - bottom cap\r\nNiskin 10 did not clo

## Add Instrument meta information

Time, depth, lat, lon should be added regardless (always our coordinates) but for a mooring site its going to be a (1,1,1,t) dataset
The variables of interest should be read from the data file and matched to a key for naming.  That key is in the inst_config file seen below and should represent common conversion names in the raw data

In [7]:
with open(inst_meta_file) as file:
    inst_config = yaml.full_load(file)
inst_config

{'time': {'epic_key': 'TIM_601',
  'name': 'time',
  'generic_name': 'time',
  'standard_name': 'time',
  'long_name': 'date and time since reference time',
  'time_origin': '1900-01-01 00:00:00',
  'units': 'days since 1900-01-01T00:00:00Z'},
 'depth': {'epic_key': 'D_3',
  'generic_name': 'depth',
  'units': 'meter',
  'long_name': 'depth below surface (meters)',
  'standard_name': 'depth'},
 'latitude': {'epic_key': 'LON_501',
  'name': 'latitude',
  'generic_name': 'latitude',
  'units': 'degrees_north',
  'long_name': 'latitude',
  'standard_name': 'latitude'},
 'longitude': {'epic_key': 'LAT_500',
  'name': 'longitude',
  'generic_name': 'longitude',
  'units': 'degrees_east',
  'long_name': 'longitude',
  'standard_name': 'longitude'},
 'temperature_ch1': {'epic_key': 'T_28',
  'generic_name': 'temp channel 1',
  'long_name': 'Sea temperature in-situ ITS-90 scale',
  'standard_name': 'sea_water_temperature',
  'units': 'degree_C'},
 'temperature_ch2': {'epic_key': 'T2_35',
  'ge

In [26]:
#sbe data uses header info to name variables... but we want standard names from the dictionary I've created, so we need to rename column variables appropriately
#rename values to appropriate names, if a value isn't in the .yaml file, you can add it

#*** biggest *** difference between moored and profile data is there may be multiple instruments with the same dataype (e.g.) temperature
# on the same platform.  We _used_ to use the phrases primary and secondary, but will now only refer to them as ch1, ch2 etc
cruise_data['ctd001.btl'] = cruise_data['ctd001.btl'].rename(columns={
                        't090c':'temperature_ch1',
                        't190c':'temperature_ch2',
                        'sal00':'salinity_ch1',
                        'sal11':'salinity_ch2',
                        'sbox0mm/kg':'oxy_conc_ch1',
                        'sbeox0ml/l':'oxy_concM_ch1',
                        'sbox1mm/kg':'oxy_conc_ch2',
                        'sbeox1ml/l':'oxy_concM_ch2',
                        'sbeox0ps':'oxy_percentsat_ch1',
                        'sbeox1ps':'oxy_percentsat_ch2',
                        'sigma-t00':'sigma_t_ch1',
                        'sigma-t11':'sigma_t_ch2',
                        'cstarat0':'Attenuation',
                        'cstartr0':'Transmittance',
                        'fleco-afl':'chlor_fluorescence',
                        'turbwetntu0':'turbidity',
                        'empty':'empty', #this will be ignored
                        'prdm':'Pressure [dbar]',
                        'flag':'flag'})

cruise_data['ctd001.btl'].sample()

Unnamed: 0_level_0,oxy_concM_ch1,oxy_percentsat_ch1,oxy_conc_ch1,oxy_concM_ch2,oxy_percentsat_ch2,oxy_conc_ch2,salinity_ch1,salinity_ch2,sigma_t_ch1,sigma_t_ch2,chlor_fluorescence,temperature_ch1,temperature_ch2,turbidity,Pressure [dbar],scan,datetime
bottle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
7.0,7.3226,98.156,318.898,7.3181,98.095,318.701,32.066,32.0683,25.4895,25.4914,0.8962,3.6112,3.6114,0.763,27.823,27479.0,2018-04-30 19:30:57


In [27]:
cruise_data['ctd001.btl'].columns

Index(['oxy_concM_ch1', 'oxy_percentsat_ch1', 'oxy_conc_ch1', 'oxy_concM_ch2',
       'oxy_percentsat_ch2', 'oxy_conc_ch2', 'salinity_ch1', 'salinity_ch2',
       'sigma_t_ch1', 'sigma_t_ch2', 'chlor_fluorescence', 'temperature_ch1',
       'temperature_ch2', 'turbidity', 'Pressure [dbar]', 'scan', 'datetime'],
      dtype='object')

## Add institutional meta-information


In [28]:
with open(group_meta_file) as file:
    group_config = yaml.full_load(file)
group_config

{'source_documents': 'http://www.oceansites.org/docs/oceansites_data_format_reference_manual.pdf',
 'institution': 'Pacific Marine Environmental Lab (PMEL)',
 'project': 'EcoFOCI',
 'project_url': 'https://www.ecofoci.noaa.gov',
 'principal_investigator': 'Phyllis Stabeno',
 'principal_investigator_email': 'phyllis.stabeno (at) noaa.gov',
 'creator_name': 'Shaun Bell',
 'creator_email': 'shaun.bell (at) noaa.gov',
 'creator_institution': 'PMEL',
 'keywords': 'Mooring, Oceanographic',
 'comment': 'Provisional data',
 'sea_area': 'Bering Sea (BS)',
 'featureType': 'timeSeries',
 'conventions': '”CF-1.6, ~OceanSITES-1.5, ACDD-1.2”',
 'license': '',
 'references': '',
 'citation': '',
 'acknowledgement': ''}

In [29]:
# Add meta data and prelim processing based on meta data
# Convert to xarray and add meta information - save as CF netcdf file
# pass -> data, instmeta, depmeta
cruise_data_nc = ncCFsave.EcoFOCI_CFnc_profile(df=cruise_data['ctd001.btl'], 
                                instrument_yaml=inst_config, 
                                cruise_yaml=cruise_config)
cruise_data_nc

<EcoFOCIpy.io.ncCFsave.EcoFOCI_CFnc_profile at 0x15fc90940>

At this point, you could save your file with the `.xarray2netcdf_save()` method and have a functioning dataset.... but it would be very simple with no additional qc, meta-data, or tuned parameters for optimizing software like ferret or erddap.

In [30]:
# expand the dimensions and coordinate variables
# renames them appropriatley and prepares them for meta-filled values
cruise_data_nc.expand_dimensions(geophys_sort=False)

In [31]:
#build list from columsn in data - if a variable isn't in the yaml file, it will be dropped from the final data fields
cruise_data_nc.variable_meta_data(variable_keys=list(cruise_data['ctd001.btl'].columns.values),drop_missing=False)
#adding dimension meta needs to come after updating the dimension values... BUG?
cruise_data_nc.dimension_meta_data(variable_keys=['depth','latitude','longitude'])

The following steps can happen in just about any order and are all meta-data driven.  Therefore, they are not required to have a functioning dataset, but they are required to have a well described dataset

In [32]:
#add global attributes
cruise_data_nc.deployment_meta_add(conscastno='CTD001')
cruise_data_nc.get_xdf()

#add instituitonal global attributes
cruise_data_nc.institution_meta_add(group_config)

#add creation date/time - provenance data
cruise_data_nc.provinance_meta_add()

#provide intial qc status field
cruise_data_nc.qc_status(qc_status='unknown')


## Save CF Netcdf files

Currently stick to netcdf3 classic... but migrating to netcdf4 (default) may be no problems for most modern purposes.  Its easy enough to pass the `format` kwargs through to the netcdf api of xarray.

In [33]:
cast = 'CTD001'.split('D')[-1]
cruise_data_nc.xarray2netcdf_save(xdf = cruise_data_nc.get_xdf(),
                           filename=cruise_data_nc.filename_const(manual_label=cruise_name+'c'+cast.zfill(3)+'_ctd'),format="NETCDF3_CLASSIC")

In [34]:
cruise_data_nc.get_xdf()

## Next Steps

QC of data (plot parameters with other instruments)
- be sure to updated the qc_status and the history