# Import and filter data

### Import modules

In [None]:
%pylab inline
import glidertools as gt
import xarray as xr # for file I/O
from cmocean import cm as cmo  # we use this for colormaps

## Working with Seaglider base station files

GliderTools supports loading Seaglider files, including `scicon` data (different sampling frequencies).  
There is a function that makes it easier to find variable names that you'd like to load: `gt.load.seaglider_show_variables`  

This function is demonstrated in the cell below.
The function accepts a **list of file names** and can also receive a string with a wildcard placeholder (`*`) and basic regular expressions are also supported. In the example below we use a simple asterisk placeholder for all the files. 

Note that the function chooses only one file from the passed list or glob string - this file name will be shown. The returned table shows the variable name, dimensions, units and brief comment if it is available. 

In [None]:
filenames = 'data/p*.nc'

gt.load.seaglider_show_variables(filenames)

### Load variables

From the variable listing, one can choose multiple variables to load. Note that one only needs the variable name to load the data.

The `gt.load.seaglider_basestation_netCDFs` function is used to load a list of variables. It requires the filename string or list (as described above) and keys. It may be that these variables are not sampled at the same frequency. In this case, the loading function will load the sampling frequency dimensions separately. The function will try to find a time variable for each sampling frequency/dimension. 

#### Coordinates and automatic *time* fetching
All associated coordinate variables will also be loaded with the data if coordinates are documented. These may included *latitude, longitude, depth* and *time* (naming may vary). If time cannot be found for a dimension, a *time* variable from a different dimension with the same number of observations is used instead.

#### Merging data based on time
If the `return_merged` is set to *True*, the function will merge the dimensions if the dimension has an associated *time* variable. 

The function returns a dictionary of `xarray.Datasets` - a Python package that deals with coordinate indexed multi-dimensional arrays. We recommend that you read the documentation (http://xarray.pydata.org/en/stable/) as this package is used throughout *GliderTools*. This allows the original metadata to be copied with the data. The dictionary keys are the names of the dimensions. If `return_merged` is set to *True* an additional entry under the key `merged` will be included.

The structure of a dimension output is shown below. Note that the merged data will use the largest dimension as the primary dataset and the other data will be merged onto that time index. Data is linearly interpolated to the nearest time measurement of the primary index, but only by one measurement to ensure transparancy.

#### Metadata handling
If the keyword argument `keep_global_attrs=True`, the attributes from the original files (for all that are the same) are passed on to the output *Datasets* from the original netCDF attributes. The variable attributes (units, comments, axis...) are passed on by default, but can also be set to False if not wanted. GliderTools functions will automatically pass on these attributes to function outputs if a `xarray.DataArray` with attributes is given. 
All functions applied to data will also be recorded under the variable attribute `processing`.

In [None]:
names = [
    'ctd_depth',
    'ctd_time',
    'ctd_pressure',
    'salinity',
    'temperature',
    'eng_wlbb2flvmt_Chlsig',
    'eng_wlbb2flvmt_wl470sig',
    'eng_wlbb2flvmt_wl700sig',
    'aanderaa4330_dissolved_oxygen',
    'eng_qsp_PARuV',
]

ds_dict = gt.load.seaglider_basestation_netCDFs(filenames, names, return_merged=True, keep_global_attrs=False)

In [None]:
# Here we drop the time variables imported for the PAR variable
# we don't need these anymore. You might have to change this 
# depening on the dataset
merged = ds_dict['merged']
if 'time' in merged:
    merged = merged.drop(["time", "time_dt64"])


# To make it easier and clearer to work with, we rename the 
# original variables to something that makes more sense. This
# is done with the xarray.Dataset.rename({}) function.
# We only use the merged dataset as this contains all the 
# imported dimensions. 
# NOTE: The renaming has to be specific to the dataset otherwise an error will occur
dat = merged.rename({
    'salinity': 'salt_raw',
    'temperature': 'temp_raw',
    'ctd_pressure': 'pressure',
    'ctd_depth': 'depth',
    'ctd_time_dt64': 'time',
    'ctd_time': 'time_raw',
    'eng_wlbb2flvmt_wl700sig': 'bb700_raw',
    'eng_wlbb2flvmt_wl470sig': 'bb470_raw',
    'eng_wlbb2flvmt_Chlsig': 'flr_raw',
    'eng_qsp_PARuV': 'par_raw',
    'aanderaa4330_dissolved_oxygen': 'oxy_raw',
})

print(dat)



In [None]:
# variable assignment for conveniant access
depth = dat.depth
dives = dat.dives
lats = dat.latitude
lons = dat.longitude
time = dat.time
pres = dat.pressure
temp = dat.temp_raw
salt = dat.salt_raw

# name coordinates for quicker plotting
x = dat.dives
y = dat.depth

Glidertools has inbuild plotting routines for data visualisation

In [None]:
gt.plot(x, y, salt, cmap=cmo.haline, robust=True, shading='auto')
title('Original Data')


# Cleaning

The `cleaning` module contains several tools that help to remove erroneous data - profiles or points. 
These filters can be applied *globally* (IQR and standard devation limits), *vertically* (running average filters) or *horizontally* (horizontal filters on gridded data only). 

Below we use **salinity** to demonstrate the different functions available to users.

## Global filtering: outlier limits 
These functions find upper and lower limits for data outliers using standard deviations of the entire dataset. Multipliers can be set to make the filters more or less strict. Alternatively, data can be filtered by interquartile range using `outlier_bounds_iqr`

In [None]:
salt_std = gt.cleaning.outlier_bounds_std(salt, multiplier=2)
gt.plot(x, y, salt_std, cmap=cmo.haline, robust=True, shading='flat')
title('Outlier Bounds Stdev Method')


#### Despiking
This approach was used by Briggs et al. (2010). The idea is to apply a rolling filter to the data (along the time dimension). This forms the baseline. The difference from the original data are spikes. 

There are two rolling filters that can be applied to the data. The *median* approach is the equivalent of a rolling median. The *minmax* approach first applies a rolling minimum and then rolling maximum to data. This is useful particularly for optics data where spikes are particles in the water column and are not normally distributed. 

Here we use the median filter for salinity. Custom rolling filters can be applied created with the `gt.cleaning.rolling_window` function

In [None]:
salt_base, salt_spike = gt.cleaning.despike(salt, window_size=5, spike_method='median')

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, salt_base, cmap=cmo.haline, ax=ax[0])
ax[0].set_title('Despiked using median filter')
ax[0].cb.set_label('Salinity despiked')
ax[0].set_xlabel('')

gt.plot(x, y, salt_spike, cmap=cm.RdBu_r, vmin=-6e-3, vmax=6e-3, ax=ax[1])
ax[1].cb.set_label('Salinity spikes')

### Savitzky-Golay 
The Savitzky-Golay function fits a low order polynomial to a rolling window of the time series. This has the result of smoothing the data. A larger window with a lower order polynomial with have a smoother fit.

We recommend a 2nd order kernel. Here we use first order to show that the difference can be quite big.

In [None]:
salt_savgol = gt.cleaning.savitzky_golay(salt, window_size=11, order=1)
fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, salt_savgol, cmap=cmo.haline, ax=ax[0])
ax[0].set_title('Smoothing the data with Savitzky-Golay')
ax[0].cb.set_label('Smoothed salinity')
ax[0].set_xlabel('')

gt.plot(x, y, salt_savgol - salt, cmap=cm.RdBu, vmin=-6e-3, vmax=6e-3, ax=ax[1])
ax[1].cb.set_label('Difference from original');


Several filtering steps can be applied with a single function call using the `gt.calc_physics` wrapper

In [None]:
salt_qc = gt.calc_physics(salt, x, y, 
                          mask_frac=0.2, iqr=2.5, 
                          spike_window=5, spike_method='median', 
                          savitzky_golay_window=11, savitzky_golay_order=2)
        
fig, ax = plt.subplots(3, 1, figsize=[9, 8.5], sharex=True, dpi=90)

gt.plot(x, y, salt, cmap=cmo.haline, ax=ax[0])
gt.plot(x, y, salt_qc, cmap=cmo.haline, ax=ax[1])
gt.plot(x, y, salt_qc - salt, cmap=cm.RdBu_r, vmin=-0.02, vmax=0.02, ax=ax[2])

[a.set_xlabel('') for a in ax]

ax[0].cb.set_label('Original Data')
ax[1].cb.set_label('Cleaned Data')
ax[2].cb.set_label('Difference from Original')

plt.show()

### Effect of cleaning on a single data profile

In [None]:
fig, ax = subplots(1, 3, figsize=[10, 5], dpi=90)
fig.subplots_adjust(wspace=0.3)

dive_no = 310.5

idx = dat.dives==dive_no
colors = rcParams['axes.prop_cycle'].by_key()['color']

for i in range(2):
    ax[i].plot(salt[idx],         y[idx], c=colors[0], label='Raw', lw=4)
    ax[i].plot(salt_base[idx],    y[idx], c=colors[3], label='Despike window = 3')
    ax[i].plot(salt_savgol[idx],  y[idx], c='k', label='Savitsky-Golay')
    
ax[2].barh(y[idx], salt_savgol[idx]  - salt[idx], zorder=100, facecolor='k')
ax[2].barh(y[idx], salt_base[idx]    - salt[idx], zorder=100, facecolor=colors[3])

ax[0].set_xlim(34, 34.4)
ax[1].set_xlim(34, 34.2)
ax[0].legend(loc=4)

ymin, ymax= 0, 100
ax[0].fill_between([33, 35], [ymin, ymin], [ymax, ymax], facecolor='k', alpha=0.2)
ax[0].set_ylim(500, 0)
ax[1].set_ylim(ymax, ymin)
ax[0].set_ylabel('Depth [m]', labelpad=15)
ax[0].set_xlabel('Salinity', labelpad=15)
ax[1].set_xlabel('Salinity', labelpad=15)
ax[2].set_xlabel('$\Delta$Salinity', labelpad=15)
ax[1].set_title('Profile ' + str(dive_no))

ax[2].set_ylim(ymax, ymin)
ax[2].set_xlim(-0.01, 0.01)
[a.grid(c='0.75', ls='--') for a in ax]

plt.show()

### Asigning data

New variables can be added to the dataset

In [None]:
temp_qc = gt.calc_physics(temp, x, y, 
                          mask_frac=0.2, iqr=2.5, 
                          spike_window=5, spike_method='median', 
                          savitzky_golay_window=11, savitzky_golay_order=2)

In [None]:
dat['salt_qc'] = salt_qc
dat['temp_qc'] = temp_qc

### Saving data

You can save the dat object using [xarray's netcdf read/write](https://xarray.pydata.org/en/stable/io.html)

In [None]:
dat.to_netcdf('physics_processed.nc')

This netcdf can be read using `open_dataset`

In [None]:
dat_from_file = xr.open_dataset('physics_processed.nc')

In [None]:
dat_from_file