# Basic CTD Processing Pipeline Demo

This notebook will provide an example of basic pipeline functionality using CTD processing as the subject. It is extremely uncommon to collect glider data without CTD measurements so generally, this processing can be applied to almost any glider.

The data used here can be found at the [BOCD Bio-Carbon Deployment Catalogue](https://platforms.bodc.ac.uk/deployment-catalogue/BIO-Carbon/) under [Nelson (unit_397) OG NetCDF](https://linkedsystems.uk/erddap/files/Public_OG1_Data_001/Nelson_20240528/Nelson_646_R.nc). Context for the deployment can be found on the download site, but for further detail see the [NOC BIO-Carbon Project Page](https://noc.ac.uk/projects/bio-carbon).

Once you have downloaded the data, it should be placed in the **examples/data/OG1** folder or, alternatively, you can edit the config for this notebook (**examples/configs/example_config_nelson.yaml**) so that "file_path" in the "Load OG1" step points towards where your data is stored.

Alternatively, run the next cell to check for the datafile and download it if not present.


In [None]:
from pathlib import Path
input_dir = Path("../data/OG1")
input_file = input_dir / "Nelson_646_R.nc"
if not input_dir.exists():
    input_dir.mkdir(parents=True)
if not input_file.exists():
    import requests
    response = requests.get("https://linkedsystems.uk/erddap/files/Public_OG1_Data_001/Nelson_20240528/Nelson_646_R.nc")
    if response.status_code == 200:
        with open(input_file, "wb") as f:
            f.write(response.content)
        print(f"Example file downloaded and written to {input_dir.resolve()}")
    else:
        print("File download failed")

## Viewing the data

This isn't necessary for the pipeline to work, however it is useful to see that data for those who are unfamiliar with glider data

In [None]:
import xarray as xr

dataset = xr.load_dataset("../data/OG1/Nelson_646_R.nc")
dataset

Feel free to explore the data variables in this dataset. Many will be completely filled with Nans, but for our purposes we only care about TIME, LATITUDE, LONGITUDE, PRES (pressure), CNDC (conductivity), and TEMP (temperature). You may notice that this data has not coordinates, which means it does not conform to the glider-community [OG1 format](https://github.com/OceanGlidersCommunity/OG-format-user-manual/blob/main/OG_Format.adoc) however, it is still compatable with the pipeline so long as the N_MEASUREMENTS dimension exists.

## Importing the pipeline

Before anything can be run, we need to make sure that the necessary packages are installed. All of the requirements are listed in **requirements.txt** and can be installed automatically using `pip install -r requirements.txt` in a terminal (if using anaconda, make sure that you have selected the right environment using `conda activate your_env_name`). If you are missing any packages the pipeline may break.

When we import the pipeline it will try and register all of the steps available to it. This is labeled as [Discovery] in the print log. If you are missing a package requirement, you will see a print in the form "Failed to import <step>: <error message>" which should indicate which package you are missing.

In [None]:
# Set the working directory and import the pipeline
import os
cwd = os.getcwd()
os.chdir(f"{cwd}/../../src")
from toolbox.pipeline import *

## Making the pipeline

To make the pipeline, we simply have to make an instance using the `Pipeline()` class, passing it the path to the config which defines how we want to process the data. You should see a series of print logs indicating that the steps specified in the config have been found and added to the pipeline.

In [None]:
# Create the pipeline using the specified config
Pipe = Pipeline(
    config_path=r"../examples/configs/example_config_nelson.yaml"
)

## Running the pipeline

Calling the Pipeline.run() method tells the pipeline to execute all of the steps listed in the config in order from top-down. As it is running, it will plot diagnostics for steps where the setting is True. For more details on what is being run, see the comments in the config. Once completed, the processing has been excecuted and your data should be saved as **Nelson_646_R_Processed.nc** in the data/OG1 folder.

In [None]:
# Run the pipeline
Pipe.run()

# Checking the output

Because our pipeline "Pipe" is stored in pythons local variable space we can actually access the data without having to load in the processed data as we did for the input data. This can be done by looking in the pipeline context as follows:

In [None]:
data = Pipe._context["data"]
data

If you compare the number of data variables here to that of the input data, we have gained 30 variables - these will be a mixture of our derived variables and new QC columns.

I'd recommend checking the data with some plotting to make sure everything looks ok - in fact, with this dataset not everything is ok with the input CTD variables. If you look at the salinity outputs, they are much smaller than expected. This is because the CNDC input to the pipline was in the wrong units for the gsw-python implementation of the equation of state of seawater. This kind of error would have to be rectified manually by modifying the input data as the pipeline expects correct unit inputs.