# Getting Started with Echodataflow: Processing EK60 Data to Extract Target Strength

Welcome to this beginner-friendly notebook that guides you through the process of using `echodataflow` to process EK60 data from the SH1707 survey. `echodataflow` is a powerful tool for ocean acoustics data processing. In this notebook, we'll walk through the following steps:

1. **Setting Up**: Importing necessary libraries and setting up the configuration paths for the dataset and pipeline.

2. **Getting Data**: Using `glob_url` to retrieve a list of URLs matching the specified pattern of raw EK60 data files.

3. **Preparing Files**: Extracting file names from URLs and creating a file listing for the transect.

4. **Processing with echodataflow**: Starting the echodataflow processing using the specified configurations.

5. **Results**: Displaying the first entry from the processed data.

Let's get started!

# Step 0: Install Echodataflow

Before we proceed with the instructions on using `echodataflow`, let's ensure that we have properly set up and prepared our system with echodataflow for usage. Feel free to skip this step if you have echodataflow already set up.

### Step 0.1: Make a virtual environment

To keep your Echodataflow environment isolated, it's recommended to create a virtual environment using Conda or Python's built-in venv module. Here's an example using Conda:

```bash
conda create --name echodataflow-env
conda activate echodataflow-env
```

Or, using Python's venv:

```bash
python -m venv echodataflow-env
source echodataflow-env/bin/activate  # On Windows, use `echodataflow-env\Scripts\activate`
```
### Step 0.2: Clone the Repository
Now that you have a virtual environment set up, you can clone the Echodataflow project repository to your local machine using the following command:

```bash
git clone <echodataflow_repo>
```

### Step 0.3: Install the Package
Navigate to the project directory you've just cloned and install the Echodataflow package. The -e flag is crucial as it enables editable mode, which is especially helpful during development and testing. Now, take a moment and let the echodataflow do its thing while you enjoy your coffee.

```bash
cd <project_directory>
pip install -e .
```

---

### Step 0.4: Initialize Echodataflow and Prefect

To kickstart your journey with Echodataflow and Prefect, follow these simple initialization steps:

##### 0.4.1 Initializing Echodataflow
Begin by initializing Echodataflow with the following command:

```bash
echodataflow init
```

This command sets up the groundwork for your Echodataflow environment, preparing it for seamless usage.

##### 0.4.2 Initializing Prefect
For Prefect, initialization involves a few extra steps, including secure authentication. Enter the following command to initiate the Prefect authentication process:

- If you have a Prefect Cloud account, provide your Prefect API key to securely link your account. Type your API key when prompted and press Enter.

```bash
prefect cloud login
```

- If you don't have a Prefect Cloud account yet, you can use local prefect account. This is especially useful for those who are just starting out and want to explore Prefect without an account.

```bash
prefect profiles create echodataflow-local
```


The initialization process will ensure that both Echodataflow and Prefect are properly set up and ready for you to dive into your cloud-based workflows.

---

## Step 1: Setting Up

We begin by importing the required libraries and specifying the paths for the dataset and pipeline configuration files. These files contain the necessary information for data processing.

In [1]:
from pathlib import Path
from echodataflow import echodataflow_start, glob_url

dataset_config = Path("./datastore.yaml").resolve()
pipeline_config = Path("./pipeline.yaml").resolve()

## Step 1.1
### Pipeline Configuration: Target Strength Processing
In this section, we will provide you with the pipeline configuration that we'll be using for our target strength processing. The configuration is presented in YAML format, which is a structured and human-readable way to define settings for data processing.

Here's the configuration we'll be using:

```yaml
active_recipe: target_strength 
use_local_dask: true
n_workers: 5
pipeline:
- recipe_name: target_strength
  stages:
  - name: echodataflow_open_raw
    module: echodataflow.stages.subflows.open_raw
    options:
      save_raw_file: true
      use_raw_offline: true
      use_offline: true
  - name: echodataflow_compute_TS
    module: echodataflow.stages.subflows.compute_TS
    options:
      use_offline: true
```

Let's break down the components of this configuration:

- **active_recipe**: Specifies the recipe to be used for processing, which is set as "target_strength" in this case.

- **use_local_dask**: This flag indicates that we'll be utilizing a local Dask Cluster for parallel processing.

- **n_workers**: Determines the number of worker processes in the Dask Cluster. Here, we're using 5 workers for efficient parallelization.

- **pipeline**: This section defines the sequence of stages to execute. In this example, we're following the "target_strength" recipe, which comprises two stages.

- **echodataflow_open_raw**: This stage utilizes the `open_raw` subflow module to open raw data files. It includes options such as saving raw files, using raw data in offline mode, and utilizing offline data.

- **echodataflow_compute_TS**: This stage employs the `compute_TS` subflow module to compute target strength. It includes an option to use offline data.

**Note**: For a more comprehensive understanding of each option and its functionality, you can refer to the [Pipeline documentation](https://github.com/OSOceanAcoustics/echodataflow/blob/dev/docs/configuration/pipeline.md).

Keep in mind that in this example, we'll be setting up a local Dask Cluster with 5 workers for parallel processing. This configuration will enable us to efficiently process our data for target strength analysis. To turn it off, toggle `use_local_dask` to false.

Feel free to explore and modify the configuration to understand better.

## Step 1.2
### Datastore Configuration: Organizing Data for Processing

In this section, we'll delve into the configuration that defines how the data will be organized and managed for processing. This configuration is provided in YAML format and plays a crucial role in structuring data inputs and outputs.

Here's the detailed breakdown of the configuration:

```yaml
name: Bell_M._Shimada-SH1707-EK60
sonar_model: EK60 
raw_regex: (.*)-?D(?P<date>\w{1,8})-T(?P<time>\w{1,6}) 
args: 
  urlpath: s3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw
  parameters:
    ship_name: Bell_M._Shimada
    survey_name: SH1707
    sonar_model: EK60
  storage_options:
    anon: true
  transect:
    file: ./EK60_SH1707_Shimada.txt
  default_transect_num: 2017
  json_export: true 
output: 
  urlpath: ./echodataflow-output
  retention: false
  overwrite: true
```

Let's delve into the individual components of the configuration presented here:

- **name**: Specifies a descriptive name for the configuration, aiding in identifying its purpose.

- **sonar_model**: Indicates the type of sonar being utilized, which in this case is "EK60".

- **raw_regex**: Defines a regular expression pattern for extracting date and time information from raw data file names.

- **args**: This section provides crucial arguments for structuring data inputs:

  - **urlpath**: Defines the URL pattern to access the raw data files stored on a remote server. The placeholders `{{ ship_name }}`, `{{ survey_name }}`, and `{{ sonar_model }}` are dynamically replaced with the specified values.

  - **storage_options**: Sets storage options, such as anonymous access (`anon: true`), for retrieving the data.

  - **transect**: Specifies a file (`EK60_SH1707_Shimada.txt`) containing the list of files to process, along with default transect information.

  - **json_export**: Enables JSON metadata export.

- **output**: This section configures the output settings for processed data:

  - **urlpath**: Determines the output directory (`./echodataflow-output`) where the processed data will be stored.

  - **retention**: Disables data retention, indicating that only Target Strength data will be stored in this case.

  - **overwrite**: Allows data overwriting if the data already exists.

**Note**: 
- For a more comprehensive understanding of each option and its functionality, you can refer to the [Datast documentation](https://github.com/OSOceanAcoustics/echodataflow/blob/dev/docs/configuration/datastore.md).
- The pipeline will store Target Strength output under `./echodataflow-output`. As the retention is set to false, only Target Strength files will be stored. To specify files for processing, create a list of file names and store it in `EK60_SH1707_Shimada.txt`, which should be placed under the transect directory.

This configuration facilitates efficient data organization and management for the processing pipeline. Feel free to tailor it to your specific data and processing requirements.


## Step 2: Getting Data
Next, we'll use the glob_url function to retrieve a list of URLs matching a specific pattern. In this case, we're targeting raw EK60 data files from the SH1707 survey.

In [2]:
all_files = glob_url("s3://ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/*.raw", {'anon':True})

## Step 3: Preparing Files
We'll now extract the file names from the URLs and create a file listing for the transect. This will help us organize and work with the data effectively.

In [3]:
files = []
for file in all_files:
    f = file.split(".r")[0]
    files.append(f.split("/")[-1])

transect = open('EK60_SH1707_Shimada.txt','w')
i = 0
for f in files:
    if i == 10:
        break
    transect.write(f+".raw\n")
    i = i + 1
transect.close()

## Step 4: Processing with echodataflow
Now, we're ready to kick off the data processing using echodataflow. We'll provide the dataset and pipeline configurations, along with additional options.

In [4]:
options = {"storage_options_override": False}
data  = echodataflow_start(dataset_config=dataset_config, pipeline_config=pipeline_config, options=options)


Checking Configuration

Configuration Check Completed
Checking Connection to Prefect Server

Starting the Pipeline


Dataset Configuration Loaded For This Run
--------------------------------------------------
{'name': 'Bell_M._Shimada-SH1707-EK60', 'sonar_model': 'EK60', 'raw_regex': '(.*)-?D(?P<date>\\w{1,8})-T(?P<time>\\w{1,6})', 'args': {'urlpath': 's3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw', 'parameters': {'ship_name': 'Bell_M._Shimada', 'survey_name': 'SH1707', 'sonar_model': 'EK60'}, 'storage_options': {'anon': True}, 'transect': {'file': './EK60_SH1707_Shimada.txt'}, 'default_transect_num': 2017, 'json_export': True}, 'output': {'urlpath': './echodataflow-output', 'retention': True, 'overwrite': True}}
Pipeline Configuration Loaded For This Run
--------------------------------------------------
{'active_recipe': 'target_strength', 'use_local_dask': False, 'n_workers': 5, 'pipeline': [{'recipe_name': 'target_strength', 'stages': [

ValueError: Must have raw_dicts or raw_json_path present.

## Step 5: Results
Finally, let's take a look at the first entry from the processed data.

In [None]:
data[0][0]

**Congratulations!** You've successfully processed EK60 data using echodataflow. This notebook provides a simplified overview, and you can explore the capabilities of echodataflow for more advanced processing tasks.

Feel free to modify the parameters, paths, and configurations as needed to adapt to your data and requirements.