# Getting Started with Echoflow on AWS: Processing EK60 Data to compute MVBS


Welcome to the Echoflow AWS Deployment Guide notebook! This notebook will walk you through the steps to deploy an Echoflow pipeline on Amazon Web Services (AWS). Echoflow is a powerful tool for acoustics data processing and analysis, and AWS provides a scalable and reliable cloud platform for running your workflows.

In this notebook, you will learn how to set up an EC2 instance, install Echoflow, configure AWS S3 storage for your processed data, and initiate your Echoflow pipeline using Prefect. By following these steps, you'll be able to seamlessly run your acoustics data processing workflows in the cloud.

Let's get started!

## Prerequisites

Before you begin, make sure you have the following prerequisites in place:

- An AWS account with necessary permissions to create EC2 instances and S3 buckets.
- A private key (.pem) file for SSH access to your EC2 instance.
- Basic familiarity with the command-line interface (CLI).

## Notebook Setup

In this notebook, we'll go through the following steps:

1. Create an EC2 Instance: Set up an AWS EC2 instance to run your Echoflow pipeline.
2. Connect to EC2 using SSH: Establish a secure connection to your EC2 instance using SSH.
3. Install Echoflow: Create a virtual environment, clone the Echoflow repository, and install the package.
4. Initialize Echoflow and Prefect: Configure your Echoflow and Prefect environments.
5. Create an S3 Bucket: Set up an AWS S3 bucket to store your processed data.
6. Store AWS Credentials: Store your AWS credentials securely for access.
7. Create Credential Blocks: Set up credential blocks for secure access within Prefect.
8. Open Jupyter Notebook: Launch Jupyter Notebook to execute your Echoflow pipeline.
9. Setting Up: Setting up source S3.
10. Getting Data: Get data from the source S3.
11. Preparing Files: Preparing files for processing.
12. Processing with echoflow: Execute the Pipeline.

Now, let's dive into each step to deploy your Echoflow pipeline on AWS!

---


# Step 1: Create an EC2 Instance
Refer to the [AWS EC2] (https://docs.aws.amazon.com/efs/latest/ug/gs-step-one-create-ec2-resources.html) documentation for a step-by-step guide on creating an EC2 instance. Make sure you have the private key (.pem) file for SSH access.

---

# Step 2: Connect to EC2 using SSH
Connect to your EC2 instance using SSH. You can follow the instructions in the [AWS SSH] (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-linux-inst-ssh.html) documentation. You'll need the private key (.pem) file.

---

# Step 3: Install Echoflow

### Step 3.1: Make a virtual environment

To keep your Echoflow environment isolated, it's recommended to create a virtual environment using Conda or Python's built-in venv module. Here's an example using Conda:

```bash
conda create --name echoflow-env
conda activate echoflow-env
```

Or, using Python's venv:

```bash
python -m venv echoflow-env
source echoflow-env/bin/activate  # On Windows, use `echoflow-env\Scripts\activate`
```
### Step 3.2: Clone the Repository
Now that you have a virtual environment set up, you can clone the Echoflow project repository to your local machine using the following command:

```bash
git clone <echoflow_repo>
```

### Step 3.3: Install the Package
Navigate to the project directory you've just cloned and install the Echoflow package. The -e flag is crucial as it enables editable mode, which is especially helpful during development and testing. Now, take a moment and let the echoflow do its thing while you enjoy your coffee.

```bash
cd <project_directory>
pip install -e .
```

---

### Step 4: Initialize Echoflow and Prefect

To kickstart your journey with Echoflow and Prefect, follow these simple initialization steps:

##### 4.1 Initializing Echoflow
Begin by initializing Echoflow with the following command:

```bash
echoflow init
```

This command sets up the groundwork for your Echoflow environment, preparing it for seamless usage.

##### 4.2 Initializing Prefect
For Prefect, initialization involves a few extra steps, including secure authentication. Enter the following command to initiate the Prefect authentication process:

- If you have a Prefect Cloud account, provide your Prefect API key to securely link your account. Type your API key when prompted and press Enter.

```bash
prefect cloud login
```

- If you don't have a Prefect Cloud account yet, you can use local prefect account. This is especially useful for those who are just starting out and want to explore Prefect without an account.

```bash
prefect profiles create echoflow-local
```


The initialization process will ensure that both Echoflow and Prefect are properly set up and ready for you to dive into your cloud-based workflows.

---

# Step 5: Create a S3 bucket to store the output

Create an S3 bucket to store the processed output. Refer to the [AWS S3] (https://docs.aws.amazon.com/quickstarts/latest/s3backup/step-1-create-bucket.html) documentation for guidance. Add the S3 URI to datastore.yaml in the same directory as this notebook under the urlpath key in the output section:

```yaml
# ...rest of the cofiguration
output: # Output arguments
  urlpath: <YOUR_S3_URI> # Destination data URL parameters
  overwrite: true 
  storage_options: 
    block_name: echoflow-aws-credentials
    type: AWS
```

---

# Step 6: Store AWS Credentials

Edit the ~/.echoflow/credentials.ini file and add your AWS Key and Secret.

```bash
nano ~/.echoflow/credentials.ini

# add the following and save:
[echoflow-aws-credentials]
aws_access_key_id=my-aws-key
aws_secret_access_key=my-aws-secret
provider=AWS
```

---

# Step 7: Create Credential blocks

Once you have stored the credentials in the ini file, call the below command to create a block securedly stored in your prefect account. For more about blocks refer [Blocks] (https://github.com/OSOceanAcoustics/echoflow/blob/dev/docs/configuration/blocks.md). 

```bash
echoflow load-credentials
```

---

# Step 8: Jupyter Notebook
Open Jupyter Notebook using terminal in the same activated environment 

```bash
jupyter notebook
```

---

# Step 9: Setting Up
We begin by importing the required libraries and specifying the paths for the dataset and pipeline configuration files. These files contain the necessary information for data processing.

In [1]:
from pathlib import Path
from echoflow import echoflow_start, StorageType, glob_url

dataset_config = Path("./datastore.yaml").resolve()
pipeline_config = Path("./pipeline.yaml").resolve()

# Step 9.1
### Pipeline Configuration: Mean Volume Backscattering Strength
In this section, we will provide you with the pipeline configuration that we'll be using for our MVBS processing. The configuration is presented in YAML format, which is a structured and human-readable way to define settings for data processing.

Here's the configuration we'll be using:

```yaml
active_recipe: standard 
use_local_dask: true
n_workers: 3
pipeline:
- recipe_name: standard 
  stages: 
  - name: echoflow_open_raw 
    module: echoflow.stages.subflows.open_raw 
    options: 
      save_raw_file: true
      use_raw_offline: true 
      use_offline: true 
  - name: echoflow_combine_echodata
    module: echoflow.stages.subflows.combine_echodata
    options:
      use_offline: true
  - name: echoflow_compute_SV
    module: echoflow.stages.subflows.compute_SV
    options:
      use_offline: true
  - name: echoflow_compute_MVBS
    module: echoflow.stages.subflows.compute_MVBS
    options:
      use_offline: true
    external_params:
      range_meter_bin: 20 
      ping_time_bin: 20S

```
    
Let's break down the components of this configuration:

- **active_recipe**: Specifies the recipe to be used for processing, which is set as "standard" in this case.

- **use_local_dask**: This flag indicates that we'll be utilizing a local Dask Cluster for parallel processing.

- **n_workers**: Determines the number of worker processes in the Dask Cluster. Here, we're using 3 workers for efficient parallelization.

- **pipeline**: This section defines the sequence of stages to execute. In this example, we're following the "standard" recipe, which comprises four stages.

    - **echoflow_open_raw**: This stage utilizes the `open_raw` subflow module to open raw data files. It includes options such as saving raw files, using raw data in offline mode, and utilizing offline data.
    
    - **echoflow_combine_echodata**: This stage employs the `combine_echodata` subflow module to combine echodatas based on transect. It includes an option to use offline data.
    
    - **compute_SV**: This stage employs the `compute_SV` subflow module to compute Backscattering Strength. It includes an option to use offline data.
    
    - **compute_MVBS**: This stage employs the `compute_MVBS` subflow module to calculate MVBS. It includes an option to use offline data.

**Note**: For a more comprehensive understanding of each option and its functionality, you can refer to the [Pipeline documentation](https://github.com/OSOceanAcoustics/echoflow/blob/dev/docs/configuration/pipeline.md).

Keep in mind that in this example, we'll be setting up a local Dask Cluster with 3 workers for parallel processing. This configuration will enable us to efficiently process our data for MVBS analysis. To turn it off, toggle `use_local_dask` to false.

Feel free to explore and modify the configuration to understand better.

## Step 9.2
### Datastore Configuration: Organizing Data for Processing

In this section, we'll delve into the configuration that defines how the data will be organized and managed for processing. This configuration is provided in YAML format and plays a crucial role in structuring data inputs and outputs.

Here's the detailed breakdown of the configuration:

```yaml
name: Bell_M._Shimada-SH1707-EK60
sonar_model: EK60 
raw_regex: (.*)-?D(?P<date>\w{1,8})-T(?P<time>\w{1,6}) 
args: 
  urlpath: s3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw
  parameters:
    ship_name: Bell_M._Shimada
    survey_name: SH1707
    sonar_model: EK60
  storage_options:
    anon: true
  transect:
    file: ./EK60_SH1707_Shimada.txt
    default_transect_num: 2017
  json_export: true 
output: 
  urlpath: <YOUR-S3-BUCKET>
  retention: false
  overwrite: true
  storage_options: 
    block_name: echoflow-aws-credentials
    type: AWS
```

Let's delve into the individual components of the configuration presented here:

- **name**: Specifies a descriptive name for the configuration, aiding in identifying its purpose.

- **sonar_model**: Indicates the type of sonar being utilized, which in this case is "EK60".

- **raw_regex**: Defines a regular expression pattern for extracting date and time information from raw data file names.

- **args**: This section provides crucial arguments for structuring data inputs:

  - **urlpath**: Defines the URL pattern to access the raw data files stored on a remote server. The placeholders `{{ ship_name }}`, `{{ survey_name }}`, and `{{ sonar_model }}` are dynamically replaced with the specified values.

  - **storage_options**: Sets storage options, such as anonymous access (`anon: true`), for retrieving the data.

  - **transect**: Specifies a file (`EK60_SH1707_Shimada.txt`) containing the list of files to process, along with default transect information.

  - **json_export**: Enables JSON metadata export.

- **output**: This section configures the output settings for processed data:

  - **urlpath**: Determines the output directory (`<YOUR-S3-BUCKET>`) where the processed data will be stored.

  - **retention**: Disables data retention, indicating that only MVBS data will be stored in this case.

  - **overwrite**: Allows data overwriting if the data already exists.

**Note**: 
- For a more comprehensive understanding of each option and its functionality, you can refer to the [Datast documentation](https://github.com/OSOceanAcoustics/echoflow/blob/dev/docs/configuration/datastore.md).
- The pipeline will store MVBS Strength output under `<YOUR-S3-BUCKET>`. As the retention is set to false, only MVBS Strength files will be stored. To specify files for processing, create a list of file names and store it in `EK60_SH1707_Shimada.txt`, which should be placed under the transect directory.

This configuration facilitates efficient data organization and management for the processing pipeline. Feel free to tailor it to your specific data and processing requirements.


## Step 10: Getting Data
Next, we'll use the glob_url function to retrieve a list of URLs matching a specific pattern. In this case, we're targeting raw EK60 data files from the SH1707 survey.

In [2]:
all_files = glob_url("s3://ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/*.raw", {'anon':True})

## Step 11: Preparing Files
We'll now extract the file names from the URLs and create a file listing for the transect. This will help us organize and work with the data effectively.

In [3]:
files = []
for file in all_files:
    f = file.split(".r")[0]
    files.append(f.split("/")[-1])

transect = open('EK60_SH1707_Shimada.txt','w')
i = 0
for f in files:
    if i == 5:
        break
    transect.write(f+".raw\n")
    i = i + 1
transect.close()

## Step 12: Processing with echoflow
Now, we're ready to kick off the data processing using echoflow. We'll provide the dataset and pipeline configurations, along with additional options.

In [4]:
options = {"storage_options_override": False}
data  = echoflow_start(dataset_config=dataset_config, pipeline_config=pipeline_config, options=options)

2023-08-30 16:01:16,445 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-7jjfqrq8', purging


<Client: 'tcp://127.0.0.1:35413' processes=3 threads=6, memory=15.61 GiB>
--------------------------------------------------

Executing stage :  name='echoflow_open_raw' module='echoflow.stages.subflows.open_raw' external_params=None options={'save_raw_file': True, 'use_raw_offline': True, 'use_offline': True} prefect_config=None
[Errno 111] Connection refused

Completed stage name='echoflow_open_raw' module='echoflow.stages.subflows.open_raw' external_params=None options={'save_raw_file': True, 'use_raw_offline': True, 'use_offline': True} prefect_config=None
--------------------------------------------------
<Client: 'tcp://127.0.0.1:35413' processes=3 threads=6, memory=15.61 GiB>
--------------------------------------------------

Executing stage :  name='echoflow_combine_echodata' module='echoflow.stages.subflows.combine_echodata' external_params=None options={'use_offline': True} prefect_config=None
[Errno 111] Connection refused
Cleaning :  s3://echoflow-workground/combined_files

  data  = echoflow_start(dataset_config=dataset_config, pipeline_config=pipeline_config, options=options)


## Step 5: Results
Finally, let's take a look at the first entry from the processed data.

In [5]:
data

[[<xarray.Dataset>
  Dimensions:            (channel: 3, ping_time: 456, echo_range: 38)
  Coordinates:
    * channel            (channel) <U37 'GPT  18 kHz 009072058c8d 1-1 ES18-11' ...
    * echo_range         (echo_range) float64 0.0 20.0 40.0 ... 700.0 720.0 740.0
    * ping_time          (ping_time) datetime64[ns] 2017-06-15T19:02:00 ... 201...
  Data variables:
      Sv                 (channel, ping_time, echo_range) float64 dask.array<chunksize=(2, 456, 38), meta=np.ndarray>
      frequency_nominal  (channel) float64 dask.array<chunksize=(3,), meta=np.ndarray>
  Attributes:
      processing_function:          commongrid.compute_MVBS
      processing_software_name:     echopype
      processing_software_version:  0.7.2.dev51+gb45c942
      processing_time:              2023-08-30T15:54:41Z]]