# Example overview of the `echopop` dataflow

## Setting up the environment

First, import the `ingest_nasc` module from the latest version of `Echopop` for loading in the data. It may be helpful to assign a `DATA_ROOT` directory path, but it is otherwise a shortcut for writing the full filepaths of target datasets. 

```python
from echopop.nwfsc_feat import ingest_nasc
from pathlib import Path

DATA_ROOT = Path("C:/Data/")
```

In [3]:
from echopop.nwfsc_feat import ingest_nasc
from pathlib import Path

DATA_ROOT = Path("C:/Users/Brandyn Lucca/Documents/Data/echopop_2019")

`````{admonition} Function arguments
:class: tip
Use the `help()` function or hover tooltips in certain IDEs (e.g. `VSCode`) to investigate how certain functions are parameterized and used.
`````

## NASC ingestion

NASC exports produced by Echoview are ingested using the `ingest_nasc` module. This first requires collating all of the raw NASC exports (`*(analysis).csv`, `*(intervals).csv`, `*(layers).csv`, `*(cells).csv`) and generating a `pandas.DataFrame`. This is achieved by the `merge_echoview_nasc` function, which produces two dataframes: 1) the processed transect intervals and 2) the merged inteverals-layers-cells information.

In [4]:
# Merge exports
df_intervals, df_exports = ingest_nasc.merge_echoview_nasc(
    nasc_path = DATA_ROOT / "raw_nasc/",
    filename_transect_pattern = r"T(\d+)",
    default_transect_spacing = 10.0,
    default_latitude_threshold = 60.0,
)

#### Transect region haul key

In some cases, a transect-region-haul key file is required for mapping biological trawls to specific Echoview export region identifiers and their respective transect numbers. This also may require passing a dictionary into the `rename_dict` argument to align column names with those expected by `Echopop`. 

In [5]:
# Read in transect-region-haul keys
TRANSECT_REGION_FILEPATH_ALL_AGES = (
    DATA_ROOT / "Stratification/US_CAN_2019_transect_region_haul_age1+ auto_final.xlsx"
)
TRANSECT_REGION_SHEETNAME_ALL_AGES: str = "Sheet1"

TRANSECT_REGION_FILE_RENAME: dict = {
    "tranect": "transect_num",
    "region id": "region_id",
    "trawl #": "haul_num",
}

# Read in the transect-region-haul key files for each group
transect_region_haul_key_all_ages = ingest_nasc.read_transect_region_haul_key(
    filename=TRANSECT_REGION_FILEPATH_ALL_AGES,
    sheetname=TRANSECT_REGION_SHEETNAME_ALL_AGES,
    rename_dict=TRANSECT_REGION_FILE_RENAME,
)

#### Processing export region identifiers

There may be instances where the export region identifiers are coded with specific expressions that are encoded with information. For `Echopop`, information such as the `region class`, `haul number`, and `country` may be expected from these codes. 

In [8]:
REGION_NAME_EXPR_DICT = {
    "REGION_CLASS": {
        "Age-1 Hake": "^(?:h1a(?![a-z]|m))",
        "Age-1 Hake Mix": "^(?:h1am(?![a-z]|1a))",
        "Hake": "^(?:h(?![a-z]|1a)|hake(?![_]))",
        "Hake Mix": "^(?:hm(?![a-z]|1a)|hake_mix(?![_]))",
    },
    "HAUL_NUM": {
        "[0-9]+",
    },
    "COUNTRY": {
        "CAN": "^[cC]",
        "US": "^[uU]",
    },
}

# Process the region name codes to define the region classes
# e.g. H5C - Region 2 corresponds to "Hake, Haul #5, Canada"
df_exports_with_regions = ingest_nasc.process_region_names(
    df=df_exports,
    region_name_expr_dict=REGION_NAME_EXPR_DICT,
    can_haul_offset=200,
)

#### Consolidate all of the NASC data

Once all of this information has been ingested and organized, the final NASC `pandas.DataFrame` can be produced targeting specific region class identifiers.

In [9]:
# Consolidate the Echoview NASC export files
df_nasc_all_ages = ingest_nasc.consolidate_echvoiew_nasc(
    df_merged=df_exports_with_regions,
    interval_df=df_intervals,
    region_class_names=["Age-1 Hake", "Age-1", "Hake", "Hake Mix"],
    impute_region_ids=True,
    transect_region_haul_key_df=transect_region_haul_key_all_ages,
)