(biodata-ingestion)=
# Biological data ingestion

## Import necessary modules

First, import the `biological` module from the `ingest` sub-package of latest version of `Echopop` for loading in the data.

In [1]:
from echopop.ingest import biological as load_data

In [2]:
from pathlib import Path

DATA_ROOT = Path("C:/Data/EchopopData/echopop_2019")

## Ingesting the data

It is expected that all of the biodata are sourced from a single master spreadsheet with three specific sheets: 1) `catch`, 2) `length`, and 3) `specimen`. However, those exact sheetnames may not be present within the file, so a `column_name_map` needs to be made that links the expected types with their respective names.

In [3]:
BIODATA_SHEET_MAP = {
    "catch": "biodata_catch", 
    "length": "biodata_length",
    "specimen": "biodata_specimen",
}

The data from each sheet also need to be filtered for specific target ship identifiers, survey names, whether an offset is added to the haul numbering to avoid overlap, and species codes. Moreover, there may be columns in the file that do not match those expected by `Echopop`. Similarly, certain biodata labels like `"sex"` may need modification. These can be defined via:

In [5]:
SUBSET_DICT = {
    "ships": {
        160: {
            "survey": 201906
        },
        584: {
            "survey": 2019097,
            "haul_offset": 200
        }
    },
    "species_code": [22500]
}
EXPECTED_ECHOPOP_BIODATA_COLUMNS = {
    "frequency": "length_count",
    "haul": "haul_num",
    "weight_in_haul": "weight",
}
BIODATA_LABEL_MAP = {
    "sex": {
        1: "male",
        2: "female",
        3: "unsexed"
    }
}

All of this additional information can then be supplied to `load_biological_data` to read in the biological data file. 

In [6]:
dict_df_bio = load_data.load_biological_data(
    biodata_filepath=DATA_ROOT / "Biological/1995-2023_biodata_redo.xlsx", 
    biodata_sheet_map=BIODATA_SHEET_MAP, 
    column_name_map=EXPECTED_ECHOPOP_BIODATA_COLUMNS, 
    subset_dict=SUBSET_DICT, 
    biodata_label_map=BIODATA_LABEL_MAP
)

## Removing mismatched hauls

There are some cases where all of the specimens from a particular haul were individual processed. Typically, the summed haul catch weights from the `catch` sheet comprise the bulk weights measured from the non-individual length measurements. However, when an entire catch is represented by `specimen`, then the summed weights from `specimen` and `catch`, when combined, effectively double-count to total haul weight from each trawl. These can be removed via `remove_specimen_hauls` from the `biology` module of the `nwfsc_feat` sub-package:

In [7]:
from echopop.workflows.nwfsc_feat import biology

biology.remove_specimen_hauls(dict_df_bio)

## Adding age and length bins

Many of the analyses used by `Echopop` assume that the underlying biological datasets are distributed over length, or age and length. Consequently, our ingested biodata need to be "binified" here using the `binify` function from the `utils` module. Here we want to break up age into integer bins from 1 to 22 years at increments of 1 year, and length into float bins from 2 to 80 cm at increments of 2 cm.

In [8]:
import numpy as np
from echopop import utils

# Age bins
AGE_BINS = np.linspace(start=1., stop=22, num=22)
utils.binify(
    data=dict_df_bio, bins=AGE_BINS, bin_column="age",
)

# Length bins
LENGTH_BINS = np.linspace(start=2., stop=80., num=40)
utils.binify(
    data=dict_df_bio, bins=LENGTH_BINS, bin_column="length", 
)

## Fitting length-weight regressions

The next step is to fit the length-weight regressions ({ref}`Eq. 2.6 <eq-26>`) for each sex in the dataset, and then across all fish (inclusive of unsexed fish). This can be done using the `biology` module in the `survey`

In [10]:
from echopop.survey import fit_length_weight_regression

# Fit length-weight regression
# ---- Create dictionary container
dict_length_weight_coefs = {}
# ---- Regression for all fish
dict_length_weight_coefs["all"] = dict_df_bio["specimen"].assign(sex="all").groupby(["sex"]).apply(
    fit_length_weight_regression,
    include_groups=False
)
# ---- Regression for each sex
dict_length_weight_coefs["sex"] = dict_df_bio["specimen"].groupby(["sex"]).apply(
    fit_length_weight_regression,
    include_groups=False
)

## Computing the mean weights per length bin

The mean weights per length bin can be computed using the `length_binned_weights` function from the `nwfsc_feat.biology` module. Similar to the length-weight regression fitting, this can be done across all fish as well as for each sex individually. 

In [11]:
import pandas as pd

# Sex-specific (grouped coefficients)
df_binned_weights_sex = biology.length_binned_weights(
    data=dict_df_bio["specimen"],
    length_bins=LENGTH_BINS,
    regression_coefficients=dict_length_weight_coefs["sex"],
    impute_bins=True,
    minimum_count_threshold=5
)

# All fish (single coefficient set)
df_binned_weights_all = biology.length_binned_weights(
    data=dict_df_bio["specimen"].assign(sex="all"),
    length_bins=LENGTH_BINS,
    regression_coefficients=dict_length_weight_coefs["all"],
    impute_bins=True,
    minimum_count_threshold=5,
)

# Combine the pivot tables by adding the "all" column to the sex-specific table
binned_weight_table = pd.concat([df_binned_weights_sex, df_binned_weights_all], axis=1)