(basic-inversion)=
# Estimate abundance and biomass from backscatter

In [None]:
%run ./weight_proportions.ipynb

This method provides a means for converting $\textit{NASC}$ into number densities (animals nmi<sup>-2</sup>) via acoustic inversion by relating fish length to acoustic backscatter. One basic type of inversion is using the linear relationship between the average fish {ref}`TS-length relationship <ts-l-relationship>`. In practice, this means that the stratified length distributions can be leveraged to calculate a mean $\textit{TS}$ for each stratum.

## Import necessary modules

With all of the biological and acoustic data ingested and processed, we can use the `InversionLengthTS`-class from the `inversion` sub-package to convert our integrated backscatter into population estimates.

In [4]:
from echopop.inversion import InversionLengthTS

## The `InversionLengthTS` class

The `InversionLengthTS` class is initialized using a parameter dictionary with the following required keys:

- `"ts_length_regression"`: A dictionary with values for the slope and intercept of the $\textit{TS}$-length regression
- `"stratify_by"`: A single or list of column names for which to compute the mean $\sigma_\text{bs}$ (and therefore $\textit{TS}$).

In addition to these mandatory keys, there are additional optional keys as well:

- `"expected_strata"`: An array of specific strat to expect in the biological data. This can be further used to define any particular strata that should be explicitly included in the analysis.
- `"impute_missing_strata"`: A boolean variable that determines whether or not to impute any missing $\sigma_\text{bs}$ from any strata. This usually occurs when there may be insufficient biological data in a particular stratum.
- `"haul_replicates"`: Another boolean variable that when `True`, individual hauls are used as replicates instead of individual lengths. This results in treating the average $\sigma_\text{bs}$ of each haul as an individual sample, which are then averaged over their respective strata. This approach has the benefit of avoiding statistical issues due to pseudoreplication.

So now we can initialize our `InversionLengthTS`-class:

In [11]:
MODEL_PARAMETERS = {
    "ts_length_regression": {
        "slope": 20.,
        "intercept": -68.
    },
    "stratify_by": ["stratum_ks"],
    "expected_strata": df_dict_strata["ks"].stratum_num.unique(),
    "impute_missing_strata": True,
    "haul_replicates": True,
}

# Initiate object to perform inversion
invert_hake = InversionLengthTS(MODEL_PARAMETERS)

We can verify that the parameterization we initialized the object are actually being used via the `model_params` attribute:

In [12]:
invert_hake.model_params

{'ts_length_regression': {'slope': 20.0, 'intercept': -68.0},
 'stratify_by': ['stratum_ks'],
 'expected_strata': array([1, 2, 3, 4, 0, 5, 6, 7, 8]),
 'impute_missing_strata': True,
 'haul_replicates': True}

The next step is to run the inversion via the `InversionLengthTS.invert` method. This has two required arguments: 

- `"df_nasc"`: A `pandas.DataFrame` with the transect $\textit{NASC}$ data.
- `"df_length"`: An individual `pandas.DataFrame`, or list of `pandas.DataFrame`s, containing length measurements sorted by each haul (`"haul_num"`) and strata (e.g., `"stratum_ks"`).

Since there are two separate length datasets in `dict_df_bio["length"]` and `dict_df_bio["specimen"]`, these can both be used to construct the overall length distribution required for the inversion:

In [13]:
df_nasc_all_ages = invert_hake.invert(df_nasc=df_nasc_all_ages,
                                      df_length=[dict_df_bio["length"], dict_df_bio["specimen"]])

(sigma-bs-strata)=

Before investigating the results, we can inspect the mean $\sigma_\text{bs}$ for each stratum via the `sigma_bs_strata` attribute:

In [14]:
invert_hake.sigma_bs_strata

Unnamed: 0_level_0,sigma_bs
stratum_ks,Unnamed: 1_level_1
0,0.000425
1,8.4e-05
2,7e-05
3,0.000114
4,0.000191
5,0.000295
6,0.000247
7,0.000384
8,0.000481


The resulting `df_nasc_all_ages` output from `invert_hake` now has the column `"number_density"`, which represents the number of animals nmi<sup>-2</sup>.

In [16]:
df_nasc_all_ages.columns

Index(['stratum_ks', 'transect_num', 'region_id', 'distance_s', 'distance_e',
       'latitude', 'longitude', 'transect_spacing', 'layer_mean_depth',
       'layer_height', 'bottom_depth', 'nasc', 'haul_num', 'mean length',
       'year', 'stratum_inpfc', 'stratum name', 'nasc_proportion',
       'geostratum_ks', 'geostratum_inpfc', 'number_density'],
      dtype='object')

## Estimating abundance and biomass

The new number density estimates can be used to estimate abundance (number of animals) and biomass (kg). First, we have to estimate the along-transect distances between intervals. This is required for later estimating the area-per-interval. The `compute_interval_distance` function from the  `survey.transect` module serves this purpose.

In [17]:
from echopop.survey import transect

transect.compute_interval_distance(df_nasc=df_nasc_all_ages, interval_threshold=0.05)


The `interval_threshold` argument is used for detecting (and removing) erroneously small or large intervals relative to the median interval distance. Let $d$ and $t$ represent interval distance (nmi) and transect, respectively. This is expressed as:

$$
    |d_t - \text{med}(d_t)| > \Delta,
$$

where $\Delta$ is the distance threshold.

With the interval distances calculated, the areas can be readily computed via:

In [18]:
df_nasc_all_ages["area_interval"] = (
    df_nasc_all_ages["transect_spacing"] * df_nasc_all_ages["distance_interval"]
)

Now the number densities can be combined with the areas for each interval to calculate along-transect abundances and biomasses. This can be done in a variety of ways, but in this case we can use the FEAT-specific functions `compute_abundance` and `compute_biomass` functions from the `nwfsc_feat.biology` module. For abundance, we need to use the previously calculated number proportions that serve to "normalize" the length distributions:

In [20]:
from echopop.workflows.nwfsc_feat import biology

biology.compute_abundance(
    dataset=df_nasc_all_ages,
    stratify_by=["stratum_ks"],
    group_by=["sex"],
    exclude_filter={"sex": "unsexed"},
    number_proportions=dict_df_number_proportion
)

The argument `group_by`, when not an empty list, tells the function whether or not to apportion the abundance values by some contrast. In this case, not only was the column `"abundance"` appended to our DataFrame `df_nasc_all_ages`, but also `"abundance_female"` and `"abundance_male"`. This also applies to the column `"number_density"`:

In [21]:
df_nasc_all_ages.columns

Index(['stratum_ks', 'transect_num', 'region_id', 'distance_s', 'distance_e',
       'latitude', 'longitude', 'transect_spacing', 'layer_mean_depth',
       'layer_height', 'bottom_depth', 'nasc', 'haul_num', 'mean length',
       'year', 'stratum_inpfc', 'stratum name', 'nasc_proportion',
       'geostratum_ks', 'geostratum_inpfc', 'number_density',
       'distance_interval', 'area_interval', 'abundance',
       'number_density_female', 'number_density_male', 'abundance_female',
       'abundance_male'],
      dtype='object')

When a dictionary is supplised to the `exclude_filter` argument, they defined keys (e.g., `sex: "unsexed"`) are removed from the biological data, and their respective proportions are redistributed among the non-excluded groups. So in this case, `exclude_filter = {"sex": "unsexed"}` instructed the function to remove all unsexed fish. 

(stratum-weights)=

With these abundances, we can then shift to calculated the biomass and areal biomass densities (kg nmi<sup>-2</sup>). This first requires calculating the mean weight per stratum using the `stratum_averaged_weight` function from the `survey.proportions` module:

In [22]:
df_averaged_weight = get_proportions.stratum_averaged_weight(
    proportions_dict=dict_df_number_proportion,
    binned_weight_table=binned_weight_table,
    stratify_by=["stratum_ks"],
    group_by=["sex"],
)

This produces a `pandas.DataFrame` indexed by `stratum_ks` for each unique factor in `group_by` column (i.e., `"sex"`):

In [24]:
display(df_averaged_weight)

sex,all,female,male
stratum_ks,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.707167,0.763183,0.621742
1,0.070003,0.071017,0.068227
2,0.052842,0.054119,0.054396
3,0.125313,0.143962,0.10305
4,0.2639,0.291243,0.237447
5,0.500277,0.515293,0.476358
6,0.394585,0.389276,0.393118
7,0.766103,0.810593,0.666063
8,1.158306,1.186256,0.962219


These can then be combined with the abundances to get the biomass and biomass density estimates:

In [25]:
biology.compute_biomass(
    dataset=df_nasc_all_ages,
    stratify_by=["stratum_ks"],
    group_by=["sex"],
    df_average_weight=df_averaged_weight,
)

Just like for `"number_density"` and `"abundance"`, not only are the columns `"biomass"` and `"biomass_density"` appended to `df_nasc_all_ages`, but so are columns with the suffixes `"_female"` and `"_male"`:

In [26]:
df_nasc_all_ages.columns

Index(['stratum_ks', 'transect_num', 'region_id', 'distance_s', 'distance_e',
       'latitude', 'longitude', 'transect_spacing', 'layer_mean_depth',
       'layer_height', 'bottom_depth', 'nasc', 'haul_num', 'mean length',
       'year', 'stratum_inpfc', 'stratum name', 'nasc_proportion',
       'geostratum_ks', 'geostratum_inpfc', 'number_density',
       'distance_interval', 'area_interval', 'abundance',
       'number_density_female', 'number_density_male', 'abundance_female',
       'abundance_male', 'biomass_density_female', 'biomass_density_male',
       'biomass_density', 'biomass_female', 'biomass_male', 'biomass'],
      dtype='object')