# **Example overview of the `echopop` dataflow**

## **`Survey`-class initialization**

Import the latest version of `echopop`.

In [1]:
import pprint

from echopop.survey import Survey

Initialize the `Survey` object by loading in input data (`Survey.input`) and configuration settings (`Survey.config`). The former reads in data from all the defined input files contained within the `./config_files/survey_year_2019_config.yml` configuration file. The latter reads in various arguments as well as the file paths that point to the input files. 

In [2]:
survey = Survey( init_config_path = "C:/Users/Brandyn/Documents/GitHub/echopop/config_files/initialization_config.yml" ,
                 survey_year_config_path = "C:/Users/Brandyn/Documents/GitHub/echopop/config_files/survey_year_2019_config.yml" )

Not only are all the necessary acoustic, biological, kriging, and stratification data imported and contained with `survey`, but they can also be parsed in a relatively straightforward manner. There are five `Survey`-class attributes to be aware of: 
* `Survey.meta`: this is currently undeveloped, but this is where necessary information such as the date the object was created and general data workflow/provenance would be collected.
* `Survey.config`: this stores the background configuration settings. 
* `Survey.input`: this contains the imported acoustic, biological, kriging, and stratification data. This can be further investigated via the various nested dictionaries that correspond to specific types of dataset. 
* `Survey.analysis`: this is the working directory that contains relevant intermediate data productions and calculations that may be of interest to the user and/or are required for later calculations. 
* `Survey.results`: this stores the overall results each analysis. 
  
## **Initial data processing**

**`Survey.meta`**

As previously mentioned, `Survey.meta` is undeveloped, but the `provenance` key will be iteratively updated with the performed analyses. Additional metadata can also be appended to this attribute.

In [3]:
pprint.pprint( survey.meta )

{'date': '2024-05-31 18:23:15', 'provenance': {}}


**`Survey.config`**

This attribute contains a variety of nested dictionaries that help to organize the entries in an intentional format that ideally minimizes ambiguity on how to access the associated values. Accessible dictionaries can be listed via `survey.config.keys()`:

In [4]:
survey.config.keys( )

dict_keys(['stratified_survey_mean_parameters', 'TS_length_regression_parameters', 'geospatial', 'survey_year', 'species', 'CAN_haul_offset', 'data_root_dir', 'biological', 'stratification', 'NASC', 'kriging'])

The overall dictionary structure of `self.config` can also be accessed. Although not required for printing out the values in this attribute, the `pprint` library is helpful for formatting nested dictionaries into a legible format in both the console and interactive notebooks. 

In [5]:
pprint.pprint( survey.config )

{'CAN_haul_offset': 200,
 'NASC': {'all_ages': {'filename': 'Exports/US_CAN_detailsa_2019_table1y+_ALL_final '
                                   '- updated.xlsx',
                       'sheetname': 'Sheet1'},
          'no_age1': {'filename': 'Exports/US_CAN_detailsa_2019_table2y+_ALL_final '
                                  '- updated.xlsx',
                      'sheetname': 'Sheet1'}},
 'TS_length_regression_parameters': {'pacific_hake': {'TS_L_intercept': -68.0,
                                                      'TS_L_slope': 20.0,
                                                      'length_units': 'cm',
                                                      'number_code': 22500}},
 'biological': {'catch': {'CAN': {'filename': 'Biological/CAN/2019_biodata_catch_CAN.xlsx',
                                  'sheetname': 'biodata_catch_CAN'},
                          'US': {'filename': 'Biological/US/2019_biodata_catch.xlsx',
                                 'sheetname': 'biod

**`Survey.input`**

Similar to `Survey.config`, the input data are grouped into various nested dictionaries. Data contained within the `Survey.input` attribute are specifically stored in four general nested dictionaries: `acoustics`, `biology`, `spatial`, and `statistics`. 

In [6]:
survey.input.keys()

dict_keys(['acoustics', 'biology', 'spatial', 'statistics'])

This results in the following branched data structure for `Survey.input`:
* `acoustics`
  * `nasc_df`: acoustic trawl data (all age and age-2+ NASC)
* `biology`
  * `catch_df`: unaged haul weight totals
  * `distributions`
    * `age_bins_df`: age distribution histogram bins
    * `length_bins_df`: length distribution histogram bins
  * `haul_to_transect_df`: haul-to-transect key that links haul numbers to their respective transects
  * `length_df`: unaged length measurements
  * `specimen_df`: aged length and weight measurements
* `spatial`
  * `strata_df`: the `KS` stratum definitions and fraction of hake for each haul
  * `geo_strata_df`: latitudinal limits of the `KS` strata
  * `inpfc_strata_df`: the `INPFC` stratum definitions and their respective latitudinal limits
* `statistics`
  * `kriging`
    * `mesh_df`: kriging mesh
    * `isobath_200m_df`: 200 m isobath coordinates
    * `model_config`: dictionary comprising all required arguments for the kriging analysis
  * `variogram`
    * `model_config`: dictionary comprising all required arguments for the variogram analysis

## **`Survey.transect_analysis(...)`**

`````{admonition} Type-hinting
:class: tip
Hover your cursor over the various functions included in the code blocks below to get additional type hints and context for usage
`````

The method `Survey.transect_analysis(...)` populates various analysis variables (`Survey.analysis`) and results (`Survey.results`). This class-method currently takes four user arguments:

* `species_id (integer, list)`: the species number code(s) (default: `22500`)
* `exclude_age1 (boolean)`: whether age-1 fish should be excluded from the analysis (default: `True`)
* `stratum (string)`: the stratum used for the various acoustic and biological calculations (default: `'ks'`)
* `verbose (boolean)`: dialogue messages will appear in the console including a summary report of the results when this is set to `True` (default: `True`) 
  
This is the primary biological data processing workhorse that is further used for later analyses, such as computing the number and weight proportions across all animals.

In [7]:
survey.transect_analysis( species_id = 22500 , exclude_age1 = True , stratum = 'ks' , verbose = True )


    --------------------------------
    TRANSECT RESULTS
    --------------------------------
    | Variable: Biomass (kmt)
    | Age-1 fish excluded: True
    | Stratum definition: KS
    --------------------------------
    GENERAL RESULTS
    --------------------------------
    Total biomass: 1651.1 kmt
        Age-1: 7.9 kmt
        Age-2+: 1643.2 kmt
    Total female biomass: 832.2 kmt
        Age-1: 4.0 kmt
        Age-2+: 828.2 kmt
    Total male biomass: 818.5 kmt
        Age-1: 3.9 kmt
        Age-2+: 814.6 kmt
    Total unsexed biomass: 0.4 kmt
    Total mixed biomass: 36.8 kmt
    --------------------------------


A variety of intermediate data products are stored in `Survey.analysis` under currently four nested dictionaries: 
* `kriging`: intermediate results specific to the kriging analysis (`Survey.kriging_analysis(...)`)
* `settings`: this provides a full recording of user-inputs and other variable definitions used for each analysis to improve replicability
* `stratified`: intermediate results specific to the stratified sampling analysis (`Survey.stratified_analysis(...)`)
* `transect`: intermediate results specific to the transect analysis (`Survey.transect_analysis(...)`)

In [8]:
survey.analysis.keys( )

dict_keys(['transect', 'settings', 'stratified'])

Since `Survey.transect_analysis(...)` was ran, the specific arguments used for the analysis can be directly accessed via:

In [9]:
pprint.pprint( survey.analysis[ 'settings' ][ 'transect' ] )

{'exclude_age1': True,
 'species_id': 22500,
 'stratum': 'ks',
 'stratum_name': 'stratum_num'}


The intermediate data products can be similarly accessed under the `transect` dictionary within `Survey.analysis`: 

In [10]:
survey.analysis[ 'transect' ].keys()

dict_keys(['acoustics', 'biology', 'coordinates'])

The results from each analysis are then stored within the `Survey.results` attribute:

In [11]:
survey.results.keys()

dict_keys(['transect', 'stratified', 'kriging'])

So we can generally glean all results recorded within `Survey.results` and also access those specific to `Survey.transect_analysis(...)` within `transect`:

In [12]:
pprint.pprint( survey.results )

{'kriging': {},
 'stratified': {},
 'transect': {'biomass_summary_df':        sex  biomass_adult  biomass_age1   biomass_all
0      all   1.643215e+09  7.869992e+06  1.651085e+09
1   female   8.282280e+08  3.950822e+06  8.321788e+08
2     male   8.146258e+08  3.919170e+06  8.185449e+08
3  unsexed   3.609296e+05  0.000000e+00  3.609296e+05
4    mixed   3.680784e+07 -4.656613e-10  3.680784e+07}}


In [13]:
survey.results[ 'transect' ]

{'biomass_summary_df':        sex  biomass_adult  biomass_age1   biomass_all
 0      all   1.643215e+09  7.869992e+06  1.651085e+09
 1   female   8.282280e+08  3.950822e+06  8.321788e+08
 2     male   8.146258e+08  3.919170e+06  8.185449e+08
 3  unsexed   3.609296e+05  0.000000e+00  3.609296e+05
 4    mixed   3.680784e+07 -4.656613e-10  3.680784e+07}

## **`Survey.stratified_analysis(...)`**

`Survey.stratified_analysis(...)` computes various stratified statistics, including the coefficient of variation (*CV*) estimates using the Jolly and Hampton (1990) stratified sampling method. There are a variety of arguments used for this function: 
* `dataset ('transect', 'kriging')`: data input selection (default: `'transect'`). This will use either the results of `Survey.transect_analysis(...)` or `Survey.kriging_analysis(...)`
* `stratum ('ks','inpfc')`: the stratum used for the various acoustic and biological calculations (default: `'inpfc'`)
* `variable( 'abundance' , 'biomass' , 'nasc')`: the data variable that will be used for the stratified resampling analysis (default: `'biomass'`)
* `bootstrap_ci`: the confidence interval (default: `0.95`) used for copmuting the uncertainty intervals around population and coefficient of variation (*CV*) estimates
* `bootstrap_ci_method`: the specific method/algorithm used for computing the bootstrap intervals (default: `'BCa'`)
* `bootstrap_ci_method_alt`: an optional argument that provides an alternative `bootstrap_ci_method` in case of skewness issues
* `bootstrap_adjust_bias`: a boolean argument (default: `True`) that determines whether the bootstrap intervals should be adjusted to account for the bootstrap bias
* `verbose (boolean)`: dialogue messages will appear in the console including a summary report of the results when this is set to `True` (default: `True`)

There are also analysis-specific optional arguments that are used depending on how `dataset` is defined:

* `mesh_transect_per_latitude (integer)`: the number of virtual transects per degree latitude when `dataset = 'kriging'`
* `transect_sample`: the resampling proportion used to resample transects within each stratum without replacement (default: inherits value from `Survey.config['stratified_survey_mean_parameters']`)
* `transect_replicates`: the number of resampling iterations that will be run (default: inherits value from `Survey.config['stratified_survey_mean_parameters']`)


In [14]:
survey.stratified_analysis( dataset = 'transect' , stratum = 'inpfc' , variable = 'biomass' , bootstrap_ci = 0.95 , bootstrap_method = "BCa" , bootstrap_ci_method_alt = "t-jackknife", verbose = True )

TypeError: Survey.stratified_analysis() got an unexpected keyword argument 'bootstrap_method'

```{warning}
You cannot run `Survey.stratified_analysis( dataset = 'kriging' , ... )` unless you have already computed the kriging results via `Survey.kriging_analysis.
```

Depending on how `dataset` is parameterized, the intermediate and final results are stored within a sub-dictionary so the outputs from both `dataset = 'transect'` and `dataset = 'kriging'` can be compared. For `Survey.analysis`, these are separated immediately below the top-level dictionary: 

In [None]:
survey.analysis[ 'stratified' ].keys( )

Here the resampled distributions of multiple statistics can be directly accessed for additional uncertainty analyses and visualizing the underlying statistical distributions: 

In [None]:
survey.analysis[ 'stratified' ][ 'transect' ].keys()

In [None]:
survey.analysis[ 'stratified' ][ 'transect' ][ 'stratified_replicates_df' ]

The final results stored within `Survey.results` are formatted in an identical way:

In [None]:
survey.results[ 'stratified' ].keys( )

In [None]:
survey.results[ 'stratified' ][ 'transect' ].keys()

In [None]:
pprint.pprint( survey.results[ 'stratified' ][ 'transect' ])

## **`Survey.kriging_analysis(...)`**

`Survey.kriging_analysis(...)` computes the kriged estimates for the target variable via ordinary kriging with an adaptive search radius. The arguments to `Survey.kriging_analysis(...)` include:
* `coordinate_transform (boolean)`: when `True`, the transect and mesh longitude/latitude coordinates are transformed to a standardized format as x/y (default: `True`)
* `crop_method ('interpolation', 'convex_hull')`: when `extrapolate = False`, this determines the method used for cropping the kriging mesh. Setting `crop_method = 'interpolation'` (*default*) resamples the latitudinal resolution of the mesh grid and interpolates over the extent of the eastern and western endpoints of each transect line. This is conducted in a piece-wise fashion to account for the island of Haida Gwaii. Setting `crop_method = 'convex_hull'` uses a polygon-based approach for cropping the mesh grid based on the survey extent.
* `extrapolate(boolean)`: when `True`, the entire kriging mesh is used. Otherwise, different methods are used to crop the kriging mesh to limit extrapolation beyond the extent of the survey transects. 
* `stratum ('ks','inpfc')`: the stratum used for mapping the defined kriged `variable` (default: `'ks'`) 
* `variable(string)`: the data variable that will be used for the kriging analysis (default: `'biomass_density'`)
* `verbose (boolean)`: dialogue messages will appear in the console including a summary report of the results when this is set to `True` (default: `True`)

There are also analysis-specific optional arguments that are used depending on how `crop_method` is defined:
* When `crop_method = 'interpolation'`:
  * `latitude_resolution (float)`: the updated latitudinal resolution (**in nmi**) used for interpolation
* When `crop_method = 'convex_hull'`:
  * `mesh_buffer_distance`: this is a dilation factor (**in nmi**) that expands/buffers the extent of the polygon defining the survey extent (default: `1.25`)
  * `num_nearest_transects`: this defines the number of nearest neighboring transects used for generating smaller polygons that are then constructed into the survey-wide polygon

Lastly, there are additional arguments that are optional since they are otherwise inherited from various parts of the `Survey` object: 
* `kriging_parameters (dictionary)`: a dictionary containing various kriging parameter variables and arguments
* `projection (string)`: an EPSG string that defines the mapping projection
* `variogram_parameters (dictionary)`: a dictionary containing various variogram parameter variables and arguments

In [None]:
survey.kriging_analysis( bearing_tolerance = 15.0 , coordinate_transform = True , crop_method = 'interpolation' , extrapolate = False , latitude_resolution = 1.25 , stratum = 'ks' , variable = 'biomass_density' , verbose = True )

There are then various results stored within `Survey.results[ 'kriging' ]`:

In [None]:
survey.results[ 'kriging' ].keys()

Some of these values are single values:

In [None]:
pprint.pprint( [survey.results['kriging'].get(key) for key in ['variable' , 'survey_mean' , 'survey_estimate' , 'survey_cv'] ] )

The meshed results can also be retrieved:

In [None]:
survey.results[ 'kriging' ][ 'mesh_results_df' ]

The `tables` sub-dictionary includes the sum of each variable distributed over age, length, and sex (in this case, `variable = biomass_density` produces estimates of kriged `biomass` for these tables).

Biomass distributed over age, length, and sex for aged fish:

In [None]:
survey.results['kriging']['tables'][ 'aged_tbl' ]

Biomass distributed over length and sex for unaged fish:

In [None]:
survey.results['kriging']['tables']['unaged_tbl']

Combined biomass from both the aged and unaged fish distributed over length, age, and sex: 

In [None]:
survey.results['kriging']['tables']['overall_apportionment_df']

Now that the kriging results are computed, they can then be used to parameterize `Survey.stratified_analysis( dataset = 'kriging' , ...)` to conduct the stratified resampling analysis: 

In [None]:
survey.stratified_analysis( 'kriging' )

## **Other 'useful' features**

Although a summary of the results are printed in the console when `verbose = True`, it is a bit obnoxious to have to re-run the entire analysis to re-generate the same message. This is addressed via the `Survey.summary(...)` method that comprises a single input: 
* `results_name (string)`: this is the name of the analysis results that should be printed into the console. This can either be formatted as a single input name (e.g. 'transect' , 'kriging') or a nested/layered variable (e.g. 'stratified:transect') where a colon (':') is used as the delimiter that separates the two result layer names.

In [None]:
survey.summary( 'transect' )

In [None]:
survey.summary( 'stratified:transect' )

In [None]:
survey.summary( 'stratified:kriging' )

In [None]:
survey.summary( 'kriging' )