(jolly-hampton)=
# Stratified resampling

In [None]:
%run ./geostatistical.ipynb

The {cite:t}`jolly_hampton_1990` algorithm for [estimating the mean and variance of stratified random transects](stratified-resampling-algo) is implemented using a [bootstrapping method](jolly-hampton-bootstrap). This enables the calculation of [confidence intervals](jolly-hampton-ci) for stratum-specific and and overall survey population estimates, as well as characterizing the overall variance via the coefficient of variation ($\textit{CV}$, {ref}`Eq. 2.21 <intext_eq_221_md>`).

## Import necessary modules

This algorithm is implemented via the `JollyHampton` class from the `survey` sub-package.

In [3]:
from echopop.survey import JollyHampton

The `JollyHampton` class can be initialized with two arguments:

- `model_parameters`: A dictionary that can be configured to include the following keys:
    - `num_replicates`: The number of bootstrap replicates.
    - `strata_transect_proportion`: The proportion of transects to sample per stratum. 
    - `transects_per_latitude`: The number of transects per degree latitude. However, this key is only necessary for virtual transect generation which will be discussed later. 
- `resample_seed`: An optional argument that sets the random seed for reproducible bootstrapping. If no seed is provided, then this defaults to being set to `None`. 

## Transects

For the transect data, we only need `strata_transect_proportion` and `num_replicates` for the `JollyHampton`-class: 

In [4]:
TRANSECT_JOLLYHAMPTON_PARAMETERS = {
    "strata_transect_proportion": 0.75,
    "num_replicates": 1000,
}

jh_transect = JollyHampton(TRANSECT_JOLLYHAMPTON_PARAMETERS)

The next step involves performing the bootstrap itself, which executes the built-in analysis pipeline. This is done by the `JollyHampton.stratified_bootstrap` method, which has three required inputs:

- `data_df`: Input `pandas.DataFrame` containing transect data.
- `stratify_by`: List of column names defining the strata (e.g., `["geostratum_inpfc"]`).
- `variable`: Name of the response variable column in `data_df` (e.g., `"biomass"`).

In [5]:
jh_transect.stratified_bootstrap(data_df=df_nasc_no_age1_prt, stratify_by=["geostratum_inpfc"], variable="biomass")

This method populates several attributes that can be accessed. The first is `transect_summary`, which details the summed biomass (kg), distance, area coverage, and areal/line biomass densities (kg nmi<sup>-2</sup>) for each transect (note that only the top ten rows are displayed below).

In [12]:
display(jh_transect.transect_summary.head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,biomass,distance,area,biomass_areal_density,biomass_distance_density
geostratum_inpfc,transect_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1.0,4010244.0,34.809236,348.092365,11520.632646,115206.326459
1,2.0,310113.5,35.341287,353.532537,877.185052,8774.820805
1,3.0,0.0,50.660997,507.671065,0.0,0.0
1,4.0,13129770.0,48.246504,483.030217,27182.089235,272139.315187
1,5.0,0.0,48.119031,480.726902,0.0,0.0
1,6.0,1209238.0,42.895494,428.615484,2821.266345,28190.337432
1,7.0,1541750.0,45.527123,455.378065,3385.648627,33864.430843
1,8.0,2774609.0,44.055394,440.435515,6299.694199,62980.008285
1,9.0,3618213.0,40.192165,402.164809,8996.84211,90022.850305
1,10.0,13779560.0,34.756911,347.653526,39635.892922,396455.200679


The next attribute is `strata_summary`, which provides the transect count (and resampling counts) total distance, area, biomass, and mean areal/line biomass densities for each stratum.

In [13]:
display(jh_transect.strata_summary.head(10))

Unnamed: 0_level_0,transect_counts,num_transects_to_sample,distance,area,biomass,biomass_distance_density,biomass_density
geostratum_inpfc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,10,8,424.604141,4247.300484,40373500.0,95085.028859,9505.684182
2,24,18,899.718778,10057.202226,425774200.0,473230.298614,42335.25153
3,13,10,504.27772,5773.969771,327420100.0,649285.338462,56706.242571
4,15,11,619.190882,7060.629519,446662900.0,721365.418732,63261.05751
5,15,11,666.979139,7068.645037,260750400.0,390942.338685,36888.312143
6,36,27,1392.286678,20673.571163,175849300.0,126302.502696,8505.994951


Next is `survey_summary`, which provides the overall total population estimate and density for each stratum and for the entire survey. These can be navigated between the two keys `"strata"` and `"survey"`.

In [38]:
jh_transect.survey_summary["strata"]

Unnamed: 0_level_0,biomass,biomass,biomass,biomass,biomass,biomass,biomass_density,biomass_density,biomass_density,biomass_density,biomass_density,biomass_density
geostratum_inpfc,1,2,3,4,5,6,1,2,3,4,5,6
0,40373500.0,425774200.0,327420100.0,446662900.0,260750400.0,175849300.0,9505.684182,42335.25153,56706.242571,63261.05751,36888.312143,8505.994951


In [39]:
jh_transect.survey_summary["survey"]

Unnamed: 0,biomass,biomass_density,cv
0,1677048000.0,30553.755539,0.109971


The last attribute to note is `bootstrap_replicates`, which is a DataFrame comprising the computed estimators (e.g., total coverage area, biomass, weighted biomass density) for each of the bootstrap replicates. This can be used to gain insights from the distributions of different metrics within each strata and for the entire survey across replicates.

In [40]:
display(jh_transect.bootstrap_replicates.head(10))

Unnamed: 0_level_0,area,area,area,area,area,area,biomass,biomass,biomass,biomass,...,distance_weighted_biomass_density,distance_weighted_biomass_density,distance_weighted_biomass_density,distance_weighted_biomass_density,distance_weighted_variance,distance_weighted_variance,distance_weighted_variance,distance_weighted_variance,distance_weighted_variance,distance_weighted_variance
geostratum_inpfc,1,2,3,4,5,6,1,2,3,4,...,3,4,5,6,1,2,3,4,5,6
replicate,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,3491.603138,7509.820151,4511.928114,5513.006534,5457.315275,14318.621027,45556460.0,434633100.0,341084800.0,530597600.0,...,28284320.0,34487540.0,19985860.0,7926901.0,2551602000000.0,24344650000000.0,87312250000000.0,115033900000000.0,40557780000000.0,3275461000000.0
1,3362.105458,7484.79516,4437.237772,5335.221508,5092.035493,15363.064304,29531890.0,331431100.0,344324700.0,448348900.0,...,28768240.0,29982170.0,21319880.0,2120924.0,940463300000.0,11123450000000.0,90440580000000.0,86939690000000.0,46021870000000.0,238653500000.0
2,3335.654784,7954.843829,4213.283876,4795.224655,5210.857907,17234.327973,32543110.0,485727500.0,291813400.0,324564000.0,...,22428660.0,22109420.0,20907680.0,7328970.0,1171072000000.0,28935180000000.0,53102280000000.0,47109950000000.0,44612040000000.0,2786318000000.0
3,3545.675582,7623.011593,4658.918907,4770.894617,5377.238005,16149.719545,45066420.0,382892000.0,351049900.0,371864000.0,...,28980800.0,24884680.0,23141160.0,6932840.0,2483088000000.0,14850160000000.0,91799480000000.0,59781160000000.0,54210450000000.0,2539439000000.0
4,3404.70016,7536.521289,4522.591358,4976.292893,4742.204911,13191.110861,42475840.0,426701500.0,325118400.0,522123200.0,...,27069040.0,33813570.0,14626090.0,8508163.0,2308959000000.0,23324540000000.0,79962040000000.0,110189200000000.0,20587070000000.0,3865258000000.0
5,3545.675582,7361.792693,4681.977201,5329.516331,5015.109211,13626.957423,45066420.0,343128800.0,321947300.0,452963800.0,...,26988730.0,29664330.0,17895990.0,8546989.0,2483088000000.0,11687230000000.0,79524850000000.0,85368720000000.0,31601220000000.0,3876241000000.0
6,3453.332432,7778.876291,4661.664549,5358.256039,5228.580027,14272.287058,46610970.0,422208900.0,323950900.0,398568800.0,...,27129350.0,26315280.0,21879290.0,3124508.0,2691887000000.0,22750280000000.0,80432380000000.0,66878680000000.0,48609280000000.0,516422000000.0
7,3459.211443,7467.118724,4480.570122,5018.232907,5338.222629,17603.312432,29774160.0,300911700.0,354382600.0,498485700.0,...,29248060.0,32283570.0,21792900.0,7216606.0,1284094000000.0,9251804000000.0,93461340000000.0,100253900000000.0,47959790000000.0,2701337000000.0
8,3363.306935,7710.787861,4452.249166,5246.909016,5202.153736,16506.412792,47028140.0,454307100.0,297863200.0,401957000.0,...,25077160.0,26113210.0,20354000.0,7055204.0,2888872000000.0,25743080000000.0,68773610000000.0,66169270000000.0,42193410000000.0,2687097000000.0
9,3418.481217,7684.859244,4684.746919,5377.169006,5397.60379,15359.356385,45454070.0,437973900.0,334195800.0,404918200.0,...,27822860.0,26463110.0,19778790.0,3290864.0,2717833000000.0,24297420000000.0,84498160000000.0,67922750000000.0,39776800000000.0,554311700000.0


These bootstrap replicates can then be processed into summary statistics computed for each stratum as well as the entire survey. This uses the `JollyHampton.summarize` method, which as the arguments:

- `ci_percentile`: Confidence level for estimating the uncertainty interval. This defaults to `0.95`.
- `ci_method`: The method used for computing the confidence interval ($\textit{CI}$). Currently, valid entries for `ci_method` are:
    - `"student_jackknife"`: This is the default method that computes the studentized $\textit{CI}$ using jackknife (or leave-one-out) resampling.
    - `"bc"`: Bias-corrected $\textit{CI}$ that adjusts for bias in the bootstrap distribution using the empirical cumulative distribution function.
    - `"bca"`: Bias-corrected and accelerated $\textit{CI}$ that not only accounts for bias in the bootstrap sample but also corrects for skewness using finite-sample jackknife resampling.
    - `"empirical"`: Sometimes known as the "delta method", this selection uses the distribution of bootstrapped deviations between the replicate means and population statistic.
    - `"normal"`: This assumes that the bootstrap replicates are normally distributed.
    - `"percentile"`: The $\textit{CI}$ are constructed directly from the bootstrap distribution quantiles.
    - `"student_standard"`: This assumes that the bootstrap replicates are approximately $t$-distributed.

In [41]:
jh_transect.summarize()

Unnamed: 0_level_0,biomass,biomass,biomass,biomass,biomass_density,biomass_density,biomass_density,biomass_density,cv,cv,cv,cv
metric,low,mean,high,bias,low,mean,high,bias,low,mean,high,bias
1,28209060.0,40409930.0,50101020.0,36434.18,6641.644748,9514.262378,11795.966714,8.578196,,,,
2,307027100.0,422563400.0,510350200.0,-3210779.0,30528.078222,42015.999774,50744.749903,-319.251756,,,,
3,243721800.0,325636800.0,379379200.0,-1783342.0,42210.443573,56397.383683,65705.088947,-308.858888,,,,
4,324248200.0,445204600.0,541868300.0,-1458327.0,45923.418822,63054.514082,76745.03914,-206.543428,,,,
5,193529900.0,263039500.0,319419500.0,2289147.0,27378.645785,37212.157417,45188.219506,323.845274,,,,
6,83192000.0,175026500.0,229105600.0,-822752.9,4024.075188,8466.197624,11082.052377,-39.797327,,,,
survey,1462387000.0,1671881000.0,1865804000.0,-5166822.0,26646.356142,30463.567847,33997.061607,-90.187691,0.123976,0.130357,0.141477,0.020386


This yields a `pandas.DataFrame` indexed by each stratum and overall survey, with columns organized by their respective metrics. In this case, we have `"biomass"`, `"biomass_density"`, and `"cv"`. The columns `"low"`, `"mean"`, and `"high"` correspond to the lower bound of the $\textit{CI}$, the distribution mean, and the upper bound of the $\textit{CI}$, respectively. The `"bias"` represents the deviation between the bootstrapped means and the original estimates. The metric `"cv"` is only calculated for the entire survey, so there are no valid values to report for each stratum (i.e., `NaN`).

## Kriged mesh

We can also run this analysis for the kriged mesh estimates, although we have to incorporate a few modifications to make the gridded points compatible with the expected transect sampling design. So we now initialize the `JollyHampton`-class object with the dictionary key `"transects_per_latitude"`, which defines the number of virtual transects that will be generated per degree latitude. 

In [42]:
KRIGED_JOLLYHAMPTON_PARAMETERS = {
    "transects_per_latitude": 5,
    "strata_transect_proportion": 0.75,
    "num_replicates": 1000,
}

jh_kriged = JollyHampton(KRIGED_JOLLYHAMPTON_PARAMETERS)

Before using the `JollyHampton.stratified_bootstrap` method, the `JollyHampton.create_virtual_transects` method is required. This method creates virtual transects from the gridded data that are then subsequently assigned to individual strata. This has four arguments:
- `data_df`: A `pandas.DataFrame` containing gridded kriged data.
- `geostrata_df`: A `pandas.DataFrame` containing geographical stratum boundaries and definitions (e.g., `"geostratum_inpfc"`).
- `stratify_by`: A list of column names in `data_df` to stratify by (e.g., `"geostratum_inpfc"`).
- `variable`: Name of the response variable column in `data_df` (e.g., `"biomass"`).

This returns a `pandas.DataFrame` that can then be fed into the other `JollyHampton`-class methods as if they were standard transect data.

In [None]:
kriged_transects = jh_kriged.create_virtual_transects(
    data_df=unextrapolated_results,
    geostrata_df=df_dict_geostrata["inpfc"],
    stratify_by=["geostratum_inpfc"],
    variable="biomass",
)

In [45]:
display(kriged_transects.head(10))

Unnamed: 0,transect_num,latitude,transect_distance,transect_area,biomass,geostratum_inpfc
0,1,34.6,48.675265,292.051592,1278193.0,1
1,2,34.8,54.211796,650.541558,1061705.0,1
2,3,35.0,54.655882,655.870584,10710390.0,1
3,4,35.2,56.585071,679.020851,646848.3,1
4,5,35.4,51.338622,616.063464,1791419.0,1
5,6,35.6,57.280884,687.370604,3334136.0,1
6,7,35.8,51.768742,621.224904,3610589.0,1
7,8,36.0,48.788137,585.457646,14940460.0,2
8,9,36.2,47.058672,564.704058,8543800.0,2
9,10,36.4,44.176795,530.121544,3189396.0,2


So now we can take these virtual transects to compute the bootstrap replicates and subsequent statistics.

In [46]:
# Generate replicates
jh_kriged.stratified_bootstrap(data_df=kriged_transects, stratify_by=["geostratum_inpfc"], variable="biomass")

# Summarize
jh_kriged.summarize()

Unnamed: 0_level_0,biomass,biomass,biomass,biomass,biomass_density,biomass_density,biomass_density,biomass_density,cv,cv,cv,cv
metric,low,mean,high,bias,low,mean,high,bias,low,mean,high,bias
1,10786350.0,22324700.0,29056770.0,-108579.1,2540.437518,5257.987864,6843.547876,-25.572912,,,,
2,308270800.0,394072900.0,464813800.0,213204.9,23037.811113,29450.005131,34736.640264,15.933312,,,,
3,256537700.0,343962500.0,410175200.0,-952431.4,36675.415948,49173.923503,58639.899534,-136.162483,,,,
4,390392700.0,492615300.0,558101000.0,2772988.0,49920.542282,62992.023841,71365.848143,354.589325,,,,
5,188085500.0,266365400.0,310634200.0,1098833.0,20822.068401,29488.068006,34388.863662,121.646641,,,,
6,119169700.0,196667200.0,251626100.0,89022.4,3320.180188,5479.333932,7010.538926,2.480248,,,,
survey,1517318000.0,1716008000.0,1876902000.0,2931529.0,19611.80867,22179.938517,24259.542433,40.236983,0.128841,0.134447,0.141746,0.018865
