(stratify-data)=
# Stratify acoustic and biological data

In [4]:
%run ./biodata_ingestion.ipynb
%run ./nasc_ingestion.ipynb

## Setting up the environment

There are two types of stratification files that `Echopop` can expect to read in: 1) haul-based and 2) latitude-based. Functions from the `load_data` module can be used to read in and preprocess these files. Similar to the other data ingestion steps, some column renaming may be required for compatibility with `Echopop`. 

## Haul-based stratification

The haul-based stratification files can include any number of different stratification definitions so long as they are assigned different names. For instance, if a spreadsheet has two sheets labeled "INPFC" and "KS", then those can be mapped using:

In [None]:
# Sheet-stratum name mapping
HAUL_STRATA_SHEETS_MAP = {
    "inpfc": "INPFC",
    "ks": "Base KS",
}

These can then be read in then using the `load_strata` function:

In [9]:
# Filepath 
HAUL_STRATA_FILE = DATA_ROOT / "Stratification/US_CAN strata 2019_final.xlsx"

# Column name mapping
EXPECTED_ECHOPOP_STRATA_COLUMNS = {
    "fraction_hake": "nasc_proportion",
    "haul": "haul_num",
    "stratum": "stratum_num",
}

# Load files
df_dict_strata = load_data.load_strata(
    strata_filepath=HAUL_STRATA_FILE, 
    strata_sheet_map=HAUL_STRATA_SHEETS_MAP, 
    column_name_map=EXPECTED_ECHOPOP_STRATA_COLUMNS
)

Once loaded in, this stratification can be directly applied to the previously ingested data using the `join_strata_by_haul` function. If we want to retain *both* stratification definitions ("INPFC" and "KS"), then `join_strata_by_haul` needs to be called for each one with a unique `stratum_name` argument value.

In [11]:
# Add KS to NASC data
df_nasc_all_ages = load_data.join_strata_by_haul(data=df_nasc_all_ages, 
                                                 strata_df=df_dict_strata["inpfc"],
                                                 stratum_name="stratum_inpfc") 

# Add INPFC to NASC data
df_nasc_all_ages = load_data.join_strata_by_haul(data=df_nasc_all_ages, 
                                                 strata_df=df_dict_strata["inpfc"],
                                                 stratum_name="stratum_inpfc") 

# Add KS to biodata
dict_df_bio = load_data.join_strata_by_haul(dict_df_bio,
                                            df_dict_strata["ks"],
                                            stratum_name="stratum_ks") 

# Add INPFC to biodata
dict_df_bio = load_data.join_strata_by_haul(dict_df_bio,
                                            df_dict_strata["inpfc"],
                                            stratum_name="stratum_inpfc")





## Latitude-based stratification

Alternatively, any georeferenced data with the correctly projected/referenced column `"latitude"` can be stratified based on their latitudinal position instead of haul number. In this instance, only the NASC data has the column `latitude`; however, this information could be supplied to the biological data for similar functionality. This first requires reading in the latitude-based stratification file, which is referred to as the "geostratification" to differentiate it from the haul-based mapping. This can be loaded in via the `load_geostrata` function.

In [12]:
# Filepath
GEOSTRATA_FILE = DATA_ROOT / "Stratification/Stratification_geographic_Lat_2019_final.xlsx"

# Sheet-stratum name mapping
GEOSTRATA_SHEETS_MAP = {
    "inpfc": "INPFC",
    "ks": "stratification1",
}

# Column renaming
EXPECTED_ECHOPOP_GEOSTRATA_COLUMNS = {
    "latitude (upper limit)": "northlimit_latitude",
    "stratum": "stratum_num",
}

# Load in file
df_dict_geostrata = load_data.load_geostrata(
    geostrata_filepath=GEOSTRATA_FILE, 
    geostrata_sheet_map=GEOSTRATA_SHEETS_MAP, 
    column_name_map=EXPECTED_ECHOPOP_GEOSTRATA_COLUMNS
)

Similar to the haul-based stratification, the geostrata can also be applied directly to the acoustic dataset via the `join_geostrata_by_latitude` function, with different `stratum_name` arguments supplied in the case of multiple geostrata being stored within the `pandas.DataFrame`.

In [13]:
# Apply KS (geostratum) to NASC
df_nasc_all_ages = load_data.join_geostrata_by_latitude(data=df_nasc_all_ages,
                                                        geostrata_df=df_dict_geostrata["ks"],
                                                        stratum_name="geostratum_ks")

# Apply INPFC (geostratum) to NASC
df_nasc_all_ages = load_data.join_geostrata_by_latitude(data=df_nasc_all_ages,
                                                        geostrata_df=df_dict_geostrata["inpfc"],
                                                        stratum_name="geostratum_inpfc")