In [1]:
# # Run this and then restart the kernel at the start of each session to install
# # 'teotil3' in development mode
# !pip install -e /home/jovyan/projects/teotil3/

In [2]:
import os

import geopandas as gpd
import pandas as pd
import teotil3 as teo

# Task 2.2: Generate catchment hierarchy

From the proposal text:

> **Oppgave 2.2: Generere nedbørfelthierarki**
>
> Skrive kode for å generere «tilgrensningsmatrisen» som kreves for å representere regine-nedbørfelthierarkiet i TEOTIL. Regine navnekonvensjonen er definert av NVE og en grunnleggende nedbørfeltstruktur kan oppnås ved å sortere regine-IDer alfanumerisk. Imidlertid vil mer sofistikert kode være nødvendig for å håndtere «kantsaker» og for å kontrollere at det endelige hierarkiet er hydrologisk rimelig. Funksjoner for å lette behandlingen vil bli lagt til input/output-modulen til TEOTIL.


NVE's description of the regine hierarchy is [here](https://www.nve.no/media/2297/regines-inndelingssystem.pdf). A few key points to note:

 * The first three digits of the regine code before the decimal point specify the vassdragsområde (`001` to `314`)
 
 * Each vassdragsområde is be sub-divided into nedbørfelt, sentralfelt, randfelt and kystfelt, each of which can be further divided - possibly multiple times - into sub-areas (and sub-sub-areas, and sub-sub-sub-areas...)
 
 * The number of characters after the decimal point determines the "level" of the catchment. In the regine dataset for 2022, the deepest level is 8
 
 * A basic hierarchy can be achieved by sorting the regine codes alphanumerically. Kystfelt are assigned wholly numeric codes, while other areas are assigned a mixture of letters and numbers. Intercatchments are indexed starting from their outflow in the upstream direction, beginning with the letter `A`. This mean that, within any given "level", we navigate downstream by working backwards through the alphabet and then down through the numbers
 
 * `Z`and `0` have a special significance: `Z` represents an entire, hydrologically complete (sub-)catchment, while `0` groups all partial fields at the same level (i.e. XXX0 groups XXX1, XXX2, ..., XXX9)
 
 * `I`and `O` are omitted from the regine codes to avoid confusion with `1` and `0`

## 1. Read NVE regine dataset

The latest dataset for 2022 was downloaded and processed as part of Task 2.1.

In [3]:
# Read regine data
spatial_data_fold = f"/home/jovyan/shared/common/teotil3/core_data"
teo_gpkg = os.path.join(spatial_data_fold, "tidied", f"teotil3_data.gpkg")
reg_gdf = gpd.read_file(teo_gpkg, layer=f"regines", driver="GPKG")
reg_gdf.head()

Unnamed: 0,regine,vassom,a_cat_poly_km2,upstr_a_km2,ospar_region,komnr_2014,fylnr_2014,komnr_2015,fylnr_2015,komnr_2016,...,a_sea_km2,a_upland_km2,a_urban_km2,a_wood_km2,ar50_tot_a_km2,a_cat_land_km2,a_lake_nve_km2,runoff_mm/yr,q_cat_m3/s,geometry
0,001.10,1,1.44279,0.0,Skagerrak,101,1,101,1,101,...,0.28194,0.0,0.0,0.849188,0.849201,1.16085,0.0,592,0.02178,"POLYGON ((297006.830 6543966.950, 297169.290 6..."
1,001.1A1,1,1.432479,777.9,Skagerrak,101,1,101,1,101,...,6.7e-05,0.004615,0.0,1.377476,1.430189,1.432412,0.043955,620,0.02814,"POLYGON ((297505.440 6543157.790, 297543.100 6..."
2,001.1A20,1,0.34016,777.9,Skagerrak,101,1,101,1,101,...,4.5e-05,0.0,0.0,0.303492,0.340114,0.340114,0.0,594,0.0064,"POLYGON ((297770.368 6543429.036, 297787.114 6..."
3,001.1A2A,1,17.647822,58.96,Skagerrak,101,1,101,1,101,...,0.0,0.467374,0.131585,15.030746,17.647822,17.647822,0.18634,637,0.35623,"POLYGON ((299678.370 6544460.320, 299667.220 6..."
4,001.1A2B,1,41.298255,41.3,Skagerrak,101,1,101,1,101,...,0.0,2.250799,0.161524,29.798394,41.298255,41.298255,7.344123,637,0.83362,"POLYGON ((303353.460 6552989.330, 303341.620 6..."


## 2. An algorithm for hydrological sorting

A general description of the algorithms is as follows:

 1. Separate the vassdragsområde (`vassom` in the code below) from the rest of the regine code (`code` in the code below) by splitting at the decimal point
 
 2. For each vassdragsområde, first **sort the codes alphanumerically**, such that downstream regines are generally towards the top of the list. Then, for each code:
 
     1. Search the list of possible downstream catchments for a suitable match (using criteria defined below). Because of the alphanumeric sorting, possible downstream catchments are all those higher in the list than the current catchment
     
     2. Begin by searching sequentially along the list of possible downstream codes. In other words, iterate the final character of the code by working backwards along the following list: `0123456789ABCDEFGHJKLMNPQRSTUVWXYZ`, searching for matches in the regines above the current one in the sorted list. Note that the next regine downstream may be further subdivided using numbers. For example, if the current catchment is `002.DC7BB`, the first search should be `002.DC7BA*`, followed by `002.DC7B9*`, then `002.DC7B8*`, all the way down to `002.DC7B1*`, where `*` could represent any sequence of integers. This stage of the search terminates at `0` (e.g. there is no need to search for `002.DC7B0*`), since `0` only appears at the end of a regine code (if considering just the part after the decimal point) 
     
     3. If no match can be found, truncate the code by one character and repeat the search. For the example above, the next step after `002.DC7B1*` would be `002.DC7B*` (which is a repeat), then `002.DC7A*`, `002.DC79*` etc.
     
     4. If no match is found at this level, continue truncating. For example, first try `002.D7*` (which is another repeat), then eventually `002.D*`, `002.C*` and so on
     
     5. As soon as any matches are found, the search is stopped and the last match in the list of matches is assigned as the next catchment downstream (i.e. from the list of sorted matches, choose the one at the end, which will be the closest to the target catchment)
     
**Note**: For some catchments draining to Sweden (i.e. in vassdragsområder 301 to 315), the regines stop just over the border and do not continue down to the catchment outflow. This leaves some "hanging catchments", which do not have a valid downstream ID based on the algorithm above. For completeness, it usually makes sense to assign these catchments to the parent vassdragsområde, rather than leaving the `regine_down` as NaN. This can be controlled using the `nan_to_vass` keyword argument, which defaults to `False` (if `True`, it prints a warning if NaNs are filled for any catchments other than those draining to Sweden). See the docstring of `assign_regine_hierarchy` for details.

### 2.1. Differences compared to the original TEOTIL

In the original TEOTIL, kystfelt were sorted according to their numeric codes, as described using the algorithm above. This is correct according to the regine naming convention, but produdes unrealistic routing of fluxes in coastal areas. As an illustration, consider the example below from vassdragsområde `020`, between Kristiansand and Grimstad

<center><img src="https://raw.githubusercontent.com/NIVANorge/teotil3/main/images/order_coastal_catchments.png" width="800"/></center>

In the original TEOTIL, fluxes are routed sequentially through the kystfelt (from "upstream" to "downstream") as follows:

    020.4260 >  020.425 >  020.4240 >  020.42320 >  020.4231 >  020.4220 >  020.421 >  020.41 >  020.32 >  020.3120 >  020.3110 >  020.224 >  020.2230 >  020.2220 >  020.2210 >  020.21 >  020.12 >  020.110
    
This is not realistic, as the numbering does not reflect dominant flow directions in coastal areas. In most cases, it is probably better to simply **assign each kystfelt directly to the parent vassdragsområde** i.e. `020.XXXX > 020.` for all kystfelt listed above.

To allow for this, the function `assign_regine_hierarchy` includes a keyword argument, `order_coastal`, which is `True` by default (as in the original TEOTIL model). Setting this to `False` will assign each kystfelt directly its parent vassdragsområde. 

In [4]:
# Assign next down ID for each regine
# Set kwargs for easiest comparison with original hierarchy
reg_df = reg_gdf[["regine", "vassom"]].copy()
reg_df = teo.io.assign_regine_hierarchy(
    reg_df,
    regine_col="regine",
    regine_down_col='regine_down',
    order_coastal=True,
    nan_to_vass=False,
    land_to_vass=False,
    add_offshore=False,
)
reg_df.head()

99.95 % of regines assigned.


Unnamed: 0,regine,regine_down,vassom
0,001.10,001.,1
1,001.1A1,001.10,1
2,001.1A20,001.1A1,1
3,001.1A2A,001.1A20,1
4,001.1A2B,001.1A2A,1


In [5]:
# Read old TEOTIL hierarchy
teo2_csv = r"https://raw.githubusercontent.com/NIVANorge/teotil2/main/data/core_input_data/regine_2020.csv"
reg_df_old = pd.read_csv(teo2_csv, sep=";")
reg_df_old = reg_df_old[["regine", "regine_ned"]]
regines_old = reg_df_old["regine"].tolist()
regines_new = reg_df["regine"].tolist()
reg_df_comp = pd.merge(reg_df_old, reg_df, how="inner", on="regine")
reg_df_comp["match"] = reg_df_comp["regine_ned"] == reg_df_comp["regine_down"]
reg_df_comp.head()

Unnamed: 0,regine,regine_ned,regine_down,vassom,match
0,001.10,001.,001.,1,True
1,001.1A1,001.10,001.10,1,True
2,001.1A20,001.1A1,001.1A1,1,True
3,001.1A2A,001.1A20,001.1A20,1,True
4,001.1A2B,001.1A2A,001.1A2A,1,True


In [6]:
n_no_match = len(reg_df_comp.query("match == False"))
n_match = len(reg_df_comp.query("match == True"))
n_reg = len(reg_df_comp)
print(n_reg, "regines found in both 'old' and 'new' datasets.")
print(
    f"Of these, {n_no_match} do not have the same next down ID (while {n_match} agree)."
)
print("")
reg_df_comp.query("match == False").head()

18217 regines found in both 'old' and 'new' datasets.
Of these, 798 do not have the same next down ID (while 17419 agree).



Unnamed: 0,regine,regine_ned,regine_down,vassom,match
148,002.116Z,002.1160,002.1163,2,False
149,002.1170,002.1160,002.1163,2,False
171,002.A2A,002.A20,002.A22,2,False
174,002.A3,002.A20,002.A22,2,False
184,002.AA3,002.AA2,002.AA22,2,False


Based on manual testing in ArcGIS, cases where the old hierarchy differs from the new one seem to be due to changes in the way regines have been subdivided. For example, some regines in the new dataset are not present in the old one, and vice versa. **For the examples I have tested, the new hierarchy seems reasonable**.

Another way of testing this is to use the old regine dataset from TEOTIL2 and apply the new algorithm to that instead. This avoids the additional complications caused by changes to the regine boundaries, and instead focuses only on the new algorithm.

In [7]:
# Read old TEOTIL hierarchy
teo2_csv = r"https://raw.githubusercontent.com/NIVANorge/teotil2/main/data/core_input_data/regine_2020.csv"
reg_df_old = pd.read_csv(teo2_csv, sep=";")
reg_df_old = reg_df_old[["regine", "regine_ned"]]

# Remove invalid regine codes from the old data (used by TEOTIL2, e.g. OSPAR regions etc.) 
reg_df_old = reg_df_old[~reg_df_old['regine'].str.endswith('.')]
reg_df_old = reg_df_old[~reg_df_old['regine'].str.contains('_')]

# Apply new algorithm with 'legacy' settings to match TEOTIL2
reg_df_old = teo.io.assign_regine_hierarchy(
    reg_df_old,
    regine_col="regine",
    regine_down_col='regine_down',
    order_coastal=True,
    nan_to_vass=False,
    land_to_vass=False,
    add_offshore=False,
)
reg_df_old.head()

100.00 % of regines assigned.


Unnamed: 0,regine,regine_down,regine_ned
0,001.10,001.,001.
1,001.1A1,001.10,001.10
2,001.1A20,001.1A1,001.1A1
3,001.1A2A,001.1A20,001.1A20
4,001.1A2B,001.1A2A,001.1A2A


In [8]:
pct_match = 100 * (reg_df_old['regine_ned'] == reg_df_old['regine_down']).sum() / len(reg_df_old)
print(f"{pct_match:.0f}% of regines match.")

100% of regines match.


This shows that **the new algorithm for TEOTIL3, when run using "legacy" parameters, is capable of *exactly* reproducing the hierarchy from TEOTIL2**. This is a good. Although the new algorithm is not perfect, it is at least as good as the old one, and it also includes improvements that make it better in several respects.