### Hydrofabric Builds MVP

This cookbook is an MVP to show a build procedure for getting from the v3 Reference Fabric to a hydrofabric product with Nexus, Flowpath, Divide, and Network Layers

The data files in this notebook are preprocessed v3 reference flowpaths and divides that are clipped to a HUC12 basin (010600010202) at Crystal Lake-Collyer Brook

In [2]:
# import all necessary modules
import geopandas as gpd
from workflow import (
    aggregate_geometries_from_pairs_and_groups,
    aggregate_with_all_rules,
    build_hydroseq_network,
    find_outlets_by_hydroseq,
    reindex_layers_with_topology,
)

In [3]:
# load all references used in this proof of concept
huc_gdf = gpd.read_file("sample_huc.gpkg")
fp_gdf = gpd.read_file("sample_flowpaths.gpkg")
div_gdf = gpd.read_file("sample_divides.gpkg")

In [4]:
# let's plot the outputs to visualize the outputs of our reference
m = huc_gdf.explore(color="black", fill=False, weight=5)
m = div_gdf.explore(m=m, color="red", alpha=0.2)
fp_gdf.explore(m=m, color="blue")

### Step 1: Create a Network Structure

First we will need to determine individual networks within our reference and construct a graph oriented object to connect flowpaths to their upstream neighbors

In [5]:
network = build_hydroseq_network(fp_gdf)
outlets = find_outlets_by_hydroseq(fp_gdf)

print(f"Found {len(outlets)} outlets")
print(f"Outlet: {outlets}")

print("Network Object:")
print(network)

Found 1 outlets
Outlet: ['6720797']
Network Object:
{'6720675': ['6722501'], '6720683': ['6720773', '6720689'], '6720797': ['6720703', '6720701'], '6720703': ['6720683', '6720651'], '6720689': ['6720681', '6720679'], '6720679': ['6720677', '6720675']}


### Step 2: Refactoring of the network to determine divides to be aggregated

Once the graph structure is created for the reference flowpaths, small flowpaths and divides need to be aggregated to better support routing stability. Currently, we are enforcing the following rules when aggreagating flowpaths:

#### Reference flowpath network consistency assumptions
- Hydroseq Reliability: The code assumes hydroseq/dnhydroseq fields provide accurate network topology. Lower hydroseq values indicate more downstream position, with differences reflecting true flow direction.
- Complete Network Coverage: All flowpaths in the dataset form a connected network graph. Missing or invalid dnhydroseq values indicate either outlets or data quality issues.

### Aggregation Rules
- Rule Priority Order: Algorithm applies rules in strict sequence: (1) 4km segment length Independence, (2) Small Catchment Aggregation ( < 0.1km2), (3) Stream Order-1 Branch Aggregation, (4) Drainage Area Aggregation. First applicable rule takes precedence.
- Length Threshold Absolute: Flowpaths ≥4km length ALWAYS remain independent, regardless of other characteristics. This represents a minimum modeling unit size requirement.
- Small Catchment Definition: flowpaths with areasqkm_left <0.1km² are considered "small catchments" requiring special aggregation treatment to prevent loss of hydrologic significance.
- Stream Order Hierarchy: Order-2 streams are preferred aggregation targets over Order-1 when available. Order-1 streams aggregate their entire upstream branch as a headwater unit.
- Drainage Area Dominance: When no order-specific rules apply, flowpaths aggregate with their largest upstream tributary by total drainage area (totdasqkm).

In [6]:
segment_length_threshold = 4.0
small_catchment_threshold = 0.1
all_aggregation_pairs = []
all_headwater_groups = []
all_independent_flowpaths = []
all_minor_flowpaths = []

for outlet in outlets:
    print(f"  Processing outlet {outlet}...")
    result = aggregate_with_all_rules(
        network_graph=network,
        fp=fp_gdf,
        start_id=outlet,
        segment_length_threshold=segment_length_threshold,
        small_catchment_threshold=small_catchment_threshold,
    )

    all_aggregation_pairs.extend(result["aggregation_pairs"])
    all_headwater_groups.extend(result["headwater_groups"])
    all_independent_flowpaths.extend(result["independent_flowpaths"])
    all_minor_flowpaths.extend(result["minor_flowpaths"])

print("\nTotal aggregation relationships identified:")
print(f"  Pairs: {len(all_aggregation_pairs)}")
print(f"  Headwater groups: {len(all_headwater_groups)}")
print(f"  Independent: {len(all_independent_flowpaths)}")
print(f"  Flowlines: {len(all_minor_flowpaths)}")

  Processing outlet 6720797...

=== Starting aggregation with all rules from outlet 6720797 ===
Small catchment threshold: 0.1 km²
Segment length threshold: 4.0 km
Starting stack-based trace from 6720797
  Processing 6720797: length=4.84km, area=48.975km², order=2.0
    Found 2 upstream: ['6720703', '6720701']
    Segment Length RULE: 4.84km >= 4.0km - INDEPENDENT
  Processing 6720703: length=1.69km, area=38.297km², order=2.0
    Found 2 upstream: ['6720683', '6720651']
    Current segment 1.69km < 4.0km - applying aggregation rules
    DRAINAGE AREA PAIR: 6720703 -> 6720683 (area: 26.924km²)
    HEADWATER GROUP: ['6720651']
  Processing 6720701: length=7.97km, area=6.644km², order=1.0
    Found 0 upstream: []
    Segment Length RULE: 7.97km >= 4.0km - INDEPENDENT
  Processing 6720773: length=1.93km, area=4.166km², order=1.0
    Found 0 upstream: []
    No upstream - adding to headwater group
  Processing 6720689: length=0.49km, area=18.672km², order=2.0
    Found 2 upstream: ['6720681

### Step 3: Aggregate Geometries 

This step takes any flowpath items that need to be combined and aggregates their geometries together into a single shape. This is important for divides with zonal statistic calculations

In [7]:
geometry_result = aggregate_geometries_from_pairs_and_groups(
    flowpaths_gdf=fp_gdf,
    divides_gdf=div_gdf,
    aggregation_pairs=all_aggregation_pairs,
    headwater_groups=all_headwater_groups,
    independent_flowpaths=all_independent_flowpaths,
    minor_flowpaths=all_minor_flowpaths,
)

=== AGGREGATING GEOMETRIES ===
Processing 3 aggregation pairs...
  Group 1: Aggregating 2 flowpaths: ['6720703', '6720683']
    Ordering geometries by hydroseq: [np.float64(4337205029.0), np.float64(4337205027.0)]
    Created aggregated flowpath: 4.38km, 65.22km², 2 segments
  Group 2: Aggregating 2 flowpaths: ['6720689', '6720681']
    Ordering geometries by hydroseq: [np.float64(4337205035.0), np.float64(4337205030.0)]
    Created aggregated flowpath: 2.95km, 28.61km², 2 segments
  Group 3: Aggregating 3 flowpaths: ['6720679', '6720675', '6722501']
    Ordering geometries by hydroseq: [np.float64(4337205033.0), np.float64(4337205032.0), np.float64(4337205031.0)]
    Created aggregated flowpath: 3.90km, 20.28km², 3 segments

Processing 3 headwater groups...
  Group 1: Single headwater 6720651 - keeping as individual
  Group 2: Single headwater 6720773 - keeping as individual
  Group 3: Single headwater 6720677 - keeping as individual

Processing 2 independent flowpaths...
  Independen

### Step 4: Re-indexing flowpaths and divides and add nexus creation

Once geometries are reaggregated, a nexus-topology and re-indexing has to be performed to connect catchments -> flowpaths as a 1:1 reference and creates nexus points for flow aggregation. The network table is also created.

*NOTE:* Flowlines (which we call minor flowpaths to reduce confusion) are not included in the flowpaths layer as the v3 prototype Hydrofabric layers in VPU01 did not have these flowlines included. This capability is meant to use the v3 prototype Hydrofabric as a reference

*NOTE:* NHD references through `hf_id` were not included as the NHD dataset is too large to fit in a notebook and ship with the code. 

In [8]:
hf = reindex_layers_with_topology(fp_gdf, div_gdf, geometry_result)

Processing 7 aggregated flowpaths
Processing 8 aggregated divides
Pre-analyzing confluence points...
    Unit 1 (indep_6720797) outlet flows to wb-0
    Unit 2 (indep_6720701) outlet flows to wb-1
    Unit 3 (agg_pair_1_6720703_6720683) outlet flows to wb-1
    Unit 4 (6720651) outlet flows to wb-3
    Unit 5 (agg_pair_2_6720689_6720681) outlet flows to wb-3
    Unit 6 (agg_group_3_6720679_6720675_6722501) outlet flows to wb-5
    Unit 7 (6720773) outlet flows to wb-3
Creating shared nexus assignments...
  nex-1 -> wb-0 (serves 1 flowpaths)
  nex-2 -> wb-1 (serves 2 flowpaths)
  nex-3 -> wb-3 (serves 3 flowpaths)
  nex-4 -> wb-5 (serves 1 flowpaths)
Processing aggregated units in hydroseq order:
  Unit 1: indep_6720797 (original IDs: ['6720797'], min hydroseq: 4337205025.0)
    Created wb-1, cat-1 -> nex-1 -> wb-0 (area: 4.034 km²)
  Unit 2: indep_6720701 (original IDs: ['6720701'], min hydroseq: 4337205026.0)
    Created wb-2, cat-2 -> nex-2 -> wb-1 (area: 6.644 km²)
  Unit 3: agg_pai

### Let's look at our created Hydrofabric 

Now that we've run all of the end -> end steps, let's take a look at each of the required layers. The database schema of each layer is intended to match that used by the prototype v3 Hydrofabric

In [9]:
# network
hf["network"].head()

Unnamed: 0,fid,flowpath_id,flowpath_toid,mainstem,hydroseq,lengthkm,divide_id,poi_id,vpuid,divide_toid,type,areasqkm,flowline_id,hf_part,hf_id,hf_source
0,1,wb-1,nex-1,4336380000.0,4337205000.0,4.836,cat-1,cat-1,1,cat-1,network,4.0338,[6720797],,,MVP
1,2,wb-2,nex-2,4336380000.0,4337205000.0,7.972,cat-2,cat-2,1,wb-1,network,6.64425,[6720701],,,MVP
2,3,wb-3,nex-2,4336380000.0,4337205000.0,4.375,cat-3,cat-3,1,wb-1,network,6.32745,"[6720703, 6720683]",,,MVP
3,4,wb-4,nex-3,4336380000.0,4337205000.0,5.915,cat-4,cat-4,1,wb-3,network,9.131399,[6720651],,,MVP
4,5,wb-5,nex-3,4336380000.0,4337205000.0,2.954,cat-5,cat-5,1,wb-3,network,10.22535,"[6720689, 6720681]",,,MVP


In [10]:
# divides
hf["divides"].head()

Unnamed: 0,fid,divide_id,divide_toid,type,ds_id,areasqkm,vpuid,flowpath_id,lengthkm,has_flowline,geometry
0,1,cat-1,nex-1,network,,4.0338,1,wb-1,4.836,True,"MULTIPOLYGON (((2033835 2599185, 2033805 25991..."
1,2,cat-2,nex-2,network,,6.64425,1,wb-2,7.972,True,"MULTIPOLYGON (((2031885 2601405, 2031765 26013..."
2,3,cat-3,nex-2,network,,6.32745,1,wb-3,4.375,True,"POLYGON ((2028045 2602935, 2028135 2602785, 20..."
3,4,cat-4,nex-3,network,,9.131399,1,wb-4,5.915,True,"MULTIPOLYGON (((2028045 2602935, 2028045 26033..."
4,5,cat-5,nex-3,network,,10.22535,1,wb-5,2.954,True,"POLYGON ((2028675 2599665, 2028525 2599665, 20..."


In [11]:
# flowpaths
hf["flowpaths"].head()

Unnamed: 0,fid,flowpath_id,flowpath_toid,hydroseq,mainstem,lengthkm,divide_id,poi_id,vpuid,geometry
0,1,wb-1,nex-1,4337205000.0,4336380000.0,4.836,cat-1,,1,"MULTILINESTRING ((2030901.602 2600376.245, 203..."
1,2,wb-2,nex-2,4337205000.0,4336380000.0,7.972,cat-2,,1,"MULTILINESTRING ((2030118.994 2605583.92, 2030..."
2,3,wb-3,nex-2,4337205000.0,4336380000.0,4.375,cat-3,,1,"MULTILINESTRING ((2028876.04 2599961.997, 2028..."
3,4,wb-4,nex-3,4337205000.0,4336380000.0,5.915,cat-4,,1,"MULTILINESTRING ((2028002.719 2606200.239, 202..."
4,5,wb-5,nex-3,4337205000.0,4336380000.0,2.954,cat-5,,1,"MULTILINESTRING ((2026556.183 2600175.429, 202..."


In [12]:
# nexus
hf["flowpaths"].head()

Unnamed: 0,fid,flowpath_id,flowpath_toid,hydroseq,mainstem,lengthkm,divide_id,poi_id,vpuid,geometry
0,1,wb-1,nex-1,4337205000.0,4336380000.0,4.836,cat-1,,1,"MULTILINESTRING ((2030901.602 2600376.245, 203..."
1,2,wb-2,nex-2,4337205000.0,4336380000.0,7.972,cat-2,,1,"MULTILINESTRING ((2030118.994 2605583.92, 2030..."
2,3,wb-3,nex-2,4337205000.0,4336380000.0,4.375,cat-3,,1,"MULTILINESTRING ((2028876.04 2599961.997, 2028..."
3,4,wb-4,nex-3,4337205000.0,4336380000.0,5.915,cat-4,,1,"MULTILINESTRING ((2028002.719 2606200.239, 202..."
4,5,wb-5,nex-3,4337205000.0,4336380000.0,2.954,cat-5,,1,"MULTILINESTRING ((2026556.183 2600175.429, 202..."


In [13]:
# Let's view the outputs on a map
m = huc_gdf.explore(color="black", fill=False, weight=5)
m = hf["divides"].explore(m=m, color="red", alpha=0.2)
m = hf["flowpaths"].explore(m=m, color="blue")
hf["nexus"].explore(m=m, color="black")

In [14]:
# We can also save the outputs to disk for external verification if necessary
output_file = "MVP_NGWPC_hydrofabric.gpkg"
for table_name, _layer in hf.items():
    if len(_layer) > 0:
        gpd.GeoDataFrame(_layer).to_file(output_file, layer=table_name, driver="GPKG")
    else:
        print(f"Warning: {table_name} layer is empty")