
## Catchments  ## 

**Pour points**

Definition: Specific locations on the river network where we want to measure or model water flow.
Examples: Gauging stations, river mouths, dams, hydropower plants.

**Catchments**
 
Definition: The area of land where all rainfall drains to the same pour point.
Each catchment is linked to exactly one pour point.

**Steps**
- Clip HydroSHEDS DEM, ACC, DIR Bhutan + buffer.
- Create pour points (CSV with lon, lat, id).
- Snap pour points to nearest high-ACC pixels (river cells).
- Use flow direction (DIR) + snapped points to generate catchments.
- Convert catchments to polygons and calculate basic attributes.


- Use your clipped DIR (as_dir_Bhutan_and_buffer.tif) to build a catchment ID raster (one ID per basin).
- Convert catchments to polygons (optional, for QA/visualization).
- Build a point‚Üícatchment_id mapping for any Bhutan locations (CSV of lon/lat).
- Use the same raster to tag every weather pixel with catchment_id, then group/aggregate.

## 1. Crop and Save a Smaller DEM, ACC, DIR (TIF) for Bhutan + Buffer
Instead of processing the full Asia-wide TIF files, we first crop and save a smaller GeoTIFF files limited to the Bhutan region and its buffer (latitude 25.0¬∞‚Äì29.5¬∞, longitude 87.0¬∞‚Äì93.5¬∞). 

**DEM** ‚Äî Digital Elevation Model

- A raster grid of ground elevation (usually meters above sea level).

**DIR** ‚Äî Flow Direction

- A raster showing which neighboring cell water flows to from each cell (downslope).
- In HydroSHEDS (ESRI D8), values encode directions: 1=E, 2=SE, 4=S, 8=SW, 16=W, 32=NW, 64=N, 128=NE. 
- Computed from the DEM.

**ACC** ‚Äî Flow Accumulation

- For each cell, how much upstream area drains into it.
- High ACC = river channels; used to define streams and to snap pour points.

In [10]:
#First lets cut DEM
import rasterio
from rasterio.windows import from_bounds
from rasterio.enums import Resampling
import os
import gc

# Define the input and output paths
input_tif = "../../data/HydroSHEDS/as_dem_3s.tif"
output_tif = "../../data/HydroSHEDS/as_dem_Bhutan_and_buffer.tif"

# Define the bounding box for Bhutan + buffer (in degrees)
min_lon, max_lon = 87.0, 93.5
min_lat, max_lat = 25.0, 29.5

# Open the source TIFF file
with rasterio.open(input_tif) as src:
    print(f"üì¶ Number of bands in TIFF: {src.count}")
    if src.count != 1:
        raise ValueError("‚ùå Expected only one band in the DEM file.")

    # Compute the pixel window corresponding to the bounding box
    window = from_bounds(min_lon, min_lat, max_lon, max_lat, transform=src.transform)

    # Read the data within that window (band 1 = elevation)
    data = src.read(1, window=window)

    # Get the updated transform for the cropped window
    transform = src.window_transform(window)

    # Save the cropped raster to a new TIF
    out_meta = src.meta.copy()
    out_meta.update({
        "height": window.height,
        "width": window.width,
        "transform": transform
    })

    with rasterio.open(output_tif, "w", **out_meta) as out_src:
        out_src.write(data, 1)

print(f"‚úÖ Saved: {output_tif}")
print(f"üìê Size (width x height): {window.width} x {window.height}")

# Check file size in MB
file_size_mb = os.path.getsize(output_tif) / (1024 * 1024)
print(f"üíæ File size: {file_size_mb:.2f} MB")

with rasterio.open(output_tif) as tif_check:
    print(f"üìå CRS: {tif_check.crs}")
    print(f"üß≠ Bounds: {tif_check.bounds}")
    print(f"üì¶ Data type: {tif_check.dtypes[0]}")
    print(f"üßÆ NoData value: {tif_check.nodata}")

# üî• Clean up memory
del data, transform, out_meta, window, tif_check, src, out_src
gc.collect()
print("‚úÖ Memory cleaned up.")

üì¶ Number of bands in TIFF: 1
‚úÖ Saved: ../../data/HydroSHEDS/as_dem_Bhutan_and_buffer.tif
üìê Size (width x height): 7800.0 x 5400.0
üíæ File size: 80.37 MB
üìå CRS: EPSG:4326
üß≠ Bounds: BoundingBox(left=87.0, bottom=25.000000000000007, right=93.5, top=29.500000000000007)
üì¶ Data type: int16
üßÆ NoData value: 32767.0
‚úÖ Memory cleaned up.


In [11]:
#Cut DIR and ACC
import rasterio
from rasterio.windows import from_bounds
from rasterio.enums import Resampling
import os
import gc

# === ACC ===
input_tif_acc  = "../../data/HydroSHEDS/as_acc_3s.tif"
output_tif_acc = "../../data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif"

# Bhutan + buffer (degrees)
min_lon, max_lon = 87.0, 93.5
min_lat, max_lat = 25.0, 29.5

with rasterio.open(input_tif_acc) as src_acc:
    print(f"üì¶ Number of bands in TIFF: {src_acc.count}")
    if src_acc.count != 1:
        raise ValueError("‚ùå Expected only one band in the ACC file.")

    window_acc = from_bounds(min_lon, min_lat, max_lon, max_lat, transform=src_acc.transform)

    data_acc = src_acc.read(1, window=window_acc)
    transform_acc = src_acc.window_transform(window_acc)

    out_meta_acc = src_acc.meta.copy()
    out_meta_acc.update({
        "height": window_acc.height,
        "width": window_acc.width,
        "transform": transform_acc
    })

    with rasterio.open(output_tif_acc, "w", **out_meta_acc) as out_src_acc:
        out_src_acc.write(data_acc, 1)

print(f"‚úÖ Saved: {output_tif_acc}")
print(f"üìê Size (width x height): {window_acc.width} x {window_acc.height}")

file_size_mb_acc = os.path.getsize(output_tif_acc) / (1024 * 1024)
print(f"üíæ File size: {file_size_mb_acc:.2f} MB")

with rasterio.open(output_tif_acc) as tif_check_acc:
    print(f"üìå CRS: {tif_check_acc.crs}")
    print(f"üß≠ Bounds: {tif_check_acc.bounds}")
    print(f"üì¶ Data type: {tif_check_acc.dtypes[0]}")
    print(f"üßÆ NoData value: {tif_check_acc.nodata}")

# üî• Clean up memory (ACC)
del data_acc, transform_acc, out_meta_acc, window_acc, tif_check_acc, src_acc, out_src_acc
gc.collect()
print("‚úÖ Memory cleaned up (ACC).")


# === DIR ===
input_tif_dir  = "../../data/HydroSHEDS/as_dir_3s.tif"
output_tif_dir = "../../data/HydroSHEDS/as_dir_Bhutan_and_buffer.tif"

with rasterio.open(input_tif_dir) as src_dir:
    print(f"üì¶ Number of bands in TIFF: {src_dir.count}")
    if src_dir.count != 1:
        raise ValueError("‚ùå Expected only one band in the DIR file.")

    window_dir = from_bounds(min_lon, min_lat, max_lon, max_lat, transform=src_dir.transform)

    data_dir = src_dir.read(1, window=window_dir)
    transform_dir = src_dir.window_transform(window_dir)

    out_meta_dir = src_dir.meta.copy()
    out_meta_dir.update({
        "height": window_dir.height,
        "width": window_dir.width,
        "transform": transform_dir
    })

    with rasterio.open(output_tif_dir, "w", **out_meta_dir) as out_src_dir:
        out_src_dir.write(data_dir, 1)

print(f"‚úÖ Saved: {output_tif_dir}")
print(f"üìê Size (width x height): {window_dir.width} x {window_dir.height}")

file_size_mb_dir = os.path.getsize(output_tif_dir) / (1024 * 1024)
print(f"üíæ File size: {file_size_mb_dir:.2f} MB")

with rasterio.open(output_tif_dir) as tif_check_dir:
    print(f"üìå CRS: {tif_check_dir.crs}")
    print(f"üß≠ Bounds: {tif_check_dir.bounds}")
    print(f"üì¶ Data type: {tif_check_dir.dtypes[0]}")
    print(f"üßÆ NoData value: {tif_check_dir.nodata}")

# üî• Clean up memory (DIR)
del data_dir, transform_dir, out_meta_dir, window_dir, tif_check_dir, src_dir, out_src_dir
gc.collect()
print("‚úÖ Memory cleaned up (DIR).")

üì¶ Number of bands in TIFF: 1
‚úÖ Saved: ../../data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif
üìê Size (width x height): 7800.0 x 5400.0
üíæ File size: 160.71 MB
üìå CRS: EPSG:4326
üß≠ Bounds: BoundingBox(left=87.0, bottom=25.000000000000007, right=93.5, top=29.500000000000007)
üì¶ Data type: uint32
üßÆ NoData value: 4294967295.0
‚úÖ Memory cleaned up (ACC).
üì¶ Number of bands in TIFF: 1
‚úÖ Saved: ../../data/HydroSHEDS/as_dir_Bhutan_and_buffer.tif
üìê Size (width x height): 7800.0 x 5400.0
üíæ File size: 40.20 MB
üìå CRS: EPSG:4326
üß≠ Bounds: BoundingBox(left=87.0, bottom=25.000000000000007, right=93.5, top=29.500000000000007)
üì¶ Data type: uint8
üßÆ NoData value: 255.0
‚úÖ Memory cleaned up (DIR).


## 2. Generate catchments (basins) from DIR
Note: ESRI-D8 flow direction: a raster layer of flow directions using ESRI‚Äôs D8 scheme. Each pixel stores a code for the direction water flows to:
1 = E, 2 = SE, 4 = S, 8 = SW, 16 = W, 32 = NW, 64 = N, 128 = NE.

In [18]:
# === Install (into THIS kernel) + Build clustered watersheds @ ACC threshold = 10,000 ===
# - Installs: pyogrio (for GPKG I/O), geopandas (optional)
# - Streams from ACC (threshold = 10,000)
# - Boundary outlet candidates -> 8-connected clusters -> pick max-ACC per cluster
# - Watershed per selected outlet (ESRI-D8), polygonize to SHP, write GPKG via pyogrio

import sys, subprocess, os, gc
from pathlib import Path
import numpy as np
import rasterio

# 0) Install libs into THIS kernel (idempotent)
def ensure_package(name):
    try:
        __import__(name)
        return
    except ImportError:
        print(f"üì¶ Installing {name} ...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", name])
        __import__(name)

for pkg in ["pyogrio", "geopandas"]:
    ensure_package(pkg)

import pyogrio  # now available

# 1) Py3.8 shim so whitebox works on Python < 3.9
if sys.version_info < (3, 9):
    import importlib.resources as ir
    try:
        import importlib_resources
        if not hasattr(ir, "files"):
            ir.files = importlib_resources.files
    except Exception:
        pass

import whitebox
wbt = whitebox.WhiteboxTools()

# 2) Parameters
THRESHOLD_CELLS = 10_000  # ACC >= 10k -> stream (~81 km¬≤ if 0.0081 km¬≤/pixel)
print(f"Using ACC threshold = {THRESHOLD_CELLS} cells")

# 3) Paths (absolute)
root = Path("../../data/HydroSHEDS").resolve()
dir_tif = (root / "as_dir_Bhutan_and_buffer.tif").resolve()
acc_tif = (root / "as_acc_Bhutan_and_buffer.tif").resolve()
out_dir = (root / "bt_out").resolve()
out_dir.mkdir(parents=True, exist_ok=True)

streams_tif = out_dir / "streams.tif"
pp_rast     = out_dir / "auto_pour_points_clustered.tif"
ws_tif      = out_dir / "watersheds_by_outlets_clustered.tif"
ws_shp      = out_dir / "watersheds_by_outlets_clustered.shp"
ws_gpkg     = out_dir / "watersheds_by_outlets_clustered.gpkg"

def safe_remove(p: Path):
    try:
        if p.exists():
            p.unlink()
    except Exception as e:
        print(f"‚ö†Ô∏è Could not remove {p}: {e}")

def cleanup_shapefile(stem: Path):
    base = stem.with_suffix("")
    for ext in (".shp", ".shx", ".dbf", ".prj", ".cpg", ".qmd"):
        safe_remove(base.with_suffix(ext))

# clean old outputs
for p in [streams_tif, pp_rast, ws_tif, ws_gpkg]:
    safe_remove(p)
cleanup_shapefile(ws_shp)

# 4) Input checks & alignment
assert dir_tif.exists(), f"Missing DIR: {dir_tif}"
assert acc_tif.exists(), f"Missing ACC: {acc_tif}"
with rasterio.open(dir_tif) as rD, rasterio.open(acc_tif) as rA:
    assert rD.crs == rA.crs, "DIR and ACC CRS differ"
    assert rD.transform == rA.transform, "DIR and ACC grids not aligned"
    assert (rD.width, rD.height) == (rA.width, rA.height), "DIR and ACC size mismatch"
    H, W = rD.height, rD.width
    profile = rD.profile
print(f"‚úÖ DIR/ACC aligned: {W} x {H} | CRS={profile['crs']}")

wbt.work_dir = str(out_dir)

# 5) Streams from ACC
ok_s = wbt.extract_streams(flow_accum=str(acc_tif), output=str(streams_tif), threshold=THRESHOLD_CELLS)
print("‚úÖ Streams:", ok_s, "‚Üí", streams_tif)

# 6) Boundary outlet candidates (stream on outer border + D8 points outside)
with rasterio.open(dir_tif) as r_dir, rasterio.open(streams_tif) as r_str:
    dir_arr = r_dir.read(1)
    str_arr = r_str.read(1).astype(bool)

code2offset = {1:(0,1), 2:(1,1), 4:(1,0), 8:(1,-1), 16:(0,-1), 32:(-1,-1), 64:(-1,0), 128:(-1,1)}

cand_coords = []
# top/bottom rows
for c in range(W):
    for r in (0, H-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand_coords.append((r, c))
# left/right cols (skip corners to avoid dupes)
for r in range(1, H-1):
    for c in (0, W-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand_coords.append((r, c))

cand_coords = list(dict.fromkeys(cand_coords))
print(f"üîé Candidate boundary outlets (raw): {len(cand_coords)}")

# 7) Cluster candidates (8-connectivity) and pick max-ACC per cluster
class DSU:
    def __init__(self, n):
        self.p = list(range(n)); self.r = [0]*n
    def find(self, x):
        while self.p[x] != x:
            self.p[x] = self.p[self.p[x]]
            x = self.p[x]
        return x
    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra == rb: return
        if self.r[ra] < self.r[rb]:
            self.p[ra] = rb
        elif self.r[ra] > self.r[rb]:
            self.p[rb] = ra
        else:
            self.p[rb] = ra; self.r[ra] += 1

N = len(cand_coords)
idx_map = {rc:i for i, rc in enumerate(cand_coords)}
cand_set = set(cand_coords)
dsu = DSU(N)
nbrs = [(dr, dc) for dr in (-1,0,1) for dc in (-1,0,1) if not (dr==0 and dc==0)]

for i, (r, c) in enumerate(cand_coords):
    for dr, dc in nbrs:
        nr, nc = r+dr, c+dc
        if (nr, nc) in cand_set:
            dsu.union(i, idx_map[(nr, nc)])

groups = {}
for i in range(N):
    root_i = dsu.find(i)
    groups.setdefault(root_i, []).append(i)

with rasterio.open(acc_tif) as r_acc:
    acc = r_acc.read(1)
    acc_nodata = r_acc.nodata

selected_rc = []
for root_i, members in groups.items():
    best_rc, best_val = None, -1
    for i in members:
        rr, cc = cand_coords[i]
        val = acc[rr, cc]
        if acc_nodata is not None and val == acc_nodata:
            continue
        if val > best_val:
            best_val = val; best_rc = (rr, cc)
    if best_rc is None:
        best_rc = cand_coords[members[0]]
    selected_rc.append(best_rc)

print(f"‚úÖ Clustered outlets (one per mouth): {len(selected_rc)}")

if not selected_rc:
    raise RuntimeError("No clustered outlets found. Increase threshold or verify DIR/ACC.")

# 8) Rasterize pour-points (unique IDs)
pp_arr = np.zeros((H, W), dtype=np.int32)
for i, (rr, cc) in enumerate(selected_rc, start=1):
    pp_arr[rr, cc] = i

profile_pp = profile.copy()
profile_pp.update(dtype=rasterio.int32, count=1, compress="deflate", tiled=True, BIGTIFF="IF_SAFER")
with rasterio.open(pp_rast, "w", **profile_pp) as dst:
    dst.write(pp_arr, 1)
print("‚úÖ Pour-point raster (clustered):", pp_rast)

# 9) Watershed per clustered outlet (ESRI D8)
ok_w = wbt.watershed(d8_pntr=str(dir_tif), pour_pts=str(pp_rast), output=str(ws_tif), esri_pntr=True)
print("‚úÖ Watersheds by clustered outlets:", ok_w, "‚Üí", ws_tif)

# 10) Polygonize to SHP (Whitebox), then write GPKG via pyogrio (no Fiona)
ok_p = wbt.raster_to_vector_polygons(i=str(ws_tif), output=str(ws_shp))
print("‚úÖ Polygons (SHP):", ok_p, "‚Üí", ws_shp)

gdf = pyogrio.read_dataframe(str(ws_shp))
if gdf.crs is None:
    with rasterio.open(ws_tif) as rt:
        gdf.set_crs(rt.crs, inplace=True)
pyogrio.write_dataframe(
    gdf,
    str(ws_gpkg),
    layer="watersheds",
    driver="GPKG",
    append=False
)
print("‚úÖ GeoPackage written (pyogrio):", ws_gpkg)
try:
    print("Layers:", pyogrio.list_layers(str(ws_gpkg)))
except Exception:
    pass

# 11) Count unique watersheds
with rasterio.open(ws_tif) as r:
    WZ = r.read(1)
    nd = r.nodata
    n_ws = np.unique(WZ[WZ != nd]).size
print("üßÆ Unique watersheds (clustered):", n_ws)

gc.collect()
print("Done.")

üì¶ Installing pyogrio ...
Defaulting to user installation because normal site-packages is not writeable
Collecting pyogrio
  Downloading pyogrio-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading pyogrio-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.2 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.2/23.2 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h

[0m

Installing collected packages: pyogrio
Successfully installed pyogrio-0.9.0
Using ACC threshold = 10000 cells
‚úÖ DIR/ACC aligned: 7800 x 5400 | CRS=EPSG:4326
./whitebox_tools --run="ExtractStreams" --wd="/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out" --flow_accum='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif' --output='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/streams.tif' --threshold='10000' -v --compress_rasters=False

*****************************
* Welcome to ExtractStreams *
* Powered by WhiteboxTools  *
* www.whiteboxgeo.com       *
*****************************
Reading data...
Progress: 0%
Progress: 1%
Progress: 2%
Progress: 3%
Progress: 4%
Progress: 5%
Progress: 6%
Progress: 7%
Progress: 8%
Progress: 9%
Progress: 10%
Progress: 11%
Progress: 12%
Progress: 13%
Progress: 14%
P

  return ogr_read(


üßÆ Unique watersheds (clustered): 1761
Done.


In [19]:
# === Rebuild clustered watersheds with outlet filtering + proximity merge ===
# - Streams from ACC (threshold = 10,000 cells)
# - Boundary outlet candidates (stream on outer border + ESRI-D8 points outside)
# - Filter outlets by minimum ACC (major mouths only)
# - Merge outlets within MERGE_RADIUS_PX (Chebyshev)
# - Watershed per outlet; polygonize; write GPKG via pyogrio

import sys, subprocess, os, gc
from pathlib import Path
import numpy as np
import rasterio

# Ensure pyogrio present in THIS kernel
def ensure_package(name):
    try:
        __import__(name)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", name])

ensure_package("pyogrio")
import pyogrio

# Py3.8 shim for whitebox
if sys.version_info < (3, 9):
    import importlib.resources as ir
    try:
        import importlib_resources
        if not hasattr(ir, "files"):
            ir.files = importlib_resources.files
    except Exception:
        pass

import whitebox
wbt = whitebox.WhiteboxTools()

# -------- Parameters (tune these) --------
STREAM_THRESHOLD_CELLS = 10_000     # ACC >= this -> stream (‚âà81 km¬≤ if 0.0081 km¬≤/pixel)
MIN_OUTLET_ACC_CELLS   = 20_000     # keep only outlets with ACC >= this (‚âà162 km¬≤)
MERGE_RADIUS_PX        = 4          # merge outlets closer than this (Chebyshev pixels)
# -----------------------------------------

# Paths
root = Path("../../data/HydroSHEDS").resolve()
dir_tif = (root / "as_dir_Bhutan_and_buffer.tif").resolve()
acc_tif = (root / "as_acc_Bhutan_and_buffer.tif").resolve()
out_dir = (root / "bt_out").resolve()
out_dir.mkdir(parents=True, exist_ok=True)

streams_tif = out_dir / "streams.tif"
pp_rast     = out_dir / "auto_pour_points_filtered_merged.tif"
ws_tif      = out_dir / "watersheds_filtered_merged.tif"
ws_shp      = out_dir / "watersheds_filtered_merged.shp"
ws_gpkg     = out_dir / "watersheds_filtered_merged.gpkg"

def safe_remove(p: Path):
    try:
        if p.exists(): p.unlink()
    except Exception as e:
        print(f"‚ö†Ô∏è Could not remove {p}: {e}")

def cleanup_shapefile(stem: Path):
    base = stem.with_suffix("")
    for ext in (".shp", ".shx", ".dbf", ".prj", ".cpg", ".qmd"):
        safe_remove(base.with_suffix(ext))

# Clean previous outputs
for p in [streams_tif, pp_rast, ws_tif, ws_gpkg]:
    safe_remove(p)
cleanup_shapefile(ws_shp)

# Checks & alignment
assert dir_tif.exists() and acc_tif.exists()
with rasterio.open(dir_tif) as rD, rasterio.open(acc_tif) as rA:
    assert rD.crs == rA.crs
    assert rD.transform == rA.transform
    assert (rD.width, rD.height) == (rA.width, rA.height)
    H, W = rD.height, rD.width
    profile = rD.profile
print(f"‚úÖ DIR/ACC aligned: {W}x{H} | CRS={profile['crs']}")

wbt.work_dir = str(out_dir)

# 1) Streams from ACC
ok_s = wbt.extract_streams(flow_accum=str(acc_tif), output=str(streams_tif), threshold=STREAM_THRESHOLD_CELLS)
print("‚úÖ Streams:", ok_s, "‚Üí", streams_tif)

# 2) Boundary outlet candidates
with rasterio.open(dir_tif) as r_dir, rasterio.open(streams_tif) as r_str:
    dir_arr = r_dir.read(1)
    str_arr = r_str.read(1).astype(bool)

code2offset = {1:(0,1), 2:(1,1), 4:(1,0), 8:(1,-1), 16:(0,-1), 32:(-1,-1), 64:(-1,0), 128:(-1,1)}

cand = []
# top/bottom rows
for c in range(W):
    for r in (0, H-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand.append((r, c))
# left/right cols (skip corners dupes)
for r in range(1, H-1):
    for c in (0, W-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand.append((r, c))

# dedupe
cand = list(dict.fromkeys(cand))
print(f"üîé Boundary outlet candidates (raw): {len(cand)}")

# 3) Filter by minimum ACC at outlet
with rasterio.open(acc_tif) as r_acc:
    acc = r_acc.read(1)
    acc_nd = r_acc.nodata

def acc_val(rc):
    v = acc[rc[0], rc[1]]
    return -1 if (acc_nd is not None and v == acc_nd) else v

cand2 = [rc for rc in cand if acc_val(rc) >= MIN_OUTLET_ACC_CELLS]
print(f"‚úÖ After ACC filter (‚â• {MIN_OUTLET_ACC_CELLS} cells): {len(cand2)}")

# 4) Merge outlets within MERGE_RADIUS_PX (Chebyshev)
#    DSU over points; union if max(|dr|,|dc|) <= MERGE_RADIUS_PX
class DSU:
    def __init__(self, n):
        self.p = list(range(n)); self.r = [0]*n
    def find(self, x):
        while self.p[x] != x:
            self.p[x] = self.p[self.p[x]]
            x = self.p[x]
        return x
    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra == rb: return
        if self.r[ra] < self.r[rb]:
            self.p[ra] = rb
        elif self.r[ra] > self.r[rb]:
            self.p[rb] = ra
        else:
            self.p[rb] = ra; self.r[ra] += 1

N = len(cand2)
dsu = DSU(N)
for i in range(N):
    r1, c1 = cand2[i]
    # check only j>i to reduce work
    for j in range(i+1, N):
        r2, c2 = cand2[j]
        if max(abs(r1-r2), abs(c1-c2)) <= MERGE_RADIUS_PX:
            dsu.union(i, j)

groups = {}
for i in range(N):
    root = dsu.find(i)
    groups.setdefault(root, []).append(i)

selected = []
for root, idxs in groups.items():
    # pick point with max ACC
    best_rc, best_v = None, -1
    for i in idxs:
        rc = cand2[i]
        v = acc_val(rc)
        if v > best_v:
            best_v, best_rc = v, rc
    selected.append(best_rc)

print(f"‚úÖ After proximity merge (‚â§ {MERGE_RADIUS_PX}px): {len(selected)}")

# 5) Rasterize pour points
pp_arr = np.zeros((H, W), dtype=np.int32)
for i, (rr, cc) in enumerate(selected, start=1):
    pp_arr[rr, cc] = i

profile_pp = profile.copy()
profile_pp.update(dtype=rasterio.int32, count=1, compress="deflate", tiled=True, BIGTIFF="IF_SAFER")
with rasterio.open(pp_rast, "w", **profile_pp) as dst:
    dst.write(pp_arr, 1)
print("‚úÖ Pour-point raster:", pp_rast)

# 6) Watersheds
ok_w = wbt.watershed(d8_pntr=str(dir_tif), pour_pts=str(pp_rast), output=str(ws_tif), esri_pntr=True)
print("‚úÖ Watersheds:", ok_w, "‚Üí", ws_tif)

# 7) Polygonize + GPKG
wbt.raster_to_vector_polygons(i=str(ws_tif), output=str(ws_shp))
print("‚úÖ SHP:", ws_shp)

gdf = pyogrio.read_dataframe(str(ws_shp))
if gdf.crs is None:
    with rasterio.open(ws_tif) as rt:
        gdf.set_crs(rt.crs, inplace=True)
pyogrio.write_dataframe(gdf, str(ws_gpkg), layer="watersheds", driver="GPKG", append=False)
print("‚úÖ GPKG:", ws_gpkg)

# 8) Count watersheds
with rasterio.open(ws_tif) as r:
    WZ = r.read(1)
    nd = r.nodata
    n_ws = np.unique(WZ[WZ != nd]).size
print("üßÆ Unique watersheds:", n_ws)

gc.collect()
print("Done.")

‚úÖ DIR/ACC aligned: 7800x5400 | CRS=EPSG:4326
./whitebox_tools --run="ExtractStreams" --wd="/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out" --flow_accum='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif' --output='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/streams.tif' --threshold='10000' -v --compress_rasters=False

*****************************
* Welcome to ExtractStreams *
* Powered by WhiteboxTools  *
* www.whiteboxgeo.com       *
*****************************
Reading data...
Progress: 0%
Progress: 1%
Progress: 2%
Progress: 3%
Progress: 4%
Progress: 5%
Progress: 6%
Progress: 7%
Progress: 8%
Progress: 9%
Progress: 10%
Progress: 11%
Progress: 12%
Progress: 13%
Progress: 14%
Progress: 15%
Progress: 16%
Progress: 17%
Progress: 18%
Progress: 19%
Progress: 20%
Progress: 21%
Progress: 22%
P

  return ogr_read(


üßÆ Unique watersheds: 47
Done.


1) Compute basin areas and export a summary table

Adds area_km2 and a robust ws_id to the GPKG + a CSV summary.

In [20]:
from pathlib import Path
import pyogrio
import geopandas as gpd
import rasterio

root   = Path("../../data/HydroSHEDS").resolve()
outdir = root / "bt_out"
gpkg   = outdir / "watersheds_filtered_merged.gpkg"
layer  = "watersheds"
summary_csv = outdir / "watersheds_filtered_merged_summary.csv"

# Read polygons
gdf = pyogrio.read_dataframe(gpkg, layer=layer)

# Ensure CRS is present; if missing, copy from raster
if gdf.crs is None:
    with rasterio.open(outdir / "watersheds_filtered_merged.tif") as r:
        gdf = gdf.set_crs(r.crs)

# Try to find an integer ID column from Whitebox (commonly 'value' or 'FID')
id_col = None
for cand in ["value", "VALUE", "fid", "FID", "Id", "ID"]:
    if cand in gdf.columns:
        id_col = cand
        break
if id_col is None:
    # Fall back to index-based ID
    gdf["ws_id"] = gdf.index.astype(int) + 1
else:
    gdf = gdf.rename(columns={id_col: "ws_id"})
    gdf["ws_id"] = gdf["ws_id"].astype(int)

# Compute area in km¬≤ using an equal-area projection
gdf_eq = gdf.to_crs("EPSG:6933")  # World Cylindrical Equal Area
gdf["area_km2"] = gdf_eq.geometry.area.values / 1_000_000.0

# Write back to GPKG (overwriting layer) and CSV summary
pyogrio.write_dataframe(gdf, gpkg, layer=layer, driver="GPKG", append=False)
gdf[["ws_id", "area_km2"]].to_csv(summary_csv, index=False)

print("‚úÖ Updated GPKG with ws_id + area_km2:", gpkg)
print("‚úÖ Summary CSV:", summary_csv)
print("üßÆ Basins:", len(gdf), " | area_km2 stats: ",
      float(gdf["area_km2"].min()), "‚Üí", float(gdf["area_km2"].max()))

‚úÖ Updated GPKG with ws_id + area_km2: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watersheds_filtered_merged.gpkg
‚úÖ Summary CSV: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watersheds_filtered_merged_summary.csv
üßÆ Basins: 47  | area_km2 stats:  2.127651543268542 ‚Üí 143958.44097443993


2) Map any (lon, lat) points to watershed IDs (your ‚Äúdictionary‚Äù)

Takes a CSV of points and adds the ws_id from the watershed raster.

In [21]:
import pandas as pd
import numpy as np
import rasterio
from rasterio.sample import sample_gen
from pathlib import Path

# Inputs
points_csv = "../../data/your_points.csv"  # <- change to your file (must have 'longitude','latitude')
root   = Path("../../data/HydroSHEDS").resolve()
ws_tif = root / "bt_out" / "watersheds_filtered_merged.tif"
out_csv = Path(points_csv).with_name(Path(points_csv).stem + "_with_ws_id.csv")

# Load points
df = pd.read_csv(points_csv)
assert {"longitude","latitude"}.issubset(df.columns), "CSV must have columns: longitude, latitude"

# Sample watershed raster at point locations (lon/lat order for EPSG:4326)
with rasterio.open(ws_tif) as r:
    assert r.crs.to_string() == "EPSG:4326", "Expecting EPSG:4326 raster"
    coords = list(zip(df["longitude"].values, df["latitude"].values))
    vals = list(r.sample(coords))  # each is a 1-length array

ws_vals = np.array([v[0] for v in vals])
# Treat NoData as NaN
with rasterio.open(ws_tif) as r:
    nodata = r.nodata
ws_id = ws_vals.astype("float64")
if nodata is not None:
    ws_id = np.where(ws_id == nodata, np.nan, ws_id)
df["ws_id"] = ws_id.astype("Int64")  # pandas nullable int

# Save
df.to_csv(out_csv, index=False)
print("‚úÖ Points mapped to watersheds:", out_csv)

# Quick QA: how many points per basin?
counts = df["ws_id"].value_counts(dropna=True).sort_index()
print("üßÆ Points per ws_id (head):")
print(counts.head(10))

FileNotFoundError: [Errno 2] No such file or directory: '../../data/your_points.csv'