
## Catchments  ## 

**Pour points**

Definition: Specific locations on the river network where we want to measure or model water flow.
Examples: Gauging stations, river mouths, dams, hydropower plants.

**Catchments**
 
Definition: The area of land where all rainfall drains to the same pour point.
Each catchment is linked to exactly one pour point.

**Steps**
- Clip HydroSHEDS DEM, ACC, DIR Bhutan + buffer.
- Create pour points (CSV with lon, lat, id).
- Snap pour points to nearest high-ACC pixels (river cells).
- Use flow direction (DIR) + snapped points to generate catchments.
- Convert catchments to polygons and calculate basic attributes.


- Use your clipped DIR (as_dir_Bhutan_and_buffer.tif) to build a catchment ID raster (one ID per basin).
- Convert catchments to polygons (optional, for QA/visualization).
- Build a point→catchment_id mapping for any Bhutan locations (CSV of lon/lat).
- Use the same raster to tag every weather pixel with catchment_id, then group/aggregate.

## 1. Crop and Save a Smaller DEM, ACC, DIR (TIF) for Bhutan + Buffer
Instead of processing the full Asia-wide TIF files, we first crop and save a smaller GeoTIFF files limited to the Bhutan region and its buffer (latitude 25.0°–29.5°, longitude 87.0°–93.5°). 

**DEM** — Digital Elevation Model

- A raster grid of ground elevation (usually meters above sea level).

**DIR** — Flow Direction

- A raster showing which neighboring cell water flows to from each cell (downslope).
- In HydroSHEDS (ESRI D8), values encode directions: 1=E, 2=SE, 4=S, 8=SW, 16=W, 32=NW, 64=N, 128=NE. 
- Computed from the DEM.

**ACC** — Flow Accumulation

- For each cell, how much upstream area drains into it.
- High ACC = river channels; used to define streams and to snap pour points.

In [10]:
#First lets cut DEM
import rasterio
from rasterio.windows import from_bounds
from rasterio.enums import Resampling
import os
import gc

# Define the input and output paths
input_tif = "../../data/HydroSHEDS/as_dem_3s.tif"
output_tif = "../../data/HydroSHEDS/as_dem_Bhutan_and_buffer.tif"

# Define the bounding box for Bhutan + buffer (in degrees)
min_lon, max_lon = 87.0, 93.5
min_lat, max_lat = 25.0, 29.5

# Open the source TIFF file
with rasterio.open(input_tif) as src:
    print(f"📦 Number of bands in TIFF: {src.count}")
    if src.count != 1:
        raise ValueError("❌ Expected only one band in the DEM file.")

    # Compute the pixel window corresponding to the bounding box
    window = from_bounds(min_lon, min_lat, max_lon, max_lat, transform=src.transform)

    # Read the data within that window (band 1 = elevation)
    data = src.read(1, window=window)

    # Get the updated transform for the cropped window
    transform = src.window_transform(window)

    # Save the cropped raster to a new TIF
    out_meta = src.meta.copy()
    out_meta.update({
        "height": window.height,
        "width": window.width,
        "transform": transform
    })

    with rasterio.open(output_tif, "w", **out_meta) as out_src:
        out_src.write(data, 1)

print(f"✅ Saved: {output_tif}")
print(f"📐 Size (width x height): {window.width} x {window.height}")

# Check file size in MB
file_size_mb = os.path.getsize(output_tif) / (1024 * 1024)
print(f"💾 File size: {file_size_mb:.2f} MB")

with rasterio.open(output_tif) as tif_check:
    print(f"📌 CRS: {tif_check.crs}")
    print(f"🧭 Bounds: {tif_check.bounds}")
    print(f"📦 Data type: {tif_check.dtypes[0]}")
    print(f"🧮 NoData value: {tif_check.nodata}")

# 🔥 Clean up memory
del data, transform, out_meta, window, tif_check, src, out_src
gc.collect()
print("✅ Memory cleaned up.")

📦 Number of bands in TIFF: 1
✅ Saved: ../../data/HydroSHEDS/as_dem_Bhutan_and_buffer.tif
📐 Size (width x height): 7800.0 x 5400.0
💾 File size: 80.37 MB
📌 CRS: EPSG:4326
🧭 Bounds: BoundingBox(left=87.0, bottom=25.000000000000007, right=93.5, top=29.500000000000007)
📦 Data type: int16
🧮 NoData value: 32767.0
✅ Memory cleaned up.


In [11]:
#Cut DIR and ACC
import rasterio
from rasterio.windows import from_bounds
from rasterio.enums import Resampling
import os
import gc

# === ACC ===
input_tif_acc  = "../../data/HydroSHEDS/as_acc_3s.tif"
output_tif_acc = "../../data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif"

# Bhutan + buffer (degrees)
min_lon, max_lon = 87.0, 93.5
min_lat, max_lat = 25.0, 29.5

with rasterio.open(input_tif_acc) as src_acc:
    print(f"📦 Number of bands in TIFF: {src_acc.count}")
    if src_acc.count != 1:
        raise ValueError("❌ Expected only one band in the ACC file.")

    window_acc = from_bounds(min_lon, min_lat, max_lon, max_lat, transform=src_acc.transform)

    data_acc = src_acc.read(1, window=window_acc)
    transform_acc = src_acc.window_transform(window_acc)

    out_meta_acc = src_acc.meta.copy()
    out_meta_acc.update({
        "height": window_acc.height,
        "width": window_acc.width,
        "transform": transform_acc
    })

    with rasterio.open(output_tif_acc, "w", **out_meta_acc) as out_src_acc:
        out_src_acc.write(data_acc, 1)

print(f"✅ Saved: {output_tif_acc}")
print(f"📐 Size (width x height): {window_acc.width} x {window_acc.height}")

file_size_mb_acc = os.path.getsize(output_tif_acc) / (1024 * 1024)
print(f"💾 File size: {file_size_mb_acc:.2f} MB")

with rasterio.open(output_tif_acc) as tif_check_acc:
    print(f"📌 CRS: {tif_check_acc.crs}")
    print(f"🧭 Bounds: {tif_check_acc.bounds}")
    print(f"📦 Data type: {tif_check_acc.dtypes[0]}")
    print(f"🧮 NoData value: {tif_check_acc.nodata}")

# 🔥 Clean up memory (ACC)
del data_acc, transform_acc, out_meta_acc, window_acc, tif_check_acc, src_acc, out_src_acc
gc.collect()
print("✅ Memory cleaned up (ACC).")


# === DIR ===
input_tif_dir  = "../../data/HydroSHEDS/as_dir_3s.tif"
output_tif_dir = "../../data/HydroSHEDS/as_dir_Bhutan_and_buffer.tif"

with rasterio.open(input_tif_dir) as src_dir:
    print(f"📦 Number of bands in TIFF: {src_dir.count}")
    if src_dir.count != 1:
        raise ValueError("❌ Expected only one band in the DIR file.")

    window_dir = from_bounds(min_lon, min_lat, max_lon, max_lat, transform=src_dir.transform)

    data_dir = src_dir.read(1, window=window_dir)
    transform_dir = src_dir.window_transform(window_dir)

    out_meta_dir = src_dir.meta.copy()
    out_meta_dir.update({
        "height": window_dir.height,
        "width": window_dir.width,
        "transform": transform_dir
    })

    with rasterio.open(output_tif_dir, "w", **out_meta_dir) as out_src_dir:
        out_src_dir.write(data_dir, 1)

print(f"✅ Saved: {output_tif_dir}")
print(f"📐 Size (width x height): {window_dir.width} x {window_dir.height}")

file_size_mb_dir = os.path.getsize(output_tif_dir) / (1024 * 1024)
print(f"💾 File size: {file_size_mb_dir:.2f} MB")

with rasterio.open(output_tif_dir) as tif_check_dir:
    print(f"📌 CRS: {tif_check_dir.crs}")
    print(f"🧭 Bounds: {tif_check_dir.bounds}")
    print(f"📦 Data type: {tif_check_dir.dtypes[0]}")
    print(f"🧮 NoData value: {tif_check_dir.nodata}")

# 🔥 Clean up memory (DIR)
del data_dir, transform_dir, out_meta_dir, window_dir, tif_check_dir, src_dir, out_src_dir
gc.collect()
print("✅ Memory cleaned up (DIR).")

📦 Number of bands in TIFF: 1
✅ Saved: ../../data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif
📐 Size (width x height): 7800.0 x 5400.0
💾 File size: 160.71 MB
📌 CRS: EPSG:4326
🧭 Bounds: BoundingBox(left=87.0, bottom=25.000000000000007, right=93.5, top=29.500000000000007)
📦 Data type: uint32
🧮 NoData value: 4294967295.0
✅ Memory cleaned up (ACC).
📦 Number of bands in TIFF: 1
✅ Saved: ../../data/HydroSHEDS/as_dir_Bhutan_and_buffer.tif
📐 Size (width x height): 7800.0 x 5400.0
💾 File size: 40.20 MB
📌 CRS: EPSG:4326
🧭 Bounds: BoundingBox(left=87.0, bottom=25.000000000000007, right=93.5, top=29.500000000000007)
📦 Data type: uint8
🧮 NoData value: 255.0
✅ Memory cleaned up (DIR).


## 2. Generate catchments (basins) from DIR
Note: ESRI-D8 flow direction: a raster layer of flow directions using ESRI’s D8 scheme. Each pixel stores a code for the direction water flows to:
1 = E, 2 = SE, 4 = S, 8 = SW, 16 = W, 32 = NW, 64 = N, 128 = NE.

In [18]:
# === Install (into THIS kernel) + Build clustered watersheds @ ACC threshold = 10,000 ===
# - Installs: pyogrio (for GPKG I/O), geopandas (optional)
# - Streams from ACC (threshold = 10,000)
# - Boundary outlet candidates -> 8-connected clusters -> pick max-ACC per cluster
# - Watershed per selected outlet (ESRI-D8), polygonize to SHP, write GPKG via pyogrio

import sys, subprocess, os, gc
from pathlib import Path
import numpy as np
import rasterio

# 0) Install libs into THIS kernel (idempotent)
def ensure_package(name):
    try:
        __import__(name)
        return
    except ImportError:
        print(f"📦 Installing {name} ...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", name])
        __import__(name)

for pkg in ["pyogrio", "geopandas"]:
    ensure_package(pkg)

import pyogrio  # now available

# 1) Py3.8 shim so whitebox works on Python < 3.9
if sys.version_info < (3, 9):
    import importlib.resources as ir
    try:
        import importlib_resources
        if not hasattr(ir, "files"):
            ir.files = importlib_resources.files
    except Exception:
        pass

import whitebox
wbt = whitebox.WhiteboxTools()

# 2) Parameters
THRESHOLD_CELLS = 10_000  # ACC >= 10k -> stream (~81 km² if 0.0081 km²/pixel)
print(f"Using ACC threshold = {THRESHOLD_CELLS} cells")

# 3) Paths (absolute)
root = Path("../../data/HydroSHEDS").resolve()
dir_tif = (root / "as_dir_Bhutan_and_buffer.tif").resolve()
acc_tif = (root / "as_acc_Bhutan_and_buffer.tif").resolve()
out_dir = (root / "bt_out").resolve()
out_dir.mkdir(parents=True, exist_ok=True)

streams_tif = out_dir / "streams.tif"
pp_rast     = out_dir / "auto_pour_points_clustered.tif"
ws_tif      = out_dir / "watersheds_by_outlets_clustered.tif"
ws_shp      = out_dir / "watersheds_by_outlets_clustered.shp"
ws_gpkg     = out_dir / "watersheds_by_outlets_clustered.gpkg"

def safe_remove(p: Path):
    try:
        if p.exists():
            p.unlink()
    except Exception as e:
        print(f"⚠️ Could not remove {p}: {e}")

def cleanup_shapefile(stem: Path):
    base = stem.with_suffix("")
    for ext in (".shp", ".shx", ".dbf", ".prj", ".cpg", ".qmd"):
        safe_remove(base.with_suffix(ext))

# clean old outputs
for p in [streams_tif, pp_rast, ws_tif, ws_gpkg]:
    safe_remove(p)
cleanup_shapefile(ws_shp)

# 4) Input checks & alignment
assert dir_tif.exists(), f"Missing DIR: {dir_tif}"
assert acc_tif.exists(), f"Missing ACC: {acc_tif}"
with rasterio.open(dir_tif) as rD, rasterio.open(acc_tif) as rA:
    assert rD.crs == rA.crs, "DIR and ACC CRS differ"
    assert rD.transform == rA.transform, "DIR and ACC grids not aligned"
    assert (rD.width, rD.height) == (rA.width, rA.height), "DIR and ACC size mismatch"
    H, W = rD.height, rD.width
    profile = rD.profile
print(f"✅ DIR/ACC aligned: {W} x {H} | CRS={profile['crs']}")

wbt.work_dir = str(out_dir)

# 5) Streams from ACC
ok_s = wbt.extract_streams(flow_accum=str(acc_tif), output=str(streams_tif), threshold=THRESHOLD_CELLS)
print("✅ Streams:", ok_s, "→", streams_tif)

# 6) Boundary outlet candidates (stream on outer border + D8 points outside)
with rasterio.open(dir_tif) as r_dir, rasterio.open(streams_tif) as r_str:
    dir_arr = r_dir.read(1)
    str_arr = r_str.read(1).astype(bool)

code2offset = {1:(0,1), 2:(1,1), 4:(1,0), 8:(1,-1), 16:(0,-1), 32:(-1,-1), 64:(-1,0), 128:(-1,1)}

cand_coords = []
# top/bottom rows
for c in range(W):
    for r in (0, H-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand_coords.append((r, c))
# left/right cols (skip corners to avoid dupes)
for r in range(1, H-1):
    for c in (0, W-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand_coords.append((r, c))

cand_coords = list(dict.fromkeys(cand_coords))
print(f"🔎 Candidate boundary outlets (raw): {len(cand_coords)}")

# 7) Cluster candidates (8-connectivity) and pick max-ACC per cluster
class DSU:
    def __init__(self, n):
        self.p = list(range(n)); self.r = [0]*n
    def find(self, x):
        while self.p[x] != x:
            self.p[x] = self.p[self.p[x]]
            x = self.p[x]
        return x
    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra == rb: return
        if self.r[ra] < self.r[rb]:
            self.p[ra] = rb
        elif self.r[ra] > self.r[rb]:
            self.p[rb] = ra
        else:
            self.p[rb] = ra; self.r[ra] += 1

N = len(cand_coords)
idx_map = {rc:i for i, rc in enumerate(cand_coords)}
cand_set = set(cand_coords)
dsu = DSU(N)
nbrs = [(dr, dc) for dr in (-1,0,1) for dc in (-1,0,1) if not (dr==0 and dc==0)]

for i, (r, c) in enumerate(cand_coords):
    for dr, dc in nbrs:
        nr, nc = r+dr, c+dc
        if (nr, nc) in cand_set:
            dsu.union(i, idx_map[(nr, nc)])

groups = {}
for i in range(N):
    root_i = dsu.find(i)
    groups.setdefault(root_i, []).append(i)

with rasterio.open(acc_tif) as r_acc:
    acc = r_acc.read(1)
    acc_nodata = r_acc.nodata

selected_rc = []
for root_i, members in groups.items():
    best_rc, best_val = None, -1
    for i in members:
        rr, cc = cand_coords[i]
        val = acc[rr, cc]
        if acc_nodata is not None and val == acc_nodata:
            continue
        if val > best_val:
            best_val = val; best_rc = (rr, cc)
    if best_rc is None:
        best_rc = cand_coords[members[0]]
    selected_rc.append(best_rc)

print(f"✅ Clustered outlets (one per mouth): {len(selected_rc)}")

if not selected_rc:
    raise RuntimeError("No clustered outlets found. Increase threshold or verify DIR/ACC.")

# 8) Rasterize pour-points (unique IDs)
pp_arr = np.zeros((H, W), dtype=np.int32)
for i, (rr, cc) in enumerate(selected_rc, start=1):
    pp_arr[rr, cc] = i

profile_pp = profile.copy()
profile_pp.update(dtype=rasterio.int32, count=1, compress="deflate", tiled=True, BIGTIFF="IF_SAFER")
with rasterio.open(pp_rast, "w", **profile_pp) as dst:
    dst.write(pp_arr, 1)
print("✅ Pour-point raster (clustered):", pp_rast)

# 9) Watershed per clustered outlet (ESRI D8)
ok_w = wbt.watershed(d8_pntr=str(dir_tif), pour_pts=str(pp_rast), output=str(ws_tif), esri_pntr=True)
print("✅ Watersheds by clustered outlets:", ok_w, "→", ws_tif)

# 10) Polygonize to SHP (Whitebox), then write GPKG via pyogrio (no Fiona)
ok_p = wbt.raster_to_vector_polygons(i=str(ws_tif), output=str(ws_shp))
print("✅ Polygons (SHP):", ok_p, "→", ws_shp)

gdf = pyogrio.read_dataframe(str(ws_shp))
if gdf.crs is None:
    with rasterio.open(ws_tif) as rt:
        gdf.set_crs(rt.crs, inplace=True)
pyogrio.write_dataframe(
    gdf,
    str(ws_gpkg),
    layer="watersheds",
    driver="GPKG",
    append=False
)
print("✅ GeoPackage written (pyogrio):", ws_gpkg)
try:
    print("Layers:", pyogrio.list_layers(str(ws_gpkg)))
except Exception:
    pass

# 11) Count unique watersheds
with rasterio.open(ws_tif) as r:
    WZ = r.read(1)
    nd = r.nodata
    n_ws = np.unique(WZ[WZ != nd]).size
print("🧮 Unique watersheds (clustered):", n_ws)

gc.collect()
print("Done.")

📦 Installing pyogrio ...
Defaulting to user installation because normal site-packages is not writeable
Collecting pyogrio
  Downloading pyogrio-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading pyogrio-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.2/23.2 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h

[0m

Installing collected packages: pyogrio
Successfully installed pyogrio-0.9.0
Using ACC threshold = 10000 cells
✅ DIR/ACC aligned: 7800 x 5400 | CRS=EPSG:4326
./whitebox_tools --run="ExtractStreams" --wd="/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out" --flow_accum='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif' --output='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/streams.tif' --threshold='10000' -v --compress_rasters=False

*****************************
* Welcome to ExtractStreams *
* Powered by WhiteboxTools  *
* www.whiteboxgeo.com       *
*****************************
Reading data...
Progress: 0%
Progress: 1%
Progress: 2%
Progress: 3%
Progress: 4%
Progress: 5%
Progress: 6%
Progress: 7%
Progress: 8%
Progress: 9%
Progress: 10%
Progress: 11%
Progress: 12%
Progress: 13%
Progress: 14%
Pro

  return ogr_read(


🧮 Unique watersheds (clustered): 1761
Done.


In [19]:
# === Rebuild clustered watersheds with outlet filtering + proximity merge ===
# - Streams from ACC (threshold = 10,000 cells)
# - Boundary outlet candidates (stream on outer border + ESRI-D8 points outside)
# - Filter outlets by minimum ACC (major mouths only)
# - Merge outlets within MERGE_RADIUS_PX (Chebyshev)
# - Watershed per outlet; polygonize; write GPKG via pyogrio

import sys, subprocess, os, gc
from pathlib import Path
import numpy as np
import rasterio

# Ensure pyogrio present in THIS kernel
def ensure_package(name):
    try:
        __import__(name)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", name])

ensure_package("pyogrio")
import pyogrio

# Py3.8 shim for whitebox
if sys.version_info < (3, 9):
    import importlib.resources as ir
    try:
        import importlib_resources
        if not hasattr(ir, "files"):
            ir.files = importlib_resources.files
    except Exception:
        pass

import whitebox
wbt = whitebox.WhiteboxTools()

# -------- Parameters (tune these) --------
STREAM_THRESHOLD_CELLS = 10_000     # ACC >= this -> stream (≈81 km² if 0.0081 km²/pixel)
MIN_OUTLET_ACC_CELLS   = 20_000     # keep only outlets with ACC >= this (≈162 km²)
MERGE_RADIUS_PX        = 4          # merge outlets closer than this (Chebyshev pixels)
# -----------------------------------------

# Paths
root = Path("../../data/HydroSHEDS").resolve()
dir_tif = (root / "as_dir_Bhutan_and_buffer.tif").resolve()
acc_tif = (root / "as_acc_Bhutan_and_buffer.tif").resolve()
out_dir = (root / "bt_out").resolve()
out_dir.mkdir(parents=True, exist_ok=True)

streams_tif = out_dir / "streams.tif"
pp_rast     = out_dir / "auto_pour_points_filtered_merged.tif"
ws_tif      = out_dir / "watersheds_filtered_merged.tif"
ws_shp      = out_dir / "watersheds_filtered_merged.shp"
ws_gpkg     = out_dir / "watersheds_filtered_merged.gpkg"

def safe_remove(p: Path):
    try:
        if p.exists(): p.unlink()
    except Exception as e:
        print(f"⚠️ Could not remove {p}: {e}")

def cleanup_shapefile(stem: Path):
    base = stem.with_suffix("")
    for ext in (".shp", ".shx", ".dbf", ".prj", ".cpg", ".qmd"):
        safe_remove(base.with_suffix(ext))

# Clean previous outputs
for p in [streams_tif, pp_rast, ws_tif, ws_gpkg]:
    safe_remove(p)
cleanup_shapefile(ws_shp)

# Checks & alignment
assert dir_tif.exists() and acc_tif.exists()
with rasterio.open(dir_tif) as rD, rasterio.open(acc_tif) as rA:
    assert rD.crs == rA.crs
    assert rD.transform == rA.transform
    assert (rD.width, rD.height) == (rA.width, rA.height)
    H, W = rD.height, rD.width
    profile = rD.profile
print(f"✅ DIR/ACC aligned: {W}x{H} | CRS={profile['crs']}")

wbt.work_dir = str(out_dir)

# 1) Streams from ACC
ok_s = wbt.extract_streams(flow_accum=str(acc_tif), output=str(streams_tif), threshold=STREAM_THRESHOLD_CELLS)
print("✅ Streams:", ok_s, "→", streams_tif)

# 2) Boundary outlet candidates
with rasterio.open(dir_tif) as r_dir, rasterio.open(streams_tif) as r_str:
    dir_arr = r_dir.read(1)
    str_arr = r_str.read(1).astype(bool)

code2offset = {1:(0,1), 2:(1,1), 4:(1,0), 8:(1,-1), 16:(0,-1), 32:(-1,-1), 64:(-1,0), 128:(-1,1)}

cand = []
# top/bottom rows
for c in range(W):
    for r in (0, H-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand.append((r, c))
# left/right cols (skip corners dupes)
for r in range(1, H-1):
    for c in (0, W-1):
        if str_arr[r, c]:
            d = int(dir_arr[r, c])
            if d in code2offset:
                dr, dc = code2offset[d]
                nr, nc = r + dr, c + dc
                if nr < 0 or nr >= H or nc < 0 or nc >= W:
                    cand.append((r, c))

# dedupe
cand = list(dict.fromkeys(cand))
print(f"🔎 Boundary outlet candidates (raw): {len(cand)}")

# 3) Filter by minimum ACC at outlet
with rasterio.open(acc_tif) as r_acc:
    acc = r_acc.read(1)
    acc_nd = r_acc.nodata

def acc_val(rc):
    v = acc[rc[0], rc[1]]
    return -1 if (acc_nd is not None and v == acc_nd) else v

cand2 = [rc for rc in cand if acc_val(rc) >= MIN_OUTLET_ACC_CELLS]
print(f"✅ After ACC filter (≥ {MIN_OUTLET_ACC_CELLS} cells): {len(cand2)}")

# 4) Merge outlets within MERGE_RADIUS_PX (Chebyshev)
#    DSU over points; union if max(|dr|,|dc|) <= MERGE_RADIUS_PX
class DSU:
    def __init__(self, n):
        self.p = list(range(n)); self.r = [0]*n
    def find(self, x):
        while self.p[x] != x:
            self.p[x] = self.p[self.p[x]]
            x = self.p[x]
        return x
    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra == rb: return
        if self.r[ra] < self.r[rb]:
            self.p[ra] = rb
        elif self.r[ra] > self.r[rb]:
            self.p[rb] = ra
        else:
            self.p[rb] = ra; self.r[ra] += 1

N = len(cand2)
dsu = DSU(N)
for i in range(N):
    r1, c1 = cand2[i]
    # check only j>i to reduce work
    for j in range(i+1, N):
        r2, c2 = cand2[j]
        if max(abs(r1-r2), abs(c1-c2)) <= MERGE_RADIUS_PX:
            dsu.union(i, j)

groups = {}
for i in range(N):
    root = dsu.find(i)
    groups.setdefault(root, []).append(i)

selected = []
for root, idxs in groups.items():
    # pick point with max ACC
    best_rc, best_v = None, -1
    for i in idxs:
        rc = cand2[i]
        v = acc_val(rc)
        if v > best_v:
            best_v, best_rc = v, rc
    selected.append(best_rc)

print(f"✅ After proximity merge (≤ {MERGE_RADIUS_PX}px): {len(selected)}")

# 5) Rasterize pour points
pp_arr = np.zeros((H, W), dtype=np.int32)
for i, (rr, cc) in enumerate(selected, start=1):
    pp_arr[rr, cc] = i

profile_pp = profile.copy()
profile_pp.update(dtype=rasterio.int32, count=1, compress="deflate", tiled=True, BIGTIFF="IF_SAFER")
with rasterio.open(pp_rast, "w", **profile_pp) as dst:
    dst.write(pp_arr, 1)
print("✅ Pour-point raster:", pp_rast)

# 6) Watersheds
ok_w = wbt.watershed(d8_pntr=str(dir_tif), pour_pts=str(pp_rast), output=str(ws_tif), esri_pntr=True)
print("✅ Watersheds:", ok_w, "→", ws_tif)

# 7) Polygonize + GPKG
wbt.raster_to_vector_polygons(i=str(ws_tif), output=str(ws_shp))
print("✅ SHP:", ws_shp)

gdf = pyogrio.read_dataframe(str(ws_shp))
if gdf.crs is None:
    with rasterio.open(ws_tif) as rt:
        gdf.set_crs(rt.crs, inplace=True)
pyogrio.write_dataframe(gdf, str(ws_gpkg), layer="watersheds", driver="GPKG", append=False)
print("✅ GPKG:", ws_gpkg)

# 8) Count watersheds
with rasterio.open(ws_tif) as r:
    WZ = r.read(1)
    nd = r.nodata
    n_ws = np.unique(WZ[WZ != nd]).size
print("🧮 Unique watersheds:", n_ws)

gc.collect()
print("Done.")

✅ DIR/ACC aligned: 7800x5400 | CRS=EPSG:4326
./whitebox_tools --run="ExtractStreams" --wd="/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out" --flow_accum='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/as_acc_Bhutan_and_buffer.tif' --output='/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/streams.tif' --threshold='10000' -v --compress_rasters=False

*****************************
* Welcome to ExtractStreams *
* Powered by WhiteboxTools  *
* www.whiteboxgeo.com       *
*****************************
Reading data...
Progress: 0%
Progress: 1%
Progress: 2%
Progress: 3%
Progress: 4%
Progress: 5%
Progress: 6%
Progress: 7%
Progress: 8%
Progress: 9%
Progress: 10%
Progress: 11%
Progress: 12%
Progress: 13%
Progress: 14%
Progress: 15%
Progress: 16%
Progress: 17%
Progress: 18%
Progress: 19%
Progress: 20%
Progress: 21%
Progress: 22%
Pro

  return ogr_read(


🧮 Unique watersheds: 47
Done.


1) Compute basin areas and export a summary table

Adds area_km2 and a robust ws_id to the GPKG + a CSV summary.

In [13]:
from pathlib import Path
import pyogrio
import geopandas as gpd
import rasterio

root   = Path("../../data/HydroSHEDS").resolve()
outdir = root / "bt_out"
gpkg   = outdir / "watersheds_filtered_merged.gpkg"
layer  = "watersheds"
summary_csv = outdir / "watersheds_filtered_merged_summary.csv"

# Read polygons
gdf = pyogrio.read_dataframe(gpkg, layer=layer)

# Ensure CRS is present; if missing, copy from raster
if gdf.crs is None:
    with rasterio.open(outdir / "watersheds_filtered_merged.tif") as r:
        gdf = gdf.set_crs(r.crs)

# Try to find an integer ID column from Whitebox (commonly 'value' or 'FID')
id_col = None
for cand in ["value", "VALUE", "fid", "FID", "Id", "ID"]:
    if cand in gdf.columns:
        id_col = cand
        break
if id_col is None:
    # Fall back to index-based ID
    gdf["ws_id"] = gdf.index.astype(int) + 1
else:
    gdf = gdf.rename(columns={id_col: "ws_id"})
    gdf["ws_id"] = gdf["ws_id"].astype(int)

# Compute area in km² using an equal-area projection
gdf_eq = gdf.to_crs("EPSG:6933")  # World Cylindrical Equal Area
gdf["area_km2"] = gdf_eq.geometry.area.values / 1_000_000.0

# Write back to GPKG (overwriting layer) and CSV summary
pyogrio.write_dataframe(gdf, gpkg, layer=layer, driver="GPKG", append=False)
gdf[["ws_id", "area_km2"]].to_csv(summary_csv, index=False)

print("✅ Updated GPKG with ws_id + area_km2:", gpkg)
print("✅ Summary CSV:", summary_csv)
print("🧮 Basins:", len(gdf), " | area_km2 stats: ",
      float(gdf["area_km2"].min()), "→", float(gdf["area_km2"].max()))

✅ Updated GPKG with ws_id + area_km2: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watersheds_filtered_merged.gpkg
✅ Summary CSV: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watersheds_filtered_merged_summary.csv
🧮 Basins: 47  | area_km2 stats:  2.127651543268542 → 143958.44097443993


2) Map any (lon, lat) points to watershed IDs (your “dictionary”)

Takes a CSV of points and adds the ws_id from the watershed raster.   TODO

In [15]:
# --- Update watersheds GPKG with ws_id + area_km2 and write summary CSV ---

from pathlib import Path
import sys, subprocess

# ensure pyogrio is available in THIS kernel (geopandas already installed earlier)
def ensure_package(name: str):
    try:
        __import__(name)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", name])

ensure_package("pyogrio")

import pyogrio
import geopandas as gpd
import rasterio

# Paths
root    = Path("../../data/HydroSHEDS").resolve()
outdir  = root / "bt_out"
gpkg    = outdir / "watersheds_filtered_merged.gpkg"
layer   = "watersheds"
ws_tif  = outdir / "watersheds_filtered_merged.tif"
summary_csv = outdir / "watersheds_filtered_merged_summary.csv"

assert gpkg.exists(),  f"GPKG not found: {gpkg}"
assert ws_tif.exists(), f"Raster not found: {ws_tif}"

# Read polygons via pyogrio (avoids Fiona issues)
gdf = pyogrio.read_dataframe(gpkg, layer=layer)

# Ensure CRS is set; if missing, copy from raster
if gdf.crs is None:
    with rasterio.open(ws_tif) as r:
        gdf = gdf.set_crs(r.crs)

# Determine/normalize the watershed ID column
id_col = None
for cand in ["ws_id", "value", "VALUE", "fid", "FID", "Id", "ID"]:
    if cand in gdf.columns:
        id_col = cand
        break

if id_col is None:
    # Fall back to index-based ID (1..N)
    gdf["ws_id"] = gdf.index.astype(int) + 1
else:
    gdf = gdf.rename(columns={id_col: "ws_id"})
    gdf["ws_id"] = gdf["ws_id"].astype(int)

# Compute area in km² using an equal-area projection
gdf_eq = gdf.to_crs("EPSG:6933")      # World Cylindrical Equal Area
gdf["area_km2"] = gdf_eq.geometry.area.values / 1_000_000.0

# Overwrite GPKG layer and write summary CSV
pyogrio.write_dataframe(gdf, gpkg, layer=layer, driver="GPKG", append=False)
gdf[["ws_id", "area_km2"]].to_csv(summary_csv, index=False)

print("✅ Updated GPKG with ws_id + area_km2:", gpkg)
print("✅ Summary CSV:", summary_csv)
print("🧮 Basins:", len(gdf), "| area_km2 range:",
      float(gdf["area_km2"].min()), "→", float(gdf["area_km2"].max()))

✅ Updated GPKG with ws_id + area_km2: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watersheds_filtered_merged.gpkg
✅ Summary CSV: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watersheds_filtered_merged_summary.csv
🧮 Basins: 47 | area_km2 range: 2.127651543268542 → 143958.44097443993


In [17]:
   # Create a demo points CSV with one point per watershed (representative point) ---

from pathlib import Path
import pyogrio
import geopandas as gpd

root    = Path("../../data/HydroSHEDS").resolve()
outdir  = root / "bt_out"
gpkg    = outdir / "watersheds_filtered_merged.gpkg"
layer   = "watersheds"
demo_csv = outdir / "watershed_centroids_demo.csv"

gdf = pyogrio.read_dataframe(gpkg, layer=layer)
if gdf.crs is None:
    # if needed, set CRS from previous step's raster
    pass

# Representative points are guaranteed to lie inside polygon (better than plain centroid)
pts = gdf.geometry.representative_point()
pts_ll = pts.to_crs("EPSG:4326") if gdf.crs and gdf.crs.to_string() != "EPSG:4326" else pts

df_demo = gdf[["ws_id"]].copy()
df_demo["longitude"] = pts_ll.x
df_demo["latitude"]  = pts_ll.y

df_demo.to_csv(demo_csv, index=False)
print("✅ Wrote demo points CSV:", demo_csv)
print(df_demo.head())

✅ Wrote demo points CSV: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watershed_centroids_demo.csv
   ws_id  longitude   latitude
0      1  87.012083  29.482500
1      2  90.256667  28.658750
2      3  93.455833  29.438750
3      4  93.358750  29.375833
4      5  93.437500  29.232083


In [22]:
# Map points to watershed IDs and attach basin area (km²)
# You can later swap `points_csv` to your own file with lon/lat columns.

from pathlib import Path
import pandas as pd
import numpy as np
import rasterio

# --- Paths ---
root       = Path("/home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS")
outdir     = root / "bt_out"
points_csv = outdir / "watershed_centroids_demo.csv"  # replace with your own CSV when ready
ws_tif     = outdir / "watersheds_filtered_merged.tif"
summary_csv= outdir / "watersheds_filtered_merged_summary.csv"

# --- Checks ---
assert points_csv.exists(), f"Points CSV not found: {points_csv}"
assert ws_tif.exists(),     f"Watershed raster not found: {ws_tif}"
assert summary_csv.exists(),f"Summary CSV not found: {summary_csv}"

# --- Load points (need lon/lat in EPSG:4326) ---
df = pd.read_csv(points_csv, encoding="utf-8-sig")
df.columns = df.columns.str.strip()

# find lon/lat columns (supports common aliases)
lon_col = next((c for c in ["longitude","lon","LONGITUDE","Lon","x","X"] if c in df.columns), None)
lat_col = next((c for c in ["latitude","lat","LATITUDE","Lat","y","Y"] if c in df.columns), None)
if lon_col is None or lat_col is None:
    raise ValueError(f"Need lon/lat columns; found: {list(df.columns)}")

# --- Sample watershed raster at those coordinates ---
with rasterio.open(ws_tif) as r:
    assert r.crs and r.crs.to_string()=="EPSG:4326", "Watershed raster must be EPSG:4326"
    left, bottom, right, top = r.bounds
    outside = ((df[lon_col] < left) | (df[lon_col] > right) |
               (df[lat_col] < bottom) | (df[lat_col] > top)).sum()
    if outside:
        print(f"⚠️  {outside} point(s) outside raster bounds → ws_id = NaN")

    coords  = list(zip(df[lon_col].astype(float), df[lat_col].astype(float)))
    vals    = np.array([v[0] for v in r.sample(coords)], dtype="float64")
    nodata  = r.nodata

# set raster NoData to NaN, then store as pandas nullable Int
if nodata is not None:
    vals = np.where(vals == nodata, np.nan, vals)
df["ws_id"] = pd.Series(vals).astype("Int64")          # sampled ID

# --- Join basin area from summary table ---
meta = pd.read_csv(summary_csv, encoding="utf-8-sig")
meta.columns = meta.columns.str.strip()
if "ws_id" not in meta.columns or "area_km2" not in meta.columns:
    if "VALUE" in meta.columns:
        meta = meta.rename(columns={"VALUE":"ws_id"})
    if "ws_id" not in meta.columns or "area_km2" not in meta.columns:
        raise ValueError(f"Summary must contain 'ws_id' and 'area_km2'. Found: {list(meta.columns)}")
meta["ws_id"] = meta["ws_id"].astype(int)

df = df.merge(meta[["ws_id","area_km2"]], on="ws_id", how="left")

# --- Save next to the input CSV ---
out_csv = points_csv.with_name(points_csv.stem + "_with_ws_id.csv")
df.to_csv(out_csv, index=False)

# --- Report ---
n_all = len(df)
n_nan = int(df["ws_id"].isna().sum())
print(f"✅ Saved: {out_csv}")
print(f"🧮 Mapped {n_all - n_nan} / {n_all} points (NaN = outside/NoData: {n_nan})")
print("📌 Preview:")
print(df.head(5))

# Optional: top basins by number of points (for the demo each basin has 1 point)
print("\nTop basins by number of points:")
print(df["ws_id"].value_counts(dropna=True).head(10))

✅ Saved: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watershed_centroids_demo_with_ws_id.csv
🧮 Mapped 47 / 47 points (NaN = outside/NoData: 0)
📌 Preview:
   ws_id  longitude   latitude     area_km2
0     29  87.012083  29.482500   428.292418
1     33  90.256667  28.658750  1360.994845
2     30  93.455833  29.438750   414.545374
3     31  93.358750  29.375833   810.601809
4     32  93.437500  29.232083  3670.449086

Top basins by number of points:
ws_id
29    1
14    1
16    1
18    1
47    1
8     1
10    1
20    1
21    1
15    1
Name: count, dtype: Int64


In [26]:
# Build a (point -> ws_id) dictionary from the demo CSV we already created.
# You can later swap `points_with_ws_csv` to your own *_with_ws_id.csv file.

from pathlib import Path
import pandas as pd
import numpy as np

# Paths (use the demo output we produced earlier)
hydro_root = Path("../../data/HydroSHEDS").resolve()
outdir = hydro_root / "bt_out"
points_with_ws_csv = outdir / "watershed_centroids_demo_with_ws_id.csv"  # <-- change if you have your own file

# Safety checks
assert points_with_ws_csv.exists(), f"Input CSV not found: {points_with_ws_csv}"

# Load
df = pd.read_csv(points_with_ws_csv, encoding="utf-8-sig")
df.columns = df.columns.str.strip()

# Ensure required columns exist
required_cols = {"ws_id"}
missing = required_cols - set(df.columns)
if missing:
    raise ValueError(f"CSV must contain {required_cols}. Found: {list(df.columns)}")

# Find lon/lat columns for keying the dictionary (we support common aliases)
def find_col(possible, cols):
    return next((c for c in possible if c in cols), None)

lon_col = find_col(["longitude","lon","LONGITUDE","Lon","x","X"], df.columns)
lat_col = find_col(["latitude","lat","LATITUDE","Lat","y","Y"], df.columns)

# Build the mapping:
# - If there's a unique point identifier, prefer that (uncomment and set your column name).
# - Otherwise, use (longitude, latitude) tuple as the key.
# Example with a point_id:
# mapping = dict(zip(df["point_id"], df["ws_id"].astype("Int64")))

if lon_col is not None and lat_col is not None:
    # Key by coordinates
    mapping = {
        (float(row[lon_col]), float(row[lat_col])): (int(row["ws_id"]) if pd.notna(row["ws_id"]) else None)
        for _, row in df.iterrows()
    }
else:
    # No lon/lat available -> fall back to row index as the key
    mapping = {
        int(i): (int(row["ws_id"]) if pd.notna(row["ws_id"]) else None)
        for i, row in df.iterrows()
    }

# Report
n_all = len(df)
n_nan = int(df["ws_id"].isna().sum())
print(f"✅ Loaded: {points_with_ws_csv}")
print(f"🧮 Points mapped: {n_all - n_nan} / {n_all} (NaN = outside/NoData: {n_nan})")

# Top basins by number of points
top = df["ws_id"].value_counts(dropna=True).head(10)
print("\nTop basins by number of points:")
print(top)

# Show a few mapping examples
examples = list(mapping.items())[:5]
print("\n🔎 Mapping examples (first 5):")
for k, v in examples:
    print(f"  {k} -> ws_id {v}")

✅ Loaded: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/watershed_centroids_demo_with_ws_id.csv
🧮 Points mapped: 47 / 47 (NaN = outside/NoData: 0)

Top basins by number of points:
ws_id
29    1
14    1
16    1
18    1
47    1
8     1
10    1
20    1
21    1
15    1
Name: count, dtype: int64

🔎 Mapping examples (first 5):
  (87.01208333333332, 29.48250000000001) -> ws_id 29
  (90.25666666666666, 28.65875000000001) -> ws_id 33
  (93.45583333333332, 29.43875000000001) -> ws_id 30
  (93.35875, 29.37583333333334) -> ws_id 31
  (93.4375, 29.232083333333343) -> ws_id 32


In [27]:
import json
mapping_json = {f"{k[0]},{k[1]}": (int(v) if v is not None else None) for k, v in mapping.items()}
json_path = outdir / "point_to_ws_id.json"
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(mapping_json, f, ensure_ascii=False, indent=2)
print("💾 Saved mapping:", json_path)

💾 Saved mapping: /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/point_to_ws_id.json


In [28]:
# Clean bt_out/: move intermediate/temporary files into a dated _trash folder
# Keeps only the essentials for analysis/ML:
#   - watersheds_filtered_merged.tif
#   - watersheds_filtered_merged.gpkg
#   - watersheds_filtered_merged_summary.csv
#   - point_to_ws_id.json
#   - watershed_centroids_demo*.csv  (can be toggled)

from pathlib import Path
import shutil, time, os
from typing import Iterable

# ---- Settings ---------------------------------------------------------------
hydro_root = Path("../../data/HydroSHEDS").resolve()
bt_out     = hydro_root / "bt_out"

KEEP_DEMO_CSV    = True   # keep watershed_centroids_demo.csv and *_with_ws_id.csv
KEEP_STREAMS_TIF = False  # set True if you want to keep streams.tif
# ---------------------------------------------------------------------------

assert bt_out.exists(), f"bt_out not found: {bt_out}"

# Essentials to KEEP (exact filenames)
keep_exact = {
    "watersheds_filtered_merged.tif",
    "watersheds_filtered_merged.gpkg",
    "watersheds_filtered_merged_summary.csv",
    "point_to_ws_id.json",
}
if KEEP_DEMO_CSV:
    keep_exact |= {
        "watershed_centroids_demo.csv",
        "watershed_centroids_demo_with_ws_id.csv",
    }
if KEEP_STREAMS_TIF:
    keep_exact.add("streams.tif")

# Patterns to TRASH (globs); we will exclude anything in keep_exact
trash_globs = [
    "auto_pour_points*.*",
    "watersheds_by_outlets*.*",
    "watersheds_by_outlets_clustered*.*",
    "catchments_id*.*",
    # shapefile sidecars for filtered_merged (we keep GPKG instead)
    "watersheds_filtered_merged.shp",
    "watersheds_filtered_merged.shx",
    "watersheds_filtered_merged.dbf",
    "watersheds_filtered_merged.prj",
    "watersheds_filtered_merged.cpg",
    "watersheds_filtered_merged.qmd",
    # big intermediate rasters
    "auto_pour_points*.tif",
    "watersheds_by_outlets*.tif",
    "watersheds_by_outlets_clustered*.tif",
    "catchments_id*.tif",
]
# streams.tif is optional — add to trash if we decided NOT to keep it
if not KEEP_STREAMS_TIF:
    trash_globs.append("streams.tif")

def iter_files(patterns: Iterable[str]):
    seen = set()
    for pat in patterns:
        for p in bt_out.glob(pat):
            if p.is_file() and p.name not in keep_exact and p not in seen:
                seen.add(p)
                yield p

# Collect candidates
candidates = list(iter_files(trash_globs))

if not candidates:
    print("✨ Nothing to clean — bt_out already tidy.")
else:
    # Create a dated trash folder
    stamp = time.strftime("%Y%m%d_%H%M%S")
    trash = bt_out / f"_trash_{stamp}"
    trash.mkdir(parents=True, exist_ok=True)

    # Move files
    total_bytes = 0
    moved = []
    for p in candidates:
        try:
            total_bytes += p.stat().st_size
        except Exception:
            pass
        dest = trash / p.name
        try:
            shutil.move(str(p), str(dest))
            moved.append((p, dest))
        except Exception as e:
            print(f"⚠️  Failed to move {p.name}: {e}")

    # Report
    mb = total_bytes / (1024 * 1024)
    print(f"🧹 Moved {len(moved)} file(s) to {trash}  (≈ {mb:.2f} MB)")
    if moved:
        print("📦 Examples:")
        for src, dst in moved[:10]:
            print(f"  {src.name}  →  {dst.name}")

    # List what we kept (top level)
    kept = [p.name for p in bt_out.iterdir() if p.is_file() and p.name in keep_exact]
    print("\n✅ Kept essentials:")
    for name in sorted(kept):
        print("  ", name)

    # Small reminder
    print("\nℹ️ You can restore any file by moving it back from the _trash_* folder.")

🧹 Moved 25 file(s) to /home/merlin/Bhutan-Climate-Change/bhutan_climate_modeling/bhutan_climate_modeling/data/HydroSHEDS/bt_out/_trash_20250816_025059  (≈ 656.29 MB)
📦 Examples:
  auto_pour_points_clustered.tif  →  auto_pour_points_clustered.tif
  auto_pour_points_filtered_merged.tif  →  auto_pour_points_filtered_merged.tif
  auto_pour_points.tif  →  auto_pour_points.tif
  watersheds_by_outlets.shx  →  watersheds_by_outlets.shx
  watersheds_by_outlets.tif  →  watersheds_by_outlets.tif
  watersheds_by_outlets.shp  →  watersheds_by_outlets.shp
  watersheds_by_outlets_clustered.gpkg  →  watersheds_by_outlets_clustered.gpkg
  watersheds_by_outlets_clustered.shp  →  watersheds_by_outlets_clustered.shp
  watersheds_by_outlets_clustered.shx  →  watersheds_by_outlets_clustered.shx
  watersheds_by_outlets_clustered.dbf  →  watersheds_by_outlets_clustered.dbf

✅ Kept essentials:
   point_to_ws_id.json
   watershed_centroids_demo.csv
   watershed_centroids_demo_with_ws_id.csv
   watersheds_filter

End-to-end pipeline we built

Clip inputs to Bhutan + buffer

Cropped HydroSHEDS DEM, DIR (D8 flow direction, ESRI codes) and ACC (flow accumulation, cells) to [87, 93.5] x [25, 29.5] (EPSG:4326).

Verified grids are perfectly aligned (same CRS, transform, width/height).

Extract a stream network from ACC

ExtractStreams with threshold = 10,000 cells (≈ 81 km² at 90 m pixels).

Produces streams.tif (binary).

Find candidate outlet cells on the boundary

Boundary stream cells whose D8 arrow flows out of the clipped raster.

Filtered these by minimum ACC ≥ 20,000 cells (≈ 162 km²) to ignore tiny mouths.

Merged nearby outlets within 4 pixels (8-connected, DSU clustering), picking the max-ACC cell per cluster.

Rasterize pour points

Wrote a 32-bit integer raster where each retained outlet gets a unique ID.

Delineate watersheds

Watershed (WhiteboxTools) with ESRI D8 pointer → watersheds_filtered_merged.tif where every pixel stores its watershed ID.

We got 47 catchments for the region with the chosen thresholds.

Vectorize + package

Polygonized to SHP, then wrote a clean GeoPackage: watersheds_filtered_merged.gpkg (layer: watersheds).

Ensured CRS. Computed area_km2 using an equal-area CRS (EPSG:6933).

Exported a friendly summary CSV: ws_id, area_km2.

Demo points + mapping

Created watershed_centroids_demo.csv (one representative point inside each polygon).

Sampled the watershed raster at those lon/lat → watershed_centroids_demo_with_ws_id.csv.

Built a simple dict (lon,lat) → ws_id and saved point_to_ws_id.json.

Tidy up

Optional clean-up cell moves intermediate files to a dated _trash_* folder to keep the repo lean.

Why we did this & what you can do now

Goal: model riverine flooding at the catchment level. “Same catchment” = water flows to the same outlet following D8 directions.
Outcome: a reproducible partition of Bhutan into hydrologically meaningful units (ws_id) + tooling to assign any point to its catchment and to compute basin areas.

Use it for:

Attach ws_id to weather stations, towns, assets → aggregate any point-level features by catchment (groupby ws_id).

Do zonal statistics of gridded climate (e.g., ERA5) over each basin polygon (mean precip, max runoff, etc.).

Build ML features at the basin level (area, upstream accumulation, climate summaries) for flood susceptibility / risk models.

Map & QC: the GPKG opens nicely in QGIS/ArcGIS.

Quick “how to use” now

You already have watershed_centroids_demo_with_ws_id.csv.

For your own points (CSV with longitude, latitude): run the sampling cell we wrote to produce your_points_with_ws_id.csv, then:

What each result file is (short guide)

Essential

as_dir_Bhutan_and_buffer.tif — ESRI D8 flow directions (clipped).

as_acc_Bhutan_and_buffer.tif — flow accumulation (# of upstream cells, clipped).

streams.tif — extracted stream network at 10k-cell threshold.

auto_pour_points_filtered_merged.tif — raster of selected outlet points (one ID per mouth).

watersheds_filtered_merged.tif — final watershed ID raster.

watersheds_filtered_merged.gpkg (watersheds layer) — polygons with ws_id and area_km2.

watersheds_filtered_merged_summary.csv — ws_id, area_km2 table.

watershed_centroids_demo.csv — one in-polygon representative point per basin.

watershed_centroids_demo_with_ws_id.csv — those points with attached ws_id.

point_to_ws_id.json — JSON dictionary mapping (lon, lat) → ws_id (for quick lookups in apps).

Optional / intermediate (can be trashed)

watersheds_by_outlets*.tif/.shp, auto_pour_points*.tif, catchments_id*.*, SHP sidecars of the final layer — all intermediate artifacts we kept only for debugging.

Notes & assumptions

CRS is EPSG:4326 for rasters; we switch to EPSG:6933 only to compute accurate areas.

Thresholds matter:

Streams: 10,000 cells (~81 km²).

Outlet filter: 20,000 cells (~162 km²).

Merge radius: 4 px (to avoid multiple mouths on the same confluence).
You can loosen/tighten these to change basin granularity.

D8 is ESRI coding; we told WhiteboxTools --esri_pntr accordingly.

If you want, I can add a small helper to run zonal stats over ERA5 by ws_id next.