[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sreejakr/openforest4d-forest-metrics/blob/main/notebooks/differencing_script.ipynb)

# Lidar Processing Pipeline Notebook

This notebook provides a step‑by‑step implementation of our lidar metric differencing pipeline which also includes overview and VRT generation. Each code cell can be run interactively in Jupyter or Google Colab. Adjust parameters in the **Configuration** cell at the end to point to the user's data and select which metrics to process and which VRTs to build.

---

## Introduction

Suppose we have two timepoints (e.g., 2017 and 2019) of lidar‑derived raster metrics split into 1×1 km tiles. This pipeline:

1. Indexes tiles in each folder (dropping year tags from filenames).
2. Warps each pair of matching tiles to a common extent and resolution.
3. Computes raw differenced rasters and applies a value mask (+100 m or -100m).
4. Builds overviews on all output TIFFs (original and differenced).
5. Creates VRT mosaics for user‑selected metrics (original or differenced).

All processing uses GDAL command‑line utilities (`gdalwarp`, `gdal_calc`, `gdaladdo`, `gdalbuildvrt`, `gdal_edit`).

---

# File Naming and Output Structure

This pipeline compares raster metric tiles (e.g. CHM, DTM, Canopy Cover) from two years and computes spatial differences. Input files and folders must follow a consistent structure and naming convention.


## File Naming Convention

Each raster tile filename must follow this pattern:

    X_Y_YEAR_METRIC.tif

Where:
- X_Y  represents the lower-left corner coordinates of the tile (for example, 640000_4310000)  
- YEAR is the year tag, such as 2012 or 2018  
- METRIC is the metric name in lowercase (chm, dtm, canopy_cover, densitygt2m, rumple)  

Examples of valid filenames:

    640000_4310000_2012_chm.tif  
    640000_4310000_2018_chm.tif  

The script uses these names to match spatially overlapping tiles across years.

## Output Files

For each metric, differenced tiles are saved under:

    Differences/<metric>_Diff/

Each output tile is named: X_Y_metric_diff.tif

Example: 640000_4310000_chm_diff.tif

## Virtual Raster (VRT) Files

The script generates VRT mosaics for each metric and its differenced version:

    extracted_2012_metrics/Metric_2012.vrt  
    extracted_2018_metrics/Metric_2018.vrt  
    Differences/Metric_Diff.vrt  

Loading a VRT in QGIS or another GIS tool displays the entire mosaic without physically merging all tiles.

## Directory Structure

The project directory must look like this:
```
CA_Placer_Co/
  extracted_2012_metrics/
    CHM_Tiles/
      656000_4370000_2012_chm.tif
    DTM_Tiles/
      656000_4370000_2012_dtm.tif
    DSM_Tiles/
      656000_4370000_2012_dsm.tif
    Canopy_Cover_Tiles/
      656000_4370000_2012_canopy_cover.tif
    Density_Tiles/
      656000_4370000_2012_density.tif
    Rumple_Tiles/
      656000_4370000_2012_rumple.tif

  extracted_2018_metrics/
    CHM_Tiles/
      656000_4370000_2018_chm.tif
    DTM_Tiles/
      656000_4370000_2018_dtm.tif
    DSM_Tiles/
      656000_4370000_2018_dsm.tif
    Canopy_Cover_Tiles/
      656000_4370000_2018_canopy_cover.tif
    Density_Tiles/
      656000_4370000_2018_density.tif
    Rumple_Tiles/
      656000_4370000_2018_rumple.tif
```
### Result 

The created directory looks like this:

```
  Differences/                     
    CHM_Diff/
      656000_4370000_chm_diff.tif
    DTM_Diff/
      656000_4370000_dtm_diff.tif
    DSM_Diff/
      656000_4370000_dsm_diff.tif
    Canopy_Cover_Diff/
      656000_4370000_canopy_cover_diff.tif
    Density_Diff/
      656000_4370000_densitygt2m_diff.tif
    Rumple_Diff/
      656000_4370000_rumple_diff.tif
```
The VRTs of CHM, DST, DTM, Rumple, Canopy Cover and Density for the original tiles are in their respective original folders (extracted_2012_metrics, extracted 2018 metrics).
```
CA_Placer_Co/
  extracted_2012_metrics/
    Canopy_Cover_2012.vrt
    CHM_2012.vrt
    Density_2012.vrt
    DSM_2012.vrt
    DTM_2012.vrt
    Rumple_2012.vrt
    
CA_Placer_Co/
  extracted_2018_metrics/
    Canopy_Cover_2018.vrt
    CHM_2018.vrt
    Density_2018.vrt
    DSM_2018.vrt
    DTM_2018.vrt
    Rumple_2018.vrt

```

## Current Execution Order in run()

1) *Hillshade VRTs*: VRTs for any `*_hillshade.tif` tiles in the original folders (CHM_Tiles, etc.)
`Output: CHM_Hillshade_2012.vrt, CHM_Hillshade_2018.vrt, etc.`

2) *Differencing*: Computes tile-by-tile differences: 2018 - 2012
`Output folders: CHM_Diff/, DTM_Diff/, etc.`

3) *Overviews*: Adds overviews to every .tif file in:
- folder1 (2012)
- folder2 (2018)
  
`output (*_Diff folders)`

4) *VRTs for Original & Differenced Metrics*
`Original: CHM_2012.vrt, CHM_2018.vrt (if CHM_Tiles exists)
Differenced: CHM_Diff.vrt (if CHM_Diff folder exists)`

## Summary Table

 Element                   | Format / Location  
---------------------------|---------------------------------------------  
 Input tile name           | X_Y_YEAR_METRIC.tif (e.g. 640000_4310000_2012_chm.tif)  
 Output difference file    | X_Y_metric_diff.tif in Differences/<metric>_Diff/  
 VRT for year A            | extracted_2012_metrics/Metric_2012.vrt  
 VRT for year B            | extracted_2018_metrics/Metric_2018.vrt  
 VRT for differences       | Differences/Metric_Diff.vrt  
 NoData value              | –9999  

## Library Imports

In [30]:
import os
import glob
import subprocess
import traceback

In [31]:
import sys

# Connect to local google drive when opened in colab
if "google.colab" in sys.modules:
    from google.colab import drive
    drive.mount("/gdrive/")

## Helper Functions

### Indexing Tiles

When CHM tiles such as 500000_4100000_2017_chm.tif need to be paired with corresponding tiles from another year (e.g. 500000_4100000_2019_chm.tif) for differencing, constructing a lookup table that associates each tile's name (coordinate) with its full file path for a given year and metric, makes the pairing process straightforward.

- folder: Path to the directory where TIFF tiles are stored.
- year_tag: Year label embedded in the filenames (for example, "2017").
- suffix_key: Metric identifier used in filenames (for example, "chm", "dtm", "rumple", etc.).


In [32]:
# This function scans a directory of raster tiles for a given metric/year combination,
# strips the common suffix (_<year>_<metric>.tif) to derive the tile 'base' name,
# and returns a dict mapping that base name to the full path. This index allows
# us to efficiently look up matching tiles between two timepoints by the same base name.

def index_files(folder, year_tag, suffix_key):
    """
    Build a mapping from tile base names to their file paths.

    This helps match up tiles between two years for differencing.

    Args:
        folder (str): Directory containing tiled TIFFs.
        year_tag (str): Year string embedded in filenames (e.g. '2017').
        suffix_key (str): Metric key in filenames (e.g. 'chm' for canopy height).

    Returns:
        dict: Keys are tile base names (e.g. '500000_4100000'),
              values are the full file paths to the matching .tif.
    """
    index = {}
    suffix = f"_{year_tag}_{suffix_key}.tif"
    for fn in os.listdir(folder):
        if fn.endswith(suffix) and "hillshade" not in fn.lower():
            base = fn.replace(suffix, "")
            index[base] = os.path.join(folder, fn)
    return index

### Parsing Coordinates & Extents

Interpret tile filenames that encode spatial coordinates and derive exact bounding boxes (with optional margins) for use in warping, clipping, or other spatial operations.

Given a tile named like 500000_4100000, these functions extract its lower‐left corner coordinates and compute the full extent, including a configurable buffer. This ensures that adjacent processing overlaps or margins are respected.

Buffer size and tile dimensions can be adjusted to suit tile overlaps, edge‐effect mitigation, or processing margins.

In [33]:
# These functions derive spatial extents from a tile's filename, which encodes the lower-left
# corner as 'X_Y'. We use that to compute the exact bounding box for each tile, adding an
# optional buffer so that any warp or calculation includes a margin around the tile edges.

def parse_coords_from_filename(tile_name):
    """
    Extract integer X and Y from a tile_name string of format 'X_Y'.

    Args:
        tile_name (str): Name like '500000_4100000'.
    Returns:
        tuple<int,int>: (x_min, y_min) coordinates for the bottom-left corner.
    Raises:
        ValueError: if the tile_name cannot be split into two integers.
    """
    parts = tile_name.split("_")
    if len(parts) < 2:
        raise ValueError(f"Invalid tile name: {tile_name}")
    return int(parts[0]), int(parts[1])


def compute_extent_from_tile(tile_name, tile_size, buffer):
    """
    Compute the [xmin,ymin,xmax,ymax] of a tile plus buffer.

    Args:
        tile_name (str): Base name encoding 'X_Y'.
        tile_size (tuple): (width, height) in same units as coordinates.
        buffer (float): Extra distance to extend each side.
    Returns:
        list[float]: [xmin, ymin, xmax, ymax] expanded by buffer.
    """
    x0, y0 = parse_coords_from_filename(tile_name)
    return [
        x0 - buffer,
        y0 - buffer,
        x0 + tile_size[0] + buffer,
        y0 + tile_size[1] + buffer
    ]

### Warping Rasters

If the two rasters aren't sitting on exactly the same grid, your diff maps will end up with ugly stripes or shifted edges. So before we do any subtraction, both years get warped into the same extent and resolution.

The basics:

* We tell gdalwarp exactly which bounding box to use (-te xmin ymin xmax ymax) so both tiles cover the exact same footprint.
* We lock in a pixel size (-tr xres yres) so one dataset doesn't get resampled at a slightly different spacing.
* -srcnodata and -dstnodata make sure that missing pixels (NaN) stay missing.
* For CHM/DTM/DSM/diffs, -r bilinear usually looks better than nearest-neighbor because the values are continuous.

Example command from my Kaibab run:


gdalwarp -overwrite -of GTiff \
  -te 640000 4310000 641000 4311000 \
  -tr 1 1 \
  -r bilinear \
  -srcnodata nan -dstnodata nan \
  -co TILED=YES -co COMPRESS=LZW \
  input_2012_chm.tif warped_2012_chm.tif
  
Once this step is done, both rasters are on the same grid, and subtraction works without those checkerboard artifacts.

In [34]:
# Leverages gdalwarp to crop or pad a source raster to the target extent/resolution.
# Ensures both timepoints share identical grids for differencing.

def warp_if_needed(src, extent, resolution, dest):
    """
    Crop and resample a raster to a fixed geographic window using gdalwarp.

    Args:
        src (str): Path to the input TIFF.
        extent (list[float]): [xmin, ymin, xmax, ymax] target window.
        resolution (tuple[float, float] or None): (x_res, y_res) in same units;
            if None, native resolution is retained.
        dest (str): Path for the output warped TIFF.

    Returns:
        str: Path to the warped output (same as `dest`).
    """
    cmd = [
        "gdalwarp", "-overwrite", "-of", "GTiff",
        "-te", *map(str, extent),
        "-r", "near",
        "-srcnodata", "nan", "-dstnodata", "nan",
        "-co", "TILED=YES", "-co", "COMPRESS=LZW"
    ]
    if resolution:
        # Absolute value for y-res ensures positive spacing
        cmd += ["-tr", str(resolution[0]), str(abs(resolution[1]))]
    cmd += [src, dest]

    subprocess.run(cmd, check=True)
    return dest


### 3.4 Computing Differences

Performs pixel‑wise raster differencing between two timepoints (e.g., 2017 and 2019) over a set of geospatial tiles.  
The function computes the raw difference (A - B) and masks out extreme values beyond +100 and -100 to reduce noise and artifacts.

#### Processing Steps

1. **Tile matching**  
   Finds matching tiles in both folders based on their `X_Y` basename.

2. **Extent calculation**  
   Computes a buffered bounding box for each tile to reduce edge effects.

3. **Warping**  
   Resamples both rasters to a shared grid using `gdalwarp`:

```bash
gdalwarp \
     -te xmin ymin xmax ymax \
     -tr xres yres  \
     -srcnodata nan -dstnodata nan \
     -r near \
     -co TILED=YES -co COMPRESS=LZW \
     input.tif warped.tif
```     
4. **Differencing**
Computes the raw difference (A − B):

```bash
gdal_calc \                          # start the GDAL raster calculator
  -A year2.tif \                     # load "year2.tif" as variable A
  -B year1.tif \                     # load "year1.tif" as variable B
  --calc="A-B" \                     # compute A minus B for each pixel
  --outfile=tile_raw.tif \           # write the result to "tile_raw.tif"
  --type=Float32 \                   # use 32‑bit float to allow negative/decimal values
  --overwrite                        # overwrite "tile_raw.tif" if it already exists
```

5. **Filtering**
Masks out values outside the range [−100, 100], setting them to NoData (-9999):

```bash
gdal_calc \                                         # start GDAL calculator for filtering
  -A tile_raw.tif \                                  # input raw difference raster
  --calc="where((A>=-100)&(A<=100),A,-9999)" \       # keep values between –100 and +100, else set to –9999
  --NoDataValue=-9999 \                              # mark –9999 as nodata in output
  --type=Float32 \                                   # use 32‑bit float for pixel values
  --co TILED=YES --co COMPRESS=LZW \                 # apply internal tiling and LZW compression
  --overwrite \                                      # allow overwriting existing output
  --outfile=tile_filtered.tif                        # output the filtered raster
```

Explicitly set the NoData flag with:
`gdal_edit -a_nodata -9999 tile_filtered.tif`

**Important:** Initially, when the NoData flag wasn't set to -9999, QGIS rendered the plots incorrectly. The tiles appeared blank or fully transparent, even though data was present. Explicitly defining NoData ensures proper visualization and color scaling in QGIS or any GIS tool.

In [35]:
# Runs two-step differencing: (A - B) then masks out values beyond ±100 to filter spurious changes.

def compute_difference(
    folder1, tag1, folder2, tag2,
    output_dir, tile_size, buffer, resolution, suffix
):
    """
    Perform two‑step raster differencing (A – B) with value filtering.

    This routine iterates over matching tiles in two directories (e.g. two years' outputs),
    warps each to a common grid, computes the raw difference, then masks spurious values.

    Sub‑functions used:
      - index_files(dir, tag, suffix)  
        Builds a dict mapping tile bases ("X_Y) to TIFF paths for the given year tag and metric suffix.

      - compute_extent_from_tile(base, tile_size, buffer)  
        Parses "X_Y" into coordinates and returns [xmin, ymin, xmax, ymax] expanded by the buffer.

      - warp_if_needed(src, extent, resolution, dest)  
        Invokes gdalwarp to crop/resample 'src' to the specified 'extent' and 'resolution',
        preserving NaN nodata, using nearest‑neighbor, and producing a tiled, compressed TIFF.

    Args:
        folder1 (str): Directory containing the first set of TIFFs (e.g. year A).  
        tag1 (str): Identifier for naming temp files from folder1 (e.g. 2017).  
        folder2 (str): Directory containing the second set of TIFFs (e.g. year B).  
        tag2 (str): Identifier for naming temp files from folder2 (e.g. 2019).  
        output_dir (str): Target directory for final differenced TIFFs.  
        tile_size (tuple[int,int]): Width and height of each tile in coordinate units.  
        buffer (float): Margin added around each tile extent to avoid edge artifacts.  
        resolution (tuple[float,float] or None): Desired (x_res, y_res); native if None.  
        suffix (str): Metric key (e.g. chm, dtm) used in filenames.  
    """
    os.makedirs(output_dir, exist_ok=True)
    
    idx1 = index_files(folder1, tag1, suffix)
    idx2 = index_files(folder2, tag2, suffix)
    processed, skipped = 0, 0
    
    for base in sorted(idx1):
        if base not in idx2:
            # print(f"Skipping {base}: no match")
            skipped += 1
            continue
        path1= idx1[base]
        path2 = idx2[base]
        out = os.path.join(output_dir, f"{base}_{suffix}_diff.tif")
        tmp1 = out.replace('.tif', f"_tmp_{tag1}.tif")
        tmp2 = out.replace('.tif', f"_tmp_{tag2}.tif")
        try:
            ext = compute_extent_from_tile(base, tile_size, buffer)
            warp_if_needed(path1, ext, resolution, tmp1)
            warp_if_needed(path2, ext, resolution, tmp2)
            raw = out.replace('.tif', '_raw.tif')
            flt = out.replace('.tif', '_filtered.tif')
            
            # Step 1: raw difference A - B
            subprocess.run([
                'gdal_calc','-A',tmp2,'-B',tmp1,
                '--calc=A-B','--outfile',raw,
                '--type','Float32','--overwrite'
            ], check=True)
            
            # Step 2: mask values outside [-100,100]
            subprocess.run([
                'gdal_calc','-A',raw,
                '--calc','where((A>=-100)&(A<=100),A,-9999)',
                '--NoDataValue','-9999','--type','Float32',
                '--co','TILED=YES','--co','COMPRESS=LZW',
                '--overwrite','--outfile',flt
            ], check=True)
            
            os.replace(flt, out)
            subprocess.run(['gdal_edit','-a_nodata','-9999',out], check=True)
            os.remove(raw)
            os.remove(tmp1)
            os.remove(tmp2)
            # print(f"Processed {base}")
            processed += 1
            
        except Exception as e:
            print(f"Error on {base}: {e}")
            traceback.print_exc()
            skipped += 1

### 3.5 Overviews & VRTs
The following Python functions automate post‑processing of GeoTIFF outputs, adding internal overviews for fast display and building VRT mosaics for seamless GIS loading. 

Overviews are reduced‐resolution "pyramids" that live inside a GeoTIFF. When a GIS application (such as QGIS) displays a large raster at small scales (zoomed out), it can load these lower‐resolution layers instead of the full‑resolution data. This dramatically improves rendering speed and interactivity without creating separate .ovr files, because the pyramids are embedded directly in the TIFF.

A VRT is an XML index file that points to multiple source rasters and presents them as one seamless dataset. Instead of physically merging tiles into a large GeoTIFF (which duplicates data and consumes disk space), a VRT lists the locations of each tile on disk. When loaded in GIS, the VRT behaves like a single raster layer, enabling easy panning and zooming across many tiles without additional storage overhead. 

### Function: `add_overviews`

This function adds *overview pyramids* (lower-resolution previews) inside each GeoTIFF so that large rasters load much faster in GIS tools like QGIS. These pyramids help with smooth zooming and panning at different scales.

#### **Inputs**

* `base`: The main folder where all the TIFFs are stored (it checks inside subfolders too).
* `levels`: A list of how much to shrink the image for each overview. For example, `[2, 4, 8, 16]` creates 1/2, 1/4, 1/8, and 1/16 scale versions.

#### **How it works**

* It walks through all subfolders in `base`, looking for `.tif` files (ignoring `.vrt` files).
* For each TIFF, it builds a command like this:

  ```python
  ['gdaladdo', '-r', 'average', path, '2', '4', '8', '16']
  ```

  * `gdaladdo`: GDAL's tool to add overviews.
  * `-r average`: Uses the average of neighboring pixels when downsampling.
  * `path`: The full path to the raster.
  * The numbers specify the zoom levels to build.

* It then runs that command using Python's `subprocess` module. If something goes wrong, it'll raise an error.

#### Why it's useful

Without overviews, big rasters can be really slow to open or navigate. With them, an instant map is generated with previews and smooth interaction, even with huge datasets.

---

 
### Function: `build_vrt`

 Create a Virtual Raster (VRT) that stitches individual TIFF tiles into one seamless layer without data duplication. VRTs enable single‐file loading in GIS while preserving tile metadata and georeferencing.

1. **Arguments**

   * `parent_dir`: Top‐level folder containing the tile subfolder.
   * `subfolder`: Name of the directory with `.tif` files (e.g., `Placer_2012_Tiled` or `CHM_Diff`).
   * `prefix`: Base name for the output VRT (e.g., `CHM`).
   * `tag`: Timepoint or variant tag appended to the VRT name (e.g., `2012`, `Diff`).
   * `nodata`: Numeric code representing empty pixels in differenced rasters (default `-9999`).

2. **Workflow**

   * **Collect TIFF paths**:

     ```python
     tifs = sorted([os.path.join(src, f) for f in os.listdir(src) if f.lower().endswith('.tif')])
     ```

     Builds a sorted list of all GeoTIFF files in `parent_dir/subfolder`.

   * **Filelist creation**:
     Writes each TIFF path (with forward slashes) to a text file named `<prefix>_<tag>_filelist.txt`. GDAL uses this list to know which files to mosaic.

   * **Determine VRT path**:
     `<parent_dir>/<prefix>_<tag>.vrt`

   * **Command variants**:

     **a. Differenced rasters (`subfolder` ends with `_Diff`):**

     ```bash
     gdalbuildvrt \
       -srcnodata -9999 \
       -vrtnodata -9999 \
       -hidenodata \
       -input_file_list CHM_Diff_filelist.txt \
       CHM_Diff.vrt
     ```

     * `-srcnodata -9999`: Treats pixels with value `-9999` in source TIFFs as NoData.
     * `-vrtnodata -9999`: Marks these pixels as NoData in the VRT itself.
     * `-hidenodata`: Excludes NoData pixels from overview generation, preventing black borders.

     **b. Original metrics (no special NoData handling):**

     ```bash
     gdalbuildvrt \
       -input_file_list CHM_2012_filelist.txt \
       CHM_2012.vrt
     ```

     * Simpler invocation since every pixel is assumed valid.

---


These helper functions finalize the pipeline by producing GIS‑optimized files-internal pyramids for fast display and virtual mosaics for seamless data access, ensuring a smooth and responsive GIS experience.


In [36]:
# After differencing, we want to speed up GIS display and mosaic creation.
# "Overviews" are internal, reduced-resolution pyramids inside each TIFF. They let viewers
# (e.g. QGIS) draw large rasters quickly when zoomed out.
import re

# The code was generating overviews everytime it was run. This code checks if overviews already exists before adding them which makes the pipeline run faster.
def has_overviews(tif_path):
    """
    Returns True if internal overviews already exist in the TIFF.
    """
    try:
        output = subprocess.check_output(['gdalinfo', tif_path], stderr=subprocess.DEVNULL)
        return b'Overviews' in output
    except:
        return False

def add_overviews(base, levels):
    """
    Add internal overviews to all .tif files under `base`.

    Args:
        base (str): Root directory.
        levels (list[int]): Overview levels (e.g., [2, 4, 8, 16]).
    """
    total_added = 0
    total_skipped = 0

    print(" Adding overviews...\n")

    for root, _, files in os.walk(base):
        added, skipped = 0, 0
        tif_files = [f for f in files if f.lower().endswith('.tif') and 'vrt' not in f]

        for fn in tif_files:
            path = os.path.join(root, fn)
            if has_overviews(path):
                skipped += 1
                continue
            try:
                cmd = ['gdaladdo', '-r', 'average', path] + list(map(str, levels))
                subprocess.run(cmd, check=True)
                added += 1
            except subprocess.CalledProcessError as e:
                print(f" Error adding overviews to: {path}\n{e}")

        if added > 0 or skipped > 0:
            print(f" {os.path.abspath(root)} — Overviews added: {added}, Skipped: {skipped}")




# A VRT is a "virtual" mosaic: a lightweight XML that points to multiple tiles
# so they appear as one continuous dataset. We build separate VRTs for
# each metric or differenced metric, for easy loading in GIS.

def build_vrt(parent_dir, subfolder, prefix, tag, nodata='-9999', suffix_filter='.tif'):
    """
    Generate a VRT file from all TIFFs in a tile folder.

    Args:
        parent_dir (str): Directory containing the subfolder of TIFFs.
        subfolder (str): Name of the folder with .tif files.
        prefix (str): Prefix for the VRT filename (e.g. 'CHM').
        tag (str): Timepoint tag, appended to the VRT name (e.g. '2019').
        nodata (str): NoData value to apply in the mosaic.
        suffix_filter (str): Optional filter for filenames (default '.tif').
                             For example, '_hillshade.tif' for hillshade mosaics.

    Behavior:
        - Lists all matching .tif files under parent_dir/subfolder,
          writes them to a text file, then calls gdalbuildvrt
          to create prefix_tag.vrt in parent_dir.
        - Uses a simpler command for original metrics, and
          includes nodata flags for differenced tiles.
    """
    src = os.path.join(parent_dir, subfolder)
    tifs = []
    
    # Go through each file in the folder
    for filename in os.listdir(src):
        # Check if the file ends with the desired suffix (e.g., ".tif" or "_hillshade.tif")
        if filename.lower().endswith(suffix_filter):
            # Build the full path and add to the list
            full_path = os.path.join(src, filename)
            tifs.append(full_path)
    
    # Sort the list of file paths alphabetically
    tifs.sort()
    

    if not tifs:
        print(f"No matching TIFFs found in {subfolder} for suffix '{suffix_filter}', skipping VRT.")
        return

    filelist = os.path.join(parent_dir, f"{prefix}_{tag}_filelist.txt")
    
    # Windows paths use backslashes. GDAL and most command-line tools (like gdalbuildvrt) expect Unix-style forward slashes.
    with open(filelist, 'w') as fh:
        for p in tifs:
            fh.write(p.replace('\\', '/') + '\n')  # Normalize slashes for GDAL

    vrt_path = os.path.join(parent_dir, f"{prefix}_{tag}.vrt")

    # For diff files special handling of nodata is required. The nodata value for the original tiles is Nan.
    if subfolder.endswith('_Diff'):
        cmd = [
            'gdalbuildvrt',
            '-srcnodata', nodata,
            '-vrtnodata', nodata,
            '-hidenodata',
            '-input_file_list', filelist,
            vrt_path
        ]
    else:
        cmd = [
            'gdalbuildvrt',
            '-input_file_list', filelist,
            vrt_path
        ]

    subprocess.run(cmd, check=True)
    print(f"Built VRT: {vrt_path}")



## 4. Full Pipeline Function

In [37]:
def build_all_vrts(
    folder1, folder2,
    tag1, tag2,
    metrics,
    output
):
    """
    Build three kinds of mosaics (VRTs) so QGIS can open everything fast without
    physically merging tiles:

    1) Original metric VRTs (per year, per metric)
       e.g., CHM_2012.vrt and CHM_2018.vrt built from the tiles inside
       <folder>/CHM_Tiles/*.tif

    2) Hillshade VRTs (per year, per metric), but only if *_hillshade.tif exist
       in the metric folders. This lets you layer quick shaded relief backdrops.

    3) Difference VRTs (one per metric)
       e.g., Differences/CHM_Diff.vrt built from the tiles in
       <output>/CHM_Diff/*.tif

    Why bother?
    - VRTs are tiny XML "indexes" that point at the tiles. You get the feel of a
      single raster without the disk cost of a giant merge.
    - We build both originals and diffs so you can toggle between "state" and
      "change" in one project.
    """
    
    # original tiles
    for m in metrics:
        base = m.replace('_Tiles','')
        for folder, tag in ((folder1, tag1), (folder2, tag2)):
            src = os.path.join(folder, m)
            if os.path.isdir(src):
                build_vrt(folder, m, base, tag, suffix_filter=f"_{base.lower()}.tif")

    # hillshade tiles
    for m in metrics:
        base = m.replace('_Tiles','') + "_Hillshade"
        for folder, tag in ((folder1, tag1), (folder2, tag2)):
            src = os.path.join(folder, m)
            if os.path.isdir(src):
                build_vrt(folder, m, base, tag, suffix_filter="_hillshade.tif")

    # differenced folders
    for m in metrics:
        diff_folder = m.replace('_Tiles','_Diff')
        src = os.path.join(output, diff_folder)
        if os.path.isdir(src):
            prefix = diff_folder.replace('_Diff','')
            build_vrt(output, diff_folder, prefix, 'diff', suffix_filter=".tif")


def run(
    folder1, folder2,
    tag1, tag2,
    output,
    metrics=None,
    compute_diff=False,           
    tile_size=(1000,1000),
    buffer=20,
    resolution=None,
    overviews=[2,4,8,16,32]
):
    """
    Orchestrates the whole workflow in three passes:

    1) (Optional) Compute differences
       - For each metric present in both years, warp both tiles to a shared grid,
         subtract later - earlier, mask out extreme values (±100), and write a
         clean *_diff.tif to <output>/<metric>_Diff/.
       - Toggle with compute_diff=True.

    2) Build all VRTs
       - Original per-year mosaics (for each metric)
       - Hillshade mosaics (if hillshade tiles exist)
       - Difference mosaics (for each metric that has *_Diff tiles)

    3) Add internal overviews to every .tif folder we touched
       - Embedded pyramids make QGIS/ArcGIS smooth when zooming out.
       - We only add pyramids to TIFFs (not to VRTs), and we skip ones that
         already have them to keep re-runs fast.

    Parameters you'll actually tweak:
    - folder1, folder2: roots for the two timepoints (each contains metric subfolders)
    - tag1, tag2: year tags embedded in filenames (used for matching)
    - output: where all *_Diff tiles and Diff VRTs live
    - metrics: list of metric subfolder names (e.g., 'CHM_Tiles', 'DTM_Tiles', …)
    - tile_size, buffer, resolution: control how we "frame" and resample tiles
    - overviews: pyramid factors (2x, 4x, 8x, …)

    """
    
    if metrics is None:
        metrics = [
            'CHM_Tiles','DTM_Tiles','DSM_Tiles',
            'Canopy_Cover_Tiles','Density_Tiles','Rumple_Tiles'
        ]

    # 1) DIFFERENCE (only if requested)
    if compute_diff:
        for m in metrics:
            d1 = os.path.join(folder1, m)
            d2 = os.path.join(folder2, m)
            if os.path.isdir(d1) and os.path.isdir(d2):
                suffix = m.replace('_Tiles', '').lower() 
                compute_difference(
                    d1, tag1, d2, tag2,
                    os.path.join(output, m.replace('_Tiles','_Diff')),
                    tile_size, buffer, resolution,
                    suffix
                )
    # 2) BUILD VRTs
    build_all_vrts(folder1, folder2, tag1, tag2, metrics, output)

    # 3) ADD OVERVIEWS
    all_folders = []
    for root in (folder1, folder2):
        for m in metrics:
            path = os.path.join(root, m)
            if os.path.isdir(path):
                all_folders.append(path)
    for m in metrics:
        diff_dir = os.path.join(output, m.replace('_Tiles','_Diff'))
        if os.path.isdir(diff_dir):
            all_folders.append(diff_dir)

    for folder in set(all_folders):
        add_overviews(folder, overviews)



## 5. Configuration & Execution

## Pipeline Invocation and Configuration

This is the bit that actually kicks off the whole differencing + VRT build process. Point it at two sets of metric folders (e.g., 2012 and 2018), tell it where to write outputs, and it'll handle the rest.

**What you need to set:**

* **folder1\_base** – earlier year's metric folders (e.g., `…/extracted_2012_metrics`).
* **folder2\_base** – later year's metric folders (e.g., `…/extracted_2018_metrics`).
* **tag1/tag2** – the year tags that are baked into the filenames (these must match exactly).
* **output\_base** – where all `_Diff` tiles and the diff VRTs will be saved. The folder will be created if it doesn't exist.

**Optional tweaks:**

* **metrics** – list of metric subfolder names to process. Drop any you don't need.
* **tile\_size** – tile width/height in map units (1 km × 1 km is standard here).
* **buffer** – how much to pad each tile when warping (20 m works well to avoid edge artifacts).
* **resolution** – set `(x_res, y_res)` to force a uniform grid, or leave `None` to keep native pixel size.
* **overviews** – list of pyramid factors for embedded TIFF overviews. Bigger lists = smoother zooming but more disk space.

**Tip:** If you're just testing, set `compute_diff=False` to skip the heavy lifting and only rebuild VRTs.



In [38]:
folder1_base = r"C:/Users/sreeja/Documents/CA_Placer_Co/extracted_2012_metrics"  # Directory containing the previous year metric subfolders
folder2_base = r"C:/Users/sreeja/Documents/CA_Placer_Co/extracted_2018_metrics"  # Directory containing the later metric subfolders

#These tags are embedded in the filenames
tag1 = "2012" # Previous year tag
tag2 = "2018" # Later year tag

# # Output directory for differenced tiles and VRTs
# output_base = r"C:/Users/sreeja/Documents/CA_Placer_Co/Differences"

# Define available metric folders
metrics = [
    'CHM_Tiles',
    'DTM_Tiles',
    'DSM_Tiles',
    'Canopy_Cover_Tiles',
    'Density_Tiles',
    'Rumple_Tiles'
]

# Now invoke the pipeline:
#   folder1_base   Path to first timepoint directory
#   folder2_base   Path to second timepoint directory
#   tag1           Year tag for first dataset
#   tag2           Year tag for second dataset
#   output_base    Directory to write outputs (diffs & VRTs)
#   (1000,1000)    tile_size: each tile's width,height in map units
#   20             buffer: meters to extend around each tile for warping
#   None           resolution: (x_res, y_res) or None for native grid
#   [2,4,8,16,32]  overview_levels: pyramid factors for internal overviews
#   diff_metrics   list of metric subfolders to difference
#   vrt_metrics    list of folders (orig or _Diff) to build VRTs
run(
    folder1_base,     # e.g. "…/2019_chm_ellipsoid_1"
    folder2_base,     # e.g. "…/extracted_Magnum_2021_metrics"
    tag1,             # "2019"
    tag2,             # "2021"
    output_base,      # "…/Differences_Magnum"
    metrics=metrics,  # your list of metrics
    compute_diff=True,    # skip or keep differencing
    tile_size=(1000,1000),
    buffer=20,
    resolution=None,
    overviews=[2,4,8,16,32]
)

## 6. Next Steps
- Tweak lists for your needs.  
- Visualize outputs in QGIS or Python.  