# DEA Processor Verification Notebook

This notebook provides a conda-friendly verification workflow for the DEA land cover processing code.

## Overview

This notebook will:
1. Show conda commands to create and activate an environment
2. Run the AOI fetch script to create state boundary GeoJSON files
3. Execute reclassification unit tests to validate `reclassify_dea_classes` behaviour
4. Verify that `fetch_dea_raster_for_year` raises NotImplementedError (as expected)
5. Provide instructions for running the driver and next steps

## 1. Environment Setup

### Conda Installation Instructions

For a conda-based installation, follow these steps:

```bash
# Create a new conda environment with Python 3.9
conda create -n aus_land_clearing python=3.9 -y

# Activate the environment
conda activate aus_land_clearing

# Install heavy geospatial dependencies from conda-forge first
conda install -c conda-forge geopandas rasterio fiona shapely pyproj gdal -y

# Install remaining dependencies with pip
pip install -r requirements.txt

# Install the package in development mode
pip install -e .
```

### Why use conda for geospatial packages?

Geospatial Python packages like GDAL, rasterio, and geopandas have complex C/C++ dependencies. Installing them via conda-forge ensures binary compatibility and avoids common installation issues.

## 2. Setup and Imports

In [None]:
import sys
import subprocess
from pathlib import Path
import numpy as np

# Determine repository root
# If running from notebooks/ directory, go up one level
repo_root = Path.cwd()
if repo_root.name == 'notebooks':
    repo_root = repo_root.parent

print(f"Repository root: {repo_root}")

# Add src to Python path
sys.path.insert(0, str(repo_root / 'src'))

# Import the DEA processor module
from aus_land_clearing import dea_processor

print("✓ Successfully imported dea_processor module")

## 3. Fetch Australian State Boundaries

Run the fetch script to download NSW and QLD boundaries from GADM database.

In [None]:
# Path to the fetch script
fetch_script = repo_root / "scripts" / "fetch_australian_state_geojson.py"

print(f"Running: python {fetch_script}")
print("This may take a moment to download state boundaries...\n")

try:
    # Run the fetch script
    result = subprocess.run(
        [sys.executable, str(fetch_script)],
        cwd=str(repo_root),
        capture_output=True,
        text=True,
        timeout=120  # 2 minute timeout
    )
    
    # Print output
    print(result.stdout)
    
    if result.returncode != 0:
        print(f"\n⚠ Script exited with code {result.returncode}")
        if result.stderr:
            print(f"Error output:\n{result.stderr}")
    else:
        print("\n✓ State boundaries fetched successfully")
        
        # Verify files were created
        data_dir = repo_root / "data"
        nsw_file = data_dir / "nsw.geojson"
        qld_file = data_dir / "qld.geojson"
        
        if nsw_file.exists():
            print(f"  ✓ {nsw_file} created ({nsw_file.stat().st_size:,} bytes)")
        else:
            print(f"  ✗ {nsw_file} not found")
            
        if qld_file.exists():
            print(f"  ✓ {qld_file} created ({qld_file.stat().st_size:,} bytes)")
        else:
            print(f"  ✗ {qld_file} not found")
            
except subprocess.TimeoutExpired:
    print("⚠ Script timed out after 2 minutes")
    print("This may happen if network is slow or GADM server is unavailable.")
except FileNotFoundError:
    print(f"✗ Script not found at {fetch_script}")
except Exception as e:
    print(f"✗ Error running script: {e}")
    print("\nNote: If network or dependencies are unavailable, the expected output is:")
    print("  - data/nsw.geojson (New South Wales boundary)")
    print("  - data/qld.geojson (Queensland boundary)")

## 4. Test Reclassification Logic

Test the `reclassify_dea_classes` function which converts DEA land cover classes to woody (1) / non-woody (0) binary classifications.

These tests are converted from `tests/test_dea_processor.py`.

In [None]:
# Test 1: Simple reclassification
print("Test 1: Simple reclassification")
print("=" * 50)

data = np.array([[111, 124], [214, 215]])
classes_map = {
    'woody': [111, 124],
    'non_woody': [214, 215]
}

result = dea_processor.reclassify_dea_classes(data, classes_map)
expected = np.array([[1, 1], [0, 0]], dtype=np.float32)

print(f"Input data:\n{data}")
print(f"\nExpected output:\n{expected}")
print(f"\nActual output:\n{result}")

try:
    np.testing.assert_array_equal(result, expected)
    print("\n✓ Test 1 PASSED")
except AssertionError as e:
    print(f"\n✗ Test 1 FAILED: {e}")

print()

In [None]:
# Test 2: Mixed classes
print("Test 2: Mixed woody and non-woody classes")
print("=" * 50)

data = np.array([[111, 214], [124, 215], [112, 216]])
classes_map = {
    'woody': [111, 112, 124],
    'non_woody': [214, 215, 216]
}

result = dea_processor.reclassify_dea_classes(data, classes_map)
expected = np.array([[1, 0], [1, 0], [1, 0]], dtype=np.float32)

print(f"Input data:\n{data}")
print(f"\nExpected output:\n{expected}")
print(f"\nActual output:\n{result}")

try:
    np.testing.assert_array_equal(result, expected)
    print("\n✓ Test 2 PASSED")
except AssertionError as e:
    print(f"\n✗ Test 2 FAILED: {e}")

print()

In [None]:
# Test 3: Unknown classes as NaN
print("Test 3: Unknown classes should be NaN")
print("=" * 50)

data = np.array([[111, 999], [124, 888]])
classes_map = {
    'woody': [111, 124],
    'non_woody': [214, 215]
}

result = dea_processor.reclassify_dea_classes(data, classes_map)

print(f"Input data:\n{data}")
print(f"\nOutput:\n{result}")

tests_passed = []

# Check known classes are correct
if result[0, 0] == 1:
    print("✓ result[0, 0] == 1 (woody class 111)")
    tests_passed.append(True)
else:
    print(f"✗ result[0, 0] == {result[0, 0]}, expected 1")
    tests_passed.append(False)

if result[1, 0] == 1:
    print("✓ result[1, 0] == 1 (woody class 124)")
    tests_passed.append(True)
else:
    print(f"✗ result[1, 0] == {result[1, 0]}, expected 1")
    tests_passed.append(False)

# Check unknown classes are NaN
if np.isnan(result[0, 1]):
    print("✓ result[0, 1] is NaN (unknown class 999)")
    tests_passed.append(True)
else:
    print(f"✗ result[0, 1] == {result[0, 1]}, expected NaN")
    tests_passed.append(False)

if np.isnan(result[1, 1]):
    print("✓ result[1, 1] is NaN (unknown class 888)")
    tests_passed.append(True)
else:
    print(f"✗ result[1, 1] == {result[1, 1]}, expected NaN")
    tests_passed.append(False)

if all(tests_passed):
    print("\n✓ Test 3 PASSED")
else:
    print("\n✗ Test 3 FAILED")

print()

In [None]:
# Test 4: Data type preservation
print("Test 4: Output dtype should be float32")
print("=" * 50)

data = np.array([[111, 124]], dtype=np.int32)
classes_map = {
    'woody': [111, 124],
    'non_woody': []
}

result = dea_processor.reclassify_dea_classes(data, classes_map)

print(f"Input dtype: {data.dtype}")
print(f"Output dtype: {result.dtype}")

if result.dtype == np.float32:
    print("\n✓ Test 4 PASSED")
else:
    print(f"\n✗ Test 4 FAILED: Expected float32, got {result.dtype}")

print()

In [None]:
# Test 5: Large array performance
print("Test 5: Large array reclassification")
print("=" * 50)

# Create a larger test array
np.random.seed(42)  # For reproducibility
data = np.random.choice([111, 124, 214, 215], size=(100, 100))

classes_map = {
    'woody': [111, 124],
    'non_woody': [214, 215]
}

result = dea_processor.reclassify_dea_classes(data, classes_map)

print(f"Input shape: {data.shape}")
print(f"Output shape: {result.shape}")

tests_passed = []

# Check output shape matches input
if result.shape == data.shape:
    print("✓ Output shape matches input shape")
    tests_passed.append(True)
else:
    print(f"✗ Shape mismatch: {result.shape} vs {data.shape}")
    tests_passed.append(False)

# Check that all values are either 0, 1, or NaN
valid_mask = ~np.isnan(result)
valid_values = result[valid_mask]

if np.all((valid_values == 0) | (valid_values == 1)):
    print("✓ All non-NaN values are either 0 or 1")
    tests_passed.append(True)
else:
    print("✗ Some values are not 0, 1, or NaN")
    tests_passed.append(False)

print(f"\nValue distribution:")
print(f"  Woody (1): {np.sum(result == 1)} pixels")
print(f"  Non-woody (0): {np.sum(result == 0)} pixels")
print(f"  No data (NaN): {np.sum(np.isnan(result))} pixels")

if all(tests_passed):
    print("\n✓ Test 5 PASSED")
else:
    print("\n✗ Test 5 FAILED")

print()

## 5. Verify Template Function Behavior

Test that `fetch_dea_raster_for_year` raises `NotImplementedError` as expected. This is the template function that will be implemented in sweep-2 to actually fetch DEA data.

In [None]:
print("Test: fetch_dea_raster_for_year raises NotImplementedError")
print("=" * 50)

# Try to call the template function
try:
    # Create dummy inputs
    import geopandas as gpd
    from shapely.geometry import Point
    
    # Create a dummy AOI
    dummy_aoi = gpd.GeoDataFrame(
        {'geometry': [Point(150, -30).buffer(0.1)]},
        crs='EPSG:4326'
    )
    
    # Create a minimal config
    dummy_config = {
        'dea_profile': {
            'product_id': 'ga_ls_landcover_class_cyear_2',
            'crs': 'EPSG:3577',
            'resolution': 25
        }
    }
    
    # This should raise NotImplementedError
    result = dea_processor.fetch_dea_raster_for_year(
        year=2020,
        aoi=dummy_aoi,
        config=dummy_config
    )
    
    # If we get here, the function didn't raise the expected error
    print("✗ FAILED: Function did not raise NotImplementedError")
    print(f"  Instead returned: {result}")
    
except NotImplementedError as e:
    print("✓ PASSED: Function correctly raises NotImplementedError")
    print(f"\nError message:")
    print(f"{e}")
    
except Exception as e:
    print(f"✗ FAILED: Function raised unexpected error: {type(e).__name__}")
    print(f"  {e}")

print()

## 6. Summary and Next Steps

### What We've Verified

✓ **Environment Setup**: Instructions for conda-based installation  
✓ **Data Fetching**: AOI fetch script creates state boundary files  
✓ **Reclassification Logic**: `reclassify_dea_classes` correctly converts DEA classes to woody/non-woody  
✓ **Template Function**: `fetch_dea_raster_for_year` correctly raises NotImplementedError

### Running the Full Pipeline

Once the data fetching backend is implemented in sweep-2, you can run the full processing pipeline:

```bash
# Process NSW data
python scripts/run_dea_processing.py --state nsw

# Process QLD data
python scripts/run_dea_processing.py --state qld

# Process both states with custom year range
python scripts/run_dea_processing.py --state nsw --state qld --start-year 2015 --end-year 2023
```

### Next Steps for Implementation (Sweep-2)

The `fetch_dea_raster_for_year` function needs to be implemented using one of these approaches:

1. **Open Data Cube (ODC)** - if you have a local datacube instance
   ```python
   import datacube
   dc = datacube.Datacube()
   data = dc.load(product='ga_ls_landcover_class_cyear_2', ...)
   ```

2. **STAC API** - using odc-stac to query DEA's STAC catalog
   ```python
   from odc.stac import load
   from pystac_client import Client
   catalog = Client.open('https://explorer.dea.ga.gov.au/stac/')
   ```

3. **Direct Download** - from DEA's data repository

See the function docstring in `src/aus_land_clearing/dea_processor.py` for implementation details.

### Running Tests with pytest

The unit tests can also be run using pytest:

```bash
# Run all tests
pytest tests/

# Run with verbose output
pytest tests/test_dea_processor.py -v

# Run a specific test
pytest tests/test_dea_processor.py::TestReclassifyDEAClasses::test_simple_reclassification -v
```

### Continuous Integration

A GitHub Actions workflow has been added at `.github/workflows/run-tests.yml` that automatically runs the unit tests on every push and pull request. This ensures the core reclassification logic remains correct as the codebase evolves.