You are a Databricks Workflows expert. Help me create a production-grade Job DAG in Databricks Workflows UI for my pipeline in catalog main, schema demo.

Goal:
Raw → tiling → tile stats → surface cells → surface patches → features → table optimization.

Tasks (in order) and dependencies:
1) 01_ingest_raw  (no dependency)
2) 02_spatial_tiling_v2 depends on 01_ingest_raw
3) 03_tile_stats_v2 depends on 02_spatial_tiling_v2
4) 04_surface_cells_v2 depends on 03_tile_stats_v2
5) 04_surface_patches_v2 depends on 04_surface_cells_v2
6) 05_feature_water_bodies_v2 depends on 04_surface_cells_v2
7) 05_feature_building_candidates_v2 depends on 04_surface_patches_v2
8) 99_optimize_tables (SQL task) depends on both feature tasks

Tables involved:
points_raw
processed_points_tiled_v2
tile_stats_v2
surface_cells_v2
surface_patches_v2
features_water_bodies_v2
features_building_candidates_v2

Job parameters (with defaults):
siteId = "wellington_cbd"
cellSizeM = "0.5"
patchWaterThreshold = "0.7"
skipWaterTileRatio = "0.8"
targetIngestRunId = "" (optional)

Requirements:
- Use a job cluster
- Set max concurrent runs = 1
- Enable task retries (2-3) and a reasonable timeout per task
- Pass parameters to scripts via spark.conf (e.g., pipeline.siteId) or notebook widgets
- Put the SQL OPTIMIZE/ZORDER step at the end
- Provide step-by-step UI instructions to configure each task and recommended compute settings

Ask me for the workspace paths of each script if needed and show placeholders like <PATH_TO_SCRIPT>.


# Databricks Workflows Job DAG - Step-by-Step Configuration Guide

## Overview
This guide walks you through creating a production-grade Job DAG for the Trimble Geospatial point cloud processing pipeline.

---

## Part 1: Create the Job

### Step 1.1: Navigate to Workflows
1. Click **Workflows** in the left sidebar
2. Click **Create Job** button (top right)
3. Enter Job Name: `Geospatial_Pipeline_Wellington_CBD`

### Step 1.2: Configure Job-Level Settings
1. Click **Edit** next to the job name
2. Set **Max concurrent runs**: `1` (prevents concurrent processing of same site)
3. Click **Advanced** (if available)
   - **Timeout**: `7200` seconds (2 hours for entire pipeline)
4. Click **Save**

---

## Part 2: Configure Job Cluster

### Step 2.1: Create Job Cluster
1. In the job configuration page, scroll to **Compute** section
2. Click **Add** or **Edit compute**
3. Select **Job cluster** (not All-purpose cluster)

### Step 2.2: Cluster Configuration
**Cluster Name**: `geospatial-pipeline-cluster`

**Cluster Mode**: `Standard`

**Databricks Runtime**: `16.4 LTS (Scala 2.12, Spark 3.5.3)` with Photon

**Node Type**:
- **Driver**: `Standard_D4ads_v6` (4 cores, 16 GB RAM)
- **Workers**: `Standard_D4ads_v6` (4 cores, 16 GB RAM)
- **Workers count**: Start with `2-4` (autoscaling enabled)

**Advanced Options**:
- **Spark Config**:
  ```
  spark.databricks.delta.optimizeWrite.enabled true
  spark.databricks.delta.autoCompact.enabled true
  spark.sql.adaptive.enabled true
  ```
- **Environment Variables** (optional):
  ```
  PYSPARK_PYTHON=/databricks/python3/bin/python3
  ```

4. Click **Confirm**

---

## Part 3: Configure Job Parameters

### Step 3.1: Add Job Parameters
1. Scroll to **Parameters** section
2. Click **Add parameter** for each:

| Parameter Name | Default Value | Description |
|----------------|---------------|-------------|
| `siteId` | `wellington_cbd` | Site identifier |
| `cellSizeM` | `0.5` | Surface cell size in meters |
| `patchWaterThreshold` | `0.7` | Water threshold for patches |
| `skipWaterTileRatio` | `0.8` | Skip tiles with >80% water |
| `targetIngestRunId` | `` | Optional: specific ingest run |

---

## Part 4: Add Tasks (8 Tasks Total)

### Task 1: Ingest Raw Data

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `01_ingest_raw`
- **Type**: `Notebook`
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/01_ingest/01_ingest_raw`
- **Cluster**: Select the job cluster created above
- **Depends on**: (none - this is the first task)
- **Timeout**: `1800` seconds (30 min)
- **Retries**: `2`
- **Retry interval**: `60` seconds

**Parameters** (click "Add" under Base parameters):
```
siteId: {{job.parameters.siteId}}
targetIngestRunId: {{job.parameters.targetIngestRunId}}
```

**Click "Create task"**

---

### Task 2: Spatial Tiling

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `02_spatial_tiling_v2`
- **Type**: `Notebook`
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/02_processing/02_spatial_tiling_v2`
- **Cluster**: (same job cluster)
- **Depends on**: `01_ingest_raw` ✅
- **Timeout**: `3600` seconds (1 hour)
- **Retries**: `2`
- **Retry interval**: `60` seconds

**Parameters**:
```
siteId: {{job.parameters.siteId}}
```

**Click "Create task"**

---

### Task 3: Tile Statistics

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `03_tile_stats_v2`
- **Type**: `Notebook`
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/03_aggregation/03_tile_stats_v2`
- **Cluster**: (same job cluster)
- **Depends on**: `02_spatial_tiling_v2` ✅
- **Timeout**: `1800` seconds (30 min)
- **Retries**: `2`
- **Retry interval**: `60` seconds

**Parameters**:
```
siteId: {{job.parameters.siteId}}
skipWaterTileRatio: {{job.parameters.skipWaterTileRatio}}
```

**Click "Create task"**

---

### Task 4: Surface Cells

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `04_surface_cells_v2`
- **Type**: `Notebook`
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/04_surface/04_surface_cells_v2`
- **Cluster**: (same job cluster)
- **Depends on**: `03_tile_stats_v2` ✅
- **Timeout**: `3600` seconds (1 hour)
- **Retries**: `2`
- **Retry interval**: `60` seconds

**Parameters**:
```
siteId: {{job.parameters.siteId}}
cellSizeM: {{job.parameters.cellSizeM}}
```

**Click "Create task"**

---

### Task 5: Surface Patches

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `04_surface_patches_v2`
- **Type**: `Notebook`
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/04_surface/04_surface_patches_v2`
- **Cluster**: (same job cluster)
- **Depends on**: `04_surface_cells_v2` ✅
- **Timeout**: `1800` seconds (30 min)
- **Retries**: `2`
- **Retry interval**: `60` seconds

**Parameters**:
```
siteId: {{job.parameters.siteId}}
patchWaterThreshold: {{job.parameters.patchWaterThreshold}}
```

**Click "Create task"**

---

### Task 6: Extract Water Bodies

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `05_feature_water_bodies_v2`
- **Type**: `Notebook`
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/05_feature/05_feature_water_bodies_v2`
- **Cluster**: (same job cluster)
- **Depends on**: `04_surface_cells_v2` ✅ (parallel with patches)
- **Timeout**: `1800` seconds (30 min)
- **Retries**: `2`
- **Retry interval**: `60` seconds

**Parameters**:
```
siteId: {{job.parameters.siteId}}
```

**Click "Create task"**

---

### Task 7: Extract Building Candidates

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `05_feature_building_candidates_v2`
- **Type**: `Notebook`
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/05_feature/05_feature_building_candidates_v2`
- **Cluster**: (same job cluster)
- **Depends on**: `04_surface_patches_v2` ✅
- **Timeout**: `1800` seconds (30 min)
- **Retries**: `2`
- **Retry interval**: `60` seconds

**Parameters**:
```
siteId: {{job.parameters.siteId}}
```

**Click "Create task"**

---

### Task 8: Optimize Tables (SQL)

**Click "Add task" button**

**Task Configuration**:
- **Task name**: `99_optimize_tables`
- **Type**: `Notebook` (or SQL if you have a .sql file)
- **Source**: `Workspace`
- **Path**: `/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/99_optimization/99_optimize_tables`
- **Cluster**: (same job cluster)
- **Depends on**: 
  - `05_feature_water_bodies_v2` ✅
  - `05_feature_building_candidates_v2` ✅
  - (Click "Add dependency" to add both)
- **Timeout**: `1800` seconds (30 min)
- **Retries**: `1`
- **Retry interval**: `60` seconds

**Parameters**:
```
siteId: {{job.parameters.siteId}}
```

**Click "Create task"**

---

## Part 5: Verify DAG Structure

### Expected DAG Flow:
```
01_ingest_raw
      ↓
02_spatial_tiling_v2
      ↓
03_tile_stats_v2
      ↓
04_surface_cells_v2
      ↓               ↘
04_surface_patches_v2   05_feature_water_bodies_v2
      ↓                           ↓
05_feature_building_candidates_v2 ↓
      ↓                           ↓
      └─────→ 99_optimize_tables ←┘
```

### Verification Steps:
1. Click **Graph** tab to view the DAG visualization
2. Verify all dependencies are correct
3. Check that Task 6 and Task 7 can run in parallel
4. Confirm Task 8 waits for both feature tasks

---

## Part 6: Configure Notifications (Optional)

### Step 6.1: Add Email Alerts
1. Scroll to **Notifications** section
2. Click **Add notification**
3. Configure:
   - **On failure**: Send email to `jinapang2003@gmail.com`
   - **On success**: (optional) Send email
   - **On start**: (optional)

---

## Part 7: Save and Test

### Step 7.1: Save the Job
1. Click **Save** (top right)
2. Review the job configuration summary

### Step 7.2: Test Run
1. Click **Run now** button
2. Monitor the run in the **Runs** tab
3. Check each task's logs for errors
4. Verify data in Unity Catalog tables:
   ```sql
   SELECT COUNT(*) FROM main.demo.points_raw WHERE siteId = 'wellington_cbd';
   SELECT COUNT(*) FROM main.demo.processed_points_tiled_v2 WHERE siteId = 'wellington_cbd';
   SELECT COUNT(*) FROM main.demo.tile_stats_v2 WHERE siteId = 'wellington_cbd';
   ```

### Step 7.3: Schedule (Optional)
1. Click **Add trigger** in the job page
2. Choose:
   - **Scheduled**: Cron expression (e.g., `0 2 * * *` for 2 AM daily)
   - **File arrival**: Trigger on new files in landing zone
   - **Continuous**: For streaming pipelines

---

## Part 8: Production Checklist

### Before Production Deployment:
- [ ] All notebooks have proper error handling
- [ ] Site lock mechanism is tested
- [ ] External locations are configured
- [ ] Job parameters are documented
- [ ] Cluster size is right-sized for data volume
- [ ] Retry logic is appropriate
- [ ] Notifications are configured
- [ ] Access controls are set (who can run/modify the job)
- [ ] Cost monitoring is enabled
- [ ] Backup/rollback plan is documented

---

## Troubleshooting Tips

### Common Issues:

**Issue 1: Task fails with "No parent external location"**
- **Solution**: Run setup notebooks in `00_setup/` first
- Verify external locations exist: `SHOW EXTERNAL LOCATIONS;`

**Issue 2: Site lock timeout**
- **Solution**: Check if another job is holding the lock
- Manually release: Run `release_site_lock()` in a notebook

**Issue 3: Out of memory errors**
- **Solution**: Increase worker node size or count
- Enable adaptive query execution (already in Spark config)

**Issue 4: Slow performance**
- **Solution**: Run `99_optimize_tables` more frequently
- Check partition pruning in queries
- Review Spark UI for skewed partitions

---

## Next Steps

1. **Follow this guide** to create the job in Databricks UI
2. **Test with small dataset** first (single site)
3. **Monitor first production run** closely
4. **Optimize cluster size** based on actual resource usage
5. **Set up alerting** for failures
6. **Document runbook** for on-call engineers

---

## Quick Reference: Notebook Paths

```
/Users/jinapang2003@gmail.com/trimble-geospacial-demo/databricks/pipelines/
├── 01_ingest/01_ingest_raw.ipynb
├── 02_processing/02_spatial_tiling_v2.ipynb
├── 03_aggregation/03_tile_stats_v2.ipynb
├── 04_surface/04_surface_cells_v2.ipynb
├── 04_surface/04_surface_patches_v2.ipynb
├── 05_feature/05_feature_water_bodies_v2.ipynb
├── 05_feature/05_feature_building_candidates_v2.ipynb
└── 99_optimization/99_optimize_tables.ipynb
```

# Trimble Geospatial Demo - Pipelines Directory Structure

## Overview
This directory contains the complete point cloud processing pipeline for geospatial data analysis.

## Directory Tree

```
pipelines/
│
├── 00_setup/                                    # Initial setup and configuration
│   ├── create_site_lock_table.ipynb            # Create distributed locking table
│   ├── setup_external_locations.ipynb          # Register Unity Catalog external locations
│   └── setup_tiling_params.ipynb               # Configure spatial tiling parameters
│
├── 01_ingest/                                   # Data ingestion layer
│   ├── 01_ingest_raw.ipynb                     # Ingest raw point cloud data from landing zone
│   └── lesson learned.ipynb                    # Documentation of ingestion insights
│
├── 02_processing/                               # Data processing and transformation
│   ├── 02_spatial_tiling_v2.ipynb              # Spatial tiling with H3/custom grid
│   └── benchmark_to_control_params.ipynb       # Performance tuning and parameter optimization
│
├── 03_aggregation/                              # Statistical aggregation
│   └── 03_tile_stats_v2.ipynb                  # Compute per-tile statistics (z, intensity, water)
│
├── 04_surface/                                  # Surface reconstruction
│   ├── 04_surface_cells_v2.ipynb               # Generate surface grid cells
│   └── 04_surface_patches_v2.ipynb             # Create surface patches from cells
│
├── 05_feature/                                  # Feature extraction
│   ├── 05_feature_building_candidates_v2.ipynb # Detect building candidates
│   ├── 05_feature_water_bodies_v2.ipynb        # Extract water body features
│   └── debug.ipynb                             # Feature extraction debugging
│
├── 99_optimization/                             # Table maintenance and optimization
│   ├── 99_optimize_tables.ipynb                # Run OPTIMIZE on Delta tables
│   ├── analyze_tables.ipynb                    # Compute table statistics
│   └── vacuum_tables.ipynb                     # Clean up old Delta versions
│
└── path_structure.ipynb                         # Documentation of storage paths
```

## Pipeline Stages

### Stage 0: Setup (One-Time)
* **Purpose**: Initialize infrastructure and configuration
* **Notebooks**: 3
* **Run Frequency**: Once per workspace/site

### Stage 1: Ingestion
* **Purpose**: Load raw point cloud data into Delta Lake
* **Input**: Parquet files from landing zone
* **Output**: `main.demo.points_raw` (partitioned by siteId, ingestRunId)

### Stage 2: Processing
* **Purpose**: Spatial indexing and tiling
* **Input**: `points_raw`
* **Output**: `processed_points_tiled_v2` (partitioned by siteId, tileId)

### Stage 3: Aggregation
* **Purpose**: Compute tile-level statistics
* **Input**: `processed_points_tiled_v2`
* **Output**: `tile_stats_v2` (includes water detection)

### Stage 4: Surface Reconstruction
* **Purpose**: Generate digital surface models
* **Input**: `processed_points_tiled_v2`, `tile_stats_v2`
* **Output**: `surface_cells_v2`, `surface_patches_v2`

### Stage 5: Feature Extraction
* **Purpose**: Identify buildings, water bodies, and other features
* **Input**: `surface_cells_v2`, `surface_patches_v2`
* **Output**: `building_candidates_v2`, `water_bodies_v2`

### Stage 99: Optimization
* **Purpose**: Maintain Delta table performance
* **Run Frequency**: Weekly or after major ingestion

## Key Design Patterns

### 1. Site-Level Locking
* All processing notebooks use `acquire_site_lock()` / `release_site_lock()`
* Prevents concurrent writes to the same site
* Ensures latest-snapshot semantics

### 2. External Table Pattern
* All tables use Unity Catalog external locations
* Storage paths: `abfss://raw@...`, `abfss://processed@...`, `abfss://aggregated@...`
* Enables cross-workspace data sharing

### 3. Versioned Notebooks
* `_v2` suffix indicates second iteration
* Allows A/B testing and gradual migration

### 4. Partitioning Strategy
* **Raw/Processed**: `siteId` + `ingestRunId` or `tileId`
* **Aggregated**: `siteId` + derived keys
* Enables efficient `replaceWhere` operations

## Storage Locations

| Layer | Container | External Location | Example Path |
|-------|-----------|-------------------|-------------|
| Raw | `raw` | `raw_container` | `abfss://raw@trimblegeospatialdemo.../points` |
| Processed | `processed` | `processed_container` | `abfss://processed@.../points_tiled_v2` |
| Aggregated | `aggregated` | `aggregated_container` | `abfss://aggregated@.../tile_stats_v2` |

## Execution Order

```
00_setup (once)
    ↓
01_ingest_raw
    ↓
02_spatial_tiling_v2
    ↓
03_tile_stats_v2
    ↓
04_surface_cells_v2 → 04_surface_patches_v2
    ↓
05_feature_building_candidates_v2
05_feature_water_bodies_v2
    ↓
99_optimize_tables (periodic)
```

## Notes

* All notebooks use `main.demo` catalog/schema
* Water classification code: `9` (LAS standard)
* Mostly water threshold: `0.80` (80% water points)
* Site lock TTL: 60 minutes (configurable)