#  AIMS Data Platform - Orchestration Pipeline

## Pipeline Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        EXTERNAL DATA SOURCE                                  │
│                    (SFTP Server / Manual Upload)                            │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  PHASE 0: LANDING ZONE → BRONZE (Full Refresh)                              │
│  ═════════════════════════════════════════════                              │
│                                                                             │
│  📁 /landing/                                                               │
│      └── *.parquet (raw files arrive here via SFTP)                        │
│                                                                             │
│  Actions:                                                                   │
│  1. Scan landing zone for new parquet files                                │
│  2. CLEAR Bronze layer (remove old files)                                  │
│  3. COPY files from landing → Bronze                                       │
│  4. Track filenames in LANDING_FILES_TO_ARCHIVE list                       │
│                                                                             │
│  ⚠️ IMPORTANT: Landing is PRESERVED as safety net until Phase 4            │
│     If pipeline fails, landing files remain for retry                       │
│                                                                             │
│  Code: copy_file_fabric(), clear_directory()                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  PHASE 1: DATA PROFILING                                                    │
│  ═══════════════════════                                                    │
│                                                                             │
│  📂 Input:  /Bronze/*.parquet                                               │
│  📂 Output: /config/data_quality/*_validation.yml                          │
│                                                                             │
│  Actions:                                                                   │
│  1. Scan Bronze layer for all parquet files                                │
│  2. For each file (parallel workers):                                       │
│     • Load sample (100K rows default)                                       │
│     • Profile columns (types, nulls, unique values, patterns)              │
│     • Generate Great Expectations suite                                     │
│     • Write YAML validation config                                         │
│                                                                             │
│  Code: BatchProfiler.run_parallel_profiling()                              │
│                                                                             │
│  Output Example (table_validation.yml):                                    │
│  ┌───────────────────────────────────────┐                                  │
│  │ expectations:                          │                                  │
│  │   - expect_column_to_exist: id        │                                  │
│  │   - expect_column_values_to_not_be_null: id                              │
│  │   - expect_column_values_to_be_unique: id                                │
│  │   - expect_column_values_to_be_of_type: date_col, datetime              │
│  └───────────────────────────────────────┘                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  PHASE 2: DATA VALIDATION & INGESTION                                       │
│  ════════════════════════════════════                                       │
│                                                                             │
│  📂 Input:  /Bronze/*.parquet + /config/data_quality/*.yml                 │
│  📂 Output: /Silver/*.parquet                                               │
│                                                                             │
│  Actions:                                                                   │
│  1. CLEAR Silver layer (fresh data each run)                               │
│  2. For each Bronze file:                                                   │
│     • Load validation config (YAML)                                        │
│     • Load parquet data                                                     │
│     • Run Great Expectations validation                                     │
│     • Calculate success_rate (% expectations passed)                       │
│                                                                             │
│  Decision:                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  IF success_rate >= threshold (default 85%):                        │   │
│  │     ✅ PASSED → Write to Silver layer                               │   │
│  │  ELSE:                                                              │   │
│  │     ❌ FAILED → Skip ingestion                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Code: DataQualityValidator.validate() → df.to_parquet(Silver/)            │
│  Results: validation_results.json                                          │
│                                                                             │
│  ⚠️ Silver is COMPLETE OVERWRITE each run (not append/merge)               │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  PHASE 3: DATA QUALITY MONITORING                                           │
│  ═══════════════════════════════                                            │
│                                                                             │
│  📂 Input:  /config/validation_results/validation_results.json             │
│  📂 Output: Console metrics, execution log                                  │
│                                                                             │
│  Actions:                                                                   │
│  • Load validation results JSON                                             │
│  • Calculate aggregate metrics:                                             │
│    - Average Quality Score                                                  │
│    - Pass Rate (% tables passed)                                            │
│    - Tables monitored count                                                 │
│  • Generate summary DataFrame                                               │
│                                                                             │
│  Output Metrics:                                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Table          │ Success % │ Status │ Evaluated │ Successful       │   │
│  │  ────────────── │ ───────── │ ────── │ ───────── │ ──────────       │   │
│  │  customers      │   95.2    │ Passed │    12     │    11           │   │
│  │  orders         │   87.5    │ Passed │    16     │    14           │   │
│  │  products       │   72.0    │ Failed │    10     │     7           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Pipeline Success Check: success_rate >= 80% → proceed to Phase 4          │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │                               │
              success >= 80%                  success < 80%
                    │                               │
                    ▼                               ▼
┌─────────────────────────────────┐  ┌─────────────────────────────────┐
│  PHASE 4: ARCHIVE & CLEANUP     │  │  ⚠️ PHASE 4 SKIPPED             │
│  ═════════════════════════════  │  │                                 │
│                                 │  │  Landing preserved for:         │
│  📂 Input:  /landing/*.parquet  │  │  • Investigation                │
│  📂 Output: /archive/YYYYMMDD_  │  │  • Retry after fixes            │
│             HHMMSS/             │  │                                 │
│                                 │  │  Re-run pipeline after          │
│  Actions:                       │  │  resolving issues               │
│  1. Create timestamped archive  │  └─────────────────────────────────┘
│     folder                      │
│                                 │
│  2. COPY landing files →        │
│     /archive/YYYYMMDD_HHMMSS/   │
│                                 │
│  3. CLEAR landing zone          │
│     (only after successful      │
│      archive)                   │
│                                 │
│  4. Save manifest.json with:    │
│     • archive_date              │
│     • pipeline_run timestamp    │
│     • files_archived list       │
│     • validation_summary        │
│     • success_rate              │
│                                 │
│  Code: copy_file_fabric(),      │
│        delete_file_fabric()     │
└─────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  ✅ PIPELINE COMPLETE                                                        │
│                                                                             │
│  Final State:                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  /landing/  → EMPTY (ready for next SFTP batch)                     │   │
│  │  /Bronze/   → Current batch raw data (CLEARED next run)             │   │
│  │  /Silver/   → Validated data (CLEARED next run)                     │   │
│  │  /archive/  → Source of truth (timestamped historical backups)      │   │
│  │  /Gold/     → (Future: aggregated/business-ready data)              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Recovery: Copy from /archive/YYYYMMDD_HHMMSS/ back to /landing/           │
│            and re-run pipeline to regenerate Bronze & Silver               │
│                                                                             │
│  Execution log saved to: /config/validation_results/orchestration_log_*.json│
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Data Flow Summary

```
  SFTP → /landing/ ──COPY──► /Bronze/ ──VALIDATE──► /Silver/
              │                  │
              │                  └── (regeneratable)
              │
              └──ARCHIVE──► /archive/YYYYMMDD_HHMMSS/  (source of truth)
```

| Layer | Archived? | Cleared Each Run? | Purpose |
|-------|-----------|-------------------|---------|
| `/landing/` | ✅ Yes | ✅ Phase 4 | Raw SFTP files (temporary) |
| `/Bronze/` | ❌ No | ✅ Phase 0 | Working copy (regeneratable) |
| `/Silver/` | ❌ No | ✅ Phase 2 | Validated data (regeneratable) |
| `/archive/` | N/A | ❌ Never | Historical source of truth |
| `/Gold/` | TBD | TBD | Future: Business aggregates |

---

## 🔧 Environment Detection & Validation

In [None]:
# --- UNIFIED CONFIGURATION ---
import sys
import time
from pathlib import Path
from datetime import datetime

print("Starting configuration...")
start_time = time.time()

# Add project root to path for imports
project_root = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import centralized configuration
try:
    from notebooks.config import settings
    from notebooks.lib import platform_utils, logging_utils
    from notebooks.lib.storage import StorageManager
    
    IS_FABRIC = platform_utils.IS_FABRIC
    BASE_DIR = settings.base_dir
    BRONZE_DIR = settings.bronze_dir
    SILVER_DIR = settings.silver_dir
    GOLD_DIR = settings.gold_dir
    CONFIG_DIR = settings.config_dir
    RESULTS_DIR = settings.validation_results_dir
    NUM_WORKERS = settings.max_workers
    SAMPLE_SIZE = settings.sample_size
    STORAGE_FORMAT = settings.storage_format
    storage_manager = StorageManager()
    logger = logging_utils.setup_notebook_logger("orchestration")
    
    print(f"✅ Loaded configuration for environment: {settings.environment}")
    print(f"   Platform: {'Microsoft Fabric' if IS_FABRIC else 'Local'}")
    
except ImportError as e:
    print(f"⚠️ Falling back to inline configuration: {e}")
    
    IS_FABRIC = Path("/lakehouse/default/Files").exists()
    
    if IS_FABRIC:
        BASE_DIR = Path("/lakehouse/default/Files")
    else:
        _notebook_dir = Path.cwd()
        _candidate = _notebook_dir
        for _ in range(5):
            if (_candidate / "aims_data_platform").exists() or (_candidate / "pyproject.toml").exists():
                BASE_DIR = _candidate
                break
            _candidate = _candidate.parent
        else:
            BASE_DIR = _notebook_dir.parent if _notebook_dir.name == "notebooks" else _notebook_dir
    
    BRONZE_DIR = BASE_DIR / "Bronze"
    SILVER_DIR = BASE_DIR / "Silver"
    GOLD_DIR = BASE_DIR / "Gold"
    CONFIG_DIR = BASE_DIR / "config" / "data_quality"
    RESULTS_DIR = BASE_DIR / "config" / "validation_results"
    
    NUM_WORKERS = 8 if IS_FABRIC else 4
    SAMPLE_SIZE = 100000
    STORAGE_FORMAT = "parquet"
    storage_manager = None
    logger = None
    
    print(f"   Using fallback configuration for {'Fabric' if IS_FABRIC else 'Local'}")

# Define landing and archive directories
LANDING_DIR = BASE_DIR / "landing"
ARCHIVE_DIR = BASE_DIR / "archive"

print(f"Environment detection took {time.time() - start_time:.2f}s")

# ============================================================================
# FABRIC PATH HELPER & FILE OPERATIONS
# ============================================================================
def fabric_path(path):
    """Convert Path to Fabric-compatible string (relative from lakehouse root)"""
    path_str = str(path)
    if "/lakehouse/default/" in path_str:
        return path_str.replace("/lakehouse/default/", "")
    return path_str

def list_parquet_files(directory):
    """List parquet files in directory (works on both Fabric and local)"""
    files = []
    if IS_FABRIC:
        try:
            from notebookutils import mssparkutils
            fab_dir = fabric_path(directory)
            try:
                files_info = mssparkutils.fs.ls(fab_dir)
                files = [f for f in files_info if f.name.endswith('.parquet')]
            except Exception:
                pass
        except ImportError:
            if Path(directory).exists():
                files = list(Path(directory).glob("*.parquet"))
    else:
        if Path(directory).exists():
            files = list(Path(directory).glob("*.parquet"))
    return files

def copy_file_fabric(src, dst):
    """Copy file (works on both Fabric and local)"""
    if IS_FABRIC:
        from notebookutils import mssparkutils
        mssparkutils.fs.cp(fabric_path(src), fabric_path(dst))
    else:
        import shutil
        shutil.copy2(str(src), str(dst))

def delete_file_fabric(path):
    """Delete file (works on both Fabric and local)"""
    if IS_FABRIC:
        from notebookutils import mssparkutils
        mssparkutils.fs.rm(fabric_path(path))
    else:
        Path(path).unlink()

def clear_directory(directory):
    """Clear all parquet files from a directory (works on both Fabric and local)"""
    cleared = 0
    if IS_FABRIC:
        try:
            from notebookutils import mssparkutils
            fab_dir = fabric_path(directory)
            try:
                files = mssparkutils.fs.ls(fab_dir)
                for f in files:
                    if f.name.endswith('.parquet'):
                        mssparkutils.fs.rm(f"{fab_dir}/{f.name}")
                        cleared += 1
            except Exception:
                pass
        except ImportError:
            pass
    else:
        if Path(directory).exists():
            for f in Path(directory).glob("*.parquet"):
                f.unlink()
                cleared += 1
    return cleared

def ensure_dir_exists(directory):
    """Ensure directory exists (works on both Fabric and local)"""
    if IS_FABRIC:
        try:
            from notebookutils import mssparkutils
            fab_dir = fabric_path(directory)
            try:
                mssparkutils.fs.ls(fab_dir)
            except Exception:
                mssparkutils.fs.mkdirs(fab_dir)
        except ImportError:
            Path(directory).mkdir(exist_ok=True, parents=True)
    else:
        Path(directory).mkdir(exist_ok=True, parents=True)

# ============================================================================
# AUTO-CREATE DIRECTORIES
# ============================================================================
print("\n📁 Ensuring directories exist...")
for dir_path in [LANDING_DIR, ARCHIVE_DIR, BRONZE_DIR, SILVER_DIR, GOLD_DIR, CONFIG_DIR, RESULTS_DIR]:
    ensure_dir_exists(dir_path)
    print(f"   ✓ {dir_path.name}/")

# ============================================================================
# PHASE 0: COPY LANDING → BRONZE (with full Bronze refresh)
# ============================================================================
print("\n" + "="*60)
print("PHASE 0: LANDING ZONE → BRONZE (full refresh)")
print("="*60)

landing_files = list_parquet_files(LANDING_DIR)
print(f"📂 Scanning landing zone: {LANDING_DIR}")
print(f"   Found {len(landing_files)} parquet files in landing")

# Track files for Phase 4 cleanup
LANDING_FILES_TO_ARCHIVE = []

if len(landing_files) > 0:
    # Step 1: Clear Bronze layer (fresh data each run)
    print(f"\n🧹 Clearing Bronze layer for fresh data...")
    bronze_cleared = clear_directory(BRONZE_DIR)
    print(f"   Removed {bronze_cleared} old files from Bronze")
    
    # Step 2: Copy new files from landing to Bronze
    print(f"\n📋 Copying {len(landing_files)} files to Bronze...")
    copied_count = 0
    
    for f in landing_files:
        # Get filename (handle both Fabric FileInfo and Path objects)
        filename = f.name if hasattr(f, 'name') else f.name
        
        # Build paths
        if IS_FABRIC:
            src_path = f"{LANDING_DIR}/{filename}"
            bronze_path = f"{BRONZE_DIR}/{filename}"
        else:
            src_path = LANDING_DIR / filename
            bronze_path = BRONZE_DIR / filename
        
        try:
            # Copy to Bronze (keep original in landing as safety net)
            copy_file_fabric(src_path, bronze_path)
            print(f"   ✅ Copied to Bronze: {filename}")
            copied_count += 1
            LANDING_FILES_TO_ARCHIVE.append(filename)
        except Exception as e:
            print(f"   ❌ Copy failed: {filename} - {e}")
    
    print(f"\n   📊 Phase 0 Summary:")
    print(f"      Bronze cleared: {bronze_cleared} old files removed")
    print(f"      Copied to Bronze: {copied_count}/{len(landing_files)}")
    print(f"      Landing preserved until Phase 4 (safety net)")
else:
    print("   ℹ️ No new files in landing zone (waiting for SFTP data)")

# ============================================================================
# SCAN BRONZE DATA
# ============================================================================
print(f"\n📂 Scanning Bronze directory: {BRONZE_DIR}")
parquet_files = list_parquet_files(BRONZE_DIR)
print(f"   Found {len(parquet_files)} parquet files")

# ============================================================================
# SUMMARY
# ============================================================================
print(f"\n📊 Configuration Summary:")
print(f"   Environment: {'Fabric' if IS_FABRIC else 'Local'}")
print(f"   Landing: {LANDING_DIR} (preserved until Phase 4)")
print(f"   Archive: {ARCHIVE_DIR}")
print(f"   Bronze (Source): {BRONZE_DIR} (CLEARED each run)")
print(f"   Silver (Target): {SILVER_DIR} (CLEARED each run)")
print(f"   Config Dir: {CONFIG_DIR}")
print(f"   Workers: {NUM_WORKERS}")
print(f"   Bronze files: {len(parquet_files)}")

if len(parquet_files) == 0:
    print(f"\n⚠️ No parquet files in Bronze directory.")
    print(f"   Upload data to Files/landing/ and re-run this cell.")
    BRONZE_DATA_AVAILABLE = False
else:
    print(f"\n✅ Ready to process {len(parquet_files)} files")
    BRONZE_DATA_AVAILABLE = True

print(f"\nConfiguration complete in {time.time() - start_time:.2f}s")

Starting configuration...
✅ Loaded configuration for environment: local
   Platform: Local
Environment detection took 0.00s
Scanning Bronze directory: /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/data/Samples_LH_Bronze_Aims_26_parquet

Configuration Summary:
   Environment: Local
   Bronze (Source): /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/data/Samples_LH_Bronze_Aims_26_parquet
   Silver (Target): /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/data/Silver
   Config Dir: /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/config/data_quality
   Workers: 4
   Found 68 parquet files
Configuration complete in 0.00s


## 📦 Package Installation & Imports

In [2]:
#!pip install --quiet --upgrade great-expectations==0.18.8 ydata-profiling==4.5.1 pyarrow fastparquet

In [3]:
import os
import json
import pandas as pd
from datetime import datetime

# Disable Great Expectations analytics to speed up import
os.environ["GX_ANALYTICS_ENABLED"] = "False"

# Use the local library to ensure end-to-end alignment
from aims_data_platform import BatchProfiler, DataQualityValidator, DataLoader, ConfigLoader

# Import logging utilities if available
try:
    from notebooks.lib.logging_utils import timed_operation, log_phase
    LOGGING_UTILS_AVAILABLE = True
except ImportError:
    LOGGING_UTILS_AVAILABLE = False
    # Fallback: simple context manager
    from contextlib import contextmanager
    @contextmanager
    def timed_operation(description, logger=None):
        print(f"⏱️ {description}...")
        start = time.time()
        yield
        print(f"⏱️ {description} completed in {time.time() - start:.2f}s")

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## ⚙️ Pipeline Configuration

In [None]:
# Pipeline configuration - using settings where available
try:
    # Use settings pipeline_phases if available
    PIPELINE_CONFIG = {
        "run_profiling": settings.pipeline_phases.get("profiling", True),
        "run_ingestion": settings.pipeline_phases.get("ingestion", True),
        "run_monitoring": settings.pipeline_phases.get("monitoring", True),
        "run_dq_modeling": settings.pipeline_phases.get("dq_modeling", False),
        "run_bi_analytics": settings.pipeline_phases.get("bi_analytics", False),
        "force_reprocess": False,
        "dq_threshold": settings.get_dq_threshold("medium"),
        "max_workers": settings.max_workers,
        "continue_on_error": False,
    }
except (NameError, AttributeError):
    # Fallback to hardcoded defaults
    PIPELINE_CONFIG = {
        "run_profiling": True,
        "run_ingestion": True,
        "run_monitoring": True,
        "run_dq_modeling": False,
        "run_bi_analytics": False,
        "force_reprocess": False,
        "dq_threshold": 85.0,
        "max_workers": NUM_WORKERS,
        "continue_on_error": False,
    }

# Check if Bronze data is available (set in previous cell)
try:
    if not BRONZE_DATA_AVAILABLE:
        print("⚠️ No Bronze data available - disabling data processing phases")
        PIPELINE_CONFIG["run_profiling"] = False
        PIPELINE_CONFIG["run_ingestion"] = False
except NameError:
    BRONZE_DATA_AVAILABLE = len(parquet_files) > 0 if 'parquet_files' in dir() else False

# Display Configuration
print("⚙️ Pipeline Configuration:")
for key, value in PIPELINE_CONFIG.items():
    print(f"   {key}: {value}")

# Initialize Execution Log
execution_log = {
    "start_time": datetime.now().isoformat(),
    "environment": "Fabric" if IS_FABRIC else "Local",
    "storage_format": STORAGE_FORMAT,
    "bronze_data_available": BRONZE_DATA_AVAILABLE,
    "config": PIPELINE_CONFIG,
    "phases": []
}

if BRONZE_DATA_AVAILABLE:
    print("\n✅ Configuration Complete - Ready to process data")
else:
    print("\n⚠️ Configuration Complete - Waiting for Bronze data")
    print("   Upload parquet files to Bronze directory to begin processing")

⚙️ Pipeline Configuration:
   run_profiling: True
   run_ingestion: True
   run_monitoring: True
   run_dq_modeling: False
   run_bi_analytics: False
   force_reprocess: False
   dq_threshold: 85.0
   max_workers: 4
   continue_on_error: False

✅ Configuration Complete


## 🚀 Phase 1: Data Profiling

**Purpose:** Generate DQ validation configs for all Bronze layer tables

**Process:**
1. Profile each Bronze parquet file
2. Generate validation YAML configs
3. Save configs to `config/data_quality/`

In [5]:
if PIPELINE_CONFIG["run_profiling"]:
    print("\n" + "="*80)
    print("PHASE 1: DATA PROFILING")
    print("="*80)
    
    phase_start = datetime.now()
    
    try:
        with timed_operation("Phase 1: Data Profiling", logger):
            # Import profiling modules
            from aims_data_platform import BatchProfiler
            
            print(f"\n📊 Profiling Bronze layer: {BRONZE_DIR}")
            print(f"   Workers: {PIPELINE_CONFIG['max_workers']}")
            print(f"   Output: {CONFIG_DIR}")
            
            # Define custom thresholds
            custom_thresholds = {
                "severity_threshold": "medium",
                "null_tolerance": 5.0,
                "include_structural": True,
                "include_completeness": True,
                "include_validity": True
            }
            
            # Run parallel profiling using BatchProfiler
            results = BatchProfiler.run_parallel_profiling(
                input_dir=str(BRONZE_DIR),
                output_dir=str(CONFIG_DIR),
                workers=PIPELINE_CONFIG['max_workers'],
                sample_size=SAMPLE_SIZE,
                thresholds=custom_thresholds
            )
            
            # Count successes and errors
            success_results = [r for r in results if r.get('status') == 'success']
            error_results = [r for r in results if r.get('status') != 'success']
            
            # Display results
            print(f"\n✅ Profiling Complete:")
            print(f"   Files Profiled: {len(success_results)}")
            print(f"   Configs Generated: {len(list(CONFIG_DIR.glob('*.yml')))}")
            if error_results:
                print(f"   Errors: {len(error_results)}")
                for err in error_results[:5]:
                    print(f"      - {err.get('file', 'unknown')}: {err.get('error', 'unknown error')}")
        
        # Log phase execution
        execution_log["phases"].append({
            "phase": "profiling",
            "status": "success" if len(error_results) == 0 else "partial",
            "duration_seconds": (datetime.now() - phase_start).total_seconds(),
            "files_profiled": len(success_results),
            "configs_generated": len(list(CONFIG_DIR.glob('*.yml'))),
            "errors": len(error_results)
        })
        
    except Exception as e:
        print(f"\n❌ Profiling Failed: {e}")
        import traceback
        traceback.print_exc()
        
        execution_log["phases"].append({
            "phase": "profiling",
            "status": "failed",
            "error": str(e),
            "duration_seconds": (datetime.now() - phase_start).total_seconds()
        })
        
        if not PIPELINE_CONFIG.get("continue_on_error", False):
            raise
else:
    print("⏭️ Skipping Phase 1: Data Profiling (disabled in config)")


PHASE 1: DATA PROFILING
2026-01-19 13:09:35 | INFO     | orchestration | ⏱️ Phase 1: Data Profiling...

📊 Profiling Bronze layer: /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/data/Samples_LH_Bronze_Aims_26_parquet
   Workers: 4
   Output: /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/config/data_quality

✅ Profiling Complete:
   Files Profiled: 68
   Configs Generated: 68
2026-01-19 13:09:43 | INFO     | orchestration | ⏱️ Phase 1: Data Profiling completed in 7.52s


## ✅ Phase 2: Data Validation & Ingestion

**Purpose:** Validate Bronze data and ingest to Silver layer

**Process:**
1. Load validation configs
2. Validate each Bronze table
3. Ingest passing records to Silver (Delta Lake in Fabric, Parquet locally)
4. Quarantine failing records

In [None]:
if PIPELINE_CONFIG["run_ingestion"]:
    print("\n" + "="*80)
    print("PHASE 2: DATA VALIDATION & INGESTION")
    print("="*80)
    
    phase_start = datetime.now()
    
    # Helper function to write JSON (works on both Fabric and local)
    def write_json_file(file_path, data):
        """Write JSON file - uses mssparkutils on Fabric, standard IO locally"""
        content = json.dumps(data, indent=2)
        if IS_FABRIC:
            from notebookutils import mssparkutils
            fab_path = fabric_path(file_path)
            mssparkutils.fs.put(fab_path, content, overwrite=True)
        else:
            with open(str(file_path), 'w') as f:
                f.write(content)
    
    # Helper function to read JSON (works on both Fabric and local)
    def read_json_file(file_path):
        """Read JSON file - uses mssparkutils on Fabric, standard IO locally"""
        if IS_FABRIC:
            from notebookutils import mssparkutils
            fab_path = fabric_path(file_path)
            content = mssparkutils.fs.head(fab_path, 1000000)  # Read up to 1MB
            return json.loads(content)
        else:
            with open(str(file_path), 'r') as f:
                return json.load(f)
    
    # Helper to check if file exists on Fabric
    def file_exists_fabric(file_path):
        """Check if file exists - works on both Fabric and local"""
        if IS_FABRIC:
            try:
                from notebookutils import mssparkutils
                fab_path = fabric_path(file_path)
                mssparkutils.fs.head(fab_path, 1)
                return True
            except Exception:
                return False
        else:
            return Path(file_path).exists()
    
    try:
        with timed_operation("Phase 2: Validation & Ingestion", logger):
            # Import validation modules
            from aims_data_platform import DataQualityValidator, DataLoader
            
            # Clear Silver layer for complete overwrite (no append/delta)
            if storage_manager is not None:
                clear_result = storage_manager.clear_layer("silver")
                print(f"   Cleared Silver layer: {clear_result['files_cleared']} tables removed")
            elif IS_FABRIC:
                # Use mssparkutils to clear Silver on Fabric
                try:
                    from notebookutils import mssparkutils
                    fab_silver = fabric_path(SILVER_DIR)
                    try:
                        existing = mssparkutils.fs.ls(fab_silver)
                        for f in existing:
                            if f.name.endswith('.parquet'):
                                mssparkutils.fs.rm(f"{fab_silver}/{f.name}")
                        print(f"   Cleared Silver directory for fresh write")
                    except Exception:
                        print(f"   Silver directory empty or doesn't exist yet")
                except ImportError:
                    pass
            elif SILVER_DIR.exists():
                for f in SILVER_DIR.glob("*.parquet"):
                    f.unlink()
                print(f"   Cleared Silver directory for fresh write")
            
            # Track validation results
            validation_results = {
                "timestamp": datetime.now().isoformat(),
                "threshold": PIPELINE_CONFIG['dq_threshold'],
                "storage_format": STORAGE_FORMAT,
                "files": {},
                "summary": {"total": 0, "passed": 0, "failed": 0, "skipped": 0, "errors": 0}
            }
            
            validation_results["summary"]["total"] = len(parquet_files)
            print(f"   Found {len(parquet_files)} parquet files to validate\n")
            
            # Sort files by name (handles both FileInfo and Path objects)
            sorted_files = sorted(parquet_files, key=lambda f: f.name if hasattr(f, 'name') else str(f))
            
            # Track tables for Delta persistence
            TABLES_TO_PERSIST = []
            
            # Validate each file
            for parquet_file in sorted_files:
                # Handle both Fabric FileInfo and Path objects
                if hasattr(parquet_file, 'path'):
                    # Fabric FileInfo object
                    file_name = parquet_file.name
                    table_name = file_name.replace('.parquet', '')
                    file_path = str(BRONZE_DIR / file_name)
                else:
                    # Local Path object
                    file_name = parquet_file.name
                    table_name = parquet_file.stem
                    file_path = str(parquet_file)
                
                config_file = CONFIG_DIR / f"{table_name}_validation.yml"
                
                # Check if config exists
                config_exists = file_exists_fabric(config_file) if IS_FABRIC else config_file.exists()
                
                if not config_exists:
                    print(f"⚠️ SKIPPED: {file_name} (no config)")
                    validation_results["summary"]["skipped"] += 1
                    continue
                
                try:
                    # Load validator and data
                    validator = DataQualityValidator(config_path=str(config_file))
                    
                    # Use DataLoader for safe loading (handles sampling)
                    df_batch = DataLoader.load_data(file_path, sample_size=SAMPLE_SIZE)
                    result = validator.validate(df_batch)
                    
                    # Store results
                    validation_results["files"][table_name] = {
                        "overall_success": result.get('success', False),
                        "success_percentage": result.get('success_rate', 0.0),
                        "statistics": {
                            "evaluated_expectations": result.get('evaluated_checks', 0),
                            "successful_expectations": result.get('successful_checks', 0)
                        }
                    }
                    
                    # Update summary
                    if result.get('success', False):
                        validation_results["summary"]["passed"] += 1
                        print(f"✅ PASSED: {file_name} ({result.get('success_rate', 0):.1f}%)")
                        
                        # Ingest to Silver layer
                        silver_file = SILVER_DIR / f"{table_name}.parquet"
                        
                        if storage_manager is not None:
                            try:
                                storage_manager.write_to_silver(df_batch, table_name)
                                print(f"   → Ingested to Silver via StorageManager: {table_name}")
                            except Exception as sm_err:
                                df_batch.to_parquet(str(silver_file), index=False, engine='pyarrow')
                                print(f"   → Ingested to Silver (fallback): {silver_file.name}")
                        else:
                            df_batch.to_parquet(str(silver_file), index=False, engine='pyarrow')
                            print(f"   → Ingested to Silver: {silver_file.name}")
                        
                        # Track for table persistence
                        TABLES_TO_PERSIST.append(table_name)
                            
                    else:
                        validation_results["summary"]["failed"] += 1
                        print(f"❌ FAILED: {file_name} ({result.get('success_rate', 0):.1f}%)")
                        
                except Exception as e:
                    validation_results["summary"]["errors"] += 1
                    print(f"💥 ERROR: {file_name} - {e}")
            
            # Save validation results using Fabric-compatible write
            results_file = RESULTS_DIR / "validation_results.json"
            write_json_file(results_file, validation_results)
            
            # Display summary
            print(f"\n{'='*70}")
            print("VALIDATION SUMMARY")
            print(f"{'='*70}")
            summary = validation_results["summary"]
            print(f"Total Files:  {summary['total']}")
            print(f"✅ Passed:     {summary['passed']}")
            print(f"❌ Failed:     {summary['failed']}")
            print(f"⚠️  Skipped:    {summary['skipped']}")
            print(f"💥 Errors:     {summary['errors']}")
            print(f"\nPass Rate: {(summary['passed']/max(summary['total'], 1)*100):.1f}%")
            print(f"Results saved to: {results_file}")
            print(f"{'='*70}")
            
            # ================================================================
            # PERSIST SILVER TO DELTA TABLES (Full Overwrite)
            # ================================================================
            if len(TABLES_TO_PERSIST) > 0:
                print(f"\n{'='*70}")
                print("PERSISTING SILVER TO TABLES (Full Overwrite)")
                print(f"{'='*70}")
                
                tables_created = 0
                tables_failed = 0
                
                if IS_FABRIC:
                    # Use Spark to create Delta tables in Fabric
                    try:
                        from pyspark.sql import SparkSession
                        spark = SparkSession.builder.getOrCreate()
                        
                        for table_name in TABLES_TO_PERSIST:
                            try:
                                # Build the Fabric-compatible path for Spark
                                # Spark in Fabric needs relative path: Files/Silver/table.parquet
                                silver_spark_path = f"Files/Silver/{table_name}.parquet"
                                
                                # Read parquet into Spark DataFrame using Fabric path
                                spark_df = spark.read.parquet(silver_spark_path)
                                
                                # Write as managed Delta table with OVERWRITE mode
                                # This drops and recreates the table with fresh data
                                spark_df.write \
                                    .format("delta") \
                                    .mode("overwrite") \
                                    .option("overwriteSchema", "true") \
                                    .saveAsTable(f"silver_{table_name}")
                                
                                print(f"   ✅ Table created: silver_{table_name} ({spark_df.count()} rows)")
                                tables_created += 1
                                
                            except Exception as table_err:
                                print(f"   ❌ Table failed: silver_{table_name} - {table_err}")
                                tables_failed += 1
                        
                    except ImportError as spark_err:
                        print(f"   ⚠️ Spark not available: {spark_err}")
                        print(f"   Tables not created - Silver parquet files are available")
                else:
                    # Local environment - skip Delta table creation
                    print(f"   ℹ️ Local environment - Delta tables not created")
                    print(f"   Silver parquet files available at: {SILVER_DIR}")
                    print(f"   Tables to create on Fabric: {TABLES_TO_PERSIST}")
                
                print(f"\n   📊 Table Persistence Summary:")
                print(f"      Tables Created: {tables_created}")
                print(f"      Tables Failed: {tables_failed}")
                print(f"      Mode: FULL OVERWRITE (drop & reload)")
        
        # Log phase execution
        execution_log["phases"].append({
            "phase": "validation_ingestion",
            "status": "success",
            "duration_seconds": (datetime.now() - phase_start).total_seconds(),
            "validation_summary": validation_results["summary"],
            "tables_persisted": len(TABLES_TO_PERSIST) if IS_FABRIC else 0
        })
        
    except Exception as e:
        print(f"\n❌ Validation/Ingestion Failed: {e}")
        import traceback
        traceback.print_exc()
        
        execution_log["phases"].append({
            "phase": "validation_ingestion",
            "status": "failed",
            "error": str(e),
            "duration_seconds": (datetime.now() - phase_start).total_seconds()
        })
        
        if not PIPELINE_CONFIG.get("continue_on_error", False):
            raise


PHASE 2: DATA VALIDATION & INGESTION
2026-01-19 13:09:47 | INFO     | orchestration | ⏱️ Phase 2: Validation & Ingestion...
   Found 68 parquet files to validate



Calculating Metrics:   0%|          | 0/69 [00:00<?, ?it/s]

✅ PASSED: aims_activitydates.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_activitydates


Calculating Metrics:   0%|          | 0/526 [00:00<?, ?it/s]

✅ PASSED: aims_assetattributes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_assetattributes


Calculating Metrics:   0%|          | 0/193 [00:00<?, ?it/s]

✅ PASSED: aims_assetclassattributes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_assetclassattributes


Calculating Metrics:   0%|          | 0/46 [00:00<?, ?it/s]

✅ PASSED: aims_assetclasschangelogs.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_assetclasschangelogs


Calculating Metrics:   0%|          | 0/382 [00:00<?, ?it/s]

✅ PASSED: aims_assetclasses.parquet (99.1%)
   → Ingested to Silver via StorageManager: aims_assetclasses


Calculating Metrics:   0%|          | 0/140 [00:00<?, ?it/s]

✅ PASSED: aims_assetclassrelationships.parquet (91.5%)
   → Ingested to Silver via StorageManager: aims_assetclassrelationships


Calculating Metrics:   0%|          | 0/58 [00:00<?, ?it/s]

✅ PASSED: aims_assetconsents.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_assetconsents


Calculating Metrics:   0%|          | 0/39 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 85.7%)


❌ FAILED: aims_assethierarchymap.parquet (92.9%)


Calculating Metrics:   0%|          | 0/319 [00:00<?, ?it/s]

Validation FAILED: 3 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 97.4%)


❌ FAILED: aims_assetlocations.parquet (97.0%)


Calculating Metrics:   0%|          | 0/219 [00:00<?, ?it/s]

✅ PASSED: aims_assets.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_assets


Calculating Metrics:   0%|          | 0/127 [00:00<?, ?it/s]

✅ PASSED: aims_attributedomains.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_attributedomains


Calculating Metrics:   0%|          | 0/94 [00:00<?, ?it/s]

✅ PASSED: aims_attributedomainvalues.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_attributedomainvalues


Calculating Metrics:   0%|          | 0/110 [00:00<?, ?it/s]

✅ PASSED: aims_attributegroups.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_attributegroups


Calculating Metrics:   0%|          | 0/301 [00:00<?, ?it/s]

✅ PASSED: aims_attributes.parquet (96.7%)
   → Ingested to Silver via StorageManager: aims_attributes


Calculating Metrics:   0%|          | 0/54 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 90.9%)


❌ FAILED: aims_consentlinks.parquet (95.5%)


Calculating Metrics:   0%|          | 0/69 [00:00<?, ?it/s]

✅ PASSED: aims_consentmilestones.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_consentmilestones


Calculating Metrics:   0%|          | 0/46 [00:00<?, ?it/s]

✅ PASSED: aims_consentmilestonetypes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_consentmilestonetypes


Calculating Metrics:   0%|          | 0/238 [00:00<?, ?it/s]

✅ PASSED: aims_consents.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_consents


Calculating Metrics:   0%|          | 0/90 [00:00<?, ?it/s]

✅ PASSED: aims_consenttypemilestones.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_consenttypemilestones


Calculating Metrics:   0%|          | 0/67 [00:00<?, ?it/s]

✅ PASSED: aims_consenttypes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_consenttypes


Calculating Metrics:   0%|          | 0/106 [00:00<?, ?it/s]

✅ PASSED: aims_informationneedassetclass.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_informationneedassetclass


Calculating Metrics:   0%|          | 0/131 [00:00<?, ?it/s]

✅ PASSED: aims_informationneedattributes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_informationneedattributes


Calculating Metrics:   0%|          | 0/74 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 93.3%)


❌ FAILED: aims_informationneeddocs.parquet (96.7%)


Calculating Metrics:   0%|          | 0/214 [00:00<?, ?it/s]

✅ PASSED: aims_informationneedgeometries.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_informationneedgeometries


Calculating Metrics:   0%|          | 0/148 [00:00<?, ?it/s]

✅ PASSED: aims_informationneedlinks.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_informationneedlinks


Calculating Metrics:   0%|          | 0/338 [00:00<?, ?it/s]

✅ PASSED: aims_informationneedpropchngs.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_informationneedpropchngs


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

✅ PASSED: aims_informationneeds.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_informationneeds


Calculating Metrics:   0%|          | 0/99 [00:00<?, ?it/s]

✅ PASSED: aims_informationneedsourcedocs.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_informationneedsourcedocs


Calculating Metrics:   0%|          | 0/73 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 90.0%)


❌ FAILED: aims_informationneedstatusupd.parquet (95.5%)


Calculating Metrics:   0%|          | 0/54 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 90.9%)


❌ FAILED: aims_informationpackages.parquet (95.5%)


Calculating Metrics:   0%|          | 0/126 [00:00<?, ?it/s]

✅ PASSED: aims_links.parquet (97.6%)
   → Ingested to Silver via StorageManager: aims_links


Calculating Metrics:   0%|          | 0/136 [00:00<?, ?it/s]

✅ PASSED: aims_linktypes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_linktypes


Calculating Metrics:   0%|          | 0/1045 [00:00<?, ?it/s]

✅ PASSED: aims_noncompliances.parquet (98.7%)
   → Ingested to Silver via StorageManager: aims_noncompliances


Calculating Metrics:   0%|          | 0/89 [00:00<?, ?it/s]

✅ PASSED: aims_organisations.parquet (96.3%)
   → Ingested to Silver via StorageManager: aims_organisations


Calculating Metrics:   0%|          | 0/98 [00:00<?, ?it/s]

✅ PASSED: aims_owners.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_owners


Calculating Metrics:   0%|          | 0/138 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 94.1%)


❌ FAILED: aims_people.parquet (97.6%)


Calculating Metrics:   0%|          | 0/91 [00:00<?, ?it/s]

✅ PASSED: aims_phases.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_phases


Calculating Metrics:   0%|          | 0/59 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 91.7%)


❌ FAILED: aims_productassetclasses.parquet (95.8%)


Calculating Metrics:   0%|          | 0/104 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 95.2%)


❌ FAILED: aims_productcharacteristics.parquet (97.6%)


Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 90.0%)


❌ FAILED: aims_productlinks.parquet (95.0%)


Calculating Metrics:   0%|          | 0/109 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 95.5%)


❌ FAILED: aims_products.parquet (97.7%)


Calculating Metrics:   0%|          | 0/255 [00:00<?, ?it/s]

✅ PASSED: aims_projectitemactions.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_projectitemactions


Calculating Metrics:   0%|          | 0/83 [00:00<?, ?it/s]

✅ PASSED: aims_projectitemassignedroles.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_projectitemassignedroles


Calculating Metrics:   0%|          | 0/221 [00:00<?, ?it/s]

✅ PASSED: aims_projectitemattributes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_projectitemattributes


Calculating Metrics:   0%|          | 0/64 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 92.3%)


❌ FAILED: aims_projectitemlinks.parquet (96.2%)


Calculating Metrics:   0%|          | 0/121 [00:00<?, ?it/s]

✅ PASSED: aims_projectitems.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_projectitems


Calculating Metrics:   0%|          | 0/104 [00:00<?, ?it/s]

✅ PASSED: aims_relationships.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_relationships


Calculating Metrics:   0%|          | 0/176 [00:00<?, ?it/s]

✅ PASSED: aims_relationshiptypes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_relationshiptypes


Calculating Metrics:   0%|          | 0/139 [00:00<?, ?it/s]

✅ PASSED: aims_routes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_routes


Calculating Metrics:   0%|          | 0/87 [00:00<?, ?it/s]

✅ PASSED: aims_secondaryassetclasscodes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_secondaryassetclasscodes


Calculating Metrics:   0%|          | 0/91 [00:00<?, ?it/s]

✅ PASSED: aims_stages.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_stages


Calculating Metrics:   0%|          | 0/178 [00:00<?, ?it/s]

✅ PASSED: aims_taskdefinitions.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_taskdefinitions


Calculating Metrics:   0%|          | 0/41 [00:00<?, ?it/s]

✅ PASSED: aims_tracks.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_tracks


Calculating Metrics:   0%|          | 0/93 [00:00<?, ?it/s]

✅ PASSED: aims_ua_beneficiaries.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_beneficiaries


Calculating Metrics:   0%|          | 0/83 [00:00<?, ?it/s]

✅ PASSED: aims_ua_comments.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_comments


Calculating Metrics:   0%|          | 0/90 [00:00<?, ?it/s]

✅ PASSED: aims_ua_entities.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_entities


Calculating Metrics:   0%|          | 0/81 [00:00<?, ?it/s]

✅ PASSED: aims_ua_meetingattendees.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_meetingattendees


Calculating Metrics:   0%|          | 0/88 [00:00<?, ?it/s]

✅ PASSED: aims_ua_meetings.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_meetings


Calculating Metrics:   0%|          | 0/94 [00:00<?, ?it/s]

✅ PASSED: aims_ua_noncompimppartytypes.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_noncompimppartytypes


Calculating Metrics:   0%|          | 0/91 [00:00<?, ?it/s]

✅ PASSED: aims_ua_noncomplianceimpacts.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_noncomplianceimpacts


Calculating Metrics:   0%|          | 0/112 [00:00<?, ?it/s]

✅ PASSED: aims_ua_noncompotheruas.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_noncompotheruas


Calculating Metrics:   0%|          | 0/46 [00:00<?, ?it/s]

✅ PASSED: aims_ua_optionvalues.parquet (100.0%)
   → Ingested to Silver via StorageManager: aims_ua_optionvalues


Calculating Metrics:   0%|          | 0/283 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 97.3%)


❌ FAILED: aims_undertakings_assurances.parquet (98.9%)


Calculating Metrics:   0%|          | 0/89 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 94.4%)


❌ FAILED: aims_workbanks.parquet (97.2%)


Calculating Metrics:   0%|          | 0/54 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 90.9%)


❌ FAILED: aims_workbankworkorders.parquet (95.5%)


Calculating Metrics:   0%|          | 0/89 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 94.4%)


❌ FAILED: aims_workorderattributes.parquet (97.2%)


Calculating Metrics:   0%|          | 0/119 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 95.8%)


❌ FAILED: aims_workorders.parquet (97.9%)


Calculating Metrics:   0%|          | 0/89 [00:00<?, ?it/s]

Validation FAILED: 1 checks failed. Reasons: Severity 'critical' threshold 100.0% failed (actual: 94.4%)


❌ FAILED: aims_workorderstatustransition.parquet (97.2%)

VALIDATION SUMMARY
Total Files:  68
✅ Passed:     50
❌ Failed:     18
⚠️  Skipped:    0
💥 Errors:     0

Pass Rate: 73.5%
Results saved to: /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/notebooks/config/validation_results/validation_results.json
2026-01-19 13:10:51 | INFO     | orchestration | ⏱️ Phase 2: Validation & Ingestion completed in 64.40s


## 📈 Phase 3: Data Quality Monitoring

**Purpose:** Generate DQ dashboards and monitoring reports

In [None]:
if PIPELINE_CONFIG["run_monitoring"]:
    print("\n" + "="*80)
    print("PHASE 3: DATA QUALITY MONITORING")
    print("="*80)
    
    phase_start = datetime.now()
    
    try:
        with timed_operation("Phase 3: DQ Monitoring", logger):
            # Load validation results using Fabric-compatible read
            results_file = RESULTS_DIR / "validation_results.json"
            
            # Check if results file exists
            results_exist = False
            if IS_FABRIC:
                try:
                    from notebookutils import mssparkutils
                    fab_path = fabric_path(results_file)
                    mssparkutils.fs.head(fab_path, 1)
                    results_exist = True
                except Exception:
                    results_exist = False
            else:
                results_exist = results_file.exists()
            
            if not results_exist:
                print("⚠️ No validation results found. Skipping monitoring.")
            else:
                # Read validation results
                if IS_FABRIC:
                    from notebookutils import mssparkutils
                    fab_path = fabric_path(results_file)
                    content = mssparkutils.fs.head(fab_path, 1000000)
                    validation_data = json.loads(content)
                else:
                    with open(results_file, 'r') as f:
                        validation_data = json.load(f)
                
                print(f"\n📊 Generating monitoring dashboards...")
                print(f"   Data source: {results_file}")
                
                # Check if we have file results
                files_data = validation_data.get("files", {})
                if not files_data:
                    print("⚠️ No file validation results available. Run validation first.")
                    print(f"\n📊 Summary Statistics:")
                    summary = validation_data.get("summary", {})
                    print(f"   Total Files: {summary.get('total', 0)}")
                    print(f"   Passed: {summary.get('passed', 0)}")
                    print(f"   Failed: {summary.get('failed', 0)}")
                    print(f"   Skipped: {summary.get('skipped', 0)}")
                    print(f"   Errors: {summary.get('errors', 0)}")
                else:
                    # Create summary DataFrame
                    summary_data = []
                    for table_name, result in files_data.items():
                        summary_data.append({
                            "Table": table_name,
                            "Success %": result.get("success_percentage", 0),
                            "Status": "Passed" if result.get("overall_success") else "Failed",
                            "Evaluated": result.get("statistics", {}).get("evaluated_expectations", 0),
                            "Successful": result.get("statistics", {}).get("successful_expectations", 0)
                        })
                    
                    df_summary = pd.DataFrame(summary_data)
                    
                    print(f"\n📋 DQ Summary:")
                    print(df_summary.head(10).to_string(index=False))
                    
                    # Calculate key metrics
                    avg_quality = df_summary["Success %"].mean()
                    pass_rate = (df_summary["Status"] == "Passed").sum() / len(df_summary) * 100
                    
                    print(f"\n📊 Key Metrics:")
                    print(f"   Average Quality Score: {avg_quality:.1f}%")
                    print(f"   Pass Rate: {pass_rate:.1f}%")
                    print(f"   Tables Monitored: {len(df_summary)}")
                    
                    # Log phase execution
                    execution_log["phases"].append({
                        "phase": "monitoring",
                        "status": "success",
                        "duration_seconds": (datetime.now() - phase_start).total_seconds(),
                        "metrics": {
                            "avg_quality_score": float(avg_quality),
                            "pass_rate": float(pass_rate),
                            "tables_monitored": len(df_summary)
                        }
                    })
                
    except Exception as e:
        print(f"\n❌ Monitoring Failed: {e}")
        import traceback
        traceback.print_exc()
        
        execution_log["phases"].append({
            "phase": "monitoring",
            "status": "failed",
            "error": str(e),
            "duration_seconds": (datetime.now() - phase_start).total_seconds()
        })
        
        if not PIPELINE_CONFIG.get("continue_on_error", False):
            raise
else:
    print("⏭️ Skipping Phase 3: Monitoring (disabled in config)")


PHASE 3: DATA QUALITY MONITORING
2026-01-19 13:10:58 | INFO     | orchestration | ⏱️ Phase 3: DQ Monitoring...

📊 Generating monitoring dashboards...
   Data source: /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/notebooks/config/validation_results/validation_results.json

📋 DQ Summary:
                       Table  Success % Status  Evaluated  Successful
          aims_activitydates 100.000000 Passed         23          23
        aims_assetattributes 100.000000 Passed        159         159
   aims_assetclassattributes 100.000000 Passed         63          63
   aims_assetclasschangelogs 100.000000 Passed         15          15
           aims_assetclasses  99.137931 Passed        116         115
aims_assetclassrelationships  91.489362 Passed         47          43
          aims_assetconsents 100.000000 Passed         17          17
      aims_assethierarchymap  92.857143 Failed         14          13
         aims_assetlocations  97.000000 Failed        100         

## 📝 Pipeline Execution Summary

In [None]:
# Calculate success rate
successful_phases = sum(1 for p in execution_log["phases"] if p["status"] in ["success", "partial"])
total_phases = len(execution_log["phases"])
success_rate = (successful_phases / total_phases * 100) if total_phases > 0 else 0

# Finalize execution log
execution_log["end_time"] = datetime.now().isoformat()
execution_log["total_duration_seconds"] = sum(
    p.get("duration_seconds", 0) for p in execution_log["phases"]
)

print(f"\n📊 Pipeline Status (Phases 1-3):")
print(f"   Phases Completed: {successful_phases}/{total_phases}")
print(f"   Success Rate: {success_rate:.1f}%")
print(f"   Total Duration: {execution_log['total_duration_seconds']:.2f}s")

# Determine if pipeline succeeded
PIPELINE_SUCCESS = success_rate >= 80

if PIPELINE_SUCCESS:
    print(f"\n✅ Phases 1-3 PASSED - Ready for Phase 4 (Archive & Cleanup)")
else:
    print(f"\n⚠️ Pipeline had issues - Phase 4 will be SKIPPED")
    print(f"   Landing zone preserved for investigation")


📊 Overall Status:
   Phases Completed: 3/3
   Success Rate: 100.0%
   Total Duration: 71.93s

💾 Execution log saved to: /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/1_AIMS_LOCAL_2026/notebooks/config/validation_results/orchestration_log_20260119_131102.json

🎉 ALL PHASES COMPLETED SUCCESSFULLY!
2026-01-19 13:11:02 | INFO     | orchestration | Pipeline completed: 3/3 phases successful


## 📦 Phase 4: Archive & Cleanup

**Purpose:** Archive processed files and clear landing zone (ONLY after successful pipeline)

**Process:**
1. Archive Bronze files to `/archive/YYYYMMDD_HHMMSS/`
2. Clear landing zone (ready for next SFTP batch)
3. Save final execution log

In [None]:
print("\n" + "="*80)
print("PHASE 4: ARCHIVE & CLEANUP")
print("="*80)

phase_start = datetime.now()

if not PIPELINE_SUCCESS:
    print("\n⚠️ SKIPPING Phase 4 - Pipeline did not complete successfully")
    print("   Landing zone preserved for investigation/retry")
    print("   Fix issues and re-run the pipeline")
    
    execution_log["phases"].append({
        "phase": "archive_cleanup",
        "status": "skipped",
        "reason": "Pipeline success rate below threshold",
        "duration_seconds": 0
    })
else:
    try:
        # Create dated archive folder
        archive_date = datetime.now().strftime("%Y%m%d_%H%M%S")
        ARCHIVE_BATCH_DIR = ARCHIVE_DIR / archive_date
        ensure_dir_exists(ARCHIVE_BATCH_DIR)
        
        print(f"\n📦 Archiving to: {ARCHIVE_BATCH_DIR}")
        
        archived_count = 0
        cleared_count = 0
        
        # Step 1: Archive files from landing to archive folder
        if len(LANDING_FILES_TO_ARCHIVE) > 0:
            print(f"\n📋 Archiving {len(LANDING_FILES_TO_ARCHIVE)} files from landing...")
            
            for filename in LANDING_FILES_TO_ARCHIVE:
                if IS_FABRIC:
                    src_path = f"{LANDING_DIR}/{filename}"
                    archive_path = f"{ARCHIVE_BATCH_DIR}/{filename}"
                else:
                    src_path = LANDING_DIR / filename
                    archive_path = ARCHIVE_BATCH_DIR / filename
                
                try:
                    # Copy to archive
                    copy_file_fabric(src_path, archive_path)
                    archived_count += 1
                except Exception as e:
                    print(f"   ⚠️ Archive failed: {filename} - {e}")
            
            print(f"   ✅ Archived {archived_count}/{len(LANDING_FILES_TO_ARCHIVE)} files")
        
        # Step 2: Clear landing zone (only after successful archive)
        if archived_count == len(LANDING_FILES_TO_ARCHIVE) and archived_count > 0:
            print(f"\n🧹 Clearing landing zone...")
            
            for filename in LANDING_FILES_TO_ARCHIVE:
                if IS_FABRIC:
                    src_path = f"{LANDING_DIR}/{filename}"
                else:
                    src_path = LANDING_DIR / filename
                
                try:
                    delete_file_fabric(src_path)
                    cleared_count += 1
                except Exception as e:
                    print(f"   ⚠️ Delete failed: {filename} - {e}")
            
            print(f"   ✅ Cleared {cleared_count}/{len(LANDING_FILES_TO_ARCHIVE)} files from landing")
        else:
            print(f"\n⚠️ Skipping landing cleanup - archive incomplete")
        
        # Verify landing is empty
        remaining = list_parquet_files(LANDING_DIR)
        if len(remaining) == 0:
            print(f"\n   ✅ Landing zone is now EMPTY (ready for next SFTP batch)")
        else:
            print(f"\n   ⚠️ {len(remaining)} files still in landing")
        
        # Step 3: Save manifest to archive
        manifest = {
            "archive_date": archive_date,
            "pipeline_run": execution_log["start_time"],
            "files_archived": LANDING_FILES_TO_ARCHIVE,
            "validation_summary": execution_log.get("phases", [{}])[-1].get("validation_summary", {}),
            "success_rate": success_rate
        }
        
        manifest_path = ARCHIVE_BATCH_DIR / "manifest.json"
        manifest_content = json.dumps(manifest, indent=2)
        
        if IS_FABRIC:
            from notebookutils import mssparkutils
            mssparkutils.fs.put(fabric_path(manifest_path), manifest_content, overwrite=True)
        else:
            with open(manifest_path, 'w') as f:
                f.write(manifest_content)
        
        print(f"   📝 Manifest saved: {manifest_path}")
        
        # Log phase execution
        execution_log["phases"].append({
            "phase": "archive_cleanup",
            "status": "success",
            "duration_seconds": (datetime.now() - phase_start).total_seconds(),
            "files_archived": archived_count,
            "files_cleared": cleared_count,
            "archive_location": str(ARCHIVE_BATCH_DIR)
        })
        
        print(f"\n   📊 Phase 4 Summary:")
        print(f"      Files Archived: {archived_count}")
        print(f"      Landing Cleared: {cleared_count}")
        print(f"      Archive Location: {ARCHIVE_BATCH_DIR}")
        
    except Exception as e:
        print(f"\n❌ Archive/Cleanup Failed: {e}")
        import traceback
        traceback.print_exc()
        
        execution_log["phases"].append({
            "phase": "archive_cleanup",
            "status": "failed",
            "error": str(e),
            "duration_seconds": (datetime.now() - phase_start).total_seconds()
        })

## ✅ Pipeline Complete

In [None]:
# Final Pipeline Summary
print("\n" + "="*80)
print("PIPELINE EXECUTION COMPLETE")
print("="*80)

# Recalculate with all phases including Phase 4
successful_phases = sum(1 for p in execution_log["phases"] if p["status"] in ["success", "partial"])
total_phases = len(execution_log["phases"])
final_success_rate = (successful_phases / total_phases * 100) if total_phases > 0 else 0

execution_log["end_time"] = datetime.now().isoformat()
execution_log["total_duration_seconds"] = sum(
    p.get("duration_seconds", 0) for p in execution_log["phases"]
)
execution_log["final_success_rate"] = final_success_rate

print(f"\n📊 Final Status:")
print(f"   Phases Completed: {successful_phases}/{total_phases}")
print(f"   Success Rate: {final_success_rate:.1f}%")
print(f"   Total Duration: {execution_log['total_duration_seconds']:.2f}s")

# Save final execution log
log_file = RESULTS_DIR / f"orchestration_log_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
log_content = json.dumps(execution_log, indent=2)

if IS_FABRIC:
    try:
        from notebookutils import mssparkutils
        fab_path = fabric_path(log_file)
        mssparkutils.fs.put(fab_path, log_content, overwrite=True)
    except Exception as e:
        print(f"   ⚠️ Could not save log to Fabric: {e}")
else:
    with open(log_file, 'w') as f:
        f.write(log_content)

print(f"\n💾 Execution log saved to: {log_file}")

# Final state summary
print(f"\n📁 Final Directory State:")
landing_count = len(list_parquet_files(LANDING_DIR))
bronze_count = len(list_parquet_files(BRONZE_DIR))
silver_count = len(list_parquet_files(SILVER_DIR))

print(f"   /landing/  → {landing_count} files {'(EMPTY - ready for SFTP)' if landing_count == 0 else '(⚠️ not cleared)'}")
print(f"   /Bronze/   → {bronze_count} files (raw data)")
print(f"   /Silver/   → {silver_count} files (validated parquet)")
print(f"   /archive/  → Historical backups with timestamps")

# Show Delta tables status
if IS_FABRIC:
    print(f"\n📊 Delta Tables (Lakehouse):")
    try:
        tables_persisted = [t for t in TABLES_TO_PERSIST] if 'TABLES_TO_PERSIST' in dir() else []
        if tables_persisted:
            for t in tables_persisted:
                print(f"   ✅ silver_{t} (OVERWRITTEN)")
        else:
            print(f"   ℹ️ No tables persisted this run")
    except NameError:
        print(f"   ℹ️ Table info not available")
else:
    print(f"\n📊 Delta Tables:")
    print(f"   ℹ️ Local mode - tables created on Fabric deployment")

print("\n" + "="*80)
if final_success_rate == 100:
    print("🎉 ALL PHASES COMPLETED SUCCESSFULLY!")
elif final_success_rate >= 80:
    print("✅ PIPELINE COMPLETED WITH MINOR ISSUES")
else:
    print("⚠️ PIPELINE COMPLETED WITH ERRORS - Check logs")
print("="*80)

# Log final summary if logger available
if logger:
    logger.info(f"Pipeline completed: {successful_phases}/{total_phases} phases, {final_success_rate:.1f}% success")