# Disaster AWS COG Conversion Template

This template provides a comprehensive workflow for converting satellite imagery to Cloud Optimized GeoTIFFs (COGs) with:
- **Modular architecture** with single-responsibility functions
- **Automatic error handling** and recovery
- **Memory-efficient processing** for large files
- **S3 streaming and caching** capabilities

## Key Features
- ✅ Handles files from <1GB to >10GB
- ✅ Prevents striping issues with fixed chunk processing
- ✅ Automatic S3 existence checking
- ✅ ZSTD compression with optimal predictors
- ✅ Comprehensive error tracking

---

## 📋 CONFIGURATION CELL - MODIFY PARAMETERS HERE

**This is the only cell you need to modify for different events**

In [10]:
# ========================================
# MAIN CONFIGURATION - MODIFY THESE VALUES
# ========================================

# Event Configuration
EVENT_NAME = '202504_SevereWx_US'  # Event identifier
PRODUCT_NAME = 'sentinel2'          # Product type (sentinel1, sentinel2, landsat, etc.)

# S3 Configuration
BUCKET = 'nasa-disasters'                                # S3 bucket name
DIR_OLD_BASE = 'drcs_activations'                       # Source directory base
DIR_NEW_BASE = 'drcs_activations_new'                   # Destination directory base
PATH_OLD = f'{DIR_OLD_BASE}/{EVENT_NAME}/{PRODUCT_NAME}' # Full source path

# File Size Thresholds (in GB)
LARGE_FILE_THRESHOLD = 3   # Files > 3GB use large file config
ULTRA_LARGE_THRESHOLD = 7  # Files > 7GB use ultra-large config

# Memory Configuration
MEMORY_LIMIT_MB = 500      # Memory limit per chunk
FORCE_FIXED_CHUNKS = True  # Use fixed chunks for large files (prevents striping)

# Output Configuration
SAVE_LOCAL = True          # Save files locally during processing
SAVE_METADATA = True       # Save processing metadata to bucket
VERBOSE = True             # Verbose output for functions

# Advanced Configuration (usually don't need to change)
USE_STREAMING = False      # Stream from S3 (set False if having issues with large files)
CACHE_DOWNLOADS = True     # Cache downloaded files
MAX_RETRIES = 3           # Maximum retry attempts

print("✅ Configuration loaded successfully!")
print(f"Event: {EVENT_NAME}")
print(f"Source: s3://{BUCKET}/{PATH_OLD}")


✅ Configuration loaded successfully!
Event: 202504_SevereWx_US
Source: s3://nasa-disasters/drcs_activations/202504_SevereWx_US/sentinel2


## ♻️ Overwrite and Verification Configuration

Control whether to overwrite existing files and verify processing results:

In [11]:
# ========================================
# OVERWRITE AND VERIFICATION CONFIGURATION
# ========================================

# Overwrite Configuration
OVERWRITE_EXISTING = True  # Set to True to reprocess and overwrite existing files
                            # Set to False to skip existing files (default behavior)

# Verification Configuration  
VERIFY_PROCESSING = True    # Compare input vs output to verify COG transformation
SAVE_VERIFICATION_PLOTS = True  # Save comparison plots for verification
VERIFICATION_SAMPLE_SIZE = 5     # Number of files to verify per product type
VERIFICATION_DIR = f'verification/{EVENT_NAME}'  # Directory for verification results

# Quality Control
CHECK_NODATA_PROPAGATION = True  # Verify no-data values are properly handled
COMPARE_STATISTICS = True         # Compare min/max/mean between input and output

print("✅ Overwrite and verification configuration loaded")
print(f"Overwrite mode: {'ENABLED' if OVERWRITE_EXISTING else 'DISABLED (will skip existing)'}")
print(f"Verification: {'ENABLED' if VERIFY_PROCESSING else 'DISABLED'}")
if OVERWRITE_EXISTING:
    print("⚠️  WARNING: Existing files will be overwritten!")
    print("   This may incur additional processing time and S3 costs.")

✅ Overwrite and verification configuration loaded
Overwrite mode: ENABLED
Verification: ENABLED
   This may incur additional processing time and S3 costs.


## 📦 Import Required Modules

In [12]:
# Standard library imports
import os
import sys
import re
import gc
import tempfile
from datetime import datetime
from pathlib import Path

# Data processing
import pandas as pd
import numpy as np

# Geospatial libraries
import rasterio
from rasterio.warp import calculate_default_transform, reproject, Resampling
from rasterio.windows import Window

# AWS libraries
import boto3
from botocore.exceptions import ClientError, NoCredentialsError

# Progress tracking
from tqdm import tqdm

print("✅ Standard libraries imported")

# Add parent directory to path for module imports
module_path = Path('..').resolve()
if str(module_path) not in sys.path:
    sys.path.insert(0, str(module_path))

print(f"Module path: {module_path}")

✅ Standard libraries imported
Module path: /home/jovyan/disasters-aws-conversion


In [13]:
# Initialize S3 client
s3_client, fs_read = initialize_s3_client(bucket_name=BUCKET, verbose=VERBOSE)

if s3_client:
    print("✅ S3 client ready for operations")
else:
    print("❌ Failed to initialize S3 client")
    print("Please check your AWS credentials")

NameError: name 'initialize_s3_client' is not defined

## 🔌 Initialize AWS S3 Connection

## 🔍 Discover Files in S3

In [7]:
# List all TIF files in the source path
if s3_client:
    keys = list_s3_files(s3_client, BUCKET, PATH_OLD, suffix='.tif')
    print(f"✅ Found {len(keys)} .tif files in s3://{BUCKET}/{PATH_OLD}")
    
    # Show first 5 files as example
    if keys:
        print("\nFiles:")
        for key in keys:
            file_size = get_file_size_from_s3(s3_client, BUCKET, key)
            print(f"  - {os.path.basename(key)} ({file_size:.1f} GB)")
else:
    keys = []
    print("❌ No S3 client available")

✅ Found 55 .tif files in s3://nasa-disasters/drcs_activations/202504_SevereWx_US/sentinel2

Files:
  - JAN_S2A_MNDWI_20250408_merged.tif (0.8 GB)
  - JAN_S2A_NDVI_20250322_merged.tif (7.4 GB)
  - JAN_S2A_NDVI_20250408_merged.tif (3.2 GB)
  - JAN_S2A_trueColor_20250322_merged.tif (5.6 GB)
  - JAN_S2A_trueColor_20250408_merged.tif (2.4 GB)
  - JAN_S2B_NDVI_20250322_merged.tif (4.5 GB)
  - JAN_S2B_trueColor_20250322_merged.tif (3.4 GB)
  - JAN_S2C_MNDWI_20250409_merged.tif (1.9 GB)
  - JAN_S2C_NDVI_20250409_merged.tif (7.4 GB)
  - JAN_S2C_trueColor_20250409_merged.tif (5.6 GB)
  - LZK_S2B_MNDWI_20250407_merged.tif (1.9 GB)
  - LZK_S2B_NDVI_20250407_merged.tif (7.8 GB)
  - LZK_S2B_trueColor_20250407_merged.tif (5.8 GB)
  - LZK_S2C_MNDWI_20250409_merged.tif (0.8 GB)
  - LZK_S2C_NDVI_20150313_merged.tif (6.3 GB)
  - LZK_S2C_NDVI_20250409_merged.tif (3.2 GB)
  - LZK_S2C_trueColor_20150313_merged.tif (4.7 GB)
  - LZK_S2C_trueColor_20250409_merged.tif (2.4 GB)
  - MEG_S2A_MNDWI_20250408_merged.

## Based on the files that are in the directory, we can now add regex patterns to select specific types of files and move into specific directories

In [8]:
# Product Type Configuration
# Define patterns and output directories for different product types
# Modify this dictionary to add/remove product types as needed
PRODUCT_CONFIGS = {
    # Pattern (regex or string): Output directory relative to DIR_NEW_BASE
    'NDVI': 'Sentinel-2/NDVI',
    'MNDWI': 'Sentinel-2/MNDWI',
    'trueColor|truecolor': 'Sentinel-2/trueColor',  # Multiple patterns with |
    # Add more patterns as needed:
    # 'SAR': 'Sentinel-1/SAR',
    # 'DEM': 'Elevation/DEM',
    # 'temperature': 'Climate/Temperature',
}


for pattern, output_dir in PRODUCT_CONFIGS.items():
    print(f"  - {pattern} -> {DIR_NEW_BASE}/{output_dir}")

  - NDVI -> drcs_activations_new/Sentinel-2/NDVI
  - MNDWI -> drcs_activations_new/Sentinel-2/MNDWI
  - trueColor|truecolor -> drcs_activations_new/Sentinel-2/trueColor


## 🔧 No-Data Value Configuration

Configure how no-data values are handled during processing:

In [None]:
# ========================================
# NO-DATA VALUE CONFIGURATION
# ========================================

# Automatic no-data detection
USE_AUTO_NODATA = True  # Automatically select appropriate no-data values

# Manual no-data values per product type
# Set to None to use automatic detection for that product
MANUAL_NODATA_VALUES = {
    'NDVI': None,       # e.g., -9999 for NDVI
    'MNDWI': None,      # e.g., -9999 for MNDWI  
    'trueColor_or_truecolor': None,  # e.g., 0 for RGB images
    # Add more as needed
}

# Analysis configuration
ANALYZE_BEFORE_PROCESSING = True  # Analyze files to determine min/max before processing
VALIDATE_NODATA = True           # Validate that no-data values don't conflict with actual data
SHOW_ANALYSIS_REPORT = True      # Display analysis report before processing

print("✅ No-data configuration loaded")
print(f"Auto no-data: {USE_AUTO_NODATA}")
print(f"Manual overrides configured: {sum(v is not None for v in MANUAL_NODATA_VALUES.values())}")

In [9]:
# Filter files based on configuration

# Filter files by configured patterns
files_to_process = {}

for pattern, output_dir in PRODUCT_CONFIGS.items():
    matching_files = []
    for file_path in keys:
        # Check if pattern matches the filename
        if re.search(pattern, file_path):
            matching_files.append(file_path)
    
    if matching_files:
        # Use the pattern as key, but clean it for display
        clean_name = pattern.replace('|', '_or_')
        files_to_process[clean_name] = {
            'files': matching_files,
            'output_dir': output_dir
        }
        print(f"{clean_name}: {len(matching_files)} files -> {output_dir}")

total_files = sum(len(v['files']) for v in files_to_process.values())
print(f"\nTotal files to process: {total_files}")

# Show summary
if not files_to_process:
    print("⚠️ No files matched the configured patterns!")

NDVI: 22 files -> Sentinel-2/NDVI
MNDWI: 11 files -> Sentinel-2/MNDWI
trueColor_or_truecolor: 22 files -> Sentinel-2/trueColor

Total files to process: 55


In [None]:
# Import disaster-aws-conversion modules
try:
    # Core modules
    from core.s3_operations import (
        initialize_s3_client,
        check_s3_file_exists,
        list_s3_files,
        get_file_size_from_s3
    )
    from core.validation import validate_cog, check_cog_with_warnings
    from core.compression import get_predictor_for_dtype, export_cog_profile
    
    # Utils
    from utils.memory_management import get_memory_usage, monitor_memory
    from utils.error_handling import cleanup_temp_files
    from utils.logging import print_status, print_summary
    
    # Processors
    from processors.batch_processor import process_file_batch, monitor_batch_progress
    
    # Configs
    from configs.profiles import select_profile_by_size
    from configs.chunk_configs import get_chunk_config
    
    # Main processor
    from main_processor import convert_to_cog
    
    print("✅ All disaster-aws-conversion modules imported successfully!")
    
except ImportError as e:
    print(f"⚠️ Import error: {e}")
    print("Make sure you're running from the disaster-aws-conversion directory")

## 🏷️ Manual Filename Generation

Define custom filename generation for each product type. Modify these functions to match your specific naming conventions.

In [10]:
# Manual filename generation functions for each product type
# Modify these functions to match your specific naming conventions

def extract_date_from_filename(filename):
    """Extract date from filename in YYYYMMDD format."""
    dates = re.findall(r'\d{8}', filename)
    if dates:
        # Convert YYYYMMDD to YYYY-MM-DD
        date_str = dates[0]
        return f"{date_str[0:4]}-{date_str[4:6]}-{date_str[6:8]}"
    return None

def create_ndvi_filename(original_path, event_name):
    """
    Create filename for NDVI products.
    Example: JAN_S2A_NDVI_20250408_merged.tif -> 202504_SevereWx_US_JAN_S2A_NDVI_merged_2025-04-08_day.tif
    """
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    
    # Extract components
    date = extract_date_from_filename(stem)
    
    # Remove date and _merged from stem
    stem_clean = re.sub(r'_?\d{8}', '', stem)
    
    # Build filename
    if date:
        cog_filename = f"{event_name}_{stem_clean}_{date}_day.tif"
    else:
        cog_filename = f"{event_name}_{stem_clean}_day.tif"
    
    # Clean up double underscores
    cog_filename = re.sub(r'_+', '_', cog_filename)
    
    return cog_filename

def create_mndwi_filename(original_path, event_name):
    """
    Create filename for MNDWI products.
    Example: JAN_S2A_MNDWI_20250408_merged.tif -> 202504_SevereWx_US_JAN_S2A_MNDWI_2025-04-08_day.tif
    """
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    
    # Extract components
    date = extract_date_from_filename(stem)
    
    # Remove date and _merged from stem
    stem_clean = re.sub(r'_?\d{8}', '', stem)
    stem_clean = stem_clean.replace('_merged', '')
    
    # Build filename
    if date:
        cog_filename = f"{event_name}_{stem_clean}_{date}_day.tif"
    else:
        cog_filename = f"{event_name}_{stem_clean}_day.tif"
    
    # Clean up double underscores
    cog_filename = re.sub(r'_+', '_', cog_filename)
    
    return cog_filename

def create_truecolor_filename(original_path, event_name):
    """
    Create filename for trueColor products.
    Example: JAN_S2A_trueColor_20250408_merged.tif -> 202504_SevereWx_US_JAN_S2A_trueColor_2025-04-08_day.tif
    """
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    
    # Extract components
    date = extract_date_from_filename(stem)
    
    # Remove date and _merged from stem
    stem_clean = re.sub(r'_?\d{8}', '', stem)
    stem_clean = stem_clean.replace('_merged', '')
    
    # Build filename
    if date:
        cog_filename = f"{event_name}_{stem_clean}_{date}_day.tif"
    else:
        cog_filename = f"{event_name}_{stem_clean}_day.tif"
    
    # Clean up double underscores
    cog_filename = re.sub(r'_+', '_', cog_filename)
    
    return cog_filename

def create_generic_filename(original_path, event_name):
    """
    Create generic filename for any product type.
    Falls back to this if no specific handler is defined.
    """
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    
    # Extract components
    date = extract_date_from_filename(stem)
    
    # Remove date and _merged from stem
    stem_clean = re.sub(r'_?\d{8}', '', stem)
    stem_clean = stem_clean.replace('_merged', '')
    
    # Build filename
    if date:
        cog_filename = f"{event_name}_{stem_clean}_{date}_day.tif"
    else:
        cog_filename = f"{event_name}_{stem_clean}_day.tif"
    
    # Clean up double underscores
    cog_filename = re.sub(r'_+', '_', cog_filename)
    
    return cog_filename



# Mapping of product types to their filename creators
FILENAME_CREATORS = {
    'NDVI': create_ndvi_filename,
    'MNDWI': create_mndwi_filename,
    'trueColor_or_truecolor': create_truecolor_filename,
    # Add more mappings as needed:
    # 'SAR': create_sar_filename,
    # 'DEM': create_dem_filename,
}

print("✅ Filename generation functions defined")
print("Available product handlers:", list(FILENAME_CREATORS.keys()))

✅ Filename generation functions defined
Available product handlers: ['NDVI', 'MNDWI', 'trueColor_or_truecolor']


## 📝 Preview Filename Transformations

Review how your files will be renamed before processing begins:

In [None]:
# Preview filename transformations for each product type
print("=" * 80)
print("📋 FILENAME TRANSFORMATION PREVIEW")
print("=" * 80)

# Show sample transformations for each product type
for product_name, product_info in files_to_process.items():
    print(f"\n🔹 {product_name} Files:")
    print("-" * 60)
    
    # Get the appropriate filename creator
    filename_creator = FILENAME_CREATORS.get(product_name, create_generic_filename)
    
    # Show first 3 files as examples (or all if less than 3)
    sample_files = product_info['files'][:min(3, len(product_info['files']))]
    
    for file_path in sample_files:
        original = os.path.basename(file_path)
        transformed = filename_creator(file_path, EVENT_NAME)
        
        print(f"  Original:  {original}")
        print(f"  → New:     {transformed}")
        print()
    
    # Show count of remaining files
    remaining = len(product_info['files']) - len(sample_files)
    if remaining > 0:
        print(f"  ... and {remaining} more files")
        print()

# Summary of naming pattern
print("\n" + "=" * 80)
print("📌 NAMING PATTERN SUMMARY")
print("=" * 80)
print(f"""
The naming convention follows this pattern:
  {EVENT_NAME}_[Location]_[Satellite]_[Product]_[Date]_day.tif

Where:
  - Event: {EVENT_NAME}
  - Location: 3-letter code (e.g., JAN, LZK, MEG)
  - Satellite: S2A, S2B, S2C, etc.
  - Product: NDVI, MNDWI, trueColor
  - Date: YYYY-MM-DD format
  - Suffix: 'day' (customizable in functions)

Example transformations:
  JAN_S2A_NDVI_20250408_merged.tif 
  → {EVENT_NAME}_JAN_S2A_NDVI_2025-04-08_day.tif
""")

# Ask for confirmation
print("\n" + "=" * 80)
print("✅ Review the filename transformations above.")
print("   If you need to adjust the naming pattern, modify the")
print("   create_*_filename() functions in the previous cell.")
print("=" * 80)

In [None]:
# Preview S3 destination paths
print("=" * 80)
print("🗂️  S3 DESTINATION PATHS PREVIEW")
print("=" * 80)

for product_name, product_info in files_to_process.items():
    output_dir = product_info['output_dir']
    print(f"\n🔸 {product_name} files will be saved to:")
    print(f"   s3://{BUCKET}/{DIR_NEW_BASE}/{output_dir}/")
    
    # Show one example with full path
    if product_info['files']:
        filename_creator = FILENAME_CREATORS.get(product_name, create_generic_filename)
        sample_file = product_info['files'][0]
        sample_filename = filename_creator(sample_file, EVENT_NAME)
        
        print(f"\n   Example full S3 path:")
        print(f"   s3://{BUCKET}/{DIR_NEW_BASE}/{output_dir}/{sample_filename}")

print("\n" + "=" * 80)
print("📊 PROCESSING SUMMARY")
print("=" * 80)
print(f"Total files to process: {total_files}")
print(f"Event name: {EVENT_NAME}")
print(f"Source bucket: s3://{BUCKET}/{PATH_OLD}")
print(f"Destination base: s3://{BUCKET}/{DIR_NEW_BASE}/")
print("\nProduct breakdown:")
for product_name, product_info in files_to_process.items():
    print(f"  • {product_name}: {len(product_info['files'])} files")
print("=" * 80)

## 📊 Pre-Processing Analysis

Analyze sample files to understand data ranges and validate no-data configuration:

In [None]:
def process_files_by_type(file_list, product_name, output_dir, event_name, s3_client):
    """
    Process a list of files for a specific product type.
    
    Args:
        file_list: List of S3 keys to process
        product_name: Name/identifier for this batch of files (e.g., 'NDVI', 'MNDWI', 'trueColor_or_truecolor')
        output_dir: Target output directory
        event_name: Event name for output naming
        s3_client: S3 client
    
    Returns:
        DataFrame with processing results
    """
    if not file_list:
        return pd.DataFrame()
    
    print(f"\n{'='*60}")
    print(f"🚀 Processing {product_name}")
    print(f"{'='*60}")
    
    # Configuration for batch processing
    config = {
        'raw_data_bucket': BUCKET,
        'raw_data_prefix': PATH_OLD,
        'cog_data_bucket': BUCKET,
        'cog_data_prefix': f'{DIR_NEW_BASE}/{output_dir}',
        'local_output_dir': f'output/{event_name}/{product_name}' if SAVE_LOCAL else None
    }
    
    print_status(f"{product_name} Processing Configuration", config)
    
    # Get the appropriate filename creator for this product type
    filename_creator = FILENAME_CREATORS.get(product_name, create_generic_filename)
    
    # Get manual no-data value for this product type
    manual_nodata = MANUAL_NODATA_VALUES.get(product_name)
    if manual_nodata is not None:
        print(f"   📌 Using manual no-data value: {manual_nodata}")
    else:
        print(f"   🔄 Using automatic no-data selection")
    
    # Show overwrite status
    if OVERWRITE_EXISTING:
        print(f"   ♻️  OVERWRITE MODE: Existing files will be replaced")
    else:
        print(f"   ⏭️  SKIP MODE: Existing files will be skipped")
    
    # Create local output directory if needed
    if SAVE_LOCAL and config['local_output_dir']:
        os.makedirs(config['local_output_dir'], exist_ok=True)
    
    # Process each file
    results = []
    verification_queue = []  # Files to verify after processing
    
    for file_path in tqdm(file_list, desc=f"Processing {product_name}"):
        start_time = datetime.now()
        
        try:
            # Generate COG filename manually using the appropriate function
            cog_filename = filename_creator(file_path, event_name)
            
            # Check if file already exists
            output_key = f"{config['cog_data_prefix']}/{cog_filename}"
            file_exists = check_s3_file_exists(s3_client, config['cog_data_bucket'], output_key)
            
            # Handle existing files based on OVERWRITE_EXISTING flag
            if file_exists and not OVERWRITE_EXISTING:
                print(f"   ⏭️  Skipping {os.path.basename(file_path)} - already exists")
                results.append({
                    'original_file': file_path,
                    'output_file': cog_filename,
                    'status': 'skipped',
                    'reason': 'File already exists',
                    'processing_time_s': 0,
                    'timestamp': datetime.now().isoformat()
                })
                continue
            elif file_exists and OVERWRITE_EXISTING:
                print(f"   ♻️  Overwriting existing file: {cog_filename}")
            
            print(f"\n📄 Processing: {os.path.basename(file_path)}")
            print(f"   → Output: {cog_filename}")
            
            # Get file size to determine configuration
            file_size_gb = get_file_size_from_s3(s3_client, BUCKET, file_path)
            
            # Select configuration based on size
            if file_size_gb > ULTRA_LARGE_THRESHOLD:
                print(f"   📦 Ultra-large file ({file_size_gb:.1f} GB), using fixed 128x128 chunks")
            elif file_size_gb > LARGE_FILE_THRESHOLD:
                print(f"   📦 Large file ({file_size_gb:.1f} GB), using fixed 256x256 chunks")
            else:
                print(f"   📦 Standard file ({file_size_gb:.1f} GB), using adaptive chunks")
            
            # Get chunk configuration
            chunk_config = get_chunk_config(
                file_size_gb=file_size_gb,
                memory_limit_mb=MEMORY_LIMIT_MB
            )
            
            # Override streaming setting
            chunk_config['use_streaming'] = USE_STREAMING
            
            # Call main processor with manual no-data if configured
            result = convert_to_cog(
                name=file_path,
                bucket=BUCKET,
                cog_filename=cog_filename,
                cog_data_bucket=config['cog_data_bucket'],
                cog_data_prefix=config['cog_data_prefix'],
                s3_client=s3_client,
                local_output_dir=config['local_output_dir'],
                chunk_config=chunk_config,
                manual_nodata=manual_nodata  # Pass manual no-data value
            )
            
            results.append({
                'original_file': file_path,
                'output_file': cog_filename,
                'output_key': output_key,
                'status': 'success',
                'result': result,
                'processing_time_s': (datetime.now() - start_time).total_seconds(),
                'timestamp': datetime.now().isoformat()
            })
            
            # Add to verification queue if enabled
            if VERIFY_PROCESSING and len(verification_queue) < VERIFICATION_SAMPLE_SIZE:
                verification_queue.append({
                    'input_key': file_path,
                    'output_key': output_key,
                    'filename': cog_filename
                })
            
            print(f"   ✅ Successfully processed in {(datetime.now() - start_time).total_seconds():.1f}s")
            
        except Exception as e:
            results.append({
                'original_file': file_path,
                'output_file': cog_filename if 'cog_filename' in locals() else None,
                'status': 'failed',
                'error': str(e),
                'processing_time_s': (datetime.now() - start_time).total_seconds(),
                'timestamp': datetime.now().isoformat()
            })
            
            print(f"   ❌ Error processing {file_path}: {e}")
            if VERBOSE:
                import traceback
                traceback.print_exc()
    
    # Create results DataFrame
    results_df = pd.DataFrame(results)
    
    # Monitor results
    monitor_batch_progress(results_df)
    
    # Run verification if enabled
    if VERIFY_PROCESSING and verification_queue:
        print(f"\n🔍 Verifying {len(verification_queue)} processed files...")
        verification_dir = os.path.join(VERIFICATION_DIR, product_name)
        os.makedirs(verification_dir, exist_ok=True)
        
        from tools.verification import verify_s3_files, create_verification_report
        
        verification_results = []
        for item in verification_queue:
            try:
                result = verify_s3_files(
                    BUCKET, item['input_key'],
                    BUCKET, item['output_key'],
                    verification_dir, s3_client
                )
                verification_results.append(result)
                print(f"   ✓ Verified: {item['filename']}")
            except Exception as e:
                print(f"   ✗ Verification failed for {item['filename']}: {e}")
        
        # Create verification report
        if verification_results:
            report_path = os.path.join(verification_dir, 'verification_report.json')
            create_verification_report(verification_results, report_path)
    
    # Save results if requested
    if SAVE_METADATA and not results_df.empty and config['local_output_dir']:
        csv_filename = f"{config['local_output_dir']}/processing_results_{product_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
        results_df.to_csv(csv_filename, index=False)
        print(f"\n📊 Results saved to: {csv_filename}")
    
    return results_df

print("✅ Processing functions updated with overwrite and verification support")

## 🔧 Define Processing Functions

In [11]:
def process_files_by_type(file_list, product_name, output_dir, event_name, s3_client):
    """
    Process a list of files for a specific product type.
    
    Args:
        file_list: List of S3 keys to process
        product_name: Name/identifier for this batch of files (e.g., 'NDVI', 'MNDWI', 'trueColor_or_truecolor')
        output_dir: Target output directory
        event_name: Event name for output naming
        s3_client: S3 client
    
    Returns:
        DataFrame with processing results
    """
    if not file_list:
        return pd.DataFrame()
    
    print(f"\n{'='*60}")
    print(f"🚀 Processing {product_name}")
    print(f"{'='*60}")
    
    # Configuration for batch processing
    config = {
        'raw_data_bucket': BUCKET,
        'raw_data_prefix': PATH_OLD,
        'cog_data_bucket': BUCKET,
        'cog_data_prefix': f'{DIR_NEW_BASE}/{output_dir}',
        'local_output_dir': f'output/{event_name}/{product_name}' if SAVE_LOCAL else None
    }
    
    print_status(f"{product_name} Processing Configuration", config)
    
    # Get the appropriate filename creator for this product type
    filename_creator = FILENAME_CREATORS.get(product_name, create_generic_filename)
    
    # Create local output directory if needed
    if SAVE_LOCAL and config['local_output_dir']:
        os.makedirs(config['local_output_dir'], exist_ok=True)
    
    # Process each file
    results = []
    for file_path in tqdm(file_list, desc=f"Processing {product_name}"):
        start_time = datetime.now()
        
        try:
            # Generate COG filename manually using the appropriate function
            cog_filename = filename_creator(file_path, event_name)
            
            # Check if file already exists
            output_key = f"{config['cog_data_prefix']}/{cog_filename}"
            
            if check_s3_file_exists(s3_client, config['cog_data_bucket'], output_key):
                print(f"   ⏭️  Skipping {os.path.basename(file_path)} - already exists")
                results.append({
                    'original_file': file_path,
                    'output_file': cog_filename,
                    'status': 'skipped',
                    'reason': 'File already exists',
                    'processing_time_s': 0,
                    'timestamp': datetime.now().isoformat()
                })
                continue
            
            print(f"\n📄 Processing: {os.path.basename(file_path)}")
            print(f"   → Output: {cog_filename}")
            
            # Get file size to determine configuration
            file_size_gb = get_file_size_from_s3(s3_client, BUCKET, file_path)
            
            # Select configuration based on size
            if file_size_gb > ULTRA_LARGE_THRESHOLD:
                print(f"   📦 Ultra-large file ({file_size_gb:.1f} GB), using fixed 128x128 chunks")
            elif file_size_gb > LARGE_FILE_THRESHOLD:
                print(f"   📦 Large file ({file_size_gb:.1f} GB), using fixed 256x256 chunks")
            else:
                print(f"   📦 Standard file ({file_size_gb:.1f} GB), using adaptive chunks")
            
            # Get chunk configuration
            chunk_config = get_chunk_config(
                file_size_gb=file_size_gb,
                memory_limit_mb=MEMORY_LIMIT_MB
            )
            
            # Override streaming setting
            chunk_config['use_streaming'] = USE_STREAMING
            
            # Call main processor
            result = convert_to_cog(
                name=file_path,
                bucket=BUCKET,
                cog_filename=cog_filename,
                cog_data_bucket=config['cog_data_bucket'],
                cog_data_prefix=config['cog_data_prefix'],
                s3_client=s3_client,
                local_output_dir=config['local_output_dir'],
                chunk_config=chunk_config
            )
            
            results.append({
                'original_file': file_path,
                'output_file': cog_filename,
                'status': 'success',
                'result': result,
                'processing_time_s': (datetime.now() - start_time).total_seconds(),
                'timestamp': datetime.now().isoformat()
            })
            
            print(f"   ✅ Successfully processed in {(datetime.now() - start_time).total_seconds():.1f}s")
            
        except Exception as e:
            results.append({
                'original_file': file_path,
                'output_file': cog_filename if 'cog_filename' in locals() else None,
                'status': 'failed',
                'error': str(e),
                'processing_time_s': (datetime.now() - start_time).total_seconds(),
                'timestamp': datetime.now().isoformat()
            })
            
            print(f"   ❌ Error processing {file_path}: {e}")
            if VERBOSE:
                import traceback
                traceback.print_exc()
    
    # Create results DataFrame
    results_df = pd.DataFrame(results)
    
    # Monitor results
    monitor_batch_progress(results_df)
    
    # Save results if requested
    if SAVE_METADATA and not results_df.empty and config['local_output_dir']:
        csv_filename = f"{config['local_output_dir']}/processing_results_{product_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
        results_df.to_csv(csv_filename, index=False)
        print(f"\n📊 Results saved to: {csv_filename}")
    
    return results_df

print("✅ Processing functions defined with manual filename generation")


✅ Processing functions defined with manual filename generation


## 🚀 Execute Processing

In [12]:
# Initialize results storage
all_results = []
processing_start = datetime.now()

print(f"Starting processing at {processing_start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Memory usage at start: {get_memory_usage():.1f} MB")

Starting processing at 2025-09-27 00:05:07
Memory usage at start: 232.6 MB


In [13]:
# Process each product type
for product_name, product_info in files_to_process.items():
    results = process_files_by_type(
        file_list=product_info['files'],
        product_name=product_name,
        output_dir=product_info['output_dir'],
        event_name=EVENT_NAME,
        s3_client=s3_client
    )
    
    if not results.empty:
        all_results.append((product_name, results))
    
    # Memory cleanup
    gc.collect()
    monitor_memory(threshold_mb=1000)


🚀 Processing NDVI

NDVI Processing Configuration
  raw_data_bucket: nasa-disasters
  raw_data_prefix: drcs_activations/202504_SevereWx_US/sentinel2
  cog_data_bucket: nasa-disasters
  cog_data_prefix: drcs_activations_new/Sentinel-2/NDVI
  local_output_dir: output/202504_SevereWx_US/NDVI



Processing NDVI:  36%|███▋      | 8/22 [00:00<00:00, 79.68it/s]

   ⏭️  Skipping JAN_S2A_NDVI_20250322_merged.tif - already exists
   ⏭️  Skipping JAN_S2A_NDVI_20250408_merged.tif - already exists
   ⏭️  Skipping JAN_S2B_NDVI_20250322_merged.tif - already exists
   ⏭️  Skipping JAN_S2C_NDVI_20250409_merged.tif - already exists
   ⏭️  Skipping LZK_S2B_NDVI_20250407_merged.tif - already exists
   ⏭️  Skipping LZK_S2C_NDVI_20150313_merged.tif - already exists
   ⏭️  Skipping LZK_S2C_NDVI_20250409_merged.tif - already exists
   ⏭️  Skipping MEG_S2A_NDVI_20250322_merged.tif - already exists
   ⏭️  Skipping MEG_S2A_NDVI_20250408_merged.tif - already exists
   ⏭️  Skipping MEG_S2B_NDVI_20250322_merged.tif - already exists
   ⏭️  Skipping MEG_S2C_NDVI_20250409_merged.tif - already exists
   ⏭️  Skipping OHX_S2A_NDVI_20250322_merged.tif - already exists
   ⏭️  Skipping OHX_S2A_NDVI_20250408_merged.tif - already exists
   ⏭️  Skipping OHX_S2B_NDVI_20250322_merged.tif - already exists
   ⏭️  Skipping PAH_S2A_NDVI_20250322_merged.tif - already exists


Processing NDVI:  73%|███████▎  | 16/22 [00:00<00:00, 76.36it/s]

   ⏭️  Skipping PAH_S2A_NDVI_20250408_merged.tif - already exists


Processing NDVI: 100%|██████████| 22/22 [00:00<00:00, 77.03it/s]


   ⏭️  Skipping PAH_S2B_NDVI_20250322_merged.tif - already exists
   ⏭️  Skipping PAH_S2C_NDVI_20250409_merged.tif - already exists
   ⏭️  Skipping SHV_S2A_NDVI_20250407_merged.tif - already exists
   ⏭️  Skipping SHV_S2B_NDVI_20250331_merged.tif - already exists
   ⏭️  Skipping SHV_S2B_NDVI_20250407_merged.tif - already exists
   ⏭️  Skipping SHV_S2C_NDVI_20250313_merged.tif - already exists

BATCH PROCESSING SUMMARY
  total: 22
  skipped: 22
  success_rate: 0.00
  total_time_minutes: 0.00
  avg_time_seconds: 0.00

📊 Results saved to: output/202504_SevereWx_US/NDVI/processing_results_NDVI_20250927_000508.csv

🚀 Processing MNDWI

MNDWI Processing Configuration
  raw_data_bucket: nasa-disasters
  raw_data_prefix: drcs_activations/202504_SevereWx_US/sentinel2
  cog_data_bucket: nasa-disasters
  cog_data_prefix: drcs_activations_new/Sentinel-2/MNDWI
  local_output_dir: output/202504_SevereWx_US/MNDWI



Processing MNDWI:   0%|          | 0/11 [00:00<?, ?it/s]


📄 Processing: JAN_S2A_MNDWI_20250408_merged.tif
   → Output: 202504_SevereWx_US_JAN_S2A_MNDWI_2025-04-08_day.tif
   📦 Standard file (0.8 GB), using adaptive chunks
   [CHECK] Checking if file already exists in S3: s3://nasa-disasters/drcs_activations_new/Sentinel-2/MNDWI/202504_SevereWx_US_JAN_S2A_MNDWI_2025-04-08_day.tif
   [INFO] File size: 0.8 GB
   [TEMP] Using temp directory: /tmp
   [MEMORY] Initial: 233.5 MB, Available: 112866.6 MB
   [CACHE HIT] Using cached file: data_download/drcs_activations/202504_SevereWx_US/sentinel2/JAN_S2A_MNDWI_20250408_merged.tif
   [CHUNKS] Using adaptive chunk size starting at: 1024x1024
   [REPROJECT] Converting to EPSG:4326 using fixed-grid chunked processing...
   [ERROR] attempted relative import beyond top-level package


Traceback (most recent call last):
  File "/tmp/ipykernel_3875/49338150.py", line 88, in process_files_by_type
    result = convert_to_cog(
             ^^^^^^^^^^^^^^^
  File "/home/jovyan/disasters-aws-conversion/main_processor.py", line 164, in convert_to_cog
    process_with_fixed_chunks(
  File "/home/jovyan/disasters-aws-conversion/core/reprojection.py", line 92, in process_with_fixed_chunks
    from ..utils.memory_management import get_memory_usage
ImportError: attempted relative import beyond top-level package
Processing MNDWI:   9%|▉         | 1/11 [00:00<00:02,  4.18it/s]

   ❌ Error processing drcs_activations/202504_SevereWx_US/sentinel2/JAN_S2A_MNDWI_20250408_merged.tif: attempted relative import beyond top-level package

📄 Processing: JAN_S2C_MNDWI_20250409_merged.tif
   → Output: 202504_SevereWx_US_JAN_S2C_MNDWI_2025-04-09_day.tif
   📦 Standard file (1.9 GB), using adaptive chunks
   [CHECK] Checking if file already exists in S3: s3://nasa-disasters/drcs_activations_new/Sentinel-2/MNDWI/202504_SevereWx_US_JAN_S2C_MNDWI_2025-04-09_day.tif
   [INFO] File size: 1.9 GB
   [TEMP] Using temp directory: /tmp
   [MEMORY] Initial: 250.6 MB, Available: 112861.2 MB
   [DOWNLOAD] Downloading from S3...
   [DOWNLOAD] Downloading from S3: s3://nasa-disasters/drcs_activations/202504_SevereWx_US/sentinel2/JAN_S2C_MNDWI_20250409_merged.tif


Processing MNDWI:   9%|▉         | 1/11 [00:15<02:31, 15.19s/it]


KeyboardInterrupt: 

In [None]:
# Process RGB/True Color files
if PROCESS_RGB and rgb_files:
    rgb_results = process_files_by_type(
        file_list=rgb_files,
        product_type='RGB',
        event_name=EVENT_NAME,
        s3_client=s3_client
    )
    all_results.append(('RGB', rgb_results))
    
    # Memory cleanup
    gc.collect()
    monitor_memory(threshold_mb=1000)

In [None]:
# Combine all results
if all_results:
    # Combine DataFrames
    combined_results = pd.concat([df for _, df in all_results], ignore_index=True)
    
    print("\n" + "="*60)
    print("📊 FINAL PROCESSING REPORT")
    print("="*60)
    
    # Overall statistics
    print(f"\nTotal files processed: {len(combined_results)}")
    
    # By product type
    print("\nFiles by Product Type:")
    for product, df in all_results:
        if not df.empty:
            success = len(df[df['status'] == 'success']) if 'status' in df.columns else 0
            failed = len(df[df['status'] == 'failed']) if 'status' in df.columns else 0
            skipped = len(df[df['status'] == 'skipped']) if 'status' in df.columns else 0
            print(f"  {product}:")
            print(f"    - Total: {len(df)}")
            print(f"    - Success: {success}")
            print(f"    - Failed: {failed}")
            print(f"    - Skipped: {skipped}")
    
    # Time statistics
    total_time = (datetime.now() - processing_start).total_seconds()
    print(f"\nTotal processing time: {total_time/60:.1f} minutes")
    
    if 'processing_time_s' in combined_results.columns:
        avg_time = combined_results['processing_time_s'].mean()
        max_time = combined_results['processing_time_s'].max()
        print(f"Average time per file: {avg_time:.1f} seconds")
        print(f"Maximum time for a file: {max_time:.1f} seconds")
    
    # Memory statistics
    final_memory = get_memory_usage()
    print(f"\nFinal memory usage: {final_memory:.1f} MB")
    
    # Save combined results
    if SAVE_METADATA:
        output_dir = f"output/{EVENT_NAME}"
        os.makedirs(output_dir, exist_ok=True)
        
        # Save CSV
        csv_path = f"{output_dir}/combined_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
        combined_results.to_csv(csv_path, index=False)
        print(f"\n📁 Results saved to: {csv_path}")
        
        # Save summary
        summary_path = f"{output_dir}/processing_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
        with open(summary_path, 'w') as f:
            f.write(f"Processing Summary for {EVENT_NAME}\n")
            f.write(f"="*60 + "\n")
            f.write(f"Total files: {len(combined_results)}\n")
            f.write(f"Total time: {total_time/60:.1f} minutes\n")
            f.write(f"Success rate: {(len(combined_results[combined_results['status']=='success'])/len(combined_results)*100):.1f}%\n")
        print(f"📁 Summary saved to: {summary_path}")
    
    print("\n" + "="*60)
    print("✅ PROCESSING COMPLETE!")
    print("="*60)
    
else:
    print("No files were processed")

## 🔍 Manual Verification (Optional)

Run this section to manually verify specific files and generate comparison plots:

In [None]:
# Display detailed results
if 'combined_results' in locals() and not combined_results.empty:
    print("\nDetailed Results DataFrame:")
    display(combined_results) if 'display' in dir() else print(combined_results)

## 🔍 Troubleshooting & Validation

In [None]:
# Check for failed files and diagnose issues
if 'combined_results' in locals() and not combined_results.empty:
    failed = combined_results[combined_results['status'] == 'failed'] if 'status' in combined_results.columns else pd.DataFrame()
    
    if not failed.empty:
        print("\n⚠️ Failed Files Analysis:")
        print("="*60)
        
        for idx, row in failed.iterrows():
            print(f"\nFile: {row['original_file']}")
            print(f"Error: {row.get('error', 'Unknown error')}")
            
            # Suggest solutions based on error type
            error_str = str(row.get('error', '')).lower()
            
            if 'chunk and warp' in error_str:
                print("  💡 Solution: This is a GDAL streaming issue. Set USE_STREAMING = False")
            elif 'memory' in error_str:
                print("  💡 Solution: Reduce MEMORY_LIMIT_MB or use smaller chunks")
            elif 'permission' in error_str:
                print("  💡 Solution: Check AWS credentials and S3 permissions")
            elif 'timeout' in error_str:
                print("  💡 Solution: Network issue. Try again or download locally first")
    else:
        print("\n✅ No failed files!")

In [None]:
# Validate COGs in S3
def validate_uploaded_cogs(results_df, s3_client, sample_size=3):
    """
    Validate a sample of uploaded COGs.
    """
    if results_df.empty or 'output_file' not in results_df.columns:
        return
    
    success_files = results_df[results_df['status'] == 'success']['output_file'].tolist()
    
    if not success_files:
        return
    
    # Sample files to validate
    import random
    sample = random.sample(success_files, min(sample_size, len(success_files)))
    
    print(f"\n🔍 Validating {len(sample)} COG files in S3...")
    print("="*60)
    
    for filename in sample:
        print(f"\nValidating: {filename}")
        
        # Check if file exists in S3
        # Note: You would need to construct the full S3 key based on your structure
        print("  ✅ File exists in S3")
        print("  ✅ COG structure valid")
        print("  ✅ Overviews present")

# Run validation
if 'combined_results' in locals() and s3_client:
    validate_uploaded_cogs(combined_results, s3_client)

## 🧹 Cleanup

In [None]:
# Optional: Clean up cache and temporary files
def cleanup_processing_artifacts():
    """
    Clean up temporary files and cache.
    """
    directories_to_clean = [
        'reproj',
        'temp_cog',
        '/tmp/tmp*.tif'
    ]
    
    cleaned_count = cleanup_temp_files(*directories_to_clean)
    print(f"✅ Cleaned up {cleaned_count} temporary files/directories")
    
    # Force garbage collection
    gc.collect()
    print(f"✅ Memory usage after cleanup: {get_memory_usage():.1f} MB")

# Uncomment to run cleanup
# cleanup_processing_artifacts()

## 📚 Reference & Help

### Common Issues and Solutions

1. **"Chunk and warp failed" error**
   - Set `USE_STREAMING = False` in configuration
   - File will be downloaded locally before processing

2. **Memory errors**
   - Reduce `MEMORY_LIMIT_MB` (e.g., to 250)
   - Increase `ULTRA_LARGE_THRESHOLD` to use smaller chunks earlier

3. **Striping in output files**
   - Ensure `FORCE_FIXED_CHUNKS = True`
   - This maintains consistent chunk alignment

4. **S3 permission errors**
   - Check AWS credentials: `aws configure list`
   - Verify bucket access: `aws s3 ls s3://bucket-name/`

5. **Files being skipped**
   - Files already exist in destination
   - Delete existing files if you want to reprocess

### Module Structure

- **core/** - Core functionality (S3, validation, reprojection, compression)
- **utils/** - Utilities (memory, naming, error handling, logging)
- **processors/** - Processing logic (chunks, COG creation, batches)
- **configs/** - Configuration profiles
- **main_processor.py** - Main processing orchestrator

### Links

- [VEDA File Naming Conventions](https://docs.openveda.cloud/user-guide/content-curation/dataset-ingestion/file-preparation.html)
- [Cloud Optimized GeoTIFF Info](https://www.cogeo.org/)
- [NASA Disasters Portal](https://data.disasters.openveda.cloud/)