# Process Files from CSV Mapping

This notebook loads a filename mapping CSV (created by `renaming_file_template.ipynb`) and processes files to Cloud Optimized GeoTIFFs (COGs) with the new filenames.

## Features
- **Load CSV mapping** - Import pre-defined filename transformations
- **Preview before processing** - Review what will be processed
- **Batch COG conversion** - Convert all files to COGs
- **Track results** - Save processing results to CSV

## Workflow
1. Generate mapping CSV using `renaming_file_template.ipynb`
2. Review and validate the CSV
3. Run this notebook to process files

---

## üìã Step 1: Basic Configuration

Set your event name to load the corresponding CSV mapping file:

In [31]:
# ========================================
# INPUTS
# ========================================

# S3 Configuration
BUCKET = 'nasa-disasters'    # S3 bucket (DO NOT CHANGE)

# Event Details
EVENT_NAME = '202510_Flood_AK'  # Must match the event name used in renaming_file_template.ipynb
SUB_PRODUCT_NAME = 'sentinel2'

# CSV Mapping Configuration
CSV_DIR = f'file-mapping/{EVENT_NAME}'  # Directory where CSV mappings are stored
CSV_FILENAME = f'{EVENT_NAME}-{SUB_PRODUCT_NAME}.csv'  # CSV filename (usually {EVENT_NAME}-{subproductName}.csv)

# Processing Options
CHECK_SOURCE_IS_COG = True  # Check if source files are already valid COGs
SKIP_IF_SOURCE_IS_COG = True  # Skip processing if source is already a valid COG
OVERWRITE = False      # Set to True to replace existing files in S3 (after converting to COG)
VERIFY = True          # Set to True to verify COGs after creation
SAVE_RESULTS = True    # Save processing results to CSV

# Output
OUTPUT_DIR = 'csv_stats'  # Directory for results CSV


## üìÇ Step 2: Load CSV Mapping

Load the filename mapping CSV created by the renaming template:

In [32]:
import pandas as pd
from pathlib import Path
import sys
import os

# Add parent directory to path for importing functions
sys.path.insert(0, str(Path('..').resolve()))

# Construct CSV path
csv_path = Path(CSV_DIR) / CSV_FILENAME

print("üìÇ LOADING CSV MAPPING")
print("="*80)
print(f"\nLooking for: {csv_path}")

# Check if CSV exists
if not csv_path.exists():
    print(f"\nERROR: CSV file not found!")
    print(f"Expected location: {csv_path.absolute()}")
    mapping_df = pd.DataFrame()
else:
    # Load CSV
    mapping_df = pd.read_csv(csv_path)
    
    print(f"\nSuccessfully loaded CSV")
    print(f"\nMapping details:")
    print(f"   Total entries: {len(mapping_df)}")

    # Display first few rows
    print(display(mapping_df))

üìÇ LOADING CSV MAPPING

Looking for: file-mapping/202510_Flood_AK/202510_Flood_AK-sentinel2.csv

Successfully loaded CSV

Mapping details:
   Total entries: 123


Unnamed: 0,original_filename,new_filename,category,file_size_gb,nodata_value,status,original_s3_path,output_s3_path
0,S2B_MSIL2A_colorInfrared_20250913_222529_T03VV...,202510_Flood_AK_S2B_MSIL2A_colorInfrared_22252...,colorInfrared,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2B...,drcs_activations_new/Sentinel-2/colorIR/202510...
1,S2B_MSIL2A_colorInfrared_20250913_222529_T03VV...,202510_Flood_AK_S2B_MSIL2A_colorInfrared_22252...,colorInfrared,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2B...,drcs_activations_new/Sentinel-2/colorIR/202510...
2,S2B_MSIL2A_colorInfrared_20250913_222529_T03VV...,202510_Flood_AK_S2B_MSIL2A_colorInfrared_22252...,colorInfrared,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2B...,drcs_activations_new/Sentinel-2/colorIR/202510...
3,S2B_MSIL2A_colorInfrared_20250913_222529_T03VW...,202510_Flood_AK_S2B_MSIL2A_colorInfrared_22252...,colorInfrared,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2B...,drcs_activations_new/Sentinel-2/colorIR/202510...
4,S2B_MSIL2A_colorInfrared_20250913_222529_T03VW...,202510_Flood_AK_S2B_MSIL2A_colorInfrared_22252...,colorInfrared,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2B...,drcs_activations_new/Sentinel-2/colorIR/202510...
...,...,...,...,...,...,...,...,...
118,S2C_MSIL2A_trueColor_20251021_223601_T03VXJ.tif,202510_Flood_AK_S2C_MSIL2A_trueColor_223601_T0...,trueColor,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2C...,drcs_activations_new/Sentinel-2/trueColor/2025...
119,S2C_MSIL2A_trueColor_20251021_223601_T03VXK.tif,202510_Flood_AK_S2C_MSIL2A_trueColor_223601_T0...,trueColor,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2C...,drcs_activations_new/Sentinel-2/trueColor/2025...
120,S2C_MSIL2A_trueColor_20251021_223601_T03VXL.tif,202510_Flood_AK_S2C_MSIL2A_trueColor_223601_T0...,trueColor,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2C...,drcs_activations_new/Sentinel-2/trueColor/2025...
121,S2C_MSIL2A_trueColor_20251021_223601_T03WWQ.tif,202510_Flood_AK_S2C_MSIL2A_trueColor_223601_T0...,trueColor,0.336965,0,valid,drcs_activations/202510_Flood_AK/sentinel2/S2C...,drcs_activations_new/Sentinel-2/trueColor/2025...


None


## üîç Step 3: Filter Files to Process

Optionally filter which files to process (by category, size, etc.):

In [33]:
if not mapping_df.empty:
    # Filter out uncategorized files (they won't be processed)
    files_to_process = mapping_df[mapping_df['status'] == 'valid'].copy()
    
    # Optional: Filter by category
    # Uncomment and modify to process only specific categories:
    # CATEGORIES_TO_PROCESS = ['trueColor', 'colorInfrared']
    # files_to_process = files_to_process[files_to_process['category'].isin(CATEGORIES_TO_PROCESS)]
    
    # Optional: Filter by file size
    # Uncomment to process only files smaller than a certain size:
    # MAX_SIZE_GB = 5.0
    # files_to_process = files_to_process[files_to_process['file_size_gb'] <= MAX_SIZE_GB]
    
    print("FILES TO PROCESS")
    print("="*80)
    print(f"\nTotal files: {len(files_to_process)}")
    print(f"Total size:  {files_to_process['file_size_gb'].sum():.2f} GB")
    
    if len(files_to_process) > 0:
        print(f"\nBy category:")
        category_counts = files_to_process['category'].value_counts()
        for category, count in category_counts.items():
            print(f"   ‚Ä¢ {category}: {count} files")
        
        print(f"\n‚úÖ Ready to process {len(files_to_process)} files")
    else:
        print("\n‚ö†Ô∏è  No files match the filter criteria")
    
    print("\n" + "="*80)
else:
    print("‚ö†Ô∏è No mapping data loaded. Check Step 2.")
    files_to_process = pd.DataFrame()

FILES TO PROCESS

Total files: 123
Total size:  31.09 GB

By category:
   ‚Ä¢ colorInfrared: 41 files
   ‚Ä¢ shortwaveIR: 41 files
   ‚Ä¢ trueColor: 41 files

‚úÖ Ready to process 123 files



## üåê Step 4: Connect to S3

Initialize S3 client for downloading source files and uploading processed COGs:

In [34]:
from core.s3_operations import initialize_s3_client

print("üåê Connecting to S3...")
s3_client, fs = initialize_s3_client(bucket_name=BUCKET, verbose=True)


üåê Connecting to S3...
üîë Attempting to authenticate with external ID for upload permissions...
‚úÖ S3 client initialized with UPLOAD permissions via external ID
‚úÖ Confirmed access to nasa-disasters bucket
‚úÖ S3 filesystem (fsspec) initialized


## ‚öôÔ∏è Step 5: Process Files to COGs

Convert files to Cloud Optimized GeoTIFFs with the new filenames:

## üìä Step 6: View Results

Display processing statistics and results:

In [35]:
if not files_to_process.empty and s3_client:
    from core.cog_processing import process_single_file
    from core.s3_operations import check_s3_file_exists
    import time
    
    print("üöÄ STARTING COG PROCESSING")
    print("="*80)
    print(f"\nProcessing {len(files_to_process)} files...")
    print("This may take several minutes depending on file sizes.\n")
    
    # Display processing options
    print(f"Processing options:")
    print(f"  Check source is COG: {CHECK_SOURCE_IS_COG}")
    print(f"  Skip if source is COG: {SKIP_IF_SOURCE_IS_COG}")
    print(f"  Overwrite existing: {OVERWRITE}")
    print(f"  Verify COGs: {VERIFY}\n")
    
    # Track results
    results = []
    
    for idx, row in files_to_process.iterrows():
        start_time = time.time()
        
        source_path = row['original_s3_path']
        dest_path = row['output_s3_path']
        category = row['category']
        
        # Get nodata value from CSV (if available)
        nodata = row.get('nodata_value', None)
        if nodata is not None and pd.isna(nodata):
            nodata = None  # Handle NaN values
        
        print(f"\n[{idx+1}/{len(files_to_process)}] Processing: {row['original_filename']}")
        print(f"    Category: {category}")
        print(f"    Size: {row['file_size_gb']:.2f} GB")
        print(f"    Nodata: {nodata}")
        print(f"    Output: {row['new_filename']}")
        
        # Check if file already exists (unless OVERWRITE is True)
        if not OVERWRITE:
            if check_s3_file_exists(s3_client, BUCKET, dest_path):
                print(f"    ‚è≠Ô∏è  SKIPPED (already exists)")
                results.append({
                    'source_file': row['original_filename'],
                    'output_file': row['new_filename'],
                    'category': category,
                    'status': 'skipped',
                    'time_seconds': 0,
                    'error': 'File already exists'
                })
                continue
        
        try:
            # Process file to COG (COG checking is built into this function)
            success = process_single_file(
                s3_client=s3_client,
                bucket=BUCKET,
                source_key=source_path,
                dest_key=dest_path,
                nodata=nodata,  # Use nodata from CSV
                verify=VERIFY,
                check_source_is_cog=CHECK_SOURCE_IS_COG,
                skip_if_source_is_cog=SKIP_IF_SOURCE_IS_COG,
                verbose=True
            )
            
            elapsed = time.time() - start_time
            
            if success:
                print(f"    ‚úÖ SUCCESS ({elapsed:.1f}s)")
                results.append({
                    'source_file': row['original_filename'],
                    'output_file': row['new_filename'],
                    'category': category,
                    'status': 'success',
                    'time_seconds': elapsed,
                    'error': None
                })
            else:
                print(f"    ‚ùå FAILED ({elapsed:.1f}s)")
                results.append({
                    'source_file': row['original_filename'],
                    'output_file': row['new_filename'],
                    'category': category,
                    'status': 'failed',
                    'time_seconds': elapsed,
                    'error': 'Processing failed'
                })
        
        except Exception as e:
            elapsed = time.time() - start_time
            print(f"    ‚ùå ERROR: {str(e)}")
            results.append({
                'source_file': row['original_filename'],
                'output_file': row['new_filename'],
                'category': category,
                'status': 'failed',
                'time_seconds': elapsed,
                'error': str(e)
            })
    
    # Create results DataFrame
    results_df = pd.DataFrame(results)
    
    print("\n" + "="*80)
    print("\nüéâ PROCESSING COMPLETE!")
    
else:
    print("‚ö†Ô∏è Cannot process: No files to process or S3 not connected")
    results_df = pd.DataFrame()

ModuleNotFoundError: No module named 'core.cog_processing'

## üíæ Step 7: Save Results to CSV

Save processing results for future reference:

In [None]:
if not results_df.empty and SAVE_RESULTS:
    # Create output directory
    output_path = Path(OUTPUT_DIR) / EVENT_NAME
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Generate filename with timestamp
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    results_filename = f"processing_results_{timestamp}.csv"
    results_path = output_path / results_filename
    
    # Save results
    results_df.to_csv(results_path, index=False)
    
    print("üíæ RESULTS SAVED")
    print("="*80)
    print(f"\n‚úÖ Saved results to: {results_path.absolute()}")
    print(f"\nüìä Summary:")
    print(f"   Records saved: {len(results_df)}")
    print(f"   Successful:    {len(results_df[results_df['status'] == 'success'])}")
    print(f"   Failed:        {len(results_df[results_df['status'] == 'failed'])}")
    print(f"   Skipped:       {len(results_df[results_df['status'] == 'skipped'])}")
    
elif results_df.empty:
    print("‚ö†Ô∏è No results to save. Check Step 5.")
else:
    print("‚ÑπÔ∏è  Results saving disabled (SAVE_RESULTS = False)")

## üí° Next Steps

After processing:

1. **Review results** - Check the statistics and any failed files
2. **Verify outputs** - Spot-check processed COGs in S3
3. **Handle failures** - Investigate and retry any failed files
4. **Update metadata** - Add metadata to processed files if needed

## üîß Troubleshooting

### Common Issues:

1. **"CSV file not found"**
   - Run `renaming_file_template.ipynb` first
   - Check `EVENT_NAME` matches the CSV filename
   - Verify `CSV_DIR` path is correct

2. **"Connection failed"**
   - Check AWS credentials
   - Verify external ID is configured (if using upload role)
   - Test with `test_upload.py`

3. **"All files skipped"**
   - Files already exist in destination
   - Set `OVERWRITE = True` to replace existing files

4. **"Processing failures"**
   - Check source files are valid GeoTIFFs
   - Verify enough disk space for temp files
   - Check S3 write permissions
   - Review error messages in results

5. **"Slow processing"**
   - Large files take longer to process
   - Consider processing in smaller batches
   - Use filters in Step 3 to process by size/category

## üìù Notes

- Processing time varies based on file size (typically 30s-5min per file)
- COGs are created with zstd compression, level 9
- Nodata values are auto-detected unless specified
- All COGs include 5 overview levels for better performance