# üìÅ Local File Processing to S3

This notebook processes GeoTIFF files from a **local directory**, converts them to Cloud Optimized GeoTIFFs (COGs), and uploads them to S3 with renamed filenames.

## ‚ú® Features
- **Process local files** - No need to upload to S3 first
- **Filename transformation** - Define custom renaming functions
- **CSV mapping** - Optional export of filename mappings
- **COG conversion** - Automatic optimization with compression
- **Direct S3 upload** - Upload with new names to destination bucket

## üìã Workflow
1. Configure local directory and S3 destination
2. List local .tif files
3. Define filename transformations
4. Preview transformations
5. (Optional) Save mapping to CSV
6. Connect to S3
7. Process and upload files

---

## üìã Step 1: Configuration

Set your local directory path and S3 destination:

In [None]:
# ========================================
# INPUTS
# ========================================

# Local File Path
LOCAL_DIR = '/path/to/your/local/geotiffs'  # Change this to your local directory

# S3 Configuration
BUCKET = 'nasa-disasters'    # S3 bucket (DO NOT CHANGE)
DESTINATION_BASE = 'drcs_activations_new'  # Where to save COGs in S3

# Event Details
EVENT_NAME = '202510_Flood_AK'  # Your event name
SUB_PRODUCT_NAME = 'sentinel2'  # Sub-product identifier

# Processing Options
OVERWRITE = False      # Set to True to replace existing files in S3
VERIFY = True          # Verify COGs after creation
SAVE_CSV = True        # Save filename mapping to CSV
SAVE_RESULTS = True    # Save processing results to CSV

# Output
OUTPUT_DIR = 'local-processing-output'  # Directory for CSV files

print(f"Local Directory: {LOCAL_DIR}")
print(f"Event: {EVENT_NAME}")
print(f"Destination: s3://{BUCKET}/{DESTINATION_BASE}/")

## üìÇ Step 2: List Local Files

Scan the local directory for GeoTIFF files:

In [None]:
# Import necessary modules
import sys
import os
import pandas as pd
from pathlib import Path
from datetime import datetime
import glob

# Add parent directory to path
sys.path.insert(0, str(Path('..').resolve()))

print("üìÇ SCANNING LOCAL DIRECTORY")
print("="*80)
print(f"\nSearching for .tif files in: {LOCAL_DIR}\n")

# Find all .tif files (recursively)
local_dir_path = Path(LOCAL_DIR)
if not local_dir_path.exists():
    print(f"‚ùå ERROR: Directory does not exist: {LOCAL_DIR}")
    print("   Please check your LOCAL_DIR path and try again.")
    files_df = pd.DataFrame()
else:
    # Find all .tif files
    tif_files = list(local_dir_path.rglob('*.tif')) + list(local_dir_path.rglob('*.TIF'))
    
    if tif_files:
        print(f"‚úÖ Found {len(tif_files)} .tif files\n")
        
        # Create DataFrame with file info
        file_data = []
        for file_path in tif_files:
            file_size_bytes = file_path.stat().st_size
            file_size_gb = file_size_bytes / (1024 ** 3)
            
            file_data.append({
                'local_path': str(file_path),
                'original_filename': file_path.name,
                'file_size_gb': file_size_gb,
                'relative_path': str(file_path.relative_to(local_dir_path))
            })
        
        files_df = pd.DataFrame(file_data)
        
        # Display summary
        print(f"Total files: {len(files_df)}")
        print(f"Total size: {files_df['file_size_gb'].sum():.2f} GB\n")
        
        # Display file list
        print("File list:")
        print("-" * 80)
        for i, row in files_df.iterrows():
            print(f"{i+1:3}. {row['original_filename']:<60} ({row['file_size_gb']:.3f} GB)")
        
        print("\n" + "="*80)
    else:
        print("‚ö†Ô∏è No .tif files found in the specified directory.")
        print("   Check your LOCAL_DIR path.")
        files_df = pd.DataFrame()

## üè∑Ô∏è Step 3: Define Filename Transformations

Configure how files should be renamed and categorized:

In [None]:
# ========================================
# CATEGORIZATION AND FILENAME TRANSFORMATION
# ========================================

import re

# Define helper function to extract dates
def extract_date_from_filename(filename):
    """Extract date from filename in YYYY-MM-DD format."""
    # Try YYYYMMDD format
    dates = re.findall(r'\d{8}', filename)
    if dates:
        date_str = dates[0]
        return f"{date_str[0:4]}-{date_str[4:6]}-{date_str[6:8]}"
    
    # Try YYYY-MM-DD format
    dates = re.findall(r'\d{4}-\d{2}-\d{2}', filename)
    if dates:
        return dates[0]
    
    return None

# Define filename transformation functions
def create_standard_filename(original_path, event_name):
    """Create standardized filename."""
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    date = extract_date_from_filename(stem)
    
    if date:
        stem_clean = re.sub(r'_?\d{8}', '', stem)
        stem_clean = re.sub(r'_?\d{4}-\d{2}-\d{2}', '', stem_clean)
        return f"{event_name}_{stem_clean}_{date}_day.tif"
    return f"{event_name}_{stem}_day.tif"

# Configure categorization patterns
CATEGORIZATION_PATTERNS = {
    'trueColor': r'trueColor|truecolor|true_color|RGB',
    'colorInfrared': r'colorInfrared|colorIR|color_infrared|CIR',
    'naturalColor': r'naturalColor|naturalcolor|natural_color',
    'shortwaveIR': r'shortwaveIR|SWIR|shortwave'
}

# Map categories to filename functions
FILENAME_CREATORS = {
    'trueColor': create_standard_filename,
    'colorInfrared': create_standard_filename,
    'naturalColor': create_standard_filename,
    'shortwaveIR': create_standard_filename
}

# Output directories in S3
OUTPUT_DIRS = {
    'trueColor': 'Sentinel-2/trueColor',
    'colorInfrared': 'Sentinel-2/colorIR',
    'naturalColor': 'Sentinel-2/naturalColor',
    'shortwaveIR': 'Sentinel-2/shortwaveIR'
}

# Nodata values
NODATA_VALUES = {
    'trueColor': 0,
    'colorInfrared': 0,
    'naturalColor': 0,
    'shortwaveIR': 0
}

print("‚úÖ Transformation functions defined")
print(f"\nCategories configured: {len(CATEGORIZATION_PATTERNS)}")
for category in CATEGORIZATION_PATTERNS.keys():
    print(f"   ‚Ä¢ {category}")

## üîç Step 4: Preview Transformations

Apply transformations and preview the results:

In [None]:
if not files_df.empty:
    print("üìã APPLYING TRANSFORMATIONS")
    print("="*80)
    
    # Categorization function
    def categorize_file(filename):
        for category, pattern in CATEGORIZATION_PATTERNS.items():
            if re.search(pattern, filename, re.IGNORECASE):
                return category
        return 'uncategorized'
    
    # Transformation function
    def transform_filename(row):
        category = row['category']
        local_path = row['local_path']
        
        if category == 'uncategorized':
            return row['original_filename']
        
        if category in FILENAME_CREATORS:
            return FILENAME_CREATORS[category](local_path, EVENT_NAME)
        
        return row['original_filename']
    
    # Generate S3 output path
    def get_output_path(row):
        category = row['category']
        new_filename = row['new_filename']
        
        if category == 'uncategorized':
            return f"{DESTINATION_BASE}/uncategorized/{new_filename}"
        
        if category in OUTPUT_DIRS:
            return f"{DESTINATION_BASE}/{OUTPUT_DIRS[category]}/{new_filename}"
        
        return f"{DESTINATION_BASE}/{category}/{new_filename}"
    
    # Get nodata value
    def get_nodata_value(category):
        return NODATA_VALUES.get(category, None)
    
    # Apply transformations
    files_df['category'] = files_df['original_filename'].apply(categorize_file)
    files_df['new_filename'] = files_df.apply(transform_filename, axis=1)
    files_df['output_s3_path'] = files_df.apply(get_output_path, axis=1)
    files_df['nodata_value'] = files_df['category'].apply(get_nodata_value)
    files_df['status'] = files_df['category'].apply(lambda x: 'valid' if x != 'uncategorized' else 'uncategorized')
    
    # Display summary
    print(f"\nTotal files: {len(files_df)}")
    print(f"Categorized: {len(files_df[files_df['category'] != 'uncategorized'])}")
    print(f"Uncategorized: {len(files_df[files_df['category'] == 'uncategorized'])}")
    
    # Category breakdown
    print("\nFiles by category:")
    category_counts = files_df['category'].value_counts()
    for category, count in category_counts.items():
        nodata = NODATA_VALUES.get(category, 'None')
        print(f"   ‚Ä¢ {category}: {count} files (nodata={nodata})")
    
    # Show sample transformations
    print("\nüìù Sample transformations:")
    print("-" * 80)
    for i, row in files_df.head(5).iterrows():
        print(f"\n{i+1}. Original: {row['original_filename']}")
        print(f"   Category: {row['category']}")
        print(f"   New name: {row['new_filename']}")
        print(f"   Output:   s3://{BUCKET}/{row['output_s3_path']}")
    
    if len(files_df.head(5)) < len(files_df):
        print(f"\n   ... and {len(files_df) - 5} more files")
    
    # Show uncategorized files
    uncategorized = files_df[files_df['category'] == 'uncategorized']
    if not uncategorized.empty:
        print("\n‚ö†Ô∏è  UNCATEGORIZED FILES:")
        print("-" * 80)
        for _, row in uncategorized.iterrows():
            print(f"   ‚Ä¢ {row['original_filename']}")
        print("\nAdd patterns to CATEGORIZATION_PATTERNS to categorize these files")
    
    print("\n" + "="*80)
else:
    print("‚ö†Ô∏è No files to process. Check Step 2.")

## üíæ Step 5: Save Mapping to CSV (Optional)

Export the filename mapping for your records:

In [None]:
if not files_df.empty and SAVE_CSV:
    # Create output directory
    output_path = Path(OUTPUT_DIR) / EVENT_NAME
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Generate filename
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    csv_filename = f"{EVENT_NAME}-{SUB_PRODUCT_NAME}-mapping_{timestamp}.csv"
    csv_path = output_path / csv_filename
    
    # Column order
    column_order = [
        'original_filename',
        'new_filename',
        'category',
        'file_size_gb',
        'nodata_value',
        'status',
        'local_path',
        'output_s3_path'
    ]
    
    # Save to CSV
    files_df[column_order].to_csv(csv_path, index=False)
    
    print("üíæ CSV MAPPING SAVED")
    print("="*80)
    print(f"\n‚úÖ Saved to: {csv_path.absolute()}")
    print(f"\nüìä Summary:")
    print(f"   Total records: {len(files_df)}")
    print(f"   Total size:    {files_df['file_size_gb'].sum():.2f} GB")
    print(f"   Valid:         {len(files_df[files_df['status'] == 'valid'])}")
    print(f"   Uncategorized: {len(files_df[files_df['status'] == 'uncategorized'])}")
    print("\n" + "="*80)
elif files_df.empty:
    print("‚ö†Ô∏è No files to save. Check previous steps.")
else:
    print("‚ÑπÔ∏è  CSV export disabled (SAVE_CSV = False)")

## üåê Step 6: Connect to S3

Initialize S3 client with upload permissions:

In [None]:
from lib.core.s3_operations import initialize_s3_client, check_s3_file_exists

print("üåê Connecting to S3...")
s3_client, fs = initialize_s3_client(bucket_name=BUCKET, verbose=True)

if not s3_client:
    print("\n‚ùå Failed to connect to S3")
    print("   Check your AWS credentials and try again.")
else:
    print("\n‚úÖ S3 connection ready")
    print("   You can now proceed to process and upload files.")

## ‚öôÔ∏è Step 7: Process and Upload Files

Convert files to COGs and upload to S3:

In [None]:
if not files_df.empty and s3_client:
    # Filter to only valid files
    files_to_process = files_df[files_df['status'] == 'valid'].copy()
    
    if files_to_process.empty:
        print("‚ö†Ô∏è No valid files to process.")
        print("   All files are uncategorized. Update CATEGORIZATION_PATTERNS and retry.")
    else:
        print("üöÄ STARTING COG PROCESSING AND UPLOAD")
        print("="*80)
        print(f"\nProcessing {len(files_to_process)} files...")
        print("This may take several minutes depending on file sizes.\n")
        
        print(f"Processing options:")
        print(f"  Overwrite existing: {OVERWRITE}")
        print(f"  Verify COGs: {VERIFY}\n")
        
        # Import processing function
        from lib.main_processor import convert_to_cog
        import time
        
        # Track results
        results = []
        
        for idx, row in files_to_process.iterrows():
            start_time = time.time()
            
            local_path = row['local_path']
            output_key = row['output_s3_path']
            nodata = row['nodata_value']
            
            print(f"\n[{idx+1}/{len(files_to_process)}] Processing: {row['original_filename']}")
            print(f"    Category: {row['category']}")
            print(f"    Size: {row['file_size_gb']:.2f} GB")
            print(f"    Output: {row['new_filename']}")
            
            # Check if destination exists (unless OVERWRITE)
            if not OVERWRITE:
                if check_s3_file_exists(s3_client, BUCKET, output_key):
                    print(f"    ‚è≠Ô∏è  SKIPPED (already exists in S3)")
                    results.append({
                        'source_file': row['original_filename'],
                        'output_file': row['new_filename'],
                        'category': row['category'],
                        'status': 'skipped',
                        'time_seconds': 0,
                        'error': 'File already exists'
                    })
                    continue
            
            try:
                # Convert to COG and upload
                success = convert_to_cog(
                    name=local_path,
                    bucket=BUCKET,
                    cog_filename=row['new_filename'],
                    cog_data_bucket=BUCKET,
                    cog_data_prefix=f"{DESTINATION_BASE}/{OUTPUT_DIRS[row['category']]}",
                    nodata_value=nodata,
                    verify_cog=VERIFY,
                    verbose=False
                )
                
                elapsed = time.time() - start_time
                
                if success:
                    print(f"    ‚úÖ SUCCESS ({elapsed:.1f}s)")
                    results.append({
                        'source_file': row['original_filename'],
                        'output_file': row['new_filename'],
                        'category': row['category'],
                        'status': 'success',
                        'time_seconds': elapsed,
                        'error': None
                    })
                else:
                    print(f"    ‚ùå FAILED ({elapsed:.1f}s)")
                    results.append({
                        'source_file': row['original_filename'],
                        'output_file': row['new_filename'],
                        'category': row['category'],
                        'status': 'failed',
                        'time_seconds': elapsed,
                        'error': 'Processing failed'
                    })
            
            except Exception as e:
                elapsed = time.time() - start_time
                print(f"    ‚ùå ERROR: {str(e)}")
                results.append({
                    'source_file': row['original_filename'],
                    'output_file': row['new_filename'],
                    'category': row['category'],
                    'status': 'failed',
                    'time_seconds': elapsed,
                    'error': str(e)
                })
        
        # Create results DataFrame
        results_df = pd.DataFrame(results)
        
        print("\n" + "="*80)
        print("\nüéâ PROCESSING COMPLETE!")
        
        # Save results if requested
        if SAVE_RESULTS and not results_df.empty:
            output_path = Path(OUTPUT_DIR) / EVENT_NAME
            output_path.mkdir(parents=True, exist_ok=True)
            
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            results_filename = f"processing_results_{timestamp}.csv"
            results_path = output_path / results_filename
            
            results_df.to_csv(results_path, index=False)
            print(f"\nüíæ Results saved to: {results_path.absolute()}")

else:
    print("‚ö†Ô∏è Cannot process: No files or S3 not connected")
    results_df = pd.DataFrame()

## üìä Step 8: Review Results

Display processing statistics:

In [None]:
if 'results_df' in locals() and not results_df.empty:
    print("üìä PROCESSING STATISTICS")
    print("="*80)
    
    # Success rate
    total = len(results_df)
    success = len(results_df[results_df['status'] == 'success'])
    failed = len(results_df[results_df['status'] == 'failed'])
    skipped = len(results_df[results_df['status'] == 'skipped'])
    
    print(f"\nTotal files: {total}")
    print(f"‚úÖ Success: {success}")
    print(f"‚ùå Failed: {failed}")
    print(f"‚è≠Ô∏è  Skipped: {skipped}")
    print(f"\nSuccess rate: {(success/total*100):.1f}%")
    
    # Show failed files
    if failed > 0:
        print("\n‚ùå Failed files:")
        failed_df = results_df[results_df['status'] == 'failed']
        for idx, row in failed_df.iterrows():
            print(f"  - {row['source_file']}: {row.get('error', 'Unknown error')}")
    
    # Processing times
    if 'time_seconds' in results_df.columns:
        success_df = results_df[results_df['status'] == 'success']
        if not success_df.empty:
            avg_time = success_df['time_seconds'].mean()
            max_time = success_df['time_seconds'].max()
            total_time = success_df['time_seconds'].sum()
            print(f"\n‚è±Ô∏è  Timing:")
            print(f"Average: {avg_time:.1f}s per file")
            print(f"Slowest: {max_time:.1f}s")
            print(f"Total:   {total_time:.1f}s ({total_time/60:.1f} minutes)")
    
    print("\n" + "="*80)
    
    # Display results table
    print("\nüìã Detailed Results:")
    display(results_df)
else:
    print("No results to display. Run Step 7 first.")

## üí° Tips & Troubleshooting

### Common Issues:

1. **"Directory does not exist"**
   - Check `LOCAL_DIR` path is correct
   - Use absolute paths (e.g., `/Users/name/data/geotiffs`)
   - On Windows, use forward slashes or raw strings (e.g., `r"C:\\path\\to\\files"`)

2. **"No .tif files found"**
   - Verify files have `.tif` or `.TIF` extension
   - Check subdirectories are included
   - Try listing files manually with `ls` or File Explorer

3. **"S3 connection failed"**
   - Check AWS credentials are configured
   - For upload permissions, configure external ID in `aws_credentials.py`
   - Test with `lib/test_upload.py`

4. **"Files being skipped"**
   - Files already exist in S3 destination
   - Set `OVERWRITE = True` to replace existing files

5. **"Processing failures"**
   - Check source files are valid GeoTIFFs
   - Verify enough disk space for temporary files
   - Check S3 write permissions

### Performance Notes:
- Processing time varies by file size (typically 30s-5min per file)
- COGs use ZSTD compression level 22
- Predictor automatically selected based on data type
- Large files (>10GB) use optimized block sizes

### Next Steps:
1. Review processing results and any failures
2. Verify uploaded files in S3 console
3. Check CSV files for complete mapping records
4. Re-run failed files if needed (set `OVERWRITE = False` to skip successful ones)