# Wildfire Prediction Model - Initial Data Exploration

This notebook sets up the environment and performs initial data exploration for a wildfire prediction model using the modified next day wildfire dataset.

## Objectives:
1. Set up project environment and dependencies
2. Parse TFRecord files to understand data structure
3. Explore feature tensors and their properties
4. Understand the dataset format and contents

## 1. Import Required Libraries

Import essential libraries for project setup, data handling, and TFRecord parsing.

In [1]:
# Standard library imports
import os
import sys
import glob
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data handling and numerical computing
import numpy as np
import pandas as pd

# TensorFlow for TFRecord parsing
import tensorflow as tf

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set up matplotlib for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

Libraries imported successfully!
TensorFlow version: 2.20.0
NumPy version: 2.3.3
Pandas version: 2.3.3


## 2. Set Up Project Directory Structure

Verify our cookiecutter-data-science template structure is in place.

In [2]:
# Define project root and key directories
PROJECT_ROOT = Path().absolute().parent
DATA_DIR = PROJECT_ROOT / "data"
MODELS_DIR = PROJECT_ROOT / "models"
REPORTS_DIR = PROJECT_ROOT / "reports"
SRC_DIR = PROJECT_ROOT / "src"

print(f"Project root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")

# Verify directory structure
required_dirs = [
    DATA_DIR / "raw",
    DATA_DIR / "interim", 
    DATA_DIR / "processed",
    MODELS_DIR,
    REPORTS_DIR / "figures",
    SRC_DIR / "data",
    SRC_DIR / "features",
    SRC_DIR / "models",
    SRC_DIR / "visualization"
]

print("\nDirectory structure verification:")
for dir_path in required_dirs:
    exists = "✓" if dir_path.exists() else "✗"
    print(f"{exists} {dir_path.relative_to(PROJECT_ROOT)}")

# Add src to Python path for imports
sys.path.append(str(SRC_DIR))

Project root: c:\Users\Harshvardhan\OneDrive\Desktop\wildfire_pred
Data directory: c:\Users\Harshvardhan\OneDrive\Desktop\wildfire_pred\data

Directory structure verification:
✓ data\raw
✓ data\interim
✓ data\processed
✓ models
✓ reports\figures
✓ src\data
✓ src\features
✓ src\models
✓ src\visualization


## 3. Create Dependencies Configuration

Verify our requirements.txt file contains all necessary dependencies.

In [3]:
# Check if requirements.txt exists and display its contents
requirements_path = PROJECT_ROOT / "requirements.txt"

if requirements_path.exists():
    print("✓ requirements.txt found!")
    print("\nDependencies listed:")
    with open(requirements_path, 'r') as f:
        content = f.read()
        print(content)
else:
    print("✗ requirements.txt not found!")
    print("Please create requirements.txt in the project root.")

✓ requirements.txt found!

Dependencies listed:
# Data Handling
pandas>=1.5.0
numpy>=1.21.0
xarray>=2022.6.0
rioxarray>=0.12.0
tensorflow>=2.12.0

# Visualization
matplotlib>=3.5.0
seaborn>=0.11.0
geopandas>=0.13.0
plotly>=5.15.0
folium>=0.14.0

# Machine Learning
scikit-learn>=1.1.0
keras>=2.12.0

# Geospatial Processing
shapely>=1.8.0
fiona>=1.8.0
pyproj>=3.4.0
contextily>=1.3.0

# Utilities
tqdm>=4.64.0
joblib>=1.2.0
h5py>=3.7.0

# Development and Testing
jupyter>=1.0.0
ipykernel>=6.15.0
pytest>=7.1.0
black>=22.6.0
flake8>=5.0.0

# Optional: For better performance
dask[complete]>=2022.8.0


## 4. Install and Verify Required Packages

Test imports of key libraries to ensure everything is properly installed.

In [4]:
# PRODUCTION-SCALE LIBRARY TESTING WITH ENVIRONMENT SETUP
import sys
from pathlib import Path

# Test imports of key libraries with production requirements
libraries_to_test = [
    ('pandas', 'pd'),
    ('numpy', 'np'),
    ('tensorflow', 'tf'),
    ('matplotlib.pyplot', 'plt'),
    ('seaborn', 'sns'),
    ('sklearn', None),
    ('scipy', None),
    ('joblib', None),
    ('pickle', None)
]

print("🔧 PRODUCTION ENVIRONMENT VERIFICATION")
print("=" * 50)

all_imported = True
environment_info = {}

for lib, alias in libraries_to_test:
    try:
        if alias:
            exec(f"import {lib} as {alias}")
            version = eval(f"{alias}.__version__")
        else:
            __import__(lib)
            try:
                version = eval(f"{lib}.__version__")
            except AttributeError:
                # Some libraries don't have __version__
                version = "imported successfully"
        
        print(f"✓ {lib:<20} v{version}")
        environment_info[lib] = version
        
    except ImportError as e:
        print(f"✗ {lib:<20} FAILED: {e}")
        all_imported = False
        environment_info[lib] = f"MISSING: {e}"
    except AttributeError:
        print(f"✓ {lib:<20} imported (version not available)")
        environment_info[lib] = "version_unknown"

print("-" * 50)

# Check TensorFlow GPU availability
try:
    import tensorflow as tf
    print(f"\n🖥️  TensorFlow Configuration:")
    print(f"   Version: {tf.__version__}")
    print(f"   GPU Available: {tf.config.list_physical_devices('GPU')}")
    print(f"   Physical Devices: {len(tf.config.list_physical_devices())}")
    
    # Set memory growth for GPUs if available
    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        try:
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
            print(f"   ✓ GPU memory growth enabled for {len(gpus)} device(s)")
        except RuntimeError as e:
            print(f"   ⚠️  GPU setup warning: {e}")
    
    environment_info['tensorflow_gpu'] = len(gpus) > 0
    
except Exception as e:
    print(f"   ❌ TensorFlow configuration failed: {e}")
    environment_info['tensorflow_gpu'] = False

# System information
print(f"\n🖥️  System Information:")
print(f"   Python Version: {sys.version.split()[0]}")
print(f"   Platform: {sys.platform}")
print(f"   CPU Count: {sys.getsizeof('') and 'available' or 'unknown'}")

# Save environment information for reproducibility
environment_file = PROJECT_ROOT / "environment_info.txt"
with open(environment_file, 'w') as f:
    f.write("WILDFIRE PREDICTION PROJECT - ENVIRONMENT INFO\n")
    f.write("=" * 50 + "\n\n")
    f.write(f"Python Version: {sys.version}\n")
    f.write(f"Platform: {sys.platform}\n\n")
    
    f.write("Library Versions:\n")
    f.write("-" * 20 + "\n")
    for lib, version in environment_info.items():
        f.write(f"{lib}: {version}\n")
    
    f.write(f"\nTensorFlow GPU Available: {environment_info.get('tensorflow_gpu', False)}\n")

print(f"\n💾 Environment info saved to: {environment_file}")

if all_imported:
    print("\n✅ ALL REQUIRED LIBRARIES ARE AVAILABLE!")
    print("🚀 Ready for production-scale wildfire prediction development")
else:
    print("\n❌ SOME LIBRARIES ARE MISSING")
    print("📋 Please install missing dependencies from requirements.txt")
    print("   Run: pip install -r requirements.txt")

# Set up pandas display options for better output
try:
    pd.set_option('display.max_columns', 20)
    pd.set_option('display.width', 100)
    pd.set_option('display.precision', 4)
    print("\n📊 Pandas display options configured")
except:
    pass

# Set up matplotlib for high-quality plots
try:
    plt.rcParams['figure.figsize'] = (12, 8)
    plt.rcParams['figure.dpi'] = 100
    plt.rcParams['savefig.dpi'] = 300
    plt.rcParams['font.size'] = 10
    print("📈 Matplotlib configured for production plots")
except:
    pass

🔧 PRODUCTION ENVIRONMENT VERIFICATION
✓ pandas               v2.3.3
✓ numpy                v2.3.3
✓ tensorflow           v2.20.0
✓ matplotlib.pyplot    imported (version not available)
✓ seaborn              v0.13.2


NameError: name 'sklearn' is not defined

## 5. Data Ingestion Setup

Set up data paths and create utility functions for handling TFRecord format data files.

In [5]:
# PRODUCTION-SCALE DATASET DISCOVERY AND VALIDATION
import os
from pathlib import Path
import pickle

# Enhanced data paths setup with validation
TFRECORD_DATA_DIR = DATA_DIR / "raw" / "ndws_western_dataset"

print("🗂️  PRODUCTION DATASET DISCOVERY")
print("=" * 50)
print(f"Primary data directory: {TFRECORD_DATA_DIR}")
print(f"Directory exists: {TFRECORD_DATA_DIR.exists()}")

# Comprehensive file discovery and validation
dataset_info = {
    'tfrecord_files': [],
    'tfindex_files': [],
    'file_sizes': {},
    'dataset_splits': {
        'train': [],
        'eval': [],
        'test': []
    },
    'total_size_gb': 0
}

if TFRECORD_DATA_DIR.exists():
    # Find all data files
    tfrecord_files = list(TFRECORD_DATA_DIR.glob("*.tfrecord"))
    tfindex_files = list(TFRECORD_DATA_DIR.glob("*.tfindex"))
    
    dataset_info['tfrecord_files'] = tfrecord_files
    dataset_info['tfindex_files'] = tfindex_files
    
    print(f"\n📊 Dataset File Summary:")
    print(f"   TFRecord files: {len(tfrecord_files)}")
    print(f"   TFIndex files: {len(tfindex_files)}")
    
    # Calculate file sizes and organize by split
    total_size = 0
    for file_path in tfrecord_files:
        file_size = file_path.stat().st_size
        dataset_info['file_sizes'][file_path.name] = file_size
        total_size += file_size
        
        # Categorize by dataset split
        if 'train' in file_path.name:
            dataset_info['dataset_splits']['train'].append(file_path)
        elif 'eval' in file_path.name:
            dataset_info['dataset_splits']['eval'].append(file_path)
        elif 'test' in file_path.name:
            dataset_info['dataset_splits']['test'].append(file_path)
    
    dataset_info['total_size_gb'] = total_size / (1024**3)
    
    print(f"   Total dataset size: {dataset_info['total_size_gb']:.2f} GB")
    
    # Dataset split analysis
    print(f"\n📋 Dataset Split Analysis:")
    for split_name, files in dataset_info['dataset_splits'].items():
        if files:
            split_size = sum(dataset_info['file_sizes'][f.name] for f in files)
            split_size_gb = split_size / (1024**3)
            print(f"   {split_name.capitalize():<8}: {len(files):>3} files ({split_size_gb:.2f} GB)")
            
            # Show file pattern
            if len(files) <= 3:
                for f in files:
                    print(f"     - {f.name}")
            else:
                print(f"     - {files[0].name}")
                print(f"     - {files[1].name}")
                print(f"     - ... and {len(files)-2} more files")
    
    # File integrity checks
    print(f"\n🔍 File Integrity Validation:")
    missing_indices = []
    corrupted_files = []
    
    for tfrecord_file in tfrecord_files:
        # Check for corresponding .tfindex file
        index_file = tfrecord_file.with_suffix('.tfindex')
        if not index_file.exists():
            missing_indices.append(tfrecord_file.name)
        
        # Basic corruption check (file size > 0)
        if tfrecord_file.stat().st_size == 0:
            corrupted_files.append(tfrecord_file.name)
    
    if missing_indices:
        print(f"   ⚠️  Missing .tfindex files: {len(missing_indices)}")
        for missing in missing_indices[:3]:
            print(f"     - {missing}")
    else:
        print(f"   ✅ All TFRecord files have corresponding .tfindex files")
    
    if corrupted_files:
        print(f"   ❌ Potentially corrupted files (0 bytes): {len(corrupted_files)}")
        for corrupted in corrupted_files:
            print(f"     - {corrupted}")
    else:
        print(f"   ✅ No obviously corrupted files detected")
    
    # Save dataset info for future reference
    dataset_info_file = DATA_DIR / "interim" / "dataset_discovery_info.pkl"
    dataset_info_file.parent.mkdir(parents=True, exist_ok=True)
    
    # Convert Path objects to strings for serialization
    serializable_info = {
        'tfrecord_files': [str(f) for f in dataset_info['tfrecord_files']],
        'tfindex_files': [str(f) for f in dataset_info['tfindex_files']],
        'file_sizes': dataset_info['file_sizes'],
        'dataset_splits': {
            'train': [str(f) for f in dataset_info['dataset_splits']['train']],
            'eval': [str(f) for f in dataset_info['dataset_splits']['eval']],
            'test': [str(f) for f in dataset_info['dataset_splits']['test']]
        },
        'total_size_gb': dataset_info['total_size_gb'],
        'discovery_timestamp': pd.Timestamp.now().isoformat()
    }
    
    with open(dataset_info_file, 'wb') as f:
        pickle.dump(serializable_info, f)
    
    print(f"\n💾 Dataset info saved to: {dataset_info_file}")
    
else:
    print("\n❌ TFRecord data directory not found!")
    print("🔍 Searching for dataset in alternative locations...")
    
    # Comprehensive search for the dataset
    search_locations = [
        PROJECT_ROOT / "ndws_western_dataset",
        PROJECT_ROOT / "data" / "ndws_western_dataset", 
        DATA_DIR / "ndws_western_dataset",
        DATA_DIR / "raw",
        PROJECT_ROOT,
        Path.cwd()
    ]
    
    print(f"\n📍 Checking {len(search_locations)} possible locations:")
    found_alternative = False
    
    for i, loc in enumerate(search_locations, 1):
        print(f"   {i}. {loc}")
        if loc.exists():
            # Check if it contains TFRecord files
            tfrecord_files_here = list(loc.glob("*.tfrecord"))
            if tfrecord_files_here:
                print(f"      ✅ Found {len(tfrecord_files_here)} TFRecord files!")
                TFRECORD_DATA_DIR = loc
                found_alternative = True
                break
            else:
                print(f"      📁 Directory exists but no TFRecord files found")
        else:
            print(f"      ❌ Directory does not exist")
    
    if not found_alternative:
        print(f"\n🚨 DATASET NOT FOUND!")
        print(f"📋 Please ensure the wildfire dataset is available in one of these locations:")
        print(f"   • {DATA_DIR / 'raw' / 'ndws_western_dataset'}")
        print(f"   • {PROJECT_ROOT / 'ndws_western_dataset'}")
        print(f"\n📥 Dataset download instructions:")
        print(f"   1. Download the NDWS Western dataset")
        print(f"   2. Extract to: {DATA_DIR / 'raw' / 'ndws_western_dataset'}")
        print(f"   3. Verify .tfrecord files are present")
        
        # Create placeholder info for debugging
        dataset_info = {
            'tfrecord_files': [],
            'dataset_splits': {'train': [], 'eval': [], 'test': []},
            'total_size_gb': 0,
            'status': 'not_found'
        }

# Dataset readiness assessment
print(f"\n🎯 DATASET READINESS ASSESSMENT")
print("-" * 30)

if dataset_info['tfrecord_files']:
    print(f"✅ Status: READY FOR PRODUCTION")
    print(f"📊 Files available: {len(dataset_info['tfrecord_files'])}")
    print(f"💾 Total size: {dataset_info['total_size_gb']:.2f} GB")
    print(f"🔄 Splits available: {sum(1 for split in dataset_info['dataset_splits'].values() if split)}")
    
    # Estimate processing requirements
    estimated_memory_gb = dataset_info['total_size_gb'] * 2  # Conservative estimate
    print(f"\n💡 Processing Estimates:")
    print(f"   Recommended RAM: {estimated_memory_gb:.1f}+ GB")
    print(f"   Processing time estimate: {len(dataset_info['tfrecord_files']) * 2}-{len(dataset_info['tfrecord_files']) * 5} minutes")
    print(f"   GPU acceleration: {'Recommended' if dataset_info['total_size_gb'] > 1 else 'Optional'}")
    
else:
    print(f"❌ Status: NOT READY")
    print(f"🚨 Action required: Download and setup dataset")
    
print(f"\n🚀 Ready to proceed to data parsing and analysis!")

🗂️  PRODUCTION DATASET DISCOVERY
Primary data directory: c:\Users\Harshvardhan\OneDrive\Desktop\wildfire_pred\data\raw\ndws_western_dataset
Directory exists: True

📊 Dataset File Summary:
   TFRecord files: 54
   TFIndex files: 54
   Total dataset size: 7.06 GB

📋 Dataset Split Analysis:
   Train   :  25 files (3.37 GB)
     - cleaned_train_ndws_conus_western_000.tfrecord
     - cleaned_train_ndws_conus_western_001.tfrecord
     - ... and 23 more files
   Eval    :  13 files (1.51 GB)
     - cleaned_eval_ndws_conus_western_000.tfrecord
     - cleaned_eval_ndws_conus_western_001.tfrecord
     - ... and 11 more files
   Test    :  16 files (2.18 GB)
     - cleaned_test_ndws_conus_western_000.tfrecord
     - cleaned_test_ndws_conus_western_001.tfrecord
     - ... and 14 more files

🔍 File Integrity Validation:
   ✅ All TFRecord files have corresponding .tfindex files
   ✅ No obviously corrupted files detected

💾 Dataset info saved to: c:\Users\Harshvardhan\OneDrive\Desktop\wildfire_pred\d

## 6. Parse TFRecord Files

Write utility script to parse TFRecord files and extract sample records using tensorflow's tf.data.TFRecordDataset.

In [6]:
# PRODUCTION-SCALE TFRECORD PARSING WITH ERROR HANDLING
import warnings
warnings.filterwarnings('ignore')

def parse_tfrecord_example_robust(example_proto):
    """
    Robust TFRecord parsing with comprehensive error handling.
    Handles mixed data types and serialization formats.
    """
    try:
        # Parse the serialized example
        parsed_example = tf.train.Example.FromString(example_proto.numpy())
        
        feature_dict = {}
        parsing_errors = []
        
        for feature_name, feature in parsed_example.features.feature.items():
            try:
                if feature.HasField('bytes_list'):
                    # Handle tensor data stored as bytes
                    bytes_data = feature.bytes_list.value[0]
                    
                    # Try multiple parsing strategies
                    parsed_successfully = False
                    
                    # Strategy 1: Parse as tensor
                    for dtype in [tf.float32, tf.int32, tf.int64]:
                        try:
                            decoded = tf.io.parse_tensor(bytes_data, dtype)
                            feature_dict[feature_name] = {
                                'type': f'tensor_{dtype.name}',
                                'shape': decoded.shape.as_list(),
                                'data': decoded.numpy(),
                                'parsing_method': 'tensor'
                            }
                            parsed_successfully = True
                            break
                        except:
                            continue
                    
                    # Strategy 2: Raw bytes if tensor parsing fails
                    if not parsed_successfully:
                        feature_dict[feature_name] = {
                            'type': 'bytes',
                            'shape': [len(bytes_data)],
                            'data': bytes_data,
                            'parsing_method': 'raw_bytes'
                        }
                        
                elif feature.HasField('float_list'):
                    values = list(feature.float_list.value)
                    feature_dict[feature_name] = {
                        'type': 'float_list',
                        'shape': [len(values)],
                        'data': np.array(values),
                        'parsing_method': 'direct'
                    }
                    
                elif feature.HasField('int64_list'):
                    values = list(feature.int64_list.value)
                    feature_dict[feature_name] = {
                        'type': 'int64_list',
                        'shape': [len(values)],
                        'data': np.array(values),
                        'parsing_method': 'direct'
                    }
                    
            except Exception as e:
                parsing_errors.append(f"{feature_name}: {str(e)}")
                # Add placeholder for failed features
                feature_dict[feature_name] = {
                    'type': 'parse_error',
                    'shape': 'unknown',
                    'data': None,
                    'error': str(e),
                    'parsing_method': 'failed'
                }
        
        feature_dict['_parsing_errors'] = parsing_errors
        return feature_dict
        
    except Exception as e:
        return {'_global_error': str(e)}

def analyze_dataset_comprehensively(tfrecord_files, max_files=None, samples_per_file=None):
    """
    Comprehensive analysis across multiple files and samples.
    """
    print(f"🔍 COMPREHENSIVE DATASET ANALYSIS")
    print(f"=" * 50)
    files_to_analyze = len(tfrecord_files) if max_files is None else min(len(tfrecord_files), max_files)
    print(f"Analyzing {files_to_analyze} files...")
    if samples_per_file is not None:
        print(f"Samples per file: {samples_per_file}")
    else:
        print(f"Analyzing all samples per file")
    
    all_features = {}
    file_analysis = {}
    global_stats = {
        'total_samples_analyzed': 0,
        'total_parsing_errors': 0,
        'unique_features': set(),
        'consistent_shapes': {},
        'data_type_distribution': {}
    }
    
    files_to_process = tfrecord_files if max_files is None else tfrecord_files[:max_files]
    
    for file_idx, file_path in enumerate(files_to_process):
        print(f"\n📁 Analyzing file {file_idx + 1}: {file_path.name}")
        
        file_stats = {
            'samples_processed': 0,
            'parsing_errors': 0,
            'features_found': set(),
            'file_size_mb': file_path.stat().st_size / (1024 * 1024)
        }
        
        try:
            dataset = tf.data.TFRecordDataset(str(file_path))
            
            # Process all samples if samples_per_file is None, otherwise limit
            dataset_to_process = dataset if samples_per_file is None else dataset.take(samples_per_file)
            
            for sample_idx, example in enumerate(dataset_to_process):
                parsed = parse_tfrecord_example_robust(example)
                file_stats['samples_processed'] += 1
                global_stats['total_samples_analyzed'] += 1
                
                # Check for parsing errors
                if '_parsing_errors' in parsed and parsed['_parsing_errors']:
                    file_stats['parsing_errors'] += len(parsed['_parsing_errors'])
                    global_stats['total_parsing_errors'] += len(parsed['_parsing_errors'])
                
                if '_global_error' in parsed:
                    file_stats['parsing_errors'] += 1
                    global_stats['total_parsing_errors'] += 1
                    continue
                
                # Analyze each feature
                for feature_name, info in parsed.items():
                    if feature_name.startswith('_'):
                        continue
                        
                    file_stats['features_found'].add(feature_name)
                    global_stats['unique_features'].add(feature_name)
                    
                    if feature_name not in all_features:
                        all_features[feature_name] = {
                            'shapes': [],
                            'types': [],
                            'parsing_methods': [],
                            'sample_data': [],
                            'files_found_in': [],
                            'statistics': {
                                'min_vals': [],
                                'max_vals': [],
                                'mean_vals': [],
                                'nan_counts': [],
                                'inf_counts': []
                            }
                        }
                    
                    # Collect feature information
                    feature_info = all_features[feature_name]
                    feature_info['shapes'].append(info.get('shape', 'unknown'))
                    feature_info['types'].append(info.get('type', 'unknown'))
                    feature_info['parsing_methods'].append(info.get('parsing_method', 'unknown'))
                    feature_info['files_found_in'].append(file_path.name)
                    
                    # Statistical analysis for numeric data
                    if info.get('data') is not None and isinstance(info['data'], np.ndarray):
                        data = info['data']
                        if data.size > 0 and np.issubdtype(data.dtype, np.number):
                            # Calculate statistics on finite values
                            finite_data = data[np.isfinite(data)]
                            if len(finite_data) > 0:
                                feature_info['statistics']['min_vals'].append(float(np.min(finite_data)))
                                feature_info['statistics']['max_vals'].append(float(np.max(finite_data)))
                                feature_info['statistics']['mean_vals'].append(float(np.mean(finite_data)))
                            
                            # Count problematic values
                            nan_count = int(np.sum(np.isnan(data)))
                            inf_count = int(np.sum(np.isinf(data)))
                            feature_info['statistics']['nan_counts'].append(nan_count)
                            feature_info['statistics']['inf_counts'].append(inf_count)
                    
                    # Store sample data for the first occurrence
                    if len(feature_info['sample_data']) < 3:
                        if info.get('data') is not None:
                            if isinstance(info['data'], np.ndarray) and info['data'].size <= 100:
                                feature_info['sample_data'].append(info['data'])
                
                # Progress indicator
                if sample_idx == 0:
                    print(f"  ✓ Sample {sample_idx + 1}: {len(parsed)} features parsed")
            
        except Exception as e:
            print(f"  ❌ Error processing {file_path.name}: {e}")
            file_stats['file_error'] = str(e)
        
        file_analysis[file_path.name] = file_stats
        print(f"  📊 File summary: {file_stats['samples_processed']} samples, {len(file_stats['features_found'])} features")
        
        if file_stats['parsing_errors'] > 0:
            print(f"  ⚠️  Parsing errors: {file_stats['parsing_errors']}")
    
    return all_features, file_analysis, global_stats

# Execute comprehensive analysis if files are available
if 'dataset_info' in locals() and dataset_info['tfrecord_files']:
    print("🚀 Starting production-scale dataset analysis...")

    # Use the TFRecord files from our dataset discovery
    available_files = [Path(f) for f in dataset_info['tfrecord_files']] if isinstance(dataset_info['tfrecord_files'][0], str) else dataset_info['tfrecord_files']
    
    # Analyze ALL files and ALL samples for comprehensive analysis
    feature_analysis, file_analysis, global_stats = analyze_dataset_comprehensively(
        available_files, 
        max_files=None,  # Analyze ALL files
        samples_per_file=None  # Analyze ALL samples per file
    )
    
    print(f"\n📋 ANALYSIS RESULTS SUMMARY")
    print("=" * 40)
    print(f"✅ Total samples analyzed: {global_stats['total_samples_analyzed']}")
    print(f"🔧 Unique features discovered: {len(global_stats['unique_features'])}")
    print(f"⚠️  Total parsing errors: {global_stats['total_parsing_errors']}")
    
    # Feature consistency analysis
    print(f"\n🔍 FEATURE CONSISTENCY ANALYSIS")
    print("-" * 30)
    
    for feature_name, info in feature_analysis.items():
        unique_shapes = set(str(shape) for shape in info['shapes'])
        unique_types = set(info['types'])
        
        consistency_status = "✅" if len(unique_shapes) == 1 and len(unique_types) == 1 else "⚠️ "
        
        print(f"{consistency_status} {feature_name}:")
        print(f"    Shapes: {list(unique_shapes)}")
        print(f"    Types: {list(unique_types)}")
        print(f"    Found in: {len(set(info['files_found_in']))} file(s)")
        
        # Value range information if available
        if info['statistics']['min_vals'] and info['statistics']['max_vals']:
            min_val = min(info['statistics']['min_vals'])
            max_val = max(info['statistics']['max_vals'])
            print(f"    Value range: [{min_val:.3f}, {max_val:.3f}]")
        
        # Data quality issues
        total_nans = sum(info['statistics'].get('nan_counts', [0]))
        total_infs = sum(info['statistics'].get('inf_counts', [0]))
        
        if total_nans > 0 or total_infs > 0:
            print(f"    ⚠️  Quality issues: {total_nans} NaNs, {total_infs} Infs")
        
        print()
    
    # Save comprehensive analysis results
    analysis_results_file = DATA_DIR / "interim" / "comprehensive_feature_analysis.pkl"
    
    with open(analysis_results_file, 'wb') as f:
        pickle.dump({
            'feature_analysis': feature_analysis,
            'file_analysis': file_analysis,
            'global_stats': global_stats,
            'analysis_timestamp': pd.Timestamp.now().isoformat(),
            'analysis_parameters': {
                'max_files_analyzed': len(available_files),
                'samples_per_file': 'all',
                'total_files_available': len(available_files)
            }
        }, f)
    
    print(f"💾 Comprehensive analysis saved to: {analysis_results_file}")
    
    # Generate summary report
    summary_report_path = DATA_DIR / "processed" / "dataset_exploration_report.txt"
    summary_report_path.parent.mkdir(parents=True, exist_ok=True)
    
    with open(summary_report_path, 'w') as f:
        f.write("WILDFIRE DATASET EXPLORATION REPORT\n")
        f.write("=" * 40 + "\n\n")
        f.write(f"Analysis Date: {pd.Timestamp.now()}\n")
        f.write(f"Dataset Location: {TFRECORD_DATA_DIR}\n")
        f.write(f"Total Files Available: {len(available_files)}\n")
        f.write(f"Files Analyzed: {len(available_files)}\n")
        f.write(f"Total Samples Analyzed: {global_stats['total_samples_analyzed']}\n\n")
        
        f.write("FEATURE SUMMARY:\n")
        f.write("-" * 20 + "\n")
        for feature_name, info in feature_analysis.items():
            f.write(f"{feature_name}:\n")
            f.write(f"  - Shapes: {set(str(s) for s in info['shapes'])}\n")
            f.write(f"  - Types: {set(info['types'])}\n")
            f.write(f"  - Found in {len(set(info['files_found_in']))} files\n")
            
            # Statistical summary
            if info['statistics']['min_vals']:
                min_val = min(info['statistics']['min_vals'])
                max_val = max(info['statistics']['max_vals'])
                mean_val = np.mean(info['statistics']['mean_vals'])
                f.write(f"  - Range: [{min_val:.4f}, {max_val:.3f}], Mean: {mean_val:.3f}\n")
            
            f.write("\n")
    
    print(f"📄 Summary report saved to: {summary_report_path}")
    
else:
    print("❌ No TFRecord files found for analysis")

print(f"\n✅ Production-scale TFRecord analysis complete!")


🚀 Starting production-scale dataset analysis...
🔍 COMPREHENSIVE DATASET ANALYSIS
Analyzing 54 files...
Analyzing all samples per file

📁 Analyzing file 1: cleaned_eval_ndws_conus_western_000.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 252 samples, 23 features

📁 Analyzing file 2: cleaned_eval_ndws_conus_western_001.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 106 samples, 23 features

📁 Analyzing file 3: cleaned_eval_ndws_conus_western_002.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 64 samples, 23 features

📁 Analyzing file 4: cleaned_eval_ndws_conus_western_003.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 134 samples, 23 features

📁 Analyzing file 5: cleaned_eval_ndws_conus_western_004.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 568 samples, 23 features

📁 Analyzing file 6: cleaned_eval_ndws_conus_western_005.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 277 samples, 23 features

📁 Analyzing fi

## 7. Feature Names and Tensor Analysis

Identify and analyze feature names from the dataset, understand tensor dimensions and examine sample data values to understand the dataset structure.

In [8]:
# PRODUCTION-SCALE FEATURE CLASSIFICATION AND RECOMMENDATIONS

def create_comprehensive_feature_descriptions():
    """
    Create comprehensive descriptions for wildfire dataset features.
    Includes data types, expected ranges, and usage recommendations.
    """
    return {
        'elevation': {
            'description': 'Elevation above sea level (meters)',
            'expected_range': (0, 5000),
            'data_type': 'continuous',
            'category': 'topographic',
            'preprocessing': 'normalize',
            'importance': 'high'
        },
        'th': {
            'description': 'Wind direction (degrees from north)', 
            'expected_range': (0, 360),
            'data_type': 'circular',
            'category': 'meteorological',
            'preprocessing': 'circular_encoding',
            'importance': 'medium'
        },
        'vs': {
            'description': 'Wind speed (m/s)',
            'expected_range': (0, 50),
            'data_type': 'continuous',
            'category': 'meteorological',
            'preprocessing': 'normalize',
            'importance': 'high'
        },
        'tmmn': {
            'description': 'Minimum air temperature (Kelvin)',
            'expected_range': (200, 330),
            'data_type': 'continuous',
            'category': 'meteorological',
            'preprocessing': 'normalize',
            'importance': 'very_high'
        },
        'tmmx': {
            'description': 'Maximum air temperature (Kelvin)',
            'expected_range': (200, 330),
            'data_type': 'continuous',
            'category': 'meteorological',
            'preprocessing': 'normalize',
            'importance': 'very_high'
        },
        'sph': {
            'description': 'Specific humidity (kg/kg)',
            'expected_range': (0, 0.03),
            'data_type': 'continuous',
            'category': 'meteorological',
            'preprocessing': 'normalize',
            'importance': 'high'
        },
        'pr': {
            'description': 'Precipitation amount (mm)',
            'expected_range': (0, 300),
            'data_type': 'continuous',
            'category': 'meteorological',
            'preprocessing': 'normalize',
            'importance': 'very_high'
        },
        'rmax': {
            'description': 'Maximum relative humidity (%)',
            'expected_range': (0, 100),
            'data_type': 'percentage',
            'category': 'meteorological',
            'preprocessing': 'normalize',
            'importance': 'high'
        },
        'rmin': {
            'description': 'Minimum relative humidity (%)',
            'expected_range': (0, 100),
            'data_type': 'percentage',
            'category': 'meteorological',
            'preprocessing': 'normalize',
            'importance': 'high'
        },
        'fm100': {
            'description': '100-hour fuel moisture (%)',
            'expected_range': (0, 50),
            'data_type': 'percentage',
            'category': 'fire_weather',
            'preprocessing': 'normalize',
            'importance': 'very_high'
        },
        'fm1000': {
            'description': '1000-hour fuel moisture (%)',
            'expected_range': (0, 50),
            'data_type': 'percentage',
            'category': 'fire_weather',
            'preprocessing': 'normalize',
            'importance': 'very_high'
        },
        'population': {
            'description': 'Population density (people/km²)',
            'expected_range': (0, 10000),
            'data_type': 'continuous',
            'category': 'anthropogenic',
            'preprocessing': 'log_transform',
            'importance': 'medium'
        },
        'erc': {
            'description': 'Energy Release Component (fire weather index)',
            'expected_range': (0, 200),
            'data_type': 'continuous',
            'category': 'fire_weather',
            'preprocessing': 'normalize',
            'importance': 'very_high'
        },
        'PrevFireMask': {
            'description': 'Previous day fire occurrence (binary mask)',
            'expected_range': (0, 1),
            'data_type': 'binary',
            'category': 'fire_history',
            'preprocessing': 'none',
            'importance': 'very_high'
        },
        'FireMask': {
            'description': 'Current day fire occurrence (target variable)',
            'expected_range': (0, 1),
            'data_type': 'binary',
            'category': 'target',
            'preprocessing': 'none',
            'importance': 'target'
        },
        'viirs_FireMask': {
            'description': 'VIIRS satellite fire detection mask',
            'expected_range': (0, 1),
            'data_type': 'binary',
            'category': 'fire_detection',
            'preprocessing': 'none',
            'importance': 'very_high'
        },
        'burned_area': {
            'description': 'Area burned (hectares)',
            'expected_range': (0, 10000),
            'data_type': 'continuous',
            'category': 'fire_impact',
            'preprocessing': 'log_transform',
            'importance': 'medium'
        },
        'impervious': {
            'description': 'Impervious surface percentage',
            'expected_range': (0, 100),
            'data_type': 'percentage',
            'category': 'land_cover',
            'preprocessing': 'normalize',
            'importance': 'low'
        },
        'water': {
            'description': 'Water body percentage',
            'expected_range': (0, 100),
            'data_type': 'percentage',
            'category': 'land_cover',
            'preprocessing': 'normalize',
            'importance': 'low'
        }
    }

def generate_preprocessing_recommendations(feature_analysis, feature_descriptions):
    """
    Generate specific preprocessing recommendations based on actual data analysis.
    """
    recommendations = {
        'normalization_strategies': {},
        'outlier_handling': {},
        'missing_value_strategies': {},
        'feature_engineering_opportunities': [],
        'data_quality_issues': [],
        'model_input_preparation': {}
    }
    
    print("🔧 GENERATING PREPROCESSING RECOMMENDATIONS")
    print("=" * 50)
    
    for feature_name, analysis_info in feature_analysis.items():
        feature_desc = feature_descriptions.get(feature_name, {})
        
        # Statistical analysis
        stats = analysis_info.get('statistics', {})
        
        # Normalization strategy based on data type and distribution
        if feature_desc.get('data_type') == 'circular':
            recommendations['normalization_strategies'][feature_name] = 'circular_encoding'
        elif feature_desc.get('data_type') == 'binary':
            recommendations['normalization_strategies'][feature_name] = 'none'
        elif feature_desc.get('preprocessing') == 'log_transform':
            recommendations['normalization_strategies'][feature_name] = 'log_normalize'
        else:
            # Use standard scaling for continuous variables
            recommendations['normalization_strategies'][feature_name] = 'standard_scaling'
        
        # Outlier detection based on expected ranges
        expected_range = feature_desc.get('expected_range')
        if expected_range and stats.get('min_vals') and stats.get('max_vals'):
            min_val = min(stats['min_vals'])
            max_val = max(stats['max_vals'])
            
            outliers_detected = False
            if expected_range[0] is not None and min_val < expected_range[0]:
                outliers_detected = True
            if expected_range[1] is not None and max_val > expected_range[1]:
                outliers_detected = True
            
            if outliers_detected:
                recommendations['outlier_handling'][feature_name] = 'clip_to_expected_range'
        
        # Missing value strategy
        nan_count = sum(stats.get('nan_counts', [0]))
        inf_count = sum(stats.get('inf_counts', [0]))
        
        if nan_count > 0 or inf_count > 0:
            if feature_desc.get('category') in ['meteorological', 'fire_weather']:
                strategy = 'interpolate_temporal'
            elif feature_desc.get('data_type') == 'binary':
                strategy = 'fill_zeros'
            else:
                strategy = 'fill_median'
            
            recommendations['missing_value_strategies'][feature_name] = strategy
            recommendations['data_quality_issues'].append({
                'feature': feature_name,
                'issue': f'{nan_count} NaN values, {inf_count} Inf values',
                'severity': 'high' if nan_count + inf_count > 100 else 'medium'
            })
    
    # Feature engineering opportunities
    category = feature_desc.get('category', 'unknown')
    if category == 'meteorological':
        if 'tmmn' in feature_analysis and 'tmmx' in feature_analysis:
            recommendations['feature_engineering_opportunities'].append({
                'type': 'temperature_range',
                'description': 'Create temperature range (tmmx - tmmn) feature',
                'features_involved': ['tmmn', 'tmmx']
            })
        
        if 'rmin' in feature_analysis and 'rmax' in feature_analysis:
            recommendations['feature_engineering_opportunities'].append({
                'type': 'humidity_range',
                'description': 'Create humidity range (rmax - rmin) feature',
                'features_involved': ['rmin', 'rmax']
            })
    
    return recommendations

def parse_tfrecord_example_robust(example):
    """Robust TFRecord parsing with comprehensive error handling."""
    try:
        # Parse the example
        example_proto = tf.train.Example.FromString(example.numpy())
        
        feature_dict = {}
        parsing_errors = []
        
        for feature_name, feature in example_proto.features.feature.items():
            try:
                if feature.HasField('bytes_list'):
                    # Try to decode as tensor
                    try:
                        decoded_tensor = tf.io.parse_tensor(
                            feature.bytes_list.value[0], 
                            tf.float32
                        ).numpy()
                        
                        feature_dict[feature_name] = {
                            'data': decoded_tensor,
                            'type': 'tensor_float32',
                            'shape': decoded_tensor.shape,
                            'parsing_method': 'tensor_decode'
                        }
                    except Exception as e:
                        # Fallback to bytes
                        feature_dict[feature_name] = {
                            'data': feature.bytes_list.value[0],
                            'type': 'bytes',
                            'shape': len(feature.bytes_list.value),
                            'parsing_method': 'bytes_fallback'
                        }
                        
                elif feature.HasField('float_list'):
                    float_data = np.array(list(feature.float_list.value), dtype=np.float32)
                    
                    # Reshape if it looks like image data (4096 = 64x64)
                    if len(float_data) == 4096:
                        float_data = float_data.reshape(64, 64)
                    
                    feature_dict[feature_name] = {
                        'data': float_data,
                        'type': 'float_list',
                        'shape': float_data.shape,
                        'parsing_method': 'float_array'
                    }
                    
                elif feature.HasField('int64_list'):
                    int_data = np.array(list(feature.int64_list.value), dtype=np.int64)
                    
                    # Reshape if it looks like image data
                    if len(int_data) == 4096:
                        int_data = int_data.reshape(64, 64)
                    
                    feature_dict[feature_name] = {
                        'data': int_data,
                        'type': 'int64_list',
                        'shape': int_data.shape,
                        'parsing_method': 'int_array'
                    }
                else:
                    parsing_errors.append(f"Unknown feature type for {feature_name}")
                    feature_dict[feature_name] = {
                        'data': None,
                        'type': 'unknown',
                        'shape': None,
                        'parsing_method': 'failed'
                    }
                    
            except Exception as e:
                parsing_errors.append(f"Error parsing {feature_name}: {str(e)}")
                feature_dict[feature_name] = {
                    'data': None,
                    'type': 'error',
                    'shape': None,
                    'parsing_method': 'failed'
                }
        
        feature_dict['_parsing_errors'] = parsing_errors
        return feature_dict
        
    except Exception as e:
        return {'_global_error': str(e)}

def analyze_dataset_comprehensively(tfrecord_files, max_files=None, samples_per_file=None):
    """
    Comprehensive analysis across multiple files and samples.
    """
    print(f"🔍 COMPREHENSIVE DATASET ANALYSIS")
    print(f"=" * 50)
    files_to_analyze = len(tfrecord_files) if max_files is None else min(len(tfrecord_files), max_files)
    print(f"Analyzing {files_to_analyze} files...")
    if samples_per_file is not None:
        print(f"Samples per file: {samples_per_file}")
    else:
        print(f"Analyzing all samples per file")
    
    all_features = {}
    file_analysis = {}
    global_stats = {
        'total_samples_analyzed': 0,
        'total_parsing_errors': 0,
        'unique_features': set(),
        'consistent_shapes': {},
        'data_type_distribution': {}
    }
    
    files_to_process = tfrecord_files if max_files is None else tfrecord_files[:max_files]
    
    for file_idx, file_path in enumerate(files_to_process):
        print(f"\n📁 Analyzing file {file_idx + 1}: {file_path.name}")
        
        file_stats = {
            'samples_processed': 0,
            'parsing_errors': 0,
            'features_found': set(),
            'file_size_mb': file_path.stat().st_size / (1024 * 1024)
        }
        
        try:
            dataset = tf.data.TFRecordDataset(str(file_path))
            
            # Use samples_per_file limitation if specified
            dataset_to_process = dataset.take(samples_per_file) if samples_per_file is not None else dataset
            
            for sample_idx, example in enumerate(dataset_to_process):
                parsed = parse_tfrecord_example_robust(example)
                file_stats['samples_processed'] += 1
                global_stats['total_samples_analyzed'] += 1
                
                # Check for parsing errors
                if '_parsing_errors' in parsed and parsed['_parsing_errors']:
                    file_stats['parsing_errors'] += len(parsed['_parsing_errors'])
                    global_stats['total_parsing_errors'] += len(parsed['_parsing_errors'])
                
                if '_global_error' in parsed:
                    file_stats['parsing_errors'] += 1
                    global_stats['total_parsing_errors'] += 1
                    continue
                
                # Analyze each feature
                for feature_name, info in parsed.items():
                    if feature_name.startswith('_'):
                        continue
                        
                    file_stats['features_found'].add(feature_name)
                    global_stats['unique_features'].add(feature_name)
                    
                    if feature_name not in all_features:
                        all_features[feature_name] = {
                            'shapes': [],
                            'types': [],
                            'parsing_methods': [],
                            'sample_data': [],
                            'files_found_in': [],
                            'statistics': {
                                'min_vals': [],
                                'max_vals': [],
                                'mean_vals': [],
                                'nan_counts': [],
                                'inf_counts': []
                            }
                        }
                    
                    # Collect feature information
                    feature_info = all_features[feature_name]
                    feature_info['shapes'].append(info.get('shape', 'unknown'))
                    feature_info['types'].append(info.get('type', 'unknown'))
                    feature_info['parsing_methods'].append(info.get('parsing_method', 'unknown'))
                    feature_info['files_found_in'].append(file_path.name)
                    
                    # Statistical analysis for numeric data
                    if info.get('data') is not None and isinstance(info['data'], np.ndarray):
                        data = info['data']
                        if data.size > 0 and np.issubdtype(data.dtype, np.number):
                            # Calculate statistics on finite values
                            finite_data = data[np.isfinite(data)]
                            if len(finite_data) > 0:
                                feature_info['statistics']['min_vals'].append(float(np.min(finite_data)))
                                feature_info['statistics']['max_vals'].append(float(np.max(finite_data)))
                                feature_info['statistics']['mean_vals'].append(float(np.mean(finite_data)))
                            
                            # Count problematic values
                            nan_count = int(np.sum(np.isnan(data)))
                            inf_count = int(np.sum(np.isinf(data)))
                            feature_info['statistics']['nan_counts'].append(nan_count)
                            feature_info['statistics']['inf_counts'].append(inf_count)
                
                # Progress indicator
                if sample_idx == 0:
                    print(f"  ✓ Sample {sample_idx + 1}: {len(parsed)} features parsed")
            
        except Exception as e:
            print(f"  ❌ Error processing {file_path.name}: {e}")
            file_stats['file_error'] = str(e)
        
        file_analysis[file_path.name] = file_stats
        print(f"  📊 File summary: {file_stats['samples_processed']} samples, {len(file_stats['features_found'])} features")
        
        if file_stats['parsing_errors'] > 0:
            print(f"  ⚠️  Parsing errors: {file_stats['parsing_errors']}")
    
    return all_features, file_analysis, global_stats

# Execute comprehensive analysis if files are available
if 'dataset_info' in locals() and dataset_info['tfrecord_files']:
    print("🚀 Starting production-scale dataset analysis...")
    
    # Use the TFRecord files from our dataset discovery
    available_files = [Path(f) for f in dataset_info['tfrecord_files']] if isinstance(dataset_info['tfrecord_files'][0], str) else dataset_info['tfrecord_files']
    
    # Perform comprehensive analysis
    feature_analysis, file_analysis, global_stats = analyze_dataset_comprehensively(
        available_files, 
        max_files=None, 
        samples_per_file=None
    )
    
    print(f"\n📋 ANALYSIS RESULTS SUMMARY")
    print("=" * 40)
    print(f"✅ Total samples analyzed: {global_stats['total_samples_analyzed']}")
    print(f"🔧 Unique features discovered: {len(global_stats['unique_features'])}")
    print(f"⚠️  Total parsing errors: {global_stats['total_parsing_errors']}")
    
    # Feature consistency analysis
    print(f"\n🔍 FEATURE CONSISTENCY ANALYSIS")
    print("-" * 30)
    
    for feature_name, info in feature_analysis.items():
        unique_shapes = set(str(shape) for shape in info['shapes'])
        unique_types = set(info['types'])
        
        consistency_status = "✅" if len(unique_shapes) == 1 and len(unique_types) == 1 else "⚠️ "
        
        print(f"{consistency_status} {feature_name}:")
        print(f"    Shapes: {list(unique_shapes)}")
        print(f"    Types: {list(unique_types)}")
        print(f"    Found in: {len(set(info['files_found_in']))} file(s)")
        
        # Value range information if available
        if info['statistics']['min_vals'] and info['statistics']['max_vals']:
            min_val = min(info['statistics']['min_vals'])
            max_val = max(info['statistics']['max_vals'])
            print(f"    Value range: [{min_val:.3f}, {max_val:.3f}]")
        
        # Data quality issues
        total_nans = sum(info['statistics'].get('nan_counts', [0]))
        total_infs = sum(info['statistics'].get('inf_counts', [0]))
        
        if total_nans > 0 or total_infs > 0:
            print(f"    ⚠️  Quality issues: {total_nans} NaNs, {total_infs} Infs")
        
        print()
    
    # Save comprehensive analysis results
    analysis_results_file = DATA_DIR / "interim" / "comprehensive_feature_analysis.pkl"
    
    with open(analysis_results_file, 'wb') as f:
        pickle.dump({
            'feature_analysis': feature_analysis,
            'file_analysis': file_analysis,
            'global_stats': global_stats,
            'analysis_timestamp': pd.Timestamp.now().isoformat(),
            'analysis_parameters': {
                'max_files_analyzed': len(available_files),
                'samples_per_file': 'all',
                'total_files_available': len(available_files)
            }
        }, f)
    
    print(f"💾 Comprehensive analysis saved to: {analysis_results_file}")
    
    # Generate summary report
    summary_report_path = DATA_DIR / "processed" / "dataset_exploration_report.txt"
    summary_report_path.parent.mkdir(parents=True, exist_ok=True)
    
    with open(summary_report_path, 'w') as f:
        f.write("WILDFIRE DATASET EXPLORATION REPORT\n")
        f.write("=" * 40 + "\n\n")
        f.write(f"Analysis Date: {pd.Timestamp.now()}\n")
        f.write(f"Dataset Location: {TFRECORD_DATA_DIR}\n")
        f.write(f"Total Files Available: {len(available_files)}\n")
        f.write(f"Files Analyzed: {len(available_files)}\n")
        f.write(f"Total Samples Analyzed: {global_stats['total_samples_analyzed']}\n\n")
        
        f.write("FEATURE SUMMARY:\n")
        f.write("-" * 20 + "\n")
        for feature_name, info in feature_analysis.items():
            f.write(f"{feature_name}:\n")
            f.write(f"  - Shapes: {set(str(s) for s in info['shapes'])}\n")
            f.write(f"  - Types: {set(info['types'])}\n")
            f.write(f"  - Found in {len(set(info['files_found_in']))} files\n")
            
            # Statistical summary
            if info['statistics']['min_vals']:
                min_val = min(info['statistics']['min_vals'])
                max_val = max(info['statistics']['max_vals'])
                mean_val = np.mean(info['statistics']['mean_vals'])
                f.write(f"  - Range: [{min_val:.4f}, {max_val:.3f}], Mean: {mean_val:.3f}\n")
            
            f.write("\n")
    
    print(f"📄 Summary report saved to: {summary_report_path}")
    
else:
    print("❌ No TFRecord files found for analysis")


# Analyze multiple examples to understand data structure consistency
def analyze_tfrecord_structure(file_path, num_examples=5):
    """Analyze structure across multiple examples."""
    dataset = tf.data.TFRecordDataset(str(file_path))
    
    feature_info = {}
    
    for i, example in enumerate(dataset.take(num_examples)):
        parsed = parse_tfrecord_example_robust(example)
        
        # Skip if there was a global parsing error
        if '_global_error' in parsed:
            continue
            
        for feature_name, info in parsed.items():
            # Skip internal parsing metadata
            if feature_name.startswith('_'):
                continue
                
            if feature_name not in feature_info:
                feature_info[feature_name] = {
                    'type': info.get('type', 'unknown'),
                    'shapes': [],
                    'min_vals': [],
                    'max_vals': [],
                    'mean_vals': []
                }
            
            feature_info[feature_name]['shapes'].append(info.get('shape', 'unknown'))
            
            # Calculate statistics for tensor data
            if info.get('type') == 'tensor_float32' and info.get('data') is not None:
                data = info['data']
                if hasattr(data, 'size') and data.size > 0:
                    feature_info[feature_name]['min_vals'].append(float(np.min(data)))
                    feature_info[feature_name]['max_vals'].append(float(np.max(data)))
                    feature_info[feature_name]['mean_vals'].append(float(np.mean(data)))
    
    return feature_info

# Analyze structure if files are available
if tfrecord_files:
    print("Analyzing data structure across multiple examples...")
    structure_info = analyze_tfrecord_structure(tfrecord_files[0], num_examples=3)
    
    # Create summary DataFrame
    summary_data = []
    for feature_name, info in structure_info.items():
        shapes_consistent = len(set(map(str, info['shapes']))) == 1
        
        summary_data.append({
            'Feature': feature_name,
            'Type': info['type'],
            'Shape': info['shapes'][0] if shapes_consistent else f"Variable: {info['shapes']}",
            'Shape_Consistent': shapes_consistent,
            'Min_Value': np.min(info['min_vals']) if info['min_vals'] else 'N/A',
            'Max_Value': np.max(info['max_vals']) if info['max_vals'] else 'N/A',
            'Mean_Value': np.mean(info['mean_vals']) if info['mean_vals'] else 'N/A'
        })
    
    structure_df = pd.DataFrame(summary_data)
    
    print("\n📊 TFRecord Structure Analysis:")
    print("-" * 60)
    print(structure_df.to_string(index=False, max_colwidth=20))
    
    # Check for inconsistencies
    inconsistent_features = structure_df[~structure_df['Shape_Consistent']]
    if not inconsistent_features.empty:
        print(f"\n⚠️  Features with inconsistent shapes:")
        for _, row in inconsistent_features.iterrows():
            print(f"  - {row['Feature']}: {row['Shape']}")
    else:
        print(f"\n✅ All features have consistent shapes across examples")

else:
    print("❌ No TFRecord files found for structure analysis")

print(f"\n📋 Key features identified:")
for feature in ['elevation', 'tmmn', 'tmmx', 'pr', 'fm100', 'fm1000', 'erc', 'PrevFireMask', 'FireMask']:
    print(f"  - {feature}")

# Feature names discovered from the first pass
discovered_features = [
    'wdir_wind', 'fuel1', 'bi', 'gust_med', 'avg_sph', 'wdir_gust', 'wind_75',
    'burn_index_tc', 'pr', 'vs', 'psi', 'burning_index_tc', 'population',
    'erc', 'rmax', 'rmin', 'NDVI', 'PrevFireMask', 'elevation', 'th',
    'vpd', 'sph', 'tmmn', 'tmmx', 'FireMask', 'impervious', 'water',
    'viirs_FireMask', 'viirs_PrevFireMask'
]

print(f"\n🎯 COMPREHENSIVE FEATURE ANALYSIS & RECOMMENDATIONS")
print("=" * 60)

# Execute analysis if comprehensive results are available
if 'feature_analysis' in locals() and feature_analysis:
    # Get comprehensive feature descriptions
    feature_descriptions = create_comprehensive_feature_descriptions()
    
    # Generate preprocessing recommendations
    preprocessing_recommendations = generate_preprocessing_recommendations(
        feature_analysis, feature_descriptions
    )
    
    print(f"\n📊 FEATURE CATEGORIZATION")
    print("-" * 30)
    
    # Categorize features by type and importance
    categories = {}
    for feature_name in feature_analysis.keys():
        desc = feature_descriptions.get(feature_name, {})
        category = desc.get('category', 'unknown')
        importance = desc.get('importance', 'medium')
        
        if category not in categories:
            categories[category] = {}
        if importance not in categories[category]:
            categories[category][importance] = []
        
        categories[category][importance].append(feature_name)
    
    # Display categorization
    for category, importance_dict in categories.items():
        print(f"\n🏷️  {category.upper()}:")
        for importance, features in importance_dict.items():
            print(f"  {importance.title()}: {len(features)} features")
            if len(features) <= 4:
                print(f"    • " + "\n    • ".join(features))
            else:
                print(f"    • " + "\n    • ".join(features[:3]) + f"\n    • ... and {len(features)-3} more")
    
    print(f"\n🔧 PREPROCESSING RECOMMENDATIONS")
    print("-" * 40)
    
    # Display normalization strategies
    norm_strategies = {}
    for feature, strategy in preprocessing_recommendations['normalization_strategies'].items():
        if strategy not in norm_strategies:
            norm_strategies[strategy] = []
        norm_strategies[strategy].append(feature)
    
    print("📈 Normalization Strategies:")
    for strategy, features in norm_strategies.items():
        print(f"  • {strategy.replace('_', ' ').title()}: {len(features)} features")
        if len(features) <= 3:
            print(f"    - {', '.join(features)}")
    
    # Display feature engineering opportunities
    if preprocessing_recommendations['feature_engineering_opportunities']:
        print(f"\n💡 Feature Engineering Opportunities:")
        for opp in preprocessing_recommendations['feature_engineering_opportunities']:
            print(f"  • {opp['description']}")
            print(f"    Features: {', '.join(opp['features_involved'])}")
    
    # Display data quality issues
    if preprocessing_recommendations['data_quality_issues']:
        print(f"\n⚠️  Data Quality Issues Found:")
        for issue in preprocessing_recommendations['data_quality_issues']:
            severity_icon = "🚨" if issue['severity'] == 'high' else "⚠️ "
            print(f"  {severity_icon} {issue['feature']}: {issue['issue']}")
    
    print(f"\n✅ Comprehensive feature analysis and recommendations complete!")
    print(f"📊 Analysis covers {len(feature_analysis)} features across {global_stats['total_samples_analyzed']} samples")

else:
    print("❌ Comprehensive analysis not available - using discovered feature list")
    print(f"🔍 Found {len(discovered_features)} features from initial exploration")

🚀 Starting production-scale dataset analysis...
🔍 COMPREHENSIVE DATASET ANALYSIS
Analyzing 54 files...
Analyzing all samples per file

📁 Analyzing file 1: cleaned_eval_ndws_conus_western_000.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 252 samples, 23 features

📁 Analyzing file 2: cleaned_eval_ndws_conus_western_001.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 106 samples, 23 features

📁 Analyzing file 3: cleaned_eval_ndws_conus_western_002.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 64 samples, 23 features

📁 Analyzing file 4: cleaned_eval_ndws_conus_western_003.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 134 samples, 23 features

📁 Analyzing file 5: cleaned_eval_ndws_conus_western_004.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 568 samples, 23 features

📁 Analyzing file 6: cleaned_eval_ndws_conus_western_005.tfrecord
  ✓ Sample 1: 24 features parsed
  📊 File summary: 277 samples, 23 features

📁 Analyzing fi