# EOPF Zarr Explorer Sentinel-2 L2A Data Structure Analysis

This notebook analyzes the EOPF (Earth Observation Processing Framework) Sentinel-2 L2A Zarr dataset structure

## Objectives

1. **Data Structure Analysis**: Complete inventory of EOPF Sentinel-2 L2A dataset structure, chunking, and compression
2. **Hierarchy Size Analysis**: Display the size of the hierarchy data structure with sums at group level
3. **Metadata Analysis**: Analyze current metadata conventions and CRS handling
4. **Performance Analysis**: Identify bottlenecks for web access patterns
5. **Optimization Recommendations**: Document findings with recommendations for optimization

## Dataset

**Target Dataset**: `s2l2_test.zarr`
- **Product Type**: Sentinel-2 Level 2A (Bottom-of-Atmosphere reflectance)
- **Processing Level**: L2A (atmospherically corrected)

## Setup and Data Loading

In [None]:
# Import required libraries
import json
import warnings
import requests
from pathlib import Path

# Import our analysis utilities
from eopf_analysis_utils import (
    load_eopf_dataset,
    analyze_hierarchy_sizes,
    print_hierarchy_sizes
)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=UserWarning)

print("✅ Libraries and utilities imported successfully")

In [None]:
# Dataset configuration
DATASET_URL = "/home/emathot/Workspace/eopf-explorer/data-model/tests-output/eopf_geozarr/s2l2_test.zarr"

print(f"🌐 Dataset URL: {DATASET_URL}")

from xarray.namedarray.parallelcompat import list_chunkmanagers
chunk_managers = list_chunkmanagers()
for cm in chunk_managers:
    print(f"Chunk manager: {cm}")

from dask.distributed import Client
client = Client()  # set up local cluster on your laptop
client

In [None]:
# Load the EOPF dataset
zarr_store, datatree = load_eopf_dataset(DATASET_URL)

print(f"\n📁 Store keys: {list(zarr_store.keys())}")
print(f"🌳 Datatree groups: {list(datatree.groups)}")
print(f"📊 Datatree variables: {list(datatree.variables)}")

# Display basic structure
print("\n=== Zarr Store Structure ===")
print(zarr_store.tree())

## Hierarchy Size Analysis

This section analyzes the size of the hierarchy data structure with sums at group level, providing insights into data distribution and storage requirements.

In [None]:
# Analyze hierarchy sizes with group-level sums using xarray
size_analysis = analyze_hierarchy_sizes(datatree, zarr_store)

# Display hierarchy sizes in tree format
print_hierarchy_sizes(size_analysis, max_depth=4)

In [None]:
# Additional size statistics
print(f"\n📈 DETAILED SIZE STATISTICS:")
print(f"  Total arrays: {len(size_analysis['array_sizes'])}")
print(f"  Total groups: {len(size_analysis['group_sizes'])}")
print(f"  Hierarchy levels: {len(size_analysis['summary_by_level'])}")

# Show largest arrays
print(f"\n🔍 LARGEST ARRAYS (Top 10):")
sorted_arrays = sorted(size_analysis['array_sizes'].items(), 
                      key=lambda x: x[1]['size_bytes'], reverse=True)

for i, (array_path, array_info) in enumerate(sorted_arrays[:10]):
    shape_str = "x".join(map(str, array_info['shape']))
    print(f"  {i+1:2d}. {array_path}: {array_info['size_formatted']} ({shape_str}, {array_info['dtype']})")

# Show group size distribution
print(f"\n📊 GROUP SIZE DISTRIBUTION:")
for level, level_info in sorted(size_analysis['summary_by_level'].items()):
    avg_group_size = level_info['total_size_bytes'] / level_info['group_count'] if level_info['group_count'] > 0 else 0
    from eopf_analysis_utils import format_size
    print(f"  Level {level}: {level_info['group_count']} groups, avg size: {format_size(int(avg_group_size))}")

## Group-Level Size Summary

Detailed breakdown of sizes by major groups in the EOPF hierarchy.

In [None]:
# Analyze major group categories
print("\n🔍 MAJOR GROUP ANALYSIS:")

major_groups = ['measurements', 'quality', 'conditions']
for group_name in major_groups:
    if group_name in size_analysis['group_sizes']:
        group_info = size_analysis['group_sizes'][group_name]
        print(f"\n📁 {group_name.upper()} GROUP:")
        print(f"  Total size: {group_info['size_formatted']}")
        print(f"  Arrays: {group_info['array_count']}")
        print(f"  Subgroups: {group_info['subgroup_count']}")
        
        # Show percentage of total dataset
        percentage = (group_info['size_bytes'] / size_analysis['total_size_bytes']) * 100
        print(f"  Percentage of total: {percentage:.1f}%")

# Show resolution group breakdown for measurements
print(f"\n📊 RESOLUTION GROUP BREAKDOWN:")
resolution_groups = {}
for group_path, group_info in size_analysis['group_sizes'].items():
    if 'measurements/reflectance/' in group_path:
        parts = group_path.split('/')
        if len(parts) >= 3 and parts[2].startswith('r') and parts[2].endswith('m'):
            res_group = parts[2]
            if res_group not in resolution_groups:
                resolution_groups[res_group] = {
                    'size_bytes': 0,
                    'array_count': 0
                }
            resolution_groups[res_group]['size_bytes'] += group_info['size_bytes']
            resolution_groups[res_group]['array_count'] += group_info['array_count']

for res_group, info in sorted(resolution_groups.items()):
    from eopf_analysis_utils import format_size
    print(f"  {res_group}: {format_size(info['size_bytes'])} ({info['array_count']} arrays)")

## Summary

This analysis provides comprehensive insights into:

1. **Hierarchy Structure Sizes**: Complete breakdown of data sizes at each level with group-level sums
2. **Data Distribution**: Understanding of how data is distributed across the hierarchy
3. **Storage Requirements**: Detailed size information for capacity planning
4. **Resolution Analysis**: Breakdown by spatial resolution groups (r10m, r20m, r60m)
5. **Group Categories**: Analysis of measurements, quality, and conditions groups

The hierarchy size analysis is particularly useful for:
- Understanding data volume distribution across groups
- Identifying the largest data components
- Planning storage and bandwidth requirements
- Optimizing data access patterns
- Comparing sizes across different resolution levels