# CaFFe Dataset (Calving Fronts and where to Find them)

## Intro

CaFFe (Calving Fronts and where to Find them) [Gourmelon et al. 2022](https://doi.pangaea.de/10.1594/PANGAEA.940950) is a dataset designed for glacier calving front detection and land cover classification in polar regions. The dataset contains SAR (Synthetic Aperture Radar) imagery with corresponding zone masks that classify different surface types including glaciers, rock, ocean/ice melange, and undefined areas. This dataset is crucial for monitoring glacier retreat and calving dynamics in the context of climate change research.

## Dataset Characteristics

- **Modalities**: 
  - Sentinel-1 SAR imagery (single polarization)
- **Spatial Resolution**: Variable (typically 10-20m ground sample distance)
- **Temporal Resolution**: Single acquisition per location
- **Spectral Bands**: 
  - SAR: Single band grayscale intensity
- **Image Dimensions**: 512x512 pixels per patch
- **Labels**: 4 land cover classes (zone segmentation masks)
  - Class 0: N/A / Undefined
  - Class 1: Rock
  - Class 2: Glacier
  - Class 3: Ocean/Ice melange
- **Geographic Distribution**: Arctic and Antarctic regions (multiple glacier sites)
- **Temporal Coverage**: Various acquisition dates across multiple years

## Dataset Setup and Initialization

In [None]:
from pathlib import Path
from geobench_v2.datamodules import GeoBenchCaffeDataModule

# Setup paths
PROJECT_ROOT = Path("../../")

# Initialize datamodule
datamodule = GeoBenchCaffeDataModule(
    img_size=512,
    batch_size=16,
    num_workers=4,
    root=PROJECT_ROOT / "data" / "caffe",
    download=True
)
datamodule.setup("fit")
datamodule.setup("test")

print("CaFFe datamodule initialized successfully!")
print(f"Training samples: {len(datamodule.train_dataset)}")
print(f"Validation samples: {len(datamodule.val_dataset)}")
print(f"Test samples: {len(datamodule.test_dataset)}")

## Geographic Distribution Visualization

In [None]:
geo_fig = datamodule.visualize_geospatial_distribution()

## Sample Data Visualization

The dataset provides SAR imagery with 4-class zone segmentation masks for glacier calving front detection and polar land cover classification:

In [None]:
fig, batch = datamodule.visualize_batch()



## GeoBenchV2 Processing Pipeline

### Preprocessing Steps

The original dataset was released as a set of single channel gray scale PNG files of the underyling SAR imagery. Additionally, there is a metadata csv file from which geographic coordinates could be inferred. 

1. **Patch Generation**:
   - Image files 
   - Generated 512x512 pixel patches from original PNG images
   - Applied different overlap strategies per split (training: 0%, validation/test: 25%)
   - Calculated patch coordinates based on image bounds and metadata

2. **Split Generation**:
   - Used original train/test splits from source dataset
   - Created validation split by randomly sampling 10% from training data (random_state=1)
   - Applied geographic filtering: removed Southern Hemisphere samples from test set
3. **Dataset Subsampling**:
    - The final version consists of
        - 4,000 training samples
        - 1,000 validation samples
        - 2,000 test samples

### Label Processing
- **Zone Mask Remapping**: Converted original mask values to sequential class indices:
  - 0 → 0 (N/A)
  - 64 → 1 (Rock)
  - 127 → 2 (Glacier)
  - 254 → 3 (Ocean/Ice melange)

## References

1. Gourmelon, Nora; Seehaus, Thorsten; Braun, Matthias Holger; Maier, Andreas; Christlein, Vincent (2022): CaFFe (CAlving Fronts and where to Find thEm: a benchmark dataset and methodology for automatic glacier calving front extraction from sar imagery) [dataset]. PANGAEA, https://doi.org/10.1594/PANGAEA.940950