## Icechunk Version Control for Land Cover Data

### Overview

This notebook demonstrates **version control capabilities for geospatial raster data** using Icechunk, a new cloud-native storage format. We'll showcase how slowly-changing time-varying raster data (specifically NLCD land cover data) can be managed with full version control, enabling reproducible research and data lineage tracking.

#### What is Icechunk?

**Icechunk** is a cloud-native storage format that brings **Git-like version control** to large scientific datasets. Icechunk is very similar to Iceberg, but for data-cube/tensor data. Unlike traditional file systems where data changes overwrite previous versions, Icechunk:

- **Creates snapshots** of your data at each change
- **Enables time travel** to access any previous version
- **Supports branching and merging** for collaborative workflows
- **Tracks data lineage** with commit messages and metadata
- **Uses virtual references** to avoid data duplication. This means existing .nc or COGs can be referenced without rewriting the data

### Dataset: National Land Cover Database (NLCD)

#### Source: https://www.mrlc.gov/data

The NLCD provides land cover classifications for the Continental United States (CONUS)

#### Land Cover Classes

The NLCD uses standardized codes for different land cover types:
- **11**: Open Water
- **12**: Perennial Ice/Snow
- **21**: Developed, Open Space
- **22**: Developed, Low Intensity
- **23**: Developed, Medium Intensity
- **24**: Developed, High Intensity
- **31**: Barren Land (Rock/Sand/Clay)
- **41**: Decidous Forest
- **42**: Evergreen Forest
- **43**: Mixed Forest
- **52**: Shrub/Scrub
- **71**: Grassland/Herbaceous
- **81**: Pasture/Hay
- **82**: Cultivated Crops
- **90**: Woody Wetlands
- **95**: Emergent Herbaceous Wetlands

In [None]:
import warnings
from pathlib import Path

import icechunk as ic
import matplotlib.pyplot as plt
import xarray as xr

from icefabric.helpers import load_creds

warnings.filterwarnings("ignore")

# dir is where the .env file is located
load_creds()

### Opening the Icechunk Repository

Unlike traditional file formats (GeoTIFF, NetCDF), Icechunk stores data in a **repository structure** similar to Git. Each repository contains:

- **Snapshots**: Immutable versions of your data
- **Branches**: Parallel development lines (like Git branches)
- **Virtual references**: Pointers to external data files (avoiding duplication)
- **Metadata**: Rich attribution and processing history

#### Virtual Chunk Architecture

Our NLCD data uses **virtual references** - instead of copying large GeoTIFF files into Icechunk, we store lightweight references pointing to the original files. This provides:

- **Fast ingestion** (no data copying)
- **Storage efficiency** (references vs. full copies)  
- **Source preservation** (original files remain unchanged)

In [None]:
# NOTE This demonstration/example assumes the data/land_cover icechunk has been made locally to the store path, and the TIFs are in the correct location in the data path
file_location = Path("data/land_cover_tifs").resolve()
store_path = Path("data/land_cover").resolve()

storage = ic.local_filesystem_storage(str(store_path))
repo = ic.Repository.open(
    storage=storage,
    authorize_virtual_chunk_access=ic.containers_credentials({f"file://{file_location}": None}),
)

### Repository History: Data Lineage Tracking

One of Icechunk's key features is **automatic lineage tracking**. Every change to the dataset creates a new snapshot with:

- **Unique identifier** (snapshot ID)
- **Timestamp** of the change
- **Commit message** describing what changed
- **Parent relationships** showing data evolution

This provides complete **audit trails** for scientific reproducibility.


In [None]:
# Print repo ancestry
for ancestor in repo.ancestry(branch="main"):
    print(f"Snapshot ID:\t{ancestor.id}")
    print(f"Timestamp:\t{ancestor.written_at}")
    print(f"Message:\t{ancestor.message}\n")

### Accessing Current Data

The data appears as a standard Xarray Dataset, but with version control underneath.

In [None]:
session = repo.readonly_session(branch="main")
ds = xr.open_zarr(session.store, consolidated=False)
ds

In [None]:
# Set up plot for 1990 land cover
ds["5"].sel(year=1990).plot(x="X5", y="Y5")

# Invert the y-axis to show the CONUS region correctly
plt.gca().invert_yaxis()

# Add labels and show the plot
plt.xlabel("LON")
plt.ylabel("LAT")
plt.title("1990 CONUS Land Cover")
plt.show()

### Demonstrating Version Control: Adding Metadata

Now we'll demonstrate Icechunk's version control by **adding metadata** to our dataset

#### The Version Control Process

1. **Create a writable session** (like checking out code for editing)
2. **Modify the dataset** (add/update attributes, data, etc.)
3. **Commit changes** with descriptive message
4. **New snapshot created** automatically

**Important**: The original data remains **completely unchanged** and accessible.


In [None]:
session = repo.writable_session("main")
ds.attrs["sample_attr"] = "sample_attr"
ds2 = ds.copy()
ds2

In [None]:
session.store.sync_clear()  # Clears the store, but preserves snapshots and references to the data

In [None]:
# NOTE This may take 8-10 minutes
ds2.virtualize.to_icechunk(session.store)
print(session.commit("Added a sample attribute"))

### Verifying Version History

Let's examine the repository history again. Notice how we now have **two snapshots**:

1. **Original dataset** (initial commit)
2. **Dataset with metadata** (our recent addition)

This demonstrates **non-destructive updates** - both versions coexist and remain accessible.

In [None]:
# Print repo ancestry
for ancestor in repo.ancestry(branch="main"):
    print(f"Snapshot ID:\t{ancestor.id}")
    print(f"Timestamp:\t{ancestor.written_at}")
    print(f"Message:\t{ancestor.message}\n")

### Time Travel: Accessing Previous Versions

One of Icechunk's most powerful features is **time travel** - the ability to access any previous version of your data using its snapshot ID.

#### Use Cases for Time Travel:

- **Reproducing analyses** from specific points in time
- **Debugging** when something goes wrong
- **Comparing versions** to understand changes
- **Rolling back** to previous states
- **Auditing** data processing workflows

Below, we access the **original version** (before we added metadata):


In [None]:
snapshot_id = list(repo.ancestry(branch="main"))[1].id
print(f"Snapshot ID:\t{snapshot_id}")

session = repo.readonly_session(snapshot_id=snapshot_id)
_ds = xr.open_zarr(session.store, consolidated=False)
_ds

Notice how the **original version lacks the `sample_attr`** we added. This proves that the data is versioned and preserved

In [None]:
snapshot_id = list(repo.ancestry(branch="main"))[0].id  # Latest
print(f"Snapshot ID:\t{snapshot_id}")

session = repo.readonly_session(snapshot_id=snapshot_id)
latest_ds = xr.open_zarr(session.store, consolidated=False)
latest_ds

This demonstrates how Icechunk enables robust version control for geospatial data, meeting enterprise requirements for data governance, reproducibility, and collaborative research workflows (FAIR)