## Apache Iceberg Version Control for Hydrofabric and Streamflow Data

### Overview

This notebook demonstrates **enterprise-grade version control capabilities** for hydrological datasets using Apache Iceberg. We'll showcase how the hydrofabric and streamflow observations can be managed with full version control.

#### What is Apache Iceberg?

**Apache Iceberg** is a high-performance table format designed for large-scale data lakes. Unlike traditional file formats, Iceberg provides:

- **Automatic snapshots** of every data change
- **Time travel queries** to access historical versions
- **ACID transactions** for data consistency
- **Schema evolution** without breaking existing queries
- **Query performance** through advanced indexing and pruning
- **Complete audit trails** for regulatory compliance

In [None]:
import pyarrow as pa
from pyiceberg.catalog import load_catalog

from icefabric.helpers import load_creds, load_pyiceberg_config

# dir is where the .env file is located
load_creds()

# Loading the local pyiceberg config settings
pyiceberg_config = load_pyiceberg_config()

In [None]:
# Loading SQL Catalog
# This catalog can be downloaded by running the following commands with AWS creds:
# python tools/iceberg/export_catalog.py --namespace conus_hf
catalog = load_catalog(
    name="sql",
    type=pyiceberg_config["catalog"]["sql"]["type"],
    uri=pyiceberg_config["catalog"]["sql"]["uri"],
    warehouse=pyiceberg_config["catalog"]["sql"]["warehouse"],
)

# # Loading Glue Catalog
# catalog = load_catalog("glue", **{
#     "type": "glue",
#     "glue.region": "us-east-1"
# })

### Exploring the Data Catalog

Apache Iceberg organizes data into **catalogs**, **namespaces**, and **tables** - similar to databases, schemas, and tables in traditional systems. However, each table maintains complete version history automatically.

#### Hydrofabric Tables

The `conus_hf` namespace contains hydrofabric layers associated with the CONUS-based geopackage


In [None]:
catalog.list_tables("conus_hf")

Let's examine the **hydrolocations** table and make some versioned additions. Below we'll see both the snapshots from the hydrolocations table, and actual geopackage layer exported to a pandas dataframe

In [None]:
table = catalog.load_table("conus_hf.hydrolocations")
table.inspect.snapshots()

In [None]:
df = table.scan().to_pandas()
df.tail()

### Snapshot Analysis: Understanding Version History

Each snapshot in Iceberg contains:
- **Unique identifier** (snapshot_id)
- **Summary metadata** describing the operation
- **Timestamp** of the change
- **File manifests** pointing to data files
- **Schema information** at that point in time

This enables **complete traceability** of how data evolved over time.

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")
snapshot_id = table.metadata.snapshots[0].snapshot_id

### Demonstrating Version Control: Adding New Monitoring Location

Now we'll demonstrate Iceberg's version control by adding a **new hydrologic monitoring location**

#### The Version Control Process:

1. **Modify data** (add new monitoring location)
2. **Overwrite table** (creates new snapshot automatically)
3. **Preserve history** (all previous versions remain accessible)
4. **Track changes** (complete audit trail maintained)


In [None]:
new_df = df.copy()
new_df.loc[len(new_df)] = {
    "poi_id": 99999,
    "id": "wb-0",
    "nex_id": "tnx-0",
    "hf_id": 999999,
    "hl_link": "Testing",
    "hl_reference": "testing",
    "hl_uri": "testing",
    "hl_source": "testing",
    "hl_x": -1.952088e06,
    "hl_y": 1.283884e06,
    "vpu_id": 18,
}
new_df.tail()

### Writing Changes: Automatic Snapshot Creation

When we write changes to an Iceberg table:

1. **Schema validation** ensures data compatibility
2. **New snapshot created** automatically with unique ID
3. **Previous snapshots preserved** for time travel
4. **Metadata updated** with operation summary
5. **ACID guarantees** ensure consistency

This happens **atomically** - either the entire operation succeeds or fails, with no partial states.


In [None]:
_df = pa.Table.from_pandas(new_df, preserve_index=False)
with table.update_schema() as update_schema:
    update_schema.union_by_name(_df.schema)
table.overwrite(_df)
table.scan().to_pandas().tail()

### Verifying New Snapshot Creation

Let's examine the updated snapshot history. Notice how we now have **multiple snapshots**:

1. **Original data** (initial snapshot)
2. **Data with new location** (our recent addition)

Each snapshot is **completely independent** and can be accessed separately for different analyses or rollback scenarios.


In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")

Iceberg's **time travel capability** allows querying any previous snapshot using its ID


In [None]:
snapshot_id = table.metadata.snapshots[0].snapshot_id
snapshot_id_latest = table.metadata.snapshots[-1].snapshot_id
table.scan(snapshot_id=snapshot_id).to_pandas().tail()

In [None]:
table.scan(snapshot_id=snapshot_id_latest).to_pandas().tail()

### Comparing Versions: Before and After

Notice the difference between snapshots:
- **Original snapshot**: Contains original monitoring locations
- **Latest snapshot**: Includes our new test location (poi_id: 99999)

This demonstrates **non-destructive updates** - both versions coexist and remain queryable.


### Streamflow Observations: Time Series Version Control

Now let's examine **streamflow observations** - time series data that requires different version control considerations

In [None]:
table = catalog.load_table("streamflow_observations.usgs_hourly")
table.inspect.snapshots()

In [None]:
df = table.scan().to_pandas().set_index("time")
df.tail()

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")
snapshot_id = table.metadata.snapshots[0].snapshot_id

### Adding Time Series Data: Simulating Real-Time Updates

We'll now add a new streamflow observation to demonstrate version control for time series data

The process maintains **historical context** while adding new information.

In [None]:
new_streamflow_df = df.copy()
new_streamflow_df.loc[len(new_df)] = 0.1
new_streamflow_df.tail()

In [None]:
_df = pa.Table.from_pandas(new_streamflow_df)
with table.update_schema() as update_schema:
    update_schema.union_by_name(_df.schema)
table.overwrite(_df)
table.scan().to_pandas().tail()

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")

### Time Travel with Time Series Data

Comparing different snapshots of time series data reveals:

#### Original Snapshot (Baseline Data):
- Contains original observational record
- Represents specific quality control state
- Suitable for historical analysis

#### Latest Snapshot (Updated Data):  
- Includes new observations
- Represents current operational state
- Suitable for real-time applications

In [None]:
snapshot_id = table.metadata.snapshots[0].snapshot_id
snapshot_id_latest = table.metadata.snapshots[-1].snapshot_id
table.scan(snapshot_id=snapshot_id).to_pandas().tail().set_index("time")

In [None]:
df = table.scan(snapshot_id=snapshot_id).to_pandas()
_df = pa.Table.from_pandas(df)
with table.update_schema() as update_schema:
    update_schema.union_by_name(_df.schema)
table.overwrite(_df)
table.scan().to_pandas().tail()

In [None]:
table.scan(snapshot_id=snapshot_id_latest).to_pandas().tail().set_index("time")

### Demonstration Cleanup: Reverting Changes

To maintain data integrity, we'll now **revert our test changes** by removing the added records. This demonstrates:

- **Controlled rollback** procedures
- **Data management** best practices  
- **Cleanup workflows** for testing environments

**Important**: Even these cleanup operations create new snapshots, maintaining complete audit trails of all activities.

In [None]:
# Cleaning up hydrofabric changes
table = catalog.load_table("conus_hf.hydrolocations")
new_df = new_df.drop(new_df.index[-1])
_df = pa.Table.from_pandas(new_df, preserve_index=False)
with table.update_schema() as update_schema:
    update_schema.union_by_name(_df.schema)
table.overwrite(_df)
catalog.load_table("conus_hf.hydrolocations").scan().to_pandas().tail()

In [None]:
# Cleaning up Streamflow Observation changes
table = catalog.load_table("streamflow_observations.usgs_hourly")
new_streamflow_df = new_streamflow_df.drop(new_streamflow_df.index[-1])
_df = pa.Table.from_pandas(new_streamflow_df)
with table.update_schema() as update_schema:
    update_schema.union_by_name(_df.schema)
table.overwrite(_df)
catalog.load_table("streamflow_observations.usgs_hourly").scan().to_pandas().tail()

**This demonstration showcases Apache Iceberg's capability to provide version control for water resources data, enabling both reliability and reproducibility for large-scale hydrological modeling systems.**