# Demo: Timetravel in with iceberg tables

Create a demo catalog, make changes, and see the changes with "snapshot" history.

Requires:
- pyiceberg[sql-sqlite] installed
- `.env` with your AWS credentials

In [2]:
import shlex
import subprocess
from pathlib import Path

import pyarrow.parquet as pq
from pyiceberg.catalog import load_catalog

In [3]:
from icefabric.helpers import load_creds

# dir is where the .env file is located
load_creds(dir=Path.cwd().parents[1])

In [3]:
# Retrieve sample parquet from s3
parquet_path = Path("./data/parquet")
parquet_path.mkdir(parents=True, exist_ok=True)

subprocess.run(
    shlex.split(
        f"aws s3 cp s3://ngwpc-hydrofabric/hydrofabric_parquet/2.2/CONUS/divides.parquet {parquet_path}"
    )
)

download: s3://ngwpc-hydrofabric/hydrofabric_parquet/2.2/CONUS/divides.parquet to data/parquet/divides.parquet


CompletedProcess(args=['aws', 's3', 'cp', 's3://ngwpc-hydrofabric/hydrofabric_parquet/2.2/CONUS/divides.parquet', 'data/parquet'], returncode=0)

Read divides parquet file

In [5]:
df = pq.read_table("data/parquet/divides.parquet")

Create data catalog stored in "warehouse" directory.

In [6]:
warehouse_path = Path("./warehouse")
warehouse_path.mkdir(exist_ok=True)
catalog = load_catalog(
    "default",
    **{
        "type": "sql",
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

Create Iceberg table for divides

In [7]:
catalog.create_namespace("default")
table = catalog.create_table(
    "default.divides",
    schema=df.schema,
)

Add divides data to Iceberg table and print the number of rows.  There should be 831777 divides for CONUS.

In [8]:
table.append(df)
len(table.scan().to_arrow())

831777

A snapshot was created for the initial append.  Store this snapshot id for later use.

In [9]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")
snapshot_id = table.metadata.snapshots[0].snapshot_id

Snapshot ID: 6392225911450853109; Summary:  operation=Operation.APPEND


Add a new column for flowpath length in m.  Overwrite original table.

In [10]:
import pyarrow.compute as pc

df = df.append_column("lengthm", pc.multiply(df["lengthkm"], 1000))
with table.update_schema() as update_schema:
    update_schema.union_by_name(df.schema)
table.overwrite(df)

There should be a new "lengthm" column.

In [11]:
table.schema

<bound method Table.schema of divides(
  1: divide_id: optional string,
  2: toid: optional string,
  3: type: optional string,
  4: ds_id: optional double,
  5: areasqkm: optional double,
  6: vpuid: optional string,
  7: id: optional string,
  8: lengthkm: optional double,
  9: tot_drainage_areasqkm: optional double,
  10: has_flowline: optional boolean,
  11: geometry: optional binary,
  12: lengthm: optional double
),
partition by: [],
sort order: [],
snapshot: Operation.APPEND: id=6082380623201209864, parent_id=3770483438741118773, schema_id=1>

There should now be three snapshots.  The original, a delete, and an append with the new column.

In [12]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")

Snapshot ID: 6392225911450853109; Summary:  operation=Operation.APPEND
Snapshot ID: 3770483438741118773; Summary:  operation=Operation.DELETE
Snapshot ID: 6082380623201209864; Summary:  operation=Operation.APPEND


You can use the scan function and the first snapshot ID (this variable was saved earlier) to look at the table before the
new column was added.  This table doesn't have lengthm.

In [13]:
# scan = table.scan(row_filter="divide_id" == "cat-276", selected_fields=('divide_id', 'lengthm')).to_arrow()
# print(scan)
print(table.scan(snapshot_id=snapshot_id).to_arrow().to_string())
# table.scan(snapshot_id=snapshot_id)

pyarrow.Table
divide_id: large_string
toid: large_string
type: large_string
ds_id: double
areasqkm: double
vpuid: large_string
id: large_string
lengthkm: double
tot_drainage_areasqkm: double
has_flowline: bool
geometry: large_binary
