# Demo: Timetravel in with iceberg tables

Create a demo catalog, make changes, and see the changes with "snapshot" history.

Requires:
- pyiceberg[sql-sqlite] installed
- `.env` with your AWS credentials

In [None]:
import os
from pathlib import Path

from pyiceberg.catalog import load_catalog

from icefabric.helpers import load_creds, load_pyiceberg_config

# Changes the current working dir to be the project root
current_working_dir = Path.cwd()
os.chdir(Path.cwd() / "../../")
print(
    f"Changed current working dir from {current_working_dir} to: {Path.cwd()}. This must run at the project root"
)


# dir is where the .env file is located
load_creds(dir=Path.cwd())

# Loading the local pyiceberg config settings
pyiceberg_config = load_pyiceberg_config(Path.cwd())
catalog = load_catalog(
    name="sql",
    type=pyiceberg_config["catalog"]["sql"]["type"],
    uri=pyiceberg_config["catalog"]["sql"]["uri"],
    warehouse=pyiceberg_config["catalog"]["sql"]["warehouse"],
)

Load a specific table that we would like to time travel from

In [None]:
table = catalog.load_table("streamflow_observations.usgs_hourly")
table.inspect.snapshots()

Let's view this data and see what's there

In [None]:
df = table.scan().to_pandas().set_index("time")
df.tail()

A snapshot was created for the initial append.  Store this snapshot id for later use.

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")
snapshot_id = table.metadata.snapshots[0].snapshot_id

Add a new column for a fake gauge

In [None]:
import numpy as np

n = len(df)
x = np.linspace(0, n, n)
y = np.sin(2 * np.pi * 1 * x / n).astype(np.float32)

In [None]:
import pyarrow as pa

df["12345678"] = y
df

There should be a new "12345678" column.

In [None]:
table.schema

There should now be three snapshots.  The original, a delete, and an append with the new column.

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")

You can use the scan function and the first snapshot ID (this variable was saved earlier) to look at the table before the
new column was added.  This table doesn't have 12345678.

In [None]:
table.scan(snapshot_id=snapshot_id).to_pandas().tail()

whereas loading without the snapshot gives the latest data

In [None]:
table.scan().to_pandas().tail()

Now, let's delete that data from the local warehouse

In [None]:
df = df.drop("12345678", axis=1)
df.tail()

In [None]:
with table.update_schema() as update_schema:
    update_schema.delete_column("12345678")

# Then overwrite with the data (without the column)
df = df.drop("12345678", axis=1)
_df = pa.Table.from_pandas(df)
table.overwrite(_df)

In [None]:
_df.schema

Let's now check the snapshots and pull in the latest data

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")

In [None]:
table.scan().to_pandas().tail()

In [None]:
snapshot_id = table.snapshots()[-3].snapshot_id
table.scan(snapshot_id=snapshot_id).to_pandas().tail()