# Demo: Time Travel with Iceberg Tables - CRUD Operations & Version Control

## Overview
This notebook demonstrates **Create, Read, Update, and Delete (CRUD) operations** on version-controlled data using Apache Iceberg tables. The notebook showcases how Iceberg's snapshot-based architecture enables time travel capabilities and maintains a complete history of all data modifications.

## Key Features Demonstrated:
- **CREATE**: Creating new tables and adding data
- **READ**: Querying current and historical data snapshots
- **UPDATE**: Modifying table schemas and data
- **DELETE**: Removing columns and dropping tables
- **VERSION CONTROL**: Time travel through snapshots to view historical states

## Prerequisites:
- a local pyiceberg catalog spun up and referenced through .pyiceberg.yaml

## Objectives:
By the end of this notebook, you will understand how to:
1. Perform all CRUD operations on Iceberg tables
2. Leverage version control to access historical data states
3. Create and manage table snapshots
4. Navigate between different versions of your data

In [None]:
import os
from pathlib import Path

from pyiceberg.catalog import load_catalog

from icefabric.helpers import load_creds, load_pyiceberg_config

# Changes the current working dir to be the project root
current_working_dir = Path.cwd()
os.chdir(Path.cwd() / "../../")
print(
    f"Changed current working dir from {current_working_dir} to: {Path.cwd()}. This must run at the project root"
)


# dir is where the .env file is located
load_creds(dir=Path.cwd())

# Loading the local pyiceberg config settings
pyiceberg_config = load_pyiceberg_config(Path.cwd())
catalog = load_catalog(
    name="sql",
    type=pyiceberg_config["catalog"]["sql"]["type"],
    uri=pyiceberg_config["catalog"]["sql"]["uri"],
    warehouse=pyiceberg_config["catalog"]["sql"]["warehouse"],
)

### READ Operation: Loading and Inspecting Existing Data

We begin by demonstrating the **READ** operation by loading an existing table and examining its version history. This shows how Iceberg maintains complete metadata about all snapshots (versions) of the data.


In [None]:
table = catalog.load_table("streamflow_observations.usgs_hourly")
table.inspect.snapshots()

Let's examine the current data in the table. This represents the latest version of our dataset. Notice how we can easily convert Iceberg tables to pandas DataFrames for analysis.


In [None]:
df = table.scan().to_pandas().set_index("time")
df.tail()

### Version Control: Capturing Initial State

**Version Control Feature**: Every operation in Iceberg creates a snapshot with a unique ID. We're capturing the initial snapshot ID here so we can demonstrate time travel capabilities later. This snapshot represents the baseline state of our data before any modifications.


In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")
snapshot_id = table.metadata.snapshots[0].snapshot_id

### UPDATE Operation: Schema Evolution and Data Modification
 
Now we'll demonstrate the **UPDATE** operation by adding a new column to our existing table. This involves:
1. Creating synthetic data for the new column
2. Updating the table schema to accommodate the new column
3. Overwriting the table with the updated data


In [None]:
import numpy as np

n = len(df)
x = np.linspace(0, n, n)
y = np.sin(2 * np.pi * 1 * x / n).astype(np.float32)

In [None]:
import pyarrow as pa

df["12345678"] = y
df.tail()

In [None]:
_df = pa.Table.from_pandas(df)
with table.update_schema() as update_schema:
    update_schema.union_by_name(_df.schema)
table.overwrite(_df)

After our UPDATE operation, we can verify that the schema has been modified. The new column "12345678" should now be part of the table structure.


In [None]:
table.schema().fields[-1]

### Version Control: Tracking All Changes

**Version Control Feature**: Notice how Iceberg has automatically created new snapshots for our UPDATE operation. The snapshot history now shows:
- Original data snapshot
- Delete operation snapshot (part of overwrite)
- New append operation snapshot (with the new column)

This complete audit trail is essential for data governance and debugging.

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")

**Time Travel Feature**: Using the snapshot ID we captured earlier, we can query the table as it existed before our UPDATE operation. This demonstrates Iceberg's powerful time travel capabilities - you can access any historical state of your data.


In [None]:
table.scan(snapshot_id=snapshot_id).to_pandas().tail()

Comparing the current state (with the new column) versus the historical state (without the column) demonstrates how version control preserves all data states while allowing easy access to current data.


In [None]:
table.scan().to_pandas().tail()

Now we'll demonstrate another **UPDATE** operation by removing the column we just added. This shows how Iceberg handles schema evolution in both directions (adding and removing columns).


In [None]:
with table.update_schema() as update_schema:
    update_schema.delete_column("12345678")

df = df.drop("12345678", axis=1)
_df = pa.Table.from_pandas(df)
table.overwrite(_df)

In [None]:
table.schema().fields[-1]

### CREATE Operation: Building New Tables

Now we'll demonstrate the **CREATE** operation by building an entirely new table from scratch. This shows how to:
1. Prepare data for a new table
2. Create the table structure in the catalog
3. Populate the table with initial data

In [None]:
__df = df.copy()
__df["12345678"] = y
subset_df = __df[["12345678"]].copy()
subset_df.tail()

In [None]:
namespace = "streamflow_observations"
table_name = "testing_hourly"
arrow_table = pa.Table.from_pandas(subset_df)
iceberg_table = catalog.create_table(
    f"{namespace}.{table_name}",
    schema=arrow_table.schema,
)
iceberg_table.append(arrow_table)

### READ Operation: Verifying New Table Creation 

After our **CREATE** operation, we can verify that the new table exists in our namespace and examine its initial snapshot. Every new table starts with its first snapshot upon creation.


In [None]:
catalog.list_tables(namespace)

In [None]:
table = catalog.load_table(f"{namespace}.{table_name}")
table.inspect.snapshots()

In [None]:
table.scan().to_pandas().tail()

### DELETE Operation: Table Removal

Finally, we demonstrate the **DELETE** operation by completely removing the table we just created. This shows how to clean up resources and manage table lifecycle.

**Important**: Unlike column deletion (which is reversible through time travel), table deletion is permanent and removes all snapshots and data.


In [None]:
catalog.drop_table(f"{namespace}.{table_name}")
catalog.list_tables(namespace)

### Summary: CRUD Operations and Version Control Demonstrated
 
This notebook has successfully demonstrated all required CRUD operations with version-controlled data:
 
#### CREATE Operations:
- Created new tables with `catalog.create_table()`
- Added new columns to existing tables
- Populated tables with initial data using `append()`

#### READ Operations:
- Loaded existing tables with `catalog.load_table()`
- Queried current data states with `table.scan()`
- Accessed historical data states using snapshot IDs
- Inspected table schemas and metadata
 
#### UPDATE Operations:
- Modified table schemas by adding columns
- Updated data through `overwrite()` operations
- Removed columns from existing tables

#### DELETE Operations:
- Deleted columns from table schemas
- Removed entire tables with `catalog.drop_table()`

#### Version Control Features:
- **Snapshot Management**: Every operation creates tracked snapshots
- **Time Travel**: Access any historical state using snapshot IDs
- **Audit Trail**: Complete history of all table modifications
- **Schema Evolution**: Track changes to table structure over time
