Read divides parquet file


5.) Create directories:  mkdir -p data/parquet warehouse  
6.) Copy example parquet file from s3:  aws s3 cp s3://ngwpc-hydrofabric/hydrofabric_parquet/2.2/CONUS/divides.parquet data/parquet  

In [3]:
import sys
!{sys.executable} -m pip install "pyiceberg[pandas]"
!{sys.executable} -m pip install sqlalchemy


import pyarrow.parquet as pq
df = pq.read_table('data/parquet/divides.parquet')

Defaulting to user installation because normal site-packages is not writeable
Collecting sqlalchemy
  Using cached sqlalchemy-2.0.41-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
Collecting greenlet>=1
  Using cached greenlet-3.2.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (580 kB)
Installing collected packages: greenlet, sqlalchemy
Successfully installed greenlet-3.2.2 sqlalchemy-2.0.41


Create data catalog stored in "warehouse" directory.

In [4]:
from pyiceberg.catalog import load_catalog
warehouse_path = "warehouse"
catalog = load_catalog(
    "default",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

In [None]:
Create Iceberg table for divides

In [5]:
catalog.create_namespace("default")
table = catalog.create_table(
    "default.divides",
    schema=df.schema,
)

Add divides data to Iceberg table and print the number of rows.  There should be 831777 divides for CONUS.

In [6]:
table.append(df)
len(table.scan().to_arrow())

831777

A snapshot was created for the initial append.  Store this snapshot id for later use.

In [7]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")
snapshot_id = table.metadata.snapshots[0].snapshot_id

Snapshot ID: 91485850664974323; Summary:  operation=Operation.APPEND


Add a new column for flowpath length in m.  Overwrite original table.

In [8]:
import pyarrow.compute as pc
df = df.append_column("lengthm", pc.multiply(df["lengthkm"],1000))
with table.update_schema() as update_schema:
     update_schema.union_by_name(df.schema)
table.overwrite(df)

There should be a new "lengthm" column.

In [9]:
table.schema

<bound method Table.schema of divides(
  1: divide_id: optional string,
  2: toid: optional string,
  3: type: optional string,
  4: ds_id: optional double,
  5: areasqkm: optional double,
  6: vpuid: optional string,
  7: id: optional string,
  8: lengthkm: optional double,
  9: tot_drainage_areasqkm: optional double,
  10: has_flowline: optional boolean,
  11: geometry: optional binary,
  12: lengthm: optional double
),
partition by: [],
sort order: [],
snapshot: Operation.APPEND: id=1597837331107200528, parent_id=1351760718331482642, schema_id=1>

There should now be three snapshots.  The original, a delete, and an append with the new column.

In [10]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}; Summary:  {snapshot.summary}")

Snapshot ID: 91485850664974323; Summary:  operation=Operation.APPEND
Snapshot ID: 1351760718331482642; Summary:  operation=Operation.DELETE
Snapshot ID: 1597837331107200528; Summary:  operation=Operation.APPEND


You can use the scan function and the first snapshot ID (this variable was saved earlier) to look at the table before the
new column was added.  This table doesn't have lengthm.

In [11]:
#scan = table.scan(row_filter="divide_id" == "cat-276", selected_fields=('divide_id', 'lengthm')).to_arrow()
#print(scan)
print(table.scan(snapshot_id=snapshot_id).to_arrow().to_string())
#table.scan(snapshot_id=snapshot_id)

pyarrow.Table
divide_id: large_string
toid: large_string
type: large_string
ds_id: double
areasqkm: double
vpuid: large_string
id: large_string
lengthkm: double
tot_drainage_areasqkm: double
has_flowline: bool
geometry: large_binary
