# Access the Iceberg catalog

This notebook shows you how to load iceberg tables using [pyiceberg](https://github.com/apache/iceberg-python).

Note: before running this notebook, be sure to materialize the dagster assets. Otherwise, the tables will not be available.

## Connecting to the catalog

You can connect to the postgresql catalog using the `pyiceberg.catalog.sql.SqlCatalog` object. The required credentials for MinIO (storage) and postgresql are available as environment variables.

In [1]:
import os

from pyiceberg.catalog.sql import SqlCatalog


catalog = SqlCatalog(
    name="dagster_example_catalog",
    **{
        "uri": os.environ["DAGSTER_SECRET_PYICEBERG_CATALOG_URI"],
        "s3.endpoint": os.environ["DAGSTER_SECRET_S3_ENDPOINT"],
        "s3.access-key-id": os.environ["DAGSTER_SECRET_S3_ACCESS_KEY_ID"],
        "s3.secret-access-key": os.environ[
            "DAGSTER_SECRET_S3_SECRET_ACCESS_KEY"
        ],
        "py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO",
        "warehouse": os.environ["DAGSTER_SECRET_S3_WAREHOUSE"],
    }
)

If you don't see a namespace here, be sure to run `just nc` from the repository root.

In [2]:
catalog.list_namespaces()

[('air_quality',)]

If you've materialized the `daily_air_quality_data`, then you'll see it listed under the `air_quality` namespace.

In [3]:
catalog.list_tables(namespace="air_quality")

[('air_quality', 'daily_air_quality_data')]

You can load the table as follows:

In [4]:
table_daily_air_quality_data = catalog.load_table("air_quality.daily_air_quality_data")

table_daily_air_quality_data

daily_air_quality_data(
  1: station_number: optional string,
  2: value: optional double,
  3: timestamp_measured: optional string,
  4: formula: optional string,
  5: measurement_date: optional date,
  6: __index_level_0__: optional long
),
partition by: [measurement_date],
sort order: [],
snapshot: Operation.APPEND: id=3803352301006519707, schema_id=0

As you can see, the table is partitioned by the column `measurement_date`. This is because the `daily_air_quality_data` is partitioned on this column:

```python
# src/dagster_pyiceberg_example/partitions.py
daily_partition = DailyPartitionsDefinition(
    start_date=datetime.datetime(2024, 10, 20),
    end_offset=0,
    timezone="Europe/Amsterdam",
    fmt="%Y-%m-%d",
)

# src/dagster_pyiceberg_example/assets/__init__.py
@asset(
    description="Copy air quality data to iceberg table",
    compute_kind="iceberg",
    io_manager_key="warehouse_io_manager",
    partitions_def=daily_partition,
    ins={
        "ingested_data": AssetIn(
            "air_quality_data",
            # NB: need this to control which downstream asset partitions are materialized
            partition_mapping=MultiToSingleDimensionPartitionMapping(
                partition_dimension_name="daily"
            ),
            input_manager_key="landing_zone_io_manager",
            # NB: Some partitions can fail because of 500 errors from API
            #  So we need to allow missing partitions
            metadata={"allow_missing_partitions": True},
        )
    },
    code_version="v1",
    group_name="measurements",
    metadata={
        "partition_expr": "measurement_date",
    },
)
def daily_air_quality_data():
    ...
```

You can find the table metadata in the snapshot information. This also contains a reference to the dagster run id and partition key that generated the snapshot.

In [5]:
table_daily_air_quality_data.snapshots()[0].model_dump()

{'snapshot-id': 3803352301006519707,
 'sequence-number': 1,
 'timestamp-ms': 1731148151326,
 'manifest-list': 's3://warehouse/air_quality.db/daily_air_quality_data/metadata/snap-3803352301006519707-0-a7e56d49-35c0-4bb5-a618-1936788e9144.avro',
 'summary': {'operation': 'append',
  'added-files-size': '33364',
  'added-data-files': '1',
  'added-records': '11288',
  'changed-partition-count': '1',
  'created_by': 'dagster',
  'dagster_run_id': 'ff374ece-7030-4d39-9f0f-3d69b85ed81b',
  'dagster_partition_key': '2024-11-08',
  'total-data-files': '1',
  'total-delete-files': '0',
  'total-records': '11288',
  'total-files-size': '33364',
  'total-position-deletes': '0',
  'total-equality-deletes': '0'},
 'schema-id': 0}