# Iceberg Workshop: Getting Started

## Setup Catalog

In [1]:
from pyiceberg.catalog import load_catalog

catalog = load_catalog("default")

## Example Data

| name    | id | date       |
|---------|----|------------|
| Alice   |  1 | 2018-04-02 |
| Bob     |  2 | 2020-09-07 |
| Charlie |  3 | 2022-07-01 |

In [2]:
import pyarrow as pa

example_schema = pa.schema(
    [
        ("name", pa.string()),
        ("id", pa.int32()),
        ("date", pa.date32()),
    ]
)

example_data = pa.Table.from_pylist([
    {"name": "Alice", "id": 1, "date": 17623},  # 2018-05-15
    {"name": "Bob", "id": 2, "date": 18512},    # 2020-11-23
    {"name": "Charlie", "id": 3, "date": 19174} # 2022-07-07
], schema=example_schema)

## Create an Iceberg table

First, we'll create a namespace `demo_ns` to organize the Iceberg table, then define and create the table using the specified schema.

In [3]:
catalog.create_namespace_if_not_exists("demo_ns")

In [4]:
table = catalog.create_table_if_not_exists("demo_ns.demo_table_1", schema=example_schema)
table

demo_table_1(
  1: name: optional string,
  2: id: optional int,
  3: date: optional date
),
partition by: [],
sort order: [],
snapshot: null

## What happens behind table creation?

A metadata file has been created and registered as the latest metadata of table `demo_ns.demo_table_1`. Let's login to Minio Bucket and see the file:

- Minio Url: http://localhost:9001/
- username: admin
- password: password

The current bucket structure looks like the following

```
warehouse/
└── demo_ns/
    └── demo_table_1/
        └── metadata/
            └── <uuid>.metadata.json
```

# Add data to the table

It will create a new snapshot on the table

In [5]:
table.overwrite(example_data)
table



demo_table_1(
  1: name: optional string,
  2: id: optional int,
  3: date: optional date
),
partition by: [],
sort order: [],
snapshot: Operation.APPEND: id=7769616688498351098, schema_id=0

## Read the table

We can see example data has been added to the table

In [6]:
table.scan().to_pandas()

Unnamed: 0,name,id,date
0,Alice,1,2018-04-02
1,Bob,2,2020-09-07
2,Charlie,3,2022-07-01


## What happens when adding data?

The data has been written into a parquet file and a new snapshot has been created.

```
warehouse/
└── demo_ns/
    └── demo_table_1/
        ├── metadata/
        │   ├── <uuid>.metadata.json
        │   ├── snap-<uuid>.avro
        │   └── <uuid>-m0.avro
        └── data/
            └── <uuid>.parquet

```

## Table Evolution: Make table partitioned

The table we just created is unpartitioned, but it's common practice to partition a table based on specific column(s). No worries—we can easily partition it!

Iceberg allows you to update the partitioning strategy without having to recreate the table or migrate any data.

In [7]:
from pyiceberg.transforms import YearTransform

with table.update_spec() as update:
   update.add_field("date", YearTransform())

table

demo_table_1(
  1: name: optional string,
  2: id: optional int,
  3: date: optional date
),
partition by: [date_year],
sort order: [],
snapshot: Operation.APPEND: id=7769616688498351098, schema_id=0

## Add more data to partitioned table

New data will be written in to different partitions based on the partition strategy. In this demo, there will be 3 partitions

- date_year=2018
- date_year=2020
- date_year=2022

TODO: Add a preview of current file structure

In [8]:
example_data_2 = pa.Table.from_pylist([
    {"name": "David", "id": 4, "date": 17623},  # 2018-05-15
    {"name": "John", "id": 5, "date": 18512},    # 2020-11-23
    {"name": "Jonas", "id": 6, "date": 19174} # 2022-07-07
], schema=example_schema)

table.append(example_data_2)

In [9]:
table.scan().to_pandas()

Unnamed: 0,name,id,date
0,David,4,2018-04-02
1,John,5,2020-09-07
2,Jonas,6,2022-07-01
3,Alice,1,2018-04-02
4,Bob,2,2020-09-07
5,Charlie,3,2022-07-01


# Table Evolution: Add new column to the table

Iceberg allows user to add new column(s) without re-create or re-write the table

In [10]:
from pyiceberg.types import StringType, DecimalType

with table.update_schema() as update:
   # Add new columns
   update.add_column("comments", StringType())
   update.add_column("salary", DecimalType(9, 3))

In [11]:
table

demo_table_1(
  1: name: optional string,
  2: id: optional int,
  3: date: optional date,
  4: comments: optional string,
  5: salary: optional decimal(9, 3)
),
partition by: [date_year],
sort order: [],
snapshot: Operation.APPEND: id=8973395508097198299, parent_id=7769616688498351098, schema_id=0

In [12]:
table.scan().to_pandas()

Unnamed: 0,name,id,date,comments,salary
0,David,4,2018-04-02,,
1,John,5,2020-09-07,,
2,Jonas,6,2022-07-01,,
3,Alice,1,2018-04-02,,
4,Bob,2,2020-09-07,,
5,Charlie,3,2022-07-01,,


## TODO: More Possible Topics
- Transaction
- Time travel
- Step-by-step exploration of table update

## Interoperability with other engines: Spark

Table created by pyiceberg can be consumed by Spark

In [13]:
from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
  .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-aws-bundle:1.8.1")
  .config("spark.sql.catalog.demo.type", "rest")
  .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog")  
  .config("spark.sql.catalog.demo.uri", "http://rest:8181")
  .config("spark.sql.catalog.demo.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")       
  .config("spark.sql.catalog.demo.warehouse", "s3://warehouse")
  .config("spark.sql.catalog.demo.s3.endpoint", "http://minio:9000")
  .config("spark.sql.catalog.demo.s3.region", "us-east-1")
  .config("spark.sql.catalog.demo.s3.path-style-access", "true")
).getOrCreate()

In [14]:
spark.sql("SELECT * FROM demo.demo_ns.demo_table_1").show()

+-------+---+----------+--------+------+
|   name| id|      date|comments|salary|
+-------+---+----------+--------+------+
|  Alice|  1|2018-04-02|    NULL|  NULL|
|    Bob|  2|2020-09-07|    NULL|  NULL|
|Charlie|  3|2022-07-01|    NULL|  NULL|
|  David|  4|2018-04-02|    NULL|  NULL|
|   John|  5|2020-09-07|    NULL|  NULL|
|  Jonas|  6|2022-07-01|    NULL|  NULL|
+-------+---+----------+--------+------+

