# Lakehousing of HEP data using Apache Iceberg

Jayjeet Chakraborty, University of California, Santa Cruz


## Features of Iceberg:

* Supports transactions
* Hidden Partitioning
* Schema Evolution
* Time Travel and Rollbacks
* Expressive SQL
* Views

## Adding Iceberg to Spark

The Iceberg JAR file has to be copied to the Spark installation's JAR directory.

In [88]:
!ls /opt/spark/jars | grep "iceberg"

iceberg-spark-runtime-3.3_2.12-1.1.0.jar


We need to add some configurations options for Spark to pickup Iceberg.

In [89]:
!cat /opt/spark/conf/spark-defaults.conf

spark.sql.extensions                   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo                 org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.warehouse       warehouse
spark.sql.catalog.demo.type            hadoop
spark.sql.defaultCatalog               demo
spark.eventLog.enabled                 true
spark.eventLog.dir                     /home/iceberg/spark-events
spark.history.fs.logDirectory          /home/iceberg/spark-events
spark.sql.catalogImplementation        in-memory


In [3]:
spark

## Creating an Iceberg table from a Parquet file

We read NanoEvents data out of our Parquet file into a Spark Dataframe

In [35]:
df = spark.read.parquet("dataset")

We now create an Iceberg table out of this dataframe partitioned by the `event` field.

In [36]:
%%sql

DROP TABLE IF EXISTS hep;

For the purpose of the demo, we just use a subset of the data.

In [58]:
!ls dataset/ | wc -l

64


In [62]:
!ls spark/queries

query1.sql  query3.sql	query5.sql    query6-2.sql  query8.sql
query2.sql  query4.sql	query6-1.sql  query7.sql


In [57]:
df

DataFrame[run: int, luminosityBlock: bigint, event: bigint, MET: struct<pt:float,phi:float,sumet:float,significance:float,CovXX:float,CovXY:float,CovYY:float>, HLT: struct<IsoMu24_eta2p1:boolean,IsoMu24:boolean,IsoMu17_eta2p1_LooseIsoPFTau20:boolean>, PV: struct<npvs:int,x:float,y:float,z:float>, Muon: array<struct<pt:float,eta:float,phi:float,mass:float,charge:int,pfRelIso03_all:float,pfRelIso04_all:float,tightId:boolean,softId:boolean,dxy:float,dxyErr:float,dz:float,dzErr:float,jetIdx:int,genPartIdx:int>>, Electron: array<struct<pt:float,eta:float,phi:float,mass:float,charge:int,pfRelIso03_all:float,dxy:float,dxyErr:float,dz:float,dzErr:float,cutBasedId:boolean,pfId:boolean,jetIdx:int,genPartIdx:int>>, Photon: array<struct<pt:float,eta:float,phi:float,mass:float,charge:int,pfRelIso03_all:float,jetIdx:int,genPartIdx:int>>, Jet: array<struct<pt:float,eta:float,phi:float,mass:float,puId:boolean,btag:float>>, Tau: array<struct<pt:float,eta:float,phi:float,mass:float,charge:int,decayMode:

In [39]:
df.writeTo("hep").using("iceberg").create()

                                                                                

## Data, Metadata, WAL, and Snapshots

Data files

In [40]:
%%sql

SELECT file_path, file_size_in_bytes FROM demo.hep.all_data_files;

file_path,file_size_in_bytes
warehouse/hep/data/00000-89-3e24a524-389a-48f7-a19f-f095ee48ed38-00001.parquet,349541
warehouse/hep/data/00001-90-c2b20351-6b7c-41a0-80ad-c22c79004fe4-00001.parquet,349541
warehouse/hep/data/00002-91-0725c118-80dd-4e72-a2e8-2ad83bfa3d7f-00001.parquet,349541
warehouse/hep/data/00003-92-7a40a502-86be-4901-9532-45c293571f98-00001.parquet,349541
warehouse/hep/data/00004-93-97d5f294-1f90-41e7-b4d1-5d019a640a22-00001.parquet,349541
warehouse/hep/data/00005-94-b8220b97-270c-4fd6-b507-b6ede9986767-00001.parquet,349541
warehouse/hep/data/00006-95-5109e633-f8e0-4634-9080-c04cae3bbb71-00001.parquet,349541
warehouse/hep/data/00007-96-8aad0b58-c49d-4863-aba0-1042a9f91c55-00001.parquet,349541
warehouse/hep/data/00008-97-05c26e57-e5ee-4fc0-91dd-57a37e97f12e-00001.parquet,349541
warehouse/hep/data/00009-98-bd08d38e-909a-4cce-91fb-b0b600b98afe-00001.parquet,349541


Metadata files

In [41]:
%%sql

SELECT * FROM demo.hep.manifests;

content,path,length,partition_spec_id,added_snapshot_id,added_data_files_count,existing_data_files_count,deleted_data_files_count,added_delete_files_count,existing_delete_files_count,deleted_delete_files_count,partition_summaries
0,warehouse/hep/metadata/cfd14711-8f76-4d0c-8132-2db56453e278-m0.avro,13369,0,242691178791709845,16,0,0,0,0,0,[]


Write-Ahead Log files that enables transactions

In [42]:
%%sql

SELECT * from demo.hep.metadata_log_entries;

timestamp,file,latest_snapshot_id,latest_schema_id,latest_sequence_number
2023-02-22 01:08:25.685000,warehouse/hep/metadata/v1.metadata.json,242691178791709845,0,0


Snapshot files

In [43]:
%%sql

SELECT snapshot_id, manifest_list FROM demo.hep.snapshots

snapshot_id,manifest_list
242691178791709845,warehouse/hep/metadata/snap-242691178791709845-1-cfd14711-8f76-4d0c-8132-2db56453e278.avro


## Running ADL benchmark queries on Iceberg

https://github.com/iris-hep/adl-benchmarks-index/

https://arxiv.org/pdf/2104.12615.pdf

### Query 1

In [91]:
import time

for query_id in ["1", "2", "3", "4", "5", "6-1", "6-2", "7", "8"]:
    with open(f"spark/queries/query{query_id}.sql") as f:
        query = f.read()
        query = query.replace("{table}", "hep")
        s = time.time()
        resp = spark.sql(query).collect()
        e = time.time()
        print(f"Query {query_id}: ", e-s)

Query 1:  0.157670259475708
Query 2:  0.22817230224609375
Query 3:  0.257601261138916
Query 4:  0.21548938751220703
Query 5:  0.5015773773193359


                                                                                

Query 6-1:  1.1532421112060547


                                                                                

Query 6-2:  1.092984676361084
Query 7:  0.653714656829834
23/02/22 02:01:48 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier
23/02/22 02:01:48 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier
Query 8:  0.6282968521118164
