# Aufgaben zu Big Data File Formats

Folgende Aufgaben haben zum Ziel mit den verschiedenen Dateiformaten vertraut zu werden und insbesondere die speziellen Eigenschaften und Funktionen der Formate zu verstehen

### CSV and JSON
Klassische Datei Formate für Datenverarbeitung  
**Typische Eigenschaften:** Einfache Struktur, human-readable, Zeilenformat

### Avro, ORC, Parquet
Big Data optimierte Formate um schnell große Datenmengen zu lesen und zu schreiben  
**Typische Eigenschaften:** teilbar in kleine Dateien (splittable), komprimierbar (compressible), überspringbar (skippable), selbsterklärend (self describing with schema), Schema erweiterbar (Schema Evolution), Filter Pushdown

### Delta, Iceberg, Hudi
Erweiterte Big Data Formate um die ACID und Tracing Eigenschaften einer klassichen SQL Datenbank zu erfüllen  
**Typische Eigenschaften:** Erweiterung um zusätzliche Metadaten und spezielle Treiber zum lesen/schreiben, Time Travel Funktion, Merge und Update Funktionen, Audit Log Funktionalitäten

## 1. Import Python Modules
Hier werden alle benötigten Libraries für dieses Lab heruntergeladen.


In [2]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql.functions import col

from delta import *

import datetime
from datetime import datetime, timedelta

import json
import csv

# use 95% of the screen for jupyter cell
from IPython.display import display, HTML
display(HTML("<style>.container {width:100% !important; }<style>"))

## 2. Launch Spark Jupyter and Configuration

#### Configure a Spark session for Kubernetes cluster with S3 support
### CLUSTER MANAGER
- set the Kubernetes master URL as Cluster Manager(“k8s://https://” is NOT a typo, this is how Spark knows the “provider” type)

### KUBERNETES
- set the namespace that will be used for running the driver and executor pods
- set the docker image from which the Worker/Exectutor pods are created
- set the Kubernetes service account name and provide the authentication details for the service account (required to create worker pods)

### SPARK
- set the driver host and the driver port (find name of the driver service with 'kubectl get services' or in the helm chart configuration)
- enable Delta Lake, Iceberg, and Hudi support by setting the spark.sql.extensions
- configure Hive catalog for Iceberg
- enable S3 connector
- set the number of worker pods, their memory and cores (HINT: number of possible tasks = cores * executores)

### SPARK SESSION
- create the Spark session using the SparkSession.builder object
- get the Spark context from the created session and set the log level to "ERROR".


In [5]:
# spark.stop()

In [7]:
appName="jupyter-spark"

conf = SparkConf()

# CLUSTER MANAGER

conf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443")

# CONFIGURE KUBERNETES

conf.set("spark.kubernetes.namespace","frontend")
conf.set("spark.kubernetes.container.image", "thinkportgmbh/workshops:spark-3.3.2")
conf.set("spark.kubernetes.container.image.pullPolicy", "Always")

conf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
conf.set("spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt")
conf.set("spark.kubernetes.authenticate.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token")

# CONFIGURE SPARK

conf.set("spark.sql.session.timeZone", "Europe/Berlin")
conf.set("spark.driver.host", "jupyter-spark-driver.frontend.svc.cluster.local")
conf.set("spark.driver.port", "29413")

conf.set("spark.jars", "/opt/spark/jars/spark-avro_2.12-3.3.2.jar")
conf.set("spark.driver.extraClassPath","/opt/spark/jars/spark-avro_2.12-3.3.2.jar")
conf.set("spark.executor.extraClassPath","/opt/spark/jars/spark-avro_2.12-3.3.2.jar")

conf.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions, org.apache.spark.sql.hudi.HoodieSparkSessionExtension")

######## Hive als Metastore einbinden
conf.set("hive.metastore.uris", "thrift://hive-metastore.hive.svc.cluster.local:9083") 

######## Iceberg configs
conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
conf.set("spark.sql.catalog.ice","org.apache.iceberg.spark.SparkCatalog") 
conf.set("spark.sql.catalog.ice.type","hive") 
conf.set("spark.sql.catalog.ice.uri","thrift://hive-metastore.hive.svc.cluster.local:9083") 


####### Hudi configs
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

# CONFIGURE S3 CONNECTOR
conf.set("spark.hadoop.fs.s3a.endpoint", "minio.minio.svc.cluster.local:9000")
conf.set("spark.hadoop.fs.s3a.access.key", "trainadm")
conf.set("spark.hadoop.fs.s3a.secret.key", "train@thinkport")
conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")

# CONFIGURE WORKER (Customize based on workload)

conf.set("spark.executor.instances", "1")
conf.set("spark.executor.memory", "1G")
conf.set("spark.executor.cores", "2")

# SPARK SESSION

spark = SparkSession\
    .builder\
    .config(conf=conf) \
    .config('spark.sql.session.timeZone', 'Europe/Berlin') \
    .appName(appName)\
    .enableHiveSupport() \
    .getOrCreate()


sc=spark.sparkContext
sc.setLogLevel("ERROR")

# get the configuration object to check all the configurations the session was startet with
for entry in sc.getConf().getAll():
        if entry[0] in ["spark.app.name","spark.kubernetes.namespace","spark.executor.memory","spark.executor.cores","spark.driver.host","spark.master","spark.sql.extensions"]:
            print(entry[0],"=",entry[1])

spark.kubernetes.namespace = frontend
spark.sql.extensions = io.delta.sql.DeltaSparkSessionExtension, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions, org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.master = k8s://https://kubernetes.default.svc.cluster.local:443
spark.app.name = jupyter-spark
spark.executor.memory = 1G
spark.executor.cores = 2
spark.driver.host = jupyter-spark-driver.frontend.svc.cluster.local


## Create sample data

In [160]:
# initial Daten
account_data1 = [
    (1,"alex","2019-01-01",1000),
    (2,"alex","2019-02-01",1500),
    (3,"alex","2019-03-01",1700),
    (4,"maria","2020-01-01",5000)
    ]

# Datensatz mit einem Update und einer neuen Zeile
account_data2 = [
    (1,"alex","2019-03-01",3300),
    (2,"peter","2021-01-01",100)
    ]

# Datensatz mit neuer Zeile und neuer Spalte
account_data3 = [
    (1,"otto","2019-10-01",4444,"neue Spalte 1")
]

# Datensatz mit neuer Zeile und neuer Spalte
account_data4 = [
    (5,"markus","2019-09-01",555)
]

schema = ["id","account","dt_transaction","balance"]
schema3 = ["id","account","dt_transaction","balance","new"]

df1 = spark.createDataFrame(data=account_data1, schema = schema).withColumn("dt_transaction",col("dt_transaction").cast("date")).repartition(3)
df2 = spark.createDataFrame(data=account_data2, schema = schema).withColumn("dt_transaction",col("dt_transaction").cast("date")).repartition(2)
df3 = spark.createDataFrame(data=account_data3, schema = schema3).withColumn("dt_transaction",col("dt_transaction").cast("date")).repartition(1)
df4 = spark.createDataFrame(data=account_data4, schema = schema).withColumn("dt_transaction",col("dt_transaction").cast("date")).withColumn("id",col("id").cast("string")).repartition(1)


print("++ create new dataframe and show schema and data")
print("################################################")

# df1.printSchema()
df1.show(truncate=False)
df2.show(truncate=False)
df3.show(truncate=False)
df4.show(truncate=False)

++ create new dataframe and show schema and data
################################################
+---+-------+--------------+-------+
|id |account|dt_transaction|balance|
+---+-------+--------------+-------+
|1  |alex   |2019-01-01    |1000   |
|2  |alex   |2019-02-01    |1500   |
|4  |maria  |2020-01-01    |5000   |
|3  |alex   |2019-03-01    |1700   |
+---+-------+--------------+-------+

+---+-------+--------------+-------+
|id |account|dt_transaction|balance|
+---+-------+--------------+-------+
|1  |alex   |2019-03-01    |3300   |
|2  |peter  |2021-01-01    |100    |
+---+-------+--------------+-------+

+---+-------+--------------+-------+-------------+
|id |account|dt_transaction|balance|new          |
+---+-------+--------------+-------+-------------+
|1  |otto   |2019-10-01    |4444   |neue Spalte 1|
+---+-------+--------------+-------+-------------+

+---+-------+--------------+-------+
|id |account|dt_transaction|balance|
+---+-------+--------------+-------+
|5  |markus |20

## Configure boto3

In [11]:
# Hilfsfunktionen um mit einfachen Befehlen auf s3 zu arbeiten (s3 mb s3://fileformats)
import boto3
from botocore.client import Config

# Bucket, muss zuerst in Minio oder via Terminal Befehl erstellt werden
bucket = "fileformats"
bucket_path="s3://"+bucket

options = {
    'endpoint_url': 'http://minio.minio.svc.cluster.local:9000',
    'aws_access_key_id': 'trainadm',
    'aws_secret_access_key': 'train@thinkport',
    'config': Config(signature_version='s3v4'),
    'verify': False}

s3_resource = boto3.resource('s3', **options)  
s3_client = boto3.client('s3', **options)

# show files on s3 bucket/prefix
def ls(bucket,prefix):
    '''List objects from bucket/prefix'''
    try:
        for obj in s3_resource.Bucket(bucket).objects.filter(Prefix=prefix):
            print(obj.key)
    except Exception as e: 
        print(e)
    
# show file content in files
def cat(bucket,prefix,binary=False):
    '''Show content of one or several files with same prefix/wildcard'''
    try:
        for obj in s3_resource.Bucket(bucket).objects.filter(Prefix=prefix):
            print("File:",obj.key)
            print("----------------------")
            if binary==True:
                print(obj.get()['Body'].read())
            else: 
                print(obj.get()['Body'].read().decode())
            print("######################")
    except Exception as e: 
        print(e)

# delete files from bucket
def rm(bucket,prefix):
    '''Delete everything from bucket/prefix'''
    for object in s3_resource.Bucket(bucket).objects.filter(Prefix=prefix):
        print(object.key)
        s3_client.delete_object(Bucket=bucket, Key=object.key)
    print(f"Deleted files from {bucket}/{prefix}*")


In [12]:
# show everything in bucket
ls(bucket,"")
print("#############################")
# show folder
ls(bucket,"csv")
print("#############################")
# show subfolder
ls(bucket,"delta/_delta_log/")
print("#############################")
print("")
# show content of one or several files with same prefix/wildcard
cat(bucket,'csv/part')

#############################
#############################
#############################



## CSV

In [13]:
print("Number of Partitions:", df1.rdd.getNumPartitions())

# Schreibe Datenset 1 als CSV Datei
write_csv=(df1
           .write
           .format("csv")
           .mode("overwrite") # append
           .save(f"s3://{bucket}/csv")
          )


Number of Partitions: 3


                                                                                

In [14]:
ls(bucket,"csv")

csv/_SUCCESS
csv/part-00000-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
csv/part-00001-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
csv/part-00002-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv


In [15]:
cat(bucket,"csv/part")

File: csv/part-00000-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
----------------------
1,alex,2019-01-01,1000

######################
File: csv/part-00001-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
----------------------
2,alex,2019-02-01,1500
4,maria,2020-01-01,5000

######################
File: csv/part-00002-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
----------------------
3,alex,2019-03-01,1700

######################


In [16]:
# lese csv Datei wieder ein und schaue mir Spaltennamen und Schema an
read_csv=spark.read.format("csv").load(f"s3://{bucket}/csv")

read_csv.printSchema()
read_csv.show()

                                                                                

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)

+---+-----+----------+----+
|_c0|  _c1|       _c2| _c3|
+---+-----+----------+----+
|  2| alex|2019-02-01|1500|
|  4|maria|2020-01-01|5000|
|  1| alex|2019-01-01|1000|
|  3| alex|2019-03-01|1700|
+---+-----+----------+----+



In [17]:
# schreibe Datenset 3 (neue Spalte) in die gleiche Tabelle dazu
write_csv=(df3
           .write
           .format("csv")
           .mode("append") # append
           .save(f"s3://{bucket}/csv")
          )

In [18]:
ls(bucket,"csv")

csv/_SUCCESS
csv/part-00000-7e095edf-5062-4284-8dce-b3e018d3289c-c000.csv
csv/part-00000-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
csv/part-00001-7e095edf-5062-4284-8dce-b3e018d3289c-c000.csv
csv/part-00001-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
csv/part-00002-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv


In [19]:
cat(bucket,"csv/part")

File: csv/part-00000-7e095edf-5062-4284-8dce-b3e018d3289c-c000.csv
----------------------

######################
File: csv/part-00000-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
----------------------
1,alex,2019-01-01,1000

######################
File: csv/part-00001-7e095edf-5062-4284-8dce-b3e018d3289c-c000.csv
----------------------
1,otto,2019-10-01,4444,neue Spalte 1

######################
File: csv/part-00001-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
----------------------
2,alex,2019-02-01,1500
4,maria,2020-01-01,5000

######################
File: csv/part-00002-f2c3a348-3c01-472f-8f2a-cfd618a219cf-c000.csv
----------------------
3,alex,2019-03-01,1700

######################


In [20]:
# und lese alles nochmal ein um zu schauen ob die neue Spalte richtig erkannt wurde
read_csv=spark.read.format("csv").load(f"s3://{bucket}/csv")

read_csv.printSchema()
read_csv.show()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)

+---+-----+----------+----+
|_c0|  _c1|       _c2| _c3|
+---+-----+----------+----+
|  2| alex|2019-02-01|1500|
|  4|maria|2020-01-01|5000|
|  1| otto|2019-10-01|4444|
|  1| alex|2019-01-01|1000|
|  3| alex|2019-03-01|1700|
+---+-----+----------+----+



#### Erkenntnisse CSV
* Datenset wird in mehrere Dateien aufgeteilt die der Anzahl der Partitionen ensprechen 
* kein Schema vorhanden (Typen)
* kein Anfügen neuer Spalten

## JSON

### Aufgabe:
Wiederhole die gleichen Schritte mit dem JSON Format und schaue wie sich hier Schema und neue Spalten verhalten

1. Datenset 1 als json schreiben (.format("json") und Pfad= .save(f"s3://{bucket}/json"))
2. Dateien und Inhalt anzeigen, vestehen was da passiert ist
3. Daten wieder einlese und checken ob es ein Schema und Spaltennamen gibt
4. Datenset 3 anfügen (append)
5. Daten wieder einlesen und checken was mit der neuen Spalte passiert


In [31]:
print("Number of Partitions:", df1.rdd.getNumPartitions())

# Schreibe Datenset 1 als JSON Datei

# HIER EIGENE CODE EINFÜGEN - in den pfad s3://{bucket}/json schreiben

Number of Partitions: 3


<details>
<summary> &#8964 Lösung Datenset 1 als JSON Datei schreiben</summary>
<p>
<code>write_json=(df1
           .write
           .format("json")
           .mode("overwrite") # append
           .save(f"s3://{bucket}/json")
          )</code>
</details>
</p>

In [24]:
ls(bucket,"json")

json/_SUCCESS
json/part-00000-00811c1d-25e8-4a3a-b8ef-692095840568-c000.json
json/part-00001-00811c1d-25e8-4a3a-b8ef-692095840568-c000.json
json/part-00002-00811c1d-25e8-4a3a-b8ef-692095840568-c000.json


In [25]:
cat(bucket,"json/part")

File: json/part-00000-00811c1d-25e8-4a3a-b8ef-692095840568-c000.json
----------------------
{"id":1,"account":"alex","dt_transaction":"2019-01-01","balance":1000}

######################
File: json/part-00001-00811c1d-25e8-4a3a-b8ef-692095840568-c000.json
----------------------
{"id":2,"account":"alex","dt_transaction":"2019-02-01","balance":1500}
{"id":4,"account":"maria","dt_transaction":"2020-01-01","balance":5000}

######################
File: json/part-00002-00811c1d-25e8-4a3a-b8ef-692095840568-c000.json
----------------------
{"id":3,"account":"alex","dt_transaction":"2019-03-01","balance":1700}

######################


In [None]:
# Daten wieder einlese und checken ob es ein Schema und Spaltennamen gibt

# HIER EIGENE CODE EINFÜGEN

In [27]:
# schreibe Datenset 3 (neue Spalte) in die gleiche Tabelle dazu (!! append NOT overwrite)

# HIER EIGENE CODE EINFÜGEN

<details>
<summary> &#8964 Lösung Datenset 3 als JSON Datei anfügen</summary>
<p>
<code>write_json=(df3
           .write
           .format("json")
           .mode("append") # append
           .save(f"s3://{bucket}/json")
          )</code>
</details>
</p>

In [38]:
# alles nochmal einlesen und schauen ob die neue Spalte und die Schemas richtig erkannt wurden

read_json=# HIER EIGENE CODE EINFÜGEN

read_json.printSchema()
read_json.show()

<details>
<summary> &#8964 Lösung JSON wieder einlesen</summary>
<p>
<code>read_json=spark.read.format("json").load(f"s3://{bucket}/json")

read_json.printSchema()
read_json.show()
    </code>
</details>
</p>

#### Erkenntnisse CSV
* Werden Spaltennamen erhalten? 
* Gibt es ein Schema?
* Können neue Spalten agefügt und verarbeitet werden?

## AVRO
Avro ist ein Zeilenformat was für das schnelle Schreiben im Streaming Kontext optimiert ist.
Avro ist selbsterklärend, hat ein Schema und unterstützt Schema Evolution

### Aufgabe:
Wiederhole die gleichen Schritte mit dem AVRO Format und schaue wie sich hier Schema und neue Spalten verhalten

1. Datenset 1 als avro schreiben (.format("avro") und Pfad= .save(f"s3://{bucket}/avro"))
2. Dateien und Inhalt anzeigen, vestehen was da passiert ist
3. Metadaten in Datei identifizieren
3. Daten wieder einlese und checken ob es ein Schema und Spaltennamen gibt
4. Schema Evolutiuon: Datenset 3 anfügen mit neuer Spalte anfügen
5. Daten wieder einlesen und checken was mit der neuen Spalte passiert
6. Schema Enforcement: Datentyp in bestehender Spalte ändern und schauen ob und wie dies gehandhabt wird

In [51]:
# Schreibe Datenset 1 als AVRO Datei

# HIER EIGENE CODE EINFÜGEN - in den Pfad s3://{bucket}/avro schreiben


<details>
<summary> &#8964 Lösung Datenset 1 als AVRO Datei schreiben</summary>
<p>
<code>write_json=(df1
           .write
           .format("avro")
           .mode("overwrite") # append
           .save(f"s3://{bucket}/avro")
          )</code>
</details>
</p>

In [52]:
ls(bucket,"avro")

avro/_SUCCESS
avro/part-00000-aa567ac2-db28-465e-bd22-686ea9ba20ac-c000.avro
avro/part-00001-aa567ac2-db28-465e-bd22-686ea9ba20ac-c000.avro
avro/part-00002-aa567ac2-db28-465e-bd22-686ea9ba20ac-c000.avro


In [98]:
# Finde in der Darstellung der Datei die Metadaten und die eigentlichen Daten
# Da Avro ein Binärformat ist muss hier in cat die Flag auf True gesetzt werden
cat(bucket,"avro/part",True)

In [37]:
read_avro=# HIER EIGENE CODE EINFÜGEN

read_avro.printSchema()
read_avro.show()

In [39]:
# schreibe Datenset 3 (neue Spalte) in die gleiche Tabelle dazu (!! append NOT overwrite)

# HIER EIGENE CODE EINFÜGEN

<details>
<summary> &#8964 Lösung Datenset 3 als AVRO Datei anfügen</summary>
<p>
<code>write_avro=(df3
           .write
           .format("avro")
           .mode("append") # append
           .save(f"s3://{bucket}/avro")
          )</code>
</details>
</p>

In [40]:
# alles nochmal einlesen und schauen ob die neue Spalte und die Schemas richtig erkannt wurden
# wiederholt sich langsam gell?

read_avro=# HIER EIGENE CODE EINFÜGEN

read_avro.printSchema()
read_avro.show()

<details>
<summary> &#8964 Lösung JSON wieder einlesen</summary>
<p>
<code>read_json=spark.read.format("json").load(f"s3://{bucket}/json")

read_json.printSchema()
read_json.show()
    </code>
</details>
</p>

In [117]:
# Füge eine Zeile (df2) zu der AVRO Tabelle hinzu aber ändere den Datentyp für die id von long zu string
print("Schema vorher:")
df2.printSchema()


df2a=(df2
           # nur die Zeile Peter aus df2
           .where(f.col("account")=="peter")
           # ID als string statt als long
           .withColumn("id", f.col("id").cast("int"))
          )

print("Schema nachher:")
df2a.printSchema()


write_avro=(df2a
            .write
            .format("avro")
            #.option("mergeSchema","true")
            .mode("append")
            .save(f"s3://{bucket}/avro")
           )


Schema vorher:
root
 |-- id: long (nullable = true)
 |-- account: string (nullable = true)
 |-- dt_transaction: date (nullable = true)
 |-- balance: long (nullable = true)

Schema nachher:
root
 |-- id: integer (nullable = true)
 |-- account: string (nullable = true)
 |-- dt_transaction: date (nullable = true)
 |-- balance: long (nullable = true)



In [118]:
# probiere das Verzeichnis mit verschiedenen Datentypen einzulesen
read_avro=(spark
           .read
           .format("avro")
           .load(f"s3://{bucket}/avro"))


read_avro.printSchema()
read_avro.show()

root
 |-- id: float (nullable = true)
 |-- account: string (nullable = true)
 |-- dt_transaction: date (nullable = true)
 |-- balance: long (nullable = true)

23/06/28 20:59:52 ERROR TaskSetManager: Task 0 in stage 121.0 failed 4 times; aborting job


Py4JJavaError: An error occurred while calling o1573.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 121.0 failed 4 times, most recent failure: Lost task 0.3 in stage 121.0 (TID 221) (10.244.1.7 executor 1): org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["long","null"]},{"name":"account","type":["string","null"]},{"name":"dt_transaction","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"balance","type":["long","null"]}]} to SQL type STRUCT<id: FLOAT, account: STRING, dt_transaction: DATE, balance: BIGINT>.
	at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:101)
	at org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:73)
	at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.<init>(AvroFileFormat.scala:143)
	at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:154)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:139)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'id' to SQL field 'id' because schema is incompatible (avroType = "long", sqlType = FLOAT)
	at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:343)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:374)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:371)
	at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:83)
	... 26 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
	at jdk.internal.reflect.GeneratedMethodAccessor220.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["long","null"]},{"name":"account","type":["string","null"]},{"name":"dt_transaction","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"balance","type":["long","null"]}]} to SQL type STRUCT<id: FLOAT, account: STRING, dt_transaction: DATE, balance: BIGINT>.
	at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:101)
	at org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:73)
	at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.<init>(AvroFileFormat.scala:143)
	at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:154)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:139)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	... 1 more
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'id' to SQL field 'id' because schema is incompatible (avroType = "long", sqlType = FLOAT)
	at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:343)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:374)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:371)
	at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:83)
	... 26 more


#### Erkenntnisse AVRO
* Werden Spaltennamen erhalten? 
* Gibt es ein Schema?
* Schema Evolution: Kann das Schema erweitert werden, also eine neue Spalte angefügt werden?
* Schema Enforcement on write: Kann eine Spalte mit falschem Datetyp einfach beim schreiben hinzugefügt werden? 
* Schema Enforcement on read: Kann ein Verzeichnis mit mehreren Avro Dateien bei der eine Spalte ein anderes Schema hat gelesen werden?

## Parquet

### Aufgabe:
Wiederhole die gleichen Schritte mit dem PARQUET Format und schaue wie sich hier Schema und neue Spalten verhalten

1. Datenset 1 als parquet schreiben (.format("parquet") und Pfad= .save(f"s3://{bucket}/parquet"))
2. Dateien und Inhalt anzeigen, vestehen was da passiert ist
3. Metadaten in Datei identifizieren
3. Daten wieder einlese und checken ob es ein Schema und Spaltennamen gibt
4. Schema Evolutiuon: Datenset 3 anfügen mit neuer Spalte anfügen
5. Daten wieder einlesen und checken was mit der neuen Spalte passiert
6. Partion & Pushdown Filter: Execution Plan für verschiedene Filter anzeigen
6. Schema Enforcement: Datentyp in bestehender Spalte ändern und schauen ob und wie dies gehandhabt wird

In [135]:
print("Number of Partitions:", df1.rdd.getNumPartitions())

write_parquet=(df1
           .write
           # Fachliche Partitionierung beim Schreiben
           .partitionBy("account")
           .format("parquet")
           .mode("overwrite") # append
           .save(f"s3://{bucket}/parquet")
          )


Number of Partitions: 3


                                                                                

In [120]:
#ls(bucket,"parquet")

In [102]:
cat(bucket,"parquet/account=maria",True)

#### Filter Pushdown 

In [121]:
# Parquet Datei mit PartitionFilter laden
read_parquet=(spark
              .read.format("parquet")
              .load(f"s3://{bucket}/parquet")
              # Filter auf die Spalte über die partitioniert wurde
              .filter(col("account")=="alex")
             )

# Parquet mit normalem Filter laden
read_parquet2=(spark
              .read.format("parquet")
              .load(f"s3://{bucket}/parquet")
              # Filter auf die Spalte über eine normale Spalte
              .filter(col("balance")>1500)
             )

# Anzeigen des physischen Execution Plans um zu sehen welche Filter ins Dateisystem bzw. in die Parquet Datei gepusht werden
print("Partition Filter")
read_parquet.explain("simple")
print("Pushdow Filter")
read_parquet2.explain("simple")


read_parquet.printSchema()
read_parquet.show()

Partition Filter
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#1296L,dt_transaction#1297,balance#1298L,account#1299] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3://fileformats/parquet], PartitionFilters: [isnotnull(account#1299), (account#1299 = alex)], PushedFilters: [], ReadSchema: struct<id:bigint,dt_transaction:date,balance:bigint>


Pushdow Filter
== Physical Plan ==
*(1) Filter (isnotnull(balance#1307L) AND (balance#1307L > 1500))
+- *(1) ColumnarToRow
   +- FileScan parquet [id#1305L,dt_transaction#1306,balance#1307L,account#1308] Batched: true, DataFilters: [isnotnull(balance#1307L), (balance#1307L > 1500)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3://fileformats/parquet], PartitionFilters: [], PushedFilters: [IsNotNull(balance), GreaterThan(balance,1500)], ReadSchema: struct<id:bigint,dt_transaction:date,balance:bigint>


root
 |-- id: long (nullable = true)
 |-- dt_transaction: date (nullable = true)


#### Schema Evolution (Spalte anfügen)

In [147]:
# Zeile mit neuer Spalte anfügen
write_parquet=(df3
           .write
           .format("parquet")
           .mode("append") # append
           # schreibe ohne zu Partitionieren direkt in ein neues Unterverzeichnis
           .save(f"s3://{bucket}/parquet/account=otto")
          )

                                                                                

In [148]:
# einlesen mit der mergeSchema Option
read_parquet=(spark
              .read.format("parquet")
              # setzte die mergeSchema auf true/false um den Unterschied beim Einlesen zu sehen
              # Vegleiche: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging
              .option("mergeSchema", "false")
              .load(f"s3://{bucket}/parquet")
             )

read_parquet.printSchema()
read_parquet.show()

root
 |-- id: long (nullable = true)
 |-- dt_transaction: date (nullable = true)
 |-- balance: long (nullable = true)
 |-- account: string (nullable = true)

+---+--------------+-------+-------+
| id|dt_transaction|balance|account|
+---+--------------+-------+-------+
|  1|    2019-10-01|   4444|   otto|
|  1|    2019-01-01|   1000|   alex|
|  2|    2019-02-01|   1500|   alex|
|  3|    2019-03-01|   1700|   alex|
|  4|    2020-01-01|   5000|  maria|
+---+--------------+-------+-------+



#### Schema Enforcement

In [145]:
# Datensatz mit falschem Datentyp anfügen
df2a=(df2.where(f.col("account")=="peter").withColumn("id", f.col("id").cast("string")))


# Zeile mit falschem Typ anfügen
write_parquet=(df2a
           .write
           .partitionBy("account")
           .format("parquet")
           .option("mergeSchema", "true")
           .mode("append") # append
           .save(f"s3://{bucket}/parquet")
          )

                                                                                

In [146]:
# einlesen mit der mergeSchema Option
read_parquet=(spark
              .read.format("parquet")
              # setzte die mergeSchema auf true/false um den Unterschied beim Einlesen zu sehen
              .option("mergeSchema", "false")
              .load(f"s3://{bucket}/parquet")
             )

read_parquet.printSchema()
read_parquet.show()

root
 |-- id: long (nullable = true)
 |-- dt_transaction: date (nullable = true)
 |-- balance: long (nullable = true)
 |-- account: string (nullable = true)



[Stage 185:>                                                        (0 + 1) / 1]

23/06/28 21:25:43 ERROR TaskSetManager: Task 0 in stage 185.0 failed 4 times; aborting job


Py4JJavaError: An error occurred while calling o1744.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 185.0 failed 4 times, most recent failure: Lost task 0.3 in stage 185.0 (TID 320) (10.244.1.7 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://fileformats/parquet/account=peter/part-00001-6326d72d-c04c-42cd-88b1-b6d1c11b3c40.c000.snappy.parquet. Column: [id], Expected: bigint, Found: BINARY
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:724)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:278)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:561)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1125)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:187)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:316)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:212)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 20 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
	at jdk.internal.reflect.GeneratedMethodAccessor220.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://fileformats/parquet/account=peter/part-00001-6326d72d-c04c-42cd-88b1-b6d1c11b3c40.c000.snappy.parquet. Column: [id], Expected: bigint, Found: BINARY
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:724)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:278)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:561)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	... 1 more
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1125)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:187)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:316)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:212)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 20 more


#### Erkenntnisse Parquet
* Sind Parquet Dateien selbsterklärend (haben ein Spalten und Typenschema )
* Partitioning and Partion Discovery: werden die Daten in Verzeichnisse geschriebe und wieder als Partitionen erkannt?
* Schema Evolution: Kann das Schema erweitert werden, also eine neue Spalte angefügt werden?
* Schema Enforcement on write: Kann eine Spalte mit falschem Datetyp einfach beim schreiben hinzugefügt werden? 
* Schema Enforcement on read: Kann ein Verzeichnis mit mehreren Parquet Dateien bei der eine Spalte ein anderes Schema hat gelesen werden?

## Delta

- a **storage layer** that runs on top of existing data lakes
- supports ACID transactions and data versioning
- allows data lineage tracking
- provides optimization for streaming workloads

In [173]:
write_delta=(df1
           .write
           .format("delta")
           #.option("mergeSchema", "false")
           .mode("overwrite") 
           .save(f"s3://{bucket}/delta")
          )

                                                                                

In [39]:
ls(bucket,"delta")

delta/_delta_log/00000000000000000000.json
delta/part-00000-71e8256a-4ae9-4a30-82df-1df4300074fa-c000.snappy.parquet
delta/part-00001-740db38d-df85-4d5f-b675-785b8e5a1ff6-c000.snappy.parquet
delta/part-00002-751ee4f7-7ea7-4731-9df7-9a881d3a97ac-c000.snappy.parquet


In [40]:
ls(bucket,"delta/_delta_log")

delta/_delta_log/00000000000000000000.json


In [41]:
cat(bucket,"delta/_delta_log")

File: delta/_delta_log/00000000000000000000.json
----------------------
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"cc178cae-6af7-410b-9996-f81ccdeb1ab5","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"account\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"dt_transaction\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}},{\"name\":\"balance\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1684095467225}}
{"add":{"path":"part-00000-71e8256a-4ae9-4a30-82df-1df4300074fa-c000.snappy.parquet","partitionValues":{},"size":1236,"modificationTime":1684095469000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":1,\"account\":\"alex\",\"dt_transaction\":\"2019-01-01\",\"balance\":1000},\"maxValues\":{\"id\":1,\"account\":\"alex\",\"dt

In [169]:
read_delta=spark.read.format("delta").load(f"s3://{bucket}/delta")
read_delta.show()

                                                                                

+---+-------+--------------+-------+
| id|account|dt_transaction|balance|
+---+-------+--------------+-------+
|  2|   alex|    2019-02-01|   1500|
|  4|  maria|    2020-01-01|   5000|
|  3|   alex|    2019-03-01|   1700|
|  1|   alex|    2019-01-01|   1000|
+---+-------+--------------+-------+



#### Schema Evolution must be explicitely set 
Aufgabe: spiele mit der Optiob mergeSchema

In [174]:
write_delta=(df3
           .write
           .format("delta")
           # Bei Delta kann bein Schreiben gesetzt werden ob die Tabelle erweitert werden kann oder nicht, Default ist false 
           .option("mergeSchema", "true")
           .mode("append") # append
           .save(f"s3://{bucket}/delta")
          )

                                                                                

In [155]:
cat(bucket,"delta/_delta_log")

In [175]:
read_delta=spark.read.format("delta").load(f"s3://{bucket}/delta")
read_delta.show()

                                                                                

+---+-------+--------------+-------+-------------+
| id|account|dt_transaction|balance|          new|
+---+-------+--------------+-------+-------------+
|  1|   otto|    2019-10-01|   4444|neue Spalte 1|
|  2|   alex|    2019-02-01|   1500|         null|
|  4|  maria|    2020-01-01|   5000|         null|
|  3|   alex|    2019-03-01|   1700|         null|
|  1|   alex|    2019-01-01|   1000|         null|
+---+-------+--------------+-------+-------------+



#### Schema Enforcement
Aufgabe: spiele mit der Optiob mergeSchema

In [178]:
write_delta=(df2
           .where(f.col("account")=="peter")
           # Spaltentyp ändern
           .withColumn("id", f.col("id").cast("string"))
           .write
           .format("delta")
           # Bei Delta kann bein Schreiben gesetzt werden ob die Tabelle erweitert werden kann oder nicht, Default ist false
           .option("mergeSchema", "false")
           .mode("overwrite") # append
           .save(f"s3://{bucket}/delta")
          )

AnalysisException: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and StringType

#### Historie und Audit Track

In [179]:
deltaTable = DeltaTable.forPath(spark, f"s3://{bucket}/delta")
# --> Vermutlich falsche Delta Version zu Spark
fullHistoryDF = deltaTable.history()    # get the full history of the table

fullHistoryDF.printSchema()

root
 |-- version: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- userId: string (nullable = true)
 |-- userName: string (nullable = true)
 |-- operation: string (nullable = true)
 |-- operationParameters: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- job: struct (nullable = true)
 |    |-- jobId: string (nullable = true)
 |    |-- jobName: string (nullable = true)
 |    |-- runId: string (nullable = true)
 |    |-- jobOwnerId: string (nullable = true)
 |    |-- triggerType: string (nullable = true)
 |-- notebook: struct (nullable = true)
 |    |-- notebookId: string (nullable = true)
 |-- clusterId: string (nullable = true)
 |-- readVersion: long (nullable = true)
 |-- isolationLevel: string (nullable = true)
 |-- isBlindAppend: boolean (nullable = true)
 |-- operationMetrics: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- userMetadata: string (nullable =

In [158]:
# Zeige wichtige Felder aus der Historie an
# Bonus: löse die genested Spalten in eigene Spalten auf
fullHistoryDF.select("version","readVersion","timestamp","userId","operation","operationParameters","operationMetrics","userMetadata").show(truncate=True)

+-------+-----------+-------------------+------+---------+--------------------+--------------------+------------+
|version|readVersion|          timestamp|userId|operation| operationParameters|    operationMetrics|userMetadata|
+-------+-----------+-------------------+------+---------+--------------------+--------------------+------------+
|      2|          1|2023-06-28 23:36:05|  null|    WRITE|{mode -> Overwrit...|{numFiles -> 2, n...|        null|
|      1|          0|2023-06-28 23:34:35|  null|    WRITE|{mode -> Append, ...|{numFiles -> 2, n...|        null|
|      0|       null|2023-06-28 23:34:24|  null|    WRITE|{mode -> Overwrit...|{numFiles -> 3, n...|        null|
+-------+-----------+-------------------+------+---------+--------------------+--------------------+------------+



In [159]:
spark.read.format("delta").load(f"s3://{bucket}/delta").show()

                                                                                

+---+-------+--------------+-------+-------------+
| id|account|dt_transaction|balance|          new|
+---+-------+--------------+-------+-------------+
|  1|   otto|    2019-10-01|   4444|neue Spalte 1|
+---+-------+--------------+-------+-------------+



## Delta: Time travel
Aufgabe: lese verschiedenen Versionen nach VersionAsOf und Datum ein

In [180]:
spark.read.format("delta").option("versionAsOf", "1").load(f"s3://{bucket}/delta").show()




+---+-------+--------------+-------+-------------+
| id|account|dt_transaction|balance|          new|
+---+-------+--------------+-------+-------------+
|  1|   otto|    2019-10-01|   4444|neue Spalte 1|
+---+-------+--------------+-------+-------------+



                                                                                

## Delta: Merge

In [182]:
deltaTable2 = DeltaTable.forPath(spark, f"s3://{bucket}/delta")


df2a=df2.withColumn("new",f.lit("test"))
df2a.show()
deltaTable2.toDF().show()

+---+-------+--------------+-------+----+
| id|account|dt_transaction|balance| new|
+---+-------+--------------+-------+----+
|  1|   alex|    2019-03-01|   3300|test|
|  2|  peter|    2021-01-01|    100|test|
+---+-------+--------------+-------+----+



                                                                                

+---+-------+--------------+-------+-------------+
| id|account|dt_transaction|balance|          new|
+---+-------+--------------+-------+-------------+
|  1|   otto|    2019-10-01|   4444|neue Spalte 1|
|  2|   alex|    2019-02-01|   1500|         null|
|  4|  maria|    2020-01-01|   5000|         null|
|  3|   alex|    2019-03-01|   1700|         null|
|  1|   alex|    2019-01-01|   1000|         null|
+---+-------+--------------+-------+-------------+



In [183]:
dt3=(deltaTable2.alias("oldData")
      .merge(df2a.alias("newData"),
            "oldData.account = newData.account AND oldData.dt_transaction = newData.dt_transaction")
            .whenMatchedUpdateAll()
            .whenNotMatchedInsertAll()
      .execute()
    )

deltaTable2.toDF().show()

                                                                                

+---+-------+--------------+-------+-------------+
| id|account|dt_transaction|balance|          new|
+---+-------+--------------+-------+-------------+
|  1|   otto|    2019-10-01|   4444|neue Spalte 1|
|  1|   alex|    2019-03-01|   3300|         test|
|  2|  peter|    2021-01-01|    100|         test|
|  2|   alex|    2019-02-01|   1500|         null|
|  4|  maria|    2020-01-01|   5000|         null|
|  1|   alex|    2019-01-01|   1000|         null|
+---+-------+--------------+-------+-------------+



#### Erkenntnisse Delta
* Wie funktioniert das Metadatenmanagement und was steht im Delta Log?
* Schema Evolution: Kann das Schema erweitert werden, also eine neue Spalte angefügt werden?
* Schema Enforcement on write: Kann eine Spalte mit falschem Datetyp einfach beim schreiben hinzugefügt werden? 
* Schema Enforcement on read: Kann ein Verzeichnis mit mehreren Parquet Dateien bei der eine Spalte ein anderes Schema hat gelesen werden? (um die Ecke Denk Frage)

## Iceberg

- a **table format**
- supports schema evolution and provides a portable table metadata format
- best suited for analytical workloads

In [51]:
print("Current Catalog:",spark.catalog.currentDatabase())
print("List Catalogs:",spark.catalog.listDatabases())
print("List Tables in current Catalog:",spark.catalog.listTables())

Current Catalog: default
List Catalogs: [Database(name='default', description='Default Hive database', locationUri='file:/home/hive/warehouse')]
List Tables in current Catalog: []


In [61]:
# create a Database(name=<db_name>, locationUri='s3a://<bucket>/')
spark.sql(f"CREATE DATABASE iceberg_db LOCATION 's3a://{bucket}/'")

DataFrame[]

In [62]:
### show databases and tables in iceberg catalog (only sees iceberg formated tables)
# all databases from hive are shown
spark.sql("SHOW databases from ice").show()

+----------+
| namespace|
+----------+
|   default|
|iceberg_db|
+----------+



In [63]:
spark.sql("show tables from iceberg_db").show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



In [None]:
#### Delete Iceberg tables: first drop the table 
#spark.sql("drop table iceberg_db.iceberg_table")
#delete_objects("aleks-test", "iceberg_table")

In [64]:
write_iceberg=(df1
                  .write
                  .format("iceberg")
                  .mode("overwrite")
                  .saveAsTable("iceberg_db.iceberg")
               )


                                                                                

In [66]:
ls(bucket,"iceberg")

iceberg/data/00000-545-3b1bf31a-0ecb-4245-8a30-5d6b4de18962-00001.parquet
iceberg/data/00000-550-7fa8b7a8-5984-487f-b524-9753864c2808-00001.parquet
iceberg/data/00001-546-96c2733b-2dc9-4e04-b06d-1c2cbf6fe159-00001.parquet
iceberg/data/00001-551-6cfea253-4caa-4fb6-bc7e-e3d9f956388b-00001.parquet
iceberg/data/00002-547-3a8580fe-e207-4cf4-8015-2e6f91d99deb-00001.parquet
iceberg/metadata/00000-2346cea3-3db3-46a4-bb55-419ae993156b.metadata.json
iceberg/metadata/00001-70cc0e6d-feed-4202-bf2a-587f8dd81bbe.metadata.json
iceberg/metadata/5dc0a4c4-3c72-4d16-a200-c299b635415c-m0.avro
iceberg/metadata/73a18816-64a1-4119-8453-bb671b8abfb3-m0.avro
iceberg/metadata/snap-2282180466624073266-1-73a18816-64a1-4119-8453-bb671b8abfb3.avro
iceberg/metadata/snap-4263160168885610306-1-5dc0a4c4-3c72-4d16-a200-c299b635415c.avro


In [65]:

write_iceberg=(df2
                   .write
                   .format("iceberg")
                   .mode("append") # append
                   .saveAsTable("iceberg_db.iceberg")
                 )

In [67]:
ls(bucket,"iceberg")

iceberg/data/00000-545-3b1bf31a-0ecb-4245-8a30-5d6b4de18962-00001.parquet
iceberg/data/00000-550-7fa8b7a8-5984-487f-b524-9753864c2808-00001.parquet
iceberg/data/00001-546-96c2733b-2dc9-4e04-b06d-1c2cbf6fe159-00001.parquet
iceberg/data/00001-551-6cfea253-4caa-4fb6-bc7e-e3d9f956388b-00001.parquet
iceberg/data/00002-547-3a8580fe-e207-4cf4-8015-2e6f91d99deb-00001.parquet
iceberg/metadata/00000-2346cea3-3db3-46a4-bb55-419ae993156b.metadata.json
iceberg/metadata/00001-70cc0e6d-feed-4202-bf2a-587f8dd81bbe.metadata.json
iceberg/metadata/5dc0a4c4-3c72-4d16-a200-c299b635415c-m0.avro
iceberg/metadata/73a18816-64a1-4119-8453-bb671b8abfb3-m0.avro
iceberg/metadata/snap-2282180466624073266-1-73a18816-64a1-4119-8453-bb671b8abfb3.avro
iceberg/metadata/snap-4263160168885610306-1-5dc0a4c4-3c72-4d16-a200-c299b635415c.avro


In [69]:
cat(bucket,"iceberg/metadata/00000-2346cea3-3db3-46a4-bb55-419ae993156b.metadata.json",False)

File: iceberg/metadata/00000-2346cea3-3db3-46a4-bb55-419ae993156b.metadata.json
----------------------
{
  "format-version" : 1,
  "table-uuid" : "c1a84965-5daa-434b-8f3a-3f27cd91ac55",
  "location" : "s3a://fileformats//iceberg",
  "last-updated-ms" : 1684152769085,
  "last-column-id" : 4,
  "schema" : {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 2,
      "name" : "account",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "dt_transaction",
      "required" : false,
      "type" : "date"
    }, {
      "id" : 4,
      "name" : "balance",
      "required" : false,
      "type" : "long"
    } ]
  },
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 2,
      "n

In [70]:
#ALTER TABLE myTable ADD COLUMNS (address VARCHAR) - the number of columns in the df3 does not match the schema of the table, so we modify the schema of the existing table
spark.sql("ALTER TABLE iceberg_db.iceberg ADD COLUMNS (new VARCHAR(50))")

DataFrame[]

In [71]:
write_iceberg=(df3
                  .write
                  .format("iceberg")
                  .mode("append") # append
                  .option("schema", schema3)
                  .saveAsTable("iceberg_db.iceberg")
                  )

In [72]:
## Read Iceberg table:

iceberg_df = spark.read.table("iceberg_db.iceberg")
iceberg_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- account: string (nullable = true)
 |-- dt_transaction: date (nullable = true)
 |-- balance: long (nullable = true)
 |-- new: string (nullable = true)



In [73]:

spark.sql("SELECT * FROM iceberg_db.iceberg.history;").show()
spark.sql("SELECT * FROM iceberg_db.iceberg.files;").show()
spark.sql("SELECT * FROM iceberg_db.iceberg.snapshots;").show()

## alternative syntax example:
# spark.read.format("iceberg").load("iceberg_db.iceberg_table.files").show()


+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2023-05-15 14:12:...|2282180466624073266|               null|               true|
|2023-05-15 14:13:...|4263160168885610306|2282180466624073266|               true|
|2023-05-15 14:16:...|3806458280092384569|4263160168885610306|               true|
+--------------------+-------------------+-------------------+-------------------+



                                                                                

+-------+--------------------+-----------+-------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
|content|           file_path|file_format|spec_id|record_count|file_size_in_bytes|        column_sizes|        value_counts|   null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+--------------------+-----------+-------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
|      0|s3a://fileformats...|    PARQUET|      0|           1|              1534|{1 -> 52, 2 -> 55...|{1 -> 1, 2 -> 1, ...|{1 -> 0, 2 -> 0, ...|              {}|{1 ->        , 2...|{1 ->        , 2...|        null|      

### Iceberg: Time Travel
- ```snapshot-id``` selects a specific table snapshot
- ```as-of-timestamp``` selects the current snapshot at a timestamp, in milliseconds
- ```branch``` selects the head snapshot of the specified branch. Note that currently branch cannot be combined with as-of-timestamp.
- ```tag``` selects the snapshot associated with the specified tag. Tags cannot be combined with as-of-timestamp.

In [74]:
# from the results of iceberg_table.snapshots get the snapshots IDs
snapshot1 = spark.read \
                 .option("snapshot-id", "2282180466624073266") \
                 .format("iceberg") \
                 .load("iceberg_db.iceberg").show()

[Stage 107:>                                                        (0 + 1) / 1]

+---+-------+--------------+-------+
| id|account|dt_transaction|balance|
+---+-------+--------------+-------+
|  1|   alex|    2019-01-01|   1000|
|  2|   alex|    2019-02-01|   1500|
|  4|  maria|    2020-01-01|   5000|
|  3|   alex|    2019-03-01|   1700|
+---+-------+--------------+-------+



                                                                                

In [76]:
snapshot2 = spark.read \
                 .option("snapshot-id", "4263160168885610306") \
                 .format("iceberg") \
                 .load("iceberg_db.iceberg").show()

+---+-------+--------------+-------+
| id|account|dt_transaction|balance|
+---+-------+--------------+-------+
|  1|   alex|    2019-01-01|   1000|
|  2|   alex|    2019-02-01|   1500|
|  4|  maria|    2020-01-01|   5000|
|  3|   alex|    2019-03-01|   1700|
|  1|   alex|    2019-03-01|   3300|
|  2|  peter|    2021-01-01|    100|
+---+-------+--------------+-------+



In [81]:
tsToExpire = f.current_timestamp() - timedelta(minutes=10)
print(tsToExpire)

Column<'(current_timestamp() - INTERVAL '0 00:10:00' DAY TO SECOND)'>


In [82]:
## need iceberg.table
## geth nicht verstehe ich. noch nicht??
table.expireSnapshots().expireOlderThan(tsToExpire).commit();

NameError: name 'table' is not defined

# Hudi

- a **storage abstraction layer** 
- enables data ingestion and query capability on large-scale, evolving datasets
- well-suited for real-time streaming workloads and batch processing

In [83]:
# update partition path, i.e. "id/dt_transaction"
record_key = "id"
partition_path = "id"

hudi_options = {
    "hoodie.table.name": df1,
    "hoodie.datasource.write.recordkey.field": record_key,
    "hoodie.datasource.write.partitionpath.field": partition_path,
    "hoodie.datasource.write.table.name": df1,
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.precombine.field": "ts",  # This field is used by Hoodie to resolve conflicts between records with the same key (in this case, id) 
    "hoodie.upsert.shuffle.parallelism": 2,
    "hoodie.insert.shuffle.parallelism": 2
}

In [84]:
write_hudi=(df1.withColumn("ts", f.current_timestamp()).write.format("hudi") # "ts" field is a mandatory field in Hoodie that specifies the timestamp of the record, so we add a new column and use simple current_timestamp() function
               .options(**hudi_options)
               .mode("overwrite")
               .save(f"s3://{bucket}/hudi")
               )

                                                                                

In [85]:
ls(bucket,"hudi")

hudi/.hoodie/.aux/.bootstrap/.fileids/
hudi/.hoodie/.aux/.bootstrap/.partitions/
hudi/.hoodie/.schema/
hudi/.hoodie/.temp/
hudi/.hoodie/20230515122157032.commit
hudi/.hoodie/20230515122157032.commit.requested
hudi/.hoodie/20230515122157032.inflight
hudi/.hoodie/archived/
hudi/.hoodie/hoodie.properties
hudi/.hoodie/metadata/.hoodie/.aux/.bootstrap/.fileids/
hudi/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/
hudi/.hoodie/metadata/.hoodie/.schema/
hudi/.hoodie/metadata/.hoodie/.temp/
hudi/.hoodie/metadata/.hoodie/00000000000000.deltacommit
hudi/.hoodie/metadata/.hoodie/00000000000000.deltacommit.inflight
hudi/.hoodie/metadata/.hoodie/00000000000000.deltacommit.requested
hudi/.hoodie/metadata/.hoodie/20230515122157032.deltacommit
hudi/.hoodie/metadata/.hoodie/20230515122157032.deltacommit.inflight
hudi/.hoodie/metadata/.hoodie/20230515122157032.deltacommit.requested
hudi/.hoodie/metadata/.hoodie/archived/
hudi/.hoodie/metadata/.hoodie/hoodie.properties
hudi/.hoodie/metadata/files/.

In [86]:
write_hudi=(df2.withColumn("ts", f.current_timestamp()).write.format("hudi") # "ts" field is a mandatory field in Hoodie that specifies the timestamp of the record, so we add a new column and use simple current_timestamp() function
               .options(**hudi_options)
               .mode("append")
               .save(f"s3://{bucket}/hudi")
               )



[Stage 151:>                                                        (0 + 2) / 2]



                                                                                

In [87]:
write_hudi=(df3.withColumn("ts", f.current_timestamp()).write.format("hudi") # "ts" field is a mandatory field in Hoodie that specifies the timestamp of the record, so we add a new column and use simple current_timestamp() function
               .options(**hudi_options)
               .mode("append")
               .save(f"s3://{bucket}/hudi")
               )

                                                                                

In [88]:
df_hudi = spark.read.format("hudi").load(f"s3://{bucket}/hudi").show()

+-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------+-------+-------------+--------------------+---+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|account|dt_transaction|balance|          new|                  ts| id|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------+-------+-------------+--------------------+---+
|  20230515122339203|20230515122339203...|                 1|                     1|cf0f8517-23b9-441...|   otto|    2019-10-01|   4444|neue Spalte 1|2023-05-15 14:23:...|  1|
|  20230515122157032|20230515122157032...|                 4|                     4|ee5e9af9-3f2d-471...|  maria|    2020-01-01|   5000|         null|2023-05-15 14:21:...|  4|
|  20230515122321620|20230515122321620...|                 2|                     2|0cd93c57-fb35-4ff...|  peter|    202

In [89]:
ls(bucket,"hudi")

hudi/.hoodie/.aux/.bootstrap/.fileids/
hudi/.hoodie/.aux/.bootstrap/.partitions/
hudi/.hoodie/.schema/
hudi/.hoodie/.temp/
hudi/.hoodie/20230515122157032.commit
hudi/.hoodie/20230515122157032.commit.requested
hudi/.hoodie/20230515122157032.inflight
hudi/.hoodie/20230515122321620.commit
hudi/.hoodie/20230515122321620.commit.requested
hudi/.hoodie/20230515122321620.inflight
hudi/.hoodie/20230515122339203.commit
hudi/.hoodie/20230515122339203.commit.requested
hudi/.hoodie/20230515122339203.inflight
hudi/.hoodie/archived/
hudi/.hoodie/hoodie.properties
hudi/.hoodie/metadata/.hoodie/.aux/.bootstrap/.fileids/
hudi/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/
hudi/.hoodie/metadata/.hoodie/.schema/
hudi/.hoodie/metadata/.hoodie/.temp/
hudi/.hoodie/metadata/.hoodie/00000000000000.deltacommit
hudi/.hoodie/metadata/.hoodie/00000000000000.deltacommit.inflight
hudi/.hoodie/metadata/.hoodie/00000000000000.deltacommit.requested
hudi/.hoodie/metadata/.hoodie/20230515122157032.deltacommit
hudi

#### Hudi: Time Travel 

In [90]:
## Get the commit time from the Hudi table

spark.read.format("hudi")\
     .option("as.of.instant", "20230515122339203")\
     .load(f"s3://{bucket}/hudi").show()

                                                                                

+-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------+-------+-------------+--------------------+---+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|account|dt_transaction|balance|          new|                  ts| id|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------+-------+-------------+--------------------+---+
|  20230515122339203|20230515122339203...|                 1|                     1|cf0f8517-23b9-441...|   otto|    2019-10-01|   4444|neue Spalte 1|2023-05-15 14:23:...|  1|
|  20230515122157032|20230515122157032...|                 4|                     4|ee5e9af9-3f2d-471...|  maria|    2020-01-01|   5000|         null|2023-05-15 14:21:...|  4|
|  20230515122321620|20230515122321620...|                 2|                     2|0cd93c57-fb35-4ff...|  peter|    202

In [91]:
account_data4 = [
    (5,"anna","2020-11-01",2000,"neue Spalte 1")
]
df4 = spark.createDataFrame(data=account_data4, schema = schema3).withColumn("dt_transaction",col("dt_transaction").cast("date")).repartition(3)

write_hudi=(df4.withColumn("ts", f.current_timestamp()).write.format("hudi")
               .options(**hudi_options)
               .mode("append")
               .save(f"s3://{bucket}/hudi")
               )

                                                                                

In [92]:
## Incremental query:

spark.read.format("hudi"). \
  load(f"s3://{bucket}/hudi"). \
  createOrReplaceTempView("hudi_snapshots")

In [93]:
commits = list(map(lambda row: row[0], spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_snapshots order by commitTime").limit(10).collect()))
print(commits)

beginTime = commits[len(commits) - 4] # commit time we are interested in




['20230515122157032', '20230515122321620', '20230515122339203', '20230515123027657']


                                                                                

In [95]:
# incrementally query data
incremental_read_options = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': beginTime,
}

hudiIncrementalDF = spark.read.format("hudi"). \
  options(**incremental_read_options). \
  load(f"s3://{bucket}/hudi")
hudiIncrementalDF .createOrReplaceTempView("hudi_incremental")

spark.sql("select `_hoodie_commit_time`, account, balance, dt_transaction, ts from hudi_incremental").show(truncate=False)

+-------------------+-------+-------+--------------+--------------------------+
|_hoodie_commit_time|account|balance|dt_transaction|ts                        |
+-------------------+-------+-------+--------------+--------------------------+
|20230515122339203  |otto   |4444   |2019-10-01    |2023-05-15 14:23:39.428869|
|20230515123027657  |anna   |2000   |2020-11-01    |2023-05-15 14:30:27.805069|
|20230515122321620  |peter  |100    |2021-01-01    |2023-05-15 14:23:21.809413|
+-------------------+-------+-------+--------------+--------------------------+



### Hudi: Table maintenance
Hudi can run async or inline table services while running Strucrured Streaming query and takes care of cleaning, compaction and clustering. There's no operational overhead for the user.
For CoW tables, table services work in inline mode by default.
For MoR tables, some async services are enabled by default.

In [None]:
spark.stop()