<a href="https://colab.research.google.com/github/DenysNunes/data-examples/blob/main/spark/2%20-%20intermediate/basic_delta_lake_usage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Delta Lake Example

Example using Delta Lake third party library to enable lakehouse features. <br>
Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. <br>
Read more about [here](https://delta.io/).

## ▶ Initializing spark

Creating a session with default configurations and all dependencies.

In [1]:
!pip install -q pyspark==3.2.0
!rm -rf /tmp/tables/

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
from pyspark.sql.types import StructType

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("New Session Example") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:1.1.0") \
    .enableHiveSupport() \
    .getOrCreate()

from delta.tables import DeltaTable

person_path = '/tmp/tables/tb_person/'
table_meta = {
   "fields":[
      {
         "metadata":{
             "comment": "Person id."
         },
         "name":"id",
         "nullable": False,
         "type":"integer"
      },
      {
         "metadata":{
             "comment": "Person name information."
         },
         "name":"name",
         "nullable": False,
         "type":"string"
      },
      {
         "metadata":{
             "comment": "Person salary information."
         },
         "name":"salary",
         "nullable": False,
         "type":"float"
      },
      {
         "metadata":{
             "comment": "Person gender information."
         },
         "name":"gender",
         "nullable": False,
         "type":"string"
      },
      {
         "metadata":{
             "comment": "Moment when data inserted."
         },
         "name":"inserted_at",
         "nullable": False,
         "type":"timestamp"
      },
      {
         "metadata":{
             "comment": "Moment when data updated."
         },
         "name": "updated_at",
         "nullable": True,
         "type":"timestamp"
      }
   ],
   "type":"struct"
}

table_schema = StructType.fromJson(table_meta)


[K     |████████████████████████████████| 281.3 MB 38 kB/s 
[K     |████████████████████████████████| 198 kB 61.3 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## ▶ Creating a delta table

In [2]:
from pyspark.sql.types import Row
from datetime import datetime
import time

first_insert_ts = datetime.now()

df = spark.createDataFrame([
        Row(id=1, 
            name='Oliver',
            salary=2000.00,
            gender='M',
            inserted_at=first_insert_ts,
            updated_at=None
        ),
        Row(id=2, 
            name='Agata', 
            salary=2800.00,
            gender='F',
            inserted_at=first_insert_ts,
            updated_at=None
        ),
        Row(id=3, 
            name='Lola', 
            salary=4500.00,
            gender='F',
            inserted_at=first_insert_ts,
            updated_at=None
        )
], schema=table_schema)

# Waiting a time for ts change
time.sleep(0.5)

df.write.format("delta").save(person_path)

## ▶ Showing DF using **DeltaTable** class

In [3]:
deltaTable = DeltaTable.forPath(spark, person_path)

deltaTable.toDF().orderBy('id').show(200, False)

+---+------+------+------+--------------------------+----------+
|id |name  |salary|gender|inserted_at               |updated_at|
+---+------+------+------+--------------------------+----------+
|1  |Oliver|2000.0|M     |2022-01-17 05:08:36.847056|null      |
|2  |Agata |2800.0|F     |2022-01-17 05:08:36.847056|null      |
|3  |Lola  |4500.0|F     |2022-01-17 05:08:36.847056|null      |
+---+------+------+------+--------------------------+----------+



## ▶ Performing a merge operation


In [4]:
second_merge_ts = datetime.now()

# I'll use this dataframe to perform a merge operation
df_merge = spark.createDataFrame([
         Row(id=4, 
            name='Paula',
            salary=5400.00,
            gender='F',
            inserted_at=second_merge_ts, 
            updated_at=None                 
        ),
        Row(id=1, 
            name='Oliver',
            salary=3000.00,
            gender='M',
            inserted_at = None, 
            updated_at = second_merge_ts   
        )
])


when_match_upd_val = {
    "id": col("source.id"),
    "name": col("source.name"),
    "salary": col("source.salary"),
    "gender": col("source.gender"),
    "updated_at": col("source.updated_at"),
    "inserted_at": col("target.inserted_at")
}

deltaTable.alias("target") \
    .merge(df_merge.alias("source"), "target.id = source.id") \
    .whenMatchedUpdate(set=when_match_upd_val) \
    .whenNotMatchedInsertAll() \
    .execute()



In [5]:
deltaTable.toDF().orderBy('id').show(200, False)

+---+------+------+------+--------------------------+--------------------------+
|id |name  |salary|gender|inserted_at               |updated_at                |
+---+------+------+------+--------------------------+--------------------------+
|1  |Oliver|3000.0|M     |2022-01-17 05:08:36.847056|2022-01-17 05:09:02.247944|
|2  |Agata |2800.0|F     |2022-01-17 05:08:36.847056|null                      |
|3  |Lola  |4500.0|F     |2022-01-17 05:08:36.847056|null                      |
|4  |Paula |5400.0|F     |2022-01-17 05:09:02.247944|null                      |
+---+------+------+------+--------------------------+--------------------------+



## ▶ Performing a delete operation

In [6]:
deltaTable.delete(condition=expr("id = 3"))

In [7]:
deltaTable.toDF().orderBy('id').show(200, False)

+---+------+------+------+--------------------------+--------------------------+
|id |name  |salary|gender|inserted_at               |updated_at                |
+---+------+------+------+--------------------------+--------------------------+
|1  |Oliver|3000.0|M     |2022-01-17 05:08:36.847056|2022-01-17 05:09:02.247944|
|2  |Agata |2800.0|F     |2022-01-17 05:08:36.847056|null                      |
|4  |Paula |5400.0|F     |2022-01-17 05:09:02.247944|null                      |
+---+------+------+------+--------------------------+--------------------------+



# Time Travel

Another interesting feature of Delta is the capacity to navigate in table versions using "DESCRIBE HISTORY" command, each operation can be noticed in dataframe below.

In [8]:
spark.sql(f"""

DESCRIBE HISTORY '{person_path}'  

""").selectExpr("version", "timestamp", "operation") \
    .orderBy("version") \
    .show(200, False)

+-------+-----------------------+---------+
|version|timestamp              |operation|
+-------+-----------------------+---------+
|0      |2022-01-17 05:08:45.066|WRITE    |
|1      |2022-01-17 05:09:06.3  |MERGE    |
|2      |2022-01-17 05:09:13.982|DELETE   |
+-------+-----------------------+---------+



## Using a specific version

Selecting the first version of table.

In [9]:
df_old_version = spark.read.format("delta") \
                           .option("versionAsOf", 0) \
                           .load(person_path)

df_old_version.orderBy("id").show(200, False)

+---+------+------+------+--------------------------+----------+
|id |name  |salary|gender|inserted_at               |updated_at|
+---+------+------+------+--------------------------+----------+
|1  |Oliver|2000.0|M     |2022-01-17 05:08:36.847056|null      |
|2  |Agata |2800.0|F     |2022-01-17 05:08:36.847056|null      |
|3  |Lola  |4500.0|F     |2022-01-17 05:08:36.847056|null      |
+---+------+------+------+--------------------------+----------+

