# Delta Lake Lab 
## Unit 4: CRUD Support

This lab is powered by Dataproc Serverless Spark.

In the previous unit, we -
1. Create an unpartitioned delta table
2. Created a partitioned delta table called loan_db.loans_by_state_delta
3. Studied the files created & layout in the datalake
4. Learned how to look at delta table details
5. Looked at history (there was not any)
6. Created a manifest file
7. Reviewed entries in the Hive metastore

In this unit, we will learn how to -
1. Delete a record and study the delta log
2. Insert a record and study the delta log
3. Update a record and study the delta log
4. Upsert and study the delta log

### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

22/11/01 21:46:15 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  885979867746


In [None]:
ACCOUNT_NAME = "akhjain"

In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}-{ACCOUNT_NAME}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

In [7]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-bd1648aa-3db7-43e9-beb7-db0593eefbb5-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json

gs://dll-data-bucket-885979867746/delta-consumable/_symlink_format_manifest/:
gs://dll-data-bucket-885979867746/delta-consumable/_symlink_format_manifest/
gs://dll-data-bucket-885979867746/delta-consumable/_symlink_format_manifest/manifest


### 4. Delete support

In [8]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [9]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta WHERE addr_state='IA'").show(truncate=False)

ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Runtime version 4.8 used for parser compilation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Runtime version 4.8 used for parser compilation does not match the current runtime version 4.9.3
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
[Stage 8:>                                                          (0 + 1) / 1]

+----------+-----+
|addr_state|count|
+----------+-----+
|IA        |1    |
+----------+-----+



                                                                                

In [10]:
spark.sql("DELETE FROM loan_db.loans_by_state_delta WHERE addr_state='IA'").show(truncate=False)



+-----------------+
|num_affected_rows|
+-----------------+
|1                |
+-----------------+



                                                                                

In [11]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [12]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta WHERE addr_state='IA'").show(truncate=False)

+----------+-----+
|addr_state|count|
+----------+-----+
+----------+-----+



Lets look at the data lake:

In [13]:
# Note how the deleted created a json in the delta log directory
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/* 

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json


Lets look at the delta log:

In [14]:
# This is the original log
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000000.json 

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"8ac45cea-8820-4a01-b29d-3a2311ddc86b","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"addr_state\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"count\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1667338219033}}
{"add":{"path":"part-00000-bd1648aa-3db7-43e9-beb7-db0593eefbb5-c000.snappy.parquet","partitionValues":{},"size":978,"modificationTime":1667338230366,"dataChange":true,"stats":"{\"numRecords\":51,\"minValues\":{\"addr_state\":\"AK\",\"count\":1},\"maxValues\":{\"addr_state\":\"WY\",\"count\":1},\"nullCount\":{\"addr_state\":0,\"count\":0}}"}}
{"commitInfo":{"timestamp":1667338234298,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numFiles":"1","numOutp

In [15]:
# Note the delete in this log
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000001.json 

{"remove":{"path":"part-00000-bd1648aa-3db7-43e9-beb7-db0593eefbb5-c000.snappy.parquet","deletionTimestamp":1667339216100,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":978}}
{"add":{"path":"part-00000-99635b87-214b-4afa-895e-8c124bf54009-c000.snappy.parquet","partitionValues":{},"size":973,"modificationTime":1667339215833,"dataChange":true,"stats":"{\"numRecords\":50,\"minValues\":{\"addr_state\":\"AK\",\"count\":1},\"maxValues\":{\"addr_state\":\"WY\",\"count\":1},\"nullCount\":{\"addr_state\":0,\"count\":0}}"}}
{"commitInfo":{"timestamp":1667339216154,"operation":"DELETE","operationParameters":{"predicate":"[\"(spark_catalog.loan_db.loans_by_state_delta.addr_state = 'IA')\"]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numCopiedRows":"50","numAddedChangeFiles":"0","executionTimeMs":"3975","numAddedFiles":"1","rewriteTimeMs":"1620","numDeletedRows":"1","scanTimeMs":"2354"},"engineInfo":"A

### 5. Create (Insert) support

In [16]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [17]:
spark.sql("INSERT INTO loan_db.loans_by_state_delta VALUES ('IA',222222)")

                                                                                

DataFrame[]

In [18]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta WHERE addr_state='IA'").show(truncate=False)

+----------+------+
|addr_state|count |
+----------+------+
|IA        |222222|
+----------+------+



In [19]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [20]:
# Note how the insert created a new parquet file and in the delta log, yet another json
!gsutil ls -r $DELTA_LAKE_DIR_ROOT 

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-363ee39f-ff75-4e4d-8938-d866d43955ca-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-99635b87-214b-4afa-895e-8c124bf54009-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-bd1648aa-3db7-43e9-beb7-db0593eefbb5-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json

gs://dll-data-bucket-885979867746/delta-consumable/_symlink_format_manifest/:
gs://dll-data-bucket-885979867746/delta-consumable/_symlink_format_manifest/
gs://dll-data-bucket-885979867746/delta-co

In [21]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [22]:
# Lets check for the insert
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000002.json 

{"add":{"path":"part-00000-363ee39f-ff75-4e4d-8938-d866d43955ca-c000.snappy.parquet","partitionValues":{},"size":725,"modificationTime":1667339236091,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"addr_state\":\"IA\",\"count\":222222},\"maxValues\":{\"addr_state\":\"IA\",\"count\":222222},\"nullCount\":{\"addr_state\":0,\"count\":0}}"}}
{"commitInfo":{"timestamp":1667339236274,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":1,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"1","numOutputBytes":"725"},"engineInfo":"Apache-Spark/3.3.1 Delta-Lake/2.1.0","txnId":"d43488d3-179d-488d-8c3a-4b08289a2f93"}}


### 6. Update support

Lets update a record & see the changes in the delta log directory

In [23]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [24]:
spark.sql("UPDATE loan_db.loans_by_state_delta SET count = 11111 WHERE addr_state='IA'").show(truncate=False)



+-----------------+
|num_affected_rows|
+-----------------+
|1                |
+-----------------+



                                                                                

In [25]:
spark.sql("SELECT * FROM loan_db.loans_by_state_delta WHERE addr_state='IA'").show(truncate=False)

+----------+-----+
|addr_state|count|
+----------+-----+
|IA        |11111|
+----------+-----+



In [26]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [27]:
# Note how the update created a new parquet file and in the delta log, yet another json
!gsutil ls -r $DELTA_LAKE_DIR_ROOT 

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-363ee39f-ff75-4e4d-8938-d866d43955ca-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-99635b87-214b-4afa-895e-8c124bf54009-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-9a73bff5-5ed8-4ce2-9383-b6f2865cbc6a-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-bd1648aa-3db7-43e9-beb7-db0593eefbb5-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000

In [28]:
# Lets check for the update
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000003.json 

{"remove":{"path":"part-00000-363ee39f-ff75-4e4d-8938-d866d43955ca-c000.snappy.parquet","deletionTimestamp":1667339255561,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":725}}
{"add":{"path":"part-00000-9a73bff5-5ed8-4ce2-9383-b6f2865cbc6a-c000.snappy.parquet","partitionValues":{},"size":725,"modificationTime":1667339255473,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"addr_state\":\"IA\",\"count\":11111},\"maxValues\":{\"addr_state\":\"IA\",\"count\":11111},\"nullCount\":{\"addr_state\":0,\"count\":0}}"}}
{"commitInfo":{"timestamp":1667339255563,"operation":"UPDATE","operationParameters":{"predicate":"(addr_state#1436 = IA)"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numCopiedRows":"0","numAddedChangeFiles":"0","executionTimeMs":"1650","scanTimeMs":"957","numAddedFiles":"1","numUpdatedRows":"1","rewriteTimeMs":"692"},"engineInfo":"Apache-Spark/3.3.1 Delta-Lake/2.1.0","txnId

### 7. Upsert support

In [29]:
toBeMergedRows = [('IA', 555), ('CA', 12345), ('IN', 6666)]
toBeMergedColumns = ['addr_state', 'count']
toBeMergedDF = spark.createDataFrame(toBeMergedRows, toBeMergedColumns)
toBeMergedDF.createOrReplaceTempView("to_be_merged_table")
toBeMergedDF.orderBy("addr_state").show(3)

[Stage 53:>                                                         (0 + 8) / 8]

+----------+-----+
|addr_state|count|
+----------+-----+
|        CA|12345|
|        IA|  555|
|        IN| 6666|
+----------+-----+



                                                                                

In [30]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/delta_consumable/part* | wc -l 

CommandException: One or more URLs matched no objects.
0


In [31]:
spark.sql("DELETE FROM loan_db.loans_by_state_delta WHERE addr_state='IA'").show(truncate=False)

                                                                                

+-----------------+
|num_affected_rows|
+-----------------+
|1                |
+-----------------+



In [32]:
spark.sql("SELECT addr_state,count FROM loan_db.loans_by_state_delta WHERE addr_state in ('IA','CA','IN') ORDER BY addr_state").show(truncate=False)

+----------+-----+
|addr_state|count|
+----------+-----+
|CA        |1    |
|IN        |1    |
+----------+-----+



In [33]:
mergeSQLStatement = "MERGE INTO loan_db.loans_by_state_delta as d USING to_be_merged_table as m ON (d.addr_state = m.addr_state) WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * "

print(mergeSQLStatement)


MERGE INTO loan_db.loans_by_state_delta as d USING to_be_merged_table as m ON (d.addr_state = m.addr_state) WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 


In [34]:
spark.sql(mergeSQLStatement).show(truncate=False)

                                                                                

+-----------------+----------------+----------------+-----------------+
|num_affected_rows|num_updated_rows|num_deleted_rows|num_inserted_rows|
+-----------------+----------------+----------------+-----------------+
|3                |2               |0               |1                |
+-----------------+----------------+----------------+-----------------+



In [35]:
spark.sql("SELECT addr_state,count FROM loan_db.loans_by_state_delta WHERE addr_state in ('IA','CA','IN') ORDER BY addr_state").show(truncate=False)

+----------+-----+
|addr_state|count|
+----------+-----+
|CA        |12345|
|IA        |555  |
|IN        |6666 |
+----------+-----+



In [44]:
# Get the file count
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

6


In [37]:
# Note how the update created a new parquet file and in the delta log, yet another json
!gsutil ls -r $DELTA_LAKE_DIR_ROOT 

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-10887990-f51d-46ad-a768-2f8135d5fd95-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-2517e153-6923-4599-8b2e-5746fcf10973-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-363ee39f-ff75-4e4d-8938-d866d43955ca-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-99635b87-214b-4afa-895e-8c124bf54009-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-9a73bff5-5ed8-4ce2-9383-b6f2865cbc6a-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-bd1648aa-3db7-43e9-beb7-db0593eefbb5-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-buc

In [38]:
# Lets check for the upsert
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000004.json 

{"remove":{"path":"part-00000-9a73bff5-5ed8-4ce2-9383-b6f2865cbc6a-c000.snappy.parquet","deletionTimestamp":1667339279419,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":725}}
{"add":{"path":"part-00000-10887990-f51d-46ad-a768-2f8135d5fd95-c000.snappy.parquet","partitionValues":{},"size":397,"modificationTime":1667339279348,"dataChange":true,"stats":"{\"numRecords\":0,\"minValues\":{},\"maxValues\":{},\"nullCount\":{}}"}}
{"commitInfo":{"timestamp":1667339279420,"operation":"DELETE","operationParameters":{"predicate":"[\"(spark_catalog.loan_db.loans_by_state_delta.addr_state = 'IA')\"]"},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numCopiedRows":"0","numAddedChangeFiles":"0","executionTimeMs":"1341","numAddedFiles":"1","rewriteTimeMs":"583","numDeletedRows":"1","scanTimeMs":"758"},"engineInfo":"Apache-Spark/3.3.1 Delta-Lake/2.1.0","txnId":"79302f36-c952-4433-a489-81addb7c54fd"}}


### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK