# Udemy Course: Databricks Certified Data Engineer Associate - Preparation

## Section 2: Databricks Lakehouse Platform

### Lesson 11. Understanding Delta Tables (Hands On)

In this notebook, we will work with Delta Lake tables.



#### Creating Delta Lake Tables

Let us first create an empty Delta Lake table, like in SQL,

You just need a CREAT TABLE statement, a table name, in our case, employees, and a table schema.

```SQL
CREATE TABLE employees
-- USING DELTA
    (id INT, name STRING, salery DOUBLE);
```

Here the `id` is of type `INT`, `name` `STRING`, `salary` `DOUBLE`

Delta Lake is the default format and you don't need to specify the keyword USING DELTA, so we can simply remove it.

Let us run our first command.


In [0]:
%sql
USE CATALOG hive_metastore

In [0]:
%sql
CREATE TABLE employees
    (id INT, name STRING, salary DOUBLE);

#### Catalog Explorer

Great. The table has been created.

Let's confirm this. 
Let's go to the Catalog tab.

<div style="text-align: center;">
<img src="../images/Catalog Explorer employee table.jpg" style="width:1280px" >
</div> 

Here, in the default database, we can see that the table employees has been created.

Here we can see the schema of the table, our three columns, id, name, salary and other metadata information.



#### Inserting Data

Now we will insert some records all in a single transaction.

Again, like in SQL, we will use `INSERT INTO` statements.

Let's run our second command.

Here we can see that we have successfully inserted six records.


In [0]:
%sql
-- NOTE: With latest Databricks Runtimes, inserting few records in single transaction is optimized into single data file.
-- For this demo, we will insert the records in multiple transactions in order to create 4 data files.

INSERT INTO employees
VALUES 
  (1, "Adam", 3500.0),
  (2, "Sarah", 4020.5);

INSERT INTO employees
VALUES
  (3, "John", 2999.3),
  (4, "Thomas", 4000.3);

INSERT INTO employees
VALUES
  (5, "Anna", 2500.0);

INSERT INTO employees
VALUES
  (6, "Kim", 6200.3)

-- NOTE: When executing multiple SQL statements in the same cell, only the last statement's result will be displayed in the cell output.

num_affected_rows,num_inserted_rows
1,1


Now we can simply query the table using a standard `SELECT` statement.


In [0]:
%sql
SELECT * FROM employees

id,name,salary
3,John,2999.3
4,Thomas,4000.3
1,Adam,3500.0
2,Sarah,4020.5
5,Anna,2500.0
6,Kim,6200.3


#### Exploring Table Metadata

Let us now see some metadata information about our table.

Here we will use the `DESCRIBE DETAIL` command on our table.
It is an important command that allows us to explore table metadata.



In [0]:
%sql
DESCRIBE DETAIL employees

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,f417c9dd-30ac-49c7-b6ae-14e820c51596,hive_metastore.default.employees,,dbfs:/user/hive/warehouse/employees,2024-10-12T04:52:39.544Z,2024-10-12T04:53:58Z,List(),List(),4,4281,Map(),1,2,"List(appendOnly, invariants)",Map()


As you can see, there are many important information regarding our table here.

For example, we can see the location of the table.
It is the location where the table files are really stored.

In addition, we have also the number of file field, 
which indicates the number of data files in the current table version.



#### Exploring Table Directory

Let us copy the table location and explore the files using the `%fs` Magic Command.



In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

path,name,size,modificationTime
dbfs:/user/hive/warehouse/employees/_delta_log/,_delta_log/,0,1728708759000
dbfs:/user/hive/warehouse/employees/part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,1076,1728708835000
dbfs:/user/hive/warehouse/employees/part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,1080,1728708836000
dbfs:/user/hive/warehouse/employees/part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,1066,1728708837000
dbfs:/user/hive/warehouse/employees/part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,1059,1728708838000


#### Updating Table

Let us see the scenario of update operations.

In this scenario, we need to update the salary of all employees having a name starts with the letter A by adding 100 to their salary.

In [0]:
%sql
UPDATE employees 
SET salary = salary + 100
WHERE name LIKE "A%"

num_affected_rows
2


Here we can see that there are two records affected by the update operation.

Let us query the table again to see the updated data.


In [0]:
%sql
SELECT * FROM employees

id,name,salary
3,John,2999.3
4,Thomas,4000.3
1,Adam,3600.0
2,Sarah,4020.5
5,Anna,2600.0
6,Kim,6200.3


Let us now see what happened in the table directory.

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

path,name,size,modificationTime
dbfs:/user/hive/warehouse/employees/_delta_log/,_delta_log/,0,1728708759000
dbfs:/user/hive/warehouse/employees/part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,1076,1728708835000
dbfs:/user/hive/warehouse/employees/part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,1080,1728708836000
dbfs:/user/hive/warehouse/employees/part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,1066,1728708837000
dbfs:/user/hive/warehouse/employees/part-00000-7fec7995-32fc-4591-b2ce-c38399443529-c000.snappy.parquet,part-00000-7fec7995-32fc-4591-b2ce-c38399443529-c000.snappy.parquet,1076,1728708900000
dbfs:/user/hive/warehouse/employees/part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,1059,1728708838000
dbfs:/user/hive/warehouse/employees/part-00001-f3a4dc82-ba54-4812-ab87-dfca50f7ba8f-c000.snappy.parquet,part-00001-f3a4dc82-ba54-4812-ab87-dfca50f7ba8f-c000.snappy.parquet,1066,1728708900000


We can see that there are two files have been added to the directory.

As we said, rather than updating the records in the files themself, we make a copy of them.

And later, Delta uses the transaction log to indicate which files are valid in the current version of the table.



Let us confirm this by running the DESCRIBE DETAIL command.



In [0]:
%sql
DESCRIBE DETAIL employees

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,f417c9dd-30ac-49c7-b6ae-14e820c51596,hive_metastore.default.employees,,dbfs:/user/hive/warehouse/employees,2024-10-12T04:52:39.544Z,2024-10-12T04:55:00Z,List(),List(),4,4281,Map(),1,2,"List(appendOnly, invariants)",Map()


Here, as you can see, the number of files are four and not six, and they are the four files that represent the current version of the table.

So it contains the new files updated after our update command.

So if we query our delta table again, 
the query engine uses the transaction logs to 
  * resolve all the files that are valid in the current version and 
  * ignore all other data files.


In [0]:
%sql
SELECT * FROM employees

id,name,salary
3,John,2999.3
4,Thomas,4000.3
1,Adam,3600.0
2,Sarah,4020.5
5,Anna,2600.0
6,Kim,6200.3


#### Exploring Table History

And since the transaction log also stores all the changes to the Delta Lake table, we can easily review the table history using the DESCRIBE HISTORY command.



In [0]:
%sql
DESCRIBE HISTORY employees

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
5,2024-10-12T04:55:00Z,2895352578531874,suryapulika38@gmail.com,UPDATE,"Map(predicate -> [""StartsWith(name#10523, A)""])",,List(4341422527294408),1011-150700-u18wk0fi,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 2142, numCopiedRows -> 1, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1375, scanTimeMs -> 899, numAddedFiles -> 2, numUpdatedRows -> 2, numAddedBytes -> 2142, rewriteTimeMs -> 466)",,Databricks-Runtime/13.3.x-scala2.12
4,2024-10-12T04:53:58Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,3.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1059)",,Databricks-Runtime/13.3.x-scala2.12
3,2024-10-12T04:53:57Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1066)",,Databricks-Runtime/13.3.x-scala2.12
2,2024-10-12T04:53:56Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1080)",,Databricks-Runtime/13.3.x-scala2.12
1,2024-10-12T04:53:55Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1076)",,Databricks-Runtime/13.3.x-scala2.12
0,2024-10-12T04:52:39Z,2895352578531874,suryapulika38@gmail.com,CREATE TABLE,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(4341422527294408),1011-150700-u18wk0fi,,WriteSerializable,True,Map(),,Databricks-Runtime/13.3.x-scala2.12


As you can see, there are three versions of the table starting from version zero where we created the table.

The version number 1 represents our insert command.
So it is a write operation.

And finally, our update command.

So as you can see, thanks to the transaction log, we have the full history of all operations that have happened on the table.



The transaction log is located under the _delta_log folder in the table directory.

Let us explore this folder.



In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees/_delta_log'

path,name,size,modificationTime
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000000.crc,00000000000000000000.crc,2048,1728708761000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000000.json,00000000000000000000.json,1056,1728708759000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000001.crc,00000000000000000001.crc,2578,1728708836000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000001.json,00000000000000000001.json,1110,1728708835000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000002.crc,00000000000000000002.crc,3104,1728708837000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000002.json,00000000000000000002.json,1111,1728708836000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000003.crc,00000000000000000003.crc,3628,1728708838000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000003.json,00000000000000000003.json,1109,1728708837000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000004.crc,00000000000000000004.crc,4150,1728708839000
dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000004.json,00000000000000000004.json,1107,1728708838000


Each transaction is a new JSON file being written to the Delta lake transaction log.

Here you can see that there are three JSON files representing the three transactions we have made on the table.

Starting from zero, the other file in the directory are just the checksum of the JSON files.

Let us look inside the last file representing the update transaction.


In [0]:
%fs head 'dbfs:/user/hive/warehouse/employees/_delta_log/00000000000000000005.json'

For example, with the "add" element, you can see the new files that have been written to our table.

And with the "remove" tags, you can see the list of files that have been soft deleted from our table. 
It means the file that no longer should be included in the table.
