# Databricks Delta Lake

<p align="center">
    <img src="images/DeltaLake.png" width="200" height="150"/>
</p>

Traditional data lakes often face challenges related to reliability, consistency, and performance. In a conventional data lake setup, handling concurrent writes, maintaining data consistency, enforcing schemas, and managing versions can become complex. Additionally, without *ACID properties*, data lakes may be susceptible to issues like data corruption and incomplete writes.

> *Delta Lake*, the storage layer of Databricks, is designed to overcome the limitations of traditional data lakes. It introduces a *transaction* log that ensures ACID properties, making data operations reliable and consistent. It provides schema enforcement, which helps maintain data integrity, and it allows for schema evolution without disrupting workflows.

## ACID Properties

In the landscape of data lakes, the principles of ACID - *Atomicity*, *Consistency*, *Isolation*, and *Durability* - are foundational principles ensuring reliable and robust data transactions.

- **Atomicity** guarantees that data operations are treated as single, indivisible units. If a failure occurs, the system either completes the entire operation or rolls it back, maintaining a consistent state.

- **Consistency** ensures that data remains in a valid state throughout a transaction, preventing incomplete or erroneous states in the database

- **Isolation** safeguards against interference between concurrent transactions. Each transaction appears to run in isolation, even when executed simultaneously with others.

- **Durability** ensures that once a transaction is committed, its effects are permanent and survive system failures, ensuring data is not lost. 

## Transaction Logs

> Delta Lake employs a *transaction log* to record all changes made to the data lake, a key mechanism for maintaining ACID properties. The **transaction** log is a comprehensive record of all the changes made to the data, including insertions, updates, and deletions.

By capturing these changes, the transaction log facilitates the reconstruction of the data state at any given point in time, playing a pivotal role in supporting ACID transactions and features like *time travel* in Delta Lake. We will discuss about time travel in more detail later in the lesson.

## Hands-On: Creating and Managing a Delta Lake Table

In this section, we will walk through a hands-on demonstration on how to create a Delta Lake, execute transactions, and visualize the transaction logs.

### 1. Create an Empty Delta Lake Table

- Open a Databricks notebook
- Use the following command to create an empty Delta Lake table named `employees`:

```sql
-- Create an empty 'employees' table with the specified schema
CREATE TABLE employees
-- USING DELTA (optional)
(
  id INT,
  name STRING,
  title STRING,
  start_date DATE,
  salary INT
)
```

- The `--USING DELTA` parameter is optional here. Delta Lake is the default storage format in Databricks, so even without explicitly specifying it, the `employees` table will be created as a Delta Lake table by default.

### 2. Verify Schema in Data Explorer

- Navigate to the **Data** tab in Databricks
- Find the `employees` table and explore its schema using the **Data** explorer

<p align="center">
    <img src="images/EmptyTable.png" width="750" height="450"/>
</p>

### 3. Insert Data into the Delta Lake Table

- In the same notebook, insert some sample data into the `employees` table:

```sql
INSERT INTO employees
VALUES
  (1, 'John Doe', 'Engineer', '2023-01-01', 80000),
  (2, 'Jane Smith', 'Analyst', '2023-01-02', 70000),
  (3, 'Bob Johnson', 'Manager', '2023-01-03', 90000);
```
- After executing this statement, you will see the following output:

<p align="center">
    <img src="images/InsertOutput.png" width="750" height="250"/>
</p>

- This output indicates that the `INSERT` operation affected a total of 3 rows. It also specified the number of rows that were inserted. In this example, this number matches the total number of rows affected, which is 3. However, in some scenarios, the number of rows affected and the number of rows inserted may differ. 

### 4. Describe Table Metadata

- Use the following command to describe the details of the `employees` table:

```sql
DESCRIBE DETAIL employees;
```

- The output of this command will include the path to the Delta Lake table: `dbfs:/user/hive/warehouse/employees` under the **location** column

- Use the following DBFS command to navigate and list the content of the `employees` Delta Lake table directory:

```sql
%fs ls dbfs:/user/hive/warehouse/employees
```
<p align="center">
    <img src="images/FSOutput.png" width="800" height="250"/>
</p>

- The Snappy Parquet file in the output above, is the actual data file containing the Delta Lake table data. The file format is Snappy Parquet, which is a compressed, columnar storage format commonly used in data warehouses and big data systems. The file holds the actual data table in a highly compressed and efficient format, facilitating fast and optimized query performance.

- The second part of the output shows the existence of a directory called `_delta_log`. The `_delta_log` directory contains the **transaction logs** that record all changes (inserts, updates, deletes) made to the Delta Lake table. We will come back to this directory later in this hands-on.

### 5. Update Table with New Records

- Let's now add some new records to the `employees` table:

```sql
INSERT INTO employees
VALUES
  (4, 'Alice Johnson', 'Designer', '2023-01-04', 85000),
  (5, 'Charlie Brown', 'Developer', '2023-01-05', 95000);
```

### 6. Describe Table History

- Use the following command to describe the history of the `employees` table:

```sql
DESCRIBE HISTORY employees;
```
<p align="center">
    <img src="images/History.png" width="800" height="200"/>
</p>

- The `DESCRIBE HISTORY` command output reveals three operations, starting with the table's creation (version 0) and followed by two write operations (versions 1 and 2) representing insert operations, each appending rows to the `employees` table.

- Let's navigate again to the location of the `employees` table, specifically to the transaction logs directory:

```sql
%fs ls dbfs:/user/hive/warehouse/employees/_delta_log
```

- For each transaction or commit made to the Delta Lake table, you will find a pair of files in the `_delta_log` directory:

  - `CRC` (Cyclic Redundancy Check) files are used for error checking and ensuring data integrity
  - The `CRC` file is associated with a specific `JSON` file and contains checksum information. It verifies that the corresponding `JSON` file is not corrupted and maintains the integrity of the data.
  -  The `JSON` file contains metadata and transaction information in a human-readable format. It captures details about the changes made during a specific commit, such as schema evolution or data insertions.

## Time Travel in Delta Lake

Delta Lake introduces a revolutionary feature, known as *Time Travel*, enabling users to query data at different points in time. This capability is invaluable for historical analysis and auditing. 

To leverage time travel in Delta Lake, use the `AS OF` clause to query data as it existed a specified timestamp or version. Remember, you can obtain this information using the `DESCRIBE HISTORY` command.

```sql
SELECT * FROM table_name TIMESTAMP AS OF '2023-01-15T12:00:00.000Z';
```
For example, querying version 0 using the corresponding timestamp:

<p align="center">
    <img src="images/TimeStamp.png" width="850" height="125"/>
</p>

The same result should be returned if we run the same command using the version number instead:

```sql
SELECT * FROM table_name VERSION AS OF 0
```

Remember, the reason both commands return `Query returned no results`, is because version 0 was just creating the table, before any data was ingested.

## Key Takeaways

- Databricks Delta Lake addresses challenges in traditional data lakes, ensuring reliability, consistency, and performance, and introduces ACID properties for robust data transactions
- ACID properties include **atomicity** (indivisible operations), **consistency** (maintaining a valid data state), **isolation** (preventing interference between transactions), and **durability** (committing transactions permanently)
- **Transaction logs** serve as a comprehensive record of changes in the data lake
- **Time Travel** in Delta Lake is a revolutionary feature enabling querying data at specific timestamps or versions, enhancing historical analysis and auditing capabilities