# Delta Lake Fundamentals

## Overview
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. This notebook covers Delta Lake fundamentals.

## Learning Objectives
- Understand what Delta Lake is and why it matters
- Create and manage Delta tables
- Perform CRUD operations
- Use ACID transactions
- Leverage Delta Lake features

---

## 1. What is Delta Lake?

### The Problem with Traditional Data Lakes

Traditional data lakes using Parquet, CSV, or JSON have limitations:

‚ùå **No ACID transactions**
- Failed writes leave partial data
- No atomicity for multi-file operations

‚ùå **No schema enforcement**
- Data quality issues
- Schema mismatches

‚ùå **Difficult updates and deletes**
- Must rewrite entire partitions
- No efficient row-level operations

‚ùå **No time travel**
- Can't query historical versions
- Difficult to rollback changes

‚ùå **Performance challenges**
- Small file problem
- No Z-ordering or statistics

### Delta Lake Solution

Delta Lake adds a transaction log (metadata layer) on top of cloud storage:

```
Delta Table = Parquet Files + Transaction Log
```

‚úÖ **ACID Transactions**: Atomicity, Consistency, Isolation, Durability
‚úÖ **Schema Enforcement & Evolution**: Safe schema changes
‚úÖ **Time Travel**: Query any historical version
‚úÖ **Efficient Upserts**: MERGE operations
‚úÖ **Performance**: OPTIMIZE, Z-ORDER, statistics
‚úÖ **Unified Batch & Streaming**: Single API for both

## 2. Creating Delta Tables

There are multiple ways to create Delta tables in Databricks.

### Method 1: Using PySpark DataFrame Writer

In [None]:
# Create sample data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Sample customer data
data = [
    (1, "Alice", "alice@email.com", "2024-01-15", "NY"),
    (2, "Bob", "bob@email.com", "2024-01-20", "CA"),
    (3, "Charlie", "charlie@email.com", "2024-02-01", "TX"),
    (4, "Diana", "diana@email.com", "2024-02-10", "NY"),
    (5, "Eve", "eve@email.com", "2024-02-15", "FL")
]

columns = ["customer_id", "name", "email", "signup_date", "state"]

df = spark.createDataFrame(data, columns)

# Convert signup_date to date type
df = df.withColumn("signup_date", to_date(col("signup_date")))

display(df)

In [None]:
# Write as Delta table
delta_path = "/tmp/delta/customers"

df.write \
    .format("delta") \
    .mode("overwrite") \
    .save(delta_path)

print(f"Delta table created at: {delta_path}")

In [None]:
# Read Delta table
customers_df = spark.read.format("delta").load(delta_path)
display(customers_df)

### Method 2: Using SQL CREATE TABLE

In [None]:
# Create Delta table using SQL
spark.sql("""
CREATE TABLE IF NOT EXISTS orders (
    order_id BIGINT,
    customer_id BIGINT,
    product_id STRING,
    amount DECIMAL(10, 2),
    order_date DATE,
    status STRING
)
USING DELTA
PARTITIONED BY (order_date)
LOCATION '/tmp/delta/orders'
""")

print("Orders table created successfully")

In [None]:
# Insert data using SQL
spark.sql("""
INSERT INTO orders VALUES
    (101, 1, 'PROD-001', 150.00, '2024-02-01', 'completed'),
    (102, 1, 'PROD-002', 200.50, '2024-02-02', 'completed'),
    (103, 2, 'PROD-001', 75.25, '2024-02-03', 'pending'),
    (104, 3, 'PROD-003', 300.00, '2024-02-04', 'completed'),
    (105, 4, 'PROD-002', 120.75, '2024-02-05', 'shipped')
""")

# Query the table
display(spark.sql("SELECT * FROM orders ORDER BY order_date"))

### Method 3: SaveAsTable (Managed Tables)

In [None]:
# Create managed Delta table
# Data is stored in default warehouse location

products_data = [
    ("PROD-001", "Laptop", "Electronics", 999.99),
    ("PROD-002", "Phone", "Electronics", 599.99),
    ("PROD-003", "Desk", "Furniture", 299.99),
    ("PROD-004", "Chair", "Furniture", 149.99)
]

products_df = spark.createDataFrame(
    products_data, 
    ["product_id", "product_name", "category", "price"]
)

products_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("products")

display(spark.table("products"))

## 3. Delta Table Operations

### Reading Delta Tables

In [None]:
# Read using path
df1 = spark.read.format("delta").load("/tmp/delta/customers")

# Read using table name
df2 = spark.table("orders")

# Read with SQL
df3 = spark.sql("SELECT * FROM products WHERE category = 'Electronics'")

print(f"Customers: {df1.count()} rows")
print(f"Orders: {df2.count()} rows")
print(f"Electronics: {df3.count()} rows")

### Updating Data

In [None]:
# Update using SQL
spark.sql("""
UPDATE orders
SET status = 'completed'
WHERE status = 'shipped'
""")

# Verify update
display(spark.sql("SELECT * FROM orders WHERE order_id = 105"))

In [None]:
# Update using Python Delta Table API
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/tmp/delta/customers")

delta_table.update(
    condition = "state = 'NY'",
    set = {"state": lit("NEW YORK")}
)

# Verify
display(spark.read.format("delta").load("/tmp/delta/customers"))

### Deleting Data

In [None]:
# Delete using SQL
spark.sql("""
DELETE FROM orders
WHERE status = 'pending' AND order_date < '2024-02-03'
""")

print(f"Orders remaining: {spark.table('orders').count()}")

In [None]:
# Delete using Python API
delta_table = DeltaTable.forName(spark, "orders")

delta_table.delete("amount < 100")

print(f"Orders remaining: {spark.table('orders').count()}")
display(spark.table("orders"))

## 4. ACID Transactions

Delta Lake guarantees ACID properties.

### Atomicity

All or nothing - either entire operation succeeds or none of it does.

```python
# If this write fails mid-way, no partial data is visible
large_df.write.format("delta").save("/path/to/table")
```

### Consistency

Data remains in a valid state before and after transaction.

```python
# Schema enforcement prevents invalid data
# This will fail if schema doesn't match
df.write.format("delta").mode("append").save("/path/to/table")
```

### Isolation

Concurrent operations don't interfere with each other.

```python
# Multiple writers can write simultaneously
# Readers always see consistent snapshot
```

### Durability

Once committed, changes are permanent.

```python
# Transaction log ensures durability
# Can recover from failures
```

In [None]:
# Demonstration of concurrent writes (safe with Delta)
from datetime import datetime

# Writer 1: Add new customer
new_customer = spark.createDataFrame(
    [(6, "Frank", "frank@email.com", datetime.now().date(), "CA")],
    ["customer_id", "name", "email", "signup_date", "state"]
)

new_customer.write \
    .format("delta") \
    .mode("append") \
    .save("/tmp/delta/customers")

print("Customer added successfully")

# Reader: Always sees consistent state
print(f"Total customers: {spark.read.format('delta').load('/tmp/delta/customers').count()}")

## 5. Schema Enforcement and Evolution

### Schema Enforcement

Delta Lake prevents writes with incompatible schemas.

In [None]:
# Current schema
spark.read.format("delta").load("/tmp/delta/customers").printSchema()

In [None]:
# Try to write with missing column - will fail
try:
    bad_data = spark.createDataFrame(
        [(7, "Grace", "grace@email.com")],  # Missing columns!
        ["customer_id", "name", "email"]
    )
    
    bad_data.write.format("delta").mode("append").save("/tmp/delta/customers")
except Exception as e:
    print(f"Expected error: {str(e)[:100]}...")

### Schema Evolution

Allow controlled schema changes with `mergeSchema` option.

In [None]:
# Add new column with schema evolution
from pyspark.sql.types import *

evolved_data = spark.createDataFrame(
    [(7, "Grace", "grace@email.com", datetime.now().date(), "WA", "555-0123")],
    ["customer_id", "name", "email", "signup_date", "state", "phone"]  # New column!
)

# Write with schema evolution enabled
evolved_data.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/tmp/delta/customers")

print("Schema evolved successfully!")

# Check new schema
spark.read.format("delta").load("/tmp/delta/customers").printSchema()

In [None]:
# View data - old records have NULL for new column
display(spark.read.format("delta").load("/tmp/delta/customers"))

## 6. Understanding the Transaction Log

Delta Lake's magic is in the transaction log (`_delta_log` directory).

In [None]:
# View Delta table structure
display(dbutils.fs.ls("/tmp/delta/customers"))

In [None]:
# View transaction log
display(dbutils.fs.ls("/tmp/delta/customers/_delta_log"))

In [None]:
# Describe table history
display(spark.sql("DESCRIBE HISTORY delta.`/tmp/delta/customers`"))

## 7. Partitioning Delta Tables

In [None]:
# Create partitioned Delta table
sales_data = [
    (1, "2024-01-15", "US", 1000.0),
    (2, "2024-01-16", "US", 1500.0),
    (3, "2024-01-15", "EU", 2000.0),
    (4, "2024-02-01", "US", 1200.0),
    (5, "2024-02-01", "APAC", 3000.0),
]

sales_df = spark.createDataFrame(
    sales_data,
    ["sale_id", "sale_date", "region", "amount"]
).withColumn("sale_date", to_date(col("sale_date")))

# Write with partitioning
sales_df.write \
    .format("delta") \
    .partitionBy("sale_date", "region") \
    .mode("overwrite") \
    .save("/tmp/delta/sales")

print("Partitioned table created")

In [None]:
# View partition structure
display(dbutils.fs.ls("/tmp/delta/sales"))

In [None]:
# Query leveraging partitions (partition pruning)
# This will only read US partition
us_sales = spark.read.format("delta").load("/tmp/delta/sales") \
    .filter("region = 'US' AND sale_date >= '2024-01-15'")

display(us_sales)

## Practice Exercises

### Exercise 1: Create a Product Inventory Table
Create a Delta table for product inventory with schema: product_id, product_name, quantity, last_updated.

In [None]:
# Exercise 1 - Your code here

# TODO: Create sample inventory data
# TODO: Write as Delta table with partitioning by product category
# TODO: Query the table

### Exercise 2: Perform CRUD Operations
On the inventory table:
1. Insert 5 new products
2. Update quantity for 2 products
3. Delete 1 product
4. Query to verify changes

In [None]:
# Exercise 2 - Your code here

# TODO: INSERT new products
# TODO: UPDATE quantities
# TODO: DELETE a product
# TODO: SELECT to verify

## Summary

In this notebook, you learned:

‚úÖ What Delta Lake is and why it's important
‚úÖ How to create Delta tables (multiple methods)
‚úÖ CRUD operations (Create, Read, Update, Delete)
‚úÖ ACID transaction guarantees
‚úÖ Schema enforcement and evolution
‚úÖ Transaction log architecture
‚úÖ Partitioning Delta tables

## Next Steps

1. Complete the practice exercises
2. Experiment with your own data
3. Move to [02-Delta-Lake-Advanced-Features.ipynb](./02-Delta-Lake-Advanced-Features.ipynb)

## Key Takeaways

üí° **Delta Lake = Parquet + Transaction Log**
üí° **ACID transactions make data reliable**
üí° **Schema enforcement prevents bad data**
üí° **Use Delta Lake for all production workloads**

## Additional Resources

- [Delta Lake Documentation](https://docs.delta.io/)
- [Databricks Delta Guide](https://docs.databricks.com/delta/)
- [Delta Lake GitHub](https://github.com/delta-io/delta)