## building prerequisites 

In [0]:
CREATE OR REPLACE TEMP VIEW employee_hierarchy AS
SELECT * FROM VALUES
  (1, 'CEO', NULL),
  (2, 'CTO', 1),
  (3, 'CFO', 1),
  (4, 'Engineering Manager', 2),
  (5, 'Finance Manager', 3),
  (6, 'Software Engineer', 4),
  (7, 'Accountant', 5)
AS employees(employee_id, employee_name, manager_id);

CREATE OR REPLACE TEMP VIEW orders_with_items AS
SELECT * FROM VALUES
  (1, '2025-07-01', ARRAY('apple', 'banana', 'orange')),
  (2, '2025-07-02', ARRAY('grape', 'kiwi')),
  (3, '2025-07-03', ARRAY())
AS orders(order_id, order_date, items);

CREATE OR REPLACE TEMP VIEW products_with_prices AS
SELECT * FROM VALUES
  (1, MAP('apple', 1.5, 'banana', 0.75)),
  (2, MAP('grape', 2.0, 'kiwi', 1.8)),
  (3, MAP())
AS products(product_id, price_map);

CREATE OR REPLACE TEMP VIEW employees_with_address AS
SELECT * FROM VALUES
  (1, STRUCT('John Doe' AS name, '123 Main St' AS street, 'NY' AS state)),
  (2, STRUCT('Jane Smith' AS name, '456 Park Ave' AS street, 'CA' AS state)),
  (3, NULL)
AS employees(employee_id, address);

## ⏳ Delta Lake Features: Time Travel and Versioning

Delta Lake brings powerful features on top of Apache Spark and Parquet to enable **reliable**, **ACID-compliant** data lakes.

One of its key capabilities is **time travel**, allowing you to query, compare, and restore data from previous versions of a Delta table effortlessly.

---

### 🕰️ What is Time Travel?

- Delta Lake stores all changes and versions of a table as data files and transaction logs.
- You can query the table “as of” a previous **version number** or **timestamp**.
- This enables:
  - Auditing data changes
  - Debugging by comparing versions
  - Undoing accidental deletes or corruptions

---

### 🔍 Comparing Two Versions to Find New Rows

You can compare two versions of a table to identify:

- Rows added
- Rows deleted
- Rows updated

For example, to find rows that were added in version N compared to version N-1.

---

### 🔄 Reverting to a Previous Version

If you find issues in the latest data, you can restore the table to a previous version by **writing over** with that version’s data or using Delta’s `RESTORE` command (Databricks Premium).

---

### ⚡ Benefits of Time Travel

- Simplifies data recovery and compliance.
- Enables auditability and reproducibility.
- Supports complex data pipelines with rollback and snapshot functionality.

---

Next, let's see how to use these features with SQL examples!


In [0]:
CREATE OR REPLACE TABLE delta_demo (
  id INT,
  name STRING,
  value INT
) USING DELTA;                        

In [0]:
INSERT INTO delta_demo VALUES
  (1, 'A', 100),
  (2, 'B', 200);

In [0]:
DESCRIBE HISTORY delta_demo;

In [0]:
INSERT INTO delta_demo VALUES
  (3, 'C', 300),
  (4, 'D', 400);

In [0]:
SELECT * FROM delta_demo VERSION AS OF 1;

In [0]:
SELECT * FROM delta_demo;

In [0]:
SELECT * FROM delta_demo VERSION AS OF 1
EXCEPT
SELECT * FROM delta_demo VERSION AS OF 0;

In [0]:
drop table if exists delta_demo

## 🧹 Delta Lake VACUUM: Cleaning Up Old Data Files

When you perform many updates, deletes, or time travel operations, Delta Lake keeps older data files and transaction logs to support querying previous versions. Over time, these can consume storage space.

**VACUUM** helps you clean up these obsolete files safely.

---

### How VACUUM Works

- Deletes files no longer needed by any Delta table version.
- Retains data files required for time travel based on a retention period.
- Default retention is 7 days to avoid accidental data loss.
- You can specify a shorter retention period, but use caution!

---


### Example of VACUUM Usage


-- Clean up files no longer needed by any active Delta table version older than 7 days

`VACUUM delta_demo;`

-- Or specify retention period in hours (e.g., 168 hours = 7 days)

`VACUUM delta_demo RETAIN 168 HOURS;`


## 🔄 Recursive CTEs in Spark SQL

---

### What is a Recursive CTE?

- Recursive Common Table Expressions (CTEs) let you **iterate over hierarchical or graph-structured data**.
- They repeatedly apply a query to generate rows based on the previous result until no new rows are returned.
- Useful for things like **organizational charts, bill of materials, or folder hierarchies**.

---

### How it Works

1. **Anchor member**: Defines the base rows (starting point).
2. **Recursive member**: Joins the recursive CTE to itself to fetch "next level" rows.
3. Spark SQL executes the recursive query repeatedly, accumulating results.
4. The recursion ends when no more rows are generated.

---

### Example: Employee Hierarchy

Using the `employee_hierarchy` view, find **all employees under the CEO (employee_id = 1)**, including their level in the hierarchy.


In [0]:
WITH RECURSIVE emp_cte AS (
  -- Anchor member: start with the CEO
  SELECT employee_id, employee_name, manager_id, 0 AS level
  FROM employee_hierarchy
  WHERE employee_id = 1

  UNION ALL

  -- Recursive member: get employees reporting to previous level
  SELECT e.employee_id, e.employee_name, e.manager_id, cte.level + 1
  FROM employee_hierarchy e
  JOIN emp_cte cte ON e.manager_id = cte.employee_id
)

SELECT * FROM emp_cte ORDER BY level, employee_id


### ⚠️ Important Notice on Recursive CTEs in Databricks Free Edition

Databricks Free Edition **does not allow** running recursive CTEs, which means you cannot use the elegant recursive query syntax to traverse hierarchical data.

---

### Alternative Approach: Emulating Recursive CTEs with Multiple Self-Joins

The code below produces the **same output** as a recursive CTE for up to 4 hierarchy levels. However, it is:

- **Less elegant**
- **Not dynamic** (requires manually adding joins for each level)
- **Harder to maintain** for deep or unknown hierarchy depths

In [0]:
-- Level 0: CEO
SELECT
  e0.employee_id AS employee_id,
  e0.employee_name AS employee_name,
  e0.manager_id AS manager_id,
  0 AS level
FROM employee_hierarchy e0
WHERE e0.employee_id = 1

UNION ALL

-- Level 1: Direct reports to CEO
SELECT
  e1.employee_id,
  e1.employee_name,
  e1.manager_id,
  1 AS level
FROM employee_hierarchy e1
WHERE e1.manager_id = 1

UNION ALL

-- Level 2: Reports to level 1 employees
SELECT
  e2.employee_id,
  e2.employee_name,
  e2.manager_id,
  2 AS level
FROM employee_hierarchy e2
JOIN employee_hierarchy e1 ON e2.manager_id = e1.employee_id
WHERE e1.manager_id = 1

UNION ALL

-- Level 3: Reports to level 2 employees
SELECT
  e3.employee_id,
  e3.employee_name,
  e3.manager_id,
  3 AS level
FROM employee_hierarchy e3
JOIN employee_hierarchy e2 ON e3.manager_id = e2.employee_id
JOIN employee_hierarchy e1 ON e2.manager_id = e1.employee_id
WHERE e1.manager_id = 1

ORDER BY level, employee_id;

# 📜 SQL UDFs with Unity Catalog

## What is a SQL UDF?

A **User Defined Function (UDF)** lets you extend SQL with your own reusable functions.  
You can encapsulate complex logic inside a function, then call it in your queries just like built-in SQL functions.

---

## Benefits of SQL UDFs

- Simplify and reuse complex expressions or business logic
- Improve query readability
- Centralize logic to avoid duplication
- Managed and secured with **Unity Catalog** for fine-grained access control and auditing

---

## Unity Catalog and SQL UDFs

Unity Catalog allows you to:

- Register UDFs with a fully qualified name (catalog.schema.function)
- Control access to UDFs via permissions
- Track usage for auditing and governance
- Easily share functions across workspaces and users

---

## Syntax to Create SQL UDFs in Unity Catalog

```sql
CREATE FUNCTION catalog_name.schema_name.function_name(argument_name data_type, ...) 
RETURNS return_data_type
LANGUAGE SQL
AS 
'SQL expression or query using arguments';


In [0]:
CREATE OR REPLACE FUNCTION full_employee_info(emp_id INT, emp_name STRING)
RETURNS STRING
LANGUAGE SQL
RETURN CONCAT(emp_id, "_", emp_name);

CREATE OR REPLACE FUNCTION is_top_level(emp_id INT)
RETURNS BOOLEAN
LANGUAGE SQL
RETURN
CASE WHEN emp_id IS NULL
 THEN TRUE
 ELSE FALSE
END;

In [0]:
SELECT
  employee_id,
  employee_name,
  full_employee_info(employee_id,employee_name) AS full_employee_info,
  is_top_level(manager_id) AS is_ceo
FROM employee_hierarchy;

## 🧹 Before Cleaning Up UDFs, Check Unity Catalog! 
You should beable to find functions besides tables... Nice, isnt it?

In [0]:
drop function if exists full_employee_info;
drop function if exists is_top_level;

### 💡 Important !!

**UDFs can also load complex data types like JSON objects**  
https://docs.databricks.com/aws/en/udf/unity-catalog#extend-udfs-using-custom-dependencies

(for example, map or array structures from mapping tables) and act as fully-fledged dictionaries for data standardization.

This is very powerful because:  
- You can implement reusable **standardization** or **harmonization** logic inside UDFs.  
- Under the hood, such engines often leverage **Unity Catalog UDFs** to provide consistent, centralized, and maintainable data transformations across your pipelines.  

In our project, we have a standardization/harmonization engine built this way — using UDFs to apply consistent business rules and mappings seamlessly.

This approach simplifies your ETL code and improves data quality by enforcing consistent rules via reusable functions.


# 🌟 Working with Complex Data Types in Spark SQL

Spark SQL supports powerful complex data types like **arrays**, **maps**, and **structs**, which allow you to store and manipulate nested and multi-valued data directly inside tables and views.

---

### 🟦 Arrays
- Represent ordered collections of elements (e.g., a list of items in an order).
- Useful for storing multiple values in a single column.
- Can be **exploded** into multiple rows to process each element individually.

---

### 🟩 Maps
- Key-value pairs, similar to dictionaries in Python or objects in JSON.
- Perfect for storing attributes with dynamic keys (e.g., product prices by item name).
- Can be **exploded** to transform each key-value pair into a separate row.

---

### 🟥 Structs
- Complex nested data structures, like JSON objects.
- Group multiple related fields into a single column.
- Access nested fields using dot notation (`struct_col.field_name`).

---

### 🔧 Common Operations
- **Exploding arrays and maps** lets you flatten complex data to analyze individual elements.
- **Accessing nested struct fields** allows you to work with detailed, structured info inside a row.
- These types enable flexible, scalable schemas — essential for semi-structured or hierarchical data.

---

### 🚀 Why Use Complex Types in Spark SQL?

- They reduce the need for complex joins by encapsulating related data.
- Improve query expressiveness and performance by keeping data together.
- Are widely used in modern data lake architectures, especially with formats like Parquet and Delta Lake.

---

Next, we'll demonstrate queries to **explode arrays and maps**, and **access struct fields** in practice. Suggestion: check also original view to see how each query looked before and after. 
Let's get hands-on! 👇


In [0]:
SELECT
  order_id,
  order_date,
  item
FROM orders_with_items
LATERAL VIEW EXPLODE(items) AS item;


In [0]:
SELECT
  product_id,
  price_map['apple'] AS apple_price
FROM products_with_prices;

In [0]:
SELECT
  product_id,
  key,
  value
FROM products_with_prices
LATERAL VIEW EXPLODE(price_map) AS key, value;

In [0]:
SELECT
  employee_id,
  address.name AS employee_name,
  address.street AS street,
  address.state AS state
FROM employees_with_address;

### 💡 Important !!

**Are we using any of this within our project ?**  
.
.
**YES!** 

Whole Data Quality engine is build using map columns and beforementioned functions, in order to reduce size, complexity and storage/compute costs. 