## ❗❗❗Before starting the training❗❗❗
- final reminder that if you are not familiar with SQL syntax please follow this simple training to start form scratch: https://sqlbolt.com/lesson/select_queries_introduction
- if your work requires deeper knowledge or you find interesting to deepdive to sql and its working please refer to section 3.SQL upskilling ( book for self paced learning of SQL) https://onetakeda.atlassian.net/wiki/spaces/DIME/pages/6240174176/Data+Modeler+upskilling+guide
- this training counts with basic knowledge of sql and focuses on our use cases (describes and points out how we use sql syntax when building with standardized logic)

## Building prerequisites


In [0]:
-- Sales table
CREATE OR REPLACE TEMP VIEW sales_demo AS
SELECT * FROM VALUES
  ('2025-07-01', 'Apples', 10, 2.5, 'North'),
  ('2025-07-01', 'Oranges', 5, 3.0, 'North'),
  ('2025-07-02', 'Apples', 8, 2.5, 'South'),
  ('2025-07-02', 'Bananas', 15, 1.2, Null),
  ('2025-07-03', 'Oranges', 7, 3.0, 'East')
AS sales(date, product, quantity, price_per_unit, region);

-- product categories table
CREATE OR REPLACE TEMP VIEW product_categories AS
SELECT * FROM VALUES
  ('Apples', 'Fruit'),
  ('Oranges', 'Fruit'),
  ('Tomatoes', 'Vegetable')
AS categories(product, category);

-- Sales extra (for union)
CREATE OR REPLACE TEMP VIEW sales_extra AS
SELECT * FROM VALUES
  ('2025-07-04', 'Apples', 6, 2.5, 'North'),
  ('2025-07-04', 'Oranges', 9, 3.0, 'West'),
  ('2025-07-03', 'Oranges', 7, 3.0, 'East')
AS sales(date, product, quantity, price_per_unit, region);

-- Customers table
CREATE OR REPLACE TEMP VIEW customers AS
SELECT * FROM VALUES
  (1, 'Alice'),
  (2, 'Bob'),
  (3, 'Charlie'),
  (4, 'Diana')
AS customers(customer_id, name);

-- Orders table
CREATE OR REPLACE TEMP VIEW orders AS
SELECT * FROM VALUES
  (101, 1, '2023-01-10'),
  (102, 1, '2023-03-15'),
  (103, 1, '2023-06-20'),
  (104, 2, '2023-02-12'),
  (105, 3, '2023-04-25'),
  (106, 3, '2023-07-30')
AS orders(order_id, customer_id, order_date);



## 🧪 SELECT and Functions

Let's explore how to use SELECT queries with some of the useful functions.

for more possibilities check documentation:
🔗 https://spark.apache.org/docs/latest/api/sql/index.html 

In [0]:
SELECT
  date,
  year(date),
  month(date),
  product,
  quantity,
  CASE 
    WHEN quantity > 10 THEN 'High'
    WHEN quantity > 5 THEN 'Medium'
    ELSE 'Low'
  END AS demand_level
FROM sales_demo;


### 🔗 String Concatenation in Spark SQL

When combining multiple columns into a single string, Spark SQL offers:

| Function     | Description |
|--------------|-------------|
| `CONCAT(col1, col2, ...)`     | Joins strings **without** a separator. Null values cause the result to be null. |
| `CONCAT_WS('sep', col1, col2, ...)` | Joins strings **with a separator**. Nulls are skipped. Safer for keys! |

---

### ✅ Example: Simple Concatenation

```sql
-- CONCAT (null-sensitive)
SELECT product, region, CONCAT(product, region) AS concat_example
FROM sales_demo;


In [0]:
SELECT
  UPPER(product) AS product_upper,
  region,
  CONCAT(product, ' - ', region) AS product_region,
  CONCAT_ws(' - ',product, region) AS product_region_2
FROM sales_demo;

###🧠 Use Case: Surrogate Key Generation

Surrogate keys are often used in data warehousing (e.g., star schema) as unique, consistent identifiers for dimension rows — often replacing natural composite keys. Within ourworkflow this creates uniform name for Primary key of table we create and makes comaprison of records easier.
Same process is used for non-key columns from which we make MD5_key

A typical pattern is:

In [0]:
SELECT 
  product, region, date,
  MD5(CONCAT_WS('||', product, region, date)) AS row_key
FROM sales_demo;


## 🔗 JOIN Operations

Now let’s join the sales data with a product category table.


🔍 INNER JOIN

An **INNER JOIN** returns only the rows where there is a match in both tables based on the join condition.  
If no match is found, the row is excluded from the result.  
This is the most commonly used type of join for combining related data.

In [0]:
SELECT s.*, c.category
FROM sales_demo s
JOIN product_categories c
  ON s.product = c.product;

---> Only records where there is a match on both sides

🪁 LEFT OUTER JOIN

A **LEFT OUTER JOIN** returns all rows from the left table, and the matching rows from the right table.  
If there's no match on the right side, NULLs are returned for the right table’s columns.  
Useful when you want to keep all records from the left side, even if no related data exists on the right.

In [0]:
SELECT s.*, c.category
FROM sales_demo s
LEFT JOIN product_categories c
  ON s.product = c.product;

---> All records from sales_demo, only matching records from product_categories ( where is no match -> null)

🌐 FULL OUTER JOIN

A **FULL OUTER JOIN** returns all rows when there is a match in **either** the left or the right table.  
Rows without a match in one of the tables will have NULLs for the missing side.  
It’s helpful for finding unmatched records in both tables.

In [0]:
SELECT s.*, c.category
FROM sales_demo s
FULL OUTER JOIN product_categories c
  ON s.product = c.product;

---> All records from both tables (left and right), non matching sides are nulls


## ➕ UNION Operations

We can combine multiple datasets using UNION or UNION ALL.


In [0]:
SELECT * FROM sales_demo
UNION ALL
SELECT * FROM sales_extra;

In [0]:
SELECT * FROM sales_demo
UNION 
SELECT * FROM sales_extra;

--> leaving out one record (Oranges , East)


## 🧱 Common Table Expressions (CTEs)

💡 **CTEs (Common Table Expressions)** are *temporary result sets* that you can reference within a larger SQL query. They make queries more readable, reusable, and often more performant.

---

### ✅ Why Use CTEs Instead of Subqueries?

⚠️ **Best Practice**: Prefer using CTEs over deeply nested subqueries or `WHERE id IN (SELECT ...)` patterns!

---

### ❗ Key Reasons to Use CTEs:

| Advantage | Description |
|----------|-------------|
| ✅ **Better Readability** | You break down a complex query into understandable blocks. |
| ✅ **Easier Debugging** | You can test each CTE independently before chaining logic. |
| ✅ **Logical Reuse** | You can reference the same CTE multiple times in one query. |
| ✅ **Better Performance (in most engines)** | Spark can better **optimize joins and filters** in CTEs than subqueries inside `WHERE IN`. |

---

### 🚀 Performance Tip: Avoid `WHERE id IN (SELECT ...)`

- In Spark, `WHERE id IN (SELECT ...)` can force **materialization of a subquery** and may lead to **inefficient broadcast joins** or **Cartesian products** if not optimized well.
- Using a CTE followed by a **JOIN** allows the Catalyst optimizer to:
  - Reorder joins
  - Prune unnecessary columns
  - Push filters earlier (predicate pushdown)


### ❌ Using Subquery

In [0]:
SELECT c.name, 
       (SELECT MAX(o.order_date)
        FROM orders o
        WHERE o.customer_id = c.customer_id) AS latest_order
FROM customers c
WHERE (SELECT COUNT(*) 
       FROM orders o 
       WHERE o.customer_id = c.customer_id) > 2;

### ✅ Using CTE

In [0]:
WITH order_stats AS (   ---> first specifing intermediate step
  SELECT customer_id, 
         COUNT(*) AS order_count,
         MAX(order_date) AS latest_order
  FROM orders
  GROUP BY customer_id
)
SELECT c.name, o.latest_order
FROM customers c
JOIN order_stats o ----> than using predefined object
  ON c.customer_id = o.customer_id
WHERE o.order_count > 2;


## 🏅 QUALIFY — Cleaner Filtering of Ranked Rows

In traditional SQL (and in Spark SQL too), to **filter based on `ROW_NUMBER()` or `RANK()`**, people often:

- Add the window function in a subquery
- Filter the result in the outer query using `WHERE row_num = 1`

But this leads to **nested queries**, making it harder to read and optimize.




### ❌ Old Way: Using Subquery with `WHERE`

In [0]:
SELECT date,
       product,
       quantity,
       price_per_unit,
       region
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY region ORDER BY quantity DESC) AS row_num
  FROM sales_demo
) ranked_sales
WHERE row_num = 1;

### ✅ Better Way: Using QUALIFY

In [0]:
SELECT *
FROM sales_demo
QUALIFY ROW_NUMBER() OVER (PARTITION BY region ORDER BY quantity DESC) = 1;

## ⏳ Delta Lake Features: Time Travel and Versioning

Delta Lake brings powerful features on top of Apache Spark and Parquet to enable **reliable**, **ACID-compliant** data lakes.

One of its key capabilities is **time travel**, allowing you to query, compare, and restore data from previous versions of a Delta table effortlessly.

---

### 🕰️ What is Time Travel?

- Delta Lake stores all changes and versions of a table as data files and transaction logs.
- You can query the table “as of” a previous **version number** or **timestamp**.
- This enables:
  - Auditing data changes
  - Debugging by comparing versions
  - Undoing accidental deletes or corruptions

---

### 🔍 Comparing Two Versions to Find New Rows

You can compare two versions of a table to identify:

- Rows added
- Rows deleted
- Rows updated

For example, to find rows that were added in version N compared to version N-1.

---

### 🔄 Reverting to a Previous Version

If you find issues in the latest data, you can restore the table to a previous version by **writing over** with that version’s data or using Delta’s `RESTORE` command (Databricks Premium).

---

### ⚡ Benefits of Time Travel

- Simplifies data recovery and compliance.
- Enables auditability and reproducibility.
- Supports complex data pipelines with rollback and snapshot functionality.

---

Next, let's see how to use these features with SQL examples!


1. inicial load of table creating V0

In [0]:
CREATE OR REPLACE TABLE delta_demo (
  id INT,
  name STRING,
  value INT
) USING DELTA;                        

2. First insert operation , creates V1 ( each time you run Insert, it created new version and track history), for now execute only once to folow up excercise 

In [0]:
INSERT INTO delta_demo VALUES
  (1, 'A', 100),
  (2, 'B', 200);

3. observe history of table and actions applied on it


In [0]:
DESCRIBE HISTORY delta_demo;

4. Insert another set of records , create V3

In [0]:
INSERT INTO delta_demo VALUES
  (3, 'C', 300),
  (4, 'D', 400);

 Selecting specific version from history of table instead of latest version (default)

In [0]:
SELECT * FROM delta_demo VERSION AS OF 1;

Select lates version ( default state)

In [0]:
SELECT * FROM delta_demo;

Select difference between 2 version .. see which records were added between 0 ( create empty table) and 1 ( after first insert)

In [0]:
SELECT * FROM delta_demo VERSION AS OF 1
EXCEPT
SELECT * FROM delta_demo VERSION AS OF 0;

After excercite, clean table so its not sotred in catalog

In [0]:
drop table if exists delta_demo