## Building prerequisites


In [0]:
CREATE OR REPLACE TEMP VIEW sales_demo AS
SELECT * FROM VALUES
  ('2025-07-01', 'Apples', 10, 2.5, 'North'),
  ('2025-07-01', 'Oranges', 5, 3.0, 'North'),
  ('2025-07-02', 'Apples', 8, 2.5, 'South'),
  ('2025-07-02', 'Bananas', 15, 1.2, Null),
  ('2025-07-03', 'Oranges', 7, 3.0, 'East')
AS sales(date, product, quantity, price_per_unit, region);

CREATE OR REPLACE TEMP VIEW product_categories AS
SELECT * FROM VALUES
  ('Apples', 'Fruit'),
  ('Oranges', 'Fruit'),
  ('Tomatoes', 'Vegetable')
AS categories(product, category);

CREATE OR REPLACE TEMP VIEW sales_extra AS
SELECT * FROM VALUES
  ('2025-07-04', 'Apples', 6, 2.5, 'North'),
  ('2025-07-04', 'Oranges', 9, 3.0, 'West'),
  ('2025-07-03', 'Oranges', 7, 3.0, 'East')
AS sales(date, product, quantity, price_per_unit, region);



## 🧪 SELECT and Functions

Let's explore how to use SELECT queries with some of the useful functions.

for more possibilities check documentation:
🔗 https://spark.apache.org/docs/latest/api/sql/index.html 

In [0]:
SELECT
  date,
  year(date),
  month(date),
  product,
  quantity,
  CASE 
    WHEN quantity > 10 THEN 'High'
    WHEN quantity > 5 THEN 'Medium'
    ELSE 'Low'
  END AS demand_level
FROM sales_demo;


### 🔗 String Concatenation in Spark SQL

When combining multiple columns into a single string, Spark SQL offers:

| Function     | Description |
|--------------|-------------|
| `CONCAT(col1, col2, ...)`     | Joins strings **without** a separator. Null values cause the result to be null. |
| `CONCAT_WS('sep', col1, col2, ...)` | Joins strings **with a separator**. Nulls are skipped. Safer for keys! |

---

### ✅ Example: Simple Concatenation

```sql
-- CONCAT (null-sensitive)
SELECT product, region, CONCAT(product, region) AS concat_example
FROM sales_demo;


In [0]:
SELECT
  UPPER(product) AS product_upper,
  region,
  CONCAT(product, ' - ', region) AS product_region,
  CONCAT_ws(' - ',product, region) AS product_region_2
FROM sales_demo;

###🧠 Use Case: Surrogate Key Generation

Surrogate keys are often used in data warehousing (e.g., star schema) as unique, consistent identifiers for dimension rows — often replacing natural composite keys. Within ourworkflow this creates uniform name for Primary key of table we create and makes comaprison of records easier.
Same process is used for non-key columns from which we make MD5_key

A typical pattern is:

In [0]:
SELECT 
  product, region, date,
  MD5(CONCAT_WS('||', product, region, date)) AS row_key
FROM sales_demo;


## 🔗 JOIN Operations

Now let’s join the sales data with a product category table.


In [0]:
SELECT s.*, c.category
FROM sales_demo s
JOIN product_categories c
  ON s.product = c.product;

In [0]:
SELECT s.*, c.category
FROM sales_demo s
LEFT JOIN product_categories c
  ON s.product = c.product;

In [0]:
SELECT s.*, c.category
FROM sales_demo s
FULL OUTER JOIN product_categories c
  ON s.product = c.product;


## ➕ UNION Operations

We can combine multiple datasets using UNION or UNION ALL.


In [0]:
SELECT * FROM sales_demo
UNION ALL
SELECT * FROM sales_extra;

In [0]:
SELECT * FROM sales_demo
UNION
SELECT * FROM sales_extra;


## 🧱 Common Table Expressions (CTEs)

💡 **CTEs (Common Table Expressions)** are *temporary result sets* that you can reference within a larger SQL query. They make queries more readable, reusable, and often more performant.

---

### ✅ Why Use CTEs Instead of Subqueries?

⚠️ **Best Practice**: Prefer using CTEs over deeply nested subqueries or `WHERE id IN (SELECT ...)` patterns!

---

### ❗ Key Reasons to Use CTEs:

| Advantage | Description |
|----------|-------------|
| ✅ **Better Readability** | You break down a complex query into understandable blocks. |
| ✅ **Easier Debugging** | You can test each CTE independently before chaining logic. |
| ✅ **Logical Reuse** | You can reference the same CTE multiple times in one query. |
| ✅ **Better Performance (in most engines)** | Spark can better **optimize joins and filters** in CTEs than subqueries inside `WHERE IN`. |

---

### 🚀 Performance Tip: Avoid `WHERE id IN (SELECT ...)`

- In Spark, `WHERE id IN (SELECT ...)` can force **materialization of a subquery** and may lead to **inefficient broadcast joins** or **Cartesian products** if not optimized well.
- Using a CTE followed by a **JOIN** allows the Catalyst optimizer to:
  - Reorder joins
  - Prune unnecessary columns
  - Push filters earlier (predicate pushdown)


In [0]:
SELECT * FROM sales_demo
WHERE product IN (
  SELECT product FROM product_categories WHERE category = 'Fruit'
)

In [0]:
WITH fruits AS (
  SELECT product FROM product_categories WHERE category = 'Fruit'
)
SELECT s.*
FROM sales_demo s
JOIN fruits f ON s.product = f.product


## 🏅 QUALIFY — Cleaner Filtering of Ranked Rows

In traditional SQL (and in Spark SQL too), to **filter based on `ROW_NUMBER()` or `RANK()`**, people often:

- Add the window function in a subquery
- Filter the result in the outer query using `WHERE row_num = 1`

But this leads to **nested queries**, making it harder to read and optimize.




### ❌ Old Way: Using Subquery with `WHERE`

In [0]:
SELECT date,
       product,
       quantity,
       price_per_unit,
       region
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY region ORDER BY quantity DESC) AS row_num
  FROM sales_demo
) ranked_sales
WHERE row_num = 1;

### ✅ Better Way: Using QUALIFY

In [0]:
SELECT *
FROM sales_demo
QUALIFY ROW_NUMBER() OVER (PARTITION BY region ORDER BY quantity DESC) = 1;