# Session 6 — Part 1: Data Transformation Fundamentals
Deep dive into **ETL vs ELT**, layered architectures, transformation patterns, and performance tuning — with real-world examples.

## 🧩 1️⃣ What Is Data Transformation & Why It Matters
Transforms **raw** data into **business-ready** datasets. Enables analytics, reporting, ML, and decisions.

### Real-World Scenarios
- **E‑commerce:** clicks → carts → orders → *daily sales summaries*.
- **Healthcare:** vitals + EMR → standardized metrics → *trend analysis*.
- **Banking:** transactions → cleansed & enriched → *fraud dashboards*.

## ⚙️ 2️⃣ ETL vs ELT — Two Approaches

| Aspect | ETL (Extract‑Transform‑Load) | ELT (Extract‑Load‑Transform) |
|---|---|---|
| **Compute** | Outside warehouse (Spark/Glue) | Inside warehouse (SQL/DBT) |
| **Latency** | Often higher | Lower via parallel SQL |
| **When** | Clean/secure **before** storing | Modern clouds; cheap scalable compute |
| **Tools** | SSIS, Informatica, Glue ETL | dbt, Snowflake Tasks, BigQuery SQL, Synapse |

**Rule of thumb:** Use **ETL** when pre‑load scrubbing or PII removal is mandatory; **ELT** when the warehouse is your workhorse.

## 🧱 3️⃣ Multi‑Layered Architecture
```
Raw  →  Staging  →  Curated  →  Data Mart  →  BI / ML / API
```
| Layer | Purpose | Example Transformation |
|---|---|---|
| Raw | Immutable source copy | Land CSV/JSON as‑is |
| Staging | Basic type/format cleanup | Cast dates, trim strings, dedupe |
| Curated | Business modeling (facts/dims) | Join orders+customers → FactSales |
| Data Mart | Domain views | Regional Sales, Customer 360 |
| Analytics | Consumption | Dashboards, feature store |

## 🧮 4️⃣ Transformation Patterns (SQL)
**Row/Column:**
```sql
SELECT CAST(order_date AS DATE) AS order_date,
       amount, tax, amount+tax AS total_amount
FROM staging.sales
WHERE amount > 0;
```
**Join:**
```sql
SELECT o.order_id, c.customer_name
FROM curated.orders o
JOIN curated.customers c ON o.customer_id = c.id;
```
**Aggregate:**
```sql
CREATE OR REPLACE TABLE mart.sales_by_region AS
SELECT region, SUM(amount) AS total_sales
FROM curated.fact_sales
GROUP BY region;
```
**Conditional:**
```sql
CASE WHEN country IN ('US','CA') THEN 'NA' ELSE 'Other' END AS region_group
```

## ⚡ 5️⃣ Performance Optimization
| Technique | Why it helps | Example |
|---|---|---|
| Partitioning | Smaller scans | `.../year=2025/month=10/` in Parquet |
| Predicate Pushdown | Filter early | `WHERE event_dt >= CURRENT_DATE - 7` |
| Columnar Formats | Less I/O | Parquet / ORC |
| Materialized Views | Cache heavy queries | `CREATE MATERIALIZED VIEW ...` |
| Broadcast Joins | Faster small‑to‑large joins | Spark `broadcast(dim)` |

## 🧰 6️⃣ Tooling Map
| Category | Tools |
|---|---|
| SQL Transform | dbt, warehouse SQL (Snowflake/Redshift/BigQuery/Synapse) |
| Distributed | Spark, Databricks, Flink |
| Orchestration | Airflow, ADF, Glue Workflows |
| Lightweight Analytics | DuckDB, Polars |

## 🖼️ 7️⃣ Visual — Layered Transformation Flow

In [None]:
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch

BG = '#e6f0ff'; FILL = '#e6f0ff'; EDGE = '#2563eb'; TXT = '#111827'
W, H, GAP, PAD = 0.19, 0.22, 0.06, 0.01
Y0 = 0.39
labels = [
    ('Raw', 'Land CSV/JSON'),
    ('Staging', 'Type cast / dedupe'),
    ('Curated', 'Facts & Dims'),
    ('Data Mart', 'Subject-area views'),
    ('Analytics', 'BI / ML / API')
]
fig, ax = plt.subplots(figsize=(12, 3.8))
fig.patch.set_facecolor(BG); ax.set_facecolor(BG); ax.set_axis_off()
ax.set_xlim(0,1); ax.set_ylim(0,1)
total_w = len(labels)*W + (len(labels)-1)*GAP
x0 = (1-total_w)/2
xs = [x0 + i*(W+GAP) for i in range(len(labels))]
def box(x, t, s):
    r = FancyBboxPatch((x, Y0), W, H, boxstyle='round,pad=0.02,rounding_size=10', fc=FILL, ec=EDGE, lw=1.6)
    ax.add_patch(r)
    ax.text(x+W/2, Y0+H*0.62, t, ha='center', va='center', fontsize=10, color=TXT, fontweight='bold')
    ax.text(x+W/2, Y0+H*0.36, s, ha='center', va='center', fontsize=9, color=TXT)
for (t,s), x in zip(labels, xs):
    box(x,t,s)
y = Y0 + H/2
for i in range(len(xs)-1):
    ax.annotate('', xy=(xs[i+1]-PAD, y), xytext=(xs[i]+W+PAD, y),
                arrowprops=dict(arrowstyle='->', lw=2, color='#4b5563', mutation_scale=12))
plt.tight_layout(); plt.show()


## 💡 8️⃣ Practice
1) Sketch your raw→staging→curated pipeline.
2) Write a SQL transform that cleans and aggregates orders by region.
3) Compare CSV vs Parquet query times on the same query.
4) Add Session 5 quality checks after transformation.