---

# 🏗️ The Foundation — How Snowflake Executes Queries

Before we talk about optimization, let’s remember **how Snowflake works internally**:

* Data is stored in **Remote Storage** (S3, Azure Blob, GCS).
* Data is organized into **Micro-partitions** (16MB compressed chunks).
* Queries don’t scan whole tables → they **prune micro-partitions** (skip scanning unnecessary ones).
* Compute is done in **Virtual Warehouses (VWHs)**.
* Optimizations = "How can we help Snowflake prune smarter and compute less?"

Keep this mental picture in mind. Now let’s go deeper.

---

# 🔍 Optimizing Queries (Primary Checks)

When your query feels slow, don’t immediately jump to scaling warehouses. That’s like hiring 100 workers when 5 smart ones could finish the job if instructions were clearer.

Instead, check these areas **step by step**:

---

### 2.1 Complexity of Query

👉 Ask: *“Is this query doing more than it needs to?”*

* Example: A BI analyst writes:

```sql
SELECT DISTINCT customer_id, customer_name, region, COUNT(*) 
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN payments p ON o.order_id = p.order_id
WHERE region = 'APAC'
GROUP BY customer_id, customer_name, region;
```

Problem:

* `DISTINCT` + `GROUP BY` = duplication of effort.
* Multiple joins → not all needed.

Fix: Remove redundancy:

```sql
SELECT c.customer_id, c.customer_name, c.region, COUNT(*) 
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE region = 'APAC'
GROUP BY c.customer_id, c.customer_name, c.region;
```

---

### 2.2 Unnecessary Joins or Aggregations

Case: A data scientist joins 6 tables but only uses 3 of them in SELECT.
👉 Teach your brain: *If you don’t need it, don’t join it.*

---

### 2.3 Query Profile (Where’s the time spent?)

Snowflake provides a **Query Profile graph**.

* Green = Scanning.
* Purple = Joining.
* Yellow = Aggregating.

👉 Suppose 80% of time is in **JOIN step** → maybe join keys aren’t aligned.
👉 Suppose 70% is in **SCAN step** → maybe poor partition pruning.

Always look here before guessing.

---

### 2.4 Join Keys with Duplicates

Case:

* `orders` table has duplicate `order_id` rows.
* You join with `order_items`.
  👉 This explodes into millions of rows unexpectedly.

Solution: Deduplicate first with `ROW_NUMBER()` or `QUALIFY`.

---

### 2.5 Cartesian Joins

A Cartesian join happens when no `ON` condition exists (or wrong key used).

* Example: 1M rows × 1M rows → 1 trillion rows.
  👉 Always double-check join conditions.

---

### 2.6 Other Issues (Must-know)

* **Functions in WHERE** → avoid wrapping columns inside functions (`DATE(created_at)`), instead rewrite (`created_at BETWEEN '2024-01-01' AND '2024-01-31'`).
* **Select only needed columns** → reduces scan.
* \*\*Avoid SELECT \*\*\* in production.

---

### 🔑 Teaching Question (for self-check)

👉 If a query is slow, how would you **differentiate** between a join problem and a scan problem in Snowflake?
*(Hint: Look at Query Profile)*

---

# 🗃️ Chapter 3: Storage Considerations

This is where your question about **Automatic Clustering, Search Optimization Service, and Materialized Views** fits in.

But before we jump:
👉 You asked: *“Where does data get ordered/changed — Remote Storage or Micro-partition layer?”*

Answer:

* Data in **Remote Storage** is immutable.
* Once ingested, Snowflake chops it into **Micro-partitions**.
* Optimizations (clustering, search optimization, materialized views) **don’t rewrite your storage files**, they influence **how new micro-partitions are created and how metadata is used to prune/locate them**.

So think: **Remote storage = frozen lake. Metadata + micro-partitions = map that tells Snowflake where to fish.**

---

### 3.1 Automatic Clustering

* Purpose: Keep data **physically organized** in micro-partitions along chosen columns (like `region`, `date`).
* Without clustering → queries may scan many partitions.
* With clustering → Snowflake automatically reorganizes new micro-partitions to align with clustering keys.

**Scenario:**

* `sales` table = 10TB.
* Queries mostly filter `WHERE order_date BETWEEN ...`.
* Without clustering → Snowflake scans 10TB every time.
* With clustering → Micro-partitions neatly follow date ranges → pruning reduces scan to \~500GB.

👉 But clustering costs compute. Only enable if pruning benefit >> cost.

---

### 3.2 Search Optimization Service (SOS)

* Purpose: Speed up **point lookups** or highly selective filters.
* Example: Query = `WHERE customer_email = 'abc@gmail.com'`.
* Normally Snowflake scans many micro-partitions → slow.
* SOS builds extra metadata index-like structures → lets Snowflake jump directly to the partition.

👉 Key: Great for **high-cardinality columns** (emails, IDs), not for low-cardinality (gender = 'M').

---

### 3.3 Materialized Views

* Purpose: Store **pre-computed results** of a query (unlike normal views).
* Example:

```sql
CREATE MATERIALIZED VIEW mv_monthly_sales AS
SELECT region, month(order_date), SUM(amount)
FROM sales
GROUP BY region, month(order_date);
```

* When user queries:

```sql
SELECT * FROM mv_monthly_sales WHERE region='APAC';
```

→ Instantly served (since result pre-computed).

* Snowflake keeps it fresh automatically when base table changes.

👉 Downside: Extra storage + compute to maintain. Use for **frequently queried aggregations**.

---

# 🛠️ Chapter 4: Missing but Important Concepts

You didn’t mention these, but they’re **critical in Snowflake optimization**:

### 4.1 Result Caching

* If the exact same query runs again → results served instantly from cache (no compute).
* Best practice: Ensure queries are written consistently (no unnecessary timestamps like `CURRENT_TIMESTAMP` that break caching).

---

### 4.2 Metadata & Partition Pruning

* Always filter on columns that Snowflake can prune (like dates, IDs).
* Bad: `WHERE YEAR(order_date)=2024`.
* Good: `WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'`.

---

### 4.3 Warehouse Sizing: Scale Up vs Scale Out

* **Scale Up**: Increase warehouse size (more CPU/memory) → good for single large queries.
* **Scale Out**: Add more clusters → good for many concurrent users.

👉 Story: If you’re building one giant building, hire stronger workers (scale up). If you’re building 100 houses at the same time, hire more teams (scale out).

---

### 4.4 Data Modeling Choices

* Snowflake works best with **denormalized schemas** for analytics (star schema).
* Over-joining normalized schemas = slow queries.

---

# 🧭 Chapter 5: Practice Scenarios

Let me give you 2 practice thought experiments:

1. Query is slow. Query Profile shows **90% scan time**.
   👉 What’s your fix? (*Clustering or partition pruning optimization*)

2. Query is slow. Query Profile shows **80% join time**.
   👉 What’s your fix? (*Fix duplicate keys, check join strategy, remove unnecessary joins*).

---

# ✅ Wrap-up

So in summary:

* **Query optimization** starts with simplifying SQL and analyzing Query Profile.
* **Storage optimization** happens in micro-partitions (not raw remote storage). Techniques:

  * Automatic Clustering → better partition pruning.
  * SOS → faster point lookups.
  * Materialized Views → pre-computed results.
* Don’t forget **caching, warehouse scaling, and pruning strategies**.
* Always think: “Am I reducing **data scanned** or **data processed**?”

---
